* [v2 00/11] THP support for zone device page migration
@ 2025-07-30 9:21 Balbir Singh
2025-07-30 9:21 ` [v2 01/11] mm/zone_device: support large zone device private folios Balbir Singh
` (12 more replies)
0 siblings, 13 replies; 71+ messages in thread
From: Balbir Singh @ 2025-07-30 9:21 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Matthew Brost,
Francois Dugast
This patch series adds support for THP migration of zone device pages.
To do so, the patches implement support for folio zone device pages
by adding support for setting up larger order pages. Larger order
pages provide a speedup in throughput and latency.
In my local testing (using lib/test_hmm) and a throughput test, the
series shows a 350% improvement in data transfer throughput and a
500% improvement in latency
These patches build on the earlier posts by Ralph Campbell [1]
Two new flags are added in vma_migration to select and mark compound pages.
migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize()
support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND
is passed in as arguments.
The series also adds zone device awareness to (m)THP pages along
with fault handling of large zone device private pages. page vma walk
and the rmap code is also zone device aware. Support has also been
added for folios that might need to be split in the middle
of migration (when the src and dst do not agree on
MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can
migrate large pages, but the destination has not been able to allocate
large pages. The code supported and used folio_split() when migrating
THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed
as an argument to migrate_vma_setup().
The test infrastructure lib/test_hmm.c has been enhanced to support THP
migration. A new ioctl to emulate failure of large page allocations has
been added to test the folio split code path. hmm-tests.c has new test
cases for huge page migration and to test the folio split path. A new
throughput test has been added as well.
The nouveau dmem code has been enhanced to use the new THP migration
capability.
mTHP support:
The patches hard code, HPAGE_PMD_NR in a few places, but the code has
been kept generic to support various order sizes. With additional
refactoring of the code support of different order sizes should be
possible.
The future plan is to post enhancements to support mTHP with a rough
design as follows:
1. Add the notion of allowable thp orders to the HMM based test driver
2. For non PMD based THP paths in migrate_device.c, check to see if
a suitable order is found and supported by the driver
3. Iterate across orders to check the highest supported order for migration
4. Migrate and finalize
The mTHP patches can be built on top of this series, the key design
elements that need to be worked out are infrastructure and driver support
for multiple ordered pages and their migration.
HMM support for large folios:
Francois Dugast posted patches support for HMM handling [4], the proposed
changes can build on top of this series to provide support for HMM fault
handling.
References:
[1] https://lore.kernel.org/linux-mm/20201106005147.20113-1-rcampbell@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20250306044239.3874247-3-balbirs@nvidia.com/T/
[3] https://lore.kernel.org/lkml/20250703233511.2028395-1-balbirs@nvidia.com/
[4] https://lore.kernel.org/lkml/20250722193445.1588348-1-francois.dugast@intel.com/
These patches are built on top of mm/mm-stable
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Changelog v2 [3] :
- Several review comments from David Hildenbrand were addressed, Mika,
Zi, Matthew also provided helpful review comments
- In paths where it makes sense a new helper
is_pmd_device_private_entry() is used
- anon_exclusive handling of zone device private pages in
split_huge_pmd_locked() has been fixed
- Patches that introduced helpers have been folded into where they
are used
- Zone device handling in mm/huge_memory.c has benefited from the code
and testing of Matthew Brost, he helped find bugs related to
copy_huge_pmd() and partial unmapping of folios.
- Zone device THP PMD support via page_vma_mapped_walk() is restricted
to try_to_migrate_one()
- There is a new dedicated helper to split large zone device folios
Changelog v1 [2]:
- Support for handling fault_folio and using trylock in the fault path
- A new test case has been added to measure the throughput improvement
- General refactoring of code to keep up with the changes in mm
- New split folio callback when the entire split is complete/done. The
callback is used to know when the head order needs to be reset.
Testing:
- Testing was done with ZONE_DEVICE private pages on an x86 VM
Balbir Singh (11):
mm/zone_device: support large zone device private folios
mm/thp: zone_device awareness in THP handling code
mm/migrate_device: THP migration of zone device pages
mm/memory/fault: add support for zone device THP fault handling
lib/test_hmm: test cases and support for zone device private THP
mm/memremap: add folio_split support
mm/thp: add split during migration support
lib/test_hmm: add test case for split pages
selftests/mm/hmm-tests: new tests for zone device THP migration
gpu/drm/nouveau: add THP migration support
selftests/mm/hmm-tests: new throughput tests including THP
drivers/gpu/drm/nouveau/nouveau_dmem.c | 246 +++++++---
drivers/gpu/drm/nouveau/nouveau_svm.c | 6 +-
drivers/gpu/drm/nouveau/nouveau_svm.h | 3 +-
include/linux/huge_mm.h | 19 +-
include/linux/memremap.h | 51 ++-
include/linux/migrate.h | 2 +
include/linux/mm.h | 1 +
include/linux/rmap.h | 2 +
include/linux/swapops.h | 17 +
lib/test_hmm.c | 432 ++++++++++++++----
lib/test_hmm_uapi.h | 3 +
mm/huge_memory.c | 358 ++++++++++++---
mm/memory.c | 6 +-
mm/memremap.c | 48 +-
mm/migrate_device.c | 517 ++++++++++++++++++---
mm/page_vma_mapped.c | 13 +-
mm/pgtable-generic.c | 6 +
mm/rmap.c | 22 +-
tools/testing/selftests/mm/hmm-tests.c | 607 ++++++++++++++++++++++++-
19 files changed, 2040 insertions(+), 319 deletions(-)
--
2.50.1
^ permalink raw reply [flat|nested] 71+ messages in thread
* [v2 01/11] mm/zone_device: support large zone device private folios
2025-07-30 9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
@ 2025-07-30 9:21 ` Balbir Singh
2025-07-30 9:50 ` David Hildenbrand
2025-07-30 9:21 ` [v2 02/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
` (11 subsequent siblings)
12 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-07-30 9:21 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Matthew Brost,
Francois Dugast
Add routines to support allocation of large order zone device folios
and helper functions for zone device folios, to check if a folio is
device private and helpers for setting zone device data.
When large folios are used, the existing page_free() callback in
pgmap is called when the folio is freed, this is true for both
PAGE_SIZE and higher order pages.
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
include/linux/memremap.h | 10 ++++++++-
mm/memremap.c | 48 +++++++++++++++++++++++++++++-----------
2 files changed, 44 insertions(+), 14 deletions(-)
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 4aa151914eab..a0723b35eeaa 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -199,7 +199,7 @@ static inline bool folio_is_fsdax(const struct folio *folio)
}
#ifdef CONFIG_ZONE_DEVICE
-void zone_device_page_init(struct page *page);
+void zone_device_folio_init(struct folio *folio, unsigned int order);
void *memremap_pages(struct dev_pagemap *pgmap, int nid);
void memunmap_pages(struct dev_pagemap *pgmap);
void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
@@ -209,6 +209,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
unsigned long memremap_compat_align(void);
+
+static inline void zone_device_page_init(struct page *page)
+{
+ struct folio *folio = page_folio(page);
+
+ zone_device_folio_init(folio, 0);
+}
+
#else
static inline void *devm_memremap_pages(struct device *dev,
struct dev_pagemap *pgmap)
diff --git a/mm/memremap.c b/mm/memremap.c
index b0ce0d8254bd..3ca136e7455e 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -427,20 +427,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
void free_zone_device_folio(struct folio *folio)
{
struct dev_pagemap *pgmap = folio->pgmap;
+ unsigned int nr = folio_nr_pages(folio);
+ int i;
if (WARN_ON_ONCE(!pgmap))
return;
mem_cgroup_uncharge(folio);
- /*
- * Note: we don't expect anonymous compound pages yet. Once supported
- * and we could PTE-map them similar to THP, we'd have to clear
- * PG_anon_exclusive on all tail pages.
- */
if (folio_test_anon(folio)) {
- VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
- __ClearPageAnonExclusive(folio_page(folio, 0));
+ for (i = 0; i < nr; i++)
+ __ClearPageAnonExclusive(folio_page(folio, i));
+ } else {
+ VM_WARN_ON_ONCE(folio_test_large(folio));
}
/*
@@ -464,11 +463,20 @@ void free_zone_device_folio(struct folio *folio)
switch (pgmap->type) {
case MEMORY_DEVICE_PRIVATE:
+ if (folio_test_large(folio)) {
+ folio_unqueue_deferred_split(folio);
+
+ percpu_ref_put_many(&folio->pgmap->ref, nr - 1);
+ }
+ pgmap->ops->page_free(&folio->page);
+ percpu_ref_put(&folio->pgmap->ref);
+ folio->page.mapping = NULL;
+ break;
case MEMORY_DEVICE_COHERENT:
if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
break;
- pgmap->ops->page_free(folio_page(folio, 0));
- put_dev_pagemap(pgmap);
+ pgmap->ops->page_free(&folio->page);
+ percpu_ref_put(&folio->pgmap->ref);
break;
case MEMORY_DEVICE_GENERIC:
@@ -491,14 +499,28 @@ void free_zone_device_folio(struct folio *folio)
}
}
-void zone_device_page_init(struct page *page)
+void zone_device_folio_init(struct folio *folio, unsigned int order)
{
+ struct page *page = folio_page(folio, 0);
+
+ VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
+
+ /*
+ * Only PMD level migration is supported for THP migration
+ */
+ WARN_ON_ONCE(order && order != HPAGE_PMD_ORDER);
+
/*
* Drivers shouldn't be allocating pages after calling
* memunmap_pages().
*/
- WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
- set_page_count(page, 1);
+ WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
+ folio_set_count(folio, 1);
lock_page(page);
+
+ if (order > 1) {
+ prep_compound_page(page, order);
+ folio_set_large_rmappable(folio);
+ }
}
-EXPORT_SYMBOL_GPL(zone_device_page_init);
+EXPORT_SYMBOL_GPL(zone_device_folio_init);
--
2.50.1
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-07-30 9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
2025-07-30 9:21 ` [v2 01/11] mm/zone_device: support large zone device private folios Balbir Singh
@ 2025-07-30 9:21 ` Balbir Singh
2025-07-30 11:16 ` Mika Penttilä
2025-07-30 20:05 ` kernel test robot
2025-07-30 9:21 ` [v2 03/11] mm/migrate_device: THP migration of zone device pages Balbir Singh
` (10 subsequent siblings)
12 siblings, 2 replies; 71+ messages in thread
From: Balbir Singh @ 2025-07-30 9:21 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Mika Penttilä, Matthew Brost, Francois Dugast,
Ralph Campbell
Make THP handling code in the mm subsystem for THP pages aware of zone
device pages. Although the code is designed to be generic when it comes
to handling splitting of pages, the code is designed to work for THP
page sizes corresponding to HPAGE_PMD_NR.
Modify page_vma_mapped_walk() to return true when a zone device huge
entry is present, enabling try_to_migrate() and other code migration
paths to appropriately process the entry. page_vma_mapped_walk() will
return true for zone device private large folios only when
PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
not zone device private pages from having to add awareness. The key
callback that needs this flag is try_to_migrate_one(). The other
callbacks page idle, damon use it for setting young/dirty bits, which is
not significant when it comes to pmd level bit harvesting.
pmd_pfn() does not work well with zone device entries, use
pfn_pmd_entry_to_swap() for checking and comparison as for zone device
entries.
Zone device private entries when split via munmap go through pmd split,
but need to go through a folio split, deferred split does not work if a
fault is encountered because fault handling involves migration entries
(via folio_migrate_mapping) and the folio sizes are expected to be the
same there. This introduces the need to split the folio while handling
the pmd split. Because the folio is still mapped, but calling
folio_split() will cause lock recursion, the __split_unmapped_folio()
code is used with a new helper to wrap the code
split_device_private_folio(), which skips the checks around
folio->mapping, swapcache and the need to go through unmap and remap
folio.
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
include/linux/huge_mm.h | 1 +
include/linux/rmap.h | 2 +
include/linux/swapops.h | 17 +++
mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
mm/page_vma_mapped.c | 13 +-
mm/pgtable-generic.c | 6 +
mm/rmap.c | 22 +++-
7 files changed, 278 insertions(+), 51 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7748489fde1b..2a6f5ff7bca3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -345,6 +345,7 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
unsigned int new_order);
+int split_device_private_folio(struct folio *folio);
int min_order_for_split(struct folio *folio);
int split_folio_to_list(struct folio *folio, struct list_head *list);
bool uniform_split_supported(struct folio *folio, unsigned int new_order,
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 20803fcb49a7..625f36dcc121 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -905,6 +905,8 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
#define PVMW_SYNC (1 << 0)
/* Look for migration entries rather than present PTEs */
#define PVMW_MIGRATION (1 << 1)
+/* Look for device private THP entries */
+#define PVMW_THP_DEVICE_PRIVATE (1 << 2)
struct page_vma_mapped_walk {
unsigned long pfn;
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 64ea151a7ae3..2641c01bd5d2 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -563,6 +563,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
{
return is_swap_pmd(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
}
+
#else /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
static inline int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
struct page *page)
@@ -594,6 +595,22 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
}
#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
+#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
+
+static inline int is_pmd_device_private_entry(pmd_t pmd)
+{
+ return is_swap_pmd(pmd) && is_device_private_entry(pmd_to_swp_entry(pmd));
+}
+
+#else /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
+
+static inline int is_pmd_device_private_entry(pmd_t pmd)
+{
+ return 0;
+}
+
+#endif /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
+
static inline int non_swap_entry(swp_entry_t entry)
{
return swp_type(entry) >= MAX_SWAPFILES;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9c38a95e9f09..e373c6578894 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -72,6 +72,10 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
struct shrink_control *sc);
static unsigned long deferred_split_scan(struct shrinker *shrink,
struct shrink_control *sc);
+static int __split_unmapped_folio(struct folio *folio, int new_order,
+ struct page *split_at, struct xa_state *xas,
+ struct address_space *mapping, bool uniform_split);
+
static bool split_underused_thp = true;
static atomic_t huge_zero_refcount;
@@ -1711,8 +1715,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (unlikely(is_swap_pmd(pmd))) {
swp_entry_t entry = pmd_to_swp_entry(pmd);
- VM_BUG_ON(!is_pmd_migration_entry(pmd));
- if (!is_readable_migration_entry(entry)) {
+ VM_WARN_ON(!is_pmd_migration_entry(pmd) &&
+ !is_pmd_device_private_entry(pmd));
+
+ if (is_migration_entry(entry) &&
+ is_writable_migration_entry(entry)) {
entry = make_readable_migration_entry(
swp_offset(entry));
pmd = swp_entry_to_pmd(entry);
@@ -1722,6 +1729,32 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd = pmd_swp_mkuffd_wp(pmd);
set_pmd_at(src_mm, addr, src_pmd, pmd);
}
+
+ if (is_device_private_entry(entry)) {
+ if (is_writable_device_private_entry(entry)) {
+ entry = make_readable_device_private_entry(
+ swp_offset(entry));
+ pmd = swp_entry_to_pmd(entry);
+
+ if (pmd_swp_soft_dirty(*src_pmd))
+ pmd = pmd_swp_mksoft_dirty(pmd);
+ if (pmd_swp_uffd_wp(*src_pmd))
+ pmd = pmd_swp_mkuffd_wp(pmd);
+ set_pmd_at(src_mm, addr, src_pmd, pmd);
+ }
+
+ src_folio = pfn_swap_entry_folio(entry);
+ VM_WARN_ON(!folio_test_large(src_folio));
+
+ folio_get(src_folio);
+ /*
+ * folio_try_dup_anon_rmap_pmd does not fail for
+ * device private entries.
+ */
+ VM_WARN_ON(folio_try_dup_anon_rmap_pmd(src_folio,
+ &src_folio->page, dst_vma, src_vma));
+ }
+
add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
mm_inc_nr_ptes(dst_mm);
pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
@@ -2219,15 +2252,22 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
folio_remove_rmap_pmd(folio, page, vma);
WARN_ON_ONCE(folio_mapcount(folio) < 0);
VM_BUG_ON_PAGE(!PageHead(page), page);
- } else if (thp_migration_supported()) {
+ } else if (is_pmd_migration_entry(orig_pmd) ||
+ is_pmd_device_private_entry(orig_pmd)) {
swp_entry_t entry;
- VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
entry = pmd_to_swp_entry(orig_pmd);
folio = pfn_swap_entry_folio(entry);
flush_needed = 0;
- } else
- WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
+
+ if (!thp_migration_supported())
+ WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
+
+ if (is_pmd_device_private_entry(orig_pmd)) {
+ folio_remove_rmap_pmd(folio, &folio->page, vma);
+ WARN_ON_ONCE(folio_mapcount(folio) < 0);
+ }
+ }
if (folio_test_anon(folio)) {
zap_deposited_table(tlb->mm, pmd);
@@ -2247,6 +2287,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
folio_mark_accessed(folio);
}
+ /*
+ * Do a folio put on zone device private pages after
+ * changes to mm_counter, because the folio_put() will
+ * clean folio->mapping and the folio_test_anon() check
+ * will not be usable.
+ */
+ if (folio_is_device_private(folio))
+ folio_put(folio);
+
spin_unlock(ptl);
if (flush_needed)
tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
@@ -2375,7 +2424,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
struct folio *folio = pfn_swap_entry_folio(entry);
pmd_t newpmd;
- VM_BUG_ON(!is_pmd_migration_entry(*pmd));
+ VM_WARN_ON(!is_pmd_migration_entry(*pmd) &&
+ !folio_is_device_private(folio));
if (is_writable_migration_entry(entry)) {
/*
* A protection check is difficult so
@@ -2388,6 +2438,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
newpmd = swp_entry_to_pmd(entry);
if (pmd_swp_soft_dirty(*pmd))
newpmd = pmd_swp_mksoft_dirty(newpmd);
+ } else if (is_writable_device_private_entry(entry)) {
+ entry = make_readable_device_private_entry(
+ swp_offset(entry));
+ newpmd = swp_entry_to_pmd(entry);
} else {
newpmd = *pmd;
}
@@ -2834,6 +2888,44 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
pmd_populate(mm, pmd, pgtable);
}
+/**
+ * split_huge_device_private_folio - split a huge device private folio into
+ * smaller pages (of order 0), currently used by migrate_device logic to
+ * split folios for pages that are partially mapped
+ *
+ * @folio: the folio to split
+ *
+ * The caller has to hold the folio_lock and a reference via folio_get
+ */
+int split_device_private_folio(struct folio *folio)
+{
+ struct folio *end_folio = folio_next(folio);
+ struct folio *new_folio;
+ int ret = 0;
+
+ /*
+ * Split the folio now. In the case of device
+ * private pages, this path is executed when
+ * the pmd is split and since freeze is not true
+ * it is likely the folio will be deferred_split.
+ *
+ * With device private pages, deferred splits of
+ * folios should be handled here to prevent partial
+ * unmaps from causing issues later on in migration
+ * and fault handling flows.
+ */
+ folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
+ ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
+ VM_WARN_ON(ret);
+ for (new_folio = folio_next(folio); new_folio != end_folio;
+ new_folio = folio_next(new_folio)) {
+ folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
+ new_folio));
+ }
+ folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
+ return ret;
+}
+
static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long haddr, bool freeze)
{
@@ -2842,16 +2934,19 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
struct page *page;
pgtable_t pgtable;
pmd_t old_pmd, _pmd;
- bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
- bool anon_exclusive = false, dirty = false;
+ bool young, write, soft_dirty, uffd_wp = false;
+ bool anon_exclusive = false, dirty = false, present = false;
unsigned long addr;
pte_t *pte;
int i;
+ swp_entry_t swp_entry;
VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
- VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));
+
+ VM_WARN_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
+ && !(is_pmd_device_private_entry(*pmd)));
count_vm_event(THP_SPLIT_PMD);
@@ -2899,18 +2994,60 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
return __split_huge_zero_page_pmd(vma, haddr, pmd);
}
- pmd_migration = is_pmd_migration_entry(*pmd);
- if (unlikely(pmd_migration)) {
- swp_entry_t entry;
+ present = pmd_present(*pmd);
+ if (unlikely(!present)) {
+ swp_entry = pmd_to_swp_entry(*pmd);
old_pmd = *pmd;
- entry = pmd_to_swp_entry(old_pmd);
- page = pfn_swap_entry_to_page(entry);
- write = is_writable_migration_entry(entry);
- if (PageAnon(page))
- anon_exclusive = is_readable_exclusive_migration_entry(entry);
- young = is_migration_entry_young(entry);
- dirty = is_migration_entry_dirty(entry);
+
+ folio = pfn_swap_entry_folio(swp_entry);
+ VM_WARN_ON(!is_migration_entry(swp_entry) &&
+ !is_device_private_entry(swp_entry));
+ page = pfn_swap_entry_to_page(swp_entry);
+
+ if (is_pmd_migration_entry(old_pmd)) {
+ write = is_writable_migration_entry(swp_entry);
+ if (PageAnon(page))
+ anon_exclusive =
+ is_readable_exclusive_migration_entry(
+ swp_entry);
+ young = is_migration_entry_young(swp_entry);
+ dirty = is_migration_entry_dirty(swp_entry);
+ } else if (is_pmd_device_private_entry(old_pmd)) {
+ write = is_writable_device_private_entry(swp_entry);
+ anon_exclusive = PageAnonExclusive(page);
+ if (freeze && anon_exclusive &&
+ folio_try_share_anon_rmap_pmd(folio, page))
+ freeze = false;
+ if (!freeze) {
+ rmap_t rmap_flags = RMAP_NONE;
+ unsigned long addr = haddr;
+ struct folio *new_folio;
+ struct folio *end_folio = folio_next(folio);
+
+ if (anon_exclusive)
+ rmap_flags |= RMAP_EXCLUSIVE;
+
+ folio_lock(folio);
+ folio_get(folio);
+
+ split_device_private_folio(folio);
+
+ for (new_folio = folio_next(folio);
+ new_folio != end_folio;
+ new_folio = folio_next(new_folio)) {
+ addr += PAGE_SIZE;
+ folio_unlock(new_folio);
+ folio_add_anon_rmap_ptes(new_folio,
+ &new_folio->page, 1,
+ vma, addr, rmap_flags);
+ }
+ folio_unlock(folio);
+ folio_add_anon_rmap_ptes(folio, &folio->page,
+ 1, vma, haddr, rmap_flags);
+ }
+ }
+
soft_dirty = pmd_swp_soft_dirty(old_pmd);
uffd_wp = pmd_swp_uffd_wp(old_pmd);
} else {
@@ -2996,30 +3133,49 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
* Note that NUMA hinting access restrictions are not transferred to
* avoid any possibility of altering permissions across VMAs.
*/
- if (freeze || pmd_migration) {
+ if (freeze || !present) {
for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
pte_t entry;
- swp_entry_t swp_entry;
-
- if (write)
- swp_entry = make_writable_migration_entry(
- page_to_pfn(page + i));
- else if (anon_exclusive)
- swp_entry = make_readable_exclusive_migration_entry(
- page_to_pfn(page + i));
- else
- swp_entry = make_readable_migration_entry(
- page_to_pfn(page + i));
- if (young)
- swp_entry = make_migration_entry_young(swp_entry);
- if (dirty)
- swp_entry = make_migration_entry_dirty(swp_entry);
- entry = swp_entry_to_pte(swp_entry);
- if (soft_dirty)
- entry = pte_swp_mksoft_dirty(entry);
- if (uffd_wp)
- entry = pte_swp_mkuffd_wp(entry);
-
+ if (freeze || is_migration_entry(swp_entry)) {
+ if (write)
+ swp_entry = make_writable_migration_entry(
+ page_to_pfn(page + i));
+ else if (anon_exclusive)
+ swp_entry = make_readable_exclusive_migration_entry(
+ page_to_pfn(page + i));
+ else
+ swp_entry = make_readable_migration_entry(
+ page_to_pfn(page + i));
+ if (young)
+ swp_entry = make_migration_entry_young(swp_entry);
+ if (dirty)
+ swp_entry = make_migration_entry_dirty(swp_entry);
+ entry = swp_entry_to_pte(swp_entry);
+ if (soft_dirty)
+ entry = pte_swp_mksoft_dirty(entry);
+ if (uffd_wp)
+ entry = pte_swp_mkuffd_wp(entry);
+ } else {
+ /*
+ * anon_exclusive was already propagated to the relevant
+ * pages corresponding to the pte entries when freeze
+ * is false.
+ */
+ if (write)
+ swp_entry = make_writable_device_private_entry(
+ page_to_pfn(page + i));
+ else
+ swp_entry = make_readable_device_private_entry(
+ page_to_pfn(page + i));
+ /*
+ * Young and dirty bits are not progated via swp_entry
+ */
+ entry = swp_entry_to_pte(swp_entry);
+ if (soft_dirty)
+ entry = pte_swp_mksoft_dirty(entry);
+ if (uffd_wp)
+ entry = pte_swp_mkuffd_wp(entry);
+ }
VM_WARN_ON(!pte_none(ptep_get(pte + i)));
set_pte_at(mm, addr, pte + i, entry);
}
@@ -3046,7 +3202,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
}
pte_unmap(pte);
- if (!pmd_migration)
+ if (present)
folio_remove_rmap_pmd(folio, page, vma);
if (freeze)
put_page(page);
@@ -3058,8 +3214,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmd, bool freeze)
{
+
VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
- if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd))
+ if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd) ||
+ (is_pmd_device_private_entry(*pmd)))
__split_huge_pmd_locked(vma, pmd, address, freeze);
}
@@ -3238,6 +3396,9 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
VM_BUG_ON_FOLIO(folio_test_lru(new_folio), folio);
lockdep_assert_held(&lruvec->lru_lock);
+ if (folio_is_device_private(folio))
+ return;
+
if (list) {
/* page reclaim is reclaiming a huge page */
VM_WARN_ON(folio_test_lru(folio));
@@ -3252,6 +3413,7 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
list_add_tail(&new_folio->lru, &folio->lru);
folio_set_lru(new_folio);
}
+
}
/* Racy check whether the huge page can be split */
@@ -3727,7 +3889,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
/* Prevent deferred_split_scan() touching ->_refcount */
spin_lock(&ds_queue->split_queue_lock);
- if (folio_ref_freeze(folio, 1 + extra_pins)) {
+ if (folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio))) {
struct address_space *swap_cache = NULL;
struct lruvec *lruvec;
int expected_refs;
@@ -4603,7 +4765,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
return 0;
flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
- pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
+ if (unlikely(is_pmd_device_private_entry(*pvmw->pmd)))
+ pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
+ else
+ pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
/* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
@@ -4653,6 +4818,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
entry = pmd_to_swp_entry(*pvmw->pmd);
folio_get(folio);
pmde = folio_mk_pmd(folio, READ_ONCE(vma->vm_page_prot));
+
+ if (folio_is_device_private(folio)) {
+ if (pmd_write(pmde))
+ entry = make_writable_device_private_entry(
+ page_to_pfn(new));
+ else
+ entry = make_readable_device_private_entry(
+ page_to_pfn(new));
+ pmde = swp_entry_to_pmd(entry);
+ }
+
if (pmd_swp_soft_dirty(*pvmw->pmd))
pmde = pmd_mksoft_dirty(pmde);
if (is_writable_migration_entry(entry))
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index e981a1a292d2..246e6c211f34 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -250,12 +250,11 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
pmde = *pvmw->pmd;
if (!pmd_present(pmde)) {
- swp_entry_t entry;
+ swp_entry_t entry = pmd_to_swp_entry(pmde);
if (!thp_migration_supported() ||
!(pvmw->flags & PVMW_MIGRATION))
return not_found(pvmw);
- entry = pmd_to_swp_entry(pmde);
if (!is_migration_entry(entry) ||
!check_pmd(swp_offset_pfn(entry), pvmw))
return not_found(pvmw);
@@ -277,6 +276,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
* cannot return prematurely, while zap_huge_pmd() has
* cleared *pmd but not decremented compound_mapcount().
*/
+ swp_entry_t entry;
+
+ entry = pmd_to_swp_entry(pmde);
+
+ if (is_device_private_entry(entry) &&
+ (pvmw->flags & PVMW_THP_DEVICE_PRIVATE)) {
+ pvmw->ptl = pmd_lock(mm, pvmw->pmd);
+ return true;
+ }
+
if ((pvmw->flags & PVMW_SYNC) &&
thp_vma_suitable_order(vma, pvmw->address,
PMD_ORDER) &&
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 567e2d084071..604e8206a2ec 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -292,6 +292,12 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
*pmdvalp = pmdval;
if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
goto nomap;
+ if (is_swap_pmd(pmdval)) {
+ swp_entry_t entry = pmd_to_swp_entry(pmdval);
+
+ if (is_device_private_entry(entry))
+ goto nomap;
+ }
if (unlikely(pmd_trans_huge(pmdval)))
goto nomap;
if (unlikely(pmd_bad(pmdval))) {
diff --git a/mm/rmap.c b/mm/rmap.c
index f93ce27132ab..5c5c1c777ce3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2281,7 +2281,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
unsigned long address, void *arg)
{
struct mm_struct *mm = vma->vm_mm;
- DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
+ DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address,
+ PVMW_THP_DEVICE_PRIVATE);
bool anon_exclusive, writable, ret = true;
pte_t pteval;
struct page *subpage;
@@ -2326,6 +2327,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
while (page_vma_mapped_walk(&pvmw)) {
/* PMD-mapped THP migration entry */
if (!pvmw.pte) {
+ unsigned long pfn;
+
if (flags & TTU_SPLIT_HUGE_PMD) {
split_huge_pmd_locked(vma, pvmw.address,
pvmw.pmd, true);
@@ -2334,8 +2337,21 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
break;
}
#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
- subpage = folio_page(folio,
- pmd_pfn(*pvmw.pmd) - folio_pfn(folio));
+ /*
+ * Zone device private folios do not work well with
+ * pmd_pfn() on some architectures due to pte
+ * inversion.
+ */
+ if (is_pmd_device_private_entry(*pvmw.pmd)) {
+ swp_entry_t entry = pmd_to_swp_entry(*pvmw.pmd);
+
+ pfn = swp_offset_pfn(entry);
+ } else {
+ pfn = pmd_pfn(*pvmw.pmd);
+ }
+
+ subpage = folio_page(folio, pfn - folio_pfn(folio));
+
VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
!folio_test_pmd_mappable(folio), folio);
--
2.50.1
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [v2 03/11] mm/migrate_device: THP migration of zone device pages
2025-07-30 9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
2025-07-30 9:21 ` [v2 01/11] mm/zone_device: support large zone device private folios Balbir Singh
2025-07-30 9:21 ` [v2 02/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
@ 2025-07-30 9:21 ` Balbir Singh
2025-07-31 16:19 ` kernel test robot
2025-07-30 9:21 ` [v2 04/11] mm/memory/fault: add support for zone device THP fault handling Balbir Singh
` (9 subsequent siblings)
12 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-07-30 9:21 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Mika Penttilä, Matthew Brost, Francois Dugast,
Ralph Campbell
MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
device pages as compound pages during device pfn migration.
migrate_device code paths go through the collect, setup
and finalize phases of migration.
The entries in src and dst arrays passed to these functions still
remain at a PAGE_SIZE granularity. When a compound page is passed,
the first entry has the PFN along with MIGRATE_PFN_COMPOUND
and other flags set (MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the
remaining entries (HPAGE_PMD_NR - 1) are filled with 0's. This
representation allows for the compound page to be split into smaller
page sizes.
migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP
page aware. Two new helper functions migrate_vma_collect_huge_pmd()
and migrate_vma_insert_huge_pmd_page() have been added.
migrate_vma_collect_huge_pmd() can collect THP pages, but if for
some reason this fails, there is fallback support to split the folio
and migrate it.
migrate_vma_insert_huge_pmd_page() closely follows the logic of
migrate_vma_insert_page()
Support for splitting pages as needed for migration will follow in
later patches in this series.
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
include/linux/migrate.h | 2 +
mm/migrate_device.c | 456 ++++++++++++++++++++++++++++++++++------
2 files changed, 395 insertions(+), 63 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index acadd41e0b5c..d9cef0819f91 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -129,6 +129,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
#define MIGRATE_PFN_VALID (1UL << 0)
#define MIGRATE_PFN_MIGRATE (1UL << 1)
#define MIGRATE_PFN_WRITE (1UL << 3)
+#define MIGRATE_PFN_COMPOUND (1UL << 4)
#define MIGRATE_PFN_SHIFT 6
static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
@@ -147,6 +148,7 @@ enum migrate_vma_direction {
MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
+ MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,
};
struct migrate_vma {
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index e05e14d6eacd..4c3334cc3228 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -14,6 +14,7 @@
#include <linux/pagewalk.h>
#include <linux/rmap.h>
#include <linux/swapops.h>
+#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
#include "internal.h"
@@ -44,6 +45,23 @@ static int migrate_vma_collect_hole(unsigned long start,
if (!vma_is_anonymous(walk->vma))
return migrate_vma_collect_skip(start, end, walk);
+ if (thp_migration_supported() &&
+ (migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
+ (IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
+ IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
+ migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
+ MIGRATE_PFN_COMPOUND;
+ migrate->dst[migrate->npages] = 0;
+ migrate->npages++;
+ migrate->cpages++;
+
+ /*
+ * Collect the remaining entries as holes, in case we
+ * need to split later
+ */
+ return migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
+ }
+
for (addr = start; addr < end; addr += PAGE_SIZE) {
migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
migrate->dst[migrate->npages] = 0;
@@ -54,57 +72,151 @@ static int migrate_vma_collect_hole(unsigned long start,
return 0;
}
-static int migrate_vma_collect_pmd(pmd_t *pmdp,
- unsigned long start,
- unsigned long end,
- struct mm_walk *walk)
+/**
+ * migrate_vma_collect_huge_pmd - collect THP pages without splitting the
+ * folio for device private pages.
+ * @pmdp: pointer to pmd entry
+ * @start: start address of the range for migration
+ * @end: end address of the range for migration
+ * @walk: mm_walk callback structure
+ *
+ * Collect the huge pmd entry at @pmdp for migration and set the
+ * MIGRATE_PFN_COMPOUND flag in the migrate src entry to indicate that
+ * migration will occur at HPAGE_PMD granularity
+ */
+static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
+ unsigned long end, struct mm_walk *walk,
+ struct folio *fault_folio)
{
+ struct mm_struct *mm = walk->mm;
+ struct folio *folio;
struct migrate_vma *migrate = walk->private;
- struct folio *fault_folio = migrate->fault_page ?
- page_folio(migrate->fault_page) : NULL;
- struct vm_area_struct *vma = walk->vma;
- struct mm_struct *mm = vma->vm_mm;
- unsigned long addr = start, unmapped = 0;
spinlock_t *ptl;
- pte_t *ptep;
+ swp_entry_t entry;
+ int ret;
+ unsigned long write = 0;
-again:
- if (pmd_none(*pmdp))
+ ptl = pmd_lock(mm, pmdp);
+ if (pmd_none(*pmdp)) {
+ spin_unlock(ptl);
return migrate_vma_collect_hole(start, end, -1, walk);
+ }
if (pmd_trans_huge(*pmdp)) {
- struct folio *folio;
-
- ptl = pmd_lock(mm, pmdp);
- if (unlikely(!pmd_trans_huge(*pmdp))) {
+ if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
spin_unlock(ptl);
- goto again;
+ return migrate_vma_collect_skip(start, end, walk);
}
folio = pmd_folio(*pmdp);
if (is_huge_zero_folio(folio)) {
spin_unlock(ptl);
- split_huge_pmd(vma, pmdp, addr);
- } else {
- int ret;
+ return migrate_vma_collect_hole(start, end, -1, walk);
+ }
+ if (pmd_write(*pmdp))
+ write = MIGRATE_PFN_WRITE;
+ } else if (!pmd_present(*pmdp)) {
+ entry = pmd_to_swp_entry(*pmdp);
+ folio = pfn_swap_entry_folio(entry);
+
+ if (!is_device_private_entry(entry) ||
+ !(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
+ (folio->pgmap->owner != migrate->pgmap_owner)) {
+ spin_unlock(ptl);
+ return migrate_vma_collect_skip(start, end, walk);
+ }
- folio_get(folio);
+ if (is_migration_entry(entry)) {
+ migration_entry_wait_on_locked(entry, ptl);
spin_unlock(ptl);
- /* FIXME: we don't expect THP for fault_folio */
- if (WARN_ON_ONCE(fault_folio == folio))
- return migrate_vma_collect_skip(start, end,
- walk);
- if (unlikely(!folio_trylock(folio)))
- return migrate_vma_collect_skip(start, end,
- walk);
- ret = split_folio(folio);
- if (fault_folio != folio)
- folio_unlock(folio);
- folio_put(folio);
- if (ret)
- return migrate_vma_collect_skip(start, end,
- walk);
+ return -EAGAIN;
}
+
+ if (is_writable_device_private_entry(entry))
+ write = MIGRATE_PFN_WRITE;
+ } else {
+ spin_unlock(ptl);
+ return -EAGAIN;
+ }
+
+ folio_get(folio);
+ if (folio != fault_folio && unlikely(!folio_trylock(folio))) {
+ spin_unlock(ptl);
+ folio_put(folio);
+ return migrate_vma_collect_skip(start, end, walk);
+ }
+
+ if (thp_migration_supported() &&
+ (migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
+ (IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
+ IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
+
+ struct page_vma_mapped_walk pvmw = {
+ .ptl = ptl,
+ .address = start,
+ .pmd = pmdp,
+ .vma = walk->vma,
+ };
+
+ unsigned long pfn = page_to_pfn(folio_page(folio, 0));
+
+ migrate->src[migrate->npages] = migrate_pfn(pfn) | write
+ | MIGRATE_PFN_MIGRATE
+ | MIGRATE_PFN_COMPOUND;
+ migrate->dst[migrate->npages++] = 0;
+ migrate->cpages++;
+ ret = set_pmd_migration_entry(&pvmw, folio_page(folio, 0));
+ if (ret) {
+ migrate->npages--;
+ migrate->cpages--;
+ migrate->src[migrate->npages] = 0;
+ migrate->dst[migrate->npages] = 0;
+ goto fallback;
+ }
+ migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
+ spin_unlock(ptl);
+ return 0;
+ }
+
+fallback:
+ spin_unlock(ptl);
+ if (!folio_test_large(folio))
+ goto done;
+ ret = split_folio(folio);
+ if (fault_folio != folio)
+ folio_unlock(folio);
+ folio_put(folio);
+ if (ret)
+ return migrate_vma_collect_skip(start, end, walk);
+ if (pmd_none(pmdp_get_lockless(pmdp)))
+ return migrate_vma_collect_hole(start, end, -1, walk);
+
+done:
+ return -ENOENT;
+}
+
+static int migrate_vma_collect_pmd(pmd_t *pmdp,
+ unsigned long start,
+ unsigned long end,
+ struct mm_walk *walk)
+{
+ struct migrate_vma *migrate = walk->private;
+ struct vm_area_struct *vma = walk->vma;
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long addr = start, unmapped = 0;
+ spinlock_t *ptl;
+ struct folio *fault_folio = migrate->fault_page ?
+ page_folio(migrate->fault_page) : NULL;
+ pte_t *ptep;
+
+again:
+ if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) {
+ int ret = migrate_vma_collect_huge_pmd(pmdp, start, end, walk, fault_folio);
+
+ if (ret == -EAGAIN)
+ goto again;
+ if (ret == 0)
+ return 0;
}
ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
@@ -175,8 +287,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
}
- /* FIXME support THP */
- if (!page || !page->mapping || PageTransCompound(page)) {
+ if (!page || !page->mapping) {
mpfn = 0;
goto next;
}
@@ -347,14 +458,6 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
*/
int extra = 1 + (page == fault_page);
- /*
- * FIXME support THP (transparent huge page), it is bit more complex to
- * check them than regular pages, because they can be mapped with a pmd
- * or with a pte (split pte mapping).
- */
- if (folio_test_large(folio))
- return false;
-
/* Page from ZONE_DEVICE have one extra reference */
if (folio_is_zone_device(folio))
extra++;
@@ -385,17 +488,24 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
lru_add_drain();
- for (i = 0; i < npages; i++) {
+ for (i = 0; i < npages; ) {
struct page *page = migrate_pfn_to_page(src_pfns[i]);
struct folio *folio;
+ unsigned int nr = 1;
if (!page) {
if (src_pfns[i] & MIGRATE_PFN_MIGRATE)
unmapped++;
- continue;
+ goto next;
}
folio = page_folio(page);
+ nr = folio_nr_pages(folio);
+
+ if (nr > 1)
+ src_pfns[i] |= MIGRATE_PFN_COMPOUND;
+
+
/* ZONE_DEVICE folios are not on LRU */
if (!folio_is_zone_device(folio)) {
if (!folio_test_lru(folio) && allow_drain) {
@@ -407,7 +517,7 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
if (!folio_isolate_lru(folio)) {
src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
restore++;
- continue;
+ goto next;
}
/* Drop the reference we took in collect */
@@ -426,10 +536,12 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
restore++;
- continue;
+ goto next;
}
unmapped++;
+next:
+ i += nr;
}
for (i = 0; i < npages && restore; i++) {
@@ -575,6 +687,146 @@ int migrate_vma_setup(struct migrate_vma *args)
}
EXPORT_SYMBOL(migrate_vma_setup);
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+/**
+ * migrate_vma_insert_huge_pmd_page: Insert a huge folio into @migrate->vma->vm_mm
+ * at @addr. folio is already allocated as a part of the migration process with
+ * large page.
+ *
+ * @folio needs to be initialized and setup after it's allocated. The code bits
+ * here follow closely the code in __do_huge_pmd_anonymous_page(). This API does
+ * not support THP zero pages.
+ *
+ * @migrate: migrate_vma arguments
+ * @addr: address where the folio will be inserted
+ * @folio: folio to be inserted at @addr
+ * @src: src pfn which is being migrated
+ * @pmdp: pointer to the pmd
+ */
+static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
+ unsigned long addr,
+ struct page *page,
+ unsigned long *src,
+ pmd_t *pmdp)
+{
+ struct vm_area_struct *vma = migrate->vma;
+ gfp_t gfp = vma_thp_gfp_mask(vma);
+ struct folio *folio = page_folio(page);
+ int ret;
+ spinlock_t *ptl;
+ pgtable_t pgtable;
+ pmd_t entry;
+ bool flush = false;
+ unsigned long i;
+
+ VM_WARN_ON_FOLIO(!folio, folio);
+ VM_WARN_ON_ONCE(!pmd_none(*pmdp) && !is_huge_zero_pmd(*pmdp));
+
+ if (!thp_vma_suitable_order(vma, addr, HPAGE_PMD_ORDER))
+ return -EINVAL;
+
+ ret = anon_vma_prepare(vma);
+ if (ret)
+ return ret;
+
+ folio_set_order(folio, HPAGE_PMD_ORDER);
+ folio_set_large_rmappable(folio);
+
+ if (mem_cgroup_charge(folio, migrate->vma->vm_mm, gfp)) {
+ count_vm_event(THP_FAULT_FALLBACK);
+ count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+ ret = -ENOMEM;
+ goto abort;
+ }
+
+ __folio_mark_uptodate(folio);
+
+ pgtable = pte_alloc_one(vma->vm_mm);
+ if (unlikely(!pgtable))
+ goto abort;
+
+ if (folio_is_device_private(folio)) {
+ swp_entry_t swp_entry;
+
+ if (vma->vm_flags & VM_WRITE)
+ swp_entry = make_writable_device_private_entry(
+ page_to_pfn(page));
+ else
+ swp_entry = make_readable_device_private_entry(
+ page_to_pfn(page));
+ entry = swp_entry_to_pmd(swp_entry);
+ } else {
+ if (folio_is_zone_device(folio) &&
+ !folio_is_device_coherent(folio)) {
+ goto abort;
+ }
+ entry = folio_mk_pmd(folio, vma->vm_page_prot);
+ if (vma->vm_flags & VM_WRITE)
+ entry = pmd_mkwrite(pmd_mkdirty(entry), vma);
+ }
+
+ ptl = pmd_lock(vma->vm_mm, pmdp);
+ ret = check_stable_address_space(vma->vm_mm);
+ if (ret)
+ goto abort;
+
+ /*
+ * Check for userfaultfd but do not deliver the fault. Instead,
+ * just back off.
+ */
+ if (userfaultfd_missing(vma))
+ goto unlock_abort;
+
+ if (!pmd_none(*pmdp)) {
+ if (!is_huge_zero_pmd(*pmdp))
+ goto unlock_abort;
+ flush = true;
+ } else if (!pmd_none(*pmdp))
+ goto unlock_abort;
+
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
+ if (!folio_is_zone_device(folio))
+ folio_add_lru_vma(folio, vma);
+ folio_get(folio);
+
+ if (flush) {
+ pte_free(vma->vm_mm, pgtable);
+ flush_cache_page(vma, addr, addr + HPAGE_PMD_SIZE);
+ pmdp_invalidate(vma, addr, pmdp);
+ } else {
+ pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
+ mm_inc_nr_ptes(vma->vm_mm);
+ }
+ set_pmd_at(vma->vm_mm, addr, pmdp, entry);
+ update_mmu_cache_pmd(vma, addr, pmdp);
+
+ spin_unlock(ptl);
+
+ count_vm_event(THP_FAULT_ALLOC);
+ count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC);
+ count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC);
+
+ return 0;
+
+unlock_abort:
+ spin_unlock(ptl);
+abort:
+ for (i = 0; i < HPAGE_PMD_NR; i++)
+ src[i] &= ~MIGRATE_PFN_MIGRATE;
+ return 0;
+}
+#else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
+static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
+ unsigned long addr,
+ struct page *page,
+ unsigned long *src,
+ pmd_t *pmdp)
+{
+ return 0;
+}
+#endif
+
/*
* This code closely matches the code in:
* __handle_mm_fault()
@@ -585,9 +837,10 @@ EXPORT_SYMBOL(migrate_vma_setup);
*/
static void migrate_vma_insert_page(struct migrate_vma *migrate,
unsigned long addr,
- struct page *page,
+ unsigned long *dst,
unsigned long *src)
{
+ struct page *page = migrate_pfn_to_page(*dst);
struct folio *folio = page_folio(page);
struct vm_area_struct *vma = migrate->vma;
struct mm_struct *mm = vma->vm_mm;
@@ -615,8 +868,25 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
pmdp = pmd_alloc(mm, pudp, addr);
if (!pmdp)
goto abort;
- if (pmd_trans_huge(*pmdp))
- goto abort;
+
+ if (thp_migration_supported() && (*dst & MIGRATE_PFN_COMPOUND)) {
+ int ret = migrate_vma_insert_huge_pmd_page(migrate, addr, page,
+ src, pmdp);
+ if (ret)
+ goto abort;
+ return;
+ }
+
+ if (!pmd_none(*pmdp)) {
+ if (pmd_trans_huge(*pmdp)) {
+ if (!is_huge_zero_pmd(*pmdp))
+ goto abort;
+ folio_get(pmd_folio(*pmdp));
+ split_huge_pmd(vma, pmdp, addr);
+ } else if (pmd_leaf(*pmdp))
+ goto abort;
+ }
+
if (pte_alloc(mm, pmdp))
goto abort;
if (unlikely(anon_vma_prepare(vma)))
@@ -707,23 +977,24 @@ static void __migrate_device_pages(unsigned long *src_pfns,
unsigned long i;
bool notified = false;
- for (i = 0; i < npages; i++) {
+ for (i = 0; i < npages; ) {
struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
struct page *page = migrate_pfn_to_page(src_pfns[i]);
struct address_space *mapping;
struct folio *newfolio, *folio;
int r, extra_cnt = 0;
+ unsigned long nr = 1;
if (!newpage) {
src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
- continue;
+ goto next;
}
if (!page) {
unsigned long addr;
if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE))
- continue;
+ goto next;
/*
* The only time there is no vma is when called from
@@ -741,15 +1012,47 @@ static void __migrate_device_pages(unsigned long *src_pfns,
migrate->pgmap_owner);
mmu_notifier_invalidate_range_start(&range);
}
- migrate_vma_insert_page(migrate, addr, newpage,
+
+ if ((src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
+ (!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
+ nr = HPAGE_PMD_NR;
+ src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
+ src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+ goto next;
+ }
+
+ migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
&src_pfns[i]);
- continue;
+ goto next;
}
newfolio = page_folio(newpage);
folio = page_folio(page);
mapping = folio_mapping(folio);
+ /*
+ * If THP migration is enabled, check if both src and dst
+ * can migrate large pages
+ */
+ if (thp_migration_supported()) {
+ if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
+ (src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
+ !(dst_pfns[i] & MIGRATE_PFN_COMPOUND)) {
+
+ if (!migrate) {
+ src_pfns[i] &= ~(MIGRATE_PFN_MIGRATE |
+ MIGRATE_PFN_COMPOUND);
+ goto next;
+ }
+ src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+ } else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
+ (dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
+ !(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
+ src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+ }
+ }
+
+
if (folio_is_device_private(newfolio) ||
folio_is_device_coherent(newfolio)) {
if (mapping) {
@@ -762,7 +1065,7 @@ static void __migrate_device_pages(unsigned long *src_pfns,
if (!folio_test_anon(folio) ||
!folio_free_swap(folio)) {
src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
- continue;
+ goto next;
}
}
} else if (folio_is_zone_device(newfolio)) {
@@ -770,7 +1073,7 @@ static void __migrate_device_pages(unsigned long *src_pfns,
* Other types of ZONE_DEVICE page are not supported.
*/
src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
- continue;
+ goto next;
}
BUG_ON(folio_test_writeback(folio));
@@ -782,6 +1085,8 @@ static void __migrate_device_pages(unsigned long *src_pfns,
src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
else
folio_migrate_flags(newfolio, folio);
+next:
+ i += nr;
}
if (notified)
@@ -943,10 +1248,23 @@ static unsigned long migrate_device_pfn_lock(unsigned long pfn)
int migrate_device_range(unsigned long *src_pfns, unsigned long start,
unsigned long npages)
{
- unsigned long i, pfn;
+ unsigned long i, j, pfn;
+
+ for (pfn = start, i = 0; i < npages; pfn++, i++) {
+ struct page *page = pfn_to_page(pfn);
+ struct folio *folio = page_folio(page);
+ unsigned int nr = 1;
- for (pfn = start, i = 0; i < npages; pfn++, i++)
src_pfns[i] = migrate_device_pfn_lock(pfn);
+ nr = folio_nr_pages(folio);
+ if (nr > 1) {
+ src_pfns[i] |= MIGRATE_PFN_COMPOUND;
+ for (j = 1; j < nr; j++)
+ src_pfns[i+j] = 0;
+ i += j - 1;
+ pfn += j - 1;
+ }
+ }
migrate_device_unmap(src_pfns, npages, NULL);
@@ -964,10 +1282,22 @@ EXPORT_SYMBOL(migrate_device_range);
*/
int migrate_device_pfns(unsigned long *src_pfns, unsigned long npages)
{
- unsigned long i;
+ unsigned long i, j;
+
+ for (i = 0; i < npages; i++) {
+ struct page *page = pfn_to_page(src_pfns[i]);
+ struct folio *folio = page_folio(page);
+ unsigned int nr = 1;
- for (i = 0; i < npages; i++)
src_pfns[i] = migrate_device_pfn_lock(src_pfns[i]);
+ nr = folio_nr_pages(folio);
+ if (nr > 1) {
+ src_pfns[i] |= MIGRATE_PFN_COMPOUND;
+ for (j = 1; j < nr; j++)
+ src_pfns[i+j] = 0;
+ i += j - 1;
+ }
+ }
migrate_device_unmap(src_pfns, npages, NULL);
--
2.50.1
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [v2 04/11] mm/memory/fault: add support for zone device THP fault handling
2025-07-30 9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
` (2 preceding siblings ...)
2025-07-30 9:21 ` [v2 03/11] mm/migrate_device: THP migration of zone device pages Balbir Singh
@ 2025-07-30 9:21 ` Balbir Singh
2025-07-30 9:21 ` [v2 05/11] lib/test_hmm: test cases and support for zone device private THP Balbir Singh
` (8 subsequent siblings)
12 siblings, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-07-30 9:21 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Matthew Brost,
Francois Dugast
When the CPU touches a zone device THP entry, the data needs to
be migrated back to the CPU, call migrate_to_ram() on these pages
via do_huge_pmd_device_private() fault handling helper.
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
include/linux/huge_mm.h | 7 +++++++
mm/huge_memory.c | 36 ++++++++++++++++++++++++++++++++++++
mm/memory.c | 6 ++++--
3 files changed, 47 insertions(+), 2 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2a6f5ff7bca3..56fdcaf7604b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -475,6 +475,8 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)
vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
+vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
+
extern struct folio *huge_zero_folio;
extern unsigned long huge_zero_pfn;
@@ -633,6 +635,11 @@ static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
return 0;
}
+static inline vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf)
+{
+ return 0;
+}
+
static inline bool is_huge_zero_folio(const struct folio *folio)
{
return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e373c6578894..713dd433d352 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1271,6 +1271,42 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
}
+vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ vm_fault_t ret = 0;
+ spinlock_t *ptl;
+ swp_entry_t swp_entry;
+ struct page *page;
+
+ if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+ vma_end_read(vma);
+ return VM_FAULT_RETRY;
+ }
+
+ ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+ if (unlikely(!pmd_same(*vmf->pmd, vmf->orig_pmd))) {
+ spin_unlock(ptl);
+ return 0;
+ }
+
+ swp_entry = pmd_to_swp_entry(vmf->orig_pmd);
+ page = pfn_swap_entry_to_page(swp_entry);
+ vmf->page = page;
+ vmf->pte = NULL;
+ if (trylock_page(vmf->page)) {
+ get_page(page);
+ spin_unlock(ptl);
+ ret = page_pgmap(page)->ops->migrate_to_ram(vmf);
+ unlock_page(vmf->page);
+ put_page(page);
+ } else {
+ spin_unlock(ptl);
+ }
+
+ return ret;
+}
+
/*
* always: directly stall for all thp allocations
* defer: wake kswapd and fail if not immediately available
diff --git a/mm/memory.c b/mm/memory.c
index 92fd18a5d8d1..6c87f043eea1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6152,8 +6152,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
vmf.orig_pmd = pmdp_get_lockless(vmf.pmd);
if (unlikely(is_swap_pmd(vmf.orig_pmd))) {
- VM_BUG_ON(thp_migration_supported() &&
- !is_pmd_migration_entry(vmf.orig_pmd));
+ if (is_device_private_entry(
+ pmd_to_swp_entry(vmf.orig_pmd)))
+ return do_huge_pmd_device_private(&vmf);
+
if (is_pmd_migration_entry(vmf.orig_pmd))
pmd_migration_entry_wait(mm, vmf.pmd);
return 0;
--
2.50.1
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [v2 05/11] lib/test_hmm: test cases and support for zone device private THP
2025-07-30 9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
` (3 preceding siblings ...)
2025-07-30 9:21 ` [v2 04/11] mm/memory/fault: add support for zone device THP fault handling Balbir Singh
@ 2025-07-30 9:21 ` Balbir Singh
2025-07-31 11:17 ` kernel test robot
2025-07-30 9:21 ` [v2 06/11] mm/memremap: add folio_split support Balbir Singh
` (7 subsequent siblings)
12 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-07-30 9:21 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Matthew Brost,
Francois Dugast
Enhance the hmm test driver (lib/test_hmm) with support for
THP pages.
A new pool of free_folios() has now been added to the dmirror
device, which can be allocated when a request for a THP zone
device private page is made.
Add compound page awareness to the allocation function during
normal migration and fault based migration. These routines also
copy folio_nr_pages() when moving data between system memory
and device memory.
args.src and args.dst used to hold migration entries are now
dynamically allocated (as they need to hold HPAGE_PMD_NR entries
or more).
Split and migrate support will be added in future patches in this
series.
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
include/linux/memremap.h | 12 ++
lib/test_hmm.c | 366 +++++++++++++++++++++++++++++++--------
2 files changed, 303 insertions(+), 75 deletions(-)
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index a0723b35eeaa..0c5141a7d58c 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -169,6 +169,18 @@ static inline bool folio_is_device_private(const struct folio *folio)
return is_device_private_page(&folio->page);
}
+static inline void *folio_zone_device_data(const struct folio *folio)
+{
+ VM_WARN_ON_FOLIO(!folio_is_device_private(folio), folio);
+ return folio->page.zone_device_data;
+}
+
+static inline void folio_set_zone_device_data(struct folio *folio, void *data)
+{
+ VM_WARN_ON_FOLIO(!folio_is_device_private(folio), folio);
+ folio->page.zone_device_data = data;
+}
+
static inline bool is_pci_p2pdma_page(const struct page *page)
{
return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 761725bc713c..4850f9026694 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -119,6 +119,7 @@ struct dmirror_device {
unsigned long calloc;
unsigned long cfree;
struct page *free_pages;
+ struct folio *free_folios;
spinlock_t lock; /* protects the above */
};
@@ -492,7 +493,7 @@ static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
}
static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
- struct page **ppage)
+ struct page **ppage, bool is_large)
{
struct dmirror_chunk *devmem;
struct resource *res = NULL;
@@ -572,20 +573,45 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
pfn_first, pfn_last);
spin_lock(&mdevice->lock);
- for (pfn = pfn_first; pfn < pfn_last; pfn++) {
+ for (pfn = pfn_first; pfn < pfn_last; ) {
struct page *page = pfn_to_page(pfn);
+ if (is_large && IS_ALIGNED(pfn, HPAGE_PMD_NR)
+ && (pfn + HPAGE_PMD_NR <= pfn_last)) {
+ page->zone_device_data = mdevice->free_folios;
+ mdevice->free_folios = page_folio(page);
+ pfn += HPAGE_PMD_NR;
+ continue;
+ }
+
page->zone_device_data = mdevice->free_pages;
mdevice->free_pages = page;
+ pfn++;
}
+
+ ret = 0;
if (ppage) {
- *ppage = mdevice->free_pages;
- mdevice->free_pages = (*ppage)->zone_device_data;
- mdevice->calloc++;
+ if (is_large) {
+ if (!mdevice->free_folios) {
+ ret = -ENOMEM;
+ goto err_unlock;
+ }
+ *ppage = folio_page(mdevice->free_folios, 0);
+ mdevice->free_folios = (*ppage)->zone_device_data;
+ mdevice->calloc += HPAGE_PMD_NR;
+ } else if (mdevice->free_pages) {
+ *ppage = mdevice->free_pages;
+ mdevice->free_pages = (*ppage)->zone_device_data;
+ mdevice->calloc++;
+ } else {
+ ret = -ENOMEM;
+ goto err_unlock;
+ }
}
+err_unlock:
spin_unlock(&mdevice->lock);
- return 0;
+ return ret;
err_release:
mutex_unlock(&mdevice->devmem_lock);
@@ -598,10 +624,13 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
return ret;
}
-static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
+static struct page *dmirror_devmem_alloc_page(struct dmirror *dmirror,
+ bool is_large)
{
struct page *dpage = NULL;
struct page *rpage = NULL;
+ unsigned int order = is_large ? HPAGE_PMD_ORDER : 0;
+ struct dmirror_device *mdevice = dmirror->mdevice;
/*
* For ZONE_DEVICE private type, this is a fake device so we allocate
@@ -610,49 +639,55 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
* data and ignore rpage.
*/
if (dmirror_is_private_zone(mdevice)) {
- rpage = alloc_page(GFP_HIGHUSER);
+ rpage = folio_page(folio_alloc(GFP_HIGHUSER, order), 0);
if (!rpage)
return NULL;
}
spin_lock(&mdevice->lock);
- if (mdevice->free_pages) {
+ if (is_large && mdevice->free_folios) {
+ dpage = folio_page(mdevice->free_folios, 0);
+ mdevice->free_folios = dpage->zone_device_data;
+ mdevice->calloc += 1 << order;
+ spin_unlock(&mdevice->lock);
+ } else if (!is_large && mdevice->free_pages) {
dpage = mdevice->free_pages;
mdevice->free_pages = dpage->zone_device_data;
mdevice->calloc++;
spin_unlock(&mdevice->lock);
} else {
spin_unlock(&mdevice->lock);
- if (dmirror_allocate_chunk(mdevice, &dpage))
+ if (dmirror_allocate_chunk(mdevice, &dpage, is_large))
goto error;
}
- zone_device_page_init(dpage);
+ zone_device_folio_init(page_folio(dpage), order);
dpage->zone_device_data = rpage;
return dpage;
error:
if (rpage)
- __free_page(rpage);
+ __free_pages(rpage, order);
return NULL;
}
static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
struct dmirror *dmirror)
{
- struct dmirror_device *mdevice = dmirror->mdevice;
const unsigned long *src = args->src;
unsigned long *dst = args->dst;
unsigned long addr;
- for (addr = args->start; addr < args->end; addr += PAGE_SIZE,
- src++, dst++) {
+ for (addr = args->start; addr < args->end; ) {
struct page *spage;
struct page *dpage;
struct page *rpage;
+ bool is_large = *src & MIGRATE_PFN_COMPOUND;
+ int write = (*src & MIGRATE_PFN_WRITE) ? MIGRATE_PFN_WRITE : 0;
+ unsigned long nr = 1;
if (!(*src & MIGRATE_PFN_MIGRATE))
- continue;
+ goto next;
/*
* Note that spage might be NULL which is OK since it is an
@@ -662,17 +697,45 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
if (WARN(spage && is_zone_device_page(spage),
"page already in device spage pfn: 0x%lx\n",
page_to_pfn(spage)))
+ goto next;
+
+ dpage = dmirror_devmem_alloc_page(dmirror, is_large);
+ if (!dpage) {
+ struct folio *folio;
+ unsigned long i;
+ unsigned long spfn = *src >> MIGRATE_PFN_SHIFT;
+ struct page *src_page;
+
+ if (!is_large)
+ goto next;
+
+ if (!spage && is_large) {
+ nr = HPAGE_PMD_NR;
+ } else {
+ folio = page_folio(spage);
+ nr = folio_nr_pages(folio);
+ }
+
+ for (i = 0; i < nr && addr < args->end; i++) {
+ dpage = dmirror_devmem_alloc_page(dmirror, false);
+ rpage = BACKING_PAGE(dpage);
+ rpage->zone_device_data = dmirror;
+
+ *dst = migrate_pfn(page_to_pfn(dpage)) | write;
+ src_page = pfn_to_page(spfn + i);
+
+ if (spage)
+ copy_highpage(rpage, src_page);
+ else
+ clear_highpage(rpage);
+ src++;
+ dst++;
+ addr += PAGE_SIZE;
+ }
continue;
-
- dpage = dmirror_devmem_alloc_page(mdevice);
- if (!dpage)
- continue;
+ }
rpage = BACKING_PAGE(dpage);
- if (spage)
- copy_highpage(rpage, spage);
- else
- clear_highpage(rpage);
/*
* Normally, a device would use the page->zone_device_data to
@@ -684,10 +747,42 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
page_to_pfn(spage), page_to_pfn(dpage));
- *dst = migrate_pfn(page_to_pfn(dpage));
- if ((*src & MIGRATE_PFN_WRITE) ||
- (!spage && args->vma->vm_flags & VM_WRITE))
- *dst |= MIGRATE_PFN_WRITE;
+
+ *dst = migrate_pfn(page_to_pfn(dpage)) | write;
+
+ if (is_large) {
+ int i;
+ struct folio *folio = page_folio(dpage);
+ *dst |= MIGRATE_PFN_COMPOUND;
+
+ if (folio_test_large(folio)) {
+ for (i = 0; i < folio_nr_pages(folio); i++) {
+ struct page *dst_page =
+ pfn_to_page(page_to_pfn(rpage) + i);
+ struct page *src_page =
+ pfn_to_page(page_to_pfn(spage) + i);
+
+ if (spage)
+ copy_highpage(dst_page, src_page);
+ else
+ clear_highpage(dst_page);
+ src++;
+ dst++;
+ addr += PAGE_SIZE;
+ }
+ continue;
+ }
+ }
+
+ if (spage)
+ copy_highpage(rpage, spage);
+ else
+ clear_highpage(rpage);
+
+next:
+ src++;
+ dst++;
+ addr += PAGE_SIZE;
}
}
@@ -734,14 +829,17 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
const unsigned long *src = args->src;
const unsigned long *dst = args->dst;
unsigned long pfn;
+ const unsigned long start_pfn = start >> PAGE_SHIFT;
+ const unsigned long end_pfn = end >> PAGE_SHIFT;
/* Map the migrated pages into the device's page tables. */
mutex_lock(&dmirror->mutex);
- for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++,
- src++, dst++) {
+ for (pfn = start_pfn; pfn < end_pfn; pfn++, src++, dst++) {
struct page *dpage;
void *entry;
+ int nr, i;
+ struct page *rpage;
if (!(*src & MIGRATE_PFN_MIGRATE))
continue;
@@ -750,13 +848,25 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
if (!dpage)
continue;
- entry = BACKING_PAGE(dpage);
- if (*dst & MIGRATE_PFN_WRITE)
- entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
- entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
- if (xa_is_err(entry)) {
- mutex_unlock(&dmirror->mutex);
- return xa_err(entry);
+ if (*dst & MIGRATE_PFN_COMPOUND)
+ nr = folio_nr_pages(page_folio(dpage));
+ else
+ nr = 1;
+
+ WARN_ON_ONCE(end_pfn < start_pfn + nr);
+
+ rpage = BACKING_PAGE(dpage);
+ VM_WARN_ON(folio_nr_pages(page_folio(rpage)) != nr);
+
+ for (i = 0; i < nr; i++) {
+ entry = folio_page(page_folio(rpage), i);
+ if (*dst & MIGRATE_PFN_WRITE)
+ entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
+ entry = xa_store(&dmirror->pt, pfn + i, entry, GFP_ATOMIC);
+ if (xa_is_err(entry)) {
+ mutex_unlock(&dmirror->mutex);
+ return xa_err(entry);
+ }
}
}
@@ -829,31 +939,66 @@ static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
unsigned long start = args->start;
unsigned long end = args->end;
unsigned long addr;
+ unsigned int order = 0;
+ int i;
- for (addr = start; addr < end; addr += PAGE_SIZE,
- src++, dst++) {
+ for (addr = start; addr < end; ) {
struct page *dpage, *spage;
spage = migrate_pfn_to_page(*src);
- if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
- continue;
+ if (!spage || !(*src & MIGRATE_PFN_MIGRATE)) {
+ addr += PAGE_SIZE;
+ goto next;
+ }
if (WARN_ON(!is_device_private_page(spage) &&
- !is_device_coherent_page(spage)))
- continue;
+ !is_device_coherent_page(spage))) {
+ addr += PAGE_SIZE;
+ goto next;
+ }
+
spage = BACKING_PAGE(spage);
- dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
- if (!dpage)
- continue;
- pr_debug("migrating from dev to sys pfn src: 0x%lx pfn dst: 0x%lx\n",
- page_to_pfn(spage), page_to_pfn(dpage));
+ order = folio_order(page_folio(spage));
+ if (order)
+ dpage = folio_page(vma_alloc_folio(GFP_HIGHUSER_MOVABLE,
+ order, args->vma, addr), 0);
+ else
+ dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+
+ /* Try with smaller pages if large allocation fails */
+ if (!dpage && order) {
+ dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+ if (!dpage)
+ return VM_FAULT_OOM;
+ order = 0;
+ }
+
+ pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
+ page_to_pfn(spage), page_to_pfn(dpage));
lock_page(dpage);
xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
copy_highpage(dpage, spage);
*dst = migrate_pfn(page_to_pfn(dpage));
if (*src & MIGRATE_PFN_WRITE)
*dst |= MIGRATE_PFN_WRITE;
+ if (order)
+ *dst |= MIGRATE_PFN_COMPOUND;
+
+ for (i = 0; i < (1 << order); i++) {
+ struct page *src_page;
+ struct page *dst_page;
+
+ src_page = pfn_to_page(page_to_pfn(spage) + i);
+ dst_page = pfn_to_page(page_to_pfn(dpage) + i);
+
+ xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
+ copy_highpage(dst_page, src_page);
+ }
+next:
+ addr += PAGE_SIZE << order;
+ src += 1 << order;
+ dst += 1 << order;
}
return 0;
}
@@ -879,11 +1024,14 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
unsigned long size = cmd->npages << PAGE_SHIFT;
struct mm_struct *mm = dmirror->notifier.mm;
struct vm_area_struct *vma;
- unsigned long src_pfns[32] = { 0 };
- unsigned long dst_pfns[32] = { 0 };
struct migrate_vma args = { 0 };
unsigned long next;
int ret;
+ unsigned long *src_pfns;
+ unsigned long *dst_pfns;
+
+ src_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL);
+ dst_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL);
start = cmd->addr;
end = start + size;
@@ -902,7 +1050,7 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
ret = -EINVAL;
goto out;
}
- next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
+ next = min(end, addr + (PTRS_PER_PTE << PAGE_SHIFT));
if (next > vma->vm_end)
next = vma->vm_end;
@@ -912,7 +1060,7 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
args.start = addr;
args.end = next;
args.pgmap_owner = dmirror->mdevice;
- args.flags = dmirror_select_device(dmirror);
+ args.flags = dmirror_select_device(dmirror) | MIGRATE_VMA_SELECT_COMPOUND;
ret = migrate_vma_setup(&args);
if (ret)
@@ -928,6 +1076,8 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
out:
mmap_read_unlock(mm);
mmput(mm);
+ kvfree(src_pfns);
+ kvfree(dst_pfns);
return ret;
}
@@ -939,12 +1089,12 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
unsigned long size = cmd->npages << PAGE_SHIFT;
struct mm_struct *mm = dmirror->notifier.mm;
struct vm_area_struct *vma;
- unsigned long src_pfns[32] = { 0 };
- unsigned long dst_pfns[32] = { 0 };
struct dmirror_bounce bounce;
struct migrate_vma args = { 0 };
unsigned long next;
int ret;
+ unsigned long *src_pfns;
+ unsigned long *dst_pfns;
start = cmd->addr;
end = start + size;
@@ -955,6 +1105,18 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
if (!mmget_not_zero(mm))
return -EINVAL;
+ ret = -ENOMEM;
+ src_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*src_pfns),
+ GFP_KERNEL | __GFP_NOFAIL);
+ if (!src_pfns)
+ goto free_mem;
+
+ dst_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*dst_pfns),
+ GFP_KERNEL | __GFP_NOFAIL);
+ if (!dst_pfns)
+ goto free_mem;
+
+ ret = 0;
mmap_read_lock(mm);
for (addr = start; addr < end; addr = next) {
vma = vma_lookup(mm, addr);
@@ -962,7 +1124,7 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
ret = -EINVAL;
goto out;
}
- next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
+ next = min(end, addr + (PTRS_PER_PTE << PAGE_SHIFT));
if (next > vma->vm_end)
next = vma->vm_end;
@@ -972,7 +1134,8 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
args.start = addr;
args.end = next;
args.pgmap_owner = dmirror->mdevice;
- args.flags = MIGRATE_VMA_SELECT_SYSTEM;
+ args.flags = MIGRATE_VMA_SELECT_SYSTEM |
+ MIGRATE_VMA_SELECT_COMPOUND;
ret = migrate_vma_setup(&args);
if (ret)
goto out;
@@ -992,7 +1155,7 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
*/
ret = dmirror_bounce_init(&bounce, start, size);
if (ret)
- return ret;
+ goto free_mem;
mutex_lock(&dmirror->mutex);
ret = dmirror_do_read(dmirror, start, end, &bounce);
mutex_unlock(&dmirror->mutex);
@@ -1003,11 +1166,14 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
}
cmd->cpages = bounce.cpages;
dmirror_bounce_fini(&bounce);
- return ret;
+ goto free_mem;
out:
mmap_read_unlock(mm);
mmput(mm);
+free_mem:
+ kfree(src_pfns);
+ kfree(dst_pfns);
return ret;
}
@@ -1200,6 +1366,7 @@ static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk)
unsigned long i;
unsigned long *src_pfns;
unsigned long *dst_pfns;
+ unsigned int order = 0;
src_pfns = kvcalloc(npages, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL);
dst_pfns = kvcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL);
@@ -1215,13 +1382,25 @@ static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk)
if (WARN_ON(!is_device_private_page(spage) &&
!is_device_coherent_page(spage)))
continue;
+
+ order = folio_order(page_folio(spage));
spage = BACKING_PAGE(spage);
- dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL);
+ if (src_pfns[i] & MIGRATE_PFN_COMPOUND) {
+ dpage = folio_page(folio_alloc(GFP_HIGHUSER_MOVABLE,
+ order), 0);
+ } else {
+ dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL);
+ order = 0;
+ }
+
+ /* TODO Support splitting here */
lock_page(dpage);
- copy_highpage(dpage, spage);
dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
if (src_pfns[i] & MIGRATE_PFN_WRITE)
dst_pfns[i] |= MIGRATE_PFN_WRITE;
+ if (order)
+ dst_pfns[i] |= MIGRATE_PFN_COMPOUND;
+ folio_copy(page_folio(dpage), page_folio(spage));
}
migrate_device_pages(src_pfns, dst_pfns, npages);
migrate_device_finalize(src_pfns, dst_pfns, npages);
@@ -1234,7 +1413,12 @@ static void dmirror_remove_free_pages(struct dmirror_chunk *devmem)
{
struct dmirror_device *mdevice = devmem->mdevice;
struct page *page;
+ struct folio *folio;
+
+ for (folio = mdevice->free_folios; folio; folio = folio_zone_device_data(folio))
+ if (dmirror_page_to_chunk(folio_page(folio, 0)) == devmem)
+ mdevice->free_folios = folio_zone_device_data(folio);
for (page = mdevice->free_pages; page; page = page->zone_device_data)
if (dmirror_page_to_chunk(page) == devmem)
mdevice->free_pages = page->zone_device_data;
@@ -1265,6 +1449,7 @@ static void dmirror_device_remove_chunks(struct dmirror_device *mdevice)
mdevice->devmem_count = 0;
mdevice->devmem_capacity = 0;
mdevice->free_pages = NULL;
+ mdevice->free_folios = NULL;
kfree(mdevice->devmem_chunks);
mdevice->devmem_chunks = NULL;
}
@@ -1378,18 +1563,30 @@ static void dmirror_devmem_free(struct page *page)
{
struct page *rpage = BACKING_PAGE(page);
struct dmirror_device *mdevice;
+ struct folio *folio = page_folio(rpage);
+ unsigned int order = folio_order(folio);
- if (rpage != page)
- __free_page(rpage);
+ if (rpage != page) {
+ if (order)
+ __free_pages(rpage, order);
+ else
+ __free_page(rpage);
+ rpage = NULL;
+ }
mdevice = dmirror_page_to_device(page);
spin_lock(&mdevice->lock);
/* Return page to our allocator if not freeing the chunk */
if (!dmirror_page_to_chunk(page)->remove) {
- mdevice->cfree++;
- page->zone_device_data = mdevice->free_pages;
- mdevice->free_pages = page;
+ mdevice->cfree += 1 << order;
+ if (order) {
+ page->zone_device_data = mdevice->free_folios;
+ mdevice->free_folios = page_folio(page);
+ } else {
+ page->zone_device_data = mdevice->free_pages;
+ mdevice->free_pages = page;
+ }
}
spin_unlock(&mdevice->lock);
}
@@ -1397,11 +1594,10 @@ static void dmirror_devmem_free(struct page *page)
static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
{
struct migrate_vma args = { 0 };
- unsigned long src_pfns = 0;
- unsigned long dst_pfns = 0;
struct page *rpage;
struct dmirror *dmirror;
- vm_fault_t ret;
+ vm_fault_t ret = 0;
+ unsigned int order, nr;
/*
* Normally, a device would use the page->zone_device_data to point to
@@ -1412,21 +1608,38 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
dmirror = rpage->zone_device_data;
/* FIXME demonstrate how we can adjust migrate range */
+ order = folio_order(page_folio(vmf->page));
+ nr = 1 << order;
+
+ /*
+ * Consider a per-cpu cache of src and dst pfns, but with
+ * large number of cpus that might not scale well.
+ */
+ args.start = ALIGN_DOWN(vmf->address, (PAGE_SIZE << order));
args.vma = vmf->vma;
- args.start = vmf->address;
- args.end = args.start + PAGE_SIZE;
- args.src = &src_pfns;
- args.dst = &dst_pfns;
+ args.end = args.start + (PAGE_SIZE << order);
+
+ nr = (args.end - args.start) >> PAGE_SHIFT;
+ args.src = kcalloc(nr, sizeof(unsigned long), GFP_KERNEL);
+ args.dst = kcalloc(nr, sizeof(unsigned long), GFP_KERNEL);
args.pgmap_owner = dmirror->mdevice;
args.flags = dmirror_select_device(dmirror);
args.fault_page = vmf->page;
+ if (!args.src || !args.dst) {
+ ret = VM_FAULT_OOM;
+ goto err;
+ }
+
+ if (order)
+ args.flags |= MIGRATE_VMA_SELECT_COMPOUND;
+
if (migrate_vma_setup(&args))
return VM_FAULT_SIGBUS;
ret = dmirror_devmem_fault_alloc_and_copy(&args, dmirror);
if (ret)
- return ret;
+ goto err;
migrate_vma_pages(&args);
/*
* No device finalize step is needed since
@@ -1434,7 +1647,10 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
* invalidated the device page table.
*/
migrate_vma_finalize(&args);
- return 0;
+err:
+ kfree(args.src);
+ kfree(args.dst);
+ return ret;
}
static const struct dev_pagemap_ops dmirror_devmem_ops = {
@@ -1465,7 +1681,7 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id)
return ret;
/* Build a list of free ZONE_DEVICE struct pages */
- return dmirror_allocate_chunk(mdevice, NULL);
+ return dmirror_allocate_chunk(mdevice, NULL, false);
}
static void dmirror_device_remove(struct dmirror_device *mdevice)
--
2.50.1
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [v2 06/11] mm/memremap: add folio_split support
2025-07-30 9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
` (4 preceding siblings ...)
2025-07-30 9:21 ` [v2 05/11] lib/test_hmm: test cases and support for zone device private THP Balbir Singh
@ 2025-07-30 9:21 ` Balbir Singh
2025-07-30 9:21 ` [v2 07/11] mm/thp: add split during migration support Balbir Singh
` (6 subsequent siblings)
12 siblings, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-07-30 9:21 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Matthew Brost,
Francois Dugast
When a zone device page is split (via huge pmd folio split). The
driver callback for folio_split is invoked to let the device driver
know that the folio size has been split into a smaller order.
The HMM test driver has been updated to handle the split, since the
test driver uses backing pages, it requires a mechanism of reorganizing
the backing pages (backing pages are used to create a mirror device)
again into the right sized order pages. This is supported by exporting
prep_compound_page().
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
include/linux/memremap.h | 29 +++++++++++++++++++++++++++++
include/linux/mm.h | 1 +
lib/test_hmm.c | 35 +++++++++++++++++++++++++++++++++++
mm/huge_memory.c | 9 ++++++++-
4 files changed, 73 insertions(+), 1 deletion(-)
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 0c5141a7d58c..20f4b5ebbc93 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -100,6 +100,13 @@ struct dev_pagemap_ops {
*/
int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
unsigned long nr_pages, int mf_flags);
+
+ /*
+ * Used for private (un-addressable) device memory only.
+ * This callback is used when a folio is split into
+ * a smaller folio
+ */
+ void (*folio_split)(struct folio *head, struct folio *tail);
};
#define PGMAP_ALTMAP_VALID (1 << 0)
@@ -229,6 +236,23 @@ static inline void zone_device_page_init(struct page *page)
zone_device_folio_init(folio, 0);
}
+static inline void zone_device_private_split_cb(struct folio *original_folio,
+ struct folio *new_folio)
+{
+ if (folio_is_device_private(original_folio)) {
+ if (!original_folio->pgmap->ops->folio_split) {
+ if (new_folio) {
+ new_folio->pgmap = original_folio->pgmap;
+ new_folio->page.mapping =
+ original_folio->page.mapping;
+ }
+ } else {
+ original_folio->pgmap->ops->folio_split(original_folio,
+ new_folio);
+ }
+ }
+}
+
#else
static inline void *devm_memremap_pages(struct device *dev,
struct dev_pagemap *pgmap)
@@ -263,6 +287,11 @@ static inline unsigned long memremap_compat_align(void)
{
return PAGE_SIZE;
}
+
+static inline void zone_device_private_split_cb(struct folio *original_folio,
+ struct folio *new_folio)
+{
+}
#endif /* CONFIG_ZONE_DEVICE */
static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8e3a4c5b78ff..d0ecf8386dd9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1183,6 +1183,7 @@ static inline struct folio *virt_to_folio(const void *x)
void __folio_put(struct folio *folio);
void split_page(struct page *page, unsigned int order);
+void prep_compound_page(struct page *page, unsigned int order);
void folio_copy(struct folio *dst, struct folio *src);
int folio_mc_copy(struct folio *dst, struct folio *src);
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 4850f9026694..a8d0d24b4b7a 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -1653,9 +1653,44 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
return ret;
}
+static void dmirror_devmem_folio_split(struct folio *head, struct folio *tail)
+{
+ struct page *rpage = BACKING_PAGE(folio_page(head, 0));
+ struct page *rpage_tail;
+ struct folio *rfolio;
+ unsigned long offset = 0;
+
+ if (!rpage) {
+ tail->page.zone_device_data = NULL;
+ return;
+ }
+
+ rfolio = page_folio(rpage);
+
+ if (tail == NULL) {
+ folio_reset_order(rfolio);
+ rfolio->mapping = NULL;
+ folio_set_count(rfolio, 1);
+ return;
+ }
+
+ offset = folio_pfn(tail) - folio_pfn(head);
+
+ rpage_tail = folio_page(rfolio, offset);
+ tail->page.zone_device_data = rpage_tail;
+ rpage_tail->zone_device_data = rpage->zone_device_data;
+ clear_compound_head(rpage_tail);
+ rpage_tail->mapping = NULL;
+
+ folio_page(tail, 0)->mapping = folio_page(head, 0)->mapping;
+ tail->pgmap = head->pgmap;
+ folio_set_count(page_folio(rpage_tail), 1);
+}
+
static const struct dev_pagemap_ops dmirror_devmem_ops = {
.page_free = dmirror_devmem_free,
.migrate_to_ram = dmirror_devmem_fault,
+ .folio_split = dmirror_devmem_folio_split,
};
static int dmirror_device_init(struct dmirror_device *mdevice, int id)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 713dd433d352..75b368e7e33f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2955,9 +2955,16 @@ int split_device_private_folio(struct folio *folio)
VM_WARN_ON(ret);
for (new_folio = folio_next(folio); new_folio != end_folio;
new_folio = folio_next(new_folio)) {
+ zone_device_private_split_cb(folio, new_folio);
folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
new_folio));
}
+
+ /*
+ * Mark the end of the folio split for device private THP
+ * split
+ */
+ zone_device_private_split_cb(folio, NULL);
folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
return ret;
}
@@ -3979,7 +3986,6 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
ret = __split_unmapped_folio(folio, new_order, split_at, &xas,
mapping, uniform_split);
-
/*
* Unfreeze after-split folios and put them back to the right
* list. @folio should be kept frozon until page cache
@@ -4030,6 +4036,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
__filemap_remove_folio(new_folio, NULL);
folio_put_refs(new_folio, nr_pages);
}
+
/*
* Unfreeze @folio only after all page cache entries, which
* used to point to it, have been updated with new folios.
--
2.50.1
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [v2 07/11] mm/thp: add split during migration support
2025-07-30 9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
` (5 preceding siblings ...)
2025-07-30 9:21 ` [v2 06/11] mm/memremap: add folio_split support Balbir Singh
@ 2025-07-30 9:21 ` Balbir Singh
2025-07-31 10:04 ` kernel test robot
2025-07-30 9:21 ` [v2 08/11] lib/test_hmm: add test case for split pages Balbir Singh
` (5 subsequent siblings)
12 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-07-30 9:21 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Matthew Brost,
Francois Dugast
Support splitting pages during THP zone device migration as needed.
The common case that arises is that after setup, during migrate
the destination might not be able to allocate MIGRATE_PFN_COMPOUND
pages.
Add a new routine migrate_vma_split_pages() to support the splitting
of already isolated pages. The pages being migrated are already unmapped
and marked for migration during setup (via unmap). folio_split() and
__split_unmapped_folio() take additional isolated arguments, to avoid
unmapping and remaping these pages and unlocking/putting the folio.
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
include/linux/huge_mm.h | 11 +++++--
mm/huge_memory.c | 46 ++++++++++++++-------------
mm/migrate_device.c | 69 ++++++++++++++++++++++++++++++++++-------
3 files changed, 91 insertions(+), 35 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 56fdcaf7604b..19e7e3b7c2b7 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -343,9 +343,9 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
vm_flags_t vm_flags);
bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
-int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
- unsigned int new_order);
int split_device_private_folio(struct folio *folio);
+int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+ unsigned int new_order, bool unmapped);
int min_order_for_split(struct folio *folio);
int split_folio_to_list(struct folio *folio, struct list_head *list);
bool uniform_split_supported(struct folio *folio, unsigned int new_order,
@@ -354,6 +354,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
bool warns);
int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
struct list_head *list);
+
+static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+ unsigned int new_order)
+{
+ return __split_huge_page_to_list_to_order(page, list, new_order, false);
+}
+
/*
* try_folio_split - try to split a @folio at @page using non uniform split.
* @folio: folio to be split
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 75b368e7e33f..1fc1efa219c8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3538,15 +3538,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
new_folio->mapping = folio->mapping;
new_folio->index = folio->index + i;
- /*
- * page->private should not be set in tail pages. Fix up and warn once
- * if private is unexpectedly set.
- */
- if (unlikely(new_folio->private)) {
- VM_WARN_ON_ONCE_PAGE(true, new_head);
- new_folio->private = NULL;
- }
-
if (folio_test_swapcache(folio))
new_folio->swap.val = folio->swap.val + i;
@@ -3775,6 +3766,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
* @lock_at: a page within @folio to be left locked to caller
* @list: after-split folios will be put on it if non NULL
* @uniform_split: perform uniform split or not (non-uniform split)
+ * @unmapped: The pages are already unmapped, they are migration entries.
*
* It calls __split_unmapped_folio() to perform uniform and non-uniform split.
* It is in charge of checking whether the split is supported or not and
@@ -3790,7 +3782,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
*/
static int __folio_split(struct folio *folio, unsigned int new_order,
struct page *split_at, struct page *lock_at,
- struct list_head *list, bool uniform_split)
+ struct list_head *list, bool uniform_split, bool unmapped)
{
struct deferred_split *ds_queue = get_deferred_split_queue(folio);
XA_STATE(xas, &folio->mapping->i_pages, folio->index);
@@ -3840,13 +3832,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
* is taken to serialise against parallel split or collapse
* operations.
*/
- anon_vma = folio_get_anon_vma(folio);
- if (!anon_vma) {
- ret = -EBUSY;
- goto out;
+ if (!unmapped) {
+ anon_vma = folio_get_anon_vma(folio);
+ if (!anon_vma) {
+ ret = -EBUSY;
+ goto out;
+ }
+ anon_vma_lock_write(anon_vma);
}
mapping = NULL;
- anon_vma_lock_write(anon_vma);
} else {
unsigned int min_order;
gfp_t gfp;
@@ -3913,7 +3907,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
goto out_unlock;
}
- unmap_folio(folio);
+ if (!unmapped)
+ unmap_folio(folio);
/* block interrupt reentry in xa_lock and spinlock */
local_irq_disable();
@@ -4000,10 +3995,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
next = folio_next(new_folio);
+ zone_device_private_split_cb(folio, new_folio);
+
expected_refs = folio_expected_ref_count(new_folio) + 1;
folio_ref_unfreeze(new_folio, expected_refs);
- lru_add_split_folio(folio, new_folio, lruvec, list);
+ if (!unmapped)
+ lru_add_split_folio(folio, new_folio, lruvec, list);
/*
* Anonymous folio with swap cache.
@@ -4037,6 +4035,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
folio_put_refs(new_folio, nr_pages);
}
+ zone_device_private_split_cb(folio, NULL);
/*
* Unfreeze @folio only after all page cache entries, which
* used to point to it, have been updated with new folios.
@@ -4060,11 +4059,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
local_irq_enable();
+ if (unmapped)
+ return ret;
+
if (nr_shmem_dropped)
shmem_uncharge(mapping->host, nr_shmem_dropped);
if (!ret && is_anon)
remap_flags = RMP_USE_SHARED_ZEROPAGE;
+
remap_page(folio, 1 << order, remap_flags);
/*
@@ -4149,12 +4152,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
* Returns -EINVAL when trying to split to an order that is incompatible
* with the folio. Splitting to order 0 is compatible with all folios.
*/
-int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
- unsigned int new_order)
+int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+ unsigned int new_order, bool unmapped)
{
struct folio *folio = page_folio(page);
- return __folio_split(folio, new_order, &folio->page, page, list, true);
+ return __folio_split(folio, new_order, &folio->page, page, list, true,
+ unmapped);
}
/*
@@ -4183,7 +4187,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
struct page *split_at, struct list_head *list)
{
return __folio_split(folio, new_order, split_at, &folio->page, list,
- false);
+ false, false);
}
int min_order_for_split(struct folio *folio)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 4c3334cc3228..49962ea19109 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -816,6 +816,29 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
src[i] &= ~MIGRATE_PFN_MIGRATE;
return 0;
}
+
+static int migrate_vma_split_pages(struct migrate_vma *migrate,
+ unsigned long idx, unsigned long addr,
+ struct folio *folio)
+{
+ unsigned long i;
+ unsigned long pfn;
+ unsigned long flags;
+ int ret = 0;
+
+ folio_get(folio);
+ split_huge_pmd_address(migrate->vma, addr, true);
+ ret = __split_huge_page_to_list_to_order(folio_page(folio, 0), NULL,
+ 0, true);
+ if (ret)
+ return ret;
+ migrate->src[idx] &= ~MIGRATE_PFN_COMPOUND;
+ flags = migrate->src[idx] & ((1UL << MIGRATE_PFN_SHIFT) - 1);
+ pfn = migrate->src[idx] >> MIGRATE_PFN_SHIFT;
+ for (i = 1; i < HPAGE_PMD_NR; i++)
+ migrate->src[i+idx] = migrate_pfn(pfn + i) | flags;
+ return ret;
+}
#else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
unsigned long addr,
@@ -825,6 +848,11 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
{
return 0;
}
+
+static void migrate_vma_split_pages(struct migrate_vma *migrate,
+ unsigned long idx, unsigned long addr,
+ struct folio *folio)
+{}
#endif
/*
@@ -974,8 +1002,9 @@ static void __migrate_device_pages(unsigned long *src_pfns,
struct migrate_vma *migrate)
{
struct mmu_notifier_range range;
- unsigned long i;
+ unsigned long i, j;
bool notified = false;
+ unsigned long addr;
for (i = 0; i < npages; ) {
struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
@@ -1017,12 +1046,16 @@ static void __migrate_device_pages(unsigned long *src_pfns,
(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
nr = HPAGE_PMD_NR;
src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
- src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
- goto next;
+ } else {
+ nr = 1;
}
- migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
- &src_pfns[i]);
+ for (j = 0; j < nr && i + j < npages; j++) {
+ src_pfns[i+j] |= MIGRATE_PFN_MIGRATE;
+ migrate_vma_insert_page(migrate,
+ addr + j * PAGE_SIZE,
+ &dst_pfns[i+j], &src_pfns[i+j]);
+ }
goto next;
}
@@ -1044,7 +1077,14 @@ static void __migrate_device_pages(unsigned long *src_pfns,
MIGRATE_PFN_COMPOUND);
goto next;
}
- src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+ nr = 1 << folio_order(folio);
+ addr = migrate->start + i * PAGE_SIZE;
+ if (migrate_vma_split_pages(migrate, i, addr,
+ folio)) {
+ src_pfns[i] &= ~(MIGRATE_PFN_MIGRATE |
+ MIGRATE_PFN_COMPOUND);
+ goto next;
+ }
} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
@@ -1079,12 +1119,17 @@ static void __migrate_device_pages(unsigned long *src_pfns,
BUG_ON(folio_test_writeback(folio));
if (migrate && migrate->fault_page == page)
- extra_cnt = 1;
- r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
- if (r != MIGRATEPAGE_SUCCESS)
- src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
- else
- folio_migrate_flags(newfolio, folio);
+ extra_cnt++;
+ for (j = 0; j < nr && i + j < npages; j++) {
+ folio = page_folio(migrate_pfn_to_page(src_pfns[i+j]));
+ newfolio = page_folio(migrate_pfn_to_page(dst_pfns[i+j]));
+
+ r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
+ if (r != MIGRATEPAGE_SUCCESS)
+ src_pfns[i+j] &= ~MIGRATE_PFN_MIGRATE;
+ else
+ folio_migrate_flags(newfolio, folio);
+ }
next:
i += nr;
}
--
2.50.1
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [v2 08/11] lib/test_hmm: add test case for split pages
2025-07-30 9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
` (6 preceding siblings ...)
2025-07-30 9:21 ` [v2 07/11] mm/thp: add split during migration support Balbir Singh
@ 2025-07-30 9:21 ` Balbir Singh
2025-07-30 9:21 ` [v2 09/11] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
` (4 subsequent siblings)
12 siblings, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-07-30 9:21 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Matthew Brost,
Francois Dugast
Add a new flag HMM_DMIRROR_FLAG_FAIL_ALLOC to emulate
failure of allocating a large page. This tests the code paths
involving split migration.
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
lib/test_hmm.c | 61 ++++++++++++++++++++++++++++++---------------
lib/test_hmm_uapi.h | 3 +++
2 files changed, 44 insertions(+), 20 deletions(-)
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index a8d0d24b4b7a..341ae2af44ec 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -92,6 +92,7 @@ struct dmirror {
struct xarray pt;
struct mmu_interval_notifier notifier;
struct mutex mutex;
+ __u64 flags;
};
/*
@@ -699,7 +700,12 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
page_to_pfn(spage)))
goto next;
- dpage = dmirror_devmem_alloc_page(dmirror, is_large);
+ if (dmirror->flags & HMM_DMIRROR_FLAG_FAIL_ALLOC) {
+ dmirror->flags &= ~HMM_DMIRROR_FLAG_FAIL_ALLOC;
+ dpage = NULL;
+ } else
+ dpage = dmirror_devmem_alloc_page(dmirror, is_large);
+
if (!dpage) {
struct folio *folio;
unsigned long i;
@@ -959,44 +965,55 @@ static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
spage = BACKING_PAGE(spage);
order = folio_order(page_folio(spage));
-
if (order)
+ *dst = MIGRATE_PFN_COMPOUND;
+ if (*src & MIGRATE_PFN_WRITE)
+ *dst |= MIGRATE_PFN_WRITE;
+
+ if (dmirror->flags & HMM_DMIRROR_FLAG_FAIL_ALLOC) {
+ dmirror->flags &= ~HMM_DMIRROR_FLAG_FAIL_ALLOC;
+ *dst &= ~MIGRATE_PFN_COMPOUND;
+ dpage = NULL;
+ } else if (order) {
dpage = folio_page(vma_alloc_folio(GFP_HIGHUSER_MOVABLE,
order, args->vma, addr), 0);
- else
- dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
-
- /* Try with smaller pages if large allocation fails */
- if (!dpage && order) {
+ } else {
dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
- if (!dpage)
- return VM_FAULT_OOM;
- order = 0;
}
+ if (!dpage && !order)
+ return VM_FAULT_OOM;
+
pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
page_to_pfn(spage), page_to_pfn(dpage));
- lock_page(dpage);
- xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
- copy_highpage(dpage, spage);
- *dst = migrate_pfn(page_to_pfn(dpage));
- if (*src & MIGRATE_PFN_WRITE)
- *dst |= MIGRATE_PFN_WRITE;
- if (order)
- *dst |= MIGRATE_PFN_COMPOUND;
+
+ if (dpage) {
+ lock_page(dpage);
+ *dst |= migrate_pfn(page_to_pfn(dpage));
+ }
for (i = 0; i < (1 << order); i++) {
struct page *src_page;
struct page *dst_page;
+ /* Try with smaller pages if large allocation fails */
+ if (!dpage && order) {
+ dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+ lock_page(dpage);
+ dst[i] = migrate_pfn(page_to_pfn(dpage));
+ dst_page = pfn_to_page(page_to_pfn(dpage));
+ dpage = NULL; /* For the next iteration */
+ } else {
+ dst_page = pfn_to_page(page_to_pfn(dpage) + i);
+ }
+
src_page = pfn_to_page(page_to_pfn(spage) + i);
- dst_page = pfn_to_page(page_to_pfn(dpage) + i);
xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
+ addr += PAGE_SIZE;
copy_highpage(dst_page, src_page);
}
next:
- addr += PAGE_SIZE << order;
src += 1 << order;
dst += 1 << order;
}
@@ -1514,6 +1531,10 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
dmirror_device_remove_chunks(dmirror->mdevice);
ret = 0;
break;
+ case HMM_DMIRROR_FLAGS:
+ dmirror->flags = cmd.npages;
+ ret = 0;
+ break;
default:
return -EINVAL;
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 8c818a2cf4f6..f94c6d457338 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -37,6 +37,9 @@ struct hmm_dmirror_cmd {
#define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd)
#define HMM_DMIRROR_CHECK_EXCLUSIVE _IOWR('H', 0x06, struct hmm_dmirror_cmd)
#define HMM_DMIRROR_RELEASE _IOWR('H', 0x07, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_FLAGS _IOWR('H', 0x08, struct hmm_dmirror_cmd)
+
+#define HMM_DMIRROR_FLAG_FAIL_ALLOC (1ULL << 0)
/*
* Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
--
2.50.1
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [v2 09/11] selftests/mm/hmm-tests: new tests for zone device THP migration
2025-07-30 9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
` (7 preceding siblings ...)
2025-07-30 9:21 ` [v2 08/11] lib/test_hmm: add test case for split pages Balbir Singh
@ 2025-07-30 9:21 ` Balbir Singh
2025-07-30 9:21 ` [v2 10/11] gpu/drm/nouveau: add THP migration support Balbir Singh
` (3 subsequent siblings)
12 siblings, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-07-30 9:21 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Matthew Brost,
Francois Dugast
Add new tests for migrating anon THP pages, including anon_huge,
anon_huge_zero and error cases involving forced splitting of pages
during migration.
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
tools/testing/selftests/mm/hmm-tests.c | 410 +++++++++++++++++++++++++
1 file changed, 410 insertions(+)
diff --git a/tools/testing/selftests/mm/hmm-tests.c b/tools/testing/selftests/mm/hmm-tests.c
index 141bf63cbe05..da3322a1282c 100644
--- a/tools/testing/selftests/mm/hmm-tests.c
+++ b/tools/testing/selftests/mm/hmm-tests.c
@@ -2056,4 +2056,414 @@ TEST_F(hmm, hmm_cow_in_device)
hmm_buffer_free(buffer);
}
+
+/*
+ * Migrate private anonymous huge empty page.
+ */
+TEST_F(hmm, migrate_anon_huge_empty)
+{
+ struct hmm_buffer *buffer;
+ unsigned long npages;
+ unsigned long size;
+ unsigned long i;
+ void *old_ptr;
+ void *map;
+ int *ptr;
+ int ret;
+
+ size = TWOMEG;
+
+ buffer = malloc(sizeof(*buffer));
+ ASSERT_NE(buffer, NULL);
+
+ buffer->fd = -1;
+ buffer->size = 2 * size;
+ buffer->mirror = malloc(size);
+ ASSERT_NE(buffer->mirror, NULL);
+ memset(buffer->mirror, 0xFF, size);
+
+ buffer->ptr = mmap(NULL, 2 * size,
+ PROT_READ,
+ MAP_PRIVATE | MAP_ANONYMOUS,
+ buffer->fd, 0);
+ ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+ npages = size >> self->page_shift;
+ map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+ ret = madvise(map, size, MADV_HUGEPAGE);
+ ASSERT_EQ(ret, 0);
+ old_ptr = buffer->ptr;
+ buffer->ptr = map;
+
+ /* Migrate memory to device. */
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+
+ /* Check what the device read. */
+ for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], 0);
+
+ buffer->ptr = old_ptr;
+ hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge zero page.
+ */
+TEST_F(hmm, migrate_anon_huge_zero)
+{
+ struct hmm_buffer *buffer;
+ unsigned long npages;
+ unsigned long size;
+ unsigned long i;
+ void *old_ptr;
+ void *map;
+ int *ptr;
+ int ret;
+ int val;
+
+ size = TWOMEG;
+
+ buffer = malloc(sizeof(*buffer));
+ ASSERT_NE(buffer, NULL);
+
+ buffer->fd = -1;
+ buffer->size = 2 * size;
+ buffer->mirror = malloc(size);
+ ASSERT_NE(buffer->mirror, NULL);
+ memset(buffer->mirror, 0xFF, size);
+
+ buffer->ptr = mmap(NULL, 2 * size,
+ PROT_READ,
+ MAP_PRIVATE | MAP_ANONYMOUS,
+ buffer->fd, 0);
+ ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+ npages = size >> self->page_shift;
+ map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+ ret = madvise(map, size, MADV_HUGEPAGE);
+ ASSERT_EQ(ret, 0);
+ old_ptr = buffer->ptr;
+ buffer->ptr = map;
+
+ /* Initialize a read-only zero huge page. */
+ val = *(int *)buffer->ptr;
+ ASSERT_EQ(val, 0);
+
+ /* Migrate memory to device. */
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+
+ /* Check what the device read. */
+ for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], 0);
+
+ /* Fault pages back to system memory and check them. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) {
+ ASSERT_EQ(ptr[i], 0);
+ /* If it asserts once, it probably will 500,000 times */
+ if (ptr[i] != 0)
+ break;
+ }
+
+ buffer->ptr = old_ptr;
+ hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge page and free.
+ */
+TEST_F(hmm, migrate_anon_huge_free)
+{
+ struct hmm_buffer *buffer;
+ unsigned long npages;
+ unsigned long size;
+ unsigned long i;
+ void *old_ptr;
+ void *map;
+ int *ptr;
+ int ret;
+
+ size = TWOMEG;
+
+ buffer = malloc(sizeof(*buffer));
+ ASSERT_NE(buffer, NULL);
+
+ buffer->fd = -1;
+ buffer->size = 2 * size;
+ buffer->mirror = malloc(size);
+ ASSERT_NE(buffer->mirror, NULL);
+ memset(buffer->mirror, 0xFF, size);
+
+ buffer->ptr = mmap(NULL, 2 * size,
+ PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS,
+ buffer->fd, 0);
+ ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+ npages = size >> self->page_shift;
+ map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+ ret = madvise(map, size, MADV_HUGEPAGE);
+ ASSERT_EQ(ret, 0);
+ old_ptr = buffer->ptr;
+ buffer->ptr = map;
+
+ /* Initialize buffer in system memory. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ptr[i] = i;
+
+ /* Migrate memory to device. */
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+
+ /* Check what the device read. */
+ for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], i);
+
+ /* Try freeing it. */
+ ret = madvise(map, size, MADV_FREE);
+ ASSERT_EQ(ret, 0);
+
+ buffer->ptr = old_ptr;
+ hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge page and fault back to sysmem.
+ */
+TEST_F(hmm, migrate_anon_huge_fault)
+{
+ struct hmm_buffer *buffer;
+ unsigned long npages;
+ unsigned long size;
+ unsigned long i;
+ void *old_ptr;
+ void *map;
+ int *ptr;
+ int ret;
+
+ size = TWOMEG;
+
+ buffer = malloc(sizeof(*buffer));
+ ASSERT_NE(buffer, NULL);
+
+ buffer->fd = -1;
+ buffer->size = 2 * size;
+ buffer->mirror = malloc(size);
+ ASSERT_NE(buffer->mirror, NULL);
+ memset(buffer->mirror, 0xFF, size);
+
+ buffer->ptr = mmap(NULL, 2 * size,
+ PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS,
+ buffer->fd, 0);
+ ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+ npages = size >> self->page_shift;
+ map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+ ret = madvise(map, size, MADV_HUGEPAGE);
+ ASSERT_EQ(ret, 0);
+ old_ptr = buffer->ptr;
+ buffer->ptr = map;
+
+ /* Initialize buffer in system memory. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ptr[i] = i;
+
+ /* Migrate memory to device. */
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+
+ /* Check what the device read. */
+ for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], i);
+
+ /* Fault pages back to system memory and check them. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], i);
+
+ buffer->ptr = old_ptr;
+ hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge page with allocation errors.
+ */
+TEST_F(hmm, migrate_anon_huge_err)
+{
+ struct hmm_buffer *buffer;
+ unsigned long npages;
+ unsigned long size;
+ unsigned long i;
+ void *old_ptr;
+ void *map;
+ int *ptr;
+ int ret;
+
+ size = TWOMEG;
+
+ buffer = malloc(sizeof(*buffer));
+ ASSERT_NE(buffer, NULL);
+
+ buffer->fd = -1;
+ buffer->size = 2 * size;
+ buffer->mirror = malloc(2 * size);
+ ASSERT_NE(buffer->mirror, NULL);
+ memset(buffer->mirror, 0xFF, 2 * size);
+
+ old_ptr = mmap(NULL, 2 * size, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+ ASSERT_NE(old_ptr, MAP_FAILED);
+
+ npages = size >> self->page_shift;
+ map = (void *)ALIGN((uintptr_t)old_ptr, size);
+ ret = madvise(map, size, MADV_HUGEPAGE);
+ ASSERT_EQ(ret, 0);
+ buffer->ptr = map;
+
+ /* Initialize buffer in system memory. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ptr[i] = i;
+
+ /* Migrate memory to device but force a THP allocation error. */
+ ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+ HMM_DMIRROR_FLAG_FAIL_ALLOC);
+ ASSERT_EQ(ret, 0);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+
+ /* Check what the device read. */
+ for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) {
+ ASSERT_EQ(ptr[i], i);
+ if (ptr[i] != i)
+ break;
+ }
+
+ /* Try faulting back a single (PAGE_SIZE) page. */
+ ptr = buffer->ptr;
+ ASSERT_EQ(ptr[2048], 2048);
+
+ /* unmap and remap the region to reset things. */
+ ret = munmap(old_ptr, 2 * size);
+ ASSERT_EQ(ret, 0);
+ old_ptr = mmap(NULL, 2 * size, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+ ASSERT_NE(old_ptr, MAP_FAILED);
+ map = (void *)ALIGN((uintptr_t)old_ptr, size);
+ ret = madvise(map, size, MADV_HUGEPAGE);
+ ASSERT_EQ(ret, 0);
+ buffer->ptr = map;
+
+ /* Initialize buffer in system memory. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ptr[i] = i;
+
+ /* Migrate THP to device. */
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+
+ /*
+ * Force an allocation error when faulting back a THP resident in the
+ * device.
+ */
+ ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+ HMM_DMIRROR_FLAG_FAIL_ALLOC);
+ ASSERT_EQ(ret, 0);
+
+ ret = hmm_migrate_dev_to_sys(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ptr = buffer->ptr;
+ ASSERT_EQ(ptr[2048], 2048);
+
+ buffer->ptr = old_ptr;
+ hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge zero page with allocation errors.
+ */
+TEST_F(hmm, migrate_anon_huge_zero_err)
+{
+ struct hmm_buffer *buffer;
+ unsigned long npages;
+ unsigned long size;
+ unsigned long i;
+ void *old_ptr;
+ void *map;
+ int *ptr;
+ int ret;
+
+ size = TWOMEG;
+
+ buffer = malloc(sizeof(*buffer));
+ ASSERT_NE(buffer, NULL);
+
+ buffer->fd = -1;
+ buffer->size = 2 * size;
+ buffer->mirror = malloc(2 * size);
+ ASSERT_NE(buffer->mirror, NULL);
+ memset(buffer->mirror, 0xFF, 2 * size);
+
+ old_ptr = mmap(NULL, 2 * size, PROT_READ,
+ MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+ ASSERT_NE(old_ptr, MAP_FAILED);
+
+ npages = size >> self->page_shift;
+ map = (void *)ALIGN((uintptr_t)old_ptr, size);
+ ret = madvise(map, size, MADV_HUGEPAGE);
+ ASSERT_EQ(ret, 0);
+ buffer->ptr = map;
+
+ /* Migrate memory to device but force a THP allocation error. */
+ ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+ HMM_DMIRROR_FLAG_FAIL_ALLOC);
+ ASSERT_EQ(ret, 0);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+
+ /* Check what the device read. */
+ for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], 0);
+
+ /* Try faulting back a single (PAGE_SIZE) page. */
+ ptr = buffer->ptr;
+ ASSERT_EQ(ptr[2048], 0);
+
+ /* unmap and remap the region to reset things. */
+ ret = munmap(old_ptr, 2 * size);
+ ASSERT_EQ(ret, 0);
+ old_ptr = mmap(NULL, 2 * size, PROT_READ,
+ MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+ ASSERT_NE(old_ptr, MAP_FAILED);
+ map = (void *)ALIGN((uintptr_t)old_ptr, size);
+ ret = madvise(map, size, MADV_HUGEPAGE);
+ ASSERT_EQ(ret, 0);
+ buffer->ptr = map;
+
+ /* Initialize buffer in system memory (zero THP page). */
+ ret = ptr[0];
+ ASSERT_EQ(ret, 0);
+
+ /* Migrate memory to device but force a THP allocation error. */
+ ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+ HMM_DMIRROR_FLAG_FAIL_ALLOC);
+ ASSERT_EQ(ret, 0);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+
+ /* Fault the device memory back and check it. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], 0);
+
+ buffer->ptr = old_ptr;
+ hmm_buffer_free(buffer);
+}
TEST_HARNESS_MAIN
--
2.50.1
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [v2 10/11] gpu/drm/nouveau: add THP migration support
2025-07-30 9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
` (8 preceding siblings ...)
2025-07-30 9:21 ` [v2 09/11] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
@ 2025-07-30 9:21 ` Balbir Singh
2025-07-30 9:21 ` [v2 11/11] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh
` (2 subsequent siblings)
12 siblings, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-07-30 9:21 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Matthew Brost,
Francois Dugast
Change the code to add support for MIGRATE_VMA_SELECT_COMPOUND
and appropriately handling page sizes in the migrate/evict
code paths.
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
drivers/gpu/drm/nouveau/nouveau_dmem.c | 246 +++++++++++++++++--------
drivers/gpu/drm/nouveau/nouveau_svm.c | 6 +-
drivers/gpu/drm/nouveau/nouveau_svm.h | 3 +-
3 files changed, 178 insertions(+), 77 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index ca4932a150e3..d3672d01e8b5 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -83,9 +83,15 @@ struct nouveau_dmem {
struct list_head chunks;
struct mutex mutex;
struct page *free_pages;
+ struct folio *free_folios;
spinlock_t lock;
};
+struct nouveau_dmem_dma_info {
+ dma_addr_t dma_addr;
+ size_t size;
+};
+
static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page)
{
return container_of(page_pgmap(page), struct nouveau_dmem_chunk,
@@ -112,10 +118,16 @@ static void nouveau_dmem_page_free(struct page *page)
{
struct nouveau_dmem_chunk *chunk = nouveau_page_to_chunk(page);
struct nouveau_dmem *dmem = chunk->drm->dmem;
+ struct folio *folio = page_folio(page);
spin_lock(&dmem->lock);
- page->zone_device_data = dmem->free_pages;
- dmem->free_pages = page;
+ if (folio_order(folio)) {
+ folio_set_zone_device_data(folio, dmem->free_folios);
+ dmem->free_folios = folio;
+ } else {
+ page->zone_device_data = dmem->free_pages;
+ dmem->free_pages = page;
+ }
WARN_ON(!chunk->callocated);
chunk->callocated--;
@@ -139,20 +151,28 @@ static void nouveau_dmem_fence_done(struct nouveau_fence **fence)
}
}
-static int nouveau_dmem_copy_one(struct nouveau_drm *drm, struct page *spage,
- struct page *dpage, dma_addr_t *dma_addr)
+static int nouveau_dmem_copy_folio(struct nouveau_drm *drm,
+ struct folio *sfolio, struct folio *dfolio,
+ struct nouveau_dmem_dma_info *dma_info)
{
struct device *dev = drm->dev->dev;
+ struct page *dpage = folio_page(dfolio, 0);
+ struct page *spage = folio_page(sfolio, 0);
- lock_page(dpage);
+ folio_lock(dfolio);
- *dma_addr = dma_map_page(dev, dpage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
- if (dma_mapping_error(dev, *dma_addr))
+ dma_info->dma_addr = dma_map_page(dev, dpage, 0, page_size(dpage),
+ DMA_BIDIRECTIONAL);
+ dma_info->size = page_size(dpage);
+ if (dma_mapping_error(dev, dma_info->dma_addr))
return -EIO;
- if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_HOST, *dma_addr,
- NOUVEAU_APER_VRAM, nouveau_dmem_page_addr(spage))) {
- dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+ if (drm->dmem->migrate.copy_func(drm, folio_nr_pages(sfolio),
+ NOUVEAU_APER_HOST, dma_info->dma_addr,
+ NOUVEAU_APER_VRAM,
+ nouveau_dmem_page_addr(spage))) {
+ dma_unmap_page(dev, dma_info->dma_addr, page_size(dpage),
+ DMA_BIDIRECTIONAL);
return -EIO;
}
@@ -165,21 +185,38 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
struct nouveau_dmem *dmem = drm->dmem;
struct nouveau_fence *fence;
struct nouveau_svmm *svmm;
- struct page *spage, *dpage;
- unsigned long src = 0, dst = 0;
- dma_addr_t dma_addr = 0;
+ struct page *dpage;
vm_fault_t ret = 0;
struct migrate_vma args = {
.vma = vmf->vma,
- .start = vmf->address,
- .end = vmf->address + PAGE_SIZE,
- .src = &src,
- .dst = &dst,
.pgmap_owner = drm->dev,
.fault_page = vmf->page,
- .flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
+ .flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE |
+ MIGRATE_VMA_SELECT_COMPOUND,
+ .src = NULL,
+ .dst = NULL,
};
-
+ unsigned int order, nr;
+ struct folio *sfolio, *dfolio;
+ struct nouveau_dmem_dma_info dma_info;
+
+ sfolio = page_folio(vmf->page);
+ order = folio_order(sfolio);
+ nr = 1 << order;
+
+ if (order)
+ args.flags |= MIGRATE_VMA_SELECT_COMPOUND;
+
+ args.start = ALIGN_DOWN(vmf->address, (1 << (PAGE_SHIFT + order)));
+ args.vma = vmf->vma;
+ args.end = args.start + (PAGE_SIZE << order);
+ args.src = kcalloc(nr, sizeof(*args.src), GFP_KERNEL);
+ args.dst = kcalloc(nr, sizeof(*args.dst), GFP_KERNEL);
+
+ if (!args.src || !args.dst) {
+ ret = VM_FAULT_OOM;
+ goto err;
+ }
/*
* FIXME what we really want is to find some heuristic to migrate more
* than just one page on CPU fault. When such fault happens it is very
@@ -190,20 +227,26 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
if (!args.cpages)
return 0;
- spage = migrate_pfn_to_page(src);
- if (!spage || !(src & MIGRATE_PFN_MIGRATE))
- goto done;
-
- dpage = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vmf->vma, vmf->address);
- if (!dpage)
+ if (order)
+ dpage = folio_page(vma_alloc_folio(GFP_HIGHUSER | __GFP_ZERO,
+ order, vmf->vma, vmf->address), 0);
+ else
+ dpage = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vmf->vma,
+ vmf->address);
+ if (!dpage) {
+ ret = VM_FAULT_OOM;
goto done;
+ }
- dst = migrate_pfn(page_to_pfn(dpage));
+ args.dst[0] = migrate_pfn(page_to_pfn(dpage));
+ if (order)
+ args.dst[0] |= MIGRATE_PFN_COMPOUND;
+ dfolio = page_folio(dpage);
- svmm = spage->zone_device_data;
+ svmm = folio_zone_device_data(sfolio);
mutex_lock(&svmm->mutex);
nouveau_svmm_invalidate(svmm, args.start, args.end);
- ret = nouveau_dmem_copy_one(drm, spage, dpage, &dma_addr);
+ ret = nouveau_dmem_copy_folio(drm, sfolio, dfolio, &dma_info);
mutex_unlock(&svmm->mutex);
if (ret) {
ret = VM_FAULT_SIGBUS;
@@ -213,19 +256,33 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
nouveau_fence_new(&fence, dmem->migrate.chan);
migrate_vma_pages(&args);
nouveau_dmem_fence_done(&fence);
- dma_unmap_page(drm->dev->dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+ dma_unmap_page(drm->dev->dev, dma_info.dma_addr, PAGE_SIZE,
+ DMA_BIDIRECTIONAL);
done:
migrate_vma_finalize(&args);
+err:
+ kfree(args.src);
+ kfree(args.dst);
return ret;
}
+static void nouveau_dmem_folio_split(struct folio *head, struct folio *tail)
+{
+ if (tail == NULL)
+ return;
+ tail->pgmap = head->pgmap;
+ folio_set_zone_device_data(tail, folio_zone_device_data(head));
+}
+
static const struct dev_pagemap_ops nouveau_dmem_pagemap_ops = {
.page_free = nouveau_dmem_page_free,
.migrate_to_ram = nouveau_dmem_migrate_to_ram,
+ .folio_split = nouveau_dmem_folio_split,
};
static int
-nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
+nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage,
+ bool is_large)
{
struct nouveau_dmem_chunk *chunk;
struct resource *res;
@@ -274,16 +331,21 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
pfn_first = chunk->pagemap.range.start >> PAGE_SHIFT;
page = pfn_to_page(pfn_first);
spin_lock(&drm->dmem->lock);
- for (i = 0; i < DMEM_CHUNK_NPAGES - 1; ++i, ++page) {
- page->zone_device_data = drm->dmem->free_pages;
- drm->dmem->free_pages = page;
+
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) || !is_large) {
+ for (i = 0; i < DMEM_CHUNK_NPAGES - 1; ++i, ++page) {
+ page->zone_device_data = drm->dmem->free_pages;
+ drm->dmem->free_pages = page;
+ }
}
+
*ppage = page;
chunk->callocated++;
spin_unlock(&drm->dmem->lock);
- NV_INFO(drm, "DMEM: registered %ldMB of device memory\n",
- DMEM_CHUNK_SIZE >> 20);
+ NV_INFO(drm, "DMEM: registered %ldMB of %sdevice memory %lx %lx\n",
+ DMEM_CHUNK_SIZE >> 20, is_large ? "THP " : "", pfn_first,
+ nouveau_dmem_page_addr(page));
return 0;
@@ -298,27 +360,37 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
}
static struct page *
-nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
+nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm, bool is_large)
{
struct nouveau_dmem_chunk *chunk;
struct page *page = NULL;
+ struct folio *folio = NULL;
int ret;
+ unsigned int order = 0;
spin_lock(&drm->dmem->lock);
- if (drm->dmem->free_pages) {
+ if (is_large && drm->dmem->free_folios) {
+ folio = drm->dmem->free_folios;
+ drm->dmem->free_folios = folio_zone_device_data(folio);
+ chunk = nouveau_page_to_chunk(page);
+ chunk->callocated++;
+ spin_unlock(&drm->dmem->lock);
+ order = DMEM_CHUNK_NPAGES;
+ } else if (!is_large && drm->dmem->free_pages) {
page = drm->dmem->free_pages;
drm->dmem->free_pages = page->zone_device_data;
chunk = nouveau_page_to_chunk(page);
chunk->callocated++;
spin_unlock(&drm->dmem->lock);
+ folio = page_folio(page);
} else {
spin_unlock(&drm->dmem->lock);
- ret = nouveau_dmem_chunk_alloc(drm, &page);
+ ret = nouveau_dmem_chunk_alloc(drm, &page, is_large);
if (ret)
return NULL;
}
- zone_device_page_init(page);
+ zone_device_folio_init(folio, order);
return page;
}
@@ -369,12 +441,12 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
{
unsigned long i, npages = range_len(&chunk->pagemap.range) >> PAGE_SHIFT;
unsigned long *src_pfns, *dst_pfns;
- dma_addr_t *dma_addrs;
+ struct nouveau_dmem_dma_info *dma_info;
struct nouveau_fence *fence;
src_pfns = kvcalloc(npages, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL);
dst_pfns = kvcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL);
- dma_addrs = kvcalloc(npages, sizeof(*dma_addrs), GFP_KERNEL | __GFP_NOFAIL);
+ dma_info = kvcalloc(npages, sizeof(*dma_info), GFP_KERNEL | __GFP_NOFAIL);
migrate_device_range(src_pfns, chunk->pagemap.range.start >> PAGE_SHIFT,
npages);
@@ -382,17 +454,28 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
for (i = 0; i < npages; i++) {
if (src_pfns[i] & MIGRATE_PFN_MIGRATE) {
struct page *dpage;
+ struct folio *folio = page_folio(
+ migrate_pfn_to_page(src_pfns[i]));
+ unsigned int order = folio_order(folio);
+
+ if (src_pfns[i] & MIGRATE_PFN_COMPOUND) {
+ dpage = folio_page(
+ folio_alloc(
+ GFP_HIGHUSER_MOVABLE, order), 0);
+ } else {
+ /*
+ * _GFP_NOFAIL because the GPU is going away and there
+ * is nothing sensible we can do if we can't copy the
+ * data back.
+ */
+ dpage = alloc_page(GFP_HIGHUSER | __GFP_NOFAIL);
+ }
- /*
- * _GFP_NOFAIL because the GPU is going away and there
- * is nothing sensible we can do if we can't copy the
- * data back.
- */
- dpage = alloc_page(GFP_HIGHUSER | __GFP_NOFAIL);
dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
- nouveau_dmem_copy_one(chunk->drm,
- migrate_pfn_to_page(src_pfns[i]), dpage,
- &dma_addrs[i]);
+ nouveau_dmem_copy_folio(chunk->drm,
+ page_folio(migrate_pfn_to_page(src_pfns[i])),
+ page_folio(dpage),
+ &dma_info[i]);
}
}
@@ -403,8 +486,9 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
kvfree(src_pfns);
kvfree(dst_pfns);
for (i = 0; i < npages; i++)
- dma_unmap_page(chunk->drm->dev->dev, dma_addrs[i], PAGE_SIZE, DMA_BIDIRECTIONAL);
- kvfree(dma_addrs);
+ dma_unmap_page(chunk->drm->dev->dev, dma_info[i].dma_addr,
+ dma_info[i].size, DMA_BIDIRECTIONAL);
+ kvfree(dma_info);
}
void
@@ -607,31 +691,35 @@ nouveau_dmem_init(struct nouveau_drm *drm)
static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
struct nouveau_svmm *svmm, unsigned long src,
- dma_addr_t *dma_addr, u64 *pfn)
+ struct nouveau_dmem_dma_info *dma_info, u64 *pfn)
{
struct device *dev = drm->dev->dev;
struct page *dpage, *spage;
unsigned long paddr;
+ bool is_large = false;
spage = migrate_pfn_to_page(src);
if (!(src & MIGRATE_PFN_MIGRATE))
goto out;
- dpage = nouveau_dmem_page_alloc_locked(drm);
+ is_large = src & MIGRATE_PFN_COMPOUND;
+ dpage = nouveau_dmem_page_alloc_locked(drm, is_large);
if (!dpage)
goto out;
paddr = nouveau_dmem_page_addr(dpage);
if (spage) {
- *dma_addr = dma_map_page(dev, spage, 0, page_size(spage),
+ dma_info->dma_addr = dma_map_page(dev, spage, 0, page_size(spage),
DMA_BIDIRECTIONAL);
- if (dma_mapping_error(dev, *dma_addr))
+ dma_info->size = page_size(spage);
+ if (dma_mapping_error(dev, dma_info->dma_addr))
goto out_free_page;
- if (drm->dmem->migrate.copy_func(drm, 1,
- NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST, *dma_addr))
+ if (drm->dmem->migrate.copy_func(drm, folio_nr_pages(page_folio(spage)),
+ NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST,
+ dma_info->dma_addr))
goto out_dma_unmap;
} else {
- *dma_addr = DMA_MAPPING_ERROR;
+ dma_info->dma_addr = DMA_MAPPING_ERROR;
if (drm->dmem->migrate.clear_func(drm, page_size(dpage),
NOUVEAU_APER_VRAM, paddr))
goto out_free_page;
@@ -645,7 +733,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
return migrate_pfn(page_to_pfn(dpage));
out_dma_unmap:
- dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+ dma_unmap_page(dev, dma_info->dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
out_free_page:
nouveau_dmem_page_free_locked(drm, dpage);
out:
@@ -655,27 +743,33 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
static void nouveau_dmem_migrate_chunk(struct nouveau_drm *drm,
struct nouveau_svmm *svmm, struct migrate_vma *args,
- dma_addr_t *dma_addrs, u64 *pfns)
+ struct nouveau_dmem_dma_info *dma_info, u64 *pfns)
{
struct nouveau_fence *fence;
unsigned long addr = args->start, nr_dma = 0, i;
+ unsigned long order = 0;
- for (i = 0; addr < args->end; i++) {
+ for (i = 0; addr < args->end; ) {
+ struct folio *folio;
+
+ folio = page_folio(migrate_pfn_to_page(args->dst[i]));
+ order = folio_order(folio);
args->dst[i] = nouveau_dmem_migrate_copy_one(drm, svmm,
- args->src[i], dma_addrs + nr_dma, pfns + i);
- if (!dma_mapping_error(drm->dev->dev, dma_addrs[nr_dma]))
+ args->src[i], dma_info + nr_dma, pfns + i);
+ if (!dma_mapping_error(drm->dev->dev, dma_info[nr_dma].dma_addr))
nr_dma++;
- addr += PAGE_SIZE;
+ i += 1 << order;
+ addr += (1 << order) * PAGE_SIZE;
}
nouveau_fence_new(&fence, drm->dmem->migrate.chan);
migrate_vma_pages(args);
nouveau_dmem_fence_done(&fence);
- nouveau_pfns_map(svmm, args->vma->vm_mm, args->start, pfns, i);
+ nouveau_pfns_map(svmm, args->vma->vm_mm, args->start, pfns, i, order);
while (nr_dma--) {
- dma_unmap_page(drm->dev->dev, dma_addrs[nr_dma], PAGE_SIZE,
- DMA_BIDIRECTIONAL);
+ dma_unmap_page(drm->dev->dev, dma_info[nr_dma].dma_addr,
+ dma_info[nr_dma].size, DMA_BIDIRECTIONAL);
}
migrate_vma_finalize(args);
}
@@ -689,20 +783,24 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
{
unsigned long npages = (end - start) >> PAGE_SHIFT;
unsigned long max = min(SG_MAX_SINGLE_ALLOC, npages);
- dma_addr_t *dma_addrs;
struct migrate_vma args = {
.vma = vma,
.start = start,
.pgmap_owner = drm->dev,
- .flags = MIGRATE_VMA_SELECT_SYSTEM,
+ .flags = MIGRATE_VMA_SELECT_SYSTEM
+ | MIGRATE_VMA_SELECT_COMPOUND,
};
unsigned long i;
u64 *pfns;
int ret = -ENOMEM;
+ struct nouveau_dmem_dma_info *dma_info;
if (drm->dmem == NULL)
return -ENODEV;
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ max = max(HPAGE_PMD_NR, max);
+
args.src = kcalloc(max, sizeof(*args.src), GFP_KERNEL);
if (!args.src)
goto out;
@@ -710,8 +808,8 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
if (!args.dst)
goto out_free_src;
- dma_addrs = kmalloc_array(max, sizeof(*dma_addrs), GFP_KERNEL);
- if (!dma_addrs)
+ dma_info = kmalloc_array(max, sizeof(*dma_info), GFP_KERNEL);
+ if (!dma_info)
goto out_free_dst;
pfns = nouveau_pfns_alloc(max);
@@ -729,7 +827,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
goto out_free_pfns;
if (args.cpages)
- nouveau_dmem_migrate_chunk(drm, svmm, &args, dma_addrs,
+ nouveau_dmem_migrate_chunk(drm, svmm, &args, dma_info,
pfns);
args.start = args.end;
}
@@ -738,7 +836,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
out_free_pfns:
nouveau_pfns_free(pfns);
out_free_dma:
- kfree(dma_addrs);
+ kfree(dma_info);
out_free_dst:
kfree(args.dst);
out_free_src:
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 6fa387da0637..b8a3378154d5 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -921,12 +921,14 @@ nouveau_pfns_free(u64 *pfns)
void
nouveau_pfns_map(struct nouveau_svmm *svmm, struct mm_struct *mm,
- unsigned long addr, u64 *pfns, unsigned long npages)
+ unsigned long addr, u64 *pfns, unsigned long npages,
+ unsigned int page_shift)
{
struct nouveau_pfnmap_args *args = nouveau_pfns_to_args(pfns);
args->p.addr = addr;
- args->p.size = npages << PAGE_SHIFT;
+ args->p.size = npages << page_shift;
+ args->p.page = page_shift;
mutex_lock(&svmm->mutex);
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.h b/drivers/gpu/drm/nouveau/nouveau_svm.h
index e7d63d7f0c2d..3fd78662f17e 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.h
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.h
@@ -33,7 +33,8 @@ void nouveau_svmm_invalidate(struct nouveau_svmm *svmm, u64 start, u64 limit);
u64 *nouveau_pfns_alloc(unsigned long npages);
void nouveau_pfns_free(u64 *pfns);
void nouveau_pfns_map(struct nouveau_svmm *svmm, struct mm_struct *mm,
- unsigned long addr, u64 *pfns, unsigned long npages);
+ unsigned long addr, u64 *pfns, unsigned long npages,
+ unsigned int page_shift);
#else /* IS_ENABLED(CONFIG_DRM_NOUVEAU_SVM) */
static inline void nouveau_svm_init(struct nouveau_drm *drm) {}
static inline void nouveau_svm_fini(struct nouveau_drm *drm) {}
--
2.50.1
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [v2 11/11] selftests/mm/hmm-tests: new throughput tests including THP
2025-07-30 9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
` (9 preceding siblings ...)
2025-07-30 9:21 ` [v2 10/11] gpu/drm/nouveau: add THP migration support Balbir Singh
@ 2025-07-30 9:21 ` Balbir Singh
2025-07-30 11:30 ` [v2 00/11] THP support for zone device page migration David Hildenbrand
2025-08-05 21:34 ` Matthew Brost
12 siblings, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-07-30 9:21 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Matthew Brost,
Francois Dugast
Add new benchmark style support to test transfer bandwidth for
zone device memory operations.
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
tools/testing/selftests/mm/hmm-tests.c | 197 ++++++++++++++++++++++++-
1 file changed, 196 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/hmm-tests.c b/tools/testing/selftests/mm/hmm-tests.c
index da3322a1282c..1325de70f44f 100644
--- a/tools/testing/selftests/mm/hmm-tests.c
+++ b/tools/testing/selftests/mm/hmm-tests.c
@@ -25,6 +25,7 @@
#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/ioctl.h>
+#include <sys/time.h>
/*
@@ -207,8 +208,10 @@ static void hmm_buffer_free(struct hmm_buffer *buffer)
if (buffer == NULL)
return;
- if (buffer->ptr)
+ if (buffer->ptr) {
munmap(buffer->ptr, buffer->size);
+ buffer->ptr = NULL;
+ }
free(buffer->mirror);
free(buffer);
}
@@ -2466,4 +2469,196 @@ TEST_F(hmm, migrate_anon_huge_zero_err)
buffer->ptr = old_ptr;
hmm_buffer_free(buffer);
}
+
+struct benchmark_results {
+ double sys_to_dev_time;
+ double dev_to_sys_time;
+ double throughput_s2d;
+ double throughput_d2s;
+};
+
+static double get_time_ms(void)
+{
+ struct timeval tv;
+
+ gettimeofday(&tv, NULL);
+ return (tv.tv_sec * 1000.0) + (tv.tv_usec / 1000.0);
+}
+
+static inline struct hmm_buffer *hmm_buffer_alloc(unsigned long size)
+{
+ struct hmm_buffer *buffer;
+
+ buffer = malloc(sizeof(*buffer));
+
+ buffer->fd = -1;
+ buffer->size = size;
+ buffer->mirror = malloc(size);
+ memset(buffer->mirror, 0xFF, size);
+ return buffer;
+}
+
+static void print_benchmark_results(const char *test_name, size_t buffer_size,
+ struct benchmark_results *thp,
+ struct benchmark_results *regular)
+{
+ double s2d_improvement = ((regular->sys_to_dev_time - thp->sys_to_dev_time) /
+ regular->sys_to_dev_time) * 100.0;
+ double d2s_improvement = ((regular->dev_to_sys_time - thp->dev_to_sys_time) /
+ regular->dev_to_sys_time) * 100.0;
+ double throughput_s2d_improvement = ((thp->throughput_s2d - regular->throughput_s2d) /
+ regular->throughput_s2d) * 100.0;
+ double throughput_d2s_improvement = ((thp->throughput_d2s - regular->throughput_d2s) /
+ regular->throughput_d2s) * 100.0;
+
+ printf("\n=== %s (%.1f MB) ===\n", test_name, buffer_size / (1024.0 * 1024.0));
+ printf(" | With THP | Without THP | Improvement\n");
+ printf("---------------------------------------------------------------------\n");
+ printf("Sys->Dev Migration | %.3f ms | %.3f ms | %.1f%%\n",
+ thp->sys_to_dev_time, regular->sys_to_dev_time, s2d_improvement);
+ printf("Dev->Sys Migration | %.3f ms | %.3f ms | %.1f%%\n",
+ thp->dev_to_sys_time, regular->dev_to_sys_time, d2s_improvement);
+ printf("S->D Throughput | %.2f GB/s | %.2f GB/s | %.1f%%\n",
+ thp->throughput_s2d, regular->throughput_s2d, throughput_s2d_improvement);
+ printf("D->S Throughput | %.2f GB/s | %.2f GB/s | %.1f%%\n",
+ thp->throughput_d2s, regular->throughput_d2s, throughput_d2s_improvement);
+}
+
+/*
+ * Run a single migration benchmark
+ * fd: file descriptor for hmm device
+ * use_thp: whether to use THP
+ * buffer_size: size of buffer to allocate
+ * iterations: number of iterations
+ * results: where to store results
+ */
+static inline int run_migration_benchmark(int fd, int use_thp, size_t buffer_size,
+ int iterations, struct benchmark_results *results)
+{
+ struct hmm_buffer *buffer;
+ unsigned long npages = buffer_size / sysconf(_SC_PAGESIZE);
+ double start, end;
+ double s2d_total = 0, d2s_total = 0;
+ int ret, i;
+ int *ptr;
+
+ buffer = hmm_buffer_alloc(buffer_size);
+
+ /* Map memory */
+ buffer->ptr = mmap(NULL, buffer_size, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+
+ if (!buffer->ptr)
+ return -1;
+
+ /* Apply THP hint if requested */
+ if (use_thp)
+ ret = madvise(buffer->ptr, buffer_size, MADV_HUGEPAGE);
+ else
+ ret = madvise(buffer->ptr, buffer_size, MADV_NOHUGEPAGE);
+
+ if (ret)
+ return ret;
+
+ /* Initialize memory to make sure pages are allocated */
+ ptr = (int *)buffer->ptr;
+ for (i = 0; i < buffer_size / sizeof(int); i++)
+ ptr[i] = i & 0xFF;
+
+ /* Warmup iteration */
+ ret = hmm_migrate_sys_to_dev(fd, buffer, npages);
+ if (ret)
+ return ret;
+
+ ret = hmm_migrate_dev_to_sys(fd, buffer, npages);
+ if (ret)
+ return ret;
+
+ /* Benchmark iterations */
+ for (i = 0; i < iterations; i++) {
+ /* System to device migration */
+ start = get_time_ms();
+
+ ret = hmm_migrate_sys_to_dev(fd, buffer, npages);
+ if (ret)
+ return ret;
+
+ end = get_time_ms();
+ s2d_total += (end - start);
+
+ /* Device to system migration */
+ start = get_time_ms();
+
+ ret = hmm_migrate_dev_to_sys(fd, buffer, npages);
+ if (ret)
+ return ret;
+
+ end = get_time_ms();
+ d2s_total += (end - start);
+ }
+
+ /* Calculate average times and throughput */
+ results->sys_to_dev_time = s2d_total / iterations;
+ results->dev_to_sys_time = d2s_total / iterations;
+ results->throughput_s2d = (buffer_size / (1024.0 * 1024.0 * 1024.0)) /
+ (results->sys_to_dev_time / 1000.0);
+ results->throughput_d2s = (buffer_size / (1024.0 * 1024.0 * 1024.0)) /
+ (results->dev_to_sys_time / 1000.0);
+
+ /* Cleanup */
+ hmm_buffer_free(buffer);
+ return 0;
+}
+
+/*
+ * Benchmark THP migration with different buffer sizes
+ */
+TEST_F_TIMEOUT(hmm, benchmark_thp_migration, 120)
+{
+ struct benchmark_results thp_results, regular_results;
+ size_t thp_size = 2 * 1024 * 1024; /* 2MB - typical THP size */
+ int iterations = 5;
+
+ printf("\nHMM THP Migration Benchmark\n");
+ printf("---------------------------\n");
+ printf("System page size: %ld bytes\n", sysconf(_SC_PAGESIZE));
+
+ /* Test different buffer sizes */
+ size_t test_sizes[] = {
+ thp_size / 4, /* 512KB - smaller than THP */
+ thp_size / 2, /* 1MB - half THP */
+ thp_size, /* 2MB - single THP */
+ thp_size * 2, /* 4MB - two THPs */
+ thp_size * 4, /* 8MB - four THPs */
+ thp_size * 8, /* 16MB - eight THPs */
+ thp_size * 128, /* 256MB - one twenty eight THPs */
+ };
+
+ static const char *const test_names[] = {
+ "Small Buffer (512KB)",
+ "Half THP Size (1MB)",
+ "Single THP Size (2MB)",
+ "Two THP Size (4MB)",
+ "Four THP Size (8MB)",
+ "Eight THP Size (16MB)",
+ "One twenty eight THP Size (256MB)"
+ };
+
+ int num_tests = ARRAY_SIZE(test_sizes);
+
+ /* Run all tests */
+ for (int i = 0; i < num_tests; i++) {
+ /* Test with THP */
+ ASSERT_EQ(run_migration_benchmark(self->fd, 1, test_sizes[i],
+ iterations, &thp_results), 0);
+
+ /* Test without THP */
+ ASSERT_EQ(run_migration_benchmark(self->fd, 0, test_sizes[i],
+ iterations, ®ular_results), 0);
+
+ /* Print results */
+ print_benchmark_results(test_names[i], test_sizes[i],
+ &thp_results, ®ular_results);
+ }
+}
TEST_HARNESS_MAIN
--
2.50.1
^ permalink raw reply related [flat|nested] 71+ messages in thread
* Re: [v2 01/11] mm/zone_device: support large zone device private folios
2025-07-30 9:21 ` [v2 01/11] mm/zone_device: support large zone device private folios Balbir Singh
@ 2025-07-30 9:50 ` David Hildenbrand
2025-08-04 23:43 ` Balbir Singh
2025-08-05 4:22 ` Balbir Singh
0 siblings, 2 replies; 71+ messages in thread
From: David Hildenbrand @ 2025-07-30 9:50 UTC (permalink / raw)
To: Balbir Singh, linux-mm
Cc: linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Matthew Brost,
Francois Dugast
On 30.07.25 11:21, Balbir Singh wrote:
> Add routines to support allocation of large order zone device folios
> and helper functions for zone device folios, to check if a folio is
> device private and helpers for setting zone device data.
>
> When large folios are used, the existing page_free() callback in
> pgmap is called when the folio is freed, this is true for both
> PAGE_SIZE and higher order pages.
>
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mika Penttilä <mpenttil@redhat.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Francois Dugast <francois.dugast@intel.com>
>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
> include/linux/memremap.h | 10 ++++++++-
> mm/memremap.c | 48 +++++++++++++++++++++++++++++-----------
> 2 files changed, 44 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 4aa151914eab..a0723b35eeaa 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -199,7 +199,7 @@ static inline bool folio_is_fsdax(const struct folio *folio)
> }
>
> #ifdef CONFIG_ZONE_DEVICE
> -void zone_device_page_init(struct page *page);
> +void zone_device_folio_init(struct folio *folio, unsigned int order);
> void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> void memunmap_pages(struct dev_pagemap *pgmap);
> void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> @@ -209,6 +209,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
> bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>
> unsigned long memremap_compat_align(void);
> +
> +static inline void zone_device_page_init(struct page *page)
> +{
> + struct folio *folio = page_folio(page);
> +
> + zone_device_folio_init(folio, 0);
> +}
> +
> #else
> static inline void *devm_memremap_pages(struct device *dev,
> struct dev_pagemap *pgmap)
> diff --git a/mm/memremap.c b/mm/memremap.c
> index b0ce0d8254bd..3ca136e7455e 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -427,20 +427,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
> void free_zone_device_folio(struct folio *folio)
> {
> struct dev_pagemap *pgmap = folio->pgmap;
> + unsigned int nr = folio_nr_pages(folio);
> + int i;
"unsigned long" is to be future-proof.
(folio_nr_pages() returns long and probably soon unsigned long)
[ I'd probably all it "nr_pages" ]
>
> if (WARN_ON_ONCE(!pgmap))
> return;
>
> mem_cgroup_uncharge(folio);
>
> - /*
> - * Note: we don't expect anonymous compound pages yet. Once supported
> - * and we could PTE-map them similar to THP, we'd have to clear
> - * PG_anon_exclusive on all tail pages.
> - */
> if (folio_test_anon(folio)) {
> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> - __ClearPageAnonExclusive(folio_page(folio, 0));
> + for (i = 0; i < nr; i++)
> + __ClearPageAnonExclusive(folio_page(folio, i));
> + } else {
> + VM_WARN_ON_ONCE(folio_test_large(folio));
> }
>
> /*
> @@ -464,11 +463,20 @@ void free_zone_device_folio(struct folio *folio)
>
> switch (pgmap->type) {
> case MEMORY_DEVICE_PRIVATE:
> + if (folio_test_large(folio)) {
Could do "nr > 1" if we already have that value around.
> + folio_unqueue_deferred_split(folio);
I think I asked that already but maybe missed the reply: Should these
folios ever be added to the deferred split queue and is there any value
in splitting them under memory pressure in the shrinker?
My gut feeling is "No", because the buddy cannot make use of these
folios, but maybe there is an interesting case where we want that behavior?
> +
> + percpu_ref_put_many(&folio->pgmap->ref, nr - 1);
> + }
> + pgmap->ops->page_free(&folio->page);
> + percpu_ref_put(&folio->pgmap->ref);
Coold you simply do a
percpu_ref_put_many(&folio->pgmap->ref, nr);
here, or would that be problematic?
> + folio->page.mapping = NULL;
> + break;
> case MEMORY_DEVICE_COHERENT:
> if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
> break;
> - pgmap->ops->page_free(folio_page(folio, 0));
> - put_dev_pagemap(pgmap);
> + pgmap->ops->page_free(&folio->page);
> + percpu_ref_put(&folio->pgmap->ref);
> break;
>
> case MEMORY_DEVICE_GENERIC:
> @@ -491,14 +499,28 @@ void free_zone_device_folio(struct folio *folio)
> }
> }
>
> -void zone_device_page_init(struct page *page)
> +void zone_device_folio_init(struct folio *folio, unsigned int order)
> {
> + struct page *page = folio_page(folio, 0);
> +
> + VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
> +
> + /*
> + * Only PMD level migration is supported for THP migration
> + */
Talking about something that does not exist yet (and is very specific)
sounds a bit weird.
Should this go into a different patch, or could we rephrase the comment
to be a bit more generic?
In this patch here, nothing would really object to "order" being
intermediate.
(also, this is a device_private limitation? shouldn't that check go
somehwere where we can perform this device-private limitation check?)
> + WARN_ON_ONCE(order && order != HPAGE_PMD_ORDER);
> +
> /*
> * Drivers shouldn't be allocating pages after calling
> * memunmap_pages().
> */
> - WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
> - set_page_count(page, 1);
> + WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
> + folio_set_count(folio, 1);
> lock_page(page);
> +
> + if (order > 1) {
> + prep_compound_page(page, order);
> + folio_set_large_rmappable(folio);
> + }
> }
> -EXPORT_SYMBOL_GPL(zone_device_page_init);
> +EXPORT_SYMBOL_GPL(zone_device_folio_init);
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-07-30 9:21 ` [v2 02/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
@ 2025-07-30 11:16 ` Mika Penttilä
2025-07-30 11:27 ` Zi Yan
2025-07-30 20:05 ` kernel test robot
1 sibling, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-07-30 11:16 UTC (permalink / raw)
To: Balbir Singh, linux-mm
Cc: linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
Hi,
On 7/30/25 12:21, Balbir Singh wrote:
> Make THP handling code in the mm subsystem for THP pages aware of zone
> device pages. Although the code is designed to be generic when it comes
> to handling splitting of pages, the code is designed to work for THP
> page sizes corresponding to HPAGE_PMD_NR.
>
> Modify page_vma_mapped_walk() to return true when a zone device huge
> entry is present, enabling try_to_migrate() and other code migration
> paths to appropriately process the entry. page_vma_mapped_walk() will
> return true for zone device private large folios only when
> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
> not zone device private pages from having to add awareness. The key
> callback that needs this flag is try_to_migrate_one(). The other
> callbacks page idle, damon use it for setting young/dirty bits, which is
> not significant when it comes to pmd level bit harvesting.
>
> pmd_pfn() does not work well with zone device entries, use
> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
> entries.
>
> Zone device private entries when split via munmap go through pmd split,
> but need to go through a folio split, deferred split does not work if a
> fault is encountered because fault handling involves migration entries
> (via folio_migrate_mapping) and the folio sizes are expected to be the
> same there. This introduces the need to split the folio while handling
> the pmd split. Because the folio is still mapped, but calling
> folio_split() will cause lock recursion, the __split_unmapped_folio()
> code is used with a new helper to wrap the code
> split_device_private_folio(), which skips the checks around
> folio->mapping, swapcache and the need to go through unmap and remap
> folio.
>
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
> Cc: Mika Penttilä <mpenttil@redhat.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Francois Dugast <francois.dugast@intel.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
> include/linux/huge_mm.h | 1 +
> include/linux/rmap.h | 2 +
> include/linux/swapops.h | 17 +++
> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
> mm/page_vma_mapped.c | 13 +-
> mm/pgtable-generic.c | 6 +
> mm/rmap.c | 22 +++-
> 7 files changed, 278 insertions(+), 51 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 7748489fde1b..2a6f5ff7bca3 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -345,6 +345,7 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
> bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
> int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> unsigned int new_order);
> +int split_device_private_folio(struct folio *folio);
> int min_order_for_split(struct folio *folio);
> int split_folio_to_list(struct folio *folio, struct list_head *list);
> bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 20803fcb49a7..625f36dcc121 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -905,6 +905,8 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
> #define PVMW_SYNC (1 << 0)
> /* Look for migration entries rather than present PTEs */
> #define PVMW_MIGRATION (1 << 1)
> +/* Look for device private THP entries */
> +#define PVMW_THP_DEVICE_PRIVATE (1 << 2)
>
> struct page_vma_mapped_walk {
> unsigned long pfn;
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 64ea151a7ae3..2641c01bd5d2 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -563,6 +563,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
> {
> return is_swap_pmd(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
> }
> +
> #else /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
> static inline int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> struct page *page)
> @@ -594,6 +595,22 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
> }
> #endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
>
> +#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
> +
> +static inline int is_pmd_device_private_entry(pmd_t pmd)
> +{
> + return is_swap_pmd(pmd) && is_device_private_entry(pmd_to_swp_entry(pmd));
> +}
> +
> +#else /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
> +
> +static inline int is_pmd_device_private_entry(pmd_t pmd)
> +{
> + return 0;
> +}
> +
> +#endif /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
> +
> static inline int non_swap_entry(swp_entry_t entry)
> {
> return swp_type(entry) >= MAX_SWAPFILES;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9c38a95e9f09..e373c6578894 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -72,6 +72,10 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
> struct shrink_control *sc);
> static unsigned long deferred_split_scan(struct shrinker *shrink,
> struct shrink_control *sc);
> +static int __split_unmapped_folio(struct folio *folio, int new_order,
> + struct page *split_at, struct xa_state *xas,
> + struct address_space *mapping, bool uniform_split);
> +
> static bool split_underused_thp = true;
>
> static atomic_t huge_zero_refcount;
> @@ -1711,8 +1715,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> if (unlikely(is_swap_pmd(pmd))) {
> swp_entry_t entry = pmd_to_swp_entry(pmd);
>
> - VM_BUG_ON(!is_pmd_migration_entry(pmd));
> - if (!is_readable_migration_entry(entry)) {
> + VM_WARN_ON(!is_pmd_migration_entry(pmd) &&
> + !is_pmd_device_private_entry(pmd));
> +
> + if (is_migration_entry(entry) &&
> + is_writable_migration_entry(entry)) {
> entry = make_readable_migration_entry(
> swp_offset(entry));
> pmd = swp_entry_to_pmd(entry);
> @@ -1722,6 +1729,32 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> pmd = pmd_swp_mkuffd_wp(pmd);
> set_pmd_at(src_mm, addr, src_pmd, pmd);
> }
> +
> + if (is_device_private_entry(entry)) {
> + if (is_writable_device_private_entry(entry)) {
> + entry = make_readable_device_private_entry(
> + swp_offset(entry));
> + pmd = swp_entry_to_pmd(entry);
> +
> + if (pmd_swp_soft_dirty(*src_pmd))
> + pmd = pmd_swp_mksoft_dirty(pmd);
> + if (pmd_swp_uffd_wp(*src_pmd))
> + pmd = pmd_swp_mkuffd_wp(pmd);
> + set_pmd_at(src_mm, addr, src_pmd, pmd);
> + }
> +
> + src_folio = pfn_swap_entry_folio(entry);
> + VM_WARN_ON(!folio_test_large(src_folio));
> +
> + folio_get(src_folio);
> + /*
> + * folio_try_dup_anon_rmap_pmd does not fail for
> + * device private entries.
> + */
> + VM_WARN_ON(folio_try_dup_anon_rmap_pmd(src_folio,
> + &src_folio->page, dst_vma, src_vma));
> + }
> +
> add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
> mm_inc_nr_ptes(dst_mm);
> pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> @@ -2219,15 +2252,22 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> folio_remove_rmap_pmd(folio, page, vma);
> WARN_ON_ONCE(folio_mapcount(folio) < 0);
> VM_BUG_ON_PAGE(!PageHead(page), page);
> - } else if (thp_migration_supported()) {
> + } else if (is_pmd_migration_entry(orig_pmd) ||
> + is_pmd_device_private_entry(orig_pmd)) {
> swp_entry_t entry;
>
> - VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
> entry = pmd_to_swp_entry(orig_pmd);
> folio = pfn_swap_entry_folio(entry);
> flush_needed = 0;
> - } else
> - WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
> +
> + if (!thp_migration_supported())
> + WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
> +
> + if (is_pmd_device_private_entry(orig_pmd)) {
> + folio_remove_rmap_pmd(folio, &folio->page, vma);
> + WARN_ON_ONCE(folio_mapcount(folio) < 0);
> + }
> + }
>
> if (folio_test_anon(folio)) {
> zap_deposited_table(tlb->mm, pmd);
> @@ -2247,6 +2287,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> folio_mark_accessed(folio);
> }
>
> + /*
> + * Do a folio put on zone device private pages after
> + * changes to mm_counter, because the folio_put() will
> + * clean folio->mapping and the folio_test_anon() check
> + * will not be usable.
> + */
> + if (folio_is_device_private(folio))
> + folio_put(folio);
> +
> spin_unlock(ptl);
> if (flush_needed)
> tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
> @@ -2375,7 +2424,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> struct folio *folio = pfn_swap_entry_folio(entry);
> pmd_t newpmd;
>
> - VM_BUG_ON(!is_pmd_migration_entry(*pmd));
> + VM_WARN_ON(!is_pmd_migration_entry(*pmd) &&
> + !folio_is_device_private(folio));
> if (is_writable_migration_entry(entry)) {
> /*
> * A protection check is difficult so
> @@ -2388,6 +2438,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> newpmd = swp_entry_to_pmd(entry);
> if (pmd_swp_soft_dirty(*pmd))
> newpmd = pmd_swp_mksoft_dirty(newpmd);
> + } else if (is_writable_device_private_entry(entry)) {
> + entry = make_readable_device_private_entry(
> + swp_offset(entry));
> + newpmd = swp_entry_to_pmd(entry);
> } else {
> newpmd = *pmd;
> }
> @@ -2834,6 +2888,44 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> pmd_populate(mm, pmd, pgtable);
> }
>
> +/**
> + * split_huge_device_private_folio - split a huge device private folio into
> + * smaller pages (of order 0), currently used by migrate_device logic to
> + * split folios for pages that are partially mapped
> + *
> + * @folio: the folio to split
> + *
> + * The caller has to hold the folio_lock and a reference via folio_get
> + */
> +int split_device_private_folio(struct folio *folio)
> +{
> + struct folio *end_folio = folio_next(folio);
> + struct folio *new_folio;
> + int ret = 0;
> +
> + /*
> + * Split the folio now. In the case of device
> + * private pages, this path is executed when
> + * the pmd is split and since freeze is not true
> + * it is likely the folio will be deferred_split.
> + *
> + * With device private pages, deferred splits of
> + * folios should be handled here to prevent partial
> + * unmaps from causing issues later on in migration
> + * and fault handling flows.
> + */
> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
> + ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
Confusing to __split_unmapped_folio() if folio is mapped...
--Mika
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-07-30 11:16 ` Mika Penttilä
@ 2025-07-30 11:27 ` Zi Yan
2025-07-30 11:30 ` Zi Yan
0 siblings, 1 reply; 71+ messages in thread
From: Zi Yan @ 2025-07-30 11:27 UTC (permalink / raw)
To: Mika Penttilä
Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
Francois Dugast, Ralph Campbell
On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
> Hi,
>
> On 7/30/25 12:21, Balbir Singh wrote:
>> Make THP handling code in the mm subsystem for THP pages aware of zone
>> device pages. Although the code is designed to be generic when it comes
>> to handling splitting of pages, the code is designed to work for THP
>> page sizes corresponding to HPAGE_PMD_NR.
>>
>> Modify page_vma_mapped_walk() to return true when a zone device huge
>> entry is present, enabling try_to_migrate() and other code migration
>> paths to appropriately process the entry. page_vma_mapped_walk() will
>> return true for zone device private large folios only when
>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>> not zone device private pages from having to add awareness. The key
>> callback that needs this flag is try_to_migrate_one(). The other
>> callbacks page idle, damon use it for setting young/dirty bits, which is
>> not significant when it comes to pmd level bit harvesting.
>>
>> pmd_pfn() does not work well with zone device entries, use
>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>> entries.
>>
>> Zone device private entries when split via munmap go through pmd split,
>> but need to go through a folio split, deferred split does not work if a
>> fault is encountered because fault handling involves migration entries
>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>> same there. This introduces the need to split the folio while handling
>> the pmd split. Because the folio is still mapped, but calling
>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>> code is used with a new helper to wrap the code
>> split_device_private_folio(), which skips the checks around
>> folio->mapping, swapcache and the need to go through unmap and remap
>> folio.
>>
>> Cc: Karol Herbst <kherbst@redhat.com>
>> Cc: Lyude Paul <lyude@redhat.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Simona Vetter <simona@ffwll.ch>
>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>> Cc: Shuah Khan <shuah@kernel.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>> Cc: Jane Chu <jane.chu@oracle.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Donet Tom <donettom@linux.ibm.com>
>> Cc: Mika Penttilä <mpenttil@redhat.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Francois Dugast <francois.dugast@intel.com>
>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>
>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>> include/linux/huge_mm.h | 1 +
>> include/linux/rmap.h | 2 +
>> include/linux/swapops.h | 17 +++
>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>> mm/page_vma_mapped.c | 13 +-
>> mm/pgtable-generic.c | 6 +
>> mm/rmap.c | 22 +++-
>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>
<snip>
>> +/**
>> + * split_huge_device_private_folio - split a huge device private folio into
>> + * smaller pages (of order 0), currently used by migrate_device logic to
>> + * split folios for pages that are partially mapped
>> + *
>> + * @folio: the folio to split
>> + *
>> + * The caller has to hold the folio_lock and a reference via folio_get
>> + */
>> +int split_device_private_folio(struct folio *folio)
>> +{
>> + struct folio *end_folio = folio_next(folio);
>> + struct folio *new_folio;
>> + int ret = 0;
>> +
>> + /*
>> + * Split the folio now. In the case of device
>> + * private pages, this path is executed when
>> + * the pmd is split and since freeze is not true
>> + * it is likely the folio will be deferred_split.
>> + *
>> + * With device private pages, deferred splits of
>> + * folios should be handled here to prevent partial
>> + * unmaps from causing issues later on in migration
>> + * and fault handling flows.
>> + */
>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>
> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
Based on my off-list conversation with Balbir, the folio is unmapped in
CPU side but mapped in the device. folio_ref_freeeze() is not aware of
device side mapping.
>
>> + ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>
> Confusing to __split_unmapped_folio() if folio is mapped...
From driver point of view, __split_unmapped_folio() probably should be renamed
to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
folio meta data for split.
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 00/11] THP support for zone device page migration
2025-07-30 9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
` (10 preceding siblings ...)
2025-07-30 9:21 ` [v2 11/11] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh
@ 2025-07-30 11:30 ` David Hildenbrand
2025-07-30 23:18 ` Alistair Popple
2025-07-31 8:41 ` Balbir Singh
2025-08-05 21:34 ` Matthew Brost
12 siblings, 2 replies; 71+ messages in thread
From: David Hildenbrand @ 2025-07-30 11:30 UTC (permalink / raw)
To: Balbir Singh, linux-mm
Cc: linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Matthew Brost,
Francois Dugast
On 30.07.25 11:21, Balbir Singh wrote:
BTW, I keep getting confused by the topic.
Isn't this essentially
"mm: support device-private THP"
and the support for migration is just a necessary requirement to
*enable* device private?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-07-30 11:27 ` Zi Yan
@ 2025-07-30 11:30 ` Zi Yan
2025-07-30 11:42 ` Mika Penttilä
0 siblings, 1 reply; 71+ messages in thread
From: Zi Yan @ 2025-07-30 11:30 UTC (permalink / raw)
To: Mika Penttilä
Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
Francois Dugast, Ralph Campbell
On 30 Jul 2025, at 7:27, Zi Yan wrote:
> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>
>> Hi,
>>
>> On 7/30/25 12:21, Balbir Singh wrote:
>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>> device pages. Although the code is designed to be generic when it comes
>>> to handling splitting of pages, the code is designed to work for THP
>>> page sizes corresponding to HPAGE_PMD_NR.
>>>
>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>> entry is present, enabling try_to_migrate() and other code migration
>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>> return true for zone device private large folios only when
>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>> not zone device private pages from having to add awareness. The key
>>> callback that needs this flag is try_to_migrate_one(). The other
>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>> not significant when it comes to pmd level bit harvesting.
>>>
>>> pmd_pfn() does not work well with zone device entries, use
>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>> entries.
>>>
>>> Zone device private entries when split via munmap go through pmd split,
>>> but need to go through a folio split, deferred split does not work if a
>>> fault is encountered because fault handling involves migration entries
>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>> same there. This introduces the need to split the folio while handling
>>> the pmd split. Because the folio is still mapped, but calling
>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>> code is used with a new helper to wrap the code
>>> split_device_private_folio(), which skips the checks around
>>> folio->mapping, swapcache and the need to go through unmap and remap
>>> folio.
>>>
>>> Cc: Karol Herbst <kherbst@redhat.com>
>>> Cc: Lyude Paul <lyude@redhat.com>
>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>> Cc: David Airlie <airlied@gmail.com>
>>> Cc: Simona Vetter <simona@ffwll.ch>
>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>> Cc: Shuah Khan <shuah@kernel.org>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Barry Song <baohua@kernel.org>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Cc: Matthew Wilcox <willy@infradead.org>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>> Cc: Jane Chu <jane.chu@oracle.com>
>>> Cc: Alistair Popple <apopple@nvidia.com>
>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>> ---
>>> include/linux/huge_mm.h | 1 +
>>> include/linux/rmap.h | 2 +
>>> include/linux/swapops.h | 17 +++
>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>> mm/page_vma_mapped.c | 13 +-
>>> mm/pgtable-generic.c | 6 +
>>> mm/rmap.c | 22 +++-
>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>
>
> <snip>
>
>>> +/**
>>> + * split_huge_device_private_folio - split a huge device private folio into
>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>> + * split folios for pages that are partially mapped
>>> + *
>>> + * @folio: the folio to split
>>> + *
>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>> + */
>>> +int split_device_private_folio(struct folio *folio)
>>> +{
>>> + struct folio *end_folio = folio_next(folio);
>>> + struct folio *new_folio;
>>> + int ret = 0;
>>> +
>>> + /*
>>> + * Split the folio now. In the case of device
>>> + * private pages, this path is executed when
>>> + * the pmd is split and since freeze is not true
>>> + * it is likely the folio will be deferred_split.
>>> + *
>>> + * With device private pages, deferred splits of
>>> + * folios should be handled here to prevent partial
>>> + * unmaps from causing issues later on in migration
>>> + * and fault handling flows.
>>> + */
>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>
>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>
> Based on my off-list conversation with Balbir, the folio is unmapped in
> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
> device side mapping.
Maybe we should make it aware of device private mapping? So that the
process mirrors CPU side folio split: 1) unmap device private mapping,
2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
5) remap device private mapping.
>
>>
>>> + ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>
>> Confusing to __split_unmapped_folio() if folio is mapped...
>
> From driver point of view, __split_unmapped_folio() probably should be renamed
> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
> folio meta data for split.
>
>
> Best Regards,
> Yan, Zi
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-07-30 11:30 ` Zi Yan
@ 2025-07-30 11:42 ` Mika Penttilä
2025-07-30 12:08 ` Mika Penttilä
0 siblings, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-07-30 11:42 UTC (permalink / raw)
To: Zi Yan
Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
Francois Dugast, Ralph Campbell
On 7/30/25 14:30, Zi Yan wrote:
> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>
>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>
>>> Hi,
>>>
>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>> device pages. Although the code is designed to be generic when it comes
>>>> to handling splitting of pages, the code is designed to work for THP
>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>
>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>> entry is present, enabling try_to_migrate() and other code migration
>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>> return true for zone device private large folios only when
>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>> not zone device private pages from having to add awareness. The key
>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>> not significant when it comes to pmd level bit harvesting.
>>>>
>>>> pmd_pfn() does not work well with zone device entries, use
>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>> entries.
>>>>
>>>> Zone device private entries when split via munmap go through pmd split,
>>>> but need to go through a folio split, deferred split does not work if a
>>>> fault is encountered because fault handling involves migration entries
>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>> same there. This introduces the need to split the folio while handling
>>>> the pmd split. Because the folio is still mapped, but calling
>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>> code is used with a new helper to wrap the code
>>>> split_device_private_folio(), which skips the checks around
>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>> folio.
>>>>
>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>> Cc: David Airlie <airlied@gmail.com>
>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>> Cc: David Hildenbrand <david@redhat.com>
>>>> Cc: Barry Song <baohua@kernel.org>
>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>> Cc: Peter Xu <peterx@redhat.com>
>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>
>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>> ---
>>>> include/linux/huge_mm.h | 1 +
>>>> include/linux/rmap.h | 2 +
>>>> include/linux/swapops.h | 17 +++
>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>> mm/page_vma_mapped.c | 13 +-
>>>> mm/pgtable-generic.c | 6 +
>>>> mm/rmap.c | 22 +++-
>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>
>> <snip>
>>
>>>> +/**
>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>> + * split folios for pages that are partially mapped
>>>> + *
>>>> + * @folio: the folio to split
>>>> + *
>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>> + */
>>>> +int split_device_private_folio(struct folio *folio)
>>>> +{
>>>> + struct folio *end_folio = folio_next(folio);
>>>> + struct folio *new_folio;
>>>> + int ret = 0;
>>>> +
>>>> + /*
>>>> + * Split the folio now. In the case of device
>>>> + * private pages, this path is executed when
>>>> + * the pmd is split and since freeze is not true
>>>> + * it is likely the folio will be deferred_split.
>>>> + *
>>>> + * With device private pages, deferred splits of
>>>> + * folios should be handled here to prevent partial
>>>> + * unmaps from causing issues later on in migration
>>>> + * and fault handling flows.
>>>> + */
>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>> Based on my off-list conversation with Balbir, the folio is unmapped in
>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>> device side mapping.
> Maybe we should make it aware of device private mapping? So that the
> process mirrors CPU side folio split: 1) unmap device private mapping,
> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
> 5) remap device private mapping.
Ah ok this was about device private page obviously here, nevermind..
>>>> + ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>> Confusing to __split_unmapped_folio() if folio is mapped...
>> From driver point of view, __split_unmapped_folio() probably should be renamed
>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>> folio meta data for split.
>>
>>
>> Best Regards,
>> Yan, Zi
>
> Best Regards,
> Yan, Zi
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-07-30 11:42 ` Mika Penttilä
@ 2025-07-30 12:08 ` Mika Penttilä
2025-07-30 12:25 ` Zi Yan
0 siblings, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-07-30 12:08 UTC (permalink / raw)
To: Zi Yan
Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
Francois Dugast, Ralph Campbell
On 7/30/25 14:42, Mika Penttilä wrote:
> On 7/30/25 14:30, Zi Yan wrote:
>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>
>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>
>>>> Hi,
>>>>
>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>> device pages. Although the code is designed to be generic when it comes
>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>
>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>> return true for zone device private large folios only when
>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>> not zone device private pages from having to add awareness. The key
>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>
>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>> entries.
>>>>>
>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>> but need to go through a folio split, deferred split does not work if a
>>>>> fault is encountered because fault handling involves migration entries
>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>> same there. This introduces the need to split the folio while handling
>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>> code is used with a new helper to wrap the code
>>>>> split_device_private_folio(), which skips the checks around
>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>> folio.
>>>>>
>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>
>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>> ---
>>>>> include/linux/huge_mm.h | 1 +
>>>>> include/linux/rmap.h | 2 +
>>>>> include/linux/swapops.h | 17 +++
>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>> mm/page_vma_mapped.c | 13 +-
>>>>> mm/pgtable-generic.c | 6 +
>>>>> mm/rmap.c | 22 +++-
>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>
>>> <snip>
>>>
>>>>> +/**
>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>> + * split folios for pages that are partially mapped
>>>>> + *
>>>>> + * @folio: the folio to split
>>>>> + *
>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>> + */
>>>>> +int split_device_private_folio(struct folio *folio)
>>>>> +{
>>>>> + struct folio *end_folio = folio_next(folio);
>>>>> + struct folio *new_folio;
>>>>> + int ret = 0;
>>>>> +
>>>>> + /*
>>>>> + * Split the folio now. In the case of device
>>>>> + * private pages, this path is executed when
>>>>> + * the pmd is split and since freeze is not true
>>>>> + * it is likely the folio will be deferred_split.
>>>>> + *
>>>>> + * With device private pages, deferred splits of
>>>>> + * folios should be handled here to prevent partial
>>>>> + * unmaps from causing issues later on in migration
>>>>> + * and fault handling flows.
>>>>> + */
>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>> device side mapping.
>> Maybe we should make it aware of device private mapping? So that the
>> process mirrors CPU side folio split: 1) unmap device private mapping,
>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>> 5) remap device private mapping.
> Ah ok this was about device private page obviously here, nevermind..
Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>
>>>>> + ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>> Confusing to __split_unmapped_folio() if folio is mapped...
>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>> folio meta data for split.
>>>
>>>
>>> Best Regards,
>>> Yan, Zi
>> Best Regards,
>> Yan, Zi
>>
--Mika
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-07-30 12:08 ` Mika Penttilä
@ 2025-07-30 12:25 ` Zi Yan
2025-07-30 12:49 ` Mika Penttilä
0 siblings, 1 reply; 71+ messages in thread
From: Zi Yan @ 2025-07-30 12:25 UTC (permalink / raw)
To: Mika Penttilä
Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
Francois Dugast, Ralph Campbell
On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
> On 7/30/25 14:42, Mika Penttilä wrote:
>> On 7/30/25 14:30, Zi Yan wrote:
>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>
>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>
>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>> return true for zone device private large folios only when
>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>> not zone device private pages from having to add awareness. The key
>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>
>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>> entries.
>>>>>>
>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>> fault is encountered because fault handling involves migration entries
>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>> same there. This introduces the need to split the folio while handling
>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>> code is used with a new helper to wrap the code
>>>>>> split_device_private_folio(), which skips the checks around
>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>> folio.
>>>>>>
>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>
>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>> ---
>>>>>> include/linux/huge_mm.h | 1 +
>>>>>> include/linux/rmap.h | 2 +
>>>>>> include/linux/swapops.h | 17 +++
>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>> mm/pgtable-generic.c | 6 +
>>>>>> mm/rmap.c | 22 +++-
>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>
>>>> <snip>
>>>>
>>>>>> +/**
>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>> + * split folios for pages that are partially mapped
>>>>>> + *
>>>>>> + * @folio: the folio to split
>>>>>> + *
>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>> + */
>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>> +{
>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>> + struct folio *new_folio;
>>>>>> + int ret = 0;
>>>>>> +
>>>>>> + /*
>>>>>> + * Split the folio now. In the case of device
>>>>>> + * private pages, this path is executed when
>>>>>> + * the pmd is split and since freeze is not true
>>>>>> + * it is likely the folio will be deferred_split.
>>>>>> + *
>>>>>> + * With device private pages, deferred splits of
>>>>>> + * folios should be handled here to prevent partial
>>>>>> + * unmaps from causing issues later on in migration
>>>>>> + * and fault handling flows.
>>>>>> + */
>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>> device side mapping.
>>> Maybe we should make it aware of device private mapping? So that the
>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>> 5) remap device private mapping.
>> Ah ok this was about device private page obviously here, nevermind..
>
> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
The folio only has migration entries pointing to it. From CPU perspective,
it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
folio by replacing existing page table entries with migration entries
and after that the folio is regarded as “unmapped”.
The migration entry is an invalid CPU page table entry, so it is not a CPU
mapping, IIUC.
>
>>
>>>>>> + ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>> Confusing to __split_unmapped_folio() if folio is mapped...
>>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>>> folio meta data for split.
>>>>
>>>>
>>>> Best Regards,
>>>> Yan, Zi
>>> Best Regards,
>>> Yan, Zi
>>>
>
> --Mika
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-07-30 12:25 ` Zi Yan
@ 2025-07-30 12:49 ` Mika Penttilä
2025-07-30 15:10 ` Zi Yan
0 siblings, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-07-30 12:49 UTC (permalink / raw)
To: Zi Yan
Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
Francois Dugast, Ralph Campbell
On 7/30/25 15:25, Zi Yan wrote:
> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>
>> On 7/30/25 14:42, Mika Penttilä wrote:
>>> On 7/30/25 14:30, Zi Yan wrote:
>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>
>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>
>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>> return true for zone device private large folios only when
>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>
>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>> entries.
>>>>>>>
>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>> code is used with a new helper to wrap the code
>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>> folio.
>>>>>>>
>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>
>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>> ---
>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>> include/linux/rmap.h | 2 +
>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>> mm/rmap.c | 22 +++-
>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>
>>>>> <snip>
>>>>>
>>>>>>> +/**
>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>> + * split folios for pages that are partially mapped
>>>>>>> + *
>>>>>>> + * @folio: the folio to split
>>>>>>> + *
>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>> + */
>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>> +{
>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>> + struct folio *new_folio;
>>>>>>> + int ret = 0;
>>>>>>> +
>>>>>>> + /*
>>>>>>> + * Split the folio now. In the case of device
>>>>>>> + * private pages, this path is executed when
>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>> + *
>>>>>>> + * With device private pages, deferred splits of
>>>>>>> + * folios should be handled here to prevent partial
>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>> + * and fault handling flows.
>>>>>>> + */
>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>> device side mapping.
>>>> Maybe we should make it aware of device private mapping? So that the
>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>> 5) remap device private mapping.
>>> Ah ok this was about device private page obviously here, nevermind..
>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
> The folio only has migration entries pointing to it. From CPU perspective,
> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
> folio by replacing existing page table entries with migration entries
> and after that the folio is regarded as “unmapped”.
>
> The migration entry is an invalid CPU page table entry, so it is not a CPU
split_device_private_folio() is called for device private entry, not migrate entry afaics.
And it is called from split_huge_pmd() with freeze == false, not from folio split but pmd split.
> mapping, IIUC.
>
>>>>>>> + ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>> Confusing to __split_unmapped_folio() if folio is mapped...
>>>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>>>> folio meta data for split.
>>>>>
>>>>>
>>>>> Best Regards,
>>>>> Yan, Zi
>>>> Best Regards,
>>>> Yan, Zi
>>>>
>> --Mika
>
> Best Regards,
> Yan, Zi
>
--Mika
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-07-30 12:49 ` Mika Penttilä
@ 2025-07-30 15:10 ` Zi Yan
2025-07-30 15:40 ` Mika Penttilä
0 siblings, 1 reply; 71+ messages in thread
From: Zi Yan @ 2025-07-30 15:10 UTC (permalink / raw)
To: Mika Penttilä
Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
Francois Dugast, Ralph Campbell
On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
> On 7/30/25 15:25, Zi Yan wrote:
>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>
>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>
>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>
>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>> return true for zone device private large folios only when
>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>
>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>> entries.
>>>>>>>>
>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>> folio.
>>>>>>>>
>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>
>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>> ---
>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>
>>>>>> <snip>
>>>>>>
>>>>>>>> +/**
>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>> + *
>>>>>>>> + * @folio: the folio to split
>>>>>>>> + *
>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>> + */
>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>> +{
>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>> + struct folio *new_folio;
>>>>>>>> + int ret = 0;
>>>>>>>> +
>>>>>>>> + /*
>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>> + * private pages, this path is executed when
>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>> + *
>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>> + * and fault handling flows.
>>>>>>>> + */
>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>> device side mapping.
>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>> 5) remap device private mapping.
>>>> Ah ok this was about device private page obviously here, nevermind..
>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>> The folio only has migration entries pointing to it. From CPU perspective,
>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>> folio by replacing existing page table entries with migration entries
>> and after that the folio is regarded as “unmapped”.
>>
>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>
> split_device_private_folio() is called for device private entry, not migrate entry afaics.
Yes, but from CPU perspective, both device private entry and migration entry
are invalid CPU page table entries, so the device private folio is “unmapped”
at CPU side.
> And it is called from split_huge_pmd() with freeze == false, not from folio split but pmd split.
I am not sure that is the right timing of splitting a folio. The device private
folio can be kept without splitting at split_huge_pmd() time.
But from CPU perspective, a device private folio has no CPU mapping, no other
CPU can access or manipulate the folio. It should be OK to split it.
>
>> mapping, IIUC.
>>
>>>>>>>> + ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>> Confusing to __split_unmapped_folio() if folio is mapped...
>>>>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>>>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>>>>> folio meta data for split.
>>>>>>
>>>>>>
>>>>>> Best Regards,
>>>>>> Yan, Zi
>>>>> Best Regards,
>>>>> Yan, Zi
>>>>>
>>> --Mika
>>
>> Best Regards,
>> Yan, Zi
>>
> --Mika
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-07-30 15:10 ` Zi Yan
@ 2025-07-30 15:40 ` Mika Penttilä
2025-07-30 15:58 ` Zi Yan
0 siblings, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-07-30 15:40 UTC (permalink / raw)
To: Zi Yan
Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
Francois Dugast, Ralph Campbell
On 7/30/25 18:10, Zi Yan wrote:
> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>
>> On 7/30/25 15:25, Zi Yan wrote:
>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>
>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>
>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>
>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>
>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>> entries.
>>>>>>>>>
>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>> folio.
>>>>>>>>>
>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>
>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>> ---
>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>
>>>>>>> <snip>
>>>>>>>
>>>>>>>>> +/**
>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>> + *
>>>>>>>>> + * @folio: the folio to split
>>>>>>>>> + *
>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>> + */
>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>> +{
>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>> + struct folio *new_folio;
>>>>>>>>> + int ret = 0;
>>>>>>>>> +
>>>>>>>>> + /*
>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>> + *
>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>> + * and fault handling flows.
>>>>>>>>> + */
>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>> device side mapping.
>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>> 5) remap device private mapping.
>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>> The folio only has migration entries pointing to it. From CPU perspective,
>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>> folio by replacing existing page table entries with migration entries
>>> and after that the folio is regarded as “unmapped”.
>>>
>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
> Yes, but from CPU perspective, both device private entry and migration entry
> are invalid CPU page table entries, so the device private folio is “unmapped”
> at CPU side.
Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
Also which might confuse is that v1 of the series had only
migrate_vma_split_pages()
which operated only on truly unmapped (mapcount wise) folios. Which was a motivation for split_unmapped_folio()..
Now,
split_device_private_folio()
operates on mapcount != 0 folios.
>
>
>> And it is called from split_huge_pmd() with freeze == false, not from folio split but pmd split.
> I am not sure that is the right timing of splitting a folio. The device private
> folio can be kept without splitting at split_huge_pmd() time.
Yes this doesn't look quite right, and also
+ folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
looks suspicious
Maybe split_device_private_folio() tries to solve some corner case but maybe good to elaborate
more the exact conditions, there might be a better fix.
>
> But from CPU perspective, a device private folio has no CPU mapping, no other
> CPU can access or manipulate the folio. It should be OK to split it.
>
>>> mapping, IIUC.
>>>
>>>>>>>>> + ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>>> Confusing to __split_unmapped_folio() if folio is mapped...
>>>>>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>>>>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>>>>>> folio meta data for split.
>>>>>>>
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Yan, Zi
>>>>>> Best Regards,
>>>>>> Yan, Zi
>>>>>>
>>>> --Mika
>>> Best Regards,
>>> Yan, Zi
>>>
>> --Mika
>
> Best Regards,
> Yan, Zi
>
--Mika
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-07-30 15:40 ` Mika Penttilä
@ 2025-07-30 15:58 ` Zi Yan
2025-07-30 16:29 ` Mika Penttilä
0 siblings, 1 reply; 71+ messages in thread
From: Zi Yan @ 2025-07-30 15:58 UTC (permalink / raw)
To: Mika Penttilä
Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
Francois Dugast, Ralph Campbell
On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
> On 7/30/25 18:10, Zi Yan wrote:
>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>
>>> On 7/30/25 15:25, Zi Yan wrote:
>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>
>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>
>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>
>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>
>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>> entries.
>>>>>>>>>>
>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>> folio.
>>>>>>>>>>
>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>> ---
>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>
>>>>>>>> <snip>
>>>>>>>>
>>>>>>>>>> +/**
>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>> + *
>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>> + *
>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>> + */
>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>> +{
>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>> + int ret = 0;
>>>>>>>>>> +
>>>>>>>>>> + /*
>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>> + *
>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>> + */
>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>> device side mapping.
>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>> 5) remap device private mapping.
>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>> folio by replacing existing page table entries with migration entries
>>>> and after that the folio is regarded as “unmapped”.
>>>>
>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>> Yes, but from CPU perspective, both device private entry and migration entry
>> are invalid CPU page table entries, so the device private folio is “unmapped”
>> at CPU side.
>
> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
Right. That confused me when I was talking to Balbir and looking at v1.
When a device private folio is processed in __folio_split(), Balbir needed to
add code to skip CPU mapping handling code. Basically device private folios are
CPU unmapped and device mapped.
Here are my questions on device private folios:
1. How is mapcount used for device private folios? Why is it needed from CPU
perspective? Can it be stored in a device private specific data structure?
2. When a device private folio is mapped on device, can someone other than
the device driver manipulate it assuming core-mm just skips device private
folios (barring the CPU access fault handling)?
Where I am going is that can device private folios be treated as unmapped folios
by CPU and only device driver manipulates their mappings?
>
> Also which might confuse is that v1 of the series had only
> migrate_vma_split_pages()
> which operated only on truly unmapped (mapcount wise) folios. Which was a motivation for split_unmapped_folio()..
> Now,
> split_device_private_folio()
> operates on mapcount != 0 folios.
>
>>
>>
>>> And it is called from split_huge_pmd() with freeze == false, not from folio split but pmd split.
>> I am not sure that is the right timing of splitting a folio. The device private
>> folio can be kept without splitting at split_huge_pmd() time.
>
> Yes this doesn't look quite right, and also
> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
I wonder if we need to freeze a device private folio. Can anyone other than
device driver change its refcount? Since CPU just sees it as an unmapped folio.
>
> looks suspicious
>
> Maybe split_device_private_folio() tries to solve some corner case but maybe good to elaborate
> more the exact conditions, there might be a better fix.
>
>>
>> But from CPU perspective, a device private folio has no CPU mapping, no other
>> CPU can access or manipulate the folio. It should be OK to split it.
>>
>>>> mapping, IIUC.
>>>>
>>>>>>>>>> + ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>>>> Confusing to __split_unmapped_folio() if folio is mapped...
>>>>>>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>>>>>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>>>>>>> folio meta data for split.
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-07-30 15:58 ` Zi Yan
@ 2025-07-30 16:29 ` Mika Penttilä
2025-07-31 7:15 ` David Hildenbrand
0 siblings, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-07-30 16:29 UTC (permalink / raw)
To: Zi Yan
Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
Francois Dugast, Ralph Campbell
On 7/30/25 18:58, Zi Yan wrote:
> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>
>> On 7/30/25 18:10, Zi Yan wrote:
>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>
>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>
>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>
>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>
>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>
>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>> entries.
>>>>>>>>>>>
>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>> folio.
>>>>>>>>>>>
>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>> ---
>>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>
>>>>>>>>> <snip>
>>>>>>>>>
>>>>>>>>>>> +/**
>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>> + *
>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>> + *
>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>> + */
>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>> +{
>>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>>> + int ret = 0;
>>>>>>>>>>> +
>>>>>>>>>>> + /*
>>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>>> + *
>>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>>> + */
>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>> device side mapping.
>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>> 5) remap device private mapping.
>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>> folio by replacing existing page table entries with migration entries
>>>>> and after that the folio is regarded as “unmapped”.
>>>>>
>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>> Yes, but from CPU perspective, both device private entry and migration entry
>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>> at CPU side.
>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
> Right. That confused me when I was talking to Balbir and looking at v1.
> When a device private folio is processed in __folio_split(), Balbir needed to
> add code to skip CPU mapping handling code. Basically device private folios are
> CPU unmapped and device mapped.
>
> Here are my questions on device private folios:
> 1. How is mapcount used for device private folios? Why is it needed from CPU
> perspective? Can it be stored in a device private specific data structure?
Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
common code more messy if not done that way but sure possible.
And not consuming pfns (address space) at all would have benefits.
> 2. When a device private folio is mapped on device, can someone other than
> the device driver manipulate it assuming core-mm just skips device private
> folios (barring the CPU access fault handling)?
>
> Where I am going is that can device private folios be treated as unmapped folios
> by CPU and only device driver manipulates their mappings?
>
Yes not present by CPU but mm has bookkeeping on them. The private page has no content
someone could change while in device, it's just pfn.
>> Also which might confuse is that v1 of the series had only
>> migrate_vma_split_pages()
>> which operated only on truly unmapped (mapcount wise) folios. Which was a motivation for split_unmapped_folio()..
>> Now,
>> split_device_private_folio()
>> operates on mapcount != 0 folios.
>>
>>>
>>>> And it is called from split_huge_pmd() with freeze == false, not from folio split but pmd split.
>>> I am not sure that is the right timing of splitting a folio. The device private
>>> folio can be kept without splitting at split_huge_pmd() time.
>> Yes this doesn't look quite right, and also
>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
> I wonder if we need to freeze a device private folio. Can anyone other than
> device driver change its refcount? Since CPU just sees it as an unmapped folio.
>
>> looks suspicious
>>
>> Maybe split_device_private_folio() tries to solve some corner case but maybe good to elaborate
>> more the exact conditions, there might be a better fix.
>>
>>> But from CPU perspective, a device private folio has no CPU mapping, no other
>>> CPU can access or manipulate the folio. It should be OK to split it.
>>>
>>>>> mapping, IIUC.
>>>>>
>>>>>>>>>>> + ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>>>>> Confusing to __split_unmapped_folio() if folio is mapped...
>>>>>>>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>>>>>>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>>>>>>>> folio meta data for split.
>
>
> Best Regards,
> Yan, Zi
>
--Mika
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-07-30 9:21 ` [v2 02/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
2025-07-30 11:16 ` Mika Penttilä
@ 2025-07-30 20:05 ` kernel test robot
1 sibling, 0 replies; 71+ messages in thread
From: kernel test robot @ 2025-07-30 20:05 UTC (permalink / raw)
To: Balbir Singh, linux-mm
Cc: oe-kbuild-all, linux-kernel, Balbir Singh, Karol Herbst,
Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Mika Penttilä, Matthew Brost, Francois Dugast,
Ralph Campbell
Hi Balbir,
kernel test robot noticed the following build warnings:
[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on next-20250730]
[cannot apply to akpm-mm/mm-nonmm-unstable shuah-kselftest/next shuah-kselftest/fixes linus/master v6.16]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Balbir-Singh/mm-zone_device-support-large-zone-device-private-folios/20250730-172600
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20250730092139.3890844-3-balbirs%40nvidia.com
patch subject: [v2 02/11] mm/thp: zone_device awareness in THP handling code
config: i386-buildonly-randconfig-001-20250731 (https://download.01.org/0day-ci/archive/20250731/202507310343.ZipoyitU-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250731/202507310343.ZipoyitU-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507310343.ZipoyitU-lkp@intel.com/
All warnings (new ones prefixed by >>):
mm/rmap.c: In function 'try_to_migrate_one':
>> mm/rmap.c:2330:39: warning: unused variable 'pfn' [-Wunused-variable]
2330 | unsigned long pfn;
| ^~~
vim +/pfn +2330 mm/rmap.c
2273
2274 /*
2275 * @arg: enum ttu_flags will be passed to this argument.
2276 *
2277 * If TTU_SPLIT_HUGE_PMD is specified any PMD mappings will be split into PTEs
2278 * containing migration entries.
2279 */
2280 static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
2281 unsigned long address, void *arg)
2282 {
2283 struct mm_struct *mm = vma->vm_mm;
2284 DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address,
2285 PVMW_THP_DEVICE_PRIVATE);
2286 bool anon_exclusive, writable, ret = true;
2287 pte_t pteval;
2288 struct page *subpage;
2289 struct mmu_notifier_range range;
2290 enum ttu_flags flags = (enum ttu_flags)(long)arg;
2291 unsigned long pfn;
2292 unsigned long hsz = 0;
2293
2294 /*
2295 * When racing against e.g. zap_pte_range() on another cpu,
2296 * in between its ptep_get_and_clear_full() and folio_remove_rmap_*(),
2297 * try_to_migrate() may return before page_mapped() has become false,
2298 * if page table locking is skipped: use TTU_SYNC to wait for that.
2299 */
2300 if (flags & TTU_SYNC)
2301 pvmw.flags = PVMW_SYNC;
2302
2303 /*
2304 * For THP, we have to assume the worse case ie pmd for invalidation.
2305 * For hugetlb, it could be much worse if we need to do pud
2306 * invalidation in the case of pmd sharing.
2307 *
2308 * Note that the page can not be free in this function as call of
2309 * try_to_unmap() must hold a reference on the page.
2310 */
2311 range.end = vma_address_end(&pvmw);
2312 mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
2313 address, range.end);
2314 if (folio_test_hugetlb(folio)) {
2315 /*
2316 * If sharing is possible, start and end will be adjusted
2317 * accordingly.
2318 */
2319 adjust_range_if_pmd_sharing_possible(vma, &range.start,
2320 &range.end);
2321
2322 /* We need the huge page size for set_huge_pte_at() */
2323 hsz = huge_page_size(hstate_vma(vma));
2324 }
2325 mmu_notifier_invalidate_range_start(&range);
2326
2327 while (page_vma_mapped_walk(&pvmw)) {
2328 /* PMD-mapped THP migration entry */
2329 if (!pvmw.pte) {
> 2330 unsigned long pfn;
2331
2332 if (flags & TTU_SPLIT_HUGE_PMD) {
2333 split_huge_pmd_locked(vma, pvmw.address,
2334 pvmw.pmd, true);
2335 ret = false;
2336 page_vma_mapped_walk_done(&pvmw);
2337 break;
2338 }
2339 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
2340 /*
2341 * Zone device private folios do not work well with
2342 * pmd_pfn() on some architectures due to pte
2343 * inversion.
2344 */
2345 if (is_pmd_device_private_entry(*pvmw.pmd)) {
2346 swp_entry_t entry = pmd_to_swp_entry(*pvmw.pmd);
2347
2348 pfn = swp_offset_pfn(entry);
2349 } else {
2350 pfn = pmd_pfn(*pvmw.pmd);
2351 }
2352
2353 subpage = folio_page(folio, pfn - folio_pfn(folio));
2354
2355 VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
2356 !folio_test_pmd_mappable(folio), folio);
2357
2358 if (set_pmd_migration_entry(&pvmw, subpage)) {
2359 ret = false;
2360 page_vma_mapped_walk_done(&pvmw);
2361 break;
2362 }
2363 continue;
2364 #endif
2365 }
2366
2367 /* Unexpected PMD-mapped THP? */
2368 VM_BUG_ON_FOLIO(!pvmw.pte, folio);
2369
2370 /*
2371 * Handle PFN swap PTEs, such as device-exclusive ones, that
2372 * actually map pages.
2373 */
2374 pteval = ptep_get(pvmw.pte);
2375 if (likely(pte_present(pteval))) {
2376 pfn = pte_pfn(pteval);
2377 } else {
2378 pfn = swp_offset_pfn(pte_to_swp_entry(pteval));
2379 VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
2380 }
2381
2382 subpage = folio_page(folio, pfn - folio_pfn(folio));
2383 address = pvmw.address;
2384 anon_exclusive = folio_test_anon(folio) &&
2385 PageAnonExclusive(subpage);
2386
2387 if (folio_test_hugetlb(folio)) {
2388 bool anon = folio_test_anon(folio);
2389
2390 /*
2391 * huge_pmd_unshare may unmap an entire PMD page.
2392 * There is no way of knowing exactly which PMDs may
2393 * be cached for this mm, so we must flush them all.
2394 * start/end were already adjusted above to cover this
2395 * range.
2396 */
2397 flush_cache_range(vma, range.start, range.end);
2398
2399 /*
2400 * To call huge_pmd_unshare, i_mmap_rwsem must be
2401 * held in write mode. Caller needs to explicitly
2402 * do this outside rmap routines.
2403 *
2404 * We also must hold hugetlb vma_lock in write mode.
2405 * Lock order dictates acquiring vma_lock BEFORE
2406 * i_mmap_rwsem. We can only try lock here and
2407 * fail if unsuccessful.
2408 */
2409 if (!anon) {
2410 VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
2411 if (!hugetlb_vma_trylock_write(vma)) {
2412 page_vma_mapped_walk_done(&pvmw);
2413 ret = false;
2414 break;
2415 }
2416 if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
2417 hugetlb_vma_unlock_write(vma);
2418 flush_tlb_range(vma,
2419 range.start, range.end);
2420
2421 /*
2422 * The ref count of the PMD page was
2423 * dropped which is part of the way map
2424 * counting is done for shared PMDs.
2425 * Return 'true' here. When there is
2426 * no other sharing, huge_pmd_unshare
2427 * returns false and we will unmap the
2428 * actual page and drop map count
2429 * to zero.
2430 */
2431 page_vma_mapped_walk_done(&pvmw);
2432 break;
2433 }
2434 hugetlb_vma_unlock_write(vma);
2435 }
2436 /* Nuke the hugetlb page table entry */
2437 pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
2438 if (pte_dirty(pteval))
2439 folio_mark_dirty(folio);
2440 writable = pte_write(pteval);
2441 } else if (likely(pte_present(pteval))) {
2442 flush_cache_page(vma, address, pfn);
2443 /* Nuke the page table entry. */
2444 if (should_defer_flush(mm, flags)) {
2445 /*
2446 * We clear the PTE but do not flush so potentially
2447 * a remote CPU could still be writing to the folio.
2448 * If the entry was previously clean then the
2449 * architecture must guarantee that a clear->dirty
2450 * transition on a cached TLB entry is written through
2451 * and traps if the PTE is unmapped.
2452 */
2453 pteval = ptep_get_and_clear(mm, address, pvmw.pte);
2454
2455 set_tlb_ubc_flush_pending(mm, pteval, address, address + PAGE_SIZE);
2456 } else {
2457 pteval = ptep_clear_flush(vma, address, pvmw.pte);
2458 }
2459 if (pte_dirty(pteval))
2460 folio_mark_dirty(folio);
2461 writable = pte_write(pteval);
2462 } else {
2463 pte_clear(mm, address, pvmw.pte);
2464 writable = is_writable_device_private_entry(pte_to_swp_entry(pteval));
2465 }
2466
2467 VM_WARN_ON_FOLIO(writable && folio_test_anon(folio) &&
2468 !anon_exclusive, folio);
2469
2470 /* Update high watermark before we lower rss */
2471 update_hiwater_rss(mm);
2472
2473 if (PageHWPoison(subpage)) {
2474 VM_WARN_ON_FOLIO(folio_is_device_private(folio), folio);
2475
2476 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
2477 if (folio_test_hugetlb(folio)) {
2478 hugetlb_count_sub(folio_nr_pages(folio), mm);
2479 set_huge_pte_at(mm, address, pvmw.pte, pteval,
2480 hsz);
2481 } else {
2482 dec_mm_counter(mm, mm_counter(folio));
2483 set_pte_at(mm, address, pvmw.pte, pteval);
2484 }
2485 } else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
2486 !userfaultfd_armed(vma)) {
2487 /*
2488 * The guest indicated that the page content is of no
2489 * interest anymore. Simply discard the pte, vmscan
2490 * will take care of the rest.
2491 * A future reference will then fault in a new zero
2492 * page. When userfaultfd is active, we must not drop
2493 * this page though, as its main user (postcopy
2494 * migration) will not expect userfaults on already
2495 * copied pages.
2496 */
2497 dec_mm_counter(mm, mm_counter(folio));
2498 } else {
2499 swp_entry_t entry;
2500 pte_t swp_pte;
2501
2502 /*
2503 * arch_unmap_one() is expected to be a NOP on
2504 * architectures where we could have PFN swap PTEs,
2505 * so we'll not check/care.
2506 */
2507 if (arch_unmap_one(mm, vma, address, pteval) < 0) {
2508 if (folio_test_hugetlb(folio))
2509 set_huge_pte_at(mm, address, pvmw.pte,
2510 pteval, hsz);
2511 else
2512 set_pte_at(mm, address, pvmw.pte, pteval);
2513 ret = false;
2514 page_vma_mapped_walk_done(&pvmw);
2515 break;
2516 }
2517
2518 /* See folio_try_share_anon_rmap_pte(): clear PTE first. */
2519 if (folio_test_hugetlb(folio)) {
2520 if (anon_exclusive &&
2521 hugetlb_try_share_anon_rmap(folio)) {
2522 set_huge_pte_at(mm, address, pvmw.pte,
2523 pteval, hsz);
2524 ret = false;
2525 page_vma_mapped_walk_done(&pvmw);
2526 break;
2527 }
2528 } else if (anon_exclusive &&
2529 folio_try_share_anon_rmap_pte(folio, subpage)) {
2530 set_pte_at(mm, address, pvmw.pte, pteval);
2531 ret = false;
2532 page_vma_mapped_walk_done(&pvmw);
2533 break;
2534 }
2535
2536 /*
2537 * Store the pfn of the page in a special migration
2538 * pte. do_swap_page() will wait until the migration
2539 * pte is removed and then restart fault handling.
2540 */
2541 if (writable)
2542 entry = make_writable_migration_entry(
2543 page_to_pfn(subpage));
2544 else if (anon_exclusive)
2545 entry = make_readable_exclusive_migration_entry(
2546 page_to_pfn(subpage));
2547 else
2548 entry = make_readable_migration_entry(
2549 page_to_pfn(subpage));
2550 if (likely(pte_present(pteval))) {
2551 if (pte_young(pteval))
2552 entry = make_migration_entry_young(entry);
2553 if (pte_dirty(pteval))
2554 entry = make_migration_entry_dirty(entry);
2555 swp_pte = swp_entry_to_pte(entry);
2556 if (pte_soft_dirty(pteval))
2557 swp_pte = pte_swp_mksoft_dirty(swp_pte);
2558 if (pte_uffd_wp(pteval))
2559 swp_pte = pte_swp_mkuffd_wp(swp_pte);
2560 } else {
2561 swp_pte = swp_entry_to_pte(entry);
2562 if (pte_swp_soft_dirty(pteval))
2563 swp_pte = pte_swp_mksoft_dirty(swp_pte);
2564 if (pte_swp_uffd_wp(pteval))
2565 swp_pte = pte_swp_mkuffd_wp(swp_pte);
2566 }
2567 if (folio_test_hugetlb(folio))
2568 set_huge_pte_at(mm, address, pvmw.pte, swp_pte,
2569 hsz);
2570 else
2571 set_pte_at(mm, address, pvmw.pte, swp_pte);
2572 trace_set_migration_pte(address, pte_val(swp_pte),
2573 folio_order(folio));
2574 /*
2575 * No need to invalidate here it will synchronize on
2576 * against the special swap migration pte.
2577 */
2578 }
2579
2580 if (unlikely(folio_test_hugetlb(folio)))
2581 hugetlb_remove_rmap(folio);
2582 else
2583 folio_remove_rmap_pte(folio, subpage, vma);
2584 if (vma->vm_flags & VM_LOCKED)
2585 mlock_drain_local();
2586 folio_put(folio);
2587 }
2588
2589 mmu_notifier_invalidate_range_end(&range);
2590
2591 return ret;
2592 }
2593
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 00/11] THP support for zone device page migration
2025-07-30 11:30 ` [v2 00/11] THP support for zone device page migration David Hildenbrand
@ 2025-07-30 23:18 ` Alistair Popple
2025-07-31 8:41 ` Balbir Singh
1 sibling, 0 replies; 71+ messages in thread
From: Alistair Popple @ 2025-07-30 23:18 UTC (permalink / raw)
To: David Hildenbrand
Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang,
Jane Chu, Donet Tom, Ralph Campbell, Mika Penttilä,
Matthew Brost, Francois Dugast
On Wed, Jul 30, 2025 at 01:30:13PM +0200, David Hildenbrand wrote:
> On 30.07.25 11:21, Balbir Singh wrote:
>
> BTW, I keep getting confused by the topic.
>
> Isn't this essentially
>
> "mm: support device-private THP"
>
> and the support for migration is just a necessary requirement to *enable*
> device private?
Yes, that's a good point. Migration is one component but there is also fault
handling, etc. so I think calling this "support device-private THP" makes sense.
- Alistair
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-07-30 16:29 ` Mika Penttilä
@ 2025-07-31 7:15 ` David Hildenbrand
2025-07-31 8:39 ` Balbir Singh
2025-07-31 11:26 ` Zi Yan
0 siblings, 2 replies; 71+ messages in thread
From: David Hildenbrand @ 2025-07-31 7:15 UTC (permalink / raw)
To: Mika Penttilä, Zi Yan
Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
On 30.07.25 18:29, Mika Penttilä wrote:
>
> On 7/30/25 18:58, Zi Yan wrote:
>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>
>>> On 7/30/25 18:10, Zi Yan wrote:
>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>
>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>
>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>
>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>
>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>
>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>> entries.
>>>>>>>>>>>>
>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>> folio.
>>>>>>>>>>>>
>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>> <snip>
>>>>>>>>>>
>>>>>>>>>>>> +/**
>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>> + */
>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>> +{
>>>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>>>> + int ret = 0;
>>>>>>>>>>>> +
>>>>>>>>>>>> + /*
>>>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>>>> + */
>>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>> device side mapping.
>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>> 5) remap device private mapping.
>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>> folio by replacing existing page table entries with migration entries
>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>
>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>> at CPU side.
>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>> Right. That confused me when I was talking to Balbir and looking at v1.
>> When a device private folio is processed in __folio_split(), Balbir needed to
>> add code to skip CPU mapping handling code. Basically device private folios are
>> CPU unmapped and device mapped.
>>
>> Here are my questions on device private folios:
>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>> perspective? Can it be stored in a device private specific data structure?
>
> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
> common code more messy if not done that way but sure possible.
> And not consuming pfns (address space) at all would have benefits.
>
>> 2. When a device private folio is mapped on device, can someone other than
>> the device driver manipulate it assuming core-mm just skips device private
>> folios (barring the CPU access fault handling)?
>>
>> Where I am going is that can device private folios be treated as unmapped folios
>> by CPU and only device driver manipulates their mappings?
>>
> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
> someone could change while in device, it's just pfn.
Just to clarify: a device-private entry, like a device-exclusive entry,
is a *page table mapping* tracked through the rmap -- even though they
are not present page table entries.
It would be better if they would be present page table entries that are
PROT_NONE, but it's tricky to mark them as being "special"
device-private, device-exclusive etc. Maybe there are ways to do that in
the future.
Maybe device-private could just be PROT_NONE, because we can identify
the entry type based on the folio. device-exclusive is harder ...
So consider device-private entries just like PROT_NONE present page
table entries. Refcount and mapcount is adjusted accordingly by rmap
functions.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-07-31 7:15 ` David Hildenbrand
@ 2025-07-31 8:39 ` Balbir Singh
2025-07-31 11:26 ` Zi Yan
1 sibling, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-07-31 8:39 UTC (permalink / raw)
To: David Hildenbrand, Mika Penttilä, Zi Yan
Cc: linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
On 7/31/25 17:15, David Hildenbrand wrote:
> On 30.07.25 18:29, Mika Penttilä wrote:
>>
>> On 7/30/25 18:58, Zi Yan wrote:
>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>
>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>
>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>
>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>
>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>> <snip>
>>>>>>>>>>>
>>>>>>>>>>>>> +/**
>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>> + */
>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>>>>> + int ret = 0;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> + /*
>>>>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>>>>> + */
>>>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>> device side mapping.
>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>
>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>> at CPU side.
>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>> add code to skip CPU mapping handling code. Basically device private folios are
>>> CPU unmapped and device mapped.
>>>
>>> Here are my questions on device private folios:
>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>> perspective? Can it be stored in a device private specific data structure?
>>
>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>> common code more messy if not done that way but sure possible.
>> And not consuming pfns (address space) at all would have benefits.
>>
>>> 2. When a device private folio is mapped on device, can someone other than
>>> the device driver manipulate it assuming core-mm just skips device private
>>> folios (barring the CPU access fault handling)?
>>>
>>> Where I am going is that can device private folios be treated as unmapped folios
>>> by CPU and only device driver manipulates their mappings?
>>>
>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>> someone could change while in device, it's just pfn.
>
> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>
> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>
> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>
>
> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>
Thanks for clarifying on my behalf, just catching up with the discussion
When I was referring to mapped with Zi, I was talking about how touching the entry will cause a fault and migration back and they can be considered as unmapped in that sense, because they are mapped to the device. Device private entries are mapped into the page tables and have a refcount associated with the page/folio that represents the device entries.
Balbir Singh
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 00/11] THP support for zone device page migration
2025-07-30 11:30 ` [v2 00/11] THP support for zone device page migration David Hildenbrand
2025-07-30 23:18 ` Alistair Popple
@ 2025-07-31 8:41 ` Balbir Singh
2025-07-31 8:56 ` David Hildenbrand
1 sibling, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-07-31 8:41 UTC (permalink / raw)
To: David Hildenbrand, linux-mm
Cc: linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Matthew Brost,
Francois Dugast
On 7/30/25 21:30, David Hildenbrand wrote:
> On 30.07.25 11:21, Balbir Singh wrote:
>
> BTW, I keep getting confused by the topic.
>
> Isn't this essentially
>
> "mm: support device-private THP"
>
> and the support for migration is just a necessary requirement to *enable* device private?
>
I agree, I can change the title, but the focus of the use case is to
support THP migration for improved latency and throughput. All of that
involves support of device-private THP
Balbir Singh
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 00/11] THP support for zone device page migration
2025-07-31 8:41 ` Balbir Singh
@ 2025-07-31 8:56 ` David Hildenbrand
0 siblings, 0 replies; 71+ messages in thread
From: David Hildenbrand @ 2025-07-31 8:56 UTC (permalink / raw)
To: Balbir Singh, linux-mm
Cc: linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Matthew Brost,
Francois Dugast
On 31.07.25 10:41, Balbir Singh wrote:
> On 7/30/25 21:30, David Hildenbrand wrote:
>> On 30.07.25 11:21, Balbir Singh wrote:
>>
>> BTW, I keep getting confused by the topic.
>>
>> Isn't this essentially
>>
>> "mm: support device-private THP"
>>
>> and the support for migration is just a necessary requirement to *enable* device private?
>>
>
> I agree, I can change the title, but the focus of the use case is to
> support THP migration for improved latency and throughput. All of that
> involves support of device-private THP
Well, the subject as is makes one believe that THP support for
zone-device pages would already be there, and that you are adding
migration support.
That was the confusing part to me, because in the very first patch you
add ... THP support for (selected/private) zone device pages.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 07/11] mm/thp: add split during migration support
2025-07-30 9:21 ` [v2 07/11] mm/thp: add split during migration support Balbir Singh
@ 2025-07-31 10:04 ` kernel test robot
0 siblings, 0 replies; 71+ messages in thread
From: kernel test robot @ 2025-07-31 10:04 UTC (permalink / raw)
To: Balbir Singh, linux-mm
Cc: llvm, oe-kbuild-all, linux-kernel, Balbir Singh, Karol Herbst,
Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Matthew Brost,
Francois Dugast
Hi Balbir,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20250731]
[cannot apply to akpm-mm/mm-nonmm-unstable shuah-kselftest/next shuah-kselftest/fixes linus/master v6.16]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Balbir-Singh/mm-zone_device-support-large-zone-device-private-folios/20250730-172600
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20250730092139.3890844-8-balbirs%40nvidia.com
patch subject: [v2 07/11] mm/thp: add split during migration support
config: x86_64-randconfig-071-20250731 (https://download.01.org/0day-ci/archive/20250731/202507311724.mavZerV1-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250731/202507311724.mavZerV1-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507311724.mavZerV1-lkp@intel.com/
All errors (new ones prefixed by >>):
>> mm/migrate_device.c:1082:5: error: statement requires expression of scalar type ('void' invalid)
1082 | if (migrate_vma_split_pages(migrate, i, addr,
| ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1083 | folio)) {
| ~~~~~~
1 error generated.
vim +1082 mm/migrate_device.c
999
1000 static void __migrate_device_pages(unsigned long *src_pfns,
1001 unsigned long *dst_pfns, unsigned long npages,
1002 struct migrate_vma *migrate)
1003 {
1004 struct mmu_notifier_range range;
1005 unsigned long i, j;
1006 bool notified = false;
1007 unsigned long addr;
1008
1009 for (i = 0; i < npages; ) {
1010 struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
1011 struct page *page = migrate_pfn_to_page(src_pfns[i]);
1012 struct address_space *mapping;
1013 struct folio *newfolio, *folio;
1014 int r, extra_cnt = 0;
1015 unsigned long nr = 1;
1016
1017 if (!newpage) {
1018 src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
1019 goto next;
1020 }
1021
1022 if (!page) {
1023 unsigned long addr;
1024
1025 if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE))
1026 goto next;
1027
1028 /*
1029 * The only time there is no vma is when called from
1030 * migrate_device_coherent_folio(). However this isn't
1031 * called if the page could not be unmapped.
1032 */
1033 VM_BUG_ON(!migrate);
1034 addr = migrate->start + i*PAGE_SIZE;
1035 if (!notified) {
1036 notified = true;
1037
1038 mmu_notifier_range_init_owner(&range,
1039 MMU_NOTIFY_MIGRATE, 0,
1040 migrate->vma->vm_mm, addr, migrate->end,
1041 migrate->pgmap_owner);
1042 mmu_notifier_invalidate_range_start(&range);
1043 }
1044
1045 if ((src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
1046 (!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
1047 nr = HPAGE_PMD_NR;
1048 src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
1049 } else {
1050 nr = 1;
1051 }
1052
1053 for (j = 0; j < nr && i + j < npages; j++) {
1054 src_pfns[i+j] |= MIGRATE_PFN_MIGRATE;
1055 migrate_vma_insert_page(migrate,
1056 addr + j * PAGE_SIZE,
1057 &dst_pfns[i+j], &src_pfns[i+j]);
1058 }
1059 goto next;
1060 }
1061
1062 newfolio = page_folio(newpage);
1063 folio = page_folio(page);
1064 mapping = folio_mapping(folio);
1065
1066 /*
1067 * If THP migration is enabled, check if both src and dst
1068 * can migrate large pages
1069 */
1070 if (thp_migration_supported()) {
1071 if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
1072 (src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
1073 !(dst_pfns[i] & MIGRATE_PFN_COMPOUND)) {
1074
1075 if (!migrate) {
1076 src_pfns[i] &= ~(MIGRATE_PFN_MIGRATE |
1077 MIGRATE_PFN_COMPOUND);
1078 goto next;
1079 }
1080 nr = 1 << folio_order(folio);
1081 addr = migrate->start + i * PAGE_SIZE;
> 1082 if (migrate_vma_split_pages(migrate, i, addr,
1083 folio)) {
1084 src_pfns[i] &= ~(MIGRATE_PFN_MIGRATE |
1085 MIGRATE_PFN_COMPOUND);
1086 goto next;
1087 }
1088 } else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
1089 (dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
1090 !(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
1091 src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
1092 }
1093 }
1094
1095
1096 if (folio_is_device_private(newfolio) ||
1097 folio_is_device_coherent(newfolio)) {
1098 if (mapping) {
1099 /*
1100 * For now only support anonymous memory migrating to
1101 * device private or coherent memory.
1102 *
1103 * Try to get rid of swap cache if possible.
1104 */
1105 if (!folio_test_anon(folio) ||
1106 !folio_free_swap(folio)) {
1107 src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
1108 goto next;
1109 }
1110 }
1111 } else if (folio_is_zone_device(newfolio)) {
1112 /*
1113 * Other types of ZONE_DEVICE page are not supported.
1114 */
1115 src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
1116 goto next;
1117 }
1118
1119 BUG_ON(folio_test_writeback(folio));
1120
1121 if (migrate && migrate->fault_page == page)
1122 extra_cnt++;
1123 for (j = 0; j < nr && i + j < npages; j++) {
1124 folio = page_folio(migrate_pfn_to_page(src_pfns[i+j]));
1125 newfolio = page_folio(migrate_pfn_to_page(dst_pfns[i+j]));
1126
1127 r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
1128 if (r != MIGRATEPAGE_SUCCESS)
1129 src_pfns[i+j] &= ~MIGRATE_PFN_MIGRATE;
1130 else
1131 folio_migrate_flags(newfolio, folio);
1132 }
1133 next:
1134 i += nr;
1135 }
1136
1137 if (notified)
1138 mmu_notifier_invalidate_range_end(&range);
1139 }
1140
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 05/11] lib/test_hmm: test cases and support for zone device private THP
2025-07-30 9:21 ` [v2 05/11] lib/test_hmm: test cases and support for zone device private THP Balbir Singh
@ 2025-07-31 11:17 ` kernel test robot
0 siblings, 0 replies; 71+ messages in thread
From: kernel test robot @ 2025-07-31 11:17 UTC (permalink / raw)
To: Balbir Singh, linux-mm
Cc: oe-kbuild-all, linux-kernel, Balbir Singh, Karol Herbst,
Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Matthew Brost,
Francois Dugast
Hi Balbir,
kernel test robot noticed the following build warnings:
[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on next-20250731]
[cannot apply to akpm-mm/mm-nonmm-unstable shuah-kselftest/next shuah-kselftest/fixes linus/master v6.16]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Balbir-Singh/mm-zone_device-support-large-zone-device-private-folios/20250730-172600
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20250730092139.3890844-6-balbirs%40nvidia.com
patch subject: [v2 05/11] lib/test_hmm: test cases and support for zone device private THP
config: loongarch-allyesconfig (https://download.01.org/0day-ci/archive/20250731/202507311818.V6HUiudq-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 8f09b03aebb71c154f3bbe725c29e3f47d37c26e)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250731/202507311818.V6HUiudq-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507311818.V6HUiudq-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> lib/test_hmm.c:1111:6: warning: variable 'dst_pfns' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
1111 | if (!src_pfns)
| ^~~~~~~~~
lib/test_hmm.c:1176:8: note: uninitialized use occurs here
1176 | kfree(dst_pfns);
| ^~~~~~~~
lib/test_hmm.c:1111:2: note: remove the 'if' if its condition is always false
1111 | if (!src_pfns)
| ^~~~~~~~~~~~~~
1112 | goto free_mem;
| ~~~~~~~~~~~~~
lib/test_hmm.c:1097:25: note: initialize the variable 'dst_pfns' to silence this warning
1097 | unsigned long *dst_pfns;
| ^
| = NULL
1 warning generated.
vim +1111 lib/test_hmm.c
1084
1085 static int dmirror_migrate_to_device(struct dmirror *dmirror,
1086 struct hmm_dmirror_cmd *cmd)
1087 {
1088 unsigned long start, end, addr;
1089 unsigned long size = cmd->npages << PAGE_SHIFT;
1090 struct mm_struct *mm = dmirror->notifier.mm;
1091 struct vm_area_struct *vma;
1092 struct dmirror_bounce bounce;
1093 struct migrate_vma args = { 0 };
1094 unsigned long next;
1095 int ret;
1096 unsigned long *src_pfns;
1097 unsigned long *dst_pfns;
1098
1099 start = cmd->addr;
1100 end = start + size;
1101 if (end < start)
1102 return -EINVAL;
1103
1104 /* Since the mm is for the mirrored process, get a reference first. */
1105 if (!mmget_not_zero(mm))
1106 return -EINVAL;
1107
1108 ret = -ENOMEM;
1109 src_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*src_pfns),
1110 GFP_KERNEL | __GFP_NOFAIL);
> 1111 if (!src_pfns)
1112 goto free_mem;
1113
1114 dst_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*dst_pfns),
1115 GFP_KERNEL | __GFP_NOFAIL);
1116 if (!dst_pfns)
1117 goto free_mem;
1118
1119 ret = 0;
1120 mmap_read_lock(mm);
1121 for (addr = start; addr < end; addr = next) {
1122 vma = vma_lookup(mm, addr);
1123 if (!vma || !(vma->vm_flags & VM_READ)) {
1124 ret = -EINVAL;
1125 goto out;
1126 }
1127 next = min(end, addr + (PTRS_PER_PTE << PAGE_SHIFT));
1128 if (next > vma->vm_end)
1129 next = vma->vm_end;
1130
1131 args.vma = vma;
1132 args.src = src_pfns;
1133 args.dst = dst_pfns;
1134 args.start = addr;
1135 args.end = next;
1136 args.pgmap_owner = dmirror->mdevice;
1137 args.flags = MIGRATE_VMA_SELECT_SYSTEM |
1138 MIGRATE_VMA_SELECT_COMPOUND;
1139 ret = migrate_vma_setup(&args);
1140 if (ret)
1141 goto out;
1142
1143 pr_debug("Migrating from sys mem to device mem\n");
1144 dmirror_migrate_alloc_and_copy(&args, dmirror);
1145 migrate_vma_pages(&args);
1146 dmirror_migrate_finalize_and_map(&args, dmirror);
1147 migrate_vma_finalize(&args);
1148 }
1149 mmap_read_unlock(mm);
1150 mmput(mm);
1151
1152 /*
1153 * Return the migrated data for verification.
1154 * Only for pages in device zone
1155 */
1156 ret = dmirror_bounce_init(&bounce, start, size);
1157 if (ret)
1158 goto free_mem;
1159 mutex_lock(&dmirror->mutex);
1160 ret = dmirror_do_read(dmirror, start, end, &bounce);
1161 mutex_unlock(&dmirror->mutex);
1162 if (ret == 0) {
1163 if (copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr,
1164 bounce.size))
1165 ret = -EFAULT;
1166 }
1167 cmd->cpages = bounce.cpages;
1168 dmirror_bounce_fini(&bounce);
1169 goto free_mem;
1170
1171 out:
1172 mmap_read_unlock(mm);
1173 mmput(mm);
1174 free_mem:
1175 kfree(src_pfns);
1176 kfree(dst_pfns);
1177 return ret;
1178 }
1179
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-07-31 7:15 ` David Hildenbrand
2025-07-31 8:39 ` Balbir Singh
@ 2025-07-31 11:26 ` Zi Yan
2025-07-31 12:32 ` David Hildenbrand
2025-08-01 0:49 ` Balbir Singh
1 sibling, 2 replies; 71+ messages in thread
From: Zi Yan @ 2025-07-31 11:26 UTC (permalink / raw)
To: David Hildenbrand
Cc: Mika Penttilä, Balbir Singh, linux-mm, linux-kernel,
Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
Simona Vetter, Jérôme Glisse, Shuah Khan, Barry Song,
Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang,
Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
Francois Dugast, Ralph Campbell
On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
> On 30.07.25 18:29, Mika Penttilä wrote:
>>
>> On 7/30/25 18:58, Zi Yan wrote:
>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>
>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>
>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>
>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>
>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>> <snip>
>>>>>>>>>>>
>>>>>>>>>>>>> +/**
>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>> + */
>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>>>>> + int ret = 0;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> + /*
>>>>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>>>>> + */
>>>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>> device side mapping.
>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>
>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>> at CPU side.
>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>> add code to skip CPU mapping handling code. Basically device private folios are
>>> CPU unmapped and device mapped.
>>>
>>> Here are my questions on device private folios:
>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>> perspective? Can it be stored in a device private specific data structure?
>>
>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>> common code more messy if not done that way but sure possible.
>> And not consuming pfns (address space) at all would have benefits.
>>
>>> 2. When a device private folio is mapped on device, can someone other than
>>> the device driver manipulate it assuming core-mm just skips device private
>>> folios (barring the CPU access fault handling)?
>>>
>>> Where I am going is that can device private folios be treated as unmapped folios
>>> by CPU and only device driver manipulates their mappings?
>>>
>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>> someone could change while in device, it's just pfn.
>
> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>
> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>
> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>
>
> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
Thanks for the clarification.
So folio_mapcount() for device private folios should be treated the same
as normal folios, even if the corresponding PTEs are not accessible from CPUs.
Then I wonder if the device private large folio split should go through
__folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
remap. Otherwise, how can we prevent rmap changes during the split?
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-07-31 11:26 ` Zi Yan
@ 2025-07-31 12:32 ` David Hildenbrand
2025-07-31 13:34 ` Zi Yan
2025-08-01 0:49 ` Balbir Singh
1 sibling, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2025-07-31 12:32 UTC (permalink / raw)
To: Zi Yan
Cc: Mika Penttilä, Balbir Singh, linux-mm, linux-kernel,
Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
Simona Vetter, Jérôme Glisse, Shuah Khan, Barry Song,
Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang,
Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
Francois Dugast, Ralph Campbell
On 31.07.25 13:26, Zi Yan wrote:
> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>
>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>
>>> On 7/30/25 18:58, Zi Yan wrote:
>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>
>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>
>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>
>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>> <snip>
>>>>>>>>>>>>
>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>>>>>> + int ret = 0;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> + /*
>>>>>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>> device side mapping.
>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>
>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>> at CPU side.
>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>> CPU unmapped and device mapped.
>>>>
>>>> Here are my questions on device private folios:
>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>> perspective? Can it be stored in a device private specific data structure?
>>>
>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>> common code more messy if not done that way but sure possible.
>>> And not consuming pfns (address space) at all would have benefits.
>>>
>>>> 2. When a device private folio is mapped on device, can someone other than
>>>> the device driver manipulate it assuming core-mm just skips device private
>>>> folios (barring the CPU access fault handling)?
>>>>
>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>> by CPU and only device driver manipulates their mappings?
>>>>
>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>> someone could change while in device, it's just pfn.
>>
>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>
>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>
>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>
>>
>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>
> Thanks for the clarification.
>
> So folio_mapcount() for device private folios should be treated the same
> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
> Then I wonder if the device private large folio split should go through
> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
> remap. Otherwise, how can we prevent rmap changes during the split?
That is what I would expect: Replace device-private by migration
entries, perform the migration/split/whatever, restore migration entries
to device-private entries.
That will drive the mapcount to 0.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-07-31 12:32 ` David Hildenbrand
@ 2025-07-31 13:34 ` Zi Yan
2025-07-31 19:09 ` David Hildenbrand
0 siblings, 1 reply; 71+ messages in thread
From: Zi Yan @ 2025-07-31 13:34 UTC (permalink / raw)
To: David Hildenbrand
Cc: Mika Penttilä, Balbir Singh, linux-mm, linux-kernel,
Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
Simona Vetter, Jérôme Glisse, Shuah Khan, Barry Song,
Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang,
Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
Francois Dugast, Ralph Campbell
On 31 Jul 2025, at 8:32, David Hildenbrand wrote:
> On 31.07.25 13:26, Zi Yan wrote:
>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>
>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>
>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>
>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>>>>>>> + int ret = 0;
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> + /*
>>>>>>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>
>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>> at CPU side.
>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>> CPU unmapped and device mapped.
>>>>>
>>>>> Here are my questions on device private folios:
>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>> perspective? Can it be stored in a device private specific data structure?
>>>>
>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>> common code more messy if not done that way but sure possible.
>>>> And not consuming pfns (address space) at all would have benefits.
>>>>
>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>> the device driver manipulate it assuming core-mm just skips device private
>>>>> folios (barring the CPU access fault handling)?
>>>>>
>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>> by CPU and only device driver manipulates their mappings?
>>>>>
>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>> someone could change while in device, it's just pfn.
>>>
>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>
>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>
>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>
>>>
>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>
>> Thanks for the clarification.
>>
>> So folio_mapcount() for device private folios should be treated the same
>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>> Then I wonder if the device private large folio split should go through
>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>> remap. Otherwise, how can we prevent rmap changes during the split?
>
> That is what I would expect: Replace device-private by migration entries, perform the migration/split/whatever, restore migration entries to device-private entries.
>
> That will drive the mapcount to 0.
Great. That matches my expectations as well. One potential optimization could
be since device private entry is already CPU inaccessible TLB flush can be
avoided.
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 03/11] mm/migrate_device: THP migration of zone device pages
2025-07-30 9:21 ` [v2 03/11] mm/migrate_device: THP migration of zone device pages Balbir Singh
@ 2025-07-31 16:19 ` kernel test robot
0 siblings, 0 replies; 71+ messages in thread
From: kernel test robot @ 2025-07-31 16:19 UTC (permalink / raw)
To: Balbir Singh, linux-mm
Cc: oe-kbuild-all, linux-kernel, Balbir Singh, Karol Herbst,
Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Mika Penttilä, Matthew Brost, Francois Dugast,
Ralph Campbell
Hi Balbir,
kernel test robot noticed the following build warnings:
[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on next-20250731]
[cannot apply to akpm-mm/mm-nonmm-unstable shuah-kselftest/next shuah-kselftest/fixes linus/master v6.16]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Balbir-Singh/mm-zone_device-support-large-zone-device-private-folios/20250730-172600
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20250730092139.3890844-4-balbirs%40nvidia.com
patch subject: [v2 03/11] mm/migrate_device: THP migration of zone device pages
config: x86_64-randconfig-122-20250731 (https://download.01.org/0day-ci/archive/20250731/202507312342.dmLxVgli-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250731/202507312342.dmLxVgli-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507312342.dmLxVgli-lkp@intel.com/
sparse warnings: (new ones prefixed by >>)
>> mm/migrate_device.c:769:13: sparse: sparse: incorrect type in assignment (different base types) @@ expected int [assigned] ret @@ got restricted vm_fault_t @@
mm/migrate_device.c:769:13: sparse: expected int [assigned] ret
mm/migrate_device.c:769:13: sparse: got restricted vm_fault_t
mm/migrate_device.c:130:25: sparse: sparse: context imbalance in 'migrate_vma_collect_huge_pmd' - unexpected unlock
mm/migrate_device.c:815:16: sparse: sparse: context imbalance in 'migrate_vma_insert_huge_pmd_page' - different lock contexts for basic block
vim +769 mm/migrate_device.c
689
690 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
691 /**
692 * migrate_vma_insert_huge_pmd_page: Insert a huge folio into @migrate->vma->vm_mm
693 * at @addr. folio is already allocated as a part of the migration process with
694 * large page.
695 *
696 * @folio needs to be initialized and setup after it's allocated. The code bits
697 * here follow closely the code in __do_huge_pmd_anonymous_page(). This API does
698 * not support THP zero pages.
699 *
700 * @migrate: migrate_vma arguments
701 * @addr: address where the folio will be inserted
702 * @folio: folio to be inserted at @addr
703 * @src: src pfn which is being migrated
704 * @pmdp: pointer to the pmd
705 */
706 static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
707 unsigned long addr,
708 struct page *page,
709 unsigned long *src,
710 pmd_t *pmdp)
711 {
712 struct vm_area_struct *vma = migrate->vma;
713 gfp_t gfp = vma_thp_gfp_mask(vma);
714 struct folio *folio = page_folio(page);
715 int ret;
716 spinlock_t *ptl;
717 pgtable_t pgtable;
718 pmd_t entry;
719 bool flush = false;
720 unsigned long i;
721
722 VM_WARN_ON_FOLIO(!folio, folio);
723 VM_WARN_ON_ONCE(!pmd_none(*pmdp) && !is_huge_zero_pmd(*pmdp));
724
725 if (!thp_vma_suitable_order(vma, addr, HPAGE_PMD_ORDER))
726 return -EINVAL;
727
728 ret = anon_vma_prepare(vma);
729 if (ret)
730 return ret;
731
732 folio_set_order(folio, HPAGE_PMD_ORDER);
733 folio_set_large_rmappable(folio);
734
735 if (mem_cgroup_charge(folio, migrate->vma->vm_mm, gfp)) {
736 count_vm_event(THP_FAULT_FALLBACK);
737 count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
738 ret = -ENOMEM;
739 goto abort;
740 }
741
742 __folio_mark_uptodate(folio);
743
744 pgtable = pte_alloc_one(vma->vm_mm);
745 if (unlikely(!pgtable))
746 goto abort;
747
748 if (folio_is_device_private(folio)) {
749 swp_entry_t swp_entry;
750
751 if (vma->vm_flags & VM_WRITE)
752 swp_entry = make_writable_device_private_entry(
753 page_to_pfn(page));
754 else
755 swp_entry = make_readable_device_private_entry(
756 page_to_pfn(page));
757 entry = swp_entry_to_pmd(swp_entry);
758 } else {
759 if (folio_is_zone_device(folio) &&
760 !folio_is_device_coherent(folio)) {
761 goto abort;
762 }
763 entry = folio_mk_pmd(folio, vma->vm_page_prot);
764 if (vma->vm_flags & VM_WRITE)
765 entry = pmd_mkwrite(pmd_mkdirty(entry), vma);
766 }
767
768 ptl = pmd_lock(vma->vm_mm, pmdp);
> 769 ret = check_stable_address_space(vma->vm_mm);
770 if (ret)
771 goto abort;
772
773 /*
774 * Check for userfaultfd but do not deliver the fault. Instead,
775 * just back off.
776 */
777 if (userfaultfd_missing(vma))
778 goto unlock_abort;
779
780 if (!pmd_none(*pmdp)) {
781 if (!is_huge_zero_pmd(*pmdp))
782 goto unlock_abort;
783 flush = true;
784 } else if (!pmd_none(*pmdp))
785 goto unlock_abort;
786
787 add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
788 folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
789 if (!folio_is_zone_device(folio))
790 folio_add_lru_vma(folio, vma);
791 folio_get(folio);
792
793 if (flush) {
794 pte_free(vma->vm_mm, pgtable);
795 flush_cache_page(vma, addr, addr + HPAGE_PMD_SIZE);
796 pmdp_invalidate(vma, addr, pmdp);
797 } else {
798 pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
799 mm_inc_nr_ptes(vma->vm_mm);
800 }
801 set_pmd_at(vma->vm_mm, addr, pmdp, entry);
802 update_mmu_cache_pmd(vma, addr, pmdp);
803
804 spin_unlock(ptl);
805
806 count_vm_event(THP_FAULT_ALLOC);
807 count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC);
808 count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC);
809
810 return 0;
811
812 unlock_abort:
813 spin_unlock(ptl);
814 abort:
815 for (i = 0; i < HPAGE_PMD_NR; i++)
816 src[i] &= ~MIGRATE_PFN_MIGRATE;
817 return 0;
818 }
819 #else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
820 static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
821 unsigned long addr,
822 struct page *page,
823 unsigned long *src,
824 pmd_t *pmdp)
825 {
826 return 0;
827 }
828 #endif
829
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-07-31 13:34 ` Zi Yan
@ 2025-07-31 19:09 ` David Hildenbrand
0 siblings, 0 replies; 71+ messages in thread
From: David Hildenbrand @ 2025-07-31 19:09 UTC (permalink / raw)
To: Zi Yan
Cc: Mika Penttilä, Balbir Singh, linux-mm, linux-kernel,
Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
Simona Vetter, Jérôme Glisse, Shuah Khan, Barry Song,
Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang,
Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
Francois Dugast, Ralph Campbell
On 31.07.25 15:34, Zi Yan wrote:
> On 31 Jul 2025, at 8:32, David Hildenbrand wrote:
>
>> On 31.07.25 13:26, Zi Yan wrote:
>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>
>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>
>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>
>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>
>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>
>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>>>>>>>> + int ret = 0;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + /*
>>>>>>>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>
>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>> at CPU side.
>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>> CPU unmapped and device mapped.
>>>>>>
>>>>>> Here are my questions on device private folios:
>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>> perspective? Can it be stored in a device private specific data structure?
>>>>>
>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>> common code more messy if not done that way but sure possible.
>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>
>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>> the device driver manipulate it assuming core-mm just skips device private
>>>>>> folios (barring the CPU access fault handling)?
>>>>>>
>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>
>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>> someone could change while in device, it's just pfn.
>>>>
>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>
>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>
>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>
>>>>
>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>
>>> Thanks for the clarification.
>>>
>>> So folio_mapcount() for device private folios should be treated the same
>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>> Then I wonder if the device private large folio split should go through
>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>
>> That is what I would expect: Replace device-private by migration entries, perform the migration/split/whatever, restore migration entries to device-private entries.
>>
>> That will drive the mapcount to 0.
>
> Great. That matches my expectations as well. One potential optimization could
> be since device private entry is already CPU inaccessible TLB flush can be
> avoided.
Right, I would assume that is already done, or could easily be added.
Not using proper migration entries sounds like a hack that we shouldn't
start with. We should start with as little special cases as possible in
core-mm.
For example, as you probably implied, there is nothing stopping
concurrent fork() or zap to mess with the refcount+mapcount.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-07-31 11:26 ` Zi Yan
2025-07-31 12:32 ` David Hildenbrand
@ 2025-08-01 0:49 ` Balbir Singh
2025-08-01 1:09 ` Zi Yan
2025-08-01 1:16 ` Mika Penttilä
1 sibling, 2 replies; 71+ messages in thread
From: Balbir Singh @ 2025-08-01 0:49 UTC (permalink / raw)
To: Zi Yan, David Hildenbrand
Cc: Mika Penttilä, linux-mm, linux-kernel, Karol Herbst,
Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
On 7/31/25 21:26, Zi Yan wrote:
> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>
>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>
>>> On 7/30/25 18:58, Zi Yan wrote:
>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>
>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>
>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>
>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>> <snip>
>>>>>>>>>>>>
>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>>>>>> + int ret = 0;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> + /*
>>>>>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>> device side mapping.
>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>
>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>> at CPU side.
>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>> CPU unmapped and device mapped.
>>>>
>>>> Here are my questions on device private folios:
>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>> perspective? Can it be stored in a device private specific data structure?
>>>
>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>> common code more messy if not done that way but sure possible.
>>> And not consuming pfns (address space) at all would have benefits.
>>>
>>>> 2. When a device private folio is mapped on device, can someone other than
>>>> the device driver manipulate it assuming core-mm just skips device private
>>>> folios (barring the CPU access fault handling)?
>>>>
>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>> by CPU and only device driver manipulates their mappings?
>>>>
>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>> someone could change while in device, it's just pfn.
>>
>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>
>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>
>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>
>>
>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>
> Thanks for the clarification.
>
> So folio_mapcount() for device private folios should be treated the same
> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
> Then I wonder if the device private large folio split should go through
> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
> remap. Otherwise, how can we prevent rmap changes during the split?
>
That is true in general, the special cases I mentioned are:
1. split during migration (where we the sizes on source/destination do not
match) and so we need to split in the middle of migration. The entries
there are already unmapped and hence the special handling
2. Partial unmap case, where we need to split in the context of the unmap
due to the isses mentioned in the patch. I expanded the folio split code
for device private can be expanded into its own helper, which does not
need to do the xas/mapped/lru folio handling. During partial unmap the
original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
For (2), I spent some time examining the implications of not unmapping the
folios prior to split and in the partial unmap path, once we split the PMD
the folios diverge. I did not run into any particular race either with the
tests.
Balbir Singh
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-01 0:49 ` Balbir Singh
@ 2025-08-01 1:09 ` Zi Yan
2025-08-01 7:01 ` David Hildenbrand
2025-08-01 1:16 ` Mika Penttilä
1 sibling, 1 reply; 71+ messages in thread
From: Zi Yan @ 2025-08-01 1:09 UTC (permalink / raw)
To: Balbir Singh
Cc: David Hildenbrand, Mika Penttilä, linux-mm, linux-kernel,
Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
Simona Vetter, Jérôme Glisse, Shuah Khan, Barry Song,
Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang,
Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
Francois Dugast, Ralph Campbell
On 31 Jul 2025, at 20:49, Balbir Singh wrote:
> On 7/31/25 21:26, Zi Yan wrote:
>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>
>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>
>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>
>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>>>>>>> + int ret = 0;
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> + /*
>>>>>>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>
>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>> at CPU side.
>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>> CPU unmapped and device mapped.
>>>>>
>>>>> Here are my questions on device private folios:
>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>> perspective? Can it be stored in a device private specific data structure?
>>>>
>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>> common code more messy if not done that way but sure possible.
>>>> And not consuming pfns (address space) at all would have benefits.
>>>>
>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>> the device driver manipulate it assuming core-mm just skips device private
>>>>> folios (barring the CPU access fault handling)?
>>>>>
>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>> by CPU and only device driver manipulates their mappings?
>>>>>
>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>> someone could change while in device, it's just pfn.
>>>
>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>
>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>
>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>
>>>
>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>
>> Thanks for the clarification.
>>
>> So folio_mapcount() for device private folios should be treated the same
>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>> Then I wonder if the device private large folio split should go through
>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>> remap. Otherwise, how can we prevent rmap changes during the split?
>>
>
> That is true in general, the special cases I mentioned are:
>
> 1. split during migration (where we the sizes on source/destination do not
> match) and so we need to split in the middle of migration. The entries
> there are already unmapped and hence the special handling
In this case, all device private entries pointing to this device private
folio should be turned into migration entries and folio_mapcount() should
be 0. The split_device_private_folio() is handling this situation, although
the function name is not very descriptive. You might want to add a comment
to this function about its use and add a check to make sure folio_mapcount()
is 0.
> 2. Partial unmap case, where we need to split in the context of the unmap
> due to the isses mentioned in the patch. I expanded the folio split code
> for device private can be expanded into its own helper, which does not
> need to do the xas/mapped/lru folio handling. During partial unmap the
> original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>
> For (2), I spent some time examining the implications of not unmapping the
> folios prior to split and in the partial unmap path, once we split the PMD
> the folios diverge. I did not run into any particular race either with the
> tests.
For partial unmap case, you should be able to handle it in the same way
as normal PTE-mapped large folio. Since like David said, each device private
entry can be seen as a PROTNONE entry. At PMD split, PMD page table page
should be filled with device private PTEs. Each of them points to the
corresponding subpage. When device unmaps some of the PTEs, rmap code
should take care of the folio_mapcount().
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-01 0:49 ` Balbir Singh
2025-08-01 1:09 ` Zi Yan
@ 2025-08-01 1:16 ` Mika Penttilä
2025-08-01 4:44 ` Balbir Singh
1 sibling, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-08-01 1:16 UTC (permalink / raw)
To: Balbir Singh, Zi Yan, David Hildenbrand
Cc: linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
Hi,
On 8/1/25 03:49, Balbir Singh wrote:
> On 7/31/25 21:26, Zi Yan wrote:
>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>
>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>
>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>>>>>>> + int ret = 0;
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> + /*
>>>>>>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>
>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>> at CPU side.
>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>> CPU unmapped and device mapped.
>>>>>
>>>>> Here are my questions on device private folios:
>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>> perspective? Can it be stored in a device private specific data structure?
>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>> common code more messy if not done that way but sure possible.
>>>> And not consuming pfns (address space) at all would have benefits.
>>>>
>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>> the device driver manipulate it assuming core-mm just skips device private
>>>>> folios (barring the CPU access fault handling)?
>>>>>
>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>> by CPU and only device driver manipulates their mappings?
>>>>>
>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>> someone could change while in device, it's just pfn.
>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>
>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>
>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>
>>>
>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>> Thanks for the clarification.
>>
>> So folio_mapcount() for device private folios should be treated the same
>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>> Then I wonder if the device private large folio split should go through
>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>> remap. Otherwise, how can we prevent rmap changes during the split?
>>
> That is true in general, the special cases I mentioned are:
>
> 1. split during migration (where we the sizes on source/destination do not
> match) and so we need to split in the middle of migration. The entries
> there are already unmapped and hence the special handling
> 2. Partial unmap case, where we need to split in the context of the unmap
> due to the isses mentioned in the patch. I expanded the folio split code
> for device private can be expanded into its own helper, which does not
> need to do the xas/mapped/lru folio handling. During partial unmap the
> original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>
> For (2), I spent some time examining the implications of not unmapping the
> folios prior to split and in the partial unmap path, once we split the PMD
> the folios diverge. I did not run into any particular race either with the
> tests.
1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
It is vulnerable to races by rmap. And for instance this does not look right without checking:
folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
possible to split the folio at fault time then?
Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
instead?
> Balbir Singh
>
--Mika
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-01 1:16 ` Mika Penttilä
@ 2025-08-01 4:44 ` Balbir Singh
2025-08-01 5:57 ` Balbir Singh
` (2 more replies)
0 siblings, 3 replies; 71+ messages in thread
From: Balbir Singh @ 2025-08-01 4:44 UTC (permalink / raw)
To: Mika Penttilä, Zi Yan, David Hildenbrand
Cc: linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
On 8/1/25 11:16, Mika Penttilä wrote:
> Hi,
>
> On 8/1/25 03:49, Balbir Singh wrote:
>
>> On 7/31/25 21:26, Zi Yan wrote:
>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>
>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>
>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>
>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>
>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>>>>>>>> + int ret = 0;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + /*
>>>>>>>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>
>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>> at CPU side.
>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>> CPU unmapped and device mapped.
>>>>>>
>>>>>> Here are my questions on device private folios:
>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>> perspective? Can it be stored in a device private specific data structure?
>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>> common code more messy if not done that way but sure possible.
>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>
>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>> the device driver manipulate it assuming core-mm just skips device private
>>>>>> folios (barring the CPU access fault handling)?
>>>>>>
>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>
>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>> someone could change while in device, it's just pfn.
>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>
>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>
>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>
>>>>
>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>> Thanks for the clarification.
>>>
>>> So folio_mapcount() for device private folios should be treated the same
>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>> Then I wonder if the device private large folio split should go through
>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>
>> That is true in general, the special cases I mentioned are:
>>
>> 1. split during migration (where we the sizes on source/destination do not
>> match) and so we need to split in the middle of migration. The entries
>> there are already unmapped and hence the special handling
>> 2. Partial unmap case, where we need to split in the context of the unmap
>> due to the isses mentioned in the patch. I expanded the folio split code
>> for device private can be expanded into its own helper, which does not
>> need to do the xas/mapped/lru folio handling. During partial unmap the
>> original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>
>> For (2), I spent some time examining the implications of not unmapping the
>> folios prior to split and in the partial unmap path, once we split the PMD
>> the folios diverge. I did not run into any particular race either with the
>> tests.
>
> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>
> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>
> folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>
I can add checks to make sure that the call does succeed.
> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
> possible to split the folio at fault time then?
So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
> instead?
>
>
Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
split_huge_pmd_locked() path. Deferred splits do not work for device private pages, due to the
migration requirements for fault handling.
Balbir Singh
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-01 4:44 ` Balbir Singh
@ 2025-08-01 5:57 ` Balbir Singh
2025-08-01 6:01 ` Mika Penttilä
2025-08-01 7:04 ` David Hildenbrand
2 siblings, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-08-01 5:57 UTC (permalink / raw)
To: Mika Penttilä, Zi Yan, David Hildenbrand
Cc: linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
On 8/1/25 14:44, Balbir Singh wrote:
> On 8/1/25 11:16, Mika Penttilä wrote:
>> Hi,
>>
>> On 8/1/25 03:49, Balbir Singh wrote:
>>
>>> On 7/31/25 21:26, Zi Yan wrote:
>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>
>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>>>>>>>>> + int ret = 0;
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>> + /*
>>>>>>>>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>
>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>> at CPU side.
>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>> CPU unmapped and device mapped.
>>>>>>>
>>>>>>> Here are my questions on device private folios:
>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>> perspective? Can it be stored in a device private specific data structure?
>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>> common code more messy if not done that way but sure possible.
>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>
>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>> the device driver manipulate it assuming core-mm just skips device private
>>>>>>> folios (barring the CPU access fault handling)?
>>>>>>>
>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>
>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>> someone could change while in device, it's just pfn.
>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>
>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>
>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>
>>>>>
>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>> Thanks for the clarification.
>>>>
>>>> So folio_mapcount() for device private folios should be treated the same
>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>> Then I wonder if the device private large folio split should go through
>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>
>>> That is true in general, the special cases I mentioned are:
>>>
>>> 1. split during migration (where we the sizes on source/destination do not
>>> match) and so we need to split in the middle of migration. The entries
>>> there are already unmapped and hence the special handling
>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>> due to the isses mentioned in the patch. I expanded the folio split code
>>> for device private can be expanded into its own helper, which does not
>>> need to do the xas/mapped/lru folio handling. During partial unmap the
>>> original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>
>>> For (2), I spent some time examining the implications of not unmapping the
>>> folios prior to split and in the partial unmap path, once we split the PMD
>>> the folios diverge. I did not run into any particular race either with the
>>> tests.
>>
>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>
>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>
>> folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>
>
> I can add checks to make sure that the call does succeed.
>
>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>> possible to split the folio at fault time then?
>
> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
>
Let me get back to you on this with data, I was playing around with CONFIG_MM_IDS and might
have different data from it.
>
>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>> instead?
>>
>>
>
> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
> split_huge_pmd_locked() path. Deferred splits do not work for device private pages, due to the
> migration requirements for fault handling.
>
> Balbir Singh
>
Balbir
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-01 4:44 ` Balbir Singh
2025-08-01 5:57 ` Balbir Singh
@ 2025-08-01 6:01 ` Mika Penttilä
2025-08-01 7:04 ` David Hildenbrand
2 siblings, 0 replies; 71+ messages in thread
From: Mika Penttilä @ 2025-08-01 6:01 UTC (permalink / raw)
To: Balbir Singh, Zi Yan, David Hildenbrand
Cc: linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
On 8/1/25 07:44, Balbir Singh wrote:
> On 8/1/25 11:16, Mika Penttilä wrote:
>> Hi,
>>
>> On 8/1/25 03:49, Balbir Singh wrote:
>>
>>> On 7/31/25 21:26, Zi Yan wrote:
>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>
>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>>>>>>>>> + int ret = 0;
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>> + /*
>>>>>>>>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>
>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>> at CPU side.
>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>> CPU unmapped and device mapped.
>>>>>>>
>>>>>>> Here are my questions on device private folios:
>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>> perspective? Can it be stored in a device private specific data structure?
>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>> common code more messy if not done that way but sure possible.
>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>
>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>> the device driver manipulate it assuming core-mm just skips device private
>>>>>>> folios (barring the CPU access fault handling)?
>>>>>>>
>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>
>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>> someone could change while in device, it's just pfn.
>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>
>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>
>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>
>>>>>
>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>> Thanks for the clarification.
>>>>
>>>> So folio_mapcount() for device private folios should be treated the same
>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>> Then I wonder if the device private large folio split should go through
>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>
>>> That is true in general, the special cases I mentioned are:
>>>
>>> 1. split during migration (where we the sizes on source/destination do not
>>> match) and so we need to split in the middle of migration. The entries
>>> there are already unmapped and hence the special handling
>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>> due to the isses mentioned in the patch. I expanded the folio split code
>>> for device private can be expanded into its own helper, which does not
>>> need to do the xas/mapped/lru folio handling. During partial unmap the
>>> original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>
>>> For (2), I spent some time examining the implications of not unmapping the
>>> folios prior to split and in the partial unmap path, once we split the PMD
>>> the folios diverge. I did not run into any particular race either with the
>>> tests.
>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>
>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>
>> folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>
> I can add checks to make sure that the call does succeed.
>
>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>> possible to split the folio at fault time then?
> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
Is this after the deferred split -> map_unused_to_zeropage flow which would leave the page unmapped? Maybe disable that for device pages?
>
>
>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>> instead?
>>
>>
> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
> split_huge_pmd_locked() path. Deferred splits do not work for device private pages, due to the
> migration requirements for fault handling.
>
> Balbir Singh
--Mika
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-01 1:09 ` Zi Yan
@ 2025-08-01 7:01 ` David Hildenbrand
0 siblings, 0 replies; 71+ messages in thread
From: David Hildenbrand @ 2025-08-01 7:01 UTC (permalink / raw)
To: Zi Yan, Balbir Singh
Cc: Mika Penttilä, linux-mm, linux-kernel, Karol Herbst,
Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
On 01.08.25 03:09, Zi Yan wrote:
> On 31 Jul 2025, at 20:49, Balbir Singh wrote:
>
>> On 7/31/25 21:26, Zi Yan wrote:
>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>
>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>
>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>
>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>
>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>
>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>>>>>>>> + int ret = 0;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + /*
>>>>>>>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>
>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>> at CPU side.
>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>> CPU unmapped and device mapped.
>>>>>>
>>>>>> Here are my questions on device private folios:
>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>> perspective? Can it be stored in a device private specific data structure?
>>>>>
>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>> common code more messy if not done that way but sure possible.
>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>
>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>> the device driver manipulate it assuming core-mm just skips device private
>>>>>> folios (barring the CPU access fault handling)?
>>>>>>
>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>
>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>> someone could change while in device, it's just pfn.
>>>>
>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>
>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>
>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>
>>>>
>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>
>>> Thanks for the clarification.
>>>
>>> So folio_mapcount() for device private folios should be treated the same
>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>> Then I wonder if the device private large folio split should go through
>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>
>>
>> That is true in general, the special cases I mentioned are:
>>
>> 1. split during migration (where we the sizes on source/destination do not
>> match) and so we need to split in the middle of migration. The entries
>> there are already unmapped and hence the special handling
>
> In this case, all device private entries pointing to this device private
> folio should be turned into migration entries and folio_mapcount() should
> be 0. The split_device_private_folio() is handling this situation, although
> the function name is not very descriptive. You might want to add a comment
> to this function about its use and add a check to make sure folio_mapcount()
> is 0.
>
>> 2. Partial unmap case, where we need to split in the context of the unmap
>> due to the isses mentioned in the patch. I expanded the folio split code
>> for device private can be expanded into its own helper, which does not
>> need to do the xas/mapped/lru folio handling. During partial unmap the
>> original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>
>> For (2), I spent some time examining the implications of not unmapping the
>> folios prior to split and in the partial unmap path, once we split the PMD
>> the folios diverge. I did not run into any particular race either with the
>> tests.
>
> For partial unmap case, you should be able to handle it in the same way
> as normal PTE-mapped large folio. Since like David said, each device private
> entry can be seen as a PROTNONE entry. At PMD split, PMD page table page
> should be filled with device private PTEs. Each of them points to the
> corresponding subpage. When device unmaps some of the PTEs, rmap code
> should take care of the folio_mapcount().
Right. In general, no splitting of any THP with a mapcount > 0
(folio_mapped()). It's a clear indication that you are doing something
wrong.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-01 4:44 ` Balbir Singh
2025-08-01 5:57 ` Balbir Singh
2025-08-01 6:01 ` Mika Penttilä
@ 2025-08-01 7:04 ` David Hildenbrand
2025-08-01 8:01 ` Balbir Singh
2 siblings, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2025-08-01 7:04 UTC (permalink / raw)
To: Balbir Singh, Mika Penttilä, Zi Yan
Cc: linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
On 01.08.25 06:44, Balbir Singh wrote:
> On 8/1/25 11:16, Mika Penttilä wrote:
>> Hi,
>>
>> On 8/1/25 03:49, Balbir Singh wrote:
>>
>>> On 7/31/25 21:26, Zi Yan wrote:
>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>
>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>>>>>>>>> + int ret = 0;
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>> + /*
>>>>>>>>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>
>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>> at CPU side.
>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>> CPU unmapped and device mapped.
>>>>>>>
>>>>>>> Here are my questions on device private folios:
>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>> perspective? Can it be stored in a device private specific data structure?
>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>> common code more messy if not done that way but sure possible.
>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>
>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>> the device driver manipulate it assuming core-mm just skips device private
>>>>>>> folios (barring the CPU access fault handling)?
>>>>>>>
>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>
>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>> someone could change while in device, it's just pfn.
>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>
>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>
>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>
>>>>>
>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>> Thanks for the clarification.
>>>>
>>>> So folio_mapcount() for device private folios should be treated the same
>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>> Then I wonder if the device private large folio split should go through
>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>
>>> That is true in general, the special cases I mentioned are:
>>>
>>> 1. split during migration (where we the sizes on source/destination do not
>>> match) and so we need to split in the middle of migration. The entries
>>> there are already unmapped and hence the special handling
>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>> due to the isses mentioned in the patch. I expanded the folio split code
>>> for device private can be expanded into its own helper, which does not
>>> need to do the xas/mapped/lru folio handling. During partial unmap the
>>> original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>
>>> For (2), I spent some time examining the implications of not unmapping the
>>> folios prior to split and in the partial unmap path, once we split the PMD
>>> the folios diverge. I did not run into any particular race either with the
>>> tests.
>>
>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>
>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>
>> folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>
>
> I can add checks to make sure that the call does succeed.
>
>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>> possible to split the folio at fault time then?
>
> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
I think you mean "Calling folio_split() on a *fully* unmapped folio
fails ..."
A partially mapped folio still has folio_mapcount() > 0 ->
folio_mapped() == true.
>
>
>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>> instead?
>>
>>
>
> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
> split_huge_pmd_locked() path.
Yes, that's very complicated.
> Deferred splits do not work for device private pages, due to the
> migration requirements for fault handling.
Can you elaborate on that?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-01 7:04 ` David Hildenbrand
@ 2025-08-01 8:01 ` Balbir Singh
2025-08-01 8:46 ` David Hildenbrand
0 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-08-01 8:01 UTC (permalink / raw)
To: David Hildenbrand, Mika Penttilä, Zi Yan
Cc: linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
On 8/1/25 17:04, David Hildenbrand wrote:
> On 01.08.25 06:44, Balbir Singh wrote:
>> On 8/1/25 11:16, Mika Penttilä wrote:
>>> Hi,
>>>
>>> On 8/1/25 03:49, Balbir Singh wrote:
>>>
>>>> On 7/31/25 21:26, Zi Yan wrote:
>>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>>
>>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>>
>>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>>
>>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>>>>>>>>>> + int ret = 0;
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> + /*
>>>>>>>>>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>>
>>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>>> at CPU side.
>>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>>> CPU unmapped and device mapped.
>>>>>>>>
>>>>>>>> Here are my questions on device private folios:
>>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>> perspective? Can it be stored in a device private specific data structure?
>>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>>> common code more messy if not done that way but sure possible.
>>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>>
>>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>> the device driver manipulate it assuming core-mm just skips device private
>>>>>>>> folios (barring the CPU access fault handling)?
>>>>>>>>
>>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>>
>>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>>> someone could change while in device, it's just pfn.
>>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>>
>>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>>
>>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>>
>>>>>>
>>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>>> Thanks for the clarification.
>>>>>
>>>>> So folio_mapcount() for device private folios should be treated the same
>>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>>> Then I wonder if the device private large folio split should go through
>>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>>
>>>> That is true in general, the special cases I mentioned are:
>>>>
>>>> 1. split during migration (where we the sizes on source/destination do not
>>>> match) and so we need to split in the middle of migration. The entries
>>>> there are already unmapped and hence the special handling
>>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>> due to the isses mentioned in the patch. I expanded the folio split code
>>>> for device private can be expanded into its own helper, which does not
>>>> need to do the xas/mapped/lru folio handling. During partial unmap the
>>>> original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>>
>>>> For (2), I spent some time examining the implications of not unmapping the
>>>> folios prior to split and in the partial unmap path, once we split the PMD
>>>> the folios diverge. I did not run into any particular race either with the
>>>> tests.
>>>
>>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>>
>>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>>
>>> folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>
>>
>> I can add checks to make sure that the call does succeed.
>>
>>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>>> possible to split the folio at fault time then?
>>
>> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
>> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
>> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
>> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
>
> I think you mean "Calling folio_split() on a *fully* unmapped folio fails ..."
>
> A partially mapped folio still has folio_mapcount() > 0 -> folio_mapped() == true.
>
Looking into this again at my end
>>
>>
>>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>>> instead?
>>>
>>>
>>
>> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
>> split_huge_pmd_locked() path.
>
> Yes, that's very complicated.
>
Yes and I want to avoid going down that path.
>> Deferred splits do not work for device private pages, due to the
>> migration requirements for fault handling.
>
> Can you elaborate on that?
>
If a folio is under deferred_split() and is still pending a split. When a fault is handled on a partially
mapped folio, the expectation is that as a part of fault handling during migration, the code in migrate_folio_mapping()
assumes that the folio sizes are the same (via check for reference and mapcount)
Balbir
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-01 8:01 ` Balbir Singh
@ 2025-08-01 8:46 ` David Hildenbrand
2025-08-01 11:10 ` Zi Yan
0 siblings, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2025-08-01 8:46 UTC (permalink / raw)
To: Balbir Singh, Mika Penttilä, Zi Yan
Cc: linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
On 01.08.25 10:01, Balbir Singh wrote:
> On 8/1/25 17:04, David Hildenbrand wrote:
>> On 01.08.25 06:44, Balbir Singh wrote:
>>> On 8/1/25 11:16, Mika Penttilä wrote:
>>>> Hi,
>>>>
>>>> On 8/1/25 03:49, Balbir Singh wrote:
>>>>
>>>>> On 7/31/25 21:26, Zi Yan wrote:
>>>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>>>
>>>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>>>>>>>>>>> + int ret = 0;
>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>> + /*
>>>>>>>>>>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>>>> at CPU side.
>>>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>>>> CPU unmapped and device mapped.
>>>>>>>>>
>>>>>>>>> Here are my questions on device private folios:
>>>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>>> perspective? Can it be stored in a device private specific data structure?
>>>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>>>> common code more messy if not done that way but sure possible.
>>>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>>>
>>>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>>> the device driver manipulate it assuming core-mm just skips device private
>>>>>>>>> folios (barring the CPU access fault handling)?
>>>>>>>>>
>>>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>>>
>>>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>>>> someone could change while in device, it's just pfn.
>>>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>>>
>>>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>>>
>>>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>>>
>>>>>>>
>>>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>>>> Thanks for the clarification.
>>>>>>
>>>>>> So folio_mapcount() for device private folios should be treated the same
>>>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>>>> Then I wonder if the device private large folio split should go through
>>>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>>>
>>>>> That is true in general, the special cases I mentioned are:
>>>>>
>>>>> 1. split during migration (where we the sizes on source/destination do not
>>>>> match) and so we need to split in the middle of migration. The entries
>>>>> there are already unmapped and hence the special handling
>>>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>>> due to the isses mentioned in the patch. I expanded the folio split code
>>>>> for device private can be expanded into its own helper, which does not
>>>>> need to do the xas/mapped/lru folio handling. During partial unmap the
>>>>> original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>>>
>>>>> For (2), I spent some time examining the implications of not unmapping the
>>>>> folios prior to split and in the partial unmap path, once we split the PMD
>>>>> the folios diverge. I did not run into any particular race either with the
>>>>> tests.
>>>>
>>>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>>>
>>>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>>>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>>>
>>>> folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>
>>>
>>> I can add checks to make sure that the call does succeed.
>>>
>>>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>>>> possible to split the folio at fault time then?
>>>
>>> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
>>> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
>>> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
>>> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
>>
>> I think you mean "Calling folio_split() on a *fully* unmapped folio fails ..."
>>
>> A partially mapped folio still has folio_mapcount() > 0 -> folio_mapped() == true.
>>
>
> Looking into this again at my end
>
>>>
>>>
>>>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>>>> instead?
>>>>
>>>>
>>>
>>> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
>>> split_huge_pmd_locked() path.
>>
>> Yes, that's very complicated.
>>
>
> Yes and I want to avoid going down that path.
>
>>> Deferred splits do not work for device private pages, due to the
>>> migration requirements for fault handling.
>>
>> Can you elaborate on that?
>>
>
> If a folio is under deferred_split() and is still pending a split. When a fault is handled on a partially
> mapped folio, the expectation is that as a part of fault handling during migration, the code in migrate_folio_mapping()
> assumes that the folio sizes are the same (via check for reference and mapcount)
If you hit a partially-mapped folio, instead of migrating, you would
actually want to split and then migrate I assume.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-01 8:46 ` David Hildenbrand
@ 2025-08-01 11:10 ` Zi Yan
2025-08-01 12:20 ` Mika Penttilä
0 siblings, 1 reply; 71+ messages in thread
From: Zi Yan @ 2025-08-01 11:10 UTC (permalink / raw)
To: David Hildenbrand
Cc: Balbir Singh, Mika Penttilä, linux-mm, linux-kernel,
Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
Simona Vetter, Jérôme Glisse, Shuah Khan, Barry Song,
Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang,
Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
Francois Dugast, Ralph Campbell
On 1 Aug 2025, at 4:46, David Hildenbrand wrote:
> On 01.08.25 10:01, Balbir Singh wrote:
>> On 8/1/25 17:04, David Hildenbrand wrote:
>>> On 01.08.25 06:44, Balbir Singh wrote:
>>>> On 8/1/25 11:16, Mika Penttilä wrote:
>>>>> Hi,
>>>>>
>>>>> On 8/1/25 03:49, Balbir Singh wrote:
>>>>>
>>>>>> On 7/31/25 21:26, Zi Yan wrote:
>>>>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>>>>
>>>>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>>>>
>>>>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>>>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>>>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>>>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>>>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>>>>>>>>>>>> + int ret = 0;
>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>> + /*
>>>>>>>>>>>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>>>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>>>>> at CPU side.
>>>>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>>>>> CPU unmapped and device mapped.
>>>>>>>>>>
>>>>>>>>>> Here are my questions on device private folios:
>>>>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>>>> perspective? Can it be stored in a device private specific data structure?
>>>>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>>>>> common code more messy if not done that way but sure possible.
>>>>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>>>>
>>>>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>>>> the device driver manipulate it assuming core-mm just skips device private
>>>>>>>>>> folios (barring the CPU access fault handling)?
>>>>>>>>>>
>>>>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>>>>
>>>>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>>>>> someone could change while in device, it's just pfn.
>>>>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>>>>
>>>>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>>>>
>>>>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>>>>
>>>>>>>>
>>>>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>>>>> Thanks for the clarification.
>>>>>>>
>>>>>>> So folio_mapcount() for device private folios should be treated the same
>>>>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>>>>> Then I wonder if the device private large folio split should go through
>>>>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>>>>
>>>>>> That is true in general, the special cases I mentioned are:
>>>>>>
>>>>>> 1. split during migration (where we the sizes on source/destination do not
>>>>>> match) and so we need to split in the middle of migration. The entries
>>>>>> there are already unmapped and hence the special handling
>>>>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>>>> due to the isses mentioned in the patch. I expanded the folio split code
>>>>>> for device private can be expanded into its own helper, which does not
>>>>>> need to do the xas/mapped/lru folio handling. During partial unmap the
>>>>>> original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>>>>
>>>>>> For (2), I spent some time examining the implications of not unmapping the
>>>>>> folios prior to split and in the partial unmap path, once we split the PMD
>>>>>> the folios diverge. I did not run into any particular race either with the
>>>>>> tests.
>>>>>
>>>>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>>>>
>>>>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>>>>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>>>>
>>>>> folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>
>>>>
>>>> I can add checks to make sure that the call does succeed.
>>>>
>>>>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>>>>> possible to split the folio at fault time then?
>>>>
>>>> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
>>>> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
>>>> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
>>>> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
>>>
>>> I think you mean "Calling folio_split() on a *fully* unmapped folio fails ..."
>>>
>>> A partially mapped folio still has folio_mapcount() > 0 -> folio_mapped() == true.
>>>
>>
>> Looking into this again at my end
>>
>>>>
>>>>
>>>>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>>>>> instead?
>>>>>
>>>>>
>>>>
>>>> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
>>>> split_huge_pmd_locked() path.
>>>
>>> Yes, that's very complicated.
>>>
>>
>> Yes and I want to avoid going down that path.
>>
>>>> Deferred splits do not work for device private pages, due to the
>>>> migration requirements for fault handling.
>>>
>>> Can you elaborate on that?
>>>
>>
>> If a folio is under deferred_split() and is still pending a split. When a fault is handled on a partially
>> mapped folio, the expectation is that as a part of fault handling during migration, the code in migrate_folio_mapping()
>> assumes that the folio sizes are the same (via check for reference and mapcount)
>
> If you hit a partially-mapped folio, instead of migrating, you would actually want to split and then migrate I assume.
Yes, that is exactly what migrate_pages() does. And if split fails, the migration
fails too. Device private folio probably should do the same thing, assuming
splitting device private folio would always succeed.
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-01 11:10 ` Zi Yan
@ 2025-08-01 12:20 ` Mika Penttilä
2025-08-01 12:28 ` Zi Yan
0 siblings, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-08-01 12:20 UTC (permalink / raw)
To: Zi Yan, David Hildenbrand
Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
On 8/1/25 14:10, Zi Yan wrote:
> On 1 Aug 2025, at 4:46, David Hildenbrand wrote:
>
>> On 01.08.25 10:01, Balbir Singh wrote:
>>> On 8/1/25 17:04, David Hildenbrand wrote:
>>>> On 01.08.25 06:44, Balbir Singh wrote:
>>>>> On 8/1/25 11:16, Mika Penttilä wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 8/1/25 03:49, Balbir Singh wrote:
>>>>>>
>>>>>>> On 7/31/25 21:26, Zi Yan wrote:
>>>>>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>>>>>
>>>>>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>>>>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>>>>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>>>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>>>>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>>>>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>>>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>>>>>>>>>>>>> + int ret = 0;
>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>> + /*
>>>>>>>>>>>>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>>>>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>>>>>> at CPU side.
>>>>>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>>>>>> CPU unmapped and device mapped.
>>>>>>>>>>>
>>>>>>>>>>> Here are my questions on device private folios:
>>>>>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>>>>> perspective? Can it be stored in a device private specific data structure?
>>>>>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>>>>>> common code more messy if not done that way but sure possible.
>>>>>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>>>>>
>>>>>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>>>>> the device driver manipulate it assuming core-mm just skips device private
>>>>>>>>>>> folios (barring the CPU access fault handling)?
>>>>>>>>>>>
>>>>>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>>>>>
>>>>>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>>>>>> someone could change while in device, it's just pfn.
>>>>>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>>>>>
>>>>>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>>>>>
>>>>>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>>>>>> Thanks for the clarification.
>>>>>>>>
>>>>>>>> So folio_mapcount() for device private folios should be treated the same
>>>>>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>>>>>> Then I wonder if the device private large folio split should go through
>>>>>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>>>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>>>>>
>>>>>>> That is true in general, the special cases I mentioned are:
>>>>>>>
>>>>>>> 1. split during migration (where we the sizes on source/destination do not
>>>>>>> match) and so we need to split in the middle of migration. The entries
>>>>>>> there are already unmapped and hence the special handling
>>>>>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>>>>> due to the isses mentioned in the patch. I expanded the folio split code
>>>>>>> for device private can be expanded into its own helper, which does not
>>>>>>> need to do the xas/mapped/lru folio handling. During partial unmap the
>>>>>>> original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>>>>>
>>>>>>> For (2), I spent some time examining the implications of not unmapping the
>>>>>>> folios prior to split and in the partial unmap path, once we split the PMD
>>>>>>> the folios diverge. I did not run into any particular race either with the
>>>>>>> tests.
>>>>>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>>>>>
>>>>>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>>>>>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>>>>>
>>>>>> folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>
>>>>> I can add checks to make sure that the call does succeed.
>>>>>
>>>>>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>>>>>> possible to split the folio at fault time then?
>>>>> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
>>>>> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
>>>>> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
>>>>> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
>>>> I think you mean "Calling folio_split() on a *fully* unmapped folio fails ..."
>>>>
>>>> A partially mapped folio still has folio_mapcount() > 0 -> folio_mapped() == true.
>>>>
>>> Looking into this again at my end
>>>
>>>>>
>>>>>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>>>>>> instead?
>>>>>>
>>>>>>
>>>>> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
>>>>> split_huge_pmd_locked() path.
>>>> Yes, that's very complicated.
>>>>
>>> Yes and I want to avoid going down that path.
>>>
>>>>> Deferred splits do not work for device private pages, due to the
>>>>> migration requirements for fault handling.
>>>> Can you elaborate on that?
>>>>
>>> If a folio is under deferred_split() and is still pending a split. When a fault is handled on a partially
>>> mapped folio, the expectation is that as a part of fault handling during migration, the code in migrate_folio_mapping()
>>> assumes that the folio sizes are the same (via check for reference and mapcount)
>> If you hit a partially-mapped folio, instead of migrating, you would actually want to split and then migrate I assume.
> Yes, that is exactly what migrate_pages() does. And if split fails, the migration
> fails too. Device private folio probably should do the same thing, assuming
> splitting device private folio would always succeed.
hmm afaics the normal folio_split wants to use RMP_USE_SHARED_ZEROPAGE when splitting and remapping
device private pages, that can't work..
>
> Best Regards,
> Yan, Zi
>
--Mika
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-01 12:20 ` Mika Penttilä
@ 2025-08-01 12:28 ` Zi Yan
2025-08-02 1:17 ` Balbir Singh
2025-08-02 10:37 ` Balbir Singh
0 siblings, 2 replies; 71+ messages in thread
From: Zi Yan @ 2025-08-01 12:28 UTC (permalink / raw)
To: Mika Penttilä
Cc: David Hildenbrand, Balbir Singh, linux-mm, linux-kernel,
Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
Simona Vetter, Jérôme Glisse, Shuah Khan, Barry Song,
Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang,
Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
Francois Dugast, Ralph Campbell
On 1 Aug 2025, at 8:20, Mika Penttilä wrote:
> On 8/1/25 14:10, Zi Yan wrote:
>> On 1 Aug 2025, at 4:46, David Hildenbrand wrote:
>>
>>> On 01.08.25 10:01, Balbir Singh wrote:
>>>> On 8/1/25 17:04, David Hildenbrand wrote:
>>>>> On 01.08.25 06:44, Balbir Singh wrote:
>>>>>> On 8/1/25 11:16, Mika Penttilä wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 8/1/25 03:49, Balbir Singh wrote:
>>>>>>>
>>>>>>>> On 7/31/25 21:26, Zi Yan wrote:
>>>>>>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>>>>>>
>>>>>>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>>>>>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>>>>>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>>>>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>>>>>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>>>>>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>>>>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>>>>>>>>>>>>>> + int ret = 0;
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>> + /*
>>>>>>>>>>>>>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>>>>>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>>>>>>> at CPU side.
>>>>>>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>>>>>>> CPU unmapped and device mapped.
>>>>>>>>>>>>
>>>>>>>>>>>> Here are my questions on device private folios:
>>>>>>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>>>>>> perspective? Can it be stored in a device private specific data structure?
>>>>>>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>>>>>>> common code more messy if not done that way but sure possible.
>>>>>>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>>>>>>
>>>>>>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>>>>>> the device driver manipulate it assuming core-mm just skips device private
>>>>>>>>>>>> folios (barring the CPU access fault handling)?
>>>>>>>>>>>>
>>>>>>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>>>>>>
>>>>>>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>>>>>>> someone could change while in device, it's just pfn.
>>>>>>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>>>>>>
>>>>>>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>>>>>>
>>>>>>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>>>>>>> Thanks for the clarification.
>>>>>>>>>
>>>>>>>>> So folio_mapcount() for device private folios should be treated the same
>>>>>>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>>>>>>> Then I wonder if the device private large folio split should go through
>>>>>>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>>>>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>>>>>>
>>>>>>>> That is true in general, the special cases I mentioned are:
>>>>>>>>
>>>>>>>> 1. split during migration (where we the sizes on source/destination do not
>>>>>>>> match) and so we need to split in the middle of migration. The entries
>>>>>>>> there are already unmapped and hence the special handling
>>>>>>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>>>>>> due to the isses mentioned in the patch. I expanded the folio split code
>>>>>>>> for device private can be expanded into its own helper, which does not
>>>>>>>> need to do the xas/mapped/lru folio handling. During partial unmap the
>>>>>>>> original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>>>>>>
>>>>>>>> For (2), I spent some time examining the implications of not unmapping the
>>>>>>>> folios prior to split and in the partial unmap path, once we split the PMD
>>>>>>>> the folios diverge. I did not run into any particular race either with the
>>>>>>>> tests.
>>>>>>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>>>>>>
>>>>>>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>>>>>>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>>>>>>
>>>>>>> folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>
>>>>>> I can add checks to make sure that the call does succeed.
>>>>>>
>>>>>>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>>>>>>> possible to split the folio at fault time then?
>>>>>> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
>>>>>> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
>>>>>> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
>>>>>> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
>>>>> I think you mean "Calling folio_split() on a *fully* unmapped folio fails ..."
>>>>>
>>>>> A partially mapped folio still has folio_mapcount() > 0 -> folio_mapped() == true.
>>>>>
>>>> Looking into this again at my end
>>>>
>>>>>>
>>>>>>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>>>>>>> instead?
>>>>>>>
>>>>>>>
>>>>>> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
>>>>>> split_huge_pmd_locked() path.
>>>>> Yes, that's very complicated.
>>>>>
>>>> Yes and I want to avoid going down that path.
>>>>
>>>>>> Deferred splits do not work for device private pages, due to the
>>>>>> migration requirements for fault handling.
>>>>> Can you elaborate on that?
>>>>>
>>>> If a folio is under deferred_split() and is still pending a split. When a fault is handled on a partially
>>>> mapped folio, the expectation is that as a part of fault handling during migration, the code in migrate_folio_mapping()
>>>> assumes that the folio sizes are the same (via check for reference and mapcount)
>>> If you hit a partially-mapped folio, instead of migrating, you would actually want to split and then migrate I assume.
>> Yes, that is exactly what migrate_pages() does. And if split fails, the migration
>> fails too. Device private folio probably should do the same thing, assuming
>> splitting device private folio would always succeed.
>
> hmm afaics the normal folio_split wants to use RMP_USE_SHARED_ZEROPAGE when splitting and remapping
> device private pages, that can't work..
It is fine to exclude device private folio to use RMP_USE_SHARED_ZEROPAGE like:
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2b4ea5a2ce7d..b97dfd3521a9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3858,7 +3858,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
if (nr_shmem_dropped)
shmem_uncharge(mapping->host, nr_shmem_dropped);
- if (!ret && is_anon)
+ if (!ret && is_anon && !folio_is_device_private(folio))
remap_flags = RMP_USE_SHARED_ZEROPAGE;
remap_page(folio, 1 << order, remap_flags);
Or it can be done in remove_migration_pte().
Best Regards,
Yan, Zi
^ permalink raw reply related [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-01 12:28 ` Zi Yan
@ 2025-08-02 1:17 ` Balbir Singh
2025-08-02 10:37 ` Balbir Singh
1 sibling, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-08-02 1:17 UTC (permalink / raw)
To: Zi Yan, Mika Penttilä
Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
On 8/1/25 22:28, Zi Yan wrote:
> On 1 Aug 2025, at 8:20, Mika Penttilä wrote:
>
>> On 8/1/25 14:10, Zi Yan wrote:
>>> On 1 Aug 2025, at 4:46, David Hildenbrand wrote:
>>>
>>>> On 01.08.25 10:01, Balbir Singh wrote:
>>>>> On 8/1/25 17:04, David Hildenbrand wrote:
>>>>>> On 01.08.25 06:44, Balbir Singh wrote:
>>>>>>> On 8/1/25 11:16, Mika Penttilä wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 8/1/25 03:49, Balbir Singh wrote:
>>>>>>>>
>>>>>>>>> On 7/31/25 21:26, Zi Yan wrote:
>>>>>>>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>>>>>>>
>>>>>>>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>>>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>>> include/linux/huge_mm.h | 1 +
>>>>>>>>>>>>>>>>>>>>>>> include/linux/rmap.h | 2 +
>>>>>>>>>>>>>>>>>>>>>>> include/linux/swapops.h | 17 +++
>>>>>>>>>>>>>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>>>>>>> mm/page_vma_mapped.c | 13 +-
>>>>>>>>>>>>>>>>>>>>>>> mm/pgtable-generic.c | 6 +
>>>>>>>>>>>>>>>>>>>>>>> mm/rmap.c | 22 +++-
>>>>>>>>>>>>>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>>>>> + struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>>>>>>>> + struct folio *new_folio;
>>>>>>>>>>>>>>>>>>>>>>> + int ret = 0;
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> + /*
>>>>>>>>>>>>>>>>>>>>>>> + * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>>>>>>>> + * private pages, this path is executed when
>>>>>>>>>>>>>>>>>>>>>>> + * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>>>>>>>> + * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>>>>> + * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>>>>>>>> + * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>>>>>>>> + * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>>>>>>>> + * and fault handling flows.
>>>>>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>>>>>>>> at CPU side.
>>>>>>>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>>>>>>>> CPU unmapped and device mapped.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here are my questions on device private folios:
>>>>>>>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>>>>>>> perspective? Can it be stored in a device private specific data structure?
>>>>>>>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>>>>>>>> common code more messy if not done that way but sure possible.
>>>>>>>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>>>>>>>
>>>>>>>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>>>>>>> the device driver manipulate it assuming core-mm just skips device private
>>>>>>>>>>>>> folios (barring the CPU access fault handling)?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>>>>>>>
>>>>>>>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>>>>>>>> someone could change while in device, it's just pfn.
>>>>>>>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>>>>>>>
>>>>>>>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>>>>>>>
>>>>>>>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>>>>>>>> Thanks for the clarification.
>>>>>>>>>>
>>>>>>>>>> So folio_mapcount() for device private folios should be treated the same
>>>>>>>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>>>>>>>> Then I wonder if the device private large folio split should go through
>>>>>>>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>>>>>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>>>>>>>
>>>>>>>>> That is true in general, the special cases I mentioned are:
>>>>>>>>>
>>>>>>>>> 1. split during migration (where we the sizes on source/destination do not
>>>>>>>>> match) and so we need to split in the middle of migration. The entries
>>>>>>>>> there are already unmapped and hence the special handling
>>>>>>>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>>>>>>> due to the isses mentioned in the patch. I expanded the folio split code
>>>>>>>>> for device private can be expanded into its own helper, which does not
>>>>>>>>> need to do the xas/mapped/lru folio handling. During partial unmap the
>>>>>>>>> original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>>>>>>>
>>>>>>>>> For (2), I spent some time examining the implications of not unmapping the
>>>>>>>>> folios prior to split and in the partial unmap path, once we split the PMD
>>>>>>>>> the folios diverge. I did not run into any particular race either with the
>>>>>>>>> tests.
>>>>>>>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>>>>>>>
>>>>>>>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>>>>>>>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>>>>>>>
>>>>>>>> folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>
>>>>>>> I can add checks to make sure that the call does succeed.
>>>>>>>
>>>>>>>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>>>>>>>> possible to split the folio at fault time then?
>>>>>>> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
>>>>>>> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
>>>>>>> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
>>>>>>> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
>>>>>> I think you mean "Calling folio_split() on a *fully* unmapped folio fails ..."
>>>>>>
>>>>>> A partially mapped folio still has folio_mapcount() > 0 -> folio_mapped() == true.
>>>>>>
>>>>> Looking into this again at my end
>>>>>
>>>>>>>
>>>>>>>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>>>>>>>> instead?
>>>>>>>>
>>>>>>>>
>>>>>>> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
>>>>>>> split_huge_pmd_locked() path.
>>>>>> Yes, that's very complicated.
>>>>>>
>>>>> Yes and I want to avoid going down that path.
>>>>>
>>>>>>> Deferred splits do not work for device private pages, due to the
>>>>>>> migration requirements for fault handling.
>>>>>> Can you elaborate on that?
>>>>>>
>>>>> If a folio is under deferred_split() and is still pending a split. When a fault is handled on a partially
>>>>> mapped folio, the expectation is that as a part of fault handling during migration, the code in migrate_folio_mapping()
>>>>> assumes that the folio sizes are the same (via check for reference and mapcount)
>>>> If you hit a partially-mapped folio, instead of migrating, you would actually want to split and then migrate I assume.
>>> Yes, that is exactly what migrate_pages() does. And if split fails, the migration
>>> fails too. Device private folio probably should do the same thing, assuming
>>> splitting device private folio would always succeed.
>>
>> hmm afaics the normal folio_split wants to use RMP_USE_SHARED_ZEROPAGE when splitting and remapping
>> device private pages, that can't work..
>
> It is fine to exclude device private folio to use RMP_USE_SHARED_ZEROPAGE like:
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2b4ea5a2ce7d..b97dfd3521a9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3858,7 +3858,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> if (nr_shmem_dropped)
> shmem_uncharge(mapping->host, nr_shmem_dropped);
>
> - if (!ret && is_anon)
> + if (!ret && is_anon && !folio_is_device_private(folio))
> remap_flags = RMP_USE_SHARED_ZEROPAGE;
> remap_page(folio, 1 << order, remap_flags);
>
> Or it can be done in remove_migration_pte().
I have the same set of changes plus more to see if the logic can be simplified and well known
paths be taken.
Balbir Singh
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-01 12:28 ` Zi Yan
2025-08-02 1:17 ` Balbir Singh
@ 2025-08-02 10:37 ` Balbir Singh
2025-08-02 12:13 ` Mika Penttilä
1 sibling, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-08-02 10:37 UTC (permalink / raw)
To: Zi Yan, Mika Penttilä
Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
FYI:
I have the following patch on top of my series that seems to make it work
without requiring the helper to split device private folios
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
include/linux/huge_mm.h | 1 -
lib/test_hmm.c | 11 +++++-
mm/huge_memory.c | 76 ++++-------------------------------------
mm/migrate_device.c | 51 +++++++++++++++++++++++++++
4 files changed, 67 insertions(+), 72 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 19e7e3b7c2b7..52d8b435950b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
vm_flags_t vm_flags);
bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
-int split_device_private_folio(struct folio *folio);
int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
unsigned int new_order, bool unmapped);
int min_order_for_split(struct folio *folio);
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 341ae2af44ec..444477785882 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
* the mirror but here we use it to hold the page for the simulated
* device memory and that page holds the pointer to the mirror.
*/
- rpage = vmf->page->zone_device_data;
+ rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
dmirror = rpage->zone_device_data;
/* FIXME demonstrate how we can adjust migrate range */
order = folio_order(page_folio(vmf->page));
nr = 1 << order;
+ /*
+ * When folios are partially mapped, we can't rely on the folio
+ * order of vmf->page as the folio might not be fully split yet
+ */
+ if (vmf->pte) {
+ order = 0;
+ nr = 1;
+ }
+
/*
* Consider a per-cpu cache of src and dst pfns, but with
* large number of cpus that might not scale well.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1fc1efa219c8..863393dec1f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
struct shrink_control *sc);
static unsigned long deferred_split_scan(struct shrinker *shrink,
struct shrink_control *sc);
-static int __split_unmapped_folio(struct folio *folio, int new_order,
- struct page *split_at, struct xa_state *xas,
- struct address_space *mapping, bool uniform_split);
-
static bool split_underused_thp = true;
static atomic_t huge_zero_refcount;
@@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
pmd_populate(mm, pmd, pgtable);
}
-/**
- * split_huge_device_private_folio - split a huge device private folio into
- * smaller pages (of order 0), currently used by migrate_device logic to
- * split folios for pages that are partially mapped
- *
- * @folio: the folio to split
- *
- * The caller has to hold the folio_lock and a reference via folio_get
- */
-int split_device_private_folio(struct folio *folio)
-{
- struct folio *end_folio = folio_next(folio);
- struct folio *new_folio;
- int ret = 0;
-
- /*
- * Split the folio now. In the case of device
- * private pages, this path is executed when
- * the pmd is split and since freeze is not true
- * it is likely the folio will be deferred_split.
- *
- * With device private pages, deferred splits of
- * folios should be handled here to prevent partial
- * unmaps from causing issues later on in migration
- * and fault handling flows.
- */
- folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
- ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
- VM_WARN_ON(ret);
- for (new_folio = folio_next(folio); new_folio != end_folio;
- new_folio = folio_next(new_folio)) {
- zone_device_private_split_cb(folio, new_folio);
- folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
- new_folio));
- }
-
- /*
- * Mark the end of the folio split for device private THP
- * split
- */
- zone_device_private_split_cb(folio, NULL);
- folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
- return ret;
-}
-
static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long haddr, bool freeze)
{
@@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
freeze = false;
if (!freeze) {
rmap_t rmap_flags = RMAP_NONE;
- unsigned long addr = haddr;
- struct folio *new_folio;
- struct folio *end_folio = folio_next(folio);
if (anon_exclusive)
rmap_flags |= RMAP_EXCLUSIVE;
- folio_lock(folio);
- folio_get(folio);
-
- split_device_private_folio(folio);
-
- for (new_folio = folio_next(folio);
- new_folio != end_folio;
- new_folio = folio_next(new_folio)) {
- addr += PAGE_SIZE;
- folio_unlock(new_folio);
- folio_add_anon_rmap_ptes(new_folio,
- &new_folio->page, 1,
- vma, addr, rmap_flags);
- }
- folio_unlock(folio);
- folio_add_anon_rmap_ptes(folio, &folio->page,
- 1, vma, haddr, rmap_flags);
+ folio_ref_add(folio, HPAGE_PMD_NR - 1);
+ if (anon_exclusive)
+ rmap_flags |= RMAP_EXCLUSIVE;
+ folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
+ vma, haddr, rmap_flags);
}
}
@@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
if (nr_shmem_dropped)
shmem_uncharge(mapping->host, nr_shmem_dropped);
- if (!ret && is_anon)
+ if (!ret && is_anon && !folio_is_device_private(folio))
remap_flags = RMP_USE_SHARED_ZEROPAGE;
remap_page(folio, 1 << order, remap_flags);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 49962ea19109..4264c0290d08 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
* page table entry. Other special swap entries are not
* migratable, and we ignore regular swapped page.
*/
+ struct folio *folio;
+
entry = pte_to_swp_entry(pte);
if (!is_device_private_entry(entry))
goto next;
@@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
pgmap->owner != migrate->pgmap_owner)
goto next;
+ folio = page_folio(page);
+ if (folio_test_large(folio)) {
+ struct folio *new_folio;
+ struct folio *new_fault_folio;
+
+ /*
+ * The reason for finding pmd present with a
+ * device private pte and a large folio for the
+ * pte is partial unmaps. Split the folio now
+ * for the migration to be handled correctly
+ */
+ pte_unmap_unlock(ptep, ptl);
+
+ folio_get(folio);
+ if (folio != fault_folio)
+ folio_lock(folio);
+ if (split_folio(folio)) {
+ if (folio != fault_folio)
+ folio_unlock(folio);
+ ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+ goto next;
+ }
+
+ /*
+ * After the split, get back the extra reference
+ * on the fault_page, this reference is checked during
+ * folio_migrate_mapping()
+ */
+ if (migrate->fault_page) {
+ new_fault_folio = page_folio(migrate->fault_page);
+ folio_get(new_fault_folio);
+ }
+
+ new_folio = page_folio(page);
+ pfn = page_to_pfn(page);
+
+ /*
+ * Ensure the lock is held on the correct
+ * folio after the split
+ */
+ if (folio != new_folio) {
+ folio_unlock(folio);
+ folio_lock(new_folio);
+ }
+ folio_put(folio);
+ addr = start;
+ goto again;
+ }
+
mpfn = migrate_pfn(page_to_pfn(page)) |
MIGRATE_PFN_MIGRATE;
if (is_writable_device_private_entry(entry))
--
2.50.1
^ permalink raw reply related [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-02 10:37 ` Balbir Singh
@ 2025-08-02 12:13 ` Mika Penttilä
2025-08-04 22:46 ` Balbir Singh
0 siblings, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-08-02 12:13 UTC (permalink / raw)
To: Balbir Singh, Zi Yan
Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
Hi,
On 8/2/25 13:37, Balbir Singh wrote:
> FYI:
>
> I have the following patch on top of my series that seems to make it work
> without requiring the helper to split device private folios
>
I think this looks much better!
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
> include/linux/huge_mm.h | 1 -
> lib/test_hmm.c | 11 +++++-
> mm/huge_memory.c | 76 ++++-------------------------------------
> mm/migrate_device.c | 51 +++++++++++++++++++++++++++
> 4 files changed, 67 insertions(+), 72 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 19e7e3b7c2b7..52d8b435950b 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
> vm_flags_t vm_flags);
>
> bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
> -int split_device_private_folio(struct folio *folio);
> int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> unsigned int new_order, bool unmapped);
> int min_order_for_split(struct folio *folio);
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 341ae2af44ec..444477785882 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
> * the mirror but here we use it to hold the page for the simulated
> * device memory and that page holds the pointer to the mirror.
> */
> - rpage = vmf->page->zone_device_data;
> + rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
> dmirror = rpage->zone_device_data;
>
> /* FIXME demonstrate how we can adjust migrate range */
> order = folio_order(page_folio(vmf->page));
> nr = 1 << order;
>
> + /*
> + * When folios are partially mapped, we can't rely on the folio
> + * order of vmf->page as the folio might not be fully split yet
> + */
> + if (vmf->pte) {
> + order = 0;
> + nr = 1;
> + }
> +
> /*
> * Consider a per-cpu cache of src and dst pfns, but with
> * large number of cpus that might not scale well.
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1fc1efa219c8..863393dec1f1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
> struct shrink_control *sc);
> static unsigned long deferred_split_scan(struct shrinker *shrink,
> struct shrink_control *sc);
> -static int __split_unmapped_folio(struct folio *folio, int new_order,
> - struct page *split_at, struct xa_state *xas,
> - struct address_space *mapping, bool uniform_split);
> -
> static bool split_underused_thp = true;
>
> static atomic_t huge_zero_refcount;
> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> pmd_populate(mm, pmd, pgtable);
> }
>
> -/**
> - * split_huge_device_private_folio - split a huge device private folio into
> - * smaller pages (of order 0), currently used by migrate_device logic to
> - * split folios for pages that are partially mapped
> - *
> - * @folio: the folio to split
> - *
> - * The caller has to hold the folio_lock and a reference via folio_get
> - */
> -int split_device_private_folio(struct folio *folio)
> -{
> - struct folio *end_folio = folio_next(folio);
> - struct folio *new_folio;
> - int ret = 0;
> -
> - /*
> - * Split the folio now. In the case of device
> - * private pages, this path is executed when
> - * the pmd is split and since freeze is not true
> - * it is likely the folio will be deferred_split.
> - *
> - * With device private pages, deferred splits of
> - * folios should be handled here to prevent partial
> - * unmaps from causing issues later on in migration
> - * and fault handling flows.
> - */
> - folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
> - ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
> - VM_WARN_ON(ret);
> - for (new_folio = folio_next(folio); new_folio != end_folio;
> - new_folio = folio_next(new_folio)) {
> - zone_device_private_split_cb(folio, new_folio);
> - folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
> - new_folio));
> - }
> -
> - /*
> - * Mark the end of the folio split for device private THP
> - * split
> - */
> - zone_device_private_split_cb(folio, NULL);
> - folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
> - return ret;
> -}
> -
> static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> unsigned long haddr, bool freeze)
> {
> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> freeze = false;
> if (!freeze) {
> rmap_t rmap_flags = RMAP_NONE;
> - unsigned long addr = haddr;
> - struct folio *new_folio;
> - struct folio *end_folio = folio_next(folio);
>
> if (anon_exclusive)
> rmap_flags |= RMAP_EXCLUSIVE;
>
> - folio_lock(folio);
> - folio_get(folio);
> -
> - split_device_private_folio(folio);
> -
> - for (new_folio = folio_next(folio);
> - new_folio != end_folio;
> - new_folio = folio_next(new_folio)) {
> - addr += PAGE_SIZE;
> - folio_unlock(new_folio);
> - folio_add_anon_rmap_ptes(new_folio,
> - &new_folio->page, 1,
> - vma, addr, rmap_flags);
> - }
> - folio_unlock(folio);
> - folio_add_anon_rmap_ptes(folio, &folio->page,
> - 1, vma, haddr, rmap_flags);
> + folio_ref_add(folio, HPAGE_PMD_NR - 1);
> + if (anon_exclusive)
> + rmap_flags |= RMAP_EXCLUSIVE;
> + folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
> + vma, haddr, rmap_flags);
> }
> }
>
> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> if (nr_shmem_dropped)
> shmem_uncharge(mapping->host, nr_shmem_dropped);
>
> - if (!ret && is_anon)
> + if (!ret && is_anon && !folio_is_device_private(folio))
> remap_flags = RMP_USE_SHARED_ZEROPAGE;
>
> remap_page(folio, 1 << order, remap_flags);
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 49962ea19109..4264c0290d08 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> * page table entry. Other special swap entries are not
> * migratable, and we ignore regular swapped page.
> */
> + struct folio *folio;
> +
> entry = pte_to_swp_entry(pte);
> if (!is_device_private_entry(entry))
> goto next;
> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> pgmap->owner != migrate->pgmap_owner)
> goto next;
>
> + folio = page_folio(page);
> + if (folio_test_large(folio)) {
> + struct folio *new_folio;
> + struct folio *new_fault_folio;
> +
> + /*
> + * The reason for finding pmd present with a
> + * device private pte and a large folio for the
> + * pte is partial unmaps. Split the folio now
> + * for the migration to be handled correctly
> + */
> + pte_unmap_unlock(ptep, ptl);
> +
> + folio_get(folio);
> + if (folio != fault_folio)
> + folio_lock(folio);
> + if (split_folio(folio)) {
> + if (folio != fault_folio)
> + folio_unlock(folio);
> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> + goto next;
> + }
> +
The nouveau migrate_to_ram handler needs adjustment also if split happens.
> + /*
> + * After the split, get back the extra reference
> + * on the fault_page, this reference is checked during
> + * folio_migrate_mapping()
> + */
> + if (migrate->fault_page) {
> + new_fault_folio = page_folio(migrate->fault_page);
> + folio_get(new_fault_folio);
> + }
> +
> + new_folio = page_folio(page);
> + pfn = page_to_pfn(page);
> +
> + /*
> + * Ensure the lock is held on the correct
> + * folio after the split
> + */
> + if (folio != new_folio) {
> + folio_unlock(folio);
> + folio_lock(new_folio);
> + }
Maybe careful not to unlock fault_page ?
> + folio_put(folio);
> + addr = start;
> + goto again;
> + }
> +
> mpfn = migrate_pfn(page_to_pfn(page)) |
> MIGRATE_PFN_MIGRATE;
> if (is_writable_device_private_entry(entry))
--Mika
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-02 12:13 ` Mika Penttilä
@ 2025-08-04 22:46 ` Balbir Singh
2025-08-04 23:26 ` Mika Penttilä
0 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-08-04 22:46 UTC (permalink / raw)
To: Mika Penttilä, Zi Yan
Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
On 8/2/25 22:13, Mika Penttilä wrote:
> Hi,
>
> On 8/2/25 13:37, Balbir Singh wrote:
>> FYI:
>>
>> I have the following patch on top of my series that seems to make it work
>> without requiring the helper to split device private folios
>>
> I think this looks much better!
>
Thanks!
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>> include/linux/huge_mm.h | 1 -
>> lib/test_hmm.c | 11 +++++-
>> mm/huge_memory.c | 76 ++++-------------------------------------
>> mm/migrate_device.c | 51 +++++++++++++++++++++++++++
>> 4 files changed, 67 insertions(+), 72 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 19e7e3b7c2b7..52d8b435950b 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>> vm_flags_t vm_flags);
>>
>> bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>> -int split_device_private_folio(struct folio *folio);
>> int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> unsigned int new_order, bool unmapped);
>> int min_order_for_split(struct folio *folio);
>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>> index 341ae2af44ec..444477785882 100644
>> --- a/lib/test_hmm.c
>> +++ b/lib/test_hmm.c
>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>> * the mirror but here we use it to hold the page for the simulated
>> * device memory and that page holds the pointer to the mirror.
>> */
>> - rpage = vmf->page->zone_device_data;
>> + rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>> dmirror = rpage->zone_device_data;
>>
>> /* FIXME demonstrate how we can adjust migrate range */
>> order = folio_order(page_folio(vmf->page));
>> nr = 1 << order;
>>
>> + /*
>> + * When folios are partially mapped, we can't rely on the folio
>> + * order of vmf->page as the folio might not be fully split yet
>> + */
>> + if (vmf->pte) {
>> + order = 0;
>> + nr = 1;
>> + }
>> +
>> /*
>> * Consider a per-cpu cache of src and dst pfns, but with
>> * large number of cpus that might not scale well.
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 1fc1efa219c8..863393dec1f1 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>> struct shrink_control *sc);
>> static unsigned long deferred_split_scan(struct shrinker *shrink,
>> struct shrink_control *sc);
>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>> - struct page *split_at, struct xa_state *xas,
>> - struct address_space *mapping, bool uniform_split);
>> -
>> static bool split_underused_thp = true;
>>
>> static atomic_t huge_zero_refcount;
>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>> pmd_populate(mm, pmd, pgtable);
>> }
>>
>> -/**
>> - * split_huge_device_private_folio - split a huge device private folio into
>> - * smaller pages (of order 0), currently used by migrate_device logic to
>> - * split folios for pages that are partially mapped
>> - *
>> - * @folio: the folio to split
>> - *
>> - * The caller has to hold the folio_lock and a reference via folio_get
>> - */
>> -int split_device_private_folio(struct folio *folio)
>> -{
>> - struct folio *end_folio = folio_next(folio);
>> - struct folio *new_folio;
>> - int ret = 0;
>> -
>> - /*
>> - * Split the folio now. In the case of device
>> - * private pages, this path is executed when
>> - * the pmd is split and since freeze is not true
>> - * it is likely the folio will be deferred_split.
>> - *
>> - * With device private pages, deferred splits of
>> - * folios should be handled here to prevent partial
>> - * unmaps from causing issues later on in migration
>> - * and fault handling flows.
>> - */
>> - folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>> - ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>> - VM_WARN_ON(ret);
>> - for (new_folio = folio_next(folio); new_folio != end_folio;
>> - new_folio = folio_next(new_folio)) {
>> - zone_device_private_split_cb(folio, new_folio);
>> - folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>> - new_folio));
>> - }
>> -
>> - /*
>> - * Mark the end of the folio split for device private THP
>> - * split
>> - */
>> - zone_device_private_split_cb(folio, NULL);
>> - folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>> - return ret;
>> -}
>> -
>> static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>> unsigned long haddr, bool freeze)
>> {
>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>> freeze = false;
>> if (!freeze) {
>> rmap_t rmap_flags = RMAP_NONE;
>> - unsigned long addr = haddr;
>> - struct folio *new_folio;
>> - struct folio *end_folio = folio_next(folio);
>>
>> if (anon_exclusive)
>> rmap_flags |= RMAP_EXCLUSIVE;
>>
>> - folio_lock(folio);
>> - folio_get(folio);
>> -
>> - split_device_private_folio(folio);
>> -
>> - for (new_folio = folio_next(folio);
>> - new_folio != end_folio;
>> - new_folio = folio_next(new_folio)) {
>> - addr += PAGE_SIZE;
>> - folio_unlock(new_folio);
>> - folio_add_anon_rmap_ptes(new_folio,
>> - &new_folio->page, 1,
>> - vma, addr, rmap_flags);
>> - }
>> - folio_unlock(folio);
>> - folio_add_anon_rmap_ptes(folio, &folio->page,
>> - 1, vma, haddr, rmap_flags);
>> + folio_ref_add(folio, HPAGE_PMD_NR - 1);
>> + if (anon_exclusive)
>> + rmap_flags |= RMAP_EXCLUSIVE;
>> + folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>> + vma, haddr, rmap_flags);
>> }
>> }
>>
>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>> if (nr_shmem_dropped)
>> shmem_uncharge(mapping->host, nr_shmem_dropped);
>>
>> - if (!ret && is_anon)
>> + if (!ret && is_anon && !folio_is_device_private(folio))
>> remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>
>> remap_page(folio, 1 << order, remap_flags);
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 49962ea19109..4264c0290d08 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> * page table entry. Other special swap entries are not
>> * migratable, and we ignore regular swapped page.
>> */
>> + struct folio *folio;
>> +
>> entry = pte_to_swp_entry(pte);
>> if (!is_device_private_entry(entry))
>> goto next;
>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> pgmap->owner != migrate->pgmap_owner)
>> goto next;
>>
>> + folio = page_folio(page);
>> + if (folio_test_large(folio)) {
>> + struct folio *new_folio;
>> + struct folio *new_fault_folio;
>> +
>> + /*
>> + * The reason for finding pmd present with a
>> + * device private pte and a large folio for the
>> + * pte is partial unmaps. Split the folio now
>> + * for the migration to be handled correctly
>> + */
>> + pte_unmap_unlock(ptep, ptl);
>> +
>> + folio_get(folio);
>> + if (folio != fault_folio)
>> + folio_lock(folio);
>> + if (split_folio(folio)) {
>> + if (folio != fault_folio)
>> + folio_unlock(folio);
>> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>> + goto next;
>> + }
>> +
>
> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>
test_hmm needs adjustment because of the way the backup folios are setup.
>> + /*
>> + * After the split, get back the extra reference
>> + * on the fault_page, this reference is checked during
>> + * folio_migrate_mapping()
>> + */
>> + if (migrate->fault_page) {
>> + new_fault_folio = page_folio(migrate->fault_page);
>> + folio_get(new_fault_folio);
>> + }
>> +
>> + new_folio = page_folio(page);
>> + pfn = page_to_pfn(page);
>> +
>> + /*
>> + * Ensure the lock is held on the correct
>> + * folio after the split
>> + */
>> + if (folio != new_folio) {
>> + folio_unlock(folio);
>> + folio_lock(new_folio);
>> + }
>
> Maybe careful not to unlock fault_page ?
>
split_page will unlock everything but the original folio, the code takes the lock
on the folio corresponding to the new folio
>> + folio_put(folio);
>> + addr = start;
>> + goto again;
>> + }
>> +
>> mpfn = migrate_pfn(page_to_pfn(page)) |
>> MIGRATE_PFN_MIGRATE;
>> if (is_writable_device_private_entry(entry))
>
Balbir
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-04 22:46 ` Balbir Singh
@ 2025-08-04 23:26 ` Mika Penttilä
2025-08-05 4:10 ` Balbir Singh
0 siblings, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-08-04 23:26 UTC (permalink / raw)
To: Balbir Singh, Zi Yan
Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
Hi,
On 8/5/25 01:46, Balbir Singh wrote:
> On 8/2/25 22:13, Mika Penttilä wrote:
>> Hi,
>>
>> On 8/2/25 13:37, Balbir Singh wrote:
>>> FYI:
>>>
>>> I have the following patch on top of my series that seems to make it work
>>> without requiring the helper to split device private folios
>>>
>> I think this looks much better!
>>
> Thanks!
>
>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>> ---
>>> include/linux/huge_mm.h | 1 -
>>> lib/test_hmm.c | 11 +++++-
>>> mm/huge_memory.c | 76 ++++-------------------------------------
>>> mm/migrate_device.c | 51 +++++++++++++++++++++++++++
>>> 4 files changed, 67 insertions(+), 72 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>> vm_flags_t vm_flags);
>>>
>>> bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>> -int split_device_private_folio(struct folio *folio);
>>> int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> unsigned int new_order, bool unmapped);
>>> int min_order_for_split(struct folio *folio);
>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>> index 341ae2af44ec..444477785882 100644
>>> --- a/lib/test_hmm.c
>>> +++ b/lib/test_hmm.c
>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>> * the mirror but here we use it to hold the page for the simulated
>>> * device memory and that page holds the pointer to the mirror.
>>> */
>>> - rpage = vmf->page->zone_device_data;
>>> + rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>> dmirror = rpage->zone_device_data;
>>>
>>> /* FIXME demonstrate how we can adjust migrate range */
>>> order = folio_order(page_folio(vmf->page));
>>> nr = 1 << order;
>>>
>>> + /*
>>> + * When folios are partially mapped, we can't rely on the folio
>>> + * order of vmf->page as the folio might not be fully split yet
>>> + */
>>> + if (vmf->pte) {
>>> + order = 0;
>>> + nr = 1;
>>> + }
>>> +
>>> /*
>>> * Consider a per-cpu cache of src and dst pfns, but with
>>> * large number of cpus that might not scale well.
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 1fc1efa219c8..863393dec1f1 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>> struct shrink_control *sc);
>>> static unsigned long deferred_split_scan(struct shrinker *shrink,
>>> struct shrink_control *sc);
>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>> - struct page *split_at, struct xa_state *xas,
>>> - struct address_space *mapping, bool uniform_split);
>>> -
>>> static bool split_underused_thp = true;
>>>
>>> static atomic_t huge_zero_refcount;
>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>> pmd_populate(mm, pmd, pgtable);
>>> }
>>>
>>> -/**
>>> - * split_huge_device_private_folio - split a huge device private folio into
>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>> - * split folios for pages that are partially mapped
>>> - *
>>> - * @folio: the folio to split
>>> - *
>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>> - */
>>> -int split_device_private_folio(struct folio *folio)
>>> -{
>>> - struct folio *end_folio = folio_next(folio);
>>> - struct folio *new_folio;
>>> - int ret = 0;
>>> -
>>> - /*
>>> - * Split the folio now. In the case of device
>>> - * private pages, this path is executed when
>>> - * the pmd is split and since freeze is not true
>>> - * it is likely the folio will be deferred_split.
>>> - *
>>> - * With device private pages, deferred splits of
>>> - * folios should be handled here to prevent partial
>>> - * unmaps from causing issues later on in migration
>>> - * and fault handling flows.
>>> - */
>>> - folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>> - ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>> - VM_WARN_ON(ret);
>>> - for (new_folio = folio_next(folio); new_folio != end_folio;
>>> - new_folio = folio_next(new_folio)) {
>>> - zone_device_private_split_cb(folio, new_folio);
>>> - folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>> - new_folio));
>>> - }
>>> -
>>> - /*
>>> - * Mark the end of the folio split for device private THP
>>> - * split
>>> - */
>>> - zone_device_private_split_cb(folio, NULL);
>>> - folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>> - return ret;
>>> -}
>>> -
>>> static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>> unsigned long haddr, bool freeze)
>>> {
>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>> freeze = false;
>>> if (!freeze) {
>>> rmap_t rmap_flags = RMAP_NONE;
>>> - unsigned long addr = haddr;
>>> - struct folio *new_folio;
>>> - struct folio *end_folio = folio_next(folio);
>>>
>>> if (anon_exclusive)
>>> rmap_flags |= RMAP_EXCLUSIVE;
>>>
>>> - folio_lock(folio);
>>> - folio_get(folio);
>>> -
>>> - split_device_private_folio(folio);
>>> -
>>> - for (new_folio = folio_next(folio);
>>> - new_folio != end_folio;
>>> - new_folio = folio_next(new_folio)) {
>>> - addr += PAGE_SIZE;
>>> - folio_unlock(new_folio);
>>> - folio_add_anon_rmap_ptes(new_folio,
>>> - &new_folio->page, 1,
>>> - vma, addr, rmap_flags);
>>> - }
>>> - folio_unlock(folio);
>>> - folio_add_anon_rmap_ptes(folio, &folio->page,
>>> - 1, vma, haddr, rmap_flags);
>>> + folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>> + if (anon_exclusive)
>>> + rmap_flags |= RMAP_EXCLUSIVE;
>>> + folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>> + vma, haddr, rmap_flags);
>>> }
>>> }
>>>
>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>> if (nr_shmem_dropped)
>>> shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>
>>> - if (!ret && is_anon)
>>> + if (!ret && is_anon && !folio_is_device_private(folio))
>>> remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>
>>> remap_page(folio, 1 << order, remap_flags);
>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>> index 49962ea19109..4264c0290d08 100644
>>> --- a/mm/migrate_device.c
>>> +++ b/mm/migrate_device.c
>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>> * page table entry. Other special swap entries are not
>>> * migratable, and we ignore regular swapped page.
>>> */
>>> + struct folio *folio;
>>> +
>>> entry = pte_to_swp_entry(pte);
>>> if (!is_device_private_entry(entry))
>>> goto next;
>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>> pgmap->owner != migrate->pgmap_owner)
>>> goto next;
>>>
>>> + folio = page_folio(page);
>>> + if (folio_test_large(folio)) {
>>> + struct folio *new_folio;
>>> + struct folio *new_fault_folio;
>>> +
>>> + /*
>>> + * The reason for finding pmd present with a
>>> + * device private pte and a large folio for the
>>> + * pte is partial unmaps. Split the folio now
>>> + * for the migration to be handled correctly
>>> + */
>>> + pte_unmap_unlock(ptep, ptl);
>>> +
>>> + folio_get(folio);
>>> + if (folio != fault_folio)
>>> + folio_lock(folio);
>>> + if (split_folio(folio)) {
>>> + if (folio != fault_folio)
>>> + folio_unlock(folio);
>>> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>> + goto next;
>>> + }
>>> +
>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>
> test_hmm needs adjustment because of the way the backup folios are setup.
nouveau should check the folio order after the possible split happens.
>
>>> + /*
>>> + * After the split, get back the extra reference
>>> + * on the fault_page, this reference is checked during
>>> + * folio_migrate_mapping()
>>> + */
>>> + if (migrate->fault_page) {
>>> + new_fault_folio = page_folio(migrate->fault_page);
>>> + folio_get(new_fault_folio);
>>> + }
>>> +
>>> + new_folio = page_folio(page);
>>> + pfn = page_to_pfn(page);
>>> +
>>> + /*
>>> + * Ensure the lock is held on the correct
>>> + * folio after the split
>>> + */
>>> + if (folio != new_folio) {
>>> + folio_unlock(folio);
>>> + folio_lock(new_folio);
>>> + }
>> Maybe careful not to unlock fault_page ?
>>
> split_page will unlock everything but the original folio, the code takes the lock
> on the folio corresponding to the new folio
I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
>
>>> + folio_put(folio);
>>> + addr = start;
>>> + goto again;
>>> + }
>>> +
>>> mpfn = migrate_pfn(page_to_pfn(page)) |
>>> MIGRATE_PFN_MIGRATE;
>>> if (is_writable_device_private_entry(entry))
> Balbir
>
--Mika
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 01/11] mm/zone_device: support large zone device private folios
2025-07-30 9:50 ` David Hildenbrand
@ 2025-08-04 23:43 ` Balbir Singh
2025-08-05 4:22 ` Balbir Singh
1 sibling, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-08-04 23:43 UTC (permalink / raw)
To: David Hildenbrand, linux-mm
Cc: linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Matthew Brost,
Francois Dugast
On 7/30/25 19:50, David Hildenbrand wrote:
> On 30.07.25 11:21, Balbir Singh wrote:
>> Add routines to support allocation of large order zone device folios
>> and helper functions for zone device folios, to check if a folio is
>> device private and helpers for setting zone device data.
>>
>> When large folios are used, the existing page_free() callback in
>> pgmap is called when the folio is freed, this is true for both
>> PAGE_SIZE and higher order pages.
>>
>> Cc: Karol Herbst <kherbst@redhat.com>
>> Cc: Lyude Paul <lyude@redhat.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Simona Vetter <simona@ffwll.ch>
>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>> Cc: Shuah Khan <shuah@kernel.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>> Cc: Jane Chu <jane.chu@oracle.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Donet Tom <donettom@linux.ibm.com>
>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>> Cc: Mika Penttilä <mpenttil@redhat.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Francois Dugast <francois.dugast@intel.com>
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>> include/linux/memremap.h | 10 ++++++++-
>> mm/memremap.c | 48 +++++++++++++++++++++++++++++-----------
>> 2 files changed, 44 insertions(+), 14 deletions(-)
>>
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index 4aa151914eab..a0723b35eeaa 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -199,7 +199,7 @@ static inline bool folio_is_fsdax(const struct folio *folio)
>> }
>> #ifdef CONFIG_ZONE_DEVICE
>> -void zone_device_page_init(struct page *page);
>> +void zone_device_folio_init(struct folio *folio, unsigned int order);
>> void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>> void memunmap_pages(struct dev_pagemap *pgmap);
>> void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
>> @@ -209,6 +209,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>> bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>> unsigned long memremap_compat_align(void);
>> +
>> +static inline void zone_device_page_init(struct page *page)
>> +{
>> + struct folio *folio = page_folio(page);
>> +
>> + zone_device_folio_init(folio, 0);
>> +}
>> +
>> #else
>> static inline void *devm_memremap_pages(struct device *dev,
>> struct dev_pagemap *pgmap)
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index b0ce0d8254bd..3ca136e7455e 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -427,20 +427,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>> void free_zone_device_folio(struct folio *folio)
>> {
>> struct dev_pagemap *pgmap = folio->pgmap;
>> + unsigned int nr = folio_nr_pages(folio);
>> + int i;
>
> "unsigned long" is to be future-proof.
Will change this for v3
>
> (folio_nr_pages() returns long and probably soon unsigned long)
>
> [ I'd probably all it "nr_pages" ]
Ack
>
>> if (WARN_ON_ONCE(!pgmap))
>> return;
>> mem_cgroup_uncharge(folio);
>> - /*
>> - * Note: we don't expect anonymous compound pages yet. Once supported
>> - * and we could PTE-map them similar to THP, we'd have to clear
>> - * PG_anon_exclusive on all tail pages.
>> - */
>> if (folio_test_anon(folio)) {
>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>> - __ClearPageAnonExclusive(folio_page(folio, 0));
>> + for (i = 0; i < nr; i++)
>> + __ClearPageAnonExclusive(folio_page(folio, i));
>> + } else {
>> + VM_WARN_ON_ONCE(folio_test_large(folio));
>> }
>> /*
>> @@ -464,11 +463,20 @@ void free_zone_device_folio(struct folio *folio)
>> switch (pgmap->type) {
>> case MEMORY_DEVICE_PRIVATE:
>> + if (folio_test_large(folio)) {
>
> Could do "nr > 1" if we already have that value around.
>
Ack
>> + folio_unqueue_deferred_split(folio);
>
> I think I asked that already but maybe missed the reply: Should these folios ever be added to the deferred split queue and is there any value in splitting them under memory pressure in the shrinker?
>
> My gut feeling is "No", because the buddy cannot make use of these folios, but maybe there is an interesting case where we want that behavior?
>
>> +
>> + percpu_ref_put_many(&folio->pgmap->ref, nr - 1);
>> + }
>> + pgmap->ops->page_free(&folio->page);
>> + percpu_ref_put(&folio->pgmap->ref);
>
> Coold you simply do a
>
> percpu_ref_put_many(&folio->pgmap->ref, nr);
>
> here, or would that be problematic?
>
I can definitely try that
>> + folio->page.mapping = NULL;
>> + break;
>> case MEMORY_DEVICE_COHERENT:
>> if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
>> break;
>> - pgmap->ops->page_free(folio_page(folio, 0));
>> - put_dev_pagemap(pgmap);
>> + pgmap->ops->page_free(&folio->page);
>> + percpu_ref_put(&folio->pgmap->ref);
>> break;
>> case MEMORY_DEVICE_GENERIC:
>> @@ -491,14 +499,28 @@ void free_zone_device_folio(struct folio *folio)
>> }
>> }
>> -void zone_device_page_init(struct page *page)
>> +void zone_device_folio_init(struct folio *folio, unsigned int order)
>> {
>> + struct page *page = folio_page(folio, 0);
>> +
>> + VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>> +
>> + /*
>> + * Only PMD level migration is supported for THP migration
>> + */
>
> Talking about something that does not exist yet (and is very specific) sounds a bit weird.
>
> Should this go into a different patch, or could we rephrase the comment to be a bit more generic?
>
> In this patch here, nothing would really object to "order" being intermediate.
>
> (also, this is a device_private limitation? shouldn't that check go somehwere where we can perform this device-private limitation check?)
>
I can remove the limitation and keep it generic
>> + WARN_ON_ONCE(order && order != HPAGE_PMD_ORDER);
>> +
>> /*
>> * Drivers shouldn't be allocating pages after calling
>> * memunmap_pages().
>> */
>> - WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
>> - set_page_count(page, 1);
>> + WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
>> + folio_set_count(folio, 1);
>> lock_page(page);
>> +
>> + if (order > 1) {
>> + prep_compound_page(page, order);
>> + folio_set_large_rmappable(folio);
>> + }
>> }
>> -EXPORT_SYMBOL_GPL(zone_device_page_init);
>> +EXPORT_SYMBOL_GPL(zone_device_folio_init);
>
>
Thanks,
Balbir
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-04 23:26 ` Mika Penttilä
@ 2025-08-05 4:10 ` Balbir Singh
2025-08-05 4:24 ` Mika Penttilä
0 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-08-05 4:10 UTC (permalink / raw)
To: Mika Penttilä, Zi Yan
Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
On 8/5/25 09:26, Mika Penttilä wrote:
> Hi,
>
> On 8/5/25 01:46, Balbir Singh wrote:
>> On 8/2/25 22:13, Mika Penttilä wrote:
>>> Hi,
>>>
>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>> FYI:
>>>>
>>>> I have the following patch on top of my series that seems to make it work
>>>> without requiring the helper to split device private folios
>>>>
>>> I think this looks much better!
>>>
>> Thanks!
>>
>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>> ---
>>>> include/linux/huge_mm.h | 1 -
>>>> lib/test_hmm.c | 11 +++++-
>>>> mm/huge_memory.c | 76 ++++-------------------------------------
>>>> mm/migrate_device.c | 51 +++++++++++++++++++++++++++
>>>> 4 files changed, 67 insertions(+), 72 deletions(-)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>> vm_flags_t vm_flags);
>>>>
>>>> bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>> -int split_device_private_folio(struct folio *folio);
>>>> int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>> unsigned int new_order, bool unmapped);
>>>> int min_order_for_split(struct folio *folio);
>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>> index 341ae2af44ec..444477785882 100644
>>>> --- a/lib/test_hmm.c
>>>> +++ b/lib/test_hmm.c
>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>> * the mirror but here we use it to hold the page for the simulated
>>>> * device memory and that page holds the pointer to the mirror.
>>>> */
>>>> - rpage = vmf->page->zone_device_data;
>>>> + rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>> dmirror = rpage->zone_device_data;
>>>>
>>>> /* FIXME demonstrate how we can adjust migrate range */
>>>> order = folio_order(page_folio(vmf->page));
>>>> nr = 1 << order;
>>>>
>>>> + /*
>>>> + * When folios are partially mapped, we can't rely on the folio
>>>> + * order of vmf->page as the folio might not be fully split yet
>>>> + */
>>>> + if (vmf->pte) {
>>>> + order = 0;
>>>> + nr = 1;
>>>> + }
>>>> +
>>>> /*
>>>> * Consider a per-cpu cache of src and dst pfns, but with
>>>> * large number of cpus that might not scale well.
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>> struct shrink_control *sc);
>>>> static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>> struct shrink_control *sc);
>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>> - struct page *split_at, struct xa_state *xas,
>>>> - struct address_space *mapping, bool uniform_split);
>>>> -
>>>> static bool split_underused_thp = true;
>>>>
>>>> static atomic_t huge_zero_refcount;
>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>> pmd_populate(mm, pmd, pgtable);
>>>> }
>>>>
>>>> -/**
>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>> - * split folios for pages that are partially mapped
>>>> - *
>>>> - * @folio: the folio to split
>>>> - *
>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>> - */
>>>> -int split_device_private_folio(struct folio *folio)
>>>> -{
>>>> - struct folio *end_folio = folio_next(folio);
>>>> - struct folio *new_folio;
>>>> - int ret = 0;
>>>> -
>>>> - /*
>>>> - * Split the folio now. In the case of device
>>>> - * private pages, this path is executed when
>>>> - * the pmd is split and since freeze is not true
>>>> - * it is likely the folio will be deferred_split.
>>>> - *
>>>> - * With device private pages, deferred splits of
>>>> - * folios should be handled here to prevent partial
>>>> - * unmaps from causing issues later on in migration
>>>> - * and fault handling flows.
>>>> - */
>>>> - folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>> - ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>> - VM_WARN_ON(ret);
>>>> - for (new_folio = folio_next(folio); new_folio != end_folio;
>>>> - new_folio = folio_next(new_folio)) {
>>>> - zone_device_private_split_cb(folio, new_folio);
>>>> - folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>> - new_folio));
>>>> - }
>>>> -
>>>> - /*
>>>> - * Mark the end of the folio split for device private THP
>>>> - * split
>>>> - */
>>>> - zone_device_private_split_cb(folio, NULL);
>>>> - folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>> - return ret;
>>>> -}
>>>> -
>>>> static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>> unsigned long haddr, bool freeze)
>>>> {
>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>> freeze = false;
>>>> if (!freeze) {
>>>> rmap_t rmap_flags = RMAP_NONE;
>>>> - unsigned long addr = haddr;
>>>> - struct folio *new_folio;
>>>> - struct folio *end_folio = folio_next(folio);
>>>>
>>>> if (anon_exclusive)
>>>> rmap_flags |= RMAP_EXCLUSIVE;
>>>>
>>>> - folio_lock(folio);
>>>> - folio_get(folio);
>>>> -
>>>> - split_device_private_folio(folio);
>>>> -
>>>> - for (new_folio = folio_next(folio);
>>>> - new_folio != end_folio;
>>>> - new_folio = folio_next(new_folio)) {
>>>> - addr += PAGE_SIZE;
>>>> - folio_unlock(new_folio);
>>>> - folio_add_anon_rmap_ptes(new_folio,
>>>> - &new_folio->page, 1,
>>>> - vma, addr, rmap_flags);
>>>> - }
>>>> - folio_unlock(folio);
>>>> - folio_add_anon_rmap_ptes(folio, &folio->page,
>>>> - 1, vma, haddr, rmap_flags);
>>>> + folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>> + if (anon_exclusive)
>>>> + rmap_flags |= RMAP_EXCLUSIVE;
>>>> + folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>> + vma, haddr, rmap_flags);
>>>> }
>>>> }
>>>>
>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>> if (nr_shmem_dropped)
>>>> shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>
>>>> - if (!ret && is_anon)
>>>> + if (!ret && is_anon && !folio_is_device_private(folio))
>>>> remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>
>>>> remap_page(folio, 1 << order, remap_flags);
>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>> index 49962ea19109..4264c0290d08 100644
>>>> --- a/mm/migrate_device.c
>>>> +++ b/mm/migrate_device.c
>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>> * page table entry. Other special swap entries are not
>>>> * migratable, and we ignore regular swapped page.
>>>> */
>>>> + struct folio *folio;
>>>> +
>>>> entry = pte_to_swp_entry(pte);
>>>> if (!is_device_private_entry(entry))
>>>> goto next;
>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>> pgmap->owner != migrate->pgmap_owner)
>>>> goto next;
>>>>
>>>> + folio = page_folio(page);
>>>> + if (folio_test_large(folio)) {
>>>> + struct folio *new_folio;
>>>> + struct folio *new_fault_folio;
>>>> +
>>>> + /*
>>>> + * The reason for finding pmd present with a
>>>> + * device private pte and a large folio for the
>>>> + * pte is partial unmaps. Split the folio now
>>>> + * for the migration to be handled correctly
>>>> + */
>>>> + pte_unmap_unlock(ptep, ptl);
>>>> +
>>>> + folio_get(folio);
>>>> + if (folio != fault_folio)
>>>> + folio_lock(folio);
>>>> + if (split_folio(folio)) {
>>>> + if (folio != fault_folio)
>>>> + folio_unlock(folio);
>>>> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>> + goto next;
>>>> + }
>>>> +
>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>
>> test_hmm needs adjustment because of the way the backup folios are setup.
>
> nouveau should check the folio order after the possible split happens.
>
You mean the folio_split callback?
>>
>>>> + /*
>>>> + * After the split, get back the extra reference
>>>> + * on the fault_page, this reference is checked during
>>>> + * folio_migrate_mapping()
>>>> + */
>>>> + if (migrate->fault_page) {
>>>> + new_fault_folio = page_folio(migrate->fault_page);
>>>> + folio_get(new_fault_folio);
>>>> + }
>>>> +
>>>> + new_folio = page_folio(page);
>>>> + pfn = page_to_pfn(page);
>>>> +
>>>> + /*
>>>> + * Ensure the lock is held on the correct
>>>> + * folio after the split
>>>> + */
>>>> + if (folio != new_folio) {
>>>> + folio_unlock(folio);
>>>> + folio_lock(new_folio);
>>>> + }
>>> Maybe careful not to unlock fault_page ?
>>>
>> split_page will unlock everything but the original folio, the code takes the lock
>> on the folio corresponding to the new folio
>
> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
>
Not sure I follow what you're trying to elaborate on here
Balbir
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 01/11] mm/zone_device: support large zone device private folios
2025-07-30 9:50 ` David Hildenbrand
2025-08-04 23:43 ` Balbir Singh
@ 2025-08-05 4:22 ` Balbir Singh
2025-08-05 10:57 ` David Hildenbrand
1 sibling, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-08-05 4:22 UTC (permalink / raw)
To: David Hildenbrand, linux-mm
Cc: linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Matthew Brost,
Francois Dugast
On 7/30/25 19:50, David Hildenbrand wrote:
> I think I asked that already but maybe missed the reply: Should these folios ever be added to the deferred split queue and is there any value in splitting them under memory pressure in the shrinker?
>
> My gut feeling is "No", because the buddy cannot make use of these folios, but maybe there is an interesting case where we want that behavior?
>
I realized I did not answer this
deferred_split() is the default action when partial unmaps take place. Anything that does
folio_rmap_remove_ptes can cause the folio to be deferred split if it gets partially
unmapped.
We can optimize for this later if needed
Balbir Singh
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-05 4:10 ` Balbir Singh
@ 2025-08-05 4:24 ` Mika Penttilä
2025-08-05 5:19 ` Mika Penttilä
2025-08-05 10:27 ` Balbir Singh
0 siblings, 2 replies; 71+ messages in thread
From: Mika Penttilä @ 2025-08-05 4:24 UTC (permalink / raw)
To: Balbir Singh, Zi Yan
Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
Hi,
On 8/5/25 07:10, Balbir Singh wrote:
> On 8/5/25 09:26, Mika Penttilä wrote:
>> Hi,
>>
>> On 8/5/25 01:46, Balbir Singh wrote:
>>> On 8/2/25 22:13, Mika Penttilä wrote:
>>>> Hi,
>>>>
>>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>>> FYI:
>>>>>
>>>>> I have the following patch on top of my series that seems to make it work
>>>>> without requiring the helper to split device private folios
>>>>>
>>>> I think this looks much better!
>>>>
>>> Thanks!
>>>
>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>> ---
>>>>> include/linux/huge_mm.h | 1 -
>>>>> lib/test_hmm.c | 11 +++++-
>>>>> mm/huge_memory.c | 76 ++++-------------------------------------
>>>>> mm/migrate_device.c | 51 +++++++++++++++++++++++++++
>>>>> 4 files changed, 67 insertions(+), 72 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>>> --- a/include/linux/huge_mm.h
>>>>> +++ b/include/linux/huge_mm.h
>>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>> vm_flags_t vm_flags);
>>>>>
>>>>> bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>> -int split_device_private_folio(struct folio *folio);
>>>>> int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>> unsigned int new_order, bool unmapped);
>>>>> int min_order_for_split(struct folio *folio);
>>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>>> index 341ae2af44ec..444477785882 100644
>>>>> --- a/lib/test_hmm.c
>>>>> +++ b/lib/test_hmm.c
>>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>> * the mirror but here we use it to hold the page for the simulated
>>>>> * device memory and that page holds the pointer to the mirror.
>>>>> */
>>>>> - rpage = vmf->page->zone_device_data;
>>>>> + rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>> dmirror = rpage->zone_device_data;
>>>>>
>>>>> /* FIXME demonstrate how we can adjust migrate range */
>>>>> order = folio_order(page_folio(vmf->page));
>>>>> nr = 1 << order;
>>>>>
>>>>> + /*
>>>>> + * When folios are partially mapped, we can't rely on the folio
>>>>> + * order of vmf->page as the folio might not be fully split yet
>>>>> + */
>>>>> + if (vmf->pte) {
>>>>> + order = 0;
>>>>> + nr = 1;
>>>>> + }
>>>>> +
>>>>> /*
>>>>> * Consider a per-cpu cache of src and dst pfns, but with
>>>>> * large number of cpus that might not scale well.
>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>>> --- a/mm/huge_memory.c
>>>>> +++ b/mm/huge_memory.c
>>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>>> struct shrink_control *sc);
>>>>> static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>> struct shrink_control *sc);
>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>> - struct page *split_at, struct xa_state *xas,
>>>>> - struct address_space *mapping, bool uniform_split);
>>>>> -
>>>>> static bool split_underused_thp = true;
>>>>>
>>>>> static atomic_t huge_zero_refcount;
>>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>> pmd_populate(mm, pmd, pgtable);
>>>>> }
>>>>>
>>>>> -/**
>>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>>> - * split folios for pages that are partially mapped
>>>>> - *
>>>>> - * @folio: the folio to split
>>>>> - *
>>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>>> - */
>>>>> -int split_device_private_folio(struct folio *folio)
>>>>> -{
>>>>> - struct folio *end_folio = folio_next(folio);
>>>>> - struct folio *new_folio;
>>>>> - int ret = 0;
>>>>> -
>>>>> - /*
>>>>> - * Split the folio now. In the case of device
>>>>> - * private pages, this path is executed when
>>>>> - * the pmd is split and since freeze is not true
>>>>> - * it is likely the folio will be deferred_split.
>>>>> - *
>>>>> - * With device private pages, deferred splits of
>>>>> - * folios should be handled here to prevent partial
>>>>> - * unmaps from causing issues later on in migration
>>>>> - * and fault handling flows.
>>>>> - */
>>>>> - folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>> - ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>> - VM_WARN_ON(ret);
>>>>> - for (new_folio = folio_next(folio); new_folio != end_folio;
>>>>> - new_folio = folio_next(new_folio)) {
>>>>> - zone_device_private_split_cb(folio, new_folio);
>>>>> - folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>>> - new_folio));
>>>>> - }
>>>>> -
>>>>> - /*
>>>>> - * Mark the end of the folio split for device private THP
>>>>> - * split
>>>>> - */
>>>>> - zone_device_private_split_cb(folio, NULL);
>>>>> - folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>>> - return ret;
>>>>> -}
>>>>> -
>>>>> static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>> unsigned long haddr, bool freeze)
>>>>> {
>>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>> freeze = false;
>>>>> if (!freeze) {
>>>>> rmap_t rmap_flags = RMAP_NONE;
>>>>> - unsigned long addr = haddr;
>>>>> - struct folio *new_folio;
>>>>> - struct folio *end_folio = folio_next(folio);
>>>>>
>>>>> if (anon_exclusive)
>>>>> rmap_flags |= RMAP_EXCLUSIVE;
>>>>>
>>>>> - folio_lock(folio);
>>>>> - folio_get(folio);
>>>>> -
>>>>> - split_device_private_folio(folio);
>>>>> -
>>>>> - for (new_folio = folio_next(folio);
>>>>> - new_folio != end_folio;
>>>>> - new_folio = folio_next(new_folio)) {
>>>>> - addr += PAGE_SIZE;
>>>>> - folio_unlock(new_folio);
>>>>> - folio_add_anon_rmap_ptes(new_folio,
>>>>> - &new_folio->page, 1,
>>>>> - vma, addr, rmap_flags);
>>>>> - }
>>>>> - folio_unlock(folio);
>>>>> - folio_add_anon_rmap_ptes(folio, &folio->page,
>>>>> - 1, vma, haddr, rmap_flags);
>>>>> + folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>>> + if (anon_exclusive)
>>>>> + rmap_flags |= RMAP_EXCLUSIVE;
>>>>> + folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>>> + vma, haddr, rmap_flags);
>>>>> }
>>>>> }
>>>>>
>>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>> if (nr_shmem_dropped)
>>>>> shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>>
>>>>> - if (!ret && is_anon)
>>>>> + if (!ret && is_anon && !folio_is_device_private(folio))
>>>>> remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>>
>>>>> remap_page(folio, 1 << order, remap_flags);
>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>> index 49962ea19109..4264c0290d08 100644
>>>>> --- a/mm/migrate_device.c
>>>>> +++ b/mm/migrate_device.c
>>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>> * page table entry. Other special swap entries are not
>>>>> * migratable, and we ignore regular swapped page.
>>>>> */
>>>>> + struct folio *folio;
>>>>> +
>>>>> entry = pte_to_swp_entry(pte);
>>>>> if (!is_device_private_entry(entry))
>>>>> goto next;
>>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>> pgmap->owner != migrate->pgmap_owner)
>>>>> goto next;
>>>>>
>>>>> + folio = page_folio(page);
>>>>> + if (folio_test_large(folio)) {
>>>>> + struct folio *new_folio;
>>>>> + struct folio *new_fault_folio;
>>>>> +
>>>>> + /*
>>>>> + * The reason for finding pmd present with a
>>>>> + * device private pte and a large folio for the
>>>>> + * pte is partial unmaps. Split the folio now
>>>>> + * for the migration to be handled correctly
>>>>> + */
>>>>> + pte_unmap_unlock(ptep, ptl);
>>>>> +
>>>>> + folio_get(folio);
>>>>> + if (folio != fault_folio)
>>>>> + folio_lock(folio);
>>>>> + if (split_folio(folio)) {
>>>>> + if (folio != fault_folio)
>>>>> + folio_unlock(folio);
>>>>> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>> + goto next;
>>>>> + }
>>>>> +
>>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>>
>>> test_hmm needs adjustment because of the way the backup folios are setup.
>> nouveau should check the folio order after the possible split happens.
>>
> You mean the folio_split callback?
no, nouveau_dmem_migrate_to_ram():
..
sfolio = page_folio(vmf->page);
order = folio_order(sfolio);
...
migrate_vma_setup()
..
if sfolio is split order still reflects the pre-split order
>
>>>>> + /*
>>>>> + * After the split, get back the extra reference
>>>>> + * on the fault_page, this reference is checked during
>>>>> + * folio_migrate_mapping()
>>>>> + */
>>>>> + if (migrate->fault_page) {
>>>>> + new_fault_folio = page_folio(migrate->fault_page);
>>>>> + folio_get(new_fault_folio);
>>>>> + }
>>>>> +
>>>>> + new_folio = page_folio(page);
>>>>> + pfn = page_to_pfn(page);
>>>>> +
>>>>> + /*
>>>>> + * Ensure the lock is held on the correct
>>>>> + * folio after the split
>>>>> + */
>>>>> + if (folio != new_folio) {
>>>>> + folio_unlock(folio);
>>>>> + folio_lock(new_folio);
>>>>> + }
>>>> Maybe careful not to unlock fault_page ?
>>>>
>>> split_page will unlock everything but the original folio, the code takes the lock
>>> on the folio corresponding to the new folio
>> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
>>
> Not sure I follow what you're trying to elaborate on here
do_swap_page:
..
if (trylock_page(vmf->page)) {
ret = pgmap->ops->migrate_to_ram(vmf);
<- vmf->page should be locked here even after split
unlock_page(vmf->page);
> Balbir
>
--Mika
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-05 4:24 ` Mika Penttilä
@ 2025-08-05 5:19 ` Mika Penttilä
2025-08-05 10:27 ` Balbir Singh
1 sibling, 0 replies; 71+ messages in thread
From: Mika Penttilä @ 2025-08-05 5:19 UTC (permalink / raw)
To: Balbir Singh, Zi Yan
Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
On 8/5/25 07:24, Mika Penttilä wrote:
> Hi,
>
> On 8/5/25 07:10, Balbir Singh wrote:
>> On 8/5/25 09:26, Mika Penttilä wrote:
>>> Hi,
>>>
>>> On 8/5/25 01:46, Balbir Singh wrote:
>>>> On 8/2/25 22:13, Mika Penttilä wrote:
>>>>> Hi,
>>>>>
>>>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>>>> FYI:
>>>>>>
>>>>>> I have the following patch on top of my series that seems to make it work
>>>>>> without requiring the helper to split device private folios
>>>>>>
>>>>> I think this looks much better!
>>>>>
>>>> Thanks!
>>>>
>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>> ---
>>>>>> include/linux/huge_mm.h | 1 -
>>>>>> lib/test_hmm.c | 11 +++++-
>>>>>> mm/huge_memory.c | 76 ++++-------------------------------------
>>>>>> mm/migrate_device.c | 51 +++++++++++++++++++++++++++
>>>>>> 4 files changed, 67 insertions(+), 72 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>>>> --- a/include/linux/huge_mm.h
>>>>>> +++ b/include/linux/huge_mm.h
>>>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>>> vm_flags_t vm_flags);
>>>>>>
>>>>>> bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>>> -int split_device_private_folio(struct folio *folio);
>>>>>> int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>>> unsigned int new_order, bool unmapped);
>>>>>> int min_order_for_split(struct folio *folio);
>>>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>>>> index 341ae2af44ec..444477785882 100644
>>>>>> --- a/lib/test_hmm.c
>>>>>> +++ b/lib/test_hmm.c
>>>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>>> * the mirror but here we use it to hold the page for the simulated
>>>>>> * device memory and that page holds the pointer to the mirror.
>>>>>> */
>>>>>> - rpage = vmf->page->zone_device_data;
>>>>>> + rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>>> dmirror = rpage->zone_device_data;
>>>>>>
>>>>>> /* FIXME demonstrate how we can adjust migrate range */
>>>>>> order = folio_order(page_folio(vmf->page));
>>>>>> nr = 1 << order;
>>>>>>
>>>>>> + /*
>>>>>> + * When folios are partially mapped, we can't rely on the folio
>>>>>> + * order of vmf->page as the folio might not be fully split yet
>>>>>> + */
>>>>>> + if (vmf->pte) {
>>>>>> + order = 0;
>>>>>> + nr = 1;
>>>>>> + }
>>>>>> +
>>>>>> /*
>>>>>> * Consider a per-cpu cache of src and dst pfns, but with
>>>>>> * large number of cpus that might not scale well.
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>>>> struct shrink_control *sc);
>>>>>> static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>>> struct shrink_control *sc);
>>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>>> - struct page *split_at, struct xa_state *xas,
>>>>>> - struct address_space *mapping, bool uniform_split);
>>>>>> -
>>>>>> static bool split_underused_thp = true;
>>>>>>
>>>>>> static atomic_t huge_zero_refcount;
>>>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>>> pmd_populate(mm, pmd, pgtable);
>>>>>> }
>>>>>>
>>>>>> -/**
>>>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>> - * split folios for pages that are partially mapped
>>>>>> - *
>>>>>> - * @folio: the folio to split
>>>>>> - *
>>>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>>>> - */
>>>>>> -int split_device_private_folio(struct folio *folio)
>>>>>> -{
>>>>>> - struct folio *end_folio = folio_next(folio);
>>>>>> - struct folio *new_folio;
>>>>>> - int ret = 0;
>>>>>> -
>>>>>> - /*
>>>>>> - * Split the folio now. In the case of device
>>>>>> - * private pages, this path is executed when
>>>>>> - * the pmd is split and since freeze is not true
>>>>>> - * it is likely the folio will be deferred_split.
>>>>>> - *
>>>>>> - * With device private pages, deferred splits of
>>>>>> - * folios should be handled here to prevent partial
>>>>>> - * unmaps from causing issues later on in migration
>>>>>> - * and fault handling flows.
>>>>>> - */
>>>>>> - folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>> - ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>> - VM_WARN_ON(ret);
>>>>>> - for (new_folio = folio_next(folio); new_folio != end_folio;
>>>>>> - new_folio = folio_next(new_folio)) {
>>>>>> - zone_device_private_split_cb(folio, new_folio);
>>>>>> - folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>>>> - new_folio));
>>>>>> - }
>>>>>> -
>>>>>> - /*
>>>>>> - * Mark the end of the folio split for device private THP
>>>>>> - * split
>>>>>> - */
>>>>>> - zone_device_private_split_cb(folio, NULL);
>>>>>> - folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>> - return ret;
>>>>>> -}
>>>>>> -
>>>>>> static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>> unsigned long haddr, bool freeze)
>>>>>> {
>>>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>> freeze = false;
>>>>>> if (!freeze) {
>>>>>> rmap_t rmap_flags = RMAP_NONE;
>>>>>> - unsigned long addr = haddr;
>>>>>> - struct folio *new_folio;
>>>>>> - struct folio *end_folio = folio_next(folio);
>>>>>>
>>>>>> if (anon_exclusive)
>>>>>> rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>
>>>>>> - folio_lock(folio);
>>>>>> - folio_get(folio);
>>>>>> -
>>>>>> - split_device_private_folio(folio);
>>>>>> -
>>>>>> - for (new_folio = folio_next(folio);
>>>>>> - new_folio != end_folio;
>>>>>> - new_folio = folio_next(new_folio)) {
>>>>>> - addr += PAGE_SIZE;
>>>>>> - folio_unlock(new_folio);
>>>>>> - folio_add_anon_rmap_ptes(new_folio,
>>>>>> - &new_folio->page, 1,
>>>>>> - vma, addr, rmap_flags);
>>>>>> - }
>>>>>> - folio_unlock(folio);
>>>>>> - folio_add_anon_rmap_ptes(folio, &folio->page,
>>>>>> - 1, vma, haddr, rmap_flags);
>>>>>> + folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>>>> + if (anon_exclusive)
>>>>>> + rmap_flags |= RMAP_EXCLUSIVE;
>>>>>> + folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>>>> + vma, haddr, rmap_flags);
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>> if (nr_shmem_dropped)
>>>>>> shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>>>
>>>>>> - if (!ret && is_anon)
>>>>>> + if (!ret && is_anon && !folio_is_device_private(folio))
>>>>>> remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>>>
>>>>>> remap_page(folio, 1 << order, remap_flags);
>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>> index 49962ea19109..4264c0290d08 100644
>>>>>> --- a/mm/migrate_device.c
>>>>>> +++ b/mm/migrate_device.c
>>>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>> * page table entry. Other special swap entries are not
>>>>>> * migratable, and we ignore regular swapped page.
>>>>>> */
>>>>>> + struct folio *folio;
>>>>>> +
>>>>>> entry = pte_to_swp_entry(pte);
>>>>>> if (!is_device_private_entry(entry))
>>>>>> goto next;
>>>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>> pgmap->owner != migrate->pgmap_owner)
>>>>>> goto next;
>>>>>>
>>>>>> + folio = page_folio(page);
>>>>>> + if (folio_test_large(folio)) {
>>>>>> + struct folio *new_folio;
>>>>>> + struct folio *new_fault_folio;
>>>>>> +
>>>>>> + /*
>>>>>> + * The reason for finding pmd present with a
>>>>>> + * device private pte and a large folio for the
>>>>>> + * pte is partial unmaps. Split the folio now
>>>>>> + * for the migration to be handled correctly
>>>>>> + */
>>>>>> + pte_unmap_unlock(ptep, ptl);
>>>>>> +
>>>>>> + folio_get(folio);
>>>>>> + if (folio != fault_folio)
>>>>>> + folio_lock(folio);
>>>>>> + if (split_folio(folio)) {
>>>>>> + if (folio != fault_folio)
>>>>>> + folio_unlock(folio);
>>>>>> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>>> + goto next;
>>>>>> + }
>>>>>> +
>>>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>>>
>>>> test_hmm needs adjustment because of the way the backup folios are setup.
>>> nouveau should check the folio order after the possible split happens.
>>>
>> You mean the folio_split callback?
> no, nouveau_dmem_migrate_to_ram():
> ..
> sfolio = page_folio(vmf->page);
> order = folio_order(sfolio);
> ...
> migrate_vma_setup()
> ..
> if sfolio is split order still reflects the pre-split order
>
>>>>>> + /*
>>>>>> + * After the split, get back the extra reference
>>>>>> + * on the fault_page, this reference is checked during
>>>>>> + * folio_migrate_mapping()
>>>>>> + */
>>>>>> + if (migrate->fault_page) {
>>>>>> + new_fault_folio = page_folio(migrate->fault_page);
>>>>>> + folio_get(new_fault_folio);
>>>>>> + }
>>>>>> +
>>>>>> + new_folio = page_folio(page);
>>>>>> + pfn = page_to_pfn(page);
>>>>>> +
>>>>>> + /*
>>>>>> + * Ensure the lock is held on the correct
>>>>>> + * folio after the split
>>>>>> + */
>>>>>> + if (folio != new_folio) {
>>>>>> + folio_unlock(folio);
>>>>>> + folio_lock(new_folio);
>>>>>> + }
>>>>> Maybe careful not to unlock fault_page ?
>>>>>
>>>> split_page will unlock everything but the original folio, the code takes the lock
>>>> on the folio corresponding to the new folio
>>> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
>>>
>> Not sure I follow what you're trying to elaborate on here
>
Actually fault_folio should be fine but should we have:
if (fault_folio)
if(folio != new_folio)) {
folio_unlock(folio);
folio_lock(new_folio);
}
else
folio_unlock(folio);
>> Balbir
>>
> --Mika
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-05 4:24 ` Mika Penttilä
2025-08-05 5:19 ` Mika Penttilä
@ 2025-08-05 10:27 ` Balbir Singh
2025-08-05 10:35 ` Mika Penttilä
1 sibling, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-08-05 10:27 UTC (permalink / raw)
To: Mika Penttilä, Zi Yan
Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
On 8/5/25 14:24, Mika Penttilä wrote:
> Hi,
>
> On 8/5/25 07:10, Balbir Singh wrote:
>> On 8/5/25 09:26, Mika Penttilä wrote:
>>> Hi,
>>>
>>> On 8/5/25 01:46, Balbir Singh wrote:
>>>> On 8/2/25 22:13, Mika Penttilä wrote:
>>>>> Hi,
>>>>>
>>>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>>>> FYI:
>>>>>>
>>>>>> I have the following patch on top of my series that seems to make it work
>>>>>> without requiring the helper to split device private folios
>>>>>>
>>>>> I think this looks much better!
>>>>>
>>>> Thanks!
>>>>
>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>> ---
>>>>>> include/linux/huge_mm.h | 1 -
>>>>>> lib/test_hmm.c | 11 +++++-
>>>>>> mm/huge_memory.c | 76 ++++-------------------------------------
>>>>>> mm/migrate_device.c | 51 +++++++++++++++++++++++++++
>>>>>> 4 files changed, 67 insertions(+), 72 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>>>> --- a/include/linux/huge_mm.h
>>>>>> +++ b/include/linux/huge_mm.h
>>>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>>> vm_flags_t vm_flags);
>>>>>>
>>>>>> bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>>> -int split_device_private_folio(struct folio *folio);
>>>>>> int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>>> unsigned int new_order, bool unmapped);
>>>>>> int min_order_for_split(struct folio *folio);
>>>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>>>> index 341ae2af44ec..444477785882 100644
>>>>>> --- a/lib/test_hmm.c
>>>>>> +++ b/lib/test_hmm.c
>>>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>>> * the mirror but here we use it to hold the page for the simulated
>>>>>> * device memory and that page holds the pointer to the mirror.
>>>>>> */
>>>>>> - rpage = vmf->page->zone_device_data;
>>>>>> + rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>>> dmirror = rpage->zone_device_data;
>>>>>>
>>>>>> /* FIXME demonstrate how we can adjust migrate range */
>>>>>> order = folio_order(page_folio(vmf->page));
>>>>>> nr = 1 << order;
>>>>>>
>>>>>> + /*
>>>>>> + * When folios are partially mapped, we can't rely on the folio
>>>>>> + * order of vmf->page as the folio might not be fully split yet
>>>>>> + */
>>>>>> + if (vmf->pte) {
>>>>>> + order = 0;
>>>>>> + nr = 1;
>>>>>> + }
>>>>>> +
>>>>>> /*
>>>>>> * Consider a per-cpu cache of src and dst pfns, but with
>>>>>> * large number of cpus that might not scale well.
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>>>> struct shrink_control *sc);
>>>>>> static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>>> struct shrink_control *sc);
>>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>>> - struct page *split_at, struct xa_state *xas,
>>>>>> - struct address_space *mapping, bool uniform_split);
>>>>>> -
>>>>>> static bool split_underused_thp = true;
>>>>>>
>>>>>> static atomic_t huge_zero_refcount;
>>>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>>> pmd_populate(mm, pmd, pgtable);
>>>>>> }
>>>>>>
>>>>>> -/**
>>>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>> - * split folios for pages that are partially mapped
>>>>>> - *
>>>>>> - * @folio: the folio to split
>>>>>> - *
>>>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>>>> - */
>>>>>> -int split_device_private_folio(struct folio *folio)
>>>>>> -{
>>>>>> - struct folio *end_folio = folio_next(folio);
>>>>>> - struct folio *new_folio;
>>>>>> - int ret = 0;
>>>>>> -
>>>>>> - /*
>>>>>> - * Split the folio now. In the case of device
>>>>>> - * private pages, this path is executed when
>>>>>> - * the pmd is split and since freeze is not true
>>>>>> - * it is likely the folio will be deferred_split.
>>>>>> - *
>>>>>> - * With device private pages, deferred splits of
>>>>>> - * folios should be handled here to prevent partial
>>>>>> - * unmaps from causing issues later on in migration
>>>>>> - * and fault handling flows.
>>>>>> - */
>>>>>> - folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>> - ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>> - VM_WARN_ON(ret);
>>>>>> - for (new_folio = folio_next(folio); new_folio != end_folio;
>>>>>> - new_folio = folio_next(new_folio)) {
>>>>>> - zone_device_private_split_cb(folio, new_folio);
>>>>>> - folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>>>> - new_folio));
>>>>>> - }
>>>>>> -
>>>>>> - /*
>>>>>> - * Mark the end of the folio split for device private THP
>>>>>> - * split
>>>>>> - */
>>>>>> - zone_device_private_split_cb(folio, NULL);
>>>>>> - folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>> - return ret;
>>>>>> -}
>>>>>> -
>>>>>> static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>> unsigned long haddr, bool freeze)
>>>>>> {
>>>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>> freeze = false;
>>>>>> if (!freeze) {
>>>>>> rmap_t rmap_flags = RMAP_NONE;
>>>>>> - unsigned long addr = haddr;
>>>>>> - struct folio *new_folio;
>>>>>> - struct folio *end_folio = folio_next(folio);
>>>>>>
>>>>>> if (anon_exclusive)
>>>>>> rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>
>>>>>> - folio_lock(folio);
>>>>>> - folio_get(folio);
>>>>>> -
>>>>>> - split_device_private_folio(folio);
>>>>>> -
>>>>>> - for (new_folio = folio_next(folio);
>>>>>> - new_folio != end_folio;
>>>>>> - new_folio = folio_next(new_folio)) {
>>>>>> - addr += PAGE_SIZE;
>>>>>> - folio_unlock(new_folio);
>>>>>> - folio_add_anon_rmap_ptes(new_folio,
>>>>>> - &new_folio->page, 1,
>>>>>> - vma, addr, rmap_flags);
>>>>>> - }
>>>>>> - folio_unlock(folio);
>>>>>> - folio_add_anon_rmap_ptes(folio, &folio->page,
>>>>>> - 1, vma, haddr, rmap_flags);
>>>>>> + folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>>>> + if (anon_exclusive)
>>>>>> + rmap_flags |= RMAP_EXCLUSIVE;
>>>>>> + folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>>>> + vma, haddr, rmap_flags);
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>> if (nr_shmem_dropped)
>>>>>> shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>>>
>>>>>> - if (!ret && is_anon)
>>>>>> + if (!ret && is_anon && !folio_is_device_private(folio))
>>>>>> remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>>>
>>>>>> remap_page(folio, 1 << order, remap_flags);
>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>> index 49962ea19109..4264c0290d08 100644
>>>>>> --- a/mm/migrate_device.c
>>>>>> +++ b/mm/migrate_device.c
>>>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>> * page table entry. Other special swap entries are not
>>>>>> * migratable, and we ignore regular swapped page.
>>>>>> */
>>>>>> + struct folio *folio;
>>>>>> +
>>>>>> entry = pte_to_swp_entry(pte);
>>>>>> if (!is_device_private_entry(entry))
>>>>>> goto next;
>>>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>> pgmap->owner != migrate->pgmap_owner)
>>>>>> goto next;
>>>>>>
>>>>>> + folio = page_folio(page);
>>>>>> + if (folio_test_large(folio)) {
>>>>>> + struct folio *new_folio;
>>>>>> + struct folio *new_fault_folio;
>>>>>> +
>>>>>> + /*
>>>>>> + * The reason for finding pmd present with a
>>>>>> + * device private pte and a large folio for the
>>>>>> + * pte is partial unmaps. Split the folio now
>>>>>> + * for the migration to be handled correctly
>>>>>> + */
>>>>>> + pte_unmap_unlock(ptep, ptl);
>>>>>> +
>>>>>> + folio_get(folio);
>>>>>> + if (folio != fault_folio)
>>>>>> + folio_lock(folio);
>>>>>> + if (split_folio(folio)) {
>>>>>> + if (folio != fault_folio)
>>>>>> + folio_unlock(folio);
>>>>>> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>>> + goto next;
>>>>>> + }
>>>>>> +
>>>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>>>
>>>> test_hmm needs adjustment because of the way the backup folios are setup.
>>> nouveau should check the folio order after the possible split happens.
>>>
>> You mean the folio_split callback?
>
> no, nouveau_dmem_migrate_to_ram():
> ..
> sfolio = page_folio(vmf->page);
> order = folio_order(sfolio);
> ...
> migrate_vma_setup()
> ..
> if sfolio is split order still reflects the pre-split order
>
Will fix, good catch!
>>
>>>>>> + /*
>>>>>> + * After the split, get back the extra reference
>>>>>> + * on the fault_page, this reference is checked during
>>>>>> + * folio_migrate_mapping()
>>>>>> + */
>>>>>> + if (migrate->fault_page) {
>>>>>> + new_fault_folio = page_folio(migrate->fault_page);
>>>>>> + folio_get(new_fault_folio);
>>>>>> + }
>>>>>> +
>>>>>> + new_folio = page_folio(page);
>>>>>> + pfn = page_to_pfn(page);
>>>>>> +
>>>>>> + /*
>>>>>> + * Ensure the lock is held on the correct
>>>>>> + * folio after the split
>>>>>> + */
>>>>>> + if (folio != new_folio) {
>>>>>> + folio_unlock(folio);
>>>>>> + folio_lock(new_folio);
>>>>>> + }
>>>>> Maybe careful not to unlock fault_page ?
>>>>>
>>>> split_page will unlock everything but the original folio, the code takes the lock
>>>> on the folio corresponding to the new folio
>>> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
>>>
>> Not sure I follow what you're trying to elaborate on here
>
> do_swap_page:
> ..
> if (trylock_page(vmf->page)) {
> ret = pgmap->ops->migrate_to_ram(vmf);
> <- vmf->page should be locked here even after split
> unlock_page(vmf->page);
>
Yep, the split will unlock all tail folios, leaving the just head folio locked
and this the change, the lock we need to hold is the folio lock associated with
fault_page, pte entry and not unlock when the cause is a fault. The code seems
to do the right thing there, let me double check
Balbir
and the code does the right thing there.
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-05 10:27 ` Balbir Singh
@ 2025-08-05 10:35 ` Mika Penttilä
2025-08-05 10:36 ` Balbir Singh
0 siblings, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-08-05 10:35 UTC (permalink / raw)
To: Balbir Singh, Zi Yan
Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
On 8/5/25 13:27, Balbir Singh wrote:
> On 8/5/25 14:24, Mika Penttilä wrote:
>> Hi,
>>
>> On 8/5/25 07:10, Balbir Singh wrote:
>>> On 8/5/25 09:26, Mika Penttilä wrote:
>>>> Hi,
>>>>
>>>> On 8/5/25 01:46, Balbir Singh wrote:
>>>>> On 8/2/25 22:13, Mika Penttilä wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>>>>> FYI:
>>>>>>>
>>>>>>> I have the following patch on top of my series that seems to make it work
>>>>>>> without requiring the helper to split device private folios
>>>>>>>
>>>>>> I think this looks much better!
>>>>>>
>>>>> Thanks!
>>>>>
>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>> ---
>>>>>>> include/linux/huge_mm.h | 1 -
>>>>>>> lib/test_hmm.c | 11 +++++-
>>>>>>> mm/huge_memory.c | 76 ++++-------------------------------------
>>>>>>> mm/migrate_device.c | 51 +++++++++++++++++++++++++++
>>>>>>> 4 files changed, 67 insertions(+), 72 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>>>>> --- a/include/linux/huge_mm.h
>>>>>>> +++ b/include/linux/huge_mm.h
>>>>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>>>> vm_flags_t vm_flags);
>>>>>>>
>>>>>>> bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>>>> -int split_device_private_folio(struct folio *folio);
>>>>>>> int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>>>> unsigned int new_order, bool unmapped);
>>>>>>> int min_order_for_split(struct folio *folio);
>>>>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>>>>> index 341ae2af44ec..444477785882 100644
>>>>>>> --- a/lib/test_hmm.c
>>>>>>> +++ b/lib/test_hmm.c
>>>>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>>>> * the mirror but here we use it to hold the page for the simulated
>>>>>>> * device memory and that page holds the pointer to the mirror.
>>>>>>> */
>>>>>>> - rpage = vmf->page->zone_device_data;
>>>>>>> + rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>>>> dmirror = rpage->zone_device_data;
>>>>>>>
>>>>>>> /* FIXME demonstrate how we can adjust migrate range */
>>>>>>> order = folio_order(page_folio(vmf->page));
>>>>>>> nr = 1 << order;
>>>>>>>
>>>>>>> + /*
>>>>>>> + * When folios are partially mapped, we can't rely on the folio
>>>>>>> + * order of vmf->page as the folio might not be fully split yet
>>>>>>> + */
>>>>>>> + if (vmf->pte) {
>>>>>>> + order = 0;
>>>>>>> + nr = 1;
>>>>>>> + }
>>>>>>> +
>>>>>>> /*
>>>>>>> * Consider a per-cpu cache of src and dst pfns, but with
>>>>>>> * large number of cpus that might not scale well.
>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>>>>> --- a/mm/huge_memory.c
>>>>>>> +++ b/mm/huge_memory.c
>>>>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>>>>> struct shrink_control *sc);
>>>>>>> static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>>>> struct shrink_control *sc);
>>>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>>>> - struct page *split_at, struct xa_state *xas,
>>>>>>> - struct address_space *mapping, bool uniform_split);
>>>>>>> -
>>>>>>> static bool split_underused_thp = true;
>>>>>>>
>>>>>>> static atomic_t huge_zero_refcount;
>>>>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>>>> pmd_populate(mm, pmd, pgtable);
>>>>>>> }
>>>>>>>
>>>>>>> -/**
>>>>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>> - * split folios for pages that are partially mapped
>>>>>>> - *
>>>>>>> - * @folio: the folio to split
>>>>>>> - *
>>>>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>> - */
>>>>>>> -int split_device_private_folio(struct folio *folio)
>>>>>>> -{
>>>>>>> - struct folio *end_folio = folio_next(folio);
>>>>>>> - struct folio *new_folio;
>>>>>>> - int ret = 0;
>>>>>>> -
>>>>>>> - /*
>>>>>>> - * Split the folio now. In the case of device
>>>>>>> - * private pages, this path is executed when
>>>>>>> - * the pmd is split and since freeze is not true
>>>>>>> - * it is likely the folio will be deferred_split.
>>>>>>> - *
>>>>>>> - * With device private pages, deferred splits of
>>>>>>> - * folios should be handled here to prevent partial
>>>>>>> - * unmaps from causing issues later on in migration
>>>>>>> - * and fault handling flows.
>>>>>>> - */
>>>>>>> - folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>> - ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>> - VM_WARN_ON(ret);
>>>>>>> - for (new_folio = folio_next(folio); new_folio != end_folio;
>>>>>>> - new_folio = folio_next(new_folio)) {
>>>>>>> - zone_device_private_split_cb(folio, new_folio);
>>>>>>> - folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>>>>> - new_folio));
>>>>>>> - }
>>>>>>> -
>>>>>>> - /*
>>>>>>> - * Mark the end of the folio split for device private THP
>>>>>>> - * split
>>>>>>> - */
>>>>>>> - zone_device_private_split_cb(folio, NULL);
>>>>>>> - folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>> - return ret;
>>>>>>> -}
>>>>>>> -
>>>>>>> static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>> unsigned long haddr, bool freeze)
>>>>>>> {
>>>>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>> freeze = false;
>>>>>>> if (!freeze) {
>>>>>>> rmap_t rmap_flags = RMAP_NONE;
>>>>>>> - unsigned long addr = haddr;
>>>>>>> - struct folio *new_folio;
>>>>>>> - struct folio *end_folio = folio_next(folio);
>>>>>>>
>>>>>>> if (anon_exclusive)
>>>>>>> rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>>
>>>>>>> - folio_lock(folio);
>>>>>>> - folio_get(folio);
>>>>>>> -
>>>>>>> - split_device_private_folio(folio);
>>>>>>> -
>>>>>>> - for (new_folio = folio_next(folio);
>>>>>>> - new_folio != end_folio;
>>>>>>> - new_folio = folio_next(new_folio)) {
>>>>>>> - addr += PAGE_SIZE;
>>>>>>> - folio_unlock(new_folio);
>>>>>>> - folio_add_anon_rmap_ptes(new_folio,
>>>>>>> - &new_folio->page, 1,
>>>>>>> - vma, addr, rmap_flags);
>>>>>>> - }
>>>>>>> - folio_unlock(folio);
>>>>>>> - folio_add_anon_rmap_ptes(folio, &folio->page,
>>>>>>> - 1, vma, haddr, rmap_flags);
>>>>>>> + folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>>>>> + if (anon_exclusive)
>>>>>>> + rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>> + folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>>>>> + vma, haddr, rmap_flags);
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>> if (nr_shmem_dropped)
>>>>>>> shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>>>>
>>>>>>> - if (!ret && is_anon)
>>>>>>> + if (!ret && is_anon && !folio_is_device_private(folio))
>>>>>>> remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>>>>
>>>>>>> remap_page(folio, 1 << order, remap_flags);
>>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>>> index 49962ea19109..4264c0290d08 100644
>>>>>>> --- a/mm/migrate_device.c
>>>>>>> +++ b/mm/migrate_device.c
>>>>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>> * page table entry. Other special swap entries are not
>>>>>>> * migratable, and we ignore regular swapped page.
>>>>>>> */
>>>>>>> + struct folio *folio;
>>>>>>> +
>>>>>>> entry = pte_to_swp_entry(pte);
>>>>>>> if (!is_device_private_entry(entry))
>>>>>>> goto next;
>>>>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>> pgmap->owner != migrate->pgmap_owner)
>>>>>>> goto next;
>>>>>>>
>>>>>>> + folio = page_folio(page);
>>>>>>> + if (folio_test_large(folio)) {
>>>>>>> + struct folio *new_folio;
>>>>>>> + struct folio *new_fault_folio;
>>>>>>> +
>>>>>>> + /*
>>>>>>> + * The reason for finding pmd present with a
>>>>>>> + * device private pte and a large folio for the
>>>>>>> + * pte is partial unmaps. Split the folio now
>>>>>>> + * for the migration to be handled correctly
>>>>>>> + */
>>>>>>> + pte_unmap_unlock(ptep, ptl);
>>>>>>> +
>>>>>>> + folio_get(folio);
>>>>>>> + if (folio != fault_folio)
>>>>>>> + folio_lock(folio);
>>>>>>> + if (split_folio(folio)) {
>>>>>>> + if (folio != fault_folio)
>>>>>>> + folio_unlock(folio);
>>>>>>> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>>>> + goto next;
>>>>>>> + }
>>>>>>> +
>>>>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>>>>
>>>>> test_hmm needs adjustment because of the way the backup folios are setup.
>>>> nouveau should check the folio order after the possible split happens.
>>>>
>>> You mean the folio_split callback?
>> no, nouveau_dmem_migrate_to_ram():
>> ..
>> sfolio = page_folio(vmf->page);
>> order = folio_order(sfolio);
>> ...
>> migrate_vma_setup()
>> ..
>> if sfolio is split order still reflects the pre-split order
>>
> Will fix, good catch!
>
>>>>>>> + /*
>>>>>>> + * After the split, get back the extra reference
>>>>>>> + * on the fault_page, this reference is checked during
>>>>>>> + * folio_migrate_mapping()
>>>>>>> + */
>>>>>>> + if (migrate->fault_page) {
>>>>>>> + new_fault_folio = page_folio(migrate->fault_page);
>>>>>>> + folio_get(new_fault_folio);
>>>>>>> + }
>>>>>>> +
>>>>>>> + new_folio = page_folio(page);
>>>>>>> + pfn = page_to_pfn(page);
>>>>>>> +
>>>>>>> + /*
>>>>>>> + * Ensure the lock is held on the correct
>>>>>>> + * folio after the split
>>>>>>> + */
>>>>>>> + if (folio != new_folio) {
>>>>>>> + folio_unlock(folio);
>>>>>>> + folio_lock(new_folio);
>>>>>>> + }
>>>>>> Maybe careful not to unlock fault_page ?
>>>>>>
>>>>> split_page will unlock everything but the original folio, the code takes the lock
>>>>> on the folio corresponding to the new folio
>>>> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
>>>>
>>> Not sure I follow what you're trying to elaborate on here
>> do_swap_page:
>> ..
>> if (trylock_page(vmf->page)) {
>> ret = pgmap->ops->migrate_to_ram(vmf);
>> <- vmf->page should be locked here even after split
>> unlock_page(vmf->page);
>>
> Yep, the split will unlock all tail folios, leaving the just head folio locked
> and this the change, the lock we need to hold is the folio lock associated with
> fault_page, pte entry and not unlock when the cause is a fault. The code seems
> to do the right thing there, let me double check
Yes the fault case is ok. But if migrate not for a fault, we should not leave any page locked
>
> Balbir
> and the code does the right thing there.
>
--Mika
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-05 10:35 ` Mika Penttilä
@ 2025-08-05 10:36 ` Balbir Singh
2025-08-05 10:46 ` Mika Penttilä
0 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-08-05 10:36 UTC (permalink / raw)
To: Mika Penttilä, Zi Yan
Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
On 8/5/25 20:35, Mika Penttilä wrote:
>
> On 8/5/25 13:27, Balbir Singh wrote:
>
>> On 8/5/25 14:24, Mika Penttilä wrote:
>>> Hi,
>>>
>>> On 8/5/25 07:10, Balbir Singh wrote:
>>>> On 8/5/25 09:26, Mika Penttilä wrote:
>>>>> Hi,
>>>>>
>>>>> On 8/5/25 01:46, Balbir Singh wrote:
>>>>>> On 8/2/25 22:13, Mika Penttilä wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>>>>>> FYI:
>>>>>>>>
>>>>>>>> I have the following patch on top of my series that seems to make it work
>>>>>>>> without requiring the helper to split device private folios
>>>>>>>>
>>>>>>> I think this looks much better!
>>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>> ---
>>>>>>>> include/linux/huge_mm.h | 1 -
>>>>>>>> lib/test_hmm.c | 11 +++++-
>>>>>>>> mm/huge_memory.c | 76 ++++-------------------------------------
>>>>>>>> mm/migrate_device.c | 51 +++++++++++++++++++++++++++
>>>>>>>> 4 files changed, 67 insertions(+), 72 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>>>>>> --- a/include/linux/huge_mm.h
>>>>>>>> +++ b/include/linux/huge_mm.h
>>>>>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>>>>> vm_flags_t vm_flags);
>>>>>>>>
>>>>>>>> bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>>>>> -int split_device_private_folio(struct folio *folio);
>>>>>>>> int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>>>>> unsigned int new_order, bool unmapped);
>>>>>>>> int min_order_for_split(struct folio *folio);
>>>>>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>>>>>> index 341ae2af44ec..444477785882 100644
>>>>>>>> --- a/lib/test_hmm.c
>>>>>>>> +++ b/lib/test_hmm.c
>>>>>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>>>>> * the mirror but here we use it to hold the page for the simulated
>>>>>>>> * device memory and that page holds the pointer to the mirror.
>>>>>>>> */
>>>>>>>> - rpage = vmf->page->zone_device_data;
>>>>>>>> + rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>>>>> dmirror = rpage->zone_device_data;
>>>>>>>>
>>>>>>>> /* FIXME demonstrate how we can adjust migrate range */
>>>>>>>> order = folio_order(page_folio(vmf->page));
>>>>>>>> nr = 1 << order;
>>>>>>>>
>>>>>>>> + /*
>>>>>>>> + * When folios are partially mapped, we can't rely on the folio
>>>>>>>> + * order of vmf->page as the folio might not be fully split yet
>>>>>>>> + */
>>>>>>>> + if (vmf->pte) {
>>>>>>>> + order = 0;
>>>>>>>> + nr = 1;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> /*
>>>>>>>> * Consider a per-cpu cache of src and dst pfns, but with
>>>>>>>> * large number of cpus that might not scale well.
>>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>>>>>> --- a/mm/huge_memory.c
>>>>>>>> +++ b/mm/huge_memory.c
>>>>>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>>>>>> struct shrink_control *sc);
>>>>>>>> static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>>>>> struct shrink_control *sc);
>>>>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>>>>> - struct page *split_at, struct xa_state *xas,
>>>>>>>> - struct address_space *mapping, bool uniform_split);
>>>>>>>> -
>>>>>>>> static bool split_underused_thp = true;
>>>>>>>>
>>>>>>>> static atomic_t huge_zero_refcount;
>>>>>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>>>>> pmd_populate(mm, pmd, pgtable);
>>>>>>>> }
>>>>>>>>
>>>>>>>> -/**
>>>>>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>> - * split folios for pages that are partially mapped
>>>>>>>> - *
>>>>>>>> - * @folio: the folio to split
>>>>>>>> - *
>>>>>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>> - */
>>>>>>>> -int split_device_private_folio(struct folio *folio)
>>>>>>>> -{
>>>>>>>> - struct folio *end_folio = folio_next(folio);
>>>>>>>> - struct folio *new_folio;
>>>>>>>> - int ret = 0;
>>>>>>>> -
>>>>>>>> - /*
>>>>>>>> - * Split the folio now. In the case of device
>>>>>>>> - * private pages, this path is executed when
>>>>>>>> - * the pmd is split and since freeze is not true
>>>>>>>> - * it is likely the folio will be deferred_split.
>>>>>>>> - *
>>>>>>>> - * With device private pages, deferred splits of
>>>>>>>> - * folios should be handled here to prevent partial
>>>>>>>> - * unmaps from causing issues later on in migration
>>>>>>>> - * and fault handling flows.
>>>>>>>> - */
>>>>>>>> - folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>> - ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>>> - VM_WARN_ON(ret);
>>>>>>>> - for (new_folio = folio_next(folio); new_folio != end_folio;
>>>>>>>> - new_folio = folio_next(new_folio)) {
>>>>>>>> - zone_device_private_split_cb(folio, new_folio);
>>>>>>>> - folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>>>>>> - new_folio));
>>>>>>>> - }
>>>>>>>> -
>>>>>>>> - /*
>>>>>>>> - * Mark the end of the folio split for device private THP
>>>>>>>> - * split
>>>>>>>> - */
>>>>>>>> - zone_device_private_split_cb(folio, NULL);
>>>>>>>> - folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>> - return ret;
>>>>>>>> -}
>>>>>>>> -
>>>>>>>> static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>>> unsigned long haddr, bool freeze)
>>>>>>>> {
>>>>>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>>> freeze = false;
>>>>>>>> if (!freeze) {
>>>>>>>> rmap_t rmap_flags = RMAP_NONE;
>>>>>>>> - unsigned long addr = haddr;
>>>>>>>> - struct folio *new_folio;
>>>>>>>> - struct folio *end_folio = folio_next(folio);
>>>>>>>>
>>>>>>>> if (anon_exclusive)
>>>>>>>> rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>>>
>>>>>>>> - folio_lock(folio);
>>>>>>>> - folio_get(folio);
>>>>>>>> -
>>>>>>>> - split_device_private_folio(folio);
>>>>>>>> -
>>>>>>>> - for (new_folio = folio_next(folio);
>>>>>>>> - new_folio != end_folio;
>>>>>>>> - new_folio = folio_next(new_folio)) {
>>>>>>>> - addr += PAGE_SIZE;
>>>>>>>> - folio_unlock(new_folio);
>>>>>>>> - folio_add_anon_rmap_ptes(new_folio,
>>>>>>>> - &new_folio->page, 1,
>>>>>>>> - vma, addr, rmap_flags);
>>>>>>>> - }
>>>>>>>> - folio_unlock(folio);
>>>>>>>> - folio_add_anon_rmap_ptes(folio, &folio->page,
>>>>>>>> - 1, vma, haddr, rmap_flags);
>>>>>>>> + folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>>>>>> + if (anon_exclusive)
>>>>>>>> + rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>>> + folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>>>>>> + vma, haddr, rmap_flags);
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>> if (nr_shmem_dropped)
>>>>>>>> shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>>>>>
>>>>>>>> - if (!ret && is_anon)
>>>>>>>> + if (!ret && is_anon && !folio_is_device_private(folio))
>>>>>>>> remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>>>>>
>>>>>>>> remap_page(folio, 1 << order, remap_flags);
>>>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>>>> index 49962ea19109..4264c0290d08 100644
>>>>>>>> --- a/mm/migrate_device.c
>>>>>>>> +++ b/mm/migrate_device.c
>>>>>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>>> * page table entry. Other special swap entries are not
>>>>>>>> * migratable, and we ignore regular swapped page.
>>>>>>>> */
>>>>>>>> + struct folio *folio;
>>>>>>>> +
>>>>>>>> entry = pte_to_swp_entry(pte);
>>>>>>>> if (!is_device_private_entry(entry))
>>>>>>>> goto next;
>>>>>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>>> pgmap->owner != migrate->pgmap_owner)
>>>>>>>> goto next;
>>>>>>>>
>>>>>>>> + folio = page_folio(page);
>>>>>>>> + if (folio_test_large(folio)) {
>>>>>>>> + struct folio *new_folio;
>>>>>>>> + struct folio *new_fault_folio;
>>>>>>>> +
>>>>>>>> + /*
>>>>>>>> + * The reason for finding pmd present with a
>>>>>>>> + * device private pte and a large folio for the
>>>>>>>> + * pte is partial unmaps. Split the folio now
>>>>>>>> + * for the migration to be handled correctly
>>>>>>>> + */
>>>>>>>> + pte_unmap_unlock(ptep, ptl);
>>>>>>>> +
>>>>>>>> + folio_get(folio);
>>>>>>>> + if (folio != fault_folio)
>>>>>>>> + folio_lock(folio);
>>>>>>>> + if (split_folio(folio)) {
>>>>>>>> + if (folio != fault_folio)
>>>>>>>> + folio_unlock(folio);
>>>>>>>> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>>>>> + goto next;
>>>>>>>> + }
>>>>>>>> +
>>>>>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>>>>>
>>>>>> test_hmm needs adjustment because of the way the backup folios are setup.
>>>>> nouveau should check the folio order after the possible split happens.
>>>>>
>>>> You mean the folio_split callback?
>>> no, nouveau_dmem_migrate_to_ram():
>>> ..
>>> sfolio = page_folio(vmf->page);
>>> order = folio_order(sfolio);
>>> ...
>>> migrate_vma_setup()
>>> ..
>>> if sfolio is split order still reflects the pre-split order
>>>
>> Will fix, good catch!
>>
>>>>>>>> + /*
>>>>>>>> + * After the split, get back the extra reference
>>>>>>>> + * on the fault_page, this reference is checked during
>>>>>>>> + * folio_migrate_mapping()
>>>>>>>> + */
>>>>>>>> + if (migrate->fault_page) {
>>>>>>>> + new_fault_folio = page_folio(migrate->fault_page);
>>>>>>>> + folio_get(new_fault_folio);
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + new_folio = page_folio(page);
>>>>>>>> + pfn = page_to_pfn(page);
>>>>>>>> +
>>>>>>>> + /*
>>>>>>>> + * Ensure the lock is held on the correct
>>>>>>>> + * folio after the split
>>>>>>>> + */
>>>>>>>> + if (folio != new_folio) {
>>>>>>>> + folio_unlock(folio);
>>>>>>>> + folio_lock(new_folio);
>>>>>>>> + }
>>>>>>> Maybe careful not to unlock fault_page ?
>>>>>>>
>>>>>> split_page will unlock everything but the original folio, the code takes the lock
>>>>>> on the folio corresponding to the new folio
>>>>> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
>>>>>
>>>> Not sure I follow what you're trying to elaborate on here
>>> do_swap_page:
>>> ..
>>> if (trylock_page(vmf->page)) {
>>> ret = pgmap->ops->migrate_to_ram(vmf);
>>> <- vmf->page should be locked here even after split
>>> unlock_page(vmf->page);
>>>
>> Yep, the split will unlock all tail folios, leaving the just head folio locked
>> and this the change, the lock we need to hold is the folio lock associated with
>> fault_page, pte entry and not unlock when the cause is a fault. The code seems
>> to do the right thing there, let me double check
>
> Yes the fault case is ok. But if migrate not for a fault, we should not leave any page locked
>
migrate_vma_finalize() handles this
Balbir
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
2025-08-05 10:36 ` Balbir Singh
@ 2025-08-05 10:46 ` Mika Penttilä
0 siblings, 0 replies; 71+ messages in thread
From: Mika Penttilä @ 2025-08-05 10:46 UTC (permalink / raw)
To: Balbir Singh, Zi Yan
Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
Ralph Campbell
On 8/5/25 13:36, Balbir Singh wrote:
> On 8/5/25 20:35, Mika Penttilä wrote:
>> On 8/5/25 13:27, Balbir Singh wrote:
>>
>>> On 8/5/25 14:24, Mika Penttilä wrote:
>>>> Hi,
>>>>
>>>> On 8/5/25 07:10, Balbir Singh wrote:
>>>>> On 8/5/25 09:26, Mika Penttilä wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 8/5/25 01:46, Balbir Singh wrote:
>>>>>>> On 8/2/25 22:13, Mika Penttilä wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>>>>>>> FYI:
>>>>>>>>>
>>>>>>>>> I have the following patch on top of my series that seems to make it work
>>>>>>>>> without requiring the helper to split device private folios
>>>>>>>>>
>>>>>>>> I think this looks much better!
>>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>> ---
>>>>>>>>> include/linux/huge_mm.h | 1 -
>>>>>>>>> lib/test_hmm.c | 11 +++++-
>>>>>>>>> mm/huge_memory.c | 76 ++++-------------------------------------
>>>>>>>>> mm/migrate_device.c | 51 +++++++++++++++++++++++++++
>>>>>>>>> 4 files changed, 67 insertions(+), 72 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>>>>>>> --- a/include/linux/huge_mm.h
>>>>>>>>> +++ b/include/linux/huge_mm.h
>>>>>>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>>>>>> vm_flags_t vm_flags);
>>>>>>>>>
>>>>>>>>> bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>>>>>> -int split_device_private_folio(struct folio *folio);
>>>>>>>>> int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>>>>>> unsigned int new_order, bool unmapped);
>>>>>>>>> int min_order_for_split(struct folio *folio);
>>>>>>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>>>>>>> index 341ae2af44ec..444477785882 100644
>>>>>>>>> --- a/lib/test_hmm.c
>>>>>>>>> +++ b/lib/test_hmm.c
>>>>>>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>>>>>> * the mirror but here we use it to hold the page for the simulated
>>>>>>>>> * device memory and that page holds the pointer to the mirror.
>>>>>>>>> */
>>>>>>>>> - rpage = vmf->page->zone_device_data;
>>>>>>>>> + rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>>>>>> dmirror = rpage->zone_device_data;
>>>>>>>>>
>>>>>>>>> /* FIXME demonstrate how we can adjust migrate range */
>>>>>>>>> order = folio_order(page_folio(vmf->page));
>>>>>>>>> nr = 1 << order;
>>>>>>>>>
>>>>>>>>> + /*
>>>>>>>>> + * When folios are partially mapped, we can't rely on the folio
>>>>>>>>> + * order of vmf->page as the folio might not be fully split yet
>>>>>>>>> + */
>>>>>>>>> + if (vmf->pte) {
>>>>>>>>> + order = 0;
>>>>>>>>> + nr = 1;
>>>>>>>>> + }
>>>>>>>>> +
>>>>>>>>> /*
>>>>>>>>> * Consider a per-cpu cache of src and dst pfns, but with
>>>>>>>>> * large number of cpus that might not scale well.
>>>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>>>>>>> --- a/mm/huge_memory.c
>>>>>>>>> +++ b/mm/huge_memory.c
>>>>>>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>>>>>>> struct shrink_control *sc);
>>>>>>>>> static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>>>>>> struct shrink_control *sc);
>>>>>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>>>>>> - struct page *split_at, struct xa_state *xas,
>>>>>>>>> - struct address_space *mapping, bool uniform_split);
>>>>>>>>> -
>>>>>>>>> static bool split_underused_thp = true;
>>>>>>>>>
>>>>>>>>> static atomic_t huge_zero_refcount;
>>>>>>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>>>>>> pmd_populate(mm, pmd, pgtable);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> -/**
>>>>>>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>> - * split folios for pages that are partially mapped
>>>>>>>>> - *
>>>>>>>>> - * @folio: the folio to split
>>>>>>>>> - *
>>>>>>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>> - */
>>>>>>>>> -int split_device_private_folio(struct folio *folio)
>>>>>>>>> -{
>>>>>>>>> - struct folio *end_folio = folio_next(folio);
>>>>>>>>> - struct folio *new_folio;
>>>>>>>>> - int ret = 0;
>>>>>>>>> -
>>>>>>>>> - /*
>>>>>>>>> - * Split the folio now. In the case of device
>>>>>>>>> - * private pages, this path is executed when
>>>>>>>>> - * the pmd is split and since freeze is not true
>>>>>>>>> - * it is likely the folio will be deferred_split.
>>>>>>>>> - *
>>>>>>>>> - * With device private pages, deferred splits of
>>>>>>>>> - * folios should be handled here to prevent partial
>>>>>>>>> - * unmaps from causing issues later on in migration
>>>>>>>>> - * and fault handling flows.
>>>>>>>>> - */
>>>>>>>>> - folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>> - ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>>>> - VM_WARN_ON(ret);
>>>>>>>>> - for (new_folio = folio_next(folio); new_folio != end_folio;
>>>>>>>>> - new_folio = folio_next(new_folio)) {
>>>>>>>>> - zone_device_private_split_cb(folio, new_folio);
>>>>>>>>> - folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>>>>>>> - new_folio));
>>>>>>>>> - }
>>>>>>>>> -
>>>>>>>>> - /*
>>>>>>>>> - * Mark the end of the folio split for device private THP
>>>>>>>>> - * split
>>>>>>>>> - */
>>>>>>>>> - zone_device_private_split_cb(folio, NULL);
>>>>>>>>> - folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>> - return ret;
>>>>>>>>> -}
>>>>>>>>> -
>>>>>>>>> static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>>>> unsigned long haddr, bool freeze)
>>>>>>>>> {
>>>>>>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>>>> freeze = false;
>>>>>>>>> if (!freeze) {
>>>>>>>>> rmap_t rmap_flags = RMAP_NONE;
>>>>>>>>> - unsigned long addr = haddr;
>>>>>>>>> - struct folio *new_folio;
>>>>>>>>> - struct folio *end_folio = folio_next(folio);
>>>>>>>>>
>>>>>>>>> if (anon_exclusive)
>>>>>>>>> rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>>>>
>>>>>>>>> - folio_lock(folio);
>>>>>>>>> - folio_get(folio);
>>>>>>>>> -
>>>>>>>>> - split_device_private_folio(folio);
>>>>>>>>> -
>>>>>>>>> - for (new_folio = folio_next(folio);
>>>>>>>>> - new_folio != end_folio;
>>>>>>>>> - new_folio = folio_next(new_folio)) {
>>>>>>>>> - addr += PAGE_SIZE;
>>>>>>>>> - folio_unlock(new_folio);
>>>>>>>>> - folio_add_anon_rmap_ptes(new_folio,
>>>>>>>>> - &new_folio->page, 1,
>>>>>>>>> - vma, addr, rmap_flags);
>>>>>>>>> - }
>>>>>>>>> - folio_unlock(folio);
>>>>>>>>> - folio_add_anon_rmap_ptes(folio, &folio->page,
>>>>>>>>> - 1, vma, haddr, rmap_flags);
>>>>>>>>> + folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>>>>>>> + if (anon_exclusive)
>>>>>>>>> + rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>>>> + folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>>>>>>> + vma, haddr, rmap_flags);
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>> if (nr_shmem_dropped)
>>>>>>>>> shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>>>>>>
>>>>>>>>> - if (!ret && is_anon)
>>>>>>>>> + if (!ret && is_anon && !folio_is_device_private(folio))
>>>>>>>>> remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>>>>>>
>>>>>>>>> remap_page(folio, 1 << order, remap_flags);
>>>>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>>>>> index 49962ea19109..4264c0290d08 100644
>>>>>>>>> --- a/mm/migrate_device.c
>>>>>>>>> +++ b/mm/migrate_device.c
>>>>>>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>>>> * page table entry. Other special swap entries are not
>>>>>>>>> * migratable, and we ignore regular swapped page.
>>>>>>>>> */
>>>>>>>>> + struct folio *folio;
>>>>>>>>> +
>>>>>>>>> entry = pte_to_swp_entry(pte);
>>>>>>>>> if (!is_device_private_entry(entry))
>>>>>>>>> goto next;
>>>>>>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>>>> pgmap->owner != migrate->pgmap_owner)
>>>>>>>>> goto next;
>>>>>>>>>
>>>>>>>>> + folio = page_folio(page);
>>>>>>>>> + if (folio_test_large(folio)) {
>>>>>>>>> + struct folio *new_folio;
>>>>>>>>> + struct folio *new_fault_folio;
>>>>>>>>> +
>>>>>>>>> + /*
>>>>>>>>> + * The reason for finding pmd present with a
>>>>>>>>> + * device private pte and a large folio for the
>>>>>>>>> + * pte is partial unmaps. Split the folio now
>>>>>>>>> + * for the migration to be handled correctly
>>>>>>>>> + */
>>>>>>>>> + pte_unmap_unlock(ptep, ptl);
>>>>>>>>> +
>>>>>>>>> + folio_get(folio);
>>>>>>>>> + if (folio != fault_folio)
>>>>>>>>> + folio_lock(folio);
>>>>>>>>> + if (split_folio(folio)) {
>>>>>>>>> + if (folio != fault_folio)
>>>>>>>>> + folio_unlock(folio);
>>>>>>>>> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>>>>>> + goto next;
>>>>>>>>> + }
>>>>>>>>> +
>>>>>>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>>>>>>
>>>>>>> test_hmm needs adjustment because of the way the backup folios are setup.
>>>>>> nouveau should check the folio order after the possible split happens.
>>>>>>
>>>>> You mean the folio_split callback?
>>>> no, nouveau_dmem_migrate_to_ram():
>>>> ..
>>>> sfolio = page_folio(vmf->page);
>>>> order = folio_order(sfolio);
>>>> ...
>>>> migrate_vma_setup()
>>>> ..
>>>> if sfolio is split order still reflects the pre-split order
>>>>
>>> Will fix, good catch!
>>>
>>>>>>>>> + /*
>>>>>>>>> + * After the split, get back the extra reference
>>>>>>>>> + * on the fault_page, this reference is checked during
>>>>>>>>> + * folio_migrate_mapping()
>>>>>>>>> + */
>>>>>>>>> + if (migrate->fault_page) {
>>>>>>>>> + new_fault_folio = page_folio(migrate->fault_page);
>>>>>>>>> + folio_get(new_fault_folio);
>>>>>>>>> + }
>>>>>>>>> +
>>>>>>>>> + new_folio = page_folio(page);
>>>>>>>>> + pfn = page_to_pfn(page);
>>>>>>>>> +
>>>>>>>>> + /*
>>>>>>>>> + * Ensure the lock is held on the correct
>>>>>>>>> + * folio after the split
>>>>>>>>> + */
>>>>>>>>> + if (folio != new_folio) {
>>>>>>>>> + folio_unlock(folio);
>>>>>>>>> + folio_lock(new_folio);
>>>>>>>>> + }
>>>>>>>> Maybe careful not to unlock fault_page ?
>>>>>>>>
>>>>>>> split_page will unlock everything but the original folio, the code takes the lock
>>>>>>> on the folio corresponding to the new folio
>>>>>> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
>>>>>>
>>>>> Not sure I follow what you're trying to elaborate on here
>>>> do_swap_page:
>>>> ..
>>>> if (trylock_page(vmf->page)) {
>>>> ret = pgmap->ops->migrate_to_ram(vmf);
>>>> <- vmf->page should be locked here even after split
>>>> unlock_page(vmf->page);
>>>>
>>> Yep, the split will unlock all tail folios, leaving the just head folio locked
>>> and this the change, the lock we need to hold is the folio lock associated with
>>> fault_page, pte entry and not unlock when the cause is a fault. The code seems
>>> to do the right thing there, let me double check
>> Yes the fault case is ok. But if migrate not for a fault, we should not leave any page locked
>>
> migrate_vma_finalize() handles this
But we are in migrate_vma_collect_pmd() after the split, and try to collect the pte, locking the page again.
So needs to be unlocked after the split.
>
> Balbir
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 01/11] mm/zone_device: support large zone device private folios
2025-08-05 4:22 ` Balbir Singh
@ 2025-08-05 10:57 ` David Hildenbrand
2025-08-05 11:01 ` Balbir Singh
0 siblings, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2025-08-05 10:57 UTC (permalink / raw)
To: Balbir Singh, linux-mm
Cc: linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Matthew Brost,
Francois Dugast
On 05.08.25 06:22, Balbir Singh wrote:
> On 7/30/25 19:50, David Hildenbrand wrote:
>
>> I think I asked that already but maybe missed the reply: Should these folios ever be added to the deferred split queue and is there any value in splitting them under memory pressure in the shrinker?
>>
>> My gut feeling is "No", because the buddy cannot make use of these folios, but maybe there is an interesting case where we want that behavior?
>>
>
> I realized I did not answer this
>
> deferred_split() is the default action when partial unmaps take place. Anything that does
> folio_rmap_remove_ptes can cause the folio to be deferred split if it gets partially
> unmapped.
Right, but it's easy to exclude zone-device folios here. So the real
question is: do you want to deal with deferred splits or not?
If not, then just disable it right from the start.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 01/11] mm/zone_device: support large zone device private folios
2025-08-05 10:57 ` David Hildenbrand
@ 2025-08-05 11:01 ` Balbir Singh
2025-08-05 12:58 ` David Hildenbrand
0 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-08-05 11:01 UTC (permalink / raw)
To: David Hildenbrand, linux-mm
Cc: linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Matthew Brost,
Francois Dugast
On 8/5/25 20:57, David Hildenbrand wrote:
> On 05.08.25 06:22, Balbir Singh wrote:
>> On 7/30/25 19:50, David Hildenbrand wrote:
>>
>>> I think I asked that already but maybe missed the reply: Should these folios ever be added to the deferred split queue and is there any value in splitting them under memory pressure in the shrinker?
>>>
>>> My gut feeling is "No", because the buddy cannot make use of these folios, but maybe there is an interesting case where we want that behavior?
>>>
>>
>> I realized I did not answer this
>>
>> deferred_split() is the default action when partial unmaps take place. Anything that does
>> folio_rmap_remove_ptes can cause the folio to be deferred split if it gets partially
>> unmapped.
>
> Right, but it's easy to exclude zone-device folios here. So the real question is: do you want to deal with deferred splits or not?
>
> If not, then just disable it right from the start.
>
I agree, I was trying to avoid special casing device private folios unless needed to the extent possible
Balbir
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 01/11] mm/zone_device: support large zone device private folios
2025-08-05 11:01 ` Balbir Singh
@ 2025-08-05 12:58 ` David Hildenbrand
2025-08-05 21:15 ` Matthew Brost
0 siblings, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2025-08-05 12:58 UTC (permalink / raw)
To: Balbir Singh, linux-mm
Cc: linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Matthew Brost,
Francois Dugast
On 05.08.25 13:01, Balbir Singh wrote:
> On 8/5/25 20:57, David Hildenbrand wrote:
>> On 05.08.25 06:22, Balbir Singh wrote:
>>> On 7/30/25 19:50, David Hildenbrand wrote:
>>>
>>>> I think I asked that already but maybe missed the reply: Should these folios ever be added to the deferred split queue and is there any value in splitting them under memory pressure in the shrinker?
>>>>
>>>> My gut feeling is "No", because the buddy cannot make use of these folios, but maybe there is an interesting case where we want that behavior?
>>>>
>>>
>>> I realized I did not answer this
>>>
>>> deferred_split() is the default action when partial unmaps take place. Anything that does
>>> folio_rmap_remove_ptes can cause the folio to be deferred split if it gets partially
>>> unmapped.
>>
>> Right, but it's easy to exclude zone-device folios here. So the real question is: do you want to deal with deferred splits or not?
>>
>> If not, then just disable it right from the start.
>>
>
> I agree, I was trying to avoid special casing device private folios unless needed to the extent possible
By introducing a completely separate split logic :P
Jokes aside, we have plenty of zone_device special-casing already, no
harm in adding one more folio_is_zone_device() there.
Deferred splitting is all weird already that you can call yourself
fortunate if you don't have to mess with that for zone-device folios.
Again, unless there is a benefit in having it.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 01/11] mm/zone_device: support large zone device private folios
2025-08-05 12:58 ` David Hildenbrand
@ 2025-08-05 21:15 ` Matthew Brost
2025-08-06 12:19 ` Balbir Singh
0 siblings, 1 reply; 71+ messages in thread
From: Matthew Brost @ 2025-08-05 21:15 UTC (permalink / raw)
To: David Hildenbrand
Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang,
Jane Chu, Alistair Popple, Donet Tom, Ralph Campbell,
Mika Penttilä, Francois Dugast
On Tue, Aug 05, 2025 at 02:58:42PM +0200, David Hildenbrand wrote:
> On 05.08.25 13:01, Balbir Singh wrote:
> > On 8/5/25 20:57, David Hildenbrand wrote:
> > > On 05.08.25 06:22, Balbir Singh wrote:
> > > > On 7/30/25 19:50, David Hildenbrand wrote:
> > > >
> > > > > I think I asked that already but maybe missed the reply: Should these folios ever be added to the deferred split queue and is there any value in splitting them under memory pressure in the shrinker?
> > > > >
> > > > > My gut feeling is "No", because the buddy cannot make use of these folios, but maybe there is an interesting case where we want that behavior?
> > > > >
> > > >
> > > > I realized I did not answer this
> > > >
> > > > deferred_split() is the default action when partial unmaps take place. Anything that does
> > > > folio_rmap_remove_ptes can cause the folio to be deferred split if it gets partially
> > > > unmapped.
> > >
> > > Right, but it's easy to exclude zone-device folios here. So the real question is: do you want to deal with deferred splits or not?
> > >
> > > If not, then just disable it right from the start.
> > >
> >
> > I agree, I was trying to avoid special casing device private folios unless needed to the extent possible
>
> By introducing a completely separate split logic :P
>
> Jokes aside, we have plenty of zone_device special-casing already, no harm
> in adding one more folio_is_zone_device() there.
>
> Deferred splitting is all weird already that you can call yourself fortunate
> if you don't have to mess with that for zone-device folios.
>
> Again, unless there is a benefit in having it.
+1 on no deferred split for device folios.
Matt
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 00/11] THP support for zone device page migration
2025-07-30 9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
` (11 preceding siblings ...)
2025-07-30 11:30 ` [v2 00/11] THP support for zone device page migration David Hildenbrand
@ 2025-08-05 21:34 ` Matthew Brost
12 siblings, 0 replies; 71+ messages in thread
From: Matthew Brost @ 2025-08-05 21:34 UTC (permalink / raw)
To: Balbir Singh
Cc: linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, David Hildenbrand,
Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
Ralph Campbell, Mika Penttilä, Francois Dugast
On Wed, Jul 30, 2025 at 07:21:28PM +1000, Balbir Singh wrote:
> This patch series adds support for THP migration of zone device pages.
> To do so, the patches implement support for folio zone device pages
> by adding support for setting up larger order pages. Larger order
> pages provide a speedup in throughput and latency.
>
> In my local testing (using lib/test_hmm) and a throughput test, the
> series shows a 350% improvement in data transfer throughput and a
> 500% improvement in latency
>
> These patches build on the earlier posts by Ralph Campbell [1]
>
> Two new flags are added in vma_migration to select and mark compound pages.
> migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize()
> support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND
> is passed in as arguments.
>
> The series also adds zone device awareness to (m)THP pages along
> with fault handling of large zone device private pages. page vma walk
> and the rmap code is also zone device aware. Support has also been
> added for folios that might need to be split in the middle
> of migration (when the src and dst do not agree on
> MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can
> migrate large pages, but the destination has not been able to allocate
> large pages. The code supported and used folio_split() when migrating
> THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed
> as an argument to migrate_vma_setup().
>
> The test infrastructure lib/test_hmm.c has been enhanced to support THP
> migration. A new ioctl to emulate failure of large page allocations has
> been added to test the folio split code path. hmm-tests.c has new test
> cases for huge page migration and to test the folio split path. A new
> throughput test has been added as well.
>
> The nouveau dmem code has been enhanced to use the new THP migration
> capability.
>
> mTHP support:
>
> The patches hard code, HPAGE_PMD_NR in a few places, but the code has
> been kept generic to support various order sizes. With additional
> refactoring of the code support of different order sizes should be
> possible.
>
> The future plan is to post enhancements to support mTHP with a rough
> design as follows:
>
> 1. Add the notion of allowable thp orders to the HMM based test driver
> 2. For non PMD based THP paths in migrate_device.c, check to see if
> a suitable order is found and supported by the driver
> 3. Iterate across orders to check the highest supported order for migration
> 4. Migrate and finalize
>
> The mTHP patches can be built on top of this series, the key design
> elements that need to be worked out are infrastructure and driver support
> for multiple ordered pages and their migration.
>
> HMM support for large folios:
>
> Francois Dugast posted patches support for HMM handling [4], the proposed
> changes can build on top of this series to provide support for HMM fault
> handling.
>
> References:
> [1] https://lore.kernel.org/linux-mm/20201106005147.20113-1-rcampbell@nvidia.com/
> [2] https://lore.kernel.org/linux-mm/20250306044239.3874247-3-balbirs@nvidia.com/T/
> [3] https://lore.kernel.org/lkml/20250703233511.2028395-1-balbirs@nvidia.com/
> [4] https://lore.kernel.org/lkml/20250722193445.1588348-1-francois.dugast@intel.com/
>
> These patches are built on top of mm/mm-stable
>
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mika Penttilä <mpenttil@redhat.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Francois Dugast <francois.dugast@intel.com>
>
> Changelog v2 [3] :
> - Several review comments from David Hildenbrand were addressed, Mika,
> Zi, Matthew also provided helpful review comments
> - In paths where it makes sense a new helper
> is_pmd_device_private_entry() is used
> - anon_exclusive handling of zone device private pages in
> split_huge_pmd_locked() has been fixed
> - Patches that introduced helpers have been folded into where they
> are used
> - Zone device handling in mm/huge_memory.c has benefited from the code
> and testing of Matthew Brost, he helped find bugs related to
> copy_huge_pmd() and partial unmapping of folios.
I see a ton of discussion on this series, particularly patch 2. It looks
like you have landed on a different solution for partial unmaps. I
wanted to pull this series into in for testing but if this is actively
being refactored, likely best to hold off until next post or test off a
WIP branch if you have one.
Matt
> - Zone device THP PMD support via page_vma_mapped_walk() is restricted
> to try_to_migrate_one()
> - There is a new dedicated helper to split large zone device folios
>
> Changelog v1 [2]:
> - Support for handling fault_folio and using trylock in the fault path
> - A new test case has been added to measure the throughput improvement
> - General refactoring of code to keep up with the changes in mm
> - New split folio callback when the entire split is complete/done. The
> callback is used to know when the head order needs to be reset.
>
> Testing:
> - Testing was done with ZONE_DEVICE private pages on an x86 VM
>
> Balbir Singh (11):
> mm/zone_device: support large zone device private folios
> mm/thp: zone_device awareness in THP handling code
> mm/migrate_device: THP migration of zone device pages
> mm/memory/fault: add support for zone device THP fault handling
> lib/test_hmm: test cases and support for zone device private THP
> mm/memremap: add folio_split support
> mm/thp: add split during migration support
> lib/test_hmm: add test case for split pages
> selftests/mm/hmm-tests: new tests for zone device THP migration
> gpu/drm/nouveau: add THP migration support
> selftests/mm/hmm-tests: new throughput tests including THP
>
> drivers/gpu/drm/nouveau/nouveau_dmem.c | 246 +++++++---
> drivers/gpu/drm/nouveau/nouveau_svm.c | 6 +-
> drivers/gpu/drm/nouveau/nouveau_svm.h | 3 +-
> include/linux/huge_mm.h | 19 +-
> include/linux/memremap.h | 51 ++-
> include/linux/migrate.h | 2 +
> include/linux/mm.h | 1 +
> include/linux/rmap.h | 2 +
> include/linux/swapops.h | 17 +
> lib/test_hmm.c | 432 ++++++++++++++----
> lib/test_hmm_uapi.h | 3 +
> mm/huge_memory.c | 358 ++++++++++++---
> mm/memory.c | 6 +-
> mm/memremap.c | 48 +-
> mm/migrate_device.c | 517 ++++++++++++++++++---
> mm/page_vma_mapped.c | 13 +-
> mm/pgtable-generic.c | 6 +
> mm/rmap.c | 22 +-
> tools/testing/selftests/mm/hmm-tests.c | 607 ++++++++++++++++++++++++-
> 19 files changed, 2040 insertions(+), 319 deletions(-)
>
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [v2 01/11] mm/zone_device: support large zone device private folios
2025-08-05 21:15 ` Matthew Brost
@ 2025-08-06 12:19 ` Balbir Singh
0 siblings, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-08-06 12:19 UTC (permalink / raw)
To: Matthew Brost, David Hildenbrand
Cc: linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter,
Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
Ryan Roberts, Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang,
Jane Chu, Alistair Popple, Donet Tom, Ralph Campbell,
Mika Penttilä, Francois Dugast
On 8/6/25 07:15, Matthew Brost wrote:
> On Tue, Aug 05, 2025 at 02:58:42PM +0200, David Hildenbrand wrote:
>> On 05.08.25 13:01, Balbir Singh wrote:
>>> On 8/5/25 20:57, David Hildenbrand wrote:
>>>> On 05.08.25 06:22, Balbir Singh wrote:
>>>>> On 7/30/25 19:50, David Hildenbrand wrote:
>>>>>
>>>>>> I think I asked that already but maybe missed the reply: Should these folios ever be added to the deferred split queue and is there any value in splitting them under memory pressure in the shrinker?
>>>>>>
>>>>>> My gut feeling is "No", because the buddy cannot make use of these folios, but maybe there is an interesting case where we want that behavior?
>>>>>>
>>>>>
>>>>> I realized I did not answer this
>>>>>
>>>>> deferred_split() is the default action when partial unmaps take place. Anything that does
>>>>> folio_rmap_remove_ptes can cause the folio to be deferred split if it gets partially
>>>>> unmapped.
>>>>
>>>> Right, but it's easy to exclude zone-device folios here. So the real question is: do you want to deal with deferred splits or not?
>>>>
>>>> If not, then just disable it right from the start.
>>>>
>>>
>>> I agree, I was trying to avoid special casing device private folios unless needed to the extent possible
>>
>> By introducing a completely separate split logic :P
>>
>> Jokes aside, we have plenty of zone_device special-casing already, no harm
>> in adding one more folio_is_zone_device() there.
>>
>> Deferred splitting is all weird already that you can call yourself fortunate
>> if you don't have to mess with that for zone-device folios.
>>
>> Again, unless there is a benefit in having it.
>
> +1 on no deferred split for device folios.
>
>
I'll add it to v3 to check that we do not do deferred splits on zone device folios
Balbir
^ permalink raw reply [flat|nested] 71+ messages in thread
end of thread, other threads:[~2025-08-06 12:19 UTC | newest]
Thread overview: 71+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-30 9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
2025-07-30 9:21 ` [v2 01/11] mm/zone_device: support large zone device private folios Balbir Singh
2025-07-30 9:50 ` David Hildenbrand
2025-08-04 23:43 ` Balbir Singh
2025-08-05 4:22 ` Balbir Singh
2025-08-05 10:57 ` David Hildenbrand
2025-08-05 11:01 ` Balbir Singh
2025-08-05 12:58 ` David Hildenbrand
2025-08-05 21:15 ` Matthew Brost
2025-08-06 12:19 ` Balbir Singh
2025-07-30 9:21 ` [v2 02/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
2025-07-30 11:16 ` Mika Penttilä
2025-07-30 11:27 ` Zi Yan
2025-07-30 11:30 ` Zi Yan
2025-07-30 11:42 ` Mika Penttilä
2025-07-30 12:08 ` Mika Penttilä
2025-07-30 12:25 ` Zi Yan
2025-07-30 12:49 ` Mika Penttilä
2025-07-30 15:10 ` Zi Yan
2025-07-30 15:40 ` Mika Penttilä
2025-07-30 15:58 ` Zi Yan
2025-07-30 16:29 ` Mika Penttilä
2025-07-31 7:15 ` David Hildenbrand
2025-07-31 8:39 ` Balbir Singh
2025-07-31 11:26 ` Zi Yan
2025-07-31 12:32 ` David Hildenbrand
2025-07-31 13:34 ` Zi Yan
2025-07-31 19:09 ` David Hildenbrand
2025-08-01 0:49 ` Balbir Singh
2025-08-01 1:09 ` Zi Yan
2025-08-01 7:01 ` David Hildenbrand
2025-08-01 1:16 ` Mika Penttilä
2025-08-01 4:44 ` Balbir Singh
2025-08-01 5:57 ` Balbir Singh
2025-08-01 6:01 ` Mika Penttilä
2025-08-01 7:04 ` David Hildenbrand
2025-08-01 8:01 ` Balbir Singh
2025-08-01 8:46 ` David Hildenbrand
2025-08-01 11:10 ` Zi Yan
2025-08-01 12:20 ` Mika Penttilä
2025-08-01 12:28 ` Zi Yan
2025-08-02 1:17 ` Balbir Singh
2025-08-02 10:37 ` Balbir Singh
2025-08-02 12:13 ` Mika Penttilä
2025-08-04 22:46 ` Balbir Singh
2025-08-04 23:26 ` Mika Penttilä
2025-08-05 4:10 ` Balbir Singh
2025-08-05 4:24 ` Mika Penttilä
2025-08-05 5:19 ` Mika Penttilä
2025-08-05 10:27 ` Balbir Singh
2025-08-05 10:35 ` Mika Penttilä
2025-08-05 10:36 ` Balbir Singh
2025-08-05 10:46 ` Mika Penttilä
2025-07-30 20:05 ` kernel test robot
2025-07-30 9:21 ` [v2 03/11] mm/migrate_device: THP migration of zone device pages Balbir Singh
2025-07-31 16:19 ` kernel test robot
2025-07-30 9:21 ` [v2 04/11] mm/memory/fault: add support for zone device THP fault handling Balbir Singh
2025-07-30 9:21 ` [v2 05/11] lib/test_hmm: test cases and support for zone device private THP Balbir Singh
2025-07-31 11:17 ` kernel test robot
2025-07-30 9:21 ` [v2 06/11] mm/memremap: add folio_split support Balbir Singh
2025-07-30 9:21 ` [v2 07/11] mm/thp: add split during migration support Balbir Singh
2025-07-31 10:04 ` kernel test robot
2025-07-30 9:21 ` [v2 08/11] lib/test_hmm: add test case for split pages Balbir Singh
2025-07-30 9:21 ` [v2 09/11] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
2025-07-30 9:21 ` [v2 10/11] gpu/drm/nouveau: add THP migration support Balbir Singh
2025-07-30 9:21 ` [v2 11/11] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh
2025-07-30 11:30 ` [v2 00/11] THP support for zone device page migration David Hildenbrand
2025-07-30 23:18 ` Alistair Popple
2025-07-31 8:41 ` Balbir Singh
2025-07-31 8:56 ` David Hildenbrand
2025-08-05 21:34 ` Matthew Brost
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).