From: Balbir Singh <balbirs@nvidia.com>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Cc: damon@lists.linux.dev, dri-devel@lists.freedesktop.org,
"Balbir Singh" <balbirs@nvidia.com>,
"Andrew Morton" <akpm@linux-foundation.org>,
"David Hildenbrand" <david@redhat.com>, "Zi Yan" <ziy@nvidia.com>,
"Joshua Hahn" <joshua.hahnjy@gmail.com>,
"Rakie Kim" <rakie.kim@sk.com>,
"Byungchul Park" <byungchul@sk.com>,
"Gregory Price" <gourry@gourry.net>,
"Ying Huang" <ying.huang@linux.alibaba.com>,
"Alistair Popple" <apopple@nvidia.com>,
"Oscar Salvador" <osalvador@suse.de>,
"Lorenzo Stoakes" <lorenzo.stoakes@oracle.com>,
"Baolin Wang" <baolin.wang@linux.alibaba.com>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
"Nico Pache" <npache@redhat.com>,
"Ryan Roberts" <ryan.roberts@arm.com>,
"Dev Jain" <dev.jain@arm.com>, "Barry Song" <baohua@kernel.org>,
"Lyude Paul" <lyude@redhat.com>,
"Danilo Krummrich" <dakr@kernel.org>,
"David Airlie" <airlied@gmail.com>,
"Simona Vetter" <simona@ffwll.ch>,
"Ralph Campbell" <rcampbell@nvidia.com>,
"Mika Penttilä" <mpenttil@redhat.com>,
"Matthew Brost" <matthew.brost@intel.com>,
"Francois Dugast" <francois.dugast@intel.com>
Subject: [v4 00/15] mm: support device-private THP
Date: Wed, 3 Sep 2025 11:18:45 +1000 [thread overview]
Message-ID: <20250903011900.3657435-1-balbirs@nvidia.com> (raw)
This patch series introduces support for Transparent Huge Page (THP)
migration in zone device-private memory. The implementation enables
efficient migration of large folios between system memory and
device-private memory
Background
Current zone device-private memory implementation only supports PAGE_SIZE
granularity, leading to:
- Increased TLB pressure
- Inefficient migration between CPU and GPU memory
This series extends the existing zone device-private infrastructure to
support THP, leading to:
- Reduced page table overhead
- Improved memory bandwidth utilization
- Seamless fallback to base pages when needed
In my local testing (using lib/test_hmm) and a throughput test, the
series shows a 4x improvement in data transfer throughput and a
5x improvement in latency
These patches build on the earlier posts by Ralph Campbell [1]
Two new flags are added in vma_migration to select and mark compound
pages. migrate_vma_setup(), migrate_vma_pages() and
migrate_vma_finalize() support migration of these pages when
MIGRATE_VMA_SELECT_COMPOUND is passed in as arguments.
The series also adds zone device awareness to (m)THP pages along with
fault handling of large zone device private pages. page vma walk and the
rmap code is also zone device aware. Support has also been added for
folios that might need to be split in the middle of migration (when the
src and dst do not agree on MIGRATE_PFN_COMPOUND), that occurs when src
side of the migration can migrate large pages, but the destination has
not been able to allocate large pages. The code supported and used
folio_split() when migrating THP pages, this is used when
MIGRATE_VMA_SELECT_COMPOUND is not passed as an argument to
migrate_vma_setup().
The test infrastructure lib/test_hmm.c has been enhanced to support THP
migration. A new ioctl to emulate failure of large page allocations has
been added to test the folio split code path. hmm-tests.c has new test
cases for huge page migration and to test the folio split path. A new
throughput test has been added as well.
The nouveau dmem code has been enhanced to use the new THP migration
capability.
mTHP support:
The patches hard code, HPAGE_PMD_NR in a few places, but the code has
been kept generic to support various order sizes. With additional
refactoring of the code support of different order sizes should be
possible.
The future plan is to post enhancements to support mTHP with a rough
design as follows:
1. Add the notion of allowable thp orders to the HMM based test driver
2. For non PMD based THP paths in migrate_device.c, check to see if
a suitable order is found and supported by the driver
3. Iterate across orders to check the highest supported order for migration
4. Migrate and finalize
The mTHP patches can be built on top of this series, the key design
elements that need to be worked out are infrastructure and driver support
for multiple ordered pages and their migration.
HMM support for large folios:
Currently in mm-unstable [4]
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
References:
[1] https://lore.kernel.org/linux-mm/20201106005147.20113-1-rcampbell@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20250306044239.3874247-3-balbirs@nvidia.com/T/
[3] https://lore.kernel.org/lkml/20250703233511.2028395-1-balbirs@nvidia.com/
[4] https://lkml.kernel.org/r/20250902130713.1644661-1-francois.dugast@intel.com
[5] https://lore.kernel.org/lkml/20250730092139.3890844-1-balbirs@nvidia.com/
[6] https://lore.kernel.org/lkml/20250812024036.690064-1-balbirs@nvidia.com/
These patches are built on top of mm/mm-stable
Changelog v4 [6] :
- Addressed review comments
- Split patch 2 into a smaller set of patches
- PVMW_THP_DEVICE_PRIVATE flag is no longer present
- damon/page_idle and other page_vma_mapped_walk paths are aware of
device-private folios
- No more flush for non-present entries in set_pmd_migration_entry
- Implemented a helper function for migrate_vma_split_folio() which
splits large folios if seen during a pte walk
- Removed the controversial change for folio_ref_freeze using
folio_expected_ref_count()
- Removed functions invoked from with VM_WARN_ON
- New test cases and fixes from Matthew Brost
- Fixed bugs reported by kernel test robot (Thanks!)
- Several fixes for THP support in nouveau driver
Changelog v3 [5] :
- Addressed review comments
- No more split_device_private_folio() helper
- Device private large folios do not end up on deferred scan lists
- Removed THP size order checks when initializing zone device folio
- Fixed bugs reported by kernel test robot (Thanks!)
Changelog v2 [3] :
- Several review comments from David Hildenbrand were addressed, Mika,
Zi, Matthew also provided helpful review comments
- In paths where it makes sense a new helper
is_pmd_device_private_entry() is used
- anon_exclusive handling of zone device private pages in
split_huge_pmd_locked() has been fixed
- Patches that introduced helpers have been folded into where they
are used
- Zone device handling in mm/huge_memory.c has benefited from the code
and testing of Matthew Brost, he helped find bugs related to
copy_huge_pmd() and partial unmapping of folios.
- Zone device THP PMD support via page_vma_mapped_walk() is restricted
to try_to_migrate_one()
- There is a new dedicated helper to split large zone device folios
Changelog v1 [2]:
- Support for handling fault_folio and using trylock in the fault path
- A new test case has been added to measure the throughput improvement
- General refactoring of code to keep up with the changes in mm
- New split folio callback when the entire split is complete/done. The
callback is used to know when the head order needs to be reset.
Testing:
- Testing was done with ZONE_DEVICE private pages on an x86 VM
Balbir Singh (14):
mm/zone_device: support large zone device private folios
mm/huge_memory: add device-private THP support to PMD operations
mm/rmap: extend rmap and migration support device-private entries
mm/huge_memory: implement device-private THP splitting
mm/migrate_device: handle partially mapped folios during collection
mm/migrate_device: implement THP migration of zone device pages
mm/memory/fault: add THP fault handling for zone device private pages
lib/test_hmm: add zone device private THP test infrastructure
mm/memremap: add driver callback support for folio splitting
mm/migrate_device: add THP splitting during migration
lib/test_hmm: add large page allocation failure testing
selftests/mm/hmm-tests: new tests for zone device THP migration
selftests/mm/hmm-tests: new throughput tests including THP
gpu/drm/nouveau: enable THP support for GPU memory migration
Matthew Brost (1):
selftests/mm/hmm-tests: partial unmap, mremap and anon_write tests
drivers/gpu/drm/nouveau/nouveau_dmem.c | 306 +++++---
drivers/gpu/drm/nouveau/nouveau_svm.c | 6 +-
drivers/gpu/drm/nouveau/nouveau_svm.h | 3 +-
include/linux/huge_mm.h | 18 +-
include/linux/memremap.h | 51 +-
include/linux/migrate.h | 2 +
include/linux/mm.h | 1 +
include/linux/swapops.h | 27 +
lib/test_hmm.c | 443 +++++++++---
lib/test_hmm_uapi.h | 3 +
mm/damon/ops-common.c | 20 +-
mm/huge_memory.c | 288 ++++++--
mm/memory.c | 6 +-
mm/memremap.c | 38 +-
mm/migrate_device.c | 614 +++++++++++++++--
mm/page_idle.c | 5 +-
mm/page_vma_mapped.c | 12 +-
mm/pgtable-generic.c | 6 +
mm/rmap.c | 25 +-
tools/testing/selftests/mm/hmm-tests.c | 919 +++++++++++++++++++++++--
20 files changed, 2399 insertions(+), 394 deletions(-)
--
2.50.1
next reply other threads:[~2025-09-03 1:19 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-03 1:18 Balbir Singh [this message]
2025-09-03 1:18 ` [v4 01/15] mm/zone_device: support large zone device private folios Balbir Singh
2025-09-03 1:18 ` [v4 02/15] mm/huge_memory: add device-private THP support to PMD operations Balbir Singh
2025-09-03 1:18 ` [v4 03/15] mm/rmap: extend rmap and migration support device-private entries Balbir Singh
2025-09-03 1:18 ` [v4 04/15] mm/huge_memory: implement device-private THP splitting Balbir Singh
2025-09-03 1:18 ` [v4 05/15] mm/migrate_device: handle partially mapped folios during collection Balbir Singh
2025-09-03 4:40 ` Mika Penttilä
2025-09-03 6:05 ` Balbir Singh
2025-09-03 8:26 ` Mika Penttilä
2025-09-04 9:37 ` kernel test robot
2025-09-03 1:18 ` [v4 06/15] mm/migrate_device: implement THP migration of zone device pages Balbir Singh
2025-09-03 1:18 ` [v4 07/15] mm/memory/fault: add THP fault handling for zone device private pages Balbir Singh
2025-09-03 1:18 ` [v4 08/15] lib/test_hmm: add zone device private THP test infrastructure Balbir Singh
2025-09-03 1:18 ` [v4 09/15] mm/memremap: add driver callback support for folio splitting Balbir Singh
2025-09-03 1:18 ` [v4 10/15] mm/migrate_device: add THP splitting during migration Balbir Singh
2025-09-03 1:18 ` [v4 11/15] lib/test_hmm: add large page allocation failure testing Balbir Singh
2025-09-03 1:18 ` [v4 12/15] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
2025-09-03 1:18 ` [v4 13/15] selftests/mm/hmm-tests: partial unmap, mremap and anon_write tests Balbir Singh
2025-09-03 1:18 ` [v4 14/15] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh
2025-09-03 1:19 ` [v4 15/15] gpu/drm/nouveau: enable THP support for GPU memory migration Balbir Singh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250903011900.3657435-1-balbirs@nvidia.com \
--to=balbirs@nvidia.com \
--cc=Liam.Howlett@oracle.com \
--cc=airlied@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=byungchul@sk.com \
--cc=dakr@kernel.org \
--cc=damon@lists.linux.dev \
--cc=david@redhat.com \
--cc=dev.jain@arm.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=francois.dugast@intel.com \
--cc=gourry@gourry.net \
--cc=joshua.hahnjy@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=lyude@redhat.com \
--cc=matthew.brost@intel.com \
--cc=mpenttil@redhat.com \
--cc=npache@redhat.com \
--cc=osalvador@suse.de \
--cc=rakie.kim@sk.com \
--cc=rcampbell@nvidia.com \
--cc=ryan.roberts@arm.com \
--cc=simona@ffwll.ch \
--cc=ying.huang@linux.alibaba.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).