Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Balbir Singh <balbirs@nvidia.com>
To: "Mika Penttilä" <mpenttil@redhat.com>, "Zi Yan" <ziy@nvidia.com>
Cc: "David Hildenbrand" <david@redhat.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	"Karol Herbst" <kherbst@redhat.com>,
	"Lyude Paul" <lyude@redhat.com>,
	"Danilo Krummrich" <dakr@kernel.org>,
	"David Airlie" <airlied@gmail.com>,
	"Simona Vetter" <simona@ffwll.ch>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"Shuah Khan" <shuah@kernel.org>, "Barry Song" <baohua@kernel.org>,
	"Baolin Wang" <baolin.wang@linux.alibaba.com>,
	"Ryan Roberts" <ryan.roberts@arm.com>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Peter Xu" <peterx@redhat.com>,
	"Kefeng Wang" <wangkefeng.wang@huawei.com>,
	"Jane Chu" <jane.chu@oracle.com>,
	"Alistair Popple" <apopple@nvidia.com>,
	"Donet Tom" <donettom@linux.ibm.com>,
	"Matthew Brost" <matthew.brost@intel.com>,
	"Francois Dugast" <francois.dugast@intel.com>,
	"Ralph Campbell" <rcampbell@nvidia.com>
Subject: Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
Date: Tue, 5 Aug 2025 20:27:32 +1000	[thread overview]
Message-ID: <b3f97ffd-2e25-4df6-9da5-65264ee6916b@nvidia.com> (raw)
In-Reply-To: <087e40e6-3b3f-4a02-8270-7e6cfdb56a04@redhat.com>

On 8/5/25 14:24, Mika Penttilä wrote:
> Hi,
> 
> On 8/5/25 07:10, Balbir Singh wrote:
>> On 8/5/25 09:26, Mika Penttilä wrote:
>>> Hi,
>>>
>>> On 8/5/25 01:46, Balbir Singh wrote:
>>>> On 8/2/25 22:13, Mika Penttilä wrote:
>>>>> Hi,
>>>>>
>>>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>>>> FYI:
>>>>>>
>>>>>> I have the following patch on top of my series that seems to make it work
>>>>>> without requiring the helper to split device private folios
>>>>>>
>>>>> I think this looks much better!
>>>>>
>>>> Thanks!
>>>>
>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>> ---
>>>>>>  include/linux/huge_mm.h |  1 -
>>>>>>  lib/test_hmm.c          | 11 +++++-
>>>>>>  mm/huge_memory.c        | 76 ++++-------------------------------------
>>>>>>  mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
>>>>>>  4 files changed, 67 insertions(+), 72 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>>>> --- a/include/linux/huge_mm.h
>>>>>> +++ b/include/linux/huge_mm.h
>>>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>>>  		vm_flags_t vm_flags);
>>>>>>  
>>>>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>>> -int split_device_private_folio(struct folio *folio);
>>>>>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>>>  		unsigned int new_order, bool unmapped);
>>>>>>  int min_order_for_split(struct folio *folio);
>>>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>>>> index 341ae2af44ec..444477785882 100644
>>>>>> --- a/lib/test_hmm.c
>>>>>> +++ b/lib/test_hmm.c
>>>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>>>  	 * the mirror but here we use it to hold the page for the simulated
>>>>>>  	 * device memory and that page holds the pointer to the mirror.
>>>>>>  	 */
>>>>>> -	rpage = vmf->page->zone_device_data;
>>>>>> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>>>  	dmirror = rpage->zone_device_data;
>>>>>>  
>>>>>>  	/* FIXME demonstrate how we can adjust migrate range */
>>>>>>  	order = folio_order(page_folio(vmf->page));
>>>>>>  	nr = 1 << order;
>>>>>>  
>>>>>> +	/*
>>>>>> +	 * When folios are partially mapped, we can't rely on the folio
>>>>>> +	 * order of vmf->page as the folio might not be fully split yet
>>>>>> +	 */
>>>>>> +	if (vmf->pte) {
>>>>>> +		order = 0;
>>>>>> +		nr = 1;
>>>>>> +	}
>>>>>> +
>>>>>>  	/*
>>>>>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>>>>>  	 * large number of cpus that might not scale well.
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>>>>  					  struct shrink_control *sc);
>>>>>>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>>>  					 struct shrink_control *sc);
>>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>>> -		struct page *split_at, struct xa_state *xas,
>>>>>> -		struct address_space *mapping, bool uniform_split);
>>>>>> -
>>>>>>  static bool split_underused_thp = true;
>>>>>>  
>>>>>>  static atomic_t huge_zero_refcount;
>>>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>>>  	pmd_populate(mm, pmd, pgtable);
>>>>>>  }
>>>>>>  
>>>>>> -/**
>>>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>> - * split folios for pages that are partially mapped
>>>>>> - *
>>>>>> - * @folio: the folio to split
>>>>>> - *
>>>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>>>> - */
>>>>>> -int split_device_private_folio(struct folio *folio)
>>>>>> -{
>>>>>> -	struct folio *end_folio = folio_next(folio);
>>>>>> -	struct folio *new_folio;
>>>>>> -	int ret = 0;
>>>>>> -
>>>>>> -	/*
>>>>>> -	 * Split the folio now. In the case of device
>>>>>> -	 * private pages, this path is executed when
>>>>>> -	 * the pmd is split and since freeze is not true
>>>>>> -	 * it is likely the folio will be deferred_split.
>>>>>> -	 *
>>>>>> -	 * With device private pages, deferred splits of
>>>>>> -	 * folios should be handled here to prevent partial
>>>>>> -	 * unmaps from causing issues later on in migration
>>>>>> -	 * and fault handling flows.
>>>>>> -	 */
>>>>>> -	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>> -	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>> -	VM_WARN_ON(ret);
>>>>>> -	for (new_folio = folio_next(folio); new_folio != end_folio;
>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>> -		zone_device_private_split_cb(folio, new_folio);
>>>>>> -		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>>>> -								new_folio));
>>>>>> -	}
>>>>>> -
>>>>>> -	/*
>>>>>> -	 * Mark the end of the folio split for device private THP
>>>>>> -	 * split
>>>>>> -	 */
>>>>>> -	zone_device_private_split_cb(folio, NULL);
>>>>>> -	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>> -	return ret;
>>>>>> -}
>>>>>> -
>>>>>>  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>  		unsigned long haddr, bool freeze)
>>>>>>  {
>>>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>  				freeze = false;
>>>>>>  			if (!freeze) {
>>>>>>  				rmap_t rmap_flags = RMAP_NONE;
>>>>>> -				unsigned long addr = haddr;
>>>>>> -				struct folio *new_folio;
>>>>>> -				struct folio *end_folio = folio_next(folio);
>>>>>>  
>>>>>>  				if (anon_exclusive)
>>>>>>  					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>  
>>>>>> -				folio_lock(folio);
>>>>>> -				folio_get(folio);
>>>>>> -
>>>>>> -				split_device_private_folio(folio);
>>>>>> -
>>>>>> -				for (new_folio = folio_next(folio);
>>>>>> -					new_folio != end_folio;
>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>> -					addr += PAGE_SIZE;
>>>>>> -					folio_unlock(new_folio);
>>>>>> -					folio_add_anon_rmap_ptes(new_folio,
>>>>>> -						&new_folio->page, 1,
>>>>>> -						vma, addr, rmap_flags);
>>>>>> -				}
>>>>>> -				folio_unlock(folio);
>>>>>> -				folio_add_anon_rmap_ptes(folio, &folio->page,
>>>>>> -						1, vma, haddr, rmap_flags);
>>>>>> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>>>> +				if (anon_exclusive)
>>>>>> +					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>>>> +						 vma, haddr, rmap_flags);
>>>>>>  			}
>>>>>>  		}
>>>>>>  
>>>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>  	if (nr_shmem_dropped)
>>>>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>>>  
>>>>>> -	if (!ret && is_anon)
>>>>>> +	if (!ret && is_anon && !folio_is_device_private(folio))
>>>>>>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>>>  
>>>>>>  	remap_page(folio, 1 << order, remap_flags);
>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>> index 49962ea19109..4264c0290d08 100644
>>>>>> --- a/mm/migrate_device.c
>>>>>> +++ b/mm/migrate_device.c
>>>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>  			 * page table entry. Other special swap entries are not
>>>>>>  			 * migratable, and we ignore regular swapped page.
>>>>>>  			 */
>>>>>> +			struct folio *folio;
>>>>>> +
>>>>>>  			entry = pte_to_swp_entry(pte);
>>>>>>  			if (!is_device_private_entry(entry))
>>>>>>  				goto next;
>>>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>  			    pgmap->owner != migrate->pgmap_owner)
>>>>>>  				goto next;
>>>>>>  
>>>>>> +			folio = page_folio(page);
>>>>>> +			if (folio_test_large(folio)) {
>>>>>> +				struct folio *new_folio;
>>>>>> +				struct folio *new_fault_folio;
>>>>>> +
>>>>>> +				/*
>>>>>> +				 * The reason for finding pmd present with a
>>>>>> +				 * device private pte and a large folio for the
>>>>>> +				 * pte is partial unmaps. Split the folio now
>>>>>> +				 * for the migration to be handled correctly
>>>>>> +				 */
>>>>>> +				pte_unmap_unlock(ptep, ptl);
>>>>>> +
>>>>>> +				folio_get(folio);
>>>>>> +				if (folio != fault_folio)
>>>>>> +					folio_lock(folio);
>>>>>> +				if (split_folio(folio)) {
>>>>>> +					if (folio != fault_folio)
>>>>>> +						folio_unlock(folio);
>>>>>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>>> +					goto next;
>>>>>> +				}
>>>>>> +
>>>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>>>
>>>> test_hmm needs adjustment because of the way the backup folios are setup.
>>> nouveau should check the folio order after the possible split happens.
>>>
>> You mean the folio_split callback?
> 
> no, nouveau_dmem_migrate_to_ram():
> ..
>         sfolio = page_folio(vmf->page);
>         order = folio_order(sfolio);
> ...
>         migrate_vma_setup()
> ..
> if sfolio is split order still reflects the pre-split order
> 

Will fix, good catch!

>>
>>>>>> +				/*
>>>>>> +				 * After the split, get back the extra reference
>>>>>> +				 * on the fault_page, this reference is checked during
>>>>>> +				 * folio_migrate_mapping()
>>>>>> +				 */
>>>>>> +				if (migrate->fault_page) {
>>>>>> +					new_fault_folio = page_folio(migrate->fault_page);
>>>>>> +					folio_get(new_fault_folio);
>>>>>> +				}
>>>>>> +
>>>>>> +				new_folio = page_folio(page);
>>>>>> +				pfn = page_to_pfn(page);
>>>>>> +
>>>>>> +				/*
>>>>>> +				 * Ensure the lock is held on the correct
>>>>>> +				 * folio after the split
>>>>>> +				 */
>>>>>> +				if (folio != new_folio) {
>>>>>> +					folio_unlock(folio);
>>>>>> +					folio_lock(new_folio);
>>>>>> +				}
>>>>> Maybe careful not to unlock fault_page ?
>>>>>
>>>> split_page will unlock everything but the original folio, the code takes the lock
>>>> on the folio corresponding to the new folio
>>> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
>>>
>> Not sure I follow what you're trying to elaborate on here
> 
> do_swap_page:
> ..
>         if (trylock_page(vmf->page)) {
>                 ret = pgmap->ops->migrate_to_ram(vmf);
>                                <- vmf->page should be locked here even after split
>                 unlock_page(vmf->page);
> 

Yep, the split will unlock all tail folios, leaving the just head folio locked
and this the change, the lock we need to hold is the folio lock associated with
fault_page, pte entry and not unlock when the cause is a fault. The code seems
to do the right thing there, let me double check

Balbir
and the code does the right thing there.

next prev parent reply	other threads:[~2025-08-05 10:27 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-30  9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
2025-07-30  9:21 ` [v2 01/11] mm/zone_device: support large zone device private folios Balbir Singh
2025-07-30  9:50   ` David Hildenbrand
2025-08-04 23:43     ` Balbir Singh
2025-08-05  4:22     ` Balbir Singh
2025-08-05 10:57       ` David Hildenbrand
2025-08-05 11:01         ` Balbir Singh
2025-08-05 12:58           ` David Hildenbrand
2025-08-05 21:15             ` Matthew Brost
2025-08-06 12:19               ` Balbir Singh
2025-07-30  9:21 ` [v2 02/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
2025-07-30 11:16   ` Mika Penttilä
2025-07-30 11:27     ` Zi Yan
2025-07-30 11:30       ` Zi Yan
2025-07-30 11:42         ` Mika Penttilä
2025-07-30 12:08           ` Mika Penttilä
2025-07-30 12:25             ` Zi Yan
2025-07-30 12:49               ` Mika Penttilä
2025-07-30 15:10                 ` Zi Yan
2025-07-30 15:40                   ` Mika Penttilä
2025-07-30 15:58                     ` Zi Yan
2025-07-30 16:29                       ` Mika Penttilä
2025-07-31  7:15                         ` David Hildenbrand
2025-07-31  8:39                           ` Balbir Singh
2025-07-31 11:26                           ` Zi Yan
2025-07-31 12:32                             ` David Hildenbrand
2025-07-31 13:34                               ` Zi Yan
2025-07-31 19:09                                 ` David Hildenbrand
2025-08-01  0:49                             ` Balbir Singh
2025-08-01  1:09                               ` Zi Yan
2025-08-01  7:01                                 ` David Hildenbrand
2025-08-01  1:16                               ` Mika Penttilä
2025-08-01  4:44                                 ` Balbir Singh
2025-08-01  5:57                                   ` Balbir Singh
2025-08-01  6:01                                   ` Mika Penttilä
2025-08-01  7:04                                   ` David Hildenbrand
2025-08-01  8:01                                     ` Balbir Singh
2025-08-01  8:46                                       ` David Hildenbrand
2025-08-01 11:10                                         ` Zi Yan
2025-08-01 12:20                                           ` Mika Penttilä
2025-08-01 12:28                                             ` Zi Yan
2025-08-02  1:17                                               ` Balbir Singh
2025-08-02 10:37                                               ` Balbir Singh
2025-08-02 12:13                                                 ` Mika Penttilä
2025-08-04 22:46                                                   ` Balbir Singh
2025-08-04 23:26                                                     ` Mika Penttilä
2025-08-05  4:10                                                       ` Balbir Singh
2025-08-05  4:24                                                         ` Mika Penttilä
2025-08-05  5:19                                                           ` Mika Penttilä
2025-08-05 10:27                                                           ` Balbir Singh [this message]
2025-08-05 10:35                                                             ` Mika Penttilä
2025-08-05 10:36                                                               ` Balbir Singh
2025-08-05 10:46                                                                 ` Mika Penttilä
2025-07-30 20:05   ` kernel test robot
2025-07-30  9:21 ` [v2 03/11] mm/migrate_device: THP migration of zone device pages Balbir Singh
2025-07-31 16:19   ` kernel test robot
2025-07-30  9:21 ` [v2 04/11] mm/memory/fault: add support for zone device THP fault handling Balbir Singh
2025-07-30  9:21 ` [v2 05/11] lib/test_hmm: test cases and support for zone device private THP Balbir Singh
2025-07-31 11:17   ` kernel test robot
2025-07-30  9:21 ` [v2 06/11] mm/memremap: add folio_split support Balbir Singh
2025-07-30  9:21 ` [v2 07/11] mm/thp: add split during migration support Balbir Singh
2025-07-31 10:04   ` kernel test robot
2025-07-30  9:21 ` [v2 08/11] lib/test_hmm: add test case for split pages Balbir Singh
2025-07-30  9:21 ` [v2 09/11] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
2025-07-30  9:21 ` [v2 10/11] gpu/drm/nouveau: add THP migration support Balbir Singh
2025-07-30  9:21 ` [v2 11/11] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh
2025-07-30 11:30 ` [v2 00/11] THP support for zone device page migration David Hildenbrand
2025-07-30 23:18   ` Alistair Popple
2025-07-31  8:41   ` Balbir Singh
2025-07-31  8:56     ` David Hildenbrand
2025-08-05 21:34 ` Matthew Brost

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b3f97ffd-2e25-4df6-9da5-65264ee6916b@nvidia.com \
    --to=balbirs@nvidia.com \
    --cc=airlied@gmail.com \
    --cc=apopple@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=dakr@kernel.org \
    --cc=david@redhat.com \
    --cc=donettom@linux.ibm.com \
    --cc=francois.dugast@intel.com \
    --cc=jane.chu@oracle.com \
    --cc=jglisse@redhat.com \
    --cc=kherbst@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lyude@redhat.com \
    --cc=matthew.brost@intel.com \
    --cc=mpenttil@redhat.com \
    --cc=peterx@redhat.com \
    --cc=rcampbell@nvidia.com \
    --cc=ryan.roberts@arm.com \
    --cc=shuah@kernel.org \
    --cc=simona@ffwll.ch \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.