Re: [v1 resend 08/12] mm/thp: add split during migration support

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: Zi Yan <ziy@nvidia.com>
Cc: "Balbir Singh" <balbirs@nvidia.com>,
	linux-mm@kvack.org, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, "Karol Herbst" <kherbst@redhat.com>,
	"Lyude Paul" <lyude@redhat.com>,
	"Danilo Krummrich" <dakr@kernel.org>,
	"David Airlie" <airlied@gmail.com>,
	"Simona Vetter" <simona@ffwll.ch>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"Shuah Khan" <shuah@kernel.org>,
	"David Hildenbrand" <david@redhat.com>,
	"Barry Song" <baohua@kernel.org>,
	"Baolin Wang" <baolin.wang@linux.alibaba.com>,
	"Ryan Roberts" <ryan.roberts@arm.com>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Peter Xu" <peterx@redhat.com>,
	"Kefeng Wang" <wangkefeng.wang@huawei.com>,
	"Jane Chu" <jane.chu@oracle.com>,
	"Alistair Popple" <apopple@nvidia.com>,
	"Donet Tom" <donettom@linux.ibm.com>
Subject: Re: [v1 resend 08/12] mm/thp: add split during migration support
Date: Wed, 16 Jul 2025 09:24:45 -0700	[thread overview]
Message-ID: <aHfSTdoi/M9ORrXE@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <1DD0079E-0AF6-49F5-9CB3-E440F36D2D9B@nvidia.com>

On Wed, Jul 16, 2025 at 07:19:10AM -0400, Zi Yan wrote:
> On 16 Jul 2025, at 1:34, Matthew Brost wrote:
> 
> > On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
> >> On 7/6/25 11:34, Zi Yan wrote:
> >>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
> >>>
> >>>> On 7/5/25 11:55, Zi Yan wrote:
> >>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
> >>>>>
> >>>>>> On 7/4/25 21:24, Zi Yan wrote:
> >>>>>>>
> >>>>>>> s/pages/folio
> >>>>>>>
> >>>>>>
> >>>>>> Thanks, will make the changes
> >>>>>>
> >>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
> >>>>>>> they are removed from LRU lists. isolated here causes confusion.
> >>>>>>>
> >>>>>>
> >>>>>> Ack, will change the name
> >>>>>>
> >>>>>>
> >>>>>>>>   *
> >>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
> >>>>>>>>   * It is in charge of checking whether the split is supported or not and
> >>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> >>>>>>>>   */
> >>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>  		struct page *split_at, struct page *lock_at,
> >>>>>>>> -		struct list_head *list, bool uniform_split)
> >>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
> >>>>>>>>  {
> >>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> >>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> >>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>  		 * is taken to serialise against parallel split or collapse
> >>>>>>>>  		 * operations.
> >>>>>>>>  		 */
> >>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
> >>>>>>>> -		if (!anon_vma) {
> >>>>>>>> -			ret = -EBUSY;
> >>>>>>>> -			goto out;
> >>>>>>>> +		if (!isolated) {
> >>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
> >>>>>>>> +			if (!anon_vma) {
> >>>>>>>> +				ret = -EBUSY;
> >>>>>>>> +				goto out;
> >>>>>>>> +			}
> >>>>>>>> +			anon_vma_lock_write(anon_vma);
> >>>>>>>>  		}
> >>>>>>>>  		end = -1;
> >>>>>>>>  		mapping = NULL;
> >>>>>>>> -		anon_vma_lock_write(anon_vma);
> >>>>>>>>  	} else {
> >>>>>>>>  		unsigned int min_order;
> >>>>>>>>  		gfp_t gfp;
> >>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>  		goto out_unlock;
> >>>>>>>>  	}
> >>>>>>>>
> >>>>>>>> -	unmap_folio(folio);
> >>>>>>>> +	if (!isolated)
> >>>>>>>> +		unmap_folio(folio);
> >>>>>>>>
> >>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
> >>>>>>>>  	local_irq_disable();
> >>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>
> >>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
> >>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
> >>>>>>>> -				uniform_split);
> >>>>>>>> +				uniform_split, isolated);
> >>>>>>>>  	} else {
> >>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
> >>>>>>>>  fail:
> >>>>>>>>  		if (mapping)
> >>>>>>>>  			xas_unlock(&xas);
> >>>>>>>>  		local_irq_enable();
> >>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>> +		if (!isolated)
> >>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>>  		ret = -EAGAIN;
> >>>>>>>>  	}
> >>>>>>>
> >>>>>>> These "isolated" special handlings does not look good, I wonder if there
> >>>>>>> is a way of letting split code handle device private folios more gracefully.
> >>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
> >>>>>>> not need to unmap_page(), remap_page(), or unlock?
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> There are two reasons for going down the current code path
> >>>>>
> >>>>> After thinking more, I think adding isolated/unmapped is not the right
> >>>>> way, since unmapped folio is a very generic concept. If you add it,
> >>>>> one can easily misuse the folio split code by first unmapping a folio
> >>>>> and trying to split it with unmapped = true. I do not think that is
> >>>>> supported and your patch does not prevent that from happening in the future.
> >>>>>
> >>>>
> >>>> I don't understand the misuse case you mention, I assume you mean someone can
> >>>> get the usage wrong? The responsibility is on the caller to do the right thing
> >>>> if calling the API with unmapped
> >>>
> >>> Before your patch, there is no use case of splitting unmapped folios.
> >>> Your patch only adds support for device private page split, not any unmapped
> >>> folio split. So using a generic isolated/unmapped parameter is not OK.
> >>>
> >>
> >> There is a use for splitting unmapped folios (see below)
> >>
> >>>>
> >>>>> You should teach different parts of folio split code path to handle
> >>>>> device private folios properly. Details are below.
> >>>>>
> >>>>>>
> >>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
> >>>>>>    the split routine to return with -EBUSY
> >>>>>
> >>>>> You do something below instead.
> >>>>>
> >>>>> if (!anon_vma && !folio_is_device_private(folio)) {
> >>>>> 	ret = -EBUSY;
> >>>>> 	goto out;
> >>>>> } else if (anon_vma) {
> >>>>> 	anon_vma_lock_write(anon_vma);
> >>>>> }
> >>>>>
> >>>>
> >>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
> >>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
> >>>> the check for device private folios?
> >>>
> >>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
> >>> in if (!isolated) branch. In that case, just do
> >>>
> >>> if (folio_is_device_private(folio) {
> >>> ...
> >>> } else if (is_anon) {
> >>> ...
> >>> } else {
> >>> ...
> >>> }
> >>>
> >>>>
> >>>>> People can know device private folio split needs a special handling.
> >>>>>
> >>>>> BTW, why a device private folio can also be anonymous? Does it mean
> >>>>> if a page cache folio is migrated to device private, kernel also
> >>>>> sees it as both device private and file-backed?
> >>>>>
> >>>>
> >>>> FYI: device private folios only work with anonymous private pages, hence
> >>>> the name device private.
> >>>
> >>> OK.
> >>>
> >>>>
> >>>>>
> >>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
> >>>>>>    the migrate_device API has already just done as a part of the migration. The
> >>>>>>    entries under consideration are already migration entries in this case.
> >>>>>>    This is wasteful and in some case unexpected.
> >>>>>
> >>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
> >>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
> >>>>> can teach either try_to_migrate() or try_to_unmap() to just split
> >>>>> device private PMD mapping. Or if that is not preferred,
> >>>>> you can simply call split_huge_pmd_address() when unmap_folio()
> >>>>> sees a device private folio.
> >>>>>
> >>>>> For remap_page(), you can simply return for device private folios
> >>>>> like it is currently doing for non anonymous folios.
> >>>>>
> >>>>
> >>>> Doing a full rmap walk does not make sense with unmap_folio() and
> >>>> remap_folio(), because
> >>>>
> >>>> 1. We need to do a page table walk/rmap walk again
> >>>> 2. We'll need special handling of migration <-> migration entries
> >>>>    in the rmap handling (set/remove migration ptes)
> >>>> 3. In this context, the code is already in the middle of migration,
> >>>>    so trying to do that again does not make sense.
> >>>
> >>> Why doing split in the middle of migration? Existing split code
> >>> assumes to-be-split folios are mapped.
> >>>
> >>> What prevents doing split before migration?
> >>>
> >>
> >> The code does do a split prior to migration if THP selection fails
> >>
> >> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
> >> and the fallback part which calls split_folio()
> >>
> >> But the case under consideration is special since the device needs to allocate
> >> corresponding pfn's as well. The changelog mentions it:
> >>
> >> "The common case that arises is that after setup, during migrate
> >> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
> >> pages."
> >>
> >> I can expand on it, because migrate_vma() is a multi-phase operation
> >>
> >> 1. migrate_vma_setup()
> >> 2. migrate_vma_pages()
> >> 3. migrate_vma_finalize()
> >>
> >> It can so happen that when we get the destination pfn's allocated the destination
> >> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
> >>
> >> The pages have been unmapped and collected in migrate_vma_setup()
> >>
> >> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
> >> tests the split and emulates a failure on the device side to allocate large pages
> >> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
> >>
> >
> > Another use case I’ve seen is when a previously allocated high-order
> > folio, now in the free memory pool, is reallocated as a lower-order
> > page. For example, a 2MB fault allocates a folio, the memory is later
> 
> That is different. If the high-order folio is free, it should be split
> using split_page() from mm/page_alloc.c.
> 

Ah, ok. Let me see if that works - it would easier.

> > freed, and then a 4KB fault reuses a page from that previously allocated
> > folio. This will be actually quite common in Xe / GPU SVM. In such
> > cases, the folio in an unmapped state needs to be split. I’d suggest a
> 
> This folio is unused, so ->flags, ->mapping, and etc. are not set,
> __split_unmapped_folio() is not for it, unless you mean free folio
> differently.
> 

This is right, those fields should be clear.

Thanks for the tip.

Matt

> > migrate_device_* helper built on top of the core MM __split_folio
> > function add here.
> >
> 
> --
> Best Regards,
> Yan, Zi

next prev parent reply	other threads:[~2025-07-16 16:23 UTC|newest]

Thread overview: 99+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-03 23:34 [v1 resend 00/12] THP support for zone device page migration Balbir Singh
2025-07-03 23:35 ` [v1 resend 01/12] mm/zone_device: support large zone device private folios Balbir Singh
2025-07-07  5:28   ` Alistair Popple
2025-07-08  6:47     ` Balbir Singh
2025-07-03 23:35 ` [v1 resend 02/12] mm/migrate_device: flags for selecting device private THP pages Balbir Singh
2025-07-07  5:31   ` Alistair Popple
2025-07-08  7:31     ` Balbir Singh
2025-07-19 20:06       ` Matthew Brost
2025-07-19 20:16         ` Matthew Brost
2025-07-18  3:15   ` Matthew Brost
2025-07-03 23:35 ` [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code Balbir Singh
2025-07-04  4:46   ` Mika Penttilä
2025-07-06  1:21     ` Balbir Singh
2025-07-04 11:10   ` Mika Penttilä
2025-07-05  0:14     ` Balbir Singh
2025-07-07  6:09       ` Alistair Popple
2025-07-08  7:40         ` Balbir Singh
2025-07-07  3:49   ` Mika Penttilä
2025-07-08  4:20     ` Balbir Singh
2025-07-08  4:30       ` Mika Penttilä
2025-07-07  6:07   ` Alistair Popple
2025-07-08  4:59     ` Balbir Singh
2025-07-22  4:42   ` Matthew Brost
2025-07-03 23:35 ` [v1 resend 04/12] mm/migrate_device: THP migration of zone device pages Balbir Singh
2025-07-04 15:35   ` kernel test robot
2025-07-18  6:59   ` Matthew Brost
2025-07-18  7:04     ` Balbir Singh
2025-07-18  7:21       ` Matthew Brost
2025-07-18  8:22         ` Matthew Brost
2025-07-22  4:54           ` Matthew Brost
2025-07-19  2:10   ` Matthew Brost
2025-07-03 23:35 ` [v1 resend 05/12] mm/memory/fault: add support for zone device THP fault handling Balbir Singh
2025-07-17 19:34   ` Matthew Brost
2025-07-03 23:35 ` [v1 resend 06/12] lib/test_hmm: test cases and support for zone device private THP Balbir Singh
2025-07-03 23:35 ` [v1 resend 07/12] mm/memremap: add folio_split support Balbir Singh
2025-07-04 11:14   ` Mika Penttilä
2025-07-06  1:24     ` Balbir Singh
2025-07-03 23:35 ` [v1 resend 08/12] mm/thp: add split during migration support Balbir Singh
2025-07-04  5:17   ` Mika Penttilä
2025-07-04  6:43     ` Mika Penttilä
2025-07-05  0:26       ` Balbir Singh
2025-07-05  3:17         ` Mika Penttilä
2025-07-07  2:35           ` Balbir Singh
2025-07-07  3:29             ` Mika Penttilä
2025-07-08  7:37               ` Balbir Singh
2025-07-04 11:24   ` Zi Yan
2025-07-05  0:58     ` Balbir Singh
2025-07-05  1:55       ` Zi Yan
2025-07-06  1:15         ` Balbir Singh
2025-07-06  1:34           ` Zi Yan
2025-07-06  1:47             ` Balbir Singh
2025-07-06  2:34               ` Zi Yan
2025-07-06  3:03                 ` Zi Yan
2025-07-07  2:29                   ` Balbir Singh
2025-07-07  2:45                     ` Zi Yan
2025-07-08  3:31                       ` Balbir Singh
2025-07-08  7:43                       ` Balbir Singh
2025-07-16  5:34               ` Matthew Brost
2025-07-16 11:19                 ` Zi Yan
2025-07-16 16:24                   ` Matthew Brost [this message]
2025-07-16 21:53                     ` Balbir Singh
2025-07-17 22:24                       ` Matthew Brost
2025-07-17 23:04                         ` Zi Yan
2025-07-18  0:41                           ` Matthew Brost
2025-07-18  1:25                             ` Zi Yan
2025-07-18  3:33                               ` Matthew Brost
2025-07-18 15:06                                 ` Zi Yan
2025-07-23  0:00                                   ` Matthew Brost
2025-07-03 23:35 ` [v1 resend 09/12] lib/test_hmm: add test case for split pages Balbir Singh
2025-07-03 23:35 ` [v1 resend 10/12] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
2025-07-03 23:35 ` [v1 resend 11/12] gpu/drm/nouveau: add THP migration support Balbir Singh
2025-07-03 23:35 ` [v1 resend 12/12] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh
2025-07-04 16:16 ` [v1 resend 00/12] THP support for zone device page migration Zi Yan
2025-07-04 23:56   ` Balbir Singh
2025-07-08 14:53 ` David Hildenbrand
2025-07-08 22:43   ` Balbir Singh
2025-07-17 23:40 ` Matthew Brost
2025-07-18  3:57   ` Balbir Singh
2025-07-18  4:57     ` Matthew Brost
2025-07-21 23:48       ` Balbir Singh
2025-07-22  0:07         ` Matthew Brost
2025-07-22  0:51           ` Balbir Singh
2025-07-19  0:53     ` Matthew Brost
2025-07-21 11:42     ` Francois Dugast
2025-07-21 23:34       ` Balbir Singh
2025-07-22  0:01         ` Matthew Brost
2025-07-22 19:34         ` [PATCH] mm/hmm: Do not fault in device private pages owned by the caller Francois Dugast
2025-07-22 20:07           ` Andrew Morton
2025-07-23 15:34             ` Francois Dugast
2025-07-23 18:05               ` Matthew Brost
2025-07-24  0:25           ` Balbir Singh
2025-07-24  5:02             ` Matthew Brost
2025-07-24  5:46               ` Mika Penttilä
2025-07-24  5:57                 ` Matthew Brost
2025-07-24  6:04                   ` Mika Penttilä
2025-07-24  6:47                     ` Leon Romanovsky
2025-07-28 13:34               ` Jason Gunthorpe
2025-08-08  0:21           ` Matthew Brost
2025-08-08  9:43             ` Francois Dugast

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aHfSTdoi/M9ORrXE@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=airlied@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=balbirs@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=dakr@kernel.org \
    --cc=david@redhat.com \
    --cc=donettom@linux.ibm.com \
    --cc=jane.chu@oracle.com \
    --cc=jglisse@redhat.com \
    --cc=kherbst@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lyude@redhat.com \
    --cc=peterx@redhat.com \
    --cc=ryan.roberts@arm.com \
    --cc=shuah@kernel.org \
    --cc=simona@ffwll.ch \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.