linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Mika Penttilä" <mpenttil@redhat.com>
To: Zi Yan <ziy@nvidia.com>
Cc: "Balbir Singh" <balbirs@nvidia.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	"Karol Herbst" <kherbst@redhat.com>,
	"Lyude Paul" <lyude@redhat.com>,
	"Danilo Krummrich" <dakr@kernel.org>,
	"David Airlie" <airlied@gmail.com>,
	"Simona Vetter" <simona@ffwll.ch>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"Shuah Khan" <shuah@kernel.org>,
	"David Hildenbrand" <david@redhat.com>,
	"Barry Song" <baohua@kernel.org>,
	"Baolin Wang" <baolin.wang@linux.alibaba.com>,
	"Ryan Roberts" <ryan.roberts@arm.com>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Peter Xu" <peterx@redhat.com>,
	"Kefeng Wang" <wangkefeng.wang@huawei.com>,
	"Jane Chu" <jane.chu@oracle.com>,
	"Alistair Popple" <apopple@nvidia.com>,
	"Donet Tom" <donettom@linux.ibm.com>,
	"Matthew Brost" <matthew.brost@intel.com>,
	"Francois Dugast" <francois.dugast@intel.com>,
	"Ralph Campbell" <rcampbell@nvidia.com>
Subject: Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
Date: Wed, 30 Jul 2025 18:40:18 +0300	[thread overview]
Message-ID: <2308291f-3afc-44b4-bfc9-c6cf0cdd6295@redhat.com> (raw)
In-Reply-To: <8E2CE1DF-4C37-4690-B968-AEA180FF44A1@nvidia.com>


On 7/30/25 18:10, Zi Yan wrote:
> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>
>> On 7/30/25 15:25, Zi Yan wrote:
>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>
>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>
>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>
>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>
>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>> entries.
>>>>>>>>>
>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>> folio.
>>>>>>>>>
>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>
>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>> ---
>>>>>>>>>  include/linux/huge_mm.h |   1 +
>>>>>>>>>  include/linux/rmap.h    |   2 +
>>>>>>>>>  include/linux/swapops.h |  17 +++
>>>>>>>>>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>  mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>  mm/pgtable-generic.c    |   6 +
>>>>>>>>>  mm/rmap.c               |  22 +++-
>>>>>>>>>  7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>
>>>>>>> <snip>
>>>>>>>
>>>>>>>>> +/**
>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>> + *
>>>>>>>>> + * @folio: the folio to split
>>>>>>>>> + *
>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>> + */
>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>> +{
>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>> +	int ret = 0;
>>>>>>>>> +
>>>>>>>>> +	/*
>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>> +	 *
>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>> +	 */
>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>> device side mapping.
>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>> 5) remap device private mapping.
>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>> The folio only has migration entries pointing to it. From CPU perspective,
>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>> folio by replacing existing page table entries with migration entries
>>> and after that the folio is regarded as “unmapped”.
>>>
>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
> Yes, but from CPU perspective, both device private entry and migration entry
> are invalid CPU page table entries, so the device private folio is “unmapped”
> at CPU side.

Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.

Also which might confuse is that v1 of the series had only 
  migrate_vma_split_pages()
which operated only on truly unmapped (mapcount wise) folios. Which was a motivation for split_unmapped_folio()..
Now, 
  split_device_private_folio()
operates on mapcount != 0 folios.

>
>
>> And it is called from split_huge_pmd() with freeze == false, not from folio split but pmd split.
> I am not sure that is the right timing of splitting a folio. The device private
> folio can be kept without splitting at split_huge_pmd() time.

Yes this doesn't look quite right, and also
+	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));

looks suspicious

Maybe split_device_private_folio() tries to solve some corner case but maybe good to elaborate
more the exact conditions, there might be a better fix.

>
> But from CPU perspective, a device private folio has no CPU mapping, no other
> CPU can access or manipulate the folio. It should be OK to split it.
>
>>> mapping, IIUC.
>>>
>>>>>>>>> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>>> Confusing to  __split_unmapped_folio() if folio is mapped...
>>>>>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>>>>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>>>>>> folio meta data for split.
>>>>>>>
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Yan, Zi
>>>>>> Best Regards,
>>>>>> Yan, Zi
>>>>>>
>>>> --Mika
>>> Best Regards,
>>> Yan, Zi
>>>
>> --Mika
>
> Best Regards,
> Yan, Zi
>
--Mika




  reply	other threads:[~2025-07-30 15:40 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-30  9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
2025-07-30  9:21 ` [v2 01/11] mm/zone_device: support large zone device private folios Balbir Singh
2025-07-30  9:50   ` David Hildenbrand
2025-08-04 23:43     ` Balbir Singh
2025-08-05  4:22     ` Balbir Singh
2025-08-05 10:57       ` David Hildenbrand
2025-08-05 11:01         ` Balbir Singh
2025-08-05 12:58           ` David Hildenbrand
2025-08-05 21:15             ` Matthew Brost
2025-08-06 12:19               ` Balbir Singh
2025-07-30  9:21 ` [v2 02/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
2025-07-30 11:16   ` Mika Penttilä
2025-07-30 11:27     ` Zi Yan
2025-07-30 11:30       ` Zi Yan
2025-07-30 11:42         ` Mika Penttilä
2025-07-30 12:08           ` Mika Penttilä
2025-07-30 12:25             ` Zi Yan
2025-07-30 12:49               ` Mika Penttilä
2025-07-30 15:10                 ` Zi Yan
2025-07-30 15:40                   ` Mika Penttilä [this message]
2025-07-30 15:58                     ` Zi Yan
2025-07-30 16:29                       ` Mika Penttilä
2025-07-31  7:15                         ` David Hildenbrand
2025-07-31  8:39                           ` Balbir Singh
2025-07-31 11:26                           ` Zi Yan
2025-07-31 12:32                             ` David Hildenbrand
2025-07-31 13:34                               ` Zi Yan
2025-07-31 19:09                                 ` David Hildenbrand
2025-08-01  0:49                             ` Balbir Singh
2025-08-01  1:09                               ` Zi Yan
2025-08-01  7:01                                 ` David Hildenbrand
2025-08-01  1:16                               ` Mika Penttilä
2025-08-01  4:44                                 ` Balbir Singh
2025-08-01  5:57                                   ` Balbir Singh
2025-08-01  6:01                                   ` Mika Penttilä
2025-08-01  7:04                                   ` David Hildenbrand
2025-08-01  8:01                                     ` Balbir Singh
2025-08-01  8:46                                       ` David Hildenbrand
2025-08-01 11:10                                         ` Zi Yan
2025-08-01 12:20                                           ` Mika Penttilä
2025-08-01 12:28                                             ` Zi Yan
2025-08-02  1:17                                               ` Balbir Singh
2025-08-02 10:37                                               ` Balbir Singh
2025-08-02 12:13                                                 ` Mika Penttilä
2025-08-04 22:46                                                   ` Balbir Singh
2025-08-04 23:26                                                     ` Mika Penttilä
2025-08-05  4:10                                                       ` Balbir Singh
2025-08-05  4:24                                                         ` Mika Penttilä
2025-08-05  5:19                                                           ` Mika Penttilä
2025-08-05 10:27                                                           ` Balbir Singh
2025-08-05 10:35                                                             ` Mika Penttilä
2025-08-05 10:36                                                               ` Balbir Singh
2025-08-05 10:46                                                                 ` Mika Penttilä
2025-07-30 20:05   ` kernel test robot
2025-07-30  9:21 ` [v2 03/11] mm/migrate_device: THP migration of zone device pages Balbir Singh
2025-07-31 16:19   ` kernel test robot
2025-07-30  9:21 ` [v2 04/11] mm/memory/fault: add support for zone device THP fault handling Balbir Singh
2025-07-30  9:21 ` [v2 05/11] lib/test_hmm: test cases and support for zone device private THP Balbir Singh
2025-07-31 11:17   ` kernel test robot
2025-07-30  9:21 ` [v2 06/11] mm/memremap: add folio_split support Balbir Singh
2025-07-30  9:21 ` [v2 07/11] mm/thp: add split during migration support Balbir Singh
2025-07-31 10:04   ` kernel test robot
2025-07-30  9:21 ` [v2 08/11] lib/test_hmm: add test case for split pages Balbir Singh
2025-07-30  9:21 ` [v2 09/11] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
2025-07-30  9:21 ` [v2 10/11] gpu/drm/nouveau: add THP migration support Balbir Singh
2025-07-30  9:21 ` [v2 11/11] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh
2025-07-30 11:30 ` [v2 00/11] THP support for zone device page migration David Hildenbrand
2025-07-30 23:18   ` Alistair Popple
2025-07-31  8:41   ` Balbir Singh
2025-07-31  8:56     ` David Hildenbrand
2025-08-05 21:34 ` Matthew Brost

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2308291f-3afc-44b4-bfc9-c6cf0cdd6295@redhat.com \
    --to=mpenttil@redhat.com \
    --cc=airlied@gmail.com \
    --cc=apopple@nvidia.com \
    --cc=balbirs@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=dakr@kernel.org \
    --cc=david@redhat.com \
    --cc=donettom@linux.ibm.com \
    --cc=francois.dugast@intel.com \
    --cc=jane.chu@oracle.com \
    --cc=jglisse@redhat.com \
    --cc=kherbst@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lyude@redhat.com \
    --cc=matthew.brost@intel.com \
    --cc=peterx@redhat.com \
    --cc=rcampbell@nvidia.com \
    --cc=ryan.roberts@arm.com \
    --cc=shuah@kernel.org \
    --cc=simona@ffwll.ch \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).