Re: [RFC 01/11] mm/zone_device: support large zone device private folios

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Balbir Singh <balbirs@nvidia.com>
To: David Hildenbrand <david@redhat.com>,
	linux-mm@kvack.org, akpm@linux-foundation.org
Cc: dri-devel@lists.freedesktop.org, nouveau@lists.freedesktop.org,
	"Karol Herbst" <kherbst@redhat.com>,
	"Lyude Paul" <lyude@redhat.com>,
	"Danilo Krummrich" <dakr@kernel.org>,
	"David Airlie" <airlied@gmail.com>,
	"Simona Vetter" <simona@ffwll.ch>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"Shuah Khan" <shuah@kernel.org>, "Barry Song" <baohua@kernel.org>,
	"Baolin Wang" <baolin.wang@linux.alibaba.com>,
	"Ryan Roberts" <ryan.roberts@arm.com>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Peter Xu" <peterx@redhat.com>, "Zi Yan" <ziy@nvidia.com>,
	"Kefeng Wang" <wangkefeng.wang@huawei.com>,
	"Jane Chu" <jane.chu@oracle.com>,
	"Alistair Popple" <apopple@nvidia.com>,
	"Donet Tom" <donettom@linux.ibm.com>
Subject: Re: [RFC 01/11] mm/zone_device: support large zone device private folios
Date: Wed, 9 Jul 2025 15:25:26 +1000	[thread overview]
Message-ID: <e4feba70-632f-4d97-a917-e53c19d09a32@nvidia.com> (raw)
In-Reply-To: <564babd9-fe33-4ca7-b63f-73f007fbfb51@redhat.com>

On 7/8/25 23:37, David Hildenbrand wrote:
> On 06.03.25 05:42, Balbir Singh wrote:
>> Add routines to support allocation of large order zone device folios
>> and helper functions for zone device folios, to check if a folio is
>> device private and helpers for setting zone device data.
>>
>> When large folios are used, the existing page_free() callback in
>> pgmap is called when the folio is freed, this is true for both
>> PAGE_SIZE and higher order pages.
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>   include/linux/memremap.h | 22 +++++++++++++++++-
>>   mm/memremap.c            | 50 +++++++++++++++++++++++++++++-----------
>>   2 files changed, 58 insertions(+), 14 deletions(-)
>>
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index 4aa151914eab..11d586dd8ef1 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -169,6 +169,18 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>       return is_device_private_page(&folio->page);
>>   }
>>   +static inline void *folio_zone_device_data(const struct folio *folio)
>> +{
>> +    VM_BUG_ON_FOLIO(!folio_is_device_private(folio), folio);
>> +    return folio->page.zone_device_data;
>> +}
> 
> Not used.
> 
>> +
>> +static inline void folio_set_zone_device_data(struct folio *folio, void *data)
>> +{
>> +    VM_BUG_ON_FOLIO(!folio_is_device_private(folio), folio);
>> +    folio->page.zone_device_data = data;
>> +}
>> +
> 
> Not used.
> 
> Move both into the patch where they are actually used.
> 

Ack

>>   static inline bool is_pci_p2pdma_page(const struct page *page)
>>   {
>>       return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
>> @@ -199,7 +211,7 @@ static inline bool folio_is_fsdax(const struct folio *folio)
>>   }
>>     #ifdef CONFIG_ZONE_DEVICE
>> -void zone_device_page_init(struct page *page);
>> +void init_zone_device_folio(struct folio *folio, unsigned int order);
>>   void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>>   void memunmap_pages(struct dev_pagemap *pgmap);
>>   void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
>> @@ -209,6 +221,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>>   bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>>     unsigned long memremap_compat_align(void);
>> +
>> +static inline void zone_device_page_init(struct page *page)
>> +{
>> +    struct folio *folio = page_folio(page);
>> +
>> +    init_zone_device_folio(folio, 0);
>> +}
>> +
>>   #else
>>   static inline void *devm_memremap_pages(struct device *dev,
>>           struct dev_pagemap *pgmap)
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 2aebc1b192da..7d98d0a4c0cd 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -459,20 +459,21 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>>   void free_zone_device_folio(struct folio *folio)
>>   {
>>       struct dev_pagemap *pgmap = folio->pgmap;
>> +    unsigned int nr = folio_nr_pages(folio);
>> +    int i;
>> +    bool anon = folio_test_anon(folio);
> 
> You can easily get rid of this (see below).
> 
>> +    struct page *page = folio_page(folio, 0);
> 
> Please inline folio_page(folio, 0) below instead.

Sure, is that preferred to taking a struct page ref?

> 
>>         if (WARN_ON_ONCE(!pgmap))
>>           return;
>>         mem_cgroup_uncharge(folio);
>>   -    /*
>> -     * Note: we don't expect anonymous compound pages yet. Once supported
>> -     * and we could PTE-map them similar to THP, we'd have to clear
>> -     * PG_anon_exclusive on all tail pages.
>> -     */
>> -    if (folio_test_anon(folio)) {
>> -        VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>> -        __ClearPageAnonExclusive(folio_page(folio, 0));
>> +    WARN_ON_ONCE(folio_test_large(folio) && !anon);
>> +
>> +    for (i = 0; i < nr; i++) {
>> +        if (anon)
>> +            __ClearPageAnonExclusive(folio_page(folio, i));
>>       }
> 
> if (folio_test_anon(folio)) {
>     for (i = 0; i < nr; i++)
>         __ClearPageAnonExclusive(folio_page(folio, i));
> } else {
>     VM_WARN_ON_ONCE(folio_test_large(folio));
> }
> 

Ack

>>         /*
>> @@ -496,10 +497,19 @@ void free_zone_device_folio(struct folio *folio)
>>         switch (pgmap->type) {
>>       case MEMORY_DEVICE_PRIVATE:
>> +        if (folio_test_large(folio)) {
>> +            folio_unqueue_deferred_split(folio);
> 
> Is deferred splitting even a thing for device-private?
> 
> Should we ever queue them for deferred splitting?
> 

Not really, but wanted to do the right thing in the tear down path, I can remove these bits

>> +
>> +            percpu_ref_put_many(&folio->pgmap->ref, nr - 1);
> 
> Looks like we instead want a helper put_dev_pagemap_refs(pgmap, nr) below instead
> 
>> +        }
>> +        pgmap->ops->page_free(page);
>> +        put_dev_pagemap(pgmap);
>> +        page->mapping = NULL;
>> +        break;
>>       case MEMORY_DEVICE_COHERENT:
>>           if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
>>               break;
>> -        pgmap->ops->page_free(folio_page(folio, 0));
>> +        pgmap->ops->page_free(page);
>>           put_dev_pagemap(pgmap);
>>           break;
>>   @@ -523,14 +533,28 @@ void free_zone_device_folio(struct folio *folio)
>>       }
>>   }
>>   -void zone_device_page_init(struct page *page)
>> +void init_zone_device_folio(struct folio *folio, unsigned int order)
>>   {
>> +    struct page *page = folio_page(folio, 0);
>> +
>> +    VM_BUG_ON(order > MAX_ORDER_NR_PAGES);
> 
> VM_WARN_ON_ONCE() or anything else that is not *BUG, please.
> 

Ack

>> +
>> +    WARN_ON_ONCE(order && order != HPAGE_PMD_ORDER);
> 
> Why do we need that limitation?
> 

mTHP is not yet supported in the series. We could keep this routine more generic
and not need the checks, but I added them to prevent unsupported order usage

>> +
>>       /*
>>        * Drivers shouldn't be allocating pages after calling
>>        * memunmap_pages().
>>        */
>> -    WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
>> -    set_page_count(page, 1);
>> +    WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
>> +    folio_set_count(folio, 1);
>>       lock_page(page);
>> +
>> +    /*
>> +     * Only PMD level migration is supported for THP migration
>> +     */
> 
> I don't understand how that comment interacts with the code below. This is basic large folio initialization.
> 
> Drop the comment, or move it above the HPAGE_PMD_ORDER check with a clear reason why that limitation excists.
>

Ack

 
>> +    if (order > 1) {
>> +        prep_compound_page(page, order);
>> +        folio_set_large_rmappable(folio);
>> +    }
>>   }
>> -EXPORT_SYMBOL_GPL(zone_device_page_init);
>> +EXPORT_SYMBOL_GPL(init_zone_device_folio);
> 
> 

Thanks for the review
Balbir

next prev parent reply	other threads:[~2025-07-09  5:25 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-06  4:42 [RFC 00/11] THP support for zone device pages Balbir Singh
2025-03-06  4:42 ` [RFC 01/11] mm/zone_device: support large zone device private folios Balbir Singh
2025-03-06 23:02   ` Alistair Popple
2025-07-08 13:37   ` David Hildenbrand
2025-07-09  5:25     ` Balbir Singh [this message]
2025-03-06  4:42 ` [RFC 02/11] mm/migrate_device: flags for selecting device private THP pages Balbir Singh
2025-07-08 13:41   ` David Hildenbrand
2025-07-09  5:25     ` Balbir Singh
2025-03-06  4:42 ` [RFC 03/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
2025-07-08 14:10   ` David Hildenbrand
2025-07-09  6:06     ` Alistair Popple
2025-07-09 12:30     ` Balbir Singh
2025-03-06  4:42 ` [RFC 04/11] mm/migrate_device: THP migration of zone device pages Balbir Singh
2025-03-06  9:24   ` Mika Penttilä
2025-03-06 21:35     ` Balbir Singh
2025-03-06  4:42 ` [RFC 05/11] mm/memory/fault: Add support for zone device THP fault handling Balbir Singh
2025-07-08 14:40   ` David Hildenbrand
2025-07-09 23:26     ` Balbir Singh
2025-03-06  4:42 ` [RFC 06/11] lib/test_hmm: test cases and support for zone device private THP Balbir Singh
2025-03-06  4:42 ` [RFC 07/11] mm/memremap: Add folio_split support Balbir Singh
2025-03-06  8:16   ` Mika Penttilä
2025-03-06 21:42     ` Balbir Singh
2025-03-06 22:36   ` Alistair Popple
2025-07-08 14:31   ` David Hildenbrand
2025-07-09 23:34     ` Balbir Singh
2025-03-06  4:42 ` [RFC 08/11] mm/thp: add split during migration support Balbir Singh
2025-07-08 14:38   ` David Hildenbrand
2025-07-08 14:46     ` Zi Yan
2025-07-08 14:53       ` David Hildenbrand
2025-03-06  4:42 ` [RFC 09/11] lib/test_hmm: add test case for split pages Balbir Singh
2025-03-06  4:42 ` [RFC 10/11] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
2025-03-06  4:42 ` [RFC 11/11] gpu/drm/nouveau: Add THP migration support Balbir Singh
2025-03-06 23:08 ` [RFC 00/11] THP support for zone device pages Matthew Brost
2025-03-06 23:20   ` Balbir Singh
2025-07-04 13:52     ` Francois Dugast
2025-07-04 16:17       ` Zi Yan
2025-07-06  1:25         ` Balbir Singh
2025-07-06 16:34           ` Francois Dugast

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e4feba70-632f-4d97-a917-e53c19d09a32@nvidia.com \
    --to=balbirs@nvidia.com \
    --cc=airlied@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=dakr@kernel.org \
    --cc=david@redhat.com \
    --cc=donettom@linux.ibm.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=jane.chu@oracle.com \
    --cc=jglisse@redhat.com \
    --cc=kherbst@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=lyude@redhat.com \
    --cc=nouveau@lists.freedesktop.org \
    --cc=peterx@redhat.com \
    --cc=ryan.roberts@arm.com \
    --cc=shuah@kernel.org \
    --cc=simona@ffwll.ch \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.