From: Usama Arif <usama.arif@linux.dev>
To: "David Hildenbrand (Arm)" <david@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
lorenzo.stoakes@oracle.com, willy@infradead.org,
linux-mm@kvack.org
Cc: fvdl@google.com, hannes@cmpxchg.org, riel@surriel.com,
shakeel.butt@linux.dev, kas@kernel.org, baohua@kernel.org,
dev.jain@arm.com, baolin.wang@linux.alibaba.com,
npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com,
vbabka@suse.cz, lance.yang@linux.dev,
linux-kernel@vger.kernel.org, kernel-team@meta.com,
Madhavan Srinivasan <maddy@linux.ibm.com>,
Michael Ellerman <mpe@ellerman.id.au>,
linuxppc-dev@lists.ozlabs.org
Subject: Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
Date: Wed, 11 Feb 2026 13:38:17 +0000 [thread overview]
Message-ID: <735c0277-2155-47f5-a650-b4c8a0a2e6fc@linux.dev> (raw)
In-Reply-To: <13ab56cb-7fdb-4ee4-9170-f9f4fa4b6e37@kernel.org>
On 11/02/2026 13:25, David Hildenbrand (Arm) wrote:
> CCing ppc folks
>
> On 2/11/26 13:49, Usama Arif wrote:
>> When the kernel creates a PMD-level THP mapping for anonymous pages,
>> it pre-allocates a PTE page table and deposits it via
>> pgtable_trans_huge_deposit(). This deposited table is withdrawn during
>> PMD split or zap. The rationale was that split must not fail—if the
>> kernel decides to split a THP, it needs a PTE table to populate.
>>
>> However, every anon THP wastes 4KB (one page table page) that sits
>> unused in the deposit list for the lifetime of the mapping. On systems
>> with many THPs, this adds up to significant memory waste. The original
>> rationale is also not an issue. It is ok for split to fail, and if the
>> kernel can't find an order 0 allocation for split, there are much bigger
>> problems. On large servers where you can easily have 100s of GBs of THPs,
>> the memory usage for these tables is 200M per 100G. This memory could be
>> used for any other usecase, which include allocating the pagetables
>> required during split.
>>
>> This patch removes the pre-deposit for anonymous pages on architectures
>> where arch_needs_pgtable_deposit() returns false (every arch apart from
>> powerpc, and only when radix hash tables are not enabled) and allocates
>> the PTE table lazily—only when a split actually occurs. The split path
>> is modified to accept a caller-provided page table.
>>
>> PowerPC exception:
>>
>> It would have been great if we can completely remove the pagetable
>> deposit code and this commit would mostly have been a code cleanup patch,
>> unfortunately PowerPC has hash MMU, it stores hash slot information in
>> the deposited page table and pre-deposit is necessary. All deposit/
>> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
>> behavior is unchanged with this patch. On a better note,
>> arch_needs_pgtable_deposit will always evaluate to false at compile time
>> on non PowerPC architectures and the pre-deposit code will not be
>> compiled in.
>
> Is there a way to remove this? It's always been a confusing hack, now it's unpleasant to have around :)
I spent some time researching this (I havent worked with PowerPC before)
as I really wanted to get rid of all the pre-deposit code. I cant really see a
way without removing PMD THP support. I was going to CC the PowerPC maintainers
but I see that you already did!
>
> In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1 copied generic pgtable_trans_huge_deposit() hurts my belly.
>
>
> IIUC, hash is mostly used on legacy power systems, radix on newer ones.
>
Yes that is what I found as well.
> So one obvious solution: remove PMD THP support for hash MMUs along with all this hacky deposit code.
>
I would be happy with that!
>
> the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar checks need to be wrapped in a reasonable helper and likely this all needs to get cleaned up further.
Ack. The code will definitely look a lot lot cleaner and wont have much of this if we decide to remove
PMD THP support for hash MMU.
>
> The implementation if the generic pgtable_trans_huge_deposit and the radix handlers etc must be removed. If any code would trigger them it would be a bug.
>
> If we have to keep this around, pgtable_trans_huge_deposit() should likely get renamed to arch_pgtable_trans_huge_deposit() etc, as there will not be generic support for it.
>
Ack.
next prev parent reply other threads:[~2026-02-11 13:38 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-11 12:49 [RFC 0/2] mm: thp: split time allocation of page table for THPs Usama Arif
2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif
2026-02-11 13:25 ` David Hildenbrand (Arm)
2026-02-11 13:38 ` Usama Arif [this message]
2026-02-12 12:13 ` Ritesh Harjani
2026-02-12 15:25 ` Usama Arif
2026-02-12 15:39 ` David Hildenbrand (Arm)
2026-02-12 16:46 ` Ritesh Harjani
2026-02-11 13:35 ` David Hildenbrand (Arm)
2026-02-11 13:46 ` Kiryl Shutsemau
2026-02-11 13:47 ` Usama Arif
2026-02-11 19:28 ` Matthew Wilcox
2026-02-11 19:55 ` David Hildenbrand (Arm)
2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif
2026-02-11 13:27 ` David Hildenbrand (Arm)
2026-02-11 13:31 ` Usama Arif
2026-02-11 13:36 ` David Hildenbrand (Arm)
2026-02-11 13:42 ` Usama Arif
2026-02-11 13:38 ` David Hildenbrand (Arm)
2026-02-11 13:43 ` Usama Arif
2026-02-12 21:40 ` kernel test robot
2026-02-12 21:40 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=735c0277-2155-47f5-a650-b4c8a0a2e6fc@linux.dev \
--to=usama.arif@linux.dev \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=fvdl@google.com \
--cc=hannes@cmpxchg.org \
--cc=kas@kernel.org \
--cc=kernel-team@meta.com \
--cc=lance.yang@linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=maddy@linux.ibm.com \
--cc=mpe@ellerman.id.au \
--cc=npache@redhat.com \
--cc=riel@surriel.com \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.