All of lore.kernel.org
 help / color / mirror / Atom feed
From: Usama Arif <usama.arif@linux.dev>
To: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>,
	"David Hildenbrand (Arm)" <david@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	lorenzo.stoakes@oracle.com, willy@infradead.org,
	linux-mm@kvack.org
Cc: fvdl@google.com, hannes@cmpxchg.org, riel@surriel.com,
	shakeel.butt@linux.dev, kas@kernel.org, baohua@kernel.org,
	dev.jain@arm.com, baolin.wang@linux.alibaba.com,
	npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com,
	vbabka@suse.cz, lance.yang@linux.dev,
	linux-kernel@vger.kernel.org, kernel-team@meta.com,
	Madhavan Srinivasan <maddy@linux.ibm.com>,
	Michael Ellerman <mpe@ellerman.id.au>,
	linuxppc-dev@lists.ozlabs.org
Subject: Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
Date: Thu, 12 Feb 2026 15:25:33 +0000	[thread overview]
Message-ID: <93908945-e0a8-429c-b119-eff63ebb2479@linux.dev> (raw)
In-Reply-To: <875x82ma6q.ritesh.list@gmail.com>



On 12/02/2026 12:13, Ritesh Harjani (IBM) wrote:
> "David Hildenbrand (Arm)" <david@kernel.org> writes:
> 
>> CCing ppc folks
>>
> 
> Thanks David!
> 
>> On 2/11/26 13:49, Usama Arif wrote:
>>> When the kernel creates a PMD-level THP mapping for anonymous pages,
>>> it pre-allocates a PTE page table and deposits it via
>>> pgtable_trans_huge_deposit(). This deposited table is withdrawn during
>>> PMD split or zap. The rationale was that split must not fail—if the
>>> kernel decides to split a THP, it needs a PTE table to populate.
>>>
>>> However, every anon THP wastes 4KB (one page table page) that sits
>>> unused in the deposit list for the lifetime of the mapping. On systems
>>> with many THPs, this adds up to significant memory waste. The original
>>> rationale is also not an issue. It is ok for split to fail, and if the
>>> kernel can't find an order 0 allocation for split, there are much bigger
>>> problems. On large servers where you can easily have 100s of GBs of THPs,
>>> the memory usage for these tables is 200M per 100G. This memory could be
>>> used for any other usecase, which include allocating the pagetables
>>> required during split.
>>>
>>> This patch removes the pre-deposit for anonymous pages on architectures
>>> where arch_needs_pgtable_deposit() returns false (every arch apart from
>>> powerpc, and only when radix hash tables are not enabled) and allocates
>>> the PTE table lazily—only when a split actually occurs. The split path
>>> is modified to accept a caller-provided page table.
>>>
>>> PowerPC exception:
>>>
>>> It would have been great if we can completely remove the pagetable
>>> deposit code and this commit would mostly have been a code cleanup patch,
>>> unfortunately PowerPC has hash MMU, it stores hash slot information in
>>> the deposited page table and pre-deposit is necessary. All deposit/
>>> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
>>> behavior is unchanged with this patch. On a better note,
>>> arch_needs_pgtable_deposit will always evaluate to false at compile time
>>> on non PowerPC architectures and the pre-deposit code will not be
>>> compiled in.
>>
>> Is there a way to remove this? It's always been a confusing hack, now 
>> it's unpleasant to have around :)
>>
> 
> Hash MMU on PowerPC works fundamentally different than other MMUs
> (unlike Radix MMU on PowerPC). So yes, it requires few tricks to fit
> into the Linux's multi-level SW page table model. ;) 
> 
> 
>> In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1 
>> copied generic pgtable_trans_huge_deposit() hurts my belly.
>>
> 
> On PowerPC, pgtable_t can be a pte fragment. 
> 
> typedef pte_t *pgtable_t;
> 
> That means a single page can be shared among other PTE page tables. So, we
> cannot use page->lru which the generic implementation uses. I guess due
> to this, there is a slight change in implementation of
> radix__pgtable_trans_huge_deposit(). 
> 
> Doing a grep search, I think that's the same for sparc and s390 as well.
> 
>>
>> IIUC, hash is mostly used on legacy power systems, radix on newer ones.
>>
>> So one obvious solution: remove PMD THP support for hash MMUs along with 
>> all this hacky deposit code.
>>
> 
> Unfortunately, please no. There are real customers using Hash MMU on
> Power9 and even on older generations and this would mean breaking Hash
> PMD THP support for them. 
> 
> 

Thanks for confirming! I will keep the pagetable deposit for powerpc
in the next revision.
I will rename pgtable_trans_huge_deposit to arch_pgtable_trans_huge_deposit
and move it to arch/powerpc. It will an empty function for the rest of the
architectures.

>>
>> the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar 
>> checks need to be wrapped in a reasonable helper and likely this all 
>> needs to get cleaned up further.
>>
>> The implementation if the generic pgtable_trans_huge_deposit and the 
>> radix handlers etc must be removed. If any code would trigger them it 
>> would be a bug.
>>
> 
> Sure, I think after this patch series, the radix__pgtable_trans_huge_deposit() 
> will mostly be a dead code anyways. I will spend some time going
> through this series and will also give it a test on powerpc HW (with
> both Hash and Radix MMU).
> 
> I guess, we should also look at removing pgtable_trans_huge_deposit() and
> pgtable_trans_huge_withdraw() implementations from s390 and sparc, since
> those too will be dead code after this.
> 
> 
>> If we have to keep this around, pgtable_trans_huge_deposit() should 
>> likely get renamed to arch_pgtable_trans_huge_deposit() etc, as there 
>> will not be generic support for it.
>>
> 
> Sure. That make sense since PowerPC Hash MMU will still need this.
> 
> -ritesh



  reply	other threads:[~2026-02-12 15:25 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-11 12:49 [RFC 0/2] mm: thp: split time allocation of page table for THPs Usama Arif
2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif
2026-02-11 13:25   ` David Hildenbrand (Arm)
2026-02-11 13:38     ` Usama Arif
2026-02-12 12:13     ` Ritesh Harjani
2026-02-12 15:25       ` Usama Arif [this message]
2026-02-12 15:39       ` David Hildenbrand (Arm)
2026-02-12 16:46         ` Ritesh Harjani
2026-02-11 13:35   ` David Hildenbrand (Arm)
2026-02-11 13:46     ` Kiryl Shutsemau
2026-02-11 13:47     ` Usama Arif
2026-02-11 19:28   ` Matthew Wilcox
2026-02-11 19:55     ` David Hildenbrand (Arm)
2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif
2026-02-11 13:27   ` David Hildenbrand (Arm)
2026-02-11 13:31     ` Usama Arif
2026-02-11 13:36       ` David Hildenbrand (Arm)
2026-02-11 13:42         ` Usama Arif
2026-02-11 13:38       ` David Hildenbrand (Arm)
2026-02-11 13:43         ` Usama Arif
2026-02-12 21:40   ` kernel test robot
2026-02-12 21:40   ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=93908945-e0a8-429c-b119-eff63ebb2479@linux.dev \
    --to=usama.arif@linux.dev \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=fvdl@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=kas@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=maddy@linux.ibm.com \
    --cc=mpe@ellerman.id.au \
    --cc=npache@redhat.com \
    --cc=riel@surriel.com \
    --cc=ritesh.list@gmail.com \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.