All of lore.kernel.org
 help / color / mirror / Atom feed
From: Usama Arif <usama.arif@linux.dev>
To: Nico Pache <npache@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	david@kernel.org, lorenzo.stoakes@oracle.com,
	willy@infradead.org, linux-mm@kvack.org, fvdl@google.com,
	hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev,
	kas@kernel.org, baohua@kernel.org, dev.jain@arm.com,
	baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com,
	ryan.roberts@arm.com, Vlastimil Babka <vbabka@kernel.org>,
	lance.yang@linux.dev, linux-kernel@vger.kernel.org,
	kernel-team@meta.com, maddy@linux.ibm.com, mpe@ellerman.id.au,
	linuxppc-dev@lists.ozlabs.org, hca@linux.ibm.com,
	gor@linux.ibm.com, agordeev@linux.ibm.com,
	borntraeger@linux.ibm.com, svens@linux.ibm.com,
	linux-s390@vger.kernel.org
Subject: Re: [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split
Date: Fri, 27 Feb 2026 11:13:55 +0000	[thread overview]
Message-ID: <1d3a4e8e-9ea0-42e7-b8e7-d92fb27f80f4@linux.dev> (raw)
In-Reply-To: <CAA1CXcAYt3OfW_uBTYZgr-dBhg99x=5pUs5uvqtpg+PNJ1KxGQ@mail.gmail.com>



On 26/02/2026 21:01, Nico Pache wrote:
> On Thu, Feb 26, 2026 at 4:33 AM Usama Arif <usama.arif@linux.dev> wrote:
>>
>> When the kernel creates a PMD-level THP mapping for anonymous pages, it
>> pre-allocates a PTE page table via pgtable_trans_huge_deposit(). This
>> page table sits unused in a deposit list for the lifetime of the THP
>> mapping, only to be withdrawn when the PMD is split or zapped. Every
>> anonymous THP therefore wastes 4KB of memory unconditionally. On large
>> servers where hundreds of gigabytes of memory are mapped as THPs, this
>> adds up: roughly 200MB wasted per 100GB of THP memory. This memory
>> could otherwise satisfy other allocations, including the very PTE page
>> table allocations needed when splits eventually occur.
>>
>> This series removes the pre-deposit and allocates the PTE page table
>> lazily — only when a PMD split actually happens. Since a large number
>> of THPs are never split (they are zapped wholesale when processes exit or
>> munmap the full range), the allocation is avoided entirely in the common
>> case.
>>
>> The pre-deposit pattern exists because split_huge_pmd was designed as an
>> operation that must never fail: if the kernel decides to split, it needs
>> a PTE page table, so one is deposited in advance. But "must never fail"
>> is an unnecessarily strong requirement. A PMD split is typically triggered
>> by a partial operation on a sub-PMD range — partial munmap, partial
>> mprotect, partial mremap and so on.
>> Most of these operations already have well-defined error handling for
>> allocation failures (e.g., -ENOMEM, VM_FAULT_OOM). Allowing split to
>> fail and propagating the error through these existing paths is the natural
>> thing to do. Furthermore, split failing requires an order-0 allocation for
>> a page table to fail, which is extremely unlikely.
>>
>> Designing functions like split_huge_pmd as operations that cannot fail
>> has a subtle but real cost to code quality. It forces a pre-allocation
>> pattern - every THP creation path must deposit a page table, and every
>> split or zap path must withdraw one, creating a hidden coupling between
>> widely separated code paths.
>>
>> This also serves as a code cleanup. On every architecture except powerpc
>> with hash MMU, the deposit/withdraw machinery becomes dead code. The
>> series removes the generic implementations in pgtable-generic.c and the
>> s390/sparc overrides, replacing them with no-op stubs guarded by
>> arch_needs_pgtable_deposit(), which evaluates to false at compile time
>> on all non-powerpc architectures.
> 
> Hi Usama,
> 
> Thanks for tackling this, it seems like an interesting problem. Im
> trying to get more into reviewing, so bare with me I may have some
> stupid comments or questions. Where I can really help out is with
> testing. I will build this for all RH-supported architectures and run
> some automated test suites and performance metrics. I'll report back
> if I spot anything.
> 
> Cheers!
> -- Nico
> 

Thanks for the build and looking into reviewing this. All comments
and questions are welcome! I had only tested on x86, and I had a look
at the link you shared so its great to know that powerPC and s390 are fine.



  reply	other threads:[~2026-02-27 11:14 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
2026-02-26 11:23 ` [RFC v2 01/21] mm: thp: make split_huge_pmd functions return int for error propagation Usama Arif
2026-02-26 11:23 ` [RFC v2 02/21] mm: thp: propagate split failure from vma_adjust_trans_huge() Usama Arif
2026-02-26 11:23 ` [RFC v2 03/21] mm: thp: handle split failure in copy_huge_pmd() Usama Arif
2026-02-26 11:23 ` [RFC v2 04/21] mm: thp: handle split failure in do_huge_pmd_wp_page() Usama Arif
2026-02-26 11:23 ` [RFC v2 05/21] mm: thp: handle split failure in zap_pmd_range() Usama Arif
2026-02-26 11:23 ` [RFC v2 06/21] mm: thp: handle split failure in wp_huge_pmd() Usama Arif
2026-02-26 11:23 ` [RFC v2 07/21] mm: thp: retry on split failure in change_pmd_range() Usama Arif
2026-02-26 11:23 ` [RFC v2 08/21] mm: thp: handle split failure in follow_pmd_mask() Usama Arif
2026-02-26 11:23 ` [RFC v2 09/21] mm: handle walk_page_range() failure from THP split Usama Arif
2026-02-26 11:23 ` [RFC v2 10/21] mm: thp: handle split failure in mremap move_page_tables() Usama Arif
2026-02-26 11:23 ` [RFC v2 11/21] mm: thp: handle split failure in userfaultfd move_pages() Usama Arif
2026-02-26 11:23 ` [RFC v2 12/21] mm: thp: handle split failure in device migration Usama Arif
2026-03-02 21:20   ` Nico Pache
2026-03-04 11:55     ` Usama Arif
2026-03-05 16:55     ` Usama Arif
2026-03-09 15:09       ` Nico Pache
2026-03-09 21:34         ` Usama Arif
2026-02-26 11:23 ` [RFC v2 13/21] mm: huge_mm: Make sure all split_huge_pmd calls are checked Usama Arif
2026-02-26 16:32   ` kernel test robot
2026-02-27 12:11   ` Usama Arif
2026-02-26 11:23 ` [RFC v2 14/21] mm: thp: allocate PTE page tables lazily at split time Usama Arif
2026-02-26 11:23 ` [RFC v2 15/21] mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed Usama Arif
2026-02-26 11:23 ` [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter Usama Arif
2026-02-26 13:56   ` kernel test robot
2026-02-26 14:22   ` Usama Arif
2026-02-26 15:10   ` kernel test robot
2026-02-26 11:23 ` [RFC v2 17/21] selftests/mm: add THP PMD split test infrastructure Usama Arif
2026-02-26 11:23 ` [RFC v2 18/21] selftests/mm: add partial_mprotect test for change_pmd_range Usama Arif
2026-02-26 11:23 ` [RFC v2 19/21] selftests/mm: add partial_mlock test Usama Arif
2026-02-26 11:23 ` [RFC v2 20/21] selftests/mm: add partial_mremap test for move_page_tables Usama Arif
2026-02-26 11:23 ` [RFC v2 21/21] selftests/mm: add madv_dontneed_partial test Usama Arif
2026-02-26 21:01 ` [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Nico Pache
2026-02-27 11:13   ` Usama Arif [this message]
2026-02-28  0:06     ` Nico Pache
2026-03-02 11:08       ` Usama Arif

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1d3a4e8e-9ea0-42e7-b8e7-d92fb27f80f4@linux.dev \
    --to=usama.arif@linux.dev \
    --cc=Liam.Howlett@oracle.com \
    --cc=agordeev@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=fvdl@google.com \
    --cc=gor@linux.ibm.com \
    --cc=hannes@cmpxchg.org \
    --cc=hca@linux.ibm.com \
    --cc=kas@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=maddy@linux.ibm.com \
    --cc=mpe@ellerman.id.au \
    --cc=npache@redhat.com \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=svens@linux.ibm.com \
    --cc=vbabka@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.