linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Usama Arif <usamaarif642@gmail.com>
To: David Hildenbrand <david@redhat.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Dev Jain <dev.jain@arm.com>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Nico Pache <npache@redhat.com>,
	linux-mm@kvack.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
	ziy@nvidia.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com,
	corbet@lwn.net, rostedt@goodmis.org, mhiramat@kernel.org,
	mathieu.desnoyers@efficios.com, akpm@linux-foundation.org,
	baohua@kernel.org, willy@infradead.org, peterx@redhat.com,
	wangkefeng.wang@huawei.com, sunnanyong@huawei.com,
	vishal.moola@gmail.com, thomas.hellstrom@linux.intel.com,
	yang@os.amperecomputing.com, kirill.shutemov@linux.intel.com,
	aarcange@redhat.com, raquini@redhat.com,
	anshuman.khandual@arm.com, catalin.marinas@arm.com,
	tiwai@suse.de, will@kernel.org, dave.hansen@linux.intel.com,
	jack@suse.cz, cl@gentwo.org, jglisse@google.com,
	surenb@google.com, zokeefe@google.com, hannes@cmpxchg.org,
	rientjes@google.com, mhocko@suse.com, rdunlap@infradead.org,
	hughd@google.com
Subject: Re: [PATCH v10 00/13] khugepaged: mTHP support
Date: Tue, 2 Sep 2025 11:34:13 +0100	[thread overview]
Message-ID: <17075d6a-a209-4636-ae42-2f8944aea745@gmail.com> (raw)
In-Reply-To: <286e2cb3-6beb-4d21-b28a-2f99bb2f759b@redhat.com>



On 02/09/2025 10:03, David Hildenbrand wrote:
> On 02.09.25 04:28, Baolin Wang wrote:
>>
>>
>> On 2025/9/2 00:46, David Hildenbrand wrote:
>>> On 29.08.25 03:55, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2025/8/28 18:48, Dev Jain wrote:
>>>>>
>>>>> On 28/08/25 3:16 pm, Baolin Wang wrote:
>>>>>> (Sorry for chiming in late)
>>>>>>
>>>>>> On 2025/8/22 22:10, David Hildenbrand wrote:
>>>>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1),
>>>>>>>>> but not sure
>>>>>>>>> if we have to add that for now.
>>>>>>>>
>>>>>>>> Yeah not so sure about this, this is a 'just have to know' too, and
>>>>>>>> yes you
>>>>>>>> might add it to the docs, but people are going to be mightily
>>>>>>>> confused, esp if
>>>>>>>> it's a calculated value.
>>>>>>>>
>>>>>>>> I don't see any other way around having a separate tunable if we
>>>>>>>> don't just have
>>>>>>>> something VERY simple like on/off.
>>>>>>>
>>>>>>> Yeah, not advocating that we add support for other values than 0/511,
>>>>>>> really.
>>>>>>>
>>>>>>>>
>>>>>>>> Also the mentioned issue sounds like something that needs to be
>>>>>>>> fixed elsewhere
>>>>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be
>>>>>>>> wrong - and
>>>>>>>> happy to stand corrected if this is somehow inherent, but reallly
>>>>>>>> feels that
>>>>>>>> way).
>>>>>>>
>>>>>>> I think the creep is unavoidable for certain values.
>>>>>>>
>>>>>>> If you have the first two pages of a PMD area populated, and you
>>>>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse
>>>>>>> first a
>>>>>>> order-2 folio, then and order-3 ... until you reached PMD order.
>>>>>>>
>>>>>>> So for now we really should just support 0 / 511 to say "don't
>>>>>>> collapse if there are holes" vs. "always collapse if there is at
>>>>>>> least one pte used".
>>>>>>
>>>>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511,
>>>>>> no mTHP collapses would ever occur anyway, unless you have 2MB
>>>>>> disabled and other mTHP sizes enabled. Technically, at 511, only the
>>>>>> highest enabled order would ever be collapsed."
>>>>> I didn't understand this statement. At 511, mTHP collapses will occur if
>>>>> khugepaged cannot get a PMD folio. Our goal is to collapse to the
>>>>> highest order folio.
>>>>
>>>> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I
>>>> mean is, as in the example I gave below, users may only want to allow a
>>>> large order collapse when the number of present PTEs reaches half of the
>>>> large folio, in order to avoid RSS bloat.
>>>
>>> How do these users control allocation at fault time where this parameter
>>> is completely ignored?
>>
>> Sorry, I did not get your point. Why does the 'max_pte_none' need to
>> control allocation at fault time? Could you be more specific? Thanks.
> 
> The comment over khugepaged_max_ptes_none gives a hint:
> 
> /*
>  * default collapse hugepages if there is at least one pte mapped like
>  * it would have happened if the vma was large enough during page
>  * fault.
>  *
>  * Note that these are only respected if collapse was initiated by khugepaged.
>  */
> 
> In the common case (for anything that really cares about RSS bloat) you will just a
> get a THP during page fault and consequently RSS bloat.
> 
> As raised in my other reply, the only documented reason to set max_ptes_none=0 seems
> to be when an application later (after once possibly getting a THP already during
> page faults) did some MADV_DONTNEED and wants to control the usage of THPs itself using
> MADV_COLLAPSE.
> 
> It's a questionable use case, that already got more problematic with mTHP and page
> table reclaim.
> 
> Let me explain:
> 
> Before mTHP, if someone would MADV_DONTNEED (resulting in
> a page table with at least one pte_none entry), there would have been no way we would
> get memory over-allocated afterwards with max_ptes_none=0.
> 
> (1) Page faults would spot "there is a page table" and just fallback to order-0 pages.
> (2) khugepaged was told to not collapse through max_ptes_none=0.
> 
> But now:
> 
> (A) With mTHP during page-faults, we can just end up over-allocating memory in such
>     an area again: page faults will simply spot a bunch of pte_nones around the fault area
>     and install an mTHP.
> 
> (B) With page table reclaim (when zapping all PTEs in a table at once), we will reclaim the
>     page table. The next page fault will just try installing a PMD THP again, because there is
>     no PTE table anymore.
> 
> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only
> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some
> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense.
> 
> 

For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter
memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and
will break down those hugepages and free up zero-filled memory. I have seen in our prod workloads where
the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure,
the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits
of THPs like lower TLB misses.

I do agree that the value of max_ptes_none is magical and different workloads can react very differently
to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean
that the memory regression of using THP=always vs THP=madvise is halved.




  reply	other threads:[~2025-09-02 10:34 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-19 13:41 [PATCH v10 00/13] khugepaged: mTHP support Nico Pache
2025-08-19 13:41 ` [PATCH v10 01/13] khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
2025-08-20 10:42   ` Lorenzo Stoakes
2025-08-19 13:41 ` [PATCH v10 02/13] introduce collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
2025-08-20 11:21   ` Lorenzo Stoakes
2025-08-20 16:35     ` Nico Pache
2025-08-22 10:21       ` Lorenzo Stoakes
2025-08-26 13:30         ` Nico Pache
2025-08-19 13:41 ` [PATCH v10 03/13] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2025-08-20 13:23   ` Lorenzo Stoakes
2025-08-20 15:40     ` Nico Pache
2025-08-21  3:41       ` Wei Yang
2025-08-21 14:09         ` Zi Yan
2025-08-22 10:25           ` Lorenzo Stoakes
2025-08-24  1:37   ` Wei Yang
2025-08-26 13:46     ` Nico Pache
2025-08-19 13:41 ` [PATCH v10 04/13] khugepaged: generalize alloc_charge_folio() Nico Pache
2025-08-20 13:28   ` Lorenzo Stoakes
2025-08-19 13:41 ` [PATCH v10 05/13] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
2025-08-20 14:22   ` Lorenzo Stoakes
2025-09-01 16:15     ` David Hildenbrand
2025-08-19 13:41 ` [PATCH v10 06/13] khugepaged: add " Nico Pache
2025-08-20 18:29   ` Lorenzo Stoakes
2025-09-02 20:12     ` Nico Pache
2025-08-19 13:41 ` [PATCH v10 07/13] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2025-08-21 12:05   ` Lorenzo Stoakes
2025-08-21 12:33     ` Dev Jain
2025-08-22 10:33       ` Lorenzo Stoakes
2025-08-21 16:54     ` Steven Rostedt
2025-08-21 16:56       ` Lorenzo Stoakes
2025-08-19 13:42 ` [PATCH v10 08/13] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
2025-08-20 10:38   ` Lorenzo Stoakes
2025-08-19 13:42 ` [PATCH v10 09/13] khugepaged: enable collapsing mTHPs even when PMD THPs are disabled Nico Pache
2025-08-21 13:35   ` Lorenzo Stoakes
2025-08-19 13:42 ` [PATCH v10 10/13] khugepaged: kick khugepaged for enabling none-PMD-sized mTHPs Nico Pache
2025-08-21 14:18   ` Lorenzo Stoakes
2025-08-21 14:26     ` Lorenzo Stoakes
2025-08-22  6:59     ` Baolin Wang
2025-08-22  7:36       ` Dev Jain
2025-08-19 13:42 ` [PATCH v10 11/13] khugepaged: improve tracepoints for mTHP orders Nico Pache
2025-08-21 14:24   ` Lorenzo Stoakes
2025-08-19 14:16 ` [PATCH v10 12/13] khugepaged: add per-order mTHP khugepaged stats Nico Pache
2025-08-21 14:47   ` Lorenzo Stoakes
2025-08-19 14:17 ` [PATCH v10 13/13] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
2025-08-21 15:03   ` Lorenzo Stoakes
2025-08-19 21:55 ` [PATCH v10 00/13] khugepaged: mTHP support Andrew Morton
2025-08-20 15:55   ` Nico Pache
2025-08-21 15:01 ` Lorenzo Stoakes
2025-08-21 15:13   ` Dev Jain
2025-08-21 15:19     ` Lorenzo Stoakes
2025-08-21 15:25       ` Nico Pache
2025-08-21 15:27         ` Nico Pache
2025-08-21 15:32           ` Lorenzo Stoakes
2025-08-21 16:46             ` Nico Pache
2025-08-21 16:54               ` Lorenzo Stoakes
2025-08-21 17:26                 ` David Hildenbrand
2025-08-21 20:43                 ` David Hildenbrand
2025-08-22 10:41                   ` Lorenzo Stoakes
2025-08-22 14:10                     ` David Hildenbrand
2025-08-22 14:49                       ` Lorenzo Stoakes
2025-08-22 15:33                         ` Dev Jain
2025-08-26 10:43                           ` Lorenzo Stoakes
2025-08-28  9:46                       ` Baolin Wang
2025-08-28 10:48                         ` Dev Jain
2025-08-29  1:55                           ` Baolin Wang
2025-09-01 16:46                             ` David Hildenbrand
2025-09-02  2:28                               ` Baolin Wang
2025-09-02  9:03                                 ` David Hildenbrand
2025-09-02 10:34                                   ` Usama Arif [this message]
2025-09-02 11:03                                     ` David Hildenbrand
2025-09-02 20:23                                       ` Usama Arif
2025-09-03  3:27                                         ` Baolin Wang
2025-08-21 16:38     ` Liam R. Howlett
2025-09-01 16:21 ` David Hildenbrand
2025-09-01 17:06   ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=17075d6a-a209-4636-ae42-2f8944aea745@gmail.com \
    --to=usamaarif642@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jglisse@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=npache@redhat.com \
    --cc=peterx@redhat.com \
    --cc=raquini@redhat.com \
    --cc=rdunlap@infradead.org \
    --cc=rientjes@google.com \
    --cc=rostedt@goodmis.org \
    --cc=ryan.roberts@arm.com \
    --cc=sunnanyong@huawei.com \
    --cc=surenb@google.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=tiwai@suse.de \
    --cc=vishal.moola@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yang@os.amperecomputing.com \
    --cc=ziy@nvidia.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).