From: Nico Pache <npache@redhat.com>
To: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: linux-mm@kvack.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org,
linux-trace-kernel@vger.kernel.org, akpm@linux-foundation.org,
corbet@lwn.net, rostedt@goodmis.org, mhiramat@kernel.org,
mathieu.desnoyers@efficios.com, david@redhat.com,
baohua@kernel.org, ryan.roberts@arm.com, willy@infradead.org,
peterx@redhat.com, ziy@nvidia.com, wangkefeng.wang@huawei.com,
usamaarif642@gmail.com, sunnanyong@huawei.com,
vishal.moola@gmail.com, thomas.hellstrom@linux.intel.com,
yang@os.amperecomputing.com, kirill.shutemov@linux.intel.com,
aarcange@redhat.com, raquini@redhat.com, dev.jain@arm.com,
anshuman.khandual@arm.com, catalin.marinas@arm.com,
tiwai@suse.de, will@kernel.org, dave.hansen@linux.intel.com,
jack@suse.cz, cl@gentwo.org, jglisse@google.com,
surenb@google.com, zokeefe@google.com, hannes@cmpxchg.org,
rientjes@google.com, mhocko@suse.com, rdunlap@infradead.org
Subject: Re: [PATCH v4 05/12] khugepaged: generalize __collapse_huge_page_* for mTHP support
Date: Wed, 23 Apr 2025 02:00:02 -0600 [thread overview]
Message-ID: <CAA1CXcBsjAVdu4RWAYJC82Wm3o=OY_Z6iyEu0YNuiC5grG_z-Q@mail.gmail.com> (raw)
In-Reply-To: <e4e4aaae-92be-4cd2-9435-dccad99961bf@linux.alibaba.com>
On Wed, Apr 23, 2025 at 1:30 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/4/17 08:02, Nico Pache wrote:
> > generalize the order of the __collapse_huge_page_* functions
> > to support future mTHP collapse.
> >
> > mTHP collapse can suffer from incosistant behavior, and memory waste
> > "creep". disable swapin and shared support for mTHP collapse.
> >
> > No functional changes in this patch.
> >
> > Co-developed-by: Dev Jain <dev.jain@arm.com>
> > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> > mm/khugepaged.c | 46 ++++++++++++++++++++++++++++------------------
> > 1 file changed, 28 insertions(+), 18 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 883e9a46359f..5e9272ab82da 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -565,15 +565,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > unsigned long address,
> > pte_t *pte,
> > struct collapse_control *cc,
> > - struct list_head *compound_pagelist)
> > + struct list_head *compound_pagelist,
> > + u8 order)
> > {
> > struct page *page = NULL;
> > struct folio *folio = NULL;
> > pte_t *_pte;
> > int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
> > bool writable = false;
> > + int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
> >
> > - for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> > + for (_pte = pte; _pte < pte + (1 << order);
> > _pte++, address += PAGE_SIZE) {
> > pte_t pteval = ptep_get(_pte);
> > if (pte_none(pteval) || (pte_present(pteval) &&
> > @@ -581,7 +583,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > ++none_or_zero;
> > if (!userfaultfd_armed(vma) &&
> > (!cc->is_khugepaged ||
> > - none_or_zero <= khugepaged_max_ptes_none)) {
> > + none_or_zero <= scaled_none)) {
> > continue;
> > } else {
> > result = SCAN_EXCEED_NONE_PTE;
> > @@ -609,8 +611,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > /* See hpage_collapse_scan_pmd(). */
> > if (folio_maybe_mapped_shared(folio)) {
> > ++shared;
> > - if (cc->is_khugepaged &&
> > - shared > khugepaged_max_ptes_shared) {
> > + if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
> > + shared > khugepaged_max_ptes_shared)) {
> > result = SCAN_EXCEED_SHARED_PTE;
> > count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> > goto out;
> > @@ -711,13 +713,14 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> > struct vm_area_struct *vma,
> > unsigned long address,
> > spinlock_t *ptl,
> > - struct list_head *compound_pagelist)
> > + struct list_head *compound_pagelist,
> > + u8 order)
> > {
> > struct folio *src, *tmp;
> > pte_t *_pte;
> > pte_t pteval;
> >
> > - for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> > + for (_pte = pte; _pte < pte + (1 << order);
> > _pte++, address += PAGE_SIZE) {
> > pteval = ptep_get(_pte);
> > if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > @@ -764,7 +767,8 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> > pmd_t *pmd,
> > pmd_t orig_pmd,
> > struct vm_area_struct *vma,
> > - struct list_head *compound_pagelist)
> > + struct list_head *compound_pagelist,
> > + u8 order)
> > {
> > spinlock_t *pmd_ptl;
> >
> > @@ -781,7 +785,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> > * Release both raw and compound pages isolated
> > * in __collapse_huge_page_isolate.
> > */
> > - release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
> > + release_pte_pages(pte, pte + (1 << order), compound_pagelist);
> > }
> >
> > /*
> > @@ -802,7 +806,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> > static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
> > pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> > unsigned long address, spinlock_t *ptl,
> > - struct list_head *compound_pagelist)
> > + struct list_head *compound_pagelist, u8 order)
> > {
> > unsigned int i;
> > int result = SCAN_SUCCEED;
> > @@ -810,7 +814,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
> > /*
> > * Copying pages' contents is subject to memory poison at any iteration.
> > */
> > - for (i = 0; i < HPAGE_PMD_NR; i++) {
> > + for (i = 0; i < (1 << order); i++) {
> > pte_t pteval = ptep_get(pte + i);
> > struct page *page = folio_page(folio, i);
> > unsigned long src_addr = address + i * PAGE_SIZE;
> > @@ -829,10 +833,10 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
> >
> > if (likely(result == SCAN_SUCCEED))
> > __collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
> > - compound_pagelist);
> > + compound_pagelist, order);
> > else
> > __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
> > - compound_pagelist);
> > + compound_pagelist, order);
> >
> > return result;
> > }
> > @@ -1000,11 +1004,11 @@ static int check_pmd_still_valid(struct mm_struct *mm,
> > static int __collapse_huge_page_swapin(struct mm_struct *mm,
> > struct vm_area_struct *vma,
> > unsigned long haddr, pmd_t *pmd,
> > - int referenced)
> > + int referenced, u8 order)
> > {
> > int swapped_in = 0;
> > vm_fault_t ret = 0;
> > - unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
> > + unsigned long address, end = haddr + (PAGE_SIZE << order);
> > int result;
> > pte_t *pte = NULL;
> > spinlock_t *ptl;
> > @@ -1035,6 +1039,12 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> > if (!is_swap_pte(vmf.orig_pte))
> > continue;
> >
> > + /* Dont swapin for mTHP collapse */
> > + if (order != HPAGE_PMD_ORDER) {
> > + result = SCAN_EXCEED_SWAP_PTE;
> > + goto out;
> > + }
>
> IMO, this check should move into hpage_collapse_scan_pmd(), that means
> if we scan the swap ptes for mTHP collapse, then we can return
> 'SCAN_EXCEED_SWAP_PTE' to abort the collapse earlier.
I dont think this is correct. We currently abort if the global
max_swap_ptes or max_shared_ptes is exceeded during the PMD scan.
However if those pass (and we dont collapse at the PMD level), we will
continue to mTHP collapses. Then during the isolate function we check
for shared ptes in this specific mTHP range and abort if there's a
shared ptes. For swap we only know that some pages in the PMD are
unmapped, but we arent sure which, so we have to try and fault the
PTEs, and if it's a swap pte, and we are on mTHP collapse, we abort
the collapse attempt. So having swap/shared PTEs in the PMD scan, does
not indicate that ALL mTHP collapses will fail, but some will.
This may make more sense as you continue to review the series!
>
> The logic is the same as how you handle the shared ptes for mTHP.>
> > vmf.pte = pte;
> > vmf.ptl = ptl;
> > ret = do_swap_page(&vmf);
> > @@ -1154,7 +1164,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > * that case. Continuing to collapse causes inconsistency.
> > */
> > result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > - referenced);
> > + referenced, HPAGE_PMD_ORDER);
> > if (result != SCAN_SUCCEED)
> > goto out_nolock;
> > }
> > @@ -1201,7 +1211,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > if (pte) {
> > result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > - &compound_pagelist);
> > + &compound_pagelist, HPAGE_PMD_ORDER);
> > spin_unlock(pte_ptl);
> > } else {
> > result = SCAN_PMD_NULL;
> > @@ -1231,7 +1241,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >
> > result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > vma, address, pte_ptl,
> > - &compound_pagelist);
> > + &compound_pagelist, HPAGE_PMD_ORDER);
> > pte_unmap(pte);
> > if (unlikely(result != SCAN_SUCCEED))
> > goto out_up_write;
>
next prev parent reply other threads:[~2025-04-23 8:00 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-17 0:02 [PATCH v4 00/12] khugepaged: mTHP support Nico Pache
2025-04-17 0:02 ` [PATCH v4 01/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
2025-04-23 6:44 ` Baolin Wang
2025-04-23 7:06 ` Nico Pache
2025-04-17 0:02 ` [PATCH v4 02/12] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
2025-04-23 6:49 ` Baolin Wang
2025-04-17 0:02 ` [PATCH v4 03/12] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2025-04-23 6:55 ` Baolin Wang
2025-04-17 0:02 ` [PATCH v4 04/12] khugepaged: generalize alloc_charge_folio() Nico Pache
2025-04-23 7:06 ` Baolin Wang
2025-04-17 0:02 ` [PATCH v4 05/12] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
2025-04-23 7:30 ` Baolin Wang
2025-04-23 8:00 ` Nico Pache [this message]
2025-04-23 8:25 ` Baolin Wang
2025-04-17 0:02 ` [PATCH v4 06/12] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
2025-04-27 2:51 ` Baolin Wang
2025-04-28 14:47 ` Nico Pache
2025-04-29 7:16 ` Baolin Wang
2025-04-17 0:02 ` [PATCH v4 07/12] khugepaged: add " Nico Pache
2025-04-24 12:21 ` Baolin Wang
2025-04-28 15:14 ` Nico Pache
2025-04-17 0:02 ` [PATCH v4 08/12] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2025-04-24 7:48 ` Baolin Wang
2025-04-28 15:44 ` Nico Pache
2025-04-29 6:53 ` Baolin Wang
2025-04-17 0:02 ` [PATCH v4 09/12] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
2025-04-17 0:02 ` [PATCH v4 10/12] khugepaged: improve tracepoints for mTHP orders Nico Pache
2025-04-24 7:51 ` Baolin Wang
2025-04-17 0:02 ` [PATCH v4 11/12] khugepaged: add per-order mTHP khugepaged stats Nico Pache
2025-04-24 7:58 ` Baolin Wang
2025-04-28 15:45 ` Nico Pache
2025-04-17 0:02 ` [PATCH v4 12/12] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
2025-04-24 15:03 ` Usama Arif
2025-04-28 14:54 ` Nico Pache
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAA1CXcBsjAVdu4RWAYJC82Wm3o=OY_Z6iyEu0YNuiC5grG_z-Q@mail.gmail.com' \
--to=npache@redhat.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=anshuman.khandual@arm.com \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=catalin.marinas@arm.com \
--cc=cl@gentwo.org \
--cc=corbet@lwn.net \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=dev.jain@arm.com \
--cc=hannes@cmpxchg.org \
--cc=jack@suse.cz \
--cc=jglisse@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mhiramat@kernel.org \
--cc=mhocko@suse.com \
--cc=peterx@redhat.com \
--cc=raquini@redhat.com \
--cc=rdunlap@infradead.org \
--cc=rientjes@google.com \
--cc=rostedt@goodmis.org \
--cc=ryan.roberts@arm.com \
--cc=sunnanyong@huawei.com \
--cc=surenb@google.com \
--cc=thomas.hellstrom@linux.intel.com \
--cc=tiwai@suse.de \
--cc=usamaarif642@gmail.com \
--cc=vishal.moola@gmail.com \
--cc=wangkefeng.wang@huawei.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=yang@os.amperecomputing.com \
--cc=ziy@nvidia.com \
--cc=zokeefe@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).