Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [GIT PULL] rv fixes for v7.1
From: Gabriele Monaco @ 2026-06-04 13:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Tomas Glozar, linux-kernel, linux-trace-kernel, unknownbbqrx,
	Wen Yang
In-Reply-To: <20260604085405.234d22eb@fedora>

On Thu, 2026-06-04 at 08:54 -0400, Steven Rostedt wrote:
> On Thu, 04 Jun 2026 14:42:02 +0200
> Gabriele Monaco <gmonaco@redhat.com> wrote:
> 
> > All this to say that, in my opinion unknownbbqrx
> > <dev@unknownbbqr.xyz>
> > is NOT an anonymous contribution, just a nickname that differs from
> > the legal name of this person (which we wouldn't validate anyway),
> > so I would say it complies with the rules.
> 
> It's a username on github and not a nickname. I did a search for
> "unknownbbqr" and it doesn't come up anywhere but Google tries to
> find similar matches which brings me to an OnlyFans account :-p
> 
> It *DOES NOT* qualify because there's no accountability for this. For
> people who have a nickname as their entire internet persona, sure,
> I'll take patches from them as there's an entity that exists behind
> it.
> But I'm not going to take some username on github as a persona. To
> me, that's still anonymous.

Alright, fair. In the link I sent, the signoff got changed to Ali Ahmet
MEMIS <dev@unknownbbqr.xyz>, but I believe we cannot use that unless
the user themselves adds it (and they seem unreachable).

I posted the re-authored patch in [1], I'm not sure that's the proper
way though (the patch is so simple that is unmodified). But if you give
me a green light I can send you a pull request with that patch instead.

Thanks,
Gabriele

[1] -
https://lore.kernel.org/lkml/20260604120946.90302-2-gmonaco@redhat.com/

> 
> -- Steve


^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lorenzo Stoakes @ 2026-06-04 12:55 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, Usama Arif
In-Reply-To: <CAA1CXcDxZEmWtmGFiKDKSPSae8pN0at4vYV24FOs+t_GTGkZ6g@mail.gmail.com>

On Thu, Jun 04, 2026 at 06:45:58AM -0600, Nico Pache wrote:
> On Thu, Jun 4, 2026 at 6:40 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Thu, Jun 04, 2026 at 12:38:30PM +0100, Lorenzo Stoakes wrote:
> > > I will go review the thread about the cache maintenance separately and
> > > respond about that.
> > >
> > > On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
> > > > Pass an order and offset to collapse_huge_page to support collapsing anon
> > > > memory to arbitrary orders within a PMD. order indicates what mTHP size we
> > > > are attempting to collapse to, and offset indicates were in the PMD to
> > > > start the collapse attempt.
> > > >
> > > > For non-PMD collapse we must leave the anon VMA write locked until after
> > > > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> > > > the mTHP case this is not true, and we must keep the lock to prevent
> > > > access/changes to the page tables. This can happen if the rmap walkers hit
> > > > a pmd_none while the PMD entry is currently unavailable due to being
> > > > temporarily removed during the collapse phase.
> > > >
> > > > Acked-by: Usama Arif <usama.arif@linux.dev>
> > > > Signed-off-by: Nico Pache <npache@redhat.com>
> > >
> > > The logic LGTM generally, some questions for understanding below, and of
> > > course as per above I want to review the Lance/David subthread.
> > >
> > > Thanks!
> > >
> > > > ---
> > > >  mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
> > > >  1 file changed, 55 insertions(+), 38 deletions(-)
> > > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index fab35d318641..d64f42f66236 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> > > >   * while allocating a THP, as that could trigger direct reclaim/compaction.
> > > >   * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> > > >   */
> > > > -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > > > -           int referenced, int unmapped, struct collapse_control *cc)
> > > > +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> > > > +           int referenced, int unmapped, struct collapse_control *cc,
> > > > +           unsigned int order)
> > > >  {
> > > > +   const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> > > > +   const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
> > > >     LIST_HEAD(compound_pagelist);
> > > >     pmd_t *pmd, _pmd;
> > > > -   pte_t *pte;
> > > > +   pte_t *pte = NULL;
> > >
> > > As mentioned elsewhere for some reason this was dropped in
> > > mm-unstable. Maybe a bad conflict resolution?
> > >
> > > >     pgtable_t pgtable;
> > > >     struct folio *folio;
> > > >     spinlock_t *pmd_ptl, *pte_ptl;
> > > >     enum scan_result result = SCAN_FAIL;
> > > >     struct vm_area_struct *vma;
> > > >     struct mmu_notifier_range range;
> > > > +   bool anon_vma_locked = false;
> > > >
> > > > -   VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > > > -
> > > > -   result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > > > +   result = alloc_charge_folio(&folio, mm, cc, order);
> > > >     if (result != SCAN_SUCCEED)
> > > >             goto out_nolock;
> > > >
> > > >     mmap_read_lock(mm);
> > > > -   result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > > -                                    HPAGE_PMD_ORDER);
> > > > +   result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > > > +                                    &vma, cc, order);
> > > >     if (result != SCAN_SUCCEED) {
> > > >             mmap_read_unlock(mm);
> > > >             goto out_nolock;
> > > >     }
> > > >
> > > > -   result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > > > +   result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
> > > >     if (result != SCAN_SUCCEED) {
> > > >             mmap_read_unlock(mm);
> > > >             goto out_nolock;
> > > > @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > >              * released when it fails. So we jump out_nolock directly in
> > > >              * that case.  Continuing to collapse causes inconsistency.
> > > >              */
> > > > -           result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > > > -                                                referenced, HPAGE_PMD_ORDER);
> > > > +           result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> > > > +                                                referenced, order);
> > > >             if (result != SCAN_SUCCEED)
> > > >                     goto out_nolock;
> > > >     }
> > > > @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > >      * mmap_lock.
> > > >      */
> > > >     mmap_write_lock(mm);
> > > > -   result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > > -                                    HPAGE_PMD_ORDER);
> > > > +   result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > > > +                                    &vma, cc, order);
> > > >     if (result != SCAN_SUCCEED)
> > > >             goto out_up_write;
> > > >     /* check if the pmd is still valid */
> > > >     vma_start_write(vma);
> >
> > Hmm actually I think we have another problem here.
> >
> > For PMD THP this is fine. Only a single VMA can span the range we need, and it
> > will span the entire PMD.
> >
> > But for mTHP we have an issue...
> >
> > See below...
> >
> > > > -   result = check_pmd_still_valid(mm, address, pmd);
> > > > +   result = check_pmd_still_valid(mm, pmd_addr, pmd);
> > > >     if (result != SCAN_SUCCEED)
> > > >             goto out_up_write;
> > > >
> > > >     anon_vma_lock_write(vma->anon_vma);
> > > > +   anon_vma_locked = true;
> > >
> > > I worry that we hold this lock a lot longer now? Maybe the algorithmic
> > > change alters that, but Claude did suggest on the s390 bug that longer lock
> > > hold might be an issue.
> > >
> > > I wonder if we'll observe lock contention as a result?
> > >
> > > Correct me if I'm wrong and we're not holding longer than previously,
> > > however. Just appears that we do.
> > >
> > > >
> > > > -   mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > > > -                           address + HPAGE_PMD_SIZE);
> > > > +   mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> > > > +                           end_addr);
> > > >     mmu_notifier_invalidate_range_start(&range);
> > > >
> > > >     pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > > > @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > >      * Parallel GUP-fast is fine since GUP-fast will back off when
> > > >      * it detects PMD is changed.
> > > >      */
> > > > -   _pmd = pmdp_collapse_flush(vma, address, pmd);
> > > > +   _pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
> >
> > ...So we exclude VMA locked faults faulting in a new PMD entry for PMD-sized THP
> > but for mTHP we might have _another_ VMA that spans another part of the range
> > mapped by the same PMD entry.
> >
> > So we clear this, but we do not have a write lock on any other VMA, and so
> > racing VMA read locks can install a new PMD entry.
> >
> > > >     spin_unlock(pmd_ptl);
> >
> > Especially since you unlock this :)
> >
> > And...
> >
> > > >     mmu_notifier_invalidate_range_end(&range);
> > > >     tlb_remove_table_sync_one();
> > > >
> > > > -   pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > > > +   pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> > > >     if (pte) {
> > > > -           result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > > > -                                                 HPAGE_PMD_ORDER,
> > > > -                                                 &compound_pagelist);
> > > > +           result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> > > > +                                                 order, &compound_pagelist);
> > > >             spin_unlock(pte_ptl);
> > > >     } else {
> > > >             result = SCAN_NO_PTE_TABLE;
> > > >     }
> > > >
> > > >     if (unlikely(result != SCAN_SUCCEED)) {
> > > > -           if (pte)
> > > > -                   pte_unmap(pte);
> > >
> > > OK I seem to remember this is because we're holding the anon_vma lock
> > > longer. That does imply that on e.g. x86-64 the RCU lock is being held a
> > > bit longer also as well as the anon_vma loc.
> > >
> > > I guess it's also because we need to hold anon_vma and pte lock because
> > > we're fiddling around at PTE level for mTHP not just PMD level as 'classic'
> > > THP did.
> > >
> > > (Rememberings going on here :)
> > >
> > > >             spin_lock(pmd_ptl);
> > > > -           BUG_ON(!pmd_none(*pmd));
> > > > +           WARN_ON_ONCE(!pmd_none(*pmd));
> >
> > ...this will get triggered.
> >
> > I don't know whether we can safely hold the PMD lock across everything here for
> > mTHP?
> >
> > Maybe the solution would have to be to scan through VMAs in the range of the PMD
> > and VMA write lock each of them?
>
> I believe we've spoken about this before, but because we always make

Maybe worth a comment then...? Ah how rewarding review is :)

This is something that somebody else might very well wonder about and
forget that it happens to be covered there.

Also:

/* Always check the PMD order to ensure its not shared by another VMA */

Is pretty lightweight there. Something about avoiding racing page faults
would be helpful.

> sure the VMA spans the full PMD we won't ever hit this issue. If we
> wanted to support mTHP collapse on regions smaller than a PMD, the
> locking gets tricky (hence the design choice to not do that for now).
>
> This is handled by the HPAGE_ORDER in hugepage_vma_revalidate().

The existing code is atrocious, and sticking this on top has added to the
pile of assumptions and conventions and having to go check a bunch of
functions to 'just know' you're safe for X, Y, Z.

We really need to see some cleanup series coming after this and I'm going
to get pretty grumpy(ier) if we don't.

>
> /* Always check the PMD order to ensure its not shared by another VMA */
> if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
>
> -- Nico
>
> >
> > That could cause some 'interesting' lock contention issues though? Then again,
> > we will be releasing the mmap write lock soon enough which will drop the VMA
> > write locks.
> >
> > > >             /*
> > > >              * We can only use set_pmd_at when establishing
> > > >              * hugepmds and never for establishing regular pmds that
> > > > @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > >              */
> > > >             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > > >             spin_unlock(pmd_ptl);
> > > > -           anon_vma_unlock_write(vma->anon_vma);
> > > >             goto out_up_write;
> > > >     }
> > > >
> > > >     /*
> > > > -    * All pages are isolated and locked so anon_vma rmap
> > > > -    * can't run anymore.
> > > > +    * For PMD collapse all pages are isolated and locked so anon_vma
> > > > +    * rmap can't run anymore. For mTHP collapse the PMD entry has been
> > > > +    * removed and not all pages are isolated and locked, so we must hold
> > >
> > > Right because some PTE entries be unaffected by the change.
> > >
> > > > +    * the lock to prevent neighboring folios from attempting to access
> > > > +    * this PMD until its reinstalled.
> > >
> > > OK. This is slightly annoying for my CoW context work as it means there's
> > > another case where we need to explicitly hold an anon_vma lock for
> > > correctness :)
> > >
> > > Anyway I will think about that separately, is what it is. And in fact
> > > motivates to want this merged earlier so I can work against it :)
> > >
> > >
> > > >      */
> > > > -   anon_vma_unlock_write(vma->anon_vma);
> > > > +   if (is_pmd_order(order)) {
> > > > +           anon_vma_unlock_write(vma->anon_vma);
> > > > +           anon_vma_locked = false;
> > > > +   }
> > > >
> > > >     result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > > > -                                      vma, address, pte_ptl,
> > > > -                                      HPAGE_PMD_ORDER,
> > > > -                                      &compound_pagelist);
> > > > -   pte_unmap(pte);
> > > > +                                      vma, start_addr, pte_ptl,
> > > > +                                      order, &compound_pagelist);
> > > >     if (unlikely(result != SCAN_SUCCEED))
> > > >             goto out_up_write;
> > > >
> > > > @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > >      * write.
> > > >      */
> > > >     __folio_mark_uptodate(folio);
> > > > -   pgtable = pmd_pgtable(_pmd);
> > > > -
> > > >     spin_lock(pmd_ptl);
> > > > -   BUG_ON(!pmd_none(*pmd));
> > > > -   pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > > -   map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> > > > +   WARN_ON_ONCE(!pmd_none(*pmd));
> > > > +   if (is_pmd_order(order)) {
> > > > +           pgtable = pmd_pgtable(_pmd);
> > > > +           pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > > +           map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> > > > +   } else {
> > > > +           /*
> > > > +            * set_ptes is called in map_anon_folio_pte_nopf with the
> > > > +            * pmd_ptl lock still held; this is safe as the PMD is expected
> > >
> > > PMD entry you mean?
> > >
> > > > +            * to be none. The pmd entry is then repopulated below.
> > > > +            */
> > > > +           map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> > >
> > > So here we populate entries in the existing PTE _table_ to point at the new
> > > order>0 folio? With arm64 of course doing transparent contpte stuff?
> > >
> > > > +           smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> > > > +           pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > >
> > > And then we reinstall the pre-existing PMD _entry_ from none -> what it was
> > > before?
> > >
> > > > +   }
> > > >     spin_unlock(pmd_ptl);
> > > >
> > > >     folio = NULL;
> > > >
> > > >     result = SCAN_SUCCEED;
> > > >  out_up_write:
> > > > +   if (anon_vma_locked)
> > > > +           anon_vma_unlock_write(vma->anon_vma);
> > > > +   if (pte)
> > > > +           pte_unmap(pte);
> > > >     mmap_write_unlock(mm);
> > > >  out_nolock:
> > > >     if (folio)
> > > > @@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > > >             /* collapse_huge_page expects the lock to be dropped before calling */
> > > >             mmap_read_unlock(mm);
> > > >             result = collapse_huge_page(mm, start_addr, referenced,
> > > > -                                       unmapped, cc);
> > > > +                                       unmapped, cc, HPAGE_PMD_ORDER);
> > > >             /* collapse_huge_page will return with the mmap_lock released */
> > > >             *lock_dropped = true;
> > > >     }
> > > > --
> > > > 2.54.0
> > > >
> >
> > Thanks, Lorenzo
> >
>

^ permalink raw reply

* Re: [GIT PULL] rv fixes for v7.1
From: Steven Rostedt @ 2026-06-04 12:54 UTC (permalink / raw)
  To: Gabriele Monaco
  Cc: Tomas Glozar, linux-kernel, linux-trace-kernel, unknownbbqrx,
	Wen Yang
In-Reply-To: <791528a4f69460ba1e9589361860bafeb1517237.camel@redhat.com>

On Thu, 04 Jun 2026 14:42:02 +0200
Gabriele Monaco <gmonaco@redhat.com> wrote:

> All this to say that, in my opinion unknownbbqrx <dev@unknownbbqr.xyz>
> is NOT an anonymous contribution, just a nickname that differs from the
> legal name of this person (which we wouldn't validate anyway), so I
> would say it complies with the rules.

It's a username on github and not a nickname. I did a search for
"unknownbbqr" and it doesn't come up anywhere but Google tries to find
similar matches which brings me to an OnlyFans account :-p

It *DOES NOT* qualify because there's no accountability for this. For
people who have a nickname as their entire internet persona, sure, I'll
take patches from them as there's an entity that exists behind it. But
I'm not going to take some username on github as a persona. To me,
that's still anonymous.

-- Steve

^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Nico Pache @ 2026-06-04 12:45 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, Usama Arif
In-Reply-To: <aiFw80oLty6F_-8m@lucifer>

On Thu, Jun 4, 2026 at 6:40 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Thu, Jun 04, 2026 at 12:38:30PM +0100, Lorenzo Stoakes wrote:
> > I will go review the thread about the cache maintenance separately and
> > respond about that.
> >
> > On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
> > > Pass an order and offset to collapse_huge_page to support collapsing anon
> > > memory to arbitrary orders within a PMD. order indicates what mTHP size we
> > > are attempting to collapse to, and offset indicates were in the PMD to
> > > start the collapse attempt.
> > >
> > > For non-PMD collapse we must leave the anon VMA write locked until after
> > > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> > > the mTHP case this is not true, and we must keep the lock to prevent
> > > access/changes to the page tables. This can happen if the rmap walkers hit
> > > a pmd_none while the PMD entry is currently unavailable due to being
> > > temporarily removed during the collapse phase.
> > >
> > > Acked-by: Usama Arif <usama.arif@linux.dev>
> > > Signed-off-by: Nico Pache <npache@redhat.com>
> >
> > The logic LGTM generally, some questions for understanding below, and of
> > course as per above I want to review the Lance/David subthread.
> >
> > Thanks!
> >
> > > ---
> > >  mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
> > >  1 file changed, 55 insertions(+), 38 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index fab35d318641..d64f42f66236 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> > >   * while allocating a THP, as that could trigger direct reclaim/compaction.
> > >   * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> > >   */
> > > -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > > -           int referenced, int unmapped, struct collapse_control *cc)
> > > +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> > > +           int referenced, int unmapped, struct collapse_control *cc,
> > > +           unsigned int order)
> > >  {
> > > +   const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> > > +   const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
> > >     LIST_HEAD(compound_pagelist);
> > >     pmd_t *pmd, _pmd;
> > > -   pte_t *pte;
> > > +   pte_t *pte = NULL;
> >
> > As mentioned elsewhere for some reason this was dropped in
> > mm-unstable. Maybe a bad conflict resolution?
> >
> > >     pgtable_t pgtable;
> > >     struct folio *folio;
> > >     spinlock_t *pmd_ptl, *pte_ptl;
> > >     enum scan_result result = SCAN_FAIL;
> > >     struct vm_area_struct *vma;
> > >     struct mmu_notifier_range range;
> > > +   bool anon_vma_locked = false;
> > >
> > > -   VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > > -
> > > -   result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > > +   result = alloc_charge_folio(&folio, mm, cc, order);
> > >     if (result != SCAN_SUCCEED)
> > >             goto out_nolock;
> > >
> > >     mmap_read_lock(mm);
> > > -   result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > -                                    HPAGE_PMD_ORDER);
> > > +   result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > > +                                    &vma, cc, order);
> > >     if (result != SCAN_SUCCEED) {
> > >             mmap_read_unlock(mm);
> > >             goto out_nolock;
> > >     }
> > >
> > > -   result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > > +   result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
> > >     if (result != SCAN_SUCCEED) {
> > >             mmap_read_unlock(mm);
> > >             goto out_nolock;
> > > @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > >              * released when it fails. So we jump out_nolock directly in
> > >              * that case.  Continuing to collapse causes inconsistency.
> > >              */
> > > -           result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > > -                                                referenced, HPAGE_PMD_ORDER);
> > > +           result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> > > +                                                referenced, order);
> > >             if (result != SCAN_SUCCEED)
> > >                     goto out_nolock;
> > >     }
> > > @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > >      * mmap_lock.
> > >      */
> > >     mmap_write_lock(mm);
> > > -   result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > -                                    HPAGE_PMD_ORDER);
> > > +   result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > > +                                    &vma, cc, order);
> > >     if (result != SCAN_SUCCEED)
> > >             goto out_up_write;
> > >     /* check if the pmd is still valid */
> > >     vma_start_write(vma);
>
> Hmm actually I think we have another problem here.
>
> For PMD THP this is fine. Only a single VMA can span the range we need, and it
> will span the entire PMD.
>
> But for mTHP we have an issue...
>
> See below...
>
> > > -   result = check_pmd_still_valid(mm, address, pmd);
> > > +   result = check_pmd_still_valid(mm, pmd_addr, pmd);
> > >     if (result != SCAN_SUCCEED)
> > >             goto out_up_write;
> > >
> > >     anon_vma_lock_write(vma->anon_vma);
> > > +   anon_vma_locked = true;
> >
> > I worry that we hold this lock a lot longer now? Maybe the algorithmic
> > change alters that, but Claude did suggest on the s390 bug that longer lock
> > hold might be an issue.
> >
> > I wonder if we'll observe lock contention as a result?
> >
> > Correct me if I'm wrong and we're not holding longer than previously,
> > however. Just appears that we do.
> >
> > >
> > > -   mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > > -                           address + HPAGE_PMD_SIZE);
> > > +   mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> > > +                           end_addr);
> > >     mmu_notifier_invalidate_range_start(&range);
> > >
> > >     pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > > @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > >      * Parallel GUP-fast is fine since GUP-fast will back off when
> > >      * it detects PMD is changed.
> > >      */
> > > -   _pmd = pmdp_collapse_flush(vma, address, pmd);
> > > +   _pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
>
> ...So we exclude VMA locked faults faulting in a new PMD entry for PMD-sized THP
> but for mTHP we might have _another_ VMA that spans another part of the range
> mapped by the same PMD entry.
>
> So we clear this, but we do not have a write lock on any other VMA, and so
> racing VMA read locks can install a new PMD entry.
>
> > >     spin_unlock(pmd_ptl);
>
> Especially since you unlock this :)
>
> And...
>
> > >     mmu_notifier_invalidate_range_end(&range);
> > >     tlb_remove_table_sync_one();
> > >
> > > -   pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > > +   pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> > >     if (pte) {
> > > -           result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > > -                                                 HPAGE_PMD_ORDER,
> > > -                                                 &compound_pagelist);
> > > +           result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> > > +                                                 order, &compound_pagelist);
> > >             spin_unlock(pte_ptl);
> > >     } else {
> > >             result = SCAN_NO_PTE_TABLE;
> > >     }
> > >
> > >     if (unlikely(result != SCAN_SUCCEED)) {
> > > -           if (pte)
> > > -                   pte_unmap(pte);
> >
> > OK I seem to remember this is because we're holding the anon_vma lock
> > longer. That does imply that on e.g. x86-64 the RCU lock is being held a
> > bit longer also as well as the anon_vma loc.
> >
> > I guess it's also because we need to hold anon_vma and pte lock because
> > we're fiddling around at PTE level for mTHP not just PMD level as 'classic'
> > THP did.
> >
> > (Rememberings going on here :)
> >
> > >             spin_lock(pmd_ptl);
> > > -           BUG_ON(!pmd_none(*pmd));
> > > +           WARN_ON_ONCE(!pmd_none(*pmd));
>
> ...this will get triggered.
>
> I don't know whether we can safely hold the PMD lock across everything here for
> mTHP?
>
> Maybe the solution would have to be to scan through VMAs in the range of the PMD
> and VMA write lock each of them?

I believe we've spoken about this before, but because we always make
sure the VMA spans the full PMD we won't ever hit this issue. If we
wanted to support mTHP collapse on regions smaller than a PMD, the
locking gets tricky (hence the design choice to not do that for now).

This is handled by the HPAGE_ORDER in hugepage_vma_revalidate().

/* Always check the PMD order to ensure its not shared by another VMA */
if (!thp_vma_suitable_order(vma, address, PMD_ORDER))

-- Nico

>
> That could cause some 'interesting' lock contention issues though? Then again,
> we will be releasing the mmap write lock soon enough which will drop the VMA
> write locks.
>
> > >             /*
> > >              * We can only use set_pmd_at when establishing
> > >              * hugepmds and never for establishing regular pmds that
> > > @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > >              */
> > >             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > >             spin_unlock(pmd_ptl);
> > > -           anon_vma_unlock_write(vma->anon_vma);
> > >             goto out_up_write;
> > >     }
> > >
> > >     /*
> > > -    * All pages are isolated and locked so anon_vma rmap
> > > -    * can't run anymore.
> > > +    * For PMD collapse all pages are isolated and locked so anon_vma
> > > +    * rmap can't run anymore. For mTHP collapse the PMD entry has been
> > > +    * removed and not all pages are isolated and locked, so we must hold
> >
> > Right because some PTE entries be unaffected by the change.
> >
> > > +    * the lock to prevent neighboring folios from attempting to access
> > > +    * this PMD until its reinstalled.
> >
> > OK. This is slightly annoying for my CoW context work as it means there's
> > another case where we need to explicitly hold an anon_vma lock for
> > correctness :)
> >
> > Anyway I will think about that separately, is what it is. And in fact
> > motivates to want this merged earlier so I can work against it :)
> >
> >
> > >      */
> > > -   anon_vma_unlock_write(vma->anon_vma);
> > > +   if (is_pmd_order(order)) {
> > > +           anon_vma_unlock_write(vma->anon_vma);
> > > +           anon_vma_locked = false;
> > > +   }
> > >
> > >     result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > > -                                      vma, address, pte_ptl,
> > > -                                      HPAGE_PMD_ORDER,
> > > -                                      &compound_pagelist);
> > > -   pte_unmap(pte);
> > > +                                      vma, start_addr, pte_ptl,
> > > +                                      order, &compound_pagelist);
> > >     if (unlikely(result != SCAN_SUCCEED))
> > >             goto out_up_write;
> > >
> > > @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > >      * write.
> > >      */
> > >     __folio_mark_uptodate(folio);
> > > -   pgtable = pmd_pgtable(_pmd);
> > > -
> > >     spin_lock(pmd_ptl);
> > > -   BUG_ON(!pmd_none(*pmd));
> > > -   pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > -   map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> > > +   WARN_ON_ONCE(!pmd_none(*pmd));
> > > +   if (is_pmd_order(order)) {
> > > +           pgtable = pmd_pgtable(_pmd);
> > > +           pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > +           map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> > > +   } else {
> > > +           /*
> > > +            * set_ptes is called in map_anon_folio_pte_nopf with the
> > > +            * pmd_ptl lock still held; this is safe as the PMD is expected
> >
> > PMD entry you mean?
> >
> > > +            * to be none. The pmd entry is then repopulated below.
> > > +            */
> > > +           map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> >
> > So here we populate entries in the existing PTE _table_ to point at the new
> > order>0 folio? With arm64 of course doing transparent contpte stuff?
> >
> > > +           smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> > > +           pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> >
> > And then we reinstall the pre-existing PMD _entry_ from none -> what it was
> > before?
> >
> > > +   }
> > >     spin_unlock(pmd_ptl);
> > >
> > >     folio = NULL;
> > >
> > >     result = SCAN_SUCCEED;
> > >  out_up_write:
> > > +   if (anon_vma_locked)
> > > +           anon_vma_unlock_write(vma->anon_vma);
> > > +   if (pte)
> > > +           pte_unmap(pte);
> > >     mmap_write_unlock(mm);
> > >  out_nolock:
> > >     if (folio)
> > > @@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > >             /* collapse_huge_page expects the lock to be dropped before calling */
> > >             mmap_read_unlock(mm);
> > >             result = collapse_huge_page(mm, start_addr, referenced,
> > > -                                       unmapped, cc);
> > > +                                       unmapped, cc, HPAGE_PMD_ORDER);
> > >             /* collapse_huge_page will return with the mmap_lock released */
> > >             *lock_dropped = true;
> > >     }
> > > --
> > > 2.54.0
> > >
>
> Thanks, Lorenzo
>


^ permalink raw reply

* Re: [GIT PULL] rv fixes for v7.1
From: Steven Rostedt @ 2026-06-04 12:44 UTC (permalink / raw)
  To: Tomas Glozar
  Cc: Gabriele Monaco, linux-kernel, linux-trace-kernel, unknownbbqrx,
	Wen Yang
In-Reply-To: <CAP4=nvQEF7BtsMSoDaVO+5MazsRGypUnxB8w0Uq=2sCy=g2HGw@mail.gmail.com>

On Thu, 4 Jun 2026 13:32:46 +0200
Tomas Glozar <tglozar@redhat.com> wrote:

> Is this really still the case? Note that the document says:
> 
> "using a known identity (sorry, no anonymous contributions.)"
> 
> It really used to say "real name", but it was changed by Linus in 2023
> [1]. Note especially this section by Linus:
> 
> " It was 2006, and nobody reacted to the wording, the whole Facebook 'real
>    name' controversy was a decade in the future, and nobody even thought
>    about it.  And despite the language, we've always accepted nicknames and
>    that language was never meant to be any kind of exclusionary wording."
> 
> The wording sounds quite clear to me. And I'm certain that there were
> contributions under pseudonymous identity that have been accepted
> since then, most famously by Asahi Lina, a vtuber persona [2]. I don't
> really see a difference between that and "unknownbbqrx", other than
> the latter doesn't sound like a real name.

It specifically says "using a known identity (sorry, no anonymous contributions.)"

As you said, Asahi Lina is well known and a very "known identity".
"unknownbbqrx" is unknown and even states it in the name.

I will not personally accept such a submission, as the Signed-off-by is
a legal statement that states you have the right to submit that code
and take all responsibility for it.

-- Steve

^ permalink raw reply

* Re: [GIT PULL] rv fixes for v7.1
From: Gabriele Monaco @ 2026-06-04 12:42 UTC (permalink / raw)
  To: Tomas Glozar, Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, unknownbbqrx, Wen Yang
In-Reply-To: <CAP4=nvQEF7BtsMSoDaVO+5MazsRGypUnxB8w0Uq=2sCy=g2HGw@mail.gmail.com>

On Thu, 2026-06-04 at 13:32 +0200, Tomas Glozar wrote:
> Hi Steven,
> 
> čt 4. 6. 2026 v 1:19 odesílatel Steven Rostedt <rostedt@goodmis.org>
> napsal:
> > Hi Gabriele,
> > 
> > What is this? All commits need to be authored by and signed off by
> > from
> > a real person with their official name.
> > 
> >  https://docs.kernel.org/process/submitting-patches.html#sign-your-
> > work-the-developer-s-certificate-of-origin
> > 
> > -- Steve
> > 
> 
> Is this really still the case? Note that the document says:
> 
> "using a known identity (sorry, no anonymous contributions.)"

Thanks Tomas for chipping in!

Just adding some information, the username unknownbbqr is in fact a
valid username for a Github account.

The user already sent a patch and updated it with a real name [1].

All this to say that, in my opinion unknownbbqrx <dev@unknownbbqr.xyz>
is NOT an anonymous contribution, just a nickname that differs from the
legal name of this person (which we wouldn't validate anyway), so I
would say it complies with the rules.

Thanks,
Gabriele

[1] -
https://lore.kernel.org/lkml/20260426150928.870914-1-srinivas.pandruvada@linux.intel.com/

> 
> It really used to say "real name", but it was changed by Linus in
> 2023
> [1]. Note especially this section by Linus:
> 
> " It was 2006, and nobody reacted to the wording, the whole Facebook
> 'real
>    name' controversy was a decade in the future, and nobody even
> thought
>    about it.  And despite the language, we've always accepted
> nicknames and
>    that language was never meant to be any kind of exclusionary
> wording."
> 
> The wording sounds quite clear to me. And I'm certain that there were
> contributions under pseudonymous identity that have been accepted
> since then, most famously by Asahi Lina, a vtuber persona [2]. I
> don't
> really see a difference between that and "unknownbbqrx", other than
> the latter doesn't sound like a real name.
> 
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d4563201f33a022fc0353033d9dfeb1606a88330
> [2]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2e2f6b0ef8551bf3bd8255729d27e3ad9451e562
> 
> Tomas


^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lorenzo Stoakes @ 2026-06-04 12:39 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, Usama Arif
In-Reply-To: <aiFTSLb0kkTR7I9A@lucifer>

On Thu, Jun 04, 2026 at 12:38:30PM +0100, Lorenzo Stoakes wrote:
> I will go review the thread about the cache maintenance separately and
> respond about that.
>
> On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
> > Pass an order and offset to collapse_huge_page to support collapsing anon
> > memory to arbitrary orders within a PMD. order indicates what mTHP size we
> > are attempting to collapse to, and offset indicates were in the PMD to
> > start the collapse attempt.
> >
> > For non-PMD collapse we must leave the anon VMA write locked until after
> > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> > the mTHP case this is not true, and we must keep the lock to prevent
> > access/changes to the page tables. This can happen if the rmap walkers hit
> > a pmd_none while the PMD entry is currently unavailable due to being
> > temporarily removed during the collapse phase.
> >
> > Acked-by: Usama Arif <usama.arif@linux.dev>
> > Signed-off-by: Nico Pache <npache@redhat.com>
>
> The logic LGTM generally, some questions for understanding below, and of
> course as per above I want to review the Lance/David subthread.
>
> Thanks!
>
> > ---
> >  mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
> >  1 file changed, 55 insertions(+), 38 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index fab35d318641..d64f42f66236 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> >   * while allocating a THP, as that could trigger direct reclaim/compaction.
> >   * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> >   */
> > -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > -		int referenced, int unmapped, struct collapse_control *cc)
> > +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> > +		int referenced, int unmapped, struct collapse_control *cc,
> > +		unsigned int order)
> >  {
> > +	const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> > +	const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
> >  	LIST_HEAD(compound_pagelist);
> >  	pmd_t *pmd, _pmd;
> > -	pte_t *pte;
> > +	pte_t *pte = NULL;
>
> As mentioned elsewhere for some reason this was dropped in
> mm-unstable. Maybe a bad conflict resolution?
>
> >  	pgtable_t pgtable;
> >  	struct folio *folio;
> >  	spinlock_t *pmd_ptl, *pte_ptl;
> >  	enum scan_result result = SCAN_FAIL;
> >  	struct vm_area_struct *vma;
> >  	struct mmu_notifier_range range;
> > +	bool anon_vma_locked = false;
> >
> > -	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > -
> > -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > +	result = alloc_charge_folio(&folio, mm, cc, order);
> >  	if (result != SCAN_SUCCEED)
> >  		goto out_nolock;
> >
> >  	mmap_read_lock(mm);
> > -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > -					 HPAGE_PMD_ORDER);
> > +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > +					 &vma, cc, order);
> >  	if (result != SCAN_SUCCEED) {
> >  		mmap_read_unlock(mm);
> >  		goto out_nolock;
> >  	}
> >
> > -	result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > +	result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
> >  	if (result != SCAN_SUCCEED) {
> >  		mmap_read_unlock(mm);
> >  		goto out_nolock;
> > @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >  		 * released when it fails. So we jump out_nolock directly in
> >  		 * that case.  Continuing to collapse causes inconsistency.
> >  		 */
> > -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > -						     referenced, HPAGE_PMD_ORDER);
> > +		result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> > +						     referenced, order);
> >  		if (result != SCAN_SUCCEED)
> >  			goto out_nolock;
> >  	}
> > @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >  	 * mmap_lock.
> >  	 */
> >  	mmap_write_lock(mm);
> > -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > -					 HPAGE_PMD_ORDER);
> > +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > +					 &vma, cc, order);
> >  	if (result != SCAN_SUCCEED)
> >  		goto out_up_write;
> >  	/* check if the pmd is still valid */
> >  	vma_start_write(vma);

Hmm actually I think we have another problem here.

For PMD THP this is fine. Only a single VMA can span the range we need, and it
will span the entire PMD.

But for mTHP we have an issue...

See below...

> > -	result = check_pmd_still_valid(mm, address, pmd);
> > +	result = check_pmd_still_valid(mm, pmd_addr, pmd);
> >  	if (result != SCAN_SUCCEED)
> >  		goto out_up_write;
> >
> >  	anon_vma_lock_write(vma->anon_vma);
> > +	anon_vma_locked = true;
>
> I worry that we hold this lock a lot longer now? Maybe the algorithmic
> change alters that, but Claude did suggest on the s390 bug that longer lock
> hold might be an issue.
>
> I wonder if we'll observe lock contention as a result?
>
> Correct me if I'm wrong and we're not holding longer than previously,
> however. Just appears that we do.
>
> >
> > -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > -				address + HPAGE_PMD_SIZE);
> > +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> > +				end_addr);
> >  	mmu_notifier_invalidate_range_start(&range);
> >
> >  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >  	 * Parallel GUP-fast is fine since GUP-fast will back off when
> >  	 * it detects PMD is changed.
> >  	 */
> > -	_pmd = pmdp_collapse_flush(vma, address, pmd);
> > +	_pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);

...So we exclude VMA locked faults faulting in a new PMD entry for PMD-sized THP
but for mTHP we might have _another_ VMA that spans another part of the range
mapped by the same PMD entry.

So we clear this, but we do not have a write lock on any other VMA, and so
racing VMA read locks can install a new PMD entry.

> >  	spin_unlock(pmd_ptl);

Especially since you unlock this :)

And...

> >  	mmu_notifier_invalidate_range_end(&range);
> >  	tlb_remove_table_sync_one();
> >
> > -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > +	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> >  	if (pte) {
> > -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > -						      HPAGE_PMD_ORDER,
> > -						      &compound_pagelist);
> > +		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> > +						      order, &compound_pagelist);
> >  		spin_unlock(pte_ptl);
> >  	} else {
> >  		result = SCAN_NO_PTE_TABLE;
> >  	}
> >
> >  	if (unlikely(result != SCAN_SUCCEED)) {
> > -		if (pte)
> > -			pte_unmap(pte);
>
> OK I seem to remember this is because we're holding the anon_vma lock
> longer. That does imply that on e.g. x86-64 the RCU lock is being held a
> bit longer also as well as the anon_vma loc.
>
> I guess it's also because we need to hold anon_vma and pte lock because
> we're fiddling around at PTE level for mTHP not just PMD level as 'classic'
> THP did.
>
> (Rememberings going on here :)
>
> >  		spin_lock(pmd_ptl);
> > -		BUG_ON(!pmd_none(*pmd));
> > +		WARN_ON_ONCE(!pmd_none(*pmd));

...this will get triggered.

I don't know whether we can safely hold the PMD lock across everything here for
mTHP?

Maybe the solution would have to be to scan through VMAs in the range of the PMD
and VMA write lock each of them?

That could cause some 'interesting' lock contention issues though? Then again,
we will be releasing the mmap write lock soon enough which will drop the VMA
write locks.

> >  		/*
> >  		 * We can only use set_pmd_at when establishing
> >  		 * hugepmds and never for establishing regular pmds that
> > @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >  		 */
> >  		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> >  		spin_unlock(pmd_ptl);
> > -		anon_vma_unlock_write(vma->anon_vma);
> >  		goto out_up_write;
> >  	}
> >
> >  	/*
> > -	 * All pages are isolated and locked so anon_vma rmap
> > -	 * can't run anymore.
> > +	 * For PMD collapse all pages are isolated and locked so anon_vma
> > +	 * rmap can't run anymore. For mTHP collapse the PMD entry has been
> > +	 * removed and not all pages are isolated and locked, so we must hold
>
> Right because some PTE entries be unaffected by the change.
>
> > +	 * the lock to prevent neighboring folios from attempting to access
> > +	 * this PMD until its reinstalled.
>
> OK. This is slightly annoying for my CoW context work as it means there's
> another case where we need to explicitly hold an anon_vma lock for
> correctness :)
>
> Anyway I will think about that separately, is what it is. And in fact
> motivates to want this merged earlier so I can work against it :)
>
>
> >  	 */
> > -	anon_vma_unlock_write(vma->anon_vma);
> > +	if (is_pmd_order(order)) {
> > +		anon_vma_unlock_write(vma->anon_vma);
> > +		anon_vma_locked = false;
> > +	}
> >
> >  	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > -					   vma, address, pte_ptl,
> > -					   HPAGE_PMD_ORDER,
> > -					   &compound_pagelist);
> > -	pte_unmap(pte);
> > +					   vma, start_addr, pte_ptl,
> > +					   order, &compound_pagelist);
> >  	if (unlikely(result != SCAN_SUCCEED))
> >  		goto out_up_write;
> >
> > @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >  	 * write.
> >  	 */
> >  	__folio_mark_uptodate(folio);
> > -	pgtable = pmd_pgtable(_pmd);
> > -
> >  	spin_lock(pmd_ptl);
> > -	BUG_ON(!pmd_none(*pmd));
> > -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > -	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> > +	WARN_ON_ONCE(!pmd_none(*pmd));
> > +	if (is_pmd_order(order)) {
> > +		pgtable = pmd_pgtable(_pmd);
> > +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > +		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> > +	} else {
> > +		/*
> > +		 * set_ptes is called in map_anon_folio_pte_nopf with the
> > +		 * pmd_ptl lock still held; this is safe as the PMD is expected
>
> PMD entry you mean?
>
> > +		 * to be none. The pmd entry is then repopulated below.
> > +		 */
> > +		map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
>
> So here we populate entries in the existing PTE _table_ to point at the new
> order>0 folio? With arm64 of course doing transparent contpte stuff?
>
> > +		smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> > +		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>
> And then we reinstall the pre-existing PMD _entry_ from none -> what it was
> before?
>
> > +	}
> >  	spin_unlock(pmd_ptl);
> >
> >  	folio = NULL;
> >
> >  	result = SCAN_SUCCEED;
> >  out_up_write:
> > +	if (anon_vma_locked)
> > +		anon_vma_unlock_write(vma->anon_vma);
> > +	if (pte)
> > +		pte_unmap(pte);
> >  	mmap_write_unlock(mm);
> >  out_nolock:
> >  	if (folio)
> > @@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >  		/* collapse_huge_page expects the lock to be dropped before calling */
> >  		mmap_read_unlock(mm);
> >  		result = collapse_huge_page(mm, start_addr, referenced,
> > -					    unmapped, cc);
> > +					    unmapped, cc, HPAGE_PMD_ORDER);
> >  		/* collapse_huge_page will return with the mmap_lock released */
> >  		*lock_dropped = true;
> >  	}
> > --
> > 2.54.0
> >

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lorenzo Stoakes @ 2026-06-04 12:33 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lance Yang, npache, linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
	baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
	dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
	jannh, jglisse, joshua.hahnjy, kas, liam, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, usama.arif
In-Reply-To: <f5d38f64-ab92-496d-afd3-29ccc17fec2b@kernel.org>

On Mon, Jun 01, 2026 at 08:54:24AM +0200, David Hildenbrand (Arm) wrote:
> On 6/1/26 05:28, Lance Yang wrote:
> >
> > On Sun, May 31, 2026 at 10:00:17PM +0200, David Hildenbrand (Arm) wrote:
> >> On 5/31/26 11:39, Lance Yang wrote:
> >>>
> >>>
> >>> Emm ... is it safe to use map_anon_folio_pte_nopf() here?
> >>>
> >>> At this point pmdp_collapse_flush() has cleared the PMD from the page
> >>> tables. The PTE table we are updating is only reachable through the saved
> >>> old PMD value, _pmd, until pmd_populate() below.
> >>>
> >>> map_anon_folio_pte_nopf() does set_ptes() and then calls
> >>> update_mmu_cache_range(). Documentation/core-api/cachetlb.rst describes
> >>> that hook as:
> >>>
> >>> "
> >>> 	At the end of every page fault, this routine is invoked to tell
> >>> 	the architecture specific code that translations now exists
> >>> 	in the software page tables for address space "vma->vm_mm"
> >>> 	at virtual address "address" for "nr" consecutive pages.
> >>> "
> >>>
> >>> But that does not seem true here yet, since the PTE table is not
> >>> reachable from vma->vm_mm when update_mmu_cache_range() is called.
> >>>
> >>> Should we avoid calling update_mmu_cache_range() until after the PTE
> >>> table is reinstalled with pmd_populate()?
> >>
> >> I recall that update_mmu_cache* users mostly care about updating folios flags,
> >> for the folio derived from the PTE ... or flushing caches for the user address.
> >>
> >> So intuitively I would say "the architecture code doesn't care that the PMD
> >> table will only be visible to HW shortly after". The important thing should be
> >> that it will definetly happen, and that nothing else is curently there or can be
> >> there?
> >
> > Ah, fair point.
> >
> > I was mostly worried about arch hooks that walk vma->vm_mm again, rather
> > than only using the pte pointer passed in. For example, mips does:
>
> Right, a re-walk would be the real problem.
>
> >
> >   update_mmu_cache_range()
> >     -> __update_tlb()
> >       -> pgd_offset(vma->vm_mm, address)
> >       -> pte_offset_map(...)
> >
> > and __update_tlb() has this assumption:
> >
> > 		/*
> > 		 * update_mmu_cache() is called between pte_offset_map_lock()
> > 		 * and pte_unmap_unlock(), so we can assume that ptep is not
> > 		 * NULL here: and what should be done below if it were NULL?
> > 		 */
> >
> > So if khugepaged happens to run with current->active_mm == vma->vm_mm
> > here, could __update_tlb() hit the none PMD, get NULL from

I really wish people would say Pxx _entry_ :) so confusing.

> > pte_offset_map(), and then dereference it?
>
> Likely yes -- that MIPS code is horrible. And the comment in MIPS code
> even spells that out. :(
>
> Do you know about other code like that, or is MIPS the only one doing a
> re-walk and crossing fingers?
>
> >
> > Just wanted to raise it since some arch code may still have assumptions
> > like this, and the always-enable-mTHP work is getting closer ...
>
> Right. I assume set_pte_at() couldn't trigger something similar (re-walk) in arch code,
> because we simply provide the ptep. update_mmu_cache_range() only consumes the pte.
>
> >
> > Probably very very very hard to hit, though :)
>
> Delaying update_mmu_cache_range() is nasty, as we'd have to make sure that
> nobody can interfere in the meantime ... and the PMD lock will not be sufficient.
>
> Maybe we could reinstall the page table with the cleared (none) entries while
> still holding the PTL?

You mean the cleared PTE entries that are to be updated with the collapsed
larger folio?

>
> Thinking out loud:

After staring at this long enough, this does seems like a viable solution yes.

I hate how subtle this is.

>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 5ba298d420b7..e39b750b1e6f 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1413,13 +1413,17 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>                 map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>         } else {
>                 /*
> -                * set_ptes is called in map_anon_folio_pte_nopf with the
> -                * pmd_ptl lock still held; this is safe as the PMD is expected
> -                * to be none. The pmd entry is then repopulated below.
> +                * Re-insert the page table with the cleared entries, but
> +                * hold the PTL, such that no one can mess with the re-installed
> +                * page table until we updated the temporarily-cleared entries
> +                * through map_anon_folio_pte_nopf().
>                  */

You may say nit, but, I think we should be clearly stating the problem here. Yes
we want to hold the PTL to stop anybody else messing with it yet, but we're
really doing this because of:

map_anon_folio_pte_nopf
-> update_mmu_cache_range
-> rewalk
-> try to look up an entry that's not yet actually installed
-> bang

Right?

So maybe something like:

	Re-insert the PMD entry pointing to the PTE page table with cleared
	entries first, because map_anon_folio_pte_nopf() invokes
	update_mmu_cache_range() which may cause a rewalk of the page tables and
	blow up if the supplied PTE entry belongs to a PTE table that is not yet
	present there.

	We hold the PTE PTL to avoid anything else messing with this until we're
	ready.


> -               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> -               smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */

(I guess better to comment on the smp_wmb() stuff in the other message about
this.)

> +               if (pte_ptl != pmd_ptl)
> +                       spin_lock(pte_ptl);

(Obviously should be spin_lock_nested() as David says later)

It seems a bit weird to me that we acquire the PTE lock:

	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);

Clear out the mTHP entries we're going to remove:

	if (pte) {
		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
						      order, &compound_pagelist);

THen unlock the PTE:

		spin_unlock(pte_ptl);

Before again reacquiring here, especially given this is an unreachable PTE
table.

But then again not doing that would require us to add some error handling logic
to unlock again so it's probably not vital.

>                 pmd_populate(mm, pmd, pmd_pgtable(_pmd));

So we're protecting against concurrent rmap and fault handlers with the PTL such
that installing this is safe right?

Are we good against GUP fast? I guess a race will be fine with that, or will it?
I suppose before it would have skipped the range entirely because of the missing
PMD entry anyway.

(in any case we also hold anon_vma write lock too.)

> +               map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> +               if (pte_ptl != pmd_ptl)
> +                       spin_unlock(pte_ptl);
>         }
>         spin_unlock(pmd_ptl);
>
>
>
> --
> Cheers,
>
> David

Thanks, Lorenzo

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-04 12:18 UTC (permalink / raw)
  To: Balbir Singh
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <aiFSZfRlFPd7qlIw@parvat>

On Thu, Jun 04, 2026 at 08:35:19PM +1000, Balbir Singh wrote:
> 
> My concern is that __GFP_PRIVATE is too wide, I wonder if we'll have a
> need to support N_MEMORY_PRIVATE may not be all homogeneous memory nodes.
> Very similar to how not all ZONE_DEVICE memory is homogenous.
>

Can you more precise about your definition of homogeneous here?

Are you saying not all memory on a private node will be homogeneous?
   While possible, I would argue that you should not do this and
   should instead prefer to use multiple nodes - 1 per memory class.

Are you saying not all private nodes will be homogenous?
   I don't see the issue with this.

> > 
> > Agreed, but also one which can be deferred and played with since it's
> > all kernel-internal.  None of this should have UAPI implications, and we
> > need need to accept that we're going to get it wrong on the first try.
> > 
> 
> Agreed that we might get the design wrong, until we fix it up. I feel
> that __GFP_PRIVATE should be an evolution of the design to that point.
>

Possibly.  If we can't guarantee isolation without __GFP_PRIVATE, then
we probably can't merge the baseline without it.

> > Because pagecache pages are associated with potentially many VMAs.
> > 
> > The fault can be a soft fault or a hard fault.  On soft fault - the page
> > was already present, and will simply fault into VMA without being
> > migrated.
> > 
> 
> Let's split this into two:
> 
> 1. unmapped page cache is never impacted by mempolicy and should not
>    end up on private memory nodes
> 2. For shared pages, mempolicy would be hard, but it would need to
>    be on a set of nodes backed by private memory, depending on mbind()
>    policy
>
... snip ...
> 
> I'd need to think more about this. For now, my basic requirement would
> be that unmapped page cache should not come from/to private nodes.
> 

This does not fully describe the problem.

A file can be opened and cached as unmapped page cache, and then mapped
at a later time - at which point the mapped copy would share the filemap
page cache page.

Worse, because it's file-backed, you can have the memory faulted onto
your remote node - reclaimed - and the faulted back in via the process
accessing the file via unmapped operations (read/write), at which point
you've had a silent migration occur.

Basically consider

Process A:
   fd = open("myfile", ..., RO);
   read(fd, ...);  /* mm/filemap.c fills page cache */

Process B:
   fd = open("myfile", ...);
   mem = mmap(fd, ...);
   mbind(mem, ..., private_node);
   for page in mem:
       int tmp = mem[page]; /* fault into vma */

The result of Process A running first is Process B thinks it has faulted
the memory onto private_node, but in reality it's taking soft faults and
just getting the filemap folio mapped in.

If you wanted mbind() support from the start, we would have to limit
applicability to anon memory only.

Shared anon memory is different, as there is a radix tree that deals
with a shared mempolicy state.

> 
> I am open to this, I was coming from the blueprint approach of:
> - Let's mimic N_MEMORY with N_MEMORY_PRIVATE and then pick and choose
>   what features to change or make specific to the implementation
>

N_MEMORY essentially states:
	"This is normal memory touch it however you like"

N_MEMORY_PRIVATE (_MANAGED, w/e) says
	"This is NOT normal memory, there are special rules here"

So, no, lets not mimic N_MEMORY.  This is a "closed by default" design,
while N_MEMORY is an "open by default" design.  This design choice is
explicit to make reasoning about these nodes feasible.

> > This is informed by a single use case / device.
> > 
> > There are users / devices that don't want any UAPI for their memory,
> > but simply wish to re-utilize some subsection of mm/ (page_alloc,
> > reclaim, etc).
> > 
> 
> But then, why do they need NUMA nodes? Do we have a list of use cases?
>

So far i have collected:

- Network accelerators carrying their own memory for message buffers
- GPUs with semi-general-purpose working memory across coherent links
- Acceptionally slow distributed memory that you do not want fallback
  allocations to (so you want to deliberately tier what lands there)
- Compressed memory (just another form of accelerator really) which
  has *special access rules* (i.e. writes need to be controlled)

In most if not all of these cases, the right abstraction to reason about
where memory *should come from* IS a NUMA node.

- the network stack can be taught to check if the target device has a
  node with memory and prefer that node over local memory

- accelerators can be given private nodes to manage memory using
  core mm/ components, without worrying that general kernel operation
  will put unrelated memory on those nodes or do things like migrate
  your pages out from under you (unless your driver/service requested
  that).

the tiering application should be somewhat obvious / trivial.

> > 
> > I am trying to test whether, lacking __GFP_PRIVATE, any normal runtime
> > operations access private nodes removed from fallback lists are reached
> > via something like the possible / online nodemask.
> > 
> > I remember, maybe a year ago, there were per-node allocations happening
> > during hotplug and that's why I originally proposed __GFP_PRIVATE, but
> > I'm trying to re-collect that data now.
> > 
> 
> Thanks, I look forward to the next set of patches. Let me know if I
> can help test what's on the list or if you want me to wait for the next
> round
>

Really I want to get the minimized set out the door so we can start
breaking this up by feature (reclaim, mempolicy, etc), because trying to
reason about it as a whole is infeasible - and I cannot be the single
arbiter of every use case (I simply do not have sufficient context).

I'm reworking it all as we speak.

~Gregory

^ permalink raw reply

* [PATCH] tools/rv: Ensure monitor name and desc are NUL-terminated
From: Gabriele Monaco @ 2026-06-04 12:09 UTC (permalink / raw)
  To: Steven Rostedt, Gabriele Monaco, linux-trace-kernel, linux-kernel
  Cc: unknownbbqrx

ikm_fill_monitor_definition() copies monitor name and description with
strncpy(), but does not guarantee NUL termination when source strings are
equal to or longer than the destination buffers.

Clamp copies to sizeof(dst) - 1 and explicitly append '\0' for both fields
to keep them safe for later string operations.

Suggested-by: unknownbbqrx <dev@unknownbbqr.xyz>
Fixes: 6d60f89691fc9 ("tools/rv: Add in-kernel monitor interface")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>

---

Patch was initially sent as [1], the original author's email address
doesn't seem to exist any longer and the author didn't provide a valid
name.

Reimplementing the fix and changing attribution.

[1] - https://lore.kernel.org/r/dc9ea036-de62-4e1f-be63-8e14d675bcca@smtp-relay.sendinblue.com

---
 tools/verification/rv/src/in_kernel.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/tools/verification/rv/src/in_kernel.c b/tools/verification/rv/src/in_kernel.c
index 4bb746ea6..d32453824 100644
--- a/tools/verification/rv/src/in_kernel.c
+++ b/tools/verification/rv/src/in_kernel.c
@@ -215,10 +215,11 @@ static int ikm_fill_monitor_definition(char *name, struct monitor *ikm, char *co
 		return -1;
 	}
 
-	strncpy(ikm->name, nested_name, MAX_DA_NAME_LEN);
+	strncpy(ikm->name, nested_name, sizeof(ikm->name) - 1);
+	ikm->name[sizeof(ikm->name) - 1] = '\0';
 	ikm->enabled = enabled;
-	strncpy(ikm->desc, desc, MAX_DESCRIPTION);
-
+	strncpy(ikm->desc, desc, sizeof(ikm->desc) - 1);
+	ikm->desc[sizeof(ikm->desc) - 1] = '\0';
 	free(desc);
 
 	return 0;

base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
-- 
2.54.0


^ permalink raw reply related

* [PATCH 3/3] tracing/user_events: Replace a seq_printf() call by seq_puts() in user_seq_show()
From: Markus Elfring @ 2026-06-04 12:12 UTC (permalink / raw)
  To: linux-trace-kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Steven Rostedt
  Cc: LKML, kernel-janitors
In-Reply-To: <596c2f16-a12c-4e24-8a8c-1243dce354ec@web.de>

From: Markus Elfring <elfring@users.sourceforge.net>
Date: Thu, 4 Jun 2026 13:45:48 +0200

A single string should be put into a sequence within a loop.
Thus use the corresponding function “seq_puts” for one selected call.

The source code was transformed by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
---
 kernel/trace/trace_events_user.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
index 1a0569110bfd..57a78de4dc98 100644
--- a/kernel/trace/trace_events_user.c
+++ b/kernel/trace/trace_events_user.c
@@ -2781,7 +2781,7 @@ static int user_seq_show(struct seq_file *m, void *p)
 	hash_for_each(group->register_table, i, user, node) {
 		status = user->status;
 
-		seq_printf(m, "%s", EVENT_TP_NAME(user));
+		seq_puts(m, EVENT_TP_NAME(user));
 
 		if (status != 0) {
 			seq_puts(m, " # Used by");
-- 
2.54.0


^ permalink raw reply related

* [PATCH 2/3] tracing/user_events: Use seq_putc() in user_seq_show()
From: Markus Elfring @ 2026-06-04 12:10 UTC (permalink / raw)
  To: linux-trace-kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Steven Rostedt
  Cc: LKML, kernel-janitors
In-Reply-To: <596c2f16-a12c-4e24-8a8c-1243dce354ec@web.de>

From: Markus Elfring <elfring@users.sourceforge.net>
Date: Thu, 4 Jun 2026 13:40:38 +0200

A line break should be put into a sequence within a loop.
Thus use the corresponding function “seq_putc” for one selected call.

The source code was transformed by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
---
 kernel/trace/trace_events_user.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
index a089ac30e407..1a0569110bfd 100644
--- a/kernel/trace/trace_events_user.c
+++ b/kernel/trace/trace_events_user.c
@@ -2794,7 +2794,7 @@ static int user_seq_show(struct seq_file *m, void *p)
 			busy++;
 		}
 
-		seq_puts(m, "\n");
+		seq_putc(m, '\n');
 		active++;
 	}
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH 1/3] tracing/user_events: Simplify data output in user_seq_show()
From: Markus Elfring @ 2026-06-04 12:07 UTC (permalink / raw)
  To: linux-trace-kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Steven Rostedt
  Cc: LKML, kernel-janitors
In-Reply-To: <596c2f16-a12c-4e24-8a8c-1243dce354ec@web.de>

From: Markus Elfring <elfring@users.sourceforge.net>
Date: Thu, 4 Jun 2026 13:33:43 +0200

Move the specification for a line break from a seq_puts() call
to a seq_printf() call.

The source code was transformed by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
---
 kernel/trace/trace_events_user.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
index c4ba484f7b38..a089ac30e407 100644
--- a/kernel/trace/trace_events_user.c
+++ b/kernel/trace/trace_events_user.c
@@ -2800,8 +2800,7 @@ static int user_seq_show(struct seq_file *m, void *p)
 
 	mutex_unlock(&group->reg_mutex);
 
-	seq_puts(m, "\n");
-	seq_printf(m, "Active: %d\n", active);
+	seq_printf(m, "\nActive: %d\n", active);
 	seq_printf(m, "Busy: %d\n", busy);
 
 	return 0;
-- 
2.54.0


^ permalink raw reply related

* [PATCH 0/3] tracing/user_events: More efficient data output in user_seq_show()
From: Markus Elfring @ 2026-06-04 12:05 UTC (permalink / raw)
  To: linux-trace-kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Steven Rostedt
  Cc: LKML, kernel-janitors

From: Markus Elfring <elfring@users.sourceforge.net>
Date: Thu, 4 Jun 2026 14:00:56 +0200

A few update suggestions were taken into account
from static source code analysis.

Markus Elfring (3):
  Simplify data output
  Use seq_putc()
  Replace a seq_printf() call by seq_puts()

 kernel/trace/trace_events_user.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

-- 
2.54.0


^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lorenzo Stoakes @ 2026-06-04 11:38 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, Usama Arif
In-Reply-To: <20260522150009.121603-7-npache@redhat.com>

I will go review the thread about the cache maintenance separately and
respond about that.

On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
> Pass an order and offset to collapse_huge_page to support collapsing anon
> memory to arbitrary orders within a PMD. order indicates what mTHP size we
> are attempting to collapse to, and offset indicates were in the PMD to
> start the collapse attempt.
>
> For non-PMD collapse we must leave the anon VMA write locked until after
> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> the mTHP case this is not true, and we must keep the lock to prevent
> access/changes to the page tables. This can happen if the rmap walkers hit
> a pmd_none while the PMD entry is currently unavailable due to being
> temporarily removed during the collapse phase.
>
> Acked-by: Usama Arif <usama.arif@linux.dev>
> Signed-off-by: Nico Pache <npache@redhat.com>

The logic LGTM generally, some questions for understanding below, and of
course as per above I want to review the Lance/David subthread.

Thanks!

> ---
>  mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
>  1 file changed, 55 insertions(+), 38 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index fab35d318641..d64f42f66236 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>   * while allocating a THP, as that could trigger direct reclaim/compaction.
>   * Note that the VMA must be rechecked after grabbing the mmap_lock again.
>   */
> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> -		int referenced, int unmapped, struct collapse_control *cc)
> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> +		int referenced, int unmapped, struct collapse_control *cc,
> +		unsigned int order)
>  {
> +	const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> +	const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
>  	LIST_HEAD(compound_pagelist);
>  	pmd_t *pmd, _pmd;
> -	pte_t *pte;
> +	pte_t *pte = NULL;

As mentioned elsewhere for some reason this was dropped in
mm-unstable. Maybe a bad conflict resolution?

>  	pgtable_t pgtable;
>  	struct folio *folio;
>  	spinlock_t *pmd_ptl, *pte_ptl;
>  	enum scan_result result = SCAN_FAIL;
>  	struct vm_area_struct *vma;
>  	struct mmu_notifier_range range;
> +	bool anon_vma_locked = false;
>
> -	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -
> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> +	result = alloc_charge_folio(&folio, mm, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_nolock;
>
>  	mmap_read_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> -					 HPAGE_PMD_ORDER);
> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> +					 &vma, cc, order);
>  	if (result != SCAN_SUCCEED) {
>  		mmap_read_unlock(mm);
>  		goto out_nolock;
>  	}
>
> -	result = find_pmd_or_thp_or_none(mm, address, &pmd);
> +	result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
>  	if (result != SCAN_SUCCEED) {
>  		mmap_read_unlock(mm);
>  		goto out_nolock;
> @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  		 * released when it fails. So we jump out_nolock directly in
>  		 * that case.  Continuing to collapse causes inconsistency.
>  		 */
> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -						     referenced, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> +						     referenced, order);
>  		if (result != SCAN_SUCCEED)
>  			goto out_nolock;
>  	}
> @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	 * mmap_lock.
>  	 */
>  	mmap_write_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> -					 HPAGE_PMD_ORDER);
> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> +					 &vma, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>  	/* check if the pmd is still valid */
>  	vma_start_write(vma);
> -	result = check_pmd_still_valid(mm, address, pmd);
> +	result = check_pmd_still_valid(mm, pmd_addr, pmd);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>
>  	anon_vma_lock_write(vma->anon_vma);
> +	anon_vma_locked = true;

I worry that we hold this lock a lot longer now? Maybe the algorithmic
change alters that, but Claude did suggest on the s390 bug that longer lock
hold might be an issue.

I wonder if we'll observe lock contention as a result?

Correct me if I'm wrong and we're not holding longer than previously,
however. Just appears that we do.

>
> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> -				address + HPAGE_PMD_SIZE);
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> +				end_addr);
>  	mmu_notifier_invalidate_range_start(&range);
>
>  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	 * Parallel GUP-fast is fine since GUP-fast will back off when
>  	 * it detects PMD is changed.
>  	 */
> -	_pmd = pmdp_collapse_flush(vma, address, pmd);
> +	_pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
>  	spin_unlock(pmd_ptl);
>  	mmu_notifier_invalidate_range_end(&range);
>  	tlb_remove_table_sync_one();
>
> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> +	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
>  	if (pte) {
> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> -						      HPAGE_PMD_ORDER,
> -						      &compound_pagelist);
> +		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> +						      order, &compound_pagelist);
>  		spin_unlock(pte_ptl);
>  	} else {
>  		result = SCAN_NO_PTE_TABLE;
>  	}
>
>  	if (unlikely(result != SCAN_SUCCEED)) {
> -		if (pte)
> -			pte_unmap(pte);

OK I seem to remember this is because we're holding the anon_vma lock
longer. That does imply that on e.g. x86-64 the RCU lock is being held a
bit longer also as well as the anon_vma loc.

I guess it's also because we need to hold anon_vma and pte lock because
we're fiddling around at PTE level for mTHP not just PMD level as 'classic'
THP did.

(Rememberings going on here :)

>  		spin_lock(pmd_ptl);
> -		BUG_ON(!pmd_none(*pmd));
> +		WARN_ON_ONCE(!pmd_none(*pmd));
>  		/*
>  		 * We can only use set_pmd_at when establishing
>  		 * hugepmds and never for establishing regular pmds that
> @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  		 */
>  		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>  		spin_unlock(pmd_ptl);
> -		anon_vma_unlock_write(vma->anon_vma);
>  		goto out_up_write;
>  	}
>
>  	/*
> -	 * All pages are isolated and locked so anon_vma rmap
> -	 * can't run anymore.
> +	 * For PMD collapse all pages are isolated and locked so anon_vma
> +	 * rmap can't run anymore. For mTHP collapse the PMD entry has been
> +	 * removed and not all pages are isolated and locked, so we must hold

Right because some PTE entries be unaffected by the change.

> +	 * the lock to prevent neighboring folios from attempting to access
> +	 * this PMD until its reinstalled.

OK. This is slightly annoying for my CoW context work as it means there's
another case where we need to explicitly hold an anon_vma lock for
correctness :)

Anyway I will think about that separately, is what it is. And in fact
motivates to want this merged earlier so I can work against it :)


>  	 */
> -	anon_vma_unlock_write(vma->anon_vma);
> +	if (is_pmd_order(order)) {
> +		anon_vma_unlock_write(vma->anon_vma);
> +		anon_vma_locked = false;
> +	}
>
>  	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> -					   vma, address, pte_ptl,
> -					   HPAGE_PMD_ORDER,
> -					   &compound_pagelist);
> -	pte_unmap(pte);
> +					   vma, start_addr, pte_ptl,
> +					   order, &compound_pagelist);
>  	if (unlikely(result != SCAN_SUCCEED))
>  		goto out_up_write;
>
> @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	 * write.
>  	 */
>  	__folio_mark_uptodate(folio);
> -	pgtable = pmd_pgtable(_pmd);
> -
>  	spin_lock(pmd_ptl);
> -	BUG_ON(!pmd_none(*pmd));
> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> -	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> +	WARN_ON_ONCE(!pmd_none(*pmd));
> +	if (is_pmd_order(order)) {
> +		pgtable = pmd_pgtable(_pmd);
> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> +	} else {
> +		/*
> +		 * set_ptes is called in map_anon_folio_pte_nopf with the
> +		 * pmd_ptl lock still held; this is safe as the PMD is expected

PMD entry you mean?

> +		 * to be none. The pmd entry is then repopulated below.
> +		 */
> +		map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);

So here we populate entries in the existing PTE _table_ to point at the new
order>0 folio? With arm64 of course doing transparent contpte stuff?

> +		smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> +		pmd_populate(mm, pmd, pmd_pgtable(_pmd));

And then we reinstall the pre-existing PMD _entry_ from none -> what it was
before?

> +	}
>  	spin_unlock(pmd_ptl);
>
>  	folio = NULL;
>
>  	result = SCAN_SUCCEED;
>  out_up_write:
> +	if (anon_vma_locked)
> +		anon_vma_unlock_write(vma->anon_vma);
> +	if (pte)
> +		pte_unmap(pte);
>  	mmap_write_unlock(mm);
>  out_nolock:
>  	if (folio)
> @@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		/* collapse_huge_page expects the lock to be dropped before calling */
>  		mmap_read_unlock(mm);
>  		result = collapse_huge_page(mm, start_addr, referenced,
> -					    unmapped, cc);
> +					    unmapped, cc, HPAGE_PMD_ORDER);
>  		/* collapse_huge_page will return with the mmap_lock released */
>  		*lock_dropped = true;
>  	}
> --
> 2.54.0
>

^ permalink raw reply

* Re: [GIT PULL] rv fixes for v7.1
From: Tomas Glozar @ 2026-06-04 11:32 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Gabriele Monaco, linux-kernel, linux-trace-kernel, unknownbbqrx,
	Wen Yang
In-Reply-To: <20260603191627.5cb4ef6c@fedora>

Hi Steven,

čt 4. 6. 2026 v 1:19 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal:
> Hi Gabriele,
>
> What is this? All commits need to be authored by and signed off by from
> a real person with their official name.
>
>  https://docs.kernel.org/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin
>
> -- Steve
>

Is this really still the case? Note that the document says:

"using a known identity (sorry, no anonymous contributions.)"

It really used to say "real name", but it was changed by Linus in 2023
[1]. Note especially this section by Linus:

" It was 2006, and nobody reacted to the wording, the whole Facebook 'real
   name' controversy was a decade in the future, and nobody even thought
   about it.  And despite the language, we've always accepted nicknames and
   that language was never meant to be any kind of exclusionary wording."

The wording sounds quite clear to me. And I'm certain that there were
contributions under pseudonymous identity that have been accepted
since then, most famously by Asahi Lina, a vtuber persona [2]. I don't
really see a difference between that and "unknownbbqrx", other than
the latter doesn't sound like a real name.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d4563201f33a022fc0353033d9dfeb1606a88330
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2e2f6b0ef8551bf3bd8255729d27e3ad9451e562

Tomas

^ permalink raw reply

* Re: [PATCH 1/1] tools/rv: ensure monitor name and desc are NUL-terminated
From: Gabriele Monaco @ 2026-06-04 11:08 UTC (permalink / raw)
  To: unknownbbqrx; +Cc: rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <dc9ea036-de62-4e1f-be63-8e14d675bcca@smtp-relay.sendinblue.com>

On Thu, 2026-04-23 at 17:19 +0300, unknownbbqrx wrote:
> 
> ikm_fill_monitor_definition() copies monitor name and description
> with strncpy(), but does not guarantee NUL termination when source
> strings are equal to or longer than the destination buffers.
> 
> Clamp copies to sizeof(dst) - 1 and explicitly append '\0' for both
> fields to keep them safe for later string operations.
> 
> Signed-off-by: unknownbbqrx <dev@unknownbbqr.xyz>

Contributions need to be attributed to real people using official
names. I'm going to re-send this patch with me as author and a
Suggested-by: unknownbbqrx <dev@unknownbbqr.xyz>, unless you answer
with an appropriate attribution (i.e. your real name) [1].

Thanks,
Gabriele

[1] -
https://docs.kernel.org/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin

> ---
>  tools/verification/rv/src/in_kernel.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/verification/rv/src/in_kernel.c
> b/tools/verification/rv/src/in_kernel.c
> index 4bb746ea6..d32453824 100644
> --- a/tools/verification/rv/src/in_kernel.c
> +++ b/tools/verification/rv/src/in_kernel.c
> @@ -215,10 +215,11 @@ static int ikm_fill_monitor_definition(char
> *name, struct monitor *ikm, char *co
>  		return -1;
>  	}
>  
> -	strncpy(ikm->name, nested_name, MAX_DA_NAME_LEN);
> +	strncpy(ikm->name, nested_name, sizeof(ikm->name) - 1);
> +	ikm->name[sizeof(ikm->name) - 1] = '\0';
>  	ikm->enabled = enabled;
> -	strncpy(ikm->desc, desc, MAX_DESCRIPTION);
> -
> +	strncpy(ikm->desc, desc, sizeof(ikm->desc) - 1);
> +	ikm->desc[sizeof(ikm->desc) - 1] = '\0';
>  	free(desc);
>  
>  	return 0;


^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Balbir Singh @ 2026-06-04 10:35 UTC (permalink / raw)
  To: Gregory Price
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <aiE5DZC8Io4SNI3H@gourry-fedora-PF4VCD3F>

On Thu, Jun 04, 2026 at 09:36:29AM +0100, Gregory Price wrote:
> On Thu, Jun 04, 2026 at 11:43:14AM +1000, Balbir Singh wrote:
> > On Wed, Jun 03, 2026 at 08:02:09AM +0100, Gregory Price wrote:
> > > 
> > > Here is how the page allocator fallback lists and nodemasks interact:
> > > 
> > >    Fallbacks A:  A B 
> > >    Fallbacks B:  B A
> > >    Fallbacks C:  C A B   (Private)
> > >    Fallbacks D:  D B A   (Private)
> > > 
> > 
> > Do we want regular memory (N_MEMORY) in the fallback list of device private nodes?
> > The assumption is that we have ATS translation enabled? Assumiung A and
> > B are N_MEMORY here or am I misreading your illustraion?
> >
> 
> If we don't have __GFP_PRIVATE, then probably not.  This is a holdover
> from the current __GFP_PRIVATE branch so that if the preferred_nid=
> value is a private node (which is a hint, but not a hard control),
> there's a way for that allocation to land *somewhere*.
> 
> __GFP_PRIVATE would say "Only allow access to private nodes if this
> flag is provided - otherwise treat that as unreachable and fall back".
> 
> (__GFP_PRIVATE | __GFP_THISNODE) then does exactly what you expect (only
> allocate from specifically this private node and don't fall back).
> 
> This has the added benefit of not causing OOM on allocation failure.
> 
> Some would consider such a request a bug (i.e. that caller has a bad
> mask), but I find the premise of that statement to be flawwed if only
> because we do not have good controls over what ends up in a nodemask due
> to the existence of things like possible_nodes.
>

My concern is that __GFP_PRIVATE is too wide, I wonder if we'll have a
need to support N_MEMORY_PRIVATE may not be all homogeneous memory nodes.
Very similar to how not all ZONE_DEVICE memory is homogenous.

> 
> > > If we wanted to change this behavior, realistically we'd be looking for
> > > a way to add specific nodes to certain fallback lists - rather than
> > > modify the nodemask interaction in some way.
> > 
> > Yes, that is what we did with CDM, control the fallback for
> > N_MEMORY_PRIVATE, but there is a design decision to be made here.
> >
> 
> Agreed, but also one which can be deferred and played with since it's
> all kernel-internal.  None of this should have UAPI implications, and we
> need need to accept that we're going to get it wrong on the first try.
> 

Agreed that we might get the design wrong, until we fix it up. I feel
that __GFP_PRIVATE should be an evolution of the design to that point.

> > > 2) full mempolicy support doesn't really make sense
> > > 
> > >    task mempolicy PROBABLY should never really touch private nodes,
> > >    while VMA policy certainly can.  Assuming we're able to support
> > >    multi-private-node masks, none of the non-bind mempolicies even
> > >    make sense for most private nodes (interleave? weighted interleave?)
> > > 
> > 
> > Yes, mostly, but is that baked into the design? If so, why?
> >
> 
> "Baked in" in this case would mean:
> 
>   set_mempolicy(..., private_node) -> -EINVAL
>   mbind(..., private_node)         -> Success
> 
> With appropriate documentation.
> 
> This can be changed later if a reasonable design was agreed upon.
> 
> > > 4) File VMA interactions don't entirely make sense with mbind
> > > 
> > >    In theory you might want:
> > > 
> > >    fd = open("somefile", ...);
> > >    mem = mmap(fd, ...);
> > >    mbind(mem, ..., private_node);
> > >    for page in mem:
> > >       mem[page_off] /* fault file into private memory */
> > > 
> > >    In reality: This does not work the way you want.
> > 
> > Why not? Just curious about what you found?
> > 
> 
> Because pagecache pages are associated with potentially many VMAs.
> 
> The fault can be a soft fault or a hard fault.  On soft fault - the page
> was already present, and will simply fault into VMA without being
> migrated.
> 

Let's split this into two:

1. unmapped page cache is never impacted by mempolicy and should not
   end up on private memory nodes
2. For shared pages, mempolicy would be hard, but it would need to
   be on a set of nodes backed by private memory, depending on mbind()
   policy

> You can imagine the following
> 
> Process A:
>     fd = open("somefile", ...);
>     mem = mmap(fd, ...);
>     mbind(mem, ..., private_node_A);
>     for page in mem:
>        mem[page_off] /* fault file into private memory */
> 
> Process B:
>     fd = open("somefile", ...);
>     mem = mmap(fd, ...);
>     mbind(mem, ..., private_node_B);
>     for page in mem:
>        mem[page_off] /* fault file into private memory */
> 
> If process A runs first, and assuming VMA mempolicy is respected for
> file backed allocation (note: it's not, see below) - then the second
> process will think the memory now lives on node B when it's already
> living on node A (pages are not migrated on fault).
> 
> filemap page cache means file-backed pages are global resources.
> 
> Re file-backed VMAs - see filemap_alloc_folio_noprof in mm/filemap.c
> 
> struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
> {
>         int n;
>         struct folio *folio;
> 
>         if (cpuset_do_page_mem_spread()) {
>                 unsigned int cpuset_mems_cookie;
>                 do {
>                         cpuset_mems_cookie = read_mems_allowed_begin();
>                         n = cpuset_mem_spread_node();
>                         folio = __folio_alloc_node_noprof(gfp, order, n);
>                 } while (!folio && read_mems_allowed_retry(cpuset_mems_cookie));
> 
>                 return folio;
>         }
>         return folio_alloc_noprof(gfp, order);
> }
> 
> We'd have to hang a mempolicy off of the file and use fctl or something
> like this if we want a file to have a node preference.

I'd need to think more about this. For now, my basic requirement would
be that unmapped page cache should not come from/to private nodes.

> 
> > > 
> > >    I went digging and we need a few mild extensions to allow
> > >    migration on mbind to work for pagecache pages, and the fault
> > >    path does not necessarily respect the vma mempolicy always.
> > > 
> > >    You also start getting into the question of "what happens when
> > >    the node is out of memory and you don't have reclaim support?".
> > 
> > Yes, we should discuss reclaim support, I think we should allow for
> > reclaim. It allows you to overcommit private memory the way we can
> > with regular memory.
> > 
> 
> Reclaim support is feasible, but again - crawl, walk, run.
> 
> If we get the base private node infrastructure in place, we can break
> things like mempolicy and reclaim support into different work streams
> to enable support for these features.
> 
> Different private node users will be interested in different
> combinations of mm/ service support.
> 
> For example:  compressed memory as a swap backend DOES NOT want explicit
> reclaim support - it will need to manage its own shrinker.  This comes
> from requirements associated with that specific use case (which I do not
> want to get into here).
> 
> That is why this series introduced the concept of NP_OPS_* - so that the
> owner (driver) of a private node (such as a CXL-enabled accelerator
> driver) can tell mm/ what services it should enable for that node.

I am open to this, I was coming from the blueprint approach of:
- Let's mimic N_MEMORY with N_MEMORY_PRIVATE and then pick and choose
  what features to change or make specific to the implementation

> 
> > > 
> > > For all these reasons, I think the be mbind/mempolicy support with
> > > private nodes needs to be brought in with follow up work - not
> > > introduced as part of the baseline set.
> > > 
> > 
> > I am not opposed to the follow up work, but I feel mbind() should
> > be the fundamental work and user space API.
> >
> 
> This is informed by a single use case / device.
> 
> There are users / devices that don't want any UAPI for their memory,
> but simply wish to re-utilize some subsection of mm/ (page_alloc,
> reclaim, etc).
> 

But then, why do they need NUMA nodes? Do we have a list of use cases?

> > > 
> > > I am arguing for #1 - the community has argued for #2 and "fixing
> > > existing nodemask users".  I think we can ship #2 and pivot to #1 if we
> > > find fixing existing users is infeasible or too much of a maintenance
> > > burden.
> > 
> > Again happy to discuss this, I'd like to make sure we agree on the
> > design. I am wondering if there is any experimental data to choose
> > between 1 and 2.
> > 
> 
> I am trying to test whether, lacking __GFP_PRIVATE, any normal runtime
> operations access private nodes removed from fallback lists are reached
> via something like the possible / online nodemask.
> 
> I remember, maybe a year ago, there were per-node allocations happening
> during hotplug and that's why I originally proposed __GFP_PRIVATE, but
> I'm trying to re-collect that data now.
> 

Thanks, I look forward to the next set of patches. Let me know if I
can help test what's on the list or if you want me to wait for the next
round

Balbir


^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Nico Pache @ 2026-06-04 10:32 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, Usama Arif
In-Reply-To: <aiFRVb4zhrRw-yw5@lucifer>

On Thu, Jun 4, 2026 at 4:22 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
> > Pass an order and offset to collapse_huge_page to support collapsing anon
> > memory to arbitrary orders within a PMD. order indicates what mTHP size we
> > are attempting to collapse to, and offset indicates were in the PMD to
> > start the collapse attempt.
> >
> > For non-PMD collapse we must leave the anon VMA write locked until after
> > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> > the mTHP case this is not true, and we must keep the lock to prevent
> > access/changes to the page tables. This can happen if the rmap walkers hit
> > a pmd_none while the PMD entry is currently unavailable due to being
> > temporarily removed during the collapse phase.
> >
> > Acked-by: Usama Arif <usama.arif@linux.dev>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
> >  1 file changed, 55 insertions(+), 38 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index fab35d318641..d64f42f66236 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> >   * while allocating a THP, as that could trigger direct reclaim/compaction.
> >   * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> >   */
> > -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > -             int referenced, int unmapped, struct collapse_control *cc)
> > +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> > +             int referenced, int unmapped, struct collapse_control *cc,
> > +             unsigned int order)
> >  {
> > +     const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> > +     const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
> >       LIST_HEAD(compound_pagelist);
> >       pmd_t *pmd, _pmd;
> > -     pte_t *pte;
> > +     pte_t *pte = NULL;
>
> Hmm, this part of the patch wasn't taken, and now we have uninitialised state
> being dereferenced (see [0])

Good catch, I was just looking at your report and wondering what
happened there. Hopefully, with the v19 we apply this correctly :)

-- Nico

>
> [0]:https://lore.kernel.org/all/aiFO1RlpZ7Ki44y1@lucifer/
>
> Did a review comment here somehow cause this to be changed in the patch?
>
> Andrew - was there an error in applying the patch somehow?
>
> Thanks, Lorenzo
>
> >       pgtable_t pgtable;
> >       struct folio *folio;
> >       spinlock_t *pmd_ptl, *pte_ptl;
> >       enum scan_result result = SCAN_FAIL;
> >       struct vm_area_struct *vma;
> >       struct mmu_notifier_range range;
> > +     bool anon_vma_locked = false;
> >
> > -     VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > -
> > -     result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > +     result = alloc_charge_folio(&folio, mm, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_nolock;
> >
> >       mmap_read_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > -                                      HPAGE_PMD_ORDER);
> > +     result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > +                                      &vma, cc, order);
> >       if (result != SCAN_SUCCEED) {
> >               mmap_read_unlock(mm);
> >               goto out_nolock;
> >       }
> >
> > -     result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > +     result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
> >       if (result != SCAN_SUCCEED) {
> >               mmap_read_unlock(mm);
> >               goto out_nolock;
> > @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >                * released when it fails. So we jump out_nolock directly in
> >                * that case.  Continuing to collapse causes inconsistency.
> >                */
> > -             result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > -                                                  referenced, HPAGE_PMD_ORDER);
> > +             result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> > +                                                  referenced, order);
> >               if (result != SCAN_SUCCEED)
> >                       goto out_nolock;
> >       }
> > @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >        * mmap_lock.
> >        */
> >       mmap_write_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > -                                      HPAGE_PMD_ORDER);
> > +     result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> > +                                      &vma, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> >       /* check if the pmd is still valid */
> >       vma_start_write(vma);
> > -     result = check_pmd_still_valid(mm, address, pmd);
> > +     result = check_pmd_still_valid(mm, pmd_addr, pmd);
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> >
> >       anon_vma_lock_write(vma->anon_vma);
> > +     anon_vma_locked = true;
> >
> > -     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > -                             address + HPAGE_PMD_SIZE);
> > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> > +                             end_addr);
> >       mmu_notifier_invalidate_range_start(&range);
> >
> >       pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >        * Parallel GUP-fast is fine since GUP-fast will back off when
> >        * it detects PMD is changed.
> >        */
> > -     _pmd = pmdp_collapse_flush(vma, address, pmd);
> > +     _pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
> >       spin_unlock(pmd_ptl);
> >       mmu_notifier_invalidate_range_end(&range);
> >       tlb_remove_table_sync_one();
> >
> > -     pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > +     pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> >       if (pte) {
> > -             result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > -                                                   HPAGE_PMD_ORDER,
> > -                                                   &compound_pagelist);
> > +             result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> > +                                                   order, &compound_pagelist);
> >               spin_unlock(pte_ptl);
> >       } else {
> >               result = SCAN_NO_PTE_TABLE;
> >       }
> >
> >       if (unlikely(result != SCAN_SUCCEED)) {
> > -             if (pte)
> > -                     pte_unmap(pte);
> >               spin_lock(pmd_ptl);
> > -             BUG_ON(!pmd_none(*pmd));
> > +             WARN_ON_ONCE(!pmd_none(*pmd));
> >               /*
> >                * We can only use set_pmd_at when establishing
> >                * hugepmds and never for establishing regular pmds that
> > @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >                */
> >               pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> >               spin_unlock(pmd_ptl);
> > -             anon_vma_unlock_write(vma->anon_vma);
> >               goto out_up_write;
> >       }
> >
> >       /*
> > -      * All pages are isolated and locked so anon_vma rmap
> > -      * can't run anymore.
> > +      * For PMD collapse all pages are isolated and locked so anon_vma
> > +      * rmap can't run anymore. For mTHP collapse the PMD entry has been
> > +      * removed and not all pages are isolated and locked, so we must hold
> > +      * the lock to prevent neighboring folios from attempting to access
> > +      * this PMD until its reinstalled.
> >        */
> > -     anon_vma_unlock_write(vma->anon_vma);
> > +     if (is_pmd_order(order)) {
> > +             anon_vma_unlock_write(vma->anon_vma);
> > +             anon_vma_locked = false;
> > +     }
> >
> >       result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > -                                        vma, address, pte_ptl,
> > -                                        HPAGE_PMD_ORDER,
> > -                                        &compound_pagelist);
> > -     pte_unmap(pte);
> > +                                        vma, start_addr, pte_ptl,
> > +                                        order, &compound_pagelist);
> >       if (unlikely(result != SCAN_SUCCEED))
> >               goto out_up_write;
> >
> > @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >        * write.
> >        */
> >       __folio_mark_uptodate(folio);
> > -     pgtable = pmd_pgtable(_pmd);
> > -
> >       spin_lock(pmd_ptl);
> > -     BUG_ON(!pmd_none(*pmd));
> > -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > -     map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> > +     WARN_ON_ONCE(!pmd_none(*pmd));
> > +     if (is_pmd_order(order)) {
> > +             pgtable = pmd_pgtable(_pmd);
> > +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > +             map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> > +     } else {
> > +             /*
> > +              * set_ptes is called in map_anon_folio_pte_nopf with the
> > +              * pmd_ptl lock still held; this is safe as the PMD is expected
> > +              * to be none. The pmd entry is then repopulated below.
> > +              */
> > +             map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> > +             smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> > +             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > +     }
> >       spin_unlock(pmd_ptl);
> >
> >       folio = NULL;
> >
> >       result = SCAN_SUCCEED;
> >  out_up_write:
> > +     if (anon_vma_locked)
> > +             anon_vma_unlock_write(vma->anon_vma);
> > +     if (pte)
> > +             pte_unmap(pte);
> >       mmap_write_unlock(mm);
> >  out_nolock:
> >       if (folio)
> > @@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >               /* collapse_huge_page expects the lock to be dropped before calling */
> >               mmap_read_unlock(mm);
> >               result = collapse_huge_page(mm, start_addr, referenced,
> > -                                         unmapped, cc);
> > +                                         unmapped, cc, HPAGE_PMD_ORDER);
> >               /* collapse_huge_page will return with the mmap_lock released */
> >               *lock_dropped = true;
> >       }
> > --
> > 2.54.0
> >
>


^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lorenzo Stoakes @ 2026-06-04 10:21 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, Usama Arif
In-Reply-To: <20260522150009.121603-7-npache@redhat.com>

On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
> Pass an order and offset to collapse_huge_page to support collapsing anon
> memory to arbitrary orders within a PMD. order indicates what mTHP size we
> are attempting to collapse to, and offset indicates were in the PMD to
> start the collapse attempt.
>
> For non-PMD collapse we must leave the anon VMA write locked until after
> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> the mTHP case this is not true, and we must keep the lock to prevent
> access/changes to the page tables. This can happen if the rmap walkers hit
> a pmd_none while the PMD entry is currently unavailable due to being
> temporarily removed during the collapse phase.
>
> Acked-by: Usama Arif <usama.arif@linux.dev>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
>  1 file changed, 55 insertions(+), 38 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index fab35d318641..d64f42f66236 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>   * while allocating a THP, as that could trigger direct reclaim/compaction.
>   * Note that the VMA must be rechecked after grabbing the mmap_lock again.
>   */
> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> -		int referenced, int unmapped, struct collapse_control *cc)
> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> +		int referenced, int unmapped, struct collapse_control *cc,
> +		unsigned int order)
>  {
> +	const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> +	const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
>  	LIST_HEAD(compound_pagelist);
>  	pmd_t *pmd, _pmd;
> -	pte_t *pte;
> +	pte_t *pte = NULL;

Hmm, this part of the patch wasn't taken, and now we have uninitialised state
being dereferenced (see [0])

[0]:https://lore.kernel.org/all/aiFO1RlpZ7Ki44y1@lucifer/

Did a review comment here somehow cause this to be changed in the patch?

Andrew - was there an error in applying the patch somehow?

Thanks, Lorenzo

>  	pgtable_t pgtable;
>  	struct folio *folio;
>  	spinlock_t *pmd_ptl, *pte_ptl;
>  	enum scan_result result = SCAN_FAIL;
>  	struct vm_area_struct *vma;
>  	struct mmu_notifier_range range;
> +	bool anon_vma_locked = false;
>
> -	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -
> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> +	result = alloc_charge_folio(&folio, mm, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_nolock;
>
>  	mmap_read_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> -					 HPAGE_PMD_ORDER);
> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> +					 &vma, cc, order);
>  	if (result != SCAN_SUCCEED) {
>  		mmap_read_unlock(mm);
>  		goto out_nolock;
>  	}
>
> -	result = find_pmd_or_thp_or_none(mm, address, &pmd);
> +	result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
>  	if (result != SCAN_SUCCEED) {
>  		mmap_read_unlock(mm);
>  		goto out_nolock;
> @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  		 * released when it fails. So we jump out_nolock directly in
>  		 * that case.  Continuing to collapse causes inconsistency.
>  		 */
> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -						     referenced, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> +						     referenced, order);
>  		if (result != SCAN_SUCCEED)
>  			goto out_nolock;
>  	}
> @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	 * mmap_lock.
>  	 */
>  	mmap_write_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> -					 HPAGE_PMD_ORDER);
> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> +					 &vma, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>  	/* check if the pmd is still valid */
>  	vma_start_write(vma);
> -	result = check_pmd_still_valid(mm, address, pmd);
> +	result = check_pmd_still_valid(mm, pmd_addr, pmd);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>
>  	anon_vma_lock_write(vma->anon_vma);
> +	anon_vma_locked = true;
>
> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> -				address + HPAGE_PMD_SIZE);
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> +				end_addr);
>  	mmu_notifier_invalidate_range_start(&range);
>
>  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	 * Parallel GUP-fast is fine since GUP-fast will back off when
>  	 * it detects PMD is changed.
>  	 */
> -	_pmd = pmdp_collapse_flush(vma, address, pmd);
> +	_pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
>  	spin_unlock(pmd_ptl);
>  	mmu_notifier_invalidate_range_end(&range);
>  	tlb_remove_table_sync_one();
>
> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> +	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
>  	if (pte) {
> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> -						      HPAGE_PMD_ORDER,
> -						      &compound_pagelist);
> +		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> +						      order, &compound_pagelist);
>  		spin_unlock(pte_ptl);
>  	} else {
>  		result = SCAN_NO_PTE_TABLE;
>  	}
>
>  	if (unlikely(result != SCAN_SUCCEED)) {
> -		if (pte)
> -			pte_unmap(pte);
>  		spin_lock(pmd_ptl);
> -		BUG_ON(!pmd_none(*pmd));
> +		WARN_ON_ONCE(!pmd_none(*pmd));
>  		/*
>  		 * We can only use set_pmd_at when establishing
>  		 * hugepmds and never for establishing regular pmds that
> @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  		 */
>  		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>  		spin_unlock(pmd_ptl);
> -		anon_vma_unlock_write(vma->anon_vma);
>  		goto out_up_write;
>  	}
>
>  	/*
> -	 * All pages are isolated and locked so anon_vma rmap
> -	 * can't run anymore.
> +	 * For PMD collapse all pages are isolated and locked so anon_vma
> +	 * rmap can't run anymore. For mTHP collapse the PMD entry has been
> +	 * removed and not all pages are isolated and locked, so we must hold
> +	 * the lock to prevent neighboring folios from attempting to access
> +	 * this PMD until its reinstalled.
>  	 */
> -	anon_vma_unlock_write(vma->anon_vma);
> +	if (is_pmd_order(order)) {
> +		anon_vma_unlock_write(vma->anon_vma);
> +		anon_vma_locked = false;
> +	}
>
>  	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> -					   vma, address, pte_ptl,
> -					   HPAGE_PMD_ORDER,
> -					   &compound_pagelist);
> -	pte_unmap(pte);
> +					   vma, start_addr, pte_ptl,
> +					   order, &compound_pagelist);
>  	if (unlikely(result != SCAN_SUCCEED))
>  		goto out_up_write;
>
> @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	 * write.
>  	 */
>  	__folio_mark_uptodate(folio);
> -	pgtable = pmd_pgtable(_pmd);
> -
>  	spin_lock(pmd_ptl);
> -	BUG_ON(!pmd_none(*pmd));
> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> -	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> +	WARN_ON_ONCE(!pmd_none(*pmd));
> +	if (is_pmd_order(order)) {
> +		pgtable = pmd_pgtable(_pmd);
> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> +	} else {
> +		/*
> +		 * set_ptes is called in map_anon_folio_pte_nopf with the
> +		 * pmd_ptl lock still held; this is safe as the PMD is expected
> +		 * to be none. The pmd entry is then repopulated below.
> +		 */
> +		map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> +		smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> +		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> +	}
>  	spin_unlock(pmd_ptl);
>
>  	folio = NULL;
>
>  	result = SCAN_SUCCEED;
>  out_up_write:
> +	if (anon_vma_locked)
> +		anon_vma_unlock_write(vma->anon_vma);
> +	if (pte)
> +		pte_unmap(pte);
>  	mmap_write_unlock(mm);
>  out_nolock:
>  	if (folio)
> @@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		/* collapse_huge_page expects the lock to be dropped before calling */
>  		mmap_read_unlock(mm);
>  		result = collapse_huge_page(mm, start_addr, referenced,
> -					    unmapped, cc);
> +					    unmapped, cc, HPAGE_PMD_ORDER);
>  		/* collapse_huge_page will return with the mmap_lock released */
>  		*lock_dropped = true;
>  	}
> --
> 2.54.0
>

^ permalink raw reply

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
From: Lorenzo Stoakes @ 2026-06-04 10:10 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260522150009.121603-1-npache@redhat.com>

Not sure if already addressed for v19 but I just tried building a kernel in
mm-unstable and saw this (clang 22.1.6):

mm/khugepaged.c:1357:6: warning: variable 'pte' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
 1357 |         if (result != SCAN_SUCCEED)
      |             ^~~~~~~~~~~~~~~~~~~~~~
mm/khugepaged.c:1458:6: note: uninitialized use occurs here
 1458 |         if (pte)
      |             ^~~
mm/khugepaged.c:1357:2: note: remove the 'if' if its condition is always false
 1357 |         if (result != SCAN_SUCCEED)
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~
 1358 |                 goto out_up_write;
      |                 ~~~~~~~~~~~~~~~~~
mm/khugepaged.c:1352:6: warning: variable 'pte' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
 1352 |         if (result != SCAN_SUCCEED)
      |             ^~~~~~~~~~~~~~~~~~~~~~
mm/khugepaged.c:1458:6: note: uninitialized use occurs here
 1458 |         if (pte)
      |             ^~~
mm/khugepaged.c:1352:2: note: remove the 'if' if its condition is always false
 1352 |         if (result != SCAN_SUCCEED)
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~
 1353 |                 goto out_up_write;
      |                 ~~~~~~~~~~~~~~~~~
mm/khugepaged.c:1296:12: note: initialize the variable 'pte' to silence this warning
 1296 |         pte_t *pte;
      |                   ^
      |                    = NULL
2 warnings generated.


Is this already addressed in v19/review here? If not could you please address it.

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
From: Peter Zijlstra @ 2026-06-04  9:34 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Tengda Wu, Steven Rostedt, Mathieu Desnoyers, Alexei Starovoitov,
	linux-trace-kernel, linux-kernel
In-Reply-To: <20260601084001.9566b443746447ec2bb1a9fb@kernel.org>

On Mon, Jun 01, 2026 at 08:40:01AM +0900, Masami Hiramatsu wrote:

> Peter, is it OK to drop @rq from task_on_cpu()? 

Sure.

> Then we can use it from rethook.

Well, it is in sched/sched.h, which is an internal header, and no you
cannot use that header in rethook.

But lets step back first, what is the actual problem here, why are we
looking at ->on_cpu at all?

^ permalink raw reply

* Re: [PATCH v3] tracing: fix CFI violation in probestub test
From: Steven Rostedt @ 2026-06-04  9:10 UTC (permalink / raw)
  To: Eva Kurchatova
  Cc: mhiramat, linux-trace-kernel, linux-kernel, mathieu.desnoyers,
	peterz, jpoimboe, samitolvanen
In-Reply-To: <20260603153147.573589-1-eva.kurchatova@virtuozzo.com>

On Wed,  3 Jun 2026 18:31:42 +0300
Eva Kurchatova <eva.kurchatova@virtuozzo.com> wrote:

> When multiple callbacks are registered on the same tracepoint,
> callbacks will be indirectly called via traceiter helper.
> 
> Pointers to __probestub_* callbacks reside in __tracepoints section,
> which is excluded from ENDBR checks in objtool, causing objtool to
> assume those functions are never indirectly called.
> 
> Registering multiple callbacks using sched_wakeup test will result
> in #CP exception due to missing ENDBR in __probestub_sched_wakeup
> on a CFI-enabled machine.
> 
> Fix this by adding CFI_NOSEAL annotation to probestub declaration.

I took this but rewrote the change log as it is still incorrect. It has
nothing to do with multiple callbacks attached to the probe. It has to
do with how tprobes works.

I updated with this:

From: Eva Kurchatova <eva.kurchatova@virtuozzo.com>
Subject: [PATCH] tracing: Fix CFI violation in probestub being called by
 tprobes

The probestub is a function to allow tprobes to hook to a tracepoint to
gain access to its parameters. The function itself is only referenced by
the tracepoint structure which lives in the __tracepoint section. objtool
explicitly ignores that section and when processing functions in the
kernel, if it detects one that has no references it will seal it to have
its ENDBR stripped on boot up.

This means when a tprobe is attached to the sched_wakeup tracepoint, when it
is triggered it will call __probestub_sched_wakeup and due to the missing
ENDBR on a CFI-enabled machine it will take a #CP exception.

Fix this by adding CFI_NOSEAL annotation to probestub declaration.

Cc: stable@vger.kernel.org
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://patch.msgid.link/20260603153147.573589-1-eva.kurchatova@virtuozzo.com
Fixes: d5173f753750 ("objtool: Exclude __tracepoints data from ENDBR checks")
Signed-off-by: Eva Kurchatova <eva.kurchatova@virtuozzo.com>
[ Updated change log ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/tracepoint.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 763eea4d80d8..2d2b9f8cdda4 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -20,6 +20,7 @@
 #include <linux/rcupdate_trace.h>
 #include <linux/tracepoint-defs.h>
 #include <linux/static_call.h>
+#include <linux/cfi.h>
 
 struct module;
 struct tracepoint;
@@ -389,6 +390,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 	void __probestub_##_name(void *__data, proto)			\
 	{								\
 	}								\
+	/*								\
+	 * Annotate the probestub 'CFI_NOSEAL' to stop objtool from	\
+	 * requesting the kernel remove the ENDBR, because the only	\
+	 * references to the function are in the __tracepoint section,	\
+	 * that objtool doesn't scan.					\
+	 */								\
+	CFI_NOSEAL(__probestub_##_name);				\
 	DEFINE_STATIC_CALL(tp_func_##_name, __traceiter_##_name);	\
 	DEFINE_RUST_DO_TRACE(_name, TP_PROTO(proto), TP_ARGS(args))
 
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH v3] tracing: fix CFI violation in probestub test
From: Steven Rostedt @ 2026-06-04  8:52 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Eva Kurchatova, linux-trace-kernel, linux-kernel,
	mathieu.desnoyers, peterz, jpoimboe, samitolvanen
In-Reply-To: <20260604084716.99e2be26a21c71d6e4ed8270@kernel.org>

On Thu, 4 Jun 2026 08:47:16 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thanks.

> 
> Peter, will you pick this fix because it fixes objtool change?

No, it's a tracepoint change. objtool only affected it.

I'll take it as I just finished testing.

-- Steve

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-04  8:36 UTC (permalink / raw)
  To: Balbir Singh
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <aiDVMgu0viTIml8H@parvat>

On Thu, Jun 04, 2026 at 11:43:14AM +1000, Balbir Singh wrote:
> On Wed, Jun 03, 2026 at 08:02:09AM +0100, Gregory Price wrote:
> > 
> > Here is how the page allocator fallback lists and nodemasks interact:
> > 
> >    Fallbacks A:  A B 
> >    Fallbacks B:  B A
> >    Fallbacks C:  C A B   (Private)
> >    Fallbacks D:  D B A   (Private)
> > 
> 
> Do we want regular memory (N_MEMORY) in the fallback list of device private nodes?
> The assumption is that we have ATS translation enabled? Assumiung A and
> B are N_MEMORY here or am I misreading your illustraion?
>

If we don't have __GFP_PRIVATE, then probably not.  This is a holdover
from the current __GFP_PRIVATE branch so that if the preferred_nid=
value is a private node (which is a hint, but not a hard control),
there's a way for that allocation to land *somewhere*.

__GFP_PRIVATE would say "Only allow access to private nodes if this
flag is provided - otherwise treat that as unreachable and fall back".

(__GFP_PRIVATE | __GFP_THISNODE) then does exactly what you expect (only
allocate from specifically this private node and don't fall back).

This has the added benefit of not causing OOM on allocation failure.

Some would consider such a request a bug (i.e. that caller has a bad
mask), but I find the premise of that statement to be flawwed if only
because we do not have good controls over what ends up in a nodemask due
to the existence of things like possible_nodes.

> > If we wanted to change this behavior, realistically we'd be looking for
> > a way to add specific nodes to certain fallback lists - rather than
> > modify the nodemask interaction in some way.
> 
> Yes, that is what we did with CDM, control the fallback for
> N_MEMORY_PRIVATE, but there is a design decision to be made here.
>

Agreed, but also one which can be deferred and played with since it's
all kernel-internal.  None of this should have UAPI implications, and we
need need to accept that we're going to get it wrong on the first try.

> > 2) full mempolicy support doesn't really make sense
> > 
> >    task mempolicy PROBABLY should never really touch private nodes,
> >    while VMA policy certainly can.  Assuming we're able to support
> >    multi-private-node masks, none of the non-bind mempolicies even
> >    make sense for most private nodes (interleave? weighted interleave?)
> > 
> 
> Yes, mostly, but is that baked into the design? If so, why?
>

"Baked in" in this case would mean:

  set_mempolicy(..., private_node) -> -EINVAL
  mbind(..., private_node)         -> Success

With appropriate documentation.

This can be changed later if a reasonable design was agreed upon.

> > 4) File VMA interactions don't entirely make sense with mbind
> > 
> >    In theory you might want:
> > 
> >    fd = open("somefile", ...);
> >    mem = mmap(fd, ...);
> >    mbind(mem, ..., private_node);
> >    for page in mem:
> >       mem[page_off] /* fault file into private memory */
> > 
> >    In reality: This does not work the way you want.
> 
> Why not? Just curious about what you found?
> 

Because pagecache pages are associated with potentially many VMAs.

The fault can be a soft fault or a hard fault.  On soft fault - the page
was already present, and will simply fault into VMA without being
migrated.

You can imagine the following

Process A:
    fd = open("somefile", ...);
    mem = mmap(fd, ...);
    mbind(mem, ..., private_node_A);
    for page in mem:
       mem[page_off] /* fault file into private memory */

Process B:
    fd = open("somefile", ...);
    mem = mmap(fd, ...);
    mbind(mem, ..., private_node_B);
    for page in mem:
       mem[page_off] /* fault file into private memory */

If process A runs first, and assuming VMA mempolicy is respected for
file backed allocation (note: it's not, see below) - then the second
process will think the memory now lives on node B when it's already
living on node A (pages are not migrated on fault).

filemap page cache means file-backed pages are global resources.

Re file-backed VMAs - see filemap_alloc_folio_noprof in mm/filemap.c

struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
{
        int n;
        struct folio *folio;

        if (cpuset_do_page_mem_spread()) {
                unsigned int cpuset_mems_cookie;
                do {
                        cpuset_mems_cookie = read_mems_allowed_begin();
                        n = cpuset_mem_spread_node();
                        folio = __folio_alloc_node_noprof(gfp, order, n);
                } while (!folio && read_mems_allowed_retry(cpuset_mems_cookie));

                return folio;
        }
        return folio_alloc_noprof(gfp, order);
}

We'd have to hang a mempolicy off of the file and use fctl or something
like this if we want a file to have a node preference.

> > 
> >    I went digging and we need a few mild extensions to allow
> >    migration on mbind to work for pagecache pages, and the fault
> >    path does not necessarily respect the vma mempolicy always.
> > 
> >    You also start getting into the question of "what happens when
> >    the node is out of memory and you don't have reclaim support?".
> 
> Yes, we should discuss reclaim support, I think we should allow for
> reclaim. It allows you to overcommit private memory the way we can
> with regular memory.
> 

Reclaim support is feasible, but again - crawl, walk, run.

If we get the base private node infrastructure in place, we can break
things like mempolicy and reclaim support into different work streams
to enable support for these features.

Different private node users will be interested in different
combinations of mm/ service support.

For example:  compressed memory as a swap backend DOES NOT want explicit
reclaim support - it will need to manage its own shrinker.  This comes
from requirements associated with that specific use case (which I do not
want to get into here).

That is why this series introduced the concept of NP_OPS_* - so that the
owner (driver) of a private node (such as a CXL-enabled accelerator
driver) can tell mm/ what services it should enable for that node.

> > 
> > For all these reasons, I think the be mbind/mempolicy support with
> > private nodes needs to be brought in with follow up work - not
> > introduced as part of the baseline set.
> > 
> 
> I am not opposed to the follow up work, but I feel mbind() should
> be the fundamental work and user space API.
>

This is informed by a single use case / device.

There are users / devices that don't want any UAPI for their memory,
but simply wish to re-utilize some subsection of mm/ (page_alloc,
reclaim, etc).

> > 
> > I am arguing for #1 - the community has argued for #2 and "fixing
> > existing nodemask users".  I think we can ship #2 and pivot to #1 if we
> > find fixing existing users is infeasible or too much of a maintenance
> > burden.
> 
> Again happy to discuss this, I'd like to make sure we agree on the
> design. I am wondering if there is any experimental data to choose
> between 1 and 2.
> 

I am trying to test whether, lacking __GFP_PRIVATE, any normal runtime
operations access private nodes removed from fallback lists are reached
via something like the possible / online nodemask.

I remember, maybe a year ago, there were per-node allocations happening
during hotplug and that's why I originally proposed __GFP_PRIVATE, but
I'm trying to re-collect that data now.

~Gregory

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox