Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
From: Karim Manaouil @ 2026-05-31 23:29 UTC (permalink / raw)
  To: Salvatore Dipietro
  Cc: abuehaze, akpm, alisaidi, blakgeof, brauner, dipietro.salvatore,
	djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs,
	ritesh.list, stable, vbabka, willy
In-Reply-To: <20260527162412.19922-1-dipiets@amazon.it>

On Wed, May 27, 2026 at 04:24:10PM +0000, Salvatore Dipietro wrote:
> 
> Thanks Ritesh and Matthew for the continued feedback and guidance on this thread.
> I'd like to summarize where we stand and ask for your input on the best path forward.
> 
> Summary of approaches tested:
> We've now benchmarked all proposed variations (pgbench simple-update, 1024 clients, 
> 96-vCPU arm64, huge_pages=off, PREEMPT_NONE applied [1]):
> 
> | Patch                          | Change Location       | Avg TPS    | % vs Baseline |
> |--------------------------------|-----------------------|-----------:|:-------------:|
> | Baseline (no patch)            | —                     | 101,979.75 |       —       |
> | v1 (original, iomap caller)    | fs/iomap/buffered-io.c| 141,194.20 |    +38.45%    |
> | Ritesh's suggestion            | mm/filemap.c          | 139,200.61 |    +36.50%    |
> | Matthew's suggestion           | mm/filemap.c          | 143,863.82 |    +41.07%    |
> | kcompactd background           | mm/page_alloc.c       | 134,278.47 |    +31.67%    |
> 
> 
> All approaches recover significant throughput. The kcompactd approach (background 
> compaction and returning nopage for costly orders with __GFP_NORETRY) aligns with the
> architectural direction Dave and Christoph proposed, keeping compaction out of the direct 
> reclaim path, and lives entirely in the page allocator. 
> 
> Based on the discussion, I see two possible directions and would appreciate your guidance:
> 
> 1. Page allocator fix (mm/page_alloc.c): The kcompactd background approach addresses 
> Matthew's concern that filemap.c shouldn't know about PAGE_ALLOC_COSTLY_ORDER, and aligns 
> with Dave's vision of removing compaction from the direct reclaim path.
> 
> 2. filemap fix (mm/filemap.c): Both Ritesh's and Matthew's suggestions are minimal, 
> backportable, and preserve lightweight reclaim for non-costly orders. 
> Ritesh's variant differentiates between costly and non-costly orders, while Matthew's 
> is simpler and performs best.

I am not very familiar with THPs in the page cache, but for anonymous
memory, we have /sys/kernel/mm/transparent_hugepages/defrag which
decides what to do in the event of a THP allocation failure, whether to
enter a synchronous compaction or wake up kcompactd.

Check vma_thp_gfp_mask(). Maybe you should adopt something similar called
file_thp_gfp_mask().

The problem with fallback is that your application is never going to get
a THP and eventually TLB pressure might actually end up slowing you
down in the long run.

Also compaction is only really tried if it makes sense. That is if
enough free memory is available to actually perform the compaction and
have a chance of creating a large enough huge page. So compaction is
actually never performed under accute memory pressure. Which means your
system actually has enough free pages, but somehow the compaction is
slow and inefficient.

I am just trying to think loudly here and address the root cause. The
real problem here is fragmentation due to unmovable pages, probably in
your case the page tables. We should work more on reducing pageblock
type mixing. Also page tables can actually be made movable so that
compaction can treat them as movable pages.


> 
> Would either of these directions be acceptable for a v3, or would you prefer a different approach?
> 
> I'm happy to test any additional variations or direction to move this forward
> 
> Salvatore
> 
> 
> [1] https://lore.kernel.org/all/20260403191942.21410-1-dipiets@amazon.it/T/#m8baeeaf48aa7ae5342c8c2db8f4e1c27e03c1368
> 
> 
> 
> 
> AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico
> 
> 

-- 
~karim


^ permalink raw reply

* Re: [PATCH v2 1/3] mm/swap: colocate page-cluster sysctl with swap readahead
From: Barry Song @ 2026-05-31 22:06 UTC (permalink / raw)
  To: Jianyue Wu
  Cc: Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Youngjun Park, Qi Zheng, Shakeel Butt, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Hugh Dickins,
	Baolin Wang, linux-mm, linux-kernel
In-Reply-To: <20260531-ch-swap-series-plus-folio-lru-cleanup-v2-1-1b9a4ac255b4@gmail.com>

On Sun, May 31, 2026 at 5:50 PM Jianyue Wu <wujianyue000@gmail.com> wrote:
>
> page_cluster and the vm.page-cluster sysctl are only used by swap-in
> readahead in swap_state.c. Move them out of swap.c together with
> swap_readahead_setup(), and make page_cluster static to that file.
>
> Rename swap_setup() while moving it as well. The helper is internal to
> MM and now only sets up swap readahead defaults and its sysctl hook, so
> the more specific name matches its reduced scope.
>
> Signed-off-by: Jianyue Wu <wujianyue000@gmail.com>
> ---
>  include/linux/swap.h |  2 +-
>  mm/swap.c            | 36 ------------------------------------
>  mm/swap.h            |  2 --
>  mm/swap_state.c      | 37 +++++++++++++++++++++++++++++++++++++
>  mm/vmscan.c          |  2 +-
>  5 files changed, 39 insertions(+), 40 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 636d94108166..c36f72877e8b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -345,7 +345,7 @@ extern void lru_add_drain_cpu_zone(struct zone *zone);
>  extern void lru_add_drain_all(void);
>  void folio_deactivate(struct folio *folio);
>  void folio_mark_lazyfree(struct folio *folio);
> -extern void swap_setup(void);
> +extern void swap_readahead_setup(void);

Can we move this to mm/swap.h?


^ permalink raw reply

* Re: [PATCH] zram: fix partial I/O config check
From: Barry Song @ 2026-05-31 21:38 UTC (permalink / raw)
  To: Jianyue Wu
  Cc: Minchan Kim, Sergey Senozhatsky, Jens Axboe, Chris Li,
	Kairui Song, Andrew Morton, Baoquan He, linux-kernel, linux-block,
	linux-mm, Christoph Hellwig
In-Reply-To: <20260531-zram-fix-partial-io-config-check-on-akpm-v1-1-eb085d98faea@gmail.com>

On Sun, May 31, 2026 at 8:35 PM Jianyue Wu <wujianyue000@gmail.com> wrote:
>
> IS_ENABLED() expects a CONFIG_* symbol. Use the real Kconfig symbol so
> this warning reflects whether synchronous partial I/O is built in.
>
> Signed-off-by: Jianyue Wu <wujianyue000@gmail.com>
> ---
> zram: fix partial I/O config check
> ---
>  drivers/block/zram/zram_drv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 6e1330ce4bc1..72f89fd5572e 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1510,7 +1510,7 @@ static int read_from_bdev(struct zram *zram, struct page *page, u32 index,
>  {
>         atomic64_inc(&zram->stats.bd_reads);
>         if (!parent) {
> -               if (WARN_ON_ONCE(!IS_ENABLED(ZRAM_PARTIAL_IO)))
> +               if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_ZRAM_PARTIAL_IO)))

However, I don't see ZRAM_PARTIAL_IO defined as a Kconfig option.

#if PAGE_SIZE != 4096
static inline bool is_partial_io(struct bio_vec *bvec)
{
        return bvec->bv_len != PAGE_SIZE;
}
#define ZRAM_PARTIAL_IO         1
#else
static inline bool is_partial_io(struct bio_vec *bvec)
{
        return false;
}
#endif


>                         return -EIO;
>                 return read_from_bdev_sync(zram, page, index, blk_idx);
>         }
>
> ---
> base-commit: 404fb4f38e8f38469dfff4df0205c9d18eeb1f57
> change-id: 20260531-zram-fix-partial-io-config-check-on-akpm-c62b972416f8
>
> Best regards,
> --
> Jianyue Wu <wujianyue000@gmail.com>
>


^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-05-31 21:21 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara,
	linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Miklos Szeredi, patches
In-Reply-To: <ahv16ogY8Zx3Rtox@pedro-suse.lan>

On Sun, May 31, 2026 at 11:54 AM Pedro Falcato <pfalcato@suse.de> wrote:
> So, you took an ongoing discussion with an ongoing RFC patchset, and you
> decided to reimplement part of the idea on your own, as a concurrent patchset.

Yes. But I propose an alternative solution to this problem.

Brauner said in discussion for your patchset:
"So I'm not very likely to pick this up as is".
So, I decided to submit another solution.

Pedro, I'm not trying to insult you.

Other kernel developers will decide which of these two solutions they like more.

Many people in discussion of your patchset said how they
dislike splice/vmsplice, and especially vmsplice.
Hellwig said "vmsplice is the worst".
Brauner, Hellwig, Horn said that they dislike vmsplice.
They said that vmsplice in its current form should not
be used, and that it is broken.

Despite all these problems nobody managed to fix
vmsplice in all these years.
So I propose just to effectively remove it.

You may think that I just saw a recent discussion and decided
to jump in. No. splice/vmsplice is my topic of interest for many
years. You can verify this by searching "f:Askar splice"
on lore.kernel.org . I simply decided that given
recent vulnerabilities now is the perfect time to solve
all these vmsplice problems once and for all.

I explained my position here:
https://lore.kernel.org/all/20260523204100.553125-1-safinaskar@gmail.com/ .
Nobody answered, so I just posted this patchset.

If my patchset is applied, then I will try to deal
with splice-pagecache-to-pipe somehow,
probably by removing it, too. :) I decided first
to deal with vmsplice, because it seems to be
easier problem.

> > vmsplice(fd, vec, vlen, vmsplice_flags) will
> > be equivalent to preadv2(fd, vec, vlen, -1, rw_flags) if you have
> > readable pipe and to pwritev2(fd, vec, vlen, -1, rw_flags) if you have
> > writable pipe.
>
> This does not work. https://codesearch.debian.net/search?q=vmsplice%28&literal=1
> There are users.

Yes, they are. But my solution is compatible. vmsplice is simply performance
optimization. vmsplice will work just as before, but slower.
And, most importantly, vmsplice design problems will be gone
(nobody managed to fix them anyway for all these years).

-- 
Askar Safin


^ permalink raw reply

* Re: [PATCH mm-unstable v18 10/14] mm/khugepaged: introduce collapse_allowable_orders helper function
From: David Hildenbrand (Arm) @ 2026-05-31 20:18 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260522150009.121603-11-npache@redhat.com>

On 5/22/26 17:00, Nico Pache wrote:
> Add collapse_allowable_orders() to generalize THP order eligibility. The
> function determines which THP orders are permitted based on collapse
> context (khugepaged vs madv_collapse).
> 
> This consolidates collapse configuration logic and provides a clean
> interface for future mTHP collapse support where the orders may be
> different.

It would have been good to describe here that, for now, it only ever returns
PMDs, and that it will be extended next.

Logically, this patch belongs to #12, not #11 ... so seeing it before #11 was a bit

... and there, it is clear that we don't even want to know the orders?

So can we just call this function

"collapse_possible" and make it return a boolean?

> 
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 4534025bc81d..64ceebc9d8a7 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -552,12 +552,21 @@ void __khugepaged_enter(struct mm_struct *mm)
>  		wake_up_interruptible(&khugepaged_wait);
>  }
>  
> +/* Check what orders are allowed based on the vma and collapse type */
> +static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
> +		vm_flags_t vm_flags, enum tva_type tva_flags)
> +{
> +	unsigned long orders = BIT(HPAGE_PMD_ORDER);
> +
> +	return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
> +}
> +
>  void khugepaged_enter_vma(struct vm_area_struct *vma,
>  			  vm_flags_t vm_flags)
>  {
>  	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
>  	    hugepage_pmd_enabled()) {
> -		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
> +		if (collapse_allowable_orders(vma, vm_flags, TVA_KHUGEPAGED))
>  			__khugepaged_enter(vma->vm_mm);
>  	}
>  }
> @@ -2680,7 +2689,7 @@ static void collapse_scan_mm_slot(unsigned int progress_max,
>  			cc->progress++;
>  			break;
>  		}
> -		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
> +		if (!collapse_allowable_orders(vma, vma->vm_flags, TVA_KHUGEPAGED)) {
>  			cc->progress++;
>  			continue;
>  		}
> @@ -2989,7 +2998,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  	BUG_ON(vma->vm_start > start);
>  	BUG_ON(vma->vm_end < end);
>  
> -	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
> +	if (!collapse_allowable_orders(vma, vma->vm_flags, TVA_FORCED_COLLAPSE))
>  		return -EINVAL;
>  
>  	cc = kmalloc_obj(*cc);

Having a simple

static bool collapse_possible(...)
{
	return collapse_allowable_orders(...)
}

Would make the above slightly more readable.

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH mm-unstable v18 08/14] mm/khugepaged: add per-order mTHP collapse failure statistics
From: David Hildenbrand (Arm) @ 2026-05-31 20:09 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260522150009.121603-9-npache@redhat.com>

On 5/22/26 17:00, Nico Pache wrote:
> Add three new mTHP statistics to track collapse failures for different
> orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:
> 
> - collapse_exceed_swap_pte: Increment when mTHP collapse fails due to
> 	encountering a swap PTE.
> 
> - collapse_exceed_none_pte: Counts when mTHP collapse fails due to
>   	exceeding the none PTE threshold for the given order
> 
> - collapse_exceed_shared_pte: Counts when mTHP collapse fails due to
> 	encountering a shared PTE.
> 
> These statistics complement the existing THP_SCAN_EXCEED_* events by
> providing per-order granularity for mTHP collapse attempts. The stats are
> exposed via sysfs under
> `/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
> supported hugepage size.
> 
> As we currently do not support collapsing mTHPs that contain a swap or
> shared entry, those statistics keep track of how often we are
> encountering failed mTHP collapses due to these restrictions.
> 
> We will add support for mTHP collapse for anonymous pages next; lets also
> track when this happens at the PMD level within the per-mTHP stats.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  Documentation/admin-guide/mm/transhuge.rst | 14 ++++++++++++++
>  include/linux/huge_mm.h                    |  3 +++
>  mm/huge_memory.c                           |  7 +++++++
>  mm/khugepaged.c                            | 21 +++++++++++++++++++--
>  4 files changed, 43 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index c51932e6275d..80a4d0bed70b 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -714,6 +714,20 @@ nr_anon_partially_mapped
>         an anonymous THP as "partially mapped" and count it here, even though it
>         is not actually partially mapped anymore.
>  
> +collapse_exceed_none_pte
> +       The number of collapse attempts that failed due to exceeding the
> +       max_ptes_none threshold.
> +
> +collapse_exceed_swap_pte
> +       The number of collapse attempts that failed due to exceeding the
> +       max_ptes_swap threshold. For non-PMD orders this occurs if a mTHP range
> +       contains at least one swap PTE.
> +
> +collapse_exceed_shared_pte
> +       The number of collapse attempts that failed due to exceeding the
> +       max_ptes_shared threshold. For non-PMD orders this occurs if a mTHP range
> +       contains at least one shared PTE.
> +
>  As the system ages, allocating huge pages may be expensive as the
>  system uses memory compaction to copy data around memory to free a
>  huge page for use. There are some counters in ``/proc/vmstat`` to help
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index ba7ae6808544..48496f09909b 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -144,6 +144,9 @@ enum mthp_stat_item {
>  	MTHP_STAT_SPLIT_DEFERRED,
>  	MTHP_STAT_NR_ANON,
>  	MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
> +	MTHP_STAT_COLLAPSE_EXCEED_SWAP,
> +	MTHP_STAT_COLLAPSE_EXCEED_NONE,
> +	MTHP_STAT_COLLAPSE_EXCEED_SHARED,
>  	__MTHP_STAT_COUNT
>  };
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 345c54133c83..5c128cdec810 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -703,6 +703,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
>  DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
>  DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
>  DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED)

[...]

>  		/* See collapse_scan_pmd(). */
>  		if (folio_maybe_mapped_shared(folio)) {
> +			/*
> +			 * TODO: Support shared pages without leading to further
> +			 * mTHP collapses. Currently bringing in new pages via
> +			 * shared may cause a future higher order collapse on a
> +			 * rescan of the same range.
> +			 */

This comment actually belongs into an earlier patch, no?

>  			if (++shared > max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +				if (is_pmd_order(order))
> +					count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
>  				goto out;
>  			}
>  		}
> @@ -1138,6 +1148,7 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>  		 * range.
>  		 */
>  		if (!is_pmd_order(order)) {
> +			count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
>  			pte_unmap(pte);
>  			mmap_read_unlock(mm);
>  			result = SCAN_EXCEED_SWAP_PTE;
> @@ -1433,6 +1444,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			if (++none_or_zero > max_ptes_none) {
>  				result = SCAN_EXCEED_NONE_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +				count_mthp_stat(HPAGE_PMD_ORDER,
> +						MTHP_STAT_COLLAPSE_EXCEED_NONE);
>  				goto out_unmap;
>  			}
>  			continue;
> @@ -1441,6 +1454,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			if (++unmapped > max_ptes_swap) {
>  				result = SCAN_EXCEED_SWAP_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> +				count_mthp_stat(HPAGE_PMD_ORDER,
> +						MTHP_STAT_COLLAPSE_EXCEED_SWAP);
>  				goto out_unmap;
>  			}
>  			/*
> @@ -1498,6 +1513,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			if (++shared > max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +				count_mthp_stat(HPAGE_PMD_ORDER,
> +						MTHP_STAT_COLLAPSE_EXCEED_SHARED);

Can be done as a later cleanup, but having a single function that obtains an
order and knows which stats to update would be cleaner (and a good preparation
for shmem mTHP collapse support).

Nothing jumped at me, so

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply

* Re: [RFC PATCH v2 3/9] mm/zswap: support fully zswap-backed large folio loads
From: Fujunjie @ 2026-05-31 20:03 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Andrew Morton, linux-mm, Alexandre Ghiti, Kairui Song, Usama Arif,
	Chris Li, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
	Hugh Dickins, Roman Gushchin, Shakeel Butt, linux-kernel, cgroups
In-Reply-To: <CAKEwX=Or6forBoArv1b=MZuhOuF+MTuLLZWPKgUmkBVaoBoYSQ@mail.gmail.com>



On 5/30/2026 2:25 AM, Nhat Pham wrote:
> On Fri, May 29, 2026 at 5:19 AM fujunjie <fujunjie1@qq.com> wrote:
>>
>> zswap currently refuses large swapcache folios. That is correct for mixed
>> backend ranges, but it also prevents the common swapin path from loading a
>> range that is still fully backed by zswap.
>>
>> Teach zswap_load() to fill a locked large swapcache folio by decompressing
>> each base-page entry into the matching folio offset, then flushing the
>> folio once. A missing entry after zswap data has been seen is reported as
>> -EAGAIN so the caller can drop the speculative large folio and retry
>> order-0.
>>
>> The large load keeps the zswap entries in place. It is a clean speculative
>> fill: until the swap slots are freed, zswap remains the backing copy if
>> reclaim drops the large folio before PTEs are installed.
>>
>> Signed-off-by: fujunjie <fujunjie1@qq.com>
>> ---
>>  mm/zswap.c | 105 ++++++++++++++++++++++++++++++++++++++++++++---------
>>  1 file changed, 87 insertions(+), 18 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index da5297f7bd69..94ba112a2982 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -15,6 +15,8 @@
>>
>>  #include <linux/module.h>
>>  #include <linux/cpu.h>
>> +#include <linux/mm.h>
>> +#include <linux/huge_mm.h>
>>  #include <linux/highmem.h>
>>  #include <linux/slab.h>
>>  #include <linux/spinlock.h>
>> @@ -934,7 +936,8 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>>         return comp_ret == 0 && alloc_ret == 0;
>>  }
>>
>> -static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>> +static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio,
>> +                            unsigned int page_idx, bool flush_dcache)
>>  {
>>         struct zswap_pool *pool = entry->pool;
>>         struct scatterlist input[2]; /* zsmalloc returns an SG list 1-2 entries */
>> @@ -952,14 +955,15 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>>
>>                 WARN_ON_ONCE(input->length != PAGE_SIZE);
>>
>> -               dst = kmap_local_folio(folio, 0);
>> +               dst = kmap_local_folio(folio, page_idx * PAGE_SIZE);
>>                 memcpy_from_sglist(dst, input, 0, PAGE_SIZE);
>>                 dlen = PAGE_SIZE;
>>                 kunmap_local(dst);
>> -               flush_dcache_folio(folio);
>> +               if (flush_dcache)
>> +                       flush_dcache_folio(folio);
>>         } else {
>>                 sg_init_table(&output, 1);
>> -               sg_set_folio(&output, folio, PAGE_SIZE, 0);
>> +               sg_set_folio(&output, folio, PAGE_SIZE, page_idx * PAGE_SIZE);
>>                 acomp_request_set_params(acomp_ctx->req, input, &output,
>>                                          entry->length, PAGE_SIZE);
>>                 ret = crypto_acomp_decompress(acomp_ctx->req);
>> @@ -1042,7 +1046,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>>                 goto out;
>>         }
>>
>> -       if (!zswap_decompress(entry, folio)) {
>> +       if (!zswap_decompress(entry, folio, 0, true)) {
>>                 ret = -EIO;
>>                 goto out;
>>         }
>> @@ -1615,10 +1619,9 @@ enum zswap_range_state zswap_probe_range(swp_entry_t swp,
>>   *  NOT marked up-to-date, so that an IO error is emitted (e.g. do_swap_page()
>>   *  will SIGBUS).
>>   *
>> - *  -EINVAL: if the swapped out content was in zswap, but the page belongs
>> - *  to a large folio, which is not supported by zswap. The folio is unlocked,
>> - *  but NOT marked up-to-date, so that an IO error is emitted (e.g.
>> - *  do_swap_page() will SIGBUS).
>> + *  -EAGAIN: if the swapped out content belongs to a large folio, but the
>> + *  range is mixed or raced with writeback. The folio remains locked so the
>> + *  caller can drop the large swapcache folio and retry order-0.
>>   *
>>   *  -ENOENT: if the swapped out content was not in zswap. The folio remains
>>   *  locked on return.
>> @@ -1626,9 +1629,12 @@ enum zswap_range_state zswap_probe_range(swp_entry_t swp,
>>  int zswap_load(struct folio *folio)
>>  {
>>         swp_entry_t swp = folio->swap;
>> +       unsigned int nr_pages = folio_nr_pages(folio);
>> +       unsigned int type = swp_type(swp);
>>         pgoff_t offset = swp_offset(swp);
>> -       struct xarray *tree = swap_zswap_tree(swp);
>> +       struct xarray *tree;
>>         struct zswap_entry *entry;
>> +       unsigned int i;
>>
>>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
>>         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
>> @@ -1636,21 +1642,84 @@ int zswap_load(struct folio *folio)
>>         if (zswap_never_enabled())
>>                 return -ENOENT;
>>
>> -       /*
>> -        * Large folios should not be swapped in while zswap is being used, as
>> -        * they are not properly handled. Zswap does not properly load large
>> -        * folios, and a large folio may only be partially in zswap.
>> -        */
>> -       if (WARN_ON_ONCE(folio_test_large(folio))) {
>> +       if (folio_test_large(folio)) {
>> +               struct obj_cgroup *first_objcg = NULL;
>> +               bool same_objcg = true;
>> +               bool saw_zswap = false;
>> +               bool saw_non_zswap = false;
>> +
>> +               /*
>> +                * The locked large swapcache folio now covers the range and
>> +                * conflicts with zswap writeback's order-0 swapcache allocation.
>> +                * If the range is mixed or an entry disappears, retry order-0.
>> +                */
>> +               for (i = 0; i < nr_pages; i++) {
>> +                       tree = swap_zswap_tree(swp_entry(type, offset + i));
>> +                       entry = xa_load(tree, offset + i);
>> +                       if (!entry) {
>> +                               if (saw_zswap)
>> +                                       return -EAGAIN;
>> +                               saw_non_zswap = true;
>> +                               continue;
>> +                       }
> 
> Can we use xas_load API here instead of traversing down the tree again
> and again?

I'll rework it to use xas_load(), while handling zswap tree boundaries correctly.

> 
>> +                       if (saw_non_zswap)
>> +                               return -EAGAIN;
>> +
>> +                       if (!saw_zswap)
>> +                               first_objcg = entry->objcg;
>> +                       else if (entry->objcg != first_objcg)
>> +                               same_objcg = false;
> 
> Can we get different objcg at this point?

The objcg pointers can be different in principle, for example if
the range is assembled from entries that came from different per-node objcgs
of the same memcg.

But for this accounting path, count_objcg_events() ultimately charges the
event to obj_cgroup_memcg(entry->objcg). Since the large swapcache allocation
has already checked compatible swap ownership for the range, the final memcg
accounting target should be the same even if the objcg pointers differ.

I will simplify this in v3 and avoid the extra objcg equality pass.

> 
>> +                       saw_zswap = true;
>> +               }
>> +               if (!saw_zswap)
>> +                       return -ENOENT;
>> +
>> +               for (i = 0; i < nr_pages; i++) {
>> +                       tree = swap_zswap_tree(swp_entry(type, offset + i));
>> +                       entry = xa_load(tree, offset + i);
>> +                       if (!entry)
>> +                               return -EAGAIN;
>> +
>> +                       if (!zswap_decompress(entry, folio, i, false)) {
>> +                               folio_unlock(folio);
>> +                               return -EIO;
>> +                       }
>> +               }
>> +
>> +               flush_dcache_folio(folio);
>> +               /*
>> +                * Keep zswap entries until swap slots are freed. This is a clean
>> +                * speculative fill; zswap remains the backing copy if reclaim
>> +                * drops the large folio before PTEs are installed.
>> +                */
>> +               folio_mark_uptodate(folio);
>> +               count_vm_events(ZSWPIN, nr_pages);
>> +               count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN);
>> +
>> +               if (same_objcg) {
>> +                       if (first_objcg)
>> +                               count_objcg_events(first_objcg, ZSWPIN, nr_pages);
>> +               } else {
>> +                       for (i = 0; i < nr_pages; i++) {
>> +                               tree = swap_zswap_tree(swp_entry(type, offset + i));
>> +                               entry = xa_load(tree, offset + i);
>> +                               if (WARN_ON_ONCE(!entry))
>> +                                       continue;
>> +                               if (entry->objcg)
>> +                                       count_objcg_events(entry->objcg, ZSWPIN, 1);
> 
> xas_load() here too?

Yes, same issue here. 

> 
> 
>> +                       }
>> +               }
>> +
>>                 folio_unlock(folio);
>> -               return -EINVAL;
>> +               return 0;
>>         }
> 
>>
>> +       tree = swap_zswap_tree(swp);
>>         entry = xa_load(tree, offset);
>>         if (!entry)
>>                 return -ENOENT;
>>
>> -       if (!zswap_decompress(entry, folio)) {
>> +       if (!zswap_decompress(entry, folio, 0, true)) {
>>                 folio_unlock(folio);
>>                 return -EIO;
>>         }
> 
> I wonder how much of these two paths (order 0 and larger order) can be
> unified...

I think more of this can be unified than this version does.

I split the paths this way because I treated the large-folio load as a
speculative fill and kept the zswap entries as the backing copy. But with
your point that an installed large swapcache folio should block zswap
writeback from turning the range mixed, I should revisit that completion rule
instead of baking it into a separate path.

For the v3 version I will try to collapse the common load path. If the large-folio
case still needs different entry lifetime rules, I will make that distinction
explicit.

> 
>> --
>> 2.34.1
>>




^ permalink raw reply

* Re: [PATCH mm-unstable v18 12/14] mm/khugepaged: avoid unnecessary mTHP collapse attempts
From: David Hildenbrand (Arm) @ 2026-05-31 20:02 UTC (permalink / raw)
  To: Lance Yang, npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <20260531073102.20318-1-lance.yang@linux.dev>

On 5/31/26 09:31, Lance Yang wrote:
> 
> On Fri, May 22, 2026 at 09:00:07AM -0600, Nico Pache wrote:
>> There are cases where, if an attempted collapse fails, all subsequent
>> orders are guaranteed to also fail. Avoid these collapse attempts by
>> bailing out early.
>>
>> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
>> Acked-by: Usama Arif <usama.arif@linux.dev>
>> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>> mm/khugepaged.c | 24 +++++++++++++++++++++++-
>> 1 file changed, 23 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index d3d7db8be26c..15b7298bc225 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1535,9 +1535,31 @@ static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
>> 			collapse_address = address + offset * PAGE_SIZE;
>> 			ret = collapse_huge_page(mm, collapse_address, referenced,
>> 						 unmapped, cc, order);
>> -			if (ret == SCAN_SUCCEED) {
>> +
>> +			switch (ret) {
>> +			/* Cases where we continue to next collapse candidate */
>> +			case SCAN_SUCCEED:
>> 				collapsed += nr_ptes;
>> +				fallthrough;
>> +			case SCAN_PTE_MAPPED_HUGEPAGE:
>> 				continue;
>> +			/* Cases where lower orders might still succeed */
>> +			case SCAN_LACK_REFERENCED_PAGE:
>> +			case SCAN_EXCEED_NONE_PTE:
>> +			case SCAN_EXCEED_SWAP_PTE:
>> +			case SCAN_EXCEED_SHARED_PTE:
>> +			case SCAN_PAGE_LOCK:
>> +			case SCAN_PAGE_COUNT:
>> +			case SCAN_PAGE_NULL:
>> +			case SCAN_DEL_PAGE_LRU:
>> +			case SCAN_PTE_NON_PRESENT:
>> +			case SCAN_PTE_UFFD_WP:
>> +			case SCAN_ALLOC_HUGE_PAGE_FAIL:
> 
> Nit: shouldn't SCAN_CGROUP_CHARGE_FAIL go with SCAN_ALLOC_HUGE_PAGE_FAIL
> here?
> 
> If charging the current order fails, a smaller order might still fit :)

I think the reasoning was here, that if we are already that close to our mem
limit, we should just give up instead of trying to squeeze it in .. :)

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: David Hildenbrand (Arm) @ 2026-05-31 20:00 UTC (permalink / raw)
  To: Lance Yang, npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <20260531093942.19644-1-lance.yang@linux.dev>

On 5/31/26 11:39, Lance Yang wrote:
> 
> On Fri, May 22, 2026 at 09:00:01AM -0600, Nico Pache wrote:
>> Pass an order and offset to collapse_huge_page to support collapsing anon
>> memory to arbitrary orders within a PMD. order indicates what mTHP size we
>> are attempting to collapse to, and offset indicates were in the PMD to
>> start the collapse attempt.
>>
>> For non-PMD collapse we must leave the anon VMA write locked until after
>> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
>> the mTHP case this is not true, and we must keep the lock to prevent
>> access/changes to the page tables. This can happen if the rmap walkers hit
>> a pmd_none while the PMD entry is currently unavailable due to being
>> temporarily removed during the collapse phase.
>>
>> Acked-by: Usama Arif <usama.arif@linux.dev>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>> mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
>> 1 file changed, 55 insertions(+), 38 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index fab35d318641..d64f42f66236 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>>  * while allocating a THP, as that could trigger direct reclaim/compaction.
>>  * Note that the VMA must be rechecked after grabbing the mmap_lock again.
>>  */
>> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
>> -		int referenced, int unmapped, struct collapse_control *cc)
>> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
>> +		int referenced, int unmapped, struct collapse_control *cc,
>> +		unsigned int order)
>> {
>> +	const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
>> +	const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
>> 	LIST_HEAD(compound_pagelist);
>> 	pmd_t *pmd, _pmd;
>> -	pte_t *pte;
>> +	pte_t *pte = NULL;
>> 	pgtable_t pgtable;
>> 	struct folio *folio;
>> 	spinlock_t *pmd_ptl, *pte_ptl;
>> 	enum scan_result result = SCAN_FAIL;
>> 	struct vm_area_struct *vma;
>> 	struct mmu_notifier_range range;
>> +	bool anon_vma_locked = false;
>>
>> -	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> -
>> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>> +	result = alloc_charge_folio(&folio, mm, cc, order);
>> 	if (result != SCAN_SUCCEED)
>> 		goto out_nolock;
>>
>> 	mmap_read_lock(mm);
>> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>> -					 HPAGE_PMD_ORDER);
>> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>> +					 &vma, cc, order);
>> 	if (result != SCAN_SUCCEED) {
>> 		mmap_read_unlock(mm);
>> 		goto out_nolock;
>> 	}
>>
>> -	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>> +	result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
>> 	if (result != SCAN_SUCCEED) {
>> 		mmap_read_unlock(mm);
>> 		goto out_nolock;
>> @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>> 		 * released when it fails. So we jump out_nolock directly in
>> 		 * that case.  Continuing to collapse causes inconsistency.
>> 		 */
>> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>> -						     referenced, HPAGE_PMD_ORDER);
>> +		result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
>> +						     referenced, order);
>> 		if (result != SCAN_SUCCEED)
>> 			goto out_nolock;
>> 	}
>> @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>> 	 * mmap_lock.
>> 	 */
>> 	mmap_write_lock(mm);
>> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
>> -					 HPAGE_PMD_ORDER);
>> +	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
>> +					 &vma, cc, order);
>> 	if (result != SCAN_SUCCEED)
>> 		goto out_up_write;
>> 	/* check if the pmd is still valid */
>> 	vma_start_write(vma);
>> -	result = check_pmd_still_valid(mm, address, pmd);
>> +	result = check_pmd_still_valid(mm, pmd_addr, pmd);
>> 	if (result != SCAN_SUCCEED)
>> 		goto out_up_write;
>>
>> 	anon_vma_lock_write(vma->anon_vma);
>> +	anon_vma_locked = true;
>>
>> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>> -				address + HPAGE_PMD_SIZE);
>> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
>> +				end_addr);
>> 	mmu_notifier_invalidate_range_start(&range);
>>
>> 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>> @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>> 	 * Parallel GUP-fast is fine since GUP-fast will back off when
>> 	 * it detects PMD is changed.
>> 	 */
>> -	_pmd = pmdp_collapse_flush(vma, address, pmd);
>> +	_pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
>> 	spin_unlock(pmd_ptl);
>> 	mmu_notifier_invalidate_range_end(&range);
>> 	tlb_remove_table_sync_one();
>>
>> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>> +	pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
>> 	if (pte) {
>> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
>> -						      HPAGE_PMD_ORDER,
>> -						      &compound_pagelist);
>> +		result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
>> +						      order, &compound_pagelist);
>> 		spin_unlock(pte_ptl);
>> 	} else {
>> 		result = SCAN_NO_PTE_TABLE;
>> 	}
>>
>> 	if (unlikely(result != SCAN_SUCCEED)) {
>> -		if (pte)
>> -			pte_unmap(pte);
>> 		spin_lock(pmd_ptl);
>> -		BUG_ON(!pmd_none(*pmd));
>> +		WARN_ON_ONCE(!pmd_none(*pmd));
>> 		/*
>> 		 * We can only use set_pmd_at when establishing
>> 		 * hugepmds and never for establishing regular pmds that
>> @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>> 		 */
>> 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>> 		spin_unlock(pmd_ptl);
>> -		anon_vma_unlock_write(vma->anon_vma);
>> 		goto out_up_write;
>> 	}
>>
>> 	/*
>> -	 * All pages are isolated and locked so anon_vma rmap
>> -	 * can't run anymore.
>> +	 * For PMD collapse all pages are isolated and locked so anon_vma
>> +	 * rmap can't run anymore. For mTHP collapse the PMD entry has been
>> +	 * removed and not all pages are isolated and locked, so we must hold
>> +	 * the lock to prevent neighboring folios from attempting to access
>> +	 * this PMD until its reinstalled.
>> 	 */
>> -	anon_vma_unlock_write(vma->anon_vma);
>> +	if (is_pmd_order(order)) {
>> +		anon_vma_unlock_write(vma->anon_vma);
>> +		anon_vma_locked = false;
>> +	}
>>
>> 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>> -					   vma, address, pte_ptl,
>> -					   HPAGE_PMD_ORDER,
>> -					   &compound_pagelist);
>> -	pte_unmap(pte);
>> +					   vma, start_addr, pte_ptl,
>> +					   order, &compound_pagelist);
>> 	if (unlikely(result != SCAN_SUCCEED))
>> 		goto out_up_write;
>>
>> @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>> 	 * write.
>> 	 */
>> 	__folio_mark_uptodate(folio);
>> -	pgtable = pmd_pgtable(_pmd);
>> -
>> 	spin_lock(pmd_ptl);
>> -	BUG_ON(!pmd_none(*pmd));
>> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> -	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
>> +	WARN_ON_ONCE(!pmd_none(*pmd));
>> +	if (is_pmd_order(order)) {
>> +		pgtable = pmd_pgtable(_pmd);
>> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> +		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
>> +	} else {
>> +		/*
>> +		 * set_ptes is called in map_anon_folio_pte_nopf with the
>> +		 * pmd_ptl lock still held; this is safe as the PMD is expected
>> +		 * to be none. The pmd entry is then repopulated below.
>> +		 */
>> +		map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> 
> Emm ... is it safe to use map_anon_folio_pte_nopf() here?
> 
> At this point pmdp_collapse_flush() has cleared the PMD from the page
> tables. The PTE table we are updating is only reachable through the saved
> old PMD value, _pmd, until pmd_populate() below.
> 
> map_anon_folio_pte_nopf() does set_ptes() and then calls
> update_mmu_cache_range(). Documentation/core-api/cachetlb.rst describes
> that hook as:
> 
> "
> 	At the end of every page fault, this routine is invoked to tell
> 	the architecture specific code that translations now exists
> 	in the software page tables for address space "vma->vm_mm"
> 	at virtual address "address" for "nr" consecutive pages.
> "
> 
> But that does not seem true here yet, since the PTE table is not
> reachable from vma->vm_mm when update_mmu_cache_range() is called.
> 
> Should we avoid calling update_mmu_cache_range() until after the PTE
> table is reinstalled with pmd_populate()?

I recall that update_mmu_cache* users mostly care about updating folios flags,
for the folio derived from the PTE ... or flushing caches for the user address.

So intuitively I would say "the architecture code doesn't care that the PMD
table will only be visible to HW shortly after". The important thing should be
that it will definetly happen, and that nothing else is curently there or can be
there?



-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH v2 0/2] mm/hmm: A fix and a selftest
From: David Hildenbrand (Arm) @ 2026-05-31 19:49 UTC (permalink / raw)
  To: Andrew Morton, Dev Jain
  Cc: liam, ljs, jgg, leon, shuah, vbabka, jannh, pfalcato, balbirs,
	linux-mm, linux-kernel, linux-fsdevel, rppt, surenb, mhocko,
	linux-kselftest, usama.arif, ryan.roberts, anshuman.khandual
In-Reply-To: <20260531122458.864e9deac7425e1b71c14f3a@linux-foundation.org>

On 5/31/26 21:24, Andrew Morton wrote:
> On Sat, 30 May 2026 08:54:10 +0000 Dev Jain <dev.jain@arm.com> wrote:
> 
>> Patch 1 fixes a stale warning present from the time when only migration
>> softleaf entries were supported at the PMD level.
>>
>> Patch 2 adds some code into hmm-tests.c which exercises the pagemap path
>> for PMD device-private entries.
> 
> David, I'll retain your ack on patch 1.

Thanks!

-- 
Cheers,

David


^ permalink raw reply

* Re: Process (was Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP) collapse support
From: David Hildenbrand (Arm) @ 2026-05-31 19:49 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), Lorenzo Stoakes
  Cc: Nico Pache, akpm, linux-kernel, linux-mm, liam, mhocko, rppt,
	vbabka, willy
In-Reply-To: <fa9095cc-69a1-468d-a1fe-718386398b10@kernel.org>

> 
>> and have people
>> base work against that?
> 
> This I'm not so sure how it would work. Assuming we have submaintainers with
> their trees and branches, the final "stable branch" is merged from those.
> But it's not a good base for work targeting the same merge window, as that
> work would likely go to one of those submaintainer trees. But then it can't
> be based on the result of merge of all submaintainer trees. That could only
> work for patches targetting the next cycle (after the stable branch becomes
> part of rc1).
> 
> So either patches can be based on rc1 and applied as topic branches in a
> submaintainer tree and then merged, or if they really depend on something
> already in a submaintainer tree, then based on the respective topic branch
> that's part of it.

Right, most patches can be sent against the "stable branch", but cherry-picked
on a submaintainers branch / topic tree.


> 
>> This would be 'source of truth' and what we eventually send to Linus.
> 
> Yes.
> 
>> In that world, the maintainers perform conflict resolution, but with git rerere
>> we need only do this once.
> 
> I think the conflicts would arise from merging the submaintainers' branches
> to the mm-next tree, and if they get updated and the merges are recreated
> (like linux-next works) git rerere avoids resolving the same conflicts again.
> 
> Hm like Andrew said, this needs a diagram indeed :)

It's one of the first things we'll discuss in the upcoming meetings ... I want
to talk to some other folks (in particular, TIP and KVM) to understand how they
are handling that.

Hopefully I'll find some time after my inbox calmed down a bit ... jeez, 1500
mails in one week if my eyes didn't betray me.

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH] MAINTAINERS: add vm.rst to memory management core
From: David Hildenbrand (Arm) @ 2026-05-31 19:31 UTC (permalink / raw)
  To: Brian Masney, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko
  Cc: linux-mm, linux-kernel
In-Reply-To: <20260528-mm-vm-rst-maintainers-file-v1-1-306631c0a610@redhat.com>

On 5/28/26 15:56, Brian Masney wrote:
> The vm.rst file is currently not listed in the MAINTAINERS file, so
> let's go ahead and add to the MM core subsystem so that the maintainers
> are CCed when changes to the documentation are proposed.
> 
> Signed-off-by: Brian Masney <bmasney@redhat.com>
> ---
>  MAINTAINERS | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index c99f0650b334..d8dcc4f25ccf 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16923,6 +16923,7 @@ L:	linux-mm@kvack.org
>  S:	Maintained
>  W:	http://www.linux-mm.org
>  T:	git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> +F:	Documentation/admin-guide/sysctl/vm.rst
>  F:	include/linux/folio_batch.h
>  F:	include/linux/gfp.h
>  F:	include/linux/gfp_types.h

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH] arm64: mm: call pagetable dtor when freeing hot-removed page tables
From: David Hildenbrand (Arm) @ 2026-05-31 19:29 UTC (permalink / raw)
  To: Kevin Brodsky, Vishal Moola, Catalin Marinas
  Cc: Andrew Morton, Alistair Popple, linux-arm-kernel, linux-kernel,
	linux-mm, will
In-Reply-To: <db11078b-a283-4717-bed2-da56e77c7132@arm.com>

On 5/27/26 09:34, Kevin Brodsky wrote:
> On 26/05/2026 14:31, David Hildenbrand (Arm) wrote:
>> On 5/26/26 13:54, Kevin Brodsky wrote:
>>> Agreed, I think this is the right thing to do, something like:
>>>
>>> if (folio_test_pgtable(page_folio(page)))
>>> pagetable_dtor_free(page_ptdesc(page)); else
>>> free_hotplug_page_range(page, PAGE_SIZE, NULL);
>> That code pattern is wrong.
>>
>> folio_test_pgtable() shouldn't exist.
>>
>> In the future, something is either a pgtable or a folio, not both.
>>
>> So check the type against the page, not the folio.
> 
> In other words use PageTable(page) instead? Interestingly I can see a
> few calls to folio_test_pgtable() across the kernel but none to
> PageTable(), maybe just an antipattern then? The ctor/dtor also use
> __folio_{set,clear}_pgtable().

We started running into the wrong direction, and now have to undo all the wrong
things we already did :)

In the current design Willy has in mind, doing a page_folio() will later return
NULL if it's not actually a folio.

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH v2 0/2] mm/hmm: A fix and a selftest
From: Andrew Morton @ 2026-05-31 19:24 UTC (permalink / raw)
  To: Dev Jain
  Cc: liam, ljs, jgg, leon, david, shuah, vbabka, jannh, pfalcato,
	balbirs, linux-mm, linux-kernel, linux-fsdevel, rppt, surenb,
	mhocko, linux-kselftest, usama.arif, ryan.roberts,
	anshuman.khandual
In-Reply-To: <20260530085413.1270139-1-dev.jain@arm.com>

On Sat, 30 May 2026 08:54:10 +0000 Dev Jain <dev.jain@arm.com> wrote:

> Patch 1 fixes a stale warning present from the time when only migration
> softleaf entries were supported at the PMD level.
> 
> Patch 2 adds some code into hmm-tests.c which exercises the pagemap path
> for PMD device-private entries.

David, I'll retain your ack on patch 1.

Sashiko has a concern about the selftest:
	https://sashiko.dev/#/patchset/20260530085413.1270139-1-dev.jain@arm.com


^ permalink raw reply

* Re: [RFC PATCH] MAINTAINERS: add ABI documents for mm
From: David Hildenbrand (Arm) @ 2026-05-31 19:19 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, Lorenzo Stoakes, Michal Hocko,
	Mike Rapoport, Suren Baghdasaryan, Vlastimil Babka, linux-kernel,
	linux-mm
In-Reply-To: <20260530011852.90663-1-sj@kernel.org>

On 5/30/26 03:18, SeongJae Park wrote:
> A few mm subsystem entries in MAINTAINERS are missing their ABI
> documents.  Add those.

Maybe spell out "testing ABI" in here + subject, because I was concerned there
for a second :)

> 
> Signed-off-by: SeongJae Park <sj@kernel.org>
> ---
> 1. I found no better place for mm, mm-cma, mm-memory-tiers, and mm-numa,
>    so put those on the 'MM - MISC' section.  This is an RFC mainly
>    because I want to know if it concerns someone.
> 
>  MAINTAINERS | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index e119182a8911d..a31f6f207afd8 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16854,6 +16854,7 @@ L:	linux-mm@kvack.org
>  S:	Maintained
>  W:	http://www.linux-mm.org
>  T:	git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> +F:	Documentation/ABI/testing/sysfs-kernel-mm-ksm

Makes sense.

>  F:	Documentation/admin-guide/mm/ksm.rst
>  F:	Documentation/mm/ksm.rst
>  F:	include/linux/ksm.h
> @@ -16876,6 +16877,8 @@ L:	linux-mm@kvack.org
>  S:	Maintained
>  W:	http://www.linux-mm.org
>  T:	git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> +F:	Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
> +F:	Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave

Makes sense.

>  F:	include/linux/mempolicy.h
>  F:	include/uapi/linux/mempolicy.h
>  F:	include/linux/migrate.h
> @@ -16918,6 +16921,10 @@ L:	linux-mm@kvack.org
>  S:	Maintained
>  W:	http://www.linux-mm.org
>  T:	git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> +F:	Documentation/ABI/testing/sysfs-kernel-mm

Maybe that one should go to CORE? But MISC works for me as well.

> +F:	Documentation/ABI/testing/sysfs-kernel-mm-cma

Given that CMA is here, sounds about right.

> +F:	Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers

Given that memory tiering stuff is in here, makes sense I guess.

> +F:	Documentation/ABI/testing/sysfs-kernel-mm-numa

That's all about page demotion for vmscan ... but again, also belongs to memory
tiering. So makes sense.

>  F:	Documentation/admin-guide/mm/
>  F:	Documentation/mm/
>  F:	include/linux/cma.h
> @@ -17041,6 +17048,7 @@ R:	Barry Song <baohua@kernel.org>
>  R:	Youngjun Park <youngjun.park@lge.com>
>  L:	linux-mm@kvack.org
>  S:	Maintained
> +F:	Documentation/ABI/testing/sysfs-kernel-mm-swap
>  F:	Documentation/mm/swap-table.rst
>  F:	include/linux/swap.h
>  F:	include/linux/swapfile.h
> @@ -17068,6 +17076,7 @@ L:	linux-mm@kvack.org
>  S:	Maintained
>  W:	http://www.linux-mm.org
>  T:	git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> +F:	Documentation/ABI/testing/sysfs-kernel-mm-transparent-hugepage
>  F:	Documentation/admin-guide/mm/transhuge.rst
>  F:	include/linux/huge_mm.h
>  F:	include/linux/khugepaged.h

All LGTM.

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH] fs/proc/task_mmu: do not warn on seeing non-migration pmd entry
From: David Hildenbrand (Arm) @ 2026-05-31 19:11 UTC (permalink / raw)
  To: Dev Jain, akpm, liam, ljs, jgg, leon, shuah
  Cc: vbabka, jannh, pfalcato, rppt, surenb, mhocko, balbirs, linux-mm,
	linux-kernel, linux-fsdevel, linux-kselftest, ryan.roberts,
	anshuman.khandual, stable
In-Reply-To: <20260529111704.1078346-1-dev.jain@arm.com>

On 5/29/26 13:17, Dev Jain wrote:
> pagemap_pmd_range_thp() warns if a non-present PMD is not a migration
> entry. This became false once device-private entries at the PMD level were
> added.
> 
> One can hit the warning by patching hmm-tests.c with the following:
> 
> diff --git a/tools/testing/selftests/mm/hmm-tests.c b/tools/testing/selftests/mm/hmm-tests.c
> index e1c8a679a4cf3..7f0a3384f3c5f 100644
> --- a/tools/testing/selftests/mm/hmm-tests.c
> +++ b/tools/testing/selftests/mm/hmm-tests.c
> @@ -209,6 +209,37 @@ static int hmm_dmirror_cmd(int fd,
>  	return 0;
>  }
> 
> +static int hmm_read_self_pagemap(void *addr, unsigned long npages,
> +				 unsigned long page_size)
> +{
> +	const size_t entry_size = sizeof(uint64_t);
> +	const off_t offset = ((uintptr_t)addr / page_size) * entry_size;
> +	uint64_t *entries;
> +	ssize_t nread;
> +	int fd;
> +
> +	entries = malloc(npages * entry_size);
> +	if (!entries)
> +		return -ENOMEM;
> +
> +	fd = open("/proc/self/pagemap", O_RDONLY);
> +	if (fd < 0) {
> +		free(entries);
> +		return -errno;
> +	}
> +
> +	nread = pread(fd, entries, npages * entry_size, offset);
> +	close(fd);
> +	free(entries);
> +
> +	if (nread < 0)
> +		return -errno;
> +	if ((size_t)nread != npages * entry_size)
> +		return -EIO;
> +
> +	return 0;
> +}
> +
>  static void hmm_buffer_free(struct hmm_buffer *buffer)
>  {
>  	if (buffer == NULL)
> @@ -2314,6 +2345,10 @@ TEST_F(hmm, migrate_anon_huge_fault)
>  	ASSERT_EQ(ret, 0);
>  	ASSERT_EQ(buffer->cpages, npages);
> 
> +	/* Exercise pagemap on a PMD device-private entry. */
> +	ret = hmm_read_self_pagemap(buffer->ptr, npages, self->page_size);
> +	ASSERT_EQ(ret, 0);
> +
>  	/* Check what the device read. */
>  	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
>  		ASSERT_EQ(ptr[i], i);
> 
> 


> Therefore, remove the stale migration-only assertion.
> 
> Fixes: a30b48bf1b24 ("mm/migrate_device: implement THP migration of zone device pages")
> Cc: stable@vger.kernel.org
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> Applies on mm-unstable (404fb4f38e8f).
> 
>  fs/proc/task_mmu.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 1e3a15bf46f4e..58938e62154d9 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -2129,7 +2129,6 @@ static int pagemap_pmd_range_thp(pmd_t *pmdp, unsigned long addr,
>  			flags |= PM_SOFT_DIRTY;
>  		if (pmd_swp_uffd_wp(pmd))
>  			flags |= PM_UFFD_WP;
> -		VM_WARN_ON_ONCE(!pmd_is_migration_entry(pmd));
>  		page = softleaf_to_page(entry);
>  	}
>  

The whole thp_migration_supported() guard is a bit shaky, right?

I guess device-private entries currently imply thp_migration_supported(), but
that thp_migration_supported() check is really questionable and should likely
just go away (else if -> else).

Staring at pte_to_pagemap_entry(), likely we'd also want

if (softleaf_has_pfn(entry))
	page = softleaf_to_page(entry);

to prepare for PMD swap entries.


Anyhow, both are unrelated (can you send patches to clean it up?)

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Hildenbrand (Arm) @ 2026-05-31 19:01 UTC (permalink / raw)
  To: Pedro Falcato, Askar Safin
  Cc: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara,
	linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, Miklos Szeredi, patches
In-Reply-To: <ahv16ogY8Zx3Rtox@pedro-suse.lan>

On 5/31/26 10:54, Pedro Falcato wrote:
> On Sun, May 31, 2026 at 01:01:04AM +0000, Askar Safin wrote:
>> This patchset is for VFS.
>>
>> Recently we got a lot of vulnerabilities in splice/vmsplice.
>>
>> Also vmsplice already was source of vulnerabilities in the past:
>> CVE-2020-29374 (see https://lwn.net/Articles/849638/ ).
>>
>> Also vmsplice is problematic for other reasons. Here is what other
>> developers say:
>>
>> Linus Torvalds in 2023:
>>> So I'd personally be perfectly ok with just making vmsplice() be
>>> exactly the same as write, and turn all of vmsplice() into just "it's
>>> a read() if the pipe is open for read, and a write if it's open for
>>> writing".
>> https://lore.kernel.org/all/CAHk-=wgG_2cmHgZwKjydi7=iimyHyN8aessnbM9XQ9ufbaUz9g@mail.gmail.com/
>>
>> Christoph Hellwig in May 2026:
>>> vmsplice is the worst, as it is one of the few remaining places that
>>> can incorrectly dirty file backed pages without telling the file system
>>> and cause the other problems fixed by a FOLL_PIN conversion, but it is
>>> the only one where we do not have any idea yet how we could convert it
>>> to FOLL_PIN due to the unbounded pin time.
>> https://lore.kernel.org/all/agwFlBKvKytjURDO@infradead.org/
>>
>> See recent discussion here:
>> https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u
> 
> So, you took an ongoing discussion with an ongoing RFC patchset, and you
> decided to reimplement part of the idea on your own, as a concurrent patchset.
> 
> Riiiiiight.... I don't think I have to NAK this, do I?

Jup. I'll just ignore this patch set here.

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH 2/8] bpf: Recover arena kernel faults with scratch page
From: David Hildenbrand (Arm) @ 2026-05-31 18:58 UTC (permalink / raw)
  To: Tejun Heo, Alexei Starovoitov
  Cc: David Vernet, Andrea Righi, Changwoo Min, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Kumar Kartikeya Dwivedi,
	Peter Zijlstra, Catalin Marinas, Will Deacon, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Andrew Morton,
	Mike Rapoport, Emil Tsalapatis, sched-ext, bpf, x86,
	linux-arm-kernel, linux-mm, linux-kernel
In-Reply-To: <8cc56c7a4aa29628b7d17d85be7eadb9@kernel.org>

On 5/31/26 19:47, Tejun Heo wrote:
> Hello,
> 
> I posted the check removal [1], and Sashiko's review flagged a
> break-before-make problem with it [2] that I think is real.

Yeah, and as I raised previously, this is very questionable locking design :)

Either everybody works with atomics or nobody.

> 
> The scratch page is a present PAGE_KERNEL mapping, so having
> apply_range_set_cb() overwrite it via set_pte_at() during
> bpf_arena_alloc_pages() is a valid->valid PFN change. I'm not familiar with
> arm at all. David, my understanding is that's a break-before-make violation
> on arm64, and that on any arch the stale TLB entry keeps resolving to the
> shared scratch page until it's flushed, so a later access can hit scratch
> instead of the new page. Is that what you were worried about?
> 
> So instead of just dropping the check, the install should route through an
> invalid entry rather than overwrite in place:
> 
> 	while (!ptep_try_set(pte, mk_pte(page, PAGE_KERNEL))) {
> 		old = ptep_get(pte);
> 		if (pte_none(old))
> 			continue;
> 		if (WARN_ON_ONCE(pte_page(old) != arena->scratch_page))
> 			return -EBUSY;
> 		ptep_get_and_clear(&init_mm, addr, pte);
> 		broke_scratch = true;
> 	}

We have to handle architectures where ptep_try_set() is not implemented (as I
tried with my variant).

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH 09/12] memblock: introduce MEMBLOCK_KHO_SCRATCH_EXT
From: Mike Rapoport @ 2026-05-31 18:51 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Pasha Tatashin, Alexander Graf, Muchun Song, Oscar Salvador,
	David Hildenbrand, Andrew Morton, Jason Miu, kexec, linux-mm,
	linux-kernel
In-Reply-To: <2vxzse7j7ai9.fsf@kernel.org>

On Fri, May 22, 2026 at 05:02:38PM +0200, Pratyush Yadav wrote:
> On Fri, May 22 2026, Pasha Tatashin wrote:
> 
> > On 05-11 18:46, Pratyush Yadav wrote:
> >> On Mon, May 11 2026, Mike Rapoport wrote:
> >> 
> >> > On Wed, Apr 29, 2026 at 03:39:11PM +0200, Pratyush Yadav wrote:
> >> >> From: "Pratyush Yadav (Google)" <pratyush@kernel.org>
> >> >> 
> >> >> In the upcoming commits, the KHO will learn how to discover free blocks
> >> >> of memory by walking the KHO radix tree. It will then mark those regions
> >> >> as scratch to allow memory allocation in case scratch runs low.
> >> >> 
> >> >> To differentiate the extended scratch areas from the main scratch areas,
> >> >> introduce MEMBLOCK_KHO_SCRATCH_EXT. Use it when choosing memblock flags
> >> >> for allocations during scratch-only. Teach should_skip_region() to check
> >> >> for both flags before deciding if the region should be skipped.
> >> >
> >> > Why there's a need to differentiate SCRATCH and SCRATCH_EXT?
> >> > SCRATCH (I still hate the name) means "memory memblock can safely use for
> >
> >  +1000
> >
> > I also strongly dislike this name and mentioned it in another thread
> > earlier today.
> >
> > If we ever decide to s/scratch/something-else/ globally, that should be a
> > separate cleanup effort. However, since we are introducing a brand new flag
> > here, we can discuss a better name for the _ext portion to avoid overloading
> > the "scratch" concept.
> >
> >> > the allocations". Initially this memory comes from the reservations in the
> >> > first kernel, but if the second kernel can find more memory to extend it,
> >> > why that additional memory should be treated differently? 
> >> 
> >> Two reasons:
> >> 
> >> 1. We mark SCRATCH as MIGRATE_CMA. We don't want to do that for
> >>    SCRATCH_EXT since this memory can be used for non-movable
> >>    allocations.
> >> 
> >> 2. Gigantic (1G) huge pages can not be allocated from scratch. They can
> >>    be preserved memory and thus should not be allocated from SCRATCH.
> >>    See patch 12 that does allocations for gigantic huge pages only from
> >>    SCRATCH_EXT.
> >> 
> >> I will add this in the commit message for the next version.
> >> 
> >> Naming is hard, so if you have any better names I'm all ears :-)
> >
> > IMO, this scratch_ext is not "scratch" in the traditional KHO sense at all.
> > The traditional KHO scratch is what is passed from kernel to kernel and is
> > guaranteed to contain zero preserved memory. This new memory is not passed
> > from kernel to kernel and can contain preserved memory at runtime. It's
> > essentially just memory that we identify as currently unpreserved and release
> > early to the system.
> >
> > If we want to keep the naming aligned with the existing codebase for now:
> > MEMBLOCK_KHO_SCRATCH      -> original scratch
> > MEMBLOCK_KHO_UNPRESERVED  -> for the new memory (instead of SCRATCH_EXT)
> 
> UNPRESERVED sounds good to me. I will use that for the next revision
> unless Mike objects.
 
Can we make it shorter? ;-)

UNPRESERVED makes sense, although I'd love to completely remove KHO_ notion
and make the name reflect how it's used by memblock. I was toying with
PREFERRED instead of SCRATCH, but it didn't feel right enough.
With two of them that surely won't work :)

> > Alternatively, if we do want to tackle the global rename of "scratch" later:
> > MEMBLOCK_KHO_BOOTSTRAP    -> for the original scratch
> > MEMBLOCK_KHO_UNPRESERVED  -> for this new dynamic memory
> 
> Or perhaps BOOTMEM? I suppose either of the two are somewhat better than
> scratch.

Well, if we have BOOTMEM_HVO, we can have BOOTMEM_KHO as well :)

> Anyway, can we please do the SCRATCH rename as a separate series? I

Sure. We can continue bikeshedding in parallel.

> would like this series to not get muddled in the naming discussion. I
> will use UNPRESERVED for the new concept in v2 though.

That might warrant v3 even if everything else is perfect :)
 
> -- 
> Regards,
> Pratyush Yadav

-- 
Sincerely yours,
Mike.


^ permalink raw reply

* Re: [RFC PATCH 0/2] mm: swap: allow per-device skipping of zero-filled page check
From: Kairui Song @ 2026-05-31 18:48 UTC (permalink / raw)
  To: Youngjun Park
  Cc: linux-mm, akpm, chrisl, kasong, shikemeng, nphamcs, bhe, baohua
In-Reply-To: <20260518073455.2495934-1-youngjun.park@lge.com>

On Mon, May 18, 2026 at 04:34:53PM +0800, Youngjun Park wrote:
> Currently, the swap layer checks whether a page is entirely zero-filled
> before writing it out to the swap device. However, some swap backends,
> such as zram and our custom swap device, already perform their own
> same-filled page checking internally. This results in redundant CPU operations 
> checking same page pattern.
> 
> This patchset introduces a new swapon flag, SWAP_FLAG_SKIP_ZERO_CHECK,
> to eliminate this redundancy. I introduce this as a per-device flag
> rather than a global setting because traditional swap devices still
> benefit from the swap layer's zero page check to avoid unnecessary I/O.
> By using this flag, userspace can selectively disable the zero check
> only for specific backends.
> 
> Furthermore, on certain architectures where the zero map is managed via
> a separate bitmap, skipping this check allows bypassing
> the bitmap allocation entirely (saving memory).
> 
> This modification is based on the previous discussion with Nhat Pham [1].
> Additionally, this patchset is built on top of Kairui Song's recent
> patchset regarding swap table and zeromap modifications [2].
> 
> Tested simply with zram on QEMU to verify zero-filled page handling.
> 
> References:
> [1] https://lore.kernel.org/linux-mm/acQvNRLpHwnHt7i+@yjaykim-PowerEdge-T330/
> [2] https://lore.kernel.org/linux-mm/20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com/T/#t
> 
> Youngjun Park (2):
>   mm: swap: add SWAP_FLAG_SKIP_ZERO_CHECK to skip zero-filled page check
>   mm: swap: do not allocate zero_bitmap if zero check is skipped
> 
>  include/linux/swap.h |  4 +++-
>  mm/page_io.c         |  7 ++++++-
>  mm/swap.h            | 12 ++++++++++++
>  mm/swapfile.c        | 14 ++++++++++----
>  4 files changed, 31 insertions(+), 6 deletions(-)
> 
> -- 
> 2.34.1
> 

Hi YoungJun,

I think this idea might be useful, some workloads with very few zero
folios doesn't benefit from this zero folio detection and it's more of a
burden than gain:

For example, one test result with MySQL:

pswpin      129323368
pswpout     131460192
swpin_zero       4641
swpout_zero    248210

Less than 0.3% percent of the pages are zero, and almost none of the
zero pages are swapped back.

I think the zero page detection is actually better combined with
compression, e.g. ZRAM & ZSWAP which will always have to touch the
content of the page. But some devices may be able to just accept
the folio as it is and CPU may not want or need to read the content
again before pass the folio to the device, that may save some CPU
time I think?

How we can make this zero check as an interface is arguable though.

So I think the user might not be limited to ZRAM.


^ permalink raw reply

* Re: [PATCH 12/12] mm/hugetlb: make bootmem allocation work with KHO
From: Mike Rapoport @ 2026-05-31 18:40 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Pasha Tatashin, Alexander Graf, Muchun Song, Oscar Salvador,
	David Hildenbrand, Andrew Morton, Jason Miu, kexec, linux-mm,
	linux-kernel
In-Reply-To: <2vxzo6i37bs6.fsf@kernel.org>

On Mon, May 25, 2026 at 05:24:09PM +0200, Pratyush Yadav wrote:
> On Sun, May 17 2026, Mike Rapoport wrote:
> > On Wed, Apr 29, 2026 at 03:39:14PM +0200, Pratyush Yadav wrote:
> >> From: "Pratyush Yadav (Google)" <pratyush@kernel.org>
> 
> So, in summary, I would like to pursue option 1 and try to make it more
> appetizing. But I would like to at least know if you hate the "extended
> scratch" (ignore the name) as a concept or only the code it results in.

Let's retry this one :)

I looked more closely, and it seems that mixing SCRATCH and SCRATCH_EXT
should be a lesser headache than going with option 4.

Tracking the changes in gigantic pages in hugetlb also does not seem
something we'd like to pursue especially considering that memory from freed
or demoted gigantic pages could be reserved.

If we add a dedicated memblock_something to allocate gigantic pages, we
can reduce branching in alloc_bootmem() to

	if (cma)
		do_cma()
	else
		do_memblock()

For hugetlb_cma we might want to teach CMA to create pre-allocated areas
and then it could reuse the same memblock API. This seems useful even
regardless of KHO.
 
-- 
Sincerely yours,
Mike.


^ permalink raw reply

* Re: [PATCH 2/8] bpf: Recover arena kernel faults with scratch page
From: Tejun Heo @ 2026-05-31 17:47 UTC (permalink / raw)
  To: Alexei Starovoitov, David Hildenbrand
  Cc: David Vernet, Andrea Righi, Changwoo Min, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Kumar Kartikeya Dwivedi,
	Peter Zijlstra, Catalin Marinas, Will Deacon, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Andrew Morton,
	Mike Rapoport, Emil Tsalapatis, sched-ext, bpf, x86,
	linux-arm-kernel, linux-mm, linux-kernel
In-Reply-To: <ahnedQ33ZH8ogjbC@slm.duckdns.org>

Hello,

I posted the check removal [1], and Sashiko's review flagged a
break-before-make problem with it [2] that I think is real.

The scratch page is a present PAGE_KERNEL mapping, so having
apply_range_set_cb() overwrite it via set_pte_at() during
bpf_arena_alloc_pages() is a valid->valid PFN change. I'm not familiar with
arm at all. David, my understanding is that's a break-before-make violation
on arm64, and that on any arch the stale TLB entry keeps resolving to the
shared scratch page until it's flushed, so a later access can hit scratch
instead of the new page. Is that what you were worried about?

So instead of just dropping the check, the install should route through an
invalid entry rather than overwrite in place:

	while (!ptep_try_set(pte, mk_pte(page, PAGE_KERNEL))) {
		old = ptep_get(pte);
		if (pte_none(old))
			continue;
		if (WARN_ON_ONCE(pte_page(old) != arena->scratch_page))
			return -EBUSY;
		ptep_get_and_clear(&init_mm, addr, pte);
		broke_scratch = true;
	}

ptep_try_set() only fills a none slot, so the slot goes scratch->none->page
and never valid->valid, and the loop copes with a concurrent fault
re-scratching it. This also closes the set_pte_at()-vs-ptep_try_set() race
I raised earlier, since both sides are now cmpxchg. A broken scratch entry
was live, so the caller flush_tlb_kernel_range()s those pages when
broke_scratch is set, like arena_free_pages() already does after clearing.

[1] https://lore.kernel.org/r/20260531165852.555930-1-tj@kernel.org
[2] https://lore.kernel.org/r/20260531170854.31EA51F00893@smtp.kernel.org

Thanks.
--
tejun


^ permalink raw reply

* Re: [PATCH v2] mm/gup: honour FOLL_PIN in NOMMU __get_user_pages_locked()
From: David Hildenbrand (Arm) @ 2026-05-31 17:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Kroah-Hartman, linux-mm, linux-kernel, Jason Gunthorpe,
	John Hubbard, Peter Xu
In-Reply-To: <20260526173550.927ba4ba517e9af5a6d2c6bf@linux-foundation.org>

On 5/27/26 02:35, Andrew Morton wrote:
> On Fri, 24 Apr 2026 16:19:32 +0200 "David Hildenbrand (Arm)" <david@kernel.org> wrote:
> 
>> On 4/24/26 15:38, Andrew Morton wrote:
>>>
>>>
>>> Battle of the bots?
>>> 	https://sashiko.dev/#/patchset/2026042303-vendor-outright-b9d2@gregkh
>>
>> It references the
>>
>> 	if (pages && !(flags & FOLL_PIN))
>> 		flags |= FOLL_GET;
>>
>> I'm not sure if there is actual NOMMU code that triggers it. For example,
>> uprobes uses that pattern, but I suspect that that's not a thing on NOMMU.
>>
>>
>> Probably best to just squash:
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index ad9ded39609c..44bd28cf6e00 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -1988,6 +1988,9 @@ static long __get_user_pages_locked(struct mm_struct *mm,
>> unsigned long start,
>>         if (!nr_pages)
>>                 return 0;
>>
>> +       if (pages && !(foll_flags & FOLL_PIN))
>> +               foll_flags |= FOLL_GET;
>> +
>>         /*
>>          * The internal caller expects GUP to manage the lock internally and the
>>          * lock must be released when this returns.
> 
> Nothing happened.
> 
> Should we proceed with this as-is?

No let's squash what I provided. Probably a resend would be best. If greg is too
busy I can give it a try tomorrow.

-- 
Cheers,

David


^ permalink raw reply

* Re: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions
From: Mike Rapoport @ 2026-05-31 17:10 UTC (permalink / raw)
  To: Jork Loeser
  Cc: linux-hyperv, linux-mm, kexec, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Pasha Tatashin, Pratyush Yadav,
	Alexander Graf, Jason Miu, Andrew Morton, David Hildenbrand,
	Muchun Song, Oscar Salvador, Baoquan He, Catalin Marinas,
	Will Deacon, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Kees Cook, Ran Xiaokai,
	Justinien Bouron, Sourabh Jain, Pingfan Liu, Rafael J. Wysocki,
	Mario Limonciello, linux-arm-kernel, x86, linux-kernel,
	Michael Kelley
In-Reply-To: <20260528004204.1484584-1-jloeser@linux.microsoft.com>

Hi Jork,

Only had time to skim through the patches.
I have a couple of high level questions for now.

On Wed, May 27, 2026 at 05:41:42PM -0700, Jork Loeser wrote:
> When Linux runs as an L1 Virtual Host (L1VH) under Hyper-V, the MSHV
> root partition driver deposits pages to the hypervisor and creates
> partitions for guest VMs. Prior patches enabled kexec for L1VH, but
> only when no partitions had been created and no memory had been donated.
> 
> This series lifts that limitation. It uses KHO (Kexec Handover) to:
> 
>  - Track all pages deposited to the hypervisor in a KHO radix tree
>    and preserve them across kexec so the new kernel knows which pages
>    are owned by the hypervisor.
> 
>  - Freeze running partitions before kexec, record their IDs in the
>    KHO FDT, and vacuum (tear down + reclaim memory) stale partitions
>    after kexec.
> 
>  - In case of a crash, exclude hypervisor-owned pages from crash
>    dump collection by passing the radix tree root PA via Hyper-V
>    crash MSR P2 to the crash kernel.
> 
> Dependency on Pratyush's KHO series
> ===================================
> 
> Patches 1-12 are cherry-picked from Pratyush Yadav's v1 series
> "kho: make boot time huge page allocation work nicely with KHO" [1],
> which is still under discussion. This series uses functionality from
> those patches -- specifically the meta-data page enumeration via table
> callbacks and the restructured radix tree API. It also extends the
> KHO radix tree with:
> 
>  - A freeze mechanism to lock the tree before serializing for kexec
>    (patch 13).

There were a lot of effort to make KHO stateless and drop the requirement
for finalization/freeze.

Why is this necessary to add a freeze mechanism to kho_radix_tree?
If it's a hard requirement of mshv maybe the freeze part should be handled
there?
 
>  - A crash-kernel-safe variant that memremaps radix nodes for use
>    outside the direct map (patch 14).
> 
> Patch overview
> ==============
> 
> Patches 1-12:  KHO radix tree and memblock changes (from [1])
> Patch 13:      Radix tree freeze and del_key() error reporting

del_key() error reporting sounds like something we'd want to avoid.
del_key() is called on "freeing" path and during error handling, it would
be hard if at all possible to deal with errors from del_key().

> Patch 14:      Crash-kernel-safe radix tree presence check
> Patch 15:      Page tracker using KHO radix tree for deposited pages
> Patch 16:      Debugfs interface for page tracker
> Patches 17-18: Crash MSR reshuffling + crash dump page exclusion
> Patch 19:      Export kexec_in_progress for modules

Isn't there another way to differentiate kexec reboot?

> Patch 20:      Freeze and vacuum partitions across kexec
> 
> Feedback
> ========
> 
> This is an RFC. I am looking for feedback on the overall approach as
> well as the KHO changes (patches 13-14).
> 
> [1] https://lore.kernel.org/linux-mm/20260429133928.850721-1-pratyush@kernel.org/
> 
> Based-on: linux-next/master (next-20260527)

-- 
Sincerely yours,
Mike.


^ permalink raw reply

* tracepoints expose s_dev of kernel-internal superblocks -- no generic resolution interface
From: Yiyang Chen @ 2026-05-31 17:10 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-trace-devel
  Cc: Christian Brauner, Steven Rostedt, Andrew Morton, Matthew Wilcox

Hi all,

While tracing page cache activity via mm_filemap_add_to_page_cache, I
noticed s_dev values that do not appear in /proc/*/mountinfo:

  mm_filemap_add_to_page_cache: dev 0:18 ino dea89 pfn=0x13ba00 ofs=0 order=9

Using a kernel module to enumerate all active superblocks, I confirmed
this is the hugetlbfs internal mount created by hugetlbfs_init() ->
kern_mount().  The actual s_dev value is dynamically allocated via
get_anon_bdev() and varies across systems (0:18 on my test machine).
Because kern_mount() attaches the mount to MNT_NS_INTERNAL, it is
invisible to all mount namespaces.

Each internal filesystem requires its own trick to discover the
s_dev -> fs_type mapping:

  - shmem:      create a memfd, call fstatfs() -> TMPFS_MAGIC
  - bdev:       open a block device, call fstatfs()
  - hugetlbfs: memfd_create(MFD_HUGETLB) creates a file
    on the internal mount itself, so fstat() gives s_dev and
    fstatfs() returns HUGETLBFS_MAGIC

None of these are a general, authoritative interface -- each requires
filesystem-specific knowledge of which object to create and which
syscall to call.

The question is: should there be a single stable interface that exposes
the s_dev -> fs_type mapping for all active superblocks, including
internal ones? Here are some options:

  - Add fs_type to the affected tracepoints.
  - Provide a generic interface (BPF iterator, /proc, /sys) that
    exposes s_dev + fs_type for all active superblocks including
    internal ones.


Reproduce steps:

s_dev is dynamically allocated via get_anon_bdev(), values vary.

--- test_hugetlb_sdev.c ---
#include <sys/mman.h>
#include <string.h>
int main(void) {
    void *p = mmap(NULL, 2<<20, PROT_READ|PROT_WRITE,
                   MAP_SHARED|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0);
    if (p == MAP_FAILED) return 1;
    memset(p, 0, 2<<20);
    munmap(p, 2<<20);
    return 0;
}

  gcc -o test_hugetlb_sdev test_hugetlb_sdev.c
  # precondition:
  cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages   # must be > 0

--- trace script (run with sudo) ---
  TP=/sys/kernel/tracing
  DEVS=$(awk '{print $3}' /proc/self/mountinfo | sort -u)

  echo 1 > $TP/events/filemap/mm_filemap_add_to_page_cache/enable
  echo 1 > $TP/events/hugetlbfs/hugetlbfs_alloc_inode/enable
  echo > $TP/trace

  ./test_hugetlb_sdev
  sleep 1

  echo "=== s_dev from tracepoint, NOT in mountinfo ==="
  # filemap uses "dev X:Y", hugetlbfs uses "dev X,Y"
  grep -v '^#' $TP/trace | grep -oP 'dev[= ]\K\d+[:,]\d+' | tr ',' ':' | sort -u \
      | while read d; do echo "$DEVS" | grep -qx "$d" || echo "  $d"; done

  echo 0 > $TP/events/filemap/enable
  echo 0 > $TP/events/hugetlbfs/enable
  echo > $TP/trace


You will see hugetlbfs s_dev which is not present in any mountinfo.


Thanks.


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox