[PATCH v1 00/10] Remove READ_ONLY_THP_FOR

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig
@ 2026-03-27  1:42 Zi Yan
  2026-03-27  1:42 ` [PATCH v1 01/10] mm: remove READ_ONLY_THP_FOR_FS Kconfig option Zi Yan
                   ` (11 more replies)
  0 siblings, 12 replies; 76+ messages in thread
From: Zi Yan @ 2026-03-27  1:42 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, linux-btrfs,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

Hi all,

This patchset removes READ_ONLY_THP_FOR_FS Kconfig and enables creating
read-only THPs for FSes with large folio support (the supported orders
need to include PMD_ORDER) by default.

The changes are:
1. collapse_file() from mm/khugepaged.c, instead of checking
   CONFIG_READ_ONLY_THP_FOR_FS, makes sure the mapping_max_folio_order()
   of struct address_space of the file is at least PMD_ORDER.
2. file_thp_enabled() also checks mapping_max_folio_order() instead.
3. truncate_inode_partial_folio() calls folio_split() directly instead
   of the removed try_folio_split_to_order(), since large folios can
   only show up on a FS with large folio support.
4. nr_thps is removed from struct address_space, since it is no longer
   needed to drop all read-only THPs from a FS without large folio
   support when the fd becomes writable. Its related filemap_nr_thps*()
   are removed too.
5. folio_check_splittable() no longer checks READ_ONLY_THP_FOR_FS.
6. Updated comments in various places.

Changelog
===
From RFC[1]:
1. instead of removing READ_ONLY_THP_FOR_FS function entirely, turn it
   on by default for all FSes with large folio support and the supported
   orders includes PMD_ORDER.

Suggestions and comments are welcome.

Link: https://lore.kernel.org/all/20260323190644.1714379-1-ziy@nvidia.com/ [1]

Zi Yan (10):
  mm: remove READ_ONLY_THP_FOR_FS Kconfig option
  mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  mm: fs: remove filemap_nr_thps*() functions and their users
  fs: remove nr_thps from struct address_space
  mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
  mm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FS
  mm/truncate: use folio_split() in truncate_inode_partial_folio()
  fs/btrfs: remove a comment referring to READ_ONLY_THP_FOR_FS
  selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged
  selftests/mm: remove READ_ONLY_THP_FOR_FS from comments in
    guard-regions

 fs/btrfs/defrag.c                          |  3 --
 fs/inode.c                                 |  3 --
 fs/open.c                                  | 27 ----------------
 include/linux/fs.h                         |  5 ---
 include/linux/huge_mm.h                    | 25 ++-------------
 include/linux/pagemap.h                    | 29 -----------------
 mm/Kconfig                                 | 11 -------
 mm/filemap.c                               |  1 -
 mm/huge_memory.c                           | 29 ++---------------
 mm/khugepaged.c                            | 36 +++++-----------------
 mm/truncate.c                              |  8 ++---
 tools/testing/selftests/mm/guard-regions.c |  9 +++---
 tools/testing/selftests/mm/khugepaged.c    |  4 +--
 13 files changed, 23 insertions(+), 167 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v1 01/10] mm: remove READ_ONLY_THP_FOR_FS Kconfig option
  2026-03-27  1:42 [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig Zi Yan
@ 2026-03-27  1:42 ` Zi Yan
  2026-03-27 11:45   ` Lorenzo Stoakes (Oracle)
  2026-03-27 13:33   ` David Hildenbrand (Arm)
  2026-03-27  1:42 ` [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check Zi Yan
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 76+ messages in thread
From: Zi Yan @ 2026-03-27  1:42 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, linux-btrfs,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

No one will be able to use it, so the related code can be removed in the
coming commits.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/Kconfig | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index bd283958d675..408fc7b82233 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -937,17 +937,6 @@ config THP_SWAP
 
 	  For selection by architectures with reasonable THP sizes.
 
-config READ_ONLY_THP_FOR_FS
-	bool "Read-only THP for filesystems (EXPERIMENTAL)"
-	depends on TRANSPARENT_HUGEPAGE
-
-	help
-	  Allow khugepaged to put read-only file-backed pages in THP.
-
-	  This is marked experimental because it is a new feature. Write
-	  support of file THPs will be developed in the next few release
-	  cycles.
-
 config NO_PAGE_MAPCOUNT
 	bool "No per-page mapcount (EXPERIMENTAL)"
 	help
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-03-27  1:42 [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig Zi Yan
  2026-03-27  1:42 ` [PATCH v1 01/10] mm: remove READ_ONLY_THP_FOR_FS Kconfig option Zi Yan
@ 2026-03-27  1:42 ` Zi Yan
  2026-03-27  7:29   ` Lance Yang
                     ` (3 more replies)
  2026-03-27  1:42 ` [PATCH v1 03/10] mm: fs: remove filemap_nr_thps*() functions and their users Zi Yan
                   ` (9 subsequent siblings)
  11 siblings, 4 replies; 76+ messages in thread
From: Zi Yan @ 2026-03-27  1:42 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, linux-btrfs,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

collapse_file() requires FSes supporting large folio with at least
PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
huge option turned on also sets large folio order on mapping, so the check
also applies to shmem.

While at it, replace VM_BUG_ON with returning failure values.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/khugepaged.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d06d84219e1b..45b12ffb1550 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 	int nr_none = 0;
 	bool is_shmem = shmem_file(file);
 
-	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
-	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
+	/* "huge" shmem sets mapping folio order and passes the check below */
+	if (mapping_max_folio_order(mapping) < PMD_ORDER)
+		return SCAN_FAIL;
+	if (start & (HPAGE_PMD_NR - 1))
+		return SCAN_ADDRESS_RANGE;
 
 	result = alloc_charge_folio(&new_folio, mm, cc);
 	if (result != SCAN_SUCCEED)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v1 03/10] mm: fs: remove filemap_nr_thps*() functions and their users
  2026-03-27  1:42 [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig Zi Yan
  2026-03-27  1:42 ` [PATCH v1 01/10] mm: remove READ_ONLY_THP_FOR_FS Kconfig option Zi Yan
  2026-03-27  1:42 ` [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check Zi Yan
@ 2026-03-27  1:42 ` Zi Yan
  2026-03-27  9:32   ` Lance Yang
  2026-03-27 12:23   ` Lorenzo Stoakes (Oracle)
  2026-03-27  1:42 ` [PATCH v1 04/10] fs: remove nr_thps from struct address_space Zi Yan
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 76+ messages in thread
From: Zi Yan @ 2026-03-27  1:42 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, linux-btrfs,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

They are used by READ_ONLY_THP_FOR_FS to handle writes to FSes without
large folio support, so that read-only THPs created in these FSes are not
seen by the FSes when the underlying fd becomes writable. Now read-only PMD
THPs only appear in a FS with large folio support and the supported orders
include PMD_ORDRE.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 fs/open.c               | 27 ---------------------------
 include/linux/pagemap.h | 29 -----------------------------
 mm/filemap.c            |  1 -
 mm/huge_memory.c        |  1 -
 mm/khugepaged.c         | 29 ++---------------------------
 5 files changed, 2 insertions(+), 85 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 91f1139591ab..cef382d9d8b8 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -970,33 +970,6 @@ static int do_dentry_open(struct file *f,
 	if ((f->f_flags & O_DIRECT) && !(f->f_mode & FMODE_CAN_ODIRECT))
 		return -EINVAL;
 
-	/*
-	 * XXX: Huge page cache doesn't support writing yet. Drop all page
-	 * cache for this file before processing writes.
-	 */
-	if (f->f_mode & FMODE_WRITE) {
-		/*
-		 * Depends on full fence from get_write_access() to synchronize
-		 * against collapse_file() regarding i_writecount and nr_thps
-		 * updates. Ensures subsequent insertion of THPs into the page
-		 * cache will fail.
-		 */
-		if (filemap_nr_thps(inode->i_mapping)) {
-			struct address_space *mapping = inode->i_mapping;
-
-			filemap_invalidate_lock(inode->i_mapping);
-			/*
-			 * unmap_mapping_range just need to be called once
-			 * here, because the private pages is not need to be
-			 * unmapped mapping (e.g. data segment of dynamic
-			 * shared libraries here).
-			 */
-			unmap_mapping_range(mapping, 0, 0, 0);
-			truncate_inode_pages(mapping, 0);
-			filemap_invalidate_unlock(inode->i_mapping);
-		}
-	}
-
 	return 0;
 
 cleanup_all:
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ec442af3f886..dad3f8846cdc 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -530,35 +530,6 @@ static inline size_t mapping_max_folio_size(const struct address_space *mapping)
 	return PAGE_SIZE << mapping_max_folio_order(mapping);
 }
 
-static inline int filemap_nr_thps(const struct address_space *mapping)
-{
-#ifdef CONFIG_READ_ONLY_THP_FOR_FS
-	return atomic_read(&mapping->nr_thps);
-#else
-	return 0;
-#endif
-}
-
-static inline void filemap_nr_thps_inc(struct address_space *mapping)
-{
-#ifdef CONFIG_READ_ONLY_THP_FOR_FS
-	if (!mapping_large_folio_support(mapping))
-		atomic_inc(&mapping->nr_thps);
-#else
-	WARN_ON_ONCE(mapping_large_folio_support(mapping) == 0);
-#endif
-}
-
-static inline void filemap_nr_thps_dec(struct address_space *mapping)
-{
-#ifdef CONFIG_READ_ONLY_THP_FOR_FS
-	if (!mapping_large_folio_support(mapping))
-		atomic_dec(&mapping->nr_thps);
-#else
-	WARN_ON_ONCE(mapping_large_folio_support(mapping) == 0);
-#endif
-}
-
 struct address_space *folio_mapping(const struct folio *folio);
 
 /**
diff --git a/mm/filemap.c b/mm/filemap.c
index 2b933a1da9bd..4248e7cdecf3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -189,7 +189,6 @@ static void filemap_unaccount_folio(struct address_space *mapping,
 			lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, -nr);
 	} else if (folio_test_pmd_mappable(folio)) {
 		lruvec_stat_mod_folio(folio, NR_FILE_THPS, -nr);
-		filemap_nr_thps_dec(mapping);
 	}
 	if (test_bit(AS_KERNEL_FILE, &folio->mapping->flags))
 		mod_node_page_state(folio_pgdat(folio),
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b2a6060b3c20..c7873dbdc470 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3833,7 +3833,6 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
 				} else {
 					lruvec_stat_mod_folio(folio,
 							NR_FILE_THPS, -nr);
-					filemap_nr_thps_dec(mapping);
 				}
 			}
 		}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 45b12ffb1550..8004ab8de6d2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2104,20 +2104,8 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 		goto xa_unlocked;
 	}
 
-	if (!is_shmem) {
-		filemap_nr_thps_inc(mapping);
-		/*
-		 * Paired with the fence in do_dentry_open() -> get_write_access()
-		 * to ensure i_writecount is up to date and the update to nr_thps
-		 * is visible. Ensures the page cache will be truncated if the
-		 * file is opened writable.
-		 */
-		smp_mb();
-		if (inode_is_open_for_write(mapping->host)) {
-			result = SCAN_FAIL;
-			filemap_nr_thps_dec(mapping);
-		}
-	}
+	if (!is_shmem && inode_is_open_for_write(mapping->host))
+		result = SCAN_FAIL;
 
 xa_locked:
 	xas_unlock_irq(&xas);
@@ -2296,19 +2284,6 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 		folio_putback_lru(folio);
 		folio_put(folio);
 	}
-	/*
-	 * Undo the updates of filemap_nr_thps_inc for non-SHMEM
-	 * file only. This undo is not needed unless failure is
-	 * due to SCAN_COPY_MC.
-	 */
-	if (!is_shmem && result == SCAN_COPY_MC) {
-		filemap_nr_thps_dec(mapping);
-		/*
-		 * Paired with the fence in do_dentry_open() -> get_write_access()
-		 * to ensure the update to nr_thps is visible.
-		 */
-		smp_mb();
-	}
 
 	new_folio->mapping = NULL;
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v1 04/10] fs: remove nr_thps from struct address_space
  2026-03-27  1:42 [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig Zi Yan
                   ` (2 preceding siblings ...)
  2026-03-27  1:42 ` [PATCH v1 03/10] mm: fs: remove filemap_nr_thps*() functions and their users Zi Yan
@ 2026-03-27  1:42 ` Zi Yan
  2026-03-27 12:29   ` Lorenzo Stoakes (Oracle)
                     ` (2 more replies)
  2026-03-27  1:42 ` [PATCH v1 05/10] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled() Zi Yan
                   ` (7 subsequent siblings)
  11 siblings, 3 replies; 76+ messages in thread
From: Zi Yan @ 2026-03-27  1:42 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, linux-btrfs,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

filemap_nr_thps*() are removed, the related field, address_space->nr_thps,
is no longer needed. Remove it.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 fs/inode.c         | 3 ---
 include/linux/fs.h | 5 -----
 2 files changed, 8 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index cc12b68e021b..16ab0a345419 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -280,9 +280,6 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
 	mapping->flags = 0;
 	mapping->wb_err = 0;
 	atomic_set(&mapping->i_mmap_writable, 0);
-#ifdef CONFIG_READ_ONLY_THP_FOR_FS
-	atomic_set(&mapping->nr_thps, 0);
-#endif
 	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
 	mapping->i_private_data = NULL;
 	mapping->writeback_index = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0bdccfa70b44..35875696fb4c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -455,7 +455,6 @@ extern const struct address_space_operations empty_aops;
  *   memory mappings.
  * @gfp_mask: Memory allocation flags to use for allocating pages.
  * @i_mmap_writable: Number of VM_SHARED, VM_MAYWRITE mappings.
- * @nr_thps: Number of THPs in the pagecache (non-shmem only).
  * @i_mmap: Tree of private and shared mappings.
  * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
  * @nrpages: Number of page entries, protected by the i_pages lock.
@@ -473,10 +472,6 @@ struct address_space {
 	struct rw_semaphore	invalidate_lock;
 	gfp_t			gfp_mask;
 	atomic_t		i_mmap_writable;
-#ifdef CONFIG_READ_ONLY_THP_FOR_FS
-	/* number of thp, only for non-shmem files */
-	atomic_t		nr_thps;
-#endif
 	struct rb_root_cached	i_mmap;
 	unsigned long		nrpages;
 	pgoff_t			writeback_index;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v1 05/10] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
  2026-03-27  1:42 [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig Zi Yan
                   ` (3 preceding siblings ...)
  2026-03-27  1:42 ` [PATCH v1 04/10] fs: remove nr_thps from struct address_space Zi Yan
@ 2026-03-27  1:42 ` Zi Yan
  2026-03-27 12:42   ` Lorenzo Stoakes (Oracle)
  2026-03-27  1:42 ` [PATCH v1 06/10] mm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FS Zi Yan
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 76+ messages in thread
From: Zi Yan @ 2026-03-27  1:42 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, linux-btrfs,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

Replace it with a check on the max folio order of the file's address space
mapping, making sure PMD_ORDER is supported.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c7873dbdc470..1da1467328a3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -89,9 +89,6 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
 {
 	struct inode *inode;
 
-	if (!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS))
-		return false;
-
 	if (!vma->vm_file)
 		return false;
 
@@ -100,6 +97,9 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
 	if (IS_ANON_FILE(inode))
 		return false;
 
+	if (mapping_max_folio_order(inode->i_mapping) < PMD_ORDER)
+		return false;
+
 	return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
 }
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v1 06/10] mm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FS
  2026-03-27  1:42 [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig Zi Yan
                   ` (4 preceding siblings ...)
  2026-03-27  1:42 ` [PATCH v1 05/10] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled() Zi Yan
@ 2026-03-27  1:42 ` Zi Yan
  2026-03-27 12:50   ` Lorenzo Stoakes (Oracle)
  2026-03-30  9:15   ` Lance Yang
  2026-03-27  1:42 ` [PATCH v1 07/10] mm/truncate: use folio_split() in truncate_inode_partial_folio() Zi Yan
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 76+ messages in thread
From: Zi Yan @ 2026-03-27  1:42 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, linux-btrfs,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

Without READ_ONLY_THP_FOR_FS, large file-backed folios cannot be created by
a FS without large folio support. The check is no longer needed.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 22 ----------------------
 1 file changed, 22 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1da1467328a3..30eddcbf86f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3732,28 +3732,6 @@ int folio_check_splittable(struct folio *folio, unsigned int new_order,
 		/* order-1 is not supported for anonymous THP. */
 		if (new_order == 1)
 			return -EINVAL;
-	} else if (split_type == SPLIT_TYPE_NON_UNIFORM || new_order) {
-		if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
-		    !mapping_large_folio_support(folio->mapping)) {
-			/*
-			 * We can always split a folio down to a single page
-			 * (new_order == 0) uniformly.
-			 *
-			 * For any other scenario
-			 *   a) uniform split targeting a large folio
-			 *      (new_order > 0)
-			 *   b) any non-uniform split
-			 * we must confirm that the file system supports large
-			 * folios.
-			 *
-			 * Note that we might still have THPs in such
-			 * mappings, which is created from khugepaged when
-			 * CONFIG_READ_ONLY_THP_FOR_FS is enabled. But in that
-			 * case, the mapping does not actually support large
-			 * folios properly.
-			 */
-			return -EINVAL;
-		}
 	}
 
 	/*
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v1 07/10] mm/truncate: use folio_split() in truncate_inode_partial_folio()
  2026-03-27  1:42 [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig Zi Yan
                   ` (5 preceding siblings ...)
  2026-03-27  1:42 ` [PATCH v1 06/10] mm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FS Zi Yan
@ 2026-03-27  1:42 ` Zi Yan
  2026-03-27  3:33   ` Lance Yang
                     ` (3 more replies)
  2026-03-27  1:42 ` [PATCH v1 08/10] fs/btrfs: remove a comment referring to READ_ONLY_THP_FOR_FS Zi Yan
                   ` (4 subsequent siblings)
  11 siblings, 4 replies; 76+ messages in thread
From: Zi Yan @ 2026-03-27  1:42 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, linux-btrfs,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

After READ_ONLY_THP_FOR_FS is removed, FS either supports large folio or
not. folio_split() can be used on a FS with large folio support without
worrying about getting a THP on a FS without large folio support.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/huge_mm.h | 25 ++-----------------------
 mm/truncate.c           |  8 ++++----
 2 files changed, 6 insertions(+), 27 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1258fa37e85b..171de8138e98 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -389,27 +389,6 @@ static inline int split_huge_page_to_order(struct page *page, unsigned int new_o
 	return split_huge_page_to_list_to_order(page, NULL, new_order);
 }
 
-/**
- * try_folio_split_to_order() - try to split a @folio at @page to @new_order
- * using non uniform split.
- * @folio: folio to be split
- * @page: split to @new_order at the given page
- * @new_order: the target split order
- *
- * Try to split a @folio at @page using non uniform split to @new_order, if
- * non uniform split is not supported, fall back to uniform split. After-split
- * folios are put back to LRU list. Use min_order_for_split() to get the lower
- * bound of @new_order.
- *
- * Return: 0 - split is successful, otherwise split failed.
- */
-static inline int try_folio_split_to_order(struct folio *folio,
-		struct page *page, unsigned int new_order)
-{
-	if (folio_check_splittable(folio, new_order, SPLIT_TYPE_NON_UNIFORM))
-		return split_huge_page_to_order(&folio->page, new_order);
-	return folio_split(folio, new_order, page, NULL);
-}
 static inline int split_huge_page(struct page *page)
 {
 	return split_huge_page_to_list_to_order(page, NULL, 0);
@@ -641,8 +620,8 @@ static inline int split_folio_to_list(struct folio *folio, struct list_head *lis
 	return -EINVAL;
 }
 
-static inline int try_folio_split_to_order(struct folio *folio,
-		struct page *page, unsigned int new_order)
+static inline int folio_split(struct folio *folio, unsigned int new_order,
+		struct page *page, struct list_head *list);
 {
 	VM_WARN_ON_ONCE_FOLIO(1, folio);
 	return -EINVAL;
diff --git a/mm/truncate.c b/mm/truncate.c
index 2931d66c16d0..6973b05ec4b8 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -177,7 +177,7 @@ int truncate_inode_folio(struct address_space *mapping, struct folio *folio)
 	return 0;
 }
 
-static int try_folio_split_or_unmap(struct folio *folio, struct page *split_at,
+static int folio_split_or_unmap(struct folio *folio, struct page *split_at,
 				    unsigned long min_order)
 {
 	enum ttu_flags ttu_flags =
@@ -186,7 +186,7 @@ static int try_folio_split_or_unmap(struct folio *folio, struct page *split_at,
 		TTU_IGNORE_MLOCK;
 	int ret;
 
-	ret = try_folio_split_to_order(folio, split_at, min_order);
+	ret = folio_split(folio, min_order, split_at, NULL);
 
 	/*
 	 * If the split fails, unmap the folio, so it will be refaulted
@@ -252,7 +252,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 
 	min_order = mapping_min_folio_order(folio->mapping);
 	split_at = folio_page(folio, PAGE_ALIGN_DOWN(offset) / PAGE_SIZE);
-	if (!try_folio_split_or_unmap(folio, split_at, min_order)) {
+	if (!folio_split_or_unmap(folio, split_at, min_order)) {
 		/*
 		 * try to split at offset + length to make sure folios within
 		 * the range can be dropped, especially to avoid memory waste
@@ -279,7 +279,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 		/* make sure folio2 is large and does not change its mapping */
 		if (folio_test_large(folio2) &&
 		    folio2->mapping == folio->mapping)
-			try_folio_split_or_unmap(folio2, split_at2, min_order);
+			folio_split_or_unmap(folio2, split_at2, min_order);
 
 		folio_unlock(folio2);
 out:
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v1 08/10] fs/btrfs: remove a comment referring to READ_ONLY_THP_FOR_FS
  2026-03-27  1:42 [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig Zi Yan
                   ` (6 preceding siblings ...)
  2026-03-27  1:42 ` [PATCH v1 07/10] mm/truncate: use folio_split() in truncate_inode_partial_folio() Zi Yan
@ 2026-03-27  1:42 ` Zi Yan
  2026-03-27 13:05   ` Lorenzo Stoakes (Oracle)
  2026-03-27  1:42 ` [PATCH v1 09/10] selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged Zi Yan
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 76+ messages in thread
From: Zi Yan @ 2026-03-27  1:42 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, linux-btrfs,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

READ_ONLY_THP_FOR_FS is no longer present, remove related comment.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/defrag.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index 7e2db5d3a4d4..a8d49d9ca981 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -860,9 +860,6 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
 		return folio;
 
 	/*
-	 * Since we can defragment files opened read-only, we can encounter
-	 * transparent huge pages here (see CONFIG_READ_ONLY_THP_FOR_FS).
-	 *
 	 * The IO for such large folios is not fully tested, thus return
 	 * an error to reject such folios unless it's an experimental build.
 	 *
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v1 09/10] selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged
  2026-03-27  1:42 [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig Zi Yan
                   ` (7 preceding siblings ...)
  2026-03-27  1:42 ` [PATCH v1 08/10] fs/btrfs: remove a comment referring to READ_ONLY_THP_FOR_FS Zi Yan
@ 2026-03-27  1:42 ` Zi Yan
  2026-03-27 13:05   ` Lorenzo Stoakes (Oracle)
  2026-03-27  1:42 ` [PATCH v1 10/10] selftests/mm: remove READ_ONLY_THP_FOR_FS from comments in guard-regions Zi Yan
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 76+ messages in thread
From: Zi Yan @ 2026-03-27  1:42 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, linux-btrfs,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

Change the requirement to a file system with large folio support and the
supported order needs to include PMD_ORDER.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 tools/testing/selftests/mm/khugepaged.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
index 3fe7ef04ac62..bdcdd31beb1e 100644
--- a/tools/testing/selftests/mm/khugepaged.c
+++ b/tools/testing/selftests/mm/khugepaged.c
@@ -1086,8 +1086,8 @@ static void usage(void)
 	fprintf(stderr, "\t<context>\t: [all|khugepaged|madvise]\n");
 	fprintf(stderr, "\t<mem_type>\t: [all|anon|file|shmem]\n");
 	fprintf(stderr, "\n\t\"file,all\" mem_type requires [dir] argument\n");
-	fprintf(stderr, "\n\t\"file,all\" mem_type requires kernel built with\n");
-	fprintf(stderr,	"\tCONFIG_READ_ONLY_THP_FOR_FS=y\n");
+	fprintf(stderr, "\n\t\"file,all\" mem_type requires a file system\n");
+	fprintf(stderr,	"\twith large folio support (order >= PMD order)\n");
 	fprintf(stderr, "\n\tif [dir] is a (sub)directory of a tmpfs mount, tmpfs must be\n");
 	fprintf(stderr,	"\tmounted with huge=advise option for khugepaged tests to work\n");
 	fprintf(stderr,	"\n\tSupported Options:\n");
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v1 10/10] selftests/mm: remove READ_ONLY_THP_FOR_FS from comments in guard-regions
  2026-03-27  1:42 [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig Zi Yan
                   ` (8 preceding siblings ...)
  2026-03-27  1:42 ` [PATCH v1 09/10] selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged Zi Yan
@ 2026-03-27  1:42 ` Zi Yan
  2026-03-27 13:06   ` Lorenzo Stoakes (Oracle)
  2026-03-27 13:46 ` [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig David Hildenbrand (Arm)
  2026-04-05 17:38 ` Nico Pache
  11 siblings, 1 reply; 76+ messages in thread
From: Zi Yan @ 2026-03-27  1:42 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, linux-btrfs,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

Any file system with large folio support and the supported orders include
PMD_ORDER can be used.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 tools/testing/selftests/mm/guard-regions.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c
index 48e8b1539be3..13e77e48b6ef 100644
--- a/tools/testing/selftests/mm/guard-regions.c
+++ b/tools/testing/selftests/mm/guard-regions.c
@@ -2205,7 +2205,7 @@ TEST_F(guard_regions, collapse)
 
 	/*
 	 * We must close and re-open local-file backed as read-only for
-	 * CONFIG_READ_ONLY_THP_FOR_FS to work.
+	 * MADV_COLLAPSE to work.
 	 */
 	if (variant->backing == LOCAL_FILE_BACKED) {
 		ASSERT_EQ(close(self->fd), 0);
@@ -2237,9 +2237,10 @@ TEST_F(guard_regions, collapse)
 	/*
 	 * Now collapse the entire region. This should fail in all cases.
 	 *
-	 * The madvise() call will also fail if CONFIG_READ_ONLY_THP_FOR_FS is
-	 * not set for the local file case, but we can't differentiate whether
-	 * this occurred or if the collapse was rightly rejected.
+	 * The madvise() call will also fail if the file system does not support
+	 * large folio or the supported orders do not include PMD_ORDER for the
+	 * local file case, but we can't differentiate whether this occurred or
+	 * if the collapse was rightly rejected.
 	 */
 	EXPECT_NE(madvise(ptr, size, MADV_COLLAPSE), 0);
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 07/10] mm/truncate: use folio_split() in truncate_inode_partial_folio()
  2026-03-27  1:42 ` [PATCH v1 07/10] mm/truncate: use folio_split() in truncate_inode_partial_folio() Zi Yan
@ 2026-03-27  3:33   ` Lance Yang
  2026-03-27 13:05   ` Lorenzo Stoakes (Oracle)
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 76+ messages in thread
From: Lance Yang @ 2026-03-27  3:33 UTC (permalink / raw)
  To: Zi Yan
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Baolin Wang, Matthew Wilcox (Oracle), Liam R. Howlett, Nico Pache,
	Song Liu, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest



On 2026/3/27 09:42, Zi Yan wrote:
> After READ_ONLY_THP_FOR_FS is removed, FS either supports large folio or
> not. folio_split() can be used on a FS with large folio support without
> worrying about getting a THP on a FS without large folio support.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>   include/linux/huge_mm.h | 25 ++-----------------------
>   mm/truncate.c           |  8 ++++----
>   2 files changed, 6 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 1258fa37e85b..171de8138e98 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -389,27 +389,6 @@ static inline int split_huge_page_to_order(struct page *page, unsigned int new_o
>   	return split_huge_page_to_list_to_order(page, NULL, new_order);
>   }
>   
> -/**
> - * try_folio_split_to_order() - try to split a @folio at @page to @new_order
> - * using non uniform split.
> - * @folio: folio to be split
> - * @page: split to @new_order at the given page
> - * @new_order: the target split order
> - *
> - * Try to split a @folio at @page using non uniform split to @new_order, if
> - * non uniform split is not supported, fall back to uniform split. After-split
> - * folios are put back to LRU list. Use min_order_for_split() to get the lower
> - * bound of @new_order.
> - *
> - * Return: 0 - split is successful, otherwise split failed.
> - */
> -static inline int try_folio_split_to_order(struct folio *folio,
> -		struct page *page, unsigned int new_order)
> -{
> -	if (folio_check_splittable(folio, new_order, SPLIT_TYPE_NON_UNIFORM))
> -		return split_huge_page_to_order(&folio->page, new_order);
> -	return folio_split(folio, new_order, page, NULL);
> -}
>   static inline int split_huge_page(struct page *page)
>   {
>   	return split_huge_page_to_list_to_order(page, NULL, 0);
> @@ -641,8 +620,8 @@ static inline int split_folio_to_list(struct folio *folio, struct list_head *lis
>   	return -EINVAL;
>   }
>   
> -static inline int try_folio_split_to_order(struct folio *folio,
> -		struct page *page, unsigned int new_order)
> +static inline int folio_split(struct folio *folio, unsigned int new_order,
> +		struct page *page, struct list_head *list);

Ouch, that ';' wasn't supposed to be there, right?


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-03-27  1:42 ` [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check Zi Yan
@ 2026-03-27  7:29   ` Lance Yang
  2026-03-27  7:35     ` Lance Yang
  2026-03-27  9:44   ` Baolin Wang
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 76+ messages in thread
From: Lance Yang @ 2026-03-27  7:29 UTC (permalink / raw)
  To: ziy
  Cc: willy, songliubraving, clm, dsterba, viro, brauner, jack, akpm,
	david, ljs, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, vbabka, rppt, surenb, mhocko, shuah,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest


On Thu, Mar 26, 2026 at 09:42:47PM -0400, Zi Yan wrote:
>collapse_file() requires FSes supporting large folio with at least
>PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
>huge option turned on also sets large folio order on mapping, so the check
>also applies to shmem.
>
>While at it, replace VM_BUG_ON with returning failure values.
>
>Signed-off-by: Zi Yan <ziy@nvidia.com>
>---
> mm/khugepaged.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index d06d84219e1b..45b12ffb1550 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> 	int nr_none = 0;
> 	bool is_shmem = shmem_file(file);
> 
>-	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
>-	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>+	/* "huge" shmem sets mapping folio order and passes the check below */
>+	if (mapping_max_folio_order(mapping) < PMD_ORDER)
>+		return SCAN_FAIL;

Yep, for shmem inodes, if the mount has huge= enabled, inode creation
marks the mapping are large-folio capable:

	/* Don't consider 'deny' for emergencies and 'force' for testing */
	if (sbinfo->huge)
		mapping_set_large_folios(inode->i_mapping);

LGTM!

Reviewed-by: Lance Yang <lance.yang@linux.dev>


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-03-27  7:29   ` Lance Yang
@ 2026-03-27  7:35     ` Lance Yang
  0 siblings, 0 replies; 76+ messages in thread
From: Lance Yang @ 2026-03-27  7:35 UTC (permalink / raw)
  To: ziy
  Cc: willy, songliubraving, clm, dsterba, viro, brauner, jack, akpm,
	david, ljs, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, vbabka, rppt, surenb, mhocko, shuah,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest



On 2026/3/27 15:29, Lance Yang wrote:
> 
> On Thu, Mar 26, 2026 at 09:42:47PM -0400, Zi Yan wrote:
>> collapse_file() requires FSes supporting large folio with at least
>> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
>> huge option turned on also sets large folio order on mapping, so the check
>> also applies to shmem.
>>
>> While at it, replace VM_BUG_ON with returning failure values.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>> mm/khugepaged.c | 7 +++++--
>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index d06d84219e1b..45b12ffb1550 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>> 	int nr_none = 0;
>> 	bool is_shmem = shmem_file(file);
>>
>> -	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
>> -	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>> +	/* "huge" shmem sets mapping folio order and passes the check below */
>> +	if (mapping_max_folio_order(mapping) < PMD_ORDER)
>> +		return SCAN_FAIL;
> 
> Yep, for shmem inodes, if the mount has huge= enabled, inode creation
> marks the mapping are large-folio capable:

Oops, s/are/as/

> 
> 	/* Don't consider 'deny' for emergencies and 'force' for testing */
> 	if (sbinfo->huge)
> 		mapping_set_large_folios(inode->i_mapping);
> 
> LGTM!
> 
> Reviewed-by: Lance Yang <lance.yang@linux.dev>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 03/10] mm: fs: remove filemap_nr_thps*() functions and their users
  2026-03-27  1:42 ` [PATCH v1 03/10] mm: fs: remove filemap_nr_thps*() functions and their users Zi Yan
@ 2026-03-27  9:32   ` Lance Yang
  2026-03-27 12:23   ` Lorenzo Stoakes (Oracle)
  1 sibling, 0 replies; 76+ messages in thread
From: Lance Yang @ 2026-03-27  9:32 UTC (permalink / raw)
  To: Zi Yan
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Baolin Wang, Song Liu, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, linux-btrfs,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest,
	Matthew Wilcox (Oracle)



On 2026/3/27 09:42, Zi Yan wrote:
> They are used by READ_ONLY_THP_FOR_FS to handle writes to FSes without
> large folio support, so that read-only THPs created in these FSes are not
> seen by the FSes when the underlying fd becomes writable. Now read-only PMD
> THPs only appear in a FS with large folio support and the supported orders
> include PMD_ORDRE.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---

LGTM, feel free to add:

Reviewed-by: Lance Yang <lance.yang@linux.dev>


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-03-27  1:42 ` [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check Zi Yan
  2026-03-27  7:29   ` Lance Yang
@ 2026-03-27  9:44   ` Baolin Wang
  2026-03-27 12:02     ` Lorenzo Stoakes (Oracle)
  2026-03-27 12:07   ` Lorenzo Stoakes (Oracle)
  2026-03-27 13:37   ` David Hildenbrand (Arm)
  3 siblings, 1 reply; 76+ messages in thread
From: Baolin Wang @ 2026-03-27  9:44 UTC (permalink / raw)
  To: Zi Yan, Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Shuah Khan, linux-btrfs, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest



On 3/27/26 9:42 AM, Zi Yan wrote:
> collapse_file() requires FSes supporting large folio with at least
> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
> huge option turned on also sets large folio order on mapping, so the check
> also applies to shmem.
> 
> While at it, replace VM_BUG_ON with returning failure values.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>   mm/khugepaged.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index d06d84219e1b..45b12ffb1550 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>   	int nr_none = 0;
>   	bool is_shmem = shmem_file(file);
>   
> -	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
> -	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
> +	/* "huge" shmem sets mapping folio order and passes the check below */
> +	if (mapping_max_folio_order(mapping) < PMD_ORDER)
> +		return SCAN_FAIL;

This is not true for anonymous shmem, since its large order allocation 
logic is similar to anonymous memory. That means it will not call 
mapping_set_large_folios() for anonymous shmem.

So I think the check should be:

if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)
      return SCAN_FAIL;


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 01/10] mm: remove READ_ONLY_THP_FOR_FS Kconfig option
  2026-03-27  1:42 ` [PATCH v1 01/10] mm: remove READ_ONLY_THP_FOR_FS Kconfig option Zi Yan
@ 2026-03-27 11:45   ` Lorenzo Stoakes (Oracle)
  2026-03-27 13:33   ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 76+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-27 11:45 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Thu, Mar 26, 2026 at 09:42:46PM -0400, Zi Yan wrote:
> No one will be able to use it, so the related code can be removed in the
> coming commits.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>

Seems a reasonable ordering, so:

Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

> ---
>  mm/Kconfig | 11 -----------
>  1 file changed, 11 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index bd283958d675..408fc7b82233 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -937,17 +937,6 @@ config THP_SWAP
>
>  	  For selection by architectures with reasonable THP sizes.
>
> -config READ_ONLY_THP_FOR_FS
> -	bool "Read-only THP for filesystems (EXPERIMENTAL)"
> -	depends on TRANSPARENT_HUGEPAGE
> -
> -	help
> -	  Allow khugepaged to put read-only file-backed pages in THP.
> -
> -	  This is marked experimental because it is a new feature. Write
> -	  support of file THPs will be developed in the next few release
> -	  cycles.
> -
>  config NO_PAGE_MAPCOUNT
>  	bool "No per-page mapcount (EXPERIMENTAL)"
>  	help
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-03-27  9:44   ` Baolin Wang
@ 2026-03-27 12:02     ` Lorenzo Stoakes (Oracle)
  2026-03-27 13:45       ` Baolin Wang
  0 siblings, 1 reply; 76+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-27 12:02 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Zi Yan, Matthew Wilcox (Oracle), Song Liu, Chris Mason,
	David Sterba, Alexander Viro, Christian Brauner, Jan Kara,
	Andrew Morton, David Hildenbrand, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Mar 27, 2026 at 05:44:49PM +0800, Baolin Wang wrote:
>
>
> On 3/27/26 9:42 AM, Zi Yan wrote:
> > collapse_file() requires FSes supporting large folio with at least
> > PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
> > huge option turned on also sets large folio order on mapping, so the check
> > also applies to shmem.
> >
> > While at it, replace VM_BUG_ON with returning failure values.
> >
> > Signed-off-by: Zi Yan <ziy@nvidia.com>
> > ---
> >   mm/khugepaged.c | 7 +++++--
> >   1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index d06d84219e1b..45b12ffb1550 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> >   	int nr_none = 0;
> >   	bool is_shmem = shmem_file(file);
> > -	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
> > -	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
> > +	/* "huge" shmem sets mapping folio order and passes the check below */
> > +	if (mapping_max_folio_order(mapping) < PMD_ORDER)
> > +		return SCAN_FAIL;
>
> This is not true for anonymous shmem, since its large order allocation logic
> is similar to anonymous memory. That means it will not call
> mapping_set_large_folios() for anonymous shmem.
>
> So I think the check should be:
>
> if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)
>      return SCAN_FAIL;

Hmm but in shmem_init() we have:

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
	if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY)
		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
	else
		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */

	/*
	 * Default to setting PMD-sized THP to inherit the global setting and
	 * disable all other multi-size THPs.
	 */
	if (!shmem_orders_configured)
		huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER);
#endif

And shm_mnt->mnt_sb is the superblock used for anon shmem. Also
shmem_enabled_store() updates that if necessary.

So we're still fine right?

__shmem_file_setup() (used for anon shmem) calls shmem_get_inode() ->
__shmem_get_inode() which has:

	if (sbinfo->huge)
		mapping_set_large_folios(inode->i_mapping);

Shared for both anon shmem and tmpfs-style shmem.

So I think it's fine as-is.

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-03-27  1:42 ` [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check Zi Yan
  2026-03-27  7:29   ` Lance Yang
  2026-03-27  9:44   ` Baolin Wang
@ 2026-03-27 12:07   ` Lorenzo Stoakes (Oracle)
  2026-03-27 14:15     ` Lorenzo Stoakes (Oracle)
  2026-03-27 14:46     ` Zi Yan
  2026-03-27 13:37   ` David Hildenbrand (Arm)
  3 siblings, 2 replies; 76+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-27 12:07 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Thu, Mar 26, 2026 at 09:42:47PM -0400, Zi Yan wrote:
> collapse_file() requires FSes supporting large folio with at least
> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
> huge option turned on also sets large folio order on mapping, so the check
> also applies to shmem.
>
> While at it, replace VM_BUG_ON with returning failure values.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>



> ---
>  mm/khugepaged.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index d06d84219e1b..45b12ffb1550 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>  	int nr_none = 0;
>  	bool is_shmem = shmem_file(file);
>
> -	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
> -	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
> +	/* "huge" shmem sets mapping folio order and passes the check below */

I think this isn't quite clear and could be improved to e.g.:

	/*
	 * Either anon shmem supports huge pages as set by shmem_enabled sysfs,
	 * or a shmem file system mounted with the "huge" option.
	 */

> +	if (mapping_max_folio_order(mapping) < PMD_ORDER)
> +		return SCAN_FAIL;

As per rest of thread, this looks correct.

> +	if (start & (HPAGE_PMD_NR - 1))
> +		return SCAN_ADDRESS_RANGE;

Hmm, we're kinda making this - presumably buggy situation - into a valid input
that just fails the scan.

Maybe just make it a VM_WARN_ON_ONCE()? Or if we want to avoid propagating the
bug that'd cause it any further:

	if (start & (HPAGE_PMD_NR - 1)) {
		VM_WARN_ON_ONCE(true);
		return SCAN_ADDRESS_RANGE;
	}

Or similar.

>
>  	result = alloc_charge_folio(&new_folio, mm, cc);
>  	if (result != SCAN_SUCCEED)
> --
> 2.43.0
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 03/10] mm: fs: remove filemap_nr_thps*() functions and their users
  2026-03-27  1:42 ` [PATCH v1 03/10] mm: fs: remove filemap_nr_thps*() functions and their users Zi Yan
  2026-03-27  9:32   ` Lance Yang
@ 2026-03-27 12:23   ` Lorenzo Stoakes (Oracle)
  2026-03-27 13:58     ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 76+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-27 12:23 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Thu, Mar 26, 2026 at 09:42:48PM -0400, Zi Yan wrote:
> They are used by READ_ONLY_THP_FOR_FS to handle writes to FSes without
> large folio support, so that read-only THPs created in these FSes are not
> seen by the FSes when the underlying fd becomes writable. Now read-only PMD
> THPs only appear in a FS with large folio support and the supported orders
> include PMD_ORDRE.

Typo: PMD_ORDRE -> PMD_ORDER

>
> Signed-off-by: Zi Yan <ziy@nvidia.com>

This looks obviously-correct since this stuff wouldn't have been invoked for
large folio file systems before + they already had to handle it separately, and
this function is only tied to CONFIG_READ_ONLY_THP_FOR_FS (+ a quick grep
suggests you didn't miss anything), so:

Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

> ---
>  fs/open.c               | 27 ---------------------------
>  include/linux/pagemap.h | 29 -----------------------------
>  mm/filemap.c            |  1 -
>  mm/huge_memory.c        |  1 -
>  mm/khugepaged.c         | 29 ++---------------------------
>  5 files changed, 2 insertions(+), 85 deletions(-)
>
> diff --git a/fs/open.c b/fs/open.c
> index 91f1139591ab..cef382d9d8b8 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -970,33 +970,6 @@ static int do_dentry_open(struct file *f,
>  	if ((f->f_flags & O_DIRECT) && !(f->f_mode & FMODE_CAN_ODIRECT))
>  		return -EINVAL;
>
> -	/*
> -	 * XXX: Huge page cache doesn't support writing yet. Drop all page
> -	 * cache for this file before processing writes.
> -	 */
> -	if (f->f_mode & FMODE_WRITE) {
> -		/*
> -		 * Depends on full fence from get_write_access() to synchronize
> -		 * against collapse_file() regarding i_writecount and nr_thps
> -		 * updates. Ensures subsequent insertion of THPs into the page
> -		 * cache will fail.
> -		 */
> -		if (filemap_nr_thps(inode->i_mapping)) {
> -			struct address_space *mapping = inode->i_mapping;
> -
> -			filemap_invalidate_lock(inode->i_mapping);
> -			/*
> -			 * unmap_mapping_range just need to be called once
> -			 * here, because the private pages is not need to be
> -			 * unmapped mapping (e.g. data segment of dynamic
> -			 * shared libraries here).
> -			 */
> -			unmap_mapping_range(mapping, 0, 0, 0);
> -			truncate_inode_pages(mapping, 0);
> -			filemap_invalidate_unlock(inode->i_mapping);
> -		}
> -	}
> -
>  	return 0;
>
>  cleanup_all:
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index ec442af3f886..dad3f8846cdc 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -530,35 +530,6 @@ static inline size_t mapping_max_folio_size(const struct address_space *mapping)
>  	return PAGE_SIZE << mapping_max_folio_order(mapping);
>  }
>
> -static inline int filemap_nr_thps(const struct address_space *mapping)
> -{
> -#ifdef CONFIG_READ_ONLY_THP_FOR_FS
> -	return atomic_read(&mapping->nr_thps);
> -#else
> -	return 0;
> -#endif
> -}
> -
> -static inline void filemap_nr_thps_inc(struct address_space *mapping)
> -{
> -#ifdef CONFIG_READ_ONLY_THP_FOR_FS
> -	if (!mapping_large_folio_support(mapping))
> -		atomic_inc(&mapping->nr_thps);
> -#else
> -	WARN_ON_ONCE(mapping_large_folio_support(mapping) == 0);
> -#endif
> -}
> -
> -static inline void filemap_nr_thps_dec(struct address_space *mapping)
> -{
> -#ifdef CONFIG_READ_ONLY_THP_FOR_FS
> -	if (!mapping_large_folio_support(mapping))
> -		atomic_dec(&mapping->nr_thps);
> -#else
> -	WARN_ON_ONCE(mapping_large_folio_support(mapping) == 0);
> -#endif
> -}
> -
>  struct address_space *folio_mapping(const struct folio *folio);
>
>  /**
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 2b933a1da9bd..4248e7cdecf3 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -189,7 +189,6 @@ static void filemap_unaccount_folio(struct address_space *mapping,
>  			lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, -nr);
>  	} else if (folio_test_pmd_mappable(folio)) {
>  		lruvec_stat_mod_folio(folio, NR_FILE_THPS, -nr);
> -		filemap_nr_thps_dec(mapping);
>  	}
>  	if (test_bit(AS_KERNEL_FILE, &folio->mapping->flags))
>  		mod_node_page_state(folio_pgdat(folio),
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index b2a6060b3c20..c7873dbdc470 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3833,7 +3833,6 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
>  				} else {
>  					lruvec_stat_mod_folio(folio,
>  							NR_FILE_THPS, -nr);
> -					filemap_nr_thps_dec(mapping);
>  				}
>  			}
>  		}
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 45b12ffb1550..8004ab8de6d2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2104,20 +2104,8 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>  		goto xa_unlocked;
>  	}
>
> -	if (!is_shmem) {
> -		filemap_nr_thps_inc(mapping);
> -		/*
> -		 * Paired with the fence in do_dentry_open() -> get_write_access()
> -		 * to ensure i_writecount is up to date and the update to nr_thps
> -		 * is visible. Ensures the page cache will be truncated if the
> -		 * file is opened writable.
> -		 */
> -		smp_mb();
> -		if (inode_is_open_for_write(mapping->host)) {
> -			result = SCAN_FAIL;
> -			filemap_nr_thps_dec(mapping);
> -		}
> -	}
> +	if (!is_shmem && inode_is_open_for_write(mapping->host))
> +		result = SCAN_FAIL;
>
>  xa_locked:
>  	xas_unlock_irq(&xas);
> @@ -2296,19 +2284,6 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>  		folio_putback_lru(folio);
>  		folio_put(folio);
>  	}
> -	/*
> -	 * Undo the updates of filemap_nr_thps_inc for non-SHMEM
> -	 * file only. This undo is not needed unless failure is
> -	 * due to SCAN_COPY_MC.
> -	 */
> -	if (!is_shmem && result == SCAN_COPY_MC) {
> -		filemap_nr_thps_dec(mapping);
> -		/*
> -		 * Paired with the fence in do_dentry_open() -> get_write_access()
> -		 * to ensure the update to nr_thps is visible.
> -		 */
> -		smp_mb();
> -	}
>
>  	new_folio->mapping = NULL;
>
> --
> 2.43.0
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 04/10] fs: remove nr_thps from struct address_space
  2026-03-27  1:42 ` [PATCH v1 04/10] fs: remove nr_thps from struct address_space Zi Yan
@ 2026-03-27 12:29   ` Lorenzo Stoakes (Oracle)
  2026-03-27 14:00   ` David Hildenbrand (Arm)
  2026-03-30  3:06   ` Lance Yang
  2 siblings, 0 replies; 76+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-27 12:29 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Thu, Mar 26, 2026 at 09:42:49PM -0400, Zi Yan wrote:
> filemap_nr_thps*() are removed, the related field, address_space->nr_thps,
> is no longer needed. Remove it.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>

I wonder if we shouldn't squash this into previous actually, but it's fine
either way, so:

Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

> ---
>  fs/inode.c         | 3 ---
>  include/linux/fs.h | 5 -----
>  2 files changed, 8 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index cc12b68e021b..16ab0a345419 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -280,9 +280,6 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
>  	mapping->flags = 0;
>  	mapping->wb_err = 0;
>  	atomic_set(&mapping->i_mmap_writable, 0);
> -#ifdef CONFIG_READ_ONLY_THP_FOR_FS
> -	atomic_set(&mapping->nr_thps, 0);
> -#endif
>  	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
>  	mapping->i_private_data = NULL;
>  	mapping->writeback_index = 0;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 0bdccfa70b44..35875696fb4c 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -455,7 +455,6 @@ extern const struct address_space_operations empty_aops;
>   *   memory mappings.
>   * @gfp_mask: Memory allocation flags to use for allocating pages.
>   * @i_mmap_writable: Number of VM_SHARED, VM_MAYWRITE mappings.
> - * @nr_thps: Number of THPs in the pagecache (non-shmem only).
>   * @i_mmap: Tree of private and shared mappings.
>   * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
>   * @nrpages: Number of page entries, protected by the i_pages lock.
> @@ -473,10 +472,6 @@ struct address_space {
>  	struct rw_semaphore	invalidate_lock;
>  	gfp_t			gfp_mask;
>  	atomic_t		i_mmap_writable;
> -#ifdef CONFIG_READ_ONLY_THP_FOR_FS
> -	/* number of thp, only for non-shmem files */
> -	atomic_t		nr_thps;
> -#endif
>  	struct rb_root_cached	i_mmap;
>  	unsigned long		nrpages;
>  	pgoff_t			writeback_index;
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 05/10] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
  2026-03-27  1:42 ` [PATCH v1 05/10] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled() Zi Yan
@ 2026-03-27 12:42   ` Lorenzo Stoakes (Oracle)
  2026-03-27 15:12     ` Zi Yan
  0 siblings, 1 reply; 76+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-27 12:42 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Thu, Mar 26, 2026 at 09:42:50PM -0400, Zi Yan wrote:
> Replace it with a check on the max folio order of the file's address space
> mapping, making sure PMD_ORDER is supported.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  mm/huge_memory.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c7873dbdc470..1da1467328a3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -89,9 +89,6 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>  {
>  	struct inode *inode;
>
> -	if (!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS))
> -		return false;
> -
>  	if (!vma->vm_file)
>  		return false;
>
> @@ -100,6 +97,9 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>  	if (IS_ANON_FILE(inode))
>  		return false;
>
> +	if (mapping_max_folio_order(inode->i_mapping) < PMD_ORDER)
> +		return false;
> +

At this point I think this should be a separate function quite honestly and
share it with 2/10's use, and then you can put the comment in here re: anon
shmem etc.

Though that won't apply here of course as shmem_allowable_huge_orders() would
have been invoked :)

But no harm in refactoring it anyway, and the repetitive < PMD_ORDER stuff is
unfortunate.

Buuut having said that is this right actually?

Because we have:

		if (((in_pf || smaps)) && vma->vm_ops->huge_fault)
			return orders;

Above it, and now you're enabling huge folio file systems to do non-page fault
THP and that's err... isn't that quite a big change?

So yeah probably no to this patch as is :) we should just drop
file_thp_enabled()?

>  	return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
>  }
>
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 06/10] mm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FS
  2026-03-27  1:42 ` [PATCH v1 06/10] mm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FS Zi Yan
@ 2026-03-27 12:50   ` Lorenzo Stoakes (Oracle)
  2026-03-30  9:15   ` Lance Yang
  1 sibling, 0 replies; 76+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-27 12:50 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Thu, Mar 26, 2026 at 09:42:51PM -0400, Zi Yan wrote:
> Without READ_ONLY_THP_FOR_FS, large file-backed folios cannot be created by
> a FS without large folio support. The check is no longer needed.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>

Seems legitimate, so:

Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

> ---
>  mm/huge_memory.c | 22 ----------------------
>  1 file changed, 22 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1da1467328a3..30eddcbf86f1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3732,28 +3732,6 @@ int folio_check_splittable(struct folio *folio, unsigned int new_order,
>  		/* order-1 is not supported for anonymous THP. */
>  		if (new_order == 1)
>  			return -EINVAL;
> -	} else if (split_type == SPLIT_TYPE_NON_UNIFORM || new_order) {
> -		if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
> -		    !mapping_large_folio_support(folio->mapping)) {
> -			/*
> -			 * We can always split a folio down to a single page
> -			 * (new_order == 0) uniformly.
> -			 *
> -			 * For any other scenario
> -			 *   a) uniform split targeting a large folio
> -			 *      (new_order > 0)
> -			 *   b) any non-uniform split
> -			 * we must confirm that the file system supports large
> -			 * folios.
> -			 *
> -			 * Note that we might still have THPs in such
> -			 * mappings, which is created from khugepaged when
> -			 * CONFIG_READ_ONLY_THP_FOR_FS is enabled. But in that
> -			 * case, the mapping does not actually support large
> -			 * folios properly.
> -			 */
> -			return -EINVAL;
> -		}
>  	}
>
>  	/*
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 07/10] mm/truncate: use folio_split() in truncate_inode_partial_folio()
  2026-03-27  1:42 ` [PATCH v1 07/10] mm/truncate: use folio_split() in truncate_inode_partial_folio() Zi Yan
  2026-03-27  3:33   ` Lance Yang
@ 2026-03-27 13:05   ` Lorenzo Stoakes (Oracle)
  2026-03-27 15:35     ` Zi Yan
  2026-03-28  9:54   ` kernel test robot
  2026-03-28  9:54   ` kernel test robot
  3 siblings, 1 reply; 76+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-27 13:05 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Thu, Mar 26, 2026 at 09:42:52PM -0400, Zi Yan wrote:
> After READ_ONLY_THP_FOR_FS is removed, FS either supports large folio or
> not. folio_split() can be used on a FS with large folio support without
> worrying about getting a THP on a FS without large folio support.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  include/linux/huge_mm.h | 25 ++-----------------------
>  mm/truncate.c           |  8 ++++----
>  2 files changed, 6 insertions(+), 27 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 1258fa37e85b..171de8138e98 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -389,27 +389,6 @@ static inline int split_huge_page_to_order(struct page *page, unsigned int new_o
>  	return split_huge_page_to_list_to_order(page, NULL, new_order);
>  }
>
> -/**
> - * try_folio_split_to_order() - try to split a @folio at @page to @new_order
> - * using non uniform split.
> - * @folio: folio to be split
> - * @page: split to @new_order at the given page
> - * @new_order: the target split order
> - *
> - * Try to split a @folio at @page using non uniform split to @new_order, if
> - * non uniform split is not supported, fall back to uniform split. After-split
> - * folios are put back to LRU list. Use min_order_for_split() to get the lower
> - * bound of @new_order.
> - *
> - * Return: 0 - split is successful, otherwise split failed.
> - */
> -static inline int try_folio_split_to_order(struct folio *folio,
> -		struct page *page, unsigned int new_order)
> -{
> -	if (folio_check_splittable(folio, new_order, SPLIT_TYPE_NON_UNIFORM))
> -		return split_huge_page_to_order(&folio->page, new_order);
> -	return folio_split(folio, new_order, page, NULL);
> -}
>  static inline int split_huge_page(struct page *page)
>  {
>  	return split_huge_page_to_list_to_order(page, NULL, 0);
> @@ -641,8 +620,8 @@ static inline int split_folio_to_list(struct folio *folio, struct list_head *lis
>  	return -EINVAL;
>  }

Hmm there's nothing in the comment or obvious jumping out at me to explain why
this is R/O thp file-backed only?

This seems like an arbitrary helper that just figures out whether it can split
using the non-uniform approach.

I think you need to explain more in the commit message why this was R/O thp
file-backed only, maybe mention some commits that added it etc., I had a quick
glance and even that didn't indicate why.

I look at folio_check_splittable() for instance and see:

	...

	} else if (split_type == SPLIT_TYPE_NON_UNIFORM || new_order) {
		if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
		    !mapping_large_folio_support(folio->mapping)) {
			...
			return -EINVAL;
		}
	}

	...

	if ((split_type == SPLIT_TYPE_NON_UNIFORM || new_order) && folio_test_swapcache(folio)) {
		return -EINVAL;
	}

	if (is_huge_zero_folio(folio))
		return -EINVAL;

	if (folio_test_writeback(folio))
		return -EBUSY;

	return 0;
}

None of which suggest that you couldn't have non-uniform splits for other
cases? This at least needs some more explanation/justification in the
commit msg.

>
> -static inline int try_folio_split_to_order(struct folio *folio,
> -		struct page *page, unsigned int new_order)
> +static inline int folio_split(struct folio *folio, unsigned int new_order,
> +		struct page *page, struct list_head *list);

Yeah as Lance pointed out that ; probably shouldn't be there :)

>  {
>  	VM_WARN_ON_ONCE_FOLIO(1, folio);
>  	return -EINVAL;
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 2931d66c16d0..6973b05ec4b8 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -177,7 +177,7 @@ int truncate_inode_folio(struct address_space *mapping, struct folio *folio)
>  	return 0;
>  }
>
> -static int try_folio_split_or_unmap(struct folio *folio, struct page *split_at,
> +static int folio_split_or_unmap(struct folio *folio, struct page *split_at,
>  				    unsigned long min_order)

I'm not sure the removal of 'try_' is warranted in general in this patch,
as it seems like it's not guaranteed any of these will succeed? Or am I
wrong?

>  {
>  	enum ttu_flags ttu_flags =
> @@ -186,7 +186,7 @@ static int try_folio_split_or_unmap(struct folio *folio, struct page *split_at,
>  		TTU_IGNORE_MLOCK;
>  	int ret;
>
> -	ret = try_folio_split_to_order(folio, split_at, min_order);
> +	ret = folio_split(folio, min_order, split_at, NULL);
>
>  	/*
>  	 * If the split fails, unmap the folio, so it will be refaulted
> @@ -252,7 +252,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
>
>  	min_order = mapping_min_folio_order(folio->mapping);
>  	split_at = folio_page(folio, PAGE_ALIGN_DOWN(offset) / PAGE_SIZE);
> -	if (!try_folio_split_or_unmap(folio, split_at, min_order)) {
> +	if (!folio_split_or_unmap(folio, split_at, min_order)) {
>  		/*
>  		 * try to split at offset + length to make sure folios within
>  		 * the range can be dropped, especially to avoid memory waste
> @@ -279,7 +279,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
>  		/* make sure folio2 is large and does not change its mapping */
>  		if (folio_test_large(folio2) &&
>  		    folio2->mapping == folio->mapping)
> -			try_folio_split_or_unmap(folio2, split_at2, min_order);
> +			folio_split_or_unmap(folio2, split_at2, min_order);
>
>  		folio_unlock(folio2);
>  out:
> --
> 2.43.0
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 08/10] fs/btrfs: remove a comment referring to READ_ONLY_THP_FOR_FS
  2026-03-27  1:42 ` [PATCH v1 08/10] fs/btrfs: remove a comment referring to READ_ONLY_THP_FOR_FS Zi Yan
@ 2026-03-27 13:05   ` Lorenzo Stoakes (Oracle)
  0 siblings, 0 replies; 76+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-27 13:05 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Thu, Mar 26, 2026 at 09:42:53PM -0400, Zi Yan wrote:
> READ_ONLY_THP_FOR_FS is no longer present, remove related comment.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Acked-by: David Sterba <dsterba@suse.com>

LGTM so:

Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

> ---
>  fs/btrfs/defrag.c | 3 ---
>  1 file changed, 3 deletions(-)
>
> diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
> index 7e2db5d3a4d4..a8d49d9ca981 100644
> --- a/fs/btrfs/defrag.c
> +++ b/fs/btrfs/defrag.c
> @@ -860,9 +860,6 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
>  		return folio;
>
>  	/*
> -	 * Since we can defragment files opened read-only, we can encounter
> -	 * transparent huge pages here (see CONFIG_READ_ONLY_THP_FOR_FS).
> -	 *
>  	 * The IO for such large folios is not fully tested, thus return
>  	 * an error to reject such folios unless it's an experimental build.
>  	 *
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 09/10] selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged
  2026-03-27  1:42 ` [PATCH v1 09/10] selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged Zi Yan
@ 2026-03-27 13:05   ` Lorenzo Stoakes (Oracle)
  0 siblings, 0 replies; 76+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-27 13:05 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Thu, Mar 26, 2026 at 09:42:54PM -0400, Zi Yan wrote:
> Change the requirement to a file system with large folio support and the
> supported order needs to include PMD_ORDER.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>

LGTM, so:

Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

> ---
>  tools/testing/selftests/mm/khugepaged.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
> index 3fe7ef04ac62..bdcdd31beb1e 100644
> --- a/tools/testing/selftests/mm/khugepaged.c
> +++ b/tools/testing/selftests/mm/khugepaged.c
> @@ -1086,8 +1086,8 @@ static void usage(void)
>  	fprintf(stderr, "\t<context>\t: [all|khugepaged|madvise]\n");
>  	fprintf(stderr, "\t<mem_type>\t: [all|anon|file|shmem]\n");
>  	fprintf(stderr, "\n\t\"file,all\" mem_type requires [dir] argument\n");
> -	fprintf(stderr, "\n\t\"file,all\" mem_type requires kernel built with\n");
> -	fprintf(stderr,	"\tCONFIG_READ_ONLY_THP_FOR_FS=y\n");
> +	fprintf(stderr, "\n\t\"file,all\" mem_type requires a file system\n");
> +	fprintf(stderr,	"\twith large folio support (order >= PMD order)\n");
>  	fprintf(stderr, "\n\tif [dir] is a (sub)directory of a tmpfs mount, tmpfs must be\n");
>  	fprintf(stderr,	"\tmounted with huge=advise option for khugepaged tests to work\n");
>  	fprintf(stderr,	"\n\tSupported Options:\n");
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 10/10] selftests/mm: remove READ_ONLY_THP_FOR_FS from comments in guard-regions
  2026-03-27  1:42 ` [PATCH v1 10/10] selftests/mm: remove READ_ONLY_THP_FOR_FS from comments in guard-regions Zi Yan
@ 2026-03-27 13:06   ` Lorenzo Stoakes (Oracle)
  0 siblings, 0 replies; 76+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-27 13:06 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Thu, Mar 26, 2026 at 09:42:55PM -0400, Zi Yan wrote:
> Any file system with large folio support and the supported orders include
> PMD_ORDER can be used.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>

Thanks :) Wondered if you'd fix these up :) So:

Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

Cheers, Lorenzo

> ---
>  tools/testing/selftests/mm/guard-regions.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c
> index 48e8b1539be3..13e77e48b6ef 100644
> --- a/tools/testing/selftests/mm/guard-regions.c
> +++ b/tools/testing/selftests/mm/guard-regions.c
> @@ -2205,7 +2205,7 @@ TEST_F(guard_regions, collapse)
>
>  	/*
>  	 * We must close and re-open local-file backed as read-only for
> -	 * CONFIG_READ_ONLY_THP_FOR_FS to work.
> +	 * MADV_COLLAPSE to work.
>  	 */
>  	if (variant->backing == LOCAL_FILE_BACKED) {
>  		ASSERT_EQ(close(self->fd), 0);
> @@ -2237,9 +2237,10 @@ TEST_F(guard_regions, collapse)
>  	/*
>  	 * Now collapse the entire region. This should fail in all cases.
>  	 *
> -	 * The madvise() call will also fail if CONFIG_READ_ONLY_THP_FOR_FS is
> -	 * not set for the local file case, but we can't differentiate whether
> -	 * this occurred or if the collapse was rightly rejected.
> +	 * The madvise() call will also fail if the file system does not support
> +	 * large folio or the supported orders do not include PMD_ORDER for the
> +	 * local file case, but we can't differentiate whether this occurred or
> +	 * if the collapse was rightly rejected.
>  	 */
>  	EXPECT_NE(madvise(ptr, size, MADV_COLLAPSE), 0);
>
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 01/10] mm: remove READ_ONLY_THP_FOR_FS Kconfig option
  2026-03-27  1:42 ` [PATCH v1 01/10] mm: remove READ_ONLY_THP_FOR_FS Kconfig option Zi Yan
  2026-03-27 11:45   ` Lorenzo Stoakes (Oracle)
@ 2026-03-27 13:33   ` David Hildenbrand (Arm)
  2026-03-27 14:39     ` Zi Yan
  1 sibling, 1 reply; 76+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-27 13:33 UTC (permalink / raw)
  To: Zi Yan, Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Shuah Khan, linux-btrfs, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On 3/27/26 02:42, Zi Yan wrote:
> No one will be able to use it, so the related code can be removed in the
> coming commits.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  mm/Kconfig | 11 -----------
>  1 file changed, 11 deletions(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index bd283958d675..408fc7b82233 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -937,17 +937,6 @@ config THP_SWAP
>  
>  	  For selection by architectures with reasonable THP sizes.
>  
> -config READ_ONLY_THP_FOR_FS
> -	bool "Read-only THP for filesystems (EXPERIMENTAL)"
> -	depends on TRANSPARENT_HUGEPAGE
> -
> -	help
> -	  Allow khugepaged to put read-only file-backed pages in THP.
> -
> -	  This is marked experimental because it is a new feature. Write
> -	  support of file THPs will be developed in the next few release
> -	  cycles.
> -
>  config NO_PAGE_MAPCOUNT
>  	bool "No per-page mapcount (EXPERIMENTAL)"
>  	help

Isn't that usually what we do at the very end when we converted all the
code?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-03-27  1:42 ` [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check Zi Yan
                     ` (2 preceding siblings ...)
  2026-03-27 12:07   ` Lorenzo Stoakes (Oracle)
@ 2026-03-27 13:37   ` David Hildenbrand (Arm)
  2026-03-27 14:43     ` Zi Yan
  3 siblings, 1 reply; 76+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-27 13:37 UTC (permalink / raw)
  To: Zi Yan, Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Shuah Khan, linux-btrfs, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On 3/27/26 02:42, Zi Yan wrote:
> collapse_file() requires FSes supporting large folio with at least
> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
> huge option turned on also sets large folio order on mapping, so the check
> also applies to shmem.
> 
> While at it, replace VM_BUG_ON with returning failure values.

Why not VM_WARN_ON_ONCE() ?

These are conditions that must be checked earlier, no?


-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-03-27 12:02     ` Lorenzo Stoakes (Oracle)
@ 2026-03-27 13:45       ` Baolin Wang
  2026-03-27 14:12         ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 76+ messages in thread
From: Baolin Wang @ 2026-03-27 13:45 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Zi Yan, Matthew Wilcox (Oracle), Song Liu, Chris Mason,
	David Sterba, Alexander Viro, Christian Brauner, Jan Kara,
	Andrew Morton, David Hildenbrand, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest



On 3/27/26 8:02 PM, Lorenzo Stoakes (Oracle) wrote:
> On Fri, Mar 27, 2026 at 05:44:49PM +0800, Baolin Wang wrote:
>>
>>
>> On 3/27/26 9:42 AM, Zi Yan wrote:
>>> collapse_file() requires FSes supporting large folio with at least
>>> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
>>> huge option turned on also sets large folio order on mapping, so the check
>>> also applies to shmem.
>>>
>>> While at it, replace VM_BUG_ON with returning failure values.
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>>    mm/khugepaged.c | 7 +++++--
>>>    1 file changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index d06d84219e1b..45b12ffb1550 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>>>    	int nr_none = 0;
>>>    	bool is_shmem = shmem_file(file);
>>> -	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
>>> -	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>>> +	/* "huge" shmem sets mapping folio order and passes the check below */
>>> +	if (mapping_max_folio_order(mapping) < PMD_ORDER)
>>> +		return SCAN_FAIL;
>>
>> This is not true for anonymous shmem, since its large order allocation logic
>> is similar to anonymous memory. That means it will not call
>> mapping_set_large_folios() for anonymous shmem.
>>
>> So I think the check should be:
>>
>> if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)
>>       return SCAN_FAIL;
> 
> Hmm but in shmem_init() we have:
> 
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> 	if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY)
> 		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
> 	else
> 		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
> 
> 	/*
> 	 * Default to setting PMD-sized THP to inherit the global setting and
> 	 * disable all other multi-size THPs.
> 	 */
> 	if (!shmem_orders_configured)
> 		huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER);
> #endif
> 
> And shm_mnt->mnt_sb is the superblock used for anon shmem. Also
> shmem_enabled_store() updates that if necessary.
> 
> So we're still fine right?
> 
> __shmem_file_setup() (used for anon shmem) calls shmem_get_inode() ->
> __shmem_get_inode() which has:
> 
> 	if (sbinfo->huge)
> 		mapping_set_large_folios(inode->i_mapping);
> 
> Shared for both anon shmem and tmpfs-style shmem.
> 
> So I think it's fine as-is.

I'm afraid not. Sorry, I should have been clearer.

First, anonymous shmem large order allocation is dynamically controlled 
via the global interface 
(/sys/kernel/mm/transparent_hugepage/shmem_enabled) and the mTHP 
interfaces 
(/sys/kernel/mm/transparent_hugepage/hugepages-*kB/shmem_enabled).

This means that during anonymous shmem initialization, these interfaces 
might be set to 'never'. so it will not call mapping_set_large_folios() 
because sbinfo->huge is 'SHMEM_HUGE_NEVER'.

Even if shmem large order allocation is subsequently enabled via the 
interfaces, __shmem_file_setup -> mapping_set_large_folios() is not 
called again.

Anonymous shmem behaves similarly to anonymous pages: it is controlled 
by the 'shmem_enabled' interfaces and uses shmem_allowable_huge_orders() 
to check for allowed large orders, rather than relying on 
mapping_max_folio_order().

The mapping_max_folio_order() is intended to control large page 
allocation only for tmpfs mounts. Therefore, I find the current code 
confusing and think it needs to be fixed:

/* Don't consider 'deny' for emergencies and 'force' for testing */
if (sb != shm_mnt->mnt_sb && sbinfo->huge)
        mapping_set_large_folios(inode->i_mapping);


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig
  2026-03-27  1:42 [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig Zi Yan
                   ` (9 preceding siblings ...)
  2026-03-27  1:42 ` [PATCH v1 10/10] selftests/mm: remove READ_ONLY_THP_FOR_FS from comments in guard-regions Zi Yan
@ 2026-03-27 13:46 ` David Hildenbrand (Arm)
  2026-03-27 14:26   ` Zi Yan
  2026-03-27 14:27   ` Lorenzo Stoakes (Oracle)
  2026-04-05 17:38 ` Nico Pache
  11 siblings, 2 replies; 76+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-27 13:46 UTC (permalink / raw)
  To: Zi Yan, Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Shuah Khan, linux-btrfs, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On 3/27/26 02:42, Zi Yan wrote:
> Hi all,
> 
> This patchset removes READ_ONLY_THP_FOR_FS Kconfig and enables creating
> read-only THPs for FSes with large folio support (the supported orders
> need to include PMD_ORDER) by default.
> 
> The changes are:
> 1. collapse_file() from mm/khugepaged.c, instead of checking
>    CONFIG_READ_ONLY_THP_FOR_FS, makes sure the mapping_max_folio_order()
>    of struct address_space of the file is at least PMD_ORDER.
> 2. file_thp_enabled() also checks mapping_max_folio_order() instead.
> 3. truncate_inode_partial_folio() calls folio_split() directly instead
>    of the removed try_folio_split_to_order(), since large folios can
>    only show up on a FS with large folio support.
> 4. nr_thps is removed from struct address_space, since it is no longer
>    needed to drop all read-only THPs from a FS without large folio
>    support when the fd becomes writable. Its related filemap_nr_thps*()
>    are removed too.
> 5. folio_check_splittable() no longer checks READ_ONLY_THP_FOR_FS.
> 6. Updated comments in various places.
> 
> Changelog
> ===
> From RFC[1]:
> 1. instead of removing READ_ONLY_THP_FOR_FS function entirely, turn it
>    on by default for all FSes with large folio support and the supported
>    orders includes PMD_ORDER.
> 
> Suggestions and comments are welcome.

Hi! :)

The patch set might be better structured by

1) Teaching code paths to not only respect READ_ONLY_THP_FOR_FS but also
filesystems with large folios. At that point, READ_ONLY_THP_FOR_FS would
have no effect.

2) Removing READ_ONLY_THP_FOR_FS along with all the old cruft that is no
longer required

MADV_COLLAPSE will keep working the whole time.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 03/10] mm: fs: remove filemap_nr_thps*() functions and their users
  2026-03-27 12:23   ` Lorenzo Stoakes (Oracle)
@ 2026-03-27 13:58     ` David Hildenbrand (Arm)
  2026-03-27 14:23       ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 76+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-27 13:58 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle), Zi Yan
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, linux-btrfs,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

On 3/27/26 13:23, Lorenzo Stoakes (Oracle) wrote:
> On Thu, Mar 26, 2026 at 09:42:48PM -0400, Zi Yan wrote:
>> They are used by READ_ONLY_THP_FOR_FS to handle writes to FSes without
>> large folio support, so that read-only THPs created in these FSes are not
>> seen by the FSes when the underlying fd becomes writable. Now read-only PMD
>> THPs only appear in a FS with large folio support and the supported orders
>> include PMD_ORDRE.
> 
> Typo: PMD_ORDRE -> PMD_ORDER
> 
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
> 
> This looks obviously-correct since this stuff wouldn't have been invoked for
> large folio file systems before + they already had to handle it separately, and
> this function is only tied to CONFIG_READ_ONLY_THP_FOR_FS (+ a quick grep
> suggests you didn't miss anything), so:

There could now be a race between collapsing and the file getting opened
r/w.

Are we sure that all code can really deal with that?

IOW, "they already had to handle it separately" -- is that true?
khugepaged would have never collapse in writable files, so I wonder if
all code paths are prepared for that.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 04/10] fs: remove nr_thps from struct address_space
  2026-03-27  1:42 ` [PATCH v1 04/10] fs: remove nr_thps from struct address_space Zi Yan
  2026-03-27 12:29   ` Lorenzo Stoakes (Oracle)
@ 2026-03-27 14:00   ` David Hildenbrand (Arm)
  2026-03-30  3:06   ` Lance Yang
  2 siblings, 0 replies; 76+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-27 14:00 UTC (permalink / raw)
  To: Zi Yan, Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Shuah Khan, linux-btrfs, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On 3/27/26 02:42, Zi Yan wrote:
> filemap_nr_thps*() are removed, the related field, address_space->nr_thps,
> is no longer needed. Remove it.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-03-27 13:45       ` Baolin Wang
@ 2026-03-27 14:12         ` Lorenzo Stoakes (Oracle)
  2026-03-27 14:26           ` Baolin Wang
  0 siblings, 1 reply; 76+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-27 14:12 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Zi Yan, Matthew Wilcox (Oracle), Song Liu, Chris Mason,
	David Sterba, Alexander Viro, Christian Brauner, Jan Kara,
	Andrew Morton, David Hildenbrand, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Mar 27, 2026 at 09:45:03PM +0800, Baolin Wang wrote:
>
>
> On 3/27/26 8:02 PM, Lorenzo Stoakes (Oracle) wrote:
> > On Fri, Mar 27, 2026 at 05:44:49PM +0800, Baolin Wang wrote:
> > >
> > >
> > > On 3/27/26 9:42 AM, Zi Yan wrote:
> > > > collapse_file() requires FSes supporting large folio with at least
> > > > PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
> > > > huge option turned on also sets large folio order on mapping, so the check
> > > > also applies to shmem.
> > > >
> > > > While at it, replace VM_BUG_ON with returning failure values.
> > > >
> > > > Signed-off-by: Zi Yan <ziy@nvidia.com>
> > > > ---
> > > >    mm/khugepaged.c | 7 +++++--
> > > >    1 file changed, 5 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index d06d84219e1b..45b12ffb1550 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> > > >    	int nr_none = 0;
> > > >    	bool is_shmem = shmem_file(file);
> > > > -	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
> > > > -	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
> > > > +	/* "huge" shmem sets mapping folio order and passes the check below */
> > > > +	if (mapping_max_folio_order(mapping) < PMD_ORDER)
> > > > +		return SCAN_FAIL;
> > >
> > > This is not true for anonymous shmem, since its large order allocation logic
> > > is similar to anonymous memory. That means it will not call
> > > mapping_set_large_folios() for anonymous shmem.
> > >
> > > So I think the check should be:
> > >
> > > if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)
> > >       return SCAN_FAIL;
> >
> > Hmm but in shmem_init() we have:
> >
> > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > 	if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY)
> > 		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
> > 	else
> > 		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
> >
> > 	/*
> > 	 * Default to setting PMD-sized THP to inherit the global setting and
> > 	 * disable all other multi-size THPs.
> > 	 */
> > 	if (!shmem_orders_configured)
> > 		huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER);
> > #endif
> >
> > And shm_mnt->mnt_sb is the superblock used for anon shmem. Also
> > shmem_enabled_store() updates that if necessary.
> >
> > So we're still fine right?
> >
> > __shmem_file_setup() (used for anon shmem) calls shmem_get_inode() ->
> > __shmem_get_inode() which has:
> >
> > 	if (sbinfo->huge)
> > 		mapping_set_large_folios(inode->i_mapping);
> >
> > Shared for both anon shmem and tmpfs-style shmem.
> >
> > So I think it's fine as-is.
>
> I'm afraid not. Sorry, I should have been clearer.
>
> First, anonymous shmem large order allocation is dynamically controlled via
> the global interface (/sys/kernel/mm/transparent_hugepage/shmem_enabled) and
> the mTHP interfaces
> (/sys/kernel/mm/transparent_hugepage/hugepages-*kB/shmem_enabled).
>
> This means that during anonymous shmem initialization, these interfaces
> might be set to 'never'. so it will not call mapping_set_large_folios()
> because sbinfo->huge is 'SHMEM_HUGE_NEVER'.
>
> Even if shmem large order allocation is subsequently enabled via the
> interfaces, __shmem_file_setup -> mapping_set_large_folios() is not called
> again.

I see your point, oh this is all a bit of a mess...

It feels like entirely the wrong abstraction anyway, since at best you're
getting a global 'is enabled'.

I guess what happened before was we'd never call into this with ! r/o thp for fs
&& ! is_shmem.

But now we are allowing it, but should STILL be gating on !is_shmem so yeah your
suggestion is correct I think actualyl.

I do hate:

	if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)

As a bit of code though. It's horrible.

Let's abstract that...

It'd be nice if we could find a way to clean things up in the lead up to changes
in series like this instead of sticking with the mess, but I guess since it
mostly removes stuff that's ok for now.

>
> Anonymous shmem behaves similarly to anonymous pages: it is controlled by
> the 'shmem_enabled' interfaces and uses shmem_allowable_huge_orders() to
> check for allowed large orders, rather than relying on
> mapping_max_folio_order().
>
> The mapping_max_folio_order() is intended to control large page allocation
> only for tmpfs mounts. Therefore, I find the current code confusing and
> think it needs to be fixed:
>
> /* Don't consider 'deny' for emergencies and 'force' for testing */
> if (sb != shm_mnt->mnt_sb && sbinfo->huge)
>        mapping_set_large_folios(inode->i_mapping);

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-03-27 12:07   ` Lorenzo Stoakes (Oracle)
@ 2026-03-27 14:15     ` Lorenzo Stoakes (Oracle)
  2026-03-27 14:46     ` Zi Yan
  1 sibling, 0 replies; 76+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-27 14:15 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Mar 27, 2026 at 12:07:22PM +0000, Lorenzo Stoakes (Oracle) wrote:
> > +	if (mapping_max_folio_order(mapping) < PMD_ORDER)
> > +		return SCAN_FAIL;
>
> As per rest of thread, this looks correct.

Actually, no :)


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 03/10] mm: fs: remove filemap_nr_thps*() functions and their users
  2026-03-27 13:58     ` David Hildenbrand (Arm)
@ 2026-03-27 14:23       ` Lorenzo Stoakes (Oracle)
  2026-03-27 15:05         ` Zi Yan
  0 siblings, 1 reply; 76+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-27 14:23 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Zi Yan, Matthew Wilcox (Oracle), Song Liu, Chris Mason,
	David Sterba, Alexander Viro, Christian Brauner, Jan Kara,
	Andrew Morton, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Mar 27, 2026 at 02:58:12PM +0100, David Hildenbrand (Arm) wrote:
> On 3/27/26 13:23, Lorenzo Stoakes (Oracle) wrote:
> > On Thu, Mar 26, 2026 at 09:42:48PM -0400, Zi Yan wrote:
> >> They are used by READ_ONLY_THP_FOR_FS to handle writes to FSes without
> >> large folio support, so that read-only THPs created in these FSes are not
> >> seen by the FSes when the underlying fd becomes writable. Now read-only PMD
> >> THPs only appear in a FS with large folio support and the supported orders
> >> include PMD_ORDRE.
> >
> > Typo: PMD_ORDRE -> PMD_ORDER
> >
> >>
> >> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >
> > This looks obviously-correct since this stuff wouldn't have been invoked for
> > large folio file systems before + they already had to handle it separately, and
> > this function is only tied to CONFIG_READ_ONLY_THP_FOR_FS (+ a quick grep
> > suggests you didn't miss anything), so:
>
> There could now be a race between collapsing and the file getting opened
> r/w.
>
> Are we sure that all code can really deal with that?
>
> IOW, "they already had to handle it separately" -- is that true?
> khugepaged would have never collapse in writable files, so I wonder if
> all code paths are prepared for that.

OK I guess I overlooked a part of this code... :) see below.

This is fine and would be a no-op anyway

-       if (f->f_mode & FMODE_WRITE) {
-               /*
-                * Depends on full fence from get_write_access() to synchronize
-                * against collapse_file() regarding i_writecount and nr_thps
-                * updates. Ensures subsequent insertion of THPs into the page
-                * cache will fail.
-                */
-               if (filemap_nr_thps(inode->i_mapping)) {

But this:

-       if (!is_shmem) {
-               filemap_nr_thps_inc(mapping);
-               /*
-                * Paired with the fence in do_dentry_open() -> get_write_access()
-                * to ensure i_writecount is up to date and the update to nr_thps
-                * is visible. Ensures the page cache will be truncated if the
-                * file is opened writable.
-                */
-               smp_mb();

We can drop barrier

-               if (inode_is_open_for_write(mapping->host)) {
-                       result = SCAN_FAIL;

But this is a functional change!

Yup missed this.

-                       filemap_nr_thps_dec(mapping);
-               }
-       }

For below:

-       /*
-        * Undo the updates of filemap_nr_thps_inc for non-SHMEM
-        * file only. This undo is not needed unless failure is
-        * due to SCAN_COPY_MC.
-        */
-       if (!is_shmem && result == SCAN_COPY_MC) {
-               filemap_nr_thps_dec(mapping);
-               /*
-                * Paired with the fence in do_dentry_open() -> get_write_access()
-                * to ensure the update to nr_thps is visible.
-                */
-               smp_mb();
-       }

Here is probably fine to remove if barrier _only_ for nr_thps.

>
> --
> Cheers,
>
> David

Sorry Zi, R-b tag withdrawn... :( I missed that 1 functional change there.

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig
  2026-03-27 13:46 ` [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig David Hildenbrand (Arm)
@ 2026-03-27 14:26   ` Zi Yan
  2026-03-27 14:27   ` Lorenzo Stoakes (Oracle)
  1 sibling, 0 replies; 76+ messages in thread
From: Zi Yan @ 2026-03-27 14:26 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 27 Mar 2026, at 9:46, David Hildenbrand (Arm) wrote:

> On 3/27/26 02:42, Zi Yan wrote:
>> Hi all,
>>
>> This patchset removes READ_ONLY_THP_FOR_FS Kconfig and enables creating
>> read-only THPs for FSes with large folio support (the supported orders
>> need to include PMD_ORDER) by default.
>>
>> The changes are:
>> 1. collapse_file() from mm/khugepaged.c, instead of checking
>>    CONFIG_READ_ONLY_THP_FOR_FS, makes sure the mapping_max_folio_order()
>>    of struct address_space of the file is at least PMD_ORDER.
>> 2. file_thp_enabled() also checks mapping_max_folio_order() instead.
>> 3. truncate_inode_partial_folio() calls folio_split() directly instead
>>    of the removed try_folio_split_to_order(), since large folios can
>>    only show up on a FS with large folio support.
>> 4. nr_thps is removed from struct address_space, since it is no longer
>>    needed to drop all read-only THPs from a FS without large folio
>>    support when the fd becomes writable. Its related filemap_nr_thps*()
>>    are removed too.
>> 5. folio_check_splittable() no longer checks READ_ONLY_THP_FOR_FS.
>> 6. Updated comments in various places.
>>
>> Changelog
>> ===
>> From RFC[1]:
>> 1. instead of removing READ_ONLY_THP_FOR_FS function entirely, turn it
>>    on by default for all FSes with large folio support and the supported
>>    orders includes PMD_ORDER.
>>
>> Suggestions and comments are welcome.
>
> Hi! :)
>
> The patch set might be better structured by
>
> 1) Teaching code paths to not only respect READ_ONLY_THP_FOR_FS but also
> filesystems with large folios. At that point, READ_ONLY_THP_FOR_FS would
> have no effect.
>
> 2) Removing READ_ONLY_THP_FOR_FS along with all the old cruft that is no
> longer required
>
> MADV_COLLAPSE will keep working the whole time.

OK. I will give this a try.

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-03-27 14:12         ` Lorenzo Stoakes (Oracle)
@ 2026-03-27 14:26           ` Baolin Wang
  2026-03-27 14:31             ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 76+ messages in thread
From: Baolin Wang @ 2026-03-27 14:26 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Zi Yan, Matthew Wilcox (Oracle), Song Liu, Chris Mason,
	David Sterba, Alexander Viro, Christian Brauner, Jan Kara,
	Andrew Morton, David Hildenbrand, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest



On 3/27/26 10:12 PM, Lorenzo Stoakes (Oracle) wrote:
> On Fri, Mar 27, 2026 at 09:45:03PM +0800, Baolin Wang wrote:
>>
>>
>> On 3/27/26 8:02 PM, Lorenzo Stoakes (Oracle) wrote:
>>> On Fri, Mar 27, 2026 at 05:44:49PM +0800, Baolin Wang wrote:
>>>>
>>>>
>>>> On 3/27/26 9:42 AM, Zi Yan wrote:
>>>>> collapse_file() requires FSes supporting large folio with at least
>>>>> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
>>>>> huge option turned on also sets large folio order on mapping, so the check
>>>>> also applies to shmem.
>>>>>
>>>>> While at it, replace VM_BUG_ON with returning failure values.
>>>>>
>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>> ---
>>>>>     mm/khugepaged.c | 7 +++++--
>>>>>     1 file changed, 5 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>> index d06d84219e1b..45b12ffb1550 100644
>>>>> --- a/mm/khugepaged.c
>>>>> +++ b/mm/khugepaged.c
>>>>> @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>>>>>     	int nr_none = 0;
>>>>>     	bool is_shmem = shmem_file(file);
>>>>> -	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
>>>>> -	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>>>>> +	/* "huge" shmem sets mapping folio order and passes the check below */
>>>>> +	if (mapping_max_folio_order(mapping) < PMD_ORDER)
>>>>> +		return SCAN_FAIL;
>>>>
>>>> This is not true for anonymous shmem, since its large order allocation logic
>>>> is similar to anonymous memory. That means it will not call
>>>> mapping_set_large_folios() for anonymous shmem.
>>>>
>>>> So I think the check should be:
>>>>
>>>> if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)
>>>>        return SCAN_FAIL;
>>>
>>> Hmm but in shmem_init() we have:
>>>
>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> 	if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY)
>>> 		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
>>> 	else
>>> 		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
>>>
>>> 	/*
>>> 	 * Default to setting PMD-sized THP to inherit the global setting and
>>> 	 * disable all other multi-size THPs.
>>> 	 */
>>> 	if (!shmem_orders_configured)
>>> 		huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER);
>>> #endif
>>>
>>> And shm_mnt->mnt_sb is the superblock used for anon shmem. Also
>>> shmem_enabled_store() updates that if necessary.
>>>
>>> So we're still fine right?
>>>
>>> __shmem_file_setup() (used for anon shmem) calls shmem_get_inode() ->
>>> __shmem_get_inode() which has:
>>>
>>> 	if (sbinfo->huge)
>>> 		mapping_set_large_folios(inode->i_mapping);
>>>
>>> Shared for both anon shmem and tmpfs-style shmem.
>>>
>>> So I think it's fine as-is.
>>
>> I'm afraid not. Sorry, I should have been clearer.
>>
>> First, anonymous shmem large order allocation is dynamically controlled via
>> the global interface (/sys/kernel/mm/transparent_hugepage/shmem_enabled) and
>> the mTHP interfaces
>> (/sys/kernel/mm/transparent_hugepage/hugepages-*kB/shmem_enabled).
>>
>> This means that during anonymous shmem initialization, these interfaces
>> might be set to 'never'. so it will not call mapping_set_large_folios()
>> because sbinfo->huge is 'SHMEM_HUGE_NEVER'.
>>
>> Even if shmem large order allocation is subsequently enabled via the
>> interfaces, __shmem_file_setup -> mapping_set_large_folios() is not called
>> again.
> 
> I see your point, oh this is all a bit of a mess...
> 
> It feels like entirely the wrong abstraction anyway, since at best you're
> getting a global 'is enabled'.
> 
> I guess what happened before was we'd never call into this with ! r/o thp for fs
> && ! is_shmem.

Right.

> But now we are allowing it, but should STILL be gating on !is_shmem so yeah your
> suggestion is correct I think actualyl.
> 
> I do hate:
> 
> 	if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)
> 
> As a bit of code though. It's horrible.

Indeed.

> Let's abstract that...
> 
> It'd be nice if we could find a way to clean things up in the lead up to changes
> in series like this instead of sticking with the mess, but I guess since it
> mostly removes stuff that's ok for now.

I think this check can be removed from this patch.

During the khugepaged's scan, it will call thp_vma_allowable_order() to 
check if the VMA is allowed to collapse into a PMD.

Specifically, within the call chain thp_vma_allowable_order() -> 
__thp_vma_allowable_orders(), shmem is checked via 
shmem_allowable_huge_orders(), while other FSes are checked via 
file_thp_enabled().

For those other filesystems, Patch 5 has already added the following 
check, which I think is sufficient to filter out those FSes that do not 
support large folios:

if (mapping_max_folio_order(inode->i_mapping) < PMD_ORDER)
	return false;


>> Anonymous shmem behaves similarly to anonymous pages: it is controlled by
>> the 'shmem_enabled' interfaces and uses shmem_allowable_huge_orders() to
>> check for allowed large orders, rather than relying on
>> mapping_max_folio_order().
>>
>> The mapping_max_folio_order() is intended to control large page allocation
>> only for tmpfs mounts. Therefore, I find the current code confusing and
>> think it needs to be fixed:
>>
>> /* Don't consider 'deny' for emergencies and 'force' for testing */
>> if (sb != shm_mnt->mnt_sb && sbinfo->huge)
>>         mapping_set_large_folios(inode->i_mapping);
> 
> Cheers, Lorenzo



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig
  2026-03-27 13:46 ` [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig David Hildenbrand (Arm)
  2026-03-27 14:26   ` Zi Yan
@ 2026-03-27 14:27   ` Lorenzo Stoakes (Oracle)
  2026-03-27 14:30     ` Zi Yan
  1 sibling, 1 reply; 76+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-27 14:27 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Zi Yan, Matthew Wilcox (Oracle), Song Liu, Chris Mason,
	David Sterba, Alexander Viro, Christian Brauner, Jan Kara,
	Andrew Morton, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Mar 27, 2026 at 02:46:43PM +0100, David Hildenbrand (Arm) wrote:
> On 3/27/26 02:42, Zi Yan wrote:
> > Hi all,
> >
> > This patchset removes READ_ONLY_THP_FOR_FS Kconfig and enables creating
> > read-only THPs for FSes with large folio support (the supported orders
> > need to include PMD_ORDER) by default.
> >
> > The changes are:
> > 1. collapse_file() from mm/khugepaged.c, instead of checking
> >    CONFIG_READ_ONLY_THP_FOR_FS, makes sure the mapping_max_folio_order()
> >    of struct address_space of the file is at least PMD_ORDER.
> > 2. file_thp_enabled() also checks mapping_max_folio_order() instead.
> > 3. truncate_inode_partial_folio() calls folio_split() directly instead
> >    of the removed try_folio_split_to_order(), since large folios can
> >    only show up on a FS with large folio support.
> > 4. nr_thps is removed from struct address_space, since it is no longer
> >    needed to drop all read-only THPs from a FS without large folio
> >    support when the fd becomes writable. Its related filemap_nr_thps*()
> >    are removed too.
> > 5. folio_check_splittable() no longer checks READ_ONLY_THP_FOR_FS.
> > 6. Updated comments in various places.
> >
> > Changelog
> > ===
> > From RFC[1]:
> > 1. instead of removing READ_ONLY_THP_FOR_FS function entirely, turn it
> >    on by default for all FSes with large folio support and the supported
> >    orders includes PMD_ORDER.
> >
> > Suggestions and comments are welcome.
>
> Hi! :)
>
> The patch set might be better structured by
>
> 1) Teaching code paths to not only respect READ_ONLY_THP_FOR_FS but also
> filesystems with large folios. At that point, READ_ONLY_THP_FOR_FS would
> have no effect.

And also please do some cleaning up of the mess we have in the code base if at
all possible :) I feel like we're constantly building on sand with this, and
should treat every major change as a chance to do this.

Or otherwise we constantly keep leaving this mess around to deal with...

>
> 2) Removing READ_ONLY_THP_FOR_FS along with all the old cruft that is no
> longer required
>
> MADV_COLLAPSE will keep working the whole time.

Obviously everything should keep working throughout any version of this series.

>
> --
> Cheers,
>
> David

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig
  2026-03-27 14:27   ` Lorenzo Stoakes (Oracle)
@ 2026-03-27 14:30     ` Zi Yan
  0 siblings, 0 replies; 76+ messages in thread
From: Zi Yan @ 2026-03-27 14:30 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: David Hildenbrand (Arm), Matthew Wilcox (Oracle), Song Liu,
	Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 27 Mar 2026, at 10:27, Lorenzo Stoakes (Oracle) wrote:

> On Fri, Mar 27, 2026 at 02:46:43PM +0100, David Hildenbrand (Arm) wrote:
>> On 3/27/26 02:42, Zi Yan wrote:
>>> Hi all,
>>>
>>> This patchset removes READ_ONLY_THP_FOR_FS Kconfig and enables creating
>>> read-only THPs for FSes with large folio support (the supported orders
>>> need to include PMD_ORDER) by default.
>>>
>>> The changes are:
>>> 1. collapse_file() from mm/khugepaged.c, instead of checking
>>>    CONFIG_READ_ONLY_THP_FOR_FS, makes sure the mapping_max_folio_order()
>>>    of struct address_space of the file is at least PMD_ORDER.
>>> 2. file_thp_enabled() also checks mapping_max_folio_order() instead.
>>> 3. truncate_inode_partial_folio() calls folio_split() directly instead
>>>    of the removed try_folio_split_to_order(), since large folios can
>>>    only show up on a FS with large folio support.
>>> 4. nr_thps is removed from struct address_space, since it is no longer
>>>    needed to drop all read-only THPs from a FS without large folio
>>>    support when the fd becomes writable. Its related filemap_nr_thps*()
>>>    are removed too.
>>> 5. folio_check_splittable() no longer checks READ_ONLY_THP_FOR_FS.
>>> 6. Updated comments in various places.
>>>
>>> Changelog
>>> ===
>>> From RFC[1]:
>>> 1. instead of removing READ_ONLY_THP_FOR_FS function entirely, turn it
>>>    on by default for all FSes with large folio support and the supported
>>>    orders includes PMD_ORDER.
>>>
>>> Suggestions and comments are welcome.
>>
>> Hi! :)
>>
>> The patch set might be better structured by
>>
>> 1) Teaching code paths to not only respect READ_ONLY_THP_FOR_FS but also
>> filesystems with large folios. At that point, READ_ONLY_THP_FOR_FS would
>> have no effect.
>
> And also please do some cleaning up of the mess we have in the code base if at
> all possible :) I feel like we're constantly building on sand with this, and
> should treat every major change as a chance to do this.
>
> Or otherwise we constantly keep leaving this mess around to deal with...

Got it. Let me read through feedbacks from individual patches and come up with
a plan.

>
>>
>> 2) Removing READ_ONLY_THP_FOR_FS along with all the old cruft that is no
>> longer required
>>
>> MADV_COLLAPSE will keep working the whole time.
>
> Obviously everything should keep working throughout any version of this series.
>
Ack.


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-03-27 14:26           ` Baolin Wang
@ 2026-03-27 14:31             ` Lorenzo Stoakes (Oracle)
  2026-03-27 15:00               ` Zi Yan
  0 siblings, 1 reply; 76+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-27 14:31 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Zi Yan, Matthew Wilcox (Oracle), Song Liu, Chris Mason,
	David Sterba, Alexander Viro, Christian Brauner, Jan Kara,
	Andrew Morton, David Hildenbrand, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Mar 27, 2026 at 10:26:53PM +0800, Baolin Wang wrote:
>
>
> On 3/27/26 10:12 PM, Lorenzo Stoakes (Oracle) wrote:
> > On Fri, Mar 27, 2026 at 09:45:03PM +0800, Baolin Wang wrote:
> > >
> > >
> > > On 3/27/26 8:02 PM, Lorenzo Stoakes (Oracle) wrote:
> > > > On Fri, Mar 27, 2026 at 05:44:49PM +0800, Baolin Wang wrote:
> > > > >
> > > > >
> > > > > On 3/27/26 9:42 AM, Zi Yan wrote:
> > > > > > collapse_file() requires FSes supporting large folio with at least
> > > > > > PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
> > > > > > huge option turned on also sets large folio order on mapping, so the check
> > > > > > also applies to shmem.
> > > > > >
> > > > > > While at it, replace VM_BUG_ON with returning failure values.
> > > > > >
> > > > > > Signed-off-by: Zi Yan <ziy@nvidia.com>
> > > > > > ---
> > > > > >     mm/khugepaged.c | 7 +++++--
> > > > > >     1 file changed, 5 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > index d06d84219e1b..45b12ffb1550 100644
> > > > > > --- a/mm/khugepaged.c
> > > > > > +++ b/mm/khugepaged.c
> > > > > > @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> > > > > >     	int nr_none = 0;
> > > > > >     	bool is_shmem = shmem_file(file);
> > > > > > -	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
> > > > > > -	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
> > > > > > +	/* "huge" shmem sets mapping folio order and passes the check below */
> > > > > > +	if (mapping_max_folio_order(mapping) < PMD_ORDER)
> > > > > > +		return SCAN_FAIL;
> > > > >
> > > > > This is not true for anonymous shmem, since its large order allocation logic
> > > > > is similar to anonymous memory. That means it will not call
> > > > > mapping_set_large_folios() for anonymous shmem.
> > > > >
> > > > > So I think the check should be:
> > > > >
> > > > > if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)
> > > > >        return SCAN_FAIL;
> > > >
> > > > Hmm but in shmem_init() we have:
> > > >
> > > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > > 	if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY)
> > > > 		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
> > > > 	else
> > > > 		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
> > > >
> > > > 	/*
> > > > 	 * Default to setting PMD-sized THP to inherit the global setting and
> > > > 	 * disable all other multi-size THPs.
> > > > 	 */
> > > > 	if (!shmem_orders_configured)
> > > > 		huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER);
> > > > #endif
> > > >
> > > > And shm_mnt->mnt_sb is the superblock used for anon shmem. Also
> > > > shmem_enabled_store() updates that if necessary.
> > > >
> > > > So we're still fine right?
> > > >
> > > > __shmem_file_setup() (used for anon shmem) calls shmem_get_inode() ->
> > > > __shmem_get_inode() which has:
> > > >
> > > > 	if (sbinfo->huge)
> > > > 		mapping_set_large_folios(inode->i_mapping);
> > > >
> > > > Shared for both anon shmem and tmpfs-style shmem.
> > > >
> > > > So I think it's fine as-is.
> > >
> > > I'm afraid not. Sorry, I should have been clearer.
> > >
> > > First, anonymous shmem large order allocation is dynamically controlled via
> > > the global interface (/sys/kernel/mm/transparent_hugepage/shmem_enabled) and
> > > the mTHP interfaces
> > > (/sys/kernel/mm/transparent_hugepage/hugepages-*kB/shmem_enabled).
> > >
> > > This means that during anonymous shmem initialization, these interfaces
> > > might be set to 'never'. so it will not call mapping_set_large_folios()
> > > because sbinfo->huge is 'SHMEM_HUGE_NEVER'.
> > >
> > > Even if shmem large order allocation is subsequently enabled via the
> > > interfaces, __shmem_file_setup -> mapping_set_large_folios() is not called
> > > again.
> >
> > I see your point, oh this is all a bit of a mess...
> >
> > It feels like entirely the wrong abstraction anyway, since at best you're
> > getting a global 'is enabled'.
> >
> > I guess what happened before was we'd never call into this with ! r/o thp for fs
> > && ! is_shmem.
>
> Right.
>
> > But now we are allowing it, but should STILL be gating on !is_shmem so yeah your
> > suggestion is correct I think actualyl.
> >
> > I do hate:
> >
> > 	if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)
> >
> > As a bit of code though. It's horrible.
>
> Indeed.
>
> > Let's abstract that...
> >
> > It'd be nice if we could find a way to clean things up in the lead up to changes
> > in series like this instead of sticking with the mess, but I guess since it
> > mostly removes stuff that's ok for now.
>
> I think this check can be removed from this patch.
>
> During the khugepaged's scan, it will call thp_vma_allowable_order() to
> check if the VMA is allowed to collapse into a PMD.
>
> Specifically, within the call chain thp_vma_allowable_order() ->
> __thp_vma_allowable_orders(), shmem is checked via
> shmem_allowable_huge_orders(), while other FSes are checked via
> file_thp_enabled().

It sucks not to have an assert. Maybe in that case make it a
VM_WARN_ON_ONCE().

I hate that you're left tracing things back like that...

>
> For those other filesystems, Patch 5 has already added the following check,
> which I think is sufficient to filter out those FSes that do not support
> large folios:
>
> if (mapping_max_folio_order(inode->i_mapping) < PMD_ORDER)
> 	return false;

2 < 5, we won't tolerate bisection hazards.

>
>
> > > Anonymous shmem behaves similarly to anonymous pages: it is controlled by
> > > the 'shmem_enabled' interfaces and uses shmem_allowable_huge_orders() to
> > > check for allowed large orders, rather than relying on
> > > mapping_max_folio_order().
> > >
> > > The mapping_max_folio_order() is intended to control large page allocation
> > > only for tmpfs mounts. Therefore, I find the current code confusing and
> > > think it needs to be fixed:
> > >
> > > /* Don't consider 'deny' for emergencies and 'force' for testing */
> > > if (sb != shm_mnt->mnt_sb && sbinfo->huge)
> > >         mapping_set_large_folios(inode->i_mapping);
> >
> > Cheers, Lorenzo
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 01/10] mm: remove READ_ONLY_THP_FOR_FS Kconfig option
  2026-03-27 13:33   ` David Hildenbrand (Arm)
@ 2026-03-27 14:39     ` Zi Yan
  0 siblings, 0 replies; 76+ messages in thread
From: Zi Yan @ 2026-03-27 14:39 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 27 Mar 2026, at 9:33, David Hildenbrand (Arm) wrote:

> On 3/27/26 02:42, Zi Yan wrote:
>> No one will be able to use it, so the related code can be removed in the
>> coming commits.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  mm/Kconfig | 11 -----------
>>  1 file changed, 11 deletions(-)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index bd283958d675..408fc7b82233 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -937,17 +937,6 @@ config THP_SWAP
>>
>>  	  For selection by architectures with reasonable THP sizes.
>>
>> -config READ_ONLY_THP_FOR_FS
>> -	bool "Read-only THP for filesystems (EXPERIMENTAL)"
>> -	depends on TRANSPARENT_HUGEPAGE
>> -
>> -	help
>> -	  Allow khugepaged to put read-only file-backed pages in THP.
>> -
>> -	  This is marked experimental because it is a new feature. Write
>> -	  support of file THPs will be developed in the next few release
>> -	  cycles.
>> -
>>  config NO_PAGE_MAPCOUNT
>>  	bool "No per-page mapcount (EXPERIMENTAL)"
>>  	help
>
> Isn't that usually what we do at the very end when we converted all the
> code?

The rationale is that after removing Kconfig, the related code is always
disabled and the following patches can remove it piece by piece. The approach
you are hinting at might be to 1) remove all users of READ_ONLY_THP_FOR_FS,
making collapse_file() reject FSes without large folio support, 2) remove
other READ_ONLY_THP_FOR_FS related code. It might still cause confusion
since READ_ONLY_THP_FOR_FS is still present while its functionality is gone.

But as you pointed out in the cover letter that MADV_COLLAPSE needs to
work throughout the patchset, I will move this patch in a later stage
when MADV_COLLAPSE works on FSes with large folio support.

WDYT?

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-03-27 13:37   ` David Hildenbrand (Arm)
@ 2026-03-27 14:43     ` Zi Yan
  0 siblings, 0 replies; 76+ messages in thread
From: Zi Yan @ 2026-03-27 14:43 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 27 Mar 2026, at 9:37, David Hildenbrand (Arm) wrote:

> On 3/27/26 02:42, Zi Yan wrote:
>> collapse_file() requires FSes supporting large folio with at least
>> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
>> huge option turned on also sets large folio order on mapping, so the check
>> also applies to shmem.
>>
>> While at it, replace VM_BUG_ON with returning failure values.
>
> Why not VM_WARN_ON_ONCE() ?
>
> These are conditions that must be checked earlier, no?

start & (HPAGE_PMD_NR - 1) yes. I can convert it to VM_WARN_ON_ONCE().

For mapping_max_folio_order(mapping) < PMD_ORDER, I probably should
move it to collapse_scan_file() to prevent wasting scanning time
if the file does not support large folio. Then, I can turn it
into a VM_WARN_ON_ONCE().

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-03-27 12:07   ` Lorenzo Stoakes (Oracle)
  2026-03-27 14:15     ` Lorenzo Stoakes (Oracle)
@ 2026-03-27 14:46     ` Zi Yan
  1 sibling, 0 replies; 76+ messages in thread
From: Zi Yan @ 2026-03-27 14:46 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 27 Mar 2026, at 8:07, Lorenzo Stoakes (Oracle) wrote:

> On Thu, Mar 26, 2026 at 09:42:47PM -0400, Zi Yan wrote:
>> collapse_file() requires FSes supporting large folio with at least
>> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
>> huge option turned on also sets large folio order on mapping, so the check
>> also applies to shmem.
>>
>> While at it, replace VM_BUG_ON with returning failure values.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>
>
>
>> ---
>>  mm/khugepaged.c | 7 +++++--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index d06d84219e1b..45b12ffb1550 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>>  	int nr_none = 0;
>>  	bool is_shmem = shmem_file(file);
>>
>> -	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
>> -	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>> +	/* "huge" shmem sets mapping folio order and passes the check below */
>
> I think this isn't quite clear and could be improved to e.g.:
>
> 	/*
> 	 * Either anon shmem supports huge pages as set by shmem_enabled sysfs,
> 	 * or a shmem file system mounted with the "huge" option.
> 	 */
>
>> +	if (mapping_max_folio_order(mapping) < PMD_ORDER)
>> +		return SCAN_FAIL;
>
> As per rest of thread, this looks correct.

Will respond to that thread.

>
>> +	if (start & (HPAGE_PMD_NR - 1))
>> +		return SCAN_ADDRESS_RANGE;
>
> Hmm, we're kinda making this - presumably buggy situation - into a valid input
> that just fails the scan.
>
> Maybe just make it a VM_WARN_ON_ONCE()? Or if we want to avoid propagating the
> bug that'd cause it any further:
>
> 	if (start & (HPAGE_PMD_NR - 1)) {
> 		VM_WARN_ON_ONCE(true);
> 		return SCAN_ADDRESS_RANGE;
> 	}
>
> Or similar.

As I responded to David, will change it to VM_WARN_ON_ONCE().

>
>>
>>  	result = alloc_charge_folio(&new_folio, mm, cc);
>>  	if (result != SCAN_SUCCEED)
>> --
>> 2.43.0
>>
>
> Cheers, Lorenzo


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-03-27 14:31             ` Lorenzo Stoakes (Oracle)
@ 2026-03-27 15:00               ` Zi Yan
  2026-03-27 16:22                 ` Lance Yang
  0 siblings, 1 reply; 76+ messages in thread
From: Zi Yan @ 2026-03-27 15:00 UTC (permalink / raw)
  To: Baolin Wang, Lorenzo Stoakes (Oracle)
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, linux-btrfs,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

On 27 Mar 2026, at 10:31, Lorenzo Stoakes (Oracle) wrote:

> On Fri, Mar 27, 2026 at 10:26:53PM +0800, Baolin Wang wrote:
>>
>>
>> On 3/27/26 10:12 PM, Lorenzo Stoakes (Oracle) wrote:
>>> On Fri, Mar 27, 2026 at 09:45:03PM +0800, Baolin Wang wrote:
>>>>
>>>>
>>>> On 3/27/26 8:02 PM, Lorenzo Stoakes (Oracle) wrote:
>>>>> On Fri, Mar 27, 2026 at 05:44:49PM +0800, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 3/27/26 9:42 AM, Zi Yan wrote:
>>>>>>> collapse_file() requires FSes supporting large folio with at least
>>>>>>> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
>>>>>>> huge option turned on also sets large folio order on mapping, so the check
>>>>>>> also applies to shmem.
>>>>>>>
>>>>>>> While at it, replace VM_BUG_ON with returning failure values.
>>>>>>>
>>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>>>> ---
>>>>>>>     mm/khugepaged.c | 7 +++++--
>>>>>>>     1 file changed, 5 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>>>> index d06d84219e1b..45b12ffb1550 100644
>>>>>>> --- a/mm/khugepaged.c
>>>>>>> +++ b/mm/khugepaged.c
>>>>>>> @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>>>>>>>     	int nr_none = 0;
>>>>>>>     	bool is_shmem = shmem_file(file);
>>>>>>> -	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
>>>>>>> -	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>>>>>>> +	/* "huge" shmem sets mapping folio order and passes the check below */
>>>>>>> +	if (mapping_max_folio_order(mapping) < PMD_ORDER)
>>>>>>> +		return SCAN_FAIL;
>>>>>>
>>>>>> This is not true for anonymous shmem, since its large order allocation logic
>>>>>> is similar to anonymous memory. That means it will not call
>>>>>> mapping_set_large_folios() for anonymous shmem.
>>>>>>
>>>>>> So I think the check should be:
>>>>>>
>>>>>> if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)
>>>>>>        return SCAN_FAIL;
>>>>>
>>>>> Hmm but in shmem_init() we have:
>>>>>
>>>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>> 	if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY)
>>>>> 		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
>>>>> 	else
>>>>> 		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
>>>>>
>>>>> 	/*
>>>>> 	 * Default to setting PMD-sized THP to inherit the global setting and
>>>>> 	 * disable all other multi-size THPs.
>>>>> 	 */
>>>>> 	if (!shmem_orders_configured)
>>>>> 		huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER);
>>>>> #endif
>>>>>
>>>>> And shm_mnt->mnt_sb is the superblock used for anon shmem. Also
>>>>> shmem_enabled_store() updates that if necessary.
>>>>>
>>>>> So we're still fine right?
>>>>>
>>>>> __shmem_file_setup() (used for anon shmem) calls shmem_get_inode() ->
>>>>> __shmem_get_inode() which has:
>>>>>
>>>>> 	if (sbinfo->huge)
>>>>> 		mapping_set_large_folios(inode->i_mapping);
>>>>>
>>>>> Shared for both anon shmem and tmpfs-style shmem.
>>>>>
>>>>> So I think it's fine as-is.
>>>>
>>>> I'm afraid not. Sorry, I should have been clearer.
>>>>
>>>> First, anonymous shmem large order allocation is dynamically controlled via
>>>> the global interface (/sys/kernel/mm/transparent_hugepage/shmem_enabled) and
>>>> the mTHP interfaces
>>>> (/sys/kernel/mm/transparent_hugepage/hugepages-*kB/shmem_enabled).
>>>>
>>>> This means that during anonymous shmem initialization, these interfaces
>>>> might be set to 'never'. so it will not call mapping_set_large_folios()
>>>> because sbinfo->huge is 'SHMEM_HUGE_NEVER'.
>>>>
>>>> Even if shmem large order allocation is subsequently enabled via the
>>>> interfaces, __shmem_file_setup -> mapping_set_large_folios() is not called
>>>> again.
>>>
>>> I see your point, oh this is all a bit of a mess...
>>>
>>> It feels like entirely the wrong abstraction anyway, since at best you're
>>> getting a global 'is enabled'.
>>>
>>> I guess what happened before was we'd never call into this with ! r/o thp for fs
>>> && ! is_shmem.
>>
>> Right.
>>
>>> But now we are allowing it, but should STILL be gating on !is_shmem so yeah your
>>> suggestion is correct I think actualyl.
>>>
>>> I do hate:
>>>
>>> 	if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)
>>>
>>> As a bit of code though. It's horrible.
>>
>> Indeed.
>>
>>> Let's abstract that...
>>>
>>> It'd be nice if we could find a way to clean things up in the lead up to changes
>>> in series like this instead of sticking with the mess, but I guess since it
>>> mostly removes stuff that's ok for now.
>>
>> I think this check can be removed from this patch.
>>
>> During the khugepaged's scan, it will call thp_vma_allowable_order() to
>> check if the VMA is allowed to collapse into a PMD.
>>
>> Specifically, within the call chain thp_vma_allowable_order() ->
>> __thp_vma_allowable_orders(), shmem is checked via
>> shmem_allowable_huge_orders(), while other FSes are checked via
>> file_thp_enabled().

But for madvise(MADV_COLLAPSE) case, IIRC, it ignores shmem huge config
and can perform collapse anyway. This means without !is_shmem the check
will break madvise(MADV_COLLAPSE). Let me know if I get it wrong, since
I was in that TVA_FORCED_COLLAPSE email thread but does not remember
everything there.


>
> It sucks not to have an assert. Maybe in that case make it a
> VM_WARN_ON_ONCE().

Will do that as I replied to David already.

>
> I hate that you're left tracing things back like that...
>
>>
>> For those other filesystems, Patch 5 has already added the following check,
>> which I think is sufficient to filter out those FSes that do not support
>> large folios:
>>
>> if (mapping_max_folio_order(inode->i_mapping) < PMD_ORDER)
>> 	return false;
>
> 2 < 5, we won't tolerate bisection hazards.
>
>>
>>
>>>> Anonymous shmem behaves similarly to anonymous pages: it is controlled by
>>>> the 'shmem_enabled' interfaces and uses shmem_allowable_huge_orders() to
>>>> check for allowed large orders, rather than relying on
>>>> mapping_max_folio_order().
>>>>
>>>> The mapping_max_folio_order() is intended to control large page allocation
>>>> only for tmpfs mounts. Therefore, I find the current code confusing and
>>>> think it needs to be fixed:
>>>>
>>>> /* Don't consider 'deny' for emergencies and 'force' for testing */
>>>> if (sb != shm_mnt->mnt_sb && sbinfo->huge)
>>>>         mapping_set_large_folios(inode->i_mapping);
>>>

Hi Baolin,

Do you want to send a fix for this?

Also I wonder how I can distinguish between anonymous shmem code and tmpfs code.
I thought they are the same thing except that they have different user interface,
but it seems that I was wrong.


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 03/10] mm: fs: remove filemap_nr_thps*() functions and their users
  2026-03-27 14:23       ` Lorenzo Stoakes (Oracle)
@ 2026-03-27 15:05         ` Zi Yan
  2026-04-01 14:35           ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 76+ messages in thread
From: Zi Yan @ 2026-03-27 15:05 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: David Hildenbrand (Arm), Matthew Wilcox (Oracle), Song Liu,
	Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 27 Mar 2026, at 10:23, Lorenzo Stoakes (Oracle) wrote:

> On Fri, Mar 27, 2026 at 02:58:12PM +0100, David Hildenbrand (Arm) wrote:
>> On 3/27/26 13:23, Lorenzo Stoakes (Oracle) wrote:
>>> On Thu, Mar 26, 2026 at 09:42:48PM -0400, Zi Yan wrote:
>>>> They are used by READ_ONLY_THP_FOR_FS to handle writes to FSes without
>>>> large folio support, so that read-only THPs created in these FSes are not
>>>> seen by the FSes when the underlying fd becomes writable. Now read-only PMD
>>>> THPs only appear in a FS with large folio support and the supported orders
>>>> include PMD_ORDRE.
>>>
>>> Typo: PMD_ORDRE -> PMD_ORDER
>>>
>>>>
>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>
>>> This looks obviously-correct since this stuff wouldn't have been invoked for
>>> large folio file systems before + they already had to handle it separately, and
>>> this function is only tied to CONFIG_READ_ONLY_THP_FOR_FS (+ a quick grep
>>> suggests you didn't miss anything), so:
>>
>> There could now be a race between collapsing and the file getting opened
>> r/w.
>>
>> Are we sure that all code can really deal with that?
>>
>> IOW, "they already had to handle it separately" -- is that true?
>> khugepaged would have never collapse in writable files, so I wonder if
>> all code paths are prepared for that.
>
> OK I guess I overlooked a part of this code... :) see below.
>
> This is fine and would be a no-op anyway
>
> -       if (f->f_mode & FMODE_WRITE) {
> -               /*
> -                * Depends on full fence from get_write_access() to synchronize
> -                * against collapse_file() regarding i_writecount and nr_thps
> -                * updates. Ensures subsequent insertion of THPs into the page
> -                * cache will fail.
> -                */
> -               if (filemap_nr_thps(inode->i_mapping)) {
>
> But this:
>
> -       if (!is_shmem) {
> -               filemap_nr_thps_inc(mapping);
> -               /*
> -                * Paired with the fence in do_dentry_open() -> get_write_access()
> -                * to ensure i_writecount is up to date and the update to nr_thps
> -                * is visible. Ensures the page cache will be truncated if the
> -                * file is opened writable.
> -                */
> -               smp_mb();
>
> We can drop barrier
>
> -               if (inode_is_open_for_write(mapping->host)) {
> -                       result = SCAN_FAIL;
>
> But this is a functional change!
>
> Yup missed this.

But I added

+	if (!is_shmem && inode_is_open_for_write(mapping->host))
+		result = SCAN_FAIL;

That keeps the original bail out, right?

>
> -                       filemap_nr_thps_dec(mapping);
> -               }
> -       }
>
> For below:
>
> -       /*
> -        * Undo the updates of filemap_nr_thps_inc for non-SHMEM
> -        * file only. This undo is not needed unless failure is
> -        * due to SCAN_COPY_MC.
> -        */
> -       if (!is_shmem && result == SCAN_COPY_MC) {
> -               filemap_nr_thps_dec(mapping);
> -               /*
> -                * Paired with the fence in do_dentry_open() -> get_write_access()
> -                * to ensure the update to nr_thps is visible.
> -                */
> -               smp_mb();
> -       }
>
> Here is probably fine to remove if barrier _only_ for nr_thps.
>
>>
>> --
>> Cheers,
>>
>> David
>
> Sorry Zi, R-b tag withdrawn... :( I missed that 1 functional change there.
>
> Cheers, Lorenzo


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 05/10] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
  2026-03-27 12:42   ` Lorenzo Stoakes (Oracle)
@ 2026-03-27 15:12     ` Zi Yan
  2026-03-27 15:29       ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 76+ messages in thread
From: Zi Yan @ 2026-03-27 15:12 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 27 Mar 2026, at 8:42, Lorenzo Stoakes (Oracle) wrote:

> On Thu, Mar 26, 2026 at 09:42:50PM -0400, Zi Yan wrote:
>> Replace it with a check on the max folio order of the file's address space
>> mapping, making sure PMD_ORDER is supported.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  mm/huge_memory.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index c7873dbdc470..1da1467328a3 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -89,9 +89,6 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>>  {
>>  	struct inode *inode;
>>
>> -	if (!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS))
>> -		return false;
>> -
>>  	if (!vma->vm_file)
>>  		return false;
>>
>> @@ -100,6 +97,9 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>>  	if (IS_ANON_FILE(inode))
>>  		return false;
>>
>> +	if (mapping_max_folio_order(inode->i_mapping) < PMD_ORDER)
>> +		return false;
>> +
>
> At this point I think this should be a separate function quite honestly and
> share it with 2/10's use, and then you can put the comment in here re: anon
> shmem etc.
>
> Though that won't apply here of course as shmem_allowable_huge_orders() would
> have been invoked :)
>
> But no harm in refactoring it anyway, and the repetitive < PMD_ORDER stuff is
> unfortunate.
>
> Buuut having said that is this right actually?
>
> Because we have:
>
> 		if (((in_pf || smaps)) && vma->vm_ops->huge_fault)
> 			return orders;
>
> Above it, and now you're enabling huge folio file systems to do non-page fault
> THP and that's err... isn't that quite a big change?

That is what READ_ONLY_THP_FOR_FS does, creating THPs after page faults, right?
This patchset changes the condition from all FSes to FSes with large folio
support.

Will add a helper, mapping_support_pmd_folio(), for
mapping_max_folio_order(inode->i_mapping) < PMD_ORDER.

>
> So yeah probably no to this patch as is :) we should just drop
> file_thp_enabled()?



>
>>  	return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
>>  }
>>
>> --
>> 2.43.0
>>


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 05/10] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
  2026-03-27 15:12     ` Zi Yan
@ 2026-03-27 15:29       ` Lorenzo Stoakes (Oracle)
  2026-03-27 15:43         ` Zi Yan
  0 siblings, 1 reply; 76+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-27 15:29 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Mar 27, 2026 at 11:12:46AM -0400, Zi Yan wrote:
> On 27 Mar 2026, at 8:42, Lorenzo Stoakes (Oracle) wrote:
>
> > On Thu, Mar 26, 2026 at 09:42:50PM -0400, Zi Yan wrote:
> >> Replace it with a check on the max folio order of the file's address space
> >> mapping, making sure PMD_ORDER is supported.
> >>
> >> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >> ---
> >>  mm/huge_memory.c | 6 +++---
> >>  1 file changed, 3 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index c7873dbdc470..1da1467328a3 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -89,9 +89,6 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
> >>  {
> >>  	struct inode *inode;
> >>
> >> -	if (!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS))
> >> -		return false;
> >> -
> >>  	if (!vma->vm_file)
> >>  		return false;
> >>
> >> @@ -100,6 +97,9 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
> >>  	if (IS_ANON_FILE(inode))
> >>  		return false;
> >>
> >> +	if (mapping_max_folio_order(inode->i_mapping) < PMD_ORDER)
> >> +		return false;
> >> +
> >
> > At this point I think this should be a separate function quite honestly and
> > share it with 2/10's use, and then you can put the comment in here re: anon
> > shmem etc.
> >
> > Though that won't apply here of course as shmem_allowable_huge_orders() would
> > have been invoked :)
> >
> > But no harm in refactoring it anyway, and the repetitive < PMD_ORDER stuff is
> > unfortunate.
> >
> > Buuut having said that is this right actually?
> >
> > Because we have:
> >
> > 		if (((in_pf || smaps)) && vma->vm_ops->huge_fault)
> > 			return orders;
> >
> > Above it, and now you're enabling huge folio file systems to do non-page fault
> > THP and that's err... isn't that quite a big change?
>
> That is what READ_ONLY_THP_FOR_FS does, creating THPs after page faults, right?
> This patchset changes the condition from all FSes to FSes with large folio
> support.

No, READ_ONLY_THP_FOR_FS operates differently.

It explicitly _only_ is allowed for MADV_COLLAPSE and only if the file is
mounted read-only.

So due to:

		if (((in_pf || smaps)) && vma->vm_ops->huge_fault)
			return orders;

		if (((!in_pf || smaps)) && file_thp_enabled(vma))
			return orders;

                      |    PF     | MADV_COLLAPSE | khugepaged |
		      |-----------|---------------|------------|
large folio fs        |     ✓     |       x       |      x     |
READ_ONLY_THP_FOR_FS  |     x     |       ✓       |      ✓     |

After this change:

                      |    PF     | MADV_COLLAPSE | khugepaged |
		      |-----------|---------------|------------|
large folio fs        |     ✓     |       ✓       |      ?     |

(I hope we're not enabling khugepaged for large folio fs - which shouldn't
be necessary anyway as we try to give them folios on page fault and they
use thp-friendly get_unused_area etc. :)

We shouldn't be doing this.

It should remain:

                      |    PF     | MADV_COLLAPSE | khugepaged |
		      |-----------|---------------|------------|
large folio fs        |     ✓     |       x       |      x     |

If we're going to remove it, we should first _just remove it_, not
simultaneously increase the scope of what all the MADV_COLLAPSE code is
doing without any confidence in any of it working properly.

And it makes the whole series misleading - you're actually _enabling_ a
feature not (only) _removing_ one.

So let's focus as David suggested on one thing at a time, incrementally.

And let's please try and sort some of this confusing mess out in the code
if at all possible...

>
> Will add a helper, mapping_support_pmd_folio(), for
> mapping_max_folio_order(inode->i_mapping) < PMD_ORDER.
>
> >
> > So yeah probably no to this patch as is :) we should just drop
> > file_thp_enabled()?
>
>
>
> >
> >>  	return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
> >>  }
> >>
> >> --
> >> 2.43.0
> >>
>
>
> Best Regards,
> Yan, Zi

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 07/10] mm/truncate: use folio_split() in truncate_inode_partial_folio()
  2026-03-27 13:05   ` Lorenzo Stoakes (Oracle)
@ 2026-03-27 15:35     ` Zi Yan
  0 siblings, 0 replies; 76+ messages in thread
From: Zi Yan @ 2026-03-27 15:35 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 27 Mar 2026, at 9:05, Lorenzo Stoakes (Oracle) wrote:

> On Thu, Mar 26, 2026 at 09:42:52PM -0400, Zi Yan wrote:
>> After READ_ONLY_THP_FOR_FS is removed, FS either supports large folio or
>> not. folio_split() can be used on a FS with large folio support without
>> worrying about getting a THP on a FS without large folio support.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  include/linux/huge_mm.h | 25 ++-----------------------
>>  mm/truncate.c           |  8 ++++----
>>  2 files changed, 6 insertions(+), 27 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 1258fa37e85b..171de8138e98 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -389,27 +389,6 @@ static inline int split_huge_page_to_order(struct page *page, unsigned int new_o
>>  	return split_huge_page_to_list_to_order(page, NULL, new_order);
>>  }
>>
>> -/**
>> - * try_folio_split_to_order() - try to split a @folio at @page to @new_order
>> - * using non uniform split.
>> - * @folio: folio to be split
>> - * @page: split to @new_order at the given page
>> - * @new_order: the target split order
>> - *
>> - * Try to split a @folio at @page using non uniform split to @new_order, if
>> - * non uniform split is not supported, fall back to uniform split. After-split
>> - * folios are put back to LRU list. Use min_order_for_split() to get the lower
>> - * bound of @new_order.
>> - *
>> - * Return: 0 - split is successful, otherwise split failed.
>> - */
>> -static inline int try_folio_split_to_order(struct folio *folio,
>> -		struct page *page, unsigned int new_order)
>> -{
>> -	if (folio_check_splittable(folio, new_order, SPLIT_TYPE_NON_UNIFORM))
>> -		return split_huge_page_to_order(&folio->page, new_order);
>> -	return folio_split(folio, new_order, page, NULL);
>> -}
>>  static inline int split_huge_page(struct page *page)
>>  {
>>  	return split_huge_page_to_list_to_order(page, NULL, 0);
>> @@ -641,8 +620,8 @@ static inline int split_folio_to_list(struct folio *folio, struct list_head *lis
>>  	return -EINVAL;
>>  }
>
> Hmm there's nothing in the comment or obvious jumping out at me to explain why
> this is R/O thp file-backed only?
>
> This seems like an arbitrary helper that just figures out whether it can split
> using the non-uniform approach.
>
> I think you need to explain more in the commit message why this was R/O thp
> file-backed only, maybe mention some commits that added it etc., I had a quick
> glance and even that didn't indicate why.
>
> I look at folio_check_splittable() for instance and see:
>
> 	...
>
> 	} else if (split_type == SPLIT_TYPE_NON_UNIFORM || new_order) {
> 		if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
> 		    !mapping_large_folio_support(folio->mapping)) {
> 			...
> 			return -EINVAL;
> 		}
> 	}
>
> 	...
>
> 	if ((split_type == SPLIT_TYPE_NON_UNIFORM || new_order) && folio_test_swapcache(folio)) {
> 		return -EINVAL;
> 	}
>
> 	if (is_huge_zero_folio(folio))
> 		return -EINVAL;
>
> 	if (folio_test_writeback(folio))
> 		return -EBUSY;
>
> 	return 0;
> }
>
> None of which suggest that you couldn't have non-uniform splits for other
> cases? This at least needs some more explanation/justification in the
> commit msg.

Sure.

When READ_ONLY_THP_FOR_FS was present, a PMD large pagecache folio can appear
in a FS without large folio support after khugepaged or madvise(MADV_COLLAPSE)
creates it. During truncate_inode_partial_folio(), such a PMD large pagecache
folio is split and if the FS does not support large folio, it needs to be split
to order-0 ones and could not be split non uniformly to ones with various orders.
try_folio_split_to_order() was added to handle this situation by checking
folio_check_splittable(..., SPLIT_TYPE_NON_UNIFORM) to detect
if the large folio is created due to READ_ONLY_THP_FOR_FS and the FS does not
support large folio. Now READ_ONLY_THP_FOR_FS is removed, all large pagecache
folios are created with FSes supporting large folio, this function is no longer
needed and all large pagecache folios can be split non uniformly.

>
>>
>> -static inline int try_folio_split_to_order(struct folio *folio,
>> -		struct page *page, unsigned int new_order)
>> +static inline int folio_split(struct folio *folio, unsigned int new_order,
>> +		struct page *page, struct list_head *list);
>
> Yeah as Lance pointed out that ; probably shouldn't be there :)

I was trying to fix folio_split() signature mismatch locally and did a simple
copy past from above. Will fix it.

>
>>  {
>>  	VM_WARN_ON_ONCE_FOLIO(1, folio);
>>  	return -EINVAL;
>> diff --git a/mm/truncate.c b/mm/truncate.c
>> index 2931d66c16d0..6973b05ec4b8 100644
>> --- a/mm/truncate.c
>> +++ b/mm/truncate.c
>> @@ -177,7 +177,7 @@ int truncate_inode_folio(struct address_space *mapping, struct folio *folio)
>>  	return 0;
>>  }
>>
>> -static int try_folio_split_or_unmap(struct folio *folio, struct page *split_at,
>> +static int folio_split_or_unmap(struct folio *folio, struct page *split_at,
>>  				    unsigned long min_order)
>
> I'm not sure the removal of 'try_' is warranted in general in this patch,
> as it seems like it's not guaranteed any of these will succeed? Or am I
> wrong?

I added explanation above.

To summarize, without READ_ONLY_THP_FOR_FS, large pagecache folios can only
appear with FSes supporting large folio, so they all can be split uniformly.
Trying to split non uniformly then perform uniform split is no longer needed.
If non uniformly split fails, uniform split will fail too, barring race
conditions like folio elevated refcount.

BTW, sashiko asked if this breaks large shmem swapcache folio split[1].
The answer is no, since large shmem swapcache folio split is not supported yet.


[1] https://sashiko.dev/#/patchset/20260327014255.2058916-1-ziy%40nvidia.com?patch=11647

>
>>  {
>>  	enum ttu_flags ttu_flags =
>> @@ -186,7 +186,7 @@ static int try_folio_split_or_unmap(struct folio *folio, struct page *split_at,
>>  		TTU_IGNORE_MLOCK;
>>  	int ret;
>>
>> -	ret = try_folio_split_to_order(folio, split_at, min_order);
>> +	ret = folio_split(folio, min_order, split_at, NULL);
>>
>>  	/*
>>  	 * If the split fails, unmap the folio, so it will be refaulted
>> @@ -252,7 +252,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
>>
>>  	min_order = mapping_min_folio_order(folio->mapping);
>>  	split_at = folio_page(folio, PAGE_ALIGN_DOWN(offset) / PAGE_SIZE);
>> -	if (!try_folio_split_or_unmap(folio, split_at, min_order)) {
>> +	if (!folio_split_or_unmap(folio, split_at, min_order)) {
>>  		/*
>>  		 * try to split at offset + length to make sure folios within
>>  		 * the range can be dropped, especially to avoid memory waste
>> @@ -279,7 +279,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
>>  		/* make sure folio2 is large and does not change its mapping */
>>  		if (folio_test_large(folio2) &&
>>  		    folio2->mapping == folio->mapping)
>> -			try_folio_split_or_unmap(folio2, split_at2, min_order);
>> +			folio_split_or_unmap(folio2, split_at2, min_order);
>>
>>  		folio_unlock(folio2);


sashiko asked folios containing split_at2 can be split in a parallel
thread, thus splitting folio2 with split_at2 can cause an issue[1].

This is handled in __folio_split(). It has a folio != page_folio(split_at)
check.

[1] https://sashiko.dev/#/patchset/20260327014255.2058916-1-ziy%40nvidia.com?patch=11647

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 05/10] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
  2026-03-27 15:29       ` Lorenzo Stoakes (Oracle)
@ 2026-03-27 15:43         ` Zi Yan
  2026-03-27 16:08           ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 76+ messages in thread
From: Zi Yan @ 2026-03-27 15:43 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 27 Mar 2026, at 11:29, Lorenzo Stoakes (Oracle) wrote:

> On Fri, Mar 27, 2026 at 11:12:46AM -0400, Zi Yan wrote:
>> On 27 Mar 2026, at 8:42, Lorenzo Stoakes (Oracle) wrote:
>>
>>> On Thu, Mar 26, 2026 at 09:42:50PM -0400, Zi Yan wrote:
>>>> Replace it with a check on the max folio order of the file's address space
>>>> mapping, making sure PMD_ORDER is supported.
>>>>
>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>> ---
>>>>  mm/huge_memory.c | 6 +++---
>>>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index c7873dbdc470..1da1467328a3 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -89,9 +89,6 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>>>>  {
>>>>  	struct inode *inode;
>>>>
>>>> -	if (!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS))
>>>> -		return false;
>>>> -
>>>>  	if (!vma->vm_file)
>>>>  		return false;
>>>>
>>>> @@ -100,6 +97,9 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>>>>  	if (IS_ANON_FILE(inode))
>>>>  		return false;
>>>>
>>>> +	if (mapping_max_folio_order(inode->i_mapping) < PMD_ORDER)
>>>> +		return false;
>>>> +
>>>
>>> At this point I think this should be a separate function quite honestly and
>>> share it with 2/10's use, and then you can put the comment in here re: anon
>>> shmem etc.
>>>
>>> Though that won't apply here of course as shmem_allowable_huge_orders() would
>>> have been invoked :)
>>>
>>> But no harm in refactoring it anyway, and the repetitive < PMD_ORDER stuff is
>>> unfortunate.
>>>
>>> Buuut having said that is this right actually?
>>>
>>> Because we have:
>>>
>>> 		if (((in_pf || smaps)) && vma->vm_ops->huge_fault)
>>> 			return orders;
>>>
>>> Above it, and now you're enabling huge folio file systems to do non-page fault
>>> THP and that's err... isn't that quite a big change?
>>
>> That is what READ_ONLY_THP_FOR_FS does, creating THPs after page faults, right?
>> This patchset changes the condition from all FSes to FSes with large folio
>> support.
>
> No, READ_ONLY_THP_FOR_FS operates differently.
>
> It explicitly _only_ is allowed for MADV_COLLAPSE and only if the file is
> mounted read-only.
>
> So due to:
>
> 		if (((in_pf || smaps)) && vma->vm_ops->huge_fault)
> 			return orders;
>
> 		if (((!in_pf || smaps)) && file_thp_enabled(vma))
> 			return orders;
>
>                       |    PF     | MADV_COLLAPSE | khugepaged |
> 		      |-----------|---------------|------------|
> large folio fs        |     ✓     |       x       |      x     |
> READ_ONLY_THP_FOR_FS  |     x     |       ✓       |      ✓     |
>
> After this change:
>
>                       |    PF     | MADV_COLLAPSE | khugepaged |
> 		      |-----------|---------------|------------|
> large folio fs        |     ✓     |       ✓       |      ?     |
>
> (I hope we're not enabling khugepaged for large folio fs - which shouldn't
> be necessary anyway as we try to give them folios on page fault and they
> use thp-friendly get_unused_area etc. :)
>
> We shouldn't be doing this.
>
> It should remain:
>
>                       |    PF     | MADV_COLLAPSE | khugepaged |
> 		      |-----------|---------------|------------|
> large folio fs        |     ✓     |       x       |      x     |
>
> If we're going to remove it, we should first _just remove it_, not
> simultaneously increase the scope of what all the MADV_COLLAPSE code is
> doing without any confidence in any of it working properly.
>
> And it makes the whole series misleading - you're actually _enabling_ a
> feature not (only) _removing_ one.

That is what my RFC patch does, but David and willy told me to do this. :)
IIUC, with READ_ONLY_THP_FOR_FS, FSes with large folio support will
get THP via MADV_COLLAPSE or khugepaged. So removing the code like I
did in RFC would cause regressions.

I guess I need to rename the series to avoid confusion. How about?

Remove read-only THP support for FSes without large folio support.

[1] https://lore.kernel.org/all/7382046f-7c58-4a3e-ab34-b2704355b7d5@kernel.org/

>
> So let's focus as David suggested on one thing at a time, incrementally.
>
> And let's please try and sort some of this confusing mess out in the code
> if at all possible...
>
>>
>> Will add a helper, mapping_support_pmd_folio(), for
>> mapping_max_folio_order(inode->i_mapping) < PMD_ORDER.
>>
>>>
>>> So yeah probably no to this patch as is :) we should just drop
>>> file_thp_enabled()?
>>
>>
>>
>>>
>>>>  	return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
>>>>  }
>>>>
>>>> --
>>>> 2.43.0
>>>>
>>
>>
>> Best Regards,
>> Yan, Zi
>
> Cheers, Lorenzo


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 05/10] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
  2026-03-27 15:43         ` Zi Yan
@ 2026-03-27 16:08           ` Lorenzo Stoakes (Oracle)
  2026-03-27 16:12             ` Zi Yan
  0 siblings, 1 reply; 76+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-27 16:08 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Mar 27, 2026 at 11:43:57AM -0400, Zi Yan wrote:
> On 27 Mar 2026, at 11:29, Lorenzo Stoakes (Oracle) wrote:
>
> > On Fri, Mar 27, 2026 at 11:12:46AM -0400, Zi Yan wrote:
> >> On 27 Mar 2026, at 8:42, Lorenzo Stoakes (Oracle) wrote:
> >>
> >>> On Thu, Mar 26, 2026 at 09:42:50PM -0400, Zi Yan wrote:
> >>>> Replace it with a check on the max folio order of the file's address space
> >>>> mapping, making sure PMD_ORDER is supported.
> >>>>
> >>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >>>> ---
> >>>>  mm/huge_memory.c | 6 +++---
> >>>>  1 file changed, 3 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>>> index c7873dbdc470..1da1467328a3 100644
> >>>> --- a/mm/huge_memory.c
> >>>> +++ b/mm/huge_memory.c
> >>>> @@ -89,9 +89,6 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
> >>>>  {
> >>>>  	struct inode *inode;
> >>>>
> >>>> -	if (!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS))
> >>>> -		return false;
> >>>> -
> >>>>  	if (!vma->vm_file)
> >>>>  		return false;
> >>>>
> >>>> @@ -100,6 +97,9 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
> >>>>  	if (IS_ANON_FILE(inode))
> >>>>  		return false;
> >>>>
> >>>> +	if (mapping_max_folio_order(inode->i_mapping) < PMD_ORDER)
> >>>> +		return false;
> >>>> +
> >>>
> >>> At this point I think this should be a separate function quite honestly and
> >>> share it with 2/10's use, and then you can put the comment in here re: anon
> >>> shmem etc.
> >>>
> >>> Though that won't apply here of course as shmem_allowable_huge_orders() would
> >>> have been invoked :)
> >>>
> >>> But no harm in refactoring it anyway, and the repetitive < PMD_ORDER stuff is
> >>> unfortunate.
> >>>
> >>> Buuut having said that is this right actually?
> >>>
> >>> Because we have:
> >>>
> >>> 		if (((in_pf || smaps)) && vma->vm_ops->huge_fault)
> >>> 			return orders;
> >>>
> >>> Above it, and now you're enabling huge folio file systems to do non-page fault
> >>> THP and that's err... isn't that quite a big change?
> >>
> >> That is what READ_ONLY_THP_FOR_FS does, creating THPs after page faults, right?
> >> This patchset changes the condition from all FSes to FSes with large folio
> >> support.
> >
> > No, READ_ONLY_THP_FOR_FS operates differently.
> >
> > It explicitly _only_ is allowed for MADV_COLLAPSE and only if the file is
> > mounted read-only.
> >
> > So due to:
> >
> > 		if (((in_pf || smaps)) && vma->vm_ops->huge_fault)
> > 			return orders;
> >
> > 		if (((!in_pf || smaps)) && file_thp_enabled(vma))
> > 			return orders;
> >
> >                       |    PF     | MADV_COLLAPSE | khugepaged |
> > 		      |-----------|---------------|------------|
> > large folio fs        |     ✓     |       x       |      x     |
> > READ_ONLY_THP_FOR_FS  |     x     |       ✓       |      ✓     |
> >
> > After this change:
> >
> >                       |    PF     | MADV_COLLAPSE | khugepaged |
> > 		      |-----------|---------------|------------|
> > large folio fs        |     ✓     |       ✓       |      ?     |
> >
> > (I hope we're not enabling khugepaged for large folio fs - which shouldn't
> > be necessary anyway as we try to give them folios on page fault and they
> > use thp-friendly get_unused_area etc. :)
> >
> > We shouldn't be doing this.
> >
> > It should remain:
> >
> >                       |    PF     | MADV_COLLAPSE | khugepaged |
> > 		      |-----------|---------------|------------|
> > large folio fs        |     ✓     |       x       |      x     |
> >
> > If we're going to remove it, we should first _just remove it_, not
> > simultaneously increase the scope of what all the MADV_COLLAPSE code is
> > doing without any confidence in any of it working properly.
> >
> > And it makes the whole series misleading - you're actually _enabling_ a
> > feature not (only) _removing_ one.
>
> That is what my RFC patch does, but David and willy told me to do this. :)
> IIUC, with READ_ONLY_THP_FOR_FS, FSes with large folio support will
> get THP via MADV_COLLAPSE or khugepaged. So removing the code like I
> did in RFC would cause regressions.

OK I think we're dealing with a union of the two states here.

READ_ONLY_THP_FOR_FS is separate from large folio support, as checked by
file_thp_enabled():

static inline bool file_thp_enabled(struct vm_area_struct *vma)
{
	struct inode *inode;

	if (!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS))
		return false;

	if (!vma->vm_file)
		return false;

	inode = file_inode(vma->vm_file);

	if (IS_ANON_FILE(inode))
		return false;

	return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
}

So actually:

                       |    PF     | MADV_COLLAPSE | khugepaged |
		       |-----------|---------------|------------|
 large folio fs        |     ✓     |       x       |      x     |
 READ_ONLY_THP_FOR_FS  |     x     |       ✓       |      ✓     |
 both!                 |     ✓     |       ✓       |      ✓     |

(Where it's impllied it's a read-only mapping obviously for the later two
cases.)

Now without READ_ONLY_THP_FOR_FS you're going to:

                       |    PF     | MADV_COLLAPSE | khugepaged |
		       |-----------|---------------|------------|
 large folio fs        |     ✓     |       x       |      x     |
 large folio + r/o     |     ✓     |       ✓       |      ✓     |

And intentionally leaving behind the 'not large folio fs, r/o' case because
those file systems need to implement large folio support.

I guess we'll regress those users but we don't care?

I do think all this needs to be spelled out in the commit message though as it's
subtle.

Turns out this PitA config option is going to kick and scream a bit first before
it goes...

>
> I guess I need to rename the series to avoid confusion. How about?
>
> Remove read-only THP support for FSes without large folio support.

Yup that'd be better :)

Cheers, Lorenzo

>
> [1] https://lore.kernel.org/all/7382046f-7c58-4a3e-ab34-b2704355b7d5@kernel.org/
>
> >
> > So let's focus as David suggested on one thing at a time, incrementally.
> >
> > And let's please try and sort some of this confusing mess out in the code
> > if at all possible...
> >
> >>
> >> Will add a helper, mapping_support_pmd_folio(), for
> >> mapping_max_folio_order(inode->i_mapping) < PMD_ORDER.
> >>
> >>>
> >>> So yeah probably no to this patch as is :) we should just drop
> >>> file_thp_enabled()?
> >>
> >>
> >>
> >>>
> >>>>  	return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
> >>>>  }
> >>>>
> >>>> --
> >>>> 2.43.0
> >>>>
> >>
> >>
> >> Best Regards,
> >> Yan, Zi
> >
> > Cheers, Lorenzo
>
>
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 05/10] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
  2026-03-27 16:08           ` Lorenzo Stoakes (Oracle)
@ 2026-03-27 16:12             ` Zi Yan
  2026-03-27 16:14               ` Lorenzo Stoakes (Oracle)
  2026-03-29  4:07               ` WANG Rui
  0 siblings, 2 replies; 76+ messages in thread
From: Zi Yan @ 2026-03-27 16:12 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 27 Mar 2026, at 12:08, Lorenzo Stoakes (Oracle) wrote:

> On Fri, Mar 27, 2026 at 11:43:57AM -0400, Zi Yan wrote:
>> On 27 Mar 2026, at 11:29, Lorenzo Stoakes (Oracle) wrote:
>>
>>> On Fri, Mar 27, 2026 at 11:12:46AM -0400, Zi Yan wrote:
>>>> On 27 Mar 2026, at 8:42, Lorenzo Stoakes (Oracle) wrote:
>>>>
>>>>> On Thu, Mar 26, 2026 at 09:42:50PM -0400, Zi Yan wrote:
>>>>>> Replace it with a check on the max folio order of the file's address space
>>>>>> mapping, making sure PMD_ORDER is supported.
>>>>>>
>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>>> ---
>>>>>>  mm/huge_memory.c | 6 +++---
>>>>>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index c7873dbdc470..1da1467328a3 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -89,9 +89,6 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>>>>>>  {
>>>>>>  	struct inode *inode;
>>>>>>
>>>>>> -	if (!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS))
>>>>>> -		return false;
>>>>>> -
>>>>>>  	if (!vma->vm_file)
>>>>>>  		return false;
>>>>>>
>>>>>> @@ -100,6 +97,9 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>>>>>>  	if (IS_ANON_FILE(inode))
>>>>>>  		return false;
>>>>>>
>>>>>> +	if (mapping_max_folio_order(inode->i_mapping) < PMD_ORDER)
>>>>>> +		return false;
>>>>>> +
>>>>>
>>>>> At this point I think this should be a separate function quite honestly and
>>>>> share it with 2/10's use, and then you can put the comment in here re: anon
>>>>> shmem etc.
>>>>>
>>>>> Though that won't apply here of course as shmem_allowable_huge_orders() would
>>>>> have been invoked :)
>>>>>
>>>>> But no harm in refactoring it anyway, and the repetitive < PMD_ORDER stuff is
>>>>> unfortunate.
>>>>>
>>>>> Buuut having said that is this right actually?
>>>>>
>>>>> Because we have:
>>>>>
>>>>> 		if (((in_pf || smaps)) && vma->vm_ops->huge_fault)
>>>>> 			return orders;
>>>>>
>>>>> Above it, and now you're enabling huge folio file systems to do non-page fault
>>>>> THP and that's err... isn't that quite a big change?
>>>>
>>>> That is what READ_ONLY_THP_FOR_FS does, creating THPs after page faults, right?
>>>> This patchset changes the condition from all FSes to FSes with large folio
>>>> support.
>>>
>>> No, READ_ONLY_THP_FOR_FS operates differently.
>>>
>>> It explicitly _only_ is allowed for MADV_COLLAPSE and only if the file is
>>> mounted read-only.
>>>
>>> So due to:
>>>
>>> 		if (((in_pf || smaps)) && vma->vm_ops->huge_fault)
>>> 			return orders;
>>>
>>> 		if (((!in_pf || smaps)) && file_thp_enabled(vma))
>>> 			return orders;
>>>
>>>                       |    PF     | MADV_COLLAPSE | khugepaged |
>>> 		      |-----------|---------------|------------|
>>> large folio fs        |     ✓     |       x       |      x     |
>>> READ_ONLY_THP_FOR_FS  |     x     |       ✓       |      ✓     |
>>>
>>> After this change:
>>>
>>>                       |    PF     | MADV_COLLAPSE | khugepaged |
>>> 		      |-----------|---------------|------------|
>>> large folio fs        |     ✓     |       ✓       |      ?     |
>>>
>>> (I hope we're not enabling khugepaged for large folio fs - which shouldn't
>>> be necessary anyway as we try to give them folios on page fault and they
>>> use thp-friendly get_unused_area etc. :)
>>>
>>> We shouldn't be doing this.
>>>
>>> It should remain:
>>>
>>>                       |    PF     | MADV_COLLAPSE | khugepaged |
>>> 		      |-----------|---------------|------------|
>>> large folio fs        |     ✓     |       x       |      x     |
>>>
>>> If we're going to remove it, we should first _just remove it_, not
>>> simultaneously increase the scope of what all the MADV_COLLAPSE code is
>>> doing without any confidence in any of it working properly.
>>>
>>> And it makes the whole series misleading - you're actually _enabling_ a
>>> feature not (only) _removing_ one.
>>
>> That is what my RFC patch does, but David and willy told me to do this. :)
>> IIUC, with READ_ONLY_THP_FOR_FS, FSes with large folio support will
>> get THP via MADV_COLLAPSE or khugepaged. So removing the code like I
>> did in RFC would cause regressions.
>
> OK I think we're dealing with a union of the two states here.
>
> READ_ONLY_THP_FOR_FS is separate from large folio support, as checked by
> file_thp_enabled():
>
> static inline bool file_thp_enabled(struct vm_area_struct *vma)
> {
> 	struct inode *inode;
>
> 	if (!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS))
> 		return false;
>
> 	if (!vma->vm_file)
> 		return false;
>
> 	inode = file_inode(vma->vm_file);
>
> 	if (IS_ANON_FILE(inode))
> 		return false;
>
> 	return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
> }
>
> So actually:
>
>                        |    PF     | MADV_COLLAPSE | khugepaged |
> 		       |-----------|---------------|------------|
>  large folio fs        |     ✓     |       x       |      x     |
>  READ_ONLY_THP_FOR_FS  |     x     |       ✓       |      ✓     |
>  both!                 |     ✓     |       ✓       |      ✓     |
>
> (Where it's impllied it's a read-only mapping obviously for the later two
> cases.)
>
> Now without READ_ONLY_THP_FOR_FS you're going to:
>
>                        |    PF     | MADV_COLLAPSE | khugepaged |
> 		       |-----------|---------------|------------|
>  large folio fs        |     ✓     |       x       |      x     |
>  large folio + r/o     |     ✓     |       ✓       |      ✓     |
>
> And intentionally leaving behind the 'not large folio fs, r/o' case because
> those file systems need to implement large folio support.
>
> I guess we'll regress those users but we don't care?

Yes. This also motivates FSes without large folio support to add large folio
support instead of relying on READ_ONLY_THP_FOR_FS hack.

>
> I do think all this needs to be spelled out in the commit message though as it's
> subtle.
>
> Turns out this PitA config option is going to kick and scream a bit first before
> it goes...

Sure. I will shameless steal your tables. Thank you for the contribution. ;)

>
>>
>> I guess I need to rename the series to avoid confusion. How about?
>>
>> Remove read-only THP support for FSes without large folio support.
>
> Yup that'd be better :)
>
> Cheers, Lorenzo
>
>>
>> [1] https://lore.kernel.org/all/7382046f-7c58-4a3e-ab34-b2704355b7d5@kernel.org/
>>
>>>
>>> So let's focus as David suggested on one thing at a time, incrementally.
>>>
>>> And let's please try and sort some of this confusing mess out in the code
>>> if at all possible...
>>>
>>>>
>>>> Will add a helper, mapping_support_pmd_folio(), for
>>>> mapping_max_folio_order(inode->i_mapping) < PMD_ORDER.
>>>>
>>>>>
>>>>> So yeah probably no to this patch as is :) we should just drop
>>>>> file_thp_enabled()?
>>>>
>>>>
>>>>
>>>>>
>>>>>>  	return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
>>>>>>  }
>>>>>>
>>>>>> --
>>>>>> 2.43.0
>>>>>>
>>>>
>>>>
>>>> Best Regards,
>>>> Yan, Zi
>>>
>>> Cheers, Lorenzo
>>
>>
>> Best Regards,
>> Yan, Zi


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 05/10] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
  2026-03-27 16:12             ` Zi Yan
@ 2026-03-27 16:14               ` Lorenzo Stoakes (Oracle)
  2026-03-29  4:07               ` WANG Rui
  1 sibling, 0 replies; 76+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-27 16:14 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Mar 27, 2026 at 12:12:04PM -0400, Zi Yan wrote:
> On 27 Mar 2026, at 12:08, Lorenzo Stoakes (Oracle) wrote:
> > So actually:
> >
> >                        |    PF     | MADV_COLLAPSE | khugepaged |
> > 		       |-----------|---------------|------------|
> >  large folio fs        |     ✓     |       x       |      x     |
> >  READ_ONLY_THP_FOR_FS  |     x     |       ✓       |      ✓     |
> >  both!                 |     ✓     |       ✓       |      ✓     |
> >
> > (Where it's impllied it's a read-only mapping obviously for the later two
> > cases.)
> >
> > Now without READ_ONLY_THP_FOR_FS you're going to:
> >
> >                        |    PF     | MADV_COLLAPSE | khugepaged |
> > 		       |-----------|---------------|------------|
> >  large folio fs        |     ✓     |       x       |      x     |
> >  large folio + r/o     |     ✓     |       ✓       |      ✓     |
> >
> > And intentionally leaving behind the 'not large folio fs, r/o' case because
> > those file systems need to implement large folio support.
> >
> > I guess we'll regress those users but we don't care?
>
> Yes. This also motivates FSes without large folio support to add large folio
> support instead of relying on READ_ONLY_THP_FOR_FS hack.

Ack that's something I can back :)

>
> >
> > I do think all this needs to be spelled out in the commit message though as it's
> > subtle.
> >
> > Turns out this PitA config option is going to kick and scream a bit first before
> > it goes...
>
> Sure. I will shameless steal your tables. Thank you for the contribution. ;)
>

Haha good I love to spread ASCII art :)

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-03-27 15:00               ` Zi Yan
@ 2026-03-27 16:22                 ` Lance Yang
  2026-03-27 16:30                   ` Zi Yan
  0 siblings, 1 reply; 76+ messages in thread
From: Lance Yang @ 2026-03-27 16:22 UTC (permalink / raw)
  To: ziy
  Cc: baolin.wang, ljs, willy, songliubraving, clm, dsterba, viro,
	brauner, jack, akpm, david, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, vbabka, rppt, surenb, mhocko, shuah,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest


On Fri, Mar 27, 2026 at 11:00:26AM -0400, Zi Yan wrote:
>On 27 Mar 2026, at 10:31, Lorenzo Stoakes (Oracle) wrote:
>
>> On Fri, Mar 27, 2026 at 10:26:53PM +0800, Baolin Wang wrote:
>>>
>>>
>>> On 3/27/26 10:12 PM, Lorenzo Stoakes (Oracle) wrote:
>>>> On Fri, Mar 27, 2026 at 09:45:03PM +0800, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 3/27/26 8:02 PM, Lorenzo Stoakes (Oracle) wrote:
>>>>>> On Fri, Mar 27, 2026 at 05:44:49PM +0800, Baolin Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 3/27/26 9:42 AM, Zi Yan wrote:
>>>>>>>> collapse_file() requires FSes supporting large folio with at least
>>>>>>>> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
>>>>>>>> huge option turned on also sets large folio order on mapping, so the check
>>>>>>>> also applies to shmem.
>>>>>>>>
>>>>>>>> While at it, replace VM_BUG_ON with returning failure values.
>>>>>>>>
>>>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>>>>> ---
>>>>>>>>     mm/khugepaged.c | 7 +++++--
>>>>>>>>     1 file changed, 5 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>>>>> index d06d84219e1b..45b12ffb1550 100644
>>>>>>>> --- a/mm/khugepaged.c
>>>>>>>> +++ b/mm/khugepaged.c
>>>>>>>> @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>>>>>>>>     	int nr_none = 0;
>>>>>>>>     	bool is_shmem = shmem_file(file);
>>>>>>>> -	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
>>>>>>>> -	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>>>>>>>> +	/* "huge" shmem sets mapping folio order and passes the check below */
>>>>>>>> +	if (mapping_max_folio_order(mapping) < PMD_ORDER)
>>>>>>>> +		return SCAN_FAIL;
>>>>>>>
>>>>>>> This is not true for anonymous shmem, since its large order allocation logic
>>>>>>> is similar to anonymous memory. That means it will not call
>>>>>>> mapping_set_large_folios() for anonymous shmem.
>>>>>>>
>>>>>>> So I think the check should be:
>>>>>>>
>>>>>>> if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)
>>>>>>>        return SCAN_FAIL;
>>>>>>
>>>>>> Hmm but in shmem_init() we have:
>>>>>>
>>>>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>> 	if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY)
>>>>>> 		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
>>>>>> 	else
>>>>>> 		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
>>>>>>
>>>>>> 	/*
>>>>>> 	 * Default to setting PMD-sized THP to inherit the global setting and
>>>>>> 	 * disable all other multi-size THPs.
>>>>>> 	 */
>>>>>> 	if (!shmem_orders_configured)
>>>>>> 		huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER);
>>>>>> #endif
>>>>>>
>>>>>> And shm_mnt->mnt_sb is the superblock used for anon shmem. Also
>>>>>> shmem_enabled_store() updates that if necessary.
>>>>>>
>>>>>> So we're still fine right?
>>>>>>
>>>>>> __shmem_file_setup() (used for anon shmem) calls shmem_get_inode() ->
>>>>>> __shmem_get_inode() which has:
>>>>>>
>>>>>> 	if (sbinfo->huge)
>>>>>> 		mapping_set_large_folios(inode->i_mapping);
>>>>>>
>>>>>> Shared for both anon shmem and tmpfs-style shmem.
>>>>>>
>>>>>> So I think it's fine as-is.
>>>>>
>>>>> I'm afraid not. Sorry, I should have been clearer.
>>>>>
>>>>> First, anonymous shmem large order allocation is dynamically controlled via
>>>>> the global interface (/sys/kernel/mm/transparent_hugepage/shmem_enabled) and
>>>>> the mTHP interfaces
>>>>> (/sys/kernel/mm/transparent_hugepage/hugepages-*kB/shmem_enabled).
>>>>>
>>>>> This means that during anonymous shmem initialization, these interfaces
>>>>> might be set to 'never'. so it will not call mapping_set_large_folios()
>>>>> because sbinfo->huge is 'SHMEM_HUGE_NEVER'.
>>>>>
>>>>> Even if shmem large order allocation is subsequently enabled via the
>>>>> interfaces, __shmem_file_setup -> mapping_set_large_folios() is not called
>>>>> again.
>>>>
>>>> I see your point, oh this is all a bit of a mess...
>>>>
>>>> It feels like entirely the wrong abstraction anyway, since at best you're
>>>> getting a global 'is enabled'.
>>>>
>>>> I guess what happened before was we'd never call into this with ! r/o thp for fs
>>>> && ! is_shmem.
>>>
>>> Right.
>>>
>>>> But now we are allowing it, but should STILL be gating on !is_shmem so yeah your
>>>> suggestion is correct I think actualyl.
>>>>
>>>> I do hate:
>>>>
>>>> 	if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)
>>>>
>>>> As a bit of code though. It's horrible.
>>>
>>> Indeed.
>>>
>>>> Let's abstract that...
>>>>
>>>> It'd be nice if we could find a way to clean things up in the lead up to changes
>>>> in series like this instead of sticking with the mess, but I guess since it
>>>> mostly removes stuff that's ok for now.
>>>
>>> I think this check can be removed from this patch.
>>>
>>> During the khugepaged's scan, it will call thp_vma_allowable_order() to
>>> check if the VMA is allowed to collapse into a PMD.
>>>
>>> Specifically, within the call chain thp_vma_allowable_order() ->
>>> __thp_vma_allowable_orders(), shmem is checked via
>>> shmem_allowable_huge_orders(), while other FSes are checked via
>>> file_thp_enabled().
>
>But for madvise(MADV_COLLAPSE) case, IIRC, it ignores shmem huge config
>and can perform collapse anyway. This means without !is_shmem the check
>will break madvise(MADV_COLLAPSE). Let me know if I get it wrong, since

Right. That will break MADV_COLLAPSE, IIUC.

For MADV_COLLAPSE on anonymous shmem, eligibility is determined by the
TVA_FORCED_COLLAPSE path via shmem_allowable_huge_orders(), not by
whether the inode mapping got mapping_set_large_folios() at creation
time.

Using mmap(MAP_SHARED | MAP_ANONYMOUS):
- create time: shmem_enabled=never, hugepages-2048kB/shmem_enabled=never
- collapse time: shmem_enabled=never, hugepages-2048kB/shmem_enabled=always

With the !is_shmem guard, collapse succeeds. Without it, the same setup
fails with -EINVAL.

Thanks,
Lance

>I was in that TVA_FORCED_COLLAPSE email thread but does not remember
>everything there.
>
>
>>
>> It sucks not to have an assert. Maybe in that case make it a
>> VM_WARN_ON_ONCE().
>
>Will do that as I replied to David already.
>
>>
>> I hate that you're left tracing things back like that...
>>
>>>
>>> For those other filesystems, Patch 5 has already added the following check,
>>> which I think is sufficient to filter out those FSes that do not support
>>> large folios:
>>>
>>> if (mapping_max_folio_order(inode->i_mapping) < PMD_ORDER)
>>> 	return false;
>>
>> 2 < 5, we won't tolerate bisection hazards.
>>
>>>
>>>
>>>>> Anonymous shmem behaves similarly to anonymous pages: it is controlled by
>>>>> the 'shmem_enabled' interfaces and uses shmem_allowable_huge_orders() to
>>>>> check for allowed large orders, rather than relying on
>>>>> mapping_max_folio_order().
>>>>>
>>>>> The mapping_max_folio_order() is intended to control large page allocation
>>>>> only for tmpfs mounts. Therefore, I find the current code confusing and
>>>>> think it needs to be fixed:
>>>>>
>>>>> /* Don't consider 'deny' for emergencies and 'force' for testing */
>>>>> if (sb != shm_mnt->mnt_sb && sbinfo->huge)
>>>>>         mapping_set_large_folios(inode->i_mapping);
>>>>
>
>Hi Baolin,
>
>Do you want to send a fix for this?
>
>Also I wonder how I can distinguish between anonymous shmem code and tmpfs code.
>I thought they are the same thing except that they have different user interface,
>but it seems that I was wrong.
>
>
>Best Regards,
>Yan, Zi
>
>


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-03-27 16:22                 ` Lance Yang
@ 2026-03-27 16:30                   ` Zi Yan
  2026-03-28  2:29                     ` Baolin Wang
  0 siblings, 1 reply; 76+ messages in thread
From: Zi Yan @ 2026-03-27 16:30 UTC (permalink / raw)
  To: Lance Yang
  Cc: baolin.wang, ljs, willy, songliubraving, clm, dsterba, viro,
	brauner, jack, akpm, david, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, vbabka, rppt, surenb, mhocko, shuah,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 27 Mar 2026, at 12:22, Lance Yang wrote:

> On Fri, Mar 27, 2026 at 11:00:26AM -0400, Zi Yan wrote:
>> On 27 Mar 2026, at 10:31, Lorenzo Stoakes (Oracle) wrote:
>>
>>> On Fri, Mar 27, 2026 at 10:26:53PM +0800, Baolin Wang wrote:
>>>>
>>>>
>>>> On 3/27/26 10:12 PM, Lorenzo Stoakes (Oracle) wrote:
>>>>> On Fri, Mar 27, 2026 at 09:45:03PM +0800, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 3/27/26 8:02 PM, Lorenzo Stoakes (Oracle) wrote:
>>>>>>> On Fri, Mar 27, 2026 at 05:44:49PM +0800, Baolin Wang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3/27/26 9:42 AM, Zi Yan wrote:
>>>>>>>>> collapse_file() requires FSes supporting large folio with at least
>>>>>>>>> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
>>>>>>>>> huge option turned on also sets large folio order on mapping, so the check
>>>>>>>>> also applies to shmem.
>>>>>>>>>
>>>>>>>>> While at it, replace VM_BUG_ON with returning failure values.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>>>>>> ---
>>>>>>>>>     mm/khugepaged.c | 7 +++++--
>>>>>>>>>     1 file changed, 5 insertions(+), 2 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>>>>>> index d06d84219e1b..45b12ffb1550 100644
>>>>>>>>> --- a/mm/khugepaged.c
>>>>>>>>> +++ b/mm/khugepaged.c
>>>>>>>>> @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>>>>>>>>>     	int nr_none = 0;
>>>>>>>>>     	bool is_shmem = shmem_file(file);
>>>>>>>>> -	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
>>>>>>>>> -	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>>>>>>>>> +	/* "huge" shmem sets mapping folio order and passes the check below */
>>>>>>>>> +	if (mapping_max_folio_order(mapping) < PMD_ORDER)
>>>>>>>>> +		return SCAN_FAIL;
>>>>>>>>
>>>>>>>> This is not true for anonymous shmem, since its large order allocation logic
>>>>>>>> is similar to anonymous memory. That means it will not call
>>>>>>>> mapping_set_large_folios() for anonymous shmem.
>>>>>>>>
>>>>>>>> So I think the check should be:
>>>>>>>>
>>>>>>>> if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)
>>>>>>>>        return SCAN_FAIL;
>>>>>>>
>>>>>>> Hmm but in shmem_init() we have:
>>>>>>>
>>>>>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>>> 	if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY)
>>>>>>> 		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
>>>>>>> 	else
>>>>>>> 		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
>>>>>>>
>>>>>>> 	/*
>>>>>>> 	 * Default to setting PMD-sized THP to inherit the global setting and
>>>>>>> 	 * disable all other multi-size THPs.
>>>>>>> 	 */
>>>>>>> 	if (!shmem_orders_configured)
>>>>>>> 		huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER);
>>>>>>> #endif
>>>>>>>
>>>>>>> And shm_mnt->mnt_sb is the superblock used for anon shmem. Also
>>>>>>> shmem_enabled_store() updates that if necessary.
>>>>>>>
>>>>>>> So we're still fine right?
>>>>>>>
>>>>>>> __shmem_file_setup() (used for anon shmem) calls shmem_get_inode() ->
>>>>>>> __shmem_get_inode() which has:
>>>>>>>
>>>>>>> 	if (sbinfo->huge)
>>>>>>> 		mapping_set_large_folios(inode->i_mapping);
>>>>>>>
>>>>>>> Shared for both anon shmem and tmpfs-style shmem.
>>>>>>>
>>>>>>> So I think it's fine as-is.
>>>>>>
>>>>>> I'm afraid not. Sorry, I should have been clearer.
>>>>>>
>>>>>> First, anonymous shmem large order allocation is dynamically controlled via
>>>>>> the global interface (/sys/kernel/mm/transparent_hugepage/shmem_enabled) and
>>>>>> the mTHP interfaces
>>>>>> (/sys/kernel/mm/transparent_hugepage/hugepages-*kB/shmem_enabled).
>>>>>>
>>>>>> This means that during anonymous shmem initialization, these interfaces
>>>>>> might be set to 'never'. so it will not call mapping_set_large_folios()
>>>>>> because sbinfo->huge is 'SHMEM_HUGE_NEVER'.
>>>>>>
>>>>>> Even if shmem large order allocation is subsequently enabled via the
>>>>>> interfaces, __shmem_file_setup -> mapping_set_large_folios() is not called
>>>>>> again.
>>>>>
>>>>> I see your point, oh this is all a bit of a mess...
>>>>>
>>>>> It feels like entirely the wrong abstraction anyway, since at best you're
>>>>> getting a global 'is enabled'.
>>>>>
>>>>> I guess what happened before was we'd never call into this with ! r/o thp for fs
>>>>> && ! is_shmem.
>>>>
>>>> Right.
>>>>
>>>>> But now we are allowing it, but should STILL be gating on !is_shmem so yeah your
>>>>> suggestion is correct I think actualyl.
>>>>>
>>>>> I do hate:
>>>>>
>>>>> 	if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)
>>>>>
>>>>> As a bit of code though. It's horrible.
>>>>
>>>> Indeed.
>>>>
>>>>> Let's abstract that...
>>>>>
>>>>> It'd be nice if we could find a way to clean things up in the lead up to changes
>>>>> in series like this instead of sticking with the mess, but I guess since it
>>>>> mostly removes stuff that's ok for now.
>>>>
>>>> I think this check can be removed from this patch.
>>>>
>>>> During the khugepaged's scan, it will call thp_vma_allowable_order() to
>>>> check if the VMA is allowed to collapse into a PMD.
>>>>
>>>> Specifically, within the call chain thp_vma_allowable_order() ->
>>>> __thp_vma_allowable_orders(), shmem is checked via
>>>> shmem_allowable_huge_orders(), while other FSes are checked via
>>>> file_thp_enabled().
>>
>> But for madvise(MADV_COLLAPSE) case, IIRC, it ignores shmem huge config
>> and can perform collapse anyway. This means without !is_shmem the check
>> will break madvise(MADV_COLLAPSE). Let me know if I get it wrong, since
>
> Right. That will break MADV_COLLAPSE, IIUC.
>
> For MADV_COLLAPSE on anonymous shmem, eligibility is determined by the
> TVA_FORCED_COLLAPSE path via shmem_allowable_huge_orders(), not by
> whether the inode mapping got mapping_set_large_folios() at creation
> time.
>
> Using mmap(MAP_SHARED | MAP_ANONYMOUS):
> - create time: shmem_enabled=never, hugepages-2048kB/shmem_enabled=never
> - collapse time: shmem_enabled=never, hugepages-2048kB/shmem_enabled=always
>
> With the !is_shmem guard, collapse succeeds. Without it, the same setup
> fails with -EINVAL.

Thank you for the confirmation. I will fix it.

>
> Thanks,
> Lance
>
>> I was in that TVA_FORCED_COLLAPSE email thread but does not remember
>> everything there.
>>
>>
>>>
>>> It sucks not to have an assert. Maybe in that case make it a
>>> VM_WARN_ON_ONCE().
>>
>> Will do that as I replied to David already.
>>
>>>
>>> I hate that you're left tracing things back like that...
>>>
>>>>
>>>> For those other filesystems, Patch 5 has already added the following check,
>>>> which I think is sufficient to filter out those FSes that do not support
>>>> large folios:
>>>>
>>>> if (mapping_max_folio_order(inode->i_mapping) < PMD_ORDER)
>>>> 	return false;
>>>
>>> 2 < 5, we won't tolerate bisection hazards.
>>>
>>>>
>>>>
>>>>>> Anonymous shmem behaves similarly to anonymous pages: it is controlled by
>>>>>> the 'shmem_enabled' interfaces and uses shmem_allowable_huge_orders() to
>>>>>> check for allowed large orders, rather than relying on
>>>>>> mapping_max_folio_order().
>>>>>>
>>>>>> The mapping_max_folio_order() is intended to control large page allocation
>>>>>> only for tmpfs mounts. Therefore, I find the current code confusing and
>>>>>> think it needs to be fixed:
>>>>>>
>>>>>> /* Don't consider 'deny' for emergencies and 'force' for testing */
>>>>>> if (sb != shm_mnt->mnt_sb && sbinfo->huge)
>>>>>>         mapping_set_large_folios(inode->i_mapping);
>>>>>
>>
>> Hi Baolin,
>>
>> Do you want to send a fix for this?
>>
>> Also I wonder how I can distinguish between anonymous shmem code and tmpfs code.
>> I thought they are the same thing except that they have different user interface,
>> but it seems that I was wrong.
>>
>>
>> Best Regards,
>> Yan, Zi
>>
>>


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-03-27 16:30                   ` Zi Yan
@ 2026-03-28  2:29                     ` Baolin Wang
  0 siblings, 0 replies; 76+ messages in thread
From: Baolin Wang @ 2026-03-28  2:29 UTC (permalink / raw)
  To: Zi Yan, Lance Yang
  Cc: ljs, willy, songliubraving, clm, dsterba, viro, brauner, jack,
	akpm, david, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	vbabka, rppt, surenb, mhocko, shuah, linux-btrfs, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest



On 3/28/26 12:30 AM, Zi Yan wrote:
> On 27 Mar 2026, at 12:22, Lance Yang wrote:
> 
>> On Fri, Mar 27, 2026 at 11:00:26AM -0400, Zi Yan wrote:
>>> On 27 Mar 2026, at 10:31, Lorenzo Stoakes (Oracle) wrote:
>>>
>>>> On Fri, Mar 27, 2026 at 10:26:53PM +0800, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 3/27/26 10:12 PM, Lorenzo Stoakes (Oracle) wrote:
>>>>>> On Fri, Mar 27, 2026 at 09:45:03PM +0800, Baolin Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 3/27/26 8:02 PM, Lorenzo Stoakes (Oracle) wrote:
>>>>>>>> On Fri, Mar 27, 2026 at 05:44:49PM +0800, Baolin Wang wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3/27/26 9:42 AM, Zi Yan wrote:
>>>>>>>>>> collapse_file() requires FSes supporting large folio with at least
>>>>>>>>>> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
>>>>>>>>>> huge option turned on also sets large folio order on mapping, so the check
>>>>>>>>>> also applies to shmem.
>>>>>>>>>>
>>>>>>>>>> While at it, replace VM_BUG_ON with returning failure values.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>>>>>>> ---
>>>>>>>>>>      mm/khugepaged.c | 7 +++++--
>>>>>>>>>>      1 file changed, 5 insertions(+), 2 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>>>>>>> index d06d84219e1b..45b12ffb1550 100644
>>>>>>>>>> --- a/mm/khugepaged.c
>>>>>>>>>> +++ b/mm/khugepaged.c
>>>>>>>>>> @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>>>>>>>>>>      	int nr_none = 0;
>>>>>>>>>>      	bool is_shmem = shmem_file(file);
>>>>>>>>>> -	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
>>>>>>>>>> -	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>>>>>>>>>> +	/* "huge" shmem sets mapping folio order and passes the check below */
>>>>>>>>>> +	if (mapping_max_folio_order(mapping) < PMD_ORDER)
>>>>>>>>>> +		return SCAN_FAIL;
>>>>>>>>>
>>>>>>>>> This is not true for anonymous shmem, since its large order allocation logic
>>>>>>>>> is similar to anonymous memory. That means it will not call
>>>>>>>>> mapping_set_large_folios() for anonymous shmem.
>>>>>>>>>
>>>>>>>>> So I think the check should be:
>>>>>>>>>
>>>>>>>>> if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)
>>>>>>>>>         return SCAN_FAIL;
>>>>>>>>
>>>>>>>> Hmm but in shmem_init() we have:
>>>>>>>>
>>>>>>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>>>> 	if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY)
>>>>>>>> 		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
>>>>>>>> 	else
>>>>>>>> 		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
>>>>>>>>
>>>>>>>> 	/*
>>>>>>>> 	 * Default to setting PMD-sized THP to inherit the global setting and
>>>>>>>> 	 * disable all other multi-size THPs.
>>>>>>>> 	 */
>>>>>>>> 	if (!shmem_orders_configured)
>>>>>>>> 		huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER);
>>>>>>>> #endif
>>>>>>>>
>>>>>>>> And shm_mnt->mnt_sb is the superblock used for anon shmem. Also
>>>>>>>> shmem_enabled_store() updates that if necessary.
>>>>>>>>
>>>>>>>> So we're still fine right?
>>>>>>>>
>>>>>>>> __shmem_file_setup() (used for anon shmem) calls shmem_get_inode() ->
>>>>>>>> __shmem_get_inode() which has:
>>>>>>>>
>>>>>>>> 	if (sbinfo->huge)
>>>>>>>> 		mapping_set_large_folios(inode->i_mapping);
>>>>>>>>
>>>>>>>> Shared for both anon shmem and tmpfs-style shmem.
>>>>>>>>
>>>>>>>> So I think it's fine as-is.
>>>>>>>
>>>>>>> I'm afraid not. Sorry, I should have been clearer.
>>>>>>>
>>>>>>> First, anonymous shmem large order allocation is dynamically controlled via
>>>>>>> the global interface (/sys/kernel/mm/transparent_hugepage/shmem_enabled) and
>>>>>>> the mTHP interfaces
>>>>>>> (/sys/kernel/mm/transparent_hugepage/hugepages-*kB/shmem_enabled).
>>>>>>>
>>>>>>> This means that during anonymous shmem initialization, these interfaces
>>>>>>> might be set to 'never'. so it will not call mapping_set_large_folios()
>>>>>>> because sbinfo->huge is 'SHMEM_HUGE_NEVER'.
>>>>>>>
>>>>>>> Even if shmem large order allocation is subsequently enabled via the
>>>>>>> interfaces, __shmem_file_setup -> mapping_set_large_folios() is not called
>>>>>>> again.
>>>>>>
>>>>>> I see your point, oh this is all a bit of a mess...
>>>>>>
>>>>>> It feels like entirely the wrong abstraction anyway, since at best you're
>>>>>> getting a global 'is enabled'.
>>>>>>
>>>>>> I guess what happened before was we'd never call into this with ! r/o thp for fs
>>>>>> && ! is_shmem.
>>>>>
>>>>> Right.
>>>>>
>>>>>> But now we are allowing it, but should STILL be gating on !is_shmem so yeah your
>>>>>> suggestion is correct I think actualyl.
>>>>>>
>>>>>> I do hate:
>>>>>>
>>>>>> 	if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)
>>>>>>
>>>>>> As a bit of code though. It's horrible.
>>>>>
>>>>> Indeed.
>>>>>
>>>>>> Let's abstract that...
>>>>>>
>>>>>> It'd be nice if we could find a way to clean things up in the lead up to changes
>>>>>> in series like this instead of sticking with the mess, but I guess since it
>>>>>> mostly removes stuff that's ok for now.
>>>>>
>>>>> I think this check can be removed from this patch.
>>>>>
>>>>> During the khugepaged's scan, it will call thp_vma_allowable_order() to
>>>>> check if the VMA is allowed to collapse into a PMD.
>>>>>
>>>>> Specifically, within the call chain thp_vma_allowable_order() ->
>>>>> __thp_vma_allowable_orders(), shmem is checked via
>>>>> shmem_allowable_huge_orders(), while other FSes are checked via
>>>>> file_thp_enabled().
>>>
>>> But for madvise(MADV_COLLAPSE) case, IIRC, it ignores shmem huge config
>>> and can perform collapse anyway. This means without !is_shmem the check
>>> will break madvise(MADV_COLLAPSE). Let me know if I get it wrong, since
>>
>> Right. That will break MADV_COLLAPSE, IIUC.
>>
>> For MADV_COLLAPSE on anonymous shmem, eligibility is determined by the
>> TVA_FORCED_COLLAPSE path via shmem_allowable_huge_orders(), not by
>> whether the inode mapping got mapping_set_large_folios() at creation
>> time.
>>
>> Using mmap(MAP_SHARED | MAP_ANONYMOUS):
>> - create time: shmem_enabled=never, hugepages-2048kB/shmem_enabled=never
>> - collapse time: shmem_enabled=never, hugepages-2048kB/shmem_enabled=always
>>
>> With the !is_shmem guard, collapse succeeds. Without it, the same setup
>> fails with -EINVAL.

Right. So my suggestion is that the check should be:

if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)

or just keep a single VM_WARN_ONCE() here, becuase I hope the 
thp_vma_allowable_order() will filter out  those FSes that do not 
support large folios.

>>> I was in that TVA_FORCED_COLLAPSE email thread but does not remember
>>> everything there.
>>>
>>>
>>>>
>>>> It sucks not to have an assert. Maybe in that case make it a
>>>> VM_WARN_ON_ONCE().
>>>
>>> Will do that as I replied to David already.
>>>
>>>>
>>>> I hate that you're left tracing things back like that...
>>>>
>>>>>
>>>>> For those other filesystems, Patch 5 has already added the following check,
>>>>> which I think is sufficient to filter out those FSes that do not support
>>>>> large folios:
>>>>>
>>>>> if (mapping_max_folio_order(inode->i_mapping) < PMD_ORDER)
>>>>> 	return false;
>>>>
>>>> 2 < 5, we won't tolerate bisection hazards.
>>>>
>>>>>
>>>>>
>>>>>>> Anonymous shmem behaves similarly to anonymous pages: it is controlled by
>>>>>>> the 'shmem_enabled' interfaces and uses shmem_allowable_huge_orders() to
>>>>>>> check for allowed large orders, rather than relying on
>>>>>>> mapping_max_folio_order().
>>>>>>>
>>>>>>> The mapping_max_folio_order() is intended to control large page allocation
>>>>>>> only for tmpfs mounts. Therefore, I find the current code confusing and
>>>>>>> think it needs to be fixed:
>>>>>>>
>>>>>>> /* Don't consider 'deny' for emergencies and 'force' for testing */
>>>>>>> if (sb != shm_mnt->mnt_sb && sbinfo->huge)
>>>>>>>          mapping_set_large_folios(inode->i_mapping);
>>>>>>
>>>
>>> Hi Baolin,
>>>
>>> Do you want to send a fix for this?
>>>
>>> Also I wonder how I can distinguish between anonymous shmem code and tmpfs code.
>>> I thought they are the same thing except that they have different user interface,
>>> but it seems that I was wrong.

Sure. I can send a patch to make the code clear.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 07/10] mm/truncate: use folio_split() in truncate_inode_partial_folio()
  2026-03-27  1:42 ` [PATCH v1 07/10] mm/truncate: use folio_split() in truncate_inode_partial_folio() Zi Yan
  2026-03-27  3:33   ` Lance Yang
  2026-03-27 13:05   ` Lorenzo Stoakes (Oracle)
@ 2026-03-28  9:54   ` kernel test robot
  2026-03-28  9:54   ` kernel test robot
  3 siblings, 0 replies; 76+ messages in thread
From: kernel test robot @ 2026-03-28  9:54 UTC (permalink / raw)
  To: Zi Yan, Matthew Wilcox (Oracle), Song Liu
  Cc: oe-kbuild-all, Chris Mason, David Sterba, Alexander Viro,
	Christian Brauner, Jan Kara, Andrew Morton,
	Linux Memory Management List, David Hildenbrand, Lorenzo Stoakes,
	Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, linux-btrfs,
	linux-kernel, linux-fsdevel, linux-kselftest

Hi Zi,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on brauner-vfs/vfs.all kdave/for-next linus/master v7.0-rc5 next-20260327]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Zi-Yan/mm-remove-READ_ONLY_THP_FOR_FS-Kconfig-option/20260327-142622
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20260327014255.2058916-8-ziy%40nvidia.com
patch subject: [PATCH v1 07/10] mm/truncate: use folio_split() in truncate_inode_partial_folio()
config: nios2-allnoconfig (https://download.01.org/0day-ci/archive/20260328/202603281704.ILlsxaUM-lkp@intel.com/config)
compiler: nios2-linux-gcc (GCC) 11.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260328/202603281704.ILlsxaUM-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603281704.ILlsxaUM-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/mm.h:1746,
                    from include/linux/pid_namespace.h:7,
                    from include/linux/ptrace.h:10,
                    from arch/nios2/kernel/asm-offsets.c:10:
>> include/linux/huge_mm.h:625:1: error: expected identifier or '(' before '{' token
     625 | {
         | ^
>> include/linux/huge_mm.h:623:19: warning: 'folio_split' declared 'static' but never defined [-Wunused-function]
     623 | static inline int folio_split(struct folio *folio, unsigned int new_order,
         |                   ^~~~~~~~~~~
   make[3]: *** [scripts/Makefile.build:184: arch/nios2/kernel/asm-offsets.s] Error 1
   make[3]: Target 'prepare' not remade because of errors.
   make[2]: *** [Makefile:1337: prepare0] Error 2
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:248: __sub-make] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:248: __sub-make] Error 2
   make: Target 'prepare' not remade because of errors.


vim +625 include/linux/huge_mm.h

e220917fa50774 Luis Chamberlain 2024-08-22  622  
9ee18d22957981 Zi Yan           2026-03-26 @623  static inline int folio_split(struct folio *folio, unsigned int new_order,
9ee18d22957981 Zi Yan           2026-03-26  624  		struct page *page, struct list_head *list);
7460b470a131f9 Zi Yan           2025-03-07 @625  {
a488ba3124c82d Pankaj Raghav    2025-09-05  626  	VM_WARN_ON_ONCE_FOLIO(1, folio);
a488ba3124c82d Pankaj Raghav    2025-09-05  627  	return -EINVAL;
7460b470a131f9 Zi Yan           2025-03-07  628  }
7460b470a131f9 Zi Yan           2025-03-07  629  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 07/10] mm/truncate: use folio_split() in truncate_inode_partial_folio()
  2026-03-27  1:42 ` [PATCH v1 07/10] mm/truncate: use folio_split() in truncate_inode_partial_folio() Zi Yan
                     ` (2 preceding siblings ...)
  2026-03-28  9:54   ` kernel test robot
@ 2026-03-28  9:54   ` kernel test robot
  3 siblings, 0 replies; 76+ messages in thread
From: kernel test robot @ 2026-03-28  9:54 UTC (permalink / raw)
  To: Zi Yan, Matthew Wilcox (Oracle), Song Liu
  Cc: llvm, oe-kbuild-all, Chris Mason, David Sterba, Alexander Viro,
	Christian Brauner, Jan Kara, Andrew Morton,
	Linux Memory Management List, David Hildenbrand, Lorenzo Stoakes,
	Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, linux-btrfs,
	linux-kernel, linux-fsdevel, linux-kselftest

Hi Zi,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on brauner-vfs/vfs.all kdave/for-next linus/master v7.0-rc5 next-20260327]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Zi-Yan/mm-remove-READ_ONLY_THP_FOR_FS-Kconfig-option/20260327-142622
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20260327014255.2058916-8-ziy%40nvidia.com
patch subject: [PATCH v1 07/10] mm/truncate: use folio_split() in truncate_inode_partial_folio()
config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20260328/202603281736.bi9GWnsF-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260328/202603281736.bi9GWnsF-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603281736.bi9GWnsF-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from arch/x86/kernel/asm-offsets.c:14:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:21:
   In file included from include/linux/mm.h:1746:
>> include/linux/huge_mm.h:625:1: error: expected identifier or '('
     625 | {
         | ^
   1 error generated.
   make[3]: *** [scripts/Makefile.build:184: arch/x86/kernel/asm-offsets.s] Error 1
   make[3]: Target 'prepare' not remade because of errors.
   make[2]: *** [Makefile:1337: prepare0] Error 2
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:248: __sub-make] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:248: __sub-make] Error 2
   make: Target 'prepare' not remade because of errors.


vim +625 include/linux/huge_mm.h

e220917fa50774f Luis Chamberlain 2024-08-22  622  
9ee18d22957981f Zi Yan           2026-03-26  623  static inline int folio_split(struct folio *folio, unsigned int new_order,
9ee18d22957981f Zi Yan           2026-03-26  624  		struct page *page, struct list_head *list);
7460b470a131f98 Zi Yan           2025-03-07 @625  {
a488ba3124c82d7 Pankaj Raghav    2025-09-05  626  	VM_WARN_ON_ONCE_FOLIO(1, folio);
a488ba3124c82d7 Pankaj Raghav    2025-09-05  627  	return -EINVAL;
7460b470a131f98 Zi Yan           2025-03-07  628  }
7460b470a131f98 Zi Yan           2025-03-07  629  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 05/10] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
  2026-03-27 16:12             ` Zi Yan
  2026-03-27 16:14               ` Lorenzo Stoakes (Oracle)
@ 2026-03-29  4:07               ` WANG Rui
  2026-03-30 11:17                 ` Lorenzo Stoakes (Oracle)
  1 sibling, 1 reply; 76+ messages in thread
From: WANG Rui @ 2026-03-29  4:07 UTC (permalink / raw)
  To: ziy
  Cc: Liam.Howlett, akpm, baohua, baolin.wang, brauner, clm, david,
	dev.jain, dsterba, jack, lance.yang, linux-btrfs, linux-fsdevel,
	linux-kernel, linux-kselftest, linux-mm, ljs, mhocko, npache,
	rppt, ryan.roberts, shuah, songliubraving, surenb, vbabka, viro,
	willy, WANG Rui

Hi Zi,

>> Now without READ_ONLY_THP_FOR_FS you're going to:
>>
>>                        |    PF     | MADV_COLLAPSE | khugepaged |
>> 		       |-----------|---------------|------------|
>>  large folio fs        |     ✓     |       x       |      x     |
>>  large folio + r/o     |     ✓     |       ✓       |      ✓     |
>>
>> And intentionally leaving behind the 'not large folio fs, r/o' case because
>> those file systems need to implement large folio support.
>>
>> I guess we'll regress those users but we don't care?
>
> Yes. This also motivates FSes without large folio support to add large folio
> support instead of relying on READ_ONLY_THP_FOR_FS hack.

Interesting, thanks for making this feature unconditional.

From my experiments, this is going to be a performance regression.

Before this patch, even when the filesystem (e.g. btrfs without experimental)
didn't support large folios, READ_ONLY_THP_FOR_FS still allowed read-only
file-backed code segments to be collapsed into huge page mappings via khugepaged.

After this patch, FilePmdMapped will always be 0 unless the filesystem supports
large folios up to PMD order, and it doesn't look like that support will arrive
anytime soon [1].

Is there a reason we can't keep this hack while continuing to push filesystems
toward proper large folio support?

I'm currently working on making the ELF loader more THP-friendly by adjusting
the virtual address alignment of read-only code segments [2]. The data shows a
noticeable drop in iTLB misses, especially for programs whose text size is just
slightly larger than PMD_SIZE. That size profile is actually quite common for
real-world binaries when using 2M huge pages. This optimization relies on
READ_ONLY_THP_FOR_FS. If the availability of huge page mappings for code segments
ends up depending on filesystem support, it will be much harder to take advantage
of this in practice. [3]

[1] https://lore.kernel.org/linux-fsdevel/ab2IIwKzmK9qwIlZ@casper.infradead.org/
[2] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc/
[3] https://lore.kernel.org/linux-fsdevel/20260320160519.80962-1-r@hev.cc/

Thanks,
Rui

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 04/10] fs: remove nr_thps from struct address_space
  2026-03-27  1:42 ` [PATCH v1 04/10] fs: remove nr_thps from struct address_space Zi Yan
  2026-03-27 12:29   ` Lorenzo Stoakes (Oracle)
  2026-03-27 14:00   ` David Hildenbrand (Arm)
@ 2026-03-30  3:06   ` Lance Yang
  2 siblings, 0 replies; 76+ messages in thread
From: Lance Yang @ 2026-03-30  3:06 UTC (permalink / raw)
  To: ziy
  Cc: willy, songliubraving, clm, dsterba, viro, brauner, jack, akpm,
	david, ljs, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, vbabka, rppt, surenb, mhocko, shuah,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest


On Thu, Mar 26, 2026 at 09:42:49PM -0400, Zi Yan wrote:
>filemap_nr_thps*() are removed, the related field, address_space->nr_thps,
>is no longer needed. Remove it.
>
>Signed-off-by: Zi Yan <ziy@nvidia.com>
>---

LGTM.
Reviewed-by: Lance Yang <lance.yang@linux.dev>


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 06/10] mm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FS
  2026-03-27  1:42 ` [PATCH v1 06/10] mm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FS Zi Yan
  2026-03-27 12:50   ` Lorenzo Stoakes (Oracle)
@ 2026-03-30  9:15   ` Lance Yang
  1 sibling, 0 replies; 76+ messages in thread
From: Lance Yang @ 2026-03-30  9:15 UTC (permalink / raw)
  To: ziy
  Cc: willy, songliubraving, clm, dsterba, viro, brauner, jack, akpm,
	david, ljs, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, vbabka, rppt, surenb, mhocko, shuah,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest


On Thu, Mar 26, 2026 at 09:42:51PM -0400, Zi Yan wrote:
>Without READ_ONLY_THP_FOR_FS, large file-backed folios cannot be created by
>a FS without large folio support. The check is no longer needed.
>
>Signed-off-by: Zi Yan <ziy@nvidia.com>
>---
> mm/huge_memory.c | 22 ----------------------
> 1 file changed, 22 deletions(-)
>
>diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>index 1da1467328a3..30eddcbf86f1 100644
>--- a/mm/huge_memory.c
>+++ b/mm/huge_memory.c
>@@ -3732,28 +3732,6 @@ int folio_check_splittable(struct folio *folio, unsigned int new_order,
> 		/* order-1 is not supported for anonymous THP. */
> 		if (new_order == 1)
> 			return -EINVAL;

While you're at it, could we also collapse this block above into a
single condition:

	/* order-1 is not supported for anonymous THP. */
	if (folio_test_anon(folio) && new_order == 1)
		return -EINVAL;

Just saying. LGTM.

Reviewed-by: Lance Yang <lance.yang@linux.dev>


>-	} else if (split_type == SPLIT_TYPE_NON_UNIFORM || new_order) {
>-		if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
>-		    !mapping_large_folio_support(folio->mapping)) {
>-			/*
>-			 * We can always split a folio down to a single page
>-			 * (new_order == 0) uniformly.
>-			 *
>-			 * For any other scenario
>-			 *   a) uniform split targeting a large folio
>-			 *      (new_order > 0)
>-			 *   b) any non-uniform split
>-			 * we must confirm that the file system supports large
>-			 * folios.
>-			 *
>-			 * Note that we might still have THPs in such
>-			 * mappings, which is created from khugepaged when
>-			 * CONFIG_READ_ONLY_THP_FOR_FS is enabled. But in that
>-			 * case, the mapping does not actually support large
>-			 * folios properly.
>-			 */
>-			return -EINVAL;
>-		}
> 	}
> 
> 	/*


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 05/10] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
  2026-03-29  4:07               ` WANG Rui
@ 2026-03-30 11:17                 ` Lorenzo Stoakes (Oracle)
  2026-03-30 14:35                   ` Zi Yan
  0 siblings, 1 reply; 76+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-30 11:17 UTC (permalink / raw)
  To: WANG Rui
  Cc: ziy, Liam.Howlett, akpm, baohua, baolin.wang, brauner, clm, david,
	dev.jain, dsterba, jack, lance.yang, linux-btrfs, linux-fsdevel,
	linux-kernel, linux-kselftest, linux-mm, mhocko, npache, rppt,
	ryan.roberts, shuah, songliubraving, surenb, vbabka, viro, willy

On Sun, Mar 29, 2026 at 12:07:41PM +0800, WANG Rui wrote:
> Hi Zi,
>
> >> Now without READ_ONLY_THP_FOR_FS you're going to:
> >>
> >>                        |    PF     | MADV_COLLAPSE | khugepaged |
> >> 		       |-----------|---------------|------------|
> >>  large folio fs        |     ✓     |       x       |      x     |
> >>  large folio + r/o     |     ✓     |       ✓       |      ✓     |
> >>
> >> And intentionally leaving behind the 'not large folio fs, r/o' case because
> >> those file systems need to implement large folio support.
> >>
> >> I guess we'll regress those users but we don't care?
> >
> > Yes. This also motivates FSes without large folio support to add large folio
> > support instead of relying on READ_ONLY_THP_FOR_FS hack.
>
> Interesting, thanks for making this feature unconditional.
>
> From my experiments, this is going to be a performance regression.
>
> Before this patch, even when the filesystem (e.g. btrfs without experimental)
> didn't support large folios, READ_ONLY_THP_FOR_FS still allowed read-only
> file-backed code segments to be collapsed into huge page mappings via khugepaged.
>
> After this patch, FilePmdMapped will always be 0 unless the filesystem supports
> large folios up to PMD order, and it doesn't look like that support will arrive
> anytime soon [1].

I think Matthew was being a little sarcastic there ;) but I suppose it's
hinting at the fact they need to get a move on.

>
> Is there a reason we can't keep this hack while continuing to push filesystems
> toward proper large folio support?

IMO - It's time for us to stop allowing filesystems to fail to implement what
mm requires of them, while still providing a hack to improve performance.

Really this hack shouldn't have been there in the first place, but it was a
'putting on notice' that filesystems need to support large folios, which
has been made amply clear to them for some time.

So yes there will be regressions for filesystems which _still_ do not
implement this, I'd suggest you focus on trying to convince them to do so
(or send patches :)

>
> I'm currently working on making the ELF loader more THP-friendly by adjusting
> the virtual address alignment of read-only code segments [2]. The data shows a
> noticeable drop in iTLB misses, especially for programs whose text size is just
> slightly larger than PMD_SIZE. That size profile is actually quite common for
> real-world binaries when using 2M huge pages. This optimization relies on
> READ_ONLY_THP_FOR_FS. If the availability of huge page mappings for code segments
> ends up depending on filesystem support, it will be much harder to take advantage
> of this in practice. [3]

Yeah, again IMO - sorry, but tough.

This is something filesystems need to implement, if they fail to do so,
that's on them.

>
> [1] https://lore.kernel.org/linux-fsdevel/ab2IIwKzmK9qwIlZ@casper.infradead.org/
> [2] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc/
> [3] https://lore.kernel.org/linux-fsdevel/20260320160519.80962-1-r@hev.cc/
>
> Thanks,
> Rui

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 05/10] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
  2026-03-30 11:17                 ` Lorenzo Stoakes (Oracle)
@ 2026-03-30 14:35                   ` Zi Yan
  2026-03-30 16:09                     ` WANG Rui
  0 siblings, 1 reply; 76+ messages in thread
From: Zi Yan @ 2026-03-30 14:35 UTC (permalink / raw)
  To: WANG Rui, Lorenzo Stoakes (Oracle)
  Cc: Liam.Howlett, akpm, baohua, baolin.wang, brauner, clm, david,
	dev.jain, dsterba, jack, lance.yang, linux-btrfs, linux-fsdevel,
	linux-kernel, linux-kselftest, linux-mm, mhocko, npache, rppt,
	ryan.roberts, shuah, songliubraving, surenb, vbabka, viro, willy

On 30 Mar 2026, at 7:17, Lorenzo Stoakes (Oracle) wrote:

> On Sun, Mar 29, 2026 at 12:07:41PM +0800, WANG Rui wrote:
>> Hi Zi,
>>
>>>> Now without READ_ONLY_THP_FOR_FS you're going to:
>>>>
>>>>                        |    PF     | MADV_COLLAPSE | khugepaged |
>>>> 		       |-----------|---------------|------------|
>>>>  large folio fs        |     ✓     |       x       |      x     |
>>>>  large folio + r/o     |     ✓     |       ✓       |      ✓     |
>>>>
>>>> And intentionally leaving behind the 'not large folio fs, r/o' case because
>>>> those file systems need to implement large folio support.
>>>>
>>>> I guess we'll regress those users but we don't care?
>>>
>>> Yes. This also motivates FSes without large folio support to add large folio
>>> support instead of relying on READ_ONLY_THP_FOR_FS hack.
>>
>> Interesting, thanks for making this feature unconditional.
>>
>> From my experiments, this is going to be a performance regression.
>>
>> Before this patch, even when the filesystem (e.g. btrfs without experimental)
>> didn't support large folios, READ_ONLY_THP_FOR_FS still allowed read-only
>> file-backed code segments to be collapsed into huge page mappings via khugepaged.
>>
>> After this patch, FilePmdMapped will always be 0 unless the filesystem supports
>> large folios up to PMD order, and it doesn't look like that support will arrive
>> anytime soon [1].
>
> I think Matthew was being a little sarcastic there ;) but I suppose it's
> hinting at the fact they need to get a move on.
>
>>
>> Is there a reason we can't keep this hack while continuing to push filesystems
>> toward proper large folio support?
>
> IMO - It's time for us to stop allowing filesystems to fail to implement what
> mm requires of them, while still providing a hack to improve performance.
>
> Really this hack shouldn't have been there in the first place, but it was a
> 'putting on notice' that filesystems need to support large folios, which
> has been made amply clear to them for some time.
>
> So yes there will be regressions for filesystems which _still_ do not
> implement this, I'd suggest you focus on trying to convince them to do so
> (or send patches :)
>

Thank Lorenzo for clarifying the intention of this patchset.


Hi Rui,

READ_ONLY_THP_FOR_FS is an experimental feature since 2019 and that means the
feature can go away at any time.

In addition, Matthew has made a heads-up on its removal [1] several months ago.
We have not heard any objection since.

It seems that you care about btrfs with large folio support. Have you
talked to btrfs people on the timeline of moving the large folio support out
of the experimental state?


[1] https://lore.kernel.org/all/aTJg9vOijOGVTnVt@casper.infradead.org/


>>
>> I'm currently working on making the ELF loader more THP-friendly by adjusting
>> the virtual address alignment of read-only code segments [2]. The data shows a
>> noticeable drop in iTLB misses, especially for programs whose text size is just
>> slightly larger than PMD_SIZE. That size profile is actually quite common for
>> real-world binaries when using 2M huge pages. This optimization relies on
>> READ_ONLY_THP_FOR_FS. If the availability of huge page mappings for code segments
>> ends up depending on filesystem support, it will be much harder to take advantage
>> of this in practice. [3]
>
> Yeah, again IMO - sorry, but tough.
>
> This is something filesystems need to implement, if they fail to do so,
> that's on them.
>
>>
>> [1] https://lore.kernel.org/linux-fsdevel/ab2IIwKzmK9qwIlZ@casper.infradead.org/
>> [2] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc/
>> [3] https://lore.kernel.org/linux-fsdevel/20260320160519.80962-1-r@hev.cc/
>>
>> Thanks,
>> Rui
>
> Cheers, Lorenzo


--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 05/10] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
  2026-03-30 14:35                   ` Zi Yan
@ 2026-03-30 16:09                     ` WANG Rui
  2026-03-30 16:19                       ` Matthew Wilcox
  0 siblings, 1 reply; 76+ messages in thread
From: WANG Rui @ 2026-03-30 16:09 UTC (permalink / raw)
  To: ziy, ljs
  Cc: Liam.Howlett, akpm, baohua, baolin.wang, brauner, clm, david,
	dev.jain, dsterba, jack, lance.yang, linux-btrfs, linux-fsdevel,
	linux-kernel, linux-kselftest, linux-mm, mhocko, npache, r, rppt,
	ryan.roberts, shuah, songliubraving, surenb, vbabka, viro, willy

Hi Lorenzo and Zi,

>>> Is there a reason we can't keep this hack while continuing to push filesystems
>>> toward proper large folio support?
>>
>> IMO - It's time for us to stop allowing filesystems to fail to implement what
>> mm requires of them, while still providing a hack to improve performance.
>>
>> Really this hack shouldn't have been there in the first place, but it was a
>> 'putting on notice' that filesystems need to support large folios, which
>> has been made amply clear to them for some time.
>>
>> So yes there will be regressions for filesystems which _still_ do not
>> implement this, I'd suggest you focus on trying to convince them to do so
>> (or send patches :)
>>
>
> Thank Lorenzo for clarifying the intention of this patchset.
>
> Hi Rui,
>
> READ_ONLY_THP_FOR_FS is an experimental feature since 2019 and that means the
> feature can go away at any time.
>
> In addition, Matthew has made a heads-up on its removal [1] several months ago.
> We have not heard any objection since.
>
> It seems that you care about btrfs with large folio support. Have you
> talked to btrfs people on the timeline of moving the large folio support out
> of the experimental state?
>
>
> [1] https://lore.kernel.org/all/aTJg9vOijOGVTnVt@casper.infradead.org/

Thanks for the clarification.

I fully agree with the long-term direction here. Ideally this should be
handled by filesystems, and mm has already done a lot of work to make
that possible.

However, in practice it does not look like simply enabling an
experimental feature is sufficient today. I did a quick check of
mapping_max_folio_size() across a few common filesystems, and only XFS
consistently reaches PMD order under both 4K and 16K base pages.
Even ext4 falls short under 16K.

PAGE_SIZE = 4K, PMD_SIZE = 2M

Filesystem                     mapping_max_folio_size   PMD order
------------------------------------------------------------------
ext4                           2M                       yes
btrfs (without experimental)   4K                       no
btrfs (with experimental)      256K                     no
xfs                            2M                       yes

PAGE_SIZE = 16K, PMD_SIZE = 32M

Filesystem                     mapping_max_folio_size   PMD order
------------------------------------------------------------------
ext4                           8M                       no
btrfs (without experimental)   16K                      no
btrfs (with experimental)      256K                     no
xfs                            32M                      yes

Given the diversity of filesystems in use, each one requires dedicated
engineering effort to implement and validate large folio support, and
that assumes both sufficient resources and prioritization on the
filesystem side. Even after support lands, coverage across different
base page sizes and configurations may take additional time to mature.

What I am really concerned about is the transition period: if filesystem
support is not yet broadly ready, while we have already removed the
fallback path, we may end up in a situation where PMD-sized mappings
become effectively unavailable on many systems for some time.

This is not about the long-term direction, but about the timing and
practical readiness.

Thanks,
Rui

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 05/10] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
  2026-03-30 16:09                     ` WANG Rui
@ 2026-03-30 16:19                       ` Matthew Wilcox
  2026-04-01 14:38                         ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 76+ messages in thread
From: Matthew Wilcox @ 2026-03-30 16:19 UTC (permalink / raw)
  To: WANG Rui
  Cc: ziy, ljs, Liam.Howlett, akpm, baohua, baolin.wang, brauner, clm,
	david, dev.jain, dsterba, jack, lance.yang, linux-btrfs,
	linux-fsdevel, linux-kernel, linux-kselftest, linux-mm, mhocko,
	npache, rppt, ryan.roberts, shuah, songliubraving, surenb, vbabka,
	viro

On Tue, Mar 31, 2026 at 12:09:42AM +0800, WANG Rui wrote:
> Given the diversity of filesystems in use, each one requires dedicated
> engineering effort to implement and validate large folio support, and
> that assumes both sufficient resources and prioritization on the
> filesystem side. Even after support lands, coverage across different
> base page sizes and configurations may take additional time to mature.
> 
> What I am really concerned about is the transition period: if filesystem
> support is not yet broadly ready, while we have already removed the
> fallback path, we may end up in a situation where PMD-sized mappings
> become effectively unavailable on many systems for some time.
> 
> This is not about the long-term direction, but about the timing and
> practical readiness.

If we leave this fallback in place, we'll never get filesystems to move
forward.  It's time to rip off this bandaid; they've got eight months
before the next stable kernel.  I've talked to them about it for years

LSFMM 2022: https://lwn.net/Articles/893512/
LSFMM 2023: https://lwn.net/Articles/931794/
LSFMM 2024: https://lwn.net/Articles/973565/
LSFMM 2025: https://lwn.net/Articles/1015320/

(and earlier, but I think I've made my point)


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 03/10] mm: fs: remove filemap_nr_thps*() functions and their users
  2026-03-27 15:05         ` Zi Yan
@ 2026-04-01 14:35           ` David Hildenbrand (Arm)
  2026-04-01 15:32             ` Zi Yan
  0 siblings, 1 reply; 76+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-01 14:35 UTC (permalink / raw)
  To: Zi Yan, Lorenzo Stoakes (Oracle)
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, linux-btrfs,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

On 3/27/26 16:05, Zi Yan wrote:
> On 27 Mar 2026, at 10:23, Lorenzo Stoakes (Oracle) wrote:
> 
>> On Fri, Mar 27, 2026 at 02:58:12PM +0100, David Hildenbrand (Arm) wrote:
>>>
>>> There could now be a race between collapsing and the file getting opened
>>> r/w.
>>>
>>> Are we sure that all code can really deal with that?
>>>
>>> IOW, "they already had to handle it separately" -- is that true?
>>> khugepaged would have never collapse in writable files, so I wonder if
>>> all code paths are prepared for that.
>>
>> OK I guess I overlooked a part of this code... :) see below.
>>
>> This is fine and would be a no-op anyway
>>
>> -       if (f->f_mode & FMODE_WRITE) {
>> -               /*
>> -                * Depends on full fence from get_write_access() to synchronize
>> -                * against collapse_file() regarding i_writecount and nr_thps
>> -                * updates. Ensures subsequent insertion of THPs into the page
>> -                * cache will fail.
>> -                */
>> -               if (filemap_nr_thps(inode->i_mapping)) {
>>
>> But this:
>>
>> -       if (!is_shmem) {
>> -               filemap_nr_thps_inc(mapping);
>> -               /*
>> -                * Paired with the fence in do_dentry_open() -> get_write_access()
>> -                * to ensure i_writecount is up to date and the update to nr_thps
>> -                * is visible. Ensures the page cache will be truncated if the
>> -                * file is opened writable.
>> -                */
>> -               smp_mb();
>>
>> We can drop barrier
>>
>> -               if (inode_is_open_for_write(mapping->host)) {
>> -                       result = SCAN_FAIL;
>>
>> But this is a functional change!
>>
>> Yup missed this.
> 
> But I added
> 
> +	if (!is_shmem && inode_is_open_for_write(mapping->host))
> +		result = SCAN_FAIL;
> 
> That keeps the original bail out, right?

Independent of that, are we sure that the possible race we allow is ok?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 05/10] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
  2026-03-30 16:19                       ` Matthew Wilcox
@ 2026-04-01 14:38                         ` David Hildenbrand (Arm)
  2026-04-01 14:53                           ` Darrick J. Wong
  0 siblings, 1 reply; 76+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-01 14:38 UTC (permalink / raw)
  To: Matthew Wilcox, WANG Rui
  Cc: ziy, ljs, Liam.Howlett, akpm, baohua, baolin.wang, brauner, clm,
	dev.jain, dsterba, jack, lance.yang, linux-btrfs, linux-fsdevel,
	linux-kernel, linux-kselftest, linux-mm, mhocko, npache, rppt,
	ryan.roberts, shuah, songliubraving, surenb, vbabka, viro

On 3/30/26 18:19, Matthew Wilcox wrote:
> On Tue, Mar 31, 2026 at 12:09:42AM +0800, WANG Rui wrote:
>> Given the diversity of filesystems in use, each one requires dedicated
>> engineering effort to implement and validate large folio support, and
>> that assumes both sufficient resources and prioritization on the
>> filesystem side. Even after support lands, coverage across different
>> base page sizes and configurations may take additional time to mature.
>>
>> What I am really concerned about is the transition period: if filesystem
>> support is not yet broadly ready, while we have already removed the
>> fallback path, we may end up in a situation where PMD-sized mappings
>> become effectively unavailable on many systems for some time.
>>
>> This is not about the long-term direction, but about the timing and
>> practical readiness.
> 
> If we leave this fallback in place, we'll never get filesystems to move
> forward.  It's time to rip off this bandaid; they've got eight months
> before the next stable kernel.

I guess if we don't force them to work on it I guess this will never
happen. They shouldn't be holding our THP hacks we want to remove hostage.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 05/10] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
  2026-04-01 14:38                         ` David Hildenbrand (Arm)
@ 2026-04-01 14:53                           ` Darrick J. Wong
  0 siblings, 0 replies; 76+ messages in thread
From: Darrick J. Wong @ 2026-04-01 14:53 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Matthew Wilcox, WANG Rui, ziy, ljs, Liam.Howlett, akpm, baohua,
	baolin.wang, brauner, clm, dev.jain, dsterba, jack, lance.yang,
	linux-btrfs, linux-fsdevel, linux-kernel, linux-kselftest,
	linux-mm, mhocko, npache, rppt, ryan.roberts, shuah,
	songliubraving, surenb, vbabka, viro

On Wed, Apr 01, 2026 at 04:38:21PM +0200, David Hildenbrand (Arm) wrote:
> On 3/30/26 18:19, Matthew Wilcox wrote:
> > On Tue, Mar 31, 2026 at 12:09:42AM +0800, WANG Rui wrote:
> >> Given the diversity of filesystems in use, each one requires dedicated
> >> engineering effort to implement and validate large folio support, and
> >> that assumes both sufficient resources and prioritization on the
> >> filesystem side. Even after support lands, coverage across different
> >> base page sizes and configurations may take additional time to mature.
> >>
> >> What I am really concerned about is the transition period: if filesystem
> >> support is not yet broadly ready, while we have already removed the
> >> fallback path, we may end up in a situation where PMD-sized mappings
> >> become effectively unavailable on many systems for some time.
> >>
> >> This is not about the long-term direction, but about the timing and
> >> practical readiness.
> > 
> > If we leave this fallback in place, we'll never get filesystems to move
> > forward.  It's time to rip off this bandaid; they've got eight months
> > before the next stable kernel.
> 
> I guess if we don't force them to work on it I guess this will never
> happen. They shouldn't be holding our THP hacks we want to remove hostage.

+1.  There are too many filesystems for the ever shrinking number of
filesystem maintainers so the work won't get done without leverage.
Leverage, as in "hey why did my fault counts go up?"

--D

> -- 
> Cheers,
> 
> David
> 


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 03/10] mm: fs: remove filemap_nr_thps*() functions and their users
  2026-04-01 14:35           ` David Hildenbrand (Arm)
@ 2026-04-01 15:32             ` Zi Yan
  2026-04-01 19:15               ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 76+ messages in thread
From: Zi Yan @ 2026-04-01 15:32 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lorenzo Stoakes (Oracle), Matthew Wilcox (Oracle), Song Liu,
	Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 1 Apr 2026, at 10:35, David Hildenbrand (Arm) wrote:

> On 3/27/26 16:05, Zi Yan wrote:
>> On 27 Mar 2026, at 10:23, Lorenzo Stoakes (Oracle) wrote:
>>
>>> On Fri, Mar 27, 2026 at 02:58:12PM +0100, David Hildenbrand (Arm) wrote:
>>>>
>>>> There could now be a race between collapsing and the file getting opened
>>>> r/w.
>>>>
>>>> Are we sure that all code can really deal with that?
>>>>
>>>> IOW, "they already had to handle it separately" -- is that true?
>>>> khugepaged would have never collapse in writable files, so I wonder if
>>>> all code paths are prepared for that.
>>>
>>> OK I guess I overlooked a part of this code... :) see below.
>>>
>>> This is fine and would be a no-op anyway
>>>
>>> -       if (f->f_mode & FMODE_WRITE) {
>>> -               /*
>>> -                * Depends on full fence from get_write_access() to synchronize
>>> -                * against collapse_file() regarding i_writecount and nr_thps
>>> -                * updates. Ensures subsequent insertion of THPs into the page
>>> -                * cache will fail.
>>> -                */
>>> -               if (filemap_nr_thps(inode->i_mapping)) {
>>>
>>> But this:
>>>
>>> -       if (!is_shmem) {
>>> -               filemap_nr_thps_inc(mapping);
>>> -               /*
>>> -                * Paired with the fence in do_dentry_open() -> get_write_access()
>>> -                * to ensure i_writecount is up to date and the update to nr_thps
>>> -                * is visible. Ensures the page cache will be truncated if the
>>> -                * file is opened writable.
>>> -                */
>>> -               smp_mb();
>>>
>>> We can drop barrier
>>>
>>> -               if (inode_is_open_for_write(mapping->host)) {
>>> -                       result = SCAN_FAIL;
>>>
>>> But this is a functional change!
>>>
>>> Yup missed this.
>>
>> But I added
>>
>> +	if (!is_shmem && inode_is_open_for_write(mapping->host))
>> +		result = SCAN_FAIL;
>>
>> That keeps the original bail out, right?
>
> Independent of that, are we sure that the possible race we allow is ok?

Let me think.

do_dentry_open() -> file_get_write_access() -> get_write_access() bumps
inode->i_writecount atomically and it turns inode_is_open_for_write()
to true. Then, do_dentry_open() also truncates all pages
if filemap_nr_thps() is not zero. This pairs with khugepaged’s first
filemap_nr_thps_inc() then inode_is_open_for_write() to prevent opening
a fd with write when there is a read-only THP.

After removing READ_ONLY_THP_FOR_FS, khugepaged only creates read-only THPs
on FSes with large folio support (to be precise THP support). If a fd
is opened for write before inode_is_open_for_write() check, khugepaged
will stop. It is fine. But if a fd is opened for write after
inode_is_open_for_write() check, khugepaged will try to collapse a read-only
THP and the fd can be written at the same time.

I notice that fd write requires locking the to-be-written folio first
(I see it from f_ops->write_iter() -> write_begin_get_folio() and assume
f_ops->write() has the same locking requirement) and khugepaged has already
locked the to-be-collapsed folio before inode_is_open_for_write(). So if the
fd is opened for write after inode_is_open_for_write() check, its write
will wait for khugepaged collapse and see a new THP. Since the FS
supports THP, writing to the new THP should be fine.

Let me know if my analysis above makes sense. If yes, I will add it
to the commit message and add a succinct comment about it before
inode_is_open_for_write().

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 03/10] mm: fs: remove filemap_nr_thps*() functions and their users
  2026-04-01 15:32             ` Zi Yan
@ 2026-04-01 19:15               ` David Hildenbrand (Arm)
  2026-04-01 20:33                 ` Zi Yan
  0 siblings, 1 reply; 76+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-01 19:15 UTC (permalink / raw)
  To: Zi Yan
  Cc: Lorenzo Stoakes (Oracle), Matthew Wilcox (Oracle), Song Liu,
	Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 4/1/26 17:32, Zi Yan wrote:
> On 1 Apr 2026, at 10:35, David Hildenbrand (Arm) wrote:
> 
>> On 3/27/26 16:05, Zi Yan wrote:
>>>
>>>
>>> But I added
>>>
>>> +	if (!is_shmem && inode_is_open_for_write(mapping->host))
>>> +		result = SCAN_FAIL;
>>>
>>> That keeps the original bail out, right?
>>
>> Independent of that, are we sure that the possible race we allow is ok?
> 
> Let me think.
> 
> do_dentry_open() -> file_get_write_access() -> get_write_access() bumps
> inode->i_writecount atomically and it turns inode_is_open_for_write()
> to true. Then, do_dentry_open() also truncates all pages
> if filemap_nr_thps() is not zero. This pairs with khugepaged’s first
> filemap_nr_thps_inc() then inode_is_open_for_write() to prevent opening
> a fd with write when there is a read-only THP.
> 
> After removing READ_ONLY_THP_FOR_FS, khugepaged only creates read-only THPs
> on FSes with large folio support (to be precise THP support). If a fd
> is opened for write before inode_is_open_for_write() check, khugepaged
> will stop. It is fine. But if a fd is opened for write after
> inode_is_open_for_write() check, khugepaged will try to collapse a read-only
> THP and the fd can be written at the same time.

Exactly, that's the race I mean.

> 
> I notice that fd write requires locking the to-be-written folio first
> (I see it from f_ops->write_iter() -> write_begin_get_folio() and assume
> f_ops->write() has the same locking requirement) and khugepaged has already
> locked the to-be-collapsed folio before inode_is_open_for_write(). So if the
> fd is opened for write after inode_is_open_for_write() check, its write
> will wait for khugepaged collapse and see a new THP. Since the FS
> supports THP, writing to the new THP should be fine.
> 
> Let me know if my analysis above makes sense. If yes, I will add it
> to the commit message and add a succinct comment about it before
> inode_is_open_for_write().

khugepaged code is the only code that replaces folios in the pagecache
by other folios. So my main concern is if that is problematic on
concurrent write access.

You argue that the folio lock is sufficient. That's certainly true for
individual folios, but I am more concerned about the replacement part.

I don't have anything concrete, primarily just pointing out that this is
a change that might unlock some code paths that could not have been
triggered before.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 03/10] mm: fs: remove filemap_nr_thps*() functions and their users
  2026-04-01 19:15               ` David Hildenbrand (Arm)
@ 2026-04-01 20:33                 ` Zi Yan
  2026-04-02 14:35                   ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 76+ messages in thread
From: Zi Yan @ 2026-04-01 20:33 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lorenzo Stoakes (Oracle), Matthew Wilcox (Oracle), Song Liu,
	Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 1 Apr 2026, at 15:15, David Hildenbrand (Arm) wrote:

> On 4/1/26 17:32, Zi Yan wrote:
>> On 1 Apr 2026, at 10:35, David Hildenbrand (Arm) wrote:
>>
>>> On 3/27/26 16:05, Zi Yan wrote:
>>>>
>>>>
>>>> But I added
>>>>
>>>> +	if (!is_shmem && inode_is_open_for_write(mapping->host))
>>>> +		result = SCAN_FAIL;
>>>>
>>>> That keeps the original bail out, right?
>>>
>>> Independent of that, are we sure that the possible race we allow is ok?
>>
>> Let me think.
>>
>> do_dentry_open() -> file_get_write_access() -> get_write_access() bumps
>> inode->i_writecount atomically and it turns inode_is_open_for_write()
>> to true. Then, do_dentry_open() also truncates all pages
>> if filemap_nr_thps() is not zero. This pairs with khugepaged’s first
>> filemap_nr_thps_inc() then inode_is_open_for_write() to prevent opening
>> a fd with write when there is a read-only THP.
>>
>> After removing READ_ONLY_THP_FOR_FS, khugepaged only creates read-only THPs
>> on FSes with large folio support (to be precise THP support). If a fd
>> is opened for write before inode_is_open_for_write() check, khugepaged
>> will stop. It is fine. But if a fd is opened for write after
>> inode_is_open_for_write() check, khugepaged will try to collapse a read-only
>> THP and the fd can be written at the same time.
>
> Exactly, that's the race I mean.
>
>>
>> I notice that fd write requires locking the to-be-written folio first
>> (I see it from f_ops->write_iter() -> write_begin_get_folio() and assume
>> f_ops->write() has the same locking requirement) and khugepaged has already
>> locked the to-be-collapsed folio before inode_is_open_for_write(). So if the
>> fd is opened for write after inode_is_open_for_write() check, its write
>> will wait for khugepaged collapse and see a new THP. Since the FS
>> supports THP, writing to the new THP should be fine.
>>
>> Let me know if my analysis above makes sense. If yes, I will add it
>> to the commit message and add a succinct comment about it before
>> inode_is_open_for_write().
>
> khugepaged code is the only code that replaces folios in the pagecache
> by other folios. So my main concern is if that is problematic on
> concurrent write access.

folio_split() does it too, although it replaces a large folio with
a bunch of after-split folios. It is a kinda reverse process of
collapse_file().

>
> You argue that the folio lock is sufficient. That's certainly true for
> individual folios, but I am more concerned about the replacement part.

For the replacement part, both old and new folios are locked during
the process. A parallel writer uses filemap_get_entry() to get the folio
from mapping, but all of them check folio->mapping after acquiring the
folio lock, except mincore_page() which is a reader. A writer can see
either old folio or new folio during the process, but

1. if it sees the old one, it waits on the old folio lock. After
it acquires the lock, it sees old_folio->mapping is NULL, no longer
matches the original mapping. The writer will try again.

2. if it sees the new one, it waits on the new folio lock. After
it acquires the lock, it sees new_folio->mapping matches the
original mapping and proceeds to its writes.

3. if khugepaged needs to do a rollback, the old folio will stay
the same and the writer will see the old one after it gets the old
folio lock.

>
> I don't have anything concrete, primarily just pointing out that this is
> a change that might unlock some code paths that could not have been
> triggered before.

Yes, the concern makes sense.

BTW, Claude is trying to convince me that even inode_is_open_for_write()
is unecessary since 1) folio_test_dirty() before it has
made sure the folio is clean, 2) try_to_unmap() and the locked folio prevents
further writes.

But then we find a hole between folio_test_dirty() and
try_to_unmap() where a write via a writable mmap PTE can dirty the folio
after folio_test_dirty() and try_to_unmap(). To remove that hole,
the “if (!is_shmem && (folio_test_dirty(...) || folio_test_writeback(...))”
needs to be moved after try_to_unmap(). With that, all to-be-collapsed
folios will be clean, unmapped, and locked, where unmapped means
writes via mmap need to fault and take the folio lock, locked means
writes via mmap and write() need to wait until the folio is unlocked.

Let me know if my reasoning makes sense. It is definitely worth the time
and effort to ensure this patchset does not introduce any unexpected race
condition or issue.

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 03/10] mm: fs: remove filemap_nr_thps*() functions and their users
  2026-04-01 20:33                 ` Zi Yan
@ 2026-04-02 14:35                   ` David Hildenbrand (Arm)
  2026-04-02 14:38                     ` Zi Yan
  0 siblings, 1 reply; 76+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-02 14:35 UTC (permalink / raw)
  To: Zi Yan
  Cc: Lorenzo Stoakes (Oracle), Matthew Wilcox (Oracle), Song Liu,
	Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 4/1/26 22:33, Zi Yan wrote:
> On 1 Apr 2026, at 15:15, David Hildenbrand (Arm) wrote:
> 
>> On 4/1/26 17:32, Zi Yan wrote:
>>>
>>>
>>> Let me think.
>>>
>>> do_dentry_open() -> file_get_write_access() -> get_write_access() bumps
>>> inode->i_writecount atomically and it turns inode_is_open_for_write()
>>> to true. Then, do_dentry_open() also truncates all pages
>>> if filemap_nr_thps() is not zero. This pairs with khugepaged’s first
>>> filemap_nr_thps_inc() then inode_is_open_for_write() to prevent opening
>>> a fd with write when there is a read-only THP.
>>>
>>> After removing READ_ONLY_THP_FOR_FS, khugepaged only creates read-only THPs
>>> on FSes with large folio support (to be precise THP support). If a fd
>>> is opened for write before inode_is_open_for_write() check, khugepaged
>>> will stop. It is fine. But if a fd is opened for write after
>>> inode_is_open_for_write() check, khugepaged will try to collapse a read-only
>>> THP and the fd can be written at the same time.
>>
>> Exactly, that's the race I mean.
>>
>>>
>>> I notice that fd write requires locking the to-be-written folio first
>>> (I see it from f_ops->write_iter() -> write_begin_get_folio() and assume
>>> f_ops->write() has the same locking requirement) and khugepaged has already
>>> locked the to-be-collapsed folio before inode_is_open_for_write(). So if the
>>> fd is opened for write after inode_is_open_for_write() check, its write
>>> will wait for khugepaged collapse and see a new THP. Since the FS
>>> supports THP, writing to the new THP should be fine.
>>>
>>> Let me know if my analysis above makes sense. If yes, I will add it
>>> to the commit message and add a succinct comment about it before
>>> inode_is_open_for_write().
>>
>> khugepaged code is the only code that replaces folios in the pagecache
>> by other folios. So my main concern is if that is problematic on
>> concurrent write access.
> 
> folio_split() does it too, although it replaces a large folio with
> a bunch of after-split folios. It is a kinda reverse process of
> collapse_file().

Right. You won't start looking at a small folio and suddenly there is
something larger.

> 
> 
>>
>> You argue that the folio lock is sufficient. That's certainly true for
>> individual folios, but I am more concerned about the replacement part.
> 
> For the replacement part, both old and new folios are locked during
> the process. A parallel writer uses filemap_get_entry() to get the folio
> from mapping, but all of them check folio->mapping after acquiring the
> folio lock, except mincore_page() which is a reader. A writer can see
> either old folio or new folio during the process, but
> 
> 1. if it sees the old one, it waits on the old folio lock. After
> it acquires the lock, it sees old_folio->mapping is NULL, no longer
> matches the original mapping. The writer will try again.
> 
> 2. if it sees the new one, it waits on the new folio lock. After
> it acquires the lock, it sees new_folio->mapping matches the
> original mapping and proceeds to its writes.
> 
> 3. if khugepaged needs to do a rollback, the old folio will stay
> the same and the writer will see the old one after it gets the old
> folio lock.

I am primarily wondering about what would happen if someone traverses
the pageache, and found+processed 3 small folios. Suddenly there is a
large folio that covers the 3 small folios processes before.

I suspect that is fine, because the code likely had to deal with
concurrent truncation+population if relevant locks are dropped already.

Just raising it.

> 
>>
>> I don't have anything concrete, primarily just pointing out that this is
>> a change that might unlock some code paths that could not have been
>> triggered before.
> 
> Yes, the concern makes sense.
> 
> BTW, Claude is trying to convince me that even inode_is_open_for_write()
> is unecessary since 1) folio_test_dirty() before it has
> made sure the folio is clean, 2) try_to_unmap() and the locked folio prevents
> further writes.
> 
> But then we find a hole between folio_test_dirty() and
> try_to_unmap() where a write via a writable mmap PTE can dirty the folio
> after folio_test_dirty() and try_to_unmap(). To remove that hole,
> the “if (!is_shmem && (folio_test_dirty(...) || folio_test_writeback(...))”
> needs to be moved after try_to_unmap(). With that, all to-be-collapsed
> folios will be clean, unmapped, and locked, where unmapped means
> writes via mmap need to fault and take the folio lock, locked means
> writes via mmap and write() need to wait until the folio is unlocked.
> 
> Let me know if my reasoning makes sense. It is definitely worth the time
> and effort to ensure this patchset does not introduce any unexpected race
> condition or issue.

Makes sense.

Please clearly spell out that there is a slight change now, where we
might be collapsing after the file has been opened for write. Then you
can document that the folio locks should be protecting us from that.

Implying that collapsing in writable files could likely "easily" done in
the future.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 03/10] mm: fs: remove filemap_nr_thps*() functions and their users
  2026-04-02 14:35                   ` David Hildenbrand (Arm)
@ 2026-04-02 14:38                     ` Zi Yan
  0 siblings, 0 replies; 76+ messages in thread
From: Zi Yan @ 2026-04-02 14:38 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lorenzo Stoakes (Oracle), Matthew Wilcox (Oracle), Song Liu,
	Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 2 Apr 2026, at 10:35, David Hildenbrand (Arm) wrote:

> On 4/1/26 22:33, Zi Yan wrote:
>> On 1 Apr 2026, at 15:15, David Hildenbrand (Arm) wrote:
>>
>>> On 4/1/26 17:32, Zi Yan wrote:
>>>>
>>>>
>>>> Let me think.
>>>>
>>>> do_dentry_open() -> file_get_write_access() -> get_write_access() bumps
>>>> inode->i_writecount atomically and it turns inode_is_open_for_write()
>>>> to true. Then, do_dentry_open() also truncates all pages
>>>> if filemap_nr_thps() is not zero. This pairs with khugepaged’s first
>>>> filemap_nr_thps_inc() then inode_is_open_for_write() to prevent opening
>>>> a fd with write when there is a read-only THP.
>>>>
>>>> After removing READ_ONLY_THP_FOR_FS, khugepaged only creates read-only THPs
>>>> on FSes with large folio support (to be precise THP support). If a fd
>>>> is opened for write before inode_is_open_for_write() check, khugepaged
>>>> will stop. It is fine. But if a fd is opened for write after
>>>> inode_is_open_for_write() check, khugepaged will try to collapse a read-only
>>>> THP and the fd can be written at the same time.
>>>
>>> Exactly, that's the race I mean.
>>>
>>>>
>>>> I notice that fd write requires locking the to-be-written folio first
>>>> (I see it from f_ops->write_iter() -> write_begin_get_folio() and assume
>>>> f_ops->write() has the same locking requirement) and khugepaged has already
>>>> locked the to-be-collapsed folio before inode_is_open_for_write(). So if the
>>>> fd is opened for write after inode_is_open_for_write() check, its write
>>>> will wait for khugepaged collapse and see a new THP. Since the FS
>>>> supports THP, writing to the new THP should be fine.
>>>>
>>>> Let me know if my analysis above makes sense. If yes, I will add it
>>>> to the commit message and add a succinct comment about it before
>>>> inode_is_open_for_write().
>>>
>>> khugepaged code is the only code that replaces folios in the pagecache
>>> by other folios. So my main concern is if that is problematic on
>>> concurrent write access.
>>
>> folio_split() does it too, although it replaces a large folio with
>> a bunch of after-split folios. It is a kinda reverse process of
>> collapse_file().
>
> Right. You won't start looking at a small folio and suddenly there is
> something larger.
>
>>
>>
>>>
>>> You argue that the folio lock is sufficient. That's certainly true for
>>> individual folios, but I am more concerned about the replacement part.
>>
>> For the replacement part, both old and new folios are locked during
>> the process. A parallel writer uses filemap_get_entry() to get the folio
>> from mapping, but all of them check folio->mapping after acquiring the
>> folio lock, except mincore_page() which is a reader. A writer can see
>> either old folio or new folio during the process, but
>>
>> 1. if it sees the old one, it waits on the old folio lock. After
>> it acquires the lock, it sees old_folio->mapping is NULL, no longer
>> matches the original mapping. The writer will try again.
>>
>> 2. if it sees the new one, it waits on the new folio lock. After
>> it acquires the lock, it sees new_folio->mapping matches the
>> original mapping and proceeds to its writes.
>>
>> 3. if khugepaged needs to do a rollback, the old folio will stay
>> the same and the writer will see the old one after it gets the old
>> folio lock.
>
> I am primarily wondering about what would happen if someone traverses
> the pageache, and found+processed 3 small folios. Suddenly there is a
> large folio that covers the 3 small folios processes before.
>
> I suspect that is fine, because the code likely had to deal with
> concurrent truncation+population if relevant locks are dropped already.
>
> Just raising it.
>
>>
>>>
>>> I don't have anything concrete, primarily just pointing out that this is
>>> a change that might unlock some code paths that could not have been
>>> triggered before.
>>
>> Yes, the concern makes sense.
>>
>> BTW, Claude is trying to convince me that even inode_is_open_for_write()
>> is unecessary since 1) folio_test_dirty() before it has
>> made sure the folio is clean, 2) try_to_unmap() and the locked folio prevents
>> further writes.
>>
>> But then we find a hole between folio_test_dirty() and
>> try_to_unmap() where a write via a writable mmap PTE can dirty the folio
>> after folio_test_dirty() and try_to_unmap(). To remove that hole,
>> the “if (!is_shmem && (folio_test_dirty(...) || folio_test_writeback(...))”
>> needs to be moved after try_to_unmap(). With that, all to-be-collapsed
>> folios will be clean, unmapped, and locked, where unmapped means
>> writes via mmap need to fault and take the folio lock, locked means
>> writes via mmap and write() need to wait until the folio is unlocked.
>>
>> Let me know if my reasoning makes sense. It is definitely worth the time
>> and effort to ensure this patchset does not introduce any unexpected race
>> condition or issue.
>
> Makes sense.
>
> Please clearly spell out that there is a slight change now, where we
> might be collapsing after the file has been opened for write. Then you
> can document that the folio locks should be protecting us from that.
>
> Implying that collapsing in writable files could likely "easily" done in
> the future.

Definitely. Thank you for all the inputs. :)


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig
  2026-03-27  1:42 [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig Zi Yan
                   ` (10 preceding siblings ...)
  2026-03-27 13:46 ` [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig David Hildenbrand (Arm)
@ 2026-04-05 17:38 ` Nico Pache
  2026-04-06  1:59   ` Zi Yan
  11 siblings, 1 reply; 76+ messages in thread
From: Nico Pache @ 2026-04-05 17:38 UTC (permalink / raw)
  To: Zi Yan, Matthew Wilcox (Oracle), David Hildenbrand
  Cc: Song Liu, Chris Mason, David Sterba, Alexander Viro,
	Christian Brauner, Jan Kara, Andrew Morton, Lorenzo Stoakes,
	Baolin Wang, Liam R. Howlett, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Shuah Khan, linux-btrfs, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On Thu, Mar 26, 2026 at 7:43 PM Zi Yan <ziy@nvidia.com> wrote:
>
> Hi all,
>
> This patchset removes READ_ONLY_THP_FOR_FS Kconfig and enables creating
> read-only THPs for FSes with large folio support (the supported orders
> need to include PMD_ORDER) by default.

Hi Zi,

Thank you for tackling this :) Ill try to review the next version as
I'm a little behind on this thread.

Should we guard collapsing READ_ONLY_THPs with a sysctl? My fear is
workloads that convert READ_ONLY THPs into writable pages (assuming
this is common/possible; my understanding of FS is rather low),
leading to storms of thp splitting. Do you think this is a real
concern? I guess this is also true of read-only-->writable fs-THPS
even without khugepaged, correct?

Cheers,
-- Nico

>
> The changes are:
> 1. collapse_file() from mm/khugepaged.c, instead of checking
>    CONFIG_READ_ONLY_THP_FOR_FS, makes sure the mapping_max_folio_order()
>    of struct address_space of the file is at least PMD_ORDER.
> 2. file_thp_enabled() also checks mapping_max_folio_order() instead.
> 3. truncate_inode_partial_folio() calls folio_split() directly instead
>    of the removed try_folio_split_to_order(), since large folios can
>    only show up on a FS with large folio support.
> 4. nr_thps is removed from struct address_space, since it is no longer
>    needed to drop all read-only THPs from a FS without large folio
>    support when the fd becomes writable. Its related filemap_nr_thps*()
>    are removed too.
> 5. folio_check_splittable() no longer checks READ_ONLY_THP_FOR_FS.
> 6. Updated comments in various places.
>
> Changelog
> ===
> From RFC[1]:
> 1. instead of removing READ_ONLY_THP_FOR_FS function entirely, turn it
>    on by default for all FSes with large folio support and the supported
>    orders includes PMD_ORDER.
>
> Suggestions and comments are welcome.
>
> Link: https://lore.kernel.org/all/20260323190644.1714379-1-ziy@nvidia.com/ [1]
>
> Zi Yan (10):
>   mm: remove READ_ONLY_THP_FOR_FS Kconfig option
>   mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
>   mm: fs: remove filemap_nr_thps*() functions and their users
>   fs: remove nr_thps from struct address_space
>   mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
>   mm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FS
>   mm/truncate: use folio_split() in truncate_inode_partial_folio()
>   fs/btrfs: remove a comment referring to READ_ONLY_THP_FOR_FS
>   selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged
>   selftests/mm: remove READ_ONLY_THP_FOR_FS from comments in
>     guard-regions
>
>  fs/btrfs/defrag.c                          |  3 --
>  fs/inode.c                                 |  3 --
>  fs/open.c                                  | 27 ----------------
>  include/linux/fs.h                         |  5 ---
>  include/linux/huge_mm.h                    | 25 ++-------------
>  include/linux/pagemap.h                    | 29 -----------------
>  mm/Kconfig                                 | 11 -------
>  mm/filemap.c                               |  1 -
>  mm/huge_memory.c                           | 29 ++---------------
>  mm/khugepaged.c                            | 36 +++++-----------------
>  mm/truncate.c                              |  8 ++---
>  tools/testing/selftests/mm/guard-regions.c |  9 +++---
>  tools/testing/selftests/mm/khugepaged.c    |  4 +--
>  13 files changed, 23 insertions(+), 167 deletions(-)
>
> --
> 2.43.0
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig
  2026-04-05 17:38 ` Nico Pache
@ 2026-04-06  1:59   ` Zi Yan
  2026-04-06 16:17     ` Nico Pache
  0 siblings, 1 reply; 76+ messages in thread
From: Zi Yan @ 2026-04-06  1:59 UTC (permalink / raw)
  To: Nico Pache
  Cc: Matthew Wilcox (Oracle), David Hildenbrand, Song Liu, Chris Mason,
	David Sterba, Alexander Viro, Christian Brauner, Jan Kara,
	Andrew Morton, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 5 Apr 2026, at 13:38, Nico Pache wrote:

> On Thu, Mar 26, 2026 at 7:43 PM Zi Yan <ziy@nvidia.com> wrote:
>>
>> Hi all,
>>
>> This patchset removes READ_ONLY_THP_FOR_FS Kconfig and enables creating
>> read-only THPs for FSes with large folio support (the supported orders
>> need to include PMD_ORDER) by default.
>
> Hi Zi,
>
> Thank you for tackling this :) Ill try to review the next version as
> I'm a little behind on this thread.

Sure. Thanks.

>
> Should we guard collapsing READ_ONLY_THPs with a sysctl? My fear is
> workloads that convert READ_ONLY THPs into writable pages (assuming
> this is common/possible; my understanding of FS is rather low),
> leading to storms of thp splitting. Do you think this is a real
> concern? I guess this is also true of read-only-->writable fs-THPS
> even without khugepaged, correct?

Why would a read-only THP need to be split when it becomes writable?
After this patchset, a read-only THP can only be created on a FS that
supports large folios (to be precise PMD THP). That means any write
to that read-only THP would just change it to a writable THP.

Let me know if I miss anything.

>
> Cheers,
> -- Nico
>
>>
>> The changes are:
>> 1. collapse_file() from mm/khugepaged.c, instead of checking
>>    CONFIG_READ_ONLY_THP_FOR_FS, makes sure the mapping_max_folio_order()
>>    of struct address_space of the file is at least PMD_ORDER.
>> 2. file_thp_enabled() also checks mapping_max_folio_order() instead.
>> 3. truncate_inode_partial_folio() calls folio_split() directly instead
>>    of the removed try_folio_split_to_order(), since large folios can
>>    only show up on a FS with large folio support.
>> 4. nr_thps is removed from struct address_space, since it is no longer
>>    needed to drop all read-only THPs from a FS without large folio
>>    support when the fd becomes writable. Its related filemap_nr_thps*()
>>    are removed too.
>> 5. folio_check_splittable() no longer checks READ_ONLY_THP_FOR_FS.
>> 6. Updated comments in various places.
>>
>> Changelog
>> ===
>> From RFC[1]:
>> 1. instead of removing READ_ONLY_THP_FOR_FS function entirely, turn it
>>    on by default for all FSes with large folio support and the supported
>>    orders includes PMD_ORDER.
>>
>> Suggestions and comments are welcome.
>>
>> Link: https://lore.kernel.org/all/20260323190644.1714379-1-ziy@nvidia.com/ [1]
>>
>> Zi Yan (10):
>>   mm: remove READ_ONLY_THP_FOR_FS Kconfig option
>>   mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
>>   mm: fs: remove filemap_nr_thps*() functions and their users
>>   fs: remove nr_thps from struct address_space
>>   mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
>>   mm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FS
>>   mm/truncate: use folio_split() in truncate_inode_partial_folio()
>>   fs/btrfs: remove a comment referring to READ_ONLY_THP_FOR_FS
>>   selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged
>>   selftests/mm: remove READ_ONLY_THP_FOR_FS from comments in
>>     guard-regions
>>
>>  fs/btrfs/defrag.c                          |  3 --
>>  fs/inode.c                                 |  3 --
>>  fs/open.c                                  | 27 ----------------
>>  include/linux/fs.h                         |  5 ---
>>  include/linux/huge_mm.h                    | 25 ++-------------
>>  include/linux/pagemap.h                    | 29 -----------------
>>  mm/Kconfig                                 | 11 -------
>>  mm/filemap.c                               |  1 -
>>  mm/huge_memory.c                           | 29 ++---------------
>>  mm/khugepaged.c                            | 36 +++++-----------------
>>  mm/truncate.c                              |  8 ++---
>>  tools/testing/selftests/mm/guard-regions.c |  9 +++---
>>  tools/testing/selftests/mm/khugepaged.c    |  4 +--
>>  13 files changed, 23 insertions(+), 167 deletions(-)
>>
>> --
>> 2.43.0
>>


--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig
  2026-04-06  1:59   ` Zi Yan
@ 2026-04-06 16:17     ` Nico Pache
  0 siblings, 0 replies; 76+ messages in thread
From: Nico Pache @ 2026-04-06 16:17 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox (Oracle), David Hildenbrand, Song Liu, Chris Mason,
	David Sterba, Alexander Viro, Christian Brauner, Jan Kara,
	Andrew Morton, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Sun, Apr 5, 2026 at 7:59 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 5 Apr 2026, at 13:38, Nico Pache wrote:
>
> > On Thu, Mar 26, 2026 at 7:43 PM Zi Yan <ziy@nvidia.com> wrote:
> >>
> >> Hi all,
> >>
> >> This patchset removes READ_ONLY_THP_FOR_FS Kconfig and enables creating
> >> read-only THPs for FSes with large folio support (the supported orders
> >> need to include PMD_ORDER) by default.
> >
> > Hi Zi,
> >
> > Thank you for tackling this :) Ill try to review the next version as
> > I'm a little behind on this thread.
>
> Sure. Thanks.
>
> >
> > Should we guard collapsing READ_ONLY_THPs with a sysctl? My fear is
> > workloads that convert READ_ONLY THPs into writable pages (assuming
> > this is common/possible; my understanding of FS is rather low),
> > leading to storms of thp splitting. Do you think this is a real
> > concern? I guess this is also true of read-only-->writable fs-THPS
> > even without khugepaged, correct?
>
> Why would a read-only THP need to be split when it becomes writable?
> After this patchset, a read-only THP can only be created on a FS that
> supports large folios (to be precise PMD THP). That means any write
> to that read-only THP would just change it to a writable THP.

Ah, okay. I was misremembering some stuff.

The concern I spotted earlier when investigating read-only THPs for
khugepaged was this:

For frequent yet short-lived writes on read-only pages (e.g., package
updates, log updates)
Wouldn't we get destructive cycles of cache invalidations and refault storms?
Imagine such pages are shared (library, execs, etc) across many processes.

When these files are marked for writing we must invalidate all of
their mappings, destroying their Page Tables and PageCache. Now all
processes must refault these mappings.

Once the write is complete, they are eligible for read-only promotion again.

The part I didn't understand (thanks Claude) is that this truncation
path in do_dentry_open is only taken for mappings/Filesystems that do
not support large folios, as only those filesystems track
mapping->nr_thps. Furthermore, with FS that natively support large
folios, khugepaged does not need to re-collapse these pages, as even
if this was the case they would be refaulted as THPs.

TLDR: My concern is not a real concern.

Cheers,
-- Nico

>
> Let me know if I miss anything.
>
> >
> > Cheers,
> > -- Nico
> >
> >>
> >> The changes are:
> >> 1. collapse_file() from mm/khugepaged.c, instead of checking
> >>    CONFIG_READ_ONLY_THP_FOR_FS, makes sure the mapping_max_folio_order()
> >>    of struct address_space of the file is at least PMD_ORDER.
> >> 2. file_thp_enabled() also checks mapping_max_folio_order() instead.
> >> 3. truncate_inode_partial_folio() calls folio_split() directly instead
> >>    of the removed try_folio_split_to_order(), since large folios can
> >>    only show up on a FS with large folio support.
> >> 4. nr_thps is removed from struct address_space, since it is no longer
> >>    needed to drop all read-only THPs from a FS without large folio
> >>    support when the fd becomes writable. Its related filemap_nr_thps*()
> >>    are removed too.
> >> 5. folio_check_splittable() no longer checks READ_ONLY_THP_FOR_FS.
> >> 6. Updated comments in various places.
> >>
> >> Changelog
> >> ===
> >> From RFC[1]:
> >> 1. instead of removing READ_ONLY_THP_FOR_FS function entirely, turn it
> >>    on by default for all FSes with large folio support and the supported
> >>    orders includes PMD_ORDER.
> >>
> >> Suggestions and comments are welcome.
> >>
> >> Link: https://lore.kernel.org/all/20260323190644.1714379-1-ziy@nvidia.com/ [1]
> >>
> >> Zi Yan (10):
> >>   mm: remove READ_ONLY_THP_FOR_FS Kconfig option
> >>   mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
> >>   mm: fs: remove filemap_nr_thps*() functions and their users
> >>   fs: remove nr_thps from struct address_space
> >>   mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
> >>   mm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FS
> >>   mm/truncate: use folio_split() in truncate_inode_partial_folio()
> >>   fs/btrfs: remove a comment referring to READ_ONLY_THP_FOR_FS
> >>   selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged
> >>   selftests/mm: remove READ_ONLY_THP_FOR_FS from comments in
> >>     guard-regions
> >>
> >>  fs/btrfs/defrag.c                          |  3 --
> >>  fs/inode.c                                 |  3 --
> >>  fs/open.c                                  | 27 ----------------
> >>  include/linux/fs.h                         |  5 ---
> >>  include/linux/huge_mm.h                    | 25 ++-------------
> >>  include/linux/pagemap.h                    | 29 -----------------
> >>  mm/Kconfig                                 | 11 -------
> >>  mm/filemap.c                               |  1 -
> >>  mm/huge_memory.c                           | 29 ++---------------
> >>  mm/khugepaged.c                            | 36 +++++-----------------
> >>  mm/truncate.c                              |  8 ++---
> >>  tools/testing/selftests/mm/guard-regions.c |  9 +++---
> >>  tools/testing/selftests/mm/khugepaged.c    |  4 +--
> >>  13 files changed, 23 insertions(+), 167 deletions(-)
> >>
> >> --
> >> 2.43.0
> >>
>
>
> --
> Best Regards,
> Yan, Zi
>



^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2026-04-06 16:18 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-27  1:42 [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig Zi Yan
2026-03-27  1:42 ` [PATCH v1 01/10] mm: remove READ_ONLY_THP_FOR_FS Kconfig option Zi Yan
2026-03-27 11:45   ` Lorenzo Stoakes (Oracle)
2026-03-27 13:33   ` David Hildenbrand (Arm)
2026-03-27 14:39     ` Zi Yan
2026-03-27  1:42 ` [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check Zi Yan
2026-03-27  7:29   ` Lance Yang
2026-03-27  7:35     ` Lance Yang
2026-03-27  9:44   ` Baolin Wang
2026-03-27 12:02     ` Lorenzo Stoakes (Oracle)
2026-03-27 13:45       ` Baolin Wang
2026-03-27 14:12         ` Lorenzo Stoakes (Oracle)
2026-03-27 14:26           ` Baolin Wang
2026-03-27 14:31             ` Lorenzo Stoakes (Oracle)
2026-03-27 15:00               ` Zi Yan
2026-03-27 16:22                 ` Lance Yang
2026-03-27 16:30                   ` Zi Yan
2026-03-28  2:29                     ` Baolin Wang
2026-03-27 12:07   ` Lorenzo Stoakes (Oracle)
2026-03-27 14:15     ` Lorenzo Stoakes (Oracle)
2026-03-27 14:46     ` Zi Yan
2026-03-27 13:37   ` David Hildenbrand (Arm)
2026-03-27 14:43     ` Zi Yan
2026-03-27  1:42 ` [PATCH v1 03/10] mm: fs: remove filemap_nr_thps*() functions and their users Zi Yan
2026-03-27  9:32   ` Lance Yang
2026-03-27 12:23   ` Lorenzo Stoakes (Oracle)
2026-03-27 13:58     ` David Hildenbrand (Arm)
2026-03-27 14:23       ` Lorenzo Stoakes (Oracle)
2026-03-27 15:05         ` Zi Yan
2026-04-01 14:35           ` David Hildenbrand (Arm)
2026-04-01 15:32             ` Zi Yan
2026-04-01 19:15               ` David Hildenbrand (Arm)
2026-04-01 20:33                 ` Zi Yan
2026-04-02 14:35                   ` David Hildenbrand (Arm)
2026-04-02 14:38                     ` Zi Yan
2026-03-27  1:42 ` [PATCH v1 04/10] fs: remove nr_thps from struct address_space Zi Yan
2026-03-27 12:29   ` Lorenzo Stoakes (Oracle)
2026-03-27 14:00   ` David Hildenbrand (Arm)
2026-03-30  3:06   ` Lance Yang
2026-03-27  1:42 ` [PATCH v1 05/10] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled() Zi Yan
2026-03-27 12:42   ` Lorenzo Stoakes (Oracle)
2026-03-27 15:12     ` Zi Yan
2026-03-27 15:29       ` Lorenzo Stoakes (Oracle)
2026-03-27 15:43         ` Zi Yan
2026-03-27 16:08           ` Lorenzo Stoakes (Oracle)
2026-03-27 16:12             ` Zi Yan
2026-03-27 16:14               ` Lorenzo Stoakes (Oracle)
2026-03-29  4:07               ` WANG Rui
2026-03-30 11:17                 ` Lorenzo Stoakes (Oracle)
2026-03-30 14:35                   ` Zi Yan
2026-03-30 16:09                     ` WANG Rui
2026-03-30 16:19                       ` Matthew Wilcox
2026-04-01 14:38                         ` David Hildenbrand (Arm)
2026-04-01 14:53                           ` Darrick J. Wong
2026-03-27  1:42 ` [PATCH v1 06/10] mm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FS Zi Yan
2026-03-27 12:50   ` Lorenzo Stoakes (Oracle)
2026-03-30  9:15   ` Lance Yang
2026-03-27  1:42 ` [PATCH v1 07/10] mm/truncate: use folio_split() in truncate_inode_partial_folio() Zi Yan
2026-03-27  3:33   ` Lance Yang
2026-03-27 13:05   ` Lorenzo Stoakes (Oracle)
2026-03-27 15:35     ` Zi Yan
2026-03-28  9:54   ` kernel test robot
2026-03-28  9:54   ` kernel test robot
2026-03-27  1:42 ` [PATCH v1 08/10] fs/btrfs: remove a comment referring to READ_ONLY_THP_FOR_FS Zi Yan
2026-03-27 13:05   ` Lorenzo Stoakes (Oracle)
2026-03-27  1:42 ` [PATCH v1 09/10] selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged Zi Yan
2026-03-27 13:05   ` Lorenzo Stoakes (Oracle)
2026-03-27  1:42 ` [PATCH v1 10/10] selftests/mm: remove READ_ONLY_THP_FOR_FS from comments in guard-regions Zi Yan
2026-03-27 13:06   ` Lorenzo Stoakes (Oracle)
2026-03-27 13:46 ` [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig David Hildenbrand (Arm)
2026-03-27 14:26   ` Zi Yan
2026-03-27 14:27   ` Lorenzo Stoakes (Oracle)
2026-03-27 14:30     ` Zi Yan
2026-04-05 17:38 ` Nico Pache
2026-04-06  1:59   ` Zi Yan
2026-04-06 16:17     ` Nico Pache

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox