[PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files
@ 2026-04-29 15:29 Zi Yan
  2026-04-29 15:29 ` [PATCH v5 01/14] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check Zi Yan
                   ` (14 more replies)
  0 siblings, 15 replies; 32+ messages in thread
From: Zi Yan @ 2026-04-29 15:29 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

I will be AFK in May most of the time, so my response might be delayed.

Hi all,

This patchset removes READ_ONLY_THP_FOR_FS Kconfig and enables creating
file-backed THPs for FSes with large folio support (the supported orders
need to include PMD_ORDER) by default, including for writable files.
It is on top of mm-new.

Before the patchset, the status of creating read-only THPs is below:

                            |    PF     | MADV_COLLAPSE | khugepaged |
                            |-----------|---------------|------------|
 large folio FSes only      |     ✓     |       x       |      x     |
 READ_ONLY_THP_FOR_FS only  |     x     |       ✓       |      ✓     |
 both                       |     ✓     |       ✓       |      ✓     |

where READ_ONLY_THP_FOR_FS implies no large folio FSes.

Now without READ_ONLY_THP_FOR_FS:

                                  |    PF     | MADV_COLLAPSE | khugepaged |
                                  |-----------|---------------|------------|
 large folio FSes (read-only fd)  |     ✓     |       ✓       |      ✓     |
 large folio FSes (read-write fd) |     ✓     |       ✓       |      ✓*    |
 no large folio FSes              |     x     |       x       |      x     |

* khugepaged only collapses clean folios from writable files. Userspace
  must flush dirty folios explicitly before khugepaged can collapse them.
  MADV_COLLAPSE handles the flush automatically via its writeback-and-retry
  path. Collapsing writable MAP_PRIVATE pagecache folios is still not
  supported, since PMD THP CoW only faults in at PTE level to avoid long
  CoW latency, and file_backed_vma_is_retractable() prevents it.

This means no-large-folio FSes need to add large folio support (the
supported orders need to include PMD_ORDER), so that they can leverage
file THP creation.

To prevent breaking file THP support for large folio FSes,
1. first 4 patches enable the support, so that without READ_ONLY_THP_FOR_FS,
   file THP still works for large folio FSes,
2. Patch 5 removes READ_ONLY_THP_FOR_FS Kconfig,
3. patches 6-12 remove code related to READ_ONLY_THP_FOR_FS,
4. patches 13-14 enable clean pagecache folio collapse for writable files.

NOTE: collapsing writable MAP_PRIVATE pagecache folios is not supported,
since:
1. PMD THP CoW only faults in at PTE level to avoid long CoW latency,
2. the first check, due to 1, in file_backed_vma_is_retractable() prevents it.

Overview
===

1. collapse_file() checks for to-be-collapsed folio dirtiness after they
   are locked and unmapped to make sure no new write happens. Before,
   mapping->nr_thps and inode->i_writecount were used to cause read-only
   THP truncation before a fd becomes writable.

2. hugepage_enabled() is true for anon, shmem, and file-backed cases
   if the global khugepaged control is on, otherwise, khugepaged for
   file-backed case is turned off and anon and shmem depend on per-size
   control knobs.

3. collapse_file() from mm/khugepaged.c, instead of checking
   CONFIG_READ_ONLY_THP_FOR_FS, makes sure the mapping_max_folio_order()
   of struct address_space of the file is at least PMD_ORDER.

4. file_thp_enabled() checks mapping_max_folio_order() instead of
   CONFIG_READ_ONLY_THP_FOR_FS and no longer checks if the file is opened
   read-only. The dirty folio check after try_to_unmap() (Change 1)
   handles writable files correctly.

5. truncate_inode_partial_folio() calls folio_split() directly instead
   of the removed try_folio_split_to_order(), since large folios can
   only show up on a FS with large folio support.

6. nr_thps is removed from struct address_space, since it is no longer
   needed to drop all read-only THPs from a FS without large folio
   support when the fd becomes writable. Its related filemap_nr_thps*()
   are removed too.

7. folio_check_splittable() no longer checks READ_ONLY_THP_FOR_FS.

8. collapse_file() only calls filemap_flush() for read-only files.
   Blindly flushing dirty folios from writable files would cause
   undesirable system-wide writeback; userspace is expected to flush
   explicitly, or use MADV_COLLAPSE which handles it via its retry path.

9. Updated comments and selftests in various places.

Changelog
===
From V4[5]:
1. fixed Patch 1's compilation error in !CONFIG_TRANSPARENT_HUGEPAGE

2. changed Patch 3 to no longer enable collapse for read-write fd but only
   allowe read-only fd.

3. added two new patches to enable clean pagecache folio collapse for
   writable files:
   - Patch 13: remove inode_is_open_for_write() from file_thp_enabled()
     so that khugepaged and MADV_COLLAPSE can process writable files.
     filemap_flush() in collapse_file() is now conditionalized on the file
     being read-only, to avoid repeatedly writing back dirty folios from
     writable files.
   - Patch 14: add read_write_file_read_ops and read_write_file_write_ops
     to the khugepaged selftest to cover the new writable-file collapse paths.

From V3[4]:
1. added a TODO comment in patch 1 noting that the is_shmem exception in
   the VM_WARN_ON_ONCE() check can be removed once shmem always calls
   mapping_set_large_folios() on its mapping. Used VM_WARN_ON_ONCE() in
   mapping_pmd_thp_support() instead.

2. fixed the dirty folio bail-out path in patch 2: add xas_unlock_irq()
   and folio_putback_lru() before the goto, which were missing and would
   have left the XA lock held and the LRU isolation ref leaked.

3. renamed hugepage_pmd_enabled() to hugepage_enabled() to reflect it
   controls khugepaged for all transparent hugepage types.

4. reverted the comment in hugepage_enabled() in patch 4 to the original;
   only removed the phrase "when configured in," which referred to
   CONFIG_READ_ONLY_THP_FOR_FS.

5. fixed commit message in patch 6: the dirty folio check is added after
   try_to_unmap() in collapse_file(), not after try_to_unmap_flush().

From V2[3]:
1. removed unnecessary check in collapse_scan_file().

2. removed inode_is_open_for_write() check in file_thp_enabled().

3. changed hugepage_enabled() to return true if khugepaged global
   control is on instead of false. cleaned up anon and shmem code in the
   function.

4. moved folio dirtiness check after try_to_unmap() but before
   try_to_unmap_flush(), since that is sufficient to prevent new writes.

5. reordered patch 4 and 5, so that khugepaged behavior does not change
   after READ_ONLY_THP_FOR_FS is removed.

6. added read-write file test in khugepaged selftest.

7. removed the read-only file restriction from guard-region selftest.

From V1[2]:
1. removed inode_is_open_for_write() check in collapse_file(), since the
   added folio dirtiness check after try_to_unmap_flush() should be
   sufficient to prevent writes to candidate folios.

2. removed READ_ONLY_THP_FOR_FS check in hugepage_enabled(), please
   see Patch 5 and item 2 in the overview for more details.

3. moved the patch removing READ_ONLY_THP_FOR_FS Kconfig after enabling
   khugepaged and MADV_COLLAPSE to create read-only THPs.

4. added mapping_pmd_thp_support() helper function.

5. used VM_WARN_ON_ONCE() in collapse_file() for mapping eligibility check
   and address alignment check instead of if + return error code. Always
   allow shmem, since MADV_COLLAPSE ignore shmem huge config.

6. added mapping eligibility check in collapse_scan_file().

7. removed trailing ; for folio_split() in the !CONFIG_TRANSPARENT_HUGEPAGE.

8. simplified code in folio_check_splittable() after removing
   READ_ONLY_THP_FOR_FS code.

9. clarified that read-only THP works for FSes with PMD THP support by
   default.

From RFC[1]:
1. instead of removing READ_ONLY_THP_FOR_FS function entirely, turn it
   on by default for all FSes with large folio support and the supported
   orders includes PMD_ORDER.

Suggestions and comments are welcome.

Link: https://lore.kernel.org/all/20260323190644.1714379-1-ziy@nvidia.com/ [1]
Link: https://lore.kernel.org/all/20260327014255.2058916-1-ziy@nvidia.com/ [2]
Link: https://lore.kernel.org/all/20260413192030.3275825-1-ziy@nvidia.com/ [3]
Link: https://lore.kernel.org/all/20260418024429.4055056-1-ziy@nvidia.com/ [4]
Link: https://lore.kernel.org/all/20260424024915.28758-1-ziy@nvidia.com/ [5]

Zi Yan (14):
  mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  mm/khugepaged: add folio dirty check after try_to_unmap()
  mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
  mm/khugepaged: remove READ_ONLY_THP_FOR_FS check in hugepage_enabled()
  mm: remove READ_ONLY_THP_FOR_FS Kconfig option
  mm: fs: remove filemap_nr_thps*() functions and their users
  fs: remove nr_thps from struct address_space
  mm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FS
  mm/truncate: use folio_split() in truncate_inode_partial_folio()
  fs/btrfs: remove a comment referring to READ_ONLY_THP_FOR_FS
  selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged
  selftests/mm: remove READ_ONLY_THP_FOR_FS code from guard-regions
  mm/khugepaged: enable clean pagecache folio collapse for writable
    files
  selftests/mm: add writable-file collapse tests for khugepaged

 fs/btrfs/defrag.c                          |   3 -
 fs/inode.c                                 |   3 -
 fs/open.c                                  |  27 ---
 include/linux/fs.h                         |   5 -
 include/linux/huge_mm.h                    |  25 +--
 include/linux/pagemap.h                    |  49 +++---
 include/linux/shmem_fs.h                   |   2 +-
 mm/Kconfig                                 |  11 --
 mm/filemap.c                               |   1 -
 mm/huge_memory.c                           |  39 +----
 mm/khugepaged.c                            | 101 ++++++-----
 mm/truncate.c                              |   8 +-
 tools/testing/selftests/mm/guard-regions.c |  18 +-
 tools/testing/selftests/mm/khugepaged.c    | 190 ++++++++++++++++-----
 tools/testing/selftests/mm/run_vmtests.sh  |  12 +-
 15 files changed, 258 insertions(+), 236 deletions(-)

-- 
2.53.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v5 01/14] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-04-29 15:29 [PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files Zi Yan
@ 2026-04-29 15:29 ` Zi Yan
  2026-04-30 14:37   ` Zi Yan
  2026-05-04  3:48   ` Nico Pache
  2026-04-29 15:29 ` [PATCH v5 02/14] mm/khugepaged: add folio dirty check after try_to_unmap() Zi Yan
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 32+ messages in thread
From: Zi Yan @ 2026-04-29 15:29 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

collapse_file() requires FSes supporting large folio with at least
PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that.
MADV_COLLAPSE ignores shmem huge config, so exclude the check for shmem.

While at it, replace VM_BUG_ON with VM_WARN_ON_ONCE.

Add a helper function mapping_pmd_folio_support() for FSes supporting large
folio with at least PMD_ORDER.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 include/linux/pagemap.h | 26 ++++++++++++++++++++++++++
 mm/khugepaged.c         | 10 ++++++++--
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 1f50991b43e3b..1fed3414fe9b8 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -513,6 +513,32 @@ static inline bool mapping_large_folio_support(const struct address_space *mappi
 	return mapping_max_folio_order(mapping) > 0;
 }
 
+/**
+ * mapping_pmd_folio_support() - Check if a mapping support PMD-sized folio
+ * @mapping: The address_space
+ *
+ * Some file supports large folio but does not support as large as PMD order.
+ * If a PMD-sized pagecache folio is attempted to be created on a filesystem,
+ * this check needs to be performed first.
+ *
+ * Return: true - PMD-sized folio is supported, false - PMD-sized folio is not
+ * supported.
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline bool mapping_pmd_folio_support(const struct address_space *mapping)
+{
+	/* AS_FOLIO_ORDER is only reasonable for pagecache folios */
+	VM_WARN_ON_ONCE((unsigned long)mapping & FOLIO_MAPPING_ANON);
+
+	return mapping_max_folio_order(mapping) >= PMD_ORDER;
+}
+#else
+static inline bool mapping_pmd_folio_support(const struct address_space *mapping)
+{
+	return false;
+}
+#endif
+
 /* Return the maximum folio size for this pagecache mapping, in bytes. */
 static inline size_t mapping_max_folio_size(const struct address_space *mapping)
 {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e112525c4aa9c..6808f2b48d864 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2235,8 +2235,14 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 	int nr_none = 0;
 	bool is_shmem = shmem_file(file);
 
-	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
-	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
+	/*
+	 * MADV_COLLAPSE ignores shmem huge config, so do not check shmem
+	 *
+	 * TODO: once shmem always calls mapping_set_large_folios() on its
+	 * mapping, the shmem check can be removed.
+	 */
+	VM_WARN_ON_ONCE(!is_shmem && !mapping_pmd_folio_support(mapping));
+	VM_WARN_ON_ONCE(start & (HPAGE_PMD_NR - 1));
 
 	result = alloc_charge_folio(&new_folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 02/14] mm/khugepaged: add folio dirty check after try_to_unmap()
  2026-04-29 15:29 [PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files Zi Yan
  2026-04-29 15:29 ` [PATCH v5 01/14] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check Zi Yan
@ 2026-04-29 15:29 ` Zi Yan
  2026-04-30 15:11   ` Zi Yan
                     ` (2 more replies)
  2026-04-29 15:29 ` [PATCH v5 03/14] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled() Zi Yan
                   ` (12 subsequent siblings)
  14 siblings, 3 replies; 32+ messages in thread
From: Zi Yan @ 2026-04-29 15:29 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

This check ensures the correctness of read-only PMD folio collapse
after it is enabled for all FSes supporting PMD pagecache folios and
replaces READ_ONLY_THP_FOR_FS.

READ_ONLY_THP_FOR_FS only supports read-only fd and uses mapping->nr_thps
and inode->i_writecount to prevent any write to read-only to-be-collapsed
folios. In upcoming commits, READ_ONLY_THP_FOR_FS will be removed and the
aforementioned mechanism will go away too. To ensure khugepaged functions
as expected after the changes, skip if any folio is dirty after
try_to_unmap(), since a dirty folio at that point means this read-only
folio can get writes between try_to_unmap() and try_to_unmap_flush() via
cached TLB entries and khugepaged does not support writable pagecache folio
collapse yet.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
---
 mm/khugepaged.c | 28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6808f2b48d864..71209a72195ab 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2327,8 +2327,7 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 				}
 			} else if (folio_test_dirty(folio)) {
 				/*
-				 * khugepaged only works on read-only fd,
-				 * so this page is dirty because it hasn't
+				 * This page is dirty because it hasn't
 				 * been flushed since first write. There
 				 * won't be new dirty pages.
 				 *
@@ -2386,8 +2385,8 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 		if (!is_shmem && (folio_test_dirty(folio) ||
 				  folio_test_writeback(folio))) {
 			/*
-			 * khugepaged only works on read-only fd, so this
-			 * folio is dirty because it hasn't been flushed
+			 * khugepaged only works on clean file-backed folios,
+			 * so this folio is dirty because it hasn't been flushed
 			 * since first write.
 			 */
 			result = SCAN_PAGE_DIRTY_OR_WRITEBACK;
@@ -2431,6 +2430,27 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 			goto out_unlock;
 		}
 
+		/*
+		 * At this point, the folio is locked and unmapped. If the PTE
+		 * was dirty, try_to_unmap() has transferred the dirty bit to
+		 * the folio and we must not collapse it into a clean
+		 * file-backed folio.
+		 *
+		 * If the folio is clean here, no one can write it until we
+		 * drop the folio lock. A write through a stale TLB entry came
+		 * from a clean PTE and must fault because the PTE has been
+		 * cleared; the fault path has to take the folio lock before
+		 * installing a writable mapping. Buffered write paths also
+		 * have to take the folio lock before modifying file contents
+		 * without a mapping, typically via write_begin_get_folio().
+		 */
+		if (!is_shmem && folio_test_dirty(folio)) {
+			result = SCAN_PAGE_DIRTY_OR_WRITEBACK;
+			xas_unlock_irq(&xas);
+			folio_putback_lru(folio);
+			goto out_unlock;
+		}
+
 		/*
 		 * Accumulate the folios that are being collapsed.
 		 */
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 03/14] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
  2026-04-29 15:29 [PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files Zi Yan
  2026-04-29 15:29 ` [PATCH v5 01/14] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check Zi Yan
  2026-04-29 15:29 ` [PATCH v5 02/14] mm/khugepaged: add folio dirty check after try_to_unmap() Zi Yan
@ 2026-04-29 15:29 ` Zi Yan
  2026-05-04  3:57   ` Nico Pache
  2026-04-29 15:29 ` [PATCH v5 04/14] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check in hugepage_enabled() Zi Yan
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2026-04-29 15:29 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

Replace it with a check on the max folio order of the file's address space
mapping, making sure PMD folio is supported. Keep the inode open-for-write
check, since even if collapse_file() now makes sure all to-be-collapsed
folios are clean and the created PMD file THP can be handled by FSes
properly, the filemap_flush() could perform undesirable write back.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2f3fcb4dd1ef8..3b324f03e9283 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -86,9 +86,6 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
 {
 	struct inode *inode;
 
-	if (!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS))
-		return false;
-
 	if (!vma->vm_file)
 		return false;
 
@@ -97,6 +94,9 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
 	if (IS_ANON_FILE(inode))
 		return false;
 
+	if (!mapping_pmd_folio_support(vma->vm_file->f_mapping))
+		return false;
+
 	return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
 }
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 04/14] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check in hugepage_enabled()
  2026-04-29 15:29 [PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files Zi Yan
                   ` (2 preceding siblings ...)
  2026-04-29 15:29 ` [PATCH v5 03/14] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled() Zi Yan
@ 2026-04-29 15:29 ` Zi Yan
  2026-05-04  4:00   ` Nico Pache
  2026-04-29 15:35 ` [PATCH v5 05/14] mm: remove READ_ONLY_THP_FOR_FS Kconfig option Zi Yan
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2026-04-29 15:29 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

Remove the READ_ONLY_THP_FOR_FS gate and khugepaged for file-backed
pmd-sized hugepages are enabled by the global transparent hugepage control.
khugepaged can still be enabled by per-size control for anon and shmem when
the global control is off.

Add shmem_hpage_pmd_enabled() stub for !CONFIG_SHMEM to remove
IS_ENABLED(SHMEM) in hugepage_enabled().

Clean up hugepage_enabled() by moving anon code to anon_hpage_enabled().

Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
---
 include/linux/shmem_fs.h |  2 +-
 mm/khugepaged.c          | 26 ++++++++++++++++----------
 2 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 93a0ba872ebe0..acb8dd961b45c 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -127,7 +127,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 void shmem_truncate_range(struct inode *inode, loff_t start, uoff_t end);
 int shmem_unuse(unsigned int type);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SHMEM)
 unsigned long shmem_allowable_huge_orders(struct inode *inode,
 				struct vm_area_struct *vma, pgoff_t index,
 				loff_t write_end, bool shmem_huge_force);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 71209a72195ab..d6971ada8f199 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -524,26 +524,32 @@ static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
 		mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
 }
 
+static inline bool anon_hpage_enabled(void)
+{
+	if (READ_ONCE(huge_anon_orders_always))
+		return true;
+	if (READ_ONCE(huge_anon_orders_madvise))
+		return true;
+	if (READ_ONCE(huge_anon_orders_inherit) &&
+	    hugepage_global_enabled())
+		return true;
+	return false;
+}
+
 static bool hugepage_enabled(void)
 {
 	/*
 	 * We cover the anon, shmem and the file-backed case here; file-backed
-	 * hugepages, when configured in, are determined by the global control.
+	 * hugepages are determined by the global control.
 	 * Anon hugepages are determined by its per-size mTHP control.
 	 * Shmem pmd-sized hugepages are also determined by its pmd-size control,
 	 * except when the global shmem_huge is set to SHMEM_HUGE_DENY.
 	 */
-	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
-	    hugepage_global_enabled())
-		return true;
-	if (READ_ONCE(huge_anon_orders_always))
+	if (hugepage_global_enabled())
 		return true;
-	if (READ_ONCE(huge_anon_orders_madvise))
-		return true;
-	if (READ_ONCE(huge_anon_orders_inherit) &&
-	    hugepage_global_enabled())
+	if (anon_hpage_enabled())
 		return true;
-	if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
+	if (shmem_hpage_pmd_enabled())
 		return true;
 	return false;
 }
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 05/14] mm: remove READ_ONLY_THP_FOR_FS Kconfig option
  2026-04-29 15:29 [PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files Zi Yan
                   ` (3 preceding siblings ...)
  2026-04-29 15:29 ` [PATCH v5 04/14] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check in hugepage_enabled() Zi Yan
@ 2026-04-29 15:35 ` Zi Yan
  2026-05-04  4:02   ` Nico Pache
  2026-04-29 15:35 ` [PATCH v5 06/14] mm: fs: remove filemap_nr_thps*() functions and their users Zi Yan
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2026-04-29 15:35 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

After removing READ_ONLY_THP_FOR_FS check in file_thp_enabled(),
khugepaged and MADV_COLLAPSE can run on FSes with PMD THP pagecache
support even without READ_ONLY_THP_FOR_FS enabled. Remove the Kconfig first
so that no one can use READ_ONLY_THP_FOR_FS as upcoming commits remove
mapping->nr_thps, which its safe guard mechanism relies on.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/Kconfig | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index e221fa1dc54d0..27dc5b0139ba6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -936,17 +936,6 @@ config THP_SWAP
 
 	  For selection by architectures with reasonable THP sizes.
 
-config READ_ONLY_THP_FOR_FS
-	bool "Read-only THP for filesystems (EXPERIMENTAL)"
-	depends on TRANSPARENT_HUGEPAGE
-
-	help
-	  Allow khugepaged to put read-only file-backed pages in THP.
-
-	  This is marked experimental because it is a new feature. Write
-	  support of file THPs will be developed in the next few release
-	  cycles.
-
 config NO_PAGE_MAPCOUNT
 	bool "No per-page mapcount (EXPERIMENTAL)"
 	help
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 06/14] mm: fs: remove filemap_nr_thps*() functions and their users
  2026-04-29 15:29 [PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files Zi Yan
                   ` (4 preceding siblings ...)
  2026-04-29 15:35 ` [PATCH v5 05/14] mm: remove READ_ONLY_THP_FOR_FS Kconfig option Zi Yan
@ 2026-04-29 15:35 ` Zi Yan
  2026-04-29 15:35 ` [PATCH v5 07/14] fs: remove nr_thps from struct address_space Zi Yan
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2026-04-29 15:35 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

They are used by READ_ONLY_THP_FOR_FS to handle writes to FSes without
large folio support, so that read-only THPs created in these FSes are not
seen by the FSes when the underlying fd becomes writable. Now read-only PMD
THPs only appear in a FS with large folio support and the supported orders
include PMD_ORDER.

READ_ONLY_THP_FOR_FS was using mapping->nr_thps, inode->i_writecount, and
smp_mb() to prevent writes to a read-only THP and collapsing writable
folios into a THP. In collapse_file(), mapping->nr_thps is increased, then
smp_mb(), and if inode->i_writecount > 0, collapse is stopped, while
do_dentry_open() first increases inode->i_writecount, then a full memory
fence, and if mapping->nr_thps > 0, all read-only THPs are truncated.

Now this mechanism can be removed along with READ_ONLY_THP_FOR_FS code,
since a dirty folio check has been added after try_to_unmap() in
collapse_file() to prevent dirty folios from being collapsed as clean.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 fs/open.c               | 27 ---------------------------
 include/linux/pagemap.h | 29 -----------------------------
 mm/filemap.c            |  1 -
 mm/huge_memory.c        |  1 -
 mm/khugepaged.c         | 28 ----------------------------
 5 files changed, 86 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 681d405bc61eb..c321b80027f13 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -968,33 +968,6 @@ static int do_dentry_open(struct file *f,
 	if ((f->f_flags & O_DIRECT) && !(f->f_mode & FMODE_CAN_ODIRECT))
 		return -EINVAL;
 
-	/*
-	 * XXX: Huge page cache doesn't support writing yet. Drop all page
-	 * cache for this file before processing writes.
-	 */
-	if (f->f_mode & FMODE_WRITE) {
-		/*
-		 * Depends on full fence from get_write_access() to synchronize
-		 * against collapse_file() regarding i_writecount and nr_thps
-		 * updates. Ensures subsequent insertion of THPs into the page
-		 * cache will fail.
-		 */
-		if (filemap_nr_thps(inode->i_mapping)) {
-			struct address_space *mapping = inode->i_mapping;
-
-			filemap_invalidate_lock(inode->i_mapping);
-			/*
-			 * unmap_mapping_range just need to be called once
-			 * here, because the private pages is not need to be
-			 * unmapped mapping (e.g. data segment of dynamic
-			 * shared libraries here).
-			 */
-			unmap_mapping_range(mapping, 0, 0, 0);
-			truncate_inode_pages(mapping, 0);
-			filemap_invalidate_unlock(inode->i_mapping);
-		}
-	}
-
 	return 0;
 
 cleanup_all:
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 1fed3414fe9b8..c6a4ecd3d6ed1 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -545,35 +545,6 @@ static inline size_t mapping_max_folio_size(const struct address_space *mapping)
 	return PAGE_SIZE << mapping_max_folio_order(mapping);
 }
 
-static inline int filemap_nr_thps(const struct address_space *mapping)
-{
-#ifdef CONFIG_READ_ONLY_THP_FOR_FS
-	return atomic_read(&mapping->nr_thps);
-#else
-	return 0;
-#endif
-}
-
-static inline void filemap_nr_thps_inc(struct address_space *mapping)
-{
-#ifdef CONFIG_READ_ONLY_THP_FOR_FS
-	if (!mapping_large_folio_support(mapping))
-		atomic_inc(&mapping->nr_thps);
-#else
-	WARN_ON_ONCE(mapping_large_folio_support(mapping) == 0);
-#endif
-}
-
-static inline void filemap_nr_thps_dec(struct address_space *mapping)
-{
-#ifdef CONFIG_READ_ONLY_THP_FOR_FS
-	if (!mapping_large_folio_support(mapping))
-		atomic_dec(&mapping->nr_thps);
-#else
-	WARN_ON_ONCE(mapping_large_folio_support(mapping) == 0);
-#endif
-}
-
 struct address_space *folio_mapping(const struct folio *folio);
 
 /**
diff --git a/mm/filemap.c b/mm/filemap.c
index ab34cab2416a4..9a5e23fa6a238 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -189,7 +189,6 @@ static void filemap_unaccount_folio(struct address_space *mapping,
 			lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, -nr);
 	} else if (folio_test_pmd_mappable(folio)) {
 		lruvec_stat_mod_folio(folio, NR_FILE_THPS, -nr);
-		filemap_nr_thps_dec(mapping);
 	}
 	if (test_bit(AS_KERNEL_FILE, &folio->mapping->flags))
 		mod_node_page_state(folio_pgdat(folio),
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3b324f03e9283..884e8b5811569 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3951,7 +3951,6 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
 				} else {
 					lruvec_stat_mod_folio(folio,
 							NR_FILE_THPS, -nr);
-					filemap_nr_thps_dec(mapping);
 				}
 			}
 		}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d6971ada8f199..1ee15b48962a3 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2469,21 +2469,6 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 		goto xa_unlocked;
 	}
 
-	if (!is_shmem) {
-		filemap_nr_thps_inc(mapping);
-		/*
-		 * Paired with the fence in do_dentry_open() -> get_write_access()
-		 * to ensure i_writecount is up to date and the update to nr_thps
-		 * is visible. Ensures the page cache will be truncated if the
-		 * file is opened writable.
-		 */
-		smp_mb();
-		if (inode_is_open_for_write(mapping->host)) {
-			result = SCAN_FAIL;
-			filemap_nr_thps_dec(mapping);
-		}
-	}
-
 xa_locked:
 	xas_unlock_irq(&xas);
 xa_unlocked:
@@ -2661,19 +2646,6 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 		folio_putback_lru(folio);
 		folio_put(folio);
 	}
-	/*
-	 * Undo the updates of filemap_nr_thps_inc for non-SHMEM
-	 * file only. This undo is not needed unless failure is
-	 * due to SCAN_COPY_MC.
-	 */
-	if (!is_shmem && result == SCAN_COPY_MC) {
-		filemap_nr_thps_dec(mapping);
-		/*
-		 * Paired with the fence in do_dentry_open() -> get_write_access()
-		 * to ensure the update to nr_thps is visible.
-		 */
-		smp_mb();
-	}
 
 	new_folio->mapping = NULL;
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 07/14] fs: remove nr_thps from struct address_space
  2026-04-29 15:29 [PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files Zi Yan
                   ` (5 preceding siblings ...)
  2026-04-29 15:35 ` [PATCH v5 06/14] mm: fs: remove filemap_nr_thps*() functions and their users Zi Yan
@ 2026-04-29 15:35 ` Zi Yan
  2026-05-04  4:11   ` Nico Pache
  2026-04-29 15:35 ` [PATCH v5 08/14] mm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FS Zi Yan
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2026-04-29 15:35 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

filemap_nr_thps*() are removed, the related field, address_space->nr_thps,
is no longer needed. Remove it. This shrinks struct address_space by 8
bytes on 64-bit systems which may increase the number of inodes we can
cache.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 fs/inode.c         | 3 ---
 include/linux/fs.h | 5 -----
 2 files changed, 8 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 6a3cbc7dcd28c..d8a6d6266c3c3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -279,9 +279,6 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
 	mapping->flags = 0;
 	mapping->wb_err = 0;
 	atomic_set(&mapping->i_mmap_writable, 0);
-#ifdef CONFIG_READ_ONLY_THP_FOR_FS
-	atomic_set(&mapping->nr_thps, 0);
-#endif
 	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
 	mapping->writeback_index = 0;
 	init_rwsem(&mapping->invalidate_lock);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 11559c513dfbb..bb9cc4f7207c1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -460,7 +460,6 @@ struct mapping_metadata_bhs {
  *   memory mappings.
  * @gfp_mask: Memory allocation flags to use for allocating pages.
  * @i_mmap_writable: Number of VM_SHARED, VM_MAYWRITE mappings.
- * @nr_thps: Number of THPs in the pagecache (non-shmem only).
  * @i_mmap: Tree of private and shared mappings.
  * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
  * @nrpages: Number of page entries, protected by the i_pages lock.
@@ -476,10 +475,6 @@ struct address_space {
 	struct rw_semaphore	invalidate_lock;
 	gfp_t			gfp_mask;
 	atomic_t		i_mmap_writable;
-#ifdef CONFIG_READ_ONLY_THP_FOR_FS
-	/* number of thp, only for non-shmem files */
-	atomic_t		nr_thps;
-#endif
 	struct rb_root_cached	i_mmap;
 	unsigned long		nrpages;
 	pgoff_t			writeback_index;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 08/14] mm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FS
  2026-04-29 15:29 [PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files Zi Yan
                   ` (6 preceding siblings ...)
  2026-04-29 15:35 ` [PATCH v5 07/14] fs: remove nr_thps from struct address_space Zi Yan
@ 2026-04-29 15:35 ` Zi Yan
  2026-04-29 15:35 ` [PATCH v5 09/14] mm/truncate: use folio_split() in truncate_inode_partial_folio() Zi Yan
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2026-04-29 15:35 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

Without READ_ONLY_THP_FOR_FS, large file-backed folios cannot be created by
a FS without large folio support. The check is no longer needed.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
---
 mm/huge_memory.c | 30 +++---------------------------
 1 file changed, 3 insertions(+), 27 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 884e8b5811569..9b3abb98a7e51 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3846,33 +3846,9 @@ int folio_check_splittable(struct folio *folio, unsigned int new_order,
 	if (!folio->mapping && !folio_test_anon(folio))
 		return -EBUSY;
 
-	if (folio_test_anon(folio)) {
-		/* order-1 is not supported for anonymous THP. */
-		if (new_order == 1)
-			return -EINVAL;
-	} else if (split_type == SPLIT_TYPE_NON_UNIFORM || new_order) {
-		if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
-		    !mapping_large_folio_support(folio->mapping)) {
-			/*
-			 * We can always split a folio down to a single page
-			 * (new_order == 0) uniformly.
-			 *
-			 * For any other scenario
-			 *   a) uniform split targeting a large folio
-			 *      (new_order > 0)
-			 *   b) any non-uniform split
-			 * we must confirm that the file system supports large
-			 * folios.
-			 *
-			 * Note that we might still have THPs in such
-			 * mappings, which is created from khugepaged when
-			 * CONFIG_READ_ONLY_THP_FOR_FS is enabled. But in that
-			 * case, the mapping does not actually support large
-			 * folios properly.
-			 */
-			return -EINVAL;
-		}
-	}
+	/* order-1 is not supported for anonymous THP. */
+	if (folio_test_anon(folio) && new_order == 1)
+		return -EINVAL;
 
 	/*
 	 * swapcache folio could only be split to order 0
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 09/14] mm/truncate: use folio_split() in truncate_inode_partial_folio()
  2026-04-29 15:29 [PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files Zi Yan
                   ` (7 preceding siblings ...)
  2026-04-29 15:35 ` [PATCH v5 08/14] mm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FS Zi Yan
@ 2026-04-29 15:35 ` Zi Yan
  2026-04-30 15:12   ` Zi Yan
  2026-04-29 15:35 ` [PATCH v5 10/14] fs/btrfs: remove a comment referring to READ_ONLY_THP_FOR_FS Zi Yan
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2026-04-29 15:35 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

After READ_ONLY_THP_FOR_FS is removed, FS either supports large folio or
not. folio_split() can be used on a FS with large folio support without
worrying about getting a THP on a FS without large folio support.

When READ_ONLY_THP_FOR_FS was present, a PMD large pagecache folio can
appear in a FS without large folio support after khugepaged or
madvise(MADV_COLLAPSE) creates it. During truncate_inode_partial_folio(),
such a PMD large pagecache folio is split and if the FS does not support
large folio, it needs to be split to order-0 ones and could not be split
non uniformly to ones with various orders. try_folio_split_to_order() was
added to handle this situation by checking folio_check_splittable(...,
SPLIT_TYPE_NON_UNIFORM) to detect if the large folio is created due to
READ_ONLY_THP_FOR_FS and the FS does not support large folio. Now
READ_ONLY_THP_FOR_FS is removed, all large pagecache folios are created
with FSes supporting large folio, this function is no longer needed and all
large pagecache folios can be split non uniformly.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/huge_mm.h | 25 ++-----------------------
 mm/truncate.c           |  8 ++++----
 2 files changed, 6 insertions(+), 27 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 48496f09909be..127f9e1e7604c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -394,27 +394,6 @@ static inline int split_huge_page_to_order(struct page *page, unsigned int new_o
 	return split_huge_page_to_list_to_order(page, NULL, new_order);
 }
 
-/**
- * try_folio_split_to_order() - try to split a @folio at @page to @new_order
- * using non uniform split.
- * @folio: folio to be split
- * @page: split to @new_order at the given page
- * @new_order: the target split order
- *
- * Try to split a @folio at @page using non uniform split to @new_order, if
- * non uniform split is not supported, fall back to uniform split. After-split
- * folios are put back to LRU list. Use min_order_for_split() to get the lower
- * bound of @new_order.
- *
- * Return: 0 - split is successful, otherwise split failed.
- */
-static inline int try_folio_split_to_order(struct folio *folio,
-		struct page *page, unsigned int new_order)
-{
-	if (folio_check_splittable(folio, new_order, SPLIT_TYPE_NON_UNIFORM))
-		return split_huge_page_to_order(&folio->page, new_order);
-	return folio_split(folio, new_order, page, NULL);
-}
 static inline int split_huge_page(struct page *page)
 {
 	return split_huge_page_to_list_to_order(page, NULL, 0);
@@ -647,8 +626,8 @@ static inline int split_folio_to_list(struct folio *folio, struct list_head *lis
 	return -EINVAL;
 }
 
-static inline int try_folio_split_to_order(struct folio *folio,
-		struct page *page, unsigned int new_order)
+static inline int folio_split(struct folio *folio, unsigned int new_order,
+		struct page *page, struct list_head *list)
 {
 	VM_WARN_ON_ONCE_FOLIO(1, folio);
 	return -EINVAL;
diff --git a/mm/truncate.c b/mm/truncate.c
index 12cc89f89afcf..b58ba940be474 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -177,7 +177,7 @@ int truncate_inode_folio(struct address_space *mapping, struct folio *folio)
 	return 0;
 }
 
-static int try_folio_split_or_unmap(struct folio *folio, struct page *split_at,
+static int folio_split_or_unmap(struct folio *folio, struct page *split_at,
 				    unsigned long min_order)
 {
 	enum ttu_flags ttu_flags =
@@ -186,7 +186,7 @@ static int try_folio_split_or_unmap(struct folio *folio, struct page *split_at,
 		TTU_IGNORE_MLOCK;
 	int ret;
 
-	ret = try_folio_split_to_order(folio, split_at, min_order);
+	ret = folio_split(folio, min_order, split_at, NULL);
 
 	/*
 	 * If the split fails, unmap the folio, so it will be refaulted
@@ -252,7 +252,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 
 	min_order = mapping_min_folio_order(folio->mapping);
 	split_at = folio_page(folio, PAGE_ALIGN_DOWN(offset) / PAGE_SIZE);
-	if (!try_folio_split_or_unmap(folio, split_at, min_order)) {
+	if (!folio_split_or_unmap(folio, split_at, min_order)) {
 		/*
 		 * try to split at offset + length to make sure folios within
 		 * the range can be dropped, especially to avoid memory waste
@@ -279,7 +279,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 		/* make sure folio2 is large and does not change its mapping */
 		if (folio_test_large(folio2) &&
 		    folio2->mapping == folio->mapping)
-			try_folio_split_or_unmap(folio2, split_at2, min_order);
+			folio_split_or_unmap(folio2, split_at2, min_order);
 
 		folio_unlock(folio2);
 out:
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 10/14] fs/btrfs: remove a comment referring to READ_ONLY_THP_FOR_FS
  2026-04-29 15:29 [PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files Zi Yan
                   ` (8 preceding siblings ...)
  2026-04-29 15:35 ` [PATCH v5 09/14] mm/truncate: use folio_split() in truncate_inode_partial_folio() Zi Yan
@ 2026-04-29 15:35 ` Zi Yan
  2026-04-29 15:35 ` [PATCH v5 11/14] selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged Zi Yan
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2026-04-29 15:35 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

READ_ONLY_THP_FOR_FS is no longer present, remove related comment.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
---
 fs/btrfs/defrag.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index 7e2db5d3a4d4c..a8d49d9ca981c 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -860,9 +860,6 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
 		return folio;
 
 	/*
-	 * Since we can defragment files opened read-only, we can encounter
-	 * transparent huge pages here (see CONFIG_READ_ONLY_THP_FOR_FS).
-	 *
 	 * The IO for such large folios is not fully tested, thus return
 	 * an error to reject such folios unless it's an experimental build.
 	 *
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 11/14] selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged
  2026-04-29 15:29 [PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files Zi Yan
                   ` (9 preceding siblings ...)
  2026-04-29 15:35 ` [PATCH v5 10/14] fs/btrfs: remove a comment referring to READ_ONLY_THP_FOR_FS Zi Yan
@ 2026-04-29 15:35 ` Zi Yan
  2026-04-30 15:16   ` Zi Yan
                     ` (2 more replies)
  2026-04-29 15:35 ` [PATCH v5 12/14] selftests/mm: remove READ_ONLY_THP_FOR_FS code from guard-regions Zi Yan
                   ` (3 subsequent siblings)
  14 siblings, 3 replies; 32+ messages in thread
From: Zi Yan @ 2026-04-29 15:35 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

Change the requirement to a file system with large folio support and the
supported order needs to include PMD_ORDER.

Also add tests of opening a file with read write permission and populating
folios with writes. Reuse the XFS image from split_huge_page_test.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 tools/testing/selftests/mm/khugepaged.c   | 131 +++++++++++++++-------
 tools/testing/selftests/mm/run_vmtests.sh |  12 +-
 2 files changed, 102 insertions(+), 41 deletions(-)

diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
index a6bb9d50363d2..80b913185c643 100644
--- a/tools/testing/selftests/mm/khugepaged.c
+++ b/tools/testing/selftests/mm/khugepaged.c
@@ -49,7 +49,8 @@ struct mem_ops {
 	const char *name;
 };
 
-static struct mem_ops *file_ops;
+static struct mem_ops *read_only_file_ops;
+static struct mem_ops *read_write_file_ops;
 static struct mem_ops *anon_ops;
 static struct mem_ops *shmem_ops;
 
@@ -112,7 +113,8 @@ static void restore_settings(int sig)
 static void save_settings(void)
 {
 	printf("Save THP and khugepaged settings...");
-	if (file_ops && finfo.type == VMA_FILE)
+	if ((read_only_file_ops || read_write_file_ops) &&
+	    finfo.type == VMA_FILE)
 		thp_set_read_ahead_path(finfo.dev_queue_read_ahead_path);
 	thp_save_settings();
 
@@ -364,11 +366,14 @@ static bool anon_check_huge(void *addr, int nr_hpages)
 	return check_huge_anon(addr, nr_hpages, hpage_pmd_size);
 }
 
-static void *file_setup_area(int nr_hpages)
+static void *file_setup_area_common(int nr_hpages, bool read_only)
 {
 	int fd;
 	void *p;
 	unsigned long size;
+	int open_opt = read_only ? O_RDONLY : O_RDWR;
+	int mmap_prot = read_only ? PROT_READ : (PROT_READ | PROT_WRITE);
+	int mmap_opt = read_only ? MAP_PRIVATE : MAP_SHARED;
 
 	unlink(finfo.path);  /* Cleanup from previous failed tests */
 	printf("Creating %s for collapse%s...", finfo.path,
@@ -399,14 +404,15 @@ static void *file_setup_area(int nr_hpages)
 	munmap(p, size);
 	success("OK");
 
-	printf("Opening %s read only for collapse...", finfo.path);
-	finfo.fd = open(finfo.path, O_RDONLY, 777);
+	printf("Opening %s %s for collapse...", finfo.path,
+	       read_only ? "read only" : "read-write");
+	finfo.fd = open(finfo.path, open_opt, 777);
 	if (finfo.fd < 0) {
 		perror("open()");
 		exit(EXIT_FAILURE);
 	}
-	p = mmap(BASE_ADDR, size, PROT_READ,
-		 MAP_PRIVATE, finfo.fd, 0);
+	p = mmap(BASE_ADDR, size, mmap_prot,
+		 mmap_opt, finfo.fd, 0);
 	if (p == MAP_FAILED || p != BASE_ADDR) {
 		perror("mmap()");
 		exit(EXIT_FAILURE);
@@ -418,6 +424,16 @@ static void *file_setup_area(int nr_hpages)
 	return p;
 }
 
+static void *file_setup_read_only_area(int nr_hpages)
+{
+	return file_setup_area_common(nr_hpages, /* read_only= */ true);
+}
+
+static void *file_setup_read_write_area(int nr_hpages)
+{
+	return file_setup_area_common(nr_hpages, /* read_only= */ false);
+}
+
 static void file_cleanup_area(void *p, unsigned long size)
 {
 	munmap(p, size);
@@ -425,14 +441,25 @@ static void file_cleanup_area(void *p, unsigned long size)
 	unlink(finfo.path);
 }
 
-static void file_fault(void *p, unsigned long start, unsigned long end)
+static void file_fault_common(void *p, unsigned long start, unsigned long end,
+		int madv_ops)
 {
-	if (madvise(((char *)p) + start, end - start, MADV_POPULATE_READ)) {
+	if (madvise(((char *)p) + start, end - start, madv_ops)) {
 		perror("madvise(MADV_POPULATE_READ");
 		exit(EXIT_FAILURE);
 	}
 }
 
+static void file_fault_read(void *p, unsigned long start, unsigned long end)
+{
+	file_fault_common(p, start, end, MADV_POPULATE_READ);
+}
+
+static void file_fault_write(void *p, unsigned long start, unsigned long end)
+{
+	file_fault_common(p, start, end, MADV_POPULATE_WRITE);
+}
+
 static bool file_check_huge(void *addr, int nr_hpages)
 {
 	switch (finfo.type) {
@@ -488,10 +515,18 @@ static struct mem_ops __anon_ops = {
 	.name = "anon",
 };
 
-static struct mem_ops __file_ops = {
-	.setup_area = &file_setup_area,
+static struct mem_ops __read_only_file_ops = {
+	.setup_area = &file_setup_read_only_area,
 	.cleanup_area = &file_cleanup_area,
-	.fault = &file_fault,
+	.fault = &file_fault_read,
+	.check_huge = &file_check_huge,
+	.name = "file",
+};
+
+static struct mem_ops __read_write_file_ops = {
+	.setup_area = &file_setup_read_write_area,
+	.cleanup_area = &file_cleanup_area,
+	.fault = &file_fault_write,
 	.check_huge = &file_check_huge,
 	.name = "file",
 };
@@ -504,6 +539,18 @@ static struct mem_ops __shmem_ops = {
 	.name = "shmem",
 };
 
+static bool is_tmpfs(struct mem_ops *ops)
+{
+	return (ops == &__read_only_file_ops ||
+		ops == &__read_write_file_ops) &&
+	       finfo.type == VMA_SHMEM;
+}
+
+static bool is_anon(struct mem_ops *ops)
+{
+	return ops == &__anon_ops;
+}
+
 static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
 			       struct mem_ops *ops, bool expect)
 {
@@ -512,6 +559,10 @@ static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
 
 	printf("%s...", msg);
 
+	/* read&write file collapse always fail */
+	if (!is_tmpfs(ops) && ops == &__read_write_file_ops)
+		expect = false;
+
 	/*
 	 * Prevent khugepaged interference and tests that MADV_COLLAPSE
 	 * ignores /sys/kernel/mm/transparent_hugepage/enabled
@@ -578,6 +629,10 @@ static bool wait_for_scan(const char *msg, char *p, int nr_hpages,
 static void khugepaged_collapse(const char *msg, char *p, int nr_hpages,
 				struct mem_ops *ops, bool expect)
 {
+	/* read&write file collapse always fail */
+	if (!is_tmpfs(ops) && ops == &__read_write_file_ops)
+		expect = false;
+
 	if (wait_for_scan(msg, p, nr_hpages, ops)) {
 		if (expect)
 			fail("Timeout");
@@ -612,16 +667,6 @@ static struct collapse_context __madvise_context = {
 	.name = "madvise",
 };
 
-static bool is_tmpfs(struct mem_ops *ops)
-{
-	return ops == &__file_ops && finfo.type == VMA_SHMEM;
-}
-
-static bool is_anon(struct mem_ops *ops)
-{
-	return ops == &__anon_ops;
-}
-
 static void alloc_at_fault(void)
 {
 	struct thp_settings settings = *thp_current_settings();
@@ -1097,8 +1142,8 @@ static void usage(void)
 	fprintf(stderr, "\t<context>\t: [all|khugepaged|madvise]\n");
 	fprintf(stderr, "\t<mem_type>\t: [all|anon|file|shmem]\n");
 	fprintf(stderr, "\n\t\"file,all\" mem_type requires [dir] argument\n");
-	fprintf(stderr, "\n\t\"file,all\" mem_type requires kernel built with\n");
-	fprintf(stderr,	"\tCONFIG_READ_ONLY_THP_FOR_FS=y\n");
+	fprintf(stderr, "\n\t\"file,all\" mem_type requires a file system\n");
+	fprintf(stderr,	"\twith large folio support (order >= PMD order)\n");
 	fprintf(stderr, "\n\tif [dir] is a (sub)directory of a tmpfs mount, tmpfs must be\n");
 	fprintf(stderr,	"\tmounted with huge=advise option for khugepaged tests to work\n");
 	fprintf(stderr,	"\n\tSupported Options:\n");
@@ -1154,20 +1199,22 @@ static void parse_test_type(int argc, char **argv)
 		usage();
 
 	if (!strcmp(buf, "all")) {
-		file_ops =  &__file_ops;
+		read_only_file_ops =  &__read_only_file_ops;
+		read_write_file_ops =  &__read_write_file_ops;
 		anon_ops = &__anon_ops;
 		shmem_ops = &__shmem_ops;
 	} else if (!strcmp(buf, "anon")) {
 		anon_ops = &__anon_ops;
 	} else if (!strcmp(buf, "file")) {
-		file_ops =  &__file_ops;
+		read_only_file_ops =  &__read_only_file_ops;
+		read_write_file_ops =  &__read_write_file_ops;
 	} else if (!strcmp(buf, "shmem")) {
 		shmem_ops = &__shmem_ops;
 	} else {
 		usage();
 	}
 
-	if (!file_ops)
+	if (!read_only_file_ops && !read_write_file_ops)
 		return;
 
 	if (argc != 2)
@@ -1239,37 +1286,43 @@ int main(int argc, char **argv)
 	} while (0)
 
 	TEST(collapse_full, khugepaged_context, anon_ops);
-	TEST(collapse_full, khugepaged_context, file_ops);
+	TEST(collapse_full, khugepaged_context, read_only_file_ops);
+	TEST(collapse_full, khugepaged_context, read_write_file_ops);
 	TEST(collapse_full, khugepaged_context, shmem_ops);
 	TEST(collapse_full, madvise_context, anon_ops);
-	TEST(collapse_full, madvise_context, file_ops);
+	TEST(collapse_full, madvise_context, read_only_file_ops);
+	TEST(collapse_full, madvise_context, read_write_file_ops);
 	TEST(collapse_full, madvise_context, shmem_ops);
 
 	TEST(collapse_empty, khugepaged_context, anon_ops);
 	TEST(collapse_empty, madvise_context, anon_ops);
 
 	TEST(collapse_single_pte_entry, khugepaged_context, anon_ops);
-	TEST(collapse_single_pte_entry, khugepaged_context, file_ops);
+	TEST(collapse_single_pte_entry, khugepaged_context, read_only_file_ops);
+	TEST(collapse_single_pte_entry, khugepaged_context, read_write_file_ops);
 	TEST(collapse_single_pte_entry, khugepaged_context, shmem_ops);
 	TEST(collapse_single_pte_entry, madvise_context, anon_ops);
-	TEST(collapse_single_pte_entry, madvise_context, file_ops);
+	TEST(collapse_single_pte_entry, madvise_context, read_only_file_ops);
+	TEST(collapse_single_pte_entry, madvise_context, read_write_file_ops);
 	TEST(collapse_single_pte_entry, madvise_context, shmem_ops);
 
 	TEST(collapse_max_ptes_none, khugepaged_context, anon_ops);
-	TEST(collapse_max_ptes_none, khugepaged_context, file_ops);
+	TEST(collapse_max_ptes_none, khugepaged_context, read_only_file_ops);
+	TEST(collapse_max_ptes_none, khugepaged_context, read_write_file_ops);
 	TEST(collapse_max_ptes_none, madvise_context, anon_ops);
-	TEST(collapse_max_ptes_none, madvise_context, file_ops);
+	TEST(collapse_max_ptes_none, madvise_context, read_only_file_ops);
+	TEST(collapse_max_ptes_none, madvise_context, read_write_file_ops);
 
 	TEST(collapse_single_pte_entry_compound, khugepaged_context, anon_ops);
-	TEST(collapse_single_pte_entry_compound, khugepaged_context, file_ops);
+	TEST(collapse_single_pte_entry_compound, khugepaged_context, read_only_file_ops);
 	TEST(collapse_single_pte_entry_compound, madvise_context, anon_ops);
-	TEST(collapse_single_pte_entry_compound, madvise_context, file_ops);
+	TEST(collapse_single_pte_entry_compound, madvise_context, read_only_file_ops);
 
 	TEST(collapse_full_of_compound, khugepaged_context, anon_ops);
-	TEST(collapse_full_of_compound, khugepaged_context, file_ops);
+	TEST(collapse_full_of_compound, khugepaged_context, read_only_file_ops);
 	TEST(collapse_full_of_compound, khugepaged_context, shmem_ops);
 	TEST(collapse_full_of_compound, madvise_context, anon_ops);
-	TEST(collapse_full_of_compound, madvise_context, file_ops);
+	TEST(collapse_full_of_compound, madvise_context, read_only_file_ops);
 	TEST(collapse_full_of_compound, madvise_context, shmem_ops);
 
 	TEST(collapse_compound_extreme, khugepaged_context, anon_ops);
@@ -1291,10 +1344,10 @@ int main(int argc, char **argv)
 	TEST(collapse_max_ptes_shared, madvise_context, anon_ops);
 
 	TEST(madvise_collapse_existing_thps, madvise_context, anon_ops);
-	TEST(madvise_collapse_existing_thps, madvise_context, file_ops);
+	TEST(madvise_collapse_existing_thps, madvise_context, read_only_file_ops);
 	TEST(madvise_collapse_existing_thps, madvise_context, shmem_ops);
 
-	TEST(madvise_retracted_page_tables, madvise_context, file_ops);
+	TEST(madvise_retracted_page_tables, madvise_context, read_only_file_ops);
 	TEST(madvise_retracted_page_tables, madvise_context, shmem_ops);
 
 	restore_settings(0);
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 3b61677fe9840..854c5c3e3a6ae 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -490,8 +490,6 @@ CATEGORY="thp" run_test ./khugepaged all:shmem
 
 CATEGORY="thp" run_test ./khugepaged -s 4 all:shmem
 
-CATEGORY="thp" run_test ./transhuge-stress -d 20
-
 # Try to create XFS if not provided
 if [ -z "${SPLIT_HUGE_PAGE_TEST_XFS_PATH}" ]; then
     if [ "${HAVE_HUGEPAGES}" = "1" ]; then
@@ -508,6 +506,14 @@ if [ -z "${SPLIT_HUGE_PAGE_TEST_XFS_PATH}" ]; then
     fi
 fi
 
+if [ -n "${SPLIT_HUGE_PAGE_TEST_XFS_PATH}" ]; then
+CATEGORY="thp" run_test ./khugepaged all:file ${SPLIT_HUGE_PAGE_TEST_XFS_PATH}
+else
+	count_total=$(( count_total + 1 ))
+	count_skip=$(( count_skip + 1 ))
+	echo "[SKIP] ./khugepaged all:file" | tap_prefix
+fi
+
 CATEGORY="thp" run_test ./split_huge_page_test ${SPLIT_HUGE_PAGE_TEST_XFS_PATH}
 
 if [ -n "${MOUNTED_XFS}" ]; then
@@ -516,6 +522,8 @@ if [ -n "${MOUNTED_XFS}" ]; then
     rm -f ${XFS_IMG}
 fi
 
+CATEGORY="thp" run_test ./transhuge-stress -d 20
+
 CATEGORY="thp" run_test ./folio_split_race_test
 
 CATEGORY="migration" run_test ./migration
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 12/14] selftests/mm: remove READ_ONLY_THP_FOR_FS code from guard-regions
  2026-04-29 15:29 [PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files Zi Yan
                   ` (10 preceding siblings ...)
  2026-04-29 15:35 ` [PATCH v5 11/14] selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged Zi Yan
@ 2026-04-29 15:35 ` Zi Yan
  2026-04-29 15:35 ` [PATCH v5 13/14] mm/khugepaged: enable clean pagecache folio collapse for writable files Zi Yan
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2026-04-29 15:35 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

Any file system with large folio support and the supported orders include
PMD_ORDER can be used. There is no need to open a file with read-only.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
---
 tools/testing/selftests/mm/guard-regions.c | 18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c
index 48e8b1539be3a..1176398919537 100644
--- a/tools/testing/selftests/mm/guard-regions.c
+++ b/tools/testing/selftests/mm/guard-regions.c
@@ -2203,17 +2203,6 @@ TEST_F(guard_regions, collapse)
 	if (variant->backing != ANON_BACKED)
 		ASSERT_EQ(ftruncate(self->fd, size), 0);
 
-	/*
-	 * We must close and re-open local-file backed as read-only for
-	 * CONFIG_READ_ONLY_THP_FOR_FS to work.
-	 */
-	if (variant->backing == LOCAL_FILE_BACKED) {
-		ASSERT_EQ(close(self->fd), 0);
-
-		self->fd = open(self->path, O_RDONLY);
-		ASSERT_GE(self->fd, 0);
-	}
-
 	ptr = mmap_(self, variant, NULL, size, PROT_READ, 0, 0);
 	ASSERT_NE(ptr, MAP_FAILED);
 
@@ -2237,9 +2226,10 @@ TEST_F(guard_regions, collapse)
 	/*
 	 * Now collapse the entire region. This should fail in all cases.
 	 *
-	 * The madvise() call will also fail if CONFIG_READ_ONLY_THP_FOR_FS is
-	 * not set for the local file case, but we can't differentiate whether
-	 * this occurred or if the collapse was rightly rejected.
+	 * The madvise() call will also fail if the file system does not support
+	 * large folio or the supported orders do not include PMD_ORDER for the
+	 * local file case, but we can't differentiate whether this occurred or
+	 * if the collapse was rightly rejected.
 	 */
 	EXPECT_NE(madvise(ptr, size, MADV_COLLAPSE), 0);
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 13/14] mm/khugepaged: enable clean pagecache folio collapse for writable files
  2026-04-29 15:29 [PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files Zi Yan
                   ` (11 preceding siblings ...)
  2026-04-29 15:35 ` [PATCH v5 12/14] selftests/mm: remove READ_ONLY_THP_FOR_FS code from guard-regions Zi Yan
@ 2026-04-29 15:35 ` Zi Yan
  2026-04-30 15:18   ` Zi Yan
  2026-04-29 15:35 ` [PATCH v5 14/14] selftests/mm: add writable-file collapse tests for khugepaged Zi Yan
  2026-04-29 16:13 ` [PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files Andrew Morton
  14 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2026-04-29 15:35 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

collapse_file() is capable of collapsing pagecache folios from writable
files to PMD folios. Now enable clean pagecache folio collapse in addition
to read-only pagecache folio collapse by removing the
inode_is_open_for_write() from file_thp_enabled() and only performing
filemap_flush() if the file is read-only.

This means userspace needs to explicitly flush the content of pagecache
folios before khugepaged can collapse the folios, or use
madvise(MADV_COLLAPSE), which does the flush in the retry. The reason is
that blindly enabling dirty pagecache folio from writable files collapse
makes khugepaged flush these folios all the time. It is undesirable to
cause system level pagecache flushes.

To properly support dirty pagecache folio collapse, filemap_flush() needs
to be avoided. Potentially, merging associated buffer instead of dropping
it with filemap_release_folio() might be needed.

NOTE: this breaks khugepaged selftests for writable file pagecache
collapse, which is set to fail all the time. The next commit fix it.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 2 +-
 mm/khugepaged.c  | 9 ++++++++-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9b3abb98a7e51..e1e9d59db6e70 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -97,7 +97,7 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
 	if (!mapping_pmd_folio_support(vma->vm_file->f_mapping))
 		return false;
 
-	return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
+	return S_ISREG(inode->i_mode);
 }
 
 /* If returns true, we are unable to access the VMA's folios. */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1ee15b48962a3..fb7ff643973cc 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2345,7 +2345,14 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 				 * forcing writeback in loop.
 				 */
 				xas_unlock_irq(&xas);
-				filemap_flush(mapping);
+				/*
+				 * Only flush for read-only files. Writable
+				 * files can have their folios dirty at any
+				 * time; blindly flushing them would cause
+				 * undesirable system-wide writeback.
+				 */
+				if (!inode_is_open_for_write(mapping->host))
+					filemap_flush(mapping);
 				result = SCAN_PAGE_DIRTY_OR_WRITEBACK;
 				goto xa_unlocked;
 			} else if (folio_test_writeback(folio)) {
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 14/14] selftests/mm: add writable-file collapse tests for khugepaged
  2026-04-29 15:29 [PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files Zi Yan
                   ` (12 preceding siblings ...)
  2026-04-29 15:35 ` [PATCH v5 13/14] mm/khugepaged: enable clean pagecache folio collapse for writable files Zi Yan
@ 2026-04-29 15:35 ` Zi Yan
  2026-04-29 16:13 ` [PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files Andrew Morton
  14 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2026-04-29 15:35 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

collapse_file() now supports collapsing clean pagecache folios from
writable files, so add corresponding tests.

Note that madvise(MADV_COLLAPSE) works for dirty pagecache folios from
writable files, because collapse_single_pmd() triggers a synchronous
writeback when first attempt of collapse_file() fails. That writeback makes
dirty folios clean and the retry of collapse_file() succeeds.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 tools/testing/selftests/mm/khugepaged.c | 113 ++++++++++++++++++------
 1 file changed, 86 insertions(+), 27 deletions(-)

diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
index 80b913185c643..e73aab5149bdf 100644
--- a/tools/testing/selftests/mm/khugepaged.c
+++ b/tools/testing/selftests/mm/khugepaged.c
@@ -41,6 +41,12 @@ enum vma_type {
 	VMA_SHMEM,
 };
 
+enum file_setup_ops {
+	FILE_SETUP_READ_ONLY_FS,
+	FILE_SETUP_READ_WRITE_FS_READ_DATA,
+	FILE_SETUP_READ_WRITE_FS_WRITE_DATA,
+};
+
 struct mem_ops {
 	void *(*setup_area)(int nr_hpages);
 	void (*cleanup_area)(void *p, unsigned long size);
@@ -50,7 +56,8 @@ struct mem_ops {
 };
 
 static struct mem_ops *read_only_file_ops;
-static struct mem_ops *read_write_file_ops;
+static struct mem_ops *read_write_file_read_ops;
+static struct mem_ops *read_write_file_write_ops;
 static struct mem_ops *anon_ops;
 static struct mem_ops *shmem_ops;
 
@@ -113,7 +120,8 @@ static void restore_settings(int sig)
 static void save_settings(void)
 {
 	printf("Save THP and khugepaged settings...");
-	if ((read_only_file_ops || read_write_file_ops) &&
+	if ((read_only_file_ops || read_write_file_read_ops ||
+	     read_write_file_write_ops) &&
 	    finfo.type == VMA_FILE)
 		thp_set_read_ahead_path(finfo.dev_queue_read_ahead_path);
 	thp_save_settings();
@@ -366,14 +374,14 @@ static bool anon_check_huge(void *addr, int nr_hpages)
 	return check_huge_anon(addr, nr_hpages, hpage_pmd_size);
 }
 
-static void *file_setup_area_common(int nr_hpages, bool read_only)
+static void *file_setup_area_common(int nr_hpages, enum file_setup_ops setup)
 {
 	int fd;
 	void *p;
 	unsigned long size;
-	int open_opt = read_only ? O_RDONLY : O_RDWR;
-	int mmap_prot = read_only ? PROT_READ : (PROT_READ | PROT_WRITE);
-	int mmap_opt = read_only ? MAP_PRIVATE : MAP_SHARED;
+	int open_opt = setup == FILE_SETUP_READ_ONLY_FS ? O_RDONLY : O_RDWR;
+	int mmap_prot = setup == FILE_SETUP_READ_ONLY_FS ? PROT_READ : (PROT_READ | PROT_WRITE);
+	int mmap_opt = setup == FILE_SETUP_READ_ONLY_FS ? MAP_PRIVATE : MAP_SHARED;
 
 	unlink(finfo.path);  /* Cleanup from previous failed tests */
 	printf("Creating %s for collapse%s...", finfo.path,
@@ -405,7 +413,10 @@ static void *file_setup_area_common(int nr_hpages, bool read_only)
 	success("OK");
 
 	printf("Opening %s %s for collapse...", finfo.path,
-	       read_only ? "read only" : "read-write");
+	       setup == FILE_SETUP_READ_ONLY_FS ? "read only" :
+	       setup == FILE_SETUP_READ_WRITE_FS_READ_DATA ?
+						  "read-write (read)" :
+						  "read-write (write)");
 	finfo.fd = open(finfo.path, open_opt, 777);
 	if (finfo.fd < 0) {
 		perror("open()");
@@ -426,12 +437,17 @@ static void *file_setup_area_common(int nr_hpages, bool read_only)
 
 static void *file_setup_read_only_area(int nr_hpages)
 {
-	return file_setup_area_common(nr_hpages, /* read_only= */ true);
+	return file_setup_area_common(nr_hpages, FILE_SETUP_READ_ONLY_FS);
+}
+
+static void *file_setup_read_write_fs_read_area(int nr_hpages)
+{
+	return file_setup_area_common(nr_hpages, FILE_SETUP_READ_WRITE_FS_READ_DATA);
 }
 
-static void *file_setup_read_write_area(int nr_hpages)
+static void *file_setup_read_write_fs_write_area(int nr_hpages)
 {
-	return file_setup_area_common(nr_hpages, /* read_only= */ false);
+	return file_setup_area_common(nr_hpages, FILE_SETUP_READ_WRITE_FS_WRITE_DATA);
 }
 
 static void file_cleanup_area(void *p, unsigned long size)
@@ -455,6 +471,17 @@ static void file_fault_read(void *p, unsigned long start, unsigned long end)
 	file_fault_common(p, start, end, MADV_POPULATE_READ);
 }
 
+static void file_fault_read_and_flush(void *p, unsigned long start, unsigned long end)
+{
+	file_fault_common(p, start, end, MADV_POPULATE_READ);
+
+	/*
+	 * make folio clean, since dirty folios from read&write file are
+	 * rejected and not flushed
+	 */
+	msync((char *)p + start, end - start, MS_SYNC);
+}
+
 static void file_fault_write(void *p, unsigned long start, unsigned long end)
 {
 	file_fault_common(p, start, end, MADV_POPULATE_WRITE);
@@ -523,8 +550,16 @@ static struct mem_ops __read_only_file_ops = {
 	.name = "file",
 };
 
-static struct mem_ops __read_write_file_ops = {
-	.setup_area = &file_setup_read_write_area,
+static struct mem_ops __read_write_file_read_ops = {
+	.setup_area = &file_setup_read_write_fs_read_area,
+	.cleanup_area = &file_cleanup_area,
+	.fault = &file_fault_read_and_flush,
+	.check_huge = &file_check_huge,
+	.name = "file",
+};
+
+static struct mem_ops __read_write_file_write_ops = {
+	.setup_area = &file_setup_read_write_fs_write_area,
 	.cleanup_area = &file_cleanup_area,
 	.fault = &file_fault_write,
 	.check_huge = &file_check_huge,
@@ -542,7 +577,8 @@ static struct mem_ops __shmem_ops = {
 static bool is_tmpfs(struct mem_ops *ops)
 {
 	return (ops == &__read_only_file_ops ||
-		ops == &__read_write_file_ops) &&
+		ops == &__read_write_file_read_ops ||
+		ops == &__read_write_file_write_ops) &&
 	       finfo.type == VMA_SHMEM;
 }
 
@@ -559,9 +595,11 @@ static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
 
 	printf("%s...", msg);
 
-	/* read&write file collapse always fail */
-	if (!is_tmpfs(ops) && ops == &__read_write_file_ops)
-		expect = false;
+	/*
+	 * read&write file collapse succeeds for MADV_COLLAPSE because dirty
+	 * folios are written back after collapse fails for dirty folios and
+	 * another collapse is attempted.
+	 */
 
 	/*
 	 * Prevent khugepaged interference and tests that MADV_COLLAPSE
@@ -629,8 +667,11 @@ static bool wait_for_scan(const char *msg, char *p, int nr_hpages,
 static void khugepaged_collapse(const char *msg, char *p, int nr_hpages,
 				struct mem_ops *ops, bool expect)
 {
-	/* read&write file collapse always fail */
-	if (!is_tmpfs(ops) && ops == &__read_write_file_ops)
+	/*
+	 * read&write file collapse fails since khugepaged does not flush
+	 * the target dirty folios
+	 */
+	if (!is_tmpfs(ops) && ops == &__read_write_file_write_ops)
 		expect = false;
 
 	if (wait_for_scan(msg, p, nr_hpages, ops)) {
@@ -753,6 +794,9 @@ static void collapse_max_ptes_none(struct collapse_context *c, struct mem_ops *o
 	validate_memory(p, 0, (hpage_pmd_nr - max_ptes_none - fault_nr_pages) * page_size);
 
 	if (c->enforce_pte_scan_limits) {
+		ops->cleanup_area(p, hpage_pmd_size);
+		p = ops->setup_area(1);
+
 		ops->fault(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
 		c->collapse("Collapse with max_ptes_none PTEs empty", p, 1, ops,
 			    true);
@@ -1200,21 +1244,24 @@ static void parse_test_type(int argc, char **argv)
 
 	if (!strcmp(buf, "all")) {
 		read_only_file_ops =  &__read_only_file_ops;
-		read_write_file_ops =  &__read_write_file_ops;
+		read_write_file_read_ops =  &__read_write_file_read_ops;
+		read_write_file_write_ops =  &__read_write_file_write_ops;
 		anon_ops = &__anon_ops;
 		shmem_ops = &__shmem_ops;
 	} else if (!strcmp(buf, "anon")) {
 		anon_ops = &__anon_ops;
 	} else if (!strcmp(buf, "file")) {
 		read_only_file_ops =  &__read_only_file_ops;
-		read_write_file_ops =  &__read_write_file_ops;
+		read_write_file_read_ops =  &__read_write_file_read_ops;
+		read_write_file_write_ops =  &__read_write_file_write_ops;
 	} else if (!strcmp(buf, "shmem")) {
 		shmem_ops = &__shmem_ops;
 	} else {
 		usage();
 	}
 
-	if (!read_only_file_ops && !read_write_file_ops)
+	if (!read_only_file_ops && !read_write_file_read_ops &&
+	    !read_write_file_write_ops)
 		return;
 
 	if (argc != 2)
@@ -1287,11 +1334,13 @@ int main(int argc, char **argv)
 
 	TEST(collapse_full, khugepaged_context, anon_ops);
 	TEST(collapse_full, khugepaged_context, read_only_file_ops);
-	TEST(collapse_full, khugepaged_context, read_write_file_ops);
+	TEST(collapse_full, khugepaged_context, read_write_file_read_ops);
+	TEST(collapse_full, khugepaged_context, read_write_file_write_ops);
 	TEST(collapse_full, khugepaged_context, shmem_ops);
 	TEST(collapse_full, madvise_context, anon_ops);
 	TEST(collapse_full, madvise_context, read_only_file_ops);
-	TEST(collapse_full, madvise_context, read_write_file_ops);
+	TEST(collapse_full, madvise_context, read_write_file_read_ops);
+	TEST(collapse_full, madvise_context, read_write_file_write_ops);
 	TEST(collapse_full, madvise_context, shmem_ops);
 
 	TEST(collapse_empty, khugepaged_context, anon_ops);
@@ -1299,30 +1348,38 @@ int main(int argc, char **argv)
 
 	TEST(collapse_single_pte_entry, khugepaged_context, anon_ops);
 	TEST(collapse_single_pte_entry, khugepaged_context, read_only_file_ops);
-	TEST(collapse_single_pte_entry, khugepaged_context, read_write_file_ops);
+	TEST(collapse_single_pte_entry, khugepaged_context, read_write_file_read_ops);
+	TEST(collapse_single_pte_entry, khugepaged_context, read_write_file_write_ops);
 	TEST(collapse_single_pte_entry, khugepaged_context, shmem_ops);
 	TEST(collapse_single_pte_entry, madvise_context, anon_ops);
 	TEST(collapse_single_pte_entry, madvise_context, read_only_file_ops);
-	TEST(collapse_single_pte_entry, madvise_context, read_write_file_ops);
+	TEST(collapse_single_pte_entry, madvise_context, read_write_file_read_ops);
+	TEST(collapse_single_pte_entry, madvise_context, read_write_file_write_ops);
 	TEST(collapse_single_pte_entry, madvise_context, shmem_ops);
 
 	TEST(collapse_max_ptes_none, khugepaged_context, anon_ops);
 	TEST(collapse_max_ptes_none, khugepaged_context, read_only_file_ops);
-	TEST(collapse_max_ptes_none, khugepaged_context, read_write_file_ops);
+	TEST(collapse_max_ptes_none, khugepaged_context, read_write_file_read_ops);
+	TEST(collapse_max_ptes_none, khugepaged_context, read_write_file_write_ops);
 	TEST(collapse_max_ptes_none, madvise_context, anon_ops);
 	TEST(collapse_max_ptes_none, madvise_context, read_only_file_ops);
-	TEST(collapse_max_ptes_none, madvise_context, read_write_file_ops);
+	TEST(collapse_max_ptes_none, madvise_context, read_write_file_read_ops);
+	TEST(collapse_max_ptes_none, madvise_context, read_write_file_write_ops);
 
 	TEST(collapse_single_pte_entry_compound, khugepaged_context, anon_ops);
 	TEST(collapse_single_pte_entry_compound, khugepaged_context, read_only_file_ops);
+	TEST(collapse_single_pte_entry_compound, khugepaged_context, read_write_file_read_ops);
 	TEST(collapse_single_pte_entry_compound, madvise_context, anon_ops);
 	TEST(collapse_single_pte_entry_compound, madvise_context, read_only_file_ops);
+	TEST(collapse_single_pte_entry_compound, madvise_context, read_write_file_read_ops);
 
 	TEST(collapse_full_of_compound, khugepaged_context, anon_ops);
 	TEST(collapse_full_of_compound, khugepaged_context, read_only_file_ops);
+	TEST(collapse_full_of_compound, khugepaged_context, read_write_file_read_ops);
 	TEST(collapse_full_of_compound, khugepaged_context, shmem_ops);
 	TEST(collapse_full_of_compound, madvise_context, anon_ops);
 	TEST(collapse_full_of_compound, madvise_context, read_only_file_ops);
+	TEST(collapse_full_of_compound, madvise_context, read_write_file_read_ops);
 	TEST(collapse_full_of_compound, madvise_context, shmem_ops);
 
 	TEST(collapse_compound_extreme, khugepaged_context, anon_ops);
@@ -1345,9 +1402,11 @@ int main(int argc, char **argv)
 
 	TEST(madvise_collapse_existing_thps, madvise_context, anon_ops);
 	TEST(madvise_collapse_existing_thps, madvise_context, read_only_file_ops);
+	TEST(madvise_collapse_existing_thps, madvise_context, read_write_file_read_ops);
 	TEST(madvise_collapse_existing_thps, madvise_context, shmem_ops);
 
 	TEST(madvise_retracted_page_tables, madvise_context, read_only_file_ops);
+	TEST(madvise_retracted_page_tables, madvise_context, read_write_file_read_ops);
 	TEST(madvise_retracted_page_tables, madvise_context, shmem_ops);
 
 	restore_settings(0);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files
  2026-04-29 15:29 [PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files Zi Yan
                   ` (13 preceding siblings ...)
  2026-04-29 15:35 ` [PATCH v5 14/14] selftests/mm: add writable-file collapse tests for khugepaged Zi Yan
@ 2026-04-29 16:13 ` Andrew Morton
  14 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2026-04-29 16:13 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Matthew Wilcox (Oracle), Song Liu, Chris Mason,
	David Sterba, Alexander Viro, Christian Brauner, Jan Kara,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Wed, 29 Apr 2026 11:29:10 -0400 Zi Yan <ziy@nvidia.com> wrote:

> This patchset removes READ_ONLY_THP_FOR_FS Kconfig and enables creating
> file-backed THPs for FSes with large folio support (the supported orders
> need to include PMD_ORDER) by default, including for writable files.

Thanks, I queued this up.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 01/14] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-04-29 15:29 ` [PATCH v5 01/14] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check Zi Yan
@ 2026-04-30 14:37   ` Zi Yan
  2026-04-30 15:04     ` Andrew Morton
  2026-05-04  3:48   ` Nico Pache
  1 sibling, 1 reply; 32+ messages in thread
From: Zi Yan @ 2026-04-30 14:37 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 29 Apr 2026, at 11:29, Zi Yan wrote:

> collapse_file() requires FSes supporting large folio with at least
> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that.
> MADV_COLLAPSE ignores shmem huge config, so exclude the check for shmem.
>
> While at it, replace VM_BUG_ON with VM_WARN_ON_ONCE.
>
> Add a helper function mapping_pmd_folio_support() for FSes supporting large
> folio with at least PMD_ORDER.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  include/linux/pagemap.h | 26 ++++++++++++++++++++++++++
>  mm/khugepaged.c         | 10 ++++++++--
>  2 files changed, 34 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 1f50991b43e3b..1fed3414fe9b8 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -513,6 +513,32 @@ static inline bool mapping_large_folio_support(const struct address_space *mappi
>  	return mapping_max_folio_order(mapping) > 0;
>  }
>
> +/**
> + * mapping_pmd_folio_support() - Check if a mapping support PMD-sized folio
> + * @mapping: The address_space
> + *
> + * Some file supports large folio but does not support as large as PMD order.
> + * If a PMD-sized pagecache folio is attempted to be created on a filesystem,
> + * this check needs to be performed first.
> + *
> + * Return: true - PMD-sized folio is supported, false - PMD-sized folio is not
> + * supported.
> + */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static inline bool mapping_pmd_folio_support(const struct address_space *mapping)
> +{
> +	/* AS_FOLIO_ORDER is only reasonable for pagecache folios */
> +	VM_WARN_ON_ONCE((unsigned long)mapping & FOLIO_MAPPING_ANON);
> +
> +	return mapping_max_folio_order(mapping) >= PMD_ORDER;
> +}
> +#else
> +static inline bool mapping_pmd_folio_support(const struct address_space *mapping)
> +{
> +	return false;
> +}
> +#endif
> +
>  /* Return the maximum folio size for this pagecache mapping, in bytes. */
>  static inline size_t mapping_max_folio_size(const struct address_space *mapping)
>  {
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index e112525c4aa9c..6808f2b48d864 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2235,8 +2235,14 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>  	int nr_none = 0;
>  	bool is_shmem = shmem_file(file);
>
> -	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
> -	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
> +	/*
> +	 * MADV_COLLAPSE ignores shmem huge config, so do not check shmem
> +	 *
> +	 * TODO: once shmem always calls mapping_set_large_folios() on its
> +	 * mapping, the shmem check can be removed.
> +	 */
> +	VM_WARN_ON_ONCE(!is_shmem && !mapping_pmd_folio_support(mapping));

sashiko asked:
Is it possible for userspace to intentionally trigger this warning via
madvise(MADV_COLLAPSE) on an unsupported read-only file? Although in
a later commit, modified file_thp_enabled() will prevent this.

Answer:
It is possible when CONFIG_READ_ONLY_THP_FOR_FS is enabled, but it is
unlikely a kernel will be shipped at this exact commit.


> +	VM_WARN_ON_ONCE(start & (HPAGE_PMD_NR - 1));
>
>  	result = alloc_charge_folio(&new_folio, mm, cc, HPAGE_PMD_ORDER);
>  	if (result != SCAN_SUCCEED)
> -- 
> 2.53.0


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 01/14] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-04-30 14:37   ` Zi Yan
@ 2026-04-30 15:04     ` Andrew Morton
  0 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2026-04-30 15:04 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Matthew Wilcox (Oracle), Song Liu, Chris Mason,
	David Sterba, Alexander Viro, Christian Brauner, Jan Kara,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Thu, 30 Apr 2026 10:37:52 -0400 Zi Yan <ziy@nvidia.com> wrote:

>  +	 * MADV_COLLAPSE ignores shmem huge config, so do not check shmem
> > +	 *
> > +	 * TODO: once shmem always calls mapping_set_large_folios() on its
> > +	 * mapping, the shmem check can be removed.
> > +	 */
> > +	VM_WARN_ON_ONCE(!is_shmem && !mapping_pmd_folio_support(mapping));
> 
> sashiko asked:
> Is it possible for userspace to intentionally trigger this warning via
> madvise(MADV_COLLAPSE) on an unsupported read-only file? Although in
> a later commit, modified file_thp_enabled() will prevent this.
> 
> Answer:
> It is possible when CONFIG_READ_ONLY_THP_FOR_FS is enabled, but it is
> unlikely a kernel will be shipped at this exact commit.

Yeah, let's not worry about minor bisection holes.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 02/14] mm/khugepaged: add folio dirty check after try_to_unmap()
  2026-04-29 15:29 ` [PATCH v5 02/14] mm/khugepaged: add folio dirty check after try_to_unmap() Zi Yan
@ 2026-04-30 15:11   ` Zi Yan
  2026-05-04  3:53   ` Nico Pache
  2026-05-06  5:23   ` Lance Yang
  2 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2026-04-30 15:11 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 29 Apr 2026, at 11:29, Zi Yan wrote:

> This check ensures the correctness of read-only PMD folio collapse
> after it is enabled for all FSes supporting PMD pagecache folios and
> replaces READ_ONLY_THP_FOR_FS.
>
> READ_ONLY_THP_FOR_FS only supports read-only fd and uses mapping->nr_thps
> and inode->i_writecount to prevent any write to read-only to-be-collapsed
> folios. In upcoming commits, READ_ONLY_THP_FOR_FS will be removed and the
> aforementioned mechanism will go away too. To ensure khugepaged functions
> as expected after the changes, skip if any folio is dirty after
> try_to_unmap(), since a dirty folio at that point means this read-only
> folio can get writes between try_to_unmap() and try_to_unmap_flush() via
> cached TLB entries and khugepaged does not support writable pagecache folio
> collapse yet.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> ---
>  mm/khugepaged.c | 28 ++++++++++++++++++++++++----
>  1 file changed, 24 insertions(+), 4 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 6808f2b48d864..71209a72195ab 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2327,8 +2327,7 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>  				}
>  			} else if (folio_test_dirty(folio)) {
>  				/*
> -				 * khugepaged only works on read-only fd,
> -				 * so this page is dirty because it hasn't
> +				 * This page is dirty because it hasn't
>  				 * been flushed since first write. There
>  				 * won't be new dirty pages.
>  				 *
> @@ -2386,8 +2385,8 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>  		if (!is_shmem && (folio_test_dirty(folio) ||
>  				  folio_test_writeback(folio))) {
>  			/*
> -			 * khugepaged only works on read-only fd, so this
> -			 * folio is dirty because it hasn't been flushed
> +			 * khugepaged only works on clean file-backed folios,
> +			 * so this folio is dirty because it hasn't been flushed
>  			 * since first write.
>  			 */
>  			result = SCAN_PAGE_DIRTY_OR_WRITEBACK;
> @@ -2431,6 +2430,27 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>  			goto out_unlock;
>  		}
>
> +		/*
> +		 * At this point, the folio is locked and unmapped. If the PTE
> +		 * was dirty, try_to_unmap() has transferred the dirty bit to
> +		 * the folio and we must not collapse it into a clean
> +		 * file-backed folio.
> +		 *
> +		 * If the folio is clean here, no one can write it until we
> +		 * drop the folio lock. A write through a stale TLB entry came
> +		 * from a clean PTE and must fault because the PTE has been
> +		 * cleared; the fault path has to take the folio lock before
> +		 * installing a writable mapping. Buffered write paths also
> +		 * have to take the folio lock before modifying file contents
> +		 * without a mapping, typically via write_begin_get_folio().
> +		 */
> +		if (!is_shmem && folio_test_dirty(folio)) {
> +			result = SCAN_PAGE_DIRTY_OR_WRITEBACK;
> +			xas_unlock_irq(&xas);
> +			folio_putback_lru(folio);
> +			goto out_unlock;

Sashiko asked:

Could a concurrent operation, such as truncate(), lock the folio, remove it
from the page cache, and drop the final reference while we are jumping to
xa_unlocked?
If the page is freed back to the buddy allocator before try_to_unmap_flush()
completes, could this leave a stale TLB entry pointing to the freed page,
potentially allowing memory corruption if the page is reallocated?

Answer:

The folio still has pagecache and LRU refs before try_to_unmap_flush() and
the truncate and free operation cannot be completed in that small window.

> +		}
> +
>  		/*
>  		 * Accumulate the folios that are being collapsed.
>  		 */
> -- 
> 2.53.0


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 09/14] mm/truncate: use folio_split() in truncate_inode_partial_folio()
  2026-04-29 15:35 ` [PATCH v5 09/14] mm/truncate: use folio_split() in truncate_inode_partial_folio() Zi Yan
@ 2026-04-30 15:12   ` Zi Yan
  0 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2026-04-30 15:12 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 29 Apr 2026, at 11:35, Zi Yan wrote:

> After READ_ONLY_THP_FOR_FS is removed, FS either supports large folio or
> not. folio_split() can be used on a FS with large folio support without
> worrying about getting a THP on a FS without large folio support.
>
> When READ_ONLY_THP_FOR_FS was present, a PMD large pagecache folio can
> appear in a FS without large folio support after khugepaged or
> madvise(MADV_COLLAPSE) creates it. During truncate_inode_partial_folio(),
> such a PMD large pagecache folio is split and if the FS does not support
> large folio, it needs to be split to order-0 ones and could not be split
> non uniformly to ones with various orders. try_folio_split_to_order() was
> added to handle this situation by checking folio_check_splittable(...,
> SPLIT_TYPE_NON_UNIFORM) to detect if the large folio is created due to
> READ_ONLY_THP_FOR_FS and the FS does not support large folio. Now
> READ_ONLY_THP_FOR_FS is removed, all large pagecache folios are created
> with FSes supporting large folio, this function is no longer needed and all
> large pagecache folios can be split non uniformly.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  include/linux/huge_mm.h | 25 ++-----------------------
>  mm/truncate.c           |  8 ++++----
>  2 files changed, 6 insertions(+), 27 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 48496f09909be..127f9e1e7604c 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -394,27 +394,6 @@ static inline int split_huge_page_to_order(struct page *page, unsigned int new_o
>  	return split_huge_page_to_list_to_order(page, NULL, new_order);
>  }
>
> -/**
> - * try_folio_split_to_order() - try to split a @folio at @page to @new_order
> - * using non uniform split.
> - * @folio: folio to be split
> - * @page: split to @new_order at the given page
> - * @new_order: the target split order
> - *
> - * Try to split a @folio at @page using non uniform split to @new_order, if
> - * non uniform split is not supported, fall back to uniform split. After-split
> - * folios are put back to LRU list. Use min_order_for_split() to get the lower
> - * bound of @new_order.
> - *
> - * Return: 0 - split is successful, otherwise split failed.
> - */
> -static inline int try_folio_split_to_order(struct folio *folio,
> -		struct page *page, unsigned int new_order)
> -{
> -	if (folio_check_splittable(folio, new_order, SPLIT_TYPE_NON_UNIFORM))
> -		return split_huge_page_to_order(&folio->page, new_order);
> -	return folio_split(folio, new_order, page, NULL);
> -}
>  static inline int split_huge_page(struct page *page)
>  {
>  	return split_huge_page_to_list_to_order(page, NULL, 0);
> @@ -647,8 +626,8 @@ static inline int split_folio_to_list(struct folio *folio, struct list_head *lis
>  	return -EINVAL;
>  }
>
> -static inline int try_folio_split_to_order(struct folio *folio,
> -		struct page *page, unsigned int new_order)
> +static inline int folio_split(struct folio *folio, unsigned int new_order,
> +		struct page *page, struct list_head *list)
>  {
>  	VM_WARN_ON_ONCE_FOLIO(1, folio);
>  	return -EINVAL;
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 12cc89f89afcf..b58ba940be474 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -177,7 +177,7 @@ int truncate_inode_folio(struct address_space *mapping, struct folio *folio)
>  	return 0;
>  }
>
> -static int try_folio_split_or_unmap(struct folio *folio, struct page *split_at,
> +static int folio_split_or_unmap(struct folio *folio, struct page *split_at,
>  				    unsigned long min_order)
>  {
>  	enum ttu_flags ttu_flags =
> @@ -186,7 +186,7 @@ static int try_folio_split_or_unmap(struct folio *folio, struct page *split_at,
>  		TTU_IGNORE_MLOCK;
>  	int ret;
>
> -	ret = try_folio_split_to_order(folio, split_at, min_order);
> +	ret = folio_split(folio, min_order, split_at, NULL);

Sahisko:

If split fails, the truncated pages within the large folio are zeroed
but never removed from the page cache or swap cache, leaving them behind
until the file is fully deleted.

Answer:
That is expected.

>
>  	/*
>  	 * If the split fails, unmap the folio, so it will be refaulted
> @@ -252,7 +252,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
>
>  	min_order = mapping_min_folio_order(folio->mapping);
>  	split_at = folio_page(folio, PAGE_ALIGN_DOWN(offset) / PAGE_SIZE);
> -	if (!try_folio_split_or_unmap(folio, split_at, min_order)) {
> +	if (!folio_split_or_unmap(folio, split_at, min_order)) {
>  		/*
>  		 * try to split at offset + length to make sure folios within
>  		 * the range can be dropped, especially to avoid memory waste
> @@ -279,7 +279,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
>  		/* make sure folio2 is large and does not change its mapping */
>  		if (folio_test_large(folio2) &&
>  		    folio2->mapping == folio->mapping)
> -			try_folio_split_or_unmap(folio2, split_at2, min_order);
> +			folio_split_or_unmap(folio2, split_at2, min_order);
>
>  		folio_unlock(folio2);
>  out:
> -- 
> 2.53.0


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 11/14] selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged
  2026-04-29 15:35 ` [PATCH v5 11/14] selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged Zi Yan
@ 2026-04-30 15:16   ` Zi Yan
  2026-04-30 15:27     ` Zi Yan
  2026-05-04  4:23   ` Nico Pache
  2026-05-04 10:11   ` Nico Pache
  2 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2026-04-30 15:16 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 29 Apr 2026, at 11:35, Zi Yan wrote:

> Change the requirement to a file system with large folio support and the
> supported order needs to include PMD_ORDER.
>
> Also add tests of opening a file with read write permission and populating
> folios with writes. Reuse the XFS image from split_huge_page_test.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  tools/testing/selftests/mm/khugepaged.c   | 131 +++++++++++++++-------
>  tools/testing/selftests/mm/run_vmtests.sh |  12 +-
>  2 files changed, 102 insertions(+), 41 deletions(-)
>
> diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
> index a6bb9d50363d2..80b913185c643 100644
> --- a/tools/testing/selftests/mm/khugepaged.c
> +++ b/tools/testing/selftests/mm/khugepaged.c
> @@ -49,7 +49,8 @@ struct mem_ops {
>  	const char *name;
>  };
>
> -static struct mem_ops *file_ops;
> +static struct mem_ops *read_only_file_ops;
> +static struct mem_ops *read_write_file_ops;
>  static struct mem_ops *anon_ops;
>  static struct mem_ops *shmem_ops;
>
> @@ -112,7 +113,8 @@ static void restore_settings(int sig)
>  static void save_settings(void)
>  {
>  	printf("Save THP and khugepaged settings...");
> -	if (file_ops && finfo.type == VMA_FILE)
> +	if ((read_only_file_ops || read_write_file_ops) &&
> +	    finfo.type == VMA_FILE)
>  		thp_set_read_ahead_path(finfo.dev_queue_read_ahead_path);
>  	thp_save_settings();
>
> @@ -364,11 +366,14 @@ static bool anon_check_huge(void *addr, int nr_hpages)
>  	return check_huge_anon(addr, nr_hpages, hpage_pmd_size);
>  }
>
> -static void *file_setup_area(int nr_hpages)
> +static void *file_setup_area_common(int nr_hpages, bool read_only)
>  {
>  	int fd;
>  	void *p;
>  	unsigned long size;
> +	int open_opt = read_only ? O_RDONLY : O_RDWR;
> +	int mmap_prot = read_only ? PROT_READ : (PROT_READ | PROT_WRITE);
> +	int mmap_opt = read_only ? MAP_PRIVATE : MAP_SHARED;
>
>  	unlink(finfo.path);  /* Cleanup from previous failed tests */
>  	printf("Creating %s for collapse%s...", finfo.path,
> @@ -399,14 +404,15 @@ static void *file_setup_area(int nr_hpages)
>  	munmap(p, size);
>  	success("OK");
>
> -	printf("Opening %s read only for collapse...", finfo.path);
> -	finfo.fd = open(finfo.path, O_RDONLY, 777);
> +	printf("Opening %s %s for collapse...", finfo.path,
> +	       read_only ? "read only" : "read-write");
> +	finfo.fd = open(finfo.path, open_opt, 777);
>  	if (finfo.fd < 0) {
>  		perror("open()");
>  		exit(EXIT_FAILURE);
>  	}
> -	p = mmap(BASE_ADDR, size, PROT_READ,
> -		 MAP_PRIVATE, finfo.fd, 0);
> +	p = mmap(BASE_ADDR, size, mmap_prot,
> +		 mmap_opt, finfo.fd, 0);
>  	if (p == MAP_FAILED || p != BASE_ADDR) {
>  		perror("mmap()");
>  		exit(EXIT_FAILURE);
> @@ -418,6 +424,16 @@ static void *file_setup_area(int nr_hpages)
>  	return p;
>  }
>
> +static void *file_setup_read_only_area(int nr_hpages)
> +{
> +	return file_setup_area_common(nr_hpages, /* read_only= */ true);
> +}
> +
> +static void *file_setup_read_write_area(int nr_hpages)
> +{
> +	return file_setup_area_common(nr_hpages, /* read_only= */ false);
> +}
> +
>  static void file_cleanup_area(void *p, unsigned long size)
>  {
>  	munmap(p, size);
> @@ -425,14 +441,25 @@ static void file_cleanup_area(void *p, unsigned long size)
>  	unlink(finfo.path);
>  }
>
> -static void file_fault(void *p, unsigned long start, unsigned long end)
> +static void file_fault_common(void *p, unsigned long start, unsigned long end,
> +		int madv_ops)
>  {
> -	if (madvise(((char *)p) + start, end - start, MADV_POPULATE_READ)) {
> +	if (madvise(((char *)p) + start, end - start, madv_ops)) {
>  		perror("madvise(MADV_POPULATE_READ");

Sashiko:
Since madv_ops can now be either MADV_POPULATE_READ or MADV_POPULATE_WRITE,
will this hardcoded error message be misleading if the write fault path
fails?

Answer:
Will send a fixup.

>  		exit(EXIT_FAILURE);
>  	}
>  }
>
> +static void file_fault_read(void *p, unsigned long start, unsigned long end)
> +{
> +	file_fault_common(p, start, end, MADV_POPULATE_READ);
> +}
> +
> +static void file_fault_write(void *p, unsigned long start, unsigned long end)
> +{
> +	file_fault_common(p, start, end, MADV_POPULATE_WRITE);
> +}
> +
>  static bool file_check_huge(void *addr, int nr_hpages)
>  {
>  	switch (finfo.type) {
> @@ -488,10 +515,18 @@ static struct mem_ops __anon_ops = {
>  	.name = "anon",
>  };
>
> -static struct mem_ops __file_ops = {
> -	.setup_area = &file_setup_area,
> +static struct mem_ops __read_only_file_ops = {
> +	.setup_area = &file_setup_read_only_area,
>  	.cleanup_area = &file_cleanup_area,
> -	.fault = &file_fault,
> +	.fault = &file_fault_read,
> +	.check_huge = &file_check_huge,
> +	.name = "file",
> +};
> +
> +static struct mem_ops __read_write_file_ops = {
> +	.setup_area = &file_setup_read_write_area,
> +	.cleanup_area = &file_cleanup_area,
> +	.fault = &file_fault_write,
>  	.check_huge = &file_check_huge,
>  	.name = "file",

Sashiko:
Both __read_only_file_ops and __read_write_file_ops use "file" for their
name fields. Since the TEST() macro uses this name to format the test
execution logs, won't this cause both configurations to produce identical log
output?

Answer:
file_setup_area_common() changes handle it.


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 13/14] mm/khugepaged: enable clean pagecache folio collapse for writable files
  2026-04-29 15:35 ` [PATCH v5 13/14] mm/khugepaged: enable clean pagecache folio collapse for writable files Zi Yan
@ 2026-04-30 15:18   ` Zi Yan
  0 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2026-04-30 15:18 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 29 Apr 2026, at 11:35, Zi Yan wrote:

> collapse_file() is capable of collapsing pagecache folios from writable
> files to PMD folios. Now enable clean pagecache folio collapse in addition
> to read-only pagecache folio collapse by removing the
> inode_is_open_for_write() from file_thp_enabled() and only performing
> filemap_flush() if the file is read-only.
>
> This means userspace needs to explicitly flush the content of pagecache
> folios before khugepaged can collapse the folios, or use
> madvise(MADV_COLLAPSE), which does the flush in the retry. The reason is
> that blindly enabling dirty pagecache folio from writable files collapse
> makes khugepaged flush these folios all the time. It is undesirable to
> cause system level pagecache flushes.
>
> To properly support dirty pagecache folio collapse, filemap_flush() needs
> to be avoided. Potentially, merging associated buffer instead of dropping
> it with filemap_release_folio() might be needed.
>
> NOTE: this breaks khugepaged selftests for writable file pagecache
> collapse, which is set to fail all the time. The next commit fix it.

Sashiko:

Is it acceptable to intentionally break the selftests in this commit? Each
commit should be self-contained and not knowingly introduce test regressions,
as this breaks bisectability.

Answer:

I am fine with squashing patch 14 into this one, but it is unlikely anyone
gets a kernel at exact this commit.


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 11/14] selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged
  2026-04-30 15:16   ` Zi Yan
@ 2026-04-30 15:27     ` Zi Yan
  0 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2026-04-30 15:27 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 30 Apr 2026, at 11:16, Zi Yan wrote:

> On 29 Apr 2026, at 11:35, Zi Yan wrote:
>
>> Change the requirement to a file system with large folio support and the
>> supported order needs to include PMD_ORDER.
>>
>> Also add tests of opening a file with read write permission and populating
>> folios with writes. Reuse the XFS image from split_huge_page_test.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  tools/testing/selftests/mm/khugepaged.c   | 131 +++++++++++++++-------
>>  tools/testing/selftests/mm/run_vmtests.sh |  12 +-
>>  2 files changed, 102 insertions(+), 41 deletions(-)
>>

<snip>

>> -static void file_fault(void *p, unsigned long start, unsigned long end)
>> +static void file_fault_common(void *p, unsigned long start, unsigned long end,
>> +		int madv_ops)
>>  {
>> -	if (madvise(((char *)p) + start, end - start, MADV_POPULATE_READ)) {
>> +	if (madvise(((char *)p) + start, end - start, madv_ops)) {
>>  		perror("madvise(MADV_POPULATE_READ");
>
> Sashiko:
> Since madv_ops can now be either MADV_POPULATE_READ or MADV_POPULATE_WRITE,
> will this hardcoded error message be misleading if the write fault path
> fails?
>
> Answer:
> Will send a fixup.


This is the fixup:
From 76e301cf5198f33d07492e224ec627b94902b4b6 Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Thu, 30 Apr 2026 11:22:30 -0400
Subject: [PATCH] selftests/mm: khugepaged perror fixup.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 tools/testing/selftests/mm/khugepaged.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
index 80b913185c643..97b8fcc490c76 100644
--- a/tools/testing/selftests/mm/khugepaged.c
+++ b/tools/testing/selftests/mm/khugepaged.c
@@ -445,7 +445,10 @@ static void file_fault_common(void *p, unsigned long start, unsigned long end,
 		int madv_ops)
 {
 	if (madvise(((char *)p) + start, end - start, madv_ops)) {
-		perror("madvise(MADV_POPULATE_READ");
+		if (madv_ops == MADV_POPULATE_READ)
+			perror("madvise(MADV_POPULATE_READ");
+		else if (madv_ops == MADV_POPULATE_WRITE)
+			perror("madvise(MADV_POPULATE_WRITE");
 		exit(EXIT_FAILURE);
 	}
 }
-- 
2.53.0



Best Regards,
Yan, Zi

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 01/14] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
  2026-04-29 15:29 ` [PATCH v5 01/14] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check Zi Yan
  2026-04-30 14:37   ` Zi Yan
@ 2026-05-04  3:48   ` Nico Pache
  1 sibling, 0 replies; 32+ messages in thread
From: Nico Pache @ 2026-05-04  3:48 UTC (permalink / raw)
  To: Zi Yan, Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest



On 4/29/26 9:29 AM, Zi Yan wrote:
> collapse_file() requires FSes supporting large folio with at least
> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that.
> MADV_COLLAPSE ignores shmem huge config, so exclude the check for shmem.
> 
> While at it, replace VM_BUG_ON with VM_WARN_ON_ONCE.
> 
> Add a helper function mapping_pmd_folio_support() for FSes supporting large
> folio with at least PMD_ORDER.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   include/linux/pagemap.h | 26 ++++++++++++++++++++++++++
>   mm/khugepaged.c         | 10 ++++++++--
>   2 files changed, 34 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 1f50991b43e3b..1fed3414fe9b8 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -513,6 +513,32 @@ static inline bool mapping_large_folio_support(const struct address_space *mappi
>   	return mapping_max_folio_order(mapping) > 0;
>   }
>   
> +/**
> + * mapping_pmd_folio_support() - Check if a mapping support PMD-sized folio
> + * @mapping: The address_space
> + *
> + * Some file supports large folio but does not support as large as PMD order.
> + * If a PMD-sized pagecache folio is attempted to be created on a filesystem,
> + * this check needs to be performed first.
> + *
> + * Return: true - PMD-sized folio is supported, false - PMD-sized folio is not
> + * supported.
> + */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static inline bool mapping_pmd_folio_support(const struct address_space *mapping)
> +{
> +	/* AS_FOLIO_ORDER is only reasonable for pagecache folios */
> +	VM_WARN_ON_ONCE((unsigned long)mapping & FOLIO_MAPPING_ANON);
> +
> +	return mapping_max_folio_order(mapping) >= PMD_ORDER;

Probably a stupid question, but I dont know FS thats well.

Here we are checking that the max allowed folio order is greater than 
(or eq) to the PMD_ORDER. Yet the function asks if PMD specifically is 
supported. In the future could we have some FS that does not support PMD 
orders, but does support larger orders (eg. PUD)?

Other than that. LGTM

Reviewed-by: Nico Pache <npache@redhat.com>

> +}
> +#else
> +static inline bool mapping_pmd_folio_support(const struct address_space *mapping)
> +{
> +	return false;
> +}
> +#endif
> +
>   /* Return the maximum folio size for this pagecache mapping, in bytes. */
>   static inline size_t mapping_max_folio_size(const struct address_space *mapping)
>   {
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index e112525c4aa9c..6808f2b48d864 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2235,8 +2235,14 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>   	int nr_none = 0;
>   	bool is_shmem = shmem_file(file);
>   
> -	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
> -	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
> +	/*
> +	 * MADV_COLLAPSE ignores shmem huge config, so do not check shmem
> +	 *
> +	 * TODO: once shmem always calls mapping_set_large_folios() on its
> +	 * mapping, the shmem check can be removed.
> +	 */
> +	VM_WARN_ON_ONCE(!is_shmem && !mapping_pmd_folio_support(mapping));
> +	VM_WARN_ON_ONCE(start & (HPAGE_PMD_NR - 1));
>   
>   	result = alloc_charge_folio(&new_folio, mm, cc, HPAGE_PMD_ORDER);
>   	if (result != SCAN_SUCCEED)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 02/14] mm/khugepaged: add folio dirty check after try_to_unmap()
  2026-04-29 15:29 ` [PATCH v5 02/14] mm/khugepaged: add folio dirty check after try_to_unmap() Zi Yan
  2026-04-30 15:11   ` Zi Yan
@ 2026-05-04  3:53   ` Nico Pache
  2026-05-06  5:23   ` Lance Yang
  2 siblings, 0 replies; 32+ messages in thread
From: Nico Pache @ 2026-05-04  3:53 UTC (permalink / raw)
  To: Zi Yan, Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest



On 4/29/26 9:29 AM, Zi Yan wrote:
> This check ensures the correctness of read-only PMD folio collapse
> after it is enabled for all FSes supporting PMD pagecache folios and
> replaces READ_ONLY_THP_FOR_FS.
> 
> READ_ONLY_THP_FOR_FS only supports read-only fd and uses mapping->nr_thps
> and inode->i_writecount to prevent any write to read-only to-be-collapsed
> folios. In upcoming commits, READ_ONLY_THP_FOR_FS will be removed and the
> aforementioned mechanism will go away too. To ensure khugepaged functions
> as expected after the changes, skip if any folio is dirty after
> try_to_unmap(), since a dirty folio at that point means this read-only
> folio can get writes between try_to_unmap() and try_to_unmap_flush() via
> cached TLB entries and khugepaged does not support writable pagecache folio
> collapse yet.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>

LGTM

Reviewed-by: Nico Pache <npache@redhat.com>

> ---
>   mm/khugepaged.c | 28 ++++++++++++++++++++++++----
>   1 file changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 6808f2b48d864..71209a72195ab 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2327,8 +2327,7 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>   				}
>   			} else if (folio_test_dirty(folio)) {
>   				/*
> -				 * khugepaged only works on read-only fd,
> -				 * so this page is dirty because it hasn't
> +				 * This page is dirty because it hasn't
>   				 * been flushed since first write. There
>   				 * won't be new dirty pages.
>   				 *
> @@ -2386,8 +2385,8 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>   		if (!is_shmem && (folio_test_dirty(folio) ||
>   				  folio_test_writeback(folio))) {
>   			/*
> -			 * khugepaged only works on read-only fd, so this
> -			 * folio is dirty because it hasn't been flushed
> +			 * khugepaged only works on clean file-backed folios,
> +			 * so this folio is dirty because it hasn't been flushed
>   			 * since first write.
>   			 */
>   			result = SCAN_PAGE_DIRTY_OR_WRITEBACK;
> @@ -2431,6 +2430,27 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>   			goto out_unlock;
>   		}
>   
> +		/*
> +		 * At this point, the folio is locked and unmapped. If the PTE
> +		 * was dirty, try_to_unmap() has transferred the dirty bit to
> +		 * the folio and we must not collapse it into a clean
> +		 * file-backed folio.
> +		 *
> +		 * If the folio is clean here, no one can write it until we
> +		 * drop the folio lock. A write through a stale TLB entry came
> +		 * from a clean PTE and must fault because the PTE has been
> +		 * cleared; the fault path has to take the folio lock before
> +		 * installing a writable mapping. Buffered write paths also
> +		 * have to take the folio lock before modifying file contents
> +		 * without a mapping, typically via write_begin_get_folio().
> +		 */
> +		if (!is_shmem && folio_test_dirty(folio)) {
> +			result = SCAN_PAGE_DIRTY_OR_WRITEBACK;
> +			xas_unlock_irq(&xas);
> +			folio_putback_lru(folio);
> +			goto out_unlock;
> +		}
> +
>   		/*
>   		 * Accumulate the folios that are being collapsed.
>   		 */


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 03/14] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
  2026-04-29 15:29 ` [PATCH v5 03/14] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled() Zi Yan
@ 2026-05-04  3:57   ` Nico Pache
  0 siblings, 0 replies; 32+ messages in thread
From: Nico Pache @ 2026-05-04  3:57 UTC (permalink / raw)
  To: Zi Yan, Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest



On 4/29/26 9:29 AM, Zi Yan wrote:
> Replace it with a check on the max folio order of the file's address space
> mapping, making sure PMD folio is supported. Keep the inode open-for-write
> check, since even if collapse_file() now makes sure all to-be-collapsed
> folios are clean and the created PMD file THP can be handled by FSes
> properly, the filemap_flush() could perform undesirable write back.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>   mm/huge_memory.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2f3fcb4dd1ef8..3b324f03e9283 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -86,9 +86,6 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>   {
>   	struct inode *inode;
>   
> -	if (!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS))
> -		return false;
> -
>   	if (!vma->vm_file)
>   		return false;
>   
> @@ -97,6 +94,9 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>   	if (IS_ANON_FILE(inode))
>   		return false;
>   
> +	if (!mapping_pmd_folio_support(vma->vm_file->f_mapping))

Other than my question in the previous patch about >= PMD_ORDER vs == 
PMD_ORDER.

LGTM

Reviewed-by: Nico Pache <npache@redhat.com>

> +		return false;
> +
>   	return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
>   }
>   


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 04/14] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check in hugepage_enabled()
  2026-04-29 15:29 ` [PATCH v5 04/14] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check in hugepage_enabled() Zi Yan
@ 2026-05-04  4:00   ` Nico Pache
  0 siblings, 0 replies; 32+ messages in thread
From: Nico Pache @ 2026-05-04  4:00 UTC (permalink / raw)
  To: Zi Yan, Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest



On 4/29/26 9:29 AM, Zi Yan wrote:
> Remove the READ_ONLY_THP_FOR_FS gate and khugepaged for file-backed
> pmd-sized hugepages are enabled by the global transparent hugepage control.
> khugepaged can still be enabled by per-size control for anon and shmem when
> the global control is off.
> 
> Add shmem_hpage_pmd_enabled() stub for !CONFIG_SHMEM to remove
> IS_ENABLED(SHMEM) in hugepage_enabled().
> 
> Clean up hugepage_enabled() by moving anon code to anon_hpage_enabled().

Thank you for cleaning that up!

Reviewed-by: Nico Pache <npache@redhat.com>

> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> ---
>   include/linux/shmem_fs.h |  2 +-
>   mm/khugepaged.c          | 26 ++++++++++++++++----------
>   2 files changed, 17 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 93a0ba872ebe0..acb8dd961b45c 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -127,7 +127,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>   void shmem_truncate_range(struct inode *inode, loff_t start, uoff_t end);
>   int shmem_unuse(unsigned int type);
>   
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SHMEM)
>   unsigned long shmem_allowable_huge_orders(struct inode *inode,
>   				struct vm_area_struct *vma, pgoff_t index,
>   				loff_t write_end, bool shmem_huge_force);
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 71209a72195ab..d6971ada8f199 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -524,26 +524,32 @@ static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
>   		mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
>   }
>   
> +static inline bool anon_hpage_enabled(void)
> +{
> +	if (READ_ONCE(huge_anon_orders_always))
> +		return true;
> +	if (READ_ONCE(huge_anon_orders_madvise))
> +		return true;
> +	if (READ_ONCE(huge_anon_orders_inherit) &&
> +	    hugepage_global_enabled())
> +		return true;
> +	return false;
> +}
> +
>   static bool hugepage_enabled(void)
>   {
>   	/*
>   	 * We cover the anon, shmem and the file-backed case here; file-backed
> -	 * hugepages, when configured in, are determined by the global control.
> +	 * hugepages are determined by the global control.
>   	 * Anon hugepages are determined by its per-size mTHP control.
>   	 * Shmem pmd-sized hugepages are also determined by its pmd-size control,
>   	 * except when the global shmem_huge is set to SHMEM_HUGE_DENY.
>   	 */
> -	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
> -	    hugepage_global_enabled())
> -		return true;
> -	if (READ_ONCE(huge_anon_orders_always))
> +	if (hugepage_global_enabled())
>   		return true;
> -	if (READ_ONCE(huge_anon_orders_madvise))
> -		return true;
> -	if (READ_ONCE(huge_anon_orders_inherit) &&
> -	    hugepage_global_enabled())
> +	if (anon_hpage_enabled())
>   		return true;
> -	if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
> +	if (shmem_hpage_pmd_enabled())
>   		return true;
>   	return false;
>   }


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 05/14] mm: remove READ_ONLY_THP_FOR_FS Kconfig option
  2026-04-29 15:35 ` [PATCH v5 05/14] mm: remove READ_ONLY_THP_FOR_FS Kconfig option Zi Yan
@ 2026-05-04  4:02   ` Nico Pache
  0 siblings, 0 replies; 32+ messages in thread
From: Nico Pache @ 2026-05-04  4:02 UTC (permalink / raw)
  To: Zi Yan, Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest



On 4/29/26 9:35 AM, Zi Yan wrote:
> After removing READ_ONLY_THP_FOR_FS check in file_thp_enabled(),
> khugepaged and MADV_COLLAPSE can run on FSes with PMD THP pagecache
> support even without READ_ONLY_THP_FOR_FS enabled. Remove the Kconfig first
> so that no one can use READ_ONLY_THP_FOR_FS as upcoming commits remove
> mapping->nr_thps, which its safe guard mechanism relies on.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   mm/Kconfig | 11 -----------
>   1 file changed, 11 deletions(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e221fa1dc54d0..27dc5b0139ba6 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -936,17 +936,6 @@ config THP_SWAP
>   
>   	  For selection by architectures with reasonable THP sizes.
>   
> -config READ_ONLY_THP_FOR_FS

Yay its gone! Thanks for working on this :)

Reviewed-by: Nico Pache <npache@redhat.com>

> -	bool "Read-only THP for filesystems (EXPERIMENTAL)"
> -	depends on TRANSPARENT_HUGEPAGE
> -
> -	help
> -	  Allow khugepaged to put read-only file-backed pages in THP.
> -
> -	  This is marked experimental because it is a new feature. Write
> -	  support of file THPs will be developed in the next few release
> -	  cycles.
> -
>   config NO_PAGE_MAPCOUNT
>   	bool "No per-page mapcount (EXPERIMENTAL)"
>   	help


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 07/14] fs: remove nr_thps from struct address_space
  2026-04-29 15:35 ` [PATCH v5 07/14] fs: remove nr_thps from struct address_space Zi Yan
@ 2026-05-04  4:11   ` Nico Pache
  0 siblings, 0 replies; 32+ messages in thread
From: Nico Pache @ 2026-05-04  4:11 UTC (permalink / raw)
  To: Zi Yan, Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest



On 4/29/26 9:35 AM, Zi Yan wrote:
> filemap_nr_thps*() are removed, the related field, address_space->nr_thps,
> is no longer needed. Remove it. This shrinks struct address_space by 8
> bytes on 64-bit systems which may increase the number of inodes we can
> cache.

We've had performance impacts in the past by changing the alignment of 
certain structs. This is a rather critical one. Ill keep an eye out for 
any performance differences noted by our PerfQE team surrounding this 
feature.

LGTM!

Reviewed-by: Nico Pache <npache@redhat.com>

> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   fs/inode.c         | 3 ---
>   include/linux/fs.h | 5 -----
>   2 files changed, 8 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 6a3cbc7dcd28c..d8a6d6266c3c3 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -279,9 +279,6 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
>   	mapping->flags = 0;
>   	mapping->wb_err = 0;
>   	atomic_set(&mapping->i_mmap_writable, 0);
> -#ifdef CONFIG_READ_ONLY_THP_FOR_FS
> -	atomic_set(&mapping->nr_thps, 0);
> -#endif
>   	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
>   	mapping->writeback_index = 0;
>   	init_rwsem(&mapping->invalidate_lock);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 11559c513dfbb..bb9cc4f7207c1 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -460,7 +460,6 @@ struct mapping_metadata_bhs {
>    *   memory mappings.
>    * @gfp_mask: Memory allocation flags to use for allocating pages.
>    * @i_mmap_writable: Number of VM_SHARED, VM_MAYWRITE mappings.
> - * @nr_thps: Number of THPs in the pagecache (non-shmem only).
>    * @i_mmap: Tree of private and shared mappings.
>    * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
>    * @nrpages: Number of page entries, protected by the i_pages lock.
> @@ -476,10 +475,6 @@ struct address_space {
>   	struct rw_semaphore	invalidate_lock;
>   	gfp_t			gfp_mask;
>   	atomic_t		i_mmap_writable;
> -#ifdef CONFIG_READ_ONLY_THP_FOR_FS
> -	/* number of thp, only for non-shmem files */
> -	atomic_t		nr_thps;
> -#endif
>   	struct rb_root_cached	i_mmap;
>   	unsigned long		nrpages;
>   	pgoff_t			writeback_index;


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 11/14] selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged
  2026-04-29 15:35 ` [PATCH v5 11/14] selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged Zi Yan
  2026-04-30 15:16   ` Zi Yan
@ 2026-05-04  4:23   ` Nico Pache
  2026-05-04 10:11   ` Nico Pache
  2 siblings, 0 replies; 32+ messages in thread
From: Nico Pache @ 2026-05-04  4:23 UTC (permalink / raw)
  To: Zi Yan, Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest



On 4/29/26 9:35 AM, Zi Yan wrote:
> Change the requirement to a file system with large folio support and the
> supported order needs to include PMD_ORDER.
> 
> Also add tests of opening a file with read write permission and populating
> folios with writes. Reuse the XFS image from split_huge_page_test.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>   tools/testing/selftests/mm/khugepaged.c   | 131 +++++++++++++++-------
>   tools/testing/selftests/mm/run_vmtests.sh |  12 +-
>   2 files changed, 102 insertions(+), 41 deletions(-)
> 
> diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
> index a6bb9d50363d2..80b913185c643 100644
> --- a/tools/testing/selftests/mm/khugepaged.c
> +++ b/tools/testing/selftests/mm/khugepaged.c
> @@ -49,7 +49,8 @@ struct mem_ops {
>   	const char *name;
>   };
>   
> -static struct mem_ops *file_ops;
> +static struct mem_ops *read_only_file_ops;
> +static struct mem_ops *read_write_file_ops;
>   static struct mem_ops *anon_ops;
>   static struct mem_ops *shmem_ops;
>   
> @@ -112,7 +113,8 @@ static void restore_settings(int sig)
>   static void save_settings(void)
>   {
>   	printf("Save THP and khugepaged settings...");
> -	if (file_ops && finfo.type == VMA_FILE)
> +	if ((read_only_file_ops || read_write_file_ops) &&
> +	    finfo.type == VMA_FILE)
>   		thp_set_read_ahead_path(finfo.dev_queue_read_ahead_path);
>   	thp_save_settings();
>   
> @@ -364,11 +366,14 @@ static bool anon_check_huge(void *addr, int nr_hpages)
>   	return check_huge_anon(addr, nr_hpages, hpage_pmd_size);
>   }
>   
> -static void *file_setup_area(int nr_hpages)
> +static void *file_setup_area_common(int nr_hpages, bool read_only)
>   {
>   	int fd;
>   	void *p;
>   	unsigned long size;
> +	int open_opt = read_only ? O_RDONLY : O_RDWR;
> +	int mmap_prot = read_only ? PROT_READ : (PROT_READ | PROT_WRITE);
> +	int mmap_opt = read_only ? MAP_PRIVATE : MAP_SHARED;
>   
>   	unlink(finfo.path);  /* Cleanup from previous failed tests */
>   	printf("Creating %s for collapse%s...", finfo.path,
> @@ -399,14 +404,15 @@ static void *file_setup_area(int nr_hpages)
>   	munmap(p, size);
>   	success("OK");
>   
> -	printf("Opening %s read only for collapse...", finfo.path);
> -	finfo.fd = open(finfo.path, O_RDONLY, 777);
> +	printf("Opening %s %s for collapse...", finfo.path,
> +	       read_only ? "read only" : "read-write");
> +	finfo.fd = open(finfo.path, open_opt, 777);
>   	if (finfo.fd < 0) {
>   		perror("open()");
>   		exit(EXIT_FAILURE);
>   	}
> -	p = mmap(BASE_ADDR, size, PROT_READ,
> -		 MAP_PRIVATE, finfo.fd, 0);
> +	p = mmap(BASE_ADDR, size, mmap_prot,
> +		 mmap_opt, finfo.fd, 0);
>   	if (p == MAP_FAILED || p != BASE_ADDR) {
>   		perror("mmap()");
>   		exit(EXIT_FAILURE);
> @@ -418,6 +424,16 @@ static void *file_setup_area(int nr_hpages)
>   	return p;
>   }
>   
> +static void *file_setup_read_only_area(int nr_hpages)
> +{
> +	return file_setup_area_common(nr_hpages, /* read_only= */ true);
> +}
> +
> +static void *file_setup_read_write_area(int nr_hpages)
> +{
> +	return file_setup_area_common(nr_hpages, /* read_only= */ false);
> +}
> +
>   static void file_cleanup_area(void *p, unsigned long size)
>   {
>   	munmap(p, size);
> @@ -425,14 +441,25 @@ static void file_cleanup_area(void *p, unsigned long size)
>   	unlink(finfo.path);
>   }
>   
> -static void file_fault(void *p, unsigned long start, unsigned long end)
> +static void file_fault_common(void *p, unsigned long start, unsigned long end,
> +		int madv_ops)
>   {
> -	if (madvise(((char *)p) + start, end - start, MADV_POPULATE_READ)) {
> +	if (madvise(((char *)p) + start, end - start, madv_ops)) {
>   		perror("madvise(MADV_POPULATE_READ");
>   		exit(EXIT_FAILURE);
>   	}
>   }
>   
> +static void file_fault_read(void *p, unsigned long start, unsigned long end)
> +{
> +	file_fault_common(p, start, end, MADV_POPULATE_READ);
> +}
> +
> +static void file_fault_write(void *p, unsigned long start, unsigned long end)
> +{
> +	file_fault_common(p, start, end, MADV_POPULATE_WRITE);
> +}
> +
>   static bool file_check_huge(void *addr, int nr_hpages)
>   {
>   	switch (finfo.type) {
> @@ -488,10 +515,18 @@ static struct mem_ops __anon_ops = {
>   	.name = "anon",
>   };
>   
> -static struct mem_ops __file_ops = {
> -	.setup_area = &file_setup_area,
> +static struct mem_ops __read_only_file_ops = {
> +	.setup_area = &file_setup_read_only_area,
>   	.cleanup_area = &file_cleanup_area,
> -	.fault = &file_fault,
> +	.fault = &file_fault_read,
> +	.check_huge = &file_check_huge,
> +	.name = "file",
> +};
> +
> +static struct mem_ops __read_write_file_ops = {
> +	.setup_area = &file_setup_read_write_area,
> +	.cleanup_area = &file_cleanup_area,
> +	.fault = &file_fault_write,
>   	.check_huge = &file_check_huge,
>   	.name = "file",
>   };
> @@ -504,6 +539,18 @@ static struct mem_ops __shmem_ops = {
>   	.name = "shmem",
>   };
>   
> +static bool is_tmpfs(struct mem_ops *ops)
> +{
> +	return (ops == &__read_only_file_ops ||
> +		ops == &__read_write_file_ops) &&
> +	       finfo.type == VMA_SHMEM;
> +}
> +
> +static bool is_anon(struct mem_ops *ops)
> +{
> +	return ops == &__anon_ops;
> +}
> +
>   static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
>   			       struct mem_ops *ops, bool expect)
>   {
> @@ -512,6 +559,10 @@ static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
>   
>   	printf("%s...", msg);
>   
> +	/* read&write file collapse always fail */

Just to confirm, you are adding the write part here so that before 
commit 13 & 14, the behavior is that it will fail. Whereas after with 
patch 13/14, we expect this behavior to be supported correct?

Thanks for working on this :)

I plan on testing the selftests changes at some point this week (if I 
find some downtime during LSFMM), and finishing my review here.

Cheers,
-- Nico


> +	if (!is_tmpfs(ops) && ops == &__read_write_file_ops)
> +		expect = false;
> +
>   	/*
>   	 * Prevent khugepaged interference and tests that MADV_COLLAPSE
>   	 * ignores /sys/kernel/mm/transparent_hugepage/enabled
> @@ -578,6 +629,10 @@ static bool wait_for_scan(const char *msg, char *p, int nr_hpages,
>   static void khugepaged_collapse(const char *msg, char *p, int nr_hpages,
>   				struct mem_ops *ops, bool expect)
>   {
> +	/* read&write file collapse always fail */
> +	if (!is_tmpfs(ops) && ops == &__read_write_file_ops)
> +		expect = false;
> +
>   	if (wait_for_scan(msg, p, nr_hpages, ops)) {
>   		if (expect)
>   			fail("Timeout");
> @@ -612,16 +667,6 @@ static struct collapse_context __madvise_context = {
>   	.name = "madvise",
>   };
>   
> -static bool is_tmpfs(struct mem_ops *ops)
> -{
> -	return ops == &__file_ops && finfo.type == VMA_SHMEM;
> -}
> -
> -static bool is_anon(struct mem_ops *ops)
> -{
> -	return ops == &__anon_ops;
> -}
> -
>   static void alloc_at_fault(void)
>   {
>   	struct thp_settings settings = *thp_current_settings();
> @@ -1097,8 +1142,8 @@ static void usage(void)
>   	fprintf(stderr, "\t<context>\t: [all|khugepaged|madvise]\n");
>   	fprintf(stderr, "\t<mem_type>\t: [all|anon|file|shmem]\n");
>   	fprintf(stderr, "\n\t\"file,all\" mem_type requires [dir] argument\n");
> -	fprintf(stderr, "\n\t\"file,all\" mem_type requires kernel built with\n");
> -	fprintf(stderr,	"\tCONFIG_READ_ONLY_THP_FOR_FS=y\n");
> +	fprintf(stderr, "\n\t\"file,all\" mem_type requires a file system\n");
> +	fprintf(stderr,	"\twith large folio support (order >= PMD order)\n");
>   	fprintf(stderr, "\n\tif [dir] is a (sub)directory of a tmpfs mount, tmpfs must be\n");
>   	fprintf(stderr,	"\tmounted with huge=advise option for khugepaged tests to work\n");
>   	fprintf(stderr,	"\n\tSupported Options:\n");
> @@ -1154,20 +1199,22 @@ static void parse_test_type(int argc, char **argv)
>   		usage();
>   
>   	if (!strcmp(buf, "all")) {
> -		file_ops =  &__file_ops;
> +		read_only_file_ops =  &__read_only_file_ops;
> +		read_write_file_ops =  &__read_write_file_ops;
>   		anon_ops = &__anon_ops;
>   		shmem_ops = &__shmem_ops;
>   	} else if (!strcmp(buf, "anon")) {
>   		anon_ops = &__anon_ops;
>   	} else if (!strcmp(buf, "file")) {
> -		file_ops =  &__file_ops;
> +		read_only_file_ops =  &__read_only_file_ops;
> +		read_write_file_ops =  &__read_write_file_ops;
>   	} else if (!strcmp(buf, "shmem")) {
>   		shmem_ops = &__shmem_ops;
>   	} else {
>   		usage();
>   	}
>   
> -	if (!file_ops)
> +	if (!read_only_file_ops && !read_write_file_ops)
>   		return;
>   
>   	if (argc != 2)
> @@ -1239,37 +1286,43 @@ int main(int argc, char **argv)
>   	} while (0)
>   
>   	TEST(collapse_full, khugepaged_context, anon_ops);
> -	TEST(collapse_full, khugepaged_context, file_ops);
> +	TEST(collapse_full, khugepaged_context, read_only_file_ops);
> +	TEST(collapse_full, khugepaged_context, read_write_file_ops);
>   	TEST(collapse_full, khugepaged_context, shmem_ops);
>   	TEST(collapse_full, madvise_context, anon_ops);
> -	TEST(collapse_full, madvise_context, file_ops);
> +	TEST(collapse_full, madvise_context, read_only_file_ops);
> +	TEST(collapse_full, madvise_context, read_write_file_ops);
>   	TEST(collapse_full, madvise_context, shmem_ops);
>   
>   	TEST(collapse_empty, khugepaged_context, anon_ops);
>   	TEST(collapse_empty, madvise_context, anon_ops);
>   
>   	TEST(collapse_single_pte_entry, khugepaged_context, anon_ops);
> -	TEST(collapse_single_pte_entry, khugepaged_context, file_ops);
> +	TEST(collapse_single_pte_entry, khugepaged_context, read_only_file_ops);
> +	TEST(collapse_single_pte_entry, khugepaged_context, read_write_file_ops);
>   	TEST(collapse_single_pte_entry, khugepaged_context, shmem_ops);
>   	TEST(collapse_single_pte_entry, madvise_context, anon_ops);
> -	TEST(collapse_single_pte_entry, madvise_context, file_ops);
> +	TEST(collapse_single_pte_entry, madvise_context, read_only_file_ops);
> +	TEST(collapse_single_pte_entry, madvise_context, read_write_file_ops);
>   	TEST(collapse_single_pte_entry, madvise_context, shmem_ops);
>   
>   	TEST(collapse_max_ptes_none, khugepaged_context, anon_ops);
> -	TEST(collapse_max_ptes_none, khugepaged_context, file_ops);
> +	TEST(collapse_max_ptes_none, khugepaged_context, read_only_file_ops);
> +	TEST(collapse_max_ptes_none, khugepaged_context, read_write_file_ops);
>   	TEST(collapse_max_ptes_none, madvise_context, anon_ops);
> -	TEST(collapse_max_ptes_none, madvise_context, file_ops);
> +	TEST(collapse_max_ptes_none, madvise_context, read_only_file_ops);
> +	TEST(collapse_max_ptes_none, madvise_context, read_write_file_ops);
>   
>   	TEST(collapse_single_pte_entry_compound, khugepaged_context, anon_ops);
> -	TEST(collapse_single_pte_entry_compound, khugepaged_context, file_ops);
> +	TEST(collapse_single_pte_entry_compound, khugepaged_context, read_only_file_ops);
>   	TEST(collapse_single_pte_entry_compound, madvise_context, anon_ops);
> -	TEST(collapse_single_pte_entry_compound, madvise_context, file_ops);
> +	TEST(collapse_single_pte_entry_compound, madvise_context, read_only_file_ops);
>   
>   	TEST(collapse_full_of_compound, khugepaged_context, anon_ops);
> -	TEST(collapse_full_of_compound, khugepaged_context, file_ops);
> +	TEST(collapse_full_of_compound, khugepaged_context, read_only_file_ops);
>   	TEST(collapse_full_of_compound, khugepaged_context, shmem_ops);
>   	TEST(collapse_full_of_compound, madvise_context, anon_ops);
> -	TEST(collapse_full_of_compound, madvise_context, file_ops);
> +	TEST(collapse_full_of_compound, madvise_context, read_only_file_ops);
>   	TEST(collapse_full_of_compound, madvise_context, shmem_ops);
>   
>   	TEST(collapse_compound_extreme, khugepaged_context, anon_ops);
> @@ -1291,10 +1344,10 @@ int main(int argc, char **argv)
>   	TEST(collapse_max_ptes_shared, madvise_context, anon_ops);
>   
>   	TEST(madvise_collapse_existing_thps, madvise_context, anon_ops);
> -	TEST(madvise_collapse_existing_thps, madvise_context, file_ops);
> +	TEST(madvise_collapse_existing_thps, madvise_context, read_only_file_ops);
>   	TEST(madvise_collapse_existing_thps, madvise_context, shmem_ops);
>   
> -	TEST(madvise_retracted_page_tables, madvise_context, file_ops);
> +	TEST(madvise_retracted_page_tables, madvise_context, read_only_file_ops);
>   	TEST(madvise_retracted_page_tables, madvise_context, shmem_ops);
>   
>   	restore_settings(0);
> diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
> index 3b61677fe9840..854c5c3e3a6ae 100755
> --- a/tools/testing/selftests/mm/run_vmtests.sh
> +++ b/tools/testing/selftests/mm/run_vmtests.sh
> @@ -490,8 +490,6 @@ CATEGORY="thp" run_test ./khugepaged all:shmem
>   
>   CATEGORY="thp" run_test ./khugepaged -s 4 all:shmem
>   
> -CATEGORY="thp" run_test ./transhuge-stress -d 20
> -
>   # Try to create XFS if not provided
>   if [ -z "${SPLIT_HUGE_PAGE_TEST_XFS_PATH}" ]; then
>       if [ "${HAVE_HUGEPAGES}" = "1" ]; then
> @@ -508,6 +506,14 @@ if [ -z "${SPLIT_HUGE_PAGE_TEST_XFS_PATH}" ]; then
>       fi
>   fi
>   
> +if [ -n "${SPLIT_HUGE_PAGE_TEST_XFS_PATH}" ]; then
> +CATEGORY="thp" run_test ./khugepaged all:file ${SPLIT_HUGE_PAGE_TEST_XFS_PATH}
> +else
> +	count_total=$(( count_total + 1 ))
> +	count_skip=$(( count_skip + 1 ))
> +	echo "[SKIP] ./khugepaged all:file" | tap_prefix
> +fi
> +
>   CATEGORY="thp" run_test ./split_huge_page_test ${SPLIT_HUGE_PAGE_TEST_XFS_PATH}
>   
>   if [ -n "${MOUNTED_XFS}" ]; then
> @@ -516,6 +522,8 @@ if [ -n "${MOUNTED_XFS}" ]; then
>       rm -f ${XFS_IMG}
>   fi
>   
> +CATEGORY="thp" run_test ./transhuge-stress -d 20
> +
>   CATEGORY="thp" run_test ./folio_split_race_test
>   
>   CATEGORY="migration" run_test ./migration


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 11/14] selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged
  2026-04-29 15:35 ` [PATCH v5 11/14] selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged Zi Yan
  2026-04-30 15:16   ` Zi Yan
  2026-05-04  4:23   ` Nico Pache
@ 2026-05-04 10:11   ` Nico Pache
  2 siblings, 0 replies; 32+ messages in thread
From: Nico Pache @ 2026-05-04 10:11 UTC (permalink / raw)
  To: Zi Yan, Andrew Morton, David Hildenbrand, Matthew Wilcox (Oracle),
	Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest



On 4/29/26 9:35 AM, Zi Yan wrote:
> Change the requirement to a file system with large folio support and the
> supported order needs to include PMD_ORDER.
> 
> Also add tests of opening a file with read write permission and populating
> folios with writes. Reuse the XFS image from split_huge_page_test.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>   tools/testing/selftests/mm/khugepaged.c   | 131 +++++++++++++++-------
>   tools/testing/selftests/mm/run_vmtests.sh |  12 +-
>   2 files changed, 102 insertions(+), 41 deletions(-)
> 
> diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
> index a6bb9d50363d2..80b913185c643 100644
> --- a/tools/testing/selftests/mm/khugepaged.c
> +++ b/tools/testing/selftests/mm/khugepaged.c
> @@ -49,7 +49,8 @@ struct mem_ops {
>   	const char *name;
>   };
>   
> -static struct mem_ops *file_ops;
> +static struct mem_ops *read_only_file_ops;
> +static struct mem_ops *read_write_file_ops;
>   static struct mem_ops *anon_ops;
>   static struct mem_ops *shmem_ops;
>   
> @@ -112,7 +113,8 @@ static void restore_settings(int sig)
>   static void save_settings(void)
>   {
>   	printf("Save THP and khugepaged settings...");
> -	if (file_ops && finfo.type == VMA_FILE)
> +	if ((read_only_file_ops || read_write_file_ops) &&
> +	    finfo.type == VMA_FILE)
>   		thp_set_read_ahead_path(finfo.dev_queue_read_ahead_path);
>   	thp_save_settings();
>   
> @@ -364,11 +366,14 @@ static bool anon_check_huge(void *addr, int nr_hpages)
>   	return check_huge_anon(addr, nr_hpages, hpage_pmd_size);
>   }
>   
> -static void *file_setup_area(int nr_hpages)
> +static void *file_setup_area_common(int nr_hpages, bool read_only)
>   {
>   	int fd;
>   	void *p;
>   	unsigned long size;
> +	int open_opt = read_only ? O_RDONLY : O_RDWR;
> +	int mmap_prot = read_only ? PROT_READ : (PROT_READ | PROT_WRITE);
> +	int mmap_opt = read_only ? MAP_PRIVATE : MAP_SHARED;
>   
>   	unlink(finfo.path);  /* Cleanup from previous failed tests */
>   	printf("Creating %s for collapse%s...", finfo.path,
> @@ -399,14 +404,15 @@ static void *file_setup_area(int nr_hpages)
>   	munmap(p, size);
>   	success("OK");
>   
> -	printf("Opening %s read only for collapse...", finfo.path);
> -	finfo.fd = open(finfo.path, O_RDONLY, 777);
> +	printf("Opening %s %s for collapse...", finfo.path,
> +	       read_only ? "read only" : "read-write");
> +	finfo.fd = open(finfo.path, open_opt, 777);
>   	if (finfo.fd < 0) {
>   		perror("open()");
>   		exit(EXIT_FAILURE);
>   	}
> -	p = mmap(BASE_ADDR, size, PROT_READ,
> -		 MAP_PRIVATE, finfo.fd, 0);
> +	p = mmap(BASE_ADDR, size, mmap_prot,
> +		 mmap_opt, finfo.fd, 0);
>   	if (p == MAP_FAILED || p != BASE_ADDR) {
>   		perror("mmap()");
>   		exit(EXIT_FAILURE);
> @@ -418,6 +424,16 @@ static void *file_setup_area(int nr_hpages)
>   	return p;
>   }
>   
> +static void *file_setup_read_only_area(int nr_hpages)
> +{
> +	return file_setup_area_common(nr_hpages, /* read_only= */ true);
> +}
> +
> +static void *file_setup_read_write_area(int nr_hpages)
> +{
> +	return file_setup_area_common(nr_hpages, /* read_only= */ false);
> +}
> +
>   static void file_cleanup_area(void *p, unsigned long size)
>   {
>   	munmap(p, size);
> @@ -425,14 +441,25 @@ static void file_cleanup_area(void *p, unsigned long size)
>   	unlink(finfo.path);
>   }
>   
> -static void file_fault(void *p, unsigned long start, unsigned long end)
> +static void file_fault_common(void *p, unsigned long start, unsigned long end,
> +		int madv_ops)
>   {
> -	if (madvise(((char *)p) + start, end - start, MADV_POPULATE_READ)) {
> +	if (madvise(((char *)p) + start, end - start, madv_ops)) {
>   		perror("madvise(MADV_POPULATE_READ");
>   		exit(EXIT_FAILURE);
>   	}
>   }
>   
> +static void file_fault_read(void *p, unsigned long start, unsigned long end)
> +{
> +	file_fault_common(p, start, end, MADV_POPULATE_READ);
> +}
> +
> +static void file_fault_write(void *p, unsigned long start, unsigned long end)
> +{
> +	file_fault_common(p, start, end, MADV_POPULATE_WRITE);
> +}
> +
>   static bool file_check_huge(void *addr, int nr_hpages)
>   {
>   	switch (finfo.type) {
> @@ -488,10 +515,18 @@ static struct mem_ops __anon_ops = {
>   	.name = "anon",
>   };
>   
> -static struct mem_ops __file_ops = {
> -	.setup_area = &file_setup_area,
> +static struct mem_ops __read_only_file_ops = {
> +	.setup_area = &file_setup_read_only_area,
>   	.cleanup_area = &file_cleanup_area,
> -	.fault = &file_fault,
> +	.fault = &file_fault_read,
> +	.check_huge = &file_check_huge,
> +	.name = "file",
> +};
> +
> +static struct mem_ops __read_write_file_ops = {
> +	.setup_area = &file_setup_read_write_area,
> +	.cleanup_area = &file_cleanup_area,
> +	.fault = &file_fault_write,
>   	.check_huge = &file_check_huge,
>   	.name = "file",
>   };
> @@ -504,6 +539,18 @@ static struct mem_ops __shmem_ops = {
>   	.name = "shmem",
>   };
>   
> +static bool is_tmpfs(struct mem_ops *ops)
> +{
> +	return (ops == &__read_only_file_ops ||
> +		ops == &__read_write_file_ops) &&
> +	       finfo.type == VMA_SHMEM;
> +}
> +
> +static bool is_anon(struct mem_ops *ops)
> +{
> +	return ops == &__anon_ops;
> +}
> +
>   static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
>   			       struct mem_ops *ops, bool expect)
>   {
> @@ -512,6 +559,10 @@ static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
>   
>   	printf("%s...", msg);
>   
> +	/* read&write file collapse always fail */
> +	if (!is_tmpfs(ops) && ops == &__read_write_file_ops)
> +		expect = false;
> +
>   	/*
>   	 * Prevent khugepaged interference and tests that MADV_COLLAPSE
>   	 * ignores /sys/kernel/mm/transparent_hugepage/enabled
> @@ -578,6 +629,10 @@ static bool wait_for_scan(const char *msg, char *p, int nr_hpages,
>   static void khugepaged_collapse(const char *msg, char *p, int nr_hpages,
>   				struct mem_ops *ops, bool expect)
>   {
> +	/* read&write file collapse always fail */
> +	if (!is_tmpfs(ops) && ops == &__read_write_file_ops)
> +		expect = false;
> +
>   	if (wait_for_scan(msg, p, nr_hpages, ops)) {
>   		if (expect)
>   			fail("Timeout");
> @@ -612,16 +667,6 @@ static struct collapse_context __madvise_context = {
>   	.name = "madvise",
>   };
>   
> -static bool is_tmpfs(struct mem_ops *ops)
> -{
> -	return ops == &__file_ops && finfo.type == VMA_SHMEM;
> -}
> -
> -static bool is_anon(struct mem_ops *ops)
> -{
> -	return ops == &__anon_ops;
> -}
> -
>   static void alloc_at_fault(void)
>   {
>   	struct thp_settings settings = *thp_current_settings();
> @@ -1097,8 +1142,8 @@ static void usage(void)
>   	fprintf(stderr, "\t<context>\t: [all|khugepaged|madvise]\n");
>   	fprintf(stderr, "\t<mem_type>\t: [all|anon|file|shmem]\n");
>   	fprintf(stderr, "\n\t\"file,all\" mem_type requires [dir] argument\n");
> -	fprintf(stderr, "\n\t\"file,all\" mem_type requires kernel built with\n");
> -	fprintf(stderr,	"\tCONFIG_READ_ONLY_THP_FOR_FS=y\n");
> +	fprintf(stderr, "\n\t\"file,all\" mem_type requires a file system\n");
> +	fprintf(stderr,	"\twith large folio support (order >= PMD order)\n");
>   	fprintf(stderr, "\n\tif [dir] is a (sub)directory of a tmpfs mount, tmpfs must be\n");
>   	fprintf(stderr,	"\tmounted with huge=advise option for khugepaged tests to work\n");
>   	fprintf(stderr,	"\n\tSupported Options:\n");
> @@ -1154,20 +1199,22 @@ static void parse_test_type(int argc, char **argv)
>   		usage();
>   
>   	if (!strcmp(buf, "all")) {
> -		file_ops =  &__file_ops;
> +		read_only_file_ops =  &__read_only_file_ops;
> +		read_write_file_ops =  &__read_write_file_ops;
>   		anon_ops = &__anon_ops;
>   		shmem_ops = &__shmem_ops;
>   	} else if (!strcmp(buf, "anon")) {
>   		anon_ops = &__anon_ops;
>   	} else if (!strcmp(buf, "file")) {
> -		file_ops =  &__file_ops;
> +		read_only_file_ops =  &__read_only_file_ops;
> +		read_write_file_ops =  &__read_write_file_ops;
>   	} else if (!strcmp(buf, "shmem")) {
>   		shmem_ops = &__shmem_ops;
>   	} else {
>   		usage();
>   	}
>   
> -	if (!file_ops)
> +	if (!read_only_file_ops && !read_write_file_ops)
>   		return;
>   
>   	if (argc != 2)
> @@ -1239,37 +1286,43 @@ int main(int argc, char **argv)
>   	} while (0)
>   
>   	TEST(collapse_full, khugepaged_context, anon_ops);
> -	TEST(collapse_full, khugepaged_context, file_ops);
> +	TEST(collapse_full, khugepaged_context, read_only_file_ops);
> +	TEST(collapse_full, khugepaged_context, read_write_file_ops);
>   	TEST(collapse_full, khugepaged_context, shmem_ops);
>   	TEST(collapse_full, madvise_context, anon_ops);
> -	TEST(collapse_full, madvise_context, file_ops);
> +	TEST(collapse_full, madvise_context, read_only_file_ops);
> +	TEST(collapse_full, madvise_context, read_write_file_ops);
>   	TEST(collapse_full, madvise_context, shmem_ops);
>   
>   	TEST(collapse_empty, khugepaged_context, anon_ops);
>   	TEST(collapse_empty, madvise_context, anon_ops);
>   
>   	TEST(collapse_single_pte_entry, khugepaged_context, anon_ops);
> -	TEST(collapse_single_pte_entry, khugepaged_context, file_ops);
> +	TEST(collapse_single_pte_entry, khugepaged_context, read_only_file_ops);
> +	TEST(collapse_single_pte_entry, khugepaged_context, read_write_file_ops);
>   	TEST(collapse_single_pte_entry, khugepaged_context, shmem_ops);
>   	TEST(collapse_single_pte_entry, madvise_context, anon_ops);
> -	TEST(collapse_single_pte_entry, madvise_context, file_ops);
> +	TEST(collapse_single_pte_entry, madvise_context, read_only_file_ops);
> +	TEST(collapse_single_pte_entry, madvise_context, read_write_file_ops);
>   	TEST(collapse_single_pte_entry, madvise_context, shmem_ops);
>   
>   	TEST(collapse_max_ptes_none, khugepaged_context, anon_ops);
> -	TEST(collapse_max_ptes_none, khugepaged_context, file_ops);
> +	TEST(collapse_max_ptes_none, khugepaged_context, read_only_file_ops);
> +	TEST(collapse_max_ptes_none, khugepaged_context, read_write_file_ops);
>   	TEST(collapse_max_ptes_none, madvise_context, anon_ops);
> -	TEST(collapse_max_ptes_none, madvise_context, file_ops);
> +	TEST(collapse_max_ptes_none, madvise_context, read_only_file_ops);
> +	TEST(collapse_max_ptes_none, madvise_context, read_write_file_ops);
>   
>   	TEST(collapse_single_pte_entry_compound, khugepaged_context, anon_ops);
> -	TEST(collapse_single_pte_entry_compound, khugepaged_context, file_ops);
> +	TEST(collapse_single_pte_entry_compound, khugepaged_context, read_only_file_ops);
>   	TEST(collapse_single_pte_entry_compound, madvise_context, anon_ops);
> -	TEST(collapse_single_pte_entry_compound, madvise_context, file_ops);
> +	TEST(collapse_single_pte_entry_compound, madvise_context, read_only_file_ops);
>   
>   	TEST(collapse_full_of_compound, khugepaged_context, anon_ops);
> -	TEST(collapse_full_of_compound, khugepaged_context, file_ops);
> +	TEST(collapse_full_of_compound, khugepaged_context, read_only_file_ops);
>   	TEST(collapse_full_of_compound, khugepaged_context, shmem_ops);
>   	TEST(collapse_full_of_compound, madvise_context, anon_ops);
> -	TEST(collapse_full_of_compound, madvise_context, file_ops);
> +	TEST(collapse_full_of_compound, madvise_context, read_only_file_ops);
>   	TEST(collapse_full_of_compound, madvise_context, shmem_ops);
>   
>   	TEST(collapse_compound_extreme, khugepaged_context, anon_ops);
> @@ -1291,10 +1344,10 @@ int main(int argc, char **argv)
>   	TEST(collapse_max_ptes_shared, madvise_context, anon_ops);
>   
>   	TEST(madvise_collapse_existing_thps, madvise_context, anon_ops);
> -	TEST(madvise_collapse_existing_thps, madvise_context, file_ops);
> +	TEST(madvise_collapse_existing_thps, madvise_context, read_only_file_ops);
>   	TEST(madvise_collapse_existing_thps, madvise_context, shmem_ops);
>   
> -	TEST(madvise_retracted_page_tables, madvise_context, file_ops);
> +	TEST(madvise_retracted_page_tables, madvise_context, read_only_file_ops);
>   	TEST(madvise_retracted_page_tables, madvise_context, shmem_ops);
>   
>   	restore_settings(0);
> diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
> index 3b61677fe9840..854c5c3e3a6ae 100755
> --- a/tools/testing/selftests/mm/run_vmtests.sh
> +++ b/tools/testing/selftests/mm/run_vmtests.sh
> @@ -490,8 +490,6 @@ CATEGORY="thp" run_test ./khugepaged all:shmem
>   
>   CATEGORY="thp" run_test ./khugepaged -s 4 all:shmem
>   
> -CATEGORY="thp" run_test ./transhuge-stress -d 20
> -
>   # Try to create XFS if not provided
>   if [ -z "${SPLIT_HUGE_PAGE_TEST_XFS_PATH}" ]; then
>       if [ "${HAVE_HUGEPAGES}" = "1" ]; then
> @@ -508,6 +506,14 @@ if [ -z "${SPLIT_HUGE_PAGE_TEST_XFS_PATH}" ]; then
>       fi
>   fi
>   
> +if [ -n "${SPLIT_HUGE_PAGE_TEST_XFS_PATH}" ]; then
> +CATEGORY="thp" run_test ./khugepaged all:file ${SPLIT_HUGE_PAGE_TEST_XFS_PATH}
> +else
> +	count_total=$(( count_total + 1 ))
> +	count_skip=$(( count_skip + 1 ))
> +	echo "[SKIP] ./khugepaged all:file" | tap_prefix

This leads selftest runs to always litter the output with SKIP when 
running this with the wrapper

make -C tools/testing/selftests TARGETS=mm run_tests

> +fi
> +
>   CATEGORY="thp" run_test ./split_huge_page_test ${SPLIT_HUGE_PAGE_TEST_XFS_PATH}
>   
>   if [ -n "${MOUNTED_XFS}" ]; then
> @@ -516,6 +522,8 @@ if [ -n "${MOUNTED_XFS}" ]; then
>       rm -f ${XFS_IMG}
>   fi
>   
> +CATEGORY="thp" run_test ./transhuge-stress -d 20
> +
>   CATEGORY="thp" run_test ./folio_split_race_test
>   
>   CATEGORY="migration" run_test ./migration


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 02/14] mm/khugepaged: add folio dirty check after try_to_unmap()
  2026-04-29 15:29 ` [PATCH v5 02/14] mm/khugepaged: add folio dirty check after try_to_unmap() Zi Yan
  2026-04-30 15:11   ` Zi Yan
  2026-05-04  3:53   ` Nico Pache
@ 2026-05-06  5:23   ` Lance Yang
  2 siblings, 0 replies; 32+ messages in thread
From: Lance Yang @ 2026-05-06  5:23 UTC (permalink / raw)
  To: ziy
  Cc: akpm, david, willy, songliubraving, clm, dsterba, viro, brauner,
	jack, ljs, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, vbabka, rppt, surenb, mhocko, shuah,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest


On Wed, Apr 29, 2026 at 11:29:12AM -0400, Zi Yan wrote:
>This check ensures the correctness of read-only PMD folio collapse
>after it is enabled for all FSes supporting PMD pagecache folios and
>replaces READ_ONLY_THP_FOR_FS.
>
>READ_ONLY_THP_FOR_FS only supports read-only fd and uses mapping->nr_thps
>and inode->i_writecount to prevent any write to read-only to-be-collapsed
>folios. In upcoming commits, READ_ONLY_THP_FOR_FS will be removed and the
>aforementioned mechanism will go away too. To ensure khugepaged functions
>as expected after the changes, skip if any folio is dirty after
>try_to_unmap(), since a dirty folio at that point means this read-only
>folio can get writes between try_to_unmap() and try_to_unmap_flush() via
>cached TLB entries and khugepaged does not support writable pagecache folio
>collapse yet.
>
>Signed-off-by: Zi Yan <ziy@nvidia.com>
>Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>---
> mm/khugepaged.c | 28 ++++++++++++++++++++++++----
> 1 file changed, 24 insertions(+), 4 deletions(-)
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index 6808f2b48d864..71209a72195ab 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -2327,8 +2327,7 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> 				}
> 			} else if (folio_test_dirty(folio)) {
> 				/*
>-				 * khugepaged only works on read-only fd,
>-				 * so this page is dirty because it hasn't
>+				 * This page is dirty because it hasn't
> 				 * been flushed since first write. There
> 				 * won't be new dirty pages.
> 				 *
>@@ -2386,8 +2385,8 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> 		if (!is_shmem && (folio_test_dirty(folio) ||
> 				  folio_test_writeback(folio))) {
> 			/*
>-			 * khugepaged only works on read-only fd, so this
>-			 * folio is dirty because it hasn't been flushed
>+			 * khugepaged only works on clean file-backed folios,
>+			 * so this folio is dirty because it hasn't been flushed
> 			 * since first write.
> 			 */
> 			result = SCAN_PAGE_DIRTY_OR_WRITEBACK;
>@@ -2431,6 +2430,27 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> 			goto out_unlock;
> 		}
> 
>+		/*
>+		 * At this point, the folio is locked and unmapped. If the PTE
>+		 * was dirty, try_to_unmap() has transferred the dirty bit to
>+		 * the folio and we must not collapse it into a clean
>+		 * file-backed folio.
>+		 *
>+		 * If the folio is clean here, no one can write it until we
>+		 * drop the folio lock. A write through a stale TLB entry came
>+		 * from a clean PTE and must fault because the PTE has been
>+		 * cleared; the fault path has to take the folio lock before

Yeah, try_to_unmap_one() also already documents the required arch
guarantee for a clean cached TLB entry after the PTE is cleared.

			/*
			 * We clear the PTE but do not flush so potentially
			 * a remote CPU could still be writing to the folio.
			 * If the entry was previously clean then the
			 * architecture must guarantee that a clear->dirty
			 * transition on a cached TLB entry is written through
			 * and traps if the PTE is unmapped.
			 */

Lesson learned :)

>+		 * installing a writable mapping. Buffered write paths also
>+		 * have to take the folio lock before modifying file contents
>+		 * without a mapping, typically via write_begin_get_folio().
>+		 */
>+		if (!is_shmem && folio_test_dirty(folio)) {
>+			result = SCAN_PAGE_DIRTY_OR_WRITEBACK;
>+			xas_unlock_irq(&xas);
>+			folio_putback_lru(folio);
>+			goto out_unlock;
>+		}

LGTM.
Reviewed-by: Lance Yang <lance.yang@linux.dev>

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2026-05-06  5:24 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-29 15:29 [PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files Zi Yan
2026-04-29 15:29 ` [PATCH v5 01/14] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check Zi Yan
2026-04-30 14:37   ` Zi Yan
2026-04-30 15:04     ` Andrew Morton
2026-05-04  3:48   ` Nico Pache
2026-04-29 15:29 ` [PATCH v5 02/14] mm/khugepaged: add folio dirty check after try_to_unmap() Zi Yan
2026-04-30 15:11   ` Zi Yan
2026-05-04  3:53   ` Nico Pache
2026-05-06  5:23   ` Lance Yang
2026-04-29 15:29 ` [PATCH v5 03/14] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled() Zi Yan
2026-05-04  3:57   ` Nico Pache
2026-04-29 15:29 ` [PATCH v5 04/14] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check in hugepage_enabled() Zi Yan
2026-05-04  4:00   ` Nico Pache
2026-04-29 15:35 ` [PATCH v5 05/14] mm: remove READ_ONLY_THP_FOR_FS Kconfig option Zi Yan
2026-05-04  4:02   ` Nico Pache
2026-04-29 15:35 ` [PATCH v5 06/14] mm: fs: remove filemap_nr_thps*() functions and their users Zi Yan
2026-04-29 15:35 ` [PATCH v5 07/14] fs: remove nr_thps from struct address_space Zi Yan
2026-05-04  4:11   ` Nico Pache
2026-04-29 15:35 ` [PATCH v5 08/14] mm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FS Zi Yan
2026-04-29 15:35 ` [PATCH v5 09/14] mm/truncate: use folio_split() in truncate_inode_partial_folio() Zi Yan
2026-04-30 15:12   ` Zi Yan
2026-04-29 15:35 ` [PATCH v5 10/14] fs/btrfs: remove a comment referring to READ_ONLY_THP_FOR_FS Zi Yan
2026-04-29 15:35 ` [PATCH v5 11/14] selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged Zi Yan
2026-04-30 15:16   ` Zi Yan
2026-04-30 15:27     ` Zi Yan
2026-05-04  4:23   ` Nico Pache
2026-05-04 10:11   ` Nico Pache
2026-04-29 15:35 ` [PATCH v5 12/14] selftests/mm: remove READ_ONLY_THP_FOR_FS code from guard-regions Zi Yan
2026-04-29 15:35 ` [PATCH v5 13/14] mm/khugepaged: enable clean pagecache folio collapse for writable files Zi Yan
2026-04-30 15:18   ` Zi Yan
2026-04-29 15:35 ` [PATCH v5 14/14] selftests/mm: add writable-file collapse tests for khugepaged Zi Yan
2026-04-29 16:13 ` [PATCH v5 00/14] Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox