[PATCH v13 00/10] enable bs

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v13 00/10] enable bs > ps in XFS
@ 2024-08-22 13:50 Pankaj Raghav (Samsung)
  2024-08-22 13:50 ` [PATCH v13 01/10] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
                   ` (10 more replies)
  0 siblings, 11 replies; 31+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-22 13:50 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts

From: Pankaj Raghav <p.raghav@samsung.com>

This is the 13th version of the series that enables block size > page size
(Large Block Size) experimental support in XFS. Please consider this for
the inclusion in 6.12.

The context and motivation can be seen in cover letter of the RFC v1 [0].
We also recorded a talk about this effort at LPC [1], if someone would
like more context on this effort.

Thanks to David Howells, the page cache changes have also been tested on
top of AFS[2] with mapping_min_order set.

A lot of emphasis has been put on testing using kdevops, starting with an XFS
baseline [3]. The testing has been split into regression and progression.

Regression testing:
In regression testing, we ran the whole test suite to check for regressions on
existing profiles due to the page cache changes.

I also ran split_huge_page_test selftest on XFS filesystem to check for
huge page splits in min order chunks is done correctly.

No regressions were found with these patches added on top.

Progression testing:
For progression testing, we tested for 8k, 16k, 32k and 64k block sizes.  To
compare it with existing support, an ARM VM with 64k base page system (without
our patches) was used as a reference to check for actual failures due to LBS
support in a 4k base page size system.

No new failures were found with the LBS support.

We've done some preliminary performance tests with fio on XFS on 4k block size
against pmem and NVMe with buffered IO and Direct IO on vanilla Vs + these
patches applied, and detected no regressions.

We ran sysbench on postgres and mysql for several hours on LBS XFS
without any issues.

We also wrote an eBPF tool called blkalgn [5] to see if IO sent to the device
is aligned and at least filesystem block size in length.

For those who want this in a git tree we have this up on a kdevops
large-block-minorder-for-next-v13 tag [6].

[0] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/
[1] https://www.youtube.com/watch?v=ar72r5Xf7x4
[2] https://lore.kernel.org/linux-mm/3792765.1724196264@warthog.procyon.org.uk/
[3] https://github.com/linux-kdevops/kdevops/blob/master/docs/xfs-bugs.md
489 non-critical issues and 55 critical issues. We've determined and reported
that the 55 critical issues have all fall into 5 common  XFS asserts or hung
tasks  and 2 memory management asserts.
[4] https://github.com/linux-kdevops/fstests/tree/lbs-fixes
[5] https://github.com/iovisor/bcc/pull/4813
[6] https://github.com/linux-kdevops/linux/
[7] https://lore.kernel.org/linux-kernel/Zl20pc-YlIWCSy6Z@casper.infradead.org/#t

Changes since v12:
- Fixed the issue of masking the wrong bits in PATCH 1.
- Collected Tested-by from David Howells.

Dave Chinner (1):
  xfs: use kvmalloc for xattr buffers

Luis Chamberlain (1):
  mm: split a folio in minimum folio order chunks

Matthew Wilcox (Oracle) (1):
  fs: Allow fine-grained control of folio sizes

Pankaj Raghav (7):
  filemap: allocate mapping_min_order folios in the page cache
  readahead: allocate folios with mapping_min_order in readahead
  filemap: cap PTE range to be created to allowed zero fill in
    folio_map_range()
  iomap: fix iomap_dio_zero() for fs bs > system page size
  xfs: expose block size in stat
  xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  xfs: enable block size larger than page size support

 fs/iomap/buffered-io.c        |   4 +-
 fs/iomap/direct-io.c          |  45 +++++++++++--
 fs/xfs/libxfs/xfs_attr_leaf.c |  15 ++---
 fs/xfs/libxfs/xfs_ialloc.c    |   5 ++
 fs/xfs/libxfs/xfs_shared.h    |   3 +
 fs/xfs/xfs_icache.c           |   6 +-
 fs/xfs/xfs_iops.c             |   2 +-
 fs/xfs/xfs_mount.c            |   8 ++-
 fs/xfs/xfs_super.c            |  28 +++++---
 include/linux/huge_mm.h       |  14 ++--
 include/linux/pagemap.h       | 123 ++++++++++++++++++++++++++++++----
 mm/filemap.c                  |  36 ++++++----
 mm/huge_memory.c              |  60 +++++++++++++++--
 mm/readahead.c                |  83 +++++++++++++++++------
 14 files changed, 347 insertions(+), 85 deletions(-)

base-commit: eb8c5ca373cbb018a84eb4db25c863302c9b6314
-- 
2.44.1

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v13 01/10] fs: Allow fine-grained control of folio sizes
  2024-08-22 13:50 [PATCH v13 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
@ 2024-08-22 13:50 ` Pankaj Raghav (Samsung)
  2024-08-23 13:09   ` Daniel Gomez
  2024-08-22 13:50 ` [PATCH v13 02/10] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-22 13:50 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts, David Howells

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

We need filesystems to be able to communicate acceptable folio sizes
to the pagecache for a variety of uses (e.g. large block sizes).
Support a range of folio sizes between order-0 and order-31.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: David Howells <dhowells@redhat.com>
---
 include/linux/pagemap.h | 90 ++++++++++++++++++++++++++++++++++-------
 mm/filemap.c            |  6 +--
 mm/readahead.c          |  4 +-
 3 files changed, 80 insertions(+), 20 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d9c7edb6422bd..c60025bb584c5 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -204,14 +204,21 @@ enum mapping_flags {
 	AS_EXITING	= 4, 	/* final truncate in progress */
 	/* writeback related tags are not used */
 	AS_NO_WRITEBACK_TAGS = 5,
-	AS_LARGE_FOLIO_SUPPORT = 6,
-	AS_RELEASE_ALWAYS,	/* Call ->release_folio(), even if no private data */
-	AS_STABLE_WRITES,	/* must wait for writeback before modifying
+	AS_RELEASE_ALWAYS = 6,	/* Call ->release_folio(), even if no private data */
+	AS_STABLE_WRITES = 7,	/* must wait for writeback before modifying
 				   folio contents */
-	AS_INACCESSIBLE,	/* Do not attempt direct R/W access to the mapping,
-				   including to move the mapping */
+	AS_INACCESSIBLE = 8,	/* Do not attempt direct R/W access to the mapping */
+	/* Bits 16-25 are used for FOLIO_ORDER */
+	AS_FOLIO_ORDER_BITS = 5,
+	AS_FOLIO_ORDER_MIN = 16,
+	AS_FOLIO_ORDER_MAX = AS_FOLIO_ORDER_MIN + AS_FOLIO_ORDER_BITS,
 };
 
+#define AS_FOLIO_ORDER_BITS_MASK ((1u << AS_FOLIO_ORDER_BITS) - 1)
+#define AS_FOLIO_ORDER_MIN_MASK (AS_FOLIO_ORDER_BITS_MASK << AS_FOLIO_ORDER_MIN)
+#define AS_FOLIO_ORDER_MAX_MASK (AS_FOLIO_ORDER_BITS_MASK << AS_FOLIO_ORDER_MAX)
+#define AS_FOLIO_ORDER_MASK (AS_FOLIO_ORDER_MIN_MASK | AS_FOLIO_ORDER_MAX_MASK)
+
 /**
  * mapping_set_error - record a writeback error in the address_space
  * @mapping: the mapping in which an error should be set
@@ -367,9 +374,51 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 #define MAX_XAS_ORDER		(XA_CHUNK_SHIFT * 2 - 1)
 #define MAX_PAGECACHE_ORDER	min(MAX_XAS_ORDER, PREFERRED_MAX_PAGECACHE_ORDER)
 
+/*
+ * mapping_set_folio_order_range() - Set the orders supported by a file.
+ * @mapping: The address space of the file.
+ * @min: Minimum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
+ * @max: Maximum folio order (between @min-MAX_PAGECACHE_ORDER inclusive).
+ *
+ * The filesystem should call this function in its inode constructor to
+ * indicate which base size (min) and maximum size (max) of folio the VFS
+ * can use to cache the contents of the file.  This should only be used
+ * if the filesystem needs special handling of folio sizes (ie there is
+ * something the core cannot know).
+ * Do not tune it based on, eg, i_size.
+ *
+ * Context: This should not be called while the inode is active as it
+ * is non-atomic.
+ */
+static inline void mapping_set_folio_order_range(struct address_space *mapping,
+						 unsigned int min,
+						 unsigned int max)
+{
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return;
+
+	if (min > MAX_PAGECACHE_ORDER)
+		min = MAX_PAGECACHE_ORDER;
+
+	if (max > MAX_PAGECACHE_ORDER)
+		max = MAX_PAGECACHE_ORDER;
+
+	if (max < min)
+		max = min;
+
+	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
+		(min << AS_FOLIO_ORDER_MIN) | (max << AS_FOLIO_ORDER_MAX);
+}
+
+static inline void mapping_set_folio_min_order(struct address_space *mapping,
+					       unsigned int min)
+{
+	mapping_set_folio_order_range(mapping, min, MAX_PAGECACHE_ORDER);
+}
+
 /**
  * mapping_set_large_folios() - Indicate the file supports large folios.
- * @mapping: The file.
+ * @mapping: The address space of the file.
  *
  * The filesystem should call this function in its inode constructor to
  * indicate that the VFS can use large folios to cache the contents of
@@ -380,7 +429,23 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
  */
 static inline void mapping_set_large_folios(struct address_space *mapping)
 {
-	__set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	mapping_set_folio_order_range(mapping, 0, MAX_PAGECACHE_ORDER);
+}
+
+static inline unsigned int
+mapping_max_folio_order(const struct address_space *mapping)
+{
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return 0;
+	return (mapping->flags & AS_FOLIO_ORDER_MAX_MASK) >> AS_FOLIO_ORDER_MAX;
+}
+
+static inline unsigned int
+mapping_min_folio_order(const struct address_space *mapping)
+{
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return 0;
+	return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
 }
 
 /*
@@ -389,20 +454,17 @@ static inline void mapping_set_large_folios(struct address_space *mapping)
  */
 static inline bool mapping_large_folio_support(struct address_space *mapping)
 {
-	/* AS_LARGE_FOLIO_SUPPORT is only reasonable for pagecache folios */
+	/* AS_FOLIO_ORDER is only reasonable for pagecache folios */
 	VM_WARN_ONCE((unsigned long)mapping & PAGE_MAPPING_ANON,
 			"Anonymous mapping always supports large folio");
 
-	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-		test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	return mapping_max_folio_order(mapping) > 0;
 }
 
 /* Return the maximum folio size for this pagecache mapping, in bytes. */
-static inline size_t mapping_max_folio_size(struct address_space *mapping)
+static inline size_t mapping_max_folio_size(const struct address_space *mapping)
 {
-	if (mapping_large_folio_support(mapping))
-		return PAGE_SIZE << MAX_PAGECACHE_ORDER;
-	return PAGE_SIZE;
+	return PAGE_SIZE << mapping_max_folio_order(mapping);
 }
 
 static inline int filemap_nr_thps(struct address_space *mapping)
diff --git a/mm/filemap.c b/mm/filemap.c
index d87c858465962..5c148144d5548 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1932,10 +1932,8 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
 			fgp_flags |= FGP_LOCK;
 
-		if (!mapping_large_folio_support(mapping))
-			order = 0;
-		if (order > MAX_PAGECACHE_ORDER)
-			order = MAX_PAGECACHE_ORDER;
+		if (order > mapping_max_folio_order(mapping))
+			order = mapping_max_folio_order(mapping);
 		/* If we're not aligned, allocate a smaller folio */
 		if (index & ((1UL << order) - 1))
 			order = __ffs(index);
diff --git a/mm/readahead.c b/mm/readahead.c
index e83fe1c6e5acd..e0cf3bfd2b2b3 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -449,10 +449,10 @@ void page_cache_ra_order(struct readahead_control *ractl,
 
 	limit = min(limit, index + ra->size - 1);
 
-	if (new_order < MAX_PAGECACHE_ORDER)
+	if (new_order < mapping_max_folio_order(mapping))
 		new_order += 2;
 
-	new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
+	new_order = min(mapping_max_folio_order(mapping), new_order);
 	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
 
 	/* See comment in page_cache_ra_unbounded() */
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v13 02/10] filemap: allocate mapping_min_order folios in the page cache
  2024-08-22 13:50 [PATCH v13 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
  2024-08-22 13:50 ` [PATCH v13 01/10] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
@ 2024-08-22 13:50 ` Pankaj Raghav (Samsung)
  2024-08-22 13:50 ` [PATCH v13 03/10] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-22 13:50 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts, David Howells

From: Pankaj Raghav <p.raghav@samsung.com>

filemap_create_folio() and do_read_cache_folio() were always allocating
folio of order 0. __filemap_get_folio was trying to allocate higher
order folios when fgp_flags had higher order hint set but it will default
to order 0 folio if higher order memory allocation fails.

Supporting mapping_min_order implies that we guarantee each folio in the
page cache has at least an order of mapping_min_order. When adding new
folios to the page cache we must also ensure the index used is aligned to
the mapping_min_order as the page cache requires the index to be aligned
to the order of the folio.

Co-developed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: David Howells <dhowells@redhat.com>
---
 include/linux/pagemap.h | 20 ++++++++++++++++++++
 mm/filemap.c            | 24 ++++++++++++++++--------
 2 files changed, 36 insertions(+), 8 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c60025bb584c5..4cc170949e9c0 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -448,6 +448,26 @@ mapping_min_folio_order(const struct address_space *mapping)
 	return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
 }
 
+static inline unsigned long
+mapping_min_folio_nrpages(struct address_space *mapping)
+{
+	return 1UL << mapping_min_folio_order(mapping);
+}
+
+/**
+ * mapping_align_index() - Align index for this mapping.
+ * @mapping: The address_space.
+ *
+ * The index of a folio must be naturally aligned.  If you are adding a
+ * new folio to the page cache and need to know what index to give it,
+ * call this function.
+ */
+static inline pgoff_t mapping_align_index(struct address_space *mapping,
+					  pgoff_t index)
+{
+	return round_down(index, mapping_min_folio_nrpages(mapping));
+}
+
 /*
  * Large folio support currently depends on THP.  These dependencies are
  * being worked on but are not yet fixed.
diff --git a/mm/filemap.c b/mm/filemap.c
index 5c148144d5548..9a047c6d03e4e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -858,6 +858,8 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
+	VM_BUG_ON_FOLIO(folio_order(folio) < mapping_min_folio_order(mapping),
+			folio);
 	mapping_set_update(&xas, mapping);
 
 	VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
@@ -1918,8 +1920,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		folio_wait_stable(folio);
 no_page:
 	if (!folio && (fgp_flags & FGP_CREAT)) {
-		unsigned order = FGF_GET_ORDER(fgp_flags);
+		unsigned int min_order = mapping_min_folio_order(mapping);
+		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
 		int err;
+		index = mapping_align_index(mapping, index);
 
 		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
 			gfp |= __GFP_WRITE;
@@ -1942,7 +1946,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			gfp_t alloc_gfp = gfp;
 
 			err = -ENOMEM;
-			if (order > 0)
+			if (order > min_order)
 				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
 			folio = filemap_alloc_folio(alloc_gfp, order);
 			if (!folio)
@@ -1957,7 +1961,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 				break;
 			folio_put(folio);
 			folio = NULL;
-		} while (order-- > 0);
+		} while (order-- > min_order);
 
 		if (err == -EEXIST)
 			goto repeat;
@@ -2443,13 +2447,15 @@ static int filemap_update_page(struct kiocb *iocb,
 }
 
 static int filemap_create_folio(struct file *file,
-		struct address_space *mapping, pgoff_t index,
+		struct address_space *mapping, loff_t pos,
 		struct folio_batch *fbatch)
 {
 	struct folio *folio;
 	int error;
+	unsigned int min_order = mapping_min_folio_order(mapping);
+	pgoff_t index;
 
-	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
+	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order);
 	if (!folio)
 		return -ENOMEM;
 
@@ -2467,6 +2473,7 @@ static int filemap_create_folio(struct file *file,
 	 * well to keep locking rules simple.
 	 */
 	filemap_invalidate_lock_shared(mapping);
+	index = (pos >> (PAGE_SHIFT + min_order)) << min_order;
 	error = filemap_add_folio(mapping, folio, index,
 			mapping_gfp_constraint(mapping, GFP_KERNEL));
 	if (error == -EEXIST)
@@ -2527,8 +2534,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 	if (!folio_batch_count(fbatch)) {
 		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
 			return -EAGAIN;
-		err = filemap_create_folio(filp, mapping,
-				iocb->ki_pos >> PAGE_SHIFT, fbatch);
+		err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch);
 		if (err == AOP_TRUNCATED_PAGE)
 			goto retry;
 		return err;
@@ -3748,9 +3754,11 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
 repeat:
 	folio = filemap_get_folio(mapping, index);
 	if (IS_ERR(folio)) {
-		folio = filemap_alloc_folio(gfp, 0);
+		folio = filemap_alloc_folio(gfp,
+					    mapping_min_folio_order(mapping));
 		if (!folio)
 			return ERR_PTR(-ENOMEM);
+		index = mapping_align_index(mapping, index);
 		err = filemap_add_folio(mapping, folio, index, gfp);
 		if (unlikely(err)) {
 			folio_put(folio);
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v13 03/10] readahead: allocate folios with mapping_min_order in readahead
  2024-08-22 13:50 [PATCH v13 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
  2024-08-22 13:50 ` [PATCH v13 01/10] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
  2024-08-22 13:50 ` [PATCH v13 02/10] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
@ 2024-08-22 13:50 ` Pankaj Raghav (Samsung)
  2024-08-22 13:50 ` [PATCH v13 04/10] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-22 13:50 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts, David Howells

From: Pankaj Raghav <p.raghav@samsung.com>

page_cache_ra_unbounded() was allocating single pages (0 order folios)
if there was no folio found in an index. Allocate mapping_min_order folios
as we need to guarantee the minimum order if it is set.

page_cache_ra_order() tries to allocate folio to the page cache with a
higher order if the index aligns with that order. Modify it so that the
order does not go below the mapping_min_order requirement of the page
cache. This function will do the right thing even if the new_order passed
is less than the mapping_min_order.
When adding new folios to the page cache we must also ensure the index
used is aligned to the mapping_min_order as the page cache requires the
index to be aligned to the order of the folio.

readahead_expand() is called from readahead aops to extend the range of
the readahead so this function can assume ractl->_index to be aligned with
min_order.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Co-developed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: David Howells <dhowells@redhat.com>
---
 mm/readahead.c | 79 ++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 61 insertions(+), 18 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index e0cf3bfd2b2b3..3dc6c7a128dd3 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -206,9 +206,10 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 		unsigned long nr_to_read, unsigned long lookahead_size)
 {
 	struct address_space *mapping = ractl->mapping;
-	unsigned long index = readahead_index(ractl);
+	unsigned long ra_folio_index, index = readahead_index(ractl);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
-	unsigned long i;
+	unsigned long mark, i = 0;
+	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
 
 	/*
 	 * Partway through the readahead operation, we will have added
@@ -223,10 +224,24 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	unsigned int nofs = memalloc_nofs_save();
 
 	filemap_invalidate_lock_shared(mapping);
+	index = mapping_align_index(mapping, index);
+
+	/*
+	 * As iterator `i` is aligned to min_nrpages, round_up the
+	 * difference between nr_to_read and lookahead_size to mark the
+	 * index that only has lookahead or "async_region" to set the
+	 * readahead flag.
+	 */
+	ra_folio_index = round_up(readahead_index(ractl) + nr_to_read - lookahead_size,
+				  min_nrpages);
+	mark = ra_folio_index - index;
+	nr_to_read += readahead_index(ractl) - index;
+	ractl->_index = index;
+
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
-	for (i = 0; i < nr_to_read; i++) {
+	while (i < nr_to_read) {
 		struct folio *folio = xa_load(&mapping->i_pages, index + i);
 		int ret;
 
@@ -240,12 +255,13 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			 * not worth getting one just for that.
 			 */
 			read_pages(ractl);
-			ractl->_index++;
-			i = ractl->_index + ractl->_nr_pages - index - 1;
+			ractl->_index += min_nrpages;
+			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask,
+					    mapping_min_folio_order(mapping));
 		if (!folio)
 			break;
 
@@ -255,14 +271,15 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			if (ret == -ENOMEM)
 				break;
 			read_pages(ractl);
-			ractl->_index++;
-			i = ractl->_index + ractl->_nr_pages - index - 1;
+			ractl->_index += min_nrpages;
+			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
-		if (i == nr_to_read - lookahead_size)
+		if (i == mark)
 			folio_set_readahead(folio);
 		ractl->_workingset |= folio_test_workingset(folio);
-		ractl->_nr_pages++;
+		ractl->_nr_pages += min_nrpages;
+		i += min_nrpages;
 	}
 
 	/*
@@ -438,13 +455,19 @@ void page_cache_ra_order(struct readahead_control *ractl,
 	struct address_space *mapping = ractl->mapping;
 	pgoff_t start = readahead_index(ractl);
 	pgoff_t index = start;
+	unsigned int min_order = mapping_min_folio_order(mapping);
 	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
 	pgoff_t mark = index + ra->size - ra->async_size;
 	unsigned int nofs;
 	int err = 0;
 	gfp_t gfp = readahead_gfp_mask(mapping);
+	unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
 
-	if (!mapping_large_folio_support(mapping) || ra->size < 4)
+	/*
+	 * Fallback when size < min_nrpages as each folio should be
+	 * at least min_nrpages anyway.
+	 */
+	if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
 		goto fallback;
 
 	limit = min(limit, index + ra->size - 1);
@@ -454,10 +477,19 @@ void page_cache_ra_order(struct readahead_control *ractl,
 
 	new_order = min(mapping_max_folio_order(mapping), new_order);
 	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
+	new_order = max(new_order, min_order);
 
 	/* See comment in page_cache_ra_unbounded() */
 	nofs = memalloc_nofs_save();
 	filemap_invalidate_lock_shared(mapping);
+	/*
+	 * If the new_order is greater than min_order and index is
+	 * already aligned to new_order, then this will be noop as index
+	 * aligned to new_order should also be aligned to min_order.
+	 */
+	ractl->_index = mapping_align_index(mapping, index);
+	index = readahead_index(ractl);
+
 	while (index <= limit) {
 		unsigned int order = new_order;
 
@@ -465,7 +497,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
 		if (index & ((1UL << order) - 1))
 			order = __ffs(index);
 		/* Don't allocate pages past EOF */
-		while (index + (1UL << order) - 1 > limit)
+		while (order > min_order && index + (1UL << order) - 1 > limit)
 			order--;
 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
 		if (err)
@@ -703,8 +735,15 @@ void readahead_expand(struct readahead_control *ractl,
 	struct file_ra_state *ra = ractl->ra;
 	pgoff_t new_index, new_nr_pages;
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
+	unsigned long min_nrpages = mapping_min_folio_nrpages(mapping);
+	unsigned int min_order = mapping_min_folio_order(mapping);
 
 	new_index = new_start / PAGE_SIZE;
+	/*
+	 * Readahead code should have aligned the ractl->_index to
+	 * min_nrpages before calling readahead aops.
+	 */
+	VM_BUG_ON(!IS_ALIGNED(ractl->_index, min_nrpages));
 
 	/* Expand the leading edge downwards */
 	while (ractl->_index > new_index) {
@@ -714,9 +753,11 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, min_order);
 		if (!folio)
 			return;
+
+		index = mapping_align_index(mapping, index);
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
 			folio_put(folio);
 			return;
@@ -726,7 +767,7 @@ void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset = true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
-		ractl->_nr_pages++;
+		ractl->_nr_pages += min_nrpages;
 		ractl->_index = folio->index;
 	}
 
@@ -741,9 +782,11 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, min_order);
 		if (!folio)
 			return;
+
+		index = mapping_align_index(mapping, index);
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
 			folio_put(folio);
 			return;
@@ -753,10 +796,10 @@ void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset = true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
-		ractl->_nr_pages++;
+		ractl->_nr_pages += min_nrpages;
 		if (ra) {
-			ra->size++;
-			ra->async_size++;
+			ra->size += min_nrpages;
+			ra->async_size += min_nrpages;
 		}
 	}
 }
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-22 13:50 [PATCH v13 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (2 preceding siblings ...)
  2024-08-22 13:50 ` [PATCH v13 03/10] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
@ 2024-08-22 13:50 ` Pankaj Raghav (Samsung)
  2024-08-29 10:51   ` Sven Schnelle
                     ` (2 more replies)
  2024-08-22 13:50 ` [PATCH v13 05/10] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() Pankaj Raghav (Samsung)
                   ` (6 subsequent siblings)
  10 siblings, 3 replies; 31+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-22 13:50 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts, David Howells

From: Luis Chamberlain <mcgrof@kernel.org>

split_folio() and split_folio_to_list() assume order 0, to support
minorder for non-anonymous folios, we must expand these to check the
folio mapping order and use that.

Set new_order to be at least minimum folio order if it is set in
split_huge_page_to_list() so that we can maintain minimum folio order
requirement in the page cache.

Update the debugfs write files used for testing to ensure the order
is respected as well. We simply enforce the min order when a file
mapping is used.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Tested-by: David Howells <dhowells@redhat.com>
---
 include/linux/huge_mm.h | 14 +++++++---
 mm/huge_memory.c        | 60 ++++++++++++++++++++++++++++++++++++++---
 2 files changed, 66 insertions(+), 8 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4c32058cacfec..70424d55da088 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -96,6 +96,8 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
 #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
 	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
 
+#define split_folio(f) split_folio_to_list(f, NULL)
+
 #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 #define HPAGE_PMD_SHIFT PMD_SHIFT
 #define HPAGE_PUD_SHIFT PUD_SHIFT
@@ -317,9 +319,10 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order);
+int split_folio_to_list(struct folio *folio, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
-	return split_huge_page_to_list_to_order(page, NULL, 0);
+	return split_folio(page_folio(page));
 }
 void deferred_split_folio(struct folio *folio);
 
@@ -495,6 +498,12 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
+
+static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
+{
+	return 0;
+}
+
 static inline void deferred_split_folio(struct folio *folio) {}
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
@@ -622,7 +631,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
 	return split_folio_to_list_to_order(folio, NULL, new_order);
 }
 
-#define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
-#define split_folio(f) split_folio_to_order(f, 0)
-
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cf8e34f62976f..06384b85a3a20 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3303,6 +3303,9 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
  * released, or if some unexpected race happened (e.g., anon VMA disappeared,
  * truncation).
  *
+ * Callers should ensure that the order respects the address space mapping
+ * min-order if one is set for non-anonymous folios.
+ *
  * Returns -EINVAL when trying to split to an order that is incompatible
  * with the folio. Splitting to order 0 is compatible with all folios.
  */
@@ -3384,6 +3387,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		mapping = NULL;
 		anon_vma_lock_write(anon_vma);
 	} else {
+		unsigned int min_order;
 		gfp_t gfp;
 
 		mapping = folio->mapping;
@@ -3394,6 +3398,14 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 			goto out;
 		}
 
+		min_order = mapping_min_folio_order(folio->mapping);
+		if (new_order < min_order) {
+			VM_WARN_ONCE(1, "Cannot split mapped folio below min-order: %u",
+				     min_order);
+			ret = -EINVAL;
+			goto out;
+		}
+
 		gfp = current_gfp_context(mapping_gfp_mask(mapping) &
 							GFP_RECLAIM_MASK);
 
@@ -3506,6 +3518,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	return ret;
 }
 
+int split_folio_to_list(struct folio *folio, struct list_head *list)
+{
+	unsigned int min_order = 0;
+
+	if (folio_test_anon(folio))
+		goto out;
+
+	if (!folio->mapping) {
+		if (folio_test_pmd_mappable(folio))
+			count_vm_event(THP_SPLIT_PAGE_FAILED);
+		return -EBUSY;
+	}
+
+	min_order = mapping_min_folio_order(folio->mapping);
+out:
+	return split_huge_page_to_list_to_order(&folio->page, list,
+							min_order);
+}
+
 void __folio_undo_large_rmappable(struct folio *folio)
 {
 	struct deferred_split *ds_queue;
@@ -3736,6 +3767,8 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		struct vm_area_struct *vma = vma_lookup(mm, addr);
 		struct folio_walk fw;
 		struct folio *folio;
+		struct address_space *mapping;
+		unsigned int target_order = new_order;
 
 		if (!vma)
 			break;
@@ -3753,7 +3786,13 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		if (!is_transparent_hugepage(folio))
 			goto next;
 
-		if (new_order >= folio_order(folio))
+		if (!folio_test_anon(folio)) {
+			mapping = folio->mapping;
+			target_order = max(new_order,
+					   mapping_min_folio_order(mapping));
+		}
+
+		if (target_order >= folio_order(folio))
 			goto next;
 
 		total++;
@@ -3771,9 +3810,14 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		folio_get(folio);
 		folio_walk_end(&fw, vma);
 
-		if (!split_folio_to_order(folio, new_order))
+		if (!folio_test_anon(folio) && folio->mapping != mapping)
+			goto unlock;
+
+		if (!split_folio_to_order(folio, target_order))
 			split++;
 
+unlock:
+
 		folio_unlock(folio);
 		folio_put(folio);
 
@@ -3802,6 +3846,8 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 	pgoff_t index;
 	int nr_pages = 1;
 	unsigned long total = 0, split = 0;
+	unsigned int min_order;
+	unsigned int target_order;
 
 	file = getname_kernel(file_path);
 	if (IS_ERR(file))
@@ -3815,6 +3861,8 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 		 file_path, off_start, off_end);
 
 	mapping = candidate->f_mapping;
+	min_order = mapping_min_folio_order(mapping);
+	target_order = max(new_order, min_order);
 
 	for (index = off_start; index < off_end; index += nr_pages) {
 		struct folio *folio = filemap_get_folio(mapping, index);
@@ -3829,15 +3877,19 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 		total++;
 		nr_pages = folio_nr_pages(folio);
 
-		if (new_order >= folio_order(folio))
+		if (target_order >= folio_order(folio))
 			goto next;
 
 		if (!folio_trylock(folio))
 			goto next;
 
-		if (!split_folio_to_order(folio, new_order))
+		if (folio->mapping != mapping)
+			goto unlock;
+
+		if (!split_folio_to_order(folio, target_order))
 			split++;
 
+unlock:
 		folio_unlock(folio);
 next:
 		folio_put(folio);
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v13 05/10] filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
  2024-08-22 13:50 [PATCH v13 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (3 preceding siblings ...)
  2024-08-22 13:50 ` [PATCH v13 04/10] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
@ 2024-08-22 13:50 ` Pankaj Raghav (Samsung)
  2024-08-22 13:50 ` [PATCH v13 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-22 13:50 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts, David Howells

From: Pankaj Raghav <p.raghav@samsung.com>

Usually the page cache does not extend beyond the size of the inode,
therefore, no PTEs are created for folios that extend beyond the size.

But with LBS support, we might extend page cache beyond the size of the
inode as we need to guarantee folios of minimum order. While doing a
read, do_fault_around() can create PTEs for pages that lie beyond the
EOF leading to incorrect error return when accessing a page beyond the
mapped file.

Cap the PTE range to be created for the page cache up to the end of
file(EOF) in filemap_map_pages() so that return error codes are consistent
with POSIX[1] for LBS configurations.

generic/749 has been created to trigger this edge case. This also fixes
generic/749 for tmpfs with huge=always on systems with 4k base page size.

[1](from mmap(2))  SIGBUS
    Attempted access to a page of the buffer that lies beyond the end
    of the mapped file.  For an explanation of the treatment  of  the
    bytes  in  the  page that corresponds to the end of a mapped file
    that is not a multiple of the page size, see NOTES.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: David Howells <dhowells@redhat.com>
---
 mm/filemap.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 9a047c6d03e4e..eab1f12e7b840 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3607,7 +3607,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	struct vm_area_struct *vma = vmf->vma;
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
-	pgoff_t last_pgoff = start_pgoff;
+	pgoff_t file_end, last_pgoff = start_pgoff;
 	unsigned long addr;
 	XA_STATE(xas, &mapping->i_pages, start_pgoff);
 	struct folio *folio;
@@ -3633,6 +3633,10 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		goto out;
 	}
 
+	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
+	if (end_pgoff > file_end)
+		end_pgoff = file_end;
+
 	folio_type = mm_counter_file(folio);
 	do {
 		unsigned long end;
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v13 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-08-22 13:50 [PATCH v13 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (4 preceding siblings ...)
  2024-08-22 13:50 ` [PATCH v13 05/10] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() Pankaj Raghav (Samsung)
@ 2024-08-22 13:50 ` Pankaj Raghav (Samsung)
  2024-08-22 13:50 ` [PATCH v13 07/10] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-22 13:50 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts, Dave Chinner

From: Pankaj Raghav <p.raghav@samsung.com>

iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
< fs block size. iomap_dio_zero() has an implicit assumption that fs block
size < page_size. This is true for most filesystems at the moment.

If the block size > page size, this will send the contents of the page
next to zero page(as len > PAGE_SIZE) to the underlying block device,
causing FS corruption.

iomap is a generic infrastructure and it should not make any assumptions
about the fs block size and the page size of the system.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap/buffered-io.c |  4 ++--
 fs/iomap/direct-io.c   | 45 ++++++++++++++++++++++++++++++++++++------
 2 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9b4ca3811a242..cdab801e9d635 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -2007,10 +2007,10 @@ iomap_writepages(struct address_space *mapping, struct writeback_control *wbc,
 }
 EXPORT_SYMBOL_GPL(iomap_writepages);
 
-static int __init iomap_init(void)
+static int __init iomap_buffered_init(void)
 {
 	return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
 			   offsetof(struct iomap_ioend, io_bio),
 			   BIOSET_NEED_BVECS);
 }
-fs_initcall(iomap_init);
+fs_initcall(iomap_buffered_init);
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46e..c02b266bba525 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -11,6 +11,7 @@
 #include <linux/iomap.h>
 #include <linux/backing-dev.h>
 #include <linux/uio.h>
+#include <linux/set_memory.h>
 #include <linux/task_io_accounting_ops.h>
 #include "trace.h"
 
@@ -27,6 +28,13 @@
 #define IOMAP_DIO_WRITE		(1U << 30)
 #define IOMAP_DIO_DIRTY		(1U << 31)
 
+/*
+ * Used for sub block zeroing in iomap_dio_zero()
+ */
+#define IOMAP_ZERO_PAGE_SIZE (SZ_64K)
+#define IOMAP_ZERO_PAGE_ORDER (get_order(IOMAP_ZERO_PAGE_SIZE))
+static struct page *zero_page;
+
 struct iomap_dio {
 	struct kiocb		*iocb;
 	const struct iomap_dio_ops *dops;
@@ -232,13 +240,20 @@ void iomap_dio_bio_end_io(struct bio *bio)
 }
 EXPORT_SYMBOL_GPL(iomap_dio_bio_end_io);
 
-static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
+static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 		loff_t pos, unsigned len)
 {
 	struct inode *inode = file_inode(dio->iocb->ki_filp);
-	struct page *page = ZERO_PAGE(0);
 	struct bio *bio;
 
+	if (!len)
+		return 0;
+	/*
+	 * Max block size supported is 64k
+	 */
+	if (WARN_ON_ONCE(len > IOMAP_ZERO_PAGE_SIZE))
+		return -EINVAL;
+
 	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
 	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
 				  GFP_KERNEL);
@@ -246,8 +261,9 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	__bio_add_page(bio, page, len, 0);
+	__bio_add_page(bio, zero_page, len, 0);
 	iomap_dio_submit_bio(iter, dio, bio, pos);
+	return 0;
 }
 
 /*
@@ -356,8 +372,10 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	if (need_zeroout) {
 		/* zero out from the start of the block to the write offset */
 		pad = pos & (fs_block_size - 1);
-		if (pad)
-			iomap_dio_zero(iter, dio, pos - pad, pad);
+
+		ret = iomap_dio_zero(iter, dio, pos - pad, pad);
+		if (ret)
+			goto out;
 	}
 
 	/*
@@ -431,7 +449,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		/* zero out from the end of the write to the end of the block */
 		pad = pos & (fs_block_size - 1);
 		if (pad)
-			iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
+			ret = iomap_dio_zero(iter, dio, pos,
+					     fs_block_size - pad);
 	}
 out:
 	/* Undo iter limitation to current extent */
@@ -753,3 +772,17 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	return iomap_dio_complete(dio);
 }
 EXPORT_SYMBOL_GPL(iomap_dio_rw);
+
+static int __init iomap_dio_init(void)
+{
+	zero_page = alloc_pages(GFP_KERNEL | __GFP_ZERO,
+				IOMAP_ZERO_PAGE_ORDER);
+
+	if (!zero_page)
+		return -ENOMEM;
+
+	set_memory_ro((unsigned long)page_address(zero_page),
+		      1U << IOMAP_ZERO_PAGE_ORDER);
+	return 0;
+}
+fs_initcall(iomap_dio_init);
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v13 07/10] xfs: use kvmalloc for xattr buffers
  2024-08-22 13:50 [PATCH v13 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (5 preceding siblings ...)
  2024-08-22 13:50 ` [PATCH v13 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
@ 2024-08-22 13:50 ` Pankaj Raghav (Samsung)
  2024-08-22 13:50 ` [PATCH v13 08/10] xfs: expose block size in stat Pankaj Raghav (Samsung)
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-22 13:50 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

Pankaj Raghav reported that when filesystem block size is larger
than page size, the xattr code can use kmalloc() for high order
allocations. This triggers a useless warning in the allocator as it
is a __GFP_NOFAIL allocation here:

static inline
struct page *rmqueue(struct zone *preferred_zone,
                        struct zone *zone, unsigned int order,
                        gfp_t gfp_flags, unsigned int alloc_flags,
                        int migratetype)
{
        struct page *page;

        /*
         * We most definitely don't want callers attempting to
         * allocate greater than order-1 page units with __GFP_NOFAIL.
         */
>>>>    WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
...

Fix this by changing all these call sites to use kvmalloc(), which
will strip the NOFAIL from the kmalloc attempt and if that fails
will do a __GFP_NOFAIL vmalloc().

This is not an issue that productions systems will see as
filesystems with block size > page size cannot be mounted by the
kernel; Pankaj is developing this functionality right now.

Reported-by: Pankaj Raghav <kernel@pankajraghav.com>
Fixes: f078d4ea8276 ("xfs: convert kmem_alloc() to kmalloc()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/xfs/libxfs/xfs_attr_leaf.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index b9e98950eb3d8..09f4cb061a6e0 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -1138,10 +1138,7 @@ xfs_attr3_leaf_to_shortform(
 
 	trace_xfs_attr_leaf_to_sf(args);
 
-	tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
-	if (!tmpbuffer)
-		return -ENOMEM;
-
+	tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
 	memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
 
 	leaf = (xfs_attr_leafblock_t *)tmpbuffer;
@@ -1205,7 +1202,7 @@ xfs_attr3_leaf_to_shortform(
 	error = 0;
 
 out:
-	kfree(tmpbuffer);
+	kvfree(tmpbuffer);
 	return error;
 }
 
@@ -1613,7 +1610,7 @@ xfs_attr3_leaf_compact(
 
 	trace_xfs_attr_leaf_compact(args);
 
-	tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
+	tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
 	memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
 	memset(bp->b_addr, 0, args->geo->blksize);
 	leaf_src = (xfs_attr_leafblock_t *)tmpbuffer;
@@ -1651,7 +1648,7 @@ xfs_attr3_leaf_compact(
 	 */
 	xfs_trans_log_buf(trans, bp, 0, args->geo->blksize - 1);
 
-	kfree(tmpbuffer);
+	kvfree(tmpbuffer);
 }
 
 /*
@@ -2330,7 +2327,7 @@ xfs_attr3_leaf_unbalance(
 		struct xfs_attr_leafblock *tmp_leaf;
 		struct xfs_attr3_icleaf_hdr tmphdr;
 
-		tmp_leaf = kzalloc(state->args->geo->blksize,
+		tmp_leaf = kvzalloc(state->args->geo->blksize,
 				GFP_KERNEL | __GFP_NOFAIL);
 
 		/*
@@ -2371,7 +2368,7 @@ xfs_attr3_leaf_unbalance(
 		}
 		memcpy(save_leaf, tmp_leaf, state->args->geo->blksize);
 		savehdr = tmphdr; /* struct copy */
-		kfree(tmp_leaf);
+		kvfree(tmp_leaf);
 	}
 
 	xfs_attr3_leaf_hdr_to_disk(state->args->geo, save_leaf, &savehdr);
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v13 08/10] xfs: expose block size in stat
  2024-08-22 13:50 [PATCH v13 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (6 preceding siblings ...)
  2024-08-22 13:50 ` [PATCH v13 07/10] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
@ 2024-08-22 13:50 ` Pankaj Raghav (Samsung)
  2024-08-22 13:50 ` [PATCH v13 09/10] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-22 13:50 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts, Dave Chinner

From: Pankaj Raghav <p.raghav@samsung.com>

For block size larger than page size, the unit of efficient IO is
the block size, not the page size. Leaving stat() to report
PAGE_SIZE as the block size causes test programs like fsx to issue
illegal ranges for operations that require block size alignment
(e.g. fallocate() insert range). Hence update the preferred IO size
to reflect the block size in this case.

This change is based on a patch originally from Dave Chinner.[1]

[1] https://lwn.net/ml/linux-fsdevel/20181107063127.3902-16-david@fromorbit.com/

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a1c4a350a6dbf..2b8dbe8bf1381 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -567,7 +567,7 @@ xfs_stat_blksize(
 			return 1U << mp->m_allocsize_log;
 	}
 
-	return PAGE_SIZE;
+	return max_t(uint32_t, PAGE_SIZE, mp->m_sb.sb_blocksize);
 }
 
 STATIC int
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v13 09/10] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  2024-08-22 13:50 [PATCH v13 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (7 preceding siblings ...)
  2024-08-22 13:50 ` [PATCH v13 08/10] xfs: expose block size in stat Pankaj Raghav (Samsung)
@ 2024-08-22 13:50 ` Pankaj Raghav (Samsung)
  2024-08-22 13:50 ` [PATCH v13 10/10] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
  2024-08-22 21:23 ` [PATCH v13 00/10] enable bs > ps in XFS Luis Chamberlain
  10 siblings, 0 replies; 31+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-22 13:50 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts, Dave Chinner

From: Pankaj Raghav <p.raghav@samsung.com>

Instead of assuming that PAGE_SHIFT is always higher than the blocklog,
make the calculation generic so that page cache count can be calculated
correctly for LBS.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_mount.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 09eef1721ef4f..3949f720b5354 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -132,11 +132,16 @@ xfs_sb_validate_fsb_count(
 	xfs_sb_t	*sbp,
 	uint64_t	nblocks)
 {
+	uint64_t		max_bytes;
+
 	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
 
+	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
+		return -EFBIG;
+
 	/* Limited by ULONG_MAX of page cache index */
-	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
+	if (max_bytes >> PAGE_SHIFT > ULONG_MAX)
 		return -EFBIG;
 	return 0;
 }
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v13 10/10] xfs: enable block size larger than page size support
  2024-08-22 13:50 [PATCH v13 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (8 preceding siblings ...)
  2024-08-22 13:50 ` [PATCH v13 09/10] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
@ 2024-08-22 13:50 ` Pankaj Raghav (Samsung)
  2024-09-03 12:29   ` [PATCH v13 10/10] xfs: enable block size larger than page size support^[ Daniel Gomez
  2024-08-22 21:23 ` [PATCH v13 00/10] enable bs > ps in XFS Luis Chamberlain
  10 siblings, 1 reply; 31+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-22 13:50 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts, Dave Chinner

From: Pankaj Raghav <p.raghav@samsung.com>

Page cache now has the ability to have a minimum order when allocating
a folio which is a prerequisite to add support for block size > page
size.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c |  5 +++++
 fs/xfs/libxfs/xfs_shared.h |  3 +++
 fs/xfs/xfs_icache.c        |  6 ++++--
 fs/xfs/xfs_mount.c         |  1 -
 fs/xfs/xfs_super.c         | 28 ++++++++++++++++++++--------
 include/linux/pagemap.h    | 13 +++++++++++++
 6 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 0af5b7a33d055..1921b689888b8 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -3033,6 +3033,11 @@ xfs_ialloc_setup_geometry(
 		igeo->ialloc_align = mp->m_dalign;
 	else
 		igeo->ialloc_align = 0;
+
+	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
+		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
+	else
+		igeo->min_folio_order = 0;
 }
 
 /* Compute the location of the root directory inode that is laid out by mkfs. */
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 2f7413afbf46c..33b84a3a83ff6 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -224,6 +224,9 @@ struct xfs_ino_geometry {
 	/* precomputed value for di_flags2 */
 	uint64_t	new_diflags2;
 
+	/* minimum folio order of a page cache allocation */
+	unsigned int	min_folio_order;
+
 };
 
 #endif /* __XFS_SHARED_H__ */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index cf629302d48e7..0fcf235e50235 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -88,7 +88,8 @@ xfs_inode_alloc(
 
 	/* VFS doesn't initialise i_mode! */
 	VFS_I(ip)->i_mode = 0;
-	mapping_set_large_folios(VFS_I(ip)->i_mapping);
+	mapping_set_folio_min_order(VFS_I(ip)->i_mapping,
+				    M_IGEO(mp)->min_folio_order);
 
 	XFS_STATS_INC(mp, vn_active);
 	ASSERT(atomic_read(&ip->i_pincount) == 0);
@@ -325,7 +326,8 @@ xfs_reinit_inode(
 	inode->i_uid = uid;
 	inode->i_gid = gid;
 	inode->i_state = state;
-	mapping_set_large_folios(inode->i_mapping);
+	mapping_set_folio_min_order(inode->i_mapping,
+				    M_IGEO(mp)->min_folio_order);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 3949f720b5354..c6933440f8066 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -134,7 +134,6 @@ xfs_sb_validate_fsb_count(
 {
 	uint64_t		max_bytes;
 
-	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
 
 	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 210481b03fdb4..8cd76a01b543f 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1638,16 +1638,28 @@ xfs_fs_fill_super(
 		goto out_free_sb;
 	}
 
-	/*
-	 * Until this is fixed only page-sized or smaller data blocks work.
-	 */
 	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
-		xfs_warn(mp,
-		"File system with blocksize %d bytes. "
-		"Only pagesize (%ld) or less will currently work.",
+		size_t max_folio_size = mapping_max_folio_size_supported();
+
+		if (!xfs_has_crc(mp)) {
+			xfs_warn(mp,
+"V4 Filesystem with blocksize %d bytes. Only pagesize (%ld) or less is supported.",
 				mp->m_sb.sb_blocksize, PAGE_SIZE);
-		error = -ENOSYS;
-		goto out_free_sb;
+			error = -ENOSYS;
+			goto out_free_sb;
+		}
+
+		if (mp->m_sb.sb_blocksize > max_folio_size) {
+			xfs_warn(mp,
+"block size (%u bytes) not supported; Only block size (%ld) or less is supported",
+				mp->m_sb.sb_blocksize, max_folio_size);
+			error = -ENOSYS;
+			goto out_free_sb;
+		}
+
+		xfs_warn(mp,
+"EXPERIMENTAL: V5 Filesystem with Large Block Size (%d bytes) enabled.",
+			mp->m_sb.sb_blocksize);
 	}
 
 	/* Ensure this filesystem fits in the page cache limits */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4cc170949e9c0..55b254d951da7 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -374,6 +374,19 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 #define MAX_XAS_ORDER		(XA_CHUNK_SHIFT * 2 - 1)
 #define MAX_PAGECACHE_ORDER	min(MAX_XAS_ORDER, PREFERRED_MAX_PAGECACHE_ORDER)
 
+/*
+ * mapping_max_folio_size_supported() - Check the max folio size supported
+ *
+ * The filesystem should call this function at mount time if there is a
+ * requirement on the folio mapping size in the page cache.
+ */
+static inline size_t mapping_max_folio_size_supported(void)
+{
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return 1U << (PAGE_SHIFT + MAX_PAGECACHE_ORDER);
+	return PAGE_SIZE;
+}
+
 /*
  * mapping_set_folio_order_range() - Set the orders supported by a file.
  * @mapping: The address space of the file.
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v13 00/10] enable bs > ps in XFS
  2024-08-22 13:50 [PATCH v13 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (9 preceding siblings ...)
  2024-08-22 13:50 ` [PATCH v13 10/10] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
@ 2024-08-22 21:23 ` Luis Chamberlain
  2024-08-23 12:36   ` Christian Brauner
  10 siblings, 1 reply; 31+ messages in thread
From: Luis Chamberlain @ 2024-08-22 21:23 UTC (permalink / raw)
  To: Christian Brauner, Andrew Morton, Chandan Babu R, Darrick J. Wong
  Cc: Pankaj Raghav (Samsung), linux-fsdevel, hare, gost.dev, linux-xfs,
	hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, ryan.roberts, mcgrof

On Thu, Aug 22, 2024 at 03:50:08PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> This is the 13th version of the series that enables block size > page size
> (Large Block Size) experimental support in XFS. Please consider this for
> the inclusion in 6.12.

Christian, Andrew,

we believe this is ready for integration, and at the last XFS BoF we
were wondering what tree this should go through. I see fs-next is
actually just a branch on linux-next with the merge of a few select
trees [0], but this touches mm, so its not clear what tree would be be
most appropriate to consider.

Please let us know what you think, it would be great to get this into
fs-next somehow to get more exposure / testing.

[0] https://lore.kernel.org/all/20240528091629.3b8de7e0@canb.auug.org.au/

  Luis

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v13 00/10] enable bs > ps in XFS
  2024-08-22 21:23 ` [PATCH v13 00/10] enable bs > ps in XFS Luis Chamberlain
@ 2024-08-23 12:36   ` Christian Brauner
  0 siblings, 0 replies; 31+ messages in thread
From: Christian Brauner @ 2024-08-23 12:36 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), Luis Chamberlain
  Cc: Christian Brauner, chandan.babu, linux-fsdevel, djwong, hare,
	gost.dev, linux-xfs, hch, david, Zi Yan, yang, linux-kernel,
	linux-mm, willy, john.g.garry, cl, p.raghav, ryan.roberts, akpm

On Thu, 22 Aug 2024 15:50:08 +0200, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> This is the 13th version of the series that enables block size > page size
> (Large Block Size) experimental support in XFS. Please consider this for
> the inclusion in 6.12.
> 
> The context and motivation can be seen in cover letter of the RFC v1 [0].
> We also recorded a talk about this effort at LPC [1], if someone would
> like more context on this effort.
> 
> [...]

I've rebased this onto v6.11-rc1 and did a test compile for each commit
and ran xfstests for xfs. Looks good so far. Should show up in fs-next
tomorrow.

---

Applied to the vfs.blocksize branch of the vfs/vfs.git tree.
Patches in the vfs.blocksize branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.blocksize

[01/10] fs: Allow fine-grained control of folio sizes
        https://git.kernel.org/vfs/vfs/c/e8201b314c01
[02/10] filemap: allocate mapping_min_order folios in the page cache
        https://git.kernel.org/vfs/vfs/c/c104d25f8c49
[03/10] readahead: allocate folios with mapping_min_order in readahead
        https://git.kernel.org/vfs/vfs/c/7949d4e70aef
[04/10] mm: split a folio in minimum folio order chunks
        https://git.kernel.org/vfs/vfs/c/fd031210c9ce
[05/10] filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
        https://git.kernel.org/vfs/vfs/c/e9f3b433acd0
[06/10] iomap: fix iomap_dio_zero() for fs bs > system page size
        https://git.kernel.org/vfs/vfs/c/d940b3b7b76b
[07/10] xfs: use kvmalloc for xattr buffers
        https://git.kernel.org/vfs/vfs/c/13c9f3c68405
[08/10] xfs: expose block size in stat
        https://git.kernel.org/vfs/vfs/c/4e70eedd93ae
[09/10] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
        https://git.kernel.org/vfs/vfs/c/f8b794f50725
[10/10] xfs: enable block size larger than page size support
        https://git.kernel.org/vfs/vfs/c/0ab3ca31b012

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v13 01/10] fs: Allow fine-grained control of folio sizes
  2024-08-22 13:50 ` [PATCH v13 01/10] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
@ 2024-08-23 13:09   ` Daniel Gomez
  0 siblings, 0 replies; 31+ messages in thread
From: Daniel Gomez @ 2024-08-23 13:09 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: brauner, akpm, chandan.babu, linux-fsdevel, djwong, hare,
	gost.dev, linux-xfs, hch, david, Zi Yan, yang, linux-kernel,
	linux-mm, willy, john.g.garry, cl, p.raghav, mcgrof, ryan.roberts,
	David Howells

On Thu, Aug 22, 2024 at 03:50:09PM +0200, Pankaj Raghav (Samsung) wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> We need filesystems to be able to communicate acceptable folio sizes
> to the pagecache for a variety of uses (e.g. large block sizes).
> Support a range of folio sizes between order-0 and order-31.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Tested-by: David Howells <dhowells@redhat.com>
> ---
>  include/linux/pagemap.h | 90 ++++++++++++++++++++++++++++++++++-------
>  mm/filemap.c            |  6 +--
>  mm/readahead.c          |  4 +-
>  3 files changed, 80 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index d9c7edb6422bd..c60025bb584c5 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -204,14 +204,21 @@ enum mapping_flags {
>  	AS_EXITING	= 4, 	/* final truncate in progress */
>  	/* writeback related tags are not used */
>  	AS_NO_WRITEBACK_TAGS = 5,
> -	AS_LARGE_FOLIO_SUPPORT = 6,
> -	AS_RELEASE_ALWAYS,	/* Call ->release_folio(), even if no private data */
> -	AS_STABLE_WRITES,	/* must wait for writeback before modifying
> +	AS_RELEASE_ALWAYS = 6,	/* Call ->release_folio(), even if no private data */
> +	AS_STABLE_WRITES = 7,	/* must wait for writeback before modifying
>  				   folio contents */
> -	AS_INACCESSIBLE,	/* Do not attempt direct R/W access to the mapping,
> -				   including to move the mapping */
> +	AS_INACCESSIBLE = 8,	/* Do not attempt direct R/W access to the mapping */
> +	/* Bits 16-25 are used for FOLIO_ORDER */
> +	AS_FOLIO_ORDER_BITS = 5,
> +	AS_FOLIO_ORDER_MIN = 16,
> +	AS_FOLIO_ORDER_MAX = AS_FOLIO_ORDER_MIN + AS_FOLIO_ORDER_BITS,
>  };
>  
> +#define AS_FOLIO_ORDER_BITS_MASK ((1u << AS_FOLIO_ORDER_BITS) - 1)
> +#define AS_FOLIO_ORDER_MIN_MASK (AS_FOLIO_ORDER_BITS_MASK << AS_FOLIO_ORDER_MIN)
> +#define AS_FOLIO_ORDER_MAX_MASK (AS_FOLIO_ORDER_BITS_MASK << AS_FOLIO_ORDER_MAX)
> +#define AS_FOLIO_ORDER_MASK (AS_FOLIO_ORDER_MIN_MASK | AS_FOLIO_ORDER_MAX_MASK)
> +
>  /**
>   * mapping_set_error - record a writeback error in the address_space
>   * @mapping: the mapping in which an error should be set
> @@ -367,9 +374,51 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>  #define MAX_XAS_ORDER		(XA_CHUNK_SHIFT * 2 - 1)
>  #define MAX_PAGECACHE_ORDER	min(MAX_XAS_ORDER, PREFERRED_MAX_PAGECACHE_ORDER)
>  
> +/*
> + * mapping_set_folio_order_range() - Set the orders supported by a file.
> + * @mapping: The address space of the file.
> + * @min: Minimum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
> + * @max: Maximum folio order (between @min-MAX_PAGECACHE_ORDER inclusive).
> + *
> + * The filesystem should call this function in its inode constructor to
> + * indicate which base size (min) and maximum size (max) of folio the VFS
> + * can use to cache the contents of the file.  This should only be used
> + * if the filesystem needs special handling of folio sizes (ie there is
> + * something the core cannot know).
> + * Do not tune it based on, eg, i_size.
> + *
> + * Context: This should not be called while the inode is active as it
> + * is non-atomic.
> + */
> +static inline void mapping_set_folio_order_range(struct address_space *mapping,
> +						 unsigned int min,
> +						 unsigned int max)
> +{
> +	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> +		return;
> +
> +	if (min > MAX_PAGECACHE_ORDER)
> +		min = MAX_PAGECACHE_ORDER;
> +
> +	if (max > MAX_PAGECACHE_ORDER)
> +		max = MAX_PAGECACHE_ORDER;
> +
> +	if (max < min)
> +		max = min;
> +
> +	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
> +		(min << AS_FOLIO_ORDER_MIN) | (max << AS_FOLIO_ORDER_MAX);

This is my test case to be sure we don't overwrite flags:

Before:
AS_FOLIO_ORDER_MASK=0x1f
AS_FOLIO_ORDER_MIN_MASK=0x1f0000
AS_FOLIO_ORDER_MAX_MASK=0x3e00000

v13:
AS_FOLIO_ORDER_BITS_MASK=0x1f
AS_FOLIO_ORDER_MASK=0x3ff0000
AS_FOLIO_ORDER_MIN_MASK=0x1f0000
AS_FOLIO_ORDER_MAX_MASK=0x3e00000
Test result for min=0, expected: min=0
Test result for min=0, expected: min=0
Test result for min=5, expected: min=5
Test result for min=10, expected: min=10
Test result for min=31, expected: min=31
Test result for min=31, expected: min=31
Test result for min=12, expected: min=12
Test result for min=20, expected: min=20
Test result for min=7, expected: min=7
Test result for min=14, expected: min=14

The test checks for flags in mapping_set_folio_order_range() and iterates over a
list of test cases:
	flags = 0xffffffff
	_flags = flags & ~(AS_FOLIO_ORDER_MASK)
	newflags = _flags | (min_val << AS_FOLIO_ORDER_MIN) | (max_val << AS_FOLIO_ORDER_MAX)


Reviewed-by: Daniel Gomez <da.gomez@samsung.com>

> +}
> +
> +static inline void mapping_set_folio_min_order(struct address_space *mapping,
> +					       unsigned int min)
> +{
> +	mapping_set_folio_order_range(mapping, min, MAX_PAGECACHE_ORDER);
> +}
> +
>  /**
>   * mapping_set_large_folios() - Indicate the file supports large folios.
> - * @mapping: The file.
> + * @mapping: The address space of the file.
>   *
>   * The filesystem should call this function in its inode constructor to
>   * indicate that the VFS can use large folios to cache the contents of
> @@ -380,7 +429,23 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>   */
>  static inline void mapping_set_large_folios(struct address_space *mapping)
>  {
> -	__set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
> +	mapping_set_folio_order_range(mapping, 0, MAX_PAGECACHE_ORDER);
> +}
> +
> +static inline unsigned int
> +mapping_max_folio_order(const struct address_space *mapping)
> +{
> +	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> +		return 0;
> +	return (mapping->flags & AS_FOLIO_ORDER_MAX_MASK) >> AS_FOLIO_ORDER_MAX;
> +}
> +
> +static inline unsigned int
> +mapping_min_folio_order(const struct address_space *mapping)
> +{
> +	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> +		return 0;
> +	return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
>  }
>  
>  /*
> @@ -389,20 +454,17 @@ static inline void mapping_set_large_folios(struct address_space *mapping)
>   */
>  static inline bool mapping_large_folio_support(struct address_space *mapping)
>  {
> -	/* AS_LARGE_FOLIO_SUPPORT is only reasonable for pagecache folios */
> +	/* AS_FOLIO_ORDER is only reasonable for pagecache folios */
>  	VM_WARN_ONCE((unsigned long)mapping & PAGE_MAPPING_ANON,
>  			"Anonymous mapping always supports large folio");
>  
> -	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> -		test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
> +	return mapping_max_folio_order(mapping) > 0;
>  }
>  
>  /* Return the maximum folio size for this pagecache mapping, in bytes. */
> -static inline size_t mapping_max_folio_size(struct address_space *mapping)
> +static inline size_t mapping_max_folio_size(const struct address_space *mapping)
>  {
> -	if (mapping_large_folio_support(mapping))
> -		return PAGE_SIZE << MAX_PAGECACHE_ORDER;
> -	return PAGE_SIZE;
> +	return PAGE_SIZE << mapping_max_folio_order(mapping);
>  }
>  
>  static inline int filemap_nr_thps(struct address_space *mapping)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index d87c858465962..5c148144d5548 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1932,10 +1932,8 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
>  			fgp_flags |= FGP_LOCK;
>  
> -		if (!mapping_large_folio_support(mapping))
> -			order = 0;
> -		if (order > MAX_PAGECACHE_ORDER)
> -			order = MAX_PAGECACHE_ORDER;
> +		if (order > mapping_max_folio_order(mapping))
> +			order = mapping_max_folio_order(mapping);
>  		/* If we're not aligned, allocate a smaller folio */
>  		if (index & ((1UL << order) - 1))
>  			order = __ffs(index);
> diff --git a/mm/readahead.c b/mm/readahead.c
> index e83fe1c6e5acd..e0cf3bfd2b2b3 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -449,10 +449,10 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  
>  	limit = min(limit, index + ra->size - 1);
>  
> -	if (new_order < MAX_PAGECACHE_ORDER)
> +	if (new_order < mapping_max_folio_order(mapping))
>  		new_order += 2;
>  
> -	new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
> +	new_order = min(mapping_max_folio_order(mapping), new_order);
>  	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
>  
>  	/* See comment in page_cache_ra_unbounded() */
> -- 
> 2.44.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-22 13:50 ` [PATCH v13 04/10] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
@ 2024-08-29 10:51   ` Sven Schnelle
  2024-08-29 18:46     ` Luis Chamberlain
  2024-08-29 22:11   ` Matthew Wilcox
  2024-09-06  6:52   ` Lai, Yi
  2 siblings, 1 reply; 31+ messages in thread
From: Sven Schnelle @ 2024-08-29 10:51 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: brauner, akpm, chandan.babu, linux-fsdevel, djwong, hare,
	gost.dev, linux-xfs, hch, david, Zi Yan, yang, linux-kernel,
	linux-mm, willy, john.g.garry, cl, p.raghav, mcgrof, ryan.roberts,
	David Howells, linux-s390

Hi,

"Pankaj Raghav (Samsung)" <kernel@pankajraghav.com> writes:

> From: Luis Chamberlain <mcgrof@kernel.org>
>
> split_folio() and split_folio_to_list() assume order 0, to support
> minorder for non-anonymous folios, we must expand these to check the
> folio mapping order and use that.
>
> Set new_order to be at least minimum folio order if it is set in
> split_huge_page_to_list() so that we can maintain minimum folio order
> requirement in the page cache.
>
> Update the debugfs write files used for testing to ensure the order
> is respected as well. We simply enforce the min order when a file
> mapping is used.
>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Tested-by: David Howells <dhowells@redhat.com>

This causes the following warning on s390 with linux-next starting from
next-20240827:

[  112.690518] BUG: Bad page map in process ksm01  pte:a5801317 pmd:99054000
[  112.690531] page: refcount:0 mapcount:-1 mapping:0000000000000000 index:0x3ff86102 pfn:0xa5801
[  112.690536] flags: 0x3ffff00000000004(referenced|node=0|zone=1|lastcpupid=0x1ffff)
[  112.690543] raw: 3ffff00000000004 0000001d47439e30 0000001d47439e30 0000000000000000
[  112.690546] raw: 000000003ff86102 0000000000000000 fffffffe00000000 0000000000000000
[  112.690548] page dumped because: bad pte
[  112.690549] addr:000003ff86102000 vm_flags:88100073 anon_vma:000000008c8e46e8 mapping:0000000000000000 index:3ff86102
[  112.690553] file:(null) fault:0x0 mmap:0x0 read_folio:0x0
[  112.690561] CPU: 1 UID: 0 PID: 604 Comm: ksm01 Not tainted 6.11.0-rc5-next-20240827-dirty #1441
[  112.690565] Hardware name: IBM 3931 A01 704 (z/VM 7.3.0)
[  112.690568] Call Trace:
[  112.690571]  [<000003ffe0eb77fe>] dump_stack_lvl+0x76/0xa0
[  112.690579]  [<000003ffe03f4a90>] print_bad_pte+0x280/0x2d0
[  112.690584]  [<000003ffe03f7654>] zap_present_ptes.isra.0+0x5c4/0x870
[  112.690598]  [<000003ffe03f7a46>] zap_pte_range+0x146/0x3d0
[  112.690601]  [<000003ffe03f7f1c>] zap_p4d_range+0x24c/0x4b0
[  112.690603]  [<000003ffe03f84ea>] unmap_page_range+0xea/0x2c0
[  112.690605]  [<000003ffe03f8754>] unmap_single_vma.isra.0+0x94/0xf0
[  112.690607]  [<000003ffe03f8866>] unmap_vmas+0xb6/0x1a0
[  112.690609]  [<000003ffe0405724>] exit_mmap+0xc4/0x3e0
[  112.690613]  [<000003ffe0154aa2>] mmput+0x72/0x170
[  112.690616]  [<000003ffe015e2c6>] exit_mm+0xd6/0x150
[  112.690618]  [<000003ffe015e52c>] do_exit+0x1ec/0x490
[  112.690620]  [<000003ffe015e9a4>] do_group_exit+0x44/0xc0
[  112.690621]  [<000003ffe016f000>] get_signal+0x7f0/0x800
[  112.690624]  [<000003ffe0108614>] arch_do_signal_or_restart+0x74/0x320
[  112.690628]  [<000003ffe020c876>] syscall_exit_to_user_mode_work+0xe6/0x170
[  112.690632]  [<000003ffe0eb7c04>] __do_syscall+0xd4/0x1c0
[  112.690634]  [<000003ffe0ec303c>] system_call+0x74/0x98
[  112.690638] Disabling lock debugging due to kernel taint

To reproduce, running the ksm01 testsuite from ltp seems to be
enough. The splat is always triggered immediately. The output from ksm01
is:

tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz'
tst_test.c:1809: TINFO: LTP version: 20240524-208-g6c3293c6f
tst_test.c:1813: TINFO: Tested kernel: 6.11.0-rc5-next-20240827 #1440 SMP Thu Aug 29 12:13:28 CEST 2024 s390x
tst_test.c:1652: TINFO: Timeout per run is 0h 00m 30s
mem.c:422: TINFO: wait for all children to stop.
mem.c:388: TINFO: child 0 stops.
mem.c:388: TINFO: child 1 stops.
mem.c:388: TINFO: child 2 stops.
mem.c:495: TINFO: KSM merging...
mem.c:434: TINFO: resume all children.
mem.c:422: TINFO: wait for all children to stop.
mem.c:344: TINFO: child 0 continues...
mem.c:347: TINFO: child 0 allocates 128 MB filled with 'c'
mem.c:344: TINFO: child 1 continues...
mem.c:347: TINFO: child 1 allocates 128 MB filled with 'a'
mem.c:344: TINFO: child 2 continues...
mem.c:347: TINFO: child 2 allocates 128 MB filled with 'a'
mem.c:400: TINFO: child 1 stops.
mem.c:400: TINFO: child 2 stops.
mem.c:400: TINFO: child 0 stops.
Test timeouted, sending SIGKILL!
tst_test.c:1700: TINFO: Killed the leftover descendant processes
tst_test.c:1706: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
tst_test.c:1708: TBROK: Test killed! (timeout?)

Thanks
Sven

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-29 10:51   ` Sven Schnelle
@ 2024-08-29 18:46     ` Luis Chamberlain
  2024-08-29 19:55       ` Matthew Wilcox
  0 siblings, 1 reply; 31+ messages in thread
From: Luis Chamberlain @ 2024-08-29 18:46 UTC (permalink / raw)
  To: Sven Schnelle
  Cc: Pankaj Raghav (Samsung), brauner, akpm, chandan.babu,
	linux-fsdevel, djwong, hare, gost.dev, linux-xfs, hch, david,
	Zi Yan, yang, linux-kernel, linux-mm, willy, john.g.garry, cl,
	p.raghav, ryan.roberts, David Howells, linux-s390

On Thu, Aug 29, 2024 at 12:51:25PM +0200, Sven Schnelle wrote:
> Hi,
> 
> "Pankaj Raghav (Samsung)" <kernel@pankajraghav.com> writes:
> 
> > From: Luis Chamberlain <mcgrof@kernel.org>
> >
> > split_folio() and split_folio_to_list() assume order 0, to support
> > minorder for non-anonymous folios, we must expand these to check the
> > folio mapping order and use that.
> >
> > Set new_order to be at least minimum folio order if it is set in
> > split_huge_page_to_list() so that we can maintain minimum folio order
> > requirement in the page cache.
> >
> > Update the debugfs write files used for testing to ensure the order
> > is respected as well. We simply enforce the min order when a file
> > mapping is used.
> >
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > Reviewed-by: Hannes Reinecke <hare@suse.de>
> > Reviewed-by: Zi Yan <ziy@nvidia.com>
> > Tested-by: David Howells <dhowells@redhat.com>
> 
> This causes the following warning on s390 with linux-next starting from
> next-20240827:
> 
> [  112.690518] BUG: Bad page map in process ksm01  pte:a5801317 pmd:99054000
> [  112.690531] page: refcount:0 mapcount:-1 mapping:0000000000000000 index:0x3ff86102 pfn:0xa5801
> [  112.690536] flags: 0x3ffff00000000004(referenced|node=0|zone=1|lastcpupid=0x1ffff)
> [  112.690543] raw: 3ffff00000000004 0000001d47439e30 0000001d47439e30 0000000000000000
> [  112.690546] raw: 000000003ff86102 0000000000000000 fffffffe00000000 0000000000000000
> [  112.690548] page dumped because: bad pte
> [  112.690549] addr:000003ff86102000 vm_flags:88100073 anon_vma:000000008c8e46e8 mapping:0000000000000000 index:3ff86102
> [  112.690553] file:(null) fault:0x0 mmap:0x0 read_folio:0x0
> [  112.690561] CPU: 1 UID: 0 PID: 604 Comm: ksm01 Not tainted 6.11.0-rc5-next-20240827-dirty #1441
> [  112.690565] Hardware name: IBM 3931 A01 704 (z/VM 7.3.0)
> [  112.690568] Call Trace:
> [  112.690571]  [<000003ffe0eb77fe>] dump_stack_lvl+0x76/0xa0
> [  112.690579]  [<000003ffe03f4a90>] print_bad_pte+0x280/0x2d0
> [  112.690584]  [<000003ffe03f7654>] zap_present_ptes.isra.0+0x5c4/0x870
> [  112.690598]  [<000003ffe03f7a46>] zap_pte_range+0x146/0x3d0
> [  112.690601]  [<000003ffe03f7f1c>] zap_p4d_range+0x24c/0x4b0
> [  112.690603]  [<000003ffe03f84ea>] unmap_page_range+0xea/0x2c0
> [  112.690605]  [<000003ffe03f8754>] unmap_single_vma.isra.0+0x94/0xf0
> [  112.690607]  [<000003ffe03f8866>] unmap_vmas+0xb6/0x1a0
> [  112.690609]  [<000003ffe0405724>] exit_mmap+0xc4/0x3e0
> [  112.690613]  [<000003ffe0154aa2>] mmput+0x72/0x170
> [  112.690616]  [<000003ffe015e2c6>] exit_mm+0xd6/0x150
> [  112.690618]  [<000003ffe015e52c>] do_exit+0x1ec/0x490
> [  112.690620]  [<000003ffe015e9a4>] do_group_exit+0x44/0xc0
> [  112.690621]  [<000003ffe016f000>] get_signal+0x7f0/0x800
> [  112.690624]  [<000003ffe0108614>] arch_do_signal_or_restart+0x74/0x320
> [  112.690628]  [<000003ffe020c876>] syscall_exit_to_user_mode_work+0xe6/0x170
> [  112.690632]  [<000003ffe0eb7c04>] __do_syscall+0xd4/0x1c0
> [  112.690634]  [<000003ffe0ec303c>] system_call+0x74/0x98
> [  112.690638] Disabling lock debugging due to kernel taint
> 
> To reproduce, running the ksm01 testsuite from ltp seems to be
> enough. The splat is always triggered immediately. The output from ksm01
> is:
> 
> tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz'
> tst_test.c:1809: TINFO: LTP version: 20240524-208-g6c3293c6f
> tst_test.c:1813: TINFO: Tested kernel: 6.11.0-rc5-next-20240827 #1440 SMP Thu Aug 29 12:13:28 CEST 2024 s390x
> tst_test.c:1652: TINFO: Timeout per run is 0h 00m 30s
> mem.c:422: TINFO: wait for all children to stop.
> mem.c:388: TINFO: child 0 stops.
> mem.c:388: TINFO: child 1 stops.
> mem.c:388: TINFO: child 2 stops.
> mem.c:495: TINFO: KSM merging...
> mem.c:434: TINFO: resume all children.
> mem.c:422: TINFO: wait for all children to stop.
> mem.c:344: TINFO: child 0 continues...
> mem.c:347: TINFO: child 0 allocates 128 MB filled with 'c'
> mem.c:344: TINFO: child 1 continues...
> mem.c:347: TINFO: child 1 allocates 128 MB filled with 'a'
> mem.c:344: TINFO: child 2 continues...
> mem.c:347: TINFO: child 2 allocates 128 MB filled with 'a'
> mem.c:400: TINFO: child 1 stops.
> mem.c:400: TINFO: child 2 stops.
> mem.c:400: TINFO: child 0 stops.
> Test timeouted, sending SIGKILL!
> tst_test.c:1700: TINFO: Killed the leftover descendant processes
> tst_test.c:1706: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
> tst_test.c:1708: TBROK: Test killed! (timeout?)

Thanks for the report and test reproducer, I was able to reproduce on
x86_64 easily with ltp/testcases/kernel/mem/ksm/ksm01 as well:

ltp/testcases/kernel/mem/ksm/ksm01
tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz'
tst_test.c:1809: TINFO: LTP version: 20240524-208-g6c3293c6fc20
tst_test.c:1813: TINFO: Tested kernel: 6.11.0-rc5-next-20240827 #56 SMP
PREEMPT_DYNAMIC Tue Aug 27 08:10:26 UTC 2024 x86_64
tst_test.c:1652: TINFO: Timeout per run is 0h 00m 30s
mem.c:422: TINFO: wait for all children to stop.
mem.c:388: TINFO: child 0 stops.
mem.c:388: TINFO: child 1 stops.
mem.c:388: TINFO: child 2 stops.
mem.c:495: TINFO: KSM merging...
mem.c:434: TINFO: resume all children.
mem.c:422: TINFO: wait for all children to stop.
mem.c:344: TINFO: child 0 continues...
mem.c:344: TINFO: child 2 continues...
mem.c:344: TINFO: child 1 continues...
mem.c:347: TINFO: child 1 allocates 128 MB filled with 'a'
mem.c:347: TINFO: child 0 allocates 128 MB filled with 'c'
mem.c:347: TINFO: child 2 allocates 128 MB filled with 'a'
mem.c:400: TINFO: child 1 stops.
mem.c:400: TINFO: child 0 stops.
mem.c:400: TINFO: child 2 stops.


Test timeouted, sending SIGKILL!
tst_test.c:1700: TINFO: Killed the leftover descendant processes
tst_test.c:1706: TINFO: If you are running on slow machine, try
exporting LTP_TIMEOUT_MUL > 1
tst_test.c:1708: TBROK: Test killed! (timeout?)

Summary:
passed   0
failed   0
broken   1
skipped  0
warnings 0


With vm debugging however I get more information about the issue:

Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page: refcount:1 mapcount:1 mapping:0000000000000000 index:0x7f589dd7f pfn:0x211d7f
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: memcg:ffff93ba245b8800
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: anon flags: 0x17fffe000020838(uptodate|dirty|lru|owner_2|swapbacked|node=0|zone=2|lastcpupid=0x1ffff)
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 017fffe000020838 ffffe59008475f88 ffffe59008476008 ffff93ba2abca5b1
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 00000007f589dd7f 0000000000000000 0000000100000000 ffff93ba245b8800
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio))
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ------------[ cut here ]------------
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: kernel BUG at mm/filemap.c:1509!
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CPU: 2 UID: 0 PID: 74 Comm: ksmd Not tainted 6.11.0-rc5-next-20240827 #56
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RIP: 0010:folio_unlock+0x43/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Code: 93 fc ff ff f0 80 30 01 78 06 5b c3 cc cc cc cc 48 89 df 31 f6 5b e9 dc fc ff ff 48 c7 c6 a0 56 49 89 48 89 df e8 2d 03 05 00 <0f> 0b 90 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RSP: 0018:ffffbb1dc02afe38 EFLAGS: 00010246
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RAX: 000000000000003f RBX: ffffe59008475fc0 RCX: 0000000000000000
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000003
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R10: ffffbb1dc02afce0 R11: ffffffff896c3608 R12: ffffe59008475fc0
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R13: 0000000000000000 R14: ffffe59008470000 R15: ffffffff89f88060
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: FS:  0000000000000000(0000) GS:ffff93c15fc80000(0000) knlGS:0000000000000000
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CR2: 0000558e368d9c48 CR3: 000000010ca66004 CR4: 0000000000770ef0
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: PKRU: 55555554
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Call Trace:
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  <TASK>
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? die+0x32/0x80
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? do_trap+0xd9/0x100
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? do_error_trap+0x6a/0x90
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? exc_invalid_op+0x4c/0x60
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? asm_exc_invalid_op+0x16/0x20
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ksm_scan_thread+0x175b/0x1d30
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_ksm_scan_thread+0x10/0x10
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  kthread+0xda/0x110
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_kthread+0x10/0x10
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ret_from_fork+0x2d/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_kthread+0x10/0x10
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ret_from_fork_asm+0x1a/0x30
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  </TASK>
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Modules linked in: xfs nvme_fabrics 9p kvm_intel kvm crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 sha1_ssse3 aesni_intel gf128mul crypto_simd cryptd pcspkr 9pnet_virtio joydev virtio_balloon virtio_console evdev button serio_raw drm nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vsock autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 md_mod virtio_net net_failover virtio_blk failover nvme psmouse crc32_pclmul crc32c_intel nvme_core virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ---[ end trace 0000000000000000 ]---
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RIP: 0010:folio_unlock+0x43/0x50
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Code: 93 fc ff ff f0 80 30 01 78 06 5b c3 cc cc cc cc 48 89 df 31 f6 5b e9 dc fc ff ff 48 c7 c6 a0 56 49 89 48 89 df e8 2d 03 05 00 <0f> 0b 90 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RSP: 0018:ffffbb1dc02afe38 EFLAGS: 00010246
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RAX: 000000000000003f RBX: ffffe59008475fc0 RCX: 0000000000000000
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000003
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R10: ffffbb1dc02afce0 R11: ffffffff896c3608 R12: ffffe59008475fc0
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R13: 0000000000000000 R14: ffffe59008470000 R15: ffffffff89f88060
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: FS:  0000000000000000(0000) GS:ffff93c15fc80000(0000) knlGS:0000000000000000
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CR2: 0000558e368d9c48 CR3: 000000010ca66004 CR4: 0000000000770ef0
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: PKRU: 55555554

Looking at the KSM code in context ksm_scan_thread+0x175 is mm/ksm.c routine
cmp_and_merge_page() on the split case:

                } else if (split) {                                             
                        /*                                                      
                         * We are here if we tried to merge two pages and       
                         * failed because they both belonged to the same        
                         * compound page. We will split the page now, but no    
                         * merging will take place.                             
                         * We do not want to add the cost of a full lock; if    
                         * the page is locked, it is better to skip it and      
                         * perhaps try again later.                             
                         */                                                     
                        if (!trylock_page(page))                                
                                return;                                         
                        split_huge_page(page);                                  
                        unlock_page(page);                                      
                }

The trylock_page() is faulty. I'm digging in further.

  Luis

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-29 18:46     ` Luis Chamberlain
@ 2024-08-29 19:55       ` Matthew Wilcox
  2024-08-29 22:12         ` Zi Yan
  0 siblings, 1 reply; 31+ messages in thread
From: Matthew Wilcox @ 2024-08-29 19:55 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Sven Schnelle, Pankaj Raghav (Samsung), brauner, akpm,
	chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	hch, david, Zi Yan, yang, linux-kernel, linux-mm, john.g.garry,
	cl, p.raghav, ryan.roberts, David Howells, linux-s390

On Thu, Aug 29, 2024 at 11:46:42AM -0700, Luis Chamberlain wrote:
> With vm debugging however I get more information about the issue:
> 
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page: refcount:1 mapcount:1 mapping:0000000000000000 index:0x7f589dd7f pfn:0x211d7f
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: memcg:ffff93ba245b8800
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: anon flags: 0x17fffe000020838(uptodate|dirty|lru|owner_2|swapbacked|node=0|zone=2|lastcpupid=0x1ffff)
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 017fffe000020838 ffffe59008475f88 ffffe59008476008 ffff93ba2abca5b1
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 00000007f589dd7f 0000000000000000 0000000100000000 ffff93ba245b8800
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio))
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ------------[ cut here ]------------
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: kernel BUG at mm/filemap.c:1509!

This is in folio_unlock().  We're trying to unlock a folio which isn't
locked!

> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CPU: 2 UID: 0 PID: 74 Comm: ksmd Not tainted 6.11.0-rc5-next-20240827 #56
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RIP: 0010:folio_unlock+0x43/0x50
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Code: 93 fc ff ff f0 80 30 01 78 06 5b c3 cc cc cc cc 48 89 df 31 f6 5b e9 dc fc ff ff 48 c7 c6 a0 56 49 89 48 89 df e8 2d 03 05 00 <0f> 0b 90 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RSP: 0018:ffffbb1dc02afe38 EFLAGS: 00010246
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RAX: 000000000000003f RBX: ffffe59008475fc0 RCX: 0000000000000000
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000003
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R10: ffffbb1dc02afce0 R11: ffffffff896c3608 R12: ffffe59008475fc0
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R13: 0000000000000000 R14: ffffe59008470000 R15: ffffffff89f88060
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: FS:  0000000000000000(0000) GS:ffff93c15fc80000(0000) knlGS:0000000000000000
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CR2: 0000558e368d9c48 CR3: 000000010ca66004 CR4: 0000000000770ef0
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: PKRU: 55555554
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Call Trace:
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  <TASK>
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? die+0x32/0x80
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? do_trap+0xd9/0x100
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? do_error_trap+0x6a/0x90
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? exc_invalid_op+0x4c/0x60
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? asm_exc_invalid_op+0x16/0x20
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ksm_scan_thread+0x175b/0x1d30
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_ksm_scan_thread+0x10/0x10
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  kthread+0xda/0x110
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_kthread+0x10/0x10
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ret_from_fork+0x2d/0x50
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_kthread+0x10/0x10
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ret_from_fork_asm+0x1a/0x30
> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  </TASK>
[...]
> Looking at the KSM code in context ksm_scan_thread+0x175 is mm/ksm.c routine
> cmp_and_merge_page() on the split case:
> 
>                 } else if (split) {                                             
>                         /*                                                      
>                          * We are here if we tried to merge two pages and       
>                          * failed because they both belonged to the same        
>                          * compound page. We will split the page now, but no    
>                          * merging will take place.                             
>                          * We do not want to add the cost of a full lock; if    
>                          * the page is locked, it is better to skip it and      
>                          * perhaps try again later.                             
>                          */                                                     
>                         if (!trylock_page(page))                                
>                                 return;                                         
>                         split_huge_page(page);                                  
>                         unlock_page(page);                                      

Obviously the page is locked when we call split_huge_page().  There's
an assert inside it.  And the lock bit is _supposed_ to be transferred
to the head page of the page which is being split.  My guess is that
this is messed up somehow; we're perhaps transferring the lock bit to
the wrong page?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-22 13:50 ` [PATCH v13 04/10] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
  2024-08-29 10:51   ` Sven Schnelle
@ 2024-08-29 22:11   ` Matthew Wilcox
  2024-09-06  6:52   ` Lai, Yi
  2 siblings, 0 replies; 31+ messages in thread
From: Matthew Wilcox @ 2024-08-29 22:11 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: brauner, akpm, chandan.babu, linux-fsdevel, djwong, hare,
	gost.dev, linux-xfs, hch, david, Zi Yan, yang, linux-kernel,
	linux-mm, john.g.garry, cl, p.raghav, mcgrof, ryan.roberts,
	David Howells

On Thu, Aug 22, 2024 at 03:50:12PM +0200, Pankaj Raghav (Samsung) wrote:
> @@ -317,9 +319,10 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		unsigned int new_order);
> +int split_folio_to_list(struct folio *folio, struct list_head *list);
>  static inline int split_huge_page(struct page *page)
>  {
> -	return split_huge_page_to_list_to_order(page, NULL, 0);
> +	return split_folio(page_folio(page));

Oh!  You can't do this!

split_huge_page() takes a precise page, NOT a folio.  That page is
locked.  When we return from split_huge_page(), the new folio which
contains the precise page is locked.

You've made it so that the caller's page's folio won't necessarily
be locked.  More testing was needed ;-P

>  }
>  void deferred_split_folio(struct folio *folio);
>  
> @@ -495,6 +498,12 @@ static inline int split_huge_page(struct page *page)
>  {
>  	return 0;
>  }
> +
> +static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
> +{
> +	return 0;
> +}
> +
>  static inline void deferred_split_folio(struct folio *folio) {}
>  #define split_huge_pmd(__vma, __pmd, __address)	\
>  	do { } while (0)
> @@ -622,7 +631,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
>  	return split_folio_to_list_to_order(folio, NULL, new_order);
>  }
>  
> -#define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
> -#define split_folio(f) split_folio_to_order(f, 0)
> -
>  #endif /* _LINUX_HUGE_MM_H */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index cf8e34f62976f..06384b85a3a20 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3303,6 +3303,9 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>   * released, or if some unexpected race happened (e.g., anon VMA disappeared,
>   * truncation).
>   *
> + * Callers should ensure that the order respects the address space mapping
> + * min-order if one is set for non-anonymous folios.
> + *
>   * Returns -EINVAL when trying to split to an order that is incompatible
>   * with the folio. Splitting to order 0 is compatible with all folios.
>   */
> @@ -3384,6 +3387,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		mapping = NULL;
>  		anon_vma_lock_write(anon_vma);
>  	} else {
> +		unsigned int min_order;
>  		gfp_t gfp;
>  
>  		mapping = folio->mapping;
> @@ -3394,6 +3398,14 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  			goto out;
>  		}
>  
> +		min_order = mapping_min_folio_order(folio->mapping);
> +		if (new_order < min_order) {
> +			VM_WARN_ONCE(1, "Cannot split mapped folio below min-order: %u",
> +				     min_order);
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
>  		gfp = current_gfp_context(mapping_gfp_mask(mapping) &
>  							GFP_RECLAIM_MASK);
>  
> @@ -3506,6 +3518,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  	return ret;
>  }
>  
> +int split_folio_to_list(struct folio *folio, struct list_head *list)
> +{
> +	unsigned int min_order = 0;
> +
> +	if (folio_test_anon(folio))
> +		goto out;
> +
> +	if (!folio->mapping) {
> +		if (folio_test_pmd_mappable(folio))
> +			count_vm_event(THP_SPLIT_PAGE_FAILED);
> +		return -EBUSY;
> +	}
> +
> +	min_order = mapping_min_folio_order(folio->mapping);
> +out:
> +	return split_huge_page_to_list_to_order(&folio->page, list,
> +							min_order);
> +}
> +
>  void __folio_undo_large_rmappable(struct folio *folio)
>  {
>  	struct deferred_split *ds_queue;
> @@ -3736,6 +3767,8 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  		struct vm_area_struct *vma = vma_lookup(mm, addr);
>  		struct folio_walk fw;
>  		struct folio *folio;
> +		struct address_space *mapping;
> +		unsigned int target_order = new_order;
>  
>  		if (!vma)
>  			break;
> @@ -3753,7 +3786,13 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  		if (!is_transparent_hugepage(folio))
>  			goto next;
>  
> -		if (new_order >= folio_order(folio))
> +		if (!folio_test_anon(folio)) {
> +			mapping = folio->mapping;
> +			target_order = max(new_order,
> +					   mapping_min_folio_order(mapping));
> +		}
> +
> +		if (target_order >= folio_order(folio))
>  			goto next;
>  
>  		total++;
> @@ -3771,9 +3810,14 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  		folio_get(folio);
>  		folio_walk_end(&fw, vma);
>  
> -		if (!split_folio_to_order(folio, new_order))
> +		if (!folio_test_anon(folio) && folio->mapping != mapping)
> +			goto unlock;
> +
> +		if (!split_folio_to_order(folio, target_order))
>  			split++;
>  
> +unlock:
> +
>  		folio_unlock(folio);
>  		folio_put(folio);
>  
> @@ -3802,6 +3846,8 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
>  	pgoff_t index;
>  	int nr_pages = 1;
>  	unsigned long total = 0, split = 0;
> +	unsigned int min_order;
> +	unsigned int target_order;
>  
>  	file = getname_kernel(file_path);
>  	if (IS_ERR(file))
> @@ -3815,6 +3861,8 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
>  		 file_path, off_start, off_end);
>  
>  	mapping = candidate->f_mapping;
> +	min_order = mapping_min_folio_order(mapping);
> +	target_order = max(new_order, min_order);
>  
>  	for (index = off_start; index < off_end; index += nr_pages) {
>  		struct folio *folio = filemap_get_folio(mapping, index);
> @@ -3829,15 +3877,19 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
>  		total++;
>  		nr_pages = folio_nr_pages(folio);
>  
> -		if (new_order >= folio_order(folio))
> +		if (target_order >= folio_order(folio))
>  			goto next;
>  
>  		if (!folio_trylock(folio))
>  			goto next;
>  
> -		if (!split_folio_to_order(folio, new_order))
> +		if (folio->mapping != mapping)
> +			goto unlock;
> +
> +		if (!split_folio_to_order(folio, target_order))
>  			split++;
>  
> +unlock:
>  		folio_unlock(folio);
>  next:
>  		folio_put(folio);
> -- 
> 2.44.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-29 19:55       ` Matthew Wilcox
@ 2024-08-29 22:12         ` Zi Yan
  2024-08-29 23:41           ` Luis Chamberlain
  0 siblings, 1 reply; 31+ messages in thread
From: Zi Yan @ 2024-08-29 22:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Sven Schnelle, Pankaj Raghav (Samsung), brauner,
	akpm, chandan.babu, linux-fsdevel, djwong, hare, gost.dev,
	linux-xfs, hch, david, yang, linux-kernel, linux-mm, john.g.garry,
	cl, p.raghav, ryan.roberts, David Howells, linux-s390

[-- Attachment #1: Type: text/plain, Size: 7906 bytes --]

On 29 Aug 2024, at 15:55, Matthew Wilcox wrote:

> On Thu, Aug 29, 2024 at 11:46:42AM -0700, Luis Chamberlain wrote:
>> With vm debugging however I get more information about the issue:
>>
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page: refcount:1 mapcount:1 mapping:0000000000000000 index:0x7f589dd7f pfn:0x211d7f
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: memcg:ffff93ba245b8800
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: anon flags: 0x17fffe000020838(uptodate|dirty|lru|owner_2|swapbacked|node=0|zone=2|lastcpupid=0x1ffff)
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 017fffe000020838 ffffe59008475f88 ffffe59008476008 ffff93ba2abca5b1
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: raw: 00000007f589dd7f 0000000000000000 0000000100000000 ffff93ba245b8800
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio))
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: ------------[ cut here ]------------
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: kernel BUG at mm/filemap.c:1509!
>
> This is in folio_unlock().  We're trying to unlock a folio which isn't
> locked!
>
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CPU: 2 UID: 0 PID: 74 Comm: ksmd Not tainted 6.11.0-rc5-next-20240827 #56
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RIP: 0010:folio_unlock+0x43/0x50
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Code: 93 fc ff ff f0 80 30 01 78 06 5b c3 cc cc cc cc 48 89 df 31 f6 5b e9 dc fc ff ff 48 c7 c6 a0 56 49 89 48 89 df e8 2d 03 05 00 <0f> 0b 90 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RSP: 0018:ffffbb1dc02afe38 EFLAGS: 00010246
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RAX: 000000000000003f RBX: ffffe59008475fc0 RCX: 0000000000000000
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000003
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R10: ffffbb1dc02afce0 R11: ffffffff896c3608 R12: ffffe59008475fc0
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: R13: 0000000000000000 R14: ffffe59008470000 R15: ffffffff89f88060
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: FS:  0000000000000000(0000) GS:ffff93c15fc80000(0000) knlGS:0000000000000000
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: CR2: 0000558e368d9c48 CR3: 000000010ca66004 CR4: 0000000000770ef0
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: PKRU: 55555554
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel: Call Trace:
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  <TASK>
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? die+0x32/0x80
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? do_trap+0xd9/0x100
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? do_error_trap+0x6a/0x90
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? exc_invalid_op+0x4c/0x60
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? asm_exc_invalid_op+0x16/0x20
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? folio_unlock+0x43/0x50
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ksm_scan_thread+0x175b/0x1d30
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_ksm_scan_thread+0x10/0x10
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  kthread+0xda/0x110
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_kthread+0x10/0x10
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ret_from_fork+0x2d/0x50
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ? __pfx_kthread+0x10/0x10
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  ret_from_fork_asm+0x1a/0x30
>> Aug 29 18:08:22 nvme-xfs-reflink-4k kernel:  </TASK>
> [...]
>> Looking at the KSM code in context ksm_scan_thread+0x175 is mm/ksm.c routine
>> cmp_and_merge_page() on the split case:
>>
>>                 } else if (split) {
>>                         /*
>>                          * We are here if we tried to merge two pages and
>>                          * failed because they both belonged to the same
>>                          * compound page. We will split the page now, but no
>>                          * merging will take place.
>>                          * We do not want to add the cost of a full lock; if
>>                          * the page is locked, it is better to skip it and
>>                          * perhaps try again later.
>>                          */
>>                         if (!trylock_page(page))
>>                                 return;
>>                         split_huge_page(page);
>>                         unlock_page(page);
>
> Obviously the page is locked when we call split_huge_page().  There's
> an assert inside it.  And the lock bit is _supposed_ to be transferred
> to the head page of the page which is being split.  My guess is that
> this is messed up somehow; we're perhaps transferring the lock bit to
> the wrong page?

The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order()
unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer
to split_huge_page_to_list_to_order(), which keeps that “page” still locked.
But this patch changes the “page” passed into split_huge_page_to_list_to_order()
always to the head page.

This fixes the crash on my x86 VM, but it can be improved:

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7c50aeed0522..eff5d2fb5d4e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -320,10 +320,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins);
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
                unsigned int new_order);
 int split_folio_to_list(struct folio *folio, struct list_head *list);
-static inline int split_huge_page(struct page *page)
-{
-       return split_folio(page_folio(page));
-}
+int split_huge_page(struct page *page);
 void deferred_split_folio(struct folio *folio);

 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c29af9451d92..4d723dab4336 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3297,6 +3297,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
        return ret;
 }

+int split_huge_page(struct page *page)
+{
+       unsigned int min_order = 0;
+       struct folio *folio = page_folio(page);
+
+       if (folio_test_anon(folio))
+               goto out;
+
+       if (!folio->mapping) {
+               if (folio_test_pmd_mappable(folio))
+                       count_vm_event(THP_SPLIT_PAGE_FAILED);
+               return -EBUSY;
+       }
+
+       min_order = mapping_min_folio_order(folio->mapping);
+out:
+       return split_huge_page_to_list_to_order(page, NULL, min_order);
+}
+
 int split_folio_to_list(struct folio *folio, struct list_head *list)
 {
        unsigned int min_order = 0;

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-29 22:12         ` Zi Yan
@ 2024-08-29 23:41           ` Luis Chamberlain
  2024-08-30  5:57             ` Sven Schnelle
                               ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Luis Chamberlain @ 2024-08-29 23:41 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox, Sven Schnelle, Pankaj Raghav (Samsung), brauner,
	akpm, chandan.babu, linux-fsdevel, djwong, hare, gost.dev,
	linux-xfs, hch, david, yang, linux-kernel, linux-mm, john.g.garry,
	cl, p.raghav, ryan.roberts, David Howells, linux-s390

On Thu, Aug 29, 2024 at 06:12:26PM -0400, Zi Yan wrote:
> The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order()
> unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer
> to split_huge_page_to_list_to_order(), which keeps that “page” still locked.
> But this patch changes the “page” passed into split_huge_page_to_list_to_order()
> always to the head page.
> 
> This fixes the crash on my x86 VM, but it can be improved:
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 7c50aeed0522..eff5d2fb5d4e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -320,10 +320,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins);
>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>                 unsigned int new_order);
>  int split_folio_to_list(struct folio *folio, struct list_head *list);
> -static inline int split_huge_page(struct page *page)
> -{
> -       return split_folio(page_folio(page));
> -}
> +int split_huge_page(struct page *page);
>  void deferred_split_folio(struct folio *folio);
> 
>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c29af9451d92..4d723dab4336 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3297,6 +3297,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>         return ret;
>  }
> 
> +int split_huge_page(struct page *page)
> +{
> +       unsigned int min_order = 0;
> +       struct folio *folio = page_folio(page);
> +
> +       if (folio_test_anon(folio))
> +               goto out;
> +
> +       if (!folio->mapping) {
> +               if (folio_test_pmd_mappable(folio))
> +                       count_vm_event(THP_SPLIT_PAGE_FAILED);
> +               return -EBUSY;
> +       }
> +
> +       min_order = mapping_min_folio_order(folio->mapping);
> +out:
> +       return split_huge_page_to_list_to_order(page, NULL, min_order);
> +}
> +
>  int split_folio_to_list(struct folio *folio, struct list_head *list)
>  {
>         unsigned int min_order = 0;


Confirmed, and also although you suggest it can be improved, I thought
that we could do that by sharing more code and putting things in the
headers, the below also fixes this but tries to share more code, but
I think it is perhaps less easier to understand than your patch.

So I think your patch is cleaner and easier as a fix.

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c275aa9cc105..99cd9c7bf55b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -97,6 +97,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
 	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
 
 #define split_folio(f) split_folio_to_list(f, NULL)
+#define split_folio_to_list(f, list) split_page_folio_to_list(&f->page, f, list)
 
 #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 #define HPAGE_PMD_SHIFT PMD_SHIFT
@@ -331,10 +332,11 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order);
-int split_folio_to_list(struct folio *folio, struct list_head *list);
+int split_page_folio_to_list(struct page *page, struct folio *folio,
+			     struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
-	return split_folio(page_folio(page));
+	return split_page_folio_to_list(page, page_folio(page), NULL);
 }
 void deferred_split_folio(struct folio *folio);
 
@@ -511,7 +513,9 @@ static inline int split_huge_page(struct page *page)
 	return 0;
 }
 
-static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
+static inline int split_page_folio_to_list(struct page *page,
+					   struct folio *folio,
+					   struct list_head *list)
 {
 	return 0;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 169f1a71c95d..b115bfe63b52 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3529,7 +3529,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	return ret;
 }
 
-int split_folio_to_list(struct folio *folio, struct list_head *list)
+int split_page_folio_to_list(struct page *page, struct folio *folio,
+			     struct list_head *list)
 {
 	unsigned int min_order = 0;
 
@@ -3544,8 +3545,7 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
 
 	min_order = mapping_min_folio_order(folio->mapping);
 out:
-	return split_huge_page_to_list_to_order(&folio->page, list,
-							min_order);
+	return split_huge_page_to_list_to_order(page, list, min_order);
 }
 
 void __folio_undo_large_rmappable(struct folio *folio)

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-29 23:41           ` Luis Chamberlain
@ 2024-08-30  5:57             ` Sven Schnelle
  2024-08-30 11:58             ` Daniel Gomez
  2024-08-30 14:59             ` Pankaj Raghav
  2 siblings, 0 replies; 31+ messages in thread
From: Sven Schnelle @ 2024-08-30  5:57 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Zi Yan, Matthew Wilcox, Pankaj Raghav (Samsung), brauner, akpm,
	chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	hch, david, yang, linux-kernel, linux-mm, john.g.garry, cl,
	p.raghav, ryan.roberts, David Howells, linux-s390

Luis Chamberlain <mcgrof@kernel.org> writes:

> On Thu, Aug 29, 2024 at 06:12:26PM -0400, Zi Yan wrote:
>> The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order()
>> unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer
>> to split_huge_page_to_list_to_order(), which keeps that “page” still locked.
>> But this patch changes the “page” passed into split_huge_page_to_list_to_order()
>> always to the head page.
>> 
>> This fixes the crash on my x86 VM, but it can be improved:

It also fixes the issue on s390x. Thanks!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-29 23:41           ` Luis Chamberlain
  2024-08-30  5:57             ` Sven Schnelle
@ 2024-08-30 11:58             ` Daniel Gomez
  2024-08-30 14:59             ` Pankaj Raghav
  2 siblings, 0 replies; 31+ messages in thread
From: Daniel Gomez @ 2024-08-30 11:58 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Zi Yan, Matthew Wilcox, Sven Schnelle, Pankaj Raghav (Samsung),
	brauner, akpm, chandan.babu, linux-fsdevel, djwong, hare,
	gost.dev, linux-xfs, hch, david, yang, linux-kernel, linux-mm,
	john.g.garry, cl, p.raghav, ryan.roberts, David Howells,
	linux-s390

On Thu, Aug 29, 2024 at 04:41:48PM -0700, Luis Chamberlain wrote:
> On Thu, Aug 29, 2024 at 06:12:26PM -0400, Zi Yan wrote:
> > The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order()
> > unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer
> > to split_huge_page_to_list_to_order(), which keeps that “page” still locked.
> > But this patch changes the “page” passed into split_huge_page_to_list_to_order()
> > always to the head page.
> > 
> > This fixes the crash on my x86 VM, but it can be improved:
> > 
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 7c50aeed0522..eff5d2fb5d4e 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -320,10 +320,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins);
> >  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> >                 unsigned int new_order);
> >  int split_folio_to_list(struct folio *folio, struct list_head *list);
> > -static inline int split_huge_page(struct page *page)
> > -{
> > -       return split_folio(page_folio(page));
> > -}
> > +int split_huge_page(struct page *page);
> >  void deferred_split_folio(struct folio *folio);
> > 
> >  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index c29af9451d92..4d723dab4336 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3297,6 +3297,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> >         return ret;
> >  }
> > 
> > +int split_huge_page(struct page *page)
> > +{
> > +       unsigned int min_order = 0;
> > +       struct folio *folio = page_folio(page);
> > +
> > +       if (folio_test_anon(folio))
> > +               goto out;
> > +
> > +       if (!folio->mapping) {
> > +               if (folio_test_pmd_mappable(folio))
> > +                       count_vm_event(THP_SPLIT_PAGE_FAILED);
> > +               return -EBUSY;
> > +       }
> > +
> > +       min_order = mapping_min_folio_order(folio->mapping);
> > +out:
> > +       return split_huge_page_to_list_to_order(page, NULL, min_order);
> > +}
> > +
> >  int split_folio_to_list(struct folio *folio, struct list_head *list)
> >  {
> >         unsigned int min_order = 0;
> 
> 
> Confirmed, and also although you suggest it can be improved, I thought
> that we could do that by sharing more code and putting things in the
> headers, the below also fixes this but tries to share more code, but
> I think it is perhaps less easier to understand than your patch.
> 
> So I think your patch is cleaner and easier as a fix.

I reproduced it in arm64 as well. And both fixes provided solved the issue.

> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index c275aa9cc105..99cd9c7bf55b 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -97,6 +97,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>  	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
>  
>  #define split_folio(f) split_folio_to_list(f, NULL)
> +#define split_folio_to_list(f, list) split_page_folio_to_list(&f->page, f, list)
>  
>  #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
>  #define HPAGE_PMD_SHIFT PMD_SHIFT
> @@ -331,10 +332,11 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		unsigned int new_order);
> -int split_folio_to_list(struct folio *folio, struct list_head *list);
> +int split_page_folio_to_list(struct page *page, struct folio *folio,
> +			     struct list_head *list);
>  static inline int split_huge_page(struct page *page)
>  {
> -	return split_folio(page_folio(page));
> +	return split_page_folio_to_list(page, page_folio(page), NULL);
>  }
>  void deferred_split_folio(struct folio *folio);
>  
> @@ -511,7 +513,9 @@ static inline int split_huge_page(struct page *page)
>  	return 0;
>  }
>  
> -static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
> +static inline int split_page_folio_to_list(struct page *page,
> +					   struct folio *folio,
> +					   struct list_head *list)
>  {
>  	return 0;
>  }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 169f1a71c95d..b115bfe63b52 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3529,7 +3529,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  	return ret;
>  }
>  
> -int split_folio_to_list(struct folio *folio, struct list_head *list)
> +int split_page_folio_to_list(struct page *page, struct folio *folio,
> +			     struct list_head *list)
>  {
>  	unsigned int min_order = 0;
>  
> @@ -3544,8 +3545,7 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
>  
>  	min_order = mapping_min_folio_order(folio->mapping);
>  out:
> -	return split_huge_page_to_list_to_order(&folio->page, list,
> -							min_order);
> +	return split_huge_page_to_list_to_order(page, list, min_order);
>  }
>  
>  void __folio_undo_large_rmappable(struct folio *folio)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-29 23:41           ` Luis Chamberlain
  2024-08-30  5:57             ` Sven Schnelle
  2024-08-30 11:58             ` Daniel Gomez
@ 2024-08-30 14:59             ` Pankaj Raghav
  2024-08-30 17:12               ` Luis Chamberlain
                                 ` (2 more replies)
  2 siblings, 3 replies; 31+ messages in thread
From: Pankaj Raghav @ 2024-08-30 14:59 UTC (permalink / raw)
  To: Luis Chamberlain, Zi Yan
  Cc: Matthew Wilcox, Sven Schnelle, brauner, akpm, chandan.babu,
	linux-fsdevel, djwong, hare, gost.dev, linux-xfs, hch, david,
	yang, linux-kernel, linux-mm, john.g.garry, cl, p.raghav,
	ryan.roberts, David Howells, linux-s390

On 30/08/2024 01:41, Luis Chamberlain wrote:
> On Thu, Aug 29, 2024 at 06:12:26PM -0400, Zi Yan wrote:
>> The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order()
>> unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer
>> to split_huge_page_to_list_to_order(), which keeps that “page” still locked.
>> But this patch changes the “page” passed into split_huge_page_to_list_to_order()
>> always to the head page.
>>
>> This fixes the crash on my x86 VM, but it can be improved:
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 7c50aeed0522..eff5d2fb5d4e 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -320,10 +320,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins);
>>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>                 unsigned int new_order);
>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>> -static inline int split_huge_page(struct page *page)
>> -{
>> -       return split_folio(page_folio(page));
>> -}
>> +int split_huge_page(struct page *page);
>>  void deferred_split_folio(struct folio *folio);
>>
>>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index c29af9451d92..4d723dab4336 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3297,6 +3297,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>         return ret;
>>  }
>>
>> +int split_huge_page(struct page *page)
>> +{
>> +       unsigned int min_order = 0;
>> +       struct folio *folio = page_folio(page);
>> +
>> +       if (folio_test_anon(folio))
>> +               goto out;
>> +
>> +       if (!folio->mapping) {
>> +               if (folio_test_pmd_mappable(folio))
>> +                       count_vm_event(THP_SPLIT_PAGE_FAILED);
>> +               return -EBUSY;
>> +       }
>> +
>> +       min_order = mapping_min_folio_order(folio->mapping);
>> +out:
>> +       return split_huge_page_to_list_to_order(page, NULL, min_order);
>> +}
>> +
>>  int split_folio_to_list(struct folio *folio, struct list_head *list)
>>  {
>>         unsigned int min_order = 0;
> 
> 
> Confirmed, and also although you suggest it can be improved, I thought
> that we could do that by sharing more code and putting things in the
> headers, the below also fixes this but tries to share more code, but
> I think it is perhaps less easier to understand than your patch.
> 
It feels a bit weird to pass both folio and the page in `split_page_folio_to_list()`.

How about we extract the code that returns the min order so that we don't repeat.

Something like this:


diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c275aa9cc105..d27febd5c639 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -331,10 +331,24 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
                unsigned int new_order);
+int min_order_for_split(struct folio *folio);
 int split_folio_to_list(struct folio *folio, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
-       return split_folio(page_folio(page));
+       struct folio *folio = page_folio(page);
+       int ret = min_order_for_split(folio);
+
+       if (ret)
+               return ret;
+
+       /*
+        * split_huge_page() locks the page before splitting and
+        * expects the same page that has been split to be locked when
+        * returned. split_folio_to_list() cannot be used here because
+        * it converts the page to folio and passes the head page to be
+        * split.
+        */
+       return split_huge_page_to_list_to_order(page, NULL, ret);
 }
 void deferred_split_folio(struct folio *folio);

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 169f1a71c95d..b167e036d01b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3529,12 +3529,10 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
        return ret;
 }

-int split_folio_to_list(struct folio *folio, struct list_head *list)
+int min_order_for_split(struct folio *folio)
 {
-       unsigned int min_order = 0;
-
        if (folio_test_anon(folio))
-               goto out;
+               return 0;

        if (!folio->mapping) {
                if (folio_test_pmd_mappable(folio))
@@ -3542,10 +3540,17 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
                return -EBUSY;
        }

-       min_order = mapping_min_folio_order(folio->mapping);
-out:
-       return split_huge_page_to_list_to_order(&folio->page, list,
-                                                       min_order);
+       return mapping_min_folio_order(folio->mapping);
+}
+
+int split_folio_to_list(struct folio *folio, struct list_head *list)
+{
+       int ret = min_order_for_split(folio);
+
+       if (ret)
+               return ret;
+
+       return split_huge_page_to_list_to_order(&folio->page, list, ret);
 }

 void __folio_undo_large_rmappable(struct folio *folio)

> So I think your patch is cleaner and easier as a fix.
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index c275aa9cc105..99cd9c7bf55b 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -97,6 +97,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>  	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
>  
>  #define split_folio(f) split_folio_to_list(f, NULL)
> +#define split_folio_to_list(f, list) split_page_folio_to_list(&f->page, f, list)
>  
>  #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
>  #define HPAGE_PMD_SHIFT PMD_SHIFT
> @@ -331,10 +332,11 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		unsigned int new_order);
> -int split_folio_to_list(struct folio *folio, struct list_head *list);
> +int split_page_folio_to_list(struct page *page, struct folio *folio,
> +			     struct list_head *list);
>  static inline int split_huge_page(struct page *page)
>  {
> -	return split_folio(page_folio(page));
> +	return split_page_folio_to_list(page, page_folio(page), NULL);
>  }
>  void deferred_split_folio(struct folio *folio);
>  
> @@ -511,7 +513,9 @@ static inline int split_huge_page(struct page *page)
>  	return 0;
>  }
>  
> -static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
> +static inline int split_page_folio_to_list(struct page *page,
> +					   struct folio *folio,
> +					   struct list_head *list)
>  {
>  	return 0;
>  }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 169f1a71c95d..b115bfe63b52 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3529,7 +3529,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  	return ret;
>  }
>  
> -int split_folio_to_list(struct folio *folio, struct list_head *list)
> +int split_page_folio_to_list(struct page *page, struct folio *folio,
> +			     struct list_head *list)
>  {
>  	unsigned int min_order = 0;
>  
> @@ -3544,8 +3545,7 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
>  
>  	min_order = mapping_min_folio_order(folio->mapping);
>  out:
> -	return split_huge_page_to_list_to_order(&folio->page, list,
> -							min_order);
> +	return split_huge_page_to_list_to_order(page, list, min_order);
>  }
>  
>  void __folio_undo_large_rmappable(struct folio *folio)

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-30 14:59             ` Pankaj Raghav
@ 2024-08-30 17:12               ` Luis Chamberlain
  2024-08-31 22:38                 ` Zi Yan
  2024-08-30 22:42               ` Matthew Wilcox
  2024-08-31 22:35               ` Zi Yan
  2 siblings, 1 reply; 31+ messages in thread
From: Luis Chamberlain @ 2024-08-30 17:12 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Zi Yan, Matthew Wilcox, Sven Schnelle, brauner, akpm,
	chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	hch, david, yang, linux-kernel, linux-mm, john.g.garry, cl,
	p.raghav, ryan.roberts, David Howells, linux-s390

On Fri, Aug 30, 2024 at 04:59:57PM +0200, Pankaj Raghav wrote:
> It feels a bit weird to pass both folio and the page in `split_page_folio_to_list()`.

Agreed..

> How about we extract the code that returns the min order so that we don't repeat.
> Something like this:

Of all solutions I like this the most, Zi, do you have any preference?

  Luis

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-30 14:59             ` Pankaj Raghav
  2024-08-30 17:12               ` Luis Chamberlain
@ 2024-08-30 22:42               ` Matthew Wilcox
  2024-08-31 22:35               ` Zi Yan
  2 siblings, 0 replies; 31+ messages in thread
From: Matthew Wilcox @ 2024-08-30 22:42 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Luis Chamberlain, Zi Yan, Sven Schnelle, brauner, akpm,
	chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	hch, david, yang, linux-kernel, linux-mm, john.g.garry, cl,
	p.raghav, ryan.roberts, David Howells, linux-s390

On Fri, Aug 30, 2024 at 04:59:57PM +0200, Pankaj Raghav wrote:
> It feels a bit weird to pass both folio and the page in `split_page_folio_to_list()`.

We do that in the rmap code.

But this is not a performance path.  We should keep this as simple as
possible.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-30 14:59             ` Pankaj Raghav
  2024-08-30 17:12               ` Luis Chamberlain
  2024-08-30 22:42               ` Matthew Wilcox
@ 2024-08-31 22:35               ` Zi Yan
  2 siblings, 0 replies; 31+ messages in thread
From: Zi Yan @ 2024-08-31 22:35 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Luis Chamberlain, Matthew Wilcox, Sven Schnelle, brauner, akpm,
	chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	hch, david, yang, linux-kernel, linux-mm, john.g.garry, cl,
	p.raghav, ryan.roberts, David Howells, linux-s390

[-- Attachment #1: Type: text/plain, Size: 5966 bytes --]

On 30 Aug 2024, at 10:59, Pankaj Raghav wrote:

> On 30/08/2024 01:41, Luis Chamberlain wrote:
>> On Thu, Aug 29, 2024 at 06:12:26PM -0400, Zi Yan wrote:
>>> The issue is that the change to split_huge_page() makes split_huge_page_to_list_to_order()
>>> unlocks the wrong subpage. split_huge_page() used to pass the “page” pointer
>>> to split_huge_page_to_list_to_order(), which keeps that “page” still locked.
>>> But this patch changes the “page” passed into split_huge_page_to_list_to_order()
>>> always to the head page.
>>>
>>> This fixes the crash on my x86 VM, but it can be improved:
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 7c50aeed0522..eff5d2fb5d4e 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -320,10 +320,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins);
>>>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>                 unsigned int new_order);
>>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>>> -static inline int split_huge_page(struct page *page)
>>> -{
>>> -       return split_folio(page_folio(page));
>>> -}
>>> +int split_huge_page(struct page *page);
>>>  void deferred_split_folio(struct folio *folio);
>>>
>>>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index c29af9451d92..4d723dab4336 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3297,6 +3297,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>         return ret;
>>>  }
>>>
>>> +int split_huge_page(struct page *page)
>>> +{
>>> +       unsigned int min_order = 0;
>>> +       struct folio *folio = page_folio(page);
>>> +
>>> +       if (folio_test_anon(folio))
>>> +               goto out;
>>> +
>>> +       if (!folio->mapping) {
>>> +               if (folio_test_pmd_mappable(folio))
>>> +                       count_vm_event(THP_SPLIT_PAGE_FAILED);
>>> +               return -EBUSY;
>>> +       }
>>> +
>>> +       min_order = mapping_min_folio_order(folio->mapping);
>>> +out:
>>> +       return split_huge_page_to_list_to_order(page, NULL, min_order);
>>> +}
>>> +
>>>  int split_folio_to_list(struct folio *folio, struct list_head *list)
>>>  {
>>>         unsigned int min_order = 0;
>>
>>
>> Confirmed, and also although you suggest it can be improved, I thought
>> that we could do that by sharing more code and putting things in the
>> headers, the below also fixes this but tries to share more code, but
>> I think it is perhaps less easier to understand than your patch.
>>
> It feels a bit weird to pass both folio and the page in `split_page_folio_to_list()`.
>
> How about we extract the code that returns the min order so that we don't repeat.
>
> Something like this:
>
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index c275aa9cc105..d27febd5c639 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -331,10 +331,24 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>                 unsigned int new_order);
> +int min_order_for_split(struct folio *folio);
>  int split_folio_to_list(struct folio *folio, struct list_head *list);

Since split_folio() is no longer used below, this line can be removed.

>  static inline int split_huge_page(struct page *page)
>  {
> -       return split_folio(page_folio(page));
> +       struct folio *folio = page_folio(page);
> +       int ret = min_order_for_split(folio);
> +
> +       if (ret)
> +               return ret;

min_order_for_split() returns -EBUSY, 0, and a positive min_order. This if
statement should be "if (ret < 0)"?

> +
> +       /*
> +        * split_huge_page() locks the page before splitting and
> +        * expects the same page that has been split to be locked when
> +        * returned. split_folio_to_list() cannot be used here because
> +        * it converts the page to folio and passes the head page to be
> +        * split.
> +        */
> +       return split_huge_page_to_list_to_order(page, NULL, ret);
>  }
>  void deferred_split_folio(struct folio *folio);
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 169f1a71c95d..b167e036d01b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3529,12 +3529,10 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>         return ret;
>  }
>
> -int split_folio_to_list(struct folio *folio, struct list_head *list)
> +int min_order_for_split(struct folio *folio)
>  {
> -       unsigned int min_order = 0;
> -
>         if (folio_test_anon(folio))
> -               goto out;
> +               return 0;
>
>         if (!folio->mapping) {
>                 if (folio_test_pmd_mappable(folio))
> @@ -3542,10 +3540,17 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
>                 return -EBUSY;
>         }
>
> -       min_order = mapping_min_folio_order(folio->mapping);
> -out:
> -       return split_huge_page_to_list_to_order(&folio->page, list,
> -                                                       min_order);
> +       return mapping_min_folio_order(folio->mapping);
> +}
> +
> +int split_folio_to_list(struct folio *folio, struct list_head *list)
> +{
> +       int ret = min_order_for_split(folio);
> +
> +       if (ret)
> +               return ret;

Ditto.

> +
> +       return split_huge_page_to_list_to_order(&folio->page, list, ret);
>  }
>
>  void __folio_undo_large_rmappable(struct folio *folio)
>


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-30 17:12               ` Luis Chamberlain
@ 2024-08-31 22:38                 ` Zi Yan
  0 siblings, 0 replies; 31+ messages in thread
From: Zi Yan @ 2024-08-31 22:38 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Pankaj Raghav, Matthew Wilcox, Sven Schnelle, brauner, akpm,
	chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	hch, david, yang, linux-kernel, linux-mm, john.g.garry, cl,
	p.raghav, ryan.roberts, David Howells, linux-s390

[-- Attachment #1: Type: text/plain, Size: 523 bytes --]

On 30 Aug 2024, at 13:12, Luis Chamberlain wrote:

> On Fri, Aug 30, 2024 at 04:59:57PM +0200, Pankaj Raghav wrote:
>> It feels a bit weird to pass both folio and the page in `split_page_folio_to_list()`.
>
> Agreed..
>
>> How about we extract the code that returns the min order so that we don't repeat.
>> Something like this:
>
> Of all solutions I like this the most, Zi, do you have any preference?

No preference, as long as Pankaj fixed the "if (ret)" issue I mentioned in the
other email.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v13 10/10] xfs: enable block size larger than page size support^[
  2024-08-22 13:50 ` [PATCH v13 10/10] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
@ 2024-09-03 12:29   ` Daniel Gomez
  0 siblings, 0 replies; 31+ messages in thread
From: Daniel Gomez @ 2024-09-03 12:29 UTC (permalink / raw)
  To: brauner
  Cc: Pankaj Raghav (Samsung), akpm, chandan.babu, linux-fsdevel,
	djwong, hare, gost.dev, linux-xfs, hch, david, Zi Yan, yang,
	linux-kernel, linux-mm, willy, john.g.garry, cl, p.raghav, mcgrof,
	ryan.roberts, Dave Chinner

On Thu, Aug 22, 2024 at 03:50:18PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Page cache now has the ability to have a minimum order when allocating
> a folio which is a prerequisite to add support for block size > page
> size.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_ialloc.c |  5 +++++
>  fs/xfs/libxfs/xfs_shared.h |  3 +++
>  fs/xfs/xfs_icache.c        |  6 ++++--
>  fs/xfs/xfs_mount.c         |  1 -
>  fs/xfs/xfs_super.c         | 28 ++++++++++++++++++++--------
>  include/linux/pagemap.h    | 13 +++++++++++++
>  6 files changed, 45 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 0af5b7a33d055..1921b689888b8 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -3033,6 +3033,11 @@ xfs_ialloc_setup_geometry(
>  		igeo->ialloc_align = mp->m_dalign;
>  	else
>  		igeo->ialloc_align = 0;
> +
> +	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
> +		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
> +	else
> +		igeo->min_folio_order = 0;
>  }
>  
>  /* Compute the location of the root directory inode that is laid out by mkfs. */
> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index 2f7413afbf46c..33b84a3a83ff6 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -224,6 +224,9 @@ struct xfs_ino_geometry {
>  	/* precomputed value for di_flags2 */
>  	uint64_t	new_diflags2;
>  
> +	/* minimum folio order of a page cache allocation */
> +	unsigned int	min_folio_order;
> +
>  };
>  
>  #endif /* __XFS_SHARED_H__ */
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index cf629302d48e7..0fcf235e50235 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -88,7 +88,8 @@ xfs_inode_alloc(
>  
>  	/* VFS doesn't initialise i_mode! */
>  	VFS_I(ip)->i_mode = 0;
> -	mapping_set_large_folios(VFS_I(ip)->i_mapping);
> +	mapping_set_folio_min_order(VFS_I(ip)->i_mapping,
> +				    M_IGEO(mp)->min_folio_order);
>  
>  	XFS_STATS_INC(mp, vn_active);
>  	ASSERT(atomic_read(&ip->i_pincount) == 0);
> @@ -325,7 +326,8 @@ xfs_reinit_inode(
>  	inode->i_uid = uid;
>  	inode->i_gid = gid;
>  	inode->i_state = state;
> -	mapping_set_large_folios(inode->i_mapping);
> +	mapping_set_folio_min_order(inode->i_mapping,
> +				    M_IGEO(mp)->min_folio_order);
>  	return error;
>  }
>  
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 3949f720b5354..c6933440f8066 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -134,7 +134,6 @@ xfs_sb_validate_fsb_count(
>  {
>  	uint64_t		max_bytes;
>  
> -	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
>  	ASSERT(sbp->sb_blocklog >= BBSHIFT);
>  
>  	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 210481b03fdb4..8cd76a01b543f 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1638,16 +1638,28 @@ xfs_fs_fill_super(
>  		goto out_free_sb;
>  	}
>  
> -	/*
> -	 * Until this is fixed only page-sized or smaller data blocks work.
> -	 */
>  	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
> -		xfs_warn(mp,
> -		"File system with blocksize %d bytes. "
> -		"Only pagesize (%ld) or less will currently work.",
> +		size_t max_folio_size = mapping_max_folio_size_supported();
> +
> +		if (!xfs_has_crc(mp)) {
> +			xfs_warn(mp,
> +"V4 Filesystem with blocksize %d bytes. Only pagesize (%ld) or less is supported.",
>  				mp->m_sb.sb_blocksize, PAGE_SIZE);
> -		error = -ENOSYS;
> -		goto out_free_sb;
> +			error = -ENOSYS;
> +			goto out_free_sb;
> +		}
> +
> +		if (mp->m_sb.sb_blocksize > max_folio_size) {
> +			xfs_warn(mp,
> +"block size (%u bytes) not supported; Only block size (%ld) or less is supported",

This small fix [1] is missing in linux-next and vfs trees. Can it be picked?

[1] https://lore.kernel.org/all/Zs_vIaw8ESLN2TwY@casper.infradead.org/

> +				mp->m_sb.sb_blocksize, max_folio_size);
> +			error = -ENOSYS;
> +			goto out_free_sb;
> +		}
> +
> +		xfs_warn(mp,
> +"EXPERIMENTAL: V5 Filesystem with Large Block Size (%d bytes) enabled.",
> +			mp->m_sb.sb_blocksize);
>  	}
>  
>  	/* Ensure this filesystem fits in the page cache limits */
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 4cc170949e9c0..55b254d951da7 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -374,6 +374,19 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>  #define MAX_XAS_ORDER		(XA_CHUNK_SHIFT * 2 - 1)
>  #define MAX_PAGECACHE_ORDER	min(MAX_XAS_ORDER, PREFERRED_MAX_PAGECACHE_ORDER)
>  
> +/*
> + * mapping_max_folio_size_supported() - Check the max folio size supported
> + *
> + * The filesystem should call this function at mount time if there is a
> + * requirement on the folio mapping size in the page cache.
> + */
> +static inline size_t mapping_max_folio_size_supported(void)
> +{
> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> +		return 1U << (PAGE_SHIFT + MAX_PAGECACHE_ORDER);
> +	return PAGE_SIZE;
> +}
> +
>  /*
>   * mapping_set_folio_order_range() - Set the orders supported by a file.
>   * @mapping: The address space of the file.
> -- 
> 2.44.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-08-22 13:50 ` [PATCH v13 04/10] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
  2024-08-29 10:51   ` Sven Schnelle
  2024-08-29 22:11   ` Matthew Wilcox
@ 2024-09-06  6:52   ` Lai, Yi
  2024-09-06  8:01     ` Pankaj Raghav (Samsung)
  2 siblings, 1 reply; 31+ messages in thread
From: Lai, Yi @ 2024-09-06  6:52 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: brauner, akpm, chandan.babu, linux-fsdevel, djwong, hare,
	gost.dev, linux-xfs, hch, david, Zi Yan, yang, linux-kernel,
	linux-mm, willy, john.g.garry, cl, p.raghav, mcgrof, ryan.roberts,
	David Howells, pengfei.xu

Hi Luis,

Greetings!

I used Syzkaller and found that there is task hang in soft_offline_page in Linux-next tree - next-20240902.

After bisection and the first bad commit is:
"
fd031210c9ce mm: split a folio in minimum folio order chunks
"

All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/bisect_info.log
bzImage:
https://github.com/laifryiee/syzkaller_logs/raw/f633dcbc3a8e4ca5f52f0110bc75ff17d9885db4/240904_155526_soft_offline_page/bzImage_ecc768a84f0b8e631986f9ade3118fa37852fef0
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/240904_155526_soft_offline_page/ecc768a84f0b8e631986f9ade3118fa37852fef0_dmesg.log

"
[  447.976688]  ? __pfx_soft_offline_page.part.0+0x10/0x10
[  447.977255]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[  447.977858]  soft_offline_page+0x97/0xc0
[  447.978281]  do_madvise.part.0+0x1a45/0x2a30
[  447.978742]  ? __pfx___lock_acquire+0x10/0x10
[  447.979227]  ? __pfx_do_madvise.part.0+0x10/0x10
[  447.979716]  ? __this_cpu_preempt_check+0x21/0x30
[  447.980225]  ? __this_cpu_preempt_check+0x21/0x30
[  447.980729]  ? lock_release+0x441/0x870
[  447.981160]  ? __this_cpu_preempt_check+0x21/0x30
[  447.981656]  ? seqcount_lockdep_reader_access.constprop.0+0xb4/0xd0
[  447.982321]  ? lockdep_hardirqs_on+0x89/0x110
[  447.982771]  ? trace_hardirqs_on+0x51/0x60
[  447.983191]  ? seqcount_lockdep_reader_access.constprop.0+0xc0/0xd0
[  447.983819]  ? __sanitizer_cov_trace_cmp4+0x1a/0x20
[  447.984282]  ? ktime_get_coarse_real_ts64+0xbf/0xf0
[  447.984673]  __x64_sys_madvise+0x139/0x180
[  447.984997]  x64_sys_call+0x19a5/0x2140
[  447.985307]  do_syscall_64+0x6d/0x140
[  447.985600]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  447.986011] RIP: 0033:0x7f782623ee5d
[  447.986248] RSP: 002b:00007fff9ddaffb8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
[  447.986709] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f782623ee5d
[  447.987147] RDX: 0000000000000065 RSI: 0000000000003000 RDI: 0000000020d51000
[  447.987584] RBP: 00007fff9ddaffc0 R08: 00007fff9ddafff0 R09: 00007fff9ddafff0
[  447.988022] R10: 00007fff9ddafff0 R11: 0000000000000217 R12: 00007fff9ddb0118
[  447.988428] R13: 0000000000401716 R14: 0000000000403e08 R15: 00007f782645d000
[  447.988799]  </TASK>
[  447.988921]
[  447.988921] Showing all locks held in the system:
[  447.989237] 1 lock held by khungtaskd/33:
[  447.989447]  #0: ffffffff8705c500 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x73/0x3c0
[  447.989947] 1 lock held by repro/628:
[  447.990144]  #0: ffffffff87258a28 (mf_mutex){+.+.}-{3:3}, at: soft_offline_page.part.0+0xda/0xf40
[  447.990611]
[  447.990701] =============================================

"

I hope you find it useful.

Regards,
Yi Lai

---

If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.

How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
  // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
  // You could change the bzImage_xxx as you want
  // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost

After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/

Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage           //x should equal or less than cpu num your pc has

Fill the bzImage file into above start3.sh to load the target kernel in vm.


Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install 

On Thu, Aug 22, 2024 at 03:50:12PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> split_folio() and split_folio_to_list() assume order 0, to support
> minorder for non-anonymous folios, we must expand these to check the
> folio mapping order and use that.
> 
> Set new_order to be at least minimum folio order if it is set in
> split_huge_page_to_list() so that we can maintain minimum folio order
> requirement in the page cache.
> 
> Update the debugfs write files used for testing to ensure the order
> is respected as well. We simply enforce the min order when a file
> mapping is used.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Tested-by: David Howells <dhowells@redhat.com>
> ---
>  include/linux/huge_mm.h | 14 +++++++---
>  mm/huge_memory.c        | 60 ++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 66 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 4c32058cacfec..70424d55da088 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -96,6 +96,8 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>  #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
>  	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
>  
> +#define split_folio(f) split_folio_to_list(f, NULL)
> +
>  #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
>  #define HPAGE_PMD_SHIFT PMD_SHIFT
>  #define HPAGE_PUD_SHIFT PUD_SHIFT
> @@ -317,9 +319,10 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		unsigned int new_order);
> +int split_folio_to_list(struct folio *folio, struct list_head *list);
>  static inline int split_huge_page(struct page *page)
>  {
> -	return split_huge_page_to_list_to_order(page, NULL, 0);
> +	return split_folio(page_folio(page));
>  }
>  void deferred_split_folio(struct folio *folio);
>  
> @@ -495,6 +498,12 @@ static inline int split_huge_page(struct page *page)
>  {
>  	return 0;
>  }
> +
> +static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
> +{
> +	return 0;
> +}
> +
>  static inline void deferred_split_folio(struct folio *folio) {}
>  #define split_huge_pmd(__vma, __pmd, __address)	\
>  	do { } while (0)
> @@ -622,7 +631,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
>  	return split_folio_to_list_to_order(folio, NULL, new_order);
>  }
>  
> -#define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
> -#define split_folio(f) split_folio_to_order(f, 0)
> -
>  #endif /* _LINUX_HUGE_MM_H */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index cf8e34f62976f..06384b85a3a20 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3303,6 +3303,9 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>   * released, or if some unexpected race happened (e.g., anon VMA disappeared,
>   * truncation).
>   *
> + * Callers should ensure that the order respects the address space mapping
> + * min-order if one is set for non-anonymous folios.
> + *
>   * Returns -EINVAL when trying to split to an order that is incompatible
>   * with the folio. Splitting to order 0 is compatible with all folios.
>   */
> @@ -3384,6 +3387,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		mapping = NULL;
>  		anon_vma_lock_write(anon_vma);
>  	} else {
> +		unsigned int min_order;
>  		gfp_t gfp;
>  
>  		mapping = folio->mapping;
> @@ -3394,6 +3398,14 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  			goto out;
>  		}
>  
> +		min_order = mapping_min_folio_order(folio->mapping);
> +		if (new_order < min_order) {
> +			VM_WARN_ONCE(1, "Cannot split mapped folio below min-order: %u",
> +				     min_order);
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
>  		gfp = current_gfp_context(mapping_gfp_mask(mapping) &
>  							GFP_RECLAIM_MASK);
>  
> @@ -3506,6 +3518,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  	return ret;
>  }
>  
> +int split_folio_to_list(struct folio *folio, struct list_head *list)
> +{
> +	unsigned int min_order = 0;
> +
> +	if (folio_test_anon(folio))
> +		goto out;
> +
> +	if (!folio->mapping) {
> +		if (folio_test_pmd_mappable(folio))
> +			count_vm_event(THP_SPLIT_PAGE_FAILED);
> +		return -EBUSY;
> +	}
> +
> +	min_order = mapping_min_folio_order(folio->mapping);
> +out:
> +	return split_huge_page_to_list_to_order(&folio->page, list,
> +							min_order);
> +}
> +
>  void __folio_undo_large_rmappable(struct folio *folio)
>  {
>  	struct deferred_split *ds_queue;
> @@ -3736,6 +3767,8 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  		struct vm_area_struct *vma = vma_lookup(mm, addr);
>  		struct folio_walk fw;
>  		struct folio *folio;
> +		struct address_space *mapping;
> +		unsigned int target_order = new_order;
>  
>  		if (!vma)
>  			break;
> @@ -3753,7 +3786,13 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  		if (!is_transparent_hugepage(folio))
>  			goto next;
>  
> -		if (new_order >= folio_order(folio))
> +		if (!folio_test_anon(folio)) {
> +			mapping = folio->mapping;
> +			target_order = max(new_order,
> +					   mapping_min_folio_order(mapping));
> +		}
> +
> +		if (target_order >= folio_order(folio))
>  			goto next;
>  
>  		total++;
> @@ -3771,9 +3810,14 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  		folio_get(folio);
>  		folio_walk_end(&fw, vma);
>  
> -		if (!split_folio_to_order(folio, new_order))
> +		if (!folio_test_anon(folio) && folio->mapping != mapping)
> +			goto unlock;
> +
> +		if (!split_folio_to_order(folio, target_order))
>  			split++;
>  
> +unlock:
> +
>  		folio_unlock(folio);
>  		folio_put(folio);
>  
> @@ -3802,6 +3846,8 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
>  	pgoff_t index;
>  	int nr_pages = 1;
>  	unsigned long total = 0, split = 0;
> +	unsigned int min_order;
> +	unsigned int target_order;
>  
>  	file = getname_kernel(file_path);
>  	if (IS_ERR(file))
> @@ -3815,6 +3861,8 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
>  		 file_path, off_start, off_end);
>  
>  	mapping = candidate->f_mapping;
> +	min_order = mapping_min_folio_order(mapping);
> +	target_order = max(new_order, min_order);
>  
>  	for (index = off_start; index < off_end; index += nr_pages) {
>  		struct folio *folio = filemap_get_folio(mapping, index);
> @@ -3829,15 +3877,19 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
>  		total++;
>  		nr_pages = folio_nr_pages(folio);
>  
> -		if (new_order >= folio_order(folio))
> +		if (target_order >= folio_order(folio))
>  			goto next;
>  
>  		if (!folio_trylock(folio))
>  			goto next;
>  
> -		if (!split_folio_to_order(folio, new_order))
> +		if (folio->mapping != mapping)
> +			goto unlock;
> +
> +		if (!split_folio_to_order(folio, target_order))
>  			split++;
>  
> +unlock:
>  		folio_unlock(folio);
>  next:
>  		folio_put(folio);
> -- 
> 2.44.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-09-06  6:52   ` Lai, Yi
@ 2024-09-06  8:01     ` Pankaj Raghav (Samsung)
  2024-09-09  9:06       ` Lai, Yi
  0 siblings, 1 reply; 31+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-09-06  8:01 UTC (permalink / raw)
  To: Lai, Yi
  Cc: brauner, akpm, chandan.babu, linux-fsdevel, djwong, hare,
	gost.dev, linux-xfs, hch, david, Zi Yan, yang, linux-kernel,
	linux-mm, willy, john.g.garry, cl, p.raghav, mcgrof, ryan.roberts,
	David Howells, pengfei.xu

On Fri, Sep 06, 2024 at 02:52:38PM +0800, Lai, Yi wrote:
Hi Yi,

> 
> I used Syzkaller and found that there is task hang in soft_offline_page in Linux-next tree - next-20240902.

I don't know if it is related, but we had a fix for this commit for a
ltp failure due to locking issues that is there in next-20240905 but not
in next-20240902.

Fix: https://lore.kernel.org/linux-next/20240902124931.506061-2-kernel@pankajraghav.com/

Is this reproducible also on next-20240905?

> 
> After bisection and the first bad commit is:
> "
> fd031210c9ce mm: split a folio in minimum folio order chunks
> "
> 
> All detailed into can be found at:
> https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page
> Syzkaller repro code:
> https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.c
> Syzkaller repro syscall steps:
> https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.prog
> Syzkaller report:
> https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.report
> Kconfig(make olddefconfig):
> https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/kconfig_origin
> Bisect info:
> https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/bisect_info.log
> bzImage:
> https://github.com/laifryiee/syzkaller_logs/raw/f633dcbc3a8e4ca5f52f0110bc75ff17d9885db4/240904_155526_soft_offline_page/bzImage_ecc768a84f0b8e631986f9ade3118fa37852fef0
> Issue dmesg:
> https://github.com/laifryiee/syzkaller_logs/blob/main/240904_155526_soft_offline_page/ecc768a84f0b8e631986f9ade3118fa37852fef0_dmesg.log
> 
> "
> [  447.976688]  ? __pfx_soft_offline_page.part.0+0x10/0x10
> [  447.977255]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [  447.977858]  soft_offline_page+0x97/0xc0
> [  447.978281]  do_madvise.part.0+0x1a45/0x2a30
> [  447.978742]  ? __pfx___lock_acquire+0x10/0x10
> [  447.979227]  ? __pfx_do_madvise.part.0+0x10/0x10
> [  447.979716]  ? __this_cpu_preempt_check+0x21/0x30
> [  447.980225]  ? __this_cpu_preempt_check+0x21/0x30
> [  447.980729]  ? lock_release+0x441/0x870
> [  447.981160]  ? __this_cpu_preempt_check+0x21/0x30
> [  447.981656]  ? seqcount_lockdep_reader_access.constprop.0+0xb4/0xd0
> [  447.982321]  ? lockdep_hardirqs_on+0x89/0x110
> [  447.982771]  ? trace_hardirqs_on+0x51/0x60
> [  447.983191]  ? seqcount_lockdep_reader_access.constprop.0+0xc0/0xd0
> [  447.983819]  ? __sanitizer_cov_trace_cmp4+0x1a/0x20
> [  447.984282]  ? ktime_get_coarse_real_ts64+0xbf/0xf0
> [  447.984673]  __x64_sys_madvise+0x139/0x180
> [  447.984997]  x64_sys_call+0x19a5/0x2140
> [  447.985307]  do_syscall_64+0x6d/0x140
> [  447.985600]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  447.986011] RIP: 0033:0x7f782623ee5d
> [  447.986248] RSP: 002b:00007fff9ddaffb8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
> [  447.986709] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f782623ee5d
> [  447.987147] RDX: 0000000000000065 RSI: 0000000000003000 RDI: 0000000020d51000
> [  447.987584] RBP: 00007fff9ddaffc0 R08: 00007fff9ddafff0 R09: 00007fff9ddafff0
> [  447.988022] R10: 00007fff9ddafff0 R11: 0000000000000217 R12: 00007fff9ddb0118
> [  447.988428] R13: 0000000000401716 R14: 0000000000403e08 R15: 00007f782645d000
> [  447.988799]  </TASK>
> [  447.988921]
> [  447.988921] Showing all locks held in the system:
> [  447.989237] 1 lock held by khungtaskd/33:
> [  447.989447]  #0: ffffffff8705c500 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x73/0x3c0
> [  447.989947] 1 lock held by repro/628:
> [  447.990144]  #0: ffffffff87258a28 (mf_mutex){+.+.}-{3:3}, at: soft_offline_page.part.0+0xda/0xf40
> [  447.990611]
> [  447.990701] =============================================
> 
> "
> 
> I hope you find it useful.
> 
> Regards,
> Yi Lai
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v13 04/10] mm: split a folio in minimum folio order chunks
  2024-09-06  8:01     ` Pankaj Raghav (Samsung)
@ 2024-09-09  9:06       ` Lai, Yi
  0 siblings, 0 replies; 31+ messages in thread
From: Lai, Yi @ 2024-09-09  9:06 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: brauner, akpm, chandan.babu, linux-fsdevel, djwong, hare,
	gost.dev, linux-xfs, hch, david, Zi Yan, yang, linux-kernel,
	linux-mm, willy, john.g.garry, cl, p.raghav, mcgrof, ryan.roberts,
	David Howells, pengfei.xu

Hi,

I have tried running the repro.c for 1 hour using next-20240905 kernel. Issue
cannot be reproduced.

Thank you.

Regards,
Yi Lai

On Fri, Sep 06, 2024 at 08:01:20AM +0000, Pankaj Raghav (Samsung) wrote:
> On Fri, Sep 06, 2024 at 02:52:38PM +0800, Lai, Yi wrote:
> Hi Yi,
> 
> > 
> > I used Syzkaller and found that there is task hang in soft_offline_page in Linux-next tree - next-20240902.
> 
> I don't know if it is related, but we had a fix for this commit for a
> ltp failure due to locking issues that is there in next-20240905 but not
> in next-20240902.
> 
> Fix: https://lore.kernel.org/linux-next/20240902124931.506061-2-kernel@pankajraghav.com/
> 
> Is this reproducible also on next-20240905?
> 
> > 
> > After bisection and the first bad commit is:
> > "
> > fd031210c9ce mm: split a folio in minimum folio order chunks
> > "
> > 
> > All detailed into can be found at:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page
> > Syzkaller repro code:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.c
> > Syzkaller repro syscall steps:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.prog
> > Syzkaller report:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/repro.report
> > Kconfig(make olddefconfig):
> > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/kconfig_origin
> > Bisect info:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/240904_155526_soft_offline_page/bisect_info.log
> > bzImage:
> > https://github.com/laifryiee/syzkaller_logs/raw/f633dcbc3a8e4ca5f52f0110bc75ff17d9885db4/240904_155526_soft_offline_page/bzImage_ecc768a84f0b8e631986f9ade3118fa37852fef0
> > Issue dmesg:
> > https://github.com/laifryiee/syzkaller_logs/blob/main/240904_155526_soft_offline_page/ecc768a84f0b8e631986f9ade3118fa37852fef0_dmesg.log
> > 
> > "
> > [  447.976688]  ? __pfx_soft_offline_page.part.0+0x10/0x10
> > [  447.977255]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> > [  447.977858]  soft_offline_page+0x97/0xc0
> > [  447.978281]  do_madvise.part.0+0x1a45/0x2a30
> > [  447.978742]  ? __pfx___lock_acquire+0x10/0x10
> > [  447.979227]  ? __pfx_do_madvise.part.0+0x10/0x10
> > [  447.979716]  ? __this_cpu_preempt_check+0x21/0x30
> > [  447.980225]  ? __this_cpu_preempt_check+0x21/0x30
> > [  447.980729]  ? lock_release+0x441/0x870
> > [  447.981160]  ? __this_cpu_preempt_check+0x21/0x30
> > [  447.981656]  ? seqcount_lockdep_reader_access.constprop.0+0xb4/0xd0
> > [  447.982321]  ? lockdep_hardirqs_on+0x89/0x110
> > [  447.982771]  ? trace_hardirqs_on+0x51/0x60
> > [  447.983191]  ? seqcount_lockdep_reader_access.constprop.0+0xc0/0xd0
> > [  447.983819]  ? __sanitizer_cov_trace_cmp4+0x1a/0x20
> > [  447.984282]  ? ktime_get_coarse_real_ts64+0xbf/0xf0
> > [  447.984673]  __x64_sys_madvise+0x139/0x180
> > [  447.984997]  x64_sys_call+0x19a5/0x2140
> > [  447.985307]  do_syscall_64+0x6d/0x140
> > [  447.985600]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  447.986011] RIP: 0033:0x7f782623ee5d
> > [  447.986248] RSP: 002b:00007fff9ddaffb8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
> > [  447.986709] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f782623ee5d
> > [  447.987147] RDX: 0000000000000065 RSI: 0000000000003000 RDI: 0000000020d51000
> > [  447.987584] RBP: 00007fff9ddaffc0 R08: 00007fff9ddafff0 R09: 00007fff9ddafff0
> > [  447.988022] R10: 00007fff9ddafff0 R11: 0000000000000217 R12: 00007fff9ddb0118
> > [  447.988428] R13: 0000000000401716 R14: 0000000000403e08 R15: 00007f782645d000
> > [  447.988799]  </TASK>
> > [  447.988921]
> > [  447.988921] Showing all locks held in the system:
> > [  447.989237] 1 lock held by khungtaskd/33:
> > [  447.989447]  #0: ffffffff8705c500 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x73/0x3c0
> > [  447.989947] 1 lock held by repro/628:
> > [  447.990144]  #0: ffffffff87258a28 (mf_mutex){+.+.}-{3:3}, at: soft_offline_page.part.0+0xda/0xf40
> > [  447.990611]
> > [  447.990701] =============================================
> > 
> > "
> > 
> > I hope you find it useful.
> > 
> > Regards,
> > Yi Lai
> > 

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2024-09-09  9:07 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-22 13:50 [PATCH v13 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
2024-08-22 13:50 ` [PATCH v13 01/10] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
2024-08-23 13:09   ` Daniel Gomez
2024-08-22 13:50 ` [PATCH v13 02/10] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
2024-08-22 13:50 ` [PATCH v13 03/10] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
2024-08-22 13:50 ` [PATCH v13 04/10] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
2024-08-29 10:51   ` Sven Schnelle
2024-08-29 18:46     ` Luis Chamberlain
2024-08-29 19:55       ` Matthew Wilcox
2024-08-29 22:12         ` Zi Yan
2024-08-29 23:41           ` Luis Chamberlain
2024-08-30  5:57             ` Sven Schnelle
2024-08-30 11:58             ` Daniel Gomez
2024-08-30 14:59             ` Pankaj Raghav
2024-08-30 17:12               ` Luis Chamberlain
2024-08-31 22:38                 ` Zi Yan
2024-08-30 22:42               ` Matthew Wilcox
2024-08-31 22:35               ` Zi Yan
2024-08-29 22:11   ` Matthew Wilcox
2024-09-06  6:52   ` Lai, Yi
2024-09-06  8:01     ` Pankaj Raghav (Samsung)
2024-09-09  9:06       ` Lai, Yi
2024-08-22 13:50 ` [PATCH v13 05/10] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() Pankaj Raghav (Samsung)
2024-08-22 13:50 ` [PATCH v13 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
2024-08-22 13:50 ` [PATCH v13 07/10] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
2024-08-22 13:50 ` [PATCH v13 08/10] xfs: expose block size in stat Pankaj Raghav (Samsung)
2024-08-22 13:50 ` [PATCH v13 09/10] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
2024-08-22 13:50 ` [PATCH v13 10/10] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
2024-09-03 12:29   ` [PATCH v13 10/10] xfs: enable block size larger than page size support^[ Daniel Gomez
2024-08-22 21:23 ` [PATCH v13 00/10] enable bs > ps in XFS Luis Chamberlain
2024-08-23 12:36   ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).