[PATCH v10 00/10] enable bs

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v10 00/10] enable bs > ps in XFS
@ 2024-07-15  9:44 Pankaj Raghav (Samsung)
  2024-07-15  9:44 ` [PATCH v10 01/10] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
                   ` (9 more replies)
  0 siblings, 10 replies; 26+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-15  9:44 UTC (permalink / raw)
  To: david, willy, chandan.babu, djwong, brauner, akpm
  Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
	p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, ryan.roberts,
	hch, Zi Yan

From: Pankaj Raghav <p.raghav@samsung.com>

This is the tenth version of the series that enables block size > page size
(Large Block Size) in XFS.
The context and motivation can be seen in cover letter of the RFC v1 [0].
We also recorded a talk about this effort at LPC [1], if someone would
like more context on this effort.

A lot of emphasis has been put on testing using kdevops, starting with an XFS
baseline [3]. The testing has been split into regression and progression.

Regression testing:
In regression testing, we ran the whole test suite to check for regressions on
existing profiles due to the page cache changes.

I also ran split_huge_page_test selftest on XFS filesystem to check for
huge page splits in min order chunks is done correctly.

No regressions were found with these patches added on top.

Progression testing:
For progression testing, we tested for 8k, 16k, 32k and 64k block sizes.  To
compare it with existing support, an ARM VM with 64k base page system (without
our patches) was used as a reference to check for actual failures due to LBS
support in a 4k base page size system.

There are some tests that assumes block size < page size that needs to be fixed.
We have a tree with fixes for xfstests [4], most of the changes have been posted
already, and only a few minor changes need to be posted. Already part of these
changes has been upstreamed to fstests, and new tests have also been written and
are out for review, namely for mmap zeroing-around corner cases, compaction
and fsstress races on mm, and stress testing folio truncation on file mapped
folios.

No new failures were found with the LBS support.

We've done some preliminary performance tests with fio on XFS on 4k block size
against pmem and NVMe with buffered IO and Direct IO on vanilla Vs + these
patches applied, and detected no regressions.

We also wrote an eBPF tool called blkalgn [5] to see if IO sent to the device
is aligned and at least filesystem block size in length.

For those who want this in a git tree we have this up on a kdevops
large-block-minorder-for-next-v10 tag [6].

[0] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/
[1] https://www.youtube.com/watch?v=ar72r5Xf7x4
[2] https://lkml.kernel.org/r/20240501153120.4094530-1-willy@infradead.org
[3] https://github.com/linux-kdevops/kdevops/blob/master/docs/xfs-bugs.md
489 non-critical issues and 55 critical issues. We've determined and reported
that the 55 critical issues have all fall into 5 common  XFS asserts or hung
tasks  and 2 memory management asserts.
[4] https://github.com/linux-kdevops/fstests/tree/lbs-fixes
[5] https://github.com/iovisor/bcc/pull/4813
[6] https://github.com/linux-kdevops/linux/
[7] https://lore.kernel.org/linux-kernel/Zl20pc-YlIWCSy6Z@casper.infradead.org/#t

Changes since v9:
- Added a mapping_max_folio_size_supported() that filesystems can call
  at mount time to check for mapping folio requirement.
- Changed split_folio_to_list() to call THP_SPLIT_PAGE_FAILED for
  pmd folios.
- Formatting changes in the first patch
- Collected RVB from Hannes, Zi yan, Darrick and Dave.

Dave Chinner (1):
  xfs: use kvmalloc for xattr buffers

Luis Chamberlain (1):
  mm: split a folio in minimum folio order chunks

Matthew Wilcox (Oracle) (1):
  fs: Allow fine-grained control of folio sizes

Pankaj Raghav (7):
  filemap: allocate mapping_min_order folios in the page cache
  readahead: allocate folios with mapping_min_order in readahead
  filemap: cap PTE range to be created to allowed zero fill in
    folio_map_range()
  iomap: fix iomap_dio_zero() for fs bs > system page size
  xfs: expose block size in stat
  xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  xfs: enable block size larger than page size support

 fs/iomap/buffered-io.c        |   4 +-
 fs/iomap/direct-io.c          |  45 ++++++++++--
 fs/xfs/libxfs/xfs_attr_leaf.c |  15 ++--
 fs/xfs/libxfs/xfs_ialloc.c    |   5 ++
 fs/xfs/libxfs/xfs_shared.h    |   3 +
 fs/xfs/xfs_icache.c           |   6 +-
 fs/xfs/xfs_iops.c             |   2 +-
 fs/xfs/xfs_mount.c            |   8 ++-
 fs/xfs/xfs_super.c            |  30 +++++---
 include/linux/huge_mm.h       |  14 ++--
 include/linux/pagemap.h       | 127 ++++++++++++++++++++++++++++++----
 mm/filemap.c                  |  36 ++++++----
 mm/huge_memory.c              |  59 ++++++++++++++--
 mm/readahead.c                |  83 ++++++++++++++++------
 14 files changed, 353 insertions(+), 84 deletions(-)

base-commit: 0b58e108042b0ed28a71cd7edf5175999955b233
-- 
2.44.1

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v10 01/10] fs: Allow fine-grained control of folio sizes
  2024-07-15  9:44 [PATCH v10 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
@ 2024-07-15  9:44 ` Pankaj Raghav (Samsung)
  2024-07-16 15:26   ` Matthew Wilcox
  2024-07-15  9:44 ` [PATCH v10 02/10] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-15  9:44 UTC (permalink / raw)
  To: david, willy, chandan.babu, djwong, brauner, akpm
  Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
	p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, ryan.roberts,
	hch, Zi Yan

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

We need filesystems to be able to communicate acceptable folio sizes
to the pagecache for a variety of uses (e.g. large block sizes).
Support a range of folio sizes between order-0 and order-31.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 include/linux/pagemap.h | 107 +++++++++++++++++++++++++++++++++++-----
 mm/filemap.c            |   6 +--
 mm/readahead.c          |   4 +-
 3 files changed, 98 insertions(+), 19 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 8026a8a433d36..8d2b5c51461b0 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -204,14 +204,21 @@ enum mapping_flags {
 	AS_EXITING	= 4, 	/* final truncate in progress */
 	/* writeback related tags are not used */
 	AS_NO_WRITEBACK_TAGS = 5,
-	AS_LARGE_FOLIO_SUPPORT = 6,
-	AS_RELEASE_ALWAYS,	/* Call ->release_folio(), even if no private data */
-	AS_STABLE_WRITES,	/* must wait for writeback before modifying
+	AS_RELEASE_ALWAYS = 6,	/* Call ->release_folio(), even if no private data */
+	AS_STABLE_WRITES = 7,	/* must wait for writeback before modifying
 				   folio contents */
-	AS_UNMOVABLE,		/* The mapping cannot be moved, ever */
-	AS_INACCESSIBLE,	/* Do not attempt direct R/W access to the mapping */
+	AS_UNMOVABLE = 8,	/* The mapping cannot be moved, ever */
+	AS_INACCESSIBLE = 9,	/* Do not attempt direct R/W access to the mapping */
+	/* Bits 16-25 are used for FOLIO_ORDER */
+	AS_FOLIO_ORDER_BITS = 5,
+	AS_FOLIO_ORDER_MIN = 16,
+	AS_FOLIO_ORDER_MAX = AS_FOLIO_ORDER_MIN + AS_FOLIO_ORDER_BITS,
 };
 
+#define AS_FOLIO_ORDER_MASK     ((1u << AS_FOLIO_ORDER_BITS) - 1)
+#define AS_FOLIO_ORDER_MIN_MASK (AS_FOLIO_ORDER_MASK << AS_FOLIO_ORDER_MIN)
+#define AS_FOLIO_ORDER_MAX_MASK (AS_FOLIO_ORDER_MASK << AS_FOLIO_ORDER_MAX)
+
 /**
  * mapping_set_error - record a writeback error in the address_space
  * @mapping: the mapping in which an error should be set
@@ -367,9 +374,70 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 #define MAX_XAS_ORDER		(XA_CHUNK_SHIFT * 2 - 1)
 #define MAX_PAGECACHE_ORDER	min(MAX_XAS_ORDER, PREFERRED_MAX_PAGECACHE_ORDER)
 
+/*
+ * mapping_max_folio_size_supported() - Check the max folio size supported
+ *
+ * The filesystem should call this function at mount time if there is a
+ * requirement on the folio mapping size in the page cache.
+ */
+static inline size_t mapping_max_folio_size_supported(void)
+{
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return 1U << (PAGE_SHIFT + MAX_PAGECACHE_ORDER);
+	return PAGE_SIZE;
+}
+
+/*
+ * mapping_set_folio_order_range() - Set the orders supported by a file.
+ * @mapping: The address space of the file.
+ * @min: Minimum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
+ * @max: Maximum folio order (between @min-MAX_PAGECACHE_ORDER inclusive).
+ *
+ * The filesystem should call this function in its inode constructor to
+ * indicate which base size (min) and maximum size (max) of folio the VFS
+ * can use to cache the contents of the file.  This should only be used
+ * if the filesystem needs special handling of folio sizes (ie there is
+ * something the core cannot know).
+ * Do not tune it based on, eg, i_size.
+ *
+ * Context: This should not be called while the inode is active as it
+ * is non-atomic.
+ */
+static inline void mapping_set_folio_order_range(struct address_space *mapping,
+						 unsigned int min,
+						 unsigned int max)
+{
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return;
+
+	if (min > MAX_PAGECACHE_ORDER) {
+		VM_WARN_ONCE(1,
+	"min order > MAX_PAGECACHE_ORDER. Setting min_order to MAX_PAGECACHE_ORDER");
+		min = MAX_PAGECACHE_ORDER;
+	}
+
+	if (max > MAX_PAGECACHE_ORDER) {
+		VM_WARN_ONCE(1,
+	"max order > MAX_PAGECACHE_ORDER. Setting max_order to MAX_PAGECACHE_ORDER");
+		max = MAX_PAGECACHE_ORDER;
+	}
+
+	if (max < min)
+		max = min;
+
+	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
+		(min << AS_FOLIO_ORDER_MIN) | (max << AS_FOLIO_ORDER_MAX);
+}
+
+static inline void mapping_set_folio_min_order(struct address_space *mapping,
+					       unsigned int min)
+{
+	mapping_set_folio_order_range(mapping, min, MAX_PAGECACHE_ORDER);
+}
+
 /**
  * mapping_set_large_folios() - Indicate the file supports large folios.
- * @mapping: The file.
+ * @mapping: The address space of the file.
  *
  * The filesystem should call this function in its inode constructor to
  * indicate that the VFS can use large folios to cache the contents of
@@ -380,7 +448,23 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
  */
 static inline void mapping_set_large_folios(struct address_space *mapping)
 {
-	__set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	mapping_set_folio_order_range(mapping, 0, MAX_PAGECACHE_ORDER);
+}
+
+static inline unsigned int
+mapping_max_folio_order(const struct address_space *mapping)
+{
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return 0;
+	return (mapping->flags & AS_FOLIO_ORDER_MAX_MASK) >> AS_FOLIO_ORDER_MAX;
+}
+
+static inline unsigned int
+mapping_min_folio_order(const struct address_space *mapping)
+{
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return 0;
+	return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
 }
 
 /*
@@ -393,16 +477,13 @@ static inline bool mapping_large_folio_support(struct address_space *mapping)
 	VM_WARN_ONCE((unsigned long)mapping & PAGE_MAPPING_ANON,
 			"Anonymous mapping always supports large folio");
 
-	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-		test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	return mapping_max_folio_order(mapping) > 0;
 }
 
 /* Return the maximum folio size for this pagecache mapping, in bytes. */
-static inline size_t mapping_max_folio_size(struct address_space *mapping)
+static inline size_t mapping_max_folio_size(const struct address_space *mapping)
 {
-	if (mapping_large_folio_support(mapping))
-		return PAGE_SIZE << MAX_PAGECACHE_ORDER;
-	return PAGE_SIZE;
+	return PAGE_SIZE << mapping_max_folio_order(mapping);
 }
 
 static inline int filemap_nr_thps(struct address_space *mapping)
diff --git a/mm/filemap.c b/mm/filemap.c
index d62150418b910..ad5e4a848070e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1933,10 +1933,8 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
 			fgp_flags |= FGP_LOCK;
 
-		if (!mapping_large_folio_support(mapping))
-			order = 0;
-		if (order > MAX_PAGECACHE_ORDER)
-			order = MAX_PAGECACHE_ORDER;
+		if (order > mapping_max_folio_order(mapping))
+			order = mapping_max_folio_order(mapping);
 		/* If we're not aligned, allocate a smaller folio */
 		if (index & ((1UL << order) - 1))
 			order = __ffs(index);
diff --git a/mm/readahead.c b/mm/readahead.c
index 517c0be7ce665..3e5239e9e1777 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -449,10 +449,10 @@ void page_cache_ra_order(struct readahead_control *ractl,
 
 	limit = min(limit, index + ra->size - 1);
 
-	if (new_order < MAX_PAGECACHE_ORDER)
+	if (new_order < mapping_max_folio_order(mapping))
 		new_order += 2;
 
-	new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
+	new_order = min(mapping_max_folio_order(mapping), new_order);
 	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
 
 	/* See comment in page_cache_ra_unbounded() */
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v10 02/10] filemap: allocate mapping_min_order folios in the page cache
  2024-07-15  9:44 [PATCH v10 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
  2024-07-15  9:44 ` [PATCH v10 01/10] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
@ 2024-07-15  9:44 ` Pankaj Raghav (Samsung)
  2024-07-15  9:44 ` [PATCH v10 03/10] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-15  9:44 UTC (permalink / raw)
  To: david, willy, chandan.babu, djwong, brauner, akpm
  Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
	p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, ryan.roberts,
	hch, Zi Yan

From: Pankaj Raghav <p.raghav@samsung.com>

filemap_create_folio() and do_read_cache_folio() were always allocating
folio of order 0. __filemap_get_folio was trying to allocate higher
order folios when fgp_flags had higher order hint set but it will default
to order 0 folio if higher order memory allocation fails.

Supporting mapping_min_order implies that we guarantee each folio in the
page cache has at least an order of mapping_min_order. When adding new
folios to the page cache we must also ensure the index used is aligned to
the mapping_min_order as the page cache requires the index to be aligned
to the order of the folio.

Co-developed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 20 ++++++++++++++++++++
 mm/filemap.c            | 24 ++++++++++++++++--------
 2 files changed, 36 insertions(+), 8 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 8d2b5c51461b0..68edbea9ae25a 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -467,6 +467,26 @@ mapping_min_folio_order(const struct address_space *mapping)
 	return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
 }
 
+static inline unsigned long
+mapping_min_folio_nrpages(struct address_space *mapping)
+{
+	return 1UL << mapping_min_folio_order(mapping);
+}
+
+/**
+ * mapping_align_index() - Align index for this mapping.
+ * @mapping: The address_space.
+ *
+ * The index of a folio must be naturally aligned.  If you are adding a
+ * new folio to the page cache and need to know what index to give it,
+ * call this function.
+ */
+static inline pgoff_t mapping_align_index(struct address_space *mapping,
+					  pgoff_t index)
+{
+	return round_down(index, mapping_min_folio_nrpages(mapping));
+}
+
 /*
  * Large folio support currently depends on THP.  These dependencies are
  * being worked on but are not yet fixed.
diff --git a/mm/filemap.c b/mm/filemap.c
index ad5e4a848070e..d27e9ac54309d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -859,6 +859,8 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
+	VM_BUG_ON_FOLIO(folio_order(folio) < mapping_min_folio_order(mapping),
+			folio);
 	mapping_set_update(&xas, mapping);
 
 	VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
@@ -1919,8 +1921,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		folio_wait_stable(folio);
 no_page:
 	if (!folio && (fgp_flags & FGP_CREAT)) {
-		unsigned order = FGF_GET_ORDER(fgp_flags);
+		unsigned int min_order = mapping_min_folio_order(mapping);
+		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
 		int err;
+		index = mapping_align_index(mapping, index);
 
 		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
 			gfp |= __GFP_WRITE;
@@ -1943,7 +1947,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			gfp_t alloc_gfp = gfp;
 
 			err = -ENOMEM;
-			if (order > 0)
+			if (order > min_order)
 				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
 			folio = filemap_alloc_folio(alloc_gfp, order);
 			if (!folio)
@@ -1958,7 +1962,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 				break;
 			folio_put(folio);
 			folio = NULL;
-		} while (order-- > 0);
+		} while (order-- > min_order);
 
 		if (err == -EEXIST)
 			goto repeat;
@@ -2447,13 +2451,15 @@ static int filemap_update_page(struct kiocb *iocb,
 }
 
 static int filemap_create_folio(struct file *file,
-		struct address_space *mapping, pgoff_t index,
+		struct address_space *mapping, loff_t pos,
 		struct folio_batch *fbatch)
 {
 	struct folio *folio;
 	int error;
+	unsigned int min_order = mapping_min_folio_order(mapping);
+	pgoff_t index;
 
-	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
+	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order);
 	if (!folio)
 		return -ENOMEM;
 
@@ -2471,6 +2477,7 @@ static int filemap_create_folio(struct file *file,
 	 * well to keep locking rules simple.
 	 */
 	filemap_invalidate_lock_shared(mapping);
+	index = (pos >> (PAGE_SHIFT + min_order)) << min_order;
 	error = filemap_add_folio(mapping, folio, index,
 			mapping_gfp_constraint(mapping, GFP_KERNEL));
 	if (error == -EEXIST)
@@ -2531,8 +2538,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 	if (!folio_batch_count(fbatch)) {
 		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
 			return -EAGAIN;
-		err = filemap_create_folio(filp, mapping,
-				iocb->ki_pos >> PAGE_SHIFT, fbatch);
+		err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch);
 		if (err == AOP_TRUNCATED_PAGE)
 			goto retry;
 		return err;
@@ -3748,9 +3754,11 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
 repeat:
 	folio = filemap_get_folio(mapping, index);
 	if (IS_ERR(folio)) {
-		folio = filemap_alloc_folio(gfp, 0);
+		folio = filemap_alloc_folio(gfp,
+					    mapping_min_folio_order(mapping));
 		if (!folio)
 			return ERR_PTR(-ENOMEM);
+		index = mapping_align_index(mapping, index);
 		err = filemap_add_folio(mapping, folio, index, gfp);
 		if (unlikely(err)) {
 			folio_put(folio);
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v10 03/10] readahead: allocate folios with mapping_min_order in readahead
  2024-07-15  9:44 [PATCH v10 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
  2024-07-15  9:44 ` [PATCH v10 01/10] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
  2024-07-15  9:44 ` [PATCH v10 02/10] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
@ 2024-07-15  9:44 ` Pankaj Raghav (Samsung)
  2024-07-15  9:44 ` [PATCH v10 04/10] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-15  9:44 UTC (permalink / raw)
  To: david, willy, chandan.babu, djwong, brauner, akpm
  Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
	p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, ryan.roberts,
	hch, Zi Yan

From: Pankaj Raghav <p.raghav@samsung.com>

page_cache_ra_unbounded() was allocating single pages (0 order folios)
if there was no folio found in an index. Allocate mapping_min_order folios
as we need to guarantee the minimum order if it is set.

page_cache_ra_order() tries to allocate folio to the page cache with a
higher order if the index aligns with that order. Modify it so that the
order does not go below the mapping_min_order requirement of the page
cache. This function will do the right thing even if the new_order passed
is less than the mapping_min_order.
When adding new folios to the page cache we must also ensure the index
used is aligned to the mapping_min_order as the page cache requires the
index to be aligned to the order of the folio.

readahead_expand() is called from readahead aops to extend the range of
the readahead so this function can assume ractl->_index to be aligned with
min_order.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Co-developed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org>
---
 mm/readahead.c | 79 ++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 61 insertions(+), 18 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 3e5239e9e1777..2078c42777a62 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -206,9 +206,10 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 		unsigned long nr_to_read, unsigned long lookahead_size)
 {
 	struct address_space *mapping = ractl->mapping;
-	unsigned long index = readahead_index(ractl);
+	unsigned long ra_folio_index, index = readahead_index(ractl);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
-	unsigned long i;
+	unsigned long mark, i = 0;
+	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
 
 	/*
 	 * Partway through the readahead operation, we will have added
@@ -223,10 +224,24 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	unsigned int nofs = memalloc_nofs_save();
 
 	filemap_invalidate_lock_shared(mapping);
+	index = mapping_align_index(mapping, index);
+
+	/*
+	 * As iterator `i` is aligned to min_nrpages, round_up the
+	 * difference between nr_to_read and lookahead_size to mark the
+	 * index that only has lookahead or "async_region" to set the
+	 * readahead flag.
+	 */
+	ra_folio_index = round_up(readahead_index(ractl) + nr_to_read - lookahead_size,
+				  min_nrpages);
+	mark = ra_folio_index - index;
+	nr_to_read += readahead_index(ractl) - index;
+	ractl->_index = index;
+
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
-	for (i = 0; i < nr_to_read; i++) {
+	while (i < nr_to_read) {
 		struct folio *folio = xa_load(&mapping->i_pages, index + i);
 		int ret;
 
@@ -240,12 +255,13 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			 * not worth getting one just for that.
 			 */
 			read_pages(ractl);
-			ractl->_index++;
-			i = ractl->_index + ractl->_nr_pages - index - 1;
+			ractl->_index += min_nrpages;
+			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask,
+					    mapping_min_folio_order(mapping));
 		if (!folio)
 			break;
 
@@ -255,14 +271,15 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			if (ret == -ENOMEM)
 				break;
 			read_pages(ractl);
-			ractl->_index++;
-			i = ractl->_index + ractl->_nr_pages - index - 1;
+			ractl->_index += min_nrpages;
+			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
-		if (i == nr_to_read - lookahead_size)
+		if (i == mark)
 			folio_set_readahead(folio);
 		ractl->_workingset |= folio_test_workingset(folio);
-		ractl->_nr_pages++;
+		ractl->_nr_pages += min_nrpages;
+		i += min_nrpages;
 	}
 
 	/*
@@ -438,13 +455,19 @@ void page_cache_ra_order(struct readahead_control *ractl,
 	struct address_space *mapping = ractl->mapping;
 	pgoff_t start = readahead_index(ractl);
 	pgoff_t index = start;
+	unsigned int min_order = mapping_min_folio_order(mapping);
 	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
 	pgoff_t mark = index + ra->size - ra->async_size;
 	unsigned int nofs;
 	int err = 0;
 	gfp_t gfp = readahead_gfp_mask(mapping);
+	unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
 
-	if (!mapping_large_folio_support(mapping) || ra->size < 4)
+	/*
+	 * Fallback when size < min_nrpages as each folio should be
+	 * at least min_nrpages anyway.
+	 */
+	if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
 		goto fallback;
 
 	limit = min(limit, index + ra->size - 1);
@@ -454,10 +477,19 @@ void page_cache_ra_order(struct readahead_control *ractl,
 
 	new_order = min(mapping_max_folio_order(mapping), new_order);
 	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
+	new_order = max(new_order, min_order);
 
 	/* See comment in page_cache_ra_unbounded() */
 	nofs = memalloc_nofs_save();
 	filemap_invalidate_lock_shared(mapping);
+	/*
+	 * If the new_order is greater than min_order and index is
+	 * already aligned to new_order, then this will be noop as index
+	 * aligned to new_order should also be aligned to min_order.
+	 */
+	ractl->_index = mapping_align_index(mapping, index);
+	index = readahead_index(ractl);
+
 	while (index <= limit) {
 		unsigned int order = new_order;
 
@@ -465,7 +497,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
 		if (index & ((1UL << order) - 1))
 			order = __ffs(index);
 		/* Don't allocate pages past EOF */
-		while (index + (1UL << order) - 1 > limit)
+		while (order > min_order && index + (1UL << order) - 1 > limit)
 			order--;
 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
 		if (err)
@@ -703,8 +735,15 @@ void readahead_expand(struct readahead_control *ractl,
 	struct file_ra_state *ra = ractl->ra;
 	pgoff_t new_index, new_nr_pages;
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
+	unsigned long min_nrpages = mapping_min_folio_nrpages(mapping);
+	unsigned int min_order = mapping_min_folio_order(mapping);
 
 	new_index = new_start / PAGE_SIZE;
+	/*
+	 * Readahead code should have aligned the ractl->_index to
+	 * min_nrpages before calling readahead aops.
+	 */
+	VM_BUG_ON(!IS_ALIGNED(ractl->_index, min_nrpages));
 
 	/* Expand the leading edge downwards */
 	while (ractl->_index > new_index) {
@@ -714,9 +753,11 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, min_order);
 		if (!folio)
 			return;
+
+		index = mapping_align_index(mapping, index);
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
 			folio_put(folio);
 			return;
@@ -726,7 +767,7 @@ void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset = true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
-		ractl->_nr_pages++;
+		ractl->_nr_pages += min_nrpages;
 		ractl->_index = folio->index;
 	}
 
@@ -741,9 +782,11 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, min_order);
 		if (!folio)
 			return;
+
+		index = mapping_align_index(mapping, index);
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
 			folio_put(folio);
 			return;
@@ -753,10 +796,10 @@ void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset = true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
-		ractl->_nr_pages++;
+		ractl->_nr_pages += min_nrpages;
 		if (ra) {
-			ra->size++;
-			ra->async_size++;
+			ra->size += min_nrpages;
+			ra->async_size += min_nrpages;
 		}
 	}
 }
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v10 04/10] mm: split a folio in minimum folio order chunks
  2024-07-15  9:44 [PATCH v10 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (2 preceding siblings ...)
  2024-07-15  9:44 ` [PATCH v10 03/10] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
@ 2024-07-15  9:44 ` Pankaj Raghav (Samsung)
  2024-07-15  9:44 ` [PATCH v10 05/10] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() Pankaj Raghav (Samsung)
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-15  9:44 UTC (permalink / raw)
  To: david, willy, chandan.babu, djwong, brauner, akpm
  Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
	p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, ryan.roberts,
	hch, Zi Yan

From: Luis Chamberlain <mcgrof@kernel.org>

split_folio() and split_folio_to_list() assume order 0, to support
minorder for non-anonymous folios, we must expand these to check the
folio mapping order and use that.

Set new_order to be at least minimum folio order if it is set in
split_huge_page_to_list() so that we can maintain minimum folio order
requirement in the page cache.

Update the debugfs write files used for testing to ensure the order
is respected as well. We simply enforce the min order when a file
mapping is used.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/huge_mm.h | 14 +++++++---
 mm/huge_memory.c        | 59 ++++++++++++++++++++++++++++++++++++++---
 2 files changed, 65 insertions(+), 8 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index cee3c5da8f0ed..b6024bf39a9fe 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -90,6 +90,8 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
 #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
 	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
 
+#define split_folio(f) split_folio_to_list(f, NULL)
+
 #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 #define HPAGE_PMD_SHIFT PMD_SHIFT
 #define HPAGE_PUD_SHIFT PUD_SHIFT
@@ -323,9 +325,10 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 bool can_split_folio(struct folio *folio, int *pextra_pins);
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order);
+int split_folio_to_list(struct folio *folio, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
-	return split_huge_page_to_list_to_order(page, NULL, 0);
+	return split_folio(page_folio(page));
 }
 void deferred_split_folio(struct folio *folio);
 
@@ -490,6 +493,12 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
+
+static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
+{
+	return 0;
+}
+
 static inline void deferred_split_folio(struct folio *folio) {}
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
@@ -604,7 +613,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
 	return split_folio_to_list_to_order(folio, NULL, new_order);
 }
 
-#define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
-#define split_folio(f) split_folio_to_order(f, 0)
-
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 251d6932130fa..af080296e11b3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3062,6 +3062,9 @@ bool can_split_folio(struct folio *folio, int *pextra_pins)
  * released, or if some unexpected race happened (e.g., anon VMA disappeared,
  * truncation).
  *
+ * Callers should ensure that the order respects the address space mapping
+ * min-order if one is set for non-anonymous folios.
+ *
  * Returns -EINVAL when trying to split to an order that is incompatible
  * with the folio. Splitting to order 0 is compatible with all folios.
  */
@@ -3143,6 +3146,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		mapping = NULL;
 		anon_vma_lock_write(anon_vma);
 	} else {
+		unsigned int min_order;
 		gfp_t gfp;
 
 		mapping = folio->mapping;
@@ -3153,6 +3157,14 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 			goto out;
 		}
 
+		min_order = mapping_min_folio_order(folio->mapping);
+		if (new_order < min_order) {
+			VM_WARN_ONCE(1, "Cannot split mapped folio below min-order: %u",
+				     min_order);
+			ret = -EINVAL;
+			goto out;
+		}
+
 		gfp = current_gfp_context(mapping_gfp_mask(mapping) &
 							GFP_RECLAIM_MASK);
 
@@ -3265,6 +3277,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	return ret;
 }
 
+int split_folio_to_list(struct folio *folio, struct list_head *list)
+{
+	unsigned int min_order = 0;
+
+	if (folio_test_anon(folio))
+		goto out;
+
+	if (!folio->mapping) {
+		if (folio_test_pmd_mappable(folio))
+			count_vm_event(THP_SPLIT_PAGE_FAILED);
+		return -EBUSY;
+	}
+
+	min_order = mapping_min_folio_order(folio->mapping);
+out:
+	return split_huge_page_to_list_to_order(&folio->page, list,
+							min_order);
+}
+
 void __folio_undo_large_rmappable(struct folio *folio)
 {
 	struct deferred_split *ds_queue;
@@ -3496,6 +3527,8 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		struct vm_area_struct *vma = vma_lookup(mm, addr);
 		struct page *page;
 		struct folio *folio;
+		struct address_space *mapping;
+		unsigned int target_order = new_order;
 
 		if (!vma)
 			break;
@@ -3516,7 +3549,13 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		if (!is_transparent_hugepage(folio))
 			goto next;
 
-		if (new_order >= folio_order(folio))
+		if (!folio_test_anon(folio)) {
+			mapping = folio->mapping;
+			target_order = max(new_order,
+					   mapping_min_folio_order(mapping));
+		}
+
+		if (target_order >= folio_order(folio))
 			goto next;
 
 		total++;
@@ -3532,9 +3571,13 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		if (!folio_trylock(folio))
 			goto next;
 
-		if (!split_folio_to_order(folio, new_order))
+		if (!folio_test_anon(folio) && folio->mapping != mapping)
+			goto unlock;
+
+		if (!split_folio_to_order(folio, target_order))
 			split++;
 
+unlock:
 		folio_unlock(folio);
 next:
 		folio_put(folio);
@@ -3559,6 +3602,7 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 	pgoff_t index;
 	int nr_pages = 1;
 	unsigned long total = 0, split = 0;
+	unsigned int min_order;
 
 	file = getname_kernel(file_path);
 	if (IS_ERR(file))
@@ -3572,9 +3616,11 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 		 file_path, off_start, off_end);
 
 	mapping = candidate->f_mapping;
+	min_order = mapping_min_folio_order(mapping);
 
 	for (index = off_start; index < off_end; index += nr_pages) {
 		struct folio *folio = filemap_get_folio(mapping, index);
+		unsigned int target_order = new_order;
 
 		nr_pages = 1;
 		if (IS_ERR(folio))
@@ -3583,18 +3629,23 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 		if (!folio_test_large(folio))
 			goto next;
 
+		target_order = max(new_order, min_order);
 		total++;
 		nr_pages = folio_nr_pages(folio);
 
-		if (new_order >= folio_order(folio))
+		if (target_order >= folio_order(folio))
 			goto next;
 
 		if (!folio_trylock(folio))
 			goto next;
 
-		if (!split_folio_to_order(folio, new_order))
+		if (folio->mapping != mapping)
+			goto unlock;
+
+		if (!split_folio_to_order(folio, target_order))
 			split++;
 
+unlock:
 		folio_unlock(folio);
 next:
 		folio_put(folio);
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v10 05/10] filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
  2024-07-15  9:44 [PATCH v10 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (3 preceding siblings ...)
  2024-07-15  9:44 ` [PATCH v10 04/10] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
@ 2024-07-15  9:44 ` Pankaj Raghav (Samsung)
  2024-07-15  9:44 ` [PATCH v10 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-15  9:44 UTC (permalink / raw)
  To: david, willy, chandan.babu, djwong, brauner, akpm
  Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
	p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, ryan.roberts,
	hch, Zi Yan

From: Pankaj Raghav <p.raghav@samsung.com>

Usually the page cache does not extend beyond the size of the inode,
therefore, no PTEs are created for folios that extend beyond the size.

But with LBS support, we might extend page cache beyond the size of the
inode as we need to guarantee folios of minimum order. While doing a
read, do_fault_around() can create PTEs for pages that lie beyond the
EOF leading to incorrect error return when accessing a page beyond the
mapped file.

Cap the PTE range to be created for the page cache up to the end of
file(EOF) in filemap_map_pages() so that return error codes are consistent
with POSIX[1] for LBS configurations.

generic/749(currently in xfstest-dev patches-in-queue branch [0]) has
been created to trigger this edge case. This also fixes generic/749 for
tmpfs with huge=always on systems with 4k base page size.

[0] https://lore.kernel.org/all/20240615002935.1033031-3-mcgrof@kernel.org/
[1](from mmap(2))  SIGBUS
    Attempted access to a page of the buffer that lies beyond the end
    of the mapped file.  For an explanation of the treatment  of  the
    bytes  in  the  page that corresponds to the end of a mapped file
    that is not a multiple of the page size, see NOTES.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 mm/filemap.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index d27e9ac54309d..d322109274532 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3608,7 +3608,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	struct vm_area_struct *vma = vmf->vma;
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
-	pgoff_t last_pgoff = start_pgoff;
+	pgoff_t file_end, last_pgoff = start_pgoff;
 	unsigned long addr;
 	XA_STATE(xas, &mapping->i_pages, start_pgoff);
 	struct folio *folio;
@@ -3634,6 +3634,10 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		goto out;
 	}
 
+	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
+	if (end_pgoff > file_end)
+		end_pgoff = file_end;
+
 	folio_type = mm_counter_file(folio);
 	do {
 		unsigned long end;
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v10 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-07-15  9:44 [PATCH v10 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (4 preceding siblings ...)
  2024-07-15  9:44 ` [PATCH v10 05/10] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() Pankaj Raghav (Samsung)
@ 2024-07-15  9:44 ` Pankaj Raghav (Samsung)
  2024-07-15  9:44 ` [PATCH v10 07/10] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-15  9:44 UTC (permalink / raw)
  To: david, willy, chandan.babu, djwong, brauner, akpm
  Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
	p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, ryan.roberts,
	hch, Zi Yan, Dave Chinner

From: Pankaj Raghav <p.raghav@samsung.com>

iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
< fs block size. iomap_dio_zero() has an implicit assumption that fs block
size < page_size. This is true for most filesystems at the moment.

If the block size > page size, this will send the contents of the page
next to zero page(as len > PAGE_SIZE) to the underlying block device,
causing FS corruption.

iomap is a generic infrastructure and it should not make any assumptions
about the fs block size and the page size of the system.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap/buffered-io.c |  4 ++--
 fs/iomap/direct-io.c   | 45 ++++++++++++++++++++++++++++++++++++------
 2 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index f420c53d86acc..d745f718bcde8 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -2007,10 +2007,10 @@ iomap_writepages(struct address_space *mapping, struct writeback_control *wbc,
 }
 EXPORT_SYMBOL_GPL(iomap_writepages);
 
-static int __init iomap_init(void)
+static int __init iomap_buffered_init(void)
 {
 	return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
 			   offsetof(struct iomap_ioend, io_bio),
 			   BIOSET_NEED_BVECS);
 }
-fs_initcall(iomap_init);
+fs_initcall(iomap_buffered_init);
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46e..c02b266bba525 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -11,6 +11,7 @@
 #include <linux/iomap.h>
 #include <linux/backing-dev.h>
 #include <linux/uio.h>
+#include <linux/set_memory.h>
 #include <linux/task_io_accounting_ops.h>
 #include "trace.h"
 
@@ -27,6 +28,13 @@
 #define IOMAP_DIO_WRITE		(1U << 30)
 #define IOMAP_DIO_DIRTY		(1U << 31)
 
+/*
+ * Used for sub block zeroing in iomap_dio_zero()
+ */
+#define IOMAP_ZERO_PAGE_SIZE (SZ_64K)
+#define IOMAP_ZERO_PAGE_ORDER (get_order(IOMAP_ZERO_PAGE_SIZE))
+static struct page *zero_page;
+
 struct iomap_dio {
 	struct kiocb		*iocb;
 	const struct iomap_dio_ops *dops;
@@ -232,13 +240,20 @@ void iomap_dio_bio_end_io(struct bio *bio)
 }
 EXPORT_SYMBOL_GPL(iomap_dio_bio_end_io);
 
-static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
+static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 		loff_t pos, unsigned len)
 {
 	struct inode *inode = file_inode(dio->iocb->ki_filp);
-	struct page *page = ZERO_PAGE(0);
 	struct bio *bio;
 
+	if (!len)
+		return 0;
+	/*
+	 * Max block size supported is 64k
+	 */
+	if (WARN_ON_ONCE(len > IOMAP_ZERO_PAGE_SIZE))
+		return -EINVAL;
+
 	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
 	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
 				  GFP_KERNEL);
@@ -246,8 +261,9 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	__bio_add_page(bio, page, len, 0);
+	__bio_add_page(bio, zero_page, len, 0);
 	iomap_dio_submit_bio(iter, dio, bio, pos);
+	return 0;
 }
 
 /*
@@ -356,8 +372,10 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	if (need_zeroout) {
 		/* zero out from the start of the block to the write offset */
 		pad = pos & (fs_block_size - 1);
-		if (pad)
-			iomap_dio_zero(iter, dio, pos - pad, pad);
+
+		ret = iomap_dio_zero(iter, dio, pos - pad, pad);
+		if (ret)
+			goto out;
 	}
 
 	/*
@@ -431,7 +449,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		/* zero out from the end of the write to the end of the block */
 		pad = pos & (fs_block_size - 1);
 		if (pad)
-			iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
+			ret = iomap_dio_zero(iter, dio, pos,
+					     fs_block_size - pad);
 	}
 out:
 	/* Undo iter limitation to current extent */
@@ -753,3 +772,17 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	return iomap_dio_complete(dio);
 }
 EXPORT_SYMBOL_GPL(iomap_dio_rw);
+
+static int __init iomap_dio_init(void)
+{
+	zero_page = alloc_pages(GFP_KERNEL | __GFP_ZERO,
+				IOMAP_ZERO_PAGE_ORDER);
+
+	if (!zero_page)
+		return -ENOMEM;
+
+	set_memory_ro((unsigned long)page_address(zero_page),
+		      1U << IOMAP_ZERO_PAGE_ORDER);
+	return 0;
+}
+fs_initcall(iomap_dio_init);
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v10 07/10] xfs: use kvmalloc for xattr buffers
  2024-07-15  9:44 [PATCH v10 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (5 preceding siblings ...)
  2024-07-15  9:44 ` [PATCH v10 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
@ 2024-07-15  9:44 ` Pankaj Raghav (Samsung)
  2024-07-15  9:44 ` [PATCH v10 08/10] xfs: expose block size in stat Pankaj Raghav (Samsung)
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-15  9:44 UTC (permalink / raw)
  To: david, willy, chandan.babu, djwong, brauner, akpm
  Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
	p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, ryan.roberts,
	hch, Zi Yan, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

Pankaj Raghav reported that when filesystem block size is larger
than page size, the xattr code can use kmalloc() for high order
allocations. This triggers a useless warning in the allocator as it
is a __GFP_NOFAIL allocation here:

static inline
struct page *rmqueue(struct zone *preferred_zone,
                        struct zone *zone, unsigned int order,
                        gfp_t gfp_flags, unsigned int alloc_flags,
                        int migratetype)
{
        struct page *page;

        /*
         * We most definitely don't want callers attempting to
         * allocate greater than order-1 page units with __GFP_NOFAIL.
         */
>>>>    WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
...

Fix this by changing all these call sites to use kvmalloc(), which
will strip the NOFAIL from the kmalloc attempt and if that fails
will do a __GFP_NOFAIL vmalloc().

This is not an issue that productions systems will see as
filesystems with block size > page size cannot be mounted by the
kernel; Pankaj is developing this functionality right now.

Reported-by: Pankaj Raghav <kernel@pankajraghav.com>
Fixes: f078d4ea8276 ("xfs: convert kmem_alloc() to kmalloc()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/xfs/libxfs/xfs_attr_leaf.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index b9e98950eb3d8..09f4cb061a6e0 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -1138,10 +1138,7 @@ xfs_attr3_leaf_to_shortform(
 
 	trace_xfs_attr_leaf_to_sf(args);
 
-	tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
-	if (!tmpbuffer)
-		return -ENOMEM;
-
+	tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
 	memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
 
 	leaf = (xfs_attr_leafblock_t *)tmpbuffer;
@@ -1205,7 +1202,7 @@ xfs_attr3_leaf_to_shortform(
 	error = 0;
 
 out:
-	kfree(tmpbuffer);
+	kvfree(tmpbuffer);
 	return error;
 }
 
@@ -1613,7 +1610,7 @@ xfs_attr3_leaf_compact(
 
 	trace_xfs_attr_leaf_compact(args);
 
-	tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
+	tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
 	memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
 	memset(bp->b_addr, 0, args->geo->blksize);
 	leaf_src = (xfs_attr_leafblock_t *)tmpbuffer;
@@ -1651,7 +1648,7 @@ xfs_attr3_leaf_compact(
 	 */
 	xfs_trans_log_buf(trans, bp, 0, args->geo->blksize - 1);
 
-	kfree(tmpbuffer);
+	kvfree(tmpbuffer);
 }
 
 /*
@@ -2330,7 +2327,7 @@ xfs_attr3_leaf_unbalance(
 		struct xfs_attr_leafblock *tmp_leaf;
 		struct xfs_attr3_icleaf_hdr tmphdr;
 
-		tmp_leaf = kzalloc(state->args->geo->blksize,
+		tmp_leaf = kvzalloc(state->args->geo->blksize,
 				GFP_KERNEL | __GFP_NOFAIL);
 
 		/*
@@ -2371,7 +2368,7 @@ xfs_attr3_leaf_unbalance(
 		}
 		memcpy(save_leaf, tmp_leaf, state->args->geo->blksize);
 		savehdr = tmphdr; /* struct copy */
-		kfree(tmp_leaf);
+		kvfree(tmp_leaf);
 	}
 
 	xfs_attr3_leaf_hdr_to_disk(state->args->geo, save_leaf, &savehdr);
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v10 08/10] xfs: expose block size in stat
  2024-07-15  9:44 [PATCH v10 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (6 preceding siblings ...)
  2024-07-15  9:44 ` [PATCH v10 07/10] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
@ 2024-07-15  9:44 ` Pankaj Raghav (Samsung)
  2024-07-15  9:44 ` [PATCH v10 09/10] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
  2024-07-15  9:44 ` [PATCH v10 10/10] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
  9 siblings, 0 replies; 26+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-15  9:44 UTC (permalink / raw)
  To: david, willy, chandan.babu, djwong, brauner, akpm
  Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
	p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, ryan.roberts,
	hch, Zi Yan, Dave Chinner

From: Pankaj Raghav <p.raghav@samsung.com>

For block size larger than page size, the unit of efficient IO is
the block size, not the page size. Leaving stat() to report
PAGE_SIZE as the block size causes test programs like fsx to issue
illegal ranges for operations that require block size alignment
(e.g. fallocate() insert range). Hence update the preferred IO size
to reflect the block size in this case.

This change is based on a patch originally from Dave Chinner.[1]

[1] https://lwn.net/ml/linux-fsdevel/20181107063127.3902-16-david@fromorbit.com/

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a00dcbc77e12b..da5c13150315e 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -562,7 +562,7 @@ xfs_stat_blksize(
 			return 1U << mp->m_allocsize_log;
 	}
 
-	return PAGE_SIZE;
+	return max_t(uint32_t, PAGE_SIZE, mp->m_sb.sb_blocksize);
 }
 
 STATIC int
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v10 09/10] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  2024-07-15  9:44 [PATCH v10 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (7 preceding siblings ...)
  2024-07-15  9:44 ` [PATCH v10 08/10] xfs: expose block size in stat Pankaj Raghav (Samsung)
@ 2024-07-15  9:44 ` Pankaj Raghav (Samsung)
  2024-07-15  9:44 ` [PATCH v10 10/10] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
  9 siblings, 0 replies; 26+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-15  9:44 UTC (permalink / raw)
  To: david, willy, chandan.babu, djwong, brauner, akpm
  Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
	p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, ryan.roberts,
	hch, Zi Yan, Dave Chinner

From: Pankaj Raghav <p.raghav@samsung.com>

Instead of assuming that PAGE_SHIFT is always higher than the blocklog,
make the calculation generic so that page cache count can be calculated
correctly for LBS.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_mount.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 09eef1721ef4f..3949f720b5354 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -132,11 +132,16 @@ xfs_sb_validate_fsb_count(
 	xfs_sb_t	*sbp,
 	uint64_t	nblocks)
 {
+	uint64_t		max_bytes;
+
 	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
 
+	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
+		return -EFBIG;
+
 	/* Limited by ULONG_MAX of page cache index */
-	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
+	if (max_bytes >> PAGE_SHIFT > ULONG_MAX)
 		return -EFBIG;
 	return 0;
 }
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v10 10/10] xfs: enable block size larger than page size support
  2024-07-15  9:44 [PATCH v10 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (8 preceding siblings ...)
  2024-07-15  9:44 ` [PATCH v10 09/10] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
@ 2024-07-15  9:44 ` Pankaj Raghav (Samsung)
  2024-07-15 16:46   ` Darrick J. Wong
  2024-07-16 15:29   ` Matthew Wilcox
  9 siblings, 2 replies; 26+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-15  9:44 UTC (permalink / raw)
  To: david, willy, chandan.babu, djwong, brauner, akpm
  Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
	p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, ryan.roberts,
	hch, Zi Yan

From: Pankaj Raghav <p.raghav@samsung.com>

Page cache now has the ability to have a minimum order when allocating
a folio which is a prerequisite to add support for block size > page
size.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/xfs/libxfs/xfs_ialloc.c |  5 +++++
 fs/xfs/libxfs/xfs_shared.h |  3 +++
 fs/xfs/xfs_icache.c        |  6 ++++--
 fs/xfs/xfs_mount.c         |  1 -
 fs/xfs/xfs_super.c         | 30 ++++++++++++++++++++++--------
 5 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 14c81f227c5bb..1e76431d75a4b 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -3019,6 +3019,11 @@ xfs_ialloc_setup_geometry(
 		igeo->ialloc_align = mp->m_dalign;
 	else
 		igeo->ialloc_align = 0;
+
+	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
+		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
+	else
+		igeo->min_folio_order = 0;
 }
 
 /* Compute the location of the root directory inode that is laid out by mkfs. */
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 34f104ed372c0..e67a1c7cc0b02 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -231,6 +231,9 @@ struct xfs_ino_geometry {
 	/* precomputed value for di_flags2 */
 	uint64_t	new_diflags2;
 
+	/* minimum folio order of a page cache allocation */
+	unsigned int	min_folio_order;
+
 };
 
 #endif /* __XFS_SHARED_H__ */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index cf629302d48e7..0fcf235e50235 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -88,7 +88,8 @@ xfs_inode_alloc(
 
 	/* VFS doesn't initialise i_mode! */
 	VFS_I(ip)->i_mode = 0;
-	mapping_set_large_folios(VFS_I(ip)->i_mapping);
+	mapping_set_folio_min_order(VFS_I(ip)->i_mapping,
+				    M_IGEO(mp)->min_folio_order);
 
 	XFS_STATS_INC(mp, vn_active);
 	ASSERT(atomic_read(&ip->i_pincount) == 0);
@@ -325,7 +326,8 @@ xfs_reinit_inode(
 	inode->i_uid = uid;
 	inode->i_gid = gid;
 	inode->i_state = state;
-	mapping_set_large_folios(inode->i_mapping);
+	mapping_set_folio_min_order(inode->i_mapping,
+				    M_IGEO(mp)->min_folio_order);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 3949f720b5354..c6933440f8066 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -134,7 +134,6 @@ xfs_sb_validate_fsb_count(
 {
 	uint64_t		max_bytes;
 
-	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
 
 	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 27e9f749c4c7f..3c455ef588d48 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1638,16 +1638,30 @@ xfs_fs_fill_super(
 		goto out_free_sb;
 	}
 
-	/*
-	 * Until this is fixed only page-sized or smaller data blocks work.
-	 */
 	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
-		xfs_warn(mp,
-		"File system with blocksize %d bytes. "
-		"Only pagesize (%ld) or less will currently work.",
+		size_t max_folio_size = mapping_max_folio_size_supported();
+
+		if (!xfs_has_crc(mp)) {
+			xfs_warn(mp,
+"V4 Filesystem with blocksize %d bytes. Only pagesize (%ld) or less is supported.",
 				mp->m_sb.sb_blocksize, PAGE_SIZE);
-		error = -ENOSYS;
-		goto out_free_sb;
+			error = -ENOSYS;
+			goto out_free_sb;
+		}
+
+		if (mp->m_sb.sb_blocksize > max_folio_size) {
+			xfs_warn(mp,
+"block size (%u bytes) not supported; maximum folio size supported in "\
+"the page cache is (%ld bytes). Check MAX_PAGECACHE_ORDER (%d)",
+			mp->m_sb.sb_blocksize, max_folio_size,
+			MAX_PAGECACHE_ORDER);
+			error = -ENOSYS;
+			goto out_free_sb;
+		}
+
+		xfs_warn(mp,
+"EXPERIMENTAL: V5 Filesystem with Large Block Size (%d bytes) enabled.",
+			mp->m_sb.sb_blocksize);
 	}
 
 	/* Ensure this filesystem fits in the page cache limits */
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v10 10/10] xfs: enable block size larger than page size support
  2024-07-15  9:44 ` [PATCH v10 10/10] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
@ 2024-07-15 16:46   ` Darrick J. Wong
  2024-07-22 14:12     ` Pankaj Raghav (Samsung)
  2024-07-16 15:29   ` Matthew Wilcox
  1 sibling, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2024-07-15 16:46 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: david, willy, chandan.babu, brauner, akpm, linux-kernel, yang,
	linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof,
	gost.dev, cl, linux-xfs, ryan.roberts, hch, Zi Yan

On Mon, Jul 15, 2024 at 11:44:57AM +0200, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Page cache now has the ability to have a minimum order when allocating
> a folio which is a prerequisite to add support for block size > page
> size.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_ialloc.c |  5 +++++
>  fs/xfs/libxfs/xfs_shared.h |  3 +++
>  fs/xfs/xfs_icache.c        |  6 ++++--
>  fs/xfs/xfs_mount.c         |  1 -
>  fs/xfs/xfs_super.c         | 30 ++++++++++++++++++++++--------
>  5 files changed, 34 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 14c81f227c5bb..1e76431d75a4b 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -3019,6 +3019,11 @@ xfs_ialloc_setup_geometry(
>  		igeo->ialloc_align = mp->m_dalign;
>  	else
>  		igeo->ialloc_align = 0;
> +
> +	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
> +		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
> +	else
> +		igeo->min_folio_order = 0;
>  }
>  
>  /* Compute the location of the root directory inode that is laid out by mkfs. */
> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index 34f104ed372c0..e67a1c7cc0b02 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -231,6 +231,9 @@ struct xfs_ino_geometry {
>  	/* precomputed value for di_flags2 */
>  	uint64_t	new_diflags2;
>  
> +	/* minimum folio order of a page cache allocation */
> +	unsigned int	min_folio_order;
> +
>  };
>  
>  #endif /* __XFS_SHARED_H__ */
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index cf629302d48e7..0fcf235e50235 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -88,7 +88,8 @@ xfs_inode_alloc(
>  
>  	/* VFS doesn't initialise i_mode! */
>  	VFS_I(ip)->i_mode = 0;
> -	mapping_set_large_folios(VFS_I(ip)->i_mapping);
> +	mapping_set_folio_min_order(VFS_I(ip)->i_mapping,
> +				    M_IGEO(mp)->min_folio_order);
>  
>  	XFS_STATS_INC(mp, vn_active);
>  	ASSERT(atomic_read(&ip->i_pincount) == 0);
> @@ -325,7 +326,8 @@ xfs_reinit_inode(
>  	inode->i_uid = uid;
>  	inode->i_gid = gid;
>  	inode->i_state = state;
> -	mapping_set_large_folios(inode->i_mapping);
> +	mapping_set_folio_min_order(inode->i_mapping,
> +				    M_IGEO(mp)->min_folio_order);
>  	return error;
>  }
>  
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 3949f720b5354..c6933440f8066 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -134,7 +134,6 @@ xfs_sb_validate_fsb_count(
>  {
>  	uint64_t		max_bytes;
>  
> -	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
>  	ASSERT(sbp->sb_blocklog >= BBSHIFT);
>  
>  	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 27e9f749c4c7f..3c455ef588d48 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1638,16 +1638,30 @@ xfs_fs_fill_super(
>  		goto out_free_sb;
>  	}
>  
> -	/*
> -	 * Until this is fixed only page-sized or smaller data blocks work.
> -	 */
>  	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
> -		xfs_warn(mp,
> -		"File system with blocksize %d bytes. "
> -		"Only pagesize (%ld) or less will currently work.",
> +		size_t max_folio_size = mapping_max_folio_size_supported();
> +
> +		if (!xfs_has_crc(mp)) {
> +			xfs_warn(mp,
> +"V4 Filesystem with blocksize %d bytes. Only pagesize (%ld) or less is supported.",
>  				mp->m_sb.sb_blocksize, PAGE_SIZE);
> -		error = -ENOSYS;
> -		goto out_free_sb;
> +			error = -ENOSYS;
> +			goto out_free_sb;
> +		}
> +
> +		if (mp->m_sb.sb_blocksize > max_folio_size) {
> +			xfs_warn(mp,
> +"block size (%u bytes) not supported; maximum folio size supported in "\
> +"the page cache is (%ld bytes). Check MAX_PAGECACHE_ORDER (%d)",
> +			mp->m_sb.sb_blocksize, max_folio_size,
> +			MAX_PAGECACHE_ORDER);
> +			error = -ENOSYS;
> +			goto out_free_sb;

Nit: Continuation lines should be indented, not lined up with the next
statement:

			xfs_warn(mp,
"block size (%u bytes) not supported; maximum folio size supported in "\
"the page cache is (%ld bytes). Check MAX_PAGECACHE_ORDER (%d)",
					mp->m_sb.sb_blocksize,
					max_folio_size,
					MAX_PAGECACHE_ORDER);
			error = -ENOSYS;
			goto out_free_sb;

With that fixed,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> +		}
> +
> +		xfs_warn(mp,
> +"EXPERIMENTAL: V5 Filesystem with Large Block Size (%d bytes) enabled.",
> +			mp->m_sb.sb_blocksize);
>  	}
>  
>  	/* Ensure this filesystem fits in the page cache limits */
> -- 
> 2.44.1
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v10 01/10] fs: Allow fine-grained control of folio sizes
  2024-07-15  9:44 ` [PATCH v10 01/10] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
@ 2024-07-16 15:26   ` Matthew Wilcox
  2024-07-17  9:46     ` Pankaj Raghav (Samsung)
  2024-07-22 14:19     ` Pankaj Raghav (Samsung)
  0 siblings, 2 replies; 26+ messages in thread
From: Matthew Wilcox @ 2024-07-16 15:26 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: david, chandan.babu, djwong, brauner, akpm, linux-kernel, yang,
	linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof,
	gost.dev, cl, linux-xfs, ryan.roberts, hch, Zi Yan

On Mon, Jul 15, 2024 at 11:44:48AM +0200, Pankaj Raghav (Samsung) wrote:
> +/*
> + * mapping_max_folio_size_supported() - Check the max folio size supported
> + *
> + * The filesystem should call this function at mount time if there is a
> + * requirement on the folio mapping size in the page cache.
> + */
> +static inline size_t mapping_max_folio_size_supported(void)
> +{
> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> +		return 1U << (PAGE_SHIFT + MAX_PAGECACHE_ORDER);
> +	return PAGE_SIZE;
> +}

There's no need for this to be part of this patch.  I've removed stuff
from this patch before that's not needed, please stop adding unnecessary
functions.  This would logically be part of patch 10.

> +static inline void mapping_set_folio_order_range(struct address_space *mapping,
> +						 unsigned int min,
> +						 unsigned int max)
> +{
> +	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> +		return;
> +
> +	if (min > MAX_PAGECACHE_ORDER) {
> +		VM_WARN_ONCE(1,
> +	"min order > MAX_PAGECACHE_ORDER. Setting min_order to MAX_PAGECACHE_ORDER");
> +		min = MAX_PAGECACHE_ORDER;
> +	}

This is really too much.  It's something that will never happen.  Just
delete the message.

> +	if (max > MAX_PAGECACHE_ORDER) {
> +		VM_WARN_ONCE(1,
> +	"max order > MAX_PAGECACHE_ORDER. Setting max_order to MAX_PAGECACHE_ORDER");
> +		max = MAX_PAGECACHE_ORDER;

Absolutely not.  If the filesystem declares it can support a block size
of 4TB, then good for it.  We just silently clamp it.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v10 10/10] xfs: enable block size larger than page size support
  2024-07-15  9:44 ` [PATCH v10 10/10] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
  2024-07-15 16:46   ` Darrick J. Wong
@ 2024-07-16 15:29   ` Matthew Wilcox
  2024-07-16 17:40     ` Darrick J. Wong
  1 sibling, 1 reply; 26+ messages in thread
From: Matthew Wilcox @ 2024-07-16 15:29 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: david, chandan.babu, djwong, brauner, akpm, linux-kernel, yang,
	linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof,
	gost.dev, cl, linux-xfs, ryan.roberts, hch, Zi Yan

On Mon, Jul 15, 2024 at 11:44:57AM +0200, Pankaj Raghav (Samsung) wrote:
> +++ b/fs/xfs/xfs_super.c
> @@ -1638,16 +1638,30 @@ xfs_fs_fill_super(
>  		goto out_free_sb;
>  	}
>  
> -	/*
> -	 * Until this is fixed only page-sized or smaller data blocks work.
> -	 */
>  	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
> -		xfs_warn(mp,
> -		"File system with blocksize %d bytes. "
> -		"Only pagesize (%ld) or less will currently work.",
> +		size_t max_folio_size = mapping_max_folio_size_supported();
> +
> +		if (!xfs_has_crc(mp)) {
> +			xfs_warn(mp,
> +"V4 Filesystem with blocksize %d bytes. Only pagesize (%ld) or less is supported.",
>  				mp->m_sb.sb_blocksize, PAGE_SIZE);
> -		error = -ENOSYS;
> -		goto out_free_sb;
> +			error = -ENOSYS;
> +			goto out_free_sb;
> +		}
> +
> +		if (mp->m_sb.sb_blocksize > max_folio_size) {
> +			xfs_warn(mp,
> +"block size (%u bytes) not supported; maximum folio size supported in "\
> +"the page cache is (%ld bytes). Check MAX_PAGECACHE_ORDER (%d)",
> +			mp->m_sb.sb_blocksize, max_folio_size,
> +			MAX_PAGECACHE_ORDER);

Again, too much message.  Way too much.  We shouldn't even allow block
devices to be created if their block size is larger than the max supported
by the page cache.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v10 10/10] xfs: enable block size larger than page size support
  2024-07-16 15:29   ` Matthew Wilcox
@ 2024-07-16 17:40     ` Darrick J. Wong
  2024-07-16 17:46       ` Matthew Wilcox
  0 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2024-07-16 17:40 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Pankaj Raghav (Samsung), david, chandan.babu, brauner, akpm,
	linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
	p.raghav, mcgrof, gost.dev, cl, linux-xfs, ryan.roberts, hch,
	Zi Yan

On Tue, Jul 16, 2024 at 04:29:05PM +0100, Matthew Wilcox wrote:
> On Mon, Jul 15, 2024 at 11:44:57AM +0200, Pankaj Raghav (Samsung) wrote:
> > +++ b/fs/xfs/xfs_super.c
> > @@ -1638,16 +1638,30 @@ xfs_fs_fill_super(
> >  		goto out_free_sb;
> >  	}
> >  
> > -	/*
> > -	 * Until this is fixed only page-sized or smaller data blocks work.
> > -	 */
> >  	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
> > -		xfs_warn(mp,
> > -		"File system with blocksize %d bytes. "
> > -		"Only pagesize (%ld) or less will currently work.",
> > +		size_t max_folio_size = mapping_max_folio_size_supported();
> > +
> > +		if (!xfs_has_crc(mp)) {
> > +			xfs_warn(mp,
> > +"V4 Filesystem with blocksize %d bytes. Only pagesize (%ld) or less is supported.",
> >  				mp->m_sb.sb_blocksize, PAGE_SIZE);
> > -		error = -ENOSYS;
> > -		goto out_free_sb;
> > +			error = -ENOSYS;
> > +			goto out_free_sb;
> > +		}
> > +
> > +		if (mp->m_sb.sb_blocksize > max_folio_size) {
> > +			xfs_warn(mp,
> > +"block size (%u bytes) not supported; maximum folio size supported in "\
> > +"the page cache is (%ld bytes). Check MAX_PAGECACHE_ORDER (%d)",
> > +			mp->m_sb.sb_blocksize, max_folio_size,
> > +			MAX_PAGECACHE_ORDER);
> 
> Again, too much message.  Way too much.  We shouldn't even allow block
> devices to be created if their block size is larger than the max supported
> by the page cache.

Filesystem blocksize != block device blocksize.  xfs still needs this
check because one can xfs_copy a 64k-fsblock xfs to a hdd with 512b
sectors and try to mount that on x86.

Assuming there /is/ some fs that allows 1G blocksize, you'd then really
want a mount check that would prevent you from mounting that.

--D

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v10 10/10] xfs: enable block size larger than page size support
  2024-07-16 17:40     ` Darrick J. Wong
@ 2024-07-16 17:46       ` Matthew Wilcox
  2024-07-16 22:37         ` Darrick J. Wong
  2024-07-17 10:02         ` Pankaj Raghav (Samsung)
  0 siblings, 2 replies; 26+ messages in thread
From: Matthew Wilcox @ 2024-07-16 17:46 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Pankaj Raghav (Samsung), david, chandan.babu, brauner, akpm,
	linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
	p.raghav, mcgrof, gost.dev, cl, linux-xfs, ryan.roberts, hch,
	Zi Yan

On Tue, Jul 16, 2024 at 10:40:16AM -0700, Darrick J. Wong wrote:
> On Tue, Jul 16, 2024 at 04:29:05PM +0100, Matthew Wilcox wrote:
> > On Mon, Jul 15, 2024 at 11:44:57AM +0200, Pankaj Raghav (Samsung) wrote:
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -1638,16 +1638,30 @@ xfs_fs_fill_super(
> > >  		goto out_free_sb;
> > >  	}
> > >  
> > > -	/*
> > > -	 * Until this is fixed only page-sized or smaller data blocks work.
> > > -	 */
> > >  	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
> > > -		xfs_warn(mp,
> > > -		"File system with blocksize %d bytes. "
> > > -		"Only pagesize (%ld) or less will currently work.",
> > > +		size_t max_folio_size = mapping_max_folio_size_supported();
> > > +
> > > +		if (!xfs_has_crc(mp)) {
> > > +			xfs_warn(mp,
> > > +"V4 Filesystem with blocksize %d bytes. Only pagesize (%ld) or less is supported.",
> > >  				mp->m_sb.sb_blocksize, PAGE_SIZE);
> > > -		error = -ENOSYS;
> > > -		goto out_free_sb;
> > > +			error = -ENOSYS;
> > > +			goto out_free_sb;
> > > +		}
> > > +
> > > +		if (mp->m_sb.sb_blocksize > max_folio_size) {
> > > +			xfs_warn(mp,
> > > +"block size (%u bytes) not supported; maximum folio size supported in "\
> > > +"the page cache is (%ld bytes). Check MAX_PAGECACHE_ORDER (%d)",
> > > +			mp->m_sb.sb_blocksize, max_folio_size,
> > > +			MAX_PAGECACHE_ORDER);
> > 
> > Again, too much message.  Way too much.  We shouldn't even allow block
> > devices to be created if their block size is larger than the max supported
> > by the page cache.
> 
> Filesystem blocksize != block device blocksize.  xfs still needs this
> check because one can xfs_copy a 64k-fsblock xfs to a hdd with 512b
> sectors and try to mount that on x86.
> 
> Assuming there /is/ some fs that allows 1G blocksize, you'd then really
> want a mount check that would prevent you from mounting that.

Absolutely, we need to have an fs blocksize check in the fs (if only
because fs fuzzers will put random values in fields and expect the system
to not crash).  But that should have nothing to do with page cache size.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v10 10/10] xfs: enable block size larger than page size support
  2024-07-16 17:46       ` Matthew Wilcox
@ 2024-07-16 22:37         ` Darrick J. Wong
  2024-07-17 10:02         ` Pankaj Raghav (Samsung)
  1 sibling, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2024-07-16 22:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Pankaj Raghav (Samsung), david, chandan.babu, brauner, akpm,
	linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
	p.raghav, mcgrof, gost.dev, cl, linux-xfs, ryan.roberts, hch,
	Zi Yan

On Tue, Jul 16, 2024 at 06:46:40PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 16, 2024 at 10:40:16AM -0700, Darrick J. Wong wrote:
> > On Tue, Jul 16, 2024 at 04:29:05PM +0100, Matthew Wilcox wrote:
> > > On Mon, Jul 15, 2024 at 11:44:57AM +0200, Pankaj Raghav (Samsung) wrote:
> > > > +++ b/fs/xfs/xfs_super.c
> > > > @@ -1638,16 +1638,30 @@ xfs_fs_fill_super(
> > > >  		goto out_free_sb;
> > > >  	}
> > > >  
> > > > -	/*
> > > > -	 * Until this is fixed only page-sized or smaller data blocks work.
> > > > -	 */
> > > >  	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
> > > > -		xfs_warn(mp,
> > > > -		"File system with blocksize %d bytes. "
> > > > -		"Only pagesize (%ld) or less will currently work.",
> > > > +		size_t max_folio_size = mapping_max_folio_size_supported();
> > > > +
> > > > +		if (!xfs_has_crc(mp)) {
> > > > +			xfs_warn(mp,
> > > > +"V4 Filesystem with blocksize %d bytes. Only pagesize (%ld) or less is supported.",
> > > >  				mp->m_sb.sb_blocksize, PAGE_SIZE);
> > > > -		error = -ENOSYS;
> > > > -		goto out_free_sb;
> > > > +			error = -ENOSYS;
> > > > +			goto out_free_sb;
> > > > +		}
> > > > +
> > > > +		if (mp->m_sb.sb_blocksize > max_folio_size) {
> > > > +			xfs_warn(mp,
> > > > +"block size (%u bytes) not supported; maximum folio size supported in "\
> > > > +"the page cache is (%ld bytes). Check MAX_PAGECACHE_ORDER (%d)",
> > > > +			mp->m_sb.sb_blocksize, max_folio_size,
> > > > +			MAX_PAGECACHE_ORDER);
> > > 
> > > Again, too much message.  Way too much.  We shouldn't even allow block
> > > devices to be created if their block size is larger than the max supported
> > > by the page cache.
> > 
> > Filesystem blocksize != block device blocksize.  xfs still needs this
> > check because one can xfs_copy a 64k-fsblock xfs to a hdd with 512b
> > sectors and try to mount that on x86.
> > 
> > Assuming there /is/ some fs that allows 1G blocksize, you'd then really
> > want a mount check that would prevent you from mounting that.
> 
> Absolutely, we need to have an fs blocksize check in the fs (if only
> because fs fuzzers will put random values in fields and expect the system
> to not crash).  But that should have nothing to do with page cache size.

I don't understand your objection -- we're setting the minimum folio
order on a file's pagecache to match the fs-wide blocksize.  If the
pagecache can't possibly fulfill our fs-wide requirement, then why would
we continue the mount?

Let's pretend that MAX_PAGECACHE_ORDER is 1.  The filesystem has 16k
blocks, the CPU has 4k base pages.  xfs will try to set the min folio
order to 2 via mapping_set_folio_order_range.  That function clamps it
to 1, so we try to cache a 16k fsblock with 8k pages.  Does that
actually work?

If not, then doesn't it make more more sense to fail the mount?

--D

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v10 01/10] fs: Allow fine-grained control of folio sizes
  2024-07-16 15:26   ` Matthew Wilcox
@ 2024-07-17  9:46     ` Pankaj Raghav (Samsung)
  2024-07-17  9:59       ` Ryan Roberts
  2024-07-22 14:19     ` Pankaj Raghav (Samsung)
  1 sibling, 1 reply; 26+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-17  9:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: david, chandan.babu, djwong, brauner, akpm, linux-kernel, yang,
	linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof,
	gost.dev, cl, linux-xfs, ryan.roberts, hch, Zi Yan

On Tue, Jul 16, 2024 at 04:26:10PM +0100, Matthew Wilcox wrote:
> On Mon, Jul 15, 2024 at 11:44:48AM +0200, Pankaj Raghav (Samsung) wrote:
> > +/*
> > + * mapping_max_folio_size_supported() - Check the max folio size supported
> > + *
> > + * The filesystem should call this function at mount time if there is a
> > + * requirement on the folio mapping size in the page cache.
> > + */
> > +static inline size_t mapping_max_folio_size_supported(void)
> > +{
> > +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> > +		return 1U << (PAGE_SHIFT + MAX_PAGECACHE_ORDER);
> > +	return PAGE_SIZE;
> > +}
> 
> There's no need for this to be part of this patch.  I've removed stuff
> from this patch before that's not needed, please stop adding unnecessary
> functions.  This would logically be part of patch 10.

That makes sense. I will move it to the last patch.

> 
> > +static inline void mapping_set_folio_order_range(struct address_space *mapping,
> > +						 unsigned int min,
> > +						 unsigned int max)
> > +{
> > +	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> > +		return;
> > +
> > +	if (min > MAX_PAGECACHE_ORDER) {
> > +		VM_WARN_ONCE(1,
> > +	"min order > MAX_PAGECACHE_ORDER. Setting min_order to MAX_PAGECACHE_ORDER");
> > +		min = MAX_PAGECACHE_ORDER;
> > +	}
> 
> This is really too much.  It's something that will never happen.  Just
> delete the message.
> 
> > +	if (max > MAX_PAGECACHE_ORDER) {
> > +		VM_WARN_ONCE(1,
> > +	"max order > MAX_PAGECACHE_ORDER. Setting max_order to MAX_PAGECACHE_ORDER");
> > +		max = MAX_PAGECACHE_ORDER;
> 
> Absolutely not.  If the filesystem declares it can support a block size
> of 4TB, then good for it.  We just silently clamp it.

Hmm, but you raised the point about clamping in the previous patches[1]
after Ryan pointed out that we should not silently clamp the order.

```
> It seems strange to silently clamp these? Presumably for the bs>ps usecase,
> whatever values are passed in are a hard requirement? So wouldn't want them to
> be silently reduced. (Especially given the recent change to reduce the size of
> MAX_PAGECACHE_ORDER to less then PMD size in some cases).

Hm, yes.  We should probably make this return an errno.  Including
returning an errno for !IS_ENABLED() and min > 0.
```

It was not clear from the conversation in the previous patches that we
decided to just clamp the order (like it was done before).

So let's just stick with how it was done before where we clamp the
values if min and max > MAX_PAGECACHE_ORDER?

[1] https://lore.kernel.org/linux-fsdevel/Zoa9rQbEUam467-q@casper.infradead.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v10 01/10] fs: Allow fine-grained control of folio sizes
  2024-07-17  9:46     ` Pankaj Raghav (Samsung)
@ 2024-07-17  9:59       ` Ryan Roberts
  2024-07-17 15:12         ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 26+ messages in thread
From: Ryan Roberts @ 2024-07-17  9:59 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), Matthew Wilcox
  Cc: david, chandan.babu, djwong, brauner, akpm, linux-kernel, yang,
	linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof,
	gost.dev, cl, linux-xfs, hch, Zi Yan

On 17/07/2024 10:46, Pankaj Raghav (Samsung) wrote:
> On Tue, Jul 16, 2024 at 04:26:10PM +0100, Matthew Wilcox wrote:
>> On Mon, Jul 15, 2024 at 11:44:48AM +0200, Pankaj Raghav (Samsung) wrote:
>>> +/*
>>> + * mapping_max_folio_size_supported() - Check the max folio size supported
>>> + *
>>> + * The filesystem should call this function at mount time if there is a
>>> + * requirement on the folio mapping size in the page cache.
>>> + */
>>> +static inline size_t mapping_max_folio_size_supported(void)
>>> +{
>>> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
>>> +		return 1U << (PAGE_SHIFT + MAX_PAGECACHE_ORDER);
>>> +	return PAGE_SIZE;
>>> +}
>>
>> There's no need for this to be part of this patch.  I've removed stuff
>> from this patch before that's not needed, please stop adding unnecessary
>> functions.  This would logically be part of patch 10.
> 
> That makes sense. I will move it to the last patch.
> 
>>
>>> +static inline void mapping_set_folio_order_range(struct address_space *mapping,
>>> +						 unsigned int min,
>>> +						 unsigned int max)
>>> +{
>>> +	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
>>> +		return;
>>> +
>>> +	if (min > MAX_PAGECACHE_ORDER) {
>>> +		VM_WARN_ONCE(1,
>>> +	"min order > MAX_PAGECACHE_ORDER. Setting min_order to MAX_PAGECACHE_ORDER");
>>> +		min = MAX_PAGECACHE_ORDER;
>>> +	}
>>
>> This is really too much.  It's something that will never happen.  Just
>> delete the message.
>>
>>> +	if (max > MAX_PAGECACHE_ORDER) {
>>> +		VM_WARN_ONCE(1,
>>> +	"max order > MAX_PAGECACHE_ORDER. Setting max_order to MAX_PAGECACHE_ORDER");
>>> +		max = MAX_PAGECACHE_ORDER;
>>
>> Absolutely not.  If the filesystem declares it can support a block size
>> of 4TB, then good for it.  We just silently clamp it.
> 
> Hmm, but you raised the point about clamping in the previous patches[1]
> after Ryan pointed out that we should not silently clamp the order.
> 
> ```
>> It seems strange to silently clamp these? Presumably for the bs>ps usecase,
>> whatever values are passed in are a hard requirement? So wouldn't want them to
>> be silently reduced. (Especially given the recent change to reduce the size of
>> MAX_PAGECACHE_ORDER to less then PMD size in some cases).
> 
> Hm, yes.  We should probably make this return an errno.  Including
> returning an errno for !IS_ENABLED() and min > 0.
> ```
> 
> It was not clear from the conversation in the previous patches that we
> decided to just clamp the order (like it was done before).
> 
> So let's just stick with how it was done before where we clamp the
> values if min and max > MAX_PAGECACHE_ORDER?
> 
> [1] https://lore.kernel.org/linux-fsdevel/Zoa9rQbEUam467-q@casper.infradead.org/

The way I see it, there are 2 approaches we could take:

1. Implement mapping_max_folio_size_supported(), write a headerdoc for
mapping_set_folio_order_range() that says min must be lte max, max must be lte
mapping_max_folio_size_supported(). Then emit VM_WARN() in
mapping_set_folio_order_range() if the constraints are violated, and clamp to
make it safe (from page cache's perspective). The VM_WARN()s can just be inline
in the if statements to keep them clean. The FS is responsible for checking
mapping_max_folio_size_supported() and ensuring min and max meet requirements.

2. Return an error from mapping_set_folio_order_range() (and the other functions
that set min/max). No need for warning. No state changed if error is returned.
FS can emit warning on error if it wants.

Personally I prefer option 2, but 1 is definitely less churn.

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v10 10/10] xfs: enable block size larger than page size support
  2024-07-16 17:46       ` Matthew Wilcox
  2024-07-16 22:37         ` Darrick J. Wong
@ 2024-07-17 10:02         ` Pankaj Raghav (Samsung)
  1 sibling, 0 replies; 26+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-17 10:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, david, chandan.babu, brauner, akpm, linux-kernel,
	yang, linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav,
	mcgrof, gost.dev, cl, linux-xfs, ryan.roberts, hch, Zi Yan

On Tue, Jul 16, 2024 at 06:46:40PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 16, 2024 at 10:40:16AM -0700, Darrick J. Wong wrote:
> > On Tue, Jul 16, 2024 at 04:29:05PM +0100, Matthew Wilcox wrote:
> > > On Mon, Jul 15, 2024 at 11:44:57AM +0200, Pankaj Raghav (Samsung) wrote:
> > > > +++ b/fs/xfs/xfs_super.c
> > > > @@ -1638,16 +1638,30 @@ xfs_fs_fill_super(
> > > >  		goto out_free_sb;
> > > >  	}
> > > >  
> > > > -	/*
> > > > -	 * Until this is fixed only page-sized or smaller data blocks work.
> > > > -	 */
> > > >  	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
> > > > -		xfs_warn(mp,
> > > > -		"File system with blocksize %d bytes. "
> > > > -		"Only pagesize (%ld) or less will currently work.",
> > > > +		size_t max_folio_size = mapping_max_folio_size_supported();
> > > > +
> > > > +		if (!xfs_has_crc(mp)) {
> > > > +			xfs_warn(mp,
> > > > +"V4 Filesystem with blocksize %d bytes. Only pagesize (%ld) or less is supported.",
> > > >  				mp->m_sb.sb_blocksize, PAGE_SIZE);
> > > > -		error = -ENOSYS;
> > > > -		goto out_free_sb;
> > > > +			error = -ENOSYS;
> > > > +			goto out_free_sb;
> > > > +		}
> > > > +
> > > > +		if (mp->m_sb.sb_blocksize > max_folio_size) {
> > > > +			xfs_warn(mp,
> > > > +"block size (%u bytes) not supported; maximum folio size supported in "\
> > > > +"the page cache is (%ld bytes). Check MAX_PAGECACHE_ORDER (%d)",
> > > > +			mp->m_sb.sb_blocksize, max_folio_size,
> > > > +			MAX_PAGECACHE_ORDER);
> > > 
> > > Again, too much message.  Way too much.  We shouldn't even allow block
> > > devices to be created if their block size is larger than the max supported
> > > by the page cache.
> > 
> > Filesystem blocksize != block device blocksize.  xfs still needs this
> > check because one can xfs_copy a 64k-fsblock xfs to a hdd with 512b
> > sectors and try to mount that on x86.
> > 
> > Assuming there /is/ some fs that allows 1G blocksize, you'd then really
> > want a mount check that would prevent you from mounting that.
> 
> Absolutely, we need to have an fs blocksize check in the fs (if only
> because fs fuzzers will put random values in fields and expect the system
> to not crash).  But that should have nothing to do with page cache size.

Ok, now I am not sure if I completely misunderstood the previous
comments. 

One of the comments you gave in the previous series is this[1]:

```
> What are callers supposed to do with an error? In the case of
> setting up a newly allocated inode in XFS, the error would be
> returned in the middle of a transaction and so this failure would
> result in a filesystem shutdown.

I suggest you handle it better than this.  If the device is asking for a
blocksize > PMD_SIZE, you should fail to mount it.  If the device is
asking for a blocksize > PAGE_SIZE and CONFIG_TRANSPARENT_HUGEPAGE is
not set, you should also decline to mount the filesystem.
```

That is exactly what we are doing here. We check for what can page cache
support and decline to mount if the max order supported is less than the
block size of the filesystem.

Maybe we can trim the the error message to just:

"block size (%u bytes) not supported; Only block size (%ld) or less is supported "\
					mp->m_sb.sb_blocksize,
					max_folio_size);

Let me know what you think.

[1]https://lore.kernel.org/linux-fsdevel/Zoc2rCPC5thSIuoR@casper.infradead.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v10 01/10] fs: Allow fine-grained control of folio sizes
  2024-07-17  9:59       ` Ryan Roberts
@ 2024-07-17 15:12         ` Pankaj Raghav (Samsung)
  2024-07-17 15:25           ` Darrick J. Wong
  2024-07-17 15:26           ` Ryan Roberts
  0 siblings, 2 replies; 26+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-17 15:12 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Matthew Wilcox, david, chandan.babu, djwong, brauner, akpm,
	linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
	p.raghav, mcgrof, gost.dev, cl, linux-xfs, hch, Zi Yan

> >>
> >> This is really too much.  It's something that will never happen.  Just
> >> delete the message.
> >>
> >>> +	if (max > MAX_PAGECACHE_ORDER) {
> >>> +		VM_WARN_ONCE(1,
> >>> +	"max order > MAX_PAGECACHE_ORDER. Setting max_order to MAX_PAGECACHE_ORDER");
> >>> +		max = MAX_PAGECACHE_ORDER;
> >>
> >> Absolutely not.  If the filesystem declares it can support a block size
> >> of 4TB, then good for it.  We just silently clamp it.
> > 
> > Hmm, but you raised the point about clamping in the previous patches[1]
> > after Ryan pointed out that we should not silently clamp the order.
> > 
> > ```
> >> It seems strange to silently clamp these? Presumably for the bs>ps usecase,
> >> whatever values are passed in are a hard requirement? So wouldn't want them to
> >> be silently reduced. (Especially given the recent change to reduce the size of
> >> MAX_PAGECACHE_ORDER to less then PMD size in some cases).
> > 
> > Hm, yes.  We should probably make this return an errno.  Including
> > returning an errno for !IS_ENABLED() and min > 0.
> > ```
> > 
> > It was not clear from the conversation in the previous patches that we
> > decided to just clamp the order (like it was done before).
> > 
> > So let's just stick with how it was done before where we clamp the
> > values if min and max > MAX_PAGECACHE_ORDER?
> > 
> > [1] https://lore.kernel.org/linux-fsdevel/Zoa9rQbEUam467-q@casper.infradead.org/
> 
> The way I see it, there are 2 approaches we could take:
> 
> 1. Implement mapping_max_folio_size_supported(), write a headerdoc for
> mapping_set_folio_order_range() that says min must be lte max, max must be lte
> mapping_max_folio_size_supported(). Then emit VM_WARN() in
> mapping_set_folio_order_range() if the constraints are violated, and clamp to
> make it safe (from page cache's perspective). The VM_WARN()s can just be inline

Inlining with the `if` is not possible since:
91241681c62a ("include/linux/mmdebug.h: make VM_WARN* non-rvals")

> in the if statements to keep them clean. The FS is responsible for checking
> mapping_max_folio_size_supported() and ensuring min and max meet requirements.

This is sort of what is done here but IIUC willy's reply to the patch,
he prefers silent clamping over having WARNINGS. I think because we check
the constraints during the mount time, so it should be safe to call
this I guess?

> 
> 2. Return an error from mapping_set_folio_order_range() (and the other functions
> that set min/max). No need for warning. No state changed if error is returned.
> FS can emit warning on error if it wants.

I think Chinner was not happy with this approach because this is done
per inode and basically we would just shutdown the filesystem in the
first inode allocation instead of refusing the mount as we know about
the MAX_PAGECACHE_ORDER even during the mount phase anyway.

--
Pankaj

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v10 01/10] fs: Allow fine-grained control of folio sizes
  2024-07-17 15:12         ` Pankaj Raghav (Samsung)
@ 2024-07-17 15:25           ` Darrick J. Wong
  2024-07-17 15:26           ` Ryan Roberts
  1 sibling, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2024-07-17 15:25 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Ryan Roberts, Matthew Wilcox, david, chandan.babu, brauner, akpm,
	linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
	p.raghav, mcgrof, gost.dev, cl, linux-xfs, hch, Zi Yan

On Wed, Jul 17, 2024 at 03:12:51PM +0000, Pankaj Raghav (Samsung) wrote:
> > >>
> > >> This is really too much.  It's something that will never happen.  Just
> > >> delete the message.
> > >>
> > >>> +	if (max > MAX_PAGECACHE_ORDER) {
> > >>> +		VM_WARN_ONCE(1,
> > >>> +	"max order > MAX_PAGECACHE_ORDER. Setting max_order to MAX_PAGECACHE_ORDER");
> > >>> +		max = MAX_PAGECACHE_ORDER;
> > >>
> > >> Absolutely not.  If the filesystem declares it can support a block size
> > >> of 4TB, then good for it.  We just silently clamp it.
> > > 
> > > Hmm, but you raised the point about clamping in the previous patches[1]
> > > after Ryan pointed out that we should not silently clamp the order.
> > > 
> > > ```
> > >> It seems strange to silently clamp these? Presumably for the bs>ps usecase,
> > >> whatever values are passed in are a hard requirement? So wouldn't want them to
> > >> be silently reduced. (Especially given the recent change to reduce the size of
> > >> MAX_PAGECACHE_ORDER to less then PMD size in some cases).
> > > 
> > > Hm, yes.  We should probably make this return an errno.  Including
> > > returning an errno for !IS_ENABLED() and min > 0.
> > > ```
> > > 
> > > It was not clear from the conversation in the previous patches that we
> > > decided to just clamp the order (like it was done before).
> > > 
> > > So let's just stick with how it was done before where we clamp the
> > > values if min and max > MAX_PAGECACHE_ORDER?
> > > 
> > > [1] https://lore.kernel.org/linux-fsdevel/Zoa9rQbEUam467-q@casper.infradead.org/
> > 
> > The way I see it, there are 2 approaches we could take:
> > 
> > 1. Implement mapping_max_folio_size_supported(), write a headerdoc for
> > mapping_set_folio_order_range() that says min must be lte max, max must be lte
> > mapping_max_folio_size_supported(). Then emit VM_WARN() in
> > mapping_set_folio_order_range() if the constraints are violated, and clamp to
> > make it safe (from page cache's perspective). The VM_WARN()s can just be inline
> 
> Inlining with the `if` is not possible since:
> 91241681c62a ("include/linux/mmdebug.h: make VM_WARN* non-rvals")
> 
> > in the if statements to keep them clean. The FS is responsible for checking
> > mapping_max_folio_size_supported() and ensuring min and max meet requirements.
> 
> This is sort of what is done here but IIUC willy's reply to the patch,
> he prefers silent clamping over having WARNINGS. I think because we check
> the constraints during the mount time, so it should be safe to call
> this I guess?

That's my read of the situation, but I'll ask about it at the next thp
meeting if that helps.

> > 
> > 2. Return an error from mapping_set_folio_order_range() (and the other functions
> > that set min/max). No need for warning. No state changed if error is returned.
> > FS can emit warning on error if it wants.
> 
> I think Chinner was not happy with this approach because this is done
> per inode and basically we would just shutdown the filesystem in the
> first inode allocation instead of refusing the mount as we know about
> the MAX_PAGECACHE_ORDER even during the mount phase anyway.

I agree.  Filesystem-wide properties (e.g. fs blocksize) should cause
the mount to fail if the pagecache cannot possibly handle any file
blocks.  Inode-specific properties (e.g. the forcealign+notears write
work John Garry is working on) could error out of open() with -EIO, but
that's a specialty file property.

--D

> --
> Pankaj
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v10 01/10] fs: Allow fine-grained control of folio sizes
  2024-07-17 15:12         ` Pankaj Raghav (Samsung)
  2024-07-17 15:25           ` Darrick J. Wong
@ 2024-07-17 15:26           ` Ryan Roberts
  1 sibling, 0 replies; 26+ messages in thread
From: Ryan Roberts @ 2024-07-17 15:26 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Matthew Wilcox, david, chandan.babu, djwong, brauner, akpm,
	linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
	p.raghav, mcgrof, gost.dev, cl, linux-xfs, hch, Zi Yan

On 17/07/2024 16:12, Pankaj Raghav (Samsung) wrote:
>>>>
>>>> This is really too much.  It's something that will never happen.  Just
>>>> delete the message.
>>>>
>>>>> +	if (max > MAX_PAGECACHE_ORDER) {
>>>>> +		VM_WARN_ONCE(1,
>>>>> +	"max order > MAX_PAGECACHE_ORDER. Setting max_order to MAX_PAGECACHE_ORDER");
>>>>> +		max = MAX_PAGECACHE_ORDER;
>>>>
>>>> Absolutely not.  If the filesystem declares it can support a block size
>>>> of 4TB, then good for it.  We just silently clamp it.
>>>
>>> Hmm, but you raised the point about clamping in the previous patches[1]
>>> after Ryan pointed out that we should not silently clamp the order.
>>>
>>> ```
>>>> It seems strange to silently clamp these? Presumably for the bs>ps usecase,
>>>> whatever values are passed in are a hard requirement? So wouldn't want them to
>>>> be silently reduced. (Especially given the recent change to reduce the size of
>>>> MAX_PAGECACHE_ORDER to less then PMD size in some cases).
>>>
>>> Hm, yes.  We should probably make this return an errno.  Including
>>> returning an errno for !IS_ENABLED() and min > 0.
>>> ```
>>>
>>> It was not clear from the conversation in the previous patches that we
>>> decided to just clamp the order (like it was done before).
>>>
>>> So let's just stick with how it was done before where we clamp the
>>> values if min and max > MAX_PAGECACHE_ORDER?
>>>
>>> [1] https://lore.kernel.org/linux-fsdevel/Zoa9rQbEUam467-q@casper.infradead.org/
>>
>> The way I see it, there are 2 approaches we could take:
>>
>> 1. Implement mapping_max_folio_size_supported(), write a headerdoc for
>> mapping_set_folio_order_range() that says min must be lte max, max must be lte
>> mapping_max_folio_size_supported(). Then emit VM_WARN() in
>> mapping_set_folio_order_range() if the constraints are violated, and clamp to
>> make it safe (from page cache's perspective). The VM_WARN()s can just be inline
> 
> Inlining with the `if` is not possible since:
> 91241681c62a ("include/linux/mmdebug.h: make VM_WARN* non-rvals")

Ahh my bad. Could use WARN_ON()?

> 
>> in the if statements to keep them clean. The FS is responsible for checking
>> mapping_max_folio_size_supported() and ensuring min and max meet requirements.
> 
> This is sort of what is done here but IIUC willy's reply to the patch,
> he prefers silent clamping over having WARNINGS. I think because we check
> the constraints during the mount time, so it should be safe to call
> this I guess?

I don't want to put words in his mouth, but I thought he was complaining about
the verbosity of the warnings, not their presence.

> 
>>
>> 2. Return an error from mapping_set_folio_order_range() (and the other functions
>> that set min/max). No need for warning. No state changed if error is returned.
>> FS can emit warning on error if it wants.
> 
> I think Chinner was not happy with this approach because this is done
> per inode and basically we would just shutdown the filesystem in the
> first inode allocation instead of refusing the mount as we know about
> the MAX_PAGECACHE_ORDER even during the mount phase anyway.

Ahh that makes sense. Understood.

> 
> --
> Pankaj


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v10 10/10] xfs: enable block size larger than page size support
  2024-07-15 16:46   ` Darrick J. Wong
@ 2024-07-22 14:12     ` Pankaj Raghav (Samsung)
  2024-07-22 18:49       ` Darrick J. Wong
  0 siblings, 1 reply; 26+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-22 14:12 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, willy, chandan.babu, brauner, akpm, linux-kernel, yang,
	linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof,
	gost.dev, cl, linux-xfs, ryan.roberts, hch, Zi Yan

> > +
> > +		if (mp->m_sb.sb_blocksize > max_folio_size) {
> > +			xfs_warn(mp,
> > +"block size (%u bytes) not supported; maximum folio size supported in "\
> > +"the page cache is (%ld bytes). Check MAX_PAGECACHE_ORDER (%d)",
> > +			mp->m_sb.sb_blocksize, max_folio_size,
> > +			MAX_PAGECACHE_ORDER);
> > +			error = -ENOSYS;
> > +			goto out_free_sb;
> 
> Nit: Continuation lines should be indented, not lined up with the next
> statement:
> 
> 			xfs_warn(mp,
> "block size (%u bytes) not supported; maximum folio size supported in "\
> "the page cache is (%ld bytes). Check MAX_PAGECACHE_ORDER (%d)",
> 					mp->m_sb.sb_blocksize,
> 					max_folio_size,
> 					MAX_PAGECACHE_ORDER);
> 			error = -ENOSYS;
> 			goto out_free_sb;

@Darrick: As willy pointed out, the error message is a bit long here.
Can we make as follows:

"block size (%u bytes) not supported; Only block size (%ld) or less is supported "\
                                        mp->m_sb.sb_blocksize,
                                        max_folio_size);

This is similar to the previous error and it is more concise IMO.

> 
> With that fixed,
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> --D
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v10 01/10] fs: Allow fine-grained control of folio sizes
  2024-07-16 15:26   ` Matthew Wilcox
  2024-07-17  9:46     ` Pankaj Raghav (Samsung)
@ 2024-07-22 14:19     ` Pankaj Raghav (Samsung)
  1 sibling, 0 replies; 26+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-22 14:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: david, chandan.babu, djwong, brauner, akpm, linux-kernel, yang,
	linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof,
	gost.dev, cl, linux-xfs, ryan.roberts, hch, Zi Yan

@willy:

I want to clarify before sending the next round of patches as I didn't
get any reply in the previous email.

IIUC your comments properly:

- I will go back to silent clamping in mapping_set_folio_order_range as
  before and remove VM_WARN_ONCE().

- I will move the mapping_max_folio_size_supported() to patch 10, and FSs
  can use them to check for the max block size that can be supported and
  take the respective action.

--
Pankaj

On Tue, Jul 16, 2024 at 04:26:10PM +0100, Matthew Wilcox wrote:
> On Mon, Jul 15, 2024 at 11:44:48AM +0200, Pankaj Raghav (Samsung) wrote:
> > +/*
> > + * mapping_max_folio_size_supported() - Check the max folio size supported
> > + *
> > + * The filesystem should call this function at mount time if there is a
> > + * requirement on the folio mapping size in the page cache.
> > + */
> > +static inline size_t mapping_max_folio_size_supported(void)
> > +{
> > +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> > +		return 1U << (PAGE_SHIFT + MAX_PAGECACHE_ORDER);
> > +	return PAGE_SIZE;
> > +}
> 
> There's no need for this to be part of this patch.  I've removed stuff
> from this patch before that's not needed, please stop adding unnecessary
> functions.  This would logically be part of patch 10.
> 
> > +static inline void mapping_set_folio_order_range(struct address_space *mapping,
> > +						 unsigned int min,
> > +						 unsigned int max)
> > +{
> > +	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> > +		return;
> > +
> > +	if (min > MAX_PAGECACHE_ORDER) {
> > +		VM_WARN_ONCE(1,
> > +	"min order > MAX_PAGECACHE_ORDER. Setting min_order to MAX_PAGECACHE_ORDER");
> > +		min = MAX_PAGECACHE_ORDER;
> > +	}
> 
> This is really too much.  It's something that will never happen.  Just
> delete the message.
> 
> > +	if (max > MAX_PAGECACHE_ORDER) {
> > +		VM_WARN_ONCE(1,
> > +	"max order > MAX_PAGECACHE_ORDER. Setting max_order to MAX_PAGECACHE_ORDER");
> > +		max = MAX_PAGECACHE_ORDER;
> 
> Absolutely not.  If the filesystem declares it can support a block size
> of 4TB, then good for it.  We just silently clamp it.
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v10 10/10] xfs: enable block size larger than page size support
  2024-07-22 14:12     ` Pankaj Raghav (Samsung)
@ 2024-07-22 18:49       ` Darrick J. Wong
  0 siblings, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2024-07-22 18:49 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: david, willy, chandan.babu, brauner, akpm, linux-kernel, yang,
	linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof,
	gost.dev, cl, linux-xfs, ryan.roberts, hch, Zi Yan

On Mon, Jul 22, 2024 at 02:12:20PM +0000, Pankaj Raghav (Samsung) wrote:
> > > +
> > > +		if (mp->m_sb.sb_blocksize > max_folio_size) {
> > > +			xfs_warn(mp,
> > > +"block size (%u bytes) not supported; maximum folio size supported in "\
> > > +"the page cache is (%ld bytes). Check MAX_PAGECACHE_ORDER (%d)",
> > > +			mp->m_sb.sb_blocksize, max_folio_size,
> > > +			MAX_PAGECACHE_ORDER);
> > > +			error = -ENOSYS;
> > > +			goto out_free_sb;
> > 
> > Nit: Continuation lines should be indented, not lined up with the next
> > statement:
> > 
> > 			xfs_warn(mp,
> > "block size (%u bytes) not supported; maximum folio size supported in "\
> > "the page cache is (%ld bytes). Check MAX_PAGECACHE_ORDER (%d)",
> > 					mp->m_sb.sb_blocksize,
> > 					max_folio_size,
> > 					MAX_PAGECACHE_ORDER);
> > 			error = -ENOSYS;
> > 			goto out_free_sb;
> 
> @Darrick: As willy pointed out, the error message is a bit long here.
> Can we make as follows:
> 
> "block size (%u bytes) not supported; Only block size (%ld) or less is supported "\
>                                         mp->m_sb.sb_blocksize,
>                                         max_folio_size);
> 
> This is similar to the previous error and it is more concise IMO.

Ah, ok.  I suppose printing max_folio_size *and* MAX_PAGECACHE_ORDER is
redundant.  The shortened version above is ok by me.

--D

> > 
> > With that fixed,
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > 
> > --D
> > 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2024-07-22 18:49 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-15  9:44 [PATCH v10 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
2024-07-15  9:44 ` [PATCH v10 01/10] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
2024-07-16 15:26   ` Matthew Wilcox
2024-07-17  9:46     ` Pankaj Raghav (Samsung)
2024-07-17  9:59       ` Ryan Roberts
2024-07-17 15:12         ` Pankaj Raghav (Samsung)
2024-07-17 15:25           ` Darrick J. Wong
2024-07-17 15:26           ` Ryan Roberts
2024-07-22 14:19     ` Pankaj Raghav (Samsung)
2024-07-15  9:44 ` [PATCH v10 02/10] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
2024-07-15  9:44 ` [PATCH v10 03/10] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
2024-07-15  9:44 ` [PATCH v10 04/10] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
2024-07-15  9:44 ` [PATCH v10 05/10] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() Pankaj Raghav (Samsung)
2024-07-15  9:44 ` [PATCH v10 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
2024-07-15  9:44 ` [PATCH v10 07/10] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
2024-07-15  9:44 ` [PATCH v10 08/10] xfs: expose block size in stat Pankaj Raghav (Samsung)
2024-07-15  9:44 ` [PATCH v10 09/10] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
2024-07-15  9:44 ` [PATCH v10 10/10] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
2024-07-15 16:46   ` Darrick J. Wong
2024-07-22 14:12     ` Pankaj Raghav (Samsung)
2024-07-22 18:49       ` Darrick J. Wong
2024-07-16 15:29   ` Matthew Wilcox
2024-07-16 17:40     ` Darrick J. Wong
2024-07-16 17:46       ` Matthew Wilcox
2024-07-16 22:37         ` Darrick J. Wong
2024-07-17 10:02         ` Pankaj Raghav (Samsung)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).