[PATCH v7 00/11] enable bs

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 00/11] enable bs > ps in XFS
@ 2024-06-07 14:58 Pankaj Raghav (Samsung)
  2024-06-07 14:58 ` [PATCH v7 01/11] readahead: rework loop in page_cache_ra_unbounded() Pankaj Raghav (Samsung)
                   ` (10 more replies)
  0 siblings, 11 replies; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-07 14:58 UTC (permalink / raw)
  To: david, djwong, chandan.babu, brauner, akpm, willy
  Cc: mcgrof, linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs,
	p.raghav, linux-fsdevel, kernel, hch, gost.dev, cl, john.g.garry

From: Pankaj Raghav <p.raghav@samsung.com>

This is the seventh version of the series that enables block size > page size
(Large Block Size) in XFS targetted for inclusion in 6.11.
The context and motivation can be seen in cover letter of the RFC v1 [0].
We also recorded a talk about this effort at LPC [1], if someone would
like more context on this effort.

The major change on this v6 we retry getting a folio and we enable
warning if we failed to get a folio in __filemap_get_folio if the
order <= min_order (Patch 3)[7].

A lot of emphasis has been put on testing using kdevops, starting with an XFS
baseline [3]. The testing has been split into regression and progression.

Regression testing:
In regression testing, we ran the whole test suite to check for regressions on
existing profiles due to the page cache changes.

I also ran split_huge_page_test selftest on XFS filesystem to check for
huge page splits in min order chunks is done correctly.

No regressions were found with these patches added on top.

Progression testing:
For progression testing, we tested for 8k, 16k, 32k and 64k block sizes.  To
compare it with existing support, an ARM VM with 64k base page system (without
our patches) was used as a reference to check for actual failures due to LBS
support in a 4k base page size system.

There are some tests that assumes block size < page size that needs to be fixed.
We have a tree with fixes for xfstests [4], most of the changes have been posted
already, and only a few minor changes need to be posted. Already part of these
changes has been upstreamed to fstests, and new tests have also been written and
are out for review, namely for mmap zeroing-around corner cases, compaction
and fsstress races on mm, and stress testing folio truncation on file mapped
folios.

No new failures were found with the LBS support.

We've done some preliminary performance tests with fio on XFS on 4k block size
against pmem and NVMe with buffered IO and Direct IO on vanilla Vs + these
patches applied, and detected no regressions.

We also wrote an eBPF tool called blkalgn [5] to see if IO sent to the device
is aligned and at least filesystem block size in length.

For those who want this in a git tree we have this up on a kdevops
large-block-minorder-for-next-v7 tag [6].

[0] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/
[1] https://www.youtube.com/watch?v=ar72r5Xf7x4
[2] https://lkml.kernel.org/r/20240501153120.4094530-1-willy@infradead.org
[3] https://github.com/linux-kdevops/kdevops/blob/master/docs/xfs-bugs.md
489 non-critical issues and 55 critical issues. We've determined and reported
that the 55 critical issues have all fall into 5 common  XFS asserts or hung
tasks  and 2 memory management asserts.
[4] https://github.com/linux-kdevops/fstests/tree/lbs-fixes
[5] https://github.com/iovisor/bcc/pull/4813
[6] https://github.com/linux-kdevops/linux/
[7] https://lore.kernel.org/linux-kernel/Zl20pc-YlIWCSy6Z@casper.infradead.org/#t

Changes since v6:
- Warn users if we can't get a min order folio in __filemap_get_folio().
- Added iomap_dio_init() function and moved zero buffer init into that.
- Modified split_huge_pages_pid() to also consider non-anonymous memory
  and removed condition for anonymous memory in split_huge_pages_file().
- Collected RVB from Hannes.

Dave Chinner (1):
  xfs: use kvmalloc for xattr buffers

Hannes Reinecke (1):
  readahead: rework loop in page_cache_ra_unbounded()

Luis Chamberlain (1):
  mm: split a folio in minimum folio order chunks

Matthew Wilcox (Oracle) (1):
  fs: Allow fine-grained control of folio sizes

Pankaj Raghav (7):
  filemap: allocate mapping_min_order folios in the page cache
  readahead: allocate folios with mapping_min_order in readahead
  filemap: cap PTE range to be created to allowed zero fill in
    folio_map_range()
  iomap: fix iomap_dio_zero() for fs bs > system page size
  xfs: expose block size in stat
  xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  xfs: enable block size larger than page size support

 fs/internal.h                 |   5 ++
 fs/iomap/buffered-io.c        |   6 ++
 fs/iomap/direct-io.c          |  26 ++++++++-
 fs/xfs/libxfs/xfs_attr_leaf.c |  15 ++---
 fs/xfs/libxfs/xfs_ialloc.c    |   5 ++
 fs/xfs/libxfs/xfs_shared.h    |   3 +
 fs/xfs/xfs_icache.c           |   6 +-
 fs/xfs/xfs_iops.c             |   2 +-
 fs/xfs/xfs_mount.c            |  11 +++-
 fs/xfs/xfs_super.c            |  18 +++---
 include/linux/huge_mm.h       |  14 +++--
 include/linux/pagemap.h       | 106 +++++++++++++++++++++++++++++-----
 mm/filemap.c                  |  38 +++++++-----
 mm/huge_memory.c              |  55 ++++++++++++++++--
 mm/readahead.c                |  98 ++++++++++++++++++++++++-------
 15 files changed, 330 insertions(+), 78 deletions(-)

base-commit: d97496ca23a2d4ee80b7302849404859d9058bcd
-- 
2.44.1

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v7 01/11] readahead: rework loop in page_cache_ra_unbounded()
  2024-06-07 14:58 [PATCH v7 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
@ 2024-06-07 14:58 ` Pankaj Raghav (Samsung)
  2024-06-07 14:58 ` [PATCH v7 02/11] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-07 14:58 UTC (permalink / raw)
  To: david, djwong, chandan.babu, brauner, akpm, willy
  Cc: mcgrof, linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs,
	p.raghav, linux-fsdevel, kernel, hch, gost.dev, cl, john.g.garry

From: Hannes Reinecke <hare@suse.de>

Rework the loop in page_cache_ra_unbounded() to advance with
the number of pages in a folio instead of just one page at a time.

Note that the index is incremented by 1 if filemap_add_folio() fails
because the size of the folio we are trying to add is 1 (order 0).

Signed-off-by: Hannes Reinecke <hare@suse.de>
Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/readahead.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index c1b23989d9ca..75e934a1fd78 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -208,7 +208,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	struct address_space *mapping = ractl->mapping;
 	unsigned long index = readahead_index(ractl);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
-	unsigned long i;
+	unsigned long i = 0;
 
 	/*
 	 * Partway through the readahead operation, we will have added
@@ -226,7 +226,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
-	for (i = 0; i < nr_to_read; i++) {
+	while (i < nr_to_read) {
 		struct folio *folio = xa_load(&mapping->i_pages, index + i);
 		int ret;
 
@@ -240,8 +240,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			 * not worth getting one just for that.
 			 */
 			read_pages(ractl);
-			ractl->_index++;
-			i = ractl->_index + ractl->_nr_pages - index - 1;
+			ractl->_index += folio_nr_pages(folio);
+			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 
@@ -256,13 +256,14 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 				break;
 			read_pages(ractl);
 			ractl->_index++;
-			i = ractl->_index + ractl->_nr_pages - index - 1;
+			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 		if (i == nr_to_read - lookahead_size)
 			folio_set_readahead(folio);
 		ractl->_workingset |= folio_test_workingset(folio);
-		ractl->_nr_pages++;
+		ractl->_nr_pages += folio_nr_pages(folio);
+		i += folio_nr_pages(folio);
 	}
 
 	/*
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v7 02/11] fs: Allow fine-grained control of folio sizes
  2024-06-07 14:58 [PATCH v7 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
  2024-06-07 14:58 ` [PATCH v7 01/11] readahead: rework loop in page_cache_ra_unbounded() Pankaj Raghav (Samsung)
@ 2024-06-07 14:58 ` Pankaj Raghav (Samsung)
  2024-06-12 15:38   ` Darrick J. Wong
  2024-06-07 14:58 ` [PATCH v7 03/11] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-07 14:58 UTC (permalink / raw)
  To: david, djwong, chandan.babu, brauner, akpm, willy
  Cc: mcgrof, linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs,
	p.raghav, linux-fsdevel, kernel, hch, gost.dev, cl, john.g.garry

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

We need filesystems to be able to communicate acceptable folio sizes
to the pagecache for a variety of uses (e.g. large block sizes).
Support a range of folio sizes between order-0 and order-31.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 include/linux/pagemap.h | 86 ++++++++++++++++++++++++++++++++++-------
 mm/filemap.c            |  6 +--
 mm/readahead.c          |  4 +-
 3 files changed, 77 insertions(+), 19 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 8f09ed4a4451..228275e7049f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -204,14 +204,21 @@ enum mapping_flags {
 	AS_EXITING	= 4, 	/* final truncate in progress */
 	/* writeback related tags are not used */
 	AS_NO_WRITEBACK_TAGS = 5,
-	AS_LARGE_FOLIO_SUPPORT = 6,
-	AS_RELEASE_ALWAYS,	/* Call ->release_folio(), even if no private data */
-	AS_STABLE_WRITES,	/* must wait for writeback before modifying
+	AS_RELEASE_ALWAYS = 6,	/* Call ->release_folio(), even if no private data */
+	AS_STABLE_WRITES = 7,	/* must wait for writeback before modifying
 				   folio contents */
-	AS_UNMOVABLE,		/* The mapping cannot be moved, ever */
-	AS_INACCESSIBLE,	/* Do not attempt direct R/W access to the mapping */
+	AS_UNMOVABLE = 8,	/* The mapping cannot be moved, ever */
+	AS_INACCESSIBLE = 9,	/* Do not attempt direct R/W access to the mapping */
+	/* Bits 16-25 are used for FOLIO_ORDER */
+	AS_FOLIO_ORDER_BITS = 5,
+	AS_FOLIO_ORDER_MIN = 16,
+	AS_FOLIO_ORDER_MAX = AS_FOLIO_ORDER_MIN + AS_FOLIO_ORDER_BITS,
 };
 
+#define AS_FOLIO_ORDER_MASK     ((1u << AS_FOLIO_ORDER_BITS) - 1)
+#define AS_FOLIO_ORDER_MIN_MASK (AS_FOLIO_ORDER_MASK << AS_FOLIO_ORDER_MIN)
+#define AS_FOLIO_ORDER_MAX_MASK (AS_FOLIO_ORDER_MASK << AS_FOLIO_ORDER_MAX)
+
 /**
  * mapping_set_error - record a writeback error in the address_space
  * @mapping: the mapping in which an error should be set
@@ -360,9 +367,49 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 #define MAX_PAGECACHE_ORDER	8
 #endif
 
+/*
+ * mapping_set_folio_order_range() - Set the orders supported by a file.
+ * @mapping: The address space of the file.
+ * @min: Minimum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
+ * @max: Maximum folio order (between @min-MAX_PAGECACHE_ORDER inclusive).
+ *
+ * The filesystem should call this function in its inode constructor to
+ * indicate which base size (min) and maximum size (max) of folio the VFS
+ * can use to cache the contents of the file.  This should only be used
+ * if the filesystem needs special handling of folio sizes (ie there is
+ * something the core cannot know).
+ * Do not tune it based on, eg, i_size.
+ *
+ * Context: This should not be called while the inode is active as it
+ * is non-atomic.
+ */
+static inline void mapping_set_folio_order_range(struct address_space *mapping,
+						 unsigned int min,
+						 unsigned int max)
+{
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return;
+
+	if (min > MAX_PAGECACHE_ORDER)
+		min = MAX_PAGECACHE_ORDER;
+	if (max > MAX_PAGECACHE_ORDER)
+		max = MAX_PAGECACHE_ORDER;
+	if (max < min)
+		max = min;
+
+	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
+		(min << AS_FOLIO_ORDER_MIN) | (max << AS_FOLIO_ORDER_MAX);
+}
+
+static inline void mapping_set_folio_min_order(struct address_space *mapping,
+					       unsigned int min)
+{
+	mapping_set_folio_order_range(mapping, min, MAX_PAGECACHE_ORDER);
+}
+
 /**
  * mapping_set_large_folios() - Indicate the file supports large folios.
- * @mapping: The file.
+ * @mapping: The address space of the file.
  *
  * The filesystem should call this function in its inode constructor to
  * indicate that the VFS can use large folios to cache the contents of
@@ -373,7 +420,23 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
  */
 static inline void mapping_set_large_folios(struct address_space *mapping)
 {
-	__set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	mapping_set_folio_order_range(mapping, 0, MAX_PAGECACHE_ORDER);
+}
+
+static inline
+unsigned int mapping_max_folio_order(const struct address_space *mapping)
+{
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return 0;
+	return (mapping->flags & AS_FOLIO_ORDER_MAX_MASK) >> AS_FOLIO_ORDER_MAX;
+}
+
+static inline
+unsigned int mapping_min_folio_order(const struct address_space *mapping)
+{
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return 0;
+	return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
 }
 
 /*
@@ -382,16 +445,13 @@ static inline void mapping_set_large_folios(struct address_space *mapping)
  */
 static inline bool mapping_large_folio_support(struct address_space *mapping)
 {
-	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-		test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	return mapping_max_folio_order(mapping) > 0;
 }
 
 /* Return the maximum folio size for this pagecache mapping, in bytes. */
-static inline size_t mapping_max_folio_size(struct address_space *mapping)
+static inline size_t mapping_max_folio_size(const struct address_space *mapping)
 {
-	if (mapping_large_folio_support(mapping))
-		return PAGE_SIZE << MAX_PAGECACHE_ORDER;
-	return PAGE_SIZE;
+	return PAGE_SIZE << mapping_max_folio_order(mapping);
 }
 
 static inline int filemap_nr_thps(struct address_space *mapping)
diff --git a/mm/filemap.c b/mm/filemap.c
index 37061aafd191..46c7a6f59788 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1933,10 +1933,8 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
 			fgp_flags |= FGP_LOCK;
 
-		if (!mapping_large_folio_support(mapping))
-			order = 0;
-		if (order > MAX_PAGECACHE_ORDER)
-			order = MAX_PAGECACHE_ORDER;
+		if (order > mapping_max_folio_order(mapping))
+			order = mapping_max_folio_order(mapping);
 		/* If we're not aligned, allocate a smaller folio */
 		if (index & ((1UL << order) - 1))
 			order = __ffs(index);
diff --git a/mm/readahead.c b/mm/readahead.c
index 75e934a1fd78..da34b28da02c 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -504,9 +504,9 @@ void page_cache_ra_order(struct readahead_control *ractl,
 
 	limit = min(limit, index + ra->size - 1);
 
-	if (new_order < MAX_PAGECACHE_ORDER) {
+	if (new_order < mapping_max_folio_order(mapping)) {
 		new_order += 2;
-		new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
+		new_order = min(mapping_max_folio_order(mapping), new_order);
 		new_order = min_t(unsigned int, new_order, ilog2(ra->size));
 	}
 
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v7 03/11] filemap: allocate mapping_min_order folios in the page cache
  2024-06-07 14:58 [PATCH v7 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
  2024-06-07 14:58 ` [PATCH v7 01/11] readahead: rework loop in page_cache_ra_unbounded() Pankaj Raghav (Samsung)
  2024-06-07 14:58 ` [PATCH v7 02/11] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
@ 2024-06-07 14:58 ` Pankaj Raghav (Samsung)
  2024-06-12  9:01   ` Hannes Reinecke
                     ` (3 more replies)
  2024-06-07 14:58 ` [PATCH v7 04/11] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
                   ` (7 subsequent siblings)
  10 siblings, 4 replies; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-07 14:58 UTC (permalink / raw)
  To: david, djwong, chandan.babu, brauner, akpm, willy
  Cc: mcgrof, linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs,
	p.raghav, linux-fsdevel, kernel, hch, gost.dev, cl, john.g.garry

From: Pankaj Raghav <p.raghav@samsung.com>

filemap_create_folio() and do_read_cache_folio() were always allocating
folio of order 0. __filemap_get_folio was trying to allocate higher
order folios when fgp_flags had higher order hint set but it will default
to order 0 folio if higher order memory allocation fails.

Supporting mapping_min_order implies that we guarantee each folio in the
page cache has at least an order of mapping_min_order. When adding new
folios to the page cache we must also ensure the index used is aligned to
the mapping_min_order as the page cache requires the index to be aligned
to the order of the folio.

Co-developed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 include/linux/pagemap.h | 20 ++++++++++++++++++++
 mm/filemap.c            | 26 ++++++++++++++++++--------
 2 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 228275e7049f..899b8d751768 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -439,6 +439,26 @@ unsigned int mapping_min_folio_order(const struct address_space *mapping)
 	return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
 }
 
+static inline unsigned long mapping_min_folio_nrpages(struct address_space *mapping)
+{
+	return 1UL << mapping_min_folio_order(mapping);
+}
+
+/**
+ * mapping_align_start_index() - Align starting index based on the min
+ * folio order of the page cache.
+ * @mapping: The address_space.
+ *
+ * Ensure the index used is aligned to the minimum folio order when adding
+ * new folios to the page cache by rounding down to the nearest minimum
+ * folio number of pages.
+ */
+static inline pgoff_t mapping_align_start_index(struct address_space *mapping,
+						pgoff_t index)
+{
+	return round_down(index, mapping_min_folio_nrpages(mapping));
+}
+
 /*
  * Large folio support currently depends on THP.  These dependencies are
  * being worked on but are not yet fixed.
diff --git a/mm/filemap.c b/mm/filemap.c
index 46c7a6f59788..8bb0d2bc93c5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -859,6 +859,8 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
+	VM_BUG_ON_FOLIO(folio_order(folio) < mapping_min_folio_order(mapping),
+			folio);
 	mapping_set_update(&xas, mapping);
 
 	VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
@@ -1919,8 +1921,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		folio_wait_stable(folio);
 no_page:
 	if (!folio && (fgp_flags & FGP_CREAT)) {
-		unsigned order = FGF_GET_ORDER(fgp_flags);
+		unsigned int min_order = mapping_min_folio_order(mapping);
+		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
 		int err;
+		index = mapping_align_start_index(mapping, index);
 
 		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
 			gfp |= __GFP_WRITE;
@@ -1943,7 +1947,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			gfp_t alloc_gfp = gfp;
 
 			err = -ENOMEM;
-			if (order > 0)
+			if (order > min_order)
 				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
 			folio = filemap_alloc_folio(alloc_gfp, order);
 			if (!folio)
@@ -1958,7 +1962,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 				break;
 			folio_put(folio);
 			folio = NULL;
-		} while (order-- > 0);
+		} while (order-- > min_order);
 
 		if (err == -EEXIST)
 			goto repeat;
@@ -2447,13 +2451,16 @@ static int filemap_update_page(struct kiocb *iocb,
 }
 
 static int filemap_create_folio(struct file *file,
-		struct address_space *mapping, pgoff_t index,
+		struct address_space *mapping, loff_t pos,
 		struct folio_batch *fbatch)
 {
 	struct folio *folio;
 	int error;
+	unsigned int min_order = mapping_min_folio_order(mapping);
+	pgoff_t index;
 
-	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
+	folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
+				    min_order);
 	if (!folio)
 		return -ENOMEM;
 
@@ -2471,6 +2478,8 @@ static int filemap_create_folio(struct file *file,
 	 * well to keep locking rules simple.
 	 */
 	filemap_invalidate_lock_shared(mapping);
+	/* index in PAGE units but aligned to min_order number of pages. */
+	index = (pos >> (PAGE_SHIFT + min_order)) << min_order;
 	error = filemap_add_folio(mapping, folio, index,
 			mapping_gfp_constraint(mapping, GFP_KERNEL));
 	if (error == -EEXIST)
@@ -2531,8 +2540,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 	if (!folio_batch_count(fbatch)) {
 		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
 			return -EAGAIN;
-		err = filemap_create_folio(filp, mapping,
-				iocb->ki_pos >> PAGE_SHIFT, fbatch);
+		err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch);
 		if (err == AOP_TRUNCATED_PAGE)
 			goto retry;
 		return err;
@@ -3748,9 +3756,11 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
 repeat:
 	folio = filemap_get_folio(mapping, index);
 	if (IS_ERR(folio)) {
-		folio = filemap_alloc_folio(gfp, 0);
+		folio = filemap_alloc_folio(gfp,
+					    mapping_min_folio_order(mapping));
 		if (!folio)
 			return ERR_PTR(-ENOMEM);
+		index = mapping_align_start_index(mapping, index);
 		err = filemap_add_folio(mapping, folio, index, gfp);
 		if (unlikely(err)) {
 			folio_put(folio);
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v7 04/11] readahead: allocate folios with mapping_min_order in readahead
  2024-06-07 14:58 [PATCH v7 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (2 preceding siblings ...)
  2024-06-07 14:58 ` [PATCH v7 03/11] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
@ 2024-06-07 14:58 ` Pankaj Raghav (Samsung)
  2024-06-12 18:50   ` Matthew Wilcox
  2024-06-07 14:58 ` [PATCH v7 05/11] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-07 14:58 UTC (permalink / raw)
  To: david, djwong, chandan.babu, brauner, akpm, willy
  Cc: mcgrof, linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs,
	p.raghav, linux-fsdevel, kernel, hch, gost.dev, cl, john.g.garry

From: Pankaj Raghav <p.raghav@samsung.com>

page_cache_ra_unbounded() was allocating single pages (0 order folios)
if there was no folio found in an index. Allocate mapping_min_order folios
as we need to guarantee the minimum order if it is set.
When read_pages() is triggered and if a page is already present, check
for truncation and move the ractl->_index by mapping_min_nrpages if that
folio was truncated. This is done to ensure we keep the alignment
requirement while adding a folio to the page cache.

page_cache_ra_order() tries to allocate folio to the page cache with a
higher order if the index aligns with that order. Modify it so that the
order does not go below the mapping_min_order requirement of the page
cache. This function will do the right thing even if the new_order passed
is less than the mapping_min_order.
When adding new folios to the page cache we must also ensure the index
used is aligned to the mapping_min_order as the page cache requires the
index to be aligned to the order of the folio.

readahead_expand() is called from readahead aops to extend the range of
the readahead so this function can assume ractl->_index to be aligned with
min_order.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/readahead.c | 85 +++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 71 insertions(+), 14 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index da34b28da02c..389cd802da63 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -206,9 +206,10 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 		unsigned long nr_to_read, unsigned long lookahead_size)
 {
 	struct address_space *mapping = ractl->mapping;
-	unsigned long index = readahead_index(ractl);
+	unsigned long ra_folio_index, index = readahead_index(ractl);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
-	unsigned long i = 0;
+	unsigned long mark, i = 0;
+	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
 
 	/*
 	 * Partway through the readahead operation, we will have added
@@ -223,6 +224,22 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	unsigned int nofs = memalloc_nofs_save();
 
 	filemap_invalidate_lock_shared(mapping);
+	index = mapping_align_start_index(mapping, index);
+
+	/*
+	 * As iterator `i` is aligned to min_nrpages, round_up the
+	 * difference between nr_to_read and lookahead_size to mark the
+	 * index that only has lookahead or "async_region" to set the
+	 * readahead flag.
+	 */
+	ra_folio_index = round_up(readahead_index(ractl) + nr_to_read - lookahead_size,
+				  min_nrpages);
+	mark = ra_folio_index - index;
+	if (index != readahead_index(ractl)) {
+		nr_to_read += readahead_index(ractl) - index;
+		ractl->_index = index;
+	}
+
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
@@ -230,7 +247,9 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 		struct folio *folio = xa_load(&mapping->i_pages, index + i);
 		int ret;
 
+
 		if (folio && !xa_is_value(folio)) {
+			long nr_pages = folio_nr_pages(folio);
 			/*
 			 * Page already present?  Kick off the current batch
 			 * of contiguous pages before continuing with the
@@ -240,12 +259,24 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			 * not worth getting one just for that.
 			 */
 			read_pages(ractl);
-			ractl->_index += folio_nr_pages(folio);
+
+			/*
+			 * Move the ractl->_index by at least min_pages
+			 * if the folio got truncated to respect the
+			 * alignment constraint in the page cache.
+			 *
+			 */
+			if (mapping != folio->mapping)
+				nr_pages = min_nrpages;
+
+			VM_BUG_ON_FOLIO(nr_pages < min_nrpages, folio);
+			ractl->_index += nr_pages;
 			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask,
+					    mapping_min_folio_order(mapping));
 		if (!folio)
 			break;
 
@@ -255,11 +286,11 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			if (ret == -ENOMEM)
 				break;
 			read_pages(ractl);
-			ractl->_index++;
+			ractl->_index += min_nrpages;
 			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
-		if (i == nr_to_read - lookahead_size)
+		if (i == mark)
 			folio_set_readahead(folio);
 		ractl->_workingset |= folio_test_workingset(folio);
 		ractl->_nr_pages += folio_nr_pages(folio);
@@ -493,13 +524,19 @@ void page_cache_ra_order(struct readahead_control *ractl,
 {
 	struct address_space *mapping = ractl->mapping;
 	pgoff_t index = readahead_index(ractl);
+	unsigned int min_order = mapping_min_folio_order(mapping);
 	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
 	pgoff_t mark = index + ra->size - ra->async_size;
 	unsigned int nofs;
 	int err = 0;
 	gfp_t gfp = readahead_gfp_mask(mapping);
+	unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
 
-	if (!mapping_large_folio_support(mapping) || ra->size < 4)
+	/*
+	 * Fallback when size < min_nrpages as each folio should be
+	 * at least min_nrpages anyway.
+	 */
+	if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
 		goto fallback;
 
 	limit = min(limit, index + ra->size - 1);
@@ -508,11 +545,20 @@ void page_cache_ra_order(struct readahead_control *ractl,
 		new_order += 2;
 		new_order = min(mapping_max_folio_order(mapping), new_order);
 		new_order = min_t(unsigned int, new_order, ilog2(ra->size));
+		new_order = max(new_order, min_order);
 	}
 
 	/* See comment in page_cache_ra_unbounded() */
 	nofs = memalloc_nofs_save();
 	filemap_invalidate_lock_shared(mapping);
+	/*
+	 * If the new_order is greater than min_order and index is
+	 * already aligned to new_order, then this will be noop as index
+	 * aligned to new_order should also be aligned to min_order.
+	 */
+	ractl->_index = mapping_align_start_index(mapping, index);
+	index = readahead_index(ractl);
+
 	while (index <= limit) {
 		unsigned int order = new_order;
 
@@ -520,7 +566,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
 		if (index & ((1UL << order) - 1))
 			order = __ffs(index);
 		/* Don't allocate pages past EOF */
-		while (index + (1UL << order) - 1 > limit)
+		while (order > min_order && index + (1UL << order) - 1 > limit)
 			order--;
 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
 		if (err)
@@ -784,8 +830,15 @@ void readahead_expand(struct readahead_control *ractl,
 	struct file_ra_state *ra = ractl->ra;
 	pgoff_t new_index, new_nr_pages;
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
+	unsigned long min_nrpages = mapping_min_folio_nrpages(mapping);
+	unsigned int min_order = mapping_min_folio_order(mapping);
 
 	new_index = new_start / PAGE_SIZE;
+	/*
+	 * Readahead code should have aligned the ractl->_index to
+	 * min_nrpages before calling readahead aops.
+	 */
+	VM_BUG_ON(!IS_ALIGNED(ractl->_index, min_nrpages));
 
 	/* Expand the leading edge downwards */
 	while (ractl->_index > new_index) {
@@ -795,9 +848,11 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, min_order);
 		if (!folio)
 			return;
+
+		index = mapping_align_start_index(mapping, index);
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
 			folio_put(folio);
 			return;
@@ -807,7 +862,7 @@ void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset = true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
-		ractl->_nr_pages++;
+		ractl->_nr_pages += min_nrpages;
 		ractl->_index = folio->index;
 	}
 
@@ -822,9 +877,11 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, min_order);
 		if (!folio)
 			return;
+
+		index = mapping_align_start_index(mapping, index);
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
 			folio_put(folio);
 			return;
@@ -834,10 +891,10 @@ void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset = true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
-		ractl->_nr_pages++;
+		ractl->_nr_pages += min_nrpages;
 		if (ra) {
-			ra->size++;
-			ra->async_size++;
+			ra->size += min_nrpages;
+			ra->async_size += min_nrpages;
 		}
 	}
 }
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v7 05/11] mm: split a folio in minimum folio order chunks
  2024-06-07 14:58 [PATCH v7 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (3 preceding siblings ...)
  2024-06-07 14:58 ` [PATCH v7 04/11] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
@ 2024-06-07 14:58 ` Pankaj Raghav (Samsung)
  2024-06-07 16:58   ` Zi Yan
  2024-06-12  9:02   ` Hannes Reinecke
  2024-06-07 14:58 ` [PATCH v7 06/11] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() Pankaj Raghav (Samsung)
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-07 14:58 UTC (permalink / raw)
  To: david, djwong, chandan.babu, brauner, akpm, willy
  Cc: mcgrof, linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs,
	p.raghav, linux-fsdevel, kernel, hch, gost.dev, cl, john.g.garry

From: Luis Chamberlain <mcgrof@kernel.org>

split_folio() and split_folio_to_list() assume order 0, to support
minorder for non-anonymous folios, we must expand these to check the
folio mapping order and use that.

Set new_order to be at least minimum folio order if it is set in
split_huge_page_to_list() so that we can maintain minimum folio order
requirement in the page cache.

Update the debugfs write files used for testing to ensure the order
is respected as well. We simply enforce the min order when a file
mapping is used.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 include/linux/huge_mm.h | 14 ++++++++---
 mm/huge_memory.c        | 55 ++++++++++++++++++++++++++++++++++++++---
 2 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 020e2344eb86..15caa4e7b00e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -88,6 +88,8 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
 	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
 
+#define split_folio(f) split_folio_to_list(f, NULL)
+
 #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 #define HPAGE_PMD_SHIFT PMD_SHIFT
 #define HPAGE_PUD_SHIFT PUD_SHIFT
@@ -307,9 +309,10 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 bool can_split_folio(struct folio *folio, int *pextra_pins);
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order);
+int split_folio_to_list(struct folio *folio, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
-	return split_huge_page_to_list_to_order(page, NULL, 0);
+	return split_folio(page_folio(page));
 }
 void deferred_split_folio(struct folio *folio);
 
@@ -474,6 +477,12 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
+
+static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
+{
+	return 0;
+}
+
 static inline void deferred_split_folio(struct folio *folio) {}
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
@@ -578,7 +587,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
 	return split_folio_to_list_to_order(folio, NULL, new_order);
 }
 
-#define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
-#define split_folio(f) split_folio_to_order(f, 0)
-
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8e49f402d7c7..399a4f5125c7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3068,6 +3068,9 @@ bool can_split_folio(struct folio *folio, int *pextra_pins)
  * released, or if some unexpected race happened (e.g., anon VMA disappeared,
  * truncation).
  *
+ * Callers should ensure that the order respects the address space mapping
+ * min-order if one is set for non-anonymous folios.
+ *
  * Returns -EINVAL when trying to split to an order that is incompatible
  * with the folio. Splitting to order 0 is compatible with all folios.
  */
@@ -3143,6 +3146,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		mapping = NULL;
 		anon_vma_lock_write(anon_vma);
 	} else {
+		unsigned int min_order;
 		gfp_t gfp;
 
 		mapping = folio->mapping;
@@ -3153,6 +3157,14 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 			goto out;
 		}
 
+		min_order = mapping_min_folio_order(folio->mapping);
+		if (new_order < min_order) {
+			VM_WARN_ONCE(1, "Cannot split mapped folio below min-order: %u",
+				     min_order);
+			ret = -EINVAL;
+			goto out;
+		}
+
 		gfp = current_gfp_context(mapping_gfp_mask(mapping) &
 							GFP_RECLAIM_MASK);
 
@@ -3264,6 +3276,21 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	return ret;
 }
 
+int split_folio_to_list(struct folio *folio, struct list_head *list)
+{
+	unsigned int min_order = 0;
+
+	if (!folio_test_anon(folio)) {
+		if (!folio->mapping) {
+			count_vm_event(THP_SPLIT_PAGE_FAILED);
+			return -EBUSY;
+		}
+		min_order = mapping_min_folio_order(folio->mapping);
+	}
+
+	return split_huge_page_to_list_to_order(&folio->page, list, min_order);
+}
+
 void __folio_undo_large_rmappable(struct folio *folio)
 {
 	struct deferred_split *ds_queue;
@@ -3493,6 +3520,8 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		struct vm_area_struct *vma = vma_lookup(mm, addr);
 		struct page *page;
 		struct folio *folio;
+		struct address_space *mapping;
+		unsigned int target_order = new_order;
 
 		if (!vma)
 			break;
@@ -3513,7 +3542,13 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		if (!is_transparent_hugepage(folio))
 			goto next;
 
-		if (new_order >= folio_order(folio))
+		if (!folio_test_anon(folio)) {
+			mapping = folio->mapping;
+			target_order = max(new_order,
+					   mapping_min_folio_order(mapping));
+		}
+
+		if (target_order >= folio_order(folio))
 			goto next;
 
 		total++;
@@ -3529,9 +3564,13 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		if (!folio_trylock(folio))
 			goto next;
 
-		if (!split_folio_to_order(folio, new_order))
+		if (!folio_test_anon(folio) && folio->mapping != mapping)
+			goto unlock;
+
+		if (!split_folio_to_order(folio, target_order))
 			split++;
 
+unlock:
 		folio_unlock(folio);
 next:
 		folio_put(folio);
@@ -3556,6 +3595,7 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 	pgoff_t index;
 	int nr_pages = 1;
 	unsigned long total = 0, split = 0;
+	unsigned int min_order;
 
 	file = getname_kernel(file_path);
 	if (IS_ERR(file))
@@ -3569,9 +3609,11 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 		 file_path, off_start, off_end);
 
 	mapping = candidate->f_mapping;
+	min_order = mapping_min_folio_order(mapping);
 
 	for (index = off_start; index < off_end; index += nr_pages) {
 		struct folio *folio = filemap_get_folio(mapping, index);
+		unsigned int target_order = new_order;
 
 		nr_pages = 1;
 		if (IS_ERR(folio))
@@ -3580,18 +3622,23 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 		if (!folio_test_large(folio))
 			goto next;
 
+		target_order = max(new_order, min_order);
 		total++;
 		nr_pages = folio_nr_pages(folio);
 
-		if (new_order >= folio_order(folio))
+		if (target_order >= folio_order(folio))
 			goto next;
 
 		if (!folio_trylock(folio))
 			goto next;
 
-		if (!split_folio_to_order(folio, new_order))
+		if (folio->mapping != mapping)
+			goto unlock;
+
+		if (!split_folio_to_order(folio, target_order))
 			split++;
 
+unlock:
 		folio_unlock(folio);
 next:
 		folio_put(folio);
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v7 06/11] filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
  2024-06-07 14:58 [PATCH v7 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (4 preceding siblings ...)
  2024-06-07 14:58 ` [PATCH v7 05/11] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
@ 2024-06-07 14:58 ` Pankaj Raghav (Samsung)
  2024-06-12 19:08   ` Matthew Wilcox
  2024-06-07 14:58 ` [PATCH v7 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-07 14:58 UTC (permalink / raw)
  To: david, djwong, chandan.babu, brauner, akpm, willy
  Cc: mcgrof, linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs,
	p.raghav, linux-fsdevel, kernel, hch, gost.dev, cl, john.g.garry

From: Pankaj Raghav <p.raghav@samsung.com>

Usually the page cache does not extend beyond the size of the inode,
therefore, no PTEs are created for folios that extend beyond the size.

But with LBS support, we might extend page cache beyond the size of the
inode as we need to guarantee folios of minimum order. Cap the PTE range
to be created for the page cache up to the max allowed zero-fill file
end, which is aligned to the PAGE_SIZE.

An fstests test has been created to trigger this edge case [0].

[0] https://lore.kernel.org/fstests/20240415081054.1782715-1-mcgrof@kernel.org/

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/filemap.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 8bb0d2bc93c5..0e48491b3d10 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3610,7 +3610,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	struct vm_area_struct *vma = vmf->vma;
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
-	pgoff_t last_pgoff = start_pgoff;
+	pgoff_t file_end, last_pgoff = start_pgoff;
 	unsigned long addr;
 	XA_STATE(xas, &mapping->i_pages, start_pgoff);
 	struct folio *folio;
@@ -3636,6 +3636,10 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		goto out;
 	}
 
+	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
+	if (end_pgoff > file_end)
+		end_pgoff = file_end;
+
 	folio_type = mm_counter_file(folio);
 	do {
 		unsigned long end;
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v7 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-06-07 14:58 [PATCH v7 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (5 preceding siblings ...)
  2024-06-07 14:58 ` [PATCH v7 06/11] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() Pankaj Raghav (Samsung)
@ 2024-06-07 14:58 ` Pankaj Raghav (Samsung)
  2024-06-11  7:38   ` John Garry
  2024-06-12 20:40   ` Darrick J. Wong
  2024-06-07 14:58 ` [PATCH v7 08/11] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-07 14:58 UTC (permalink / raw)
  To: david, djwong, chandan.babu, brauner, akpm, willy
  Cc: mcgrof, linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs,
	p.raghav, linux-fsdevel, kernel, hch, gost.dev, cl, john.g.garry

From: Pankaj Raghav <p.raghav@samsung.com>

iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
< fs block size. iomap_dio_zero() has an implicit assumption that fs block
size < page_size. This is true for most filesystems at the moment.

If the block size > page size, this will send the contents of the page
next to zero page(as len > PAGE_SIZE) to the underlying block device,
causing FS corruption.

iomap is a generic infrastructure and it should not make any assumptions
about the fs block size and the page size of the system.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 fs/internal.h          |  5 +++++
 fs/iomap/buffered-io.c |  6 ++++++
 fs/iomap/direct-io.c   | 26 ++++++++++++++++++++++++--
 3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index 84f371193f74..30217f0ff4c6 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -35,6 +35,11 @@ static inline void bdev_cache_init(void)
 int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
 		get_block_t *get_block, const struct iomap *iomap);
 
+/*
+ * iomap/direct-io.c
+ */
+int iomap_dio_init(void);
+
 /*
  * char_dev.c
  */
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 49938419fcc7..9f791db473e4 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1990,6 +1990,12 @@ EXPORT_SYMBOL_GPL(iomap_writepages);
 
 static int __init iomap_init(void)
 {
+	int ret;
+
+	ret = iomap_dio_init();
+	if (ret)
+		return ret;
+
 	return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
 			   offsetof(struct iomap_ioend, io_bio),
 			   BIOSET_NEED_BVECS);
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46..b95600b254a3 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -27,6 +27,13 @@
 #define IOMAP_DIO_WRITE		(1U << 30)
 #define IOMAP_DIO_DIRTY		(1U << 31)
 
+/*
+ * Used for sub block zeroing in iomap_dio_zero()
+ */
+#define ZERO_FSB_SIZE (65536)
+#define ZERO_FSB_ORDER (get_order(ZERO_FSB_SIZE))
+static struct page *zero_fs_block;
+
 struct iomap_dio {
 	struct kiocb		*iocb;
 	const struct iomap_dio_ops *dops;
@@ -52,6 +59,16 @@ struct iomap_dio {
 	};
 };
 
+int iomap_dio_init(void)
+{
+	zero_fs_block = alloc_pages(GFP_KERNEL | __GFP_ZERO, ZERO_FSB_ORDER);
+
+	if (!zero_fs_block)
+		return -ENOMEM;
+
+	return 0;
+}
+
 static struct bio *iomap_dio_alloc_bio(const struct iomap_iter *iter,
 		struct iomap_dio *dio, unsigned short nr_vecs, blk_opf_t opf)
 {
@@ -236,17 +253,22 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 		loff_t pos, unsigned len)
 {
 	struct inode *inode = file_inode(dio->iocb->ki_filp);
-	struct page *page = ZERO_PAGE(0);
 	struct bio *bio;
 
+	/*
+	 * Max block size supported is 64k
+	 */
+	WARN_ON_ONCE(len > ZERO_FSB_SIZE);
+
 	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
 	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
 				  GFP_KERNEL);
+
 	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	__bio_add_page(bio, page, len, 0);
+	__bio_add_page(bio, zero_fs_block, len, 0);
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }
 
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v7 08/11] xfs: use kvmalloc for xattr buffers
  2024-06-07 14:58 [PATCH v7 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (6 preceding siblings ...)
  2024-06-07 14:58 ` [PATCH v7 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
@ 2024-06-07 14:58 ` Pankaj Raghav (Samsung)
  2024-06-07 14:59 ` [PATCH v7 09/11] xfs: expose block size in stat Pankaj Raghav (Samsung)
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-07 14:58 UTC (permalink / raw)
  To: david, djwong, chandan.babu, brauner, akpm, willy
  Cc: mcgrof, linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs,
	p.raghav, linux-fsdevel, kernel, hch, gost.dev, cl, john.g.garry,
	Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

Pankaj Raghav reported that when filesystem block size is larger
than page size, the xattr code can use kmalloc() for high order
allocations. This triggers a useless warning in the allocator as it
is a __GFP_NOFAIL allocation here:

static inline
struct page *rmqueue(struct zone *preferred_zone,
                        struct zone *zone, unsigned int order,
                        gfp_t gfp_flags, unsigned int alloc_flags,
                        int migratetype)
{
        struct page *page;

        /*
         * We most definitely don't want callers attempting to
         * allocate greater than order-1 page units with __GFP_NOFAIL.
         */
>>>>    WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
...

Fix this by changing all these call sites to use kvmalloc(), which
will strip the NOFAIL from the kmalloc attempt and if that fails
will do a __GFP_NOFAIL vmalloc().

This is not an issue that productions systems will see as
filesystems with block size > page size cannot be mounted by the
kernel; Pankaj is developing this functionality right now.

Reported-by: Pankaj Raghav <kernel@pankajraghav.com>
Fixes: f078d4ea8276 ("xfs: convert kmem_alloc() to kmalloc()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/xfs/libxfs/xfs_attr_leaf.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index b9e98950eb3d..09f4cb061a6e 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -1138,10 +1138,7 @@ xfs_attr3_leaf_to_shortform(
 
 	trace_xfs_attr_leaf_to_sf(args);
 
-	tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
-	if (!tmpbuffer)
-		return -ENOMEM;
-
+	tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
 	memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
 
 	leaf = (xfs_attr_leafblock_t *)tmpbuffer;
@@ -1205,7 +1202,7 @@ xfs_attr3_leaf_to_shortform(
 	error = 0;
 
 out:
-	kfree(tmpbuffer);
+	kvfree(tmpbuffer);
 	return error;
 }
 
@@ -1613,7 +1610,7 @@ xfs_attr3_leaf_compact(
 
 	trace_xfs_attr_leaf_compact(args);
 
-	tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
+	tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
 	memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
 	memset(bp->b_addr, 0, args->geo->blksize);
 	leaf_src = (xfs_attr_leafblock_t *)tmpbuffer;
@@ -1651,7 +1648,7 @@ xfs_attr3_leaf_compact(
 	 */
 	xfs_trans_log_buf(trans, bp, 0, args->geo->blksize - 1);
 
-	kfree(tmpbuffer);
+	kvfree(tmpbuffer);
 }
 
 /*
@@ -2330,7 +2327,7 @@ xfs_attr3_leaf_unbalance(
 		struct xfs_attr_leafblock *tmp_leaf;
 		struct xfs_attr3_icleaf_hdr tmphdr;
 
-		tmp_leaf = kzalloc(state->args->geo->blksize,
+		tmp_leaf = kvzalloc(state->args->geo->blksize,
 				GFP_KERNEL | __GFP_NOFAIL);
 
 		/*
@@ -2371,7 +2368,7 @@ xfs_attr3_leaf_unbalance(
 		}
 		memcpy(save_leaf, tmp_leaf, state->args->geo->blksize);
 		savehdr = tmphdr; /* struct copy */
-		kfree(tmp_leaf);
+		kvfree(tmp_leaf);
 	}
 
 	xfs_attr3_leaf_hdr_to_disk(state->args->geo, save_leaf, &savehdr);
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v7 09/11] xfs: expose block size in stat
  2024-06-07 14:58 [PATCH v7 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (7 preceding siblings ...)
  2024-06-07 14:58 ` [PATCH v7 08/11] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
@ 2024-06-07 14:59 ` Pankaj Raghav (Samsung)
  2024-06-07 14:59 ` [PATCH v7 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
  2024-06-07 14:59 ` [PATCH v7 11/11] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
  10 siblings, 0 replies; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-07 14:59 UTC (permalink / raw)
  To: david, djwong, chandan.babu, brauner, akpm, willy
  Cc: mcgrof, linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs,
	p.raghav, linux-fsdevel, kernel, hch, gost.dev, cl, john.g.garry

From: Pankaj Raghav <p.raghav@samsung.com>

For block size larger than page size, the unit of efficient IO is
the block size, not the page size. Leaving stat() to report
PAGE_SIZE as the block size causes test programs like fsx to issue
illegal ranges for operations that require block size alignment
(e.g. fallocate() insert range). Hence update the preferred IO size
to reflect the block size in this case.

This change is based on a patch originally from Dave Chinner.[1]

[1] https://lwn.net/ml/linux-fsdevel/20181107063127.3902-16-david@fromorbit.com/

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/xfs/xfs_iops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ff222827e550..a7883303dee8 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -560,7 +560,7 @@ xfs_stat_blksize(
 			return 1U << mp->m_allocsize_log;
 	}
 
-	return PAGE_SIZE;
+	return max_t(uint32_t, PAGE_SIZE, mp->m_sb.sb_blocksize);
 }
 
 STATIC int
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v7 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  2024-06-07 14:58 [PATCH v7 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (8 preceding siblings ...)
  2024-06-07 14:59 ` [PATCH v7 09/11] xfs: expose block size in stat Pankaj Raghav (Samsung)
@ 2024-06-07 14:59 ` Pankaj Raghav (Samsung)
  2024-06-13  8:45   ` Christoph Hellwig
  2024-06-07 14:59 ` [PATCH v7 11/11] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
  10 siblings, 1 reply; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-07 14:59 UTC (permalink / raw)
  To: david, djwong, chandan.babu, brauner, akpm, willy
  Cc: mcgrof, linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs,
	p.raghav, linux-fsdevel, kernel, hch, gost.dev, cl, john.g.garry

From: Pankaj Raghav <p.raghav@samsung.com>

Instead of assuming that PAGE_SHIFT is always higher than the blocklog,
make the calculation generic so that page cache count can be calculated
correctly for LBS.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/xfs/xfs_mount.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 09eef1721ef4..46cb0384143b 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -132,11 +132,19 @@ xfs_sb_validate_fsb_count(
 	xfs_sb_t	*sbp,
 	uint64_t	nblocks)
 {
+	uint64_t		max_index;
+	uint64_t		max_bytes;
+
 	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
 
+	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
+		return -EFBIG;
+
 	/* Limited by ULONG_MAX of page cache index */
-	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
+	max_index = max_bytes >> PAGE_SHIFT;
+
+	if (max_index > ULONG_MAX)
 		return -EFBIG;
 	return 0;
 }
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v7 11/11] xfs: enable block size larger than page size support
  2024-06-07 14:58 [PATCH v7 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (9 preceding siblings ...)
  2024-06-07 14:59 ` [PATCH v7 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
@ 2024-06-07 14:59 ` Pankaj Raghav (Samsung)
  2024-06-13  8:47   ` Christoph Hellwig
  10 siblings, 1 reply; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-07 14:59 UTC (permalink / raw)
  To: david, djwong, chandan.babu, brauner, akpm, willy
  Cc: mcgrof, linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs,
	p.raghav, linux-fsdevel, kernel, hch, gost.dev, cl, john.g.garry

From: Pankaj Raghav <p.raghav@samsung.com>

Page cache now has the ability to have a minimum order when allocating
a folio which is a prerequisite to add support for block size > page
size.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/xfs/libxfs/xfs_ialloc.c |  5 +++++
 fs/xfs/libxfs/xfs_shared.h |  3 +++
 fs/xfs/xfs_icache.c        |  6 ++++--
 fs/xfs/xfs_mount.c         |  1 -
 fs/xfs/xfs_super.c         | 18 ++++++++++--------
 5 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 14c81f227c5b..1e76431d75a4 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -3019,6 +3019,11 @@ xfs_ialloc_setup_geometry(
 		igeo->ialloc_align = mp->m_dalign;
 	else
 		igeo->ialloc_align = 0;
+
+	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
+		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
+	else
+		igeo->min_folio_order = 0;
 }
 
 /* Compute the location of the root directory inode that is laid out by mkfs. */
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 34f104ed372c..e67a1c7cc0b0 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -231,6 +231,9 @@ struct xfs_ino_geometry {
 	/* precomputed value for di_flags2 */
 	uint64_t	new_diflags2;
 
+	/* minimum folio order of a page cache allocation */
+	unsigned int	min_folio_order;
+
 };
 
 #endif /* __XFS_SHARED_H__ */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 0953163a2d84..5ed3dc9e7d90 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -89,7 +89,8 @@ xfs_inode_alloc(
 	/* VFS doesn't initialise i_mode or i_state! */
 	VFS_I(ip)->i_mode = 0;
 	VFS_I(ip)->i_state = 0;
-	mapping_set_large_folios(VFS_I(ip)->i_mapping);
+	mapping_set_folio_min_order(VFS_I(ip)->i_mapping,
+				    M_IGEO(mp)->min_folio_order);
 
 	XFS_STATS_INC(mp, vn_active);
 	ASSERT(atomic_read(&ip->i_pincount) == 0);
@@ -324,7 +325,8 @@ xfs_reinit_inode(
 	inode->i_rdev = dev;
 	inode->i_uid = uid;
 	inode->i_gid = gid;
-	mapping_set_large_folios(inode->i_mapping);
+	mapping_set_folio_min_order(inode->i_mapping,
+				    M_IGEO(mp)->min_folio_order);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 46cb0384143b..a99454208807 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -135,7 +135,6 @@ xfs_sb_validate_fsb_count(
 	uint64_t		max_index;
 	uint64_t		max_bytes;
 
-	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
 
 	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 27e9f749c4c7..b8a93a8f35ca 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1638,16 +1638,18 @@ xfs_fs_fill_super(
 		goto out_free_sb;
 	}
 
-	/*
-	 * Until this is fixed only page-sized or smaller data blocks work.
-	 */
 	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
-		xfs_warn(mp,
-		"File system with blocksize %d bytes. "
-		"Only pagesize (%ld) or less will currently work.",
+		if (!xfs_has_crc(mp)) {
+			xfs_warn(mp,
+"V4 Filesystem with blocksize %d bytes. Only pagesize (%ld) or less is supported.",
 				mp->m_sb.sb_blocksize, PAGE_SIZE);
-		error = -ENOSYS;
-		goto out_free_sb;
+			error = -ENOSYS;
+			goto out_free_sb;
+		}
+
+		xfs_warn(mp,
+"EXPERIMENTAL: V5 Filesystem with Large Block Size (%d bytes) enabled.",
+			mp->m_sb.sb_blocksize);
 	}
 
 	/* Ensure this filesystem fits in the page cache limits */
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 05/11] mm: split a folio in minimum folio order chunks
  2024-06-07 14:58 ` [PATCH v7 05/11] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
@ 2024-06-07 16:58   ` Zi Yan
  2024-06-07 17:01     ` Matthew Wilcox
  2024-06-07 20:30     ` Pankaj Raghav (Samsung)
  2024-06-12  9:02   ` Hannes Reinecke
  1 sibling, 2 replies; 58+ messages in thread
From: Zi Yan @ 2024-06-07 16:58 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: david, djwong, chandan.babu, brauner, akpm, willy, mcgrof,
	linux-mm, hare, linux-kernel, yang, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl, john.g.garry

[-- Attachment #1: Type: text/plain, Size: 1552 bytes --]

Hi Pankaj,

Can you use ziy@nvidia.com instead of zi.yan@sent.com? Since I just use the latter
to send patches. Thanks.

On 7 Jun 2024, at 10:58, Pankaj Raghav (Samsung) wrote:

> From: Luis Chamberlain <mcgrof@kernel.org>
>
> split_folio() and split_folio_to_list() assume order 0, to support
> minorder for non-anonymous folios, we must expand these to check the
> folio mapping order and use that.
>
> Set new_order to be at least minimum folio order if it is set in
> split_huge_page_to_list() so that we can maintain minimum folio order
> requirement in the page cache.
>
> Update the debugfs write files used for testing to ensure the order
> is respected as well. We simply enforce the min order when a file
> mapping is used.
>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  include/linux/huge_mm.h | 14 ++++++++---
>  mm/huge_memory.c        | 55 ++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 61 insertions(+), 8 deletions(-)
>

<snip>

>
> +int split_folio_to_list(struct folio *folio, struct list_head *list)
> +{
> +	unsigned int min_order = 0;
> +
> +	if (!folio_test_anon(folio)) {
> +		if (!folio->mapping) {
> +			count_vm_event(THP_SPLIT_PAGE_FAILED);

You should only increase this counter when the input folio is a THP, namely
folio_test_pmd_mappable(folio) is true. For other large folios, we will
need a separate counter. Something like MTHP_STAT_FILE_SPLIT_FAILED.
See enum mthp_stat_item in include/linux/huge_mm.h.


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 05/11] mm: split a folio in minimum folio order chunks
  2024-06-07 16:58   ` Zi Yan
@ 2024-06-07 17:01     ` Matthew Wilcox
  2024-06-07 20:45       ` Pankaj Raghav (Samsung)
  2024-06-07 20:30     ` Pankaj Raghav (Samsung)
  1 sibling, 1 reply; 58+ messages in thread
From: Matthew Wilcox @ 2024-06-07 17:01 UTC (permalink / raw)
  To: Zi Yan
  Cc: Pankaj Raghav (Samsung), david, djwong, chandan.babu, brauner,
	akpm, mcgrof, linux-mm, hare, linux-kernel, yang, linux-xfs,
	p.raghav, linux-fsdevel, hch, gost.dev, cl, john.g.garry

On Fri, Jun 07, 2024 at 12:58:33PM -0400, Zi Yan wrote:
> > +int split_folio_to_list(struct folio *folio, struct list_head *list)
> > +{
> > +	unsigned int min_order = 0;
> > +
> > +	if (!folio_test_anon(folio)) {
> > +		if (!folio->mapping) {
> > +			count_vm_event(THP_SPLIT_PAGE_FAILED);
> 
> You should only increase this counter when the input folio is a THP, namely
> folio_test_pmd_mappable(folio) is true. For other large folios, we will
> need a separate counter. Something like MTHP_STAT_FILE_SPLIT_FAILED.
> See enum mthp_stat_item in include/linux/huge_mm.h.

Also, why should this count as a split failure?  If we see a NULL
mapping, the folio has been truncated and so no longer needs to be
split.  I understand we currently count it as a failure, but I
don't think we should.



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 05/11] mm: split a folio in minimum folio order chunks
  2024-06-07 16:58   ` Zi Yan
  2024-06-07 17:01     ` Matthew Wilcox
@ 2024-06-07 20:30     ` Pankaj Raghav (Samsung)
  2024-06-07 20:51       ` Zi Yan
  1 sibling, 1 reply; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-07 20:30 UTC (permalink / raw)
  To: Zi Yan
  Cc: david, djwong, chandan.babu, brauner, akpm, willy, mcgrof,
	linux-mm, hare, linux-kernel, yang, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl, john.g.garry

On Fri, Jun 07, 2024 at 12:58:33PM -0400, Zi Yan wrote:
> Hi Pankaj,
> 
> Can you use ziy@nvidia.com instead of zi.yan@sent.com? Since I just use the latter
> to send patches. Thanks.

Got it!

> 
> On 7 Jun 2024, at 10:58, Pankaj Raghav (Samsung) wrote:
> 
> > From: Luis Chamberlain <mcgrof@kernel.org>
> >
> > split_folio() and split_folio_to_list() assume order 0, to support
> > minorder for non-anonymous folios, we must expand these to check the
> > folio mapping order and use that.
> >
> > Set new_order to be at least minimum folio order if it is set in
> > split_huge_page_to_list() so that we can maintain minimum folio order
> > requirement in the page cache.
> >
> > Update the debugfs write files used for testing to ensure the order
> > is respected as well. We simply enforce the min order when a file
> > mapping is used.
> >
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > ---
> >  include/linux/huge_mm.h | 14 ++++++++---
> >  mm/huge_memory.c        | 55 ++++++++++++++++++++++++++++++++++++++---
> >  2 files changed, 61 insertions(+), 8 deletions(-)
> >
> 
> <snip>
> 
> >
> > +int split_folio_to_list(struct folio *folio, struct list_head *list)
> > +{
> > +	unsigned int min_order = 0;
> > +
> > +	if (!folio_test_anon(folio)) {
> > +		if (!folio->mapping) {
> > +			count_vm_event(THP_SPLIT_PAGE_FAILED);
> 
> You should only increase this counter when the input folio is a THP, namely
> folio_test_pmd_mappable(folio) is true. For other large folios, we will
> need a separate counter. Something like MTHP_STAT_FILE_SPLIT_FAILED.
> See enum mthp_stat_item in include/linux/huge_mm.h.
> 
Hmm, but we don't have mTHP support for non-anonymous memory right? In
that case it won't be applicable for file backed memory? 

I am not an expert there so correct me if I am wrong.

--
Regards,
Pankaj


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 05/11] mm: split a folio in minimum folio order chunks
  2024-06-07 17:01     ` Matthew Wilcox
@ 2024-06-07 20:45       ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-07 20:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zi Yan, david, djwong, chandan.babu, brauner, akpm, mcgrof,
	linux-mm, hare, linux-kernel, yang, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl, john.g.garry, kirill.shutemov

On Fri, Jun 07, 2024 at 06:01:56PM +0100, Matthew Wilcox wrote:
> On Fri, Jun 07, 2024 at 12:58:33PM -0400, Zi Yan wrote:
> > > +int split_folio_to_list(struct folio *folio, struct list_head *list)
> > > +{
> > > +	unsigned int min_order = 0;
> > > +
> > > +	if (!folio_test_anon(folio)) {
> > > +		if (!folio->mapping) {
> > > +			count_vm_event(THP_SPLIT_PAGE_FAILED);
> > 
> > You should only increase this counter when the input folio is a THP, namely
> > folio_test_pmd_mappable(folio) is true. For other large folios, we will
> > need a separate counter. Something like MTHP_STAT_FILE_SPLIT_FAILED.
> > See enum mthp_stat_item in include/linux/huge_mm.h.
> 
> Also, why should this count as a split failure?  If we see a NULL
> mapping, the folio has been truncated and so no longer needs to be
> split.  I understand we currently count it as a failure, but I
> don't think we should.

I also thought about this. Because if the folio was under writeback, we
don't account it as a failure but we do it if it was truncated?

I can remove the accounting that we added as a part of this series in
the next version but address the upstream changes [1] in a separate
standalone patch?
I prefer to address these kind of open discussion upstream changes
separately so that we don't delay this series.

Let me know what you think. CCing Kirill as he made those changes.

[1]
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 399a4f5125c7..21f2dd5eb4c5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3152,10 +3152,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
                mapping = folio->mapping;
 
                /* Truncated ? */
-               if (!mapping) {
+               if (!mapping)
                        ret = -EBUSY;
-                       goto out;
-               }
 

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 05/11] mm: split a folio in minimum folio order chunks
  2024-06-07 20:30     ` Pankaj Raghav (Samsung)
@ 2024-06-07 20:51       ` Zi Yan
  2024-06-10  7:26         ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 58+ messages in thread
From: Zi Yan @ 2024-06-07 20:51 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), willy
  Cc: david, djwong, chandan.babu, brauner, akpm, mcgrof, linux-mm,
	hare, linux-kernel, yang, linux-xfs, p.raghav, linux-fsdevel, hch,
	gost.dev, cl, john.g.garry

[-- Attachment #1: Type: text/plain, Size: 2252 bytes --]

On 7 Jun 2024, at 16:30, Pankaj Raghav (Samsung) wrote:

> On Fri, Jun 07, 2024 at 12:58:33PM -0400, Zi Yan wrote:
>> Hi Pankaj,
>>
>> Can you use ziy@nvidia.com instead of zi.yan@sent.com? Since I just use the latter
>> to send patches. Thanks.
>
> Got it!
>
>>
>> On 7 Jun 2024, at 10:58, Pankaj Raghav (Samsung) wrote:
>>
>>> From: Luis Chamberlain <mcgrof@kernel.org>
>>>
>>> split_folio() and split_folio_to_list() assume order 0, to support
>>> minorder for non-anonymous folios, we must expand these to check the
>>> folio mapping order and use that.
>>>
>>> Set new_order to be at least minimum folio order if it is set in
>>> split_huge_page_to_list() so that we can maintain minimum folio order
>>> requirement in the page cache.
>>>
>>> Update the debugfs write files used for testing to ensure the order
>>> is respected as well. We simply enforce the min order when a file
>>> mapping is used.
>>>
>>> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
>>> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
>>> ---
>>>  include/linux/huge_mm.h | 14 ++++++++---
>>>  mm/huge_memory.c        | 55 ++++++++++++++++++++++++++++++++++++++---
>>>  2 files changed, 61 insertions(+), 8 deletions(-)
>>>
>>
>> <snip>
>>
>>>
>>> +int split_folio_to_list(struct folio *folio, struct list_head *list)
>>> +{
>>> +	unsigned int min_order = 0;
>>> +
>>> +	if (!folio_test_anon(folio)) {
>>> +		if (!folio->mapping) {
>>> +			count_vm_event(THP_SPLIT_PAGE_FAILED);
>>
>> You should only increase this counter when the input folio is a THP, namely
>> folio_test_pmd_mappable(folio) is true. For other large folios, we will
>> need a separate counter. Something like MTHP_STAT_FILE_SPLIT_FAILED.
>> See enum mthp_stat_item in include/linux/huge_mm.h.
>>
> Hmm, but we don't have mTHP support for non-anonymous memory right? In
> that case it won't be applicable for file backed memory?

Large folio support in page cache precedes mTHP (large anonymous folio),
thanks to willy's work. mTHP is more like a subset of large folio.
There is no specific counters for page cache large folio. If you think
it is worth tracking folios with orders between 0 and 9 (exclusive),
you can add counters. Matthew, what is your take on this?


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 05/11] mm: split a folio in minimum folio order chunks
  2024-06-07 20:51       ` Zi Yan
@ 2024-06-10  7:26         ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-10  7:26 UTC (permalink / raw)
  To: Zi Yan
  Cc: willy, david, djwong, chandan.babu, brauner, akpm, mcgrof,
	linux-mm, hare, linux-kernel, yang, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl, john.g.garry

On Fri, Jun 07, 2024 at 04:51:04PM -0400, Zi Yan wrote:
> On 7 Jun 2024, at 16:30, Pankaj Raghav (Samsung) wrote:
> >>> +		if (!folio->mapping) {
> >>> +			count_vm_event(THP_SPLIT_PAGE_FAILED);
> >>
> >> You should only increase this counter when the input folio is a THP, namely
> >> folio_test_pmd_mappable(folio) is true. For other large folios, we will
> >> need a separate counter. Something like MTHP_STAT_FILE_SPLIT_FAILED.
> >> See enum mthp_stat_item in include/linux/huge_mm.h.
> >>
> > Hmm, but we don't have mTHP support for non-anonymous memory right? In
> > that case it won't be applicable for file backed memory?
> 
> Large folio support in page cache precedes mTHP (large anonymous folio),
> thanks to willy's work. mTHP is more like a subset of large folio.
> There is no specific counters for page cache large folio. If you think
> it is worth tracking folios with orders between 0 and 9 (exclusive),
> you can add counters. Matthew, what is your take on this?

Got it. I think this is out of scope for this series but something we
could consider as a future enhancement?

In any case, we need to decide whether we need to count truncation as a 
VM event or not.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-06-07 14:58 ` [PATCH v7 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
@ 2024-06-11  7:38   ` John Garry
  2024-06-11  9:41     ` Pankaj Raghav (Samsung)
  2024-06-12 20:40   ` Darrick J. Wong
  1 sibling, 1 reply; 58+ messages in thread
From: John Garry @ 2024-06-11  7:38 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), david, djwong, chandan.babu, brauner,
	akpm, willy
  Cc: mcgrof, linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs,
	p.raghav, linux-fsdevel, hch, gost.dev, cl

On 07/06/2024 15:58, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> size < page_size. This is true for most filesystems at the moment.
> 
> If the block size > page size, this will send the contents of the page
> next to zero page(as len > PAGE_SIZE) to the underlying block device,
> causing FS corruption.
> 
> iomap is a generic infrastructure and it should not make any assumptions
> about the fs block size and the page size of the system.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> ---
>   fs/internal.h          |  5 +++++
>   fs/iomap/buffered-io.c |  6 ++++++
>   fs/iomap/direct-io.c   | 26 ++++++++++++++++++++++++--
>   3 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/internal.h b/fs/internal.h
> index 84f371193f74..30217f0ff4c6 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -35,6 +35,11 @@ static inline void bdev_cache_init(void)
>   int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
>   		get_block_t *get_block, const struct iomap *iomap);
>   
> +/*
> + * iomap/direct-io.c
> + */
> +int iomap_dio_init(void);
> +
>   /*
>    * char_dev.c
>    */
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 49938419fcc7..9f791db473e4 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1990,6 +1990,12 @@ EXPORT_SYMBOL_GPL(iomap_writepages);
>   
>   static int __init iomap_init(void)
>   {
> +	int ret;
> +
> +	ret = iomap_dio_init();
> +	if (ret)
> +		return ret;
> +
>   	return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
>   			   offsetof(struct iomap_ioend, io_bio),
>   			   BIOSET_NEED_BVECS);

I suppose that it does not matter that zero_fs_block is leaked if this 
fails (or is it even leaked?), as I don't think that failing that 
bioset_init() call is handled at all.

> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index f3b43d223a46..b95600b254a3 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -27,6 +27,13 @@
>   #define IOMAP_DIO_WRITE		(1U << 30)
>   #define IOMAP_DIO_DIRTY		(1U << 31)
>   
> +/*
> + * Used for sub block zeroing in iomap_dio_zero()
> + */
> +#define ZERO_FSB_SIZE (65536)
> +#define ZERO_FSB_ORDER (get_order(ZERO_FSB_SIZE))
> +static struct page *zero_fs_block;
> +
>   struct iomap_dio {
>   	struct kiocb		*iocb;
>   	const struct iomap_dio_ops *dops;
> @@ -52,6 +59,16 @@ struct iomap_dio {
>   	};
>   };
>   
> +int iomap_dio_init(void)
> +{
> +	zero_fs_block = alloc_pages(GFP_KERNEL | __GFP_ZERO, ZERO_FSB_ORDER);
> +
> +	if (!zero_fs_block)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
>   static struct bio *iomap_dio_alloc_bio(const struct iomap_iter *iter,
>   		struct iomap_dio *dio, unsigned short nr_vecs, blk_opf_t opf)
>   {
> @@ -236,17 +253,22 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>   		loff_t pos, unsigned len)
>   {
>   	struct inode *inode = file_inode(dio->iocb->ki_filp);
> -	struct page *page = ZERO_PAGE(0);
>   	struct bio *bio;
>   
> +	/*
> +	 * Max block size supported is 64k
> +	 */
> +	WARN_ON_ONCE(len > ZERO_FSB_SIZE);

JFYI, As mentioned in 
https://lore.kernel.org/linux-xfs/20240429174746.2132161-1-john.g.garry@oracle.com/T/#m5354e2b2531a5552a8b8acd4a95342ed4d7500f2, 
we would like to support an arbitrary size. Maybe I will need to loop 
for zeroing sizes > 64K.

> +
>   	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>   	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
>   				  GFP_KERNEL);
> +
>   	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
>   	bio->bi_private = dio;
>   	bio->bi_end_io = iomap_dio_bio_end_io;
>   
> -	__bio_add_page(bio, page, len, 0);
> +	__bio_add_page(bio, zero_fs_block, len, 0);
>   	iomap_dio_submit_bio(iter, dio, bio, pos);
>   }
>   


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-06-11  7:38   ` John Garry
@ 2024-06-11  9:41     ` Pankaj Raghav (Samsung)
  2024-06-11 10:00       ` John Garry
  0 siblings, 1 reply; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-11  9:41 UTC (permalink / raw)
  To: John Garry
  Cc: david, djwong, chandan.babu, brauner, akpm, willy, mcgrof,
	linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl

> > index 49938419fcc7..9f791db473e4 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -1990,6 +1990,12 @@ EXPORT_SYMBOL_GPL(iomap_writepages);
> >   static int __init iomap_init(void)
> >   {
> > +	int ret;
> > +
> > +	ret = iomap_dio_init();
> > +	if (ret)
> > +		return ret;
> > +
> >   	return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
> >   			   offsetof(struct iomap_ioend, io_bio),
> >   			   BIOSET_NEED_BVECS);
> 
> I suppose that it does not matter that zero_fs_block is leaked if this fails
> (or is it even leaked?), as I don't think that failing that bioset_init()
> call is handled at all.

If bioset_init fails, then we have even more problems than just a leaked
64k memory? ;)

Do you have something like this in mind?

diff --git a/fs/internal.h b/fs/internal.h
index 30217f0ff4c6..def96c7ed9ea 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -39,6 +39,7 @@ int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
  * iomap/direct-io.c
  */
 int iomap_dio_init(void);
+void iomap_dio_exit(void);
 
 /*
  * char_dev.c
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9f791db473e4..8d8b9e62201f 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1994,10 +1994,16 @@ static int __init iomap_init(void)
 
        ret = iomap_dio_init();
        if (ret)
-               return ret;
+               goto out;
 
-       return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
+       ret = bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
                           offsetof(struct iomap_ioend, io_bio),
                           BIOSET_NEED_BVECS);
+       if (!ret)
+               goto out;
+
+       iomap_dio_exit();
+out:
+       return ret;
 }
 fs_initcall(iomap_init);
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b95600b254a3..f4c9445ca50d 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -69,6 +69,12 @@ int iomap_dio_init(void)
        return 0;
 }
 
+void iomap_dio_exit(void)
+{
+       __free_pages(zero_fs_block, ZERO_FSB_ORDER);
+
+}
+
 static struct bio *iomap_dio_alloc_bio(const struct iomap_iter *iter,
                struct iomap_dio *dio, unsigned short nr_vecs, blk_opf_t opf)
 {

> 
> > +
> >   static struct bio *iomap_dio_alloc_bio(const struct iomap_iter *iter,
> >   		struct iomap_dio *dio, unsigned short nr_vecs, blk_opf_t opf)
> >   {
> > @@ -236,17 +253,22 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
> >   		loff_t pos, unsigned len)
> >   {
> >   	struct inode *inode = file_inode(dio->iocb->ki_filp);
> > -	struct page *page = ZERO_PAGE(0);
> >   	struct bio *bio;
> > +	/*
> > +	 * Max block size supported is 64k
> > +	 */
> > +	WARN_ON_ONCE(len > ZERO_FSB_SIZE);
> 
> JFYI, As mentioned in https://lore.kernel.org/linux-xfs/20240429174746.2132161-1-john.g.garry@oracle.com/T/#m5354e2b2531a5552a8b8acd4a95342ed4d7500f2,
> we would like to support an arbitrary size. Maybe I will need to loop for
> zeroing sizes > 64K.

The initial patches were looping with a ZERO_PAGE(0), but the initial
feedback was to use a huge zero page. But when I discussed that at LSF,
the people thought we will be using a lot of memory for sub-block
memory, especially on architectures with 64k base page size.

So for now a good tradeoff between memory usage and efficiency was to
use a 64k buffer as that is the maximum FSB we support.[1]

IIUC, you will be using this function also to zero out the extent and
not just a FSB?

I think we could resort to looping until we have a way to request
arbitrary zero folios without having to allocate at it in
iomap_dio_alloc_bio() for every IO.

[1] https://lore.kernel.org/linux-xfs/20240529134509.120826-8-kernel@pankajraghav.com/

--
Pankaj

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-06-11  9:41     ` Pankaj Raghav (Samsung)
@ 2024-06-11 10:00       ` John Garry
  0 siblings, 0 replies; 58+ messages in thread
From: John Garry @ 2024-06-11 10:00 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: david, djwong, chandan.babu, brauner, akpm, willy, mcgrof,
	linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl

On 11/06/2024 10:41, Pankaj Raghav (Samsung) wrote:
>>> 8419fcc7..9f791db473e4 100644
>>> --- a/fs/iomap/buffered-io.c
>>> +++ b/fs/iomap/buffered-io.c
>>> @@ -1990,6 +1990,12 @@ EXPORT_SYMBOL_GPL(iomap_writepages);
>>>    static int __init iomap_init(void)
>>>    {
>>> +	int ret;
>>> +
>>> +	ret = iomap_dio_init();
>>> +	if (ret)
>>> +		return ret;
>>> +
>>>    	return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
>>>    			   offsetof(struct iomap_ioend, io_bio),
>>>    			   BIOSET_NEED_BVECS);
>> I suppose that it does not matter that zero_fs_block is leaked if this fails
>> (or is it even leaked?), as I don't think that failing that bioset_init()
>> call is handled at all.
> If bioset_init fails, then we have even more problems than just a leaked
> 64k memory? 😉
> 

Right

 > Do you have something like this in mind?

I wouldn't propose anything myself. AFAICS, we don't gracefully handle 
bioset_init() failing and iomap_ioend_bioset not being initialized properly.

>   static struct bio *iomap_dio_alloc_bio(const struct iomap_iter *iter,
>                  struct iomap_dio *dio, unsigned short nr_vecs, blk_opf_t opf)
>   {
> 
>>> +
>>>    static struct bio *iomap_dio_alloc_bio(const struct iomap_iter *iter,
>>>    		struct iomap_dio *dio, unsigned short nr_vecs, blk_opf_t opf)
>>>    {
>>> @@ -236,17 +253,22 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>>>    		loff_t pos, unsigned len)
>>>    {
>>>    	struct inode *inode = file_inode(dio->iocb->ki_filp);
>>> -	struct page *page = ZERO_PAGE(0);
>>>    	struct bio *bio;
>>> +	/*
>>> +	 * Max block size supported is 64k
>>> +	 */
>>> +	WARN_ON_ONCE(len > ZERO_FSB_SIZE);
>> JFYI, As mentioned inhttps://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240429174746.2132161-1-john.g.garry@oracle.com/T/*m5354e2b2531a5552a8b8acd4a95342ed4d7500f2__;Iw!!ACWV5N9M2RV99hQ!MTwVaC6oueHR_vgmDfOvgBX8bPdeTSRPcRcw5-CqtHnFEH-Ya1sUeZwaF-xrBF5XZ_8lJw5l-riq4t8IkfBhf2Q$  ,
>> we would like to support an arbitrary size. Maybe I will need to loop for
>> zeroing sizes > 64K.
> The initial patches were looping with a ZERO_PAGE(0), but the initial
> feedback was to use a huge zero page. But when I discussed that at LSF,
> the people thought we will be using a lot of memory for sub-block
> memory, especially on architectures with 64k base page size.
> 
> So for now a good tradeoff between memory usage and efficiency was to
> use a 64k buffer as that is the maximum FSB we support.[1]
> 
> IIUC, you will be using this function also to zero out the extent and
> not just a FSB?

Right. Or more specifically, the FS can ask for the zeroing size. 
Typically it will be inode i_blocksize, but the FS can ask for a larger 
size to zero out to extent alignment boundaries.

> 
> I think we could resort to looping until we have a way to request
> arbitrary zero folios without having to allocate at it in
> iomap_dio_alloc_bio() for every IO.
> 

ok

> [1]https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240529134509.120826-8-kernel@pankajraghav.com/__;!!ACWV5N9M2RV99hQ!MTwVaC6oueHR_vgmDfOvgBX8bPdeTSRPcRcw5-CqtHnFEH-Ya1sUeZwaF-xrBF5XZ_8lJw5l-riq4t8Ij2hl9yU$  

Thanks,
John

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 03/11] filemap: allocate mapping_min_order folios in the page cache
  2024-06-07 14:58 ` [PATCH v7 03/11] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
@ 2024-06-12  9:01   ` Hannes Reinecke
  2024-06-12 15:40   ` Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 58+ messages in thread
From: Hannes Reinecke @ 2024-06-12  9:01 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), david, djwong, chandan.babu, brauner,
	akpm, willy
  Cc: mcgrof, linux-mm, linux-kernel, yang, Zi Yan, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl, john.g.garry

On 6/7/24 16:58, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> filemap_create_folio() and do_read_cache_folio() were always allocating
> folio of order 0. __filemap_get_folio was trying to allocate higher
> order folios when fgp_flags had higher order hint set but it will default
> to order 0 folio if higher order memory allocation fails.
> 
> Supporting mapping_min_order implies that we guarantee each folio in the
> page cache has at least an order of mapping_min_order. When adding new
> folios to the page cache we must also ensure the index used is aligned to
> the mapping_min_order as the page cache requires the index to be aligned
> to the order of the folio.
> 
> Co-developed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>   include/linux/pagemap.h | 20 ++++++++++++++++++++
>   mm/filemap.c            | 26 ++++++++++++++++++--------
>   2 files changed, 38 insertions(+), 8 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 05/11] mm: split a folio in minimum folio order chunks
  2024-06-07 14:58 ` [PATCH v7 05/11] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
  2024-06-07 16:58   ` Zi Yan
@ 2024-06-12  9:02   ` Hannes Reinecke
  1 sibling, 0 replies; 58+ messages in thread
From: Hannes Reinecke @ 2024-06-12  9:02 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), david, djwong, chandan.babu, brauner,
	akpm, willy
  Cc: mcgrof, linux-mm, linux-kernel, yang, Zi Yan, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl, john.g.garry

On 6/7/24 16:58, Pankaj Raghav (Samsung) wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> split_folio() and split_folio_to_list() assume order 0, to support
> minorder for non-anonymous folios, we must expand these to check the
> folio mapping order and use that.
> 
> Set new_order to be at least minimum folio order if it is set in
> split_huge_page_to_list() so that we can maintain minimum folio order
> requirement in the page cache.
> 
> Update the debugfs write files used for testing to ensure the order
> is respected as well. We simply enforce the min order when a file
> mapping is used.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>   include/linux/huge_mm.h | 14 ++++++++---
>   mm/huge_memory.c        | 55 ++++++++++++++++++++++++++++++++++++++---
>   2 files changed, 61 insertions(+), 8 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 02/11] fs: Allow fine-grained control of folio sizes
  2024-06-07 14:58 ` [PATCH v7 02/11] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
@ 2024-06-12 15:38   ` Darrick J. Wong
  0 siblings, 0 replies; 58+ messages in thread
From: Darrick J. Wong @ 2024-06-12 15:38 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: david, chandan.babu, brauner, akpm, willy, mcgrof, linux-mm, hare,
	linux-kernel, yang, Zi Yan, linux-xfs, p.raghav, linux-fsdevel,
	hch, gost.dev, cl, john.g.garry

On Fri, Jun 07, 2024 at 02:58:53PM +0000, Pankaj Raghav (Samsung) wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> We need filesystems to be able to communicate acceptable folio sizes
> to the pagecache for a variety of uses (e.g. large block sizes).
> Support a range of folio sizes between order-0 and order-31.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>

Looks good to me,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  include/linux/pagemap.h | 86 ++++++++++++++++++++++++++++++++++-------
>  mm/filemap.c            |  6 +--
>  mm/readahead.c          |  4 +-
>  3 files changed, 77 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 8f09ed4a4451..228275e7049f 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -204,14 +204,21 @@ enum mapping_flags {
>  	AS_EXITING	= 4, 	/* final truncate in progress */
>  	/* writeback related tags are not used */
>  	AS_NO_WRITEBACK_TAGS = 5,
> -	AS_LARGE_FOLIO_SUPPORT = 6,
> -	AS_RELEASE_ALWAYS,	/* Call ->release_folio(), even if no private data */
> -	AS_STABLE_WRITES,	/* must wait for writeback before modifying
> +	AS_RELEASE_ALWAYS = 6,	/* Call ->release_folio(), even if no private data */
> +	AS_STABLE_WRITES = 7,	/* must wait for writeback before modifying
>  				   folio contents */
> -	AS_UNMOVABLE,		/* The mapping cannot be moved, ever */
> -	AS_INACCESSIBLE,	/* Do not attempt direct R/W access to the mapping */
> +	AS_UNMOVABLE = 8,	/* The mapping cannot be moved, ever */
> +	AS_INACCESSIBLE = 9,	/* Do not attempt direct R/W access to the mapping */
> +	/* Bits 16-25 are used for FOLIO_ORDER */
> +	AS_FOLIO_ORDER_BITS = 5,
> +	AS_FOLIO_ORDER_MIN = 16,
> +	AS_FOLIO_ORDER_MAX = AS_FOLIO_ORDER_MIN + AS_FOLIO_ORDER_BITS,
>  };
>  
> +#define AS_FOLIO_ORDER_MASK     ((1u << AS_FOLIO_ORDER_BITS) - 1)
> +#define AS_FOLIO_ORDER_MIN_MASK (AS_FOLIO_ORDER_MASK << AS_FOLIO_ORDER_MIN)
> +#define AS_FOLIO_ORDER_MAX_MASK (AS_FOLIO_ORDER_MASK << AS_FOLIO_ORDER_MAX)
> +
>  /**
>   * mapping_set_error - record a writeback error in the address_space
>   * @mapping: the mapping in which an error should be set
> @@ -360,9 +367,49 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>  #define MAX_PAGECACHE_ORDER	8
>  #endif
>  
> +/*
> + * mapping_set_folio_order_range() - Set the orders supported by a file.
> + * @mapping: The address space of the file.
> + * @min: Minimum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
> + * @max: Maximum folio order (between @min-MAX_PAGECACHE_ORDER inclusive).
> + *
> + * The filesystem should call this function in its inode constructor to
> + * indicate which base size (min) and maximum size (max) of folio the VFS
> + * can use to cache the contents of the file.  This should only be used
> + * if the filesystem needs special handling of folio sizes (ie there is
> + * something the core cannot know).
> + * Do not tune it based on, eg, i_size.
> + *
> + * Context: This should not be called while the inode is active as it
> + * is non-atomic.
> + */
> +static inline void mapping_set_folio_order_range(struct address_space *mapping,
> +						 unsigned int min,
> +						 unsigned int max)
> +{
> +	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> +		return;
> +
> +	if (min > MAX_PAGECACHE_ORDER)
> +		min = MAX_PAGECACHE_ORDER;
> +	if (max > MAX_PAGECACHE_ORDER)
> +		max = MAX_PAGECACHE_ORDER;
> +	if (max < min)
> +		max = min;
> +
> +	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
> +		(min << AS_FOLIO_ORDER_MIN) | (max << AS_FOLIO_ORDER_MAX);
> +}
> +
> +static inline void mapping_set_folio_min_order(struct address_space *mapping,
> +					       unsigned int min)
> +{
> +	mapping_set_folio_order_range(mapping, min, MAX_PAGECACHE_ORDER);
> +}
> +
>  /**
>   * mapping_set_large_folios() - Indicate the file supports large folios.
> - * @mapping: The file.
> + * @mapping: The address space of the file.
>   *
>   * The filesystem should call this function in its inode constructor to
>   * indicate that the VFS can use large folios to cache the contents of
> @@ -373,7 +420,23 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>   */
>  static inline void mapping_set_large_folios(struct address_space *mapping)
>  {
> -	__set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
> +	mapping_set_folio_order_range(mapping, 0, MAX_PAGECACHE_ORDER);
> +}
> +
> +static inline
> +unsigned int mapping_max_folio_order(const struct address_space *mapping)
> +{
> +	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> +		return 0;
> +	return (mapping->flags & AS_FOLIO_ORDER_MAX_MASK) >> AS_FOLIO_ORDER_MAX;
> +}
> +
> +static inline
> +unsigned int mapping_min_folio_order(const struct address_space *mapping)
> +{
> +	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> +		return 0;
> +	return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
>  }
>  
>  /*
> @@ -382,16 +445,13 @@ static inline void mapping_set_large_folios(struct address_space *mapping)
>   */
>  static inline bool mapping_large_folio_support(struct address_space *mapping)
>  {
> -	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> -		test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
> +	return mapping_max_folio_order(mapping) > 0;
>  }
>  
>  /* Return the maximum folio size for this pagecache mapping, in bytes. */
> -static inline size_t mapping_max_folio_size(struct address_space *mapping)
> +static inline size_t mapping_max_folio_size(const struct address_space *mapping)
>  {
> -	if (mapping_large_folio_support(mapping))
> -		return PAGE_SIZE << MAX_PAGECACHE_ORDER;
> -	return PAGE_SIZE;
> +	return PAGE_SIZE << mapping_max_folio_order(mapping);
>  }
>  
>  static inline int filemap_nr_thps(struct address_space *mapping)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 37061aafd191..46c7a6f59788 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1933,10 +1933,8 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
>  			fgp_flags |= FGP_LOCK;
>  
> -		if (!mapping_large_folio_support(mapping))
> -			order = 0;
> -		if (order > MAX_PAGECACHE_ORDER)
> -			order = MAX_PAGECACHE_ORDER;
> +		if (order > mapping_max_folio_order(mapping))
> +			order = mapping_max_folio_order(mapping);
>  		/* If we're not aligned, allocate a smaller folio */
>  		if (index & ((1UL << order) - 1))
>  			order = __ffs(index);
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 75e934a1fd78..da34b28da02c 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -504,9 +504,9 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  
>  	limit = min(limit, index + ra->size - 1);
>  
> -	if (new_order < MAX_PAGECACHE_ORDER) {
> +	if (new_order < mapping_max_folio_order(mapping)) {
>  		new_order += 2;
> -		new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
> +		new_order = min(mapping_max_folio_order(mapping), new_order);
>  		new_order = min_t(unsigned int, new_order, ilog2(ra->size));
>  	}
>  
> -- 
> 2.44.1
> 
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 03/11] filemap: allocate mapping_min_order folios in the page cache
  2024-06-07 14:58 ` [PATCH v7 03/11] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
  2024-06-12  9:01   ` Hannes Reinecke
@ 2024-06-12 15:40   ` Darrick J. Wong
  2024-06-12 17:24   ` Matthew Wilcox
  2024-06-13  8:44   ` Christoph Hellwig
  3 siblings, 0 replies; 58+ messages in thread
From: Darrick J. Wong @ 2024-06-12 15:40 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: david, chandan.babu, brauner, akpm, willy, mcgrof, linux-mm, hare,
	linux-kernel, yang, Zi Yan, linux-xfs, p.raghav, linux-fsdevel,
	hch, gost.dev, cl, john.g.garry

On Fri, Jun 07, 2024 at 02:58:54PM +0000, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> filemap_create_folio() and do_read_cache_folio() were always allocating
> folio of order 0. __filemap_get_folio was trying to allocate higher
> order folios when fgp_flags had higher order hint set but it will default
> to order 0 folio if higher order memory allocation fails.
> 
> Supporting mapping_min_order implies that we guarantee each folio in the
> page cache has at least an order of mapping_min_order. When adding new
> folios to the page cache we must also ensure the index used is aligned to
> the mapping_min_order as the page cache requires the index to be aligned
> to the order of the folio.
> 
> Co-developed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>

Seems pretty straightforward, so
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  include/linux/pagemap.h | 20 ++++++++++++++++++++
>  mm/filemap.c            | 26 ++++++++++++++++++--------
>  2 files changed, 38 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 228275e7049f..899b8d751768 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -439,6 +439,26 @@ unsigned int mapping_min_folio_order(const struct address_space *mapping)
>  	return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
>  }
>  
> +static inline unsigned long mapping_min_folio_nrpages(struct address_space *mapping)
> +{
> +	return 1UL << mapping_min_folio_order(mapping);
> +}
> +
> +/**
> + * mapping_align_start_index() - Align starting index based on the min
> + * folio order of the page cache.
> + * @mapping: The address_space.
> + *
> + * Ensure the index used is aligned to the minimum folio order when adding
> + * new folios to the page cache by rounding down to the nearest minimum
> + * folio number of pages.
> + */
> +static inline pgoff_t mapping_align_start_index(struct address_space *mapping,
> +						pgoff_t index)
> +{
> +	return round_down(index, mapping_min_folio_nrpages(mapping));
> +}
> +
>  /*
>   * Large folio support currently depends on THP.  These dependencies are
>   * being worked on but are not yet fixed.
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 46c7a6f59788..8bb0d2bc93c5 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -859,6 +859,8 @@ noinline int __filemap_add_folio(struct address_space *mapping,
>  
>  	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>  	VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
> +	VM_BUG_ON_FOLIO(folio_order(folio) < mapping_min_folio_order(mapping),
> +			folio);
>  	mapping_set_update(&xas, mapping);
>  
>  	VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
> @@ -1919,8 +1921,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  		folio_wait_stable(folio);
>  no_page:
>  	if (!folio && (fgp_flags & FGP_CREAT)) {
> -		unsigned order = FGF_GET_ORDER(fgp_flags);
> +		unsigned int min_order = mapping_min_folio_order(mapping);
> +		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
>  		int err;
> +		index = mapping_align_start_index(mapping, index);
>  
>  		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
>  			gfp |= __GFP_WRITE;
> @@ -1943,7 +1947,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  			gfp_t alloc_gfp = gfp;
>  
>  			err = -ENOMEM;
> -			if (order > 0)
> +			if (order > min_order)
>  				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
>  			folio = filemap_alloc_folio(alloc_gfp, order);
>  			if (!folio)
> @@ -1958,7 +1962,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  				break;
>  			folio_put(folio);
>  			folio = NULL;
> -		} while (order-- > 0);
> +		} while (order-- > min_order);
>  
>  		if (err == -EEXIST)
>  			goto repeat;
> @@ -2447,13 +2451,16 @@ static int filemap_update_page(struct kiocb *iocb,
>  }
>  
>  static int filemap_create_folio(struct file *file,
> -		struct address_space *mapping, pgoff_t index,
> +		struct address_space *mapping, loff_t pos,
>  		struct folio_batch *fbatch)
>  {
>  	struct folio *folio;
>  	int error;
> +	unsigned int min_order = mapping_min_folio_order(mapping);
> +	pgoff_t index;
>  
> -	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
> +	folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
> +				    min_order);
>  	if (!folio)
>  		return -ENOMEM;
>  
> @@ -2471,6 +2478,8 @@ static int filemap_create_folio(struct file *file,
>  	 * well to keep locking rules simple.
>  	 */
>  	filemap_invalidate_lock_shared(mapping);
> +	/* index in PAGE units but aligned to min_order number of pages. */
> +	index = (pos >> (PAGE_SHIFT + min_order)) << min_order;
>  	error = filemap_add_folio(mapping, folio, index,
>  			mapping_gfp_constraint(mapping, GFP_KERNEL));
>  	if (error == -EEXIST)
> @@ -2531,8 +2540,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
>  	if (!folio_batch_count(fbatch)) {
>  		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
>  			return -EAGAIN;
> -		err = filemap_create_folio(filp, mapping,
> -				iocb->ki_pos >> PAGE_SHIFT, fbatch);
> +		err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch);
>  		if (err == AOP_TRUNCATED_PAGE)
>  			goto retry;
>  		return err;
> @@ -3748,9 +3756,11 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
>  repeat:
>  	folio = filemap_get_folio(mapping, index);
>  	if (IS_ERR(folio)) {
> -		folio = filemap_alloc_folio(gfp, 0);
> +		folio = filemap_alloc_folio(gfp,
> +					    mapping_min_folio_order(mapping));
>  		if (!folio)
>  			return ERR_PTR(-ENOMEM);
> +		index = mapping_align_start_index(mapping, index);
>  		err = filemap_add_folio(mapping, folio, index, gfp);
>  		if (unlikely(err)) {
>  			folio_put(folio);
> -- 
> 2.44.1
> 
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 03/11] filemap: allocate mapping_min_order folios in the page cache
  2024-06-07 14:58 ` [PATCH v7 03/11] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
  2024-06-12  9:01   ` Hannes Reinecke
  2024-06-12 15:40   ` Darrick J. Wong
@ 2024-06-12 17:24   ` Matthew Wilcox
  2024-06-13  8:44   ` Christoph Hellwig
  3 siblings, 0 replies; 58+ messages in thread
From: Matthew Wilcox @ 2024-06-12 17:24 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: david, djwong, chandan.babu, brauner, akpm, mcgrof, linux-mm,
	hare, linux-kernel, yang, Zi Yan, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl, john.g.garry

On Fri, Jun 07, 2024 at 02:58:54PM +0000, Pankaj Raghav (Samsung) wrote:
> +/**
> + * mapping_align_start_index() - Align starting index based on the min
> + * folio order of the page cache.

_short_ description.  "Align index appropriately for this mapping".
And maybe that means we should call it "mapping_align_index" instead
of mapping_align_start_index?

> + * @mapping: The address_space.
> + *
> + * Ensure the index used is aligned to the minimum folio order when adding
> + * new folios to the page cache by rounding down to the nearest minimum
> + * folio number of pages.

How about:

 * The index of a folio must be naturally aligned.  If you are adding a
 * new folio to the page cache and need to know what index to give it,
 * call this function.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 04/11] readahead: allocate folios with mapping_min_order in readahead
  2024-06-07 14:58 ` [PATCH v7 04/11] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
@ 2024-06-12 18:50   ` Matthew Wilcox
  2024-06-14  9:26     ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 58+ messages in thread
From: Matthew Wilcox @ 2024-06-12 18:50 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: david, djwong, chandan.babu, brauner, akpm, mcgrof, linux-mm,
	hare, linux-kernel, yang, Zi Yan, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl, john.g.garry

On Fri, Jun 07, 2024 at 02:58:55PM +0000, Pankaj Raghav (Samsung) wrote:
> @@ -230,7 +247,9 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>  		struct folio *folio = xa_load(&mapping->i_pages, index + i);
>  		int ret;
>  
> +

Spurious newline

>  		if (folio && !xa_is_value(folio)) {
> +			long nr_pages = folio_nr_pages(folio);

Hm, but we don't have a reference on this folio.  So this isn't safe.

> @@ -240,12 +259,24 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>  			 * not worth getting one just for that.
>  			 */
>  			read_pages(ractl);
> -			ractl->_index += folio_nr_pages(folio);
> +
> +			/*
> +			 * Move the ractl->_index by at least min_pages
> +			 * if the folio got truncated to respect the
> +			 * alignment constraint in the page cache.
> +			 *
> +			 */
> +			if (mapping != folio->mapping)
> +				nr_pages = min_nrpages;
> +
> +			VM_BUG_ON_FOLIO(nr_pages < min_nrpages, folio);
> +			ractl->_index += nr_pages;

Why not just:
			ractl->_index += min_nrpages;

like you do below?


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 06/11] filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
  2024-06-07 14:58 ` [PATCH v7 06/11] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() Pankaj Raghav (Samsung)
@ 2024-06-12 19:08   ` Matthew Wilcox
  2024-06-13  7:57     ` Luis Chamberlain
  0 siblings, 1 reply; 58+ messages in thread
From: Matthew Wilcox @ 2024-06-12 19:08 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: david, djwong, chandan.babu, brauner, akpm, mcgrof, linux-mm,
	hare, linux-kernel, yang, Zi Yan, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl, john.g.garry

On Fri, Jun 07, 2024 at 02:58:57PM +0000, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Usually the page cache does not extend beyond the size of the inode,
> therefore, no PTEs are created for folios that extend beyond the size.
> 
> But with LBS support, we might extend page cache beyond the size of the
> inode as we need to guarantee folios of minimum order. Cap the PTE range
> to be created for the page cache up to the max allowed zero-fill file
> end, which is aligned to the PAGE_SIZE.

I think this is slightly misleading because we might well zero-fill
to the end of the folio.  The issue is that we're supposed to SIGBUS
if userspace accesses pages which lie entirely beyond the end of this
file.  Can you rephrase this?

(from mmap(2))
       SIGBUS Attempted access to a page of the buffer that lies beyond the end
              of the mapped file.  For an explanation of the treatment  of  the
              bytes  in  the  page that corresponds to the end of a mapped file
              that is not a multiple of the page size, see NOTES.


The code is good though.

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>

> An fstests test has been created to trigger this edge case [0].
> 
> [0] https://lore.kernel.org/fstests/20240415081054.1782715-1-mcgrof@kernel.org/
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  mm/filemap.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8bb0d2bc93c5..0e48491b3d10 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3610,7 +3610,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct file *file = vma->vm_file;
>  	struct address_space *mapping = file->f_mapping;
> -	pgoff_t last_pgoff = start_pgoff;
> +	pgoff_t file_end, last_pgoff = start_pgoff;
>  	unsigned long addr;
>  	XA_STATE(xas, &mapping->i_pages, start_pgoff);
>  	struct folio *folio;
> @@ -3636,6 +3636,10 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>  		goto out;
>  	}
>  
> +	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
> +	if (end_pgoff > file_end)
> +		end_pgoff = file_end;
> +
>  	folio_type = mm_counter_file(folio);
>  	do {
>  		unsigned long end;
> -- 
> 2.44.1
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-06-07 14:58 ` [PATCH v7 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
  2024-06-11  7:38   ` John Garry
@ 2024-06-12 20:40   ` Darrick J. Wong
  2024-06-17 15:08     ` Pankaj Raghav (Samsung)
  1 sibling, 1 reply; 58+ messages in thread
From: Darrick J. Wong @ 2024-06-12 20:40 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: david, chandan.babu, brauner, akpm, willy, mcgrof, linux-mm, hare,
	linux-kernel, yang, Zi Yan, linux-xfs, p.raghav, linux-fsdevel,
	hch, gost.dev, cl, john.g.garry

On Fri, Jun 07, 2024 at 02:58:58PM +0000, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> size < page_size. This is true for most filesystems at the moment.
> 
> If the block size > page size, this will send the contents of the page
> next to zero page(as len > PAGE_SIZE) to the underlying block device,
> causing FS corruption.
> 
> iomap is a generic infrastructure and it should not make any assumptions
> about the fs block size and the page size of the system.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> ---
>  fs/internal.h          |  5 +++++
>  fs/iomap/buffered-io.c |  6 ++++++
>  fs/iomap/direct-io.c   | 26 ++++++++++++++++++++++++--
>  3 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/internal.h b/fs/internal.h
> index 84f371193f74..30217f0ff4c6 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -35,6 +35,11 @@ static inline void bdev_cache_init(void)
>  int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
>  		get_block_t *get_block, const struct iomap *iomap);
>  
> +/*
> + * iomap/direct-io.c
> + */
> +int iomap_dio_init(void);
> +
>  /*
>   * char_dev.c
>   */
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 49938419fcc7..9f791db473e4 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1990,6 +1990,12 @@ EXPORT_SYMBOL_GPL(iomap_writepages);
>  
>  static int __init iomap_init(void)
>  {
> +	int ret;
> +
> +	ret = iomap_dio_init();
> +	if (ret)
> +		return ret;
> +
>  	return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
>  			   offsetof(struct iomap_ioend, io_bio),
>  			   BIOSET_NEED_BVECS);
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index f3b43d223a46..b95600b254a3 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -27,6 +27,13 @@
>  #define IOMAP_DIO_WRITE		(1U << 30)
>  #define IOMAP_DIO_DIRTY		(1U << 31)
>  
> +/*
> + * Used for sub block zeroing in iomap_dio_zero()
> + */
> +#define ZERO_FSB_SIZE (65536)
> +#define ZERO_FSB_ORDER (get_order(ZERO_FSB_SIZE))
> +static struct page *zero_fs_block;

Er... zero_page_64k ?

Since it's a permanent allocation, can we also mark the memory ro?

> +
>  struct iomap_dio {
>  	struct kiocb		*iocb;
>  	const struct iomap_dio_ops *dops;
> @@ -52,6 +59,16 @@ struct iomap_dio {
>  	};
>  };
>  
> +int iomap_dio_init(void)
> +{
> +	zero_fs_block = alloc_pages(GFP_KERNEL | __GFP_ZERO, ZERO_FSB_ORDER);
> +
> +	if (!zero_fs_block)
> +		return -ENOMEM;
> +
> +	return 0;
> +}

Can't we just turn this into another fs_initcall() instead of exporting
it just so we can call it from iomap_init?  And maybe rename the
existing iomap_init to iomap_pagecache_init or something, for clarity's
sake?

--D

> +
>  static struct bio *iomap_dio_alloc_bio(const struct iomap_iter *iter,
>  		struct iomap_dio *dio, unsigned short nr_vecs, blk_opf_t opf)
>  {
> @@ -236,17 +253,22 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>  		loff_t pos, unsigned len)
>  {
>  	struct inode *inode = file_inode(dio->iocb->ki_filp);
> -	struct page *page = ZERO_PAGE(0);
>  	struct bio *bio;
>  
> +	/*
> +	 * Max block size supported is 64k
> +	 */
> +	WARN_ON_ONCE(len > ZERO_FSB_SIZE);
> +
>  	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>  	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
>  				  GFP_KERNEL);
> +
>  	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
>  	bio->bi_private = dio;
>  	bio->bi_end_io = iomap_dio_bio_end_io;
>  
> -	__bio_add_page(bio, page, len, 0);
> +	__bio_add_page(bio, zero_fs_block, len, 0);
>  	iomap_dio_submit_bio(iter, dio, bio, pos);
>  }
>  
> -- 
> 2.44.1
> 
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 06/11] filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
  2024-06-12 19:08   ` Matthew Wilcox
@ 2024-06-13  7:57     ` Luis Chamberlain
  2024-06-13  8:07       ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Luis Chamberlain @ 2024-06-13  7:57 UTC (permalink / raw)
  To: Matthew Wilcox, David Hildenbrand, Hugh Dickins, yang, linmiaohe,
	muchun.song, osalvador
  Cc: Pankaj Raghav (Samsung), david, djwong, chandan.babu, brauner,
	akpm, linux-mm, hare, linux-kernel, Zi Yan, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl, john.g.garry

On Wed, Jun 12, 2024 at 08:08:15PM +0100, Matthew Wilcox wrote:
> On Fri, Jun 07, 2024 at 02:58:57PM +0000, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> > 
> > Usually the page cache does not extend beyond the size of the inode,
> > therefore, no PTEs are created for folios that extend beyond the size.
> > 
> > But with LBS support, we might extend page cache beyond the size of the
> > inode as we need to guarantee folios of minimum order. Cap the PTE range
> > to be created for the page cache up to the max allowed zero-fill file
> > end, which is aligned to the PAGE_SIZE.
> 
> I think this is slightly misleading because we might well zero-fill
> to the end of the folio.  The issue is that we're supposed to SIGBUS
> if userspace accesses pages which lie entirely beyond the end of this
> file.  Can you rephrase this?
> 
> (from mmap(2))
>        SIGBUS Attempted access to a page of the buffer that lies beyond the end
>               of the mapped file.  For an explanation of the treatment  of  the
>               bytes  in  the  page that corresponds to the end of a mapped file
>               that is not a multiple of the page size, see NOTES.
> 
> 
> The code is good though.
> 
> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Since I've been curating the respective fstests test to test for this
POSIX corner case [0] I wanted to enable the test for tmpfs instead of
skipping it as I originally had it, and that meant also realizing mmap(2)
specifically says this now:

Huge page (Huge TLB) mappings
...
       For mmap(), offset must be a multiple of the underlying huge page
       size. The system automatically aligns length to be a multiple of
       the underlying huge page size.

So do we need to adjust this patch with this:

diff --git a/mm/filemap.c b/mm/filemap.c
index ea78963f0956..9c8897ba90ff 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3617,6 +3617,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	vm_fault_t ret = 0;
 	unsigned long rss = 0;
 	unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type;
+	unsigned int align = PAGE_SIZE;
 
 	rcu_read_lock();
 	folio = next_uptodate_folio(&xas, mapping, end_pgoff);
@@ -3636,7 +3637,10 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		goto out;
 	}
 
-	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
+	if (folio_test_pmd_mappable(folio))
+		align = 1 << folio_order(folio);
+
+	file_end = DIV_ROUND_UP(i_size_read(mapping->host), align) - 1;
 	if (end_pgoff > file_end)
 		end_pgoff = file_end;

[0] https://lore.kernel.org/all/20240611030203.1719072-3-mcgrof@kernel.org/

  Luis

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 06/11] filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
  2024-06-13  7:57     ` Luis Chamberlain
@ 2024-06-13  8:07       ` David Hildenbrand
  2024-06-13  8:13         ` Luis Chamberlain
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2024-06-13  8:07 UTC (permalink / raw)
  To: Luis Chamberlain, Matthew Wilcox, Hugh Dickins, yang, linmiaohe,
	muchun.song, osalvador
  Cc: Pankaj Raghav (Samsung), david, djwong, chandan.babu, brauner,
	akpm, linux-mm, hare, linux-kernel, Zi Yan, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl, john.g.garry

On 13.06.24 09:57, Luis Chamberlain wrote:
> On Wed, Jun 12, 2024 at 08:08:15PM +0100, Matthew Wilcox wrote:
>> On Fri, Jun 07, 2024 at 02:58:57PM +0000, Pankaj Raghav (Samsung) wrote:
>>> From: Pankaj Raghav <p.raghav@samsung.com>
>>>
>>> Usually the page cache does not extend beyond the size of the inode,
>>> therefore, no PTEs are created for folios that extend beyond the size.
>>>
>>> But with LBS support, we might extend page cache beyond the size of the
>>> inode as we need to guarantee folios of minimum order. Cap the PTE range
>>> to be created for the page cache up to the max allowed zero-fill file
>>> end, which is aligned to the PAGE_SIZE.
>>
>> I think this is slightly misleading because we might well zero-fill
>> to the end of the folio.  The issue is that we're supposed to SIGBUS
>> if userspace accesses pages which lie entirely beyond the end of this
>> file.  Can you rephrase this?
>>
>> (from mmap(2))
>>         SIGBUS Attempted access to a page of the buffer that lies beyond the end
>>                of the mapped file.  For an explanation of the treatment  of  the
>>                bytes  in  the  page that corresponds to the end of a mapped file
>>                that is not a multiple of the page size, see NOTES.
>>
>>
>> The code is good though.
>>
>> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> 
> Since I've been curating the respective fstests test to test for this
> POSIX corner case [0] I wanted to enable the test for tmpfs instead of
> skipping it as I originally had it, and that meant also realizing mmap(2)
> specifically says this now:
> 
> Huge page (Huge TLB) mappings

Confusion alert: this likely talks about hugetlb (MAP_HUGETLB), not THP 
and friends.

So it might not be required for below changes.

> ...
>         For mmap(), offset must be a multiple of the underlying huge page
>         size. The system automatically aligns length to be a multiple of
>         the underlying huge page size.
> 
> So do we need to adjust this patch with this:
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index ea78963f0956..9c8897ba90ff 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3617,6 +3617,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>   	vm_fault_t ret = 0;
>   	unsigned long rss = 0;
>   	unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type;
> +	unsigned int align = PAGE_SIZE;
>   
>   	rcu_read_lock();
>   	folio = next_uptodate_folio(&xas, mapping, end_pgoff);
> @@ -3636,7 +3637,10 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>   		goto out;
>   	}
>   
> -	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
> +	if (folio_test_pmd_mappable(folio))
> +		align = 1 << folio_order(folio);
> +
> +	file_end = DIV_ROUND_UP(i_size_read(mapping->host), align) - 1;
>   	if (end_pgoff > file_end)
>   		end_pgoff = file_end;
> 
> [0] https://lore.kernel.org/all/20240611030203.1719072-3-mcgrof@kernel.org/
> 
>    Luis
> 

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 06/11] filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
  2024-06-13  8:07       ` David Hildenbrand
@ 2024-06-13  8:13         ` Luis Chamberlain
  2024-06-13  8:16           ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Luis Chamberlain @ 2024-06-13  8:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Hugh Dickins, yang, linmiaohe, muchun.song,
	osalvador, Pankaj Raghav (Samsung), david, djwong, chandan.babu,
	brauner, akpm, linux-mm, hare, linux-kernel, Zi Yan, linux-xfs,
	p.raghav, linux-fsdevel, hch, gost.dev, cl, john.g.garry

On Thu, Jun 13, 2024 at 10:07:15AM +0200, David Hildenbrand wrote:
> On 13.06.24 09:57, Luis Chamberlain wrote:
> > On Wed, Jun 12, 2024 at 08:08:15PM +0100, Matthew Wilcox wrote:
> > > On Fri, Jun 07, 2024 at 02:58:57PM +0000, Pankaj Raghav (Samsung) wrote:
> > > > From: Pankaj Raghav <p.raghav@samsung.com>
> > > > 
> > > > Usually the page cache does not extend beyond the size of the inode,
> > > > therefore, no PTEs are created for folios that extend beyond the size.
> > > > 
> > > > But with LBS support, we might extend page cache beyond the size of the
> > > > inode as we need to guarantee folios of minimum order. Cap the PTE range
> > > > to be created for the page cache up to the max allowed zero-fill file
> > > > end, which is aligned to the PAGE_SIZE.
> > > 
> > > I think this is slightly misleading because we might well zero-fill
> > > to the end of the folio.  The issue is that we're supposed to SIGBUS
> > > if userspace accesses pages which lie entirely beyond the end of this
> > > file.  Can you rephrase this?
> > > 
> > > (from mmap(2))
> > >         SIGBUS Attempted access to a page of the buffer that lies beyond the end
> > >                of the mapped file.  For an explanation of the treatment  of  the
> > >                bytes  in  the  page that corresponds to the end of a mapped file
> > >                that is not a multiple of the page size, see NOTES.
> > > 
> > > 
> > > The code is good though.
> > > 
> > > Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > 
> > Since I've been curating the respective fstests test to test for this
> > POSIX corner case [0] I wanted to enable the test for tmpfs instead of
> > skipping it as I originally had it, and that meant also realizing mmap(2)
> > specifically says this now:
> > 
> > Huge page (Huge TLB) mappings
> 
> Confusion alert: this likely talks about hugetlb (MAP_HUGETLB), not THP and
> friends.
> 
> So it might not be required for below changes.

Thanks, I had to ask as we're dusting off this little obscure corner of
the universe. Reason I ask, is the test fails for tmpfs with huge pages,
and this patch fixes it, but it got me wondering the above applies also
to tmpfs with huge pages.

  Luis

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 06/11] filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
  2024-06-13  8:13         ` Luis Chamberlain
@ 2024-06-13  8:16           ` David Hildenbrand
  2024-06-13 15:27             ` Luis Chamberlain
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2024-06-13  8:16 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Matthew Wilcox, Hugh Dickins, yang, linmiaohe, muchun.song,
	osalvador, Pankaj Raghav (Samsung), david, djwong, chandan.babu,
	brauner, akpm, linux-mm, hare, linux-kernel, Zi Yan, linux-xfs,
	p.raghav, linux-fsdevel, hch, gost.dev, cl, john.g.garry

On 13.06.24 10:13, Luis Chamberlain wrote:
> On Thu, Jun 13, 2024 at 10:07:15AM +0200, David Hildenbrand wrote:
>> On 13.06.24 09:57, Luis Chamberlain wrote:
>>> On Wed, Jun 12, 2024 at 08:08:15PM +0100, Matthew Wilcox wrote:
>>>> On Fri, Jun 07, 2024 at 02:58:57PM +0000, Pankaj Raghav (Samsung) wrote:
>>>>> From: Pankaj Raghav <p.raghav@samsung.com>
>>>>>
>>>>> Usually the page cache does not extend beyond the size of the inode,
>>>>> therefore, no PTEs are created for folios that extend beyond the size.
>>>>>
>>>>> But with LBS support, we might extend page cache beyond the size of the
>>>>> inode as we need to guarantee folios of minimum order. Cap the PTE range
>>>>> to be created for the page cache up to the max allowed zero-fill file
>>>>> end, which is aligned to the PAGE_SIZE.
>>>>
>>>> I think this is slightly misleading because we might well zero-fill
>>>> to the end of the folio.  The issue is that we're supposed to SIGBUS
>>>> if userspace accesses pages which lie entirely beyond the end of this
>>>> file.  Can you rephrase this?
>>>>
>>>> (from mmap(2))
>>>>          SIGBUS Attempted access to a page of the buffer that lies beyond the end
>>>>                 of the mapped file.  For an explanation of the treatment  of  the
>>>>                 bytes  in  the  page that corresponds to the end of a mapped file
>>>>                 that is not a multiple of the page size, see NOTES.
>>>>
>>>>
>>>> The code is good though.
>>>>
>>>> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>>>
>>> Since I've been curating the respective fstests test to test for this
>>> POSIX corner case [0] I wanted to enable the test for tmpfs instead of
>>> skipping it as I originally had it, and that meant also realizing mmap(2)
>>> specifically says this now:
>>>
>>> Huge page (Huge TLB) mappings
>>
>> Confusion alert: this likely talks about hugetlb (MAP_HUGETLB), not THP and
>> friends.
>>
>> So it might not be required for below changes.
> 
> Thanks, I had to ask as we're dusting off this little obscure corner of
> the universe. Reason I ask, is the test fails for tmpfs with huge pages,
> and this patch fixes it, but it got me wondering the above applies also
> to tmpfs with huge pages.

Is it tmpfs with THP/large folios or shmem with hugetlb? I assume the 
tmpfs with THP. There are not really mmap/munmap restrictions to THP and 
friends (because it's supposed to be "transparent" :) ).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 03/11] filemap: allocate mapping_min_order folios in the page cache
  2024-06-07 14:58 ` [PATCH v7 03/11] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
                     ` (2 preceding siblings ...)
  2024-06-12 17:24   ` Matthew Wilcox
@ 2024-06-13  8:44   ` Christoph Hellwig
  2024-06-17  9:58     ` Pankaj Raghav (Samsung)
  3 siblings, 1 reply; 58+ messages in thread
From: Christoph Hellwig @ 2024-06-13  8:44 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: david, djwong, chandan.babu, brauner, akpm, willy, mcgrof,
	linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl, john.g.garry

On Fri, Jun 07, 2024 at 02:58:54PM +0000, Pankaj Raghav (Samsung) wrote:
> +static inline unsigned long mapping_min_folio_nrpages(struct address_space *mapping)
> +{
> +	return 1UL << mapping_min_folio_order(mapping);
> +}

Overly long line here, just line break after the return type.

Then again it only has a single user just below and no documentation
so maybe just fold it into the caller?

>  no_page:
>  	if (!folio && (fgp_flags & FGP_CREAT)) {
> -		unsigned order = FGF_GET_ORDER(fgp_flags);
> +		unsigned int min_order = mapping_min_folio_order(mapping);
> +		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
>  		int err;
> +		index = mapping_align_start_index(mapping, index);

I wonder if at some point splitting this block that actually allocates
a new folio into a separate helper would be nice.  It just keep growing
in size and complexity.

> -	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
> +	folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
> +				    min_order);

Nit: no need to split this into multiple lines.

>  	if (!folio)
>  		return -ENOMEM;
>  
> @@ -2471,6 +2478,8 @@ static int filemap_create_folio(struct file *file,
>  	 * well to keep locking rules simple.
>  	 */
>  	filemap_invalidate_lock_shared(mapping);
> +	/* index in PAGE units but aligned to min_order number of pages. */

in PAGE_SIZE units?  Maybe also make this a complete sentence?


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  2024-06-07 14:59 ` [PATCH v7 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
@ 2024-06-13  8:45   ` Christoph Hellwig
  2024-06-17 16:09     ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 58+ messages in thread
From: Christoph Hellwig @ 2024-06-13  8:45 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: david, djwong, chandan.babu, brauner, akpm, willy, mcgrof,
	linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl, john.g.garry

> +	uint64_t		max_index;
> +	uint64_t		max_bytes;
> +
>  	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
>  	ASSERT(sbp->sb_blocklog >= BBSHIFT);
>  
> +	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
> +		return -EFBIG;
> +
>  	/* Limited by ULONG_MAX of page cache index */
> -	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
> +	max_index = max_bytes >> PAGE_SHIFT;
> +
> +	if (max_index > ULONG_MAX)

Do we really need the max_index variable for a single user here?
Or do you plan to add more uses of it later (can't really think of one
though)?


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 11/11] xfs: enable block size larger than page size support
  2024-06-07 14:59 ` [PATCH v7 11/11] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
@ 2024-06-13  8:47   ` Christoph Hellwig
  2024-06-17  1:29     ` Dave Chinner
  0 siblings, 1 reply; 58+ messages in thread
From: Christoph Hellwig @ 2024-06-13  8:47 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: david, djwong, chandan.babu, brauner, akpm, willy, mcgrof,
	linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl, john.g.garry

On Fri, Jun 07, 2024 at 02:59:02PM +0000, Pankaj Raghav (Samsung) wrote:
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -3019,6 +3019,11 @@ xfs_ialloc_setup_geometry(
>  		igeo->ialloc_align = mp->m_dalign;
>  	else
>  		igeo->ialloc_align = 0;
> +
> +	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
> +		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
> +	else
> +		igeo->min_folio_order = 0;
>  }

The minimum folio order isn't really part of the inode (allocation)
geometry, is it?


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 06/11] filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
  2024-06-13  8:16           ` David Hildenbrand
@ 2024-06-13 15:27             ` Luis Chamberlain
  2024-06-13 15:32               ` Matthew Wilcox
  0 siblings, 1 reply; 58+ messages in thread
From: Luis Chamberlain @ 2024-06-13 15:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Hugh Dickins, yang, linmiaohe, muchun.song,
	osalvador, Pankaj Raghav (Samsung), david, djwong, chandan.babu,
	brauner, akpm, linux-mm, hare, linux-kernel, Zi Yan, linux-xfs,
	p.raghav, linux-fsdevel, hch, gost.dev, cl, john.g.garry

On Thu, Jun 13, 2024 at 10:16:10AM +0200, David Hildenbrand wrote:
> On 13.06.24 10:13, Luis Chamberlain wrote:
> > On Thu, Jun 13, 2024 at 10:07:15AM +0200, David Hildenbrand wrote:
> > > On 13.06.24 09:57, Luis Chamberlain wrote:
> > > > On Wed, Jun 12, 2024 at 08:08:15PM +0100, Matthew Wilcox wrote:
> > > > > On Fri, Jun 07, 2024 at 02:58:57PM +0000, Pankaj Raghav (Samsung) wrote:
> > > > > > From: Pankaj Raghav <p.raghav@samsung.com>
> > > > > > 
> > > > > > Usually the page cache does not extend beyond the size of the inode,
> > > > > > therefore, no PTEs are created for folios that extend beyond the size.
> > > > > > 
> > > > > > But with LBS support, we might extend page cache beyond the size of the
> > > > > > inode as we need to guarantee folios of minimum order. Cap the PTE range
> > > > > > to be created for the page cache up to the max allowed zero-fill file
> > > > > > end, which is aligned to the PAGE_SIZE.
> > > > > 
> > > > > I think this is slightly misleading because we might well zero-fill
> > > > > to the end of the folio.  The issue is that we're supposed to SIGBUS
> > > > > if userspace accesses pages which lie entirely beyond the end of this
> > > > > file.  Can you rephrase this?
> > > > > 
> > > > > (from mmap(2))
> > > > >          SIGBUS Attempted access to a page of the buffer that lies beyond the end
> > > > >                 of the mapped file.  For an explanation of the treatment  of  the
> > > > >                 bytes  in  the  page that corresponds to the end of a mapped file
> > > > >                 that is not a multiple of the page size, see NOTES.
> > > > > 
> > > > > 
> > > > > The code is good though.
> > > > > 
> > > > > Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > > 
> > > > Since I've been curating the respective fstests test to test for this
> > > > POSIX corner case [0] I wanted to enable the test for tmpfs instead of
> > > > skipping it as I originally had it, and that meant also realizing mmap(2)
> > > > specifically says this now:
> > > > 
> > > > Huge page (Huge TLB) mappings
> > > 
> > > Confusion alert: this likely talks about hugetlb (MAP_HUGETLB), not THP and
> > > friends.
> > > 
> > > So it might not be required for below changes.
> > 
> > Thanks, I had to ask as we're dusting off this little obscure corner of
> > the universe. Reason I ask, is the test fails for tmpfs with huge pages,
> > and this patch fixes it, but it got me wondering the above applies also
> > to tmpfs with huge pages.
> 
> Is it tmpfs with THP/large folios or shmem with hugetlb? I assume the tmpfs
> with THP. There are not really mmap/munmap restrictions to THP and friends
> (because it's supposed to be "transparent" :) ).

The case I tested that failed the test was tmpfs with huge pages (not
large folios). So should we then have this:

diff --git a/mm/filemap.c b/mm/filemap.c
index ea78963f0956..649beb9bbc6b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3617,6 +3617,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	vm_fault_t ret = 0;
 	unsigned long rss = 0;
 	unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type;
+	unsigned int align = PAGE_SIZE;
 
 	rcu_read_lock();
 	folio = next_uptodate_folio(&xas, mapping, end_pgoff);
@@ -3636,7 +3637,16 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		goto out;
 	}
 
-	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
+	/*
+	 * As per the mmap(2) mmap(), the offset must be a multiple of the
+	 * underlying huge page size. The system automatically aligns length to
+	 * be a multiple of the underlying huge page size.
+	 */
+	if (folio_test_pmd_mappable(folio) &&
+	    (shmem_mapping(mapping) || folio_test_hugetlb(folio)))
+		align = 1 << folio_order(folio);
+
+	file_end = DIV_ROUND_UP(i_size_read(mapping->host), align) - 1;
 	if (end_pgoff > file_end)
 		end_pgoff = file_end;
 

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 06/11] filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
  2024-06-13 15:27             ` Luis Chamberlain
@ 2024-06-13 15:32               ` Matthew Wilcox
  2024-06-13 15:38                 ` Luis Chamberlain
  0 siblings, 1 reply; 58+ messages in thread
From: Matthew Wilcox @ 2024-06-13 15:32 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: David Hildenbrand, Hugh Dickins, yang, linmiaohe, muchun.song,
	osalvador, Pankaj Raghav (Samsung), david, djwong, chandan.babu,
	brauner, akpm, linux-mm, hare, linux-kernel, Zi Yan, linux-xfs,
	p.raghav, linux-fsdevel, hch, gost.dev, cl, john.g.garry

On Thu, Jun 13, 2024 at 08:27:27AM -0700, Luis Chamberlain wrote:
> The case I tested that failed the test was tmpfs with huge pages (not
> large folios). So should we then have this:

No.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 06/11] filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
  2024-06-13 15:32               ` Matthew Wilcox
@ 2024-06-13 15:38                 ` Luis Chamberlain
  2024-06-13 15:40                   ` Matthew Wilcox
  0 siblings, 1 reply; 58+ messages in thread
From: Luis Chamberlain @ 2024-06-13 15:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, Hugh Dickins, yang, linmiaohe, muchun.song,
	osalvador, Pankaj Raghav (Samsung), david, djwong, chandan.babu,
	brauner, akpm, linux-mm, hare, linux-kernel, Zi Yan, linux-xfs,
	p.raghav, linux-fsdevel, hch, gost.dev, cl, john.g.garry

On Thu, Jun 13, 2024 at 04:32:27PM +0100, Matthew Wilcox wrote:
> On Thu, Jun 13, 2024 at 08:27:27AM -0700, Luis Chamberlain wrote:
> > The case I tested that failed the test was tmpfs with huge pages (not
> > large folios). So should we then have this:
> 
> No.

OK so this does have a change for tmpfs with huge pages enabled, do we
take the position then this is a fix for that?

  Luis

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 06/11] filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
  2024-06-13 15:38                 ` Luis Chamberlain
@ 2024-06-13 15:40                   ` Matthew Wilcox
  2024-06-13 19:39                     ` Luis Chamberlain
  0 siblings, 1 reply; 58+ messages in thread
From: Matthew Wilcox @ 2024-06-13 15:40 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: David Hildenbrand, Hugh Dickins, yang, linmiaohe, muchun.song,
	osalvador, Pankaj Raghav (Samsung), david, djwong, chandan.babu,
	brauner, akpm, linux-mm, hare, linux-kernel, Zi Yan, linux-xfs,
	p.raghav, linux-fsdevel, hch, gost.dev, cl, john.g.garry

On Thu, Jun 13, 2024 at 08:38:15AM -0700, Luis Chamberlain wrote:
> On Thu, Jun 13, 2024 at 04:32:27PM +0100, Matthew Wilcox wrote:
> > On Thu, Jun 13, 2024 at 08:27:27AM -0700, Luis Chamberlain wrote:
> > > The case I tested that failed the test was tmpfs with huge pages (not
> > > large folios). So should we then have this:
> > 
> > No.
> 
> OK so this does have a change for tmpfs with huge pages enabled, do we
> take the position then this is a fix for that?

You literally said it was a fix just a few messages up thread?

Besides, the behaviour changes (currently) depending on whether
you specify "within_size" or "always".  This patch makes it consistent.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 06/11] filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
  2024-06-13 15:40                   ` Matthew Wilcox
@ 2024-06-13 19:39                     ` Luis Chamberlain
  0 siblings, 0 replies; 58+ messages in thread
From: Luis Chamberlain @ 2024-06-13 19:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, Hugh Dickins, yang, linmiaohe, muchun.song,
	osalvador, Pankaj Raghav (Samsung), david, djwong, chandan.babu,
	brauner, akpm, linux-mm, hare, linux-kernel, Zi Yan, linux-xfs,
	p.raghav, linux-fsdevel, hch, gost.dev, cl, john.g.garry

On Thu, Jun 13, 2024 at 04:40:28PM +0100, Matthew Wilcox wrote:
> On Thu, Jun 13, 2024 at 08:38:15AM -0700, Luis Chamberlain wrote:
> > On Thu, Jun 13, 2024 at 04:32:27PM +0100, Matthew Wilcox wrote:
> > > On Thu, Jun 13, 2024 at 08:27:27AM -0700, Luis Chamberlain wrote:
> > > > The case I tested that failed the test was tmpfs with huge pages (not
> > > > large folios). So should we then have this:
> > > 
> > > No.
> > 
> > OK so this does have a change for tmpfs with huge pages enabled, do we
> > take the position then this is a fix for that?
> 
> You literally said it was a fix just a few messages up thread?
> 
> Besides, the behaviour changes (currently) depending on whether
> you specify "within_size" or "always".  This patch makes it consistent.

The quoted mmap(2) text made me doubt it, and I was looking for
clarification. It seems clear now based on feedback the text does
not apply to tmpfs with huge pages, and so we'll just annotate it
as a fix for tmpfs with huge pages.

It makes sense to not apply, I mean, why *would* you assume you will
have an extended range zeroed out range to muck around with beyond
PAGE_SIZE just because huge pages were used when the rest of all other
filesystem APIs count on the mmap(2) PAGE_SIZE boundary.

Thanks!

  Luis

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 04/11] readahead: allocate folios with mapping_min_order in readahead
  2024-06-12 18:50   ` Matthew Wilcox
@ 2024-06-14  9:26     ` Pankaj Raghav (Samsung)
  2024-06-17 12:32       ` Matthew Wilcox
  0 siblings, 1 reply; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-14  9:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: david, djwong, chandan.babu, brauner, akpm, mcgrof, linux-mm,
	hare, linux-kernel, yang, Zi Yan, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl, john.g.garry

On Wed, Jun 12, 2024 at 07:50:49PM +0100, Matthew Wilcox wrote:
> On Fri, Jun 07, 2024 at 02:58:55PM +0000, Pankaj Raghav (Samsung) wrote:
> > @@ -230,7 +247,9 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> >  		struct folio *folio = xa_load(&mapping->i_pages, index + i);
> >  		int ret;
> >  
> > +
> 
> Spurious newline
Oops.
> 
> >  		if (folio && !xa_is_value(folio)) {
> > +			long nr_pages = folio_nr_pages(folio);
> 
> Hm, but we don't have a reference on this folio.  So this isn't safe.
> 

That is why I added a check for mapping after read_pages(). You are
right, we can make it better.

> > @@ -240,12 +259,24 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> >  			 * not worth getting one just for that.
> >  			 */
> >  			read_pages(ractl);
> > -			ractl->_index += folio_nr_pages(folio);
> > +
> > +			/*
> > +			 * Move the ractl->_index by at least min_pages
> > +			 * if the folio got truncated to respect the
> > +			 * alignment constraint in the page cache.
> > +			 *
> > +			 */
> > +			if (mapping != folio->mapping)
> > +				nr_pages = min_nrpages;
> > +
> > +			VM_BUG_ON_FOLIO(nr_pages < min_nrpages, folio);
> > +			ractl->_index += nr_pages;
> 
> Why not just:
> 			ractl->_index += min_nrpages;

Then we will only move min_nrpages even if the folio we found had a
bigger order. Hannes patches (first patch) made sure we move the
ractl->index by folio_nr_pages instead of 1 and making this change will
defeat the purpose because without mapping order set, min_nrpages will
be 1.

What I could do is the follows:

diff --git a/mm/readahead.c b/mm/readahead.c
index 389cd802da63..92cf45cdb4d3 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -249,7 +249,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 
 
                if (folio && !xa_is_value(folio)) {
-                       long nr_pages = folio_nr_pages(folio);
+                       long nr_pages;
                        /*
                         * Page already present?  Kick off the current batch
                         * of contiguous pages before continuing with the
@@ -266,10 +266,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
                         * alignment constraint in the page cache.
                         *
                         */
-                       if (mapping != folio->mapping)
-                               nr_pages = min_nrpages;
+                       nr_pages = max(folio_nr_pages(folio), (long)min_nrpages);
 
-                       VM_BUG_ON_FOLIO(nr_pages < min_nrpages, folio);
                        ractl->_index += nr_pages;
                        i = ractl->_index + ractl->_nr_pages - index;
                        continue;

Now we will still move respecting the min order constraint but if we had
a bigger folio and we do have a reference, then we move folio_nr_pages.
> 
> like you do below?
Below we add a folio of min_order, so if that fails for some reason, we
can unconditionally move min_nrpages.

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 11/11] xfs: enable block size larger than page size support
  2024-06-13  8:47   ` Christoph Hellwig
@ 2024-06-17  1:29     ` Dave Chinner
  2024-06-17  6:51       ` Christoph Hellwig
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2024-06-17  1:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pankaj Raghav (Samsung), djwong, chandan.babu, brauner, akpm,
	willy, mcgrof, linux-mm, hare, linux-kernel, yang, Zi Yan,
	linux-xfs, p.raghav, linux-fsdevel, gost.dev, cl, john.g.garry

On Thu, Jun 13, 2024 at 10:47:25AM +0200, Christoph Hellwig wrote:
> On Fri, Jun 07, 2024 at 02:59:02PM +0000, Pankaj Raghav (Samsung) wrote:
> > --- a/fs/xfs/libxfs/xfs_ialloc.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc.c
> > @@ -3019,6 +3019,11 @@ xfs_ialloc_setup_geometry(
> >  		igeo->ialloc_align = mp->m_dalign;
> >  	else
> >  		igeo->ialloc_align = 0;
> > +
> > +	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
> > +		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
> > +	else
> > +		igeo->min_folio_order = 0;
> >  }
> 
> The minimum folio order isn't really part of the inode (allocation)
> geometry, is it?

I suggested it last time around instead of calculating the same
constant on every inode allocation. We're already storing in-memory
strunct xfs_inode allocation init values in this structure. e.g. in
xfs_inode_alloc() we see things like this:

	ip->i_diflags2 = mp->m_ino_geo.new_diflags2;

So that we only calculate the default values for the filesystem once
instead of on every inode allocation. This isn't unreasonable
because xfs_inode_alloc() is a fairly hot path.

The only other place we might store it is the struct xfs_mount, but
given all the inode allocation constants are already in the embedded
mp->m_ino_geo structure, it just seems like a much better idea to
put it will all the other inode allocation constants than dump it
randomly into the struct xfs_mount....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 11/11] xfs: enable block size larger than page size support
  2024-06-17  1:29     ` Dave Chinner
@ 2024-06-17  6:51       ` Christoph Hellwig
  2024-06-17 16:31         ` Pankaj Raghav (Samsung)
  2024-06-17 23:18         ` Dave Chinner
  0 siblings, 2 replies; 58+ messages in thread
From: Christoph Hellwig @ 2024-06-17  6:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Pankaj Raghav (Samsung), djwong, chandan.babu,
	brauner, akpm, willy, mcgrof, linux-mm, hare, linux-kernel, yang,
	Zi Yan, linux-xfs, p.raghav, linux-fsdevel, gost.dev, cl,
	john.g.garry

On Mon, Jun 17, 2024 at 11:29:42AM +1000, Dave Chinner wrote:
> > > +	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
> > > +		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
> > > +	else
> > > +		igeo->min_folio_order = 0;
> > >  }
> > 
> > The minimum folio order isn't really part of the inode (allocation)
> > geometry, is it?
> 
> I suggested it last time around instead of calculating the same
> constant on every inode allocation. We're already storing in-memory
> strunct xfs_inode allocation init values in this structure. e.g. in
> xfs_inode_alloc() we see things like this:

While new_diflags2 isn't exactly inode geometry, it at least is part
of the inode allocation.  Folio min order for file data has nothing
to do with this at all.

> The only other place we might store it is the struct xfs_mount, but
> given all the inode allocation constants are already in the embedded
> mp->m_ino_geo structure, it just seems like a much better idea to
> put it will all the other inode allocation constants than dump it
> randomly into the struct xfs_mount....

Well, it is very closely elated to say the m_blockmask field in
struct xfs_mount.  The again modern CPUs tend to get a you simple
subtraction for free in most pipelines doing other things, so I'm
not really sure it's worth caching for use in inode allocation to
start with, but I don't care strongly about that.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 03/11] filemap: allocate mapping_min_order folios in the page cache
  2024-06-13  8:44   ` Christoph Hellwig
@ 2024-06-17  9:58     ` Pankaj Raghav (Samsung)
  2024-06-17 12:34       ` Matthew Wilcox
  0 siblings, 1 reply; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-17  9:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: david, djwong, chandan.babu, brauner, akpm, willy, mcgrof,
	linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs, p.raghav,
	linux-fsdevel, gost.dev, cl, john.g.garry

On Thu, Jun 13, 2024 at 10:44:10AM +0200, Christoph Hellwig wrote:
> On Fri, Jun 07, 2024 at 02:58:54PM +0000, Pankaj Raghav (Samsung) wrote:
> > +static inline unsigned long mapping_min_folio_nrpages(struct address_space *mapping)
> > +{
> > +	return 1UL << mapping_min_folio_order(mapping);
> > +}
> 
> Overly long line here, just line break after the return type.
> 
> Then again it only has a single user just below and no documentation
> so maybe just fold it into the caller?

I do use it in later patches. I will adjust the long line here :)

> 
> >  no_page:
> >  	if (!folio && (fgp_flags & FGP_CREAT)) {
> > -		unsigned order = FGF_GET_ORDER(fgp_flags);
> > +		unsigned int min_order = mapping_min_folio_order(mapping);
> > +		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
> >  		int err;
> > +		index = mapping_align_start_index(mapping, index);
> 
> I wonder if at some point splitting this block that actually allocates
> a new folio into a separate helper would be nice.  It just keep growing
> in size and complexity.
> 

I agree with that. I will put it in my future todo backlog.

> > -	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
> > +	folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
> > +				    min_order);
> 
> Nit: no need to split this into multiple lines.

Ok.
> 
> >  	if (!folio)
> >  		return -ENOMEM;
> >  
> > @@ -2471,6 +2478,8 @@ static int filemap_create_folio(struct file *file,
> >  	 * well to keep locking rules simple.
> >  	 */
> >  	filemap_invalidate_lock_shared(mapping);
> > +	/* index in PAGE units but aligned to min_order number of pages. */
> 
> in PAGE_SIZE units?  Maybe also make this a complete sentence?
Yes, will do.
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 04/11] readahead: allocate folios with mapping_min_order in readahead
  2024-06-14  9:26     ` Pankaj Raghav (Samsung)
@ 2024-06-17 12:32       ` Matthew Wilcox
  2024-06-17 16:04         ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 58+ messages in thread
From: Matthew Wilcox @ 2024-06-17 12:32 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: david, djwong, chandan.babu, brauner, akpm, mcgrof, linux-mm,
	hare, linux-kernel, yang, Zi Yan, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl, john.g.garry

On Fri, Jun 14, 2024 at 09:26:02AM +0000, Pankaj Raghav (Samsung) wrote:
> > Hm, but we don't have a reference on this folio.  So this isn't safe.
> 
> That is why I added a check for mapping after read_pages(). You are
> right, we can make it better.

That's not enoughh.

> > > +			if (mapping != folio->mapping)
> > > +				nr_pages = min_nrpages;
> > > +
> > > +			VM_BUG_ON_FOLIO(nr_pages < min_nrpages, folio);
> > > +			ractl->_index += nr_pages;
> > 
> > Why not just:
> > 			ractl->_index += min_nrpages;
> 
> Then we will only move min_nrpages even if the folio we found had a
> bigger order. Hannes patches (first patch) made sure we move the
> ractl->index by folio_nr_pages instead of 1 and making this change will
> defeat the purpose because without mapping order set, min_nrpages will
> be 1.

Hannes' patch is wrong.  It's not safe to call folio_nr_pages() unless
you have a reference to the folio.

> @@ -266,10 +266,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>                          * alignment constraint in the page cache.
>                          *
>                          */
> -                       if (mapping != folio->mapping)
> -                               nr_pages = min_nrpages;
> +                       nr_pages = max(folio_nr_pages(folio), (long)min_nrpages);

No.

> Now we will still move respecting the min order constraint but if we had
> a bigger folio and we do have a reference, then we move folio_nr_pages.

You don't have a reference, so it's never safe.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 03/11] filemap: allocate mapping_min_order folios in the page cache
  2024-06-17  9:58     ` Pankaj Raghav (Samsung)
@ 2024-06-17 12:34       ` Matthew Wilcox
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Wilcox @ 2024-06-17 12:34 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Christoph Hellwig, david, djwong, chandan.babu, brauner, akpm,
	mcgrof, linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs,
	p.raghav, linux-fsdevel, gost.dev, cl, john.g.garry

On Mon, Jun 17, 2024 at 09:58:37AM +0000, Pankaj Raghav (Samsung) wrote:
> > > @@ -2471,6 +2478,8 @@ static int filemap_create_folio(struct file *file,
> > >  	 * well to keep locking rules simple.
> > >  	 */
> > >  	filemap_invalidate_lock_shared(mapping);
> > > +	/* index in PAGE units but aligned to min_order number of pages. */
> > 
> > in PAGE_SIZE units?  Maybe also make this a complete sentence?
> Yes, will do.

I'd delete the comment entirely.  Anyone working on this code should
already know that folio indices are in units of PAGE_SIZE and must
be aligned, so I'm not sure what value this comment adds.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-06-12 20:40   ` Darrick J. Wong
@ 2024-06-17 15:08     ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-17 15:08 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, chandan.babu, brauner, akpm, willy, mcgrof, linux-mm, hare,
	linux-kernel, yang, Zi Yan, linux-xfs, p.raghav, linux-fsdevel,
	hch, gost.dev, cl, john.g.garry

On Wed, Jun 12, 2024 at 01:40:25PM -0700, Darrick J. Wong wrote:
> On Fri, Jun 07, 2024 at 02:58:58PM +0000, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index f3b43d223a46..b95600b254a3 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -27,6 +27,13 @@
> >  #define IOMAP_DIO_WRITE		(1U << 30)
> >  #define IOMAP_DIO_DIRTY		(1U << 31)
> >  
> > +/*
> > + * Used for sub block zeroing in iomap_dio_zero()
> > + */
> > +#define ZERO_FSB_SIZE (65536)
> > +#define ZERO_FSB_ORDER (get_order(ZERO_FSB_SIZE))
> > +static struct page *zero_fs_block;
> 
> Er... zero_page_64k ?
> 
> Since it's a permanent allocation, can we also mark the memory ro?

Sounds good.
> 
> > +
> >  struct iomap_dio {
> >  	struct kiocb		*iocb;
> >  	const struct iomap_dio_ops *dops;
> > @@ -52,6 +59,16 @@ struct iomap_dio {
> >  	};
> >  };
> >  
> > +int iomap_dio_init(void)
> > +{
> > +	zero_fs_block = alloc_pages(GFP_KERNEL | __GFP_ZERO, ZERO_FSB_ORDER);
> > +
> > +	if (!zero_fs_block)
> > +		return -ENOMEM;
> > +
> > +	return 0;
> > +}
> 
> Can't we just turn this into another fs_initcall() instead of exporting
> it just so we can call it from iomap_init?  And maybe rename the
> existing iomap_init to iomap_pagecache_init or something, for clarity's
> sake?

Yeah, probably iomap_pagecache_init() in fs/iomap/buffered-io.c and
iomap_dio_init() in fs/iomap/direct-io.c 
> 
> --D
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 04/11] readahead: allocate folios with mapping_min_order in readahead
  2024-06-17 12:32       ` Matthew Wilcox
@ 2024-06-17 16:04         ` Pankaj Raghav (Samsung)
  2024-06-17 16:10           ` Matthew Wilcox
  0 siblings, 1 reply; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-17 16:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: david, djwong, chandan.babu, brauner, akpm, mcgrof, linux-mm,
	hare, linux-kernel, yang, Zi Yan, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl, john.g.garry

On Mon, Jun 17, 2024 at 01:32:42PM +0100, Matthew Wilcox wrote:
> On Fri, Jun 14, 2024 at 09:26:02AM +0000, Pankaj Raghav (Samsung) wrote:
> > > Hm, but we don't have a reference on this folio.  So this isn't safe.
> > 
> > That is why I added a check for mapping after read_pages(). You are
> > right, we can make it better.
> 
> That's not enoughh.
> 
> > > > +			if (mapping != folio->mapping)
> > > > +				nr_pages = min_nrpages;
> > > > +
> > > > +			VM_BUG_ON_FOLIO(nr_pages < min_nrpages, folio);
> > > > +			ractl->_index += nr_pages;
> > > 
> > > Why not just:
> > > 			ractl->_index += min_nrpages;
> > 
> > Then we will only move min_nrpages even if the folio we found had a
> > bigger order. Hannes patches (first patch) made sure we move the
> > ractl->index by folio_nr_pages instead of 1 and making this change will
> > defeat the purpose because without mapping order set, min_nrpages will
> > be 1.
> 
> Hannes' patch is wrong.  It's not safe to call folio_nr_pages() unless
> you have a reference to the folio.
> 
> > @@ -266,10 +266,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> >                          * alignment constraint in the page cache.
> >                          *
> >                          */
> > -                       if (mapping != folio->mapping)
> > -                               nr_pages = min_nrpages;
> > +                       nr_pages = max(folio_nr_pages(folio), (long)min_nrpages);
> 
> No.
> 
> > Now we will still move respecting the min order constraint but if we had
> > a bigger folio and we do have a reference, then we move folio_nr_pages.
> 
> You don't have a reference, so it's never safe.
I am hitting my head now because you have literally mentioned that in
the comment:

	 * next batch.  This page may be the one we would
	 * have intended to mark as Readahead, but we don't
	 * have a stable reference to this page, and it's
	 * not worth getting one just for that.

I will move it by min_nrpages as follows:
>	ractl->_index += min_nrpages;


So the following can still be there from Hannes patch as we have a 
stable reference:

 		ractl->_workingset |= folio_test_workingset(folio);
-		ractl->_nr_pages++;
+		ractl->_nr_pages += folio_nr_pages(folio);
+		i += folio_nr_pages(folio);
 	}
 

Thanks for the clarification.

--
Pankaj

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  2024-06-13  8:45   ` Christoph Hellwig
@ 2024-06-17 16:09     ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-17 16:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: david, djwong, chandan.babu, brauner, akpm, willy, mcgrof,
	linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs, p.raghav,
	linux-fsdevel, gost.dev, cl, john.g.garry

On Thu, Jun 13, 2024 at 10:45:45AM +0200, Christoph Hellwig wrote:
> > +	uint64_t		max_index;
> > +	uint64_t		max_bytes;
> > +
> >  	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
> >  	ASSERT(sbp->sb_blocklog >= BBSHIFT);
> >  
> > +	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
> > +		return -EFBIG;
> > +
> >  	/* Limited by ULONG_MAX of page cache index */
> > -	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
> > +	max_index = max_bytes >> PAGE_SHIFT;
> > +
> > +	if (max_index > ULONG_MAX)
> 
> Do we really need the max_index variable for a single user here?
> Or do you plan to add more uses of it later (can't really think of one
> though)?

Yeah, we could just inline it:

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index a99454208807..c6933440f806 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -132,7 +132,6 @@ xfs_sb_validate_fsb_count(
        xfs_sb_t        *sbp,
        uint64_t        nblocks)
 {
-       uint64_t                max_index;
        uint64_t                max_bytes;
 
        ASSERT(sbp->sb_blocklog >= BBSHIFT);
@@ -141,9 +140,7 @@ xfs_sb_validate_fsb_count(
                return -EFBIG;
 
        /* Limited by ULONG_MAX of page cache index */
-       max_index = max_bytes >> PAGE_SHIFT;
-
-       if (max_index > ULONG_MAX)
+       if (max_bytes >> PAGE_SHIFT > ULONG_MAX)
                return -EFBIG;
        return 0;
 }

> 

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 04/11] readahead: allocate folios with mapping_min_order in readahead
  2024-06-17 16:04         ` Pankaj Raghav (Samsung)
@ 2024-06-17 16:10           ` Matthew Wilcox
  2024-06-17 16:39             ` Pankaj Raghav (Samsung)
  2024-06-18  6:52             ` Hannes Reinecke
  0 siblings, 2 replies; 58+ messages in thread
From: Matthew Wilcox @ 2024-06-17 16:10 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: david, djwong, chandan.babu, brauner, akpm, mcgrof, linux-mm,
	hare, linux-kernel, yang, Zi Yan, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl, john.g.garry

On Mon, Jun 17, 2024 at 04:04:20PM +0000, Pankaj Raghav (Samsung) wrote:
> On Mon, Jun 17, 2024 at 01:32:42PM +0100, Matthew Wilcox wrote:
> So the following can still be there from Hannes patch as we have a 
> stable reference:
> 
>  		ractl->_workingset |= folio_test_workingset(folio);
> -		ractl->_nr_pages++;
> +		ractl->_nr_pages += folio_nr_pages(folio);
> +		i += folio_nr_pages(folio);
>  	}

We _can_, but we just allocated it, so we know what size it is already.
I'm starting to feel that Hannes' patch should be combined with this
one.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 11/11] xfs: enable block size larger than page size support
  2024-06-17  6:51       ` Christoph Hellwig
@ 2024-06-17 16:31         ` Pankaj Raghav (Samsung)
  2024-06-17 23:18         ` Dave Chinner
  1 sibling, 0 replies; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-17 16:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, djwong, chandan.babu, brauner, akpm, willy, mcgrof,
	linux-mm, hare, linux-kernel, yang, Zi Yan, linux-xfs, p.raghav,
	linux-fsdevel, gost.dev, cl, john.g.garry

On Mon, Jun 17, 2024 at 08:51:04AM +0200, Christoph Hellwig wrote:
> On Mon, Jun 17, 2024 at 11:29:42AM +1000, Dave Chinner wrote:
> > > > +	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
> > > > +		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
> > > > +	else
> > > > +		igeo->min_folio_order = 0;
> > > >  }
> > > 
> > > The minimum folio order isn't really part of the inode (allocation)
> > > geometry, is it?
> > 
> > I suggested it last time around instead of calculating the same
> > constant on every inode allocation. We're already storing in-memory
> > strunct xfs_inode allocation init values in this structure. e.g. in
> > xfs_inode_alloc() we see things like this:
> 
> While new_diflags2 isn't exactly inode geometry, it at least is part
> of the inode allocation.  Folio min order for file data has nothing
> to do with this at all.
> 
> > The only other place we might store it is the struct xfs_mount, but
> > given all the inode allocation constants are already in the embedded
> > mp->m_ino_geo structure, it just seems like a much better idea to
> > put it will all the other inode allocation constants than dump it
> > randomly into the struct xfs_mount....
> 
> Well, it is very closely elated to say the m_blockmask field in
> struct xfs_mount.  The again modern CPUs tend to get a you simple
> subtraction for free in most pipelines doing other things, so I'm
> not really sure it's worth caching for use in inode allocation to
> start with, but I don't care strongly about that.

But there will also be an extra conditional apart from subtraction
right? 

Initially it was something like this:

@@ -73,6 +73,7 @@ xfs_inode_alloc(
 	xfs_ino_t		ino)
 {
 	struct xfs_inode	*ip;
+	int			min_order = 0;
 
 	/*
 	 * XXX: If this didn't occur in transactions, we could drop GFP_NOFAIL
@@ -88,7 +89,8 @@ xfs_inode_alloc(
 	/* VFS doesn't initialise i_mode or i_state! */
 	VFS_I(ip)->i_mode = 0;
 	VFS_I(ip)->i_state = 0;
-	mapping_set_large_folios(VFS_I(ip)->i_mapping);
+	min_order = max(min_order, ilog2(mp->m_sb.sb_blocksize) - PAGE_SHIFT);
+	mapping_set_folio_orders(VFS_I(ip)->i_mapping, min_order, MAX_PAGECACHE_ORDER);
 
 	XFS_STATS_INC(mp, vn_active);
 	ASSERT(atomic_read(&ip->i_pincount) == 0);
@@ -313,6 +315,7 @@ xfs_reinit_inode(
 	dev_t			dev = inode->i_rdev;
 	kuid_t			uid = inode->i_uid;
 	kgid_t			gid = inode->i_gid;
+	int			min_order = 0;
 
 	error = inode_init_always(mp->m_super, inode);
 
@@ -323,7 +326,8 @@ xfs_reinit_inode(
 	inode->i_rdev = dev;
 	inode->i_uid = uid;
 	inode->i_gid = gid;
-	mapping_set_large_folios(inode->i_mapping);
+	min_order = max(min_order, ilog2(mp->m_sb.sb_blocksize) - PAGE_SHIFT);
+	mapping_set_folio_orders(inode->i_mapping, min_order, MAX_PAGECACHE_ORDER);
 	return error;
 }

It does introduce a conditional in the inode allocation hot path so I
went with what Chinner proposed as it is something we use when we
initialize an inode.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 04/11] readahead: allocate folios with mapping_min_order in readahead
  2024-06-17 16:10           ` Matthew Wilcox
@ 2024-06-17 16:39             ` Pankaj Raghav (Samsung)
  2024-06-18  6:56               ` Hannes Reinecke
  2024-06-18  6:52             ` Hannes Reinecke
  1 sibling, 1 reply; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-17 16:39 UTC (permalink / raw)
  To: Matthew Wilcox, hare
  Cc: david, djwong, chandan.babu, brauner, akpm, mcgrof, linux-mm,
	linux-kernel, yang, Zi Yan, linux-xfs, p.raghav, linux-fsdevel,
	hch, gost.dev, cl, john.g.garry

On Mon, Jun 17, 2024 at 05:10:15PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 17, 2024 at 04:04:20PM +0000, Pankaj Raghav (Samsung) wrote:
> > On Mon, Jun 17, 2024 at 01:32:42PM +0100, Matthew Wilcox wrote:
> > So the following can still be there from Hannes patch as we have a 
> > stable reference:
> > 
> >  		ractl->_workingset |= folio_test_workingset(folio);
> > -		ractl->_nr_pages++;
> > +		ractl->_nr_pages += folio_nr_pages(folio);
> > +		i += folio_nr_pages(folio);
> >  	}
> 
> We _can_, but we just allocated it, so we know what size it is already.
Yes.

> I'm starting to feel that Hannes' patch should be combined with this
> one.

Fine by me. @Hannes, is that ok with you?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 11/11] xfs: enable block size larger than page size support
  2024-06-17  6:51       ` Christoph Hellwig
  2024-06-17 16:31         ` Pankaj Raghav (Samsung)
@ 2024-06-17 23:18         ` Dave Chinner
  1 sibling, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2024-06-17 23:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pankaj Raghav (Samsung), djwong, chandan.babu, brauner, akpm,
	willy, mcgrof, linux-mm, hare, linux-kernel, yang, Zi Yan,
	linux-xfs, p.raghav, linux-fsdevel, gost.dev, cl, john.g.garry

On Mon, Jun 17, 2024 at 08:51:04AM +0200, Christoph Hellwig wrote:
> On Mon, Jun 17, 2024 at 11:29:42AM +1000, Dave Chinner wrote:
> > > > +	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
> > > > +		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
> > > > +	else
> > > > +		igeo->min_folio_order = 0;
> > > >  }
> > > 
> > > The minimum folio order isn't really part of the inode (allocation)
> > > geometry, is it?
> > 
> > I suggested it last time around instead of calculating the same
> > constant on every inode allocation. We're already storing in-memory
> > strunct xfs_inode allocation init values in this structure. e.g. in
> > xfs_inode_alloc() we see things like this:
> 
> While new_diflags2 isn't exactly inode geometry, it at least is part
> of the inode allocation.  Folio min order for file data has nothing
> to do with this at all.

Yet ip->i_diflags2 is *not* initialised in xfs_init_new_inode()
when we physically allocate and initialise a new inode. It is set
for all inodes when they are allocated in memory, regardless of
their use.

Whether that is the right thing to do or not is a separate issue -
xfs_inode_from_disk() will overwrite it in every inode read case
that isn't a create.

Indeed, We could do the folio order initialisation in
xfs_setup_inode() where we set up the mapping gfp mask, but that
doesn't change the fact we set it up for every inode that is
instantiated in memory or that we want it pre-calculated...

> > The only other place we might store it is the struct xfs_mount, but
> > given all the inode allocation constants are already in the embedded
> > mp->m_ino_geo structure, it just seems like a much better idea to
> > put it will all the other inode allocation constants than dump it
> > randomly into the struct xfs_mount....
> 
> Well, it is very closely elated to say the m_blockmask field in
> struct xfs_mount.

Not really. The block mask is a property of the and used primarily
for manipulating lengths in units of FSB to/from byte counts and
vice versa. It is used all over the place, and the only guaranteed
common structure that all those callers have access to is the
xfs_mount.

OTOH, the folio order is only used for regular files to tell the
page cache how to behave. The scope of the folio order setup is the
same as mapping_set_gfp_mask() - is it only used in one place and
used for inode configuration. I may have called the structure "inode
geometry" because that described what it contained when I first
implemented it, but that doesn't mean that is all that is can
contain. It contains static, precalculated inode configuration
values, and that what we are adding here...

> The again modern CPUs tend to get a you simple
> subtraction for free in most pipelines doing other things, so I'm
> not really sure it's worth caching for use in inode allocation to
> start with, but I don't care strongly about that.

It's not the cost of a subtraction that is the problem -
precalculation is about avoiding a potential branch misprediction in
a hot path that would stall the CPU. If there were no branches, it
wouldn't be an issue, but this value cannot be calculated without at
least one branch in the logic.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 04/11] readahead: allocate folios with mapping_min_order in readahead
  2024-06-17 16:10           ` Matthew Wilcox
  2024-06-17 16:39             ` Pankaj Raghav (Samsung)
@ 2024-06-18  6:52             ` Hannes Reinecke
  1 sibling, 0 replies; 58+ messages in thread
From: Hannes Reinecke @ 2024-06-18  6:52 UTC (permalink / raw)
  To: Matthew Wilcox, Pankaj Raghav (Samsung)
  Cc: david, djwong, chandan.babu, brauner, akpm, mcgrof, linux-mm,
	linux-kernel, yang, Zi Yan, linux-xfs, p.raghav, linux-fsdevel,
	hch, gost.dev, cl, john.g.garry

On 6/17/24 18:10, Matthew Wilcox wrote:
> On Mon, Jun 17, 2024 at 04:04:20PM +0000, Pankaj Raghav (Samsung) wrote:
>> On Mon, Jun 17, 2024 at 01:32:42PM +0100, Matthew Wilcox wrote:
>> So the following can still be there from Hannes patch as we have a
>> stable reference:
>>
>>   		ractl->_workingset |= folio_test_workingset(folio);
>> -		ractl->_nr_pages++;
>> +		ractl->_nr_pages += folio_nr_pages(folio);
>> +		i += folio_nr_pages(folio);
>>   	}
> 
> We _can_, but we just allocated it, so we know what size it is already.
> I'm starting to feel that Hannes' patch should be combined with this
> one.

And we could even make it conditional; on recent devices allocating 64k
(or even 2M) worth of zero pages is not a big deal. And if you have 
machines where this is an issue maybe using large folios isn't the best
of ideas to start with.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 04/11] readahead: allocate folios with mapping_min_order in readahead
  2024-06-17 16:39             ` Pankaj Raghav (Samsung)
@ 2024-06-18  6:56               ` Hannes Reinecke
  2024-06-21 12:19                 ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 58+ messages in thread
From: Hannes Reinecke @ 2024-06-18  6:56 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), Matthew Wilcox
  Cc: david, djwong, chandan.babu, brauner, akpm, mcgrof, linux-mm,
	linux-kernel, yang, Zi Yan, linux-xfs, p.raghav, linux-fsdevel,
	hch, gost.dev, cl, john.g.garry

On 6/17/24 18:39, Pankaj Raghav (Samsung) wrote:
> On Mon, Jun 17, 2024 at 05:10:15PM +0100, Matthew Wilcox wrote:
>> On Mon, Jun 17, 2024 at 04:04:20PM +0000, Pankaj Raghav (Samsung) wrote:
>>> On Mon, Jun 17, 2024 at 01:32:42PM +0100, Matthew Wilcox wrote:
>>> So the following can still be there from Hannes patch as we have a
>>> stable reference:
>>>
>>>   		ractl->_workingset |= folio_test_workingset(folio);
>>> -		ractl->_nr_pages++;
>>> +		ractl->_nr_pages += folio_nr_pages(folio);
>>> +		i += folio_nr_pages(folio);
>>>   	}
>>
>> We _can_, but we just allocated it, so we know what size it is already.
> Yes.
> 
>> I'm starting to feel that Hannes' patch should be combined with this
>> one.
> 
> Fine by me. @Hannes, is that ok with you?

Sure. I was about to re-send my patchset anyway, so feel free to wrap it in.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 04/11] readahead: allocate folios with mapping_min_order in readahead
  2024-06-18  6:56               ` Hannes Reinecke
@ 2024-06-21 12:19                 ` Pankaj Raghav (Samsung)
  2024-06-21 13:28                   ` Hannes Reinecke
  0 siblings, 1 reply; 58+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-21 12:19 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Matthew Wilcox, david, djwong, chandan.babu, brauner, akpm,
	mcgrof, linux-mm, linux-kernel, yang, Zi Yan, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl, john.g.garry

On Tue, Jun 18, 2024 at 08:56:53AM +0200, Hannes Reinecke wrote:
> On 6/17/24 18:39, Pankaj Raghav (Samsung) wrote:
> > On Mon, Jun 17, 2024 at 05:10:15PM +0100, Matthew Wilcox wrote:
> > > On Mon, Jun 17, 2024 at 04:04:20PM +0000, Pankaj Raghav (Samsung) wrote:
> > > > On Mon, Jun 17, 2024 at 01:32:42PM +0100, Matthew Wilcox wrote:
> > > > So the following can still be there from Hannes patch as we have a
> > > > stable reference:
> > > > 
> > > >   		ractl->_workingset |= folio_test_workingset(folio);
> > > > -		ractl->_nr_pages++;
> > > > +		ractl->_nr_pages += folio_nr_pages(folio);
> > > > +		i += folio_nr_pages(folio);
> > > >   	}
> > > 
> > > We _can_, but we just allocated it, so we know what size it is already.
> > Yes.
> > 
> > > I'm starting to feel that Hannes' patch should be combined with this
> > > one.
> > 
> > Fine by me. @Hannes, is that ok with you?
> 
> Sure. I was about to re-send my patchset anyway, so feel free to wrap it in.
Is it ok if I add your Co-developed and Signed-off tag?
This is what I have combining your patch with mine and making willy's
changes:

diff --git a/mm/readahead.c b/mm/readahead.c
index 389cd802da63..f56da953c130 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -247,9 +247,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
                struct folio *folio = xa_load(&mapping->i_pages, index + i);
                int ret;
 
-
                if (folio && !xa_is_value(folio)) {
-                       long nr_pages = folio_nr_pages(folio);
                        /*
                         * Page already present?  Kick off the current batch
                         * of contiguous pages before continuing with the
@@ -259,18 +257,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
                         * not worth getting one just for that.
                         */
                        read_pages(ractl);
-
-                       /*
-                        * Move the ractl->_index by at least min_pages
-                        * if the folio got truncated to respect the
-                        * alignment constraint in the page cache.
-                        *
-                        */
-                       if (mapping != folio->mapping)
-                               nr_pages = min_nrpages;
-
-                       VM_BUG_ON_FOLIO(nr_pages < min_nrpages, folio);
-                       ractl->_index += nr_pages;
+                       ractl->_index += min_nrpages;
                        i = ractl->_index + ractl->_nr_pages - index;
                        continue;
                }
@@ -293,8 +280,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
                if (i == mark)
                        folio_set_readahead(folio);
                ractl->_workingset |= folio_test_workingset(folio);
-               ractl->_nr_pages += folio_nr_pages(folio);
-               i += folio_nr_pages(folio);
+               ractl->_nr_pages += min_nrpages;
+               i += min_nrpages;
        }
 
        /*


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v7 04/11] readahead: allocate folios with mapping_min_order in readahead
  2024-06-21 12:19                 ` Pankaj Raghav (Samsung)
@ 2024-06-21 13:28                   ` Hannes Reinecke
  0 siblings, 0 replies; 58+ messages in thread
From: Hannes Reinecke @ 2024-06-21 13:28 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Matthew Wilcox, david, djwong, chandan.babu, brauner, akpm,
	mcgrof, linux-mm, linux-kernel, yang, Zi Yan, linux-xfs, p.raghav,
	linux-fsdevel, hch, gost.dev, cl, john.g.garry

On 6/21/24 14:19, Pankaj Raghav (Samsung) wrote:
> On Tue, Jun 18, 2024 at 08:56:53AM +0200, Hannes Reinecke wrote:
>> On 6/17/24 18:39, Pankaj Raghav (Samsung) wrote:
>>> On Mon, Jun 17, 2024 at 05:10:15PM +0100, Matthew Wilcox wrote:
>>>> On Mon, Jun 17, 2024 at 04:04:20PM +0000, Pankaj Raghav (Samsung) wrote:
>>>>> On Mon, Jun 17, 2024 at 01:32:42PM +0100, Matthew Wilcox wrote:
>>>>> So the following can still be there from Hannes patch as we have a
>>>>> stable reference:
>>>>>
>>>>>    		ractl->_workingset |= folio_test_workingset(folio);
>>>>> -		ractl->_nr_pages++;
>>>>> +		ractl->_nr_pages += folio_nr_pages(folio);
>>>>> +		i += folio_nr_pages(folio);
>>>>>    	}
>>>>
>>>> We _can_, but we just allocated it, so we know what size it is already.
>>> Yes.
>>>
>>>> I'm starting to feel that Hannes' patch should be combined with this
>>>> one.
>>>
>>> Fine by me. @Hannes, is that ok with you?
>>
>> Sure. I was about to re-send my patchset anyway, so feel free to wrap it in.
> Is it ok if I add your Co-developed and Signed-off tag?
> This is what I have combining your patch with mine and making willy's
> changes:
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 389cd802da63..f56da953c130 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -247,9 +247,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>                  struct folio *folio = xa_load(&mapping->i_pages, index + i);
>                  int ret;
>   
> -
>                  if (folio && !xa_is_value(folio)) {
> -                       long nr_pages = folio_nr_pages(folio);
>                          /*
>                           * Page already present?  Kick off the current batch
>                           * of contiguous pages before continuing with the
> @@ -259,18 +257,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>                           * not worth getting one just for that.
>                           */
>                          read_pages(ractl);
> -
> -                       /*
> -                        * Move the ractl->_index by at least min_pages
> -                        * if the folio got truncated to respect the
> -                        * alignment constraint in the page cache.
> -                        *
> -                        */
> -                       if (mapping != folio->mapping)
> -                               nr_pages = min_nrpages;
> -
> -                       VM_BUG_ON_FOLIO(nr_pages < min_nrpages, folio);
> -                       ractl->_index += nr_pages;
> +                       ractl->_index += min_nrpages;
>                          i = ractl->_index + ractl->_nr_pages - index;
>                          continue;
>                  }
> @@ -293,8 +280,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>                  if (i == mark)
>                          folio_set_readahead(folio);
>                  ractl->_workingset |= folio_test_workingset(folio);
> -               ractl->_nr_pages += folio_nr_pages(folio);
> -               i += folio_nr_pages(folio);
> +               ractl->_nr_pages += min_nrpages;
> +               i += min_nrpages;
>          }
>   
>          /*
> 
Yes, that looks fine.
Go.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2024-06-21 13:28 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-07 14:58 [PATCH v7 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
2024-06-07 14:58 ` [PATCH v7 01/11] readahead: rework loop in page_cache_ra_unbounded() Pankaj Raghav (Samsung)
2024-06-07 14:58 ` [PATCH v7 02/11] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
2024-06-12 15:38   ` Darrick J. Wong
2024-06-07 14:58 ` [PATCH v7 03/11] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
2024-06-12  9:01   ` Hannes Reinecke
2024-06-12 15:40   ` Darrick J. Wong
2024-06-12 17:24   ` Matthew Wilcox
2024-06-13  8:44   ` Christoph Hellwig
2024-06-17  9:58     ` Pankaj Raghav (Samsung)
2024-06-17 12:34       ` Matthew Wilcox
2024-06-07 14:58 ` [PATCH v7 04/11] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
2024-06-12 18:50   ` Matthew Wilcox
2024-06-14  9:26     ` Pankaj Raghav (Samsung)
2024-06-17 12:32       ` Matthew Wilcox
2024-06-17 16:04         ` Pankaj Raghav (Samsung)
2024-06-17 16:10           ` Matthew Wilcox
2024-06-17 16:39             ` Pankaj Raghav (Samsung)
2024-06-18  6:56               ` Hannes Reinecke
2024-06-21 12:19                 ` Pankaj Raghav (Samsung)
2024-06-21 13:28                   ` Hannes Reinecke
2024-06-18  6:52             ` Hannes Reinecke
2024-06-07 14:58 ` [PATCH v7 05/11] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
2024-06-07 16:58   ` Zi Yan
2024-06-07 17:01     ` Matthew Wilcox
2024-06-07 20:45       ` Pankaj Raghav (Samsung)
2024-06-07 20:30     ` Pankaj Raghav (Samsung)
2024-06-07 20:51       ` Zi Yan
2024-06-10  7:26         ` Pankaj Raghav (Samsung)
2024-06-12  9:02   ` Hannes Reinecke
2024-06-07 14:58 ` [PATCH v7 06/11] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() Pankaj Raghav (Samsung)
2024-06-12 19:08   ` Matthew Wilcox
2024-06-13  7:57     ` Luis Chamberlain
2024-06-13  8:07       ` David Hildenbrand
2024-06-13  8:13         ` Luis Chamberlain
2024-06-13  8:16           ` David Hildenbrand
2024-06-13 15:27             ` Luis Chamberlain
2024-06-13 15:32               ` Matthew Wilcox
2024-06-13 15:38                 ` Luis Chamberlain
2024-06-13 15:40                   ` Matthew Wilcox
2024-06-13 19:39                     ` Luis Chamberlain
2024-06-07 14:58 ` [PATCH v7 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
2024-06-11  7:38   ` John Garry
2024-06-11  9:41     ` Pankaj Raghav (Samsung)
2024-06-11 10:00       ` John Garry
2024-06-12 20:40   ` Darrick J. Wong
2024-06-17 15:08     ` Pankaj Raghav (Samsung)
2024-06-07 14:58 ` [PATCH v7 08/11] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
2024-06-07 14:59 ` [PATCH v7 09/11] xfs: expose block size in stat Pankaj Raghav (Samsung)
2024-06-07 14:59 ` [PATCH v7 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
2024-06-13  8:45   ` Christoph Hellwig
2024-06-17 16:09     ` Pankaj Raghav (Samsung)
2024-06-07 14:59 ` [PATCH v7 11/11] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
2024-06-13  8:47   ` Christoph Hellwig
2024-06-17  1:29     ` Dave Chinner
2024-06-17  6:51       ` Christoph Hellwig
2024-06-17 16:31         ` Pankaj Raghav (Samsung)
2024-06-17 23:18         ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).