[RFC 00/23] Enable block size

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC 00/23] Enable block size > page size in XFS
@ 2023-09-15 18:38 Pankaj Raghav
  2023-09-15 18:38 ` [RFC 01/23] fs: Allow fine-grained control of folio sizes Pankaj Raghav
                   ` (24 more replies)
  0 siblings, 25 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Pankaj Raghav <p.raghav@samsung.com>

There has been efforts over the last 16 years to enable enable Large
Block Sizes (LBS), that is block sizes in filesystems where bs > page
size [1] [2]. Through these efforts we have learned that one of the
main blockers to supporting bs > ps in fiesystems has been a way to
allocate pages that are at least the filesystem block size on the page
cache where bs > ps [3]. Another blocker was changed in filesystems due to
buffer-heads. Thanks to these previous efforts, the surgery by Matthew
Willcox in the page cache for adopting xarray's multi-index support, and
iomap support, it makes supporting bs > ps in XFS possible with only a few
line change to XFS. Most of changes are to the page cache to support minimum
order folio support for the target block size on the filesystem.

A new motivation for LBS today is to support high-capacity (large amount
of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are
typically greater than 4k [4] to help reduce DRAM and so in turn cost
and space. In practice this then allows different architectures to use a
base page size of 4k while still enabling support for block sizes
aligned to the larger IUs by relying on high order folios on the page
cache when needed. It also enables to take advantage of these same
drive's support for larger atomics than 4k with buffered IO support in
Linux. As described this year at LSFMM, supporting large atomics greater
than 4k enables databases to remove the need to rely on their own
journaling, so they can disable double buffered writes [5], which is a
feature different cloud providers are already innovating and enabling
customers for through custom storage solutions.

This series still needs some polishing and fixing some crashes, but it is
mainly targeted to get initial feedback from the community, enable initial
experimentation, hence the RFC. It's being posted now given the results from
our testing are proving much better results than expected and we hope to
polish this up together with the community. After all, this has been a 16
year old effort and none of this could have been possible without that effort.

Implementation:

This series only adds the notion of a minimum order of a folio in the
page cache that was initially proposed by Willy. The minimum folio order
requirement is set during inode creation. The minimum order will
typically correspond to the filesystem block size. The page cache will
in turn respect the minimum folio order requirement while allocating a
folio. This series mainly changes the page cache's filemap, readahead, and
truncation code to allocate and align the folios to the minimum order set for the
filesystem's inode's respective address space mapping.

Only XFS was enabled and tested as a part of this series as it has
supported block sizes up to 64k and sector sizes up to 32k for years.
The only thing missing was the page cache magic to enable bs > ps. However any filesystem
that doesn't depend on buffer-heads and support larger block sizes
already should be able to leverage this effort to also support LBS,
bs > ps.

This also paves the way for supporting block devices where their logical
block size > page size in the future by leveraging iomap's address space
operation added to the block device cache by Christoph Hellwig [6]. We
have work to enable support for this, enabling LBAs > 4k on NVME,  and
at the same time allow coexistence with buffer-heads on the same block
device so to enable support allow for a drive to use filesystem's to
switch between filesystem's which may depend on buffer-heads or need the
iomap address space operations for the block device cache. Patches for
this will be posted shortly after this patch series.

Testing:

The test results show, this isn't so scary. Only a few regressions so
far on xfs where CRCs are disabled on block sizes smaller than 4k and
some generic tests crashing the system for bs > 4k. The crashes are at most a
handful at this point. This series has been cleaned up 3 times now after
we passed our first billion through fsx ops on different block sizes. Not
surprisingly there are a few test bugs for the bs > ps world.

We've established baseline first against linux-next against 14 different
XFS test profiles as maintained in kdevops [7]:

xfs_crc
xfs_reflink
xfs_reflink_normapbt
xfs_reflink_1024
xfs_reflink_2k
xfs_reflink_4k
xfs_nocrc
xfs_nocrc_512
xfs_nocrc_1k
xfs_nocrc_2k
xfs_nocrc_4k
xfs_logdev
xfs_rtdev
xfs_rtlogdev

We first established a high confidence baseline for linux-next and have
kept following that to ensure we don't regress it. The majority of
regressions are fsx ops on no CRC block sizes of 512 and 2k, and we plan
to fix that, but welcome others at this point to jump in and collaborate.

The list of known possible regressions are then can be seen on kdevops
with git grep:

git grep regression workflows/fstests/expunges/6.6.0-rc1-large-block-20230914/ | awk -F"unassigned/" '{print $2}'
xfs_nocrc_2k.txt:generic/075 # possible regression
xfs_nocrc_2k.txt:generic/112 # possible regression
xfs_nocrc_2k.txt:generic/127 # possible regression
xfs_nocrc_2k.txt:generic/231 # possible regression
xfs_nocrc_2k.txt:generic/263 # possible regression
xfs_nocrc_2k.txt:generic/469 # possible regression
xfs_nocrc_512.txt:generic/075 # possible regression
xfs_nocrc_512.txt:generic/112 # possible regression
xfs_nocrc_512.txt:generic/127 # possible regression
xfs_nocrc_512.txt:generic/231 # possible regression
xfs_nocrc_512.txt:generic/263 # possible regression
xfs_nocrc_512.txt:generic/469 # possible regression
xfs_reflink_1024.txt:generic/457 # possible regression crash https://gist.github.com/mcgrof/f182b250a9d091f77dc85782a83224b3
xfs_rtdev.txt:generic/333 # might crash might be a regression, takes forever...

Billion of fsx ops are possible with 16k and so far successful also with
hundreds of millions of fsx ops against 32k and 64k with 4k sector size.

To verify larger IOs are used we have been using Daniel Gomez's lbs-ctl
tool which uses eBPF to verify different IO counts on the block layer.
That tool will soon be published.

For more details please refer to the kernel newbies page on LBS [8].

[1] https://lwn.net/Articles/231793/
[2] https://lwn.net/ml/linux-fsdevel/20181107063127.3902-1-david@fromorbit.com/
[3] https://lore.kernel.org/linux-mm/20230308075952.GU2825702@dread.disaster.area/
[4] https://cdrdv2-public.intel.com/605724/Achieving_Optimal_Perf_IU_SSDs-338395-003US.pdf
[5] https://lwn.net/Articles/932900/
[6] https://lore.kernel.org/lkml/20230801172201.1923299-2-hch@lst.de/T/
[7] https://github.com/linux-kdevops/kdevops/blob/master/playbooks/roles/fstests/templates/xfs/xfs.config
[8] https://kernelnewbies.org/KernelProjects/large-block-size

--
Regards,
Pankaj
Luis

Dave Chinner (1):
  xfs: expose block size in stat

Luis Chamberlain (12):
  filemap: set the order of the index in page_cache_delete_batch()
  filemap: align index to mapping_min_order in filemap_range_has_page()
  mm: call xas_set_order() in replace_page_cache_folio()
  filemap: align the index to mapping_min_order in __filemap_add_folio()
  filemap: align the index to mapping_min_order in
    filemap_get_folios_tag()
  filemap: align the index to mapping_min_order in filemap_get_pages()
  readahead: set file_ra_state->ra_pages to be at least
    mapping_min_order
  readahead: add folio with at least mapping_min_order in
    page_cache_ra_order
  readahead: set the minimum ra size in get_(init|next)_ra
  readahead: align ra start and size to mapping_min_order in
    ondemand_ra()
  truncate: align index to mapping_min_order
  mm: round down folio split requirements

Matthew Wilcox (Oracle) (1):
  fs: Allow fine-grained control of folio sizes

Pankaj Raghav (9):
  pagemap: use mapping_min_order in fgf_set_order()
  filemap: add folio with at least mapping_min_order in
    __filemap_get_folio
  filemap: use mapping_min_order while allocating folios
  filemap: align the index to mapping_min_order in
    do_[a]sync_mmap_readahead
  filemap: align index to mapping_min_order in filemap_fault()
  readahead: allocate folios with mapping_min_order in ra_unbounded()
  readahead: align with mapping_min_order in force_page_cache_ra()
  xfs: enable block size larger than page size support
  xfs: set minimum order folio for page cache based on blocksize

 fs/iomap/buffered-io.c  |  2 +-
 fs/xfs/xfs_icache.c     |  8 +++-
 fs/xfs/xfs_iops.c       |  4 +-
 fs/xfs/xfs_mount.c      |  9 ++++-
 fs/xfs/xfs_super.c      |  7 +---
 include/linux/pagemap.h | 87 ++++++++++++++++++++++++++++++-----------
 mm/filemap.c            | 87 +++++++++++++++++++++++++++++++++--------
 mm/huge_memory.c        | 14 +++++--
 mm/readahead.c          | 86 ++++++++++++++++++++++++++++++++++------
 mm/truncate.c           | 34 +++++++++++-----
 10 files changed, 263 insertions(+), 75 deletions(-)

base-commit: e143016b56ecb0fcda5bb6026b0a25fe55274f56
-- 
2.40.1

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC 01/23] fs: Allow fine-grained control of folio sizes
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 19:03   ` Matthew Wilcox
  2023-09-15 18:38 ` [RFC 02/23] pagemap: use mapping_min_order in fgf_set_order() Pankaj Raghav
                   ` (23 subsequent siblings)
  24 siblings, 1 reply; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Some filesystems want to be able to limit the maximum size of folios,
and some want to be able to ensure that folios are at least a certain
size.  Add mapping_set_folio_orders() to allow this level of control.
The max folio order parameter is ignored and it is always set to
MAX_PAGECACHE_ORDER.

[Pankaj]: added mapping_min_folio_order(), changed MAX_MASK to 0x0003e000
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
[mcgrof: rebase in light of "mm, netfs, fscache: stop read optimisation
when folio removed from pagecache" which adds AS_RELEASE_ALWAYS]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 78 +++++++++++++++++++++++++++++++----------
 1 file changed, 60 insertions(+), 18 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 759b29d9a69a..d2b5308cc59e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -202,10 +202,16 @@ enum mapping_flags {
 	AS_EXITING	= 4, 	/* final truncate in progress */
 	/* writeback related tags are not used */
 	AS_NO_WRITEBACK_TAGS = 5,
-	AS_LARGE_FOLIO_SUPPORT = 6,
-	AS_RELEASE_ALWAYS,	/* Call ->release_folio(), even if no private data */
+	AS_RELEASE_ALWAYS = 6,      /* Call ->release_folio(), even if no private data */
+	AS_FOLIO_ORDER_MIN = 8,
+	AS_FOLIO_ORDER_MAX = 13,
+	/* 8-17 are used for FOLIO_ORDER */
 };
 
+#define AS_FOLIO_ORDER_MIN_MASK 0x00001f00
+#define AS_FOLIO_ORDER_MAX_MASK 0x0003e000
+#define AS_FOLIO_ORDER_MASK (AS_FOLIO_ORDER_MIN_MASK | AS_FOLIO_ORDER_MAX_MASK)
+
 /**
  * mapping_set_error - record a writeback error in the address_space
  * @mapping: the mapping in which an error should be set
@@ -310,6 +316,46 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 	m->gfp_mask = mask;
 }
 
+/*
+ * There are some parts of the kernel which assume that PMD entries
+ * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
+ * limit the maximum allocation order to PMD size.  I'm not aware of any
+ * assumptions about maximum order if THP are disabled, but 8 seems like
+ * a good order (that's 1MB if you're using 4kB pages)
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
+#else
+#define MAX_PAGECACHE_ORDER	8
+#endif
+
+/*
+ * mapping_set_folio_orders() - Set the range of folio sizes supported.
+ * @mapping: The file.
+ * @min: Minimum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
+ * @max: Maximum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
+ *
+ * The filesystem should call this function in its inode constructor to
+ * indicate which sizes of folio the VFS can use to cache the contents
+ * of the file.  This should only be used if the filesystem needs special
+ * handling of folio sizes (ie there is something the core cannot know).
+ * Do not tune it based on, eg, i_size.
+ *
+ * Context: This should not be called while the inode is active as it
+ * is non-atomic.
+ */
+static inline void mapping_set_folio_orders(struct address_space *mapping,
+					    unsigned int min, unsigned int max)
+{
+	/*
+	 * XXX: max is ignored as only minimum folio order is supported
+	 * currently.
+	 */
+	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
+			 (min << AS_FOLIO_ORDER_MIN) |
+			 (MAX_PAGECACHE_ORDER << AS_FOLIO_ORDER_MAX);
+}
+
 /**
  * mapping_set_large_folios() - Indicate the file supports large folios.
  * @mapping: The file.
@@ -323,7 +369,17 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
  */
 static inline void mapping_set_large_folios(struct address_space *mapping)
 {
-	__set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	mapping_set_folio_orders(mapping, 0, MAX_PAGECACHE_ORDER);
+}
+
+static inline unsigned int mapping_max_folio_order(struct address_space *mapping)
+{
+	return (mapping->flags & AS_FOLIO_ORDER_MAX_MASK) >> AS_FOLIO_ORDER_MAX;
+}
+
+static inline unsigned int mapping_min_folio_order(struct address_space *mapping)
+{
+	return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
 }
 
 /*
@@ -332,8 +388,7 @@ static inline void mapping_set_large_folios(struct address_space *mapping)
  */
 static inline bool mapping_large_folio_support(struct address_space *mapping)
 {
-	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-		test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	return mapping_max_folio_order(mapping) > 0;
 }
 
 static inline int filemap_nr_thps(struct address_space *mapping)
@@ -494,19 +549,6 @@ static inline void *detach_page_private(struct page *page)
 	return folio_detach_private(page_folio(page));
 }
 
-/*
- * There are some parts of the kernel which assume that PMD entries
- * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
- * limit the maximum allocation order to PMD size.  I'm not aware of any
- * assumptions about maximum order if THP are disabled, but 8 seems like
- * a good order (that's 1MB if you're using 4kB pages)
- */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
-#else
-#define MAX_PAGECACHE_ORDER	8
-#endif
-
 #ifdef CONFIG_NUMA
 struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
 #else
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC 01/23] fs: Allow fine-grained control of folio sizes
  2023-09-15 18:38 ` [RFC 01/23] fs: Allow fine-grained control of folio sizes Pankaj Raghav
@ 2023-09-15 19:03   ` Matthew Wilcox
  0 siblings, 0 replies; 54+ messages in thread
From: Matthew Wilcox @ 2023-09-15 19:03 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, p.raghav, david, da.gomez, akpm,
	linux-kernel, djwong, linux-mm, chandan.babu, mcgrof, gost.dev

On Fri, Sep 15, 2023 at 08:38:26PM +0200, Pankaj Raghav wrote:
> +static inline void mapping_set_folio_orders(struct address_space *mapping,
> +					    unsigned int min, unsigned int max)
> +{
> +	/*
> +	 * XXX: max is ignored as only minimum folio order is supported
> +	 * currently.
> +	 */

I think we need some sanity checking ...

	if (min == 1)
		min = 2;
	if (max < min)
		max = min;
	if (max > MAX_PAGECACHE_ORDER)
		max = MAX_PAGECACHE_ORDER;

> +	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
> +			 (min << AS_FOLIO_ORDER_MIN) |
> +			 (MAX_PAGECACHE_ORDER << AS_FOLIO_ORDER_MAX);
> +}


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC 02/23] pagemap: use mapping_min_order in fgf_set_order()
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
  2023-09-15 18:38 ` [RFC 01/23] fs: Allow fine-grained control of folio sizes Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 18:55   ` Matthew Wilcox
  2023-09-15 18:38 ` [RFC 03/23] filemap: add folio with at least mapping_min_order in __filemap_get_folio Pankaj Raghav
                   ` (22 subsequent siblings)
  24 siblings, 1 reply; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Pankaj Raghav <p.raghav@samsung.com>

fgf_set_order() encodes optimal order in fgp flags. Set it to at least
mapping_min_order from the page cache. Default to the old behaviour if
min_order is not set.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/iomap/buffered-io.c  | 2 +-
 include/linux/pagemap.h | 9 +++++----
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ae8673ce08b1..d4613fd550c4 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -549,7 +549,7 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
-	fgp |= fgf_set_order(len);
+	fgp |= fgf_set_order(iter->inode->i_mapping, len);
 
 	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
 			fgp, mapping_gfp_mask(iter->inode->i_mapping));
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d2b5308cc59e..5d392366420a 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -620,6 +620,7 @@ typedef unsigned int __bitwise fgf_t;
 
 /**
  * fgf_set_order - Encode a length in the fgf_t flags.
+ * @mapping: address_space struct from the inode
  * @size: The suggested size of the folio to create.
  *
  * The caller of __filemap_get_folio() can use this to suggest a preferred
@@ -629,13 +630,13 @@ typedef unsigned int __bitwise fgf_t;
  * due to alignment constraints, memory pressure, or the presence of
  * other folios at nearby indices.
  */
-static inline fgf_t fgf_set_order(size_t size)
+static inline fgf_t fgf_set_order(struct address_space *mapping, size_t size)
 {
 	unsigned int shift = ilog2(size);
+	unsigned int min_order = mapping_min_folio_order(mapping);
+	int order = max(min_order, shift - PAGE_SHIFT);
 
-	if (shift <= PAGE_SHIFT)
-		return 0;
-	return (__force fgf_t)((shift - PAGE_SHIFT) << 26);
+	return (__force fgf_t)((order) << 26);
 }
 
 void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC 02/23] pagemap: use mapping_min_order in fgf_set_order()
  2023-09-15 18:38 ` [RFC 02/23] pagemap: use mapping_min_order in fgf_set_order() Pankaj Raghav
@ 2023-09-15 18:55   ` Matthew Wilcox
  2023-09-20  7:46     ` Pankaj Raghav
  0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2023-09-15 18:55 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, p.raghav, david, da.gomez, akpm,
	linux-kernel, djwong, linux-mm, chandan.babu, mcgrof, gost.dev

On Fri, Sep 15, 2023 at 08:38:27PM +0200, Pankaj Raghav wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> fgf_set_order() encodes optimal order in fgp flags. Set it to at least
> mapping_min_order from the page cache. Default to the old behaviour if
> min_order is not set.

Why not simply:

+++ b/mm/filemap.c
@@ -1906,9 +1906,12 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
                folio_wait_stable(folio);
 no_page:
        if (!folio && (fgp_flags & FGP_CREAT)) {
-               unsigned order = FGF_GET_ORDER(fgp_flags);
+               unsigned order;
                int err;

+               order = min(mapping_min_folio_order(mapping),
+                               FGF_GET_ORDER(fgp_flags));



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC 02/23] pagemap: use mapping_min_order in fgf_set_order()
  2023-09-15 18:55   ` Matthew Wilcox
@ 2023-09-20  7:46     ` Pankaj Raghav
  0 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-20  7:46 UTC (permalink / raw)
  To: Matthew Wilcox, Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, david, da.gomez, akpm, linux-kernel,
	djwong, linux-mm, chandan.babu, mcgrof, gost.dev

On 2023-09-15 20:55, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 08:38:27PM +0200, Pankaj Raghav wrote:
>> From: Pankaj Raghav <p.raghav@samsung.com>
>>
>> fgf_set_order() encodes optimal order in fgp flags. Set it to at least
>> mapping_min_order from the page cache. Default to the old behaviour if
>> min_order is not set.
> 
> Why not simply:
> 

That is a good idea to move this to filemap instead of changing it in iomap. I will do that!

> +++ b/mm/filemap.c
> @@ -1906,9 +1906,12 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>                 folio_wait_stable(folio);
>  no_page:
>         if (!folio && (fgp_flags & FGP_CREAT)) {
> -               unsigned order = FGF_GET_ORDER(fgp_flags);
> +               unsigned order;
>                 int err;
> 
> +               order = min(mapping_min_folio_order(mapping),
> +                               FGF_GET_ORDER(fgp_flags));
> 

I think this needs to max(mapping..., FGF...)


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC 03/23] filemap: add folio with at least mapping_min_order in __filemap_get_folio
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
  2023-09-15 18:38 ` [RFC 01/23] fs: Allow fine-grained control of folio sizes Pankaj Raghav
  2023-09-15 18:38 ` [RFC 02/23] pagemap: use mapping_min_order in fgf_set_order() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 19:00   ` Matthew Wilcox
  2023-09-15 18:38 ` [RFC 04/23] filemap: set the order of the index in page_cache_delete_batch() Pankaj Raghav
                   ` (21 subsequent siblings)
  24 siblings, 1 reply; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Pankaj Raghav <p.raghav@samsung.com>

__filemap_get_folio() with FGP_CREAT should allocate at least folio of
filemap's min_order set using folio_set_mapping_orders().

A higher order folio than min_order by definition is a multiple of the
min_order. If an index is aligned to an order higher than a min_order, it
will also be aligned to the min order.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/filemap.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 8962d1255905..b1ce63143df5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1862,6 +1862,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		fgf_t fgp_flags, gfp_t gfp)
 {
 	struct folio *folio;
+	int min_order = mapping_min_folio_order(mapping);
+	int nr_of_pages = (1U << min_order);
+
+	index = round_down(index, nr_of_pages);
 
 repeat:
 	folio = filemap_get_entry(mapping, index);
@@ -1929,8 +1933,14 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			err = -ENOMEM;
 			if (order == 1)
 				order = 0;
+			if (order < min_order)
+				order = min_order;
 			if (order > 0)
 				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
+
+			if (min_order)
+				VM_BUG_ON(index & ((1UL << order) - 1));
+
 			folio = filemap_alloc_folio(alloc_gfp, order);
 			if (!folio)
 				continue;
@@ -1944,7 +1954,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 				break;
 			folio_put(folio);
 			folio = NULL;
-		} while (order-- > 0);
+		} while (order-- > min_order);
 
 		if (err == -EEXIST)
 			goto repeat;
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC 03/23] filemap: add folio with at least mapping_min_order in __filemap_get_folio
  2023-09-15 18:38 ` [RFC 03/23] filemap: add folio with at least mapping_min_order in __filemap_get_folio Pankaj Raghav
@ 2023-09-15 19:00   ` Matthew Wilcox
  2023-09-20  8:06     ` Pankaj Raghav
  0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2023-09-15 19:00 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, p.raghav, david, da.gomez, akpm,
	linux-kernel, djwong, linux-mm, chandan.babu, mcgrof, gost.dev

On Fri, Sep 15, 2023 at 08:38:28PM +0200, Pankaj Raghav wrote:
> +++ b/mm/filemap.c
> @@ -1862,6 +1862,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  		fgf_t fgp_flags, gfp_t gfp)
>  {
>  	struct folio *folio;
> +	int min_order = mapping_min_folio_order(mapping);
> +	int nr_of_pages = (1U << min_order);
> +
> +	index = round_down(index, nr_of_pages);
>  
>  repeat:
>  	folio = filemap_get_entry(mapping, index);
> @@ -1929,8 +1933,14 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  			err = -ENOMEM;
>  			if (order == 1)
>  				order = 0;
> +			if (order < min_order)
> +				order = min_order;

... oh, you do something similar here to what I recommend in my previous
response.  I don't understand why you need the previous patch.

> +			if (min_order)
> +				VM_BUG_ON(index & ((1UL << order) - 1));

You don't need the 'if' here; index & ((1 << 0) - 1) becomes false.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC 03/23] filemap: add folio with at least mapping_min_order in __filemap_get_folio
  2023-09-15 19:00   ` Matthew Wilcox
@ 2023-09-20  8:06     ` Pankaj Raghav
  0 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-20  8:06 UTC (permalink / raw)
  To: Matthew Wilcox, Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, david, da.gomez, akpm, linux-kernel,
	djwong, linux-mm, chandan.babu, mcgrof, gost.dev

On 2023-09-15 21:00, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 08:38:28PM +0200, Pankaj Raghav wrote:
>> +++ b/mm/filemap.c
>> @@ -1862,6 +1862,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>>  		fgf_t fgp_flags, gfp_t gfp)
>>  {
>>  	struct folio *folio;
>> +	int min_order = mapping_min_folio_order(mapping);
>> +	int nr_of_pages = (1U << min_order);
>> +
>> +	index = round_down(index, nr_of_pages);
>>  
>>  repeat:
>>  	folio = filemap_get_entry(mapping, index);
>> @@ -1929,8 +1933,14 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>>  			err = -ENOMEM;
>>  			if (order == 1)
>>  				order = 0;
>> +			if (order < min_order)
>> +				order = min_order;
> 
> ... oh, you do something similar here to what I recommend in my previous
> response.  I don't understand why you need the previous patch.
> 

Hmm, we made changes here a bit later and that is why it is a duplicated
I guess in both iomap fgf order and clamping the order here to min_order. We could
remove the previous patch and retain this one here.

>> +			if (min_order)
>> +				VM_BUG_ON(index & ((1UL << order) - 1));
> 
> You don't need the 'if' here; index & ((1 << 0) - 1) becomes false.
> 

Sounds good!


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC 04/23] filemap: set the order of the index in page_cache_delete_batch()
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (2 preceding siblings ...)
  2023-09-15 18:38 ` [RFC 03/23] filemap: add folio with at least mapping_min_order in __filemap_get_folio Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 19:43   ` Matthew Wilcox
  2023-09-15 18:38 ` [RFC 05/23] filemap: align index to mapping_min_order in filemap_range_has_page() Pankaj Raghav
                   ` (20 subsequent siblings)
  24 siblings, 1 reply; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Luis Chamberlain <mcgrof@kernel.org>

Similar to page_cache_delete(), call xas_set_order for non-hugetlb pages
while deleting an entry from the page cache. Also put BUG_ON if the
order of the folio is less than the mapping min_order.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/filemap.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index b1ce63143df5..2c47729dc8b0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -126,6 +126,7 @@
 static void page_cache_delete(struct address_space *mapping,
 				   struct folio *folio, void *shadow)
 {
+	unsigned int min_order = mapping_min_folio_order(mapping);
 	XA_STATE(xas, &mapping->i_pages, folio->index);
 	long nr = 1;
 
@@ -134,6 +135,7 @@ static void page_cache_delete(struct address_space *mapping,
 	xas_set_order(&xas, folio->index, folio_order(folio));
 	nr = folio_nr_pages(folio);
 
+	VM_BUG_ON_FOLIO(folio_order(folio) < min_order, folio);
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 
 	xas_store(&xas, shadow);
@@ -276,6 +278,7 @@ void filemap_remove_folio(struct folio *folio)
 static void page_cache_delete_batch(struct address_space *mapping,
 			     struct folio_batch *fbatch)
 {
+	unsigned int min_order = mapping_min_folio_order(mapping);
 	XA_STATE(xas, &mapping->i_pages, fbatch->folios[0]->index);
 	long total_pages = 0;
 	int i = 0;
@@ -304,6 +307,11 @@ static void page_cache_delete_batch(struct address_space *mapping,
 
 		WARN_ON_ONCE(!folio_test_locked(folio));
 
+		/* hugetlb pages are represented by a single entry in the xarray */
+		if (!folio_test_hugetlb(folio)) {
+			VM_BUG_ON_FOLIO(folio_order(folio) < min_order, folio);
+			xas_set_order(&xas, folio->index, folio_order(folio));
+		}
 		folio->mapping = NULL;
 		/* Leave folio->index set: truncation lookup relies on it */
 
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC 04/23] filemap: set the order of the index in page_cache_delete_batch()
  2023-09-15 18:38 ` [RFC 04/23] filemap: set the order of the index in page_cache_delete_batch() Pankaj Raghav
@ 2023-09-15 19:43   ` Matthew Wilcox
  2023-09-18 18:20     ` Luis Chamberlain
  0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2023-09-15 19:43 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, p.raghav, david, da.gomez, akpm,
	linux-kernel, djwong, linux-mm, chandan.babu, mcgrof, gost.dev

On Fri, Sep 15, 2023 at 08:38:29PM +0200, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Similar to page_cache_delete(), call xas_set_order for non-hugetlb pages
> while deleting an entry from the page cache.

Is this necessary?  As I read xas_store(), if you're storing NULL, it
will wipe out all sibling entries.  Was this based on "oops, no, it
doesn't" or "here is a gratuitous difference, change it"?



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC 04/23] filemap: set the order of the index in page_cache_delete_batch()
  2023-09-15 19:43   ` Matthew Wilcox
@ 2023-09-18 18:20     ` Luis Chamberlain
  0 siblings, 0 replies; 54+ messages in thread
From: Luis Chamberlain @ 2023-09-18 18:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Pankaj Raghav, linux-xfs, linux-fsdevel, p.raghav, david,
	da.gomez, akpm, linux-kernel, djwong, linux-mm, chandan.babu,
	gost.dev

On Fri, Sep 15, 2023 at 08:43:28PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 08:38:29PM +0200, Pankaj Raghav wrote:
> > From: Luis Chamberlain <mcgrof@kernel.org>
> > 
> > Similar to page_cache_delete(), call xas_set_order for non-hugetlb pages
> > while deleting an entry from the page cache.
> 
> Is this necessary?  As I read xas_store(), if you're storing NULL, it
> will wipe out all sibling entries.  Was this based on "oops, no, it
> doesn't" or "here is a gratuitous difference, change it"?

Based on code inspection, I saw page_cache_delete() did it. The xarray
docs and xarray selftest was not clear about the advanced API about this
case and the usage of the set order on page_cache_delete() gave me
concerns we needed it here.

We do have some enhancements to xarray self tests to use the advanced
API which we could extend with this particular case before posting, so
to prove disprove if this is really needed.

Why would it be needed on page_cache_delete() but needed here?

  Luis

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC 05/23] filemap: align index to mapping_min_order in filemap_range_has_page()
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (3 preceding siblings ...)
  2023-09-15 18:38 ` [RFC 04/23] filemap: set the order of the index in page_cache_delete_batch() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 19:45   ` Matthew Wilcox
  2023-09-15 18:38 ` [RFC 06/23] mm: call xas_set_order() in replace_page_cache_folio() Pankaj Raghav
                   ` (19 subsequent siblings)
  24 siblings, 1 reply; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Luis Chamberlain <mcgrof@kernel.org>

page cache is mapping min_folio_order aligned. Use mapping min_folio_order
to align the start_byte and end_byte in filemap_range_has_page().

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/filemap.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 2c47729dc8b0..4dee24b5b61c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -477,9 +477,12 @@ EXPORT_SYMBOL(filemap_flush);
 bool filemap_range_has_page(struct address_space *mapping,
 			   loff_t start_byte, loff_t end_byte)
 {
+	unsigned int min_order = mapping_min_folio_order(mapping);
+	unsigned int nrpages = 1UL << min_order;
+	pgoff_t index = round_down(start_byte >> PAGE_SHIFT, nrpages);
 	struct folio *folio;
-	XA_STATE(xas, &mapping->i_pages, start_byte >> PAGE_SHIFT);
-	pgoff_t max = end_byte >> PAGE_SHIFT;
+	XA_STATE(xas, &mapping->i_pages, index);
+	pgoff_t max = round_down(end_byte >> PAGE_SHIFT, nrpages);
 
 	if (end_byte < start_byte)
 		return false;
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC 05/23] filemap: align index to mapping_min_order in filemap_range_has_page()
  2023-09-15 18:38 ` [RFC 05/23] filemap: align index to mapping_min_order in filemap_range_has_page() Pankaj Raghav
@ 2023-09-15 19:45   ` Matthew Wilcox
  2023-09-18 18:25     ` Luis Chamberlain
  0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2023-09-15 19:45 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, p.raghav, david, da.gomez, akpm,
	linux-kernel, djwong, linux-mm, chandan.babu, mcgrof, gost.dev

On Fri, Sep 15, 2023 at 08:38:30PM +0200, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> page cache is mapping min_folio_order aligned. Use mapping min_folio_order
> to align the start_byte and end_byte in filemap_range_has_page().

What goes wrong if you don't?  Seems to me like it should work.

> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  mm/filemap.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 2c47729dc8b0..4dee24b5b61c 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -477,9 +477,12 @@ EXPORT_SYMBOL(filemap_flush);
>  bool filemap_range_has_page(struct address_space *mapping,
>  			   loff_t start_byte, loff_t end_byte)
>  {
> +	unsigned int min_order = mapping_min_folio_order(mapping);
> +	unsigned int nrpages = 1UL << min_order;
> +	pgoff_t index = round_down(start_byte >> PAGE_SHIFT, nrpages);
>  	struct folio *folio;
> -	XA_STATE(xas, &mapping->i_pages, start_byte >> PAGE_SHIFT);
> -	pgoff_t max = end_byte >> PAGE_SHIFT;
> +	XA_STATE(xas, &mapping->i_pages, index);
> +	pgoff_t max = round_down(end_byte >> PAGE_SHIFT, nrpages);
>  
>  	if (end_byte < start_byte)
>  		return false;
> -- 
> 2.40.1
> 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC 05/23] filemap: align index to mapping_min_order in filemap_range_has_page()
  2023-09-15 19:45   ` Matthew Wilcox
@ 2023-09-18 18:25     ` Luis Chamberlain
  0 siblings, 0 replies; 54+ messages in thread
From: Luis Chamberlain @ 2023-09-18 18:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Pankaj Raghav, linux-xfs, linux-fsdevel, p.raghav, david,
	da.gomez, akpm, linux-kernel, djwong, linux-mm, chandan.babu,
	gost.dev

On Fri, Sep 15, 2023 at 08:45:20PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 08:38:30PM +0200, Pankaj Raghav wrote:
> > From: Luis Chamberlain <mcgrof@kernel.org>
> > 
> > page cache is mapping min_folio_order aligned. Use mapping min_folio_order
> > to align the start_byte and end_byte in filemap_range_has_page().
> 
> What goes wrong if you don't?  Seems to me like it should work.

Will drop from the series after confirming, thanks.

  Luis


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC 06/23] mm: call xas_set_order() in replace_page_cache_folio()
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (4 preceding siblings ...)
  2023-09-15 18:38 ` [RFC 05/23] filemap: align index to mapping_min_order in filemap_range_has_page() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 19:46   ` Matthew Wilcox
  2023-09-15 18:38 ` [RFC 07/23] filemap: align the index to mapping_min_order in __filemap_add_folio() Pankaj Raghav
                   ` (18 subsequent siblings)
  24 siblings, 1 reply; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Luis Chamberlain <mcgrof@kernel.org>

Call xas_set_order() in replace_page_cache_folio() for non hugetlb
pages.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/filemap.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 4dee24b5b61c..33de71bfa953 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -815,12 +815,14 @@ EXPORT_SYMBOL(file_write_and_wait_range);
 void replace_page_cache_folio(struct folio *old, struct folio *new)
 {
 	struct address_space *mapping = old->mapping;
+	unsigned int min_order = mapping_min_folio_order(mapping);
 	void (*free_folio)(struct folio *) = mapping->a_ops->free_folio;
 	pgoff_t offset = old->index;
 	XA_STATE(xas, &mapping->i_pages, offset);
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
 	VM_BUG_ON_FOLIO(!folio_test_locked(new), new);
+	VM_BUG_ON_FOLIO(folio_order(new) != folio_order(old), new);
 	VM_BUG_ON_FOLIO(new->mapping, new);
 
 	folio_get(new);
@@ -829,6 +831,11 @@ void replace_page_cache_folio(struct folio *old, struct folio *new)
 
 	mem_cgroup_migrate(old, new);
 
+	if (!folio_test_hugetlb(new)) {
+		VM_BUG_ON_FOLIO(folio_order(new) < min_order, new);
+		xas_set_order(&xas, offset, folio_order(new));
+	}
+
 	xas_lock_irq(&xas);
 	xas_store(&xas, new);
 
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC 06/23] mm: call xas_set_order() in replace_page_cache_folio()
  2023-09-15 18:38 ` [RFC 06/23] mm: call xas_set_order() in replace_page_cache_folio() Pankaj Raghav
@ 2023-09-15 19:46   ` Matthew Wilcox
  2023-09-18 18:27     ` Luis Chamberlain
  0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2023-09-15 19:46 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, p.raghav, david, da.gomez, akpm,
	linux-kernel, djwong, linux-mm, chandan.babu, mcgrof, gost.dev

On Fri, Sep 15, 2023 at 08:38:31PM +0200, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Call xas_set_order() in replace_page_cache_folio() for non hugetlb
> pages.

This function definitely should work without this patch.  What goes wrong?

> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  mm/filemap.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4dee24b5b61c..33de71bfa953 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -815,12 +815,14 @@ EXPORT_SYMBOL(file_write_and_wait_range);
>  void replace_page_cache_folio(struct folio *old, struct folio *new)
>  {
>  	struct address_space *mapping = old->mapping;
> +	unsigned int min_order = mapping_min_folio_order(mapping);
>  	void (*free_folio)(struct folio *) = mapping->a_ops->free_folio;
>  	pgoff_t offset = old->index;
>  	XA_STATE(xas, &mapping->i_pages, offset);
>  
>  	VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
>  	VM_BUG_ON_FOLIO(!folio_test_locked(new), new);
> +	VM_BUG_ON_FOLIO(folio_order(new) != folio_order(old), new);
>  	VM_BUG_ON_FOLIO(new->mapping, new);
>  
>  	folio_get(new);
> @@ -829,6 +831,11 @@ void replace_page_cache_folio(struct folio *old, struct folio *new)
>  
>  	mem_cgroup_migrate(old, new);
>  
> +	if (!folio_test_hugetlb(new)) {
> +		VM_BUG_ON_FOLIO(folio_order(new) < min_order, new);
> +		xas_set_order(&xas, offset, folio_order(new));
> +	}
> +
>  	xas_lock_irq(&xas);
>  	xas_store(&xas, new);
>  
> -- 
> 2.40.1
> 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC 06/23] mm: call xas_set_order() in replace_page_cache_folio()
  2023-09-15 19:46   ` Matthew Wilcox
@ 2023-09-18 18:27     ` Luis Chamberlain
  0 siblings, 0 replies; 54+ messages in thread
From: Luis Chamberlain @ 2023-09-18 18:27 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Pankaj Raghav, linux-xfs, linux-fsdevel, p.raghav, david,
	da.gomez, akpm, linux-kernel, djwong, linux-mm, chandan.babu,
	gost.dev

On Fri, Sep 15, 2023 at 08:46:10PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 08:38:31PM +0200, Pankaj Raghav wrote:
> > From: Luis Chamberlain <mcgrof@kernel.org>
> > 
> > Call xas_set_order() in replace_page_cache_folio() for non hugetlb
> > pages.
> 
> This function definitely should work without this patch.  What goes wrong?

As with batch delete I was just trying to take care to be explicit about
setting the order for a) ddition and b) removal. Will drop as well after
confirming, thanks!

  Luis


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC 07/23] filemap: align the index to mapping_min_order in __filemap_add_folio()
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (5 preceding siblings ...)
  2023-09-15 18:38 ` [RFC 06/23] mm: call xas_set_order() in replace_page_cache_folio() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 19:48   ` Matthew Wilcox
  2023-09-15 18:38 ` [RFC 08/23] filemap: align the index to mapping_min_order in filemap_get_folios_tag() Pankaj Raghav
                   ` (17 subsequent siblings)
  24 siblings, 1 reply; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Luis Chamberlain <mcgrof@kernel.org>

Align the index to the mapping_min_order number of pages while setting
the XA_STATE and xas_set_order().

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/filemap.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 33de71bfa953..15bc810bfc89 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -859,7 +859,10 @@ EXPORT_SYMBOL_GPL(replace_page_cache_folio);
 noinline int __filemap_add_folio(struct address_space *mapping,
 		struct folio *folio, pgoff_t index, gfp_t gfp, void **shadowp)
 {
-	XA_STATE(xas, &mapping->i_pages, index);
+	unsigned int min_order = mapping_min_folio_order(mapping);
+	unsigned int nr_of_pages = (1U << min_order);
+	pgoff_t rounded_index = round_down(index, nr_of_pages);
+	XA_STATE(xas, &mapping->i_pages, rounded_index);
 	int huge = folio_test_hugetlb(folio);
 	bool charged = false;
 	long nr = 1;
@@ -875,8 +878,8 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 		charged = true;
 	}
 
-	VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
-	xas_set_order(&xas, index, folio_order(folio));
+	VM_BUG_ON_FOLIO(rounded_index & (folio_nr_pages(folio) - 1), folio);
+	xas_set_order(&xas, rounded_index, folio_order(folio));
 	nr = folio_nr_pages(folio);
 
 	gfp &= GFP_RECLAIM_MASK;
@@ -913,6 +916,7 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 			}
 		}
 
+		VM_BUG_ON_FOLIO(folio_order(folio) < min_order, folio);
 		xas_store(&xas, folio);
 		if (xas_error(&xas))
 			goto unlock;
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC 07/23] filemap: align the index to mapping_min_order in __filemap_add_folio()
  2023-09-15 18:38 ` [RFC 07/23] filemap: align the index to mapping_min_order in __filemap_add_folio() Pankaj Raghav
@ 2023-09-15 19:48   ` Matthew Wilcox
  2023-09-18 18:32     ` Luis Chamberlain
  0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2023-09-15 19:48 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, p.raghav, david, da.gomez, akpm,
	linux-kernel, djwong, linux-mm, chandan.babu, mcgrof, gost.dev

On Fri, Sep 15, 2023 at 08:38:32PM +0200, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Align the index to the mapping_min_order number of pages while setting
> the XA_STATE and xas_set_order().

Not sure why this one's necessary either.  The index should already be
aligned to folio_order.

Some bits of it are clearly needed, like checking that folio_order() >=
min_order.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC 07/23] filemap: align the index to mapping_min_order in __filemap_add_folio()
  2023-09-15 19:48   ` Matthew Wilcox
@ 2023-09-18 18:32     ` Luis Chamberlain
  0 siblings, 0 replies; 54+ messages in thread
From: Luis Chamberlain @ 2023-09-18 18:32 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Pankaj Raghav, linux-xfs, linux-fsdevel, p.raghav, david,
	da.gomez, akpm, linux-kernel, djwong, linux-mm, chandan.babu,
	gost.dev

On Fri, Sep 15, 2023 at 08:48:43PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 08:38:32PM +0200, Pankaj Raghav wrote:
> > From: Luis Chamberlain <mcgrof@kernel.org>
> > 
> > Align the index to the mapping_min_order number of pages while setting
> > the XA_STATE and xas_set_order().
> 
> Not sure why this one's necessary either.  The index should already be
> aligned to folio_order.

Oh, it was not obvious, would a VM_BUG_ON_FOLIO() be OK then?

> Some bits of it are clearly needed, like checking that folio_order() >=
> min_order.

Thanks,

  Luis


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC 08/23] filemap: align the index to mapping_min_order in filemap_get_folios_tag()
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (6 preceding siblings ...)
  2023-09-15 18:38 ` [RFC 07/23] filemap: align the index to mapping_min_order in __filemap_add_folio() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 19:50   ` Matthew Wilcox
  2023-09-15 18:38 ` [RFC 09/23] filemap: use mapping_min_order while allocating folios Pankaj Raghav
                   ` (16 subsequent siblings)
  24 siblings, 1 reply; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Luis Chamberlain <mcgrof@kernel.org>

Align the index to the mapping_min_order number of pages while setting
the XA_STATE in filemap_get_folios_tag().

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/filemap.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 15bc810bfc89..21e1341526ab 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2280,7 +2280,9 @@ EXPORT_SYMBOL(filemap_get_folios_contig);
 unsigned filemap_get_folios_tag(struct address_space *mapping, pgoff_t *start,
 			pgoff_t end, xa_mark_t tag, struct folio_batch *fbatch)
 {
-	XA_STATE(xas, &mapping->i_pages, *start);
+	unsigned int min_order = mapping_min_folio_order(mapping);
+	unsigned int nrpages = 1UL << min_order;
+	XA_STATE(xas, &mapping->i_pages, round_down(*start, nrpages));
 	struct folio *folio;
 
 	rcu_read_lock();
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC 08/23] filemap: align the index to mapping_min_order in filemap_get_folios_tag()
  2023-09-15 18:38 ` [RFC 08/23] filemap: align the index to mapping_min_order in filemap_get_folios_tag() Pankaj Raghav
@ 2023-09-15 19:50   ` Matthew Wilcox
  2023-09-18 18:36     ` Luis Chamberlain
  0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2023-09-15 19:50 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, p.raghav, david, da.gomez, akpm,
	linux-kernel, djwong, linux-mm, chandan.babu, mcgrof, gost.dev

On Fri, Sep 15, 2023 at 08:38:33PM +0200, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Align the index to the mapping_min_order number of pages while setting
> the XA_STATE in filemap_get_folios_tag().

... because?  It should already search backwards in the page cache,
otherwise calling sync_file_range() would skip the start if it landed
in a tail page of a folio.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC 08/23] filemap: align the index to mapping_min_order in filemap_get_folios_tag()
  2023-09-15 19:50   ` Matthew Wilcox
@ 2023-09-18 18:36     ` Luis Chamberlain
  0 siblings, 0 replies; 54+ messages in thread
From: Luis Chamberlain @ 2023-09-18 18:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Pankaj Raghav, linux-xfs, linux-fsdevel, p.raghav, david,
	da.gomez, akpm, linux-kernel, djwong, linux-mm, chandan.babu,
	gost.dev

On Fri, Sep 15, 2023 at 08:50:59PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 08:38:33PM +0200, Pankaj Raghav wrote:
> > From: Luis Chamberlain <mcgrof@kernel.org>
> > 
> > Align the index to the mapping_min_order number of pages while setting
> > the XA_STATE in filemap_get_folios_tag().
> 
> ... because?  It should already search backwards in the page cache,
> otherwise calling sync_file_range() would skip the start if it landed
> in a tail page of a folio.

Thanks! Will drop and verify!

  Luis


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC 09/23] filemap: use mapping_min_order while allocating folios
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (7 preceding siblings ...)
  2023-09-15 18:38 ` [RFC 08/23] filemap: align the index to mapping_min_order in filemap_get_folios_tag() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 19:54   ` Matthew Wilcox
  2023-09-15 18:38 ` [RFC 10/23] filemap: align the index to mapping_min_order in filemap_get_pages() Pankaj Raghav
                   ` (15 subsequent siblings)
  24 siblings, 1 reply; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Pankaj Raghav <p.raghav@samsung.com>

Allocate al teast mapping_min_order when creating new folio for the
filemap in filemap_create_folio() and do_read_cache_folio().

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/filemap.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 21e1341526ab..e4d46f79e95d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2502,7 +2502,8 @@ static int filemap_create_folio(struct file *file,
 	struct folio *folio;
 	int error;
 
-	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
+	folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
+				    mapping_min_folio_order(mapping));
 	if (!folio)
 		return -ENOMEM;
 
@@ -3696,7 +3697,8 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
 repeat:
 	folio = filemap_get_folio(mapping, index);
 	if (IS_ERR(folio)) {
-		folio = filemap_alloc_folio(gfp, 0);
+		folio = filemap_alloc_folio(gfp,
+					    mapping_min_folio_order(mapping));
 		if (!folio)
 			return ERR_PTR(-ENOMEM);
 		err = filemap_add_folio(mapping, folio, index, gfp);
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC 09/23] filemap: use mapping_min_order while allocating folios
  2023-09-15 18:38 ` [RFC 09/23] filemap: use mapping_min_order while allocating folios Pankaj Raghav
@ 2023-09-15 19:54   ` Matthew Wilcox
  0 siblings, 0 replies; 54+ messages in thread
From: Matthew Wilcox @ 2023-09-15 19:54 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, p.raghav, david, da.gomez, akpm,
	linux-kernel, djwong, linux-mm, chandan.babu, mcgrof, gost.dev

On Fri, Sep 15, 2023 at 08:38:34PM +0200, Pankaj Raghav wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Allocate al teast mapping_min_order when creating new folio for the
> filemap in filemap_create_folio() and do_read_cache_folio().

This patch is where you should be doing:

	index &= ~(folio_nr_pages(folio) - 1UL);

(or similar)

> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  mm/filemap.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 21e1341526ab..e4d46f79e95d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2502,7 +2502,8 @@ static int filemap_create_folio(struct file *file,
>  	struct folio *folio;
>  	int error;
>  
> -	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
> +	folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
> +				    mapping_min_folio_order(mapping));
>  	if (!folio)
>  		return -ENOMEM;
>  
> @@ -3696,7 +3697,8 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
>  repeat:
>  	folio = filemap_get_folio(mapping, index);
>  	if (IS_ERR(folio)) {
> -		folio = filemap_alloc_folio(gfp, 0);
> +		folio = filemap_alloc_folio(gfp,
> +					    mapping_min_folio_order(mapping));
>  		if (!folio)
>  			return ERR_PTR(-ENOMEM);
>  		err = filemap_add_folio(mapping, folio, index, gfp);
> -- 
> 2.40.1
> 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC 10/23] filemap: align the index to mapping_min_order in filemap_get_pages()
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (8 preceding siblings ...)
  2023-09-15 18:38 ` [RFC 09/23] filemap: use mapping_min_order while allocating folios Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 18:38 ` [RFC 11/23] filemap: align the index to mapping_min_order in do_[a]sync_mmap_readahead Pankaj Raghav
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Luis Chamberlain <mcgrof@kernel.org>

Align the index to the mapping_min_order number of pages in
filemap_get_pages().

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
generic/451 triggers a crash in this path for bs = 16k.

 mm/filemap.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index e4d46f79e95d..8a4bbddcf575 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2558,14 +2558,17 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 {
 	struct file *filp = iocb->ki_filp;
 	struct address_space *mapping = filp->f_mapping;
+	unsigned int min_order = mapping_min_folio_order(mapping);
+	unsigned int nrpages = 1UL << min_order;
 	struct file_ra_state *ra = &filp->f_ra;
-	pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;
+	pgoff_t index = round_down(iocb->ki_pos >> PAGE_SHIFT, nrpages);
 	pgoff_t last_index;
 	struct folio *folio;
 	int err = 0;
 
 	/* "last_index" is the index of the page beyond the end of the read */
 	last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
+	last_index = round_up(last_index, nrpages);
 retry:
 	if (fatal_signal_pending(current))
 		return -EINTR;
@@ -2581,8 +2584,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 	if (!folio_batch_count(fbatch)) {
 		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
 			return -EAGAIN;
-		err = filemap_create_folio(filp, mapping,
-				iocb->ki_pos >> PAGE_SHIFT, fbatch);
+		err = filemap_create_folio(filp, mapping, index, fbatch);
 		if (err == AOP_TRUNCATED_PAGE)
 			goto retry;
 		return err;
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC 11/23] filemap: align the index to mapping_min_order in do_[a]sync_mmap_readahead
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (9 preceding siblings ...)
  2023-09-15 18:38 ` [RFC 10/23] filemap: align the index to mapping_min_order in filemap_get_pages() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 18:38 ` [RFC 12/23] filemap: align index to mapping_min_order in filemap_fault() Pankaj Raghav
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Pankaj Raghav <p.raghav@samsung.com>

Align the index to the mapping_min_order number of pages in
do_[a]sync_mmap_readahead().

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/filemap.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 8a4bbddcf575..3853df90f9cf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3164,7 +3164,10 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	struct file *file = vmf->vma->vm_file;
 	struct file_ra_state *ra = &file->f_ra;
 	struct address_space *mapping = file->f_mapping;
-	DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
+	int order = mapping_min_folio_order(mapping);
+	unsigned int nrpages = 1U << order;
+	pgoff_t index = round_down(vmf->pgoff, nrpages);
+	DEFINE_READAHEAD(ractl, file, ra, mapping, index);
 	struct file *fpin = NULL;
 	unsigned long vm_flags = vmf->vma->vm_flags;
 	unsigned int mmap_miss;
@@ -3216,10 +3219,11 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	 */
 	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 	ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
+	ra->start = round_down(ra->start, nrpages);
 	ra->size = ra->ra_pages;
 	ra->async_size = ra->ra_pages / 4;
 	ractl._index = ra->start;
-	page_cache_ra_order(&ractl, ra, 0);
+	page_cache_ra_order(&ractl, ra, order);
 	return fpin;
 }
 
@@ -3233,7 +3237,10 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
 {
 	struct file *file = vmf->vma->vm_file;
 	struct file_ra_state *ra = &file->f_ra;
-	DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, vmf->pgoff);
+	int order = mapping_min_folio_order(file->f_mapping);
+	unsigned int nrpages = 1U << order;
+	pgoff_t index = round_down(vmf->pgoff, nrpages);
+	DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, index);
 	struct file *fpin = NULL;
 	unsigned int mmap_miss;
 
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC 12/23] filemap: align index to mapping_min_order in filemap_fault()
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (10 preceding siblings ...)
  2023-09-15 18:38 ` [RFC 11/23] filemap: align the index to mapping_min_order in do_[a]sync_mmap_readahead Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 18:38 ` [RFC 13/23] readahead: set file_ra_state->ra_pages to be at least mapping_min_order Pankaj Raghav
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Pankaj Raghav <p.raghav@samsung.com>

ALign the indices to mapping_min_order number of pages in
filemap_fault().

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/filemap.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 3853df90f9cf..f97099de80b3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3288,13 +3288,17 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	struct file *file = vmf->vma->vm_file;
 	struct file *fpin = NULL;
 	struct address_space *mapping = file->f_mapping;
+	unsigned int min_order = mapping_min_folio_order(mapping);
+	unsigned int nrpages = 1UL << min_order;
 	struct inode *inode = mapping->host;
-	pgoff_t max_idx, index = vmf->pgoff;
+	pgoff_t max_idx, index = round_down(vmf->pgoff, nrpages);
 	struct folio *folio;
 	vm_fault_t ret = 0;
 	bool mapping_locked = false;
 
 	max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+	max_idx = round_up(max_idx, nrpages);
+
 	if (unlikely(index >= max_idx))
 		return VM_FAULT_SIGBUS;
 
@@ -3386,13 +3390,17 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	 * We must recheck i_size under page lock.
 	 */
 	max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+	max_idx = round_up(max_idx, nrpages);
+
 	if (unlikely(index >= max_idx)) {
 		folio_unlock(folio);
 		folio_put(folio);
 		return VM_FAULT_SIGBUS;
 	}
 
-	vmf->page = folio_file_page(folio, index);
+	VM_BUG_ON_FOLIO(folio_order(folio) < min_order, folio);
+
+	vmf->page = folio_file_page(folio, vmf->pgoff);
 	return ret | VM_FAULT_LOCKED;
 
 page_not_uptodate:
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC 13/23] readahead: set file_ra_state->ra_pages to be at least mapping_min_order
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (11 preceding siblings ...)
  2023-09-15 18:38 ` [RFC 12/23] filemap: align index to mapping_min_order in filemap_fault() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 18:38 ` [RFC 14/23] readahead: allocate folios with mapping_min_order in ra_unbounded() Pankaj Raghav
                   ` (11 subsequent siblings)
  24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Luis Chamberlain <mcgrof@kernel.org>

Set the file_ra_state->ra_pages in file_ra_state_init() to be at least
mapping_min_order of pages if the bdi->ra_pages is less than that.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/readahead.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/readahead.c b/mm/readahead.c
index ef3b23a41973..5c4e7ee64dc1 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -138,7 +138,13 @@
 void
 file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
 {
+	unsigned int order = mapping_min_folio_order(mapping);
+	unsigned int min_nrpages = 1U << order;
+	unsigned int max_pages = inode_to_bdi(mapping->host)->io_pages;
+
 	ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages;
+	if (ra->ra_pages < min_nrpages && min_nrpages < max_pages)
+		ra->ra_pages = min_nrpages;
 	ra->prev_pos = -1;
 }
 EXPORT_SYMBOL_GPL(file_ra_state_init);
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC 14/23] readahead: allocate folios with mapping_min_order in ra_unbounded()
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (12 preceding siblings ...)
  2023-09-15 18:38 ` [RFC 13/23] readahead: set file_ra_state->ra_pages to be at least mapping_min_order Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 18:38 ` [RFC 15/23] readahead: align with mapping_min_order in force_page_cache_ra() Pankaj Raghav
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Pankaj Raghav <p.raghav@samsung.com>

Allocate folios with mapping_min_order order in
page_cache_ra_unbounded(). Also adjust the accounting to take the
folio_nr_pages in the loop.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/readahead.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 5c4e7ee64dc1..2a9e9020b7cf 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -250,7 +250,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			continue;
 		}
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask,
+					    mapping_min_folio_order(mapping));
 		if (!folio)
 			break;
 		if (filemap_add_folio(mapping, folio, index + i,
@@ -264,7 +265,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 		if (i == nr_to_read - lookahead_size)
 			folio_set_readahead(folio);
 		ractl->_workingset |= folio_test_workingset(folio);
-		ractl->_nr_pages++;
+		ractl->_nr_pages += folio_nr_pages(folio);
+		i += folio_nr_pages(folio) - 1;
 	}
 
 	/*
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC 15/23] readahead: align with mapping_min_order in force_page_cache_ra()
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (13 preceding siblings ...)
  2023-09-15 18:38 ` [RFC 14/23] readahead: allocate folios with mapping_min_order in ra_unbounded() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 18:38 ` [RFC 16/23] readahead: add folio with at least mapping_min_order in page_cache_ra_order Pankaj Raghav
                   ` (9 subsequent siblings)
  24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Pankaj Raghav <p.raghav@samsung.com>

Align the index to mapping_min_order in force_page_cache_ra(). This will
ensure that the folios allocated for readahead that are added to the
page cache are aligned to mapping_min_order.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/readahead.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/readahead.c b/mm/readahead.c
index 2a9e9020b7cf..838dd9ca8dad 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -318,6 +318,8 @@ void force_page_cache_ra(struct readahead_control *ractl,
 	struct file_ra_state *ra = ractl->ra;
 	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
 	unsigned long max_pages, index;
+	unsigned int folio_order = mapping_min_folio_order(mapping);
+	unsigned int nr_of_pages = (1 << folio_order);
 
 	if (unlikely(!mapping->a_ops->read_folio && !mapping->a_ops->readahead))
 		return;
@@ -327,6 +329,13 @@ void force_page_cache_ra(struct readahead_control *ractl,
 	 * be up to the optimal hardware IO size
 	 */
 	index = readahead_index(ractl);
+	if (folio_order && (index & (nr_of_pages - 1))) {
+		unsigned long old_index = index;
+
+		index = round_down(index, nr_of_pages);
+		nr_to_read += (old_index - index);
+	}
+
 	max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages);
 	nr_to_read = min_t(unsigned long, nr_to_read, max_pages);
 	while (nr_to_read) {
@@ -335,6 +344,7 @@ void force_page_cache_ra(struct readahead_control *ractl,
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
 		ractl->_index = index;
+		VM_BUG_ON(index & (nr_of_pages - 1));
 		do_page_cache_ra(ractl, this_chunk, 0);
 
 		index += this_chunk;
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC 16/23] readahead: add folio with at least mapping_min_order in page_cache_ra_order
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (14 preceding siblings ...)
  2023-09-15 18:38 ` [RFC 15/23] readahead: align with mapping_min_order in force_page_cache_ra() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 18:38 ` [RFC 17/23] readahead: set the minimum ra size in get_(init|next)_ra Pankaj Raghav
                   ` (8 subsequent siblings)
  24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Luis Chamberlain <mcgrof@kernel.org>

Set the folio order to at least mapping_min_order before calling
ra_alloc_folio().

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/readahead.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 838dd9ca8dad..fb5ff180c39e 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -506,6 +506,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
 {
 	struct address_space *mapping = ractl->mapping;
 	pgoff_t index = readahead_index(ractl);
+	unsigned int min_order = mapping_min_folio_order(mapping);
 	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
 	pgoff_t mark = index + ra->size - ra->async_size;
 	int err = 0;
@@ -535,10 +536,16 @@ void page_cache_ra_order(struct readahead_control *ractl,
 				order = 0;
 		}
 		/* Don't allocate pages past EOF */
-		while (index + (1UL << order) - 1 > limit) {
+		while (order > min_order && index + (1UL << order) - 1 > limit) {
 			if (--order == 1)
 				order = 0;
 		}
+
+		if (order < min_order)
+			order = min_order;
+
+		VM_BUG_ON(index & ((1UL << order) - 1));
+
 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
 		if (err)
 			break;
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC 17/23] readahead: set the minimum ra size in get_(init|next)_ra
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (15 preceding siblings ...)
  2023-09-15 18:38 ` [RFC 16/23] readahead: add folio with at least mapping_min_order in page_cache_ra_order Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 18:38 ` [RFC 18/23] readahead: align ra start and size to mapping_min_order in ondemand_ra() Pankaj Raghav
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Luis Chamberlain <mcgrof@kernel.org>

Make sure the minimum ra size is based on mapping_min_order in
get_init_ra() and get_next_ra(). If request ra size is greater than
mapping_min_order of pages, align it to mapping_min_order of pages.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/readahead.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index fb5ff180c39e..7c2660815a01 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -357,9 +357,17 @@ void force_page_cache_ra(struct readahead_control *ractl,
  * for small size, x 4 for medium, and x 2 for large
  * for 128k (32 page) max ra
  * 1-2 page = 16k, 3-4 page 32k, 5-8 page = 64k, > 8 page = 128k initial
+ *
+ * For higher order address space requirements we ensure no initial reads
+ * are ever less than the min number of pages required.
+ *
+ * We *always* cap the max io size allowed by the device.
  */
-static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
+static unsigned long get_init_ra_size(unsigned long size,
+				      unsigned int min_order,
+				      unsigned long max)
 {
+	unsigned int min_nrpages = 1UL << min_order;
 	unsigned long newsize = roundup_pow_of_two(size);
 
 	if (newsize <= max / 32)
@@ -369,6 +377,15 @@ static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
 	else
 		newsize = max;
 
+	if (newsize < min_nrpages) {
+		if (min_nrpages <= max)
+			newsize = min_nrpages;
+		else
+			newsize = round_up(max, min_nrpages);
+	}
+
+	VM_BUG_ON(newsize & (min_nrpages - 1));
+
 	return newsize;
 }
 
@@ -377,14 +394,19 @@ static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
  *  return it as the new window size.
  */
 static unsigned long get_next_ra_size(struct file_ra_state *ra,
+				      unsigned int min_order,
 				      unsigned long max)
 {
-	unsigned long cur = ra->size;
+	unsigned int min_nrpages = 1UL << min_order;
+	unsigned long cur = max(ra->size, min_nrpages);
+
+	cur = round_down(cur, min_nrpages);
 
 	if (cur < max / 16)
 		return 4 * cur;
 	if (cur <= max / 2)
 		return 2 * cur;
+
 	return max;
 }
 
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC 18/23] readahead: align ra start and size to mapping_min_order in ondemand_ra()
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (16 preceding siblings ...)
  2023-09-15 18:38 ` [RFC 17/23] readahead: set the minimum ra size in get_(init|next)_ra Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 18:38 ` [RFC 19/23] truncate: align index to mapping_min_order Pankaj Raghav
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Luis Chamberlain <mcgrof@kernel.org>

Align the ra->start and ra->size to mapping_min_order in
ondemand_readahead(). This will ensure the folios added to the
page_cache will be aligned to mapping_min_order number of pages.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/readahead.c | 29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 7c2660815a01..03fa6f6c8145 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -605,7 +605,11 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	unsigned long add_pages;
 	pgoff_t index = readahead_index(ractl);
 	pgoff_t expected, prev_index;
-	unsigned int order = folio ? folio_order(folio) : 0;
+	unsigned int min_order = mapping_min_folio_order(ractl->mapping);
+	unsigned int min_nrpages = 1UL << min_order;
+	unsigned int order = folio ? folio_order(folio) : min_order;
+
+	VM_BUG_ON(ractl->_index & (min_nrpages - 1));
 
 	/*
 	 * If the request exceeds the readahead window, allow the read to
@@ -627,9 +631,13 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	expected = round_up(ra->start + ra->size - ra->async_size,
 			1UL << order);
 	if (index == expected || index == (ra->start + ra->size)) {
-		ra->start += ra->size;
-		ra->size = get_next_ra_size(ra, max_pages);
+		ra->start += round_down(ra->size, min_nrpages);
+		ra->size = get_next_ra_size(ra, min_order, max_pages);
 		ra->async_size = ra->size;
+
+		VM_BUG_ON(ra->size & ((1UL << min_order) - 1));
+		VM_BUG_ON(ra->start & ((1UL << min_order) - 1));
+
 		goto readit;
 	}
 
@@ -647,13 +655,19 @@ static void ondemand_readahead(struct readahead_control *ractl,
 				max_pages);
 		rcu_read_unlock();
 
+		start = round_down(start, min_nrpages);
+
+		VM_BUG_ON(start & (min_nrpages - 1));
+		VM_BUG_ON(folio->index & (folio_nr_pages(folio) - 1));
+
 		if (!start || start - index > max_pages)
 			return;
 
 		ra->start = start;
 		ra->size = start - index;	/* old async_size */
-		ra->size += req_size;
-		ra->size = get_next_ra_size(ra, max_pages);
+		VM_BUG_ON(ra->size & (min_nrpages - 1));
+		ra->size += round_up(req_size, min_nrpages);
+		ra->size = get_next_ra_size(ra, min_order, max_pages);
 		ra->async_size = ra->size;
 		goto readit;
 	}
@@ -690,7 +704,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
 
 initial_readahead:
 	ra->start = index;
-	ra->size = get_init_ra_size(req_size, max_pages);
+	ra->size = get_init_ra_size(req_size, min_order, max_pages);
 	ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
 
 readit:
@@ -701,7 +715,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	 * Take care of maximum IO pages as above.
 	 */
 	if (index == ra->start && ra->size == ra->async_size) {
-		add_pages = get_next_ra_size(ra, max_pages);
+		add_pages = get_next_ra_size(ra, min_order, max_pages);
 		if (ra->size + add_pages <= max_pages) {
 			ra->async_size = add_pages;
 			ra->size += add_pages;
@@ -712,6 +726,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	}
 
 	ractl->_index = ra->start;
+	VM_BUG_ON(ractl->_index & (min_nrpages - 1));
 	page_cache_ra_order(ractl, ra, order);
 }
 
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC 19/23] truncate: align index to mapping_min_order
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (17 preceding siblings ...)
  2023-09-15 18:38 ` [RFC 18/23] readahead: align ra start and size to mapping_min_order in ondemand_ra() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 18:38 ` [RFC 20/23] mm: round down folio split requirements Pankaj Raghav
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Luis Chamberlain <mcgrof@kernel.org>

Align indices to mapping_min_order in invalidate_inode_pages2_range(),
mapping_try_invalidate() and truncate_inode_pages_range(). This is
necessary to keep the folios added to the page cache aligned with
mapping_min_order.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/truncate.c | 34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 8e3aa9e8618e..d5ce8e30df70 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -337,6 +337,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	int		i;
 	struct folio	*folio;
 	bool		same_folio;
+	unsigned int order = mapping_min_folio_order(mapping);
+	unsigned int nrpages = 1U << order;
 
 	if (mapping_empty(mapping))
 		return;
@@ -347,7 +349,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	 * start of the range and 'partial_end' at the end of the range.
 	 * Note that 'end' is exclusive while 'lend' is inclusive.
 	 */
-	start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	start = (lstart + (nrpages * PAGE_SIZE) - 1) >> PAGE_SHIFT;
+	start = round_down(start, nrpages);
+
 	if (lend == -1)
 		/*
 		 * lend == -1 indicates end-of-file so we have to set 'end'
@@ -356,7 +360,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		 */
 		end = -1;
 	else
-		end = (lend + 1) >> PAGE_SHIFT;
+		end = round_down((lend + 1) >> PAGE_SHIFT, nrpages);
 
 	folio_batch_init(&fbatch);
 	index = start;
@@ -372,8 +376,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		cond_resched();
 	}
 
-	same_folio = (lstart >> PAGE_SHIFT) == (lend >> PAGE_SHIFT);
-	folio = __filemap_get_folio(mapping, lstart >> PAGE_SHIFT, FGP_LOCK, 0);
+	same_folio = round_down(lstart >> PAGE_SHIFT, nrpages) ==
+		     round_down(lend >> PAGE_SHIFT, nrpages);
+	folio = __filemap_get_folio(mapping, start, FGP_LOCK, 0);
 	if (!IS_ERR(folio)) {
 		same_folio = lend < folio_pos(folio) + folio_size(folio);
 		if (!truncate_inode_partial_folio(folio, lstart, lend)) {
@@ -387,7 +392,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	}
 
 	if (!same_folio) {
-		folio = __filemap_get_folio(mapping, lend >> PAGE_SHIFT,
+		folio = __filemap_get_folio(mapping,
+					    round_down(lend >> PAGE_SHIFT, nrpages),
 						FGP_LOCK, 0);
 		if (!IS_ERR(folio)) {
 			if (!truncate_inode_partial_folio(folio, lstart, lend))
@@ -497,15 +503,18 @@ EXPORT_SYMBOL(truncate_inode_pages_final);
 unsigned long mapping_try_invalidate(struct address_space *mapping,
 		pgoff_t start, pgoff_t end, unsigned long *nr_failed)
 {
+	unsigned int min_order = mapping_min_folio_order(mapping);
+	unsigned int nrpages = 1UL << min_order;
 	pgoff_t indices[PAGEVEC_SIZE];
 	struct folio_batch fbatch;
-	pgoff_t index = start;
+	pgoff_t index = round_up(start, nrpages);
+	pgoff_t end_idx = round_down(end, nrpages);
 	unsigned long ret;
 	unsigned long count = 0;
 	int i;
 
 	folio_batch_init(&fbatch);
-	while (find_lock_entries(mapping, &index, end, &fbatch, indices)) {
+	while (find_lock_entries(mapping, &index, end_idx, &fbatch, indices)) {
 		for (i = 0; i < folio_batch_count(&fbatch); i++) {
 			struct folio *folio = fbatch.folios[i];
 
@@ -618,9 +627,11 @@ static int folio_launder(struct address_space *mapping, struct folio *folio)
 int invalidate_inode_pages2_range(struct address_space *mapping,
 				  pgoff_t start, pgoff_t end)
 {
+	unsigned int min_order = mapping_min_folio_order(mapping);
+	unsigned int nrpages = 1UL << min_order;
 	pgoff_t indices[PAGEVEC_SIZE];
 	struct folio_batch fbatch;
-	pgoff_t index;
+	pgoff_t index, end_idx;
 	int i;
 	int ret = 0;
 	int ret2 = 0;
@@ -630,8 +641,9 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 		return 0;
 
 	folio_batch_init(&fbatch);
-	index = start;
-	while (find_get_entries(mapping, &index, end, &fbatch, indices)) {
+	index = round_up(start, nrpages);
+	end_idx = round_down(end, nrpages);
+	while (find_get_entries(mapping, &index, end_idx, &fbatch, indices)) {
 		for (i = 0; i < folio_batch_count(&fbatch); i++) {
 			struct folio *folio = fbatch.folios[i];
 
@@ -660,6 +672,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				continue;
 			}
 			VM_BUG_ON_FOLIO(!folio_contains(folio, indices[i]), folio);
+			VM_BUG_ON_FOLIO(folio_order(folio) < min_order, folio);
+			VM_BUG_ON_FOLIO(folio->index & (nrpages - 1), folio);
 			folio_wait_writeback(folio);
 
 			if (folio_mapped(folio))
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC 20/23] mm: round down folio split requirements
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (18 preceding siblings ...)
  2023-09-15 18:38 ` [RFC 19/23] truncate: align index to mapping_min_order Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 18:38 ` [RFC 21/23] xfs: expose block size in stat Pankaj Raghav
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Luis Chamberlain <mcgrof@kernel.org>

When we truncate we always check if we can split a large folio, we do
this by checking the userspace mapped pages match folio_nr_pages() - 1,
but if we are using a filesystem or a block device which has a min order
it must be respected and we should only split rounding down to the
min order page requirements.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/huge_memory.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f899b3500419..e608a805c79f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2617,16 +2617,24 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 bool can_split_folio(struct folio *folio, int *pextra_pins)
 {
 	int extra_pins;
+	unsigned int min_order = 0;
+	unsigned int nrpages;
 
 	/* Additional pins from page cache */
-	if (folio_test_anon(folio))
+	if (folio_test_anon(folio)) {
 		extra_pins = folio_test_swapcache(folio) ?
 				folio_nr_pages(folio) : 0;
-	else
+	} else {
 		extra_pins = folio_nr_pages(folio);
+		if (folio->mapping)
+			min_order = mapping_min_folio_order(folio->mapping);
+	}
+
+	nrpages = 1UL << min_order;
+
 	if (pextra_pins)
 		*pextra_pins = extra_pins;
-	return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins - 1;
+	return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins - nrpages;
 }
 
 /*
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC 21/23] xfs: expose block size in stat
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (19 preceding siblings ...)
  2023-09-15 18:38 ` [RFC 20/23] mm: round down folio split requirements Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 18:38 ` [RFC 22/23] xfs: enable block size larger than page size support Pankaj Raghav
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

For block size larger than page size, the unit of efficient IO is
the block size, not the page size. Leaving stat() to report
PAGE_SIZE as the block size causes test programs like fsx to issue
illegal ranges for operations that require block size alignment
(e.g. fallocate() insert range). Hence update the preferred IO size
to reflect the block size in this case.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[mcgrof: forward rebase in consideration for commit
dd2d535e3fb29d ("xfs: cleanup calculating the stat optimal I/O size")]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/xfs/xfs_iops.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 2ededd3f6b8c..080a79a81c46 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -515,6 +515,8 @@ xfs_stat_blksize(
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = ip->i_mount;
+	unsigned long	default_size = max_t(unsigned long, PAGE_SIZE,
+					     mp->m_sb.sb_blocksize);
 
 	/*
 	 * If the file blocks are being allocated from a realtime volume, then
@@ -543,7 +545,7 @@ xfs_stat_blksize(
 			return 1U << mp->m_allocsize_log;
 	}
 
-	return PAGE_SIZE;
+	return default_size;
 }
 
 STATIC int
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC 22/23] xfs: enable block size larger than page size support
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (20 preceding siblings ...)
  2023-09-15 18:38 ` [RFC 21/23] xfs: expose block size in stat Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 18:38 ` [RFC 23/23] xfs: set minimum order folio for page cache based on blocksize Pankaj Raghav
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Pankaj Raghav <p.raghav@samsung.com>

Currently we don't support blocksize that is twice the page size due to
the limitation of having at least three pages in a large folio[1].

[1] https://lore.kernel.org/all/ZH0GvxAdw1RO2Shr@casper.infradead.org/

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/xfs/xfs_mount.c | 9 +++++++--
 fs/xfs/xfs_super.c | 7 ++-----
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index aed5be5508fe..4272898c508a 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -131,11 +131,16 @@ xfs_sb_validate_fsb_count(
 	xfs_sb_t	*sbp,
 	uint64_t	nblocks)
 {
-	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
+	unsigned long mapping_count;
+
+	if (sbp->sb_blocklog <= PAGE_SHIFT)
+		mapping_count = nblocks >> (PAGE_SHIFT - sbp->sb_blocklog);
+	else
+		mapping_count = nblocks << (sbp->sb_blocklog - PAGE_SHIFT);
 
 	/* Limited by ULONG_MAX of page cache index */
-	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
+	if (mapping_count > ULONG_MAX)
 		return -EFBIG;
 	return 0;
 }
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 1f77014c6e1a..75bf4d23051c 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1651,13 +1651,10 @@ xfs_fs_fill_super(
 		goto out_free_sb;
 	}
 
-	/*
-	 * Until this is fixed only page-sized or smaller data blocks work.
-	 */
-	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
+	if (mp->m_sb.sb_blocksize == (2 * PAGE_SIZE)) {
 		xfs_warn(mp,
 		"File system with blocksize %d bytes. "
-		"Only pagesize (%ld) or less will currently work.",
+		"Blocksize that is twice the pagesize %ld does not currently work.",
 				mp->m_sb.sb_blocksize, PAGE_SIZE);
 		error = -ENOSYS;
 		goto out_free_sb;
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC 23/23] xfs: set minimum order folio for page cache based on blocksize
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (21 preceding siblings ...)
  2023-09-15 18:38 ` [RFC 22/23] xfs: enable block size larger than page size support Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
  2023-09-15 18:50 ` [RFC 00/23] Enable block size > page size in XFS Matthew Wilcox
  2023-09-17 22:05 ` Dave Chinner
  24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Pankaj Raghav <p.raghav@samsung.com>

Enabling a block size > PAGE_SIZE is only possible if we can
ensure that the filesystem allocations for the block size is treated
atomically and we do this with the min order folio requirement for the
inode. This allows the page cache to treat this inode atomically even
if on the block layer we may treat it separately.

For instance, on x86 this enables eventual usage of block size > 4k
so long as you use a sector size set of 4k.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/xfs/xfs_icache.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index aacc7eec2497..81f07503f5ca 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -73,6 +73,7 @@ xfs_inode_alloc(
 	xfs_ino_t		ino)
 {
 	struct xfs_inode	*ip;
+	int			min_order = 0;
 
 	/*
 	 * XXX: If this didn't occur in transactions, we could drop GFP_NOFAIL
@@ -88,7 +89,8 @@ xfs_inode_alloc(
 	/* VFS doesn't initialise i_mode or i_state! */
 	VFS_I(ip)->i_mode = 0;
 	VFS_I(ip)->i_state = 0;
-	mapping_set_large_folios(VFS_I(ip)->i_mapping);
+	min_order = max(min_order, ilog2(mp->m_sb.sb_blocksize) - PAGE_SHIFT);
+	mapping_set_folio_orders(VFS_I(ip)->i_mapping, min_order, MAX_PAGECACHE_ORDER);
 
 	XFS_STATS_INC(mp, vn_active);
 	ASSERT(atomic_read(&ip->i_pincount) == 0);
@@ -313,6 +315,7 @@ xfs_reinit_inode(
 	dev_t			dev = inode->i_rdev;
 	kuid_t			uid = inode->i_uid;
 	kgid_t			gid = inode->i_gid;
+	int			min_order = 0;
 
 	error = inode_init_always(mp->m_super, inode);
 
@@ -323,7 +326,8 @@ xfs_reinit_inode(
 	inode->i_rdev = dev;
 	inode->i_uid = uid;
 	inode->i_gid = gid;
-	mapping_set_large_folios(inode->i_mapping);
+	min_order = max(min_order, ilog2(mp->m_sb.sb_blocksize) - PAGE_SHIFT);
+	mapping_set_folio_orders(inode->i_mapping, min_order, MAX_PAGECACHE_ORDER);
 	return error;
 }
 
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC 00/23] Enable block size > page size in XFS
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (22 preceding siblings ...)
  2023-09-15 18:38 ` [RFC 23/23] xfs: set minimum order folio for page cache based on blocksize Pankaj Raghav
@ 2023-09-15 18:50 ` Matthew Wilcox
  2023-09-18 12:35   ` Pankaj Raghav
  2023-09-17 22:05 ` Dave Chinner
  24 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2023-09-15 18:50 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, p.raghav, david, da.gomez, akpm,
	linux-kernel, djwong, linux-mm, chandan.babu, mcgrof, gost.dev

On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote:
> Only XFS was enabled and tested as a part of this series as it has
> supported block sizes up to 64k and sector sizes up to 32k for years.
> The only thing missing was the page cache magic to enable bs > ps. However any filesystem
> that doesn't depend on buffer-heads and support larger block sizes
> already should be able to leverage this effort to also support LBS,
> bs > ps.

I think you should choose whether you're going to use 'bs > ps' or LBS
and stick to it.  They're both pretty inscrutable and using both
interchanagbly is worse.

But I think filesystems which use buffer_heads should be fine to support
bs > ps.  The problems with the buffer cache are really when you try to
support small block sizes and large folio sizes (eg arrays of bhs on
the stack).  Supporting bs == folio_size shouldn't be a problem.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC 00/23] Enable block size > page size in XFS
  2023-09-15 18:50 ` [RFC 00/23] Enable block size > page size in XFS Matthew Wilcox
@ 2023-09-18 12:35   ` Pankaj Raghav
  0 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-18 12:35 UTC (permalink / raw)
  To: Matthew Wilcox, Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, david, da.gomez, akpm, linux-kernel,
	djwong, linux-mm, chandan.babu, mcgrof, gost.dev

On 2023-09-15 20:50, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote:
>> Only XFS was enabled and tested as a part of this series as it has
>> supported block sizes up to 64k and sector sizes up to 32k for years.
>> The only thing missing was the page cache magic to enable bs > ps. However any filesystem
>> that doesn't depend on buffer-heads and support larger block sizes
>> already should be able to leverage this effort to also support LBS,
>> bs > ps.
> 
> I think you should choose whether you're going to use 'bs > ps' or LBS
> and stick to it.  They're both pretty inscrutable and using both
> interchanagbly is worse.
> 

Got it! Probably I will stick to Large block size and explain what it means
at the start of the patchset.

> But I think filesystems which use buffer_heads should be fine to support
> bs > ps.  The problems with the buffer cache are really when you try to
> support small block sizes and large folio sizes (eg arrays of bhs on
> the stack).  Supporting bs == folio_size shouldn't be a problem.
> 

I remember some patches from you trying to avoid the stack limitation while working
with bh. Thanks for the clarification!


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC 00/23] Enable block size > page size in XFS
  2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
                   ` (23 preceding siblings ...)
  2023-09-15 18:50 ` [RFC 00/23] Enable block size > page size in XFS Matthew Wilcox
@ 2023-09-17 22:05 ` Dave Chinner
  2023-09-18  2:04   ` Luis Chamberlain
  24 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2023-09-17 22:05 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, p.raghav, da.gomez, akpm, linux-kernel,
	willy, djwong, linux-mm, chandan.babu, mcgrof, gost.dev

On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> There has been efforts over the last 16 years to enable enable Large
> Block Sizes (LBS), that is block sizes in filesystems where bs > page
> size [1] [2]. Through these efforts we have learned that one of the
> main blockers to supporting bs > ps in fiesystems has been a way to
> allocate pages that are at least the filesystem block size on the page
> cache where bs > ps [3]. Another blocker was changed in filesystems due to
> buffer-heads. Thanks to these previous efforts, the surgery by Matthew
> Willcox in the page cache for adopting xarray's multi-index support, and
> iomap support, it makes supporting bs > ps in XFS possible with only a few
> line change to XFS. Most of changes are to the page cache to support minimum
> order folio support for the target block size on the filesystem.
> 
> A new motivation for LBS today is to support high-capacity (large amount
> of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are
> typically greater than 4k [4] to help reduce DRAM and so in turn cost
> and space. In practice this then allows different architectures to use a
> base page size of 4k while still enabling support for block sizes
> aligned to the larger IUs by relying on high order folios on the page
> cache when needed. It also enables to take advantage of these same
> drive's support for larger atomics than 4k with buffered IO support in
> Linux. As described this year at LSFMM, supporting large atomics greater
> than 4k enables databases to remove the need to rely on their own
> journaling, so they can disable double buffered writes [5], which is a
> feature different cloud providers are already innovating and enabling
> customers for through custom storage solutions.
> 
> This series still needs some polishing and fixing some crashes, but it is
> mainly targeted to get initial feedback from the community, enable initial
> experimentation, hence the RFC. It's being posted now given the results from
> our testing are proving much better results than expected and we hope to
> polish this up together with the community. After all, this has been a 16
> year old effort and none of this could have been possible without that effort.
> 
> Implementation:
> 
> This series only adds the notion of a minimum order of a folio in the
> page cache that was initially proposed by Willy. The minimum folio order
> requirement is set during inode creation. The minimum order will
> typically correspond to the filesystem block size. The page cache will
> in turn respect the minimum folio order requirement while allocating a
> folio. This series mainly changes the page cache's filemap, readahead, and
> truncation code to allocate and align the folios to the minimum order set for the
> filesystem's inode's respective address space mapping.
> 
> Only XFS was enabled and tested as a part of this series as it has
> supported block sizes up to 64k and sector sizes up to 32k for years.
> The only thing missing was the page cache magic to enable bs > ps. However any filesystem
> that doesn't depend on buffer-heads and support larger block sizes
> already should be able to leverage this effort to also support LBS,
> bs > ps.
> 
> This also paves the way for supporting block devices where their logical
> block size > page size in the future by leveraging iomap's address space
> operation added to the block device cache by Christoph Hellwig [6]. We
> have work to enable support for this, enabling LBAs > 4k on NVME,  and
> at the same time allow coexistence with buffer-heads on the same block
> device so to enable support allow for a drive to use filesystem's to
> switch between filesystem's which may depend on buffer-heads or need the
> iomap address space operations for the block device cache. Patches for
> this will be posted shortly after this patch series.

Do you have a git tree branch that I can pull this from
somewhere?

As it is, I'd really prefer stuff that adds significant XFS
functionality that we need to test to be based on a current Linus
TOT kernel so that we can test it without being impacted by all
the random unrelated breakages that regularly happen in linux-next
kernels....

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC 00/23] Enable block size > page size in XFS
  2023-09-17 22:05 ` Dave Chinner
@ 2023-09-18  2:04   ` Luis Chamberlain
  2023-09-18  5:07     ` Dave Chinner
  0 siblings, 1 reply; 54+ messages in thread
From: Luis Chamberlain @ 2023-09-18  2:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Pankaj Raghav, linux-xfs, linux-fsdevel, p.raghav, da.gomez, akpm,
	linux-kernel, willy, djwong, linux-mm, chandan.babu, gost.dev

On Mon, Sep 18, 2023 at 08:05:20AM +1000, Dave Chinner wrote:
> On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> > 
> > There has been efforts over the last 16 years to enable enable Large
> > Block Sizes (LBS), that is block sizes in filesystems where bs > page
> > size [1] [2]. Through these efforts we have learned that one of the
> > main blockers to supporting bs > ps in fiesystems has been a way to
> > allocate pages that are at least the filesystem block size on the page
> > cache where bs > ps [3]. Another blocker was changed in filesystems due to
> > buffer-heads. Thanks to these previous efforts, the surgery by Matthew
> > Willcox in the page cache for adopting xarray's multi-index support, and
> > iomap support, it makes supporting bs > ps in XFS possible with only a few
> > line change to XFS. Most of changes are to the page cache to support minimum
> > order folio support for the target block size on the filesystem.
> > 
> > A new motivation for LBS today is to support high-capacity (large amount
> > of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are
> > typically greater than 4k [4] to help reduce DRAM and so in turn cost
> > and space. In practice this then allows different architectures to use a
> > base page size of 4k while still enabling support for block sizes
> > aligned to the larger IUs by relying on high order folios on the page
> > cache when needed. It also enables to take advantage of these same
> > drive's support for larger atomics than 4k with buffered IO support in
> > Linux. As described this year at LSFMM, supporting large atomics greater
> > than 4k enables databases to remove the need to rely on their own
> > journaling, so they can disable double buffered writes [5], which is a
> > feature different cloud providers are already innovating and enabling
> > customers for through custom storage solutions.
> > 
> > This series still needs some polishing and fixing some crashes, but it is
> > mainly targeted to get initial feedback from the community, enable initial
> > experimentation, hence the RFC. It's being posted now given the results from
> > our testing are proving much better results than expected and we hope to
> > polish this up together with the community. After all, this has been a 16
> > year old effort and none of this could have been possible without that effort.
> > 
> > Implementation:
> > 
> > This series only adds the notion of a minimum order of a folio in the
> > page cache that was initially proposed by Willy. The minimum folio order
> > requirement is set during inode creation. The minimum order will
> > typically correspond to the filesystem block size. The page cache will
> > in turn respect the minimum folio order requirement while allocating a
> > folio. This series mainly changes the page cache's filemap, readahead, and
> > truncation code to allocate and align the folios to the minimum order set for the
> > filesystem's inode's respective address space mapping.
> > 
> > Only XFS was enabled and tested as a part of this series as it has
> > supported block sizes up to 64k and sector sizes up to 32k for years.
> > The only thing missing was the page cache magic to enable bs > ps. However any filesystem
> > that doesn't depend on buffer-heads and support larger block sizes
> > already should be able to leverage this effort to also support LBS,
> > bs > ps.
> > 
> > This also paves the way for supporting block devices where their logical
> > block size > page size in the future by leveraging iomap's address space
> > operation added to the block device cache by Christoph Hellwig [6]. We
> > have work to enable support for this, enabling LBAs > 4k on NVME,  and
> > at the same time allow coexistence with buffer-heads on the same block
> > device so to enable support allow for a drive to use filesystem's to
> > switch between filesystem's which may depend on buffer-heads or need the
> > iomap address space operations for the block device cache. Patches for
> > this will be posted shortly after this patch series.
> 
> Do you have a git tree branch that I can pull this from
> somewhere?
> 
> As it is, I'd really prefer stuff that adds significant XFS
> functionality that we need to test to be based on a current Linus
> TOT kernel so that we can test it without being impacted by all
> the random unrelated breakages that regularly happen in linux-next
> kernels....

That's understandable! I just rebased onto Linus' tree, this only
has the bs > ps support on 4k sector size:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev

I just did a cursory build / boot / fsx with 16k block size / 4k sector size
test with this tree only. I havne't ran fstests on it.

Just a heads up, using 512 byte sector size will fail for now, it's a
regression we have to fix. Likewise using block sizes 1k, 2k will also
regress on fsx right now. These are regressions we are aware of but
haven't had time yet to bisect / fix.

  Luis


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC 00/23] Enable block size > page size in XFS
  2023-09-18  2:04   ` Luis Chamberlain
@ 2023-09-18  5:07     ` Dave Chinner
  2023-09-18 12:29       ` Pankaj Raghav
  0 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2023-09-18  5:07 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Pankaj Raghav, linux-xfs, linux-fsdevel, p.raghav, da.gomez, akpm,
	linux-kernel, willy, djwong, linux-mm, chandan.babu, gost.dev

On Sun, Sep 17, 2023 at 07:04:24PM -0700, Luis Chamberlain wrote:
> On Mon, Sep 18, 2023 at 08:05:20AM +1000, Dave Chinner wrote:
> > On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote:
> > > From: Pankaj Raghav <p.raghav@samsung.com>
> > > 
> > > There has been efforts over the last 16 years to enable enable Large
> > > Block Sizes (LBS), that is block sizes in filesystems where bs > page
> > > size [1] [2]. Through these efforts we have learned that one of the
> > > main blockers to supporting bs > ps in fiesystems has been a way to
> > > allocate pages that are at least the filesystem block size on the page
> > > cache where bs > ps [3]. Another blocker was changed in filesystems due to
> > > buffer-heads. Thanks to these previous efforts, the surgery by Matthew
> > > Willcox in the page cache for adopting xarray's multi-index support, and
> > > iomap support, it makes supporting bs > ps in XFS possible with only a few
> > > line change to XFS. Most of changes are to the page cache to support minimum
> > > order folio support for the target block size on the filesystem.
> > > 
> > > A new motivation for LBS today is to support high-capacity (large amount
> > > of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are
> > > typically greater than 4k [4] to help reduce DRAM and so in turn cost
> > > and space. In practice this then allows different architectures to use a
> > > base page size of 4k while still enabling support for block sizes
> > > aligned to the larger IUs by relying on high order folios on the page
> > > cache when needed. It also enables to take advantage of these same
> > > drive's support for larger atomics than 4k with buffered IO support in
> > > Linux. As described this year at LSFMM, supporting large atomics greater
> > > than 4k enables databases to remove the need to rely on their own
> > > journaling, so they can disable double buffered writes [5], which is a
> > > feature different cloud providers are already innovating and enabling
> > > customers for through custom storage solutions.
> > > 
> > > This series still needs some polishing and fixing some crashes, but it is
> > > mainly targeted to get initial feedback from the community, enable initial
> > > experimentation, hence the RFC. It's being posted now given the results from
> > > our testing are proving much better results than expected and we hope to
> > > polish this up together with the community. After all, this has been a 16
> > > year old effort and none of this could have been possible without that effort.
> > > 
> > > Implementation:
> > > 
> > > This series only adds the notion of a minimum order of a folio in the
> > > page cache that was initially proposed by Willy. The minimum folio order
> > > requirement is set during inode creation. The minimum order will
> > > typically correspond to the filesystem block size. The page cache will
> > > in turn respect the minimum folio order requirement while allocating a
> > > folio. This series mainly changes the page cache's filemap, readahead, and
> > > truncation code to allocate and align the folios to the minimum order set for the
> > > filesystem's inode's respective address space mapping.
> > > 
> > > Only XFS was enabled and tested as a part of this series as it has
> > > supported block sizes up to 64k and sector sizes up to 32k for years.
> > > The only thing missing was the page cache magic to enable bs > ps. However any filesystem
> > > that doesn't depend on buffer-heads and support larger block sizes
> > > already should be able to leverage this effort to also support LBS,
> > > bs > ps.
> > > 
> > > This also paves the way for supporting block devices where their logical
> > > block size > page size in the future by leveraging iomap's address space
> > > operation added to the block device cache by Christoph Hellwig [6]. We
> > > have work to enable support for this, enabling LBAs > 4k on NVME,  and
> > > at the same time allow coexistence with buffer-heads on the same block
> > > device so to enable support allow for a drive to use filesystem's to
> > > switch between filesystem's which may depend on buffer-heads or need the
> > > iomap address space operations for the block device cache. Patches for
> > > this will be posted shortly after this patch series.
> > 
> > Do you have a git tree branch that I can pull this from
> > somewhere?
> > 
> > As it is, I'd really prefer stuff that adds significant XFS
> > functionality that we need to test to be based on a current Linus
> > TOT kernel so that we can test it without being impacted by all
> > the random unrelated breakages that regularly happen in linux-next
> > kernels....
> 
> That's understandable! I just rebased onto Linus' tree, this only
> has the bs > ps support on 4k sector size:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev


> I just did a cursory build / boot / fsx with 16k block size / 4k sector size
> test with this tree only. I havne't ran fstests on it.

W/ 64k block size, generic/042 fails (maybe just a test block size
thing), generic/091 fails (data corruption on read after ~70 ops)
and then generic/095 hung with a crash in iomap_readpage_iter()
during readahead.

Looks like a null folio was passed to ifs_alloc(), which implies the
iomap_readpage_ctx didn't have a folio attached to it. Something
isn't working properly in the readahead code, which would also
explain the quick fsx failure...

> Just a heads up, using 512 byte sector size will fail for now, it's a
> regression we have to fix. Likewise using block sizes 1k, 2k will also
> regress on fsx right now. These are regressions we are aware of but
> haven't had time yet to bisect / fix.

I'm betting that the recently added sub-folio dirty tracking code
got broken by this patchset....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC 00/23] Enable block size > page size in XFS
  2023-09-18  5:07     ` Dave Chinner
@ 2023-09-18 12:29       ` Pankaj Raghav
  2023-09-19 11:56         ` Ritesh Harjani
  2023-09-21  3:00         ` Luis Chamberlain
  0 siblings, 2 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-18 12:29 UTC (permalink / raw)
  To: Dave Chinner, Luis Chamberlain
  Cc: Pankaj Raghav, linux-xfs, linux-fsdevel, da.gomez, akpm,
	linux-kernel, willy, djwong, linux-mm, chandan.babu, gost.dev,
	riteshh

>>>
>>> As it is, I'd really prefer stuff that adds significant XFS
>>> functionality that we need to test to be based on a current Linus
>>> TOT kernel so that we can test it without being impacted by all
>>> the random unrelated breakages that regularly happen in linux-next
>>> kernels....
>>
>> That's understandable! I just rebased onto Linus' tree, this only
>> has the bs > ps support on 4k sector size:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev
> 

I think this tree doesn't have some of the last minute changes I did before I sent the RFC. I will
sync with Luis offline regarding that.

> 
>> I just did a cursory build / boot / fsx with 16k block size / 4k sector size
>> test with this tree only. I havne't ran fstests on it.
> 
> W/ 64k block size, generic/042 fails (maybe just a test block size
> thing), generic/091 fails (data corruption on read after ~70 ops)
> and then generic/095 hung with a crash in iomap_readpage_iter()
> during readahead.
> 
> Looks like a null folio was passed to ifs_alloc(), which implies the
> iomap_readpage_ctx didn't have a folio attached to it. Something
> isn't working properly in the readahead code, which would also
> explain the quick fsx failure...
> 

Yeah, I have noticed this as well. This is the main crash scenario I am noticing
when I am running xfstests, and hopefully we will be able to fix it soon.

In general, we have had better results with 16k block size than 64k block size. I still don't
know why, but the ifs_alloc crash happens in generic/451 with 16k block size.


>> Just a heads up, using 512 byte sector size will fail for now, it's a
>> regression we have to fix. Likewise using block sizes 1k, 2k will also
>> regress on fsx right now. These are regressions we are aware of but
>> haven't had time yet to bisect / fix.
> 
> I'm betting that the recently added sub-folio dirty tracking code
> got broken by this patchset....
> 

Hmm, this crossed my mind as well. I am assuming I can really test the sub-folio dirty
tracking code on a system which has a page size greater than the block size? Or is there
some tests that can already test this? CCing Ritesh as well.

> Cheers,
> 
> Dave.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC 00/23] Enable block size > page size in XFS
  2023-09-18 12:29       ` Pankaj Raghav
@ 2023-09-19 11:56         ` Ritesh Harjani
  2023-09-19 21:15           ` Luis Chamberlain
  2023-09-21  3:00         ` Luis Chamberlain
  1 sibling, 1 reply; 54+ messages in thread
From: Ritesh Harjani @ 2023-09-19 11:56 UTC (permalink / raw)
  To: Pankaj Raghav, Dave Chinner, Luis Chamberlain
  Cc: Pankaj Raghav, linux-xfs, linux-fsdevel, da.gomez, akpm,
	linux-kernel, willy, djwong, linux-mm, chandan.babu, gost.dev,
	riteshh

Pankaj Raghav <p.raghav@samsung.com> writes:

>>>>
>>>> As it is, I'd really prefer stuff that adds significant XFS
>>>> functionality that we need to test to be based on a current Linus
>>>> TOT kernel so that we can test it without being impacted by all
>>>> the random unrelated breakages that regularly happen in linux-next
>>>> kernels....
>>>
>>> That's understandable! I just rebased onto Linus' tree, this only
>>> has the bs > ps support on 4k sector size:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev
>> 
>
> I think this tree doesn't have some of the last minute changes I did before I sent the RFC. I will
> sync with Luis offline regarding that.
>
>> 
>>> I just did a cursory build / boot / fsx with 16k block size / 4k sector size
>>> test with this tree only. I havne't ran fstests on it.
>> 
>> W/ 64k block size, generic/042 fails (maybe just a test block size
>> thing), generic/091 fails (data corruption on read after ~70 ops)
>> and then generic/095 hung with a crash in iomap_readpage_iter()
>> during readahead.
>> 
>> Looks like a null folio was passed to ifs_alloc(), which implies the
>> iomap_readpage_ctx didn't have a folio attached to it. Something
>> isn't working properly in the readahead code, which would also
>> explain the quick fsx failure...
>> 
>
> Yeah, I have noticed this as well. This is the main crash scenario I am noticing
> when I am running xfstests, and hopefully we will be able to fix it soon.
>
> In general, we have had better results with 16k block size than 64k block size. I still don't
> know why, but the ifs_alloc crash happens in generic/451 with 16k block size.
>
>
>>> Just a heads up, using 512 byte sector size will fail for now, it's a
>>> regression we have to fix. Likewise using block sizes 1k, 2k will also
>>> regress on fsx right now. These are regressions we are aware of but
>>> haven't had time yet to bisect / fix.
>> 
>> I'm betting that the recently added sub-folio dirty tracking code
>> got broken by this patchset....
>> 
>
> Hmm, this crossed my mind as well. I am assuming I can really test the sub-folio dirty
> tracking code on a system which has a page size greater than the block size? Or is there
> some tests that can already test this? CCing Ritesh as well.
>

Sorry I haven't yet looked into this series yet. I will spend sometime
reading it. Will also give a spin to run the fstests.

But to answer your question on how to test sub-folio dirty
tracking code[1] [2] with XFS. Just use blocksize < pagesize in mkfs option
and run fstests. There are a no. of tests which checks for data
correctness for various types of writes.

1. test 1k blocksize on a 4k pagsize machine (as long as bs < ps)
2. Test 4k blocksize on a 64k pagesize machine (if you have one) (as long as bs < ps)
3. Or also enable large folios support and test bs < ps
(with large folios system starts insantiating large folios > 4k on a 4k
pagesize machine. So blocksize automatically becomes lesser than folio size)

You will need CONFIG_TRANSPARENT_HUGEPAGE to be enabled along with
willy's series which enables large folios in buffered write path [3].
(This is already in linux 6.6-rc1)

<snip>
/*                                                                            
 * Large folio support currently depends on THP.  These dependencies are      
 * being worked on but are not yet fixed.                                     
 */                                                                           
static inline bool mapping_large_folio_support(struct address_space *mapping) 
{                                                                             
        return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&                     
                test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
}

<links>
[1]: https://lore.kernel.org/linux-xfs/20230725122932.144426-1-ritesh.list@gmail.com/
[2]:
https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?h=for-next&id=4ce02c67972211be488408c275c8fbf19faf29b3
[3]: https://lore.kernel.org/all/ZLVrEkVU2YCneoXR@casper.infradead.org/

Hope this helps!

-ritesh


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC 00/23] Enable block size > page size in XFS
  2023-09-19 11:56         ` Ritesh Harjani
@ 2023-09-19 21:15           ` Luis Chamberlain
  0 siblings, 0 replies; 54+ messages in thread
From: Luis Chamberlain @ 2023-09-19 21:15 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Pankaj Raghav, Dave Chinner, Pankaj Raghav, linux-xfs,
	linux-fsdevel, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, gost.dev, riteshh

On Tue, Sep 19, 2023 at 05:26:44PM +0530, Ritesh Harjani wrote:
> Pankaj Raghav <p.raghav@samsung.com> writes:
> 
> >>>>
> >>>> As it is, I'd really prefer stuff that adds significant XFS
> >>>> functionality that we need to test to be based on a current Linus
> >>>> TOT kernel so that we can test it without being impacted by all
> >>>> the random unrelated breakages that regularly happen in linux-next
> >>>> kernels....
> >>>
> >>> That's understandable! I just rebased onto Linus' tree, this only
> >>> has the bs > ps support on 4k sector size:
> >>>
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev
> >> 
> >
> > I think this tree doesn't have some of the last minute changes I did before I sent the RFC. I will
> > sync with Luis offline regarding that.
> >
> >> 
> >>> I just did a cursory build / boot / fsx with 16k block size / 4k sector size
> >>> test with this tree only. I havne't ran fstests on it.
> >> 
> >> W/ 64k block size, generic/042 fails (maybe just a test block size
> >> thing), generic/091 fails (data corruption on read after ~70 ops)
> >> and then generic/095 hung with a crash in iomap_readpage_iter()
> >> during readahead.
> >> 
> >> Looks like a null folio was passed to ifs_alloc(), which implies the
> >> iomap_readpage_ctx didn't have a folio attached to it. Something
> >> isn't working properly in the readahead code, which would also
> >> explain the quick fsx failure...
> >> 
> >
> > Yeah, I have noticed this as well. This is the main crash scenario I am noticing
> > when I am running xfstests, and hopefully we will be able to fix it soon.
> >
> > In general, we have had better results with 16k block size than 64k block size. I still don't
> > know why, but the ifs_alloc crash happens in generic/451 with 16k block size.
> >
> >
> >>> Just a heads up, using 512 byte sector size will fail for now, it's a
> >>> regression we have to fix. Likewise using block sizes 1k, 2k will also
> >>> regress on fsx right now. These are regressions we are aware of but
> >>> haven't had time yet to bisect / fix.
> >> 
> >> I'm betting that the recently added sub-folio dirty tracking code
> >> got broken by this patchset....
> >> 
> >
> > Hmm, this crossed my mind as well. I am assuming I can really test the sub-folio dirty
> > tracking code on a system which has a page size greater than the block size? Or is there
> > some tests that can already test this? CCing Ritesh as well.
> >
> 
> Sorry I haven't yet looked into this series yet. I will spend sometime
> reading it. Will also give a spin to run the fstests.

Ritesh,

You can save yourself time in not testing the patch series with fstests
for block sizes below ps as we already are aware that a patch in the
series breaks this. We just wanted to get the patch series out early for
review given the progress. There's probably one patch which regresses
this, if each patch regresses this, that's a bigger issue :P

  Luis


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC 00/23] Enable block size > page size in XFS
  2023-09-18 12:29       ` Pankaj Raghav
  2023-09-19 11:56         ` Ritesh Harjani
@ 2023-09-21  3:00         ` Luis Chamberlain
       [not found]           ` <ZQvNVAfZMjE3hgmN@bombadil.infradead.org>
  1 sibling, 1 reply; 54+ messages in thread
From: Luis Chamberlain @ 2023-09-21  3:00 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Dave Chinner, Pankaj Raghav, linux-xfs, linux-fsdevel, da.gomez,
	akpm, linux-kernel, willy, djwong, linux-mm, chandan.babu,
	gost.dev, riteshh

On Mon, Sep 18, 2023 at 02:29:22PM +0200, Pankaj Raghav wrote:
> I think this tree doesn't have some of the last minute changes I did
> before I sent the RFC. I will sync with Luis offline regarding that.

OK, we sorted the small changes, and this patch series posted is now rebased
and available here to Linus' v6.6-rc2, for those that want more
stability than the wild wild linux-next:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus-nobdev

If you wanna muck with the coexistence stuff, which you will need if you
want to actually use an LBS device, that is this patch series
and then the coexistence stuff:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus

Given this is a fresh rebase, I started running fsx on the nobdev branch
which only has this series and managed to get fsx ops up to over 1 million
for:

512 sector size:
  * 16k block size
  * 32k block size
  * 64k block size
4k sector size:
  * 16k block size
  * 32k block size
  * 64k block size

It's at least enough cursory test to git push it. I haven't tested
yet the second branch I pushed though but it applied without any changes
so it should be good (usual famous last words).

  Luis


^ permalink raw reply	[flat|nested] 54+ messages in thread

[parent not found: <ZQvNVAfZMjE3hgmN@bombadil.infradead.org>]

* Re: [RFC 00/23] Enable block size > page size in XFS
       [not found]           ` <ZQvNVAfZMjE3hgmN@bombadil.infradead.org>
@ 2023-09-21  6:03             ` Dave Chinner
  2023-09-21  7:18               ` Luis Chamberlain
  2023-09-22 19:38               ` Matthew Wilcox
  0 siblings, 2 replies; 54+ messages in thread
From: Dave Chinner @ 2023-09-21  6:03 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Pankaj Raghav, Pankaj Raghav, linux-xfs, linux-fsdevel, da.gomez,
	akpm, linux-kernel, willy, djwong, linux-mm, chandan.babu,
	gost.dev, riteshh

On Wed, Sep 20, 2023 at 09:57:56PM -0700, Luis Chamberlain wrote:
> On Wed, Sep 20, 2023 at 08:00:12PM -0700, Luis Chamberlain wrote:
> > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus
> > 
> > I haven't tested yet the second branch I pushed though but it applied without any changes
> > so it should be good (usual famous last words).
> 
> I have run some preliminary tests on that branch as well above using fsx
> with larger LBA formats running them all on the *same* system at the
> same time. Kernel is happy.
> 
> root@linus ~ # uname -r
> 6.6.0-rc2-large-block-linus+
> 
> root@linus ~ # mount | grep mnt
> /dev/nvme17n1 on /mnt-16k type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme13n1 on /mnt-32k-16ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme11n1 on /mnt-64k-16ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=64k,noquota)
> /dev/nvme18n1 on /mnt-32k type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme14n1 on /mnt-64k-32ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=64k,noquota)
> /dev/nvme7n1 on /mnt-64k-512b type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme4n1 on /mnt-32k-512 type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme3n1 on /mnt-16k-512b type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme9n1 on /mnt-64k-4ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=64k,noquota)
> /dev/nvme8n1 on /mnt-32k-4ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme6n1 on /mnt-16k-4ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme5n1 on /mnt-4k type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme1n1 on /mnt-512 type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> 
> root@linus ~ # ps -ef| grep fsx
> root       45601   45172 44 04:02 pts/3    00:20:26 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-16k/foo
> root       46207   45658 39 04:04 pts/5    00:17:18 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-32k-16ks/foo
> root       46792   46289 35 04:06 pts/7    00:14:36 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-64k-16ks/foo
> root       47293   46899 39 04:08 pts/9    00:15:30 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-32k/foo
> root       47921   47338 34 04:10 pts/11   00:12:56 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-64k-32ks/foo
> root       48898   48484 32 04:14 pts/13   00:10:56 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-64k-512b/foo
> root       49313   48939 35 04:15 pts/15   00:11:38 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-32k-512/foo
> root       49729   49429 40 04:17 pts/17   00:12:27 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-16k-512b/foo
> root       50085   49794 33 04:18 pts/19   00:09:56 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-64k-4ks/foo
> root       50449   50130 36 04:19 pts/21   00:10:28 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-32k-4ks/foo
> root       50844   50517 41 04:20 pts/23   00:11:22 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-16k-4ks/foo
> root       51135   50893 52 04:21 pts/25   00:13:57 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-4k/foo
> root       52061   51193 49 04:25 pts/27   00:11:21 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-512/foo
> root       57668   52131  0 04:48 pts/29   00:00:00 grep fsx

So I just pulled this, built it and run generic/091 as the very
first test on this:

# ./run_check.sh --mkfs-opts "-m rmapbt=1 -b size=64k" --run-opts "-s xfs_64k generic/091"
.....
meta-data=/dev/pmem0             isize=512    agcount=4, agsize=32768 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=0
data     =                       bsize=65536  blocks=131072, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=65536  ascii-ci=0, ftype=1
log      =internal log           bsize=65536  blocks=2613, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=65536  blocks=0, rtextents=0
....
Running: MOUNT_OPTIONS= ./check -R xunit -b -s xfs_64k generic/091
SECTION       -- xfs_64k
FSTYP         -- xfs (debug)
PLATFORM      -- Linux/x86_64 test3 6.6.0-rc2-large-block-linus-dgc+ #1906 SMP PREEMPT_DYNAMIC Thu Sep 21 15:19:47 AEST 2023
MKFS_OPTIONS  -- -f -m rmapbt=1 -b size=64k /dev/pmem1
MOUNT_OPTIONS -- -o dax=never -o context=system_u:object_r:root_t:s0 /dev/pmem1 /mnt/scratch

generic/091 10s ... [failed, exit status 1]- output mismatch (see /home/dave/src/xfstests-dev/results//xfs_64k/generic/091.out.bad)
    --- tests/generic/091.out   2022-12-21 15:53:25.467044754 +1100
    +++ /home/dave/src/xfstests-dev/results//xfs_64k/generic/091.out.bad        2023-09-21 15:47:48.222559248 +1000
    @@ -1,7 +1,113 @@
     QA output created by 091
     fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
    -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
    -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
    -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
    -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
    -fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W
    ...
    (Run 'diff -u /home/dave/src/xfstests-dev/tests/generic/091.out /home/dave/src/xfstests-dev/results//xfs_64k/generic/091.out.bad'  to see the entire diff)
Failures: generic/091
Failed 1 of 1 tests
Xunit report: /home/dave/src/xfstests-dev/results//xfs_64k/result.xml

SECTION       -- xfs_64k
=========================
Failures: generic/091
Failed 1 of 1 tests


real    0m4.214s
user    0m0.972s
sys     0m3.603s
#

For all these assertions about how none of your testing is finding
bugs in this code, It's taken me *4 seconds* of test runtime to find
the first failure.

And, well, it's the same failure as I reported for the previous
version of this code:

# cat /home/dave/src/xfstests-dev/results//xfs_64k/generic/091.out.bad
/home/dave/src/xfstests-dev/ltp/fsx -N 10000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -W /mnt/test/junk
mapped writes DISABLED    
Seed set to 1
main: filesystem does not support exchange range, disabling!
fallocating to largest ever: 0x79f06
READ BAD DATA: offset = 0x18000, size = 0xf000, fname = /mnt/test/junk
OFFSET      GOOD    BAD     RANGE
0x21000     0x0000  0x9008  0x0
operation# (mod 256) for the bad data may be 144
0x21001     0x0000  0x0810  0x1
operation# (mod 256) for the bad data may be 16
0x21002     0x0000  0x1000  0x2
operation# (mod 256) for the bad data may be 16
0x21005     0x0000  0x8e00  0x3
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x21007     0x0000  0x82ff  0x4
operation# (mod 256) for the bad data may be 255
0x21008     0x0000  0xffff  0x5
operation# (mod 256) for the bad data may be 255
0x21009     0x0000  0xffff  0x6
operation# (mod 256) for the bad data may be 255
0x2100a     0x0000  0xffff  0x7
operation# (mod 256) for the bad data may be 255
0x2100b     0x0000  0xff00  0x8
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x21010     0x0000  0x700b  0x9
operation# (mod 256) for the bad data may be 112
0x21011     0x0000  0x0b10  0xa
operation# (mod 256) for the bad data may be 16
0x21012     0x0000  0x1000  0xb
operation# (mod 256) for the bad data may be 16
0x21014     0x0000  0x038e  0xc
operation# (mod 256) for the bad data may be 3
0x21015     0x0000  0x8e00  0xd
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x21017     0x0000  0x82ff  0xe
operation# (mod 256) for the bad data may be 255
0x21018     0x0000  0xffff  0xf
operation# (mod 256) for the bad data may be 255
LOG DUMP (69 total operations):
1(  1 mod 256): FALLOC   0x6ba10 thru 0x79f06   (0xe4f6 bytes) EXTENDING
2(  2 mod 256): SKIPPED (no operation)
3(  3 mod 256): SKIPPED (no operation)
4(  4 mod 256): TRUNCATE DOWN   from 0x79f06 to 0x51800
5(  5 mod 256): SKIPPED (no operation)
6(  6 mod 256): READ     0x1b000 thru 0x21fff   (0x7000 bytes)
7(  7 mod 256): PUNCH    0x2ce7a thru 0x39b9e   (0xcd25 bytes)
8(  8 mod 256): PUNCH    0x29238 thru 0x29f57   (0xd20 bytes)
9(  9 mod 256): COPY 0x3000 thru 0x9fff (0x7000 bytes) to 0x40400 thru 0x473ff
10( 10 mod 256): READ     0x16000 thru 0x21fff  (0xc000 bytes)
11( 11 mod 256): FALLOC   0x4a42b thru 0x4b8f7  (0x14cc bytes) INTERIOR
12( 12 mod 256): TRUNCATE DOWN  from 0x51800 to 0x15c00 ******WWWW
13( 13 mod 256): SKIPPED (no operation)
14( 14 mod 256): READ     0xb000 thru 0x14fff   (0xa000 bytes)
15( 15 mod 256): SKIPPED (no operation)
16( 16 mod 256): SKIPPED (no operation)
17( 17 mod 256): SKIPPED (no operation)
18( 18 mod 256): READ     0x3000 thru 0x11fff   (0xf000 bytes)
19( 19 mod 256): FALLOC   0x69b94 thru 0x6c922  (0x2d8e bytes) EXTENDING
20( 20 mod 256): SKIPPED (no operation)
21( 21 mod 256): SKIPPED (no operation)
22( 22 mod 256): WRITE    0x23000 thru 0x285ff  (0x5600 bytes)
23( 23 mod 256): SKIPPED (no operation)
24( 24 mod 256): SKIPPED (no operation)
25( 25 mod 256): SKIPPED (no operation)
26( 26 mod 256): ZERO     0x1fba0 thru 0x2c568  (0xc9c9 bytes)  ******ZZZZ
27( 27 mod 256): READ     0x4f000 thru 0x50fff  (0x2000 bytes)
28( 28 mod 256): READ     0x39000 thru 0x3afff  (0x2000 bytes)
29( 29 mod 256): WRITE    0x40200 thru 0x4cdff  (0xcc00 bytes)
30( 30 mod 256): SKIPPED (no operation)
31( 31 mod 256): WRITE    0x47e00 thru 0x547ff  (0xca00 bytes)
32( 32 mod 256): SKIPPED (no operation)
33( 33 mod 256): READ     0x28000 thru 0x29fff  (0x2000 bytes)
34( 34 mod 256): SKIPPED (no operation)
35( 35 mod 256): READ     0x69000 thru 0x6bfff  (0x3000 bytes)
36( 36 mod 256): READ     0x16000 thru 0x20fff  (0xb000 bytes)
37( 37 mod 256): ZERO     0x45150 thru 0x47e9c  (0x2d4d bytes)
38( 38 mod 256): SKIPPED (no operation)
39( 39 mod 256): SKIPPED (no operation)
40( 40 mod 256): COPY 0x10000 thru 0x11fff      (0x2000 bytes) to 0x22a00 thru 0x249ff
41( 41 mod 256): WRITE    0x29000 thru 0x2efff  (0x6000 bytes)
42( 42 mod 256): ZERO     0x59c7 thru 0x13eee   (0xe528 bytes)
43( 43 mod 256): FALLOC   0x1fdbf thru 0x2e694  (0xe8d5 bytes) INTERIOR ******FFFF
44( 44 mod 256): SKIPPED (no operation)
45( 45 mod 256): ZERO     0x740f5 thru 0x7a11f  (0x602b bytes)
46( 46 mod 256): SKIPPED (no operation)
47( 47 mod 256): WRITE    0x14200 thru 0x1e3ff  (0xa200 bytes)
48( 48 mod 256): READ     0x69000 thru 0x6bfff  (0x3000 bytes)
49( 49 mod 256): TRUNCATE DOWN  from 0x6c922 to 0x16a00 ******WWWW
50( 50 mod 256): WRITE    0x15000 thru 0x163ff  (0x1400 bytes)
51( 51 mod 256): PUNCH    0x3b5e thru 0xa2c1    (0x6764 bytes)
52( 52 mod 256): SKIPPED (no operation)
53( 53 mod 256): SKIPPED (no operation)
54( 54 mod 256): WRITE    0x34a00 thru 0x3fdff  (0xb400 bytes) HOLE     ***WWWW
55( 55 mod 256): WRITE    0x38000 thru 0x397ff  (0x1800 bytes)
56( 56 mod 256): PUNCH    0x7922 thru 0x115f0   (0x9ccf bytes)
57( 57 mod 256): SKIPPED (no operation)
58( 58 mod 256): SKIPPED (no operation)
59( 59 mod 256): SKIPPED (no operation)
60( 60 mod 256): FALLOC   0x300a8 thru 0x331d0  (0x3128 bytes) INTERIOR
61( 61 mod 256): ZERO     0x3799c thru 0x39245  (0x18aa bytes)
62( 62 mod 256): ZERO     0x62fc3 thru 0x6b630  (0x866e bytes)
63( 63 mod 256): SKIPPED (no operation)
64( 64 mod 256): ZERO     0x6110a thru 0x61dad  (0xca4 bytes)
65( 65 mod 256): FALLOC   0x1d8ca thru 0x20876  (0x2fac bytes) INTERIOR
66( 66 mod 256): COPY 0x65000 thru 0x68fff      (0x4000 bytes) to 0x22400 thru 0x263ff
67( 67 mod 256): SKIPPED (no operation)
68( 68 mod 256): WRITE    0x36a00 thru 0x415ff  (0xac00 bytes)
69( 69 mod 256): READ     0x18000 thru 0x26fff  (0xf000 bytes)  ***RRRR***
Log of operations saved to "/mnt/test/junk.fsxops"; replay with --replay-ops
Correct content saved for comparison
(maybe hexdump "/mnt/test/junk" vs "/mnt/test/junk.fsxgood")

Guess what? The fsx parameters being used means it is testing things you
aren't. Yes, the '-Z -R -W' mean it is using direct IO for reads and writes,
mmap() is disabled. Other parameters indicate that using 4k aligned reads and
512 byte aligned writes and truncates.

There is a reason there are multiple different fsx tests in fstests;
they all exercise different sets of IO behaviours and alignments,
and they exercise the IO paths differently.

So there's clearly something wrong here - it's likely that the
filesystem IO alignment parameters pulled from the underlying block
device (4k physical, 512 byte logical sector sizes) are improperly
interpreted.  i.e. for a filesystem with a sector size of 4kB,
direct IO with an alignment of 512 bytes should be rejected......

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC 00/23] Enable block size > page size in XFS
  2023-09-21  6:03             ` Dave Chinner
@ 2023-09-21  7:18               ` Luis Chamberlain
  2023-09-21  7:20                 ` Luis Chamberlain
  2023-09-22  5:03                 ` Dave Chinner
  2023-09-22 19:38               ` Matthew Wilcox
  1 sibling, 2 replies; 54+ messages in thread
From: Luis Chamberlain @ 2023-09-21  7:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Pankaj Raghav, Pankaj Raghav, linux-xfs, linux-fsdevel, da.gomez,
	akpm, linux-kernel, willy, djwong, linux-mm, chandan.babu,
	gost.dev, riteshh

On Thu, Sep 21, 2023 at 04:03:56PM +1000, Dave Chinner wrote:
> On Wed, Sep 20, 2023 at 09:57:56PM -0700, Luis Chamberlain wrote:
> > On Wed, Sep 20, 2023 at 08:00:12PM -0700, Luis Chamberlain wrote:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus
> > > 
> > > I haven't tested yet the second branch I pushed though but it applied without any changes
> > > so it should be good (usual famous last words).
> > 
> > I have run some preliminary tests on that branch as well above using fsx
> > with larger LBA formats running them all on the *same* system at the
> > same time. Kernel is happy.

<-- snip -->

> So I just pulled this, built it and run generic/091 as the very
> first test on this:
> 
> # ./run_check.sh --mkfs-opts "-m rmapbt=1 -b size=64k" --run-opts "-s xfs_64k generic/091"

The cover letter for this patch series acknowledged failures in fstests.

For kdevops now, we borrow the same last linux-next baseline:

git grep "generic/091" workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev
workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_1024.txt:generic/091 # possible regression
workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_16k.txt:generic/091
workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_32k.txt:generic/091
workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_64k_4ks.txt:generic/091

So well, we already know this fails.

> For all these assertions about how none of your testing is finding
> bugs in this code, It's taken me *4 seconds* of test runtime to find
> the first failure.

Because you know what to look for and this is not yet perfect.

> And, well, it's the same failure as I reported for the previous
> version of this code:

And we haven't done *any* new changes to the patch series so no surprise
either.

> Guess what? The fsx parameters being used means it is testing things you
> aren't.

I actualy found quite a bit of issues with -W. And it was useful.

> Yes, the '-Z -R -W' mean it is using direct IO for reads and writes,
> mmap() is disabled. Other parameters indicate that using 4k aligned reads and
> 512 byte aligned writes and truncates.

Thanks! This will help for sure!.

> There is a reason there are multiple different fsx tests in fstests;

You made it clear, and I documented the goal to ensure we get to the
point we pass all those:

https://kernelnewbies.org/KernelProjects/large-block-size#fsx

> they all exercise different sets of IO behaviours and alignments,
> and they exercise the IO paths differently.
> 
> So there's clearly something wrong here - it's likely that the
> filesystem IO alignment parameters pulled from the underlying block
> device (4k physical, 512 byte logical sector sizes) are improperly
> interpreted.  i.e. for a filesystem with a sector size of 4kB,
> direct IO with an alignment of 512 bytes should be rejected......

So yes, this is not yet complete.

But now let's step back and I want you to realize where we started
and why we decided to post, in particular me, I was suggesting we
post now, instead of waiting for us to resolve *it all*.

When we first started this work we simply thought it was impossible.
Unless of course you are Matthew and you believed hard in your work.

The progress, which you don't see, is that steps towards fixing fsx
issues have been logarithmic. Days, weeks, months before decent
progress, but the progress was steady...

And so to get to where we are today only just shows, well this is
actually not impossible, and Matthew did the right thing with the
right data structure, and the changes to the page cache with multi
index array stuff, it seems to be able to also be used for LBS.

At this point, from a logarithmic perspective, we have huge progress,
and I don't think it will stop. It gives us confidence Matthew was
right and LBS is possible indeed with the multi-index stuff.

It's not about, can this crash. Yes, we know, it can crash. It's about
how many different ways, and how many fixes left. Because clearly the
multi-index stuff is working well. The code feedback so far on this
patch series has mostly been "I don't think this patch is needed" or
"perhaps this way is better", and that's the kind of feedback we're
looking for. Because *each* new patch adds a huge a milestone. And
it seems the progress has been logarithmic. It is exactly why this
series went out with a few patches which ... we felt safer with them
than without. For instance the batch delete.. I still am suspicious
about us not needing as Hannes' patches also seem to rely on similer
rounding on the wait stuff, and it seems to bring back memories
on issues found on permissions. But anyway, the point is that, this
is clearly not ready. But try to think of progress here as logarithmic,
and any *dent* we make on the page cache to fix the last corner cases
will be huge, not small.

If you want to try, you can see for yourself, what's the next fix? :)
And if found, was it logarithmic? How do we polish this? That's the
goal of this patch series.

  Luis

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC 00/23] Enable block size > page size in XFS
  2023-09-21  7:18               ` Luis Chamberlain
@ 2023-09-21  7:20                 ` Luis Chamberlain
  2023-09-22  5:03                 ` Dave Chinner
  1 sibling, 0 replies; 54+ messages in thread
From: Luis Chamberlain @ 2023-09-21  7:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Pankaj Raghav, Pankaj Raghav, linux-xfs, linux-fsdevel, da.gomez,
	akpm, linux-kernel, willy, djwong, linux-mm, chandan.babu,
	gost.dev, riteshh

On Thu, Sep 21, 2023 at 12:18:13AM -0700, Luis Chamberlain wrote:
> When we first started this work we simply thought it was impossible.

*not possible*

  Luis


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC 00/23] Enable block size > page size in XFS
  2023-09-21  7:18               ` Luis Chamberlain
  2023-09-21  7:20                 ` Luis Chamberlain
@ 2023-09-22  5:03                 ` Dave Chinner
  1 sibling, 0 replies; 54+ messages in thread
From: Dave Chinner @ 2023-09-22  5:03 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Pankaj Raghav, Pankaj Raghav, linux-xfs, linux-fsdevel, da.gomez,
	akpm, linux-kernel, willy, djwong, linux-mm, chandan.babu,
	gost.dev, riteshh

On Thu, Sep 21, 2023 at 12:18:13AM -0700, Luis Chamberlain wrote:
> On Thu, Sep 21, 2023 at 04:03:56PM +1000, Dave Chinner wrote:
> > On Wed, Sep 20, 2023 at 09:57:56PM -0700, Luis Chamberlain wrote:
> > > On Wed, Sep 20, 2023 at 08:00:12PM -0700, Luis Chamberlain wrote:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus
> > > > 
> > > > I haven't tested yet the second branch I pushed though but it applied without any changes
> > > > so it should be good (usual famous last words).
> > > 
> > > I have run some preliminary tests on that branch as well above using fsx
> > > with larger LBA formats running them all on the *same* system at the
> > > same time. Kernel is happy.
> 
> <-- snip -->
> 
> > So I just pulled this, built it and run generic/091 as the very
> > first test on this:
> > 
> > # ./run_check.sh --mkfs-opts "-m rmapbt=1 -b size=64k" --run-opts "-s xfs_64k generic/091"
> 
> The cover letter for this patch series acknowledged failures in fstests.

But this is a new update, which you said fixed various issues, and
you posted this in direct response to the bug report I gave you.

> For kdevops now, we borrow the same last linux-next baseline:
> 
> git grep "generic/091" workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev
> workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_1024.txt:generic/091 # possible regression
> workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_16k.txt:generic/091
> workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_32k.txt:generic/091
> workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_64k_4ks.txt:generic/091
> 
> So well, we already know this fails.

*cough*

-You- know it already fails.

And you are expecting people who try the code to somehow know
that you've explicitly ignored this fsx failure, especially after
all your words to tell us how much fsx testing it has passed?

And that's kinda my point - you're effusing about how much fsx
testing this has passed, yet it istill fails after just a handful of
ops in generic/091. The dissonance could break windows...

----

Fundamentally, when it comes to data integrity, it important to
exercise as much of the operational application space as quickly as
possible as it is that breadth of variation in operations that
flushes out more bugs and helps stabilises the code faster.

Why do you think we talk about the massive test matrix most
filesytsems have and how long it takes to iterate so much? It's
because iterating that complex test matrix is how we find all the
whacky, weird bugs in the code.

Concentrating on a single test configuration and running it over and
over again won't find bugs in code it doesn't exercise no matter how
long it is run for. Running such a setup in an automated environment
doesn't mean you get better code coverage, it just means you cover
the same narrow set of corner cases faster and more times. If it
works once, it should work a million times. Iterating it a billion
more times doesn't tell us anything additional, either.

Put simply: performing deep, homogenous testing on code that has known
data corruption bugs outside the narrow scope of the test case is
not telling us anything useful about the overall state of the code.
Indeed, turning off failing tests that are critical to validating the
correct operation of the code you are modifying is bad practice.

For code changes like this, all fsx testing in fstests should pass
before you post anything for review - even for an RFC. There is no
point reviewing code that doesn't work properly, nor wasting
people's time by encouraging them to test it when it's clear to you
that it's going to fail in various important ways.

Hence I think your testing is focussing on the wrong things and I
suspect that you've misunderstood the statements of "we'll need
billions of fsx ops to test this code" that various people have made
really meant.  You've elevated running billions of fsx ops to your
primary "it works" gating condition, at the expense of making sure
all the other parts of the filesystem still work correctly.

The reality is that the returns from fsx diminish as the number of
ops go up. Once you've run the first hundred million fsx ops for a
given operations set, the chance that the next 100M ops will find a
new problem is -greatly- reduced. The vast majority of problems will
be found in the first 10M ops that are run in any given fsx
operation, and few bugs are found beyond the 100M mark. Yes, we
occasionally find one up in the billions, but that's rare and most
definitely not somethign to focus on when still developing RFC level
code.

Different fsx configurations change the operation set that is run -
mixing DIO reads with buffered writes, turning mmap on and off,
using AIO or io_uring rather than synchronous IO, etc. These all
exercise different code paths and corner cases and have vastly
different code interactions, and that is what we need to cover when
developing new code.

IOWs, we need coverage of the *entire operation space*, not just the
same narrow set of operations run billions of time.  A wide focus
requires billions of ops to cover because it requires lots of
different application configurations to be run. In constrast, there
are only three fs configurations that matter: bs < PS, bs == PS and
bs > PS.

For example, 16kB, 32kB and 64kB filesystem configs exercise exactly
the same code paths in exactly the same way (e.g. both have non-zero
miniumum folio orders but only differ by what that order is). Hence
running the same test application configs on these different
filessytem configurations does actually not improve code coverage of
the testing at all. Testing all of them only increases the resources
required to the test a change, it does not improve the quality of
coverage of the testing being performed at all....

Hence I'd strongly suggest that, for the next posting of these
cahnge, you focus on making fstests pass without turning off any
failing tests, and that fsx is run with a wide variety of
configurations (e.g. modify all the fstests cases to run for a
configurable number of ops (e.g. via SOAK_DURATION)). We just don't
care at this point about finding that 1 in 10^15 ops bug because
it's code in development; what we actually care about is that
-everything- works correctly for the vast majority of use cases....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC 00/23] Enable block size > page size in XFS
  2023-09-21  6:03             ` Dave Chinner
  2023-09-21  7:18               ` Luis Chamberlain
@ 2023-09-22 19:38               ` Matthew Wilcox
  1 sibling, 0 replies; 54+ messages in thread
From: Matthew Wilcox @ 2023-09-22 19:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Luis Chamberlain, Pankaj Raghav, Pankaj Raghav, linux-xfs,
	linux-fsdevel, da.gomez, akpm, linux-kernel, djwong, linux-mm,
	chandan.babu, gost.dev, riteshh

lOn Thu, Sep 21, 2023 at 04:03:56PM +1000, Dave Chinner wrote:
> So there's clearly something wrong here - it's likely that the
> filesystem IO alignment parameters pulled from the underlying block
> device (4k physical, 512 byte logical sector sizes) are improperly
> interpreted.  i.e. for a filesystem with a sector size of 4kB,
> direct IO with an alignment of 512 bytes should be rejected......

I wonder if it's something in the truncation code that's splitting folios
that ought not to be split.  Does this test possibly keep folios in
cache that maybe get invalidated?

truncate_inode_partial_folio() is the one i'm most concernd about.
but i'm also severely jetlagged.


^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2023-09-22 19:38 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
2023-09-15 18:38 ` [RFC 01/23] fs: Allow fine-grained control of folio sizes Pankaj Raghav
2023-09-15 19:03   ` Matthew Wilcox
2023-09-15 18:38 ` [RFC 02/23] pagemap: use mapping_min_order in fgf_set_order() Pankaj Raghav
2023-09-15 18:55   ` Matthew Wilcox
2023-09-20  7:46     ` Pankaj Raghav
2023-09-15 18:38 ` [RFC 03/23] filemap: add folio with at least mapping_min_order in __filemap_get_folio Pankaj Raghav
2023-09-15 19:00   ` Matthew Wilcox
2023-09-20  8:06     ` Pankaj Raghav
2023-09-15 18:38 ` [RFC 04/23] filemap: set the order of the index in page_cache_delete_batch() Pankaj Raghav
2023-09-15 19:43   ` Matthew Wilcox
2023-09-18 18:20     ` Luis Chamberlain
2023-09-15 18:38 ` [RFC 05/23] filemap: align index to mapping_min_order in filemap_range_has_page() Pankaj Raghav
2023-09-15 19:45   ` Matthew Wilcox
2023-09-18 18:25     ` Luis Chamberlain
2023-09-15 18:38 ` [RFC 06/23] mm: call xas_set_order() in replace_page_cache_folio() Pankaj Raghav
2023-09-15 19:46   ` Matthew Wilcox
2023-09-18 18:27     ` Luis Chamberlain
2023-09-15 18:38 ` [RFC 07/23] filemap: align the index to mapping_min_order in __filemap_add_folio() Pankaj Raghav
2023-09-15 19:48   ` Matthew Wilcox
2023-09-18 18:32     ` Luis Chamberlain
2023-09-15 18:38 ` [RFC 08/23] filemap: align the index to mapping_min_order in filemap_get_folios_tag() Pankaj Raghav
2023-09-15 19:50   ` Matthew Wilcox
2023-09-18 18:36     ` Luis Chamberlain
2023-09-15 18:38 ` [RFC 09/23] filemap: use mapping_min_order while allocating folios Pankaj Raghav
2023-09-15 19:54   ` Matthew Wilcox
2023-09-15 18:38 ` [RFC 10/23] filemap: align the index to mapping_min_order in filemap_get_pages() Pankaj Raghav
2023-09-15 18:38 ` [RFC 11/23] filemap: align the index to mapping_min_order in do_[a]sync_mmap_readahead Pankaj Raghav
2023-09-15 18:38 ` [RFC 12/23] filemap: align index to mapping_min_order in filemap_fault() Pankaj Raghav
2023-09-15 18:38 ` [RFC 13/23] readahead: set file_ra_state->ra_pages to be at least mapping_min_order Pankaj Raghav
2023-09-15 18:38 ` [RFC 14/23] readahead: allocate folios with mapping_min_order in ra_unbounded() Pankaj Raghav
2023-09-15 18:38 ` [RFC 15/23] readahead: align with mapping_min_order in force_page_cache_ra() Pankaj Raghav
2023-09-15 18:38 ` [RFC 16/23] readahead: add folio with at least mapping_min_order in page_cache_ra_order Pankaj Raghav
2023-09-15 18:38 ` [RFC 17/23] readahead: set the minimum ra size in get_(init|next)_ra Pankaj Raghav
2023-09-15 18:38 ` [RFC 18/23] readahead: align ra start and size to mapping_min_order in ondemand_ra() Pankaj Raghav
2023-09-15 18:38 ` [RFC 19/23] truncate: align index to mapping_min_order Pankaj Raghav
2023-09-15 18:38 ` [RFC 20/23] mm: round down folio split requirements Pankaj Raghav
2023-09-15 18:38 ` [RFC 21/23] xfs: expose block size in stat Pankaj Raghav
2023-09-15 18:38 ` [RFC 22/23] xfs: enable block size larger than page size support Pankaj Raghav
2023-09-15 18:38 ` [RFC 23/23] xfs: set minimum order folio for page cache based on blocksize Pankaj Raghav
2023-09-15 18:50 ` [RFC 00/23] Enable block size > page size in XFS Matthew Wilcox
2023-09-18 12:35   ` Pankaj Raghav
2023-09-17 22:05 ` Dave Chinner
2023-09-18  2:04   ` Luis Chamberlain
2023-09-18  5:07     ` Dave Chinner
2023-09-18 12:29       ` Pankaj Raghav
2023-09-19 11:56         ` Ritesh Harjani
2023-09-19 21:15           ` Luis Chamberlain
2023-09-21  3:00         ` Luis Chamberlain
     [not found]           ` <ZQvNVAfZMjE3hgmN@bombadil.infradead.org>
2023-09-21  6:03             ` Dave Chinner
2023-09-21  7:18               ` Luis Chamberlain
2023-09-21  7:20                 ` Luis Chamberlain
2023-09-22  5:03                 ` Dave Chinner
2023-09-22 19:38               ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).