* [PATCH v8 00/10] enable bs > ps in XFS
@ 2024-06-25 11:44 Pankaj Raghav (Samsung)
2024-06-25 11:44 ` [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
` (9 more replies)
0 siblings, 10 replies; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-25 11:44 UTC (permalink / raw)
To: david, willy, chandan.babu, djwong, brauner, akpm
Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, hch, Zi Yan
From: Pankaj Raghav <p.raghav@samsung.com>
This is the eight version of the series that enables block size > page size
(Large Block Size) in XFS.
The context and motivation can be seen in cover letter of the RFC v1 [0].
We also recorded a talk about this effort at LPC [1], if someone would
like more context on this effort.
A lot of emphasis has been put on testing using kdevops, starting with an XFS
baseline [3]. The testing has been split into regression and progression.
Regression testing:
In regression testing, we ran the whole test suite to check for regressions on
existing profiles due to the page cache changes.
I also ran split_huge_page_test selftest on XFS filesystem to check for
huge page splits in min order chunks is done correctly.
No regressions were found with these patches added on top.
Progression testing:
For progression testing, we tested for 8k, 16k, 32k and 64k block sizes. To
compare it with existing support, an ARM VM with 64k base page system (without
our patches) was used as a reference to check for actual failures due to LBS
support in a 4k base page size system.
There are some tests that assumes block size < page size that needs to be fixed.
We have a tree with fixes for xfstests [4], most of the changes have been posted
already, and only a few minor changes need to be posted. Already part of these
changes has been upstreamed to fstests, and new tests have also been written and
are out for review, namely for mmap zeroing-around corner cases, compaction
and fsstress races on mm, and stress testing folio truncation on file mapped
folios.
No new failures were found with the LBS support.
We've done some preliminary performance tests with fio on XFS on 4k block size
against pmem and NVMe with buffered IO and Direct IO on vanilla Vs + these
patches applied, and detected no regressions.
We also wrote an eBPF tool called blkalgn [5] to see if IO sent to the device
is aligned and at least filesystem block size in length.
For those who want this in a git tree we have this up on a kdevops
large-block-minorder-for-next-v8 tag [6].
[0] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/
[1] https://www.youtube.com/watch?v=ar72r5Xf7x4
[2] https://lkml.kernel.org/r/20240501153120.4094530-1-willy@infradead.org
[3] https://github.com/linux-kdevops/kdevops/blob/master/docs/xfs-bugs.md
489 non-critical issues and 55 critical issues. We've determined and reported
that the 55 critical issues have all fall into 5 common XFS asserts or hung
tasks and 2 memory management asserts.
[4] https://github.com/linux-kdevops/fstests/tree/lbs-fixes
[5] https://github.com/iovisor/bcc/pull/4813
[6] https://github.com/linux-kdevops/linux/
[7] https://lore.kernel.org/linux-kernel/Zl20pc-YlIWCSy6Z@casper.infradead.org/#t
Changes since v7:
- Move by min_nrpages in page_cache_ra_unbounded if we found a folio as
we don't have a stable reference.
- Rename iomap_init to iomap_pagecache_init and add a new iomap_dio_init
and mark the zero_fs_64k memory as RO.
- Simplified calculation in xfs_sb_validate_fsb_count().
- Collected RVB from willy, Darrick and Hannes.
Dave Chinner (1):
xfs: use kvmalloc for xattr buffers
Luis Chamberlain (1):
mm: split a folio in minimum folio order chunks
Matthew Wilcox (Oracle) (1):
fs: Allow fine-grained control of folio sizes
Pankaj Raghav (7):
filemap: allocate mapping_min_order folios in the page cache
readahead: allocate folios with mapping_min_order in readahead
filemap: cap PTE range to be created to allowed zero fill in
folio_map_range()
iomap: fix iomap_dio_zero() for fs bs > system page size
xfs: expose block size in stat
xfs: make the calculation generic in xfs_sb_validate_fsb_count()
xfs: enable block size larger than page size support
fs/iomap/buffered-io.c | 4 +-
fs/iomap/direct-io.c | 30 +++++++++-
fs/xfs/libxfs/xfs_attr_leaf.c | 15 ++---
fs/xfs/libxfs/xfs_ialloc.c | 5 ++
fs/xfs/libxfs/xfs_shared.h | 3 +
fs/xfs/xfs_icache.c | 6 +-
fs/xfs/xfs_iops.c | 2 +-
fs/xfs/xfs_mount.c | 8 ++-
fs/xfs/xfs_super.c | 18 +++---
include/linux/huge_mm.h | 14 +++--
include/linux/pagemap.h | 109 +++++++++++++++++++++++++++++-----
mm/filemap.c | 36 +++++++----
mm/huge_memory.c | 55 +++++++++++++++--
mm/readahead.c | 85 +++++++++++++++++++-------
14 files changed, 309 insertions(+), 81 deletions(-)
base-commit: b992b79ca8bc336fa8e2c80990b5af80ed8f36fd
--
2.44.1
^ permalink raw reply [flat|nested] 61+ messages in thread
* [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-06-25 11:44 [PATCH v8 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
@ 2024-06-25 11:44 ` Pankaj Raghav (Samsung)
2024-07-04 12:23 ` Ryan Roberts
2024-07-09 16:29 ` Pankaj Raghav (Samsung)
2024-06-25 11:44 ` [PATCH v8 02/10] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
` (8 subsequent siblings)
9 siblings, 2 replies; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-25 11:44 UTC (permalink / raw)
To: david, willy, chandan.babu, djwong, brauner, akpm
Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, hch, Zi Yan
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
We need filesystems to be able to communicate acceptable folio sizes
to the pagecache for a variety of uses (e.g. large block sizes).
Support a range of folio sizes between order-0 and order-31.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
include/linux/pagemap.h | 86 ++++++++++++++++++++++++++++++++++-------
mm/filemap.c | 6 +--
mm/readahead.c | 4 +-
3 files changed, 77 insertions(+), 19 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4b71d581091f..0c51154cdb57 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -204,14 +204,21 @@ enum mapping_flags {
AS_EXITING = 4, /* final truncate in progress */
/* writeback related tags are not used */
AS_NO_WRITEBACK_TAGS = 5,
- AS_LARGE_FOLIO_SUPPORT = 6,
- AS_RELEASE_ALWAYS, /* Call ->release_folio(), even if no private data */
- AS_STABLE_WRITES, /* must wait for writeback before modifying
+ AS_RELEASE_ALWAYS = 6, /* Call ->release_folio(), even if no private data */
+ AS_STABLE_WRITES = 7, /* must wait for writeback before modifying
folio contents */
- AS_UNMOVABLE, /* The mapping cannot be moved, ever */
- AS_INACCESSIBLE, /* Do not attempt direct R/W access to the mapping */
+ AS_UNMOVABLE = 8, /* The mapping cannot be moved, ever */
+ AS_INACCESSIBLE = 9, /* Do not attempt direct R/W access to the mapping */
+ /* Bits 16-25 are used for FOLIO_ORDER */
+ AS_FOLIO_ORDER_BITS = 5,
+ AS_FOLIO_ORDER_MIN = 16,
+ AS_FOLIO_ORDER_MAX = AS_FOLIO_ORDER_MIN + AS_FOLIO_ORDER_BITS,
};
+#define AS_FOLIO_ORDER_MASK ((1u << AS_FOLIO_ORDER_BITS) - 1)
+#define AS_FOLIO_ORDER_MIN_MASK (AS_FOLIO_ORDER_MASK << AS_FOLIO_ORDER_MIN)
+#define AS_FOLIO_ORDER_MAX_MASK (AS_FOLIO_ORDER_MASK << AS_FOLIO_ORDER_MAX)
+
/**
* mapping_set_error - record a writeback error in the address_space
* @mapping: the mapping in which an error should be set
@@ -360,9 +367,49 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
#define MAX_PAGECACHE_ORDER 8
#endif
+/*
+ * mapping_set_folio_order_range() - Set the orders supported by a file.
+ * @mapping: The address space of the file.
+ * @min: Minimum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
+ * @max: Maximum folio order (between @min-MAX_PAGECACHE_ORDER inclusive).
+ *
+ * The filesystem should call this function in its inode constructor to
+ * indicate which base size (min) and maximum size (max) of folio the VFS
+ * can use to cache the contents of the file. This should only be used
+ * if the filesystem needs special handling of folio sizes (ie there is
+ * something the core cannot know).
+ * Do not tune it based on, eg, i_size.
+ *
+ * Context: This should not be called while the inode is active as it
+ * is non-atomic.
+ */
+static inline void mapping_set_folio_order_range(struct address_space *mapping,
+ unsigned int min,
+ unsigned int max)
+{
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ return;
+
+ if (min > MAX_PAGECACHE_ORDER)
+ min = MAX_PAGECACHE_ORDER;
+ if (max > MAX_PAGECACHE_ORDER)
+ max = MAX_PAGECACHE_ORDER;
+ if (max < min)
+ max = min;
+
+ mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
+ (min << AS_FOLIO_ORDER_MIN) | (max << AS_FOLIO_ORDER_MAX);
+}
+
+static inline void mapping_set_folio_min_order(struct address_space *mapping,
+ unsigned int min)
+{
+ mapping_set_folio_order_range(mapping, min, MAX_PAGECACHE_ORDER);
+}
+
/**
* mapping_set_large_folios() - Indicate the file supports large folios.
- * @mapping: The file.
+ * @mapping: The address space of the file.
*
* The filesystem should call this function in its inode constructor to
* indicate that the VFS can use large folios to cache the contents of
@@ -373,7 +420,23 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
*/
static inline void mapping_set_large_folios(struct address_space *mapping)
{
- __set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+ mapping_set_folio_order_range(mapping, 0, MAX_PAGECACHE_ORDER);
+}
+
+static inline
+unsigned int mapping_max_folio_order(const struct address_space *mapping)
+{
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ return 0;
+ return (mapping->flags & AS_FOLIO_ORDER_MAX_MASK) >> AS_FOLIO_ORDER_MAX;
+}
+
+static inline
+unsigned int mapping_min_folio_order(const struct address_space *mapping)
+{
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ return 0;
+ return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
}
/*
@@ -386,16 +449,13 @@ static inline bool mapping_large_folio_support(struct address_space *mapping)
VM_WARN_ONCE((unsigned long)mapping & PAGE_MAPPING_ANON,
"Anonymous mapping always supports large folio");
- return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
- test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+ return mapping_max_folio_order(mapping) > 0;
}
/* Return the maximum folio size for this pagecache mapping, in bytes. */
-static inline size_t mapping_max_folio_size(struct address_space *mapping)
+static inline size_t mapping_max_folio_size(const struct address_space *mapping)
{
- if (mapping_large_folio_support(mapping))
- return PAGE_SIZE << MAX_PAGECACHE_ORDER;
- return PAGE_SIZE;
+ return PAGE_SIZE << mapping_max_folio_order(mapping);
}
static inline int filemap_nr_thps(struct address_space *mapping)
diff --git a/mm/filemap.c b/mm/filemap.c
index 0b8c732bb643..d617c9afca51 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1933,10 +1933,8 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
fgp_flags |= FGP_LOCK;
- if (!mapping_large_folio_support(mapping))
- order = 0;
- if (order > MAX_PAGECACHE_ORDER)
- order = MAX_PAGECACHE_ORDER;
+ if (order > mapping_max_folio_order(mapping))
+ order = mapping_max_folio_order(mapping);
/* If we're not aligned, allocate a smaller folio */
if (index & ((1UL << order) - 1))
order = __ffs(index);
diff --git a/mm/readahead.c b/mm/readahead.c
index c1b23989d9ca..66058ae02f2e 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -503,9 +503,9 @@ void page_cache_ra_order(struct readahead_control *ractl,
limit = min(limit, index + ra->size - 1);
- if (new_order < MAX_PAGECACHE_ORDER) {
+ if (new_order < mapping_max_folio_order(mapping)) {
new_order += 2;
- new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
+ new_order = min(mapping_max_folio_order(mapping), new_order);
new_order = min_t(unsigned int, new_order, ilog2(ra->size));
}
--
2.44.1
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v8 02/10] filemap: allocate mapping_min_order folios in the page cache
2024-06-25 11:44 [PATCH v8 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
2024-06-25 11:44 ` [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
@ 2024-06-25 11:44 ` Pankaj Raghav (Samsung)
2024-06-25 15:52 ` Matthew Wilcox
2024-06-25 11:44 ` [PATCH v8 03/10] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
` (7 subsequent siblings)
9 siblings, 1 reply; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-25 11:44 UTC (permalink / raw)
To: david, willy, chandan.babu, djwong, brauner, akpm
Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, hch, Zi Yan
From: Pankaj Raghav <p.raghav@samsung.com>
filemap_create_folio() and do_read_cache_folio() were always allocating
folio of order 0. __filemap_get_folio was trying to allocate higher
order folios when fgp_flags had higher order hint set but it will default
to order 0 folio if higher order memory allocation fails.
Supporting mapping_min_order implies that we guarantee each folio in the
page cache has at least an order of mapping_min_order. When adding new
folios to the page cache we must also ensure the index used is aligned to
the mapping_min_order as the page cache requires the index to be aligned
to the order of the folio.
Co-developed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
include/linux/pagemap.h | 23 ++++++++++++++++++++++-
mm/filemap.c | 24 ++++++++++++++++--------
2 files changed, 38 insertions(+), 9 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 0c51154cdb57..7f1355abd8a2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -439,6 +439,27 @@ unsigned int mapping_min_folio_order(const struct address_space *mapping)
return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
}
+static inline
+unsigned long mapping_min_folio_nrpages(struct address_space *mapping)
+{
+ return 1UL << mapping_min_folio_order(mapping);
+}
+
+/**
+ * mapping_align_index() - Align index based on the min
+ * folio order of the page cache.
+ * @mapping: The address_space.
+ *
+ * The index of a folio must be naturally aligned. If you are adding a
+ * new folio to the page cache and need to know what index to give it,
+ * call this function.
+ */
+static inline pgoff_t mapping_align_index(struct address_space *mapping,
+ pgoff_t index)
+{
+ return round_down(index, mapping_min_folio_nrpages(mapping));
+}
+
/*
* Large folio support currently depends on THP. These dependencies are
* being worked on but are not yet fixed.
@@ -1165,7 +1186,7 @@ static inline vm_fault_t folio_lock_or_retry(struct folio *folio,
void folio_wait_bit(struct folio *folio, int bit_nr);
int folio_wait_bit_killable(struct folio *folio, int bit_nr);
-/*
+/*
* Wait for a folio to be unlocked.
*
* This must be called with the caller "holding" the folio,
diff --git a/mm/filemap.c b/mm/filemap.c
index d617c9afca51..8eafbd4a4d0c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -859,6 +859,8 @@ noinline int __filemap_add_folio(struct address_space *mapping,
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
+ VM_BUG_ON_FOLIO(folio_order(folio) < mapping_min_folio_order(mapping),
+ folio);
mapping_set_update(&xas, mapping);
VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
@@ -1919,8 +1921,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
folio_wait_stable(folio);
no_page:
if (!folio && (fgp_flags & FGP_CREAT)) {
- unsigned order = FGF_GET_ORDER(fgp_flags);
+ unsigned int min_order = mapping_min_folio_order(mapping);
+ unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
int err;
+ index = mapping_align_index(mapping, index);
if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
gfp |= __GFP_WRITE;
@@ -1943,7 +1947,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
gfp_t alloc_gfp = gfp;
err = -ENOMEM;
- if (order > 0)
+ if (order > min_order)
alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
folio = filemap_alloc_folio(alloc_gfp, order);
if (!folio)
@@ -1958,7 +1962,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
break;
folio_put(folio);
folio = NULL;
- } while (order-- > 0);
+ } while (order-- > min_order);
if (err == -EEXIST)
goto repeat;
@@ -2451,13 +2455,15 @@ static int filemap_update_page(struct kiocb *iocb,
}
static int filemap_create_folio(struct file *file,
- struct address_space *mapping, pgoff_t index,
+ struct address_space *mapping, loff_t pos,
struct folio_batch *fbatch)
{
struct folio *folio;
int error;
+ unsigned int min_order = mapping_min_folio_order(mapping);
+ pgoff_t index;
- folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
+ folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order);
if (!folio)
return -ENOMEM;
@@ -2475,6 +2481,7 @@ static int filemap_create_folio(struct file *file,
* well to keep locking rules simple.
*/
filemap_invalidate_lock_shared(mapping);
+ index = (pos >> (PAGE_SHIFT + min_order)) << min_order;
error = filemap_add_folio(mapping, folio, index,
mapping_gfp_constraint(mapping, GFP_KERNEL));
if (error == -EEXIST)
@@ -2535,8 +2542,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
if (!folio_batch_count(fbatch)) {
if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
return -EAGAIN;
- err = filemap_create_folio(filp, mapping,
- iocb->ki_pos >> PAGE_SHIFT, fbatch);
+ err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch);
if (err == AOP_TRUNCATED_PAGE)
goto retry;
return err;
@@ -3752,9 +3758,11 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
repeat:
folio = filemap_get_folio(mapping, index);
if (IS_ERR(folio)) {
- folio = filemap_alloc_folio(gfp, 0);
+ folio = filemap_alloc_folio(gfp,
+ mapping_min_folio_order(mapping));
if (!folio)
return ERR_PTR(-ENOMEM);
+ index = mapping_align_index(mapping, index);
err = filemap_add_folio(mapping, folio, index, gfp);
if (unlikely(err)) {
folio_put(folio);
--
2.44.1
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v8 03/10] readahead: allocate folios with mapping_min_order in readahead
2024-06-25 11:44 [PATCH v8 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
2024-06-25 11:44 ` [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
2024-06-25 11:44 ` [PATCH v8 02/10] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
@ 2024-06-25 11:44 ` Pankaj Raghav (Samsung)
2024-07-02 19:38 ` Darrick J. Wong
2024-07-04 14:24 ` Ryan Roberts
2024-06-25 11:44 ` [PATCH v8 04/10] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
` (6 subsequent siblings)
9 siblings, 2 replies; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-25 11:44 UTC (permalink / raw)
To: david, willy, chandan.babu, djwong, brauner, akpm
Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, hch, Zi Yan
From: Pankaj Raghav <p.raghav@samsung.com>
page_cache_ra_unbounded() was allocating single pages (0 order folios)
if there was no folio found in an index. Allocate mapping_min_order folios
as we need to guarantee the minimum order if it is set.
While we are at it, rework the loop in page_cache_ra_unbounded() to
advance with the number of pages in a folio instead of just one page at
a time.
page_cache_ra_order() tries to allocate folio to the page cache with a
higher order if the index aligns with that order. Modify it so that the
order does not go below the mapping_min_order requirement of the page
cache. This function will do the right thing even if the new_order passed
is less than the mapping_min_order.
When adding new folios to the page cache we must also ensure the index
used is aligned to the mapping_min_order as the page cache requires the
index to be aligned to the order of the folio.
readahead_expand() is called from readahead aops to extend the range of
the readahead so this function can assume ractl->_index to be aligned with
min_order.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Co-developed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Hannes Reinecke <hare@suse.de>
---
mm/readahead.c | 81 +++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 63 insertions(+), 18 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c
index 66058ae02f2e..2acfd6447d7b 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -206,9 +206,10 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
unsigned long nr_to_read, unsigned long lookahead_size)
{
struct address_space *mapping = ractl->mapping;
- unsigned long index = readahead_index(ractl);
+ unsigned long ra_folio_index, index = readahead_index(ractl);
gfp_t gfp_mask = readahead_gfp_mask(mapping);
- unsigned long i;
+ unsigned long mark, i = 0;
+ unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
/*
* Partway through the readahead operation, we will have added
@@ -223,10 +224,26 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
unsigned int nofs = memalloc_nofs_save();
filemap_invalidate_lock_shared(mapping);
+ index = mapping_align_index(mapping, index);
+
+ /*
+ * As iterator `i` is aligned to min_nrpages, round_up the
+ * difference between nr_to_read and lookahead_size to mark the
+ * index that only has lookahead or "async_region" to set the
+ * readahead flag.
+ */
+ ra_folio_index = round_up(readahead_index(ractl) + nr_to_read - lookahead_size,
+ min_nrpages);
+ mark = ra_folio_index - index;
+ if (index != readahead_index(ractl)) {
+ nr_to_read += readahead_index(ractl) - index;
+ ractl->_index = index;
+ }
+
/*
* Preallocate as many pages as we will need.
*/
- for (i = 0; i < nr_to_read; i++) {
+ while (i < nr_to_read) {
struct folio *folio = xa_load(&mapping->i_pages, index + i);
int ret;
@@ -240,12 +257,13 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
* not worth getting one just for that.
*/
read_pages(ractl);
- ractl->_index++;
- i = ractl->_index + ractl->_nr_pages - index - 1;
+ ractl->_index += min_nrpages;
+ i = ractl->_index + ractl->_nr_pages - index;
continue;
}
- folio = filemap_alloc_folio(gfp_mask, 0);
+ folio = filemap_alloc_folio(gfp_mask,
+ mapping_min_folio_order(mapping));
if (!folio)
break;
@@ -255,14 +273,15 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
if (ret == -ENOMEM)
break;
read_pages(ractl);
- ractl->_index++;
- i = ractl->_index + ractl->_nr_pages - index - 1;
+ ractl->_index += min_nrpages;
+ i = ractl->_index + ractl->_nr_pages - index;
continue;
}
- if (i == nr_to_read - lookahead_size)
+ if (i == mark)
folio_set_readahead(folio);
ractl->_workingset |= folio_test_workingset(folio);
- ractl->_nr_pages++;
+ ractl->_nr_pages += min_nrpages;
+ i += min_nrpages;
}
/*
@@ -492,13 +511,19 @@ void page_cache_ra_order(struct readahead_control *ractl,
{
struct address_space *mapping = ractl->mapping;
pgoff_t index = readahead_index(ractl);
+ unsigned int min_order = mapping_min_folio_order(mapping);
pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
pgoff_t mark = index + ra->size - ra->async_size;
unsigned int nofs;
int err = 0;
gfp_t gfp = readahead_gfp_mask(mapping);
+ unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
- if (!mapping_large_folio_support(mapping) || ra->size < 4)
+ /*
+ * Fallback when size < min_nrpages as each folio should be
+ * at least min_nrpages anyway.
+ */
+ if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
goto fallback;
limit = min(limit, index + ra->size - 1);
@@ -507,11 +532,20 @@ void page_cache_ra_order(struct readahead_control *ractl,
new_order += 2;
new_order = min(mapping_max_folio_order(mapping), new_order);
new_order = min_t(unsigned int, new_order, ilog2(ra->size));
+ new_order = max(new_order, min_order);
}
/* See comment in page_cache_ra_unbounded() */
nofs = memalloc_nofs_save();
filemap_invalidate_lock_shared(mapping);
+ /*
+ * If the new_order is greater than min_order and index is
+ * already aligned to new_order, then this will be noop as index
+ * aligned to new_order should also be aligned to min_order.
+ */
+ ractl->_index = mapping_align_index(mapping, index);
+ index = readahead_index(ractl);
+
while (index <= limit) {
unsigned int order = new_order;
@@ -519,7 +553,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
if (index & ((1UL << order) - 1))
order = __ffs(index);
/* Don't allocate pages past EOF */
- while (index + (1UL << order) - 1 > limit)
+ while (order > min_order && index + (1UL << order) - 1 > limit)
order--;
err = ra_alloc_folio(ractl, index, mark, order, gfp);
if (err)
@@ -783,8 +817,15 @@ void readahead_expand(struct readahead_control *ractl,
struct file_ra_state *ra = ractl->ra;
pgoff_t new_index, new_nr_pages;
gfp_t gfp_mask = readahead_gfp_mask(mapping);
+ unsigned long min_nrpages = mapping_min_folio_nrpages(mapping);
+ unsigned int min_order = mapping_min_folio_order(mapping);
new_index = new_start / PAGE_SIZE;
+ /*
+ * Readahead code should have aligned the ractl->_index to
+ * min_nrpages before calling readahead aops.
+ */
+ VM_BUG_ON(!IS_ALIGNED(ractl->_index, min_nrpages));
/* Expand the leading edge downwards */
while (ractl->_index > new_index) {
@@ -794,9 +835,11 @@ void readahead_expand(struct readahead_control *ractl,
if (folio && !xa_is_value(folio))
return; /* Folio apparently present */
- folio = filemap_alloc_folio(gfp_mask, 0);
+ folio = filemap_alloc_folio(gfp_mask, min_order);
if (!folio)
return;
+
+ index = mapping_align_index(mapping, index);
if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
folio_put(folio);
return;
@@ -806,7 +849,7 @@ void readahead_expand(struct readahead_control *ractl,
ractl->_workingset = true;
psi_memstall_enter(&ractl->_pflags);
}
- ractl->_nr_pages++;
+ ractl->_nr_pages += min_nrpages;
ractl->_index = folio->index;
}
@@ -821,9 +864,11 @@ void readahead_expand(struct readahead_control *ractl,
if (folio && !xa_is_value(folio))
return; /* Folio apparently present */
- folio = filemap_alloc_folio(gfp_mask, 0);
+ folio = filemap_alloc_folio(gfp_mask, min_order);
if (!folio)
return;
+
+ index = mapping_align_index(mapping, index);
if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
folio_put(folio);
return;
@@ -833,10 +878,10 @@ void readahead_expand(struct readahead_control *ractl,
ractl->_workingset = true;
psi_memstall_enter(&ractl->_pflags);
}
- ractl->_nr_pages++;
+ ractl->_nr_pages += min_nrpages;
if (ra) {
- ra->size++;
- ra->async_size++;
+ ra->size += min_nrpages;
+ ra->async_size += min_nrpages;
}
}
}
--
2.44.1
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v8 04/10] mm: split a folio in minimum folio order chunks
2024-06-25 11:44 [PATCH v8 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
` (2 preceding siblings ...)
2024-06-25 11:44 ` [PATCH v8 03/10] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
@ 2024-06-25 11:44 ` Pankaj Raghav (Samsung)
2024-06-25 14:45 ` Zi Yan
2024-06-25 11:44 ` [PATCH v8 05/10] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() Pankaj Raghav (Samsung)
` (5 subsequent siblings)
9 siblings, 1 reply; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-25 11:44 UTC (permalink / raw)
To: david, willy, chandan.babu, djwong, brauner, akpm
Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, hch, Zi Yan
From: Luis Chamberlain <mcgrof@kernel.org>
split_folio() and split_folio_to_list() assume order 0, to support
minorder for non-anonymous folios, we must expand these to check the
folio mapping order and use that.
Set new_order to be at least minimum folio order if it is set in
split_huge_page_to_list() so that we can maintain minimum folio order
requirement in the page cache.
Update the debugfs write files used for testing to ensure the order
is respected as well. We simply enforce the min order when a file
mapping is used.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
There was a discussion about whether we need to consider truncation of
folio to be considered a split failure or not [1]. The new code has
retained the existing behaviour of returning a failure if the folio was
truncated. I think we need to have a separate discussion whethere or not
to consider it as a failure.
include/linux/huge_mm.h | 14 ++++++++---
mm/huge_memory.c | 55 ++++++++++++++++++++++++++++++++++++++---
2 files changed, 61 insertions(+), 8 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 212cca384d7e..70d80d17c3ff 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -90,6 +90,8 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
#define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
+#define split_folio(f) split_folio_to_list(f, NULL)
+
#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
#define HPAGE_PMD_SHIFT PMD_SHIFT
#define HPAGE_PUD_SHIFT PUD_SHIFT
@@ -320,9 +322,10 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
bool can_split_folio(struct folio *folio, int *pextra_pins);
int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
unsigned int new_order);
+int split_folio_to_list(struct folio *folio, struct list_head *list);
static inline int split_huge_page(struct page *page)
{
- return split_huge_page_to_list_to_order(page, NULL, 0);
+ return split_folio(page_folio(page));
}
void deferred_split_folio(struct folio *folio);
@@ -487,6 +490,12 @@ static inline int split_huge_page(struct page *page)
{
return 0;
}
+
+static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
+{
+ return 0;
+}
+
static inline void deferred_split_folio(struct folio *folio) {}
#define split_huge_pmd(__vma, __pmd, __address) \
do { } while (0)
@@ -601,7 +610,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
return split_folio_to_list_to_order(folio, NULL, new_order);
}
-#define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
-#define split_folio(f) split_folio_to_order(f, 0)
-
#endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0fffaa58a47a..51fda5f9ac90 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3065,6 +3065,9 @@ bool can_split_folio(struct folio *folio, int *pextra_pins)
* released, or if some unexpected race happened (e.g., anon VMA disappeared,
* truncation).
*
+ * Callers should ensure that the order respects the address space mapping
+ * min-order if one is set for non-anonymous folios.
+ *
* Returns -EINVAL when trying to split to an order that is incompatible
* with the folio. Splitting to order 0 is compatible with all folios.
*/
@@ -3146,6 +3149,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
mapping = NULL;
anon_vma_lock_write(anon_vma);
} else {
+ unsigned int min_order;
gfp_t gfp;
mapping = folio->mapping;
@@ -3156,6 +3160,14 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
goto out;
}
+ min_order = mapping_min_folio_order(folio->mapping);
+ if (new_order < min_order) {
+ VM_WARN_ONCE(1, "Cannot split mapped folio below min-order: %u",
+ min_order);
+ ret = -EINVAL;
+ goto out;
+ }
+
gfp = current_gfp_context(mapping_gfp_mask(mapping) &
GFP_RECLAIM_MASK);
@@ -3267,6 +3279,21 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
return ret;
}
+int split_folio_to_list(struct folio *folio, struct list_head *list)
+{
+ unsigned int min_order = 0;
+
+ if (!folio_test_anon(folio)) {
+ if (!folio->mapping) {
+ count_vm_event(THP_SPLIT_PAGE_FAILED);
+ return -EBUSY;
+ }
+ min_order = mapping_min_folio_order(folio->mapping);
+ }
+
+ return split_huge_page_to_list_to_order(&folio->page, list, min_order);
+}
+
void __folio_undo_large_rmappable(struct folio *folio)
{
struct deferred_split *ds_queue;
@@ -3496,6 +3523,8 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
struct vm_area_struct *vma = vma_lookup(mm, addr);
struct page *page;
struct folio *folio;
+ struct address_space *mapping;
+ unsigned int target_order = new_order;
if (!vma)
break;
@@ -3516,7 +3545,13 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
if (!is_transparent_hugepage(folio))
goto next;
- if (new_order >= folio_order(folio))
+ if (!folio_test_anon(folio)) {
+ mapping = folio->mapping;
+ target_order = max(new_order,
+ mapping_min_folio_order(mapping));
+ }
+
+ if (target_order >= folio_order(folio))
goto next;
total++;
@@ -3532,9 +3567,13 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
if (!folio_trylock(folio))
goto next;
- if (!split_folio_to_order(folio, new_order))
+ if (!folio_test_anon(folio) && folio->mapping != mapping)
+ goto unlock;
+
+ if (!split_folio_to_order(folio, target_order))
split++;
+unlock:
folio_unlock(folio);
next:
folio_put(folio);
@@ -3559,6 +3598,7 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
pgoff_t index;
int nr_pages = 1;
unsigned long total = 0, split = 0;
+ unsigned int min_order;
file = getname_kernel(file_path);
if (IS_ERR(file))
@@ -3572,9 +3612,11 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
file_path, off_start, off_end);
mapping = candidate->f_mapping;
+ min_order = mapping_min_folio_order(mapping);
for (index = off_start; index < off_end; index += nr_pages) {
struct folio *folio = filemap_get_folio(mapping, index);
+ unsigned int target_order = new_order;
nr_pages = 1;
if (IS_ERR(folio))
@@ -3583,18 +3625,23 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
if (!folio_test_large(folio))
goto next;
+ target_order = max(new_order, min_order);
total++;
nr_pages = folio_nr_pages(folio);
- if (new_order >= folio_order(folio))
+ if (target_order >= folio_order(folio))
goto next;
if (!folio_trylock(folio))
goto next;
- if (!split_folio_to_order(folio, new_order))
+ if (folio->mapping != mapping)
+ goto unlock;
+
+ if (!split_folio_to_order(folio, target_order))
split++;
+unlock:
folio_unlock(folio);
next:
folio_put(folio);
--
2.44.1
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v8 05/10] filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
2024-06-25 11:44 [PATCH v8 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
` (3 preceding siblings ...)
2024-06-25 11:44 ` [PATCH v8 04/10] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
@ 2024-06-25 11:44 ` Pankaj Raghav (Samsung)
2024-07-01 23:39 ` Darrick J. Wong
2024-06-25 11:44 ` [PATCH v8 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
` (4 subsequent siblings)
9 siblings, 1 reply; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-25 11:44 UTC (permalink / raw)
To: david, willy, chandan.babu, djwong, brauner, akpm
Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, hch, Zi Yan
From: Pankaj Raghav <p.raghav@samsung.com>
Usually the page cache does not extend beyond the size of the inode,
therefore, no PTEs are created for folios that extend beyond the size.
But with LBS support, we might extend page cache beyond the size of the
inode as we need to guarantee folios of minimum order. While doing a
read, do_fault_around() can create PTEs for pages that lie beyond the
EOF leading to incorrect error return when accessing a page beyond the
mapped file.
Cap the PTE range to be created for the page cache up to the end of
file(EOF) in filemap_map_pages() so that return error codes are consistent
with POSIX[1] for LBS configurations.
generic/749(currently in xfstest-dev patches-in-queue branch [0]) has
been created to trigger this edge case. This also fixes generic/749 for
tmpfs with huge=always on systems with 4k base page size.
[0] https://lore.kernel.org/all/20240615002935.1033031-3-mcgrof@kernel.org/
[1](from mmap(2)) SIGBUS
Attempted access to a page of the buffer that lies beyond the end
of the mapped file. For an explanation of the treatment of the
bytes in the page that corresponds to the end of a mapped file
that is not a multiple of the page size, see NOTES.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
mm/filemap.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 8eafbd4a4d0c..56ff1d936aa8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3612,7 +3612,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
struct vm_area_struct *vma = vmf->vma;
struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
- pgoff_t last_pgoff = start_pgoff;
+ pgoff_t file_end, last_pgoff = start_pgoff;
unsigned long addr;
XA_STATE(xas, &mapping->i_pages, start_pgoff);
struct folio *folio;
@@ -3638,6 +3638,10 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
goto out;
}
+ file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
+ if (end_pgoff > file_end)
+ end_pgoff = file_end;
+
folio_type = mm_counter_file(folio);
do {
unsigned long end;
--
2.44.1
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v8 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size
2024-06-25 11:44 [PATCH v8 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
` (4 preceding siblings ...)
2024-06-25 11:44 ` [PATCH v8 05/10] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() Pankaj Raghav (Samsung)
@ 2024-06-25 11:44 ` Pankaj Raghav (Samsung)
2024-07-01 2:37 ` Dave Chinner
` (2 more replies)
2024-06-25 11:44 ` [PATCH v8 07/10] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
` (3 subsequent siblings)
9 siblings, 3 replies; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-25 11:44 UTC (permalink / raw)
To: david, willy, chandan.babu, djwong, brauner, akpm
Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, hch, Zi Yan
From: Pankaj Raghav <p.raghav@samsung.com>
iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
< fs block size. iomap_dio_zero() has an implicit assumption that fs block
size < page_size. This is true for most filesystems at the moment.
If the block size > page size, this will send the contents of the page
next to zero page(as len > PAGE_SIZE) to the underlying block device,
causing FS corruption.
iomap is a generic infrastructure and it should not make any assumptions
about the fs block size and the page size of the system.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
fs/iomap/buffered-io.c | 4 ++--
fs/iomap/direct-io.c | 30 ++++++++++++++++++++++++++++--
2 files changed, 30 insertions(+), 4 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index f420c53d86ac..9a9e94c7ed1d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -2007,10 +2007,10 @@ iomap_writepages(struct address_space *mapping, struct writeback_control *wbc,
}
EXPORT_SYMBOL_GPL(iomap_writepages);
-static int __init iomap_init(void)
+static int __init iomap_pagecache_init(void)
{
return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
offsetof(struct iomap_ioend, io_bio),
BIOSET_NEED_BVECS);
}
-fs_initcall(iomap_init);
+fs_initcall(iomap_pagecache_init);
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46..61d09d2364f7 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -11,6 +11,7 @@
#include <linux/iomap.h>
#include <linux/backing-dev.h>
#include <linux/uio.h>
+#include <linux/set_memory.h>
#include <linux/task_io_accounting_ops.h>
#include "trace.h"
@@ -27,6 +28,13 @@
#define IOMAP_DIO_WRITE (1U << 30)
#define IOMAP_DIO_DIRTY (1U << 31)
+/*
+ * Used for sub block zeroing in iomap_dio_zero()
+ */
+#define ZERO_PAGE_64K_SIZE (65536)
+#define ZERO_PAGE_64K_ORDER (get_order(ZERO_PAGE_64K_SIZE))
+static struct page *zero_page_64k;
+
struct iomap_dio {
struct kiocb *iocb;
const struct iomap_dio_ops *dops;
@@ -236,9 +244,13 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
loff_t pos, unsigned len)
{
struct inode *inode = file_inode(dio->iocb->ki_filp);
- struct page *page = ZERO_PAGE(0);
struct bio *bio;
+ /*
+ * Max block size supported is 64k
+ */
+ WARN_ON_ONCE(len > ZERO_PAGE_64K_SIZE);
+
bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
GFP_KERNEL);
@@ -246,7 +258,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;
- __bio_add_page(bio, page, len, 0);
+ __bio_add_page(bio, zero_page_64k, len, 0);
iomap_dio_submit_bio(iter, dio, bio, pos);
}
@@ -753,3 +765,17 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
return iomap_dio_complete(dio);
}
EXPORT_SYMBOL_GPL(iomap_dio_rw);
+
+static int __init iomap_dio_init(void)
+{
+ zero_page_64k = alloc_pages(GFP_KERNEL | __GFP_ZERO,
+ ZERO_PAGE_64K_ORDER);
+
+ if (!zero_page_64k)
+ return -ENOMEM;
+
+ set_memory_ro((unsigned long)page_address(zero_page_64k),
+ 1U << ZERO_PAGE_64K_ORDER);
+ return 0;
+}
+fs_initcall(iomap_dio_init);
--
2.44.1
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v8 07/10] xfs: use kvmalloc for xattr buffers
2024-06-25 11:44 [PATCH v8 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
` (5 preceding siblings ...)
2024-06-25 11:44 ` [PATCH v8 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
@ 2024-06-25 11:44 ` Pankaj Raghav (Samsung)
2024-06-25 18:07 ` Pankaj Raghav (Samsung)
2024-06-25 11:44 ` [PATCH v8 08/10] xfs: expose block size in stat Pankaj Raghav (Samsung)
` (2 subsequent siblings)
9 siblings, 1 reply; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-25 11:44 UTC (permalink / raw)
To: david, willy, chandan.babu, djwong, brauner, akpm
Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, hch, Zi Yan,
Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
Pankaj Raghav reported that when filesystem block size is larger
than page size, the xattr code can use kmalloc() for high order
allocations. This triggers a useless warning in the allocator as it
is a __GFP_NOFAIL allocation here:
static inline
struct page *rmqueue(struct zone *preferred_zone,
struct zone *zone, unsigned int order,
gfp_t gfp_flags, unsigned int alloc_flags,
int migratetype)
{
struct page *page;
/*
* We most definitely don't want callers attempting to
* allocate greater than order-1 page units with __GFP_NOFAIL.
*/
>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
...
Fix this by changing all these call sites to use kvmalloc(), which
will strip the NOFAIL from the kmalloc attempt and if that fails
will do a __GFP_NOFAIL vmalloc().
This is not an issue that productions systems will see as
filesystems with block size > page size cannot be mounted by the
kernel; Pankaj is developing this functionality right now.
Reported-by: Pankaj Raghav <kernel@pankajraghav.com>
Fixes: f078d4ea8276 ("xfs: convert kmem_alloc() to kmalloc()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
---
fs/xfs/libxfs/xfs_attr_leaf.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index b9e98950eb3d..09f4cb061a6e 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -1138,10 +1138,7 @@ xfs_attr3_leaf_to_shortform(
trace_xfs_attr_leaf_to_sf(args);
- tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
- if (!tmpbuffer)
- return -ENOMEM;
-
+ tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
leaf = (xfs_attr_leafblock_t *)tmpbuffer;
@@ -1205,7 +1202,7 @@ xfs_attr3_leaf_to_shortform(
error = 0;
out:
- kfree(tmpbuffer);
+ kvfree(tmpbuffer);
return error;
}
@@ -1613,7 +1610,7 @@ xfs_attr3_leaf_compact(
trace_xfs_attr_leaf_compact(args);
- tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
+ tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
memset(bp->b_addr, 0, args->geo->blksize);
leaf_src = (xfs_attr_leafblock_t *)tmpbuffer;
@@ -1651,7 +1648,7 @@ xfs_attr3_leaf_compact(
*/
xfs_trans_log_buf(trans, bp, 0, args->geo->blksize - 1);
- kfree(tmpbuffer);
+ kvfree(tmpbuffer);
}
/*
@@ -2330,7 +2327,7 @@ xfs_attr3_leaf_unbalance(
struct xfs_attr_leafblock *tmp_leaf;
struct xfs_attr3_icleaf_hdr tmphdr;
- tmp_leaf = kzalloc(state->args->geo->blksize,
+ tmp_leaf = kvzalloc(state->args->geo->blksize,
GFP_KERNEL | __GFP_NOFAIL);
/*
@@ -2371,7 +2368,7 @@ xfs_attr3_leaf_unbalance(
}
memcpy(save_leaf, tmp_leaf, state->args->geo->blksize);
savehdr = tmphdr; /* struct copy */
- kfree(tmp_leaf);
+ kvfree(tmp_leaf);
}
xfs_attr3_leaf_hdr_to_disk(state->args->geo, save_leaf, &savehdr);
--
2.44.1
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v8 08/10] xfs: expose block size in stat
2024-06-25 11:44 [PATCH v8 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
` (6 preceding siblings ...)
2024-06-25 11:44 ` [PATCH v8 07/10] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
@ 2024-06-25 11:44 ` Pankaj Raghav (Samsung)
2024-07-01 2:33 ` Dave Chinner
2024-06-25 11:44 ` [PATCH v8 09/10] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
2024-06-25 11:44 ` [PATCH v8 10/10] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
9 siblings, 1 reply; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-25 11:44 UTC (permalink / raw)
To: david, willy, chandan.babu, djwong, brauner, akpm
Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, hch, Zi Yan
From: Pankaj Raghav <p.raghav@samsung.com>
For block size larger than page size, the unit of efficient IO is
the block size, not the page size. Leaving stat() to report
PAGE_SIZE as the block size causes test programs like fsx to issue
illegal ranges for operations that require block size alignment
(e.g. fallocate() insert range). Hence update the preferred IO size
to reflect the block size in this case.
This change is based on a patch originally from Dave Chinner.[1]
[1] https://lwn.net/ml/linux-fsdevel/20181107063127.3902-16-david@fromorbit.com/
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
fs/xfs/xfs_iops.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a00dcbc77e12..da5c13150315 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -562,7 +562,7 @@ xfs_stat_blksize(
return 1U << mp->m_allocsize_log;
}
- return PAGE_SIZE;
+ return max_t(uint32_t, PAGE_SIZE, mp->m_sb.sb_blocksize);
}
STATIC int
--
2.44.1
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v8 09/10] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
2024-06-25 11:44 [PATCH v8 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
` (7 preceding siblings ...)
2024-06-25 11:44 ` [PATCH v8 08/10] xfs: expose block size in stat Pankaj Raghav (Samsung)
@ 2024-06-25 11:44 ` Pankaj Raghav (Samsung)
2024-07-01 2:34 ` Dave Chinner
2024-06-25 11:44 ` [PATCH v8 10/10] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
9 siblings, 1 reply; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-25 11:44 UTC (permalink / raw)
To: david, willy, chandan.babu, djwong, brauner, akpm
Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, hch, Zi Yan
From: Pankaj Raghav <p.raghav@samsung.com>
Instead of assuming that PAGE_SHIFT is always higher than the blocklog,
make the calculation generic so that page cache count can be calculated
correctly for LBS.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
fs/xfs/xfs_mount.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 09eef1721ef4..3949f720b535 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -132,11 +132,16 @@ xfs_sb_validate_fsb_count(
xfs_sb_t *sbp,
uint64_t nblocks)
{
+ uint64_t max_bytes;
+
ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
ASSERT(sbp->sb_blocklog >= BBSHIFT);
+ if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
+ return -EFBIG;
+
/* Limited by ULONG_MAX of page cache index */
- if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
+ if (max_bytes >> PAGE_SHIFT > ULONG_MAX)
return -EFBIG;
return 0;
}
--
2.44.1
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v8 10/10] xfs: enable block size larger than page size support
2024-06-25 11:44 [PATCH v8 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
` (8 preceding siblings ...)
2024-06-25 11:44 ` [PATCH v8 09/10] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
@ 2024-06-25 11:44 ` Pankaj Raghav (Samsung)
2024-07-01 2:34 ` Dave Chinner
9 siblings, 1 reply; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-25 11:44 UTC (permalink / raw)
To: david, willy, chandan.babu, djwong, brauner, akpm
Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, kernel, hch, Zi Yan
From: Pankaj Raghav <p.raghav@samsung.com>
Page cache now has the ability to have a minimum order when allocating
a folio which is a prerequisite to add support for block size > page
size.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
@hch and @Dave I have retained the min_folio_order to be in the inode
struct as the discussion about moving this to xfs_mount is still open.
fs/xfs/libxfs/xfs_ialloc.c | 5 +++++
fs/xfs/libxfs/xfs_shared.h | 3 +++
fs/xfs/xfs_icache.c | 6 ++++--
fs/xfs/xfs_mount.c | 1 -
fs/xfs/xfs_super.c | 18 ++++++++++--------
5 files changed, 22 insertions(+), 11 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 14c81f227c5b..1e76431d75a4 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -3019,6 +3019,11 @@ xfs_ialloc_setup_geometry(
igeo->ialloc_align = mp->m_dalign;
else
igeo->ialloc_align = 0;
+
+ if (mp->m_sb.sb_blocksize > PAGE_SIZE)
+ igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
+ else
+ igeo->min_folio_order = 0;
}
/* Compute the location of the root directory inode that is laid out by mkfs. */
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 34f104ed372c..e67a1c7cc0b0 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -231,6 +231,9 @@ struct xfs_ino_geometry {
/* precomputed value for di_flags2 */
uint64_t new_diflags2;
+ /* minimum folio order of a page cache allocation */
+ unsigned int min_folio_order;
+
};
#endif /* __XFS_SHARED_H__ */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 088ac200b026..e0f911f326e6 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -88,7 +88,8 @@ xfs_inode_alloc(
/* VFS doesn't initialise i_mode! */
VFS_I(ip)->i_mode = 0;
- mapping_set_large_folios(VFS_I(ip)->i_mapping);
+ mapping_set_folio_min_order(VFS_I(ip)->i_mapping,
+ M_IGEO(mp)->min_folio_order);
XFS_STATS_INC(mp, vn_active);
ASSERT(atomic_read(&ip->i_pincount) == 0);
@@ -325,7 +326,8 @@ xfs_reinit_inode(
inode->i_uid = uid;
inode->i_gid = gid;
inode->i_state = state;
- mapping_set_large_folios(inode->i_mapping);
+ mapping_set_folio_min_order(inode->i_mapping,
+ M_IGEO(mp)->min_folio_order);
return error;
}
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 3949f720b535..c6933440f806 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -134,7 +134,6 @@ xfs_sb_validate_fsb_count(
{
uint64_t max_bytes;
- ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
ASSERT(sbp->sb_blocklog >= BBSHIFT);
if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 27e9f749c4c7..b8a93a8f35ca 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1638,16 +1638,18 @@ xfs_fs_fill_super(
goto out_free_sb;
}
- /*
- * Until this is fixed only page-sized or smaller data blocks work.
- */
if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
- xfs_warn(mp,
- "File system with blocksize %d bytes. "
- "Only pagesize (%ld) or less will currently work.",
+ if (!xfs_has_crc(mp)) {
+ xfs_warn(mp,
+"V4 Filesystem with blocksize %d bytes. Only pagesize (%ld) or less is supported.",
mp->m_sb.sb_blocksize, PAGE_SIZE);
- error = -ENOSYS;
- goto out_free_sb;
+ error = -ENOSYS;
+ goto out_free_sb;
+ }
+
+ xfs_warn(mp,
+"EXPERIMENTAL: V5 Filesystem with Large Block Size (%d bytes) enabled.",
+ mp->m_sb.sb_blocksize);
}
/* Ensure this filesystem fits in the page cache limits */
--
2.44.1
^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [PATCH v8 04/10] mm: split a folio in minimum folio order chunks
2024-06-25 11:44 ` [PATCH v8 04/10] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
@ 2024-06-25 14:45 ` Zi Yan
2024-06-25 17:20 ` Pankaj Raghav (Samsung)
0 siblings, 1 reply; 61+ messages in thread
From: Zi Yan @ 2024-06-25 14:45 UTC (permalink / raw)
To: Pankaj Raghav (Samsung), david, willy, chandan.babu, djwong,
brauner, akpm
Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, hch, Zi Yan
[-- Attachment #1: Type: text/plain, Size: 1973 bytes --]
On Tue Jun 25, 2024 at 7:44 AM EDT, Pankaj Raghav (Samsung) wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
>
> split_folio() and split_folio_to_list() assume order 0, to support
> minorder for non-anonymous folios, we must expand these to check the
> folio mapping order and use that.
>
> Set new_order to be at least minimum folio order if it is set in
> split_huge_page_to_list() so that we can maintain minimum folio order
> requirement in the page cache.
>
> Update the debugfs write files used for testing to ensure the order
> is respected as well. We simply enforce the min order when a file
> mapping is used.
>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> ---
> There was a discussion about whether we need to consider truncation of
> folio to be considered a split failure or not [1]. The new code has
> retained the existing behaviour of returning a failure if the folio was
> truncated. I think we need to have a separate discussion whethere or not
> to consider it as a failure.
<snip>
>
> +int split_folio_to_list(struct folio *folio, struct list_head *list)
> +{
> + unsigned int min_order = 0;
> +
> + if (!folio_test_anon(folio)) {
> + if (!folio->mapping) {
> + count_vm_event(THP_SPLIT_PAGE_FAILED);
Regardless this folio split is from a truncation or not, you should not
count every folio split as a THP_SPLIT_PAGE_FAILED. Since not every
folio is a THP. You need to do:
if (folio_test_pmd_mappable(folio))
count_vm_event(THP_SPLIT_PAGE_FAILED);
See commit 835c3a25aa37 ("mm: huge_memory: add the missing
folio_test_pmd_mappable() for THP split statistics")
> + return -EBUSY;
> + }
> + min_order = mapping_min_folio_order(folio->mapping);
> + }
> +
> + return split_huge_page_to_list_to_order(&folio->page, list, min_order);
> +}
> +
--
Best Regards,
Yan, Zi
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 854 bytes --]
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 02/10] filemap: allocate mapping_min_order folios in the page cache
2024-06-25 11:44 ` [PATCH v8 02/10] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
@ 2024-06-25 15:52 ` Matthew Wilcox
2024-06-25 18:06 ` Pankaj Raghav (Samsung)
0 siblings, 1 reply; 61+ messages in thread
From: Matthew Wilcox @ 2024-06-25 15:52 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: david, chandan.babu, djwong, brauner, akpm, linux-kernel, yang,
linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof,
gost.dev, cl, linux-xfs, hch, Zi Yan
On Tue, Jun 25, 2024 at 11:44:12AM +0000, Pankaj Raghav (Samsung) wrote:
> Co-developed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
after fixing the nits below
> +/**
> + * mapping_align_index() - Align index based on the min
> + * folio order of the page cache.
+ * mapping_align_index - Align index for this mapping.
> @@ -1165,7 +1186,7 @@ static inline vm_fault_t folio_lock_or_retry(struct folio *folio,
> void folio_wait_bit(struct folio *folio, int bit_nr);
> int folio_wait_bit_killable(struct folio *folio, int bit_nr);
>
> -/*
> +/*
> * Wait for a folio to be unlocked.
> *
> * This must be called with the caller "holding" the folio,
Unnecessary whitespace change
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 04/10] mm: split a folio in minimum folio order chunks
2024-06-25 14:45 ` Zi Yan
@ 2024-06-25 17:20 ` Pankaj Raghav (Samsung)
0 siblings, 0 replies; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-25 17:20 UTC (permalink / raw)
To: Zi Yan
Cc: david, willy, chandan.babu, djwong, brauner, akpm, linux-kernel,
yang, linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav,
mcgrof, gost.dev, cl, linux-xfs, hch, Zi Yan
On Tue, Jun 25, 2024 at 10:45:09AM -0400, Zi Yan wrote:
> On Tue Jun 25, 2024 at 7:44 AM EDT, Pankaj Raghav (Samsung) wrote:
> > From: Luis Chamberlain <mcgrof@kernel.org>
> >
> > split_folio() and split_folio_to_list() assume order 0, to support
> > minorder for non-anonymous folios, we must expand these to check the
> > folio mapping order and use that.
> >
> > Set new_order to be at least minimum folio order if it is set in
> > split_huge_page_to_list() so that we can maintain minimum folio order
> > requirement in the page cache.
> >
> > Update the debugfs write files used for testing to ensure the order
> > is respected as well. We simply enforce the min order when a file
> > mapping is used.
> >
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > Reviewed-by: Hannes Reinecke <hare@suse.de>
> > ---
> > There was a discussion about whether we need to consider truncation of
> > folio to be considered a split failure or not [1]. The new code has
> > retained the existing behaviour of returning a failure if the folio was
> > truncated. I think we need to have a separate discussion whethere or not
> > to consider it as a failure.
>
> <snip>
>
> >
> > +int split_folio_to_list(struct folio *folio, struct list_head *list)
> > +{
> > + unsigned int min_order = 0;
> > +
> > + if (!folio_test_anon(folio)) {
> > + if (!folio->mapping) {
> > + count_vm_event(THP_SPLIT_PAGE_FAILED);
>
> Regardless this folio split is from a truncation or not, you should not
> count every folio split as a THP_SPLIT_PAGE_FAILED. Since not every
> folio is a THP. You need to do:
>
> if (folio_test_pmd_mappable(folio))
> count_vm_event(THP_SPLIT_PAGE_FAILED);
>
> See commit 835c3a25aa37 ("mm: huge_memory: add the missing
> folio_test_pmd_mappable() for THP split statistics")
You are right, I will change that. I didn't notice this commit.
>
> > + return -EBUSY;
> > + }
> > + min_order = mapping_min_folio_order(folio->mapping);
> > + }
> > +
> > + return split_huge_page_to_list_to_order(&folio->page, list, min_order);
> > +}
> > +
>
> --
> Best Regards,
> Yan, Zi
>
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 02/10] filemap: allocate mapping_min_order folios in the page cache
2024-06-25 15:52 ` Matthew Wilcox
@ 2024-06-25 18:06 ` Pankaj Raghav (Samsung)
0 siblings, 0 replies; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-25 18:06 UTC (permalink / raw)
To: Matthew Wilcox
Cc: david, chandan.babu, djwong, brauner, akpm, linux-kernel, yang,
linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof,
gost.dev, cl, linux-xfs, hch, Zi Yan
On Tue, Jun 25, 2024 at 04:52:03PM +0100, Matthew Wilcox wrote:
> On Tue, Jun 25, 2024 at 11:44:12AM +0000, Pankaj Raghav (Samsung) wrote:
> > Co-developed-by: Luis Chamberlain <mcgrof@kernel.org>
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > Reviewed-by: Hannes Reinecke <hare@suse.de>
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>
> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Thanks!
>
> after fixing the nits below
>
> > +/**
> > + * mapping_align_index() - Align index based on the min
> > + * folio order of the page cache.
>
> + * mapping_align_index - Align index for this mapping.
>
> > @@ -1165,7 +1186,7 @@ static inline vm_fault_t folio_lock_or_retry(struct folio *folio,
> > void folio_wait_bit(struct folio *folio, int bit_nr);
> > int folio_wait_bit_killable(struct folio *folio, int bit_nr);
> >
> > -/*
> > +/*
> > * Wait for a folio to be unlocked.
> > *
> > * This must be called with the caller "holding" the folio,
>
> Unnecessary whitespace change
>
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 07/10] xfs: use kvmalloc for xattr buffers
2024-06-25 11:44 ` [PATCH v8 07/10] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
@ 2024-06-25 18:07 ` Pankaj Raghav (Samsung)
0 siblings, 0 replies; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-06-25 18:07 UTC (permalink / raw)
To: david, willy, chandan.babu, djwong, brauner, akpm
Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, hch, Zi Yan,
Dave Chinner
On Tue, Jun 25, 2024 at 11:44:17AM +0000, Pankaj Raghav (Samsung) wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Pankaj Raghav reported that when filesystem block size is larger
> than page size, the xattr code can use kmalloc() for high order
> allocations. This triggers a useless warning in the allocator as it
> is a __GFP_NOFAIL allocation here:
>
> static inline
> struct page *rmqueue(struct zone *preferred_zone,
> struct zone *zone, unsigned int order,
> gfp_t gfp_flags, unsigned int alloc_flags,
> int migratetype)
> {
> struct page *page;
>
> /*
> * We most definitely don't want callers attempting to
> * allocate greater than order-1 page units with __GFP_NOFAIL.
> */
> >>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> ...
>
> Fix this by changing all these call sites to use kvmalloc(), which
> will strip the NOFAIL from the kmalloc attempt and if that fails
> will do a __GFP_NOFAIL vmalloc().
>
> This is not an issue that productions systems will see as
> filesystems with block size > page size cannot be mounted by the
> kernel; Pankaj is developing this functionality right now.
>
> Reported-by: Pankaj Raghav <kernel@pankajraghav.com>
> Fixes: f078d4ea8276 ("xfs: convert kmem_alloc() to kmalloc()")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
@Chandan: As we discussed today, do you want to pick this up for 6.11?
Then I can drop it from my patch series.
> ---
> fs/xfs/libxfs/xfs_attr_leaf.c | 15 ++++++---------
> 1 file changed, 6 insertions(+), 9 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
> index b9e98950eb3d..09f4cb061a6e 100644
> --- a/fs/xfs/libxfs/xfs_attr_leaf.c
> +++ b/fs/xfs/libxfs/xfs_attr_leaf.c
> @@ -1138,10 +1138,7 @@ xfs_attr3_leaf_to_shortform(
>
> trace_xfs_attr_leaf_to_sf(args);
>
> - tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
> - if (!tmpbuffer)
> - return -ENOMEM;
> -
> + tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
> memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
>
> leaf = (xfs_attr_leafblock_t *)tmpbuffer;
> @@ -1205,7 +1202,7 @@ xfs_attr3_leaf_to_shortform(
> error = 0;
>
> out:
> - kfree(tmpbuffer);
> + kvfree(tmpbuffer);
> return error;
> }
>
> @@ -1613,7 +1610,7 @@ xfs_attr3_leaf_compact(
>
> trace_xfs_attr_leaf_compact(args);
>
> - tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
> + tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
> memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
> memset(bp->b_addr, 0, args->geo->blksize);
> leaf_src = (xfs_attr_leafblock_t *)tmpbuffer;
> @@ -1651,7 +1648,7 @@ xfs_attr3_leaf_compact(
> */
> xfs_trans_log_buf(trans, bp, 0, args->geo->blksize - 1);
>
> - kfree(tmpbuffer);
> + kvfree(tmpbuffer);
> }
>
> /*
> @@ -2330,7 +2327,7 @@ xfs_attr3_leaf_unbalance(
> struct xfs_attr_leafblock *tmp_leaf;
> struct xfs_attr3_icleaf_hdr tmphdr;
>
> - tmp_leaf = kzalloc(state->args->geo->blksize,
> + tmp_leaf = kvzalloc(state->args->geo->blksize,
> GFP_KERNEL | __GFP_NOFAIL);
>
> /*
> @@ -2371,7 +2368,7 @@ xfs_attr3_leaf_unbalance(
> }
> memcpy(save_leaf, tmp_leaf, state->args->geo->blksize);
> savehdr = tmphdr; /* struct copy */
> - kfree(tmp_leaf);
> + kvfree(tmp_leaf);
> }
>
> xfs_attr3_leaf_hdr_to_disk(state->args->geo, save_leaf, &savehdr);
> --
> 2.44.1
>
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 08/10] xfs: expose block size in stat
2024-06-25 11:44 ` [PATCH v8 08/10] xfs: expose block size in stat Pankaj Raghav (Samsung)
@ 2024-07-01 2:33 ` Dave Chinner
0 siblings, 0 replies; 61+ messages in thread
From: Dave Chinner @ 2024-07-01 2:33 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: willy, chandan.babu, djwong, brauner, akpm, linux-kernel, yang,
linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof,
gost.dev, cl, linux-xfs, hch, Zi Yan
On Tue, Jun 25, 2024 at 11:44:18AM +0000, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> For block size larger than page size, the unit of efficient IO is
> the block size, not the page size. Leaving stat() to report
> PAGE_SIZE as the block size causes test programs like fsx to issue
> illegal ranges for operations that require block size alignment
> (e.g. fallocate() insert range). Hence update the preferred IO size
> to reflect the block size in this case.
>
> This change is based on a patch originally from Dave Chinner.[1]
>
> [1] https://lwn.net/ml/linux-fsdevel/20181107063127.3902-16-david@fromorbit.com/
>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
> fs/xfs/xfs_iops.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index a00dcbc77e12..da5c13150315 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -562,7 +562,7 @@ xfs_stat_blksize(
> return 1U << mp->m_allocsize_log;
> }
>
> - return PAGE_SIZE;
> + return max_t(uint32_t, PAGE_SIZE, mp->m_sb.sb_blocksize);
> }
Looks fine.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 09/10] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
2024-06-25 11:44 ` [PATCH v8 09/10] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
@ 2024-07-01 2:34 ` Dave Chinner
0 siblings, 0 replies; 61+ messages in thread
From: Dave Chinner @ 2024-07-01 2:34 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: willy, chandan.babu, djwong, brauner, akpm, linux-kernel, yang,
linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof,
gost.dev, cl, linux-xfs, hch, Zi Yan
On Tue, Jun 25, 2024 at 11:44:19AM +0000, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> Instead of assuming that PAGE_SHIFT is always higher than the blocklog,
> make the calculation generic so that page cache count can be calculated
> correctly for LBS.
>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
> fs/xfs/xfs_mount.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 09eef1721ef4..3949f720b535 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -132,11 +132,16 @@ xfs_sb_validate_fsb_count(
> xfs_sb_t *sbp,
> uint64_t nblocks)
> {
> + uint64_t max_bytes;
> +
> ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
> ASSERT(sbp->sb_blocklog >= BBSHIFT);
>
> + if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
> + return -EFBIG;
> +
> /* Limited by ULONG_MAX of page cache index */
> - if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
> + if (max_bytes >> PAGE_SHIFT > ULONG_MAX)
> return -EFBIG;
> return 0;
looks fine.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 10/10] xfs: enable block size larger than page size support
2024-06-25 11:44 ` [PATCH v8 10/10] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
@ 2024-07-01 2:34 ` Dave Chinner
0 siblings, 0 replies; 61+ messages in thread
From: Dave Chinner @ 2024-07-01 2:34 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: willy, chandan.babu, djwong, brauner, akpm, linux-kernel, yang,
linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof,
gost.dev, cl, linux-xfs, hch, Zi Yan
On Tue, Jun 25, 2024 at 11:44:20AM +0000, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> Page cache now has the ability to have a minimum order when allocating
> a folio which is a prerequisite to add support for block size > page
> size.
>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
> @hch and @Dave I have retained the min_folio_order to be in the inode
> struct as the discussion about moving this to xfs_mount is still open.
>
> fs/xfs/libxfs/xfs_ialloc.c | 5 +++++
> fs/xfs/libxfs/xfs_shared.h | 3 +++
> fs/xfs/xfs_icache.c | 6 ++++--
> fs/xfs/xfs_mount.c | 1 -
> fs/xfs/xfs_super.c | 18 ++++++++++--------
> 5 files changed, 22 insertions(+), 11 deletions(-)
all looks fine to me.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size
2024-06-25 11:44 ` [PATCH v8 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
@ 2024-07-01 2:37 ` Dave Chinner
2024-07-01 11:22 ` Pankaj Raghav (Samsung)
2024-07-01 23:40 ` Darrick J. Wong
2024-07-02 7:42 ` Christoph Hellwig
2 siblings, 1 reply; 61+ messages in thread
From: Dave Chinner @ 2024-07-01 2:37 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: willy, chandan.babu, djwong, brauner, akpm, linux-kernel, yang,
linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof,
gost.dev, cl, linux-xfs, hch, Zi Yan
On Tue, Jun 25, 2024 at 11:44:16AM +0000, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> size < page_size. This is true for most filesystems at the moment.
>
> If the block size > page size, this will send the contents of the page
> next to zero page(as len > PAGE_SIZE) to the underlying block device,
> causing FS corruption.
>
> iomap is a generic infrastructure and it should not make any assumptions
> about the fs block size and the page size of the system.
>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
Looks fine, so:
Reviewed-by: Dave Chinner <dchinner@redhat.com>
but....
> +/*
> + * Used for sub block zeroing in iomap_dio_zero()
> + */
> +#define ZERO_PAGE_64K_SIZE (65536)
> +#define ZERO_PAGE_64K_ORDER (get_order(ZERO_PAGE_64K_SIZE))
> +static struct page *zero_page_64k;
.....
> @@ -753,3 +765,17 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> return iomap_dio_complete(dio);
> }
> EXPORT_SYMBOL_GPL(iomap_dio_rw);
> +
> +static int __init iomap_dio_init(void)
> +{
> + zero_page_64k = alloc_pages(GFP_KERNEL | __GFP_ZERO,
> + ZERO_PAGE_64K_ORDER);
> +
> + if (!zero_page_64k)
> + return -ENOMEM;
> +
> + set_memory_ro((unsigned long)page_address(zero_page_64k),
> + 1U << ZERO_PAGE_64K_ORDER);
^^^^^^^^^^^^^^^^^^^^^^^^^
isn't that just ZERO_PAGE_64K_SIZE?
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size
2024-07-01 2:37 ` Dave Chinner
@ 2024-07-01 11:22 ` Pankaj Raghav (Samsung)
0 siblings, 0 replies; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-01 11:22 UTC (permalink / raw)
To: Dave Chinner
Cc: willy, chandan.babu, djwong, brauner, akpm, linux-kernel, yang,
linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof,
gost.dev, cl, linux-xfs, hch, Zi Yan
On Mon, Jul 01, 2024 at 12:37:10PM +1000, Dave Chinner wrote:
> On Tue, Jun 25, 2024 at 11:44:16AM +0000, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> >
> > iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> > < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> > size < page_size. This is true for most filesystems at the moment.
> >
> > If the block size > page size, this will send the contents of the page
> > next to zero page(as len > PAGE_SIZE) to the underlying block device,
> > causing FS corruption.
> >
> > iomap is a generic infrastructure and it should not make any assumptions
> > about the fs block size and the page size of the system.
> >
> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > Reviewed-by: Hannes Reinecke <hare@suse.de>
>
> Looks fine, so:
>
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
Thanks!
>
> but....
>
> > +
> > + if (!zero_page_64k)
> > + return -ENOMEM;
> > +
> > + set_memory_ro((unsigned long)page_address(zero_page_64k),
> > + 1U << ZERO_PAGE_64K_ORDER);
> ^^^^^^^^^^^^^^^^^^^^^^^^^
> isn't that just ZERO_PAGE_64K_SIZE?
Nope, set_memory_ro takes numbers of pages and not size in bytes :)
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 05/10] filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
2024-06-25 11:44 ` [PATCH v8 05/10] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() Pankaj Raghav (Samsung)
@ 2024-07-01 23:39 ` Darrick J. Wong
0 siblings, 0 replies; 61+ messages in thread
From: Darrick J. Wong @ 2024-07-01 23:39 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: david, willy, chandan.babu, brauner, akpm, linux-kernel, yang,
linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof,
gost.dev, cl, linux-xfs, hch, Zi Yan
On Tue, Jun 25, 2024 at 11:44:15AM +0000, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> Usually the page cache does not extend beyond the size of the inode,
> therefore, no PTEs are created for folios that extend beyond the size.
>
> But with LBS support, we might extend page cache beyond the size of the
> inode as we need to guarantee folios of minimum order. While doing a
> read, do_fault_around() can create PTEs for pages that lie beyond the
> EOF leading to incorrect error return when accessing a page beyond the
> mapped file.
>
> Cap the PTE range to be created for the page cache up to the end of
> file(EOF) in filemap_map_pages() so that return error codes are consistent
> with POSIX[1] for LBS configurations.
>
> generic/749(currently in xfstest-dev patches-in-queue branch [0]) has
> been created to trigger this edge case. This also fixes generic/749 for
> tmpfs with huge=always on systems with 4k base page size.
>
> [0] https://lore.kernel.org/all/20240615002935.1033031-3-mcgrof@kernel.org/
> [1](from mmap(2)) SIGBUS
> Attempted access to a page of the buffer that lies beyond the end
> of the mapped file. For an explanation of the treatment of the
> bytes in the page that corresponds to the end of a mapped file
> that is not a multiple of the page size, see NOTES.
>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Heh, another fun mmap wart!
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> ---
> mm/filemap.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8eafbd4a4d0c..56ff1d936aa8 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3612,7 +3612,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
> struct vm_area_struct *vma = vmf->vma;
> struct file *file = vma->vm_file;
> struct address_space *mapping = file->f_mapping;
> - pgoff_t last_pgoff = start_pgoff;
> + pgoff_t file_end, last_pgoff = start_pgoff;
> unsigned long addr;
> XA_STATE(xas, &mapping->i_pages, start_pgoff);
> struct folio *folio;
> @@ -3638,6 +3638,10 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
> goto out;
> }
>
> + file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
> + if (end_pgoff > file_end)
> + end_pgoff = file_end;
> +
> folio_type = mm_counter_file(folio);
> do {
> unsigned long end;
> --
> 2.44.1
>
>
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size
2024-06-25 11:44 ` [PATCH v8 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
2024-07-01 2:37 ` Dave Chinner
@ 2024-07-01 23:40 ` Darrick J. Wong
2024-07-02 7:42 ` Christoph Hellwig
2 siblings, 0 replies; 61+ messages in thread
From: Darrick J. Wong @ 2024-07-01 23:40 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: david, willy, chandan.babu, brauner, akpm, linux-kernel, yang,
linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof,
gost.dev, cl, linux-xfs, hch, Zi Yan
On Tue, Jun 25, 2024 at 11:44:16AM +0000, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> size < page_size. This is true for most filesystems at the moment.
>
> If the block size > page size, this will send the contents of the page
> next to zero page(as len > PAGE_SIZE) to the underlying block device,
> causing FS corruption.
>
> iomap is a generic infrastructure and it should not make any assumptions
> about the fs block size and the page size of the system.
>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
Looks good to me,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> ---
> fs/iomap/buffered-io.c | 4 ++--
> fs/iomap/direct-io.c | 30 ++++++++++++++++++++++++++++--
> 2 files changed, 30 insertions(+), 4 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index f420c53d86ac..9a9e94c7ed1d 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -2007,10 +2007,10 @@ iomap_writepages(struct address_space *mapping, struct writeback_control *wbc,
> }
> EXPORT_SYMBOL_GPL(iomap_writepages);
>
> -static int __init iomap_init(void)
> +static int __init iomap_pagecache_init(void)
> {
> return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
> offsetof(struct iomap_ioend, io_bio),
> BIOSET_NEED_BVECS);
> }
> -fs_initcall(iomap_init);
> +fs_initcall(iomap_pagecache_init);
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index f3b43d223a46..61d09d2364f7 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -11,6 +11,7 @@
> #include <linux/iomap.h>
> #include <linux/backing-dev.h>
> #include <linux/uio.h>
> +#include <linux/set_memory.h>
> #include <linux/task_io_accounting_ops.h>
> #include "trace.h"
>
> @@ -27,6 +28,13 @@
> #define IOMAP_DIO_WRITE (1U << 30)
> #define IOMAP_DIO_DIRTY (1U << 31)
>
> +/*
> + * Used for sub block zeroing in iomap_dio_zero()
> + */
> +#define ZERO_PAGE_64K_SIZE (65536)
> +#define ZERO_PAGE_64K_ORDER (get_order(ZERO_PAGE_64K_SIZE))
> +static struct page *zero_page_64k;
> +
> struct iomap_dio {
> struct kiocb *iocb;
> const struct iomap_dio_ops *dops;
> @@ -236,9 +244,13 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
> loff_t pos, unsigned len)
> {
> struct inode *inode = file_inode(dio->iocb->ki_filp);
> - struct page *page = ZERO_PAGE(0);
> struct bio *bio;
>
> + /*
> + * Max block size supported is 64k
> + */
> + WARN_ON_ONCE(len > ZERO_PAGE_64K_SIZE);
> +
> bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
> GFP_KERNEL);
> @@ -246,7 +258,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
> bio->bi_private = dio;
> bio->bi_end_io = iomap_dio_bio_end_io;
>
> - __bio_add_page(bio, page, len, 0);
> + __bio_add_page(bio, zero_page_64k, len, 0);
> iomap_dio_submit_bio(iter, dio, bio, pos);
> }
>
> @@ -753,3 +765,17 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> return iomap_dio_complete(dio);
> }
> EXPORT_SYMBOL_GPL(iomap_dio_rw);
> +
> +static int __init iomap_dio_init(void)
> +{
> + zero_page_64k = alloc_pages(GFP_KERNEL | __GFP_ZERO,
> + ZERO_PAGE_64K_ORDER);
> +
> + if (!zero_page_64k)
> + return -ENOMEM;
> +
> + set_memory_ro((unsigned long)page_address(zero_page_64k),
> + 1U << ZERO_PAGE_64K_ORDER);
> + return 0;
> +}
> +fs_initcall(iomap_dio_init);
> --
> 2.44.1
>
>
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size
2024-06-25 11:44 ` [PATCH v8 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
2024-07-01 2:37 ` Dave Chinner
2024-07-01 23:40 ` Darrick J. Wong
@ 2024-07-02 7:42 ` Christoph Hellwig
2024-07-02 10:15 ` Pankaj Raghav (Samsung)
2 siblings, 1 reply; 61+ messages in thread
From: Christoph Hellwig @ 2024-07-02 7:42 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: david, willy, chandan.babu, djwong, brauner, akpm, linux-kernel,
yang, linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav,
mcgrof, gost.dev, cl, linux-xfs, hch, Zi Yan
On Tue, Jun 25, 2024 at 11:44:16AM +0000, Pankaj Raghav (Samsung) wrote:
> -static int __init iomap_init(void)
> +static int __init iomap_pagecache_init(void)
> {
> return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
> offsetof(struct iomap_ioend, io_bio),
> BIOSET_NEED_BVECS);
> }
> -fs_initcall(iomap_init);
> +fs_initcall(iomap_pagecache_init);
s/iomap_pagecache_init/iomap_buffered_init/
We don't use pagecache naming anywhere else in the file.
> +/*
> + * Used for sub block zeroing in iomap_dio_zero()
> + */
> +#define ZERO_PAGE_64K_SIZE (65536)
just use SZ_64K
> +#define ZERO_PAGE_64K_ORDER (get_order(ZERO_PAGE_64K_SIZE))
No really point in having this.
> +static struct page *zero_page_64k;
This should be a folio. Encoding the size in the name is also really
weird and just creates churn when we have to increase it.
> + /*
> + * Max block size supported is 64k
> + */
> + WARN_ON_ONCE(len > ZERO_PAGE_64K_SIZE);
A WARN_ON without actually erroring out here is highly dangerous.
> +
> bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
Overly long line here.
> +
> +static int __init iomap_dio_init(void)
> +{
> + zero_page_64k = alloc_pages(GFP_KERNEL | __GFP_ZERO,
> + ZERO_PAGE_64K_ORDER);
> +
> + if (!zero_page_64k)
> + return -ENOMEM;
> +
> + set_memory_ro((unsigned long)page_address(zero_page_64k),
> + 1U << ZERO_PAGE_64K_ORDER);
What's the point of the set_memory_ro here? Yes, we won't write to
it, but it's hardly an attack vector and fragments the direct map.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size
2024-07-02 7:42 ` Christoph Hellwig
@ 2024-07-02 10:15 ` Pankaj Raghav (Samsung)
2024-07-02 12:02 ` Christoph Hellwig
2024-07-02 13:49 ` Luis Chamberlain
0 siblings, 2 replies; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-02 10:15 UTC (permalink / raw)
To: Christoph Hellwig
Cc: david, willy, chandan.babu, djwong, brauner, akpm, linux-kernel,
yang, linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav,
mcgrof, gost.dev, cl, linux-xfs, Zi Yan
> > +fs_initcall(iomap_pagecache_init);
>
> s/iomap_pagecache_init/iomap_buffered_init/
>
> We don't use pagecache naming anywhere else in the file.
Got it.
>
> > +/*
> > + * Used for sub block zeroing in iomap_dio_zero()
> > + */
> > +#define ZERO_PAGE_64K_SIZE (65536)
>
> just use SZ_64K
>
> > +#define ZERO_PAGE_64K_ORDER (get_order(ZERO_PAGE_64K_SIZE))
>
> No really point in having this.
Hmm, I used it twice, hence the define. But if we decide to get rid of
set_memory_ro(), then this does not make sense.
>
> > +static struct page *zero_page_64k;
>
> This should be a folio. Encoding the size in the name is also really
> weird and just creates churn when we have to increase it.
Willy suggested we could use raw pages as we don't need the metadata
from using a folio. [0]
>
>
> > + /*
> > + * Max block size supported is 64k
> > + */
> > + WARN_ON_ONCE(len > ZERO_PAGE_64K_SIZE);
>
>
> A WARN_ON without actually erroring out here is highly dangerous.
I agree but I think we decided that we are safe with 64k for now as fs
that uses iomap will not have a block size > 64k.
But this function needs some changes when we decide to go beyond 64k
by returning error instead of not returning anything.
Until then WARN_ON_ONCE would be a good stop gap for people developing
the feature to go beyond 64k block size[1].
>
> > +
> > bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>
> Overly long line here.
>
Not a part of my change, so I didn't bother reformatting it. :)
> > +
> > +static int __init iomap_dio_init(void)
> > +{
> > + zero_page_64k = alloc_pages(GFP_KERNEL | __GFP_ZERO,
> > + ZERO_PAGE_64K_ORDER);
>
> > +
> > + if (!zero_page_64k)
> > + return -ENOMEM;
> > +
> > + set_memory_ro((unsigned long)page_address(zero_page_64k),
> > + 1U << ZERO_PAGE_64K_ORDER);
>
> What's the point of the set_memory_ro here? Yes, we won't write to
> it, but it's hardly an attack vector and fragments the direct map.
That is a good point. Darrick suggested why not add a ro tag as we don't
write to it but I did not know the consequence of direct map
fragmentation when this is added. So probably there is no value calling
set_memory_ro here.
--
Pankaj
[0] https://lore.kernel.org/linux-fsdevel/ZkT46AsZ3WghOArL@casper.infradead.org/
[1] I spent a lot of time banging my head why I was getting FS corruption
when I was doing direct io in XFS while adding LBS support before I found
the PAGE_SIZE assumption here.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size
2024-07-02 10:15 ` Pankaj Raghav (Samsung)
@ 2024-07-02 12:02 ` Christoph Hellwig
2024-07-02 14:01 ` Pankaj Raghav (Samsung)
2024-07-02 16:50 ` Matthew Wilcox
2024-07-02 13:49 ` Luis Chamberlain
1 sibling, 2 replies; 61+ messages in thread
From: Christoph Hellwig @ 2024-07-02 12:02 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: Christoph Hellwig, david, willy, chandan.babu, djwong, brauner,
akpm, linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel,
hare, p.raghav, mcgrof, gost.dev, cl, linux-xfs, Zi Yan
On Tue, Jul 02, 2024 at 10:15:56AM +0000, Pankaj Raghav (Samsung) wrote:
> Willy suggested we could use raw pages as we don't need the metadata
> from using a folio. [0]
Ok, that feels weird but I'll defer to his opinion in that case.
> > > + /*
> > > + * Max block size supported is 64k
> > > + */
> > > + WARN_ON_ONCE(len > ZERO_PAGE_64K_SIZE);
> >
> >
> > A WARN_ON without actually erroring out here is highly dangerous.
>
> I agree but I think we decided that we are safe with 64k for now as fs
> that uses iomap will not have a block size > 64k.
>
> But this function needs some changes when we decide to go beyond 64k
> by returning error instead of not returning anything.
> Until then WARN_ON_ONCE would be a good stop gap for people developing
> the feature to go beyond 64k block size[1].
Sure, but please make it return an error and return that instead of
just warning and going beyond the allocated page.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size
2024-07-02 10:15 ` Pankaj Raghav (Samsung)
2024-07-02 12:02 ` Christoph Hellwig
@ 2024-07-02 13:49 ` Luis Chamberlain
1 sibling, 0 replies; 61+ messages in thread
From: Luis Chamberlain @ 2024-07-02 13:49 UTC (permalink / raw)
To: Pankaj Raghav (Samsung), Mike Rapoport
Cc: Christoph Hellwig, david, willy, chandan.babu, djwong, brauner,
akpm, linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel,
hare, p.raghav, gost.dev, cl, linux-xfs, Zi Yan
On Tue, Jul 02, 2024 at 10:15:56AM +0000, Pankaj Raghav (Samsung) wrote:
> > > + set_memory_ro((unsigned long)page_address(zero_page_64k),
> > > + 1U << ZERO_PAGE_64K_ORDER);
> >
> > What's the point of the set_memory_ro here? Yes, we won't write to
> > it, but it's hardly an attack vector and fragments the direct map.
>
> That is a good point. Darrick suggested why not add a ro tag as we don't
> write to it but I did not know the consequence of direct map
> fragmentation when this is added. So probably there is no value calling
> set_memory_ro here.
Mike Rapoport already did the thankless hard work to evaluate if direct
map fragmentation is something which causes a performance issue and it
is not [0]. Either way, this is a *one* time thing, not something that
happens as often as other things which aggrevate direct map fragmentation
like eBPF, and so from my perspective there is no issue to using, if
we want set_memory_ro().
[0] https://lwn.net/Articles/931406/
Luis
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size
2024-07-02 12:02 ` Christoph Hellwig
@ 2024-07-02 14:01 ` Pankaj Raghav (Samsung)
2024-07-02 15:42 ` Christoph Hellwig
2024-07-02 16:50 ` Matthew Wilcox
1 sibling, 1 reply; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-02 14:01 UTC (permalink / raw)
To: Christoph Hellwig
Cc: david, willy, chandan.babu, djwong, brauner, akpm, linux-kernel,
yang, linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav,
mcgrof, gost.dev, cl, linux-xfs, Zi Yan
> > > A WARN_ON without actually erroring out here is highly dangerous.
> >
> > I agree but I think we decided that we are safe with 64k for now as fs
> > that uses iomap will not have a block size > 64k.
> >
> > But this function needs some changes when we decide to go beyond 64k
> > by returning error instead of not returning anything.
> > Until then WARN_ON_ONCE would be a good stop gap for people developing
> > the feature to go beyond 64k block size[1].
>
> Sure, but please make it return an error and return that instead of
> just warning and going beyond the allocated page.
Does this make sense?
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 61d09d2364f7..14be34703588 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -240,16 +240,19 @@ void iomap_dio_bio_end_io(struct bio *bio)
}
EXPORT_SYMBOL_GPL(iomap_dio_bio_end_io);
-static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
+static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
loff_t pos, unsigned len)
{
struct inode *inode = file_inode(dio->iocb->ki_filp);
struct bio *bio;
+ if (!len)
+ return 0;
/*
* Max block size supported is 64k
*/
- WARN_ON_ONCE(len > ZERO_PAGE_64K_SIZE);
+ if (len > ZERO_PAGE_64K_SIZE)
+ return -EINVAL;
bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
@@ -260,6 +263,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
__bio_add_page(bio, zero_page_64k, len, 0);
iomap_dio_submit_bio(iter, dio, bio, pos);
+ return 0;
}
/*
@@ -368,8 +372,10 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
if (need_zeroout) {
/* zero out from the start of the block to the write offset */
pad = pos & (fs_block_size - 1);
- if (pad)
- iomap_dio_zero(iter, dio, pos - pad, pad);
+
+ ret = iomap_dio_zero(iter, dio, pos - pad, pad);
+ if (ret)
+ goto out;
}
/*
@@ -443,7 +449,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
/* zero out from the end of the write to the end of the block */
pad = pos & (fs_block_size - 1);
if (pad)
- iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
+ ret = iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
}
out:
/* Undo iter limitation to current extent */
--
Pankaj
^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [PATCH v8 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size
2024-07-02 14:01 ` Pankaj Raghav (Samsung)
@ 2024-07-02 15:42 ` Christoph Hellwig
2024-07-02 16:13 ` Pankaj Raghav (Samsung)
0 siblings, 1 reply; 61+ messages in thread
From: Christoph Hellwig @ 2024-07-02 15:42 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: Christoph Hellwig, david, willy, chandan.babu, djwong, brauner,
akpm, linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel,
hare, p.raghav, mcgrof, gost.dev, cl, linux-xfs, Zi Yan
On Tue, Jul 02, 2024 at 02:01:23PM +0000, Pankaj Raghav (Samsung) wrote:
+static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
> loff_t pos, unsigned len)
> {
> struct inode *inode = file_inode(dio->iocb->ki_filp);
> struct bio *bio;
>
> + if (!len)
> + return 0;
> /*
> * Max block size supported is 64k
> */
> - WARN_ON_ONCE(len > ZERO_PAGE_64K_SIZE);
> + if (len > ZERO_PAGE_64K_SIZE)
> + return -EINVAL;
The should probably be both WARN_ON_ONCE in addition to the error
return (and ZERO_PAGE_64K_SIZE really needs to go away..)
> + ret = iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
Overly lone line here.
Otherwise this looks good.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size
2024-07-02 15:42 ` Christoph Hellwig
@ 2024-07-02 16:13 ` Pankaj Raghav (Samsung)
2024-07-02 16:51 ` Matthew Wilcox
0 siblings, 1 reply; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-02 16:13 UTC (permalink / raw)
To: Christoph Hellwig
Cc: david, willy, chandan.babu, djwong, brauner, akpm, linux-kernel,
yang, linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav,
mcgrof, gost.dev, cl, linux-xfs, Zi Yan
On Tue, Jul 02, 2024 at 05:42:16PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 02, 2024 at 02:01:23PM +0000, Pankaj Raghav (Samsung) wrote:
> +static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
> > loff_t pos, unsigned len)
> > {
> > struct inode *inode = file_inode(dio->iocb->ki_filp);
> > struct bio *bio;
> >
> > + if (!len)
> > + return 0;
> > /*
> > * Max block size supported is 64k
> > */
> > - WARN_ON_ONCE(len > ZERO_PAGE_64K_SIZE);
> > + if (len > ZERO_PAGE_64K_SIZE)
> > + return -EINVAL;
>
> The should probably be both WARN_ON_ONCE in addition to the error
> return (and ZERO_PAGE_64K_SIZE really needs to go away..)
Yes, I will rename it to ZERO_PAGE_SZ_64K as you suggested.
>
> > + ret = iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
>
> Overly lone line here.
>
> Otherwise this looks good.
>
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size
2024-07-02 12:02 ` Christoph Hellwig
2024-07-02 14:01 ` Pankaj Raghav (Samsung)
@ 2024-07-02 16:50 ` Matthew Wilcox
1 sibling, 0 replies; 61+ messages in thread
From: Matthew Wilcox @ 2024-07-02 16:50 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Pankaj Raghav (Samsung), david, chandan.babu, djwong, brauner,
akpm, linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel,
hare, p.raghav, mcgrof, gost.dev, cl, linux-xfs, Zi Yan
On Tue, Jul 02, 2024 at 02:02:50PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 02, 2024 at 10:15:56AM +0000, Pankaj Raghav (Samsung) wrote:
> > Willy suggested we could use raw pages as we don't need the metadata
> > from using a folio. [0]
>
> Ok, that feels weird but I'll defer to his opinion in that case.
Let me see if I can make you feel less weird about it, since I think
this is something that people should have a clear feeling about.
In the Glorious Future, when we've separated pages and folios from each
other, folios are conceptually memory that gets mapped to userspace.
They have refcounts, mapcounts, a pointer to a file's mapping or an anon
vma's anon_vma, an index within that object, an LRU list, a dirty flag,
a lock bit, and so on.
We don't need any of that here. We might choose to use a special memdesc
for accounting purposes, but there's no need to allocate a folio for it.
For now, leaving it as a plain allocation of pages seems like the smartest
option, and we can revisit in the future.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size
2024-07-02 16:13 ` Pankaj Raghav (Samsung)
@ 2024-07-02 16:51 ` Matthew Wilcox
2024-07-02 17:10 ` Pankaj Raghav (Samsung)
0 siblings, 1 reply; 61+ messages in thread
From: Matthew Wilcox @ 2024-07-02 16:51 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: Christoph Hellwig, david, chandan.babu, djwong, brauner, akpm,
linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, Zi Yan
On Tue, Jul 02, 2024 at 04:13:29PM +0000, Pankaj Raghav (Samsung) wrote:
> On Tue, Jul 02, 2024 at 05:42:16PM +0200, Christoph Hellwig wrote:
> > On Tue, Jul 02, 2024 at 02:01:23PM +0000, Pankaj Raghav (Samsung) wrote:
> > +static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
> > > loff_t pos, unsigned len)
> > > {
> > > struct inode *inode = file_inode(dio->iocb->ki_filp);
> > > struct bio *bio;
> > >
> > > + if (!len)
> > > + return 0;
> > > /*
> > > * Max block size supported is 64k
> > > */
> > > - WARN_ON_ONCE(len > ZERO_PAGE_64K_SIZE);
> > > + if (len > ZERO_PAGE_64K_SIZE)
> > > + return -EINVAL;
> >
> > The should probably be both WARN_ON_ONCE in addition to the error
> > return (and ZERO_PAGE_64K_SIZE really needs to go away..)
>
> Yes, I will rename it to ZERO_PAGE_SZ_64K as you suggested.
No. It needs a symbolic name that doesn't include the actual size.
Maybe ZERO_PAGE_IO_MAX. Christoph suggested using SZ_64K to define
it, not to include it in the name.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size
2024-07-02 16:51 ` Matthew Wilcox
@ 2024-07-02 17:10 ` Pankaj Raghav (Samsung)
2024-07-03 5:16 ` Christoph Hellwig
0 siblings, 1 reply; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-02 17:10 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Christoph Hellwig, david, chandan.babu, djwong, brauner, akpm,
linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, Zi Yan
On Tue, Jul 02, 2024 at 05:51:54PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 02, 2024 at 04:13:29PM +0000, Pankaj Raghav (Samsung) wrote:
> > On Tue, Jul 02, 2024 at 05:42:16PM +0200, Christoph Hellwig wrote:
> > > On Tue, Jul 02, 2024 at 02:01:23PM +0000, Pankaj Raghav (Samsung) wrote:
> > > +static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
> > > > loff_t pos, unsigned len)
> > > > {
> > > > struct inode *inode = file_inode(dio->iocb->ki_filp);
> > > > struct bio *bio;
> > > >
> > > > + if (!len)
> > > > + return 0;
> > > > /*
> > > > * Max block size supported is 64k
> > > > */
> > > > - WARN_ON_ONCE(len > ZERO_PAGE_64K_SIZE);
> > > > + if (len > ZERO_PAGE_64K_SIZE)
> > > > + return -EINVAL;
> > >
> > > The should probably be both WARN_ON_ONCE in addition to the error
> > > return (and ZERO_PAGE_64K_SIZE really needs to go away..)
> >
> > Yes, I will rename it to ZERO_PAGE_SZ_64K as you suggested.
>
> No. It needs a symbolic name that doesn't include the actual size.
> Maybe ZERO_PAGE_IO_MAX. Christoph suggested using SZ_64K to define
> it, not to include it in the name.
Initially I kept the name as ZERO_FSB_PAGE as it is used to do sub-block
zeroing. But I know John from Oracle is already working on using it for
rt extent zeroing. So I will just go with ZERO_PAGE_IO_MAX for now.
Understood about the SZ_64K part. Thanks for the clarification.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 03/10] readahead: allocate folios with mapping_min_order in readahead
2024-06-25 11:44 ` [PATCH v8 03/10] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
@ 2024-07-02 19:38 ` Darrick J. Wong
2024-07-03 14:10 ` Pankaj Raghav (Samsung)
2024-07-04 14:24 ` Ryan Roberts
1 sibling, 1 reply; 61+ messages in thread
From: Darrick J. Wong @ 2024-07-02 19:38 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: david, willy, chandan.babu, brauner, akpm, linux-kernel, yang,
linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof,
gost.dev, cl, linux-xfs, hch, Zi Yan
On Tue, Jun 25, 2024 at 11:44:13AM +0000, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> page_cache_ra_unbounded() was allocating single pages (0 order folios)
> if there was no folio found in an index. Allocate mapping_min_order folios
> as we need to guarantee the minimum order if it is set.
> While we are at it, rework the loop in page_cache_ra_unbounded() to
> advance with the number of pages in a folio instead of just one page at
> a time.
Ok, sounds pretty straightforward so far.
> page_cache_ra_order() tries to allocate folio to the page cache with a
> higher order if the index aligns with that order. Modify it so that the
> order does not go below the mapping_min_order requirement of the page
> cache. This function will do the right thing even if the new_order passed
> is less than the mapping_min_order.
Hmm. So if I'm understanding this correctly: Currently,
page_cache_ra_order tries to allocate higher order folios if the
readahead index happens to be aligned to one of those higher orders.
With the minimum mapping order requirement, it now expands the readahead
range upwards and downwards to maintain the mapping_min_order
requirement. Right?
> When adding new folios to the page cache we must also ensure the index
> used is aligned to the mapping_min_order as the page cache requires the
> index to be aligned to the order of the folio.
>
> readahead_expand() is called from readahead aops to extend the range of
> the readahead so this function can assume ractl->_index to be aligned with
> min_order.
...and I guess this function also has to be modified to expand the ra
range even further if necessary to align with mapping_min_order. Right?
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Co-developed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
> mm/readahead.c | 81 +++++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 63 insertions(+), 18 deletions(-)
>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 66058ae02f2e..2acfd6447d7b 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -206,9 +206,10 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> unsigned long nr_to_read, unsigned long lookahead_size)
> {
> struct address_space *mapping = ractl->mapping;
> - unsigned long index = readahead_index(ractl);
> + unsigned long ra_folio_index, index = readahead_index(ractl);
> gfp_t gfp_mask = readahead_gfp_mask(mapping);
> - unsigned long i;
> + unsigned long mark, i = 0;
> + unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
>
> /*
> * Partway through the readahead operation, we will have added
> @@ -223,10 +224,26 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
I'm not as familiar with this function since xfs/iomap don't use it.
Does anyone actually pass nonzero lookahead size?
What does ext4_read_merkle_tree_page do??
folio = __filemap_get_folio(inode->i_mapping, index, FGP_ACCESSED, 0);
if (IS_ERR(folio) || !folio_test_uptodate(folio)) {
DEFINE_READAHEAD(ractl, NULL, NULL, inode->i_mapping, index);
if (!IS_ERR(folio))
folio_put(folio);
else if (num_ra_pages > 1)
page_cache_ra_unbounded(&ractl, num_ra_pages, 0);
So we try to get the folio. If the folio is an errptr then we try
unbounded readahead, which I guess works for ENOENT or EAGAIN; maybe
less well if __filemap_get_folio returns ENOMEM.
If @folio is a real but !uptodate folio then we put the folio and read
it again, but without doing readahead. <shrug>
> unsigned int nofs = memalloc_nofs_save();
>
> filemap_invalidate_lock_shared(mapping);
> + index = mapping_align_index(mapping, index);
> +
> + /*
> + * As iterator `i` is aligned to min_nrpages, round_up the
> + * difference between nr_to_read and lookahead_size to mark the
> + * index that only has lookahead or "async_region" to set the
> + * readahead flag.
> + */
> + ra_folio_index = round_up(readahead_index(ractl) + nr_to_read - lookahead_size,
> + min_nrpages);
So at this point we've rounded index down and the readahead region up to
fit the min_nrpages requirement. I'm not sure what the lookahead region
does, since nobody passes nonzero. Judging from the other functions, I
guess that's the region that we're allowed to do asynchronously?
> + mark = ra_folio_index - index;
Ah, ok, yes. We mark the first folio in the "async" region so that we
(re)start readahead when someone accesses that folio.
> + if (index != readahead_index(ractl)) {
> + nr_to_read += readahead_index(ractl) - index;
> + ractl->_index = index;
> + }
So then if we rounded inded down, now we have to add that to the ra
region.
> +
> /*
> * Preallocate as many pages as we will need.
> */
> - for (i = 0; i < nr_to_read; i++) {
> + while (i < nr_to_read) {
> struct folio *folio = xa_load(&mapping->i_pages, index + i);
> int ret;
>
> @@ -240,12 +257,13 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> * not worth getting one just for that.
> */
> read_pages(ractl);
> - ractl->_index++;
> - i = ractl->_index + ractl->_nr_pages - index - 1;
> + ractl->_index += min_nrpages;
> + i = ractl->_index + ractl->_nr_pages - index;
> continue;
> }
>
> - folio = filemap_alloc_folio(gfp_mask, 0);
> + folio = filemap_alloc_folio(gfp_mask,
> + mapping_min_folio_order(mapping));
> if (!folio)
> break;
>
> @@ -255,14 +273,15 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> if (ret == -ENOMEM)
> break;
> read_pages(ractl);
> - ractl->_index++;
> - i = ractl->_index + ractl->_nr_pages - index - 1;
> + ractl->_index += min_nrpages;
> + i = ractl->_index + ractl->_nr_pages - index;
> continue;
> }
> - if (i == nr_to_read - lookahead_size)
> + if (i == mark)
> folio_set_readahead(folio);
> ractl->_workingset |= folio_test_workingset(folio);
> - ractl->_nr_pages++;
> + ractl->_nr_pages += min_nrpages;
> + i += min_nrpages;
> }
>
> /*
> @@ -492,13 +511,19 @@ void page_cache_ra_order(struct readahead_control *ractl,
> {
> struct address_space *mapping = ractl->mapping;
> pgoff_t index = readahead_index(ractl);
> + unsigned int min_order = mapping_min_folio_order(mapping);
> pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
> pgoff_t mark = index + ra->size - ra->async_size;
> unsigned int nofs;
> int err = 0;
> gfp_t gfp = readahead_gfp_mask(mapping);
> + unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
>
> - if (!mapping_large_folio_support(mapping) || ra->size < 4)
> + /*
> + * Fallback when size < min_nrpages as each folio should be
> + * at least min_nrpages anyway.
> + */
> + if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
> goto fallback;
>
> limit = min(limit, index + ra->size - 1);
> @@ -507,11 +532,20 @@ void page_cache_ra_order(struct readahead_control *ractl,
> new_order += 2;
> new_order = min(mapping_max_folio_order(mapping), new_order);
> new_order = min_t(unsigned int, new_order, ilog2(ra->size));
> + new_order = max(new_order, min_order);
> }
>
> /* See comment in page_cache_ra_unbounded() */
> nofs = memalloc_nofs_save();
> filemap_invalidate_lock_shared(mapping);
> + /*
> + * If the new_order is greater than min_order and index is
> + * already aligned to new_order, then this will be noop as index
> + * aligned to new_order should also be aligned to min_order.
> + */
> + ractl->_index = mapping_align_index(mapping, index);
> + index = readahead_index(ractl);
I guess this also rounds index down to mapping_min_order...
> +
> while (index <= limit) {
> unsigned int order = new_order;
>
> @@ -519,7 +553,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
> if (index & ((1UL << order) - 1))
> order = __ffs(index);
> /* Don't allocate pages past EOF */
> - while (index + (1UL << order) - 1 > limit)
> + while (order > min_order && index + (1UL << order) - 1 > limit)
> order--;
...and then we try to find an order that works and doesn't go below
min_order. We already rounded index down to mapping_min_order, so that
will always succeed. Right?
> err = ra_alloc_folio(ractl, index, mark, order, gfp);
> if (err)
> @@ -783,8 +817,15 @@ void readahead_expand(struct readahead_control *ractl,
> struct file_ra_state *ra = ractl->ra;
> pgoff_t new_index, new_nr_pages;
> gfp_t gfp_mask = readahead_gfp_mask(mapping);
> + unsigned long min_nrpages = mapping_min_folio_nrpages(mapping);
> + unsigned int min_order = mapping_min_folio_order(mapping);
>
> new_index = new_start / PAGE_SIZE;
> + /*
> + * Readahead code should have aligned the ractl->_index to
> + * min_nrpages before calling readahead aops.
> + */
> + VM_BUG_ON(!IS_ALIGNED(ractl->_index, min_nrpages));
>
> /* Expand the leading edge downwards */
> while (ractl->_index > new_index) {
> @@ -794,9 +835,11 @@ void readahead_expand(struct readahead_control *ractl,
> if (folio && !xa_is_value(folio))
> return; /* Folio apparently present */
>
> - folio = filemap_alloc_folio(gfp_mask, 0);
> + folio = filemap_alloc_folio(gfp_mask, min_order);
> if (!folio)
> return;
> +
> + index = mapping_align_index(mapping, index);
> if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
> folio_put(folio);
> return;
> @@ -806,7 +849,7 @@ void readahead_expand(struct readahead_control *ractl,
> ractl->_workingset = true;
> psi_memstall_enter(&ractl->_pflags);
> }
> - ractl->_nr_pages++;
> + ractl->_nr_pages += min_nrpages;
> ractl->_index = folio->index;
> }
>
> @@ -821,9 +864,11 @@ void readahead_expand(struct readahead_control *ractl,
> if (folio && !xa_is_value(folio))
> return; /* Folio apparently present */
>
> - folio = filemap_alloc_folio(gfp_mask, 0);
> + folio = filemap_alloc_folio(gfp_mask, min_order);
> if (!folio)
> return;
> +
> + index = mapping_align_index(mapping, index);
> if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
> folio_put(folio);
> return;
> @@ -833,10 +878,10 @@ void readahead_expand(struct readahead_control *ractl,
> ractl->_workingset = true;
> psi_memstall_enter(&ractl->_pflags);
> }
> - ractl->_nr_pages++;
> + ractl->_nr_pages += min_nrpages;
> if (ra) {
> - ra->size++;
> - ra->async_size++;
> + ra->size += min_nrpages;
> + ra->async_size += min_nrpages;
...and here we are expanding the ra window yet again, only now taking
min order into account. Right? Looks ok to me, though again, iomap/xfs
don't use this function so I'm not that familiar with it.
If the answer to /all/ the questions is 'yes' then
Acked-by: Darrick J. Wong <djwong@kernel.org>
--D
> }
> }
> }
> --
> 2.44.1
>
>
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size
2024-07-02 17:10 ` Pankaj Raghav (Samsung)
@ 2024-07-03 5:16 ` Christoph Hellwig
0 siblings, 0 replies; 61+ messages in thread
From: Christoph Hellwig @ 2024-07-03 5:16 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: Matthew Wilcox, Christoph Hellwig, david, chandan.babu, djwong,
brauner, akpm, linux-kernel, yang, linux-mm, john.g.garry,
linux-fsdevel, hare, p.raghav, mcgrof, gost.dev, cl, linux-xfs,
Zi Yan
On Tue, Jul 02, 2024 at 05:10:14PM +0000, Pankaj Raghav (Samsung) wrote:
> > > Yes, I will rename it to ZERO_PAGE_SZ_64K as you suggested.
> >
> > No. It needs a symbolic name that doesn't include the actual size.
> > Maybe ZERO_PAGE_IO_MAX. Christoph suggested using SZ_64K to define
> > it, not to include it in the name.
>
> Initially I kept the name as ZERO_FSB_PAGE as it is used to do sub-block
> zeroing. But I know John from Oracle is already working on using it for
> rt extent zeroing. So I will just go with ZERO_PAGE_IO_MAX for now.
> Understood about the SZ_64K part. Thanks for the clarification.
IOMAP_ZERO_PAGE_SIZE ?
(I kind regret not going for just adding zero page as originally
suggested, but then we'd keep arguing about the nr_vecs sizing and
naming :))
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 03/10] readahead: allocate folios with mapping_min_order in readahead
2024-07-02 19:38 ` Darrick J. Wong
@ 2024-07-03 14:10 ` Pankaj Raghav (Samsung)
0 siblings, 0 replies; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-03 14:10 UTC (permalink / raw)
To: Darrick J. Wong
Cc: david, willy, chandan.babu, brauner, akpm, linux-kernel, yang,
linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof,
gost.dev, cl, linux-xfs, hch, Zi Yan
On Tue, Jul 02, 2024 at 12:38:30PM -0700, Darrick J. Wong wrote:
> On Tue, Jun 25, 2024 at 11:44:13AM +0000, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> >
> > page_cache_ra_unbounded() was allocating single pages (0 order folios)
> > if there was no folio found in an index. Allocate mapping_min_order folios
> > as we need to guarantee the minimum order if it is set.
> > While we are at it, rework the loop in page_cache_ra_unbounded() to
> > advance with the number of pages in a folio instead of just one page at
> > a time.
>
> Ok, sounds pretty straightforward so far.
>
> > page_cache_ra_order() tries to allocate folio to the page cache with a
> > higher order if the index aligns with that order. Modify it so that the
> > order does not go below the mapping_min_order requirement of the page
> > cache. This function will do the right thing even if the new_order passed
> > is less than the mapping_min_order.
>
> Hmm. So if I'm understanding this correctly: Currently,
> page_cache_ra_order tries to allocate higher order folios if the
> readahead index happens to be aligned to one of those higher orders.
> With the minimum mapping order requirement, it now expands the readahead
> range upwards and downwards to maintain the mapping_min_order
> requirement. Right?
>
Yes. We only expand because the index that was passed needs to included
and not excluded.
> > When adding new folios to the page cache we must also ensure the index
> > used is aligned to the mapping_min_order as the page cache requires the
> > index to be aligned to the order of the folio.
> >
> > readahead_expand() is called from readahead aops to extend the range of
> > the readahead so this function can assume ractl->_index to be aligned with
> > min_order.
>
> ...and I guess this function also has to be modified to expand the ra
> range even further if necessary to align with mapping_min_order. Right?
Yes! This function is a bit different from other two because this is
called from readahead aops callback while the other two are responsible
for calling the readahead aops. That is why we can assume the index
passed is aligned to min order.
>
> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > Co-developed-by: Hannes Reinecke <hare@suse.de>
> > Signed-off-by: Hannes Reinecke <hare@suse.de>
> > ---
> > mm/readahead.c | 81 +++++++++++++++++++++++++++++++++++++++-----------
> > 1 file changed, 63 insertions(+), 18 deletions(-)
> >
> > diff --git a/mm/readahead.c b/mm/readahead.c
> > index 66058ae02f2e..2acfd6447d7b 100644
> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -206,9 +206,10 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> > unsigned long nr_to_read, unsigned long lookahead_size)
> > {
> > struct address_space *mapping = ractl->mapping;
> > - unsigned long index = readahead_index(ractl);
> > + unsigned long ra_folio_index, index = readahead_index(ractl);
> > gfp_t gfp_mask = readahead_gfp_mask(mapping);
> > - unsigned long i;
> > + unsigned long mark, i = 0;
> > + unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
> >
> > /*
> > * Partway through the readahead operation, we will have added
> > @@ -223,10 +224,26 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>
> I'm not as familiar with this function since xfs/iomap don't use it.
This function ultimately invokes xfs_vm_readahead through read_pages().
So it sort of sits above xfs aops.
> > unsigned int nofs = memalloc_nofs_save();
> >
> > filemap_invalidate_lock_shared(mapping);
> > + index = mapping_align_index(mapping, index);
> > +
> > + /*
> > + * As iterator `i` is aligned to min_nrpages, round_up the
> > + * difference between nr_to_read and lookahead_size to mark the
> > + * index that only has lookahead or "async_region" to set the
> > + * readahead flag.
> > + */
> > + ra_folio_index = round_up(readahead_index(ractl) + nr_to_read - lookahead_size,
> > + min_nrpages);
>
> So at this point we've rounded index down and the readahead region up to
> fit the min_nrpages requirement. I'm not sure what the lookahead region
> does, since nobody passes nonzero. Judging from the other functions, I
> guess that's the region that we're allowed to do asynchronously?
>
> > + mark = ra_folio_index - index;
>
> Ah, ok, yes. We mark the first folio in the "async" region so that we
> (re)start readahead when someone accesses that folio.
Yes. I think we consider it as a hit, so we could expand the readahead
window now.
>
> > + if (index != readahead_index(ractl)) {
> > + nr_to_read += readahead_index(ractl) - index;
> > + ractl->_index = index;
> > + }
>
> So then if we rounded inded down, now we have to add that to the ra
> region.
Yes. I could also make it unconditional because if index ==
readahead_index(ractl), nr_to_read and ractl->_index will just remain
the same. Probably that is more efficient than having a conditinal and
then a substraction.
>
> > +
> > /*
> > + if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
> > goto fallback;
> >
> > limit = min(limit, index + ra->size - 1);
> > @@ -507,11 +532,20 @@ void page_cache_ra_order(struct readahead_control *ractl,
> > new_order += 2;
> > new_order = min(mapping_max_folio_order(mapping), new_order);
> > new_order = min_t(unsigned int, new_order, ilog2(ra->size));
> > + new_order = max(new_order, min_order);
> > }
> >
> > /* See comment in page_cache_ra_unbounded() */
> > nofs = memalloc_nofs_save();
> > filemap_invalidate_lock_shared(mapping);
> > + /*
> > + * If the new_order is greater than min_order and index is
> > + * already aligned to new_order, then this will be noop as index
> > + * aligned to new_order should also be aligned to min_order.
> > + */
> > + ractl->_index = mapping_align_index(mapping, index);
> > + index = readahead_index(ractl);
>
> I guess this also rounds index down to mapping_min_order...
Yes.
>
> > +
> > while (index <= limit) {
> > unsigned int order = new_order;
> >
> > @@ -519,7 +553,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
> > if (index & ((1UL << order) - 1))
> > order = __ffs(index);
> > /* Don't allocate pages past EOF */
> > - while (index + (1UL << order) - 1 > limit)
> > + while (order > min_order && index + (1UL << order) - 1 > limit)
> > order--;
>
> ...and then we try to find an order that works and doesn't go below
> min_order. We already rounded index down to mapping_min_order, so that
> will always succeed. Right?
Yes. We never go less than min order, even if it means extending
slightly beyond the isize (hence also the mmap fix later on in the
series).
>
> > err = ra_alloc_folio(ractl, index, mark, order, gfp);
> > if (err)
> >
> > @@ -821,9 +864,11 @@ void readahead_expand(struct readahead_control *ractl,
> > if (folio && !xa_is_value(folio))
> > return; /* Folio apparently present */
> >
> > - folio = filemap_alloc_folio(gfp_mask, 0);
> > + folio = filemap_alloc_folio(gfp_mask, min_order);
> > if (!folio)
> > return;
> > +
> > + index = mapping_align_index(mapping, index);
> > if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
> > folio_put(folio);
> > return;
> > @@ -833,10 +878,10 @@ void readahead_expand(struct readahead_control *ractl,
> > ractl->_workingset = true;
> > psi_memstall_enter(&ractl->_pflags);
> > }
> > - ractl->_nr_pages++;
> > + ractl->_nr_pages += min_nrpages;
> > if (ra) {
> > - ra->size++;
> > - ra->async_size++;
> > + ra->size += min_nrpages;
> > + ra->async_size += min_nrpages;
>
> ...and here we are expanding the ra window yet again, only now taking
> min order into account. Right? Looks ok to me, though again, iomap/xfs
> don't use this function so I'm not that familiar with it.
It is used only by few filesystems but that is the idea. Before we used
to add single pages but now we add min_order worth of pages if it is
set. Very similar to previous patch.
>
> If the answer to /all/ the questions is 'yes' then
>
> Acked-by: Darrick J. Wong <djwong@kernel.org>
Thanks, I will add it :)
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-06-25 11:44 ` [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
@ 2024-07-04 12:23 ` Ryan Roberts
2024-07-04 15:20 ` Matthew Wilcox
2024-07-04 21:34 ` Pankaj Raghav (Samsung)
2024-07-09 16:29 ` Pankaj Raghav (Samsung)
1 sibling, 2 replies; 61+ messages in thread
From: Ryan Roberts @ 2024-07-04 12:23 UTC (permalink / raw)
To: Pankaj Raghav (Samsung), david, willy, chandan.babu, djwong,
brauner, akpm
Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, hch, Zi Yan
Hi,
Here are some drive-by review comments as I'm evaluating whether these patches
can help me with what I'm trying to do at
https://lore.kernel.org/linux-mm/20240215154059.2863126-1-ryan.roberts@arm.com/...
On 25/06/2024 12:44, Pankaj Raghav (Samsung) wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
>
> We need filesystems to be able to communicate acceptable folio sizes
> to the pagecache for a variety of uses (e.g. large block sizes).
> Support a range of folio sizes between order-0 and order-31.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> ---
> include/linux/pagemap.h | 86 ++++++++++++++++++++++++++++++++++-------
> mm/filemap.c | 6 +--
> mm/readahead.c | 4 +-
> 3 files changed, 77 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 4b71d581091f..0c51154cdb57 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -204,14 +204,21 @@ enum mapping_flags {
> AS_EXITING = 4, /* final truncate in progress */
> /* writeback related tags are not used */
> AS_NO_WRITEBACK_TAGS = 5,
> - AS_LARGE_FOLIO_SUPPORT = 6,
nit: this removed enum is still referenced in a comment further down the file.
> - AS_RELEASE_ALWAYS, /* Call ->release_folio(), even if no private data */
> - AS_STABLE_WRITES, /* must wait for writeback before modifying
> + AS_RELEASE_ALWAYS = 6, /* Call ->release_folio(), even if no private data */
> + AS_STABLE_WRITES = 7, /* must wait for writeback before modifying
> folio contents */
> - AS_UNMOVABLE, /* The mapping cannot be moved, ever */
> - AS_INACCESSIBLE, /* Do not attempt direct R/W access to the mapping */
> + AS_UNMOVABLE = 8, /* The mapping cannot be moved, ever */
> + AS_INACCESSIBLE = 9, /* Do not attempt direct R/W access to the mapping */
> + /* Bits 16-25 are used for FOLIO_ORDER */
> + AS_FOLIO_ORDER_BITS = 5,
> + AS_FOLIO_ORDER_MIN = 16,
> + AS_FOLIO_ORDER_MAX = AS_FOLIO_ORDER_MIN + AS_FOLIO_ORDER_BITS,
nit: These 3 new enums seem a bit odd. It might be clearer if you just reserve
the bits for the fields here? AS_FOLIO_ORDER_BITS isn't actually a flags bit and
the MAX value is currently the start of the max field, not the end.
#define AS_FOLIO_ORDER_BITS 5
enum mapping_flags {
...
AS_FOLIO_ORDERS_FIRST = 16,
AS_FOLIO_ORDERS_LAST = AS_FOLIO_ORDERS_FIRST+(2*AS_FOLIO_ORDER_BITS)-1,
...
};
#define AS_FOLIO_ORDER_MIN_MASK \
GENMASK(AS_FOLIO_ORDERS_FIRST + AS_FOLIO_ORDER_BITS - 1, \
AS_FOLIO_ORDERS_FIRST)
#define AS_FOLIO_ORDER_MAX_MASK \
GENMASK(AS_FOLIO_ORDERS_LAST, \
AS_FOLIO_ORDERS_LAST - AS_FOLIO_ORDER_BITS + 1)
> };
>
> +#define AS_FOLIO_ORDER_MASK ((1u << AS_FOLIO_ORDER_BITS) - 1)
> +#define AS_FOLIO_ORDER_MIN_MASK (AS_FOLIO_ORDER_MASK << AS_FOLIO_ORDER_MIN)
> +#define AS_FOLIO_ORDER_MAX_MASK (AS_FOLIO_ORDER_MASK << AS_FOLIO_ORDER_MAX)
> +
> /**
> * mapping_set_error - record a writeback error in the address_space
> * @mapping: the mapping in which an error should be set
> @@ -360,9 +367,49 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
> #define MAX_PAGECACHE_ORDER 8
> #endif
>
> +/*
> + * mapping_set_folio_order_range() - Set the orders supported by a file.
> + * @mapping: The address space of the file.
> + * @min: Minimum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
> + * @max: Maximum folio order (between @min-MAX_PAGECACHE_ORDER inclusive).
> + *
> + * The filesystem should call this function in its inode constructor to
> + * indicate which base size (min) and maximum size (max) of folio the VFS
> + * can use to cache the contents of the file. This should only be used
> + * if the filesystem needs special handling of folio sizes (ie there is
> + * something the core cannot know).
> + * Do not tune it based on, eg, i_size.
> + *
> + * Context: This should not be called while the inode is active as it
> + * is non-atomic.
> + */
> +static inline void mapping_set_folio_order_range(struct address_space *mapping,
> + unsigned int min,
> + unsigned int max)
> +{
> + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> + return;
> +
> + if (min > MAX_PAGECACHE_ORDER)
> + min = MAX_PAGECACHE_ORDER;
> + if (max > MAX_PAGECACHE_ORDER)
> + max = MAX_PAGECACHE_ORDER;
> + if (max < min)
> + max = min;
It seems strange to silently clamp these? Presumably for the bs>ps usecase,
whatever values are passed in are a hard requirement? So wouldn't want them to
be silently reduced. (Especially given the recent change to reduce the size of
MAX_PAGECACHE_ORDER to less then PMD size in some cases).
> +
> + mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
> + (min << AS_FOLIO_ORDER_MIN) | (max << AS_FOLIO_ORDER_MAX);
> +}
> +
> +static inline void mapping_set_folio_min_order(struct address_space *mapping,
> + unsigned int min)
> +{
> + mapping_set_folio_order_range(mapping, min, MAX_PAGECACHE_ORDER);
> +}
> +
> /**
> * mapping_set_large_folios() - Indicate the file supports large folios.
> - * @mapping: The file.
> + * @mapping: The address space of the file.
> *
> * The filesystem should call this function in its inode constructor to
> * indicate that the VFS can use large folios to cache the contents of
> @@ -373,7 +420,23 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
> */
> static inline void mapping_set_large_folios(struct address_space *mapping)
> {
> - __set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
> + mapping_set_folio_order_range(mapping, 0, MAX_PAGECACHE_ORDER);
> +}
> +
> +static inline
> +unsigned int mapping_max_folio_order(const struct address_space *mapping)
> +{
> + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> + return 0;
> + return (mapping->flags & AS_FOLIO_ORDER_MAX_MASK) >> AS_FOLIO_ORDER_MAX;
> +}
> +
> +static inline
> +unsigned int mapping_min_folio_order(const struct address_space *mapping)
> +{
> + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> + return 0;
> + return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
> }
>
> /*
> @@ -386,16 +449,13 @@ static inline bool mapping_large_folio_support(struct address_space *mapping)
> VM_WARN_ONCE((unsigned long)mapping & PAGE_MAPPING_ANON,
> "Anonymous mapping always supports large folio");
>
> - return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> - test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
> + return mapping_max_folio_order(mapping) > 0;
> }
>
> /* Return the maximum folio size for this pagecache mapping, in bytes. */
> -static inline size_t mapping_max_folio_size(struct address_space *mapping)
> +static inline size_t mapping_max_folio_size(const struct address_space *mapping)
> {
> - if (mapping_large_folio_support(mapping))
> - return PAGE_SIZE << MAX_PAGECACHE_ORDER;
> - return PAGE_SIZE;
> + return PAGE_SIZE << mapping_max_folio_order(mapping);
> }
>
> static inline int filemap_nr_thps(struct address_space *mapping)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 0b8c732bb643..d617c9afca51 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1933,10 +1933,8 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
> if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
> fgp_flags |= FGP_LOCK;
>
> - if (!mapping_large_folio_support(mapping))
> - order = 0;
> - if (order > MAX_PAGECACHE_ORDER)
> - order = MAX_PAGECACHE_ORDER;
> + if (order > mapping_max_folio_order(mapping))
> + order = mapping_max_folio_order(mapping);
> /* If we're not aligned, allocate a smaller folio */
> if (index & ((1UL << order) - 1))
> order = __ffs(index);
> diff --git a/mm/readahead.c b/mm/readahead.c
> index c1b23989d9ca..66058ae02f2e 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -503,9 +503,9 @@ void page_cache_ra_order(struct readahead_control *ractl,
>
> limit = min(limit, index + ra->size - 1);
>
> - if (new_order < MAX_PAGECACHE_ORDER) {
> + if (new_order < mapping_max_folio_order(mapping)) {
> new_order += 2;
> - new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
> + new_order = min(mapping_max_folio_order(mapping), new_order);
> new_order = min_t(unsigned int, new_order, ilog2(ra->size));
I wonder if its possible that ra->size could ever be less than
mapping_min_folio_order()? Do you need to handle that?
Thanks,
Ryan
> }
>
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 03/10] readahead: allocate folios with mapping_min_order in readahead
2024-06-25 11:44 ` [PATCH v8 03/10] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
2024-07-02 19:38 ` Darrick J. Wong
@ 2024-07-04 14:24 ` Ryan Roberts
2024-07-04 14:29 ` Matthew Wilcox
1 sibling, 1 reply; 61+ messages in thread
From: Ryan Roberts @ 2024-07-04 14:24 UTC (permalink / raw)
To: Pankaj Raghav (Samsung), david, willy, chandan.babu, djwong,
brauner, akpm
Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, hch, Zi Yan
On 25/06/2024 12:44, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> page_cache_ra_unbounded() was allocating single pages (0 order folios)
> if there was no folio found in an index. Allocate mapping_min_order folios
> as we need to guarantee the minimum order if it is set.
> While we are at it, rework the loop in page_cache_ra_unbounded() to
> advance with the number of pages in a folio instead of just one page at
> a time.
>
> page_cache_ra_order() tries to allocate folio to the page cache with a
> higher order if the index aligns with that order. Modify it so that the
> order does not go below the mapping_min_order requirement of the page
> cache. This function will do the right thing even if the new_order passed
> is less than the mapping_min_order.
> When adding new folios to the page cache we must also ensure the index
> used is aligned to the mapping_min_order as the page cache requires the
> index to be aligned to the order of the folio.
>
> readahead_expand() is called from readahead aops to extend the range of
> the readahead so this function can assume ractl->_index to be aligned with
> min_order.
>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Co-developed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
> mm/readahead.c | 81 +++++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 63 insertions(+), 18 deletions(-)
>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 66058ae02f2e..2acfd6447d7b 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -206,9 +206,10 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> unsigned long nr_to_read, unsigned long lookahead_size)
> {
> struct address_space *mapping = ractl->mapping;
> - unsigned long index = readahead_index(ractl);
> + unsigned long ra_folio_index, index = readahead_index(ractl);
> gfp_t gfp_mask = readahead_gfp_mask(mapping);
> - unsigned long i;
> + unsigned long mark, i = 0;
> + unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
>
> /*
> * Partway through the readahead operation, we will have added
> @@ -223,10 +224,26 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> unsigned int nofs = memalloc_nofs_save();
>
> filemap_invalidate_lock_shared(mapping);
> + index = mapping_align_index(mapping, index);
> +
> + /*
> + * As iterator `i` is aligned to min_nrpages, round_up the
> + * difference between nr_to_read and lookahead_size to mark the
> + * index that only has lookahead or "async_region" to set the
> + * readahead flag.
> + */
> + ra_folio_index = round_up(readahead_index(ractl) + nr_to_read - lookahead_size,
> + min_nrpages);
> + mark = ra_folio_index - index;
> + if (index != readahead_index(ractl)) {
> + nr_to_read += readahead_index(ractl) - index;
> + ractl->_index = index;
> + }
> +
> /*
> * Preallocate as many pages as we will need.
> */
> - for (i = 0; i < nr_to_read; i++) {
> + while (i < nr_to_read) {
> struct folio *folio = xa_load(&mapping->i_pages, index + i);
> int ret;
>
> @@ -240,12 +257,13 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> * not worth getting one just for that.
> */
For the case that the folio is already in the xarray, perhaps its worth
asserting that the folio is at least min_nrpages?
> read_pages(ractl);
> - ractl->_index++;
> - i = ractl->_index + ractl->_nr_pages - index - 1;
> + ractl->_index += min_nrpages;
> + i = ractl->_index + ractl->_nr_pages - index;
> continue;
> }
>
> - folio = filemap_alloc_folio(gfp_mask, 0);
> + folio = filemap_alloc_folio(gfp_mask,
> + mapping_min_folio_order(mapping));
> if (!folio)
> break;
>
> @@ -255,14 +273,15 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> if (ret == -ENOMEM)
> break;
> read_pages(ractl);
> - ractl->_index++;
> - i = ractl->_index + ractl->_nr_pages - index - 1;
> + ractl->_index += min_nrpages;
> + i = ractl->_index + ractl->_nr_pages - index;
> continue;
> }
> - if (i == nr_to_read - lookahead_size)
> + if (i == mark)
> folio_set_readahead(folio);
> ractl->_workingset |= folio_test_workingset(folio);
> - ractl->_nr_pages++;
> + ractl->_nr_pages += min_nrpages;
> + i += min_nrpages;
> }
>
> /*
> @@ -492,13 +511,19 @@ void page_cache_ra_order(struct readahead_control *ractl,
> {
> struct address_space *mapping = ractl->mapping;
> pgoff_t index = readahead_index(ractl);
> + unsigned int min_order = mapping_min_folio_order(mapping);
> pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
> pgoff_t mark = index + ra->size - ra->async_size;
> unsigned int nofs;
> int err = 0;
> gfp_t gfp = readahead_gfp_mask(mapping);
> + unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
>
> - if (!mapping_large_folio_support(mapping) || ra->size < 4)
> + /*
> + * Fallback when size < min_nrpages as each folio should be
> + * at least min_nrpages anyway.
> + */
> + if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
> goto fallback;
>
> limit = min(limit, index + ra->size - 1);
> @@ -507,11 +532,20 @@ void page_cache_ra_order(struct readahead_control *ractl,
> new_order += 2;
> new_order = min(mapping_max_folio_order(mapping), new_order);
> new_order = min_t(unsigned int, new_order, ilog2(ra->size));
> + new_order = max(new_order, min_order);
> }
>
> /* See comment in page_cache_ra_unbounded() */
> nofs = memalloc_nofs_save();
> filemap_invalidate_lock_shared(mapping);
> + /*
> + * If the new_order is greater than min_order and index is
> + * already aligned to new_order, then this will be noop as index
> + * aligned to new_order should also be aligned to min_order.
> + */
> + ractl->_index = mapping_align_index(mapping, index);
> + index = readahead_index(ractl);
> +
> while (index <= limit) {
> unsigned int order = new_order;
>
> @@ -519,7 +553,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
> if (index & ((1UL << order) - 1))
> order = __ffs(index);
> /* Don't allocate pages past EOF */
> - while (index + (1UL << order) - 1 > limit)
> + while (order > min_order && index + (1UL << order) - 1 > limit)
> order--;
> err = ra_alloc_folio(ractl, index, mark, order, gfp);
> if (err)
> @@ -783,8 +817,15 @@ void readahead_expand(struct readahead_control *ractl,
> struct file_ra_state *ra = ractl->ra;
> pgoff_t new_index, new_nr_pages;
> gfp_t gfp_mask = readahead_gfp_mask(mapping);
> + unsigned long min_nrpages = mapping_min_folio_nrpages(mapping);
> + unsigned int min_order = mapping_min_folio_order(mapping);
>
> new_index = new_start / PAGE_SIZE;
> + /*
> + * Readahead code should have aligned the ractl->_index to
> + * min_nrpages before calling readahead aops.
> + */
> + VM_BUG_ON(!IS_ALIGNED(ractl->_index, min_nrpages));
>
> /* Expand the leading edge downwards */
> while (ractl->_index > new_index) {
> @@ -794,9 +835,11 @@ void readahead_expand(struct readahead_control *ractl,
> if (folio && !xa_is_value(folio))
> return; /* Folio apparently present */
>
> - folio = filemap_alloc_folio(gfp_mask, 0);
> + folio = filemap_alloc_folio(gfp_mask, min_order);
> if (!folio)
> return;
> +
> + index = mapping_align_index(mapping, index);
> if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
> folio_put(folio);
> return;
> @@ -806,7 +849,7 @@ void readahead_expand(struct readahead_control *ractl,
> ractl->_workingset = true;
> psi_memstall_enter(&ractl->_pflags);
> }
> - ractl->_nr_pages++;
> + ractl->_nr_pages += min_nrpages;
> ractl->_index = folio->index;
> }
>
> @@ -821,9 +864,11 @@ void readahead_expand(struct readahead_control *ractl,
> if (folio && !xa_is_value(folio))
> return; /* Folio apparently present */
>
> - folio = filemap_alloc_folio(gfp_mask, 0);
> + folio = filemap_alloc_folio(gfp_mask, min_order);
> if (!folio)
> return;
> +
> + index = mapping_align_index(mapping, index);
> if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
> folio_put(folio);
> return;
> @@ -833,10 +878,10 @@ void readahead_expand(struct readahead_control *ractl,
> ractl->_workingset = true;
> psi_memstall_enter(&ractl->_pflags);
> }
> - ractl->_nr_pages++;
> + ractl->_nr_pages += min_nrpages;
> if (ra) {
> - ra->size++;
> - ra->async_size++;
> + ra->size += min_nrpages;
> + ra->async_size += min_nrpages;
> }
> }
> }
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 03/10] readahead: allocate folios with mapping_min_order in readahead
2024-07-04 14:24 ` Ryan Roberts
@ 2024-07-04 14:29 ` Matthew Wilcox
0 siblings, 0 replies; 61+ messages in thread
From: Matthew Wilcox @ 2024-07-04 14:29 UTC (permalink / raw)
To: Ryan Roberts
Cc: Pankaj Raghav (Samsung), david, chandan.babu, djwong, brauner,
akpm, linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel,
hare, p.raghav, mcgrof, gost.dev, cl, linux-xfs, hch, Zi Yan
On Thu, Jul 04, 2024 at 03:24:10PM +0100, Ryan Roberts wrote:
> > @@ -240,12 +257,13 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> > * not worth getting one just for that.
> > */
>
> For the case that the folio is already in the xarray, perhaps its worth
> asserting that the folio is at least min_nrpages?
We'd have to get a reference on the folio to be able to do that safely.
Not worth it.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-07-04 12:23 ` Ryan Roberts
@ 2024-07-04 15:20 ` Matthew Wilcox
2024-07-04 15:52 ` Ryan Roberts
` (2 more replies)
2024-07-04 21:34 ` Pankaj Raghav (Samsung)
1 sibling, 3 replies; 61+ messages in thread
From: Matthew Wilcox @ 2024-07-04 15:20 UTC (permalink / raw)
To: Ryan Roberts
Cc: Pankaj Raghav (Samsung), david, chandan.babu, djwong, brauner,
akpm, linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel,
hare, p.raghav, mcgrof, gost.dev, cl, linux-xfs, hch, Zi Yan
On Thu, Jul 04, 2024 at 01:23:20PM +0100, Ryan Roberts wrote:
> > - AS_LARGE_FOLIO_SUPPORT = 6,
>
> nit: this removed enum is still referenced in a comment further down the file.
Thanks. Pankaj, let me know if you want me to send you a patch or if
you'll do it directly.
> > + /* Bits 16-25 are used for FOLIO_ORDER */
> > + AS_FOLIO_ORDER_BITS = 5,
> > + AS_FOLIO_ORDER_MIN = 16,
> > + AS_FOLIO_ORDER_MAX = AS_FOLIO_ORDER_MIN + AS_FOLIO_ORDER_BITS,
>
> nit: These 3 new enums seem a bit odd.
Yes, this is "too many helpful suggestions" syndrome. It made a lot
more sense originally.
https://lore.kernel.org/linux-fsdevel/ZlUQcEaP3FDXpCge@dread.disaster.area/
> > +static inline void mapping_set_folio_order_range(struct address_space *mapping,
> > + unsigned int min,
> > + unsigned int max)
> > +{
> > + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> > + return;
> > +
> > + if (min > MAX_PAGECACHE_ORDER)
> > + min = MAX_PAGECACHE_ORDER;
> > + if (max > MAX_PAGECACHE_ORDER)
> > + max = MAX_PAGECACHE_ORDER;
> > + if (max < min)
> > + max = min;
>
> It seems strange to silently clamp these? Presumably for the bs>ps usecase,
> whatever values are passed in are a hard requirement? So wouldn't want them to
> be silently reduced. (Especially given the recent change to reduce the size of
> MAX_PAGECACHE_ORDER to less then PMD size in some cases).
Hm, yes. We should probably make this return an errno. Including
returning an errno for !IS_ENABLED() and min > 0.
> > - if (new_order < MAX_PAGECACHE_ORDER) {
> > + if (new_order < mapping_max_folio_order(mapping)) {
> > new_order += 2;
> > - new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
> > + new_order = min(mapping_max_folio_order(mapping), new_order);
> > new_order = min_t(unsigned int, new_order, ilog2(ra->size));
>
> I wonder if its possible that ra->size could ever be less than
> mapping_min_folio_order()? Do you need to handle that?
I think we take care of that in later patches? This patch is mostly
about honouring the max properly and putting in the infrastructure for
the min, but not doing all the necessary work for min.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-07-04 15:20 ` Matthew Wilcox
@ 2024-07-04 15:52 ` Ryan Roberts
2024-07-04 21:28 ` Pankaj Raghav (Samsung)
2024-07-04 22:06 ` Dave Chinner
2 siblings, 0 replies; 61+ messages in thread
From: Ryan Roberts @ 2024-07-04 15:52 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Pankaj Raghav (Samsung), david, chandan.babu, djwong, brauner,
akpm, linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel,
hare, p.raghav, mcgrof, gost.dev, cl, linux-xfs, hch, Zi Yan
On 04/07/2024 16:20, Matthew Wilcox wrote:
> On Thu, Jul 04, 2024 at 01:23:20PM +0100, Ryan Roberts wrote:
>>> - AS_LARGE_FOLIO_SUPPORT = 6,
>>
>> nit: this removed enum is still referenced in a comment further down the file.
>
> Thanks. Pankaj, let me know if you want me to send you a patch or if
> you'll do it directly.
>
>>> + /* Bits 16-25 are used for FOLIO_ORDER */
>>> + AS_FOLIO_ORDER_BITS = 5,
>>> + AS_FOLIO_ORDER_MIN = 16,
>>> + AS_FOLIO_ORDER_MAX = AS_FOLIO_ORDER_MIN + AS_FOLIO_ORDER_BITS,
>>
>> nit: These 3 new enums seem a bit odd.
>
> Yes, this is "too many helpful suggestions" syndrome. It made a lot
> more sense originally.
Well now you can add my "helpful" suggestion to that list :)
>
> https://lore.kernel.org/linux-fsdevel/ZlUQcEaP3FDXpCge@dread.disaster.area/
>
>>> +static inline void mapping_set_folio_order_range(struct address_space *mapping,
>>> + unsigned int min,
>>> + unsigned int max)
>>> +{
>>> + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
>>> + return;
>>> +
>>> + if (min > MAX_PAGECACHE_ORDER)
>>> + min = MAX_PAGECACHE_ORDER;
>>> + if (max > MAX_PAGECACHE_ORDER)
>>> + max = MAX_PAGECACHE_ORDER;
>>> + if (max < min)
>>> + max = min;
>>
>> It seems strange to silently clamp these? Presumably for the bs>ps usecase,
>> whatever values are passed in are a hard requirement? So wouldn't want them to
>> be silently reduced. (Especially given the recent change to reduce the size of
>> MAX_PAGECACHE_ORDER to less then PMD size in some cases).
>
> Hm, yes. We should probably make this return an errno. Including
> returning an errno for !IS_ENABLED() and min > 0.
Right.
>
>>> - if (new_order < MAX_PAGECACHE_ORDER) {
>>> + if (new_order < mapping_max_folio_order(mapping)) {
>>> new_order += 2;
>>> - new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
>>> + new_order = min(mapping_max_folio_order(mapping), new_order);
>>> new_order = min_t(unsigned int, new_order, ilog2(ra->size));
>>
>> I wonder if its possible that ra->size could ever be less than
>> mapping_min_folio_order()? Do you need to handle that?
>
> I think we take care of that in later patches?
Yes I saw that once I got to it. You can ignore this.
> This patch is mostly
> about honouring the max properly and putting in the infrastructure for
> the min, but not doing all the necessary work for min.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-07-04 15:20 ` Matthew Wilcox
2024-07-04 15:52 ` Ryan Roberts
@ 2024-07-04 21:28 ` Pankaj Raghav (Samsung)
2024-07-04 22:06 ` Dave Chinner
2 siblings, 0 replies; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-04 21:28 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Ryan Roberts, david, chandan.babu, djwong, brauner, akpm,
linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, hch, Zi Yan
On Thu, Jul 04, 2024 at 04:20:13PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 04, 2024 at 01:23:20PM +0100, Ryan Roberts wrote:
> > > - AS_LARGE_FOLIO_SUPPORT = 6,
> >
> > nit: this removed enum is still referenced in a comment further down the file.
Good catch.
>
> Thanks. Pankaj, let me know if you want me to send you a patch or if
> you'll do it directly.
Yes, I will fold the changes.
>
> > > +static inline void mapping_set_folio_order_range(struct address_space *mapping,
> > > + unsigned int min,
> > > + unsigned int max)
> > > +{
> > > + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> > > + return;
> > > +
> > > + if (min > MAX_PAGECACHE_ORDER)
> > > + min = MAX_PAGECACHE_ORDER;
> > > + if (max > MAX_PAGECACHE_ORDER)
> > > + max = MAX_PAGECACHE_ORDER;
> > > + if (max < min)
> > > + max = min;
> >
> > It seems strange to silently clamp these? Presumably for the bs>ps usecase,
> > whatever values are passed in are a hard requirement? So wouldn't want them to
> > be silently reduced. (Especially given the recent change to reduce the size of
> > MAX_PAGECACHE_ORDER to less then PMD size in some cases).
>
> Hm, yes. We should probably make this return an errno. Including
> returning an errno for !IS_ENABLED() and min > 0.
>
Something like this? (I also need to change the xfs_icache.c to
use this return value in the last patch)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 14e1415f7dcf..04916720f807 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -390,28 +390,27 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
* Context: This should not be called while the inode is active as it
* is non-atomic.
*/
-static inline void mapping_set_folio_order_range(struct address_space *mapping,
+static inline int mapping_set_folio_order_range(struct address_space *mapping,
unsigned int min,
unsigned int max)
{
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
- return;
+ return -EINVAL;
- if (min > MAX_PAGECACHE_ORDER)
- min = MAX_PAGECACHE_ORDER;
- if (max > MAX_PAGECACHE_ORDER)
- max = MAX_PAGECACHE_ORDER;
+ if (min > MAX_PAGECACHE_ORDER || max > MAX_PAGECACHE_ORDER)
+ return -EINVAL;
if (max < min)
max = min;
mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
(min << AS_FOLIO_ORDER_MIN) | (max << AS_FOLIO_ORDER_MAX);
+ return 0;
}
-static inline void mapping_set_folio_min_order(struct address_space *mapping,
+static inline int mapping_set_folio_min_order(struct address_space *mapping,
unsigned int min)
{
- mapping_set_folio_order_range(mapping, min, MAX_PAGECACHE_ORDER);
+ return mapping_set_folio_order_range(mapping, min, MAX_PAGECACHE_ORDER);
}
@@ -428,6 +427,10 @@ static inline void mapping_set_folio_min_order(struct address_space *mapping,
*/
static inline void mapping_set_large_folios(struct address_space *mapping)
{
+ /*
+ * The return value can be safely ignored because this range
+ * will always be supported by the page cache.
+ */
mapping_set_folio_order_range(mapping, 0, MAX_PAGECACHE_ORDER);
}
--
Pankaj
^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-07-04 12:23 ` Ryan Roberts
2024-07-04 15:20 ` Matthew Wilcox
@ 2024-07-04 21:34 ` Pankaj Raghav (Samsung)
1 sibling, 0 replies; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-04 21:34 UTC (permalink / raw)
To: Ryan Roberts
Cc: david, willy, chandan.babu, djwong, brauner, akpm, linux-kernel,
yang, linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav,
mcgrof, gost.dev, cl, linux-xfs, hch, Zi Yan
On Thu, Jul 04, 2024 at 01:23:20PM +0100, Ryan Roberts wrote:
> Hi,
>
> Here are some drive-by review comments as I'm evaluating whether these patches
> can help me with what I'm trying to do at
> https://lore.kernel.org/linux-mm/20240215154059.2863126-1-ryan.roberts@arm.com/...
Just FYI, I posted the 9th version[0] today before these comments landed
and they do not your proposed changes.
And it will also be better if you can put your comments in the latest
version just to avoid fragmentation. :)
[0]https://lore.kernel.org/linux-fsdevel/20240704112320.82104-1-kernel@pankajraghav.com/
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-07-04 15:20 ` Matthew Wilcox
2024-07-04 15:52 ` Ryan Roberts
2024-07-04 21:28 ` Pankaj Raghav (Samsung)
@ 2024-07-04 22:06 ` Dave Chinner
2024-07-04 23:56 ` Matthew Wilcox
2 siblings, 1 reply; 61+ messages in thread
From: Dave Chinner @ 2024-07-04 22:06 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Ryan Roberts, Pankaj Raghav (Samsung), chandan.babu, djwong,
brauner, akpm, linux-kernel, yang, linux-mm, john.g.garry,
linux-fsdevel, hare, p.raghav, mcgrof, gost.dev, cl, linux-xfs,
hch, Zi Yan
On Thu, Jul 04, 2024 at 04:20:13PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 04, 2024 at 01:23:20PM +0100, Ryan Roberts wrote:
> > > - AS_LARGE_FOLIO_SUPPORT = 6,
> >
> > nit: this removed enum is still referenced in a comment further down the file.
>
> Thanks. Pankaj, let me know if you want me to send you a patch or if
> you'll do it directly.
>
> > > + /* Bits 16-25 are used for FOLIO_ORDER */
> > > + AS_FOLIO_ORDER_BITS = 5,
> > > + AS_FOLIO_ORDER_MIN = 16,
> > > + AS_FOLIO_ORDER_MAX = AS_FOLIO_ORDER_MIN + AS_FOLIO_ORDER_BITS,
> >
> > nit: These 3 new enums seem a bit odd.
>
> Yes, this is "too many helpful suggestions" syndrome. It made a lot
> more sense originally.
>
> https://lore.kernel.org/linux-fsdevel/ZlUQcEaP3FDXpCge@dread.disaster.area/
>
> > > +static inline void mapping_set_folio_order_range(struct address_space *mapping,
> > > + unsigned int min,
> > > + unsigned int max)
> > > +{
> > > + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> > > + return;
> > > +
> > > + if (min > MAX_PAGECACHE_ORDER)
> > > + min = MAX_PAGECACHE_ORDER;
> > > + if (max > MAX_PAGECACHE_ORDER)
> > > + max = MAX_PAGECACHE_ORDER;
> > > + if (max < min)
> > > + max = min;
> >
> > It seems strange to silently clamp these? Presumably for the bs>ps usecase,
> > whatever values are passed in are a hard requirement? So wouldn't want them to
> > be silently reduced. (Especially given the recent change to reduce the size of
> > MAX_PAGECACHE_ORDER to less then PMD size in some cases).
>
> Hm, yes. We should probably make this return an errno. Including
> returning an errno for !IS_ENABLED() and min > 0.
What are callers supposed to do with an error? In the case of
setting up a newly allocated inode in XFS, the error would be
returned in the middle of a transaction and so this failure would
result in a filesystem shutdown.
Regardless, the filesystem should never be passing min >
MAX_PAGECACHE_ORDER any time soon for bs > ps configurations. block
sizes go up to 64kB, which is a lot smaller than
MAX_PAGECACHE_ORDER. IOWs, seeing min > MAX_PAGECACHE_ORDER is
indicative of a severe bug, should be considered a fatal developer
mistake and the kernel terminated immediately. Such mistakes should
-never, ever- happen on productions systems. IOWs, this is a
situation where we should assert or bug and kill the kernel
immediately, or at minimum warn-on-once() and truncate the value and
hope things don't get immediately worse.
If we kill the kernel because min is out of range, the system will
fail on the first inode instantiation on that filesystem.
Filesystem developers should notice that sort of failure pretty
quickly and realise they've done something that isn't currently
supported...
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-07-04 22:06 ` Dave Chinner
@ 2024-07-04 23:56 ` Matthew Wilcox
2024-07-05 4:32 ` Dave Chinner
0 siblings, 1 reply; 61+ messages in thread
From: Matthew Wilcox @ 2024-07-04 23:56 UTC (permalink / raw)
To: Dave Chinner
Cc: Ryan Roberts, Pankaj Raghav (Samsung), chandan.babu, djwong,
brauner, akpm, linux-kernel, yang, linux-mm, john.g.garry,
linux-fsdevel, hare, p.raghav, mcgrof, gost.dev, cl, linux-xfs,
hch, Zi Yan
On Fri, Jul 05, 2024 at 08:06:51AM +1000, Dave Chinner wrote:
> > > It seems strange to silently clamp these? Presumably for the bs>ps usecase,
> > > whatever values are passed in are a hard requirement? So wouldn't want them to
> > > be silently reduced. (Especially given the recent change to reduce the size of
> > > MAX_PAGECACHE_ORDER to less then PMD size in some cases).
> >
> > Hm, yes. We should probably make this return an errno. Including
> > returning an errno for !IS_ENABLED() and min > 0.
>
> What are callers supposed to do with an error? In the case of
> setting up a newly allocated inode in XFS, the error would be
> returned in the middle of a transaction and so this failure would
> result in a filesystem shutdown.
I suggest you handle it better than this. If the device is asking for a
blocksize > PMD_SIZE, you should fail to mount it. If the device is
asking for a blocksize > PAGE_SIZE and CONFIG_TRANSPARENT_HUGEPAGE is
not set, you should also decline to mount the filesystem.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-07-04 23:56 ` Matthew Wilcox
@ 2024-07-05 4:32 ` Dave Chinner
2024-07-05 9:03 ` Ryan Roberts
` (2 more replies)
0 siblings, 3 replies; 61+ messages in thread
From: Dave Chinner @ 2024-07-05 4:32 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Ryan Roberts, Pankaj Raghav (Samsung), chandan.babu, djwong,
brauner, akpm, linux-kernel, yang, linux-mm, john.g.garry,
linux-fsdevel, hare, p.raghav, mcgrof, gost.dev, cl, linux-xfs,
hch, Zi Yan
On Fri, Jul 05, 2024 at 12:56:28AM +0100, Matthew Wilcox wrote:
> On Fri, Jul 05, 2024 at 08:06:51AM +1000, Dave Chinner wrote:
> > > > It seems strange to silently clamp these? Presumably for the bs>ps usecase,
> > > > whatever values are passed in are a hard requirement? So wouldn't want them to
> > > > be silently reduced. (Especially given the recent change to reduce the size of
> > > > MAX_PAGECACHE_ORDER to less then PMD size in some cases).
> > >
> > > Hm, yes. We should probably make this return an errno. Including
> > > returning an errno for !IS_ENABLED() and min > 0.
> >
> > What are callers supposed to do with an error? In the case of
> > setting up a newly allocated inode in XFS, the error would be
> > returned in the middle of a transaction and so this failure would
> > result in a filesystem shutdown.
>
> I suggest you handle it better than this. If the device is asking for a
> blocksize > PMD_SIZE, you should fail to mount it.
That's my point: we already do that.
The largest block size we support is 64kB and that's way smaller
than PMD_SIZE on all platforms and we always check for bs > ps
support at mount time when the filesystem bs > ps.
Hence we're never going to set the min value to anything unsupported
unless someone makes a massive programming mistake. At which point,
we want a *hard, immediate fail* so the developer notices their
mistake immediately. All filesystems and block devices need to
behave this way so the limits should be encoded as asserts in the
function to trigger such behaviour.
> If the device is
> asking for a blocksize > PAGE_SIZE and CONFIG_TRANSPARENT_HUGEPAGE is
> not set, you should also decline to mount the filesystem.
What does CONFIG_TRANSPARENT_HUGEPAGE have to do with filesystems
being able to use large folios?
If that's an actual dependency of using large folios, then we're at
the point where the mm side of large folios needs to be divorced
from CONFIG_TRANSPARENT_HUGEPAGE and always supported.
Alternatively, CONFIG_TRANSPARENT_HUGEPAGE needs to selected by the
block layer and also every filesystem that wants to support
sector/blocks sizes larger than PAGE_SIZE. IOWs, large folio
support needs to *always* be enabled on systems that say
CONFIG_BLOCK=y.
I'd much prefer the former occurs, because making the block layer
and filesystems dependent on an mm feature they don't actually use
is kinda weird...
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-07-05 4:32 ` Dave Chinner
@ 2024-07-05 9:03 ` Ryan Roberts
2024-07-05 12:45 ` Pankaj Raghav (Samsung)
2024-07-05 13:24 ` Pankaj Raghav (Samsung)
2024-07-05 15:14 ` Matthew Wilcox
2 siblings, 1 reply; 61+ messages in thread
From: Ryan Roberts @ 2024-07-05 9:03 UTC (permalink / raw)
To: Dave Chinner, Matthew Wilcox
Cc: Pankaj Raghav (Samsung), chandan.babu, djwong, brauner, akpm,
linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, hch, Zi Yan
On 05/07/2024 05:32, Dave Chinner wrote:
> On Fri, Jul 05, 2024 at 12:56:28AM +0100, Matthew Wilcox wrote:
>> On Fri, Jul 05, 2024 at 08:06:51AM +1000, Dave Chinner wrote:
>>>>> It seems strange to silently clamp these? Presumably for the bs>ps usecase,
>>>>> whatever values are passed in are a hard requirement? So wouldn't want them to
>>>>> be silently reduced. (Especially given the recent change to reduce the size of
>>>>> MAX_PAGECACHE_ORDER to less then PMD size in some cases).
>>>>
>>>> Hm, yes. We should probably make this return an errno. Including
>>>> returning an errno for !IS_ENABLED() and min > 0.
>>>
>>> What are callers supposed to do with an error? In the case of
>>> setting up a newly allocated inode in XFS, the error would be
>>> returned in the middle of a transaction and so this failure would
>>> result in a filesystem shutdown.
>>
>> I suggest you handle it better than this. If the device is asking for a
>> blocksize > PMD_SIZE, you should fail to mount it.
A detail, but MAX_PAGECACHE_ORDER may be smaller than PMD_SIZE even on systems
with CONFIG_TRANSPARENT_HUGEPAGE as of a fix that is currently in mm-unstable:
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define PREFERRED_MAX_PAGECACHE_ORDER HPAGE_PMD_ORDER
#else
#define PREFERRED_MAX_PAGECACHE_ORDER 8
#endif
/*
* xas_split_alloc() does not support arbitrary orders. This implies no
* 512MB THP on ARM64 with 64KB base page size.
*/
#define MAX_XAS_ORDER (XA_CHUNK_SHIFT * 2 - 1)
#define MAX_PAGECACHE_ORDER min(MAX_XAS_ORDER,
PREFERRED_MAX_PAGECACHE_ORDER)
But that also implies that the page cache can handle up to order-8 without
CONFIG_TRANSPARENT_HUGEPAGE so sounds like there isn't a dependcy on
CONFIG_TRANSPARENT_HUGEPAGE in this respect?
>
> That's my point: we already do that.
>
> The largest block size we support is 64kB and that's way smaller
> than PMD_SIZE on all platforms and we always check for bs > ps
> support at mount time when the filesystem bs > ps.
>
> Hence we're never going to set the min value to anything unsupported
> unless someone makes a massive programming mistake. At which point,
> we want a *hard, immediate fail* so the developer notices their
> mistake immediately. All filesystems and block devices need to
> behave this way so the limits should be encoded as asserts in the
> function to trigger such behaviour.
>
>> If the device is
>> asking for a blocksize > PAGE_SIZE and CONFIG_TRANSPARENT_HUGEPAGE is
>> not set, you should also decline to mount the filesystem.
>
> What does CONFIG_TRANSPARENT_HUGEPAGE have to do with filesystems
> being able to use large folios?
>
> If that's an actual dependency of using large folios, then we're at
> the point where the mm side of large folios needs to be divorced
> from CONFIG_TRANSPARENT_HUGEPAGE and always supported.
> Alternatively, CONFIG_TRANSPARENT_HUGEPAGE needs to selected by the
> block layer and also every filesystem that wants to support
> sector/blocks sizes larger than PAGE_SIZE. IOWs, large folio
> support needs to *always* be enabled on systems that say
> CONFIG_BLOCK=y.
>
> I'd much prefer the former occurs, because making the block layer
> and filesystems dependent on an mm feature they don't actually use
> is kinda weird...
>
> -Dave.
>
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-07-05 9:03 ` Ryan Roberts
@ 2024-07-05 12:45 ` Pankaj Raghav (Samsung)
0 siblings, 0 replies; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-05 12:45 UTC (permalink / raw)
To: Ryan Roberts
Cc: Dave Chinner, Matthew Wilcox, chandan.babu, djwong, brauner, akpm,
linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, hch, Zi Yan
On Fri, Jul 05, 2024 at 10:03:31AM +0100, Ryan Roberts wrote:
> On 05/07/2024 05:32, Dave Chinner wrote:
> > On Fri, Jul 05, 2024 at 12:56:28AM +0100, Matthew Wilcox wrote:
> >> On Fri, Jul 05, 2024 at 08:06:51AM +1000, Dave Chinner wrote:
> >>>>> It seems strange to silently clamp these? Presumably for the bs>ps usecase,
> >>>>> whatever values are passed in are a hard requirement? So wouldn't want them to
> >>>>> be silently reduced. (Especially given the recent change to reduce the size of
> >>>>> MAX_PAGECACHE_ORDER to less then PMD size in some cases).
> >>>>
> >>>> Hm, yes. We should probably make this return an errno. Including
> >>>> returning an errno for !IS_ENABLED() and min > 0.
> >>>
> >>> What are callers supposed to do with an error? In the case of
> >>> setting up a newly allocated inode in XFS, the error would be
> >>> returned in the middle of a transaction and so this failure would
> >>> result in a filesystem shutdown.
> >>
> >> I suggest you handle it better than this. If the device is asking for a
> >> blocksize > PMD_SIZE, you should fail to mount it.
>
> A detail, but MAX_PAGECACHE_ORDER may be smaller than PMD_SIZE even on systems
> with CONFIG_TRANSPARENT_HUGEPAGE as of a fix that is currently in mm-unstable:
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> #define PREFERRED_MAX_PAGECACHE_ORDER HPAGE_PMD_ORDER
> #else
> #define PREFERRED_MAX_PAGECACHE_ORDER 8
> #endif
>
> /*
> * xas_split_alloc() does not support arbitrary orders. This implies no
> * 512MB THP on ARM64 with 64KB base page size.
> */
> #define MAX_XAS_ORDER (XA_CHUNK_SHIFT * 2 - 1)
> #define MAX_PAGECACHE_ORDER min(MAX_XAS_ORDER,
> PREFERRED_MAX_PAGECACHE_ORDER)
>
> But that also implies that the page cache can handle up to order-8 without
> CONFIG_TRANSPARENT_HUGEPAGE so sounds like there isn't a dependcy on
> CONFIG_TRANSPARENT_HUGEPAGE in this respect?
>
I remember seeing willy's comment about dependency on CONFIG_TRANSPARENT_HUGEPAGE
for large folios[0]:
Some parts of the VM still depend on THP to handle large folios
correctly. Until those are fixed, prevent creating large folios
if THP are disabled.
This was Jan of 2022. I don't know the status of it now but we enable
large folios for filesystem only when THP is enabled(as you can see in
the helpers mapping_set_folio_order_range()).
[0] https://lore.kernel.org/lkml/20220116121822.1727633-8-willy@infradead.org/
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-07-05 4:32 ` Dave Chinner
2024-07-05 9:03 ` Ryan Roberts
@ 2024-07-05 13:24 ` Pankaj Raghav (Samsung)
2024-07-05 13:31 ` Ryan Roberts
2024-07-05 15:14 ` Matthew Wilcox
2 siblings, 1 reply; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-05 13:24 UTC (permalink / raw)
To: Dave Chinner
Cc: Matthew Wilcox, Ryan Roberts, chandan.babu, djwong, brauner, akpm,
linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, hch, Zi Yan
> > I suggest you handle it better than this. If the device is asking for a
> > blocksize > PMD_SIZE, you should fail to mount it.
>
> That's my point: we already do that.
>
> The largest block size we support is 64kB and that's way smaller
> than PMD_SIZE on all platforms and we always check for bs > ps
> support at mount time when the filesystem bs > ps.
>
> Hence we're never going to set the min value to anything unsupported
> unless someone makes a massive programming mistake. At which point,
> we want a *hard, immediate fail* so the developer notices their
> mistake immediately. All filesystems and block devices need to
> behave this way so the limits should be encoded as asserts in the
> function to trigger such behaviour.
I agree, this kind of bug will be encountered only during developement
and not during actual production due to the limit we have fs block size
in XFS.
>
> > If the device is
> > asking for a blocksize > PAGE_SIZE and CONFIG_TRANSPARENT_HUGEPAGE is
> > not set, you should also decline to mount the filesystem.
>
> What does CONFIG_TRANSPARENT_HUGEPAGE have to do with filesystems
> being able to use large folios?
>
> If that's an actual dependency of using large folios, then we're at
> the point where the mm side of large folios needs to be divorced
> from CONFIG_TRANSPARENT_HUGEPAGE and always supported.
> Alternatively, CONFIG_TRANSPARENT_HUGEPAGE needs to selected by the
> block layer and also every filesystem that wants to support
> sector/blocks sizes larger than PAGE_SIZE. IOWs, large folio
> support needs to *always* be enabled on systems that say
> CONFIG_BLOCK=y.
Why CONFIG_BLOCK? I think it is enough if it comes from the FS side
right? And for now, the only FS that needs that sort of bs > ps
guarantee is XFS with this series. Other filesystems such as bcachefs
that call mapping_set_large_folios() only enable it as an optimization
and it is not needed for the filesystem to function.
So this is my conclusion from the conversation:
- Add a dependency in Kconfig on THP for XFS until we fix the dependency
of large folios on THP
- Add a BUILD_BUG_ON(XFS_MAX_BLOCKSIZE > MAX_PAGECACHE_ORDER)
- Add a WARN_ON_ONCE() and clamp the min and max value in
mapping_set_folio_order_range() ?
Let me know what you all think @willy, @dave and @ryan.
--
Pankaj
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-07-05 13:24 ` Pankaj Raghav (Samsung)
@ 2024-07-05 13:31 ` Ryan Roberts
2024-07-05 14:14 ` Pankaj Raghav (Samsung)
2024-07-08 23:01 ` Dave Chinner
0 siblings, 2 replies; 61+ messages in thread
From: Ryan Roberts @ 2024-07-05 13:31 UTC (permalink / raw)
To: Pankaj Raghav (Samsung), Dave Chinner
Cc: Matthew Wilcox, chandan.babu, djwong, brauner, akpm, linux-kernel,
yang, linux-mm, john.g.garry, linux-fsdevel, hare, p.raghav,
mcgrof, gost.dev, cl, linux-xfs, hch, Zi Yan
On 05/07/2024 14:24, Pankaj Raghav (Samsung) wrote:
>>> I suggest you handle it better than this. If the device is asking for a
>>> blocksize > PMD_SIZE, you should fail to mount it.
>>
>> That's my point: we already do that.
>>
>> The largest block size we support is 64kB and that's way smaller
>> than PMD_SIZE on all platforms and we always check for bs > ps
>> support at mount time when the filesystem bs > ps.
>>
>> Hence we're never going to set the min value to anything unsupported
>> unless someone makes a massive programming mistake. At which point,
>> we want a *hard, immediate fail* so the developer notices their
>> mistake immediately. All filesystems and block devices need to
>> behave this way so the limits should be encoded as asserts in the
>> function to trigger such behaviour.
>
> I agree, this kind of bug will be encountered only during developement
> and not during actual production due to the limit we have fs block size
> in XFS.
>
>>
>>> If the device is
>>> asking for a blocksize > PAGE_SIZE and CONFIG_TRANSPARENT_HUGEPAGE is
>>> not set, you should also decline to mount the filesystem.
>>
>> What does CONFIG_TRANSPARENT_HUGEPAGE have to do with filesystems
>> being able to use large folios?
>>
>> If that's an actual dependency of using large folios, then we're at
>> the point where the mm side of large folios needs to be divorced
>> from CONFIG_TRANSPARENT_HUGEPAGE and always supported.
>> Alternatively, CONFIG_TRANSPARENT_HUGEPAGE needs to selected by the
>> block layer and also every filesystem that wants to support
>> sector/blocks sizes larger than PAGE_SIZE. IOWs, large folio
>> support needs to *always* be enabled on systems that say
>> CONFIG_BLOCK=y.
>
> Why CONFIG_BLOCK? I think it is enough if it comes from the FS side
> right? And for now, the only FS that needs that sort of bs > ps
> guarantee is XFS with this series. Other filesystems such as bcachefs
> that call mapping_set_large_folios() only enable it as an optimization
> and it is not needed for the filesystem to function.
>
> So this is my conclusion from the conversation:
> - Add a dependency in Kconfig on THP for XFS until we fix the dependency
> of large folios on THP
THP isn't supported on some arches, so isn't this effectively saying XFS can no
longer be used with those arches, even if the bs <= ps? I think while pagecache
large folios depend on THP, you need to make this a mount-time check in the FS?
But ideally, MAX_PAGECACHE_ORDER would be set to 0 for
!CONFIG_TRANSPARENT_HUGEPAGE so you can just check against that and don't have
to worry about THP availability directly.
Willy; Why is MAX_PAGECACHE_ORDER set to 8 when THP is disabled currently?
> - Add a BUILD_BUG_ON(XFS_MAX_BLOCKSIZE > MAX_PAGECACHE_ORDER)
> - Add a WARN_ON_ONCE() and clamp the min and max value in
> mapping_set_folio_order_range() ?
>
> Let me know what you all think @willy, @dave and @ryan.
>
> --
> Pankaj
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-07-05 13:31 ` Ryan Roberts
@ 2024-07-05 14:14 ` Pankaj Raghav (Samsung)
2024-07-08 23:01 ` Dave Chinner
1 sibling, 0 replies; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-05 14:14 UTC (permalink / raw)
To: Ryan Roberts
Cc: Dave Chinner, Matthew Wilcox, chandan.babu, djwong, brauner, akpm,
linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, hch, Zi Yan
> >>
> >>> If the device is
> >>> asking for a blocksize > PAGE_SIZE and CONFIG_TRANSPARENT_HUGEPAGE is
> >>> not set, you should also decline to mount the filesystem.
> >>
> >> What does CONFIG_TRANSPARENT_HUGEPAGE have to do with filesystems
> >> being able to use large folios?
> >>
> >> If that's an actual dependency of using large folios, then we're at
> >> the point where the mm side of large folios needs to be divorced
> >> from CONFIG_TRANSPARENT_HUGEPAGE and always supported.
> >> Alternatively, CONFIG_TRANSPARENT_HUGEPAGE needs to selected by the
> >> block layer and also every filesystem that wants to support
> >> sector/blocks sizes larger than PAGE_SIZE. IOWs, large folio
> >> support needs to *always* be enabled on systems that say
> >> CONFIG_BLOCK=y.
> >
> > Why CONFIG_BLOCK? I think it is enough if it comes from the FS side
> > right? And for now, the only FS that needs that sort of bs > ps
> > guarantee is XFS with this series. Other filesystems such as bcachefs
> > that call mapping_set_large_folios() only enable it as an optimization
> > and it is not needed for the filesystem to function.
> >
> > So this is my conclusion from the conversation:
> > - Add a dependency in Kconfig on THP for XFS until we fix the dependency
> > of large folios on THP
>
> THP isn't supported on some arches, so isn't this effectively saying XFS can no
> longer be used with those arches, even if the bs <= ps? I think while pagecache
> large folios depend on THP, you need to make this a mount-time check in the FS?
>
> But ideally, MAX_PAGECACHE_ORDER would be set to 0 for
> !CONFIG_TRANSPARENT_HUGEPAGE so you can just check against that and don't have
> to worry about THP availability directly.
Yes, that would be better. We should have a way to probe it during mount
time without requiring any address_space mapping. We could have a helper
something as follows:
static inline unsigned int mapping_max_folio_order_supported()
{
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
return 0;
return MAX_PAGECACHE_ORDER;
}
This could be used by the FS to verify during mount time.
>
> Willy; Why is MAX_PAGECACHE_ORDER set to 8 when THP is disabled currently?
>
This appeared in this patch with the following comment:
https://lore.kernel.org/linux-fsdevel/20230710130253.3484695-8-willy@infradead.org/
+/*
+ * There are some parts of the kernel which assume that PMD entries
+ * are exactly HPAGE_PMD_ORDER. Those should be fixed, but until then,
+ * limit the maximum allocation order to PMD size. I'm not aware of any
+ * assumptions about maximum order if THP are disabled, but 8 seems like
+ * a good order (that's 1MB if you're using 4kB pages)
+ */
> > - Add a BUILD_BUG_ON(XFS_MAX_BLOCKSIZE > MAX_PAGECACHE_ORDER)
> > - Add a WARN_ON_ONCE() and clamp the min and max value in
> > mapping_set_folio_order_range() ?
> >
> > Let me know what you all think @willy, @dave and @ryan.
> >
> > --
> > Pankaj
>
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-07-05 4:32 ` Dave Chinner
2024-07-05 9:03 ` Ryan Roberts
2024-07-05 13:24 ` Pankaj Raghav (Samsung)
@ 2024-07-05 15:14 ` Matthew Wilcox
2 siblings, 0 replies; 61+ messages in thread
From: Matthew Wilcox @ 2024-07-05 15:14 UTC (permalink / raw)
To: Dave Chinner
Cc: Ryan Roberts, Pankaj Raghav (Samsung), chandan.babu, djwong,
brauner, akpm, linux-kernel, yang, linux-mm, john.g.garry,
linux-fsdevel, hare, p.raghav, mcgrof, gost.dev, cl, linux-xfs,
hch, Zi Yan
On Fri, Jul 05, 2024 at 02:32:37PM +1000, Dave Chinner wrote:
> > If the device is
> > asking for a blocksize > PAGE_SIZE and CONFIG_TRANSPARENT_HUGEPAGE is
> > not set, you should also decline to mount the filesystem.
>
> What does CONFIG_TRANSPARENT_HUGEPAGE have to do with filesystems
> being able to use large folios?
You can't have large folios without CONFIG_TRANSPARENT_HUGEPAGE.
There's a bunch of infrastructure that's gated behind that. I used to say
"somebody should fix that up", but honestly the amount of work I'm willing
to do to support hobbyist architectures has decreased dramatically.
Any useful system has THP support.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-07-05 13:31 ` Ryan Roberts
2024-07-05 14:14 ` Pankaj Raghav (Samsung)
@ 2024-07-08 23:01 ` Dave Chinner
2024-07-09 8:11 ` Ryan Roberts
2024-07-09 13:08 ` Pankaj Raghav (Samsung)
1 sibling, 2 replies; 61+ messages in thread
From: Dave Chinner @ 2024-07-08 23:01 UTC (permalink / raw)
To: Ryan Roberts
Cc: Pankaj Raghav (Samsung), Matthew Wilcox, chandan.babu, djwong,
brauner, akpm, linux-kernel, yang, linux-mm, john.g.garry,
linux-fsdevel, hare, p.raghav, mcgrof, gost.dev, cl, linux-xfs,
hch, Zi Yan
On Fri, Jul 05, 2024 at 02:31:08PM +0100, Ryan Roberts wrote:
> On 05/07/2024 14:24, Pankaj Raghav (Samsung) wrote:
> >>> I suggest you handle it better than this. If the device is asking for a
> >>> blocksize > PMD_SIZE, you should fail to mount it.
> >>
> >> That's my point: we already do that.
> >>
> >> The largest block size we support is 64kB and that's way smaller
> >> than PMD_SIZE on all platforms and we always check for bs > ps
> >> support at mount time when the filesystem bs > ps.
> >>
> >> Hence we're never going to set the min value to anything unsupported
> >> unless someone makes a massive programming mistake. At which point,
> >> we want a *hard, immediate fail* so the developer notices their
> >> mistake immediately. All filesystems and block devices need to
> >> behave this way so the limits should be encoded as asserts in the
> >> function to trigger such behaviour.
> >
> > I agree, this kind of bug will be encountered only during developement
> > and not during actual production due to the limit we have fs block size
> > in XFS.
> >
> >>
> >>> If the device is
> >>> asking for a blocksize > PAGE_SIZE and CONFIG_TRANSPARENT_HUGEPAGE is
> >>> not set, you should also decline to mount the filesystem.
> >>
> >> What does CONFIG_TRANSPARENT_HUGEPAGE have to do with filesystems
> >> being able to use large folios?
> >>
> >> If that's an actual dependency of using large folios, then we're at
> >> the point where the mm side of large folios needs to be divorced
> >> from CONFIG_TRANSPARENT_HUGEPAGE and always supported.
> >> Alternatively, CONFIG_TRANSPARENT_HUGEPAGE needs to selected by the
> >> block layer and also every filesystem that wants to support
> >> sector/blocks sizes larger than PAGE_SIZE. IOWs, large folio
> >> support needs to *always* be enabled on systems that say
> >> CONFIG_BLOCK=y.
> >
> > Why CONFIG_BLOCK? I think it is enough if it comes from the FS side
> > right? And for now, the only FS that needs that sort of bs > ps
> > guarantee is XFS with this series. Other filesystems such as bcachefs
> > that call mapping_set_large_folios() only enable it as an optimization
> > and it is not needed for the filesystem to function.
> >
> > So this is my conclusion from the conversation:
> > - Add a dependency in Kconfig on THP for XFS until we fix the dependency
> > of large folios on THP
>
> THP isn't supported on some arches, so isn't this effectively saying XFS can no
> longer be used with those arches, even if the bs <= ps?
I'm good with that - we're already long past the point where we try
to support XFS on every linux platform. Indeed, we've recent been
musing about making XFS depend on 64 bit only - 32 bit systems don't
have the memory capacity to run the full xfs tool chain (e.g.
xfs_repair) on filesystems over about a TB in size, and they are
greatly limited in kernel memory and vmap areas, both of which XFS
makes heavy use of. Basically, friends don't let friends use XFS on
32 bit systems, and that's been true for about 20 years now.
Our problem is the test matrix - if we now have to explicitly test
XFS both with and without large folios enabled to support these
platforms, we've just doubled our test matrix. The test matrix is
already far too large to robustly cover, so anything that requires
doubling the number of kernel configs we have to test is, IMO, a
non-starter.
That's why we really don't support XFS on 32 bit systems anymore and
why we're talking about making that official with a config option.
If we're at the point where XFS will now depend on large folios (i.e
THP), then we need to seriously consider reducing the supported
arches to just those that support both 64 bit and THP. If niche
arches want to support THP, or enable large folios without the need
for THP, then they can do that work and then they get XFS for
free.
Just because an arch might run a Linux kernel, it doesn't mean we
have to support XFS on it....
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-07-08 23:01 ` Dave Chinner
@ 2024-07-09 8:11 ` Ryan Roberts
2024-07-09 13:08 ` Pankaj Raghav (Samsung)
1 sibling, 0 replies; 61+ messages in thread
From: Ryan Roberts @ 2024-07-09 8:11 UTC (permalink / raw)
To: Dave Chinner
Cc: Pankaj Raghav (Samsung), Matthew Wilcox, chandan.babu, djwong,
brauner, akpm, linux-kernel, yang, linux-mm, john.g.garry,
linux-fsdevel, hare, p.raghav, mcgrof, gost.dev, cl, linux-xfs,
hch, Zi Yan
On 09/07/2024 00:01, Dave Chinner wrote:
> On Fri, Jul 05, 2024 at 02:31:08PM +0100, Ryan Roberts wrote:
>> On 05/07/2024 14:24, Pankaj Raghav (Samsung) wrote:
>>>>> I suggest you handle it better than this. If the device is asking for a
>>>>> blocksize > PMD_SIZE, you should fail to mount it.
>>>>
>>>> That's my point: we already do that.
>>>>
>>>> The largest block size we support is 64kB and that's way smaller
>>>> than PMD_SIZE on all platforms and we always check for bs > ps
>>>> support at mount time when the filesystem bs > ps.
>>>>
>>>> Hence we're never going to set the min value to anything unsupported
>>>> unless someone makes a massive programming mistake. At which point,
>>>> we want a *hard, immediate fail* so the developer notices their
>>>> mistake immediately. All filesystems and block devices need to
>>>> behave this way so the limits should be encoded as asserts in the
>>>> function to trigger such behaviour.
>>>
>>> I agree, this kind of bug will be encountered only during developement
>>> and not during actual production due to the limit we have fs block size
>>> in XFS.
>>>
>>>>
>>>>> If the device is
>>>>> asking for a blocksize > PAGE_SIZE and CONFIG_TRANSPARENT_HUGEPAGE is
>>>>> not set, you should also decline to mount the filesystem.
>>>>
>>>> What does CONFIG_TRANSPARENT_HUGEPAGE have to do with filesystems
>>>> being able to use large folios?
>>>>
>>>> If that's an actual dependency of using large folios, then we're at
>>>> the point where the mm side of large folios needs to be divorced
>>>> from CONFIG_TRANSPARENT_HUGEPAGE and always supported.
>>>> Alternatively, CONFIG_TRANSPARENT_HUGEPAGE needs to selected by the
>>>> block layer and also every filesystem that wants to support
>>>> sector/blocks sizes larger than PAGE_SIZE. IOWs, large folio
>>>> support needs to *always* be enabled on systems that say
>>>> CONFIG_BLOCK=y.
>>>
>>> Why CONFIG_BLOCK? I think it is enough if it comes from the FS side
>>> right? And for now, the only FS that needs that sort of bs > ps
>>> guarantee is XFS with this series. Other filesystems such as bcachefs
>>> that call mapping_set_large_folios() only enable it as an optimization
>>> and it is not needed for the filesystem to function.
>>>
>>> So this is my conclusion from the conversation:
>>> - Add a dependency in Kconfig on THP for XFS until we fix the dependency
>>> of large folios on THP
>>
>> THP isn't supported on some arches, so isn't this effectively saying XFS can no
>> longer be used with those arches, even if the bs <= ps?
>
> I'm good with that - we're already long past the point where we try
> to support XFS on every linux platform. Indeed, we've recent been
> musing about making XFS depend on 64 bit only - 32 bit systems don't
> have the memory capacity to run the full xfs tool chain (e.g.
> xfs_repair) on filesystems over about a TB in size, and they are
> greatly limited in kernel memory and vmap areas, both of which XFS
> makes heavy use of. Basically, friends don't let friends use XFS on
> 32 bit systems, and that's been true for about 20 years now.
>
> Our problem is the test matrix - if we now have to explicitly test
> XFS both with and without large folios enabled to support these
> platforms, we've just doubled our test matrix. The test matrix is
> already far too large to robustly cover, so anything that requires
> doubling the number of kernel configs we have to test is, IMO, a
> non-starter.
>
> That's why we really don't support XFS on 32 bit systems anymore and
> why we're talking about making that official with a config option.
> If we're at the point where XFS will now depend on large folios (i.e
> THP), then we need to seriously consider reducing the supported
> arches to just those that support both 64 bit and THP. If niche
> arches want to support THP, or enable large folios without the need
> for THP, then they can do that work and then they get XFS for
> free.
>
> Just because an arch might run a Linux kernel, it doesn't mean we
> have to support XFS on it....
OK. I was just pointing out the impact of adding this Kconfig dependency. If
that impact is explicitly considered and desired, then great. I'll leave you to it.
Thanks,
Ryan
>
> -Dave.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-07-08 23:01 ` Dave Chinner
2024-07-09 8:11 ` Ryan Roberts
@ 2024-07-09 13:08 ` Pankaj Raghav (Samsung)
1 sibling, 0 replies; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-09 13:08 UTC (permalink / raw)
To: Dave Chinner
Cc: Ryan Roberts, Matthew Wilcox, chandan.babu, djwong, brauner, akpm,
linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, hch, Zi Yan
> > >
> > > Why CONFIG_BLOCK? I think it is enough if it comes from the FS side
> > > right? And for now, the only FS that needs that sort of bs > ps
> > > guarantee is XFS with this series. Other filesystems such as bcachefs
> > > that call mapping_set_large_folios() only enable it as an optimization
> > > and it is not needed for the filesystem to function.
> > >
> > > So this is my conclusion from the conversation:
> > > - Add a dependency in Kconfig on THP for XFS until we fix the dependency
> > > of large folios on THP
> >
> > THP isn't supported on some arches, so isn't this effectively saying XFS can no
> > longer be used with those arches, even if the bs <= ps?
>
> I'm good with that - we're already long past the point where we try
From my cursory review, I can see that the following arch supports THP
(* indicates it has some dependency on other Kconfig parameter):
arc*, arm*, arm64, loongarch, mips*, powerpc*, riscv*, s390, sparc, x86.
and the following do not have THP support:
alpha, csky, hexagon, m68k, microblaze, nios2, openrisc, parisc, sh, um,
xtensa.
Looks like the arch that do not THP support are either old or embedded
processor that target mainly 32-bit architecture.
So are we OK with?
diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index d41edd30388b7..be2c1c0e9fe8b 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -5,6 +5,7 @@ config XFS_FS
select EXPORTFS
select LIBCRC32C
select FS_IOMAP
+ select TRANSPARENT_HUGEPAGE
help
XFS is a high performance journaling filesystem which originated
on the SGI IRIX platform. It is completely multi-threaded, can
> to support XFS on every linux platform. Indeed, we've recent been
> musing about making XFS depend on 64 bit only - 32 bit systems don't
> have the memory capacity to run the full xfs tool chain (e.g.
> xfs_repair) on filesystems over about a TB in size, and they are
> greatly limited in kernel memory and vmap areas, both of which XFS
> makes heavy use of. Basically, friends don't let friends use XFS on
> 32 bit systems, and that's been true for about 20 years now.
>
> Our problem is the test matrix - if we now have to explicitly test
> XFS both with and without large folios enabled to support these
> platforms, we've just doubled our test matrix. The test matrix is
> already far too large to robustly cover, so anything that requires
> doubling the number of kernel configs we have to test is, IMO, a
> non-starter.
>
> That's why we really don't support XFS on 32 bit systems anymore and
> why we're talking about making that official with a config option.
> If we're at the point where XFS will now depend on large folios (i.e
> THP), then we need to seriously consider reducing the supported
> arches to just those that support both 64 bit and THP. If niche
> arches want to support THP, or enable large folios without the need
> for THP, then they can do that work and then they get XFS for
> free.
>
> Just because an arch might run a Linux kernel, it doesn't mean we
> have to support XFS on it....
The other option is we can expose a simple helper from page cache as
follows:
static inline unsigned int mapping_max_folio_order_supported()
{
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
return 0;
return MAX_PAGECACHE_ORDER;
}
This could be used to know the maximum order supported at mount time and
deny mounting for LBS configs if max folio supported is less than the
block size being requested.
^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-06-25 11:44 ` [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
2024-07-04 12:23 ` Ryan Roberts
@ 2024-07-09 16:29 ` Pankaj Raghav (Samsung)
2024-07-09 16:38 ` Matthew Wilcox
2024-07-09 16:50 ` Darrick J. Wong
1 sibling, 2 replies; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-09 16:29 UTC (permalink / raw)
To: david, willy, ryan.roberts, djwong
Cc: linux-kernel, yang, linux-mm, john.g.garry, linux-fsdevel, hare,
p.raghav, mcgrof, gost.dev, cl, linux-xfs, hch, Zi Yan, akpm,
chandan.babu
For now, this is the only patch that is blocking for the next version.
Based on the discussion, is the following logical @ryan, @dave and
@willy?
- We give explicit VM_WARN_ONCE if we try to set folio order range if
the THP is disabled, min and max is greater than MAX_PAGECACHE_ORDER.
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 14e1415f7dcf4..313c9fad61859 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -394,13 +394,24 @@ static inline void mapping_set_folio_order_range(struct address_space *mapping,
unsigned int min,
unsigned int max)
{
- if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+ VM_WARN_ONCE(1,
+ "THP needs to be enabled to support mapping folio order range");
return;
+ }
- if (min > MAX_PAGECACHE_ORDER)
+ if (min > MAX_PAGECACHE_ORDER) {
+ VM_WARN_ONCE(1,
+ "min order > MAX_PAGECACHE_ORDER. Setting min_order to MAX_PAGECACHE_ORDER");
min = MAX_PAGECACHE_ORDER;
- if (max > MAX_PAGECACHE_ORDER)
+ }
+
+ if (max > MAX_PAGECACHE_ORDER) {
+ VM_WARN_ONCE(1,
+ "max order > MAX_PAGECACHE_ORDER. Setting max_order to MAX_PAGECACHE_ORDER");
max = MAX_PAGECACHE_ORDER;
+ }
+
if (max < min)
max = min;
- We make THP an explicit dependency for XFS:
diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index d41edd30388b7..be2c1c0e9fe8b 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -5,6 +5,7 @@ config XFS_FS
select EXPORTFS
select LIBCRC32C
select FS_IOMAP
+ select TRANSPARENT_HUGEPAGE
help
XFS is a high performance journaling filesystem which originated
on the SGI IRIX platform. It is completely multi-threaded, can
OR
We create a helper in page cache that FSs can use to check if a specific
order can be supported at mount time:
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 14e1415f7dcf..9be775ef11a5 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -374,6 +374,14 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
#define MAX_XAS_ORDER (XA_CHUNK_SHIFT * 2 - 1)
#define MAX_PAGECACHE_ORDER min(MAX_XAS_ORDER, PREFERRED_MAX_PAGECACHE_ORDER)
+
+static inline unsigned int mapping_max_folio_order_supported()
+{
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ return 0;
+ return MAX_PAGECACHE_ORDER;
+}
+
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index b8a93a8f35cac..e2be8743c2c20 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1647,6 +1647,15 @@ xfs_fs_fill_super(
goto out_free_sb;
}
+ if (mp->m_sb.sb_blocklog - PAGE_SHIFT >
+ mapping_max_folio_order_supported()) {
+ xfs_warn(mp,
+"Block Size (%d bytes) is not supported. Check MAX_PAGECACHE_ORDER",
+ mp->m_sb.sb_blocksize);
+ error = -ENOSYS;
+ goto out_free_sb;
+ }
+
xfs_warn(mp,
"EXPERIMENTAL: V5 Filesystem with Large Block Size (%d bytes) enabled.",
mp->m_sb.sb_blocksize);
--
Pankaj
^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-07-09 16:29 ` Pankaj Raghav (Samsung)
@ 2024-07-09 16:38 ` Matthew Wilcox
2024-07-09 17:33 ` Pankaj Raghav (Samsung)
2024-07-09 16:50 ` Darrick J. Wong
1 sibling, 1 reply; 61+ messages in thread
From: Matthew Wilcox @ 2024-07-09 16:38 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: david, ryan.roberts, djwong, linux-kernel, yang, linux-mm,
john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof, gost.dev, cl,
linux-xfs, hch, Zi Yan, akpm, chandan.babu
On Tue, Jul 09, 2024 at 04:29:07PM +0000, Pankaj Raghav (Samsung) wrote:
> +++ b/include/linux/pagemap.h
> @@ -394,13 +394,24 @@ static inline void mapping_set_folio_order_range(struct address_space *mapping,
> unsigned int min,
> unsigned int max)
> {
> - if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> + VM_WARN_ONCE(1,
> + "THP needs to be enabled to support mapping folio order range");
> return;
> + }
No. Filesystems call mapping_set_folio_order_range() without it being
conditional on CONFIG_TRANSPARENT_HUGEPAGE. Usually that takes the
form of an unconditional call to mapping_set_large_folios().
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-07-09 16:29 ` Pankaj Raghav (Samsung)
2024-07-09 16:38 ` Matthew Wilcox
@ 2024-07-09 16:50 ` Darrick J. Wong
2024-07-09 21:08 ` Pankaj Raghav (Samsung)
1 sibling, 1 reply; 61+ messages in thread
From: Darrick J. Wong @ 2024-07-09 16:50 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: david, willy, ryan.roberts, linux-kernel, yang, linux-mm,
john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof, gost.dev, cl,
linux-xfs, hch, Zi Yan, akpm, chandan.babu
On Tue, Jul 09, 2024 at 04:29:07PM +0000, Pankaj Raghav (Samsung) wrote:
> For now, this is the only patch that is blocking for the next version.
>
> Based on the discussion, is the following logical @ryan, @dave and
> @willy?
>
> - We give explicit VM_WARN_ONCE if we try to set folio order range if
> the THP is disabled, min and max is greater than MAX_PAGECACHE_ORDER.
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 14e1415f7dcf4..313c9fad61859 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -394,13 +394,24 @@ static inline void mapping_set_folio_order_range(struct address_space *mapping,
> unsigned int min,
> unsigned int max)
> {
> - if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> + VM_WARN_ONCE(1,
> + "THP needs to be enabled to support mapping folio order range");
> return;
> + }
>
> - if (min > MAX_PAGECACHE_ORDER)
> + if (min > MAX_PAGECACHE_ORDER) {
> + VM_WARN_ONCE(1,
> + "min order > MAX_PAGECACHE_ORDER. Setting min_order to MAX_PAGECACHE_ORDER");
> min = MAX_PAGECACHE_ORDER;
> - if (max > MAX_PAGECACHE_ORDER)
> + }
> +
> + if (max > MAX_PAGECACHE_ORDER) {
> + VM_WARN_ONCE(1,
> + "max order > MAX_PAGECACHE_ORDER. Setting max_order to MAX_PAGECACHE_ORDER");
> max = MAX_PAGECACHE_ORDER;
> + }
> +
> if (max < min)
> max = min;
>
> - We make THP an explicit dependency for XFS:
>
> diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
> index d41edd30388b7..be2c1c0e9fe8b 100644
> --- a/fs/xfs/Kconfig
> +++ b/fs/xfs/Kconfig
> @@ -5,6 +5,7 @@ config XFS_FS
> select EXPORTFS
> select LIBCRC32C
> select FS_IOMAP
> + select TRANSPARENT_HUGEPAGE
> help
> XFS is a high performance journaling filesystem which originated
> on the SGI IRIX platform. It is completely multi-threaded, can
>
> OR
>
> We create a helper in page cache that FSs can use to check if a specific
> order can be supported at mount time:
I like this solution better; if XFS is going to drop support for o[ld]d
architectures I think we need /some/ sort of notice period. Or at least
a better story than "we want to support 64k fsblocks on x64 so we're
withdrawing support even for 4k fsblocks and smallish filesystems on
m68k".
You probably don't want bs>ps support to block on some arcane discussion
about 32-bit, right? ;)
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 14e1415f7dcf..9be775ef11a5 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -374,6 +374,14 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
> #define MAX_XAS_ORDER (XA_CHUNK_SHIFT * 2 - 1)
> #define MAX_PAGECACHE_ORDER min(MAX_XAS_ORDER, PREFERRED_MAX_PAGECACHE_ORDER)
>
> +
> +static inline unsigned int mapping_max_folio_order_supported()
> +{
> + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> + return 0;
Shouldn't this line be indented by two tabs, not six spaces?
> + return MAX_PAGECACHE_ORDER;
> +}
Alternately, should this return the max folio size in bytes?
static inline size_t mapping_max_folio_size(void)
{
if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
return 1U << (PAGE_SHIFT + MAX_PAGECACHE_ORDER);
return PAGE_SIZE;
}
Then the validation looks like:
const size_t max_folio_size = mapping_max_folio_size();
if (mp->m_sb.sb_blocksize > max_folio_size) {
xfs_warn(mp,
"block size (%u bytes) not supported; maximum folio size is %u.",
mp->m_sb.sb_blocksize, max_folio_size);
error = -ENOSYS;
goto out_free_sb;
}
(Don't mind me bikeshedding here.)
> +
>
>
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index b8a93a8f35cac..e2be8743c2c20 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1647,6 +1647,15 @@ xfs_fs_fill_super(
> goto out_free_sb;
> }
>
> + if (mp->m_sb.sb_blocklog - PAGE_SHIFT >
> + mapping_max_folio_order_supported()) {
> + xfs_warn(mp,
> +"Block Size (%d bytes) is not supported. Check MAX_PAGECACHE_ORDER",
> + mp->m_sb.sb_blocksize);
You might as well print MAX_PAGECACHE_ORDER here to make analysis
easier on less-familiar architectures:
xfs_warn(mp,
"block size (%d bytes) is not supported; max folio size is %u.",
mp->m_sb.sb_blocksize,
1U << mapping_max_folio_order_supported());
(I wrote this comment first.)
--D
> + error = -ENOSYS;
> + goto out_free_sb;
> + }
> +
> xfs_warn(mp,
> "EXPERIMENTAL: V5 Filesystem with Large Block Size (%d bytes) enabled.",
> mp->m_sb.sb_blocksize);
>
>
> --
> Pankaj
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-07-09 16:38 ` Matthew Wilcox
@ 2024-07-09 17:33 ` Pankaj Raghav (Samsung)
0 siblings, 0 replies; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-09 17:33 UTC (permalink / raw)
To: Matthew Wilcox
Cc: david, ryan.roberts, djwong, linux-kernel, yang, linux-mm,
john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof, gost.dev, cl,
linux-xfs, hch, Zi Yan, akpm, chandan.babu
On Tue, Jul 09, 2024 at 05:38:16PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 09, 2024 at 04:29:07PM +0000, Pankaj Raghav (Samsung) wrote:
> > +++ b/include/linux/pagemap.h
> > @@ -394,13 +394,24 @@ static inline void mapping_set_folio_order_range(struct address_space *mapping,
> > unsigned int min,
> > unsigned int max)
> > {
> > - if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> > + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> > + VM_WARN_ONCE(1,
> > + "THP needs to be enabled to support mapping folio order range");
> > return;
> > + }
>
> No. Filesystems call mapping_set_folio_order_range() without it being
> conditional on CONFIG_TRANSPARENT_HUGEPAGE. Usually that takes the
> form of an unconditional call to mapping_set_large_folios().
Ah, you are right.
Actually thinking more about it, we don't need VM_WARN_ONCE on
CONFIG_THP IS_ENABLED, because if we go the route where a FS will
call something like `mapping_max_folio_order_supported()` during mount
time, that will already return `0` as the maximum order that will be
supported.
So just something like this should be enough:
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 14e1415f7dcf..ef6b13854385 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -397,10 +397,18 @@ static inline void mapping_set_folio_order_range(struct address_space *mapping,
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
return;
- if (min > MAX_PAGECACHE_ORDER)
+ if (min > MAX_PAGECACHE_ORDER) {
+ VM_WARN_ONCE(1,
+ "min order > MAX_PAGECACHE_ORDER. Setting min_order to MAX_PAGECACHE_ORDER");
min = MAX_PAGECACHE_ORDER;
- if (max > MAX_PAGECACHE_ORDER)
+ }
+
+ if (max > MAX_PAGECACHE_ORDER) {
+ VM_WARN_ONCE(1,
+ "max order > MAX_PAGECACHE_ORDER. Setting max_order to MAX_PAGECACHE_ORDER");
max = MAX_PAGECACHE_ORDER;
+ }
+
if (max < min)
max = min;
If we have a helper such as mapping_max_folio_order_supported() that
could be invoked by FSs to see what page cache could support.
And FSs that call mapping_set_large_folios() as an optimization will not
see these random WARNINGS because we call this function with the actual
min and max range.
Let me know what you think.
--
Pankaj
^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-07-09 16:50 ` Darrick J. Wong
@ 2024-07-09 21:08 ` Pankaj Raghav (Samsung)
2024-07-09 21:59 ` Darrick J. Wong
0 siblings, 1 reply; 61+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-07-09 21:08 UTC (permalink / raw)
To: Darrick J. Wong
Cc: david, willy, ryan.roberts, linux-kernel, yang, linux-mm,
john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof, gost.dev, cl,
linux-xfs, hch, Zi Yan, akpm, chandan.babu
> >
> > - We make THP an explicit dependency for XFS:
> >
> > diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
> > index d41edd30388b7..be2c1c0e9fe8b 100644
> > --- a/fs/xfs/Kconfig
> > +++ b/fs/xfs/Kconfig
> > @@ -5,6 +5,7 @@ config XFS_FS
> > select EXPORTFS
> > select LIBCRC32C
> > select FS_IOMAP
> > + select TRANSPARENT_HUGEPAGE
> > help
> > XFS is a high performance journaling filesystem which originated
> > on the SGI IRIX platform. It is completely multi-threaded, can
> >
> > OR
> >
> > We create a helper in page cache that FSs can use to check if a specific
> > order can be supported at mount time:
>
> I like this solution better; if XFS is going to drop support for o[ld]d
> architectures I think we need /some/ sort of notice period. Or at least
> a better story than "we want to support 64k fsblocks on x64 so we're
> withdrawing support even for 4k fsblocks and smallish filesystems on
> m68k".
>
> You probably don't want bs>ps support to block on some arcane discussion
> about 32-bit, right? ;)
>
:)
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index 14e1415f7dcf..9be775ef11a5 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -374,6 +374,14 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
> > #define MAX_XAS_ORDER (XA_CHUNK_SHIFT * 2 - 1)
> > #define MAX_PAGECACHE_ORDER min(MAX_XAS_ORDER, PREFERRED_MAX_PAGECACHE_ORDER)
> >
> > +
> > +static inline unsigned int mapping_max_folio_order_supported()
> > +{
> > + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> > + return 0;
>
> Shouldn't this line be indented by two tabs, not six spaces?
>
> > + return MAX_PAGECACHE_ORDER;
> > +}
>
> Alternately, should this return the max folio size in bytes?
>
> static inline size_t mapping_max_folio_size(void)
> {
> if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> return 1U << (PAGE_SHIFT + MAX_PAGECACHE_ORDER);
> return PAGE_SIZE;
> }
We already have mapping_max_folio_size(mapping) which returns the
maximum folio order set for that mapping. So this could be called as
mapping_max_folio_size_supported().
So we could just have mapping_max_folio_size_supported() instead of
having mapping_max_folio_order_supported as you suggest.
>
> Then the validation looks like:
>
> const size_t max_folio_size = mapping_max_folio_size();
>
> if (mp->m_sb.sb_blocksize > max_folio_size) {
> xfs_warn(mp,
> "block size (%u bytes) not supported; maximum folio size is %u.",
> mp->m_sb.sb_blocksize, max_folio_size);
> error = -ENOSYS;
> goto out_free_sb;
> }
>
> (Don't mind me bikeshedding here.)
>
> > +
> >
> >
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index b8a93a8f35cac..e2be8743c2c20 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -1647,6 +1647,15 @@ xfs_fs_fill_super(
> > goto out_free_sb;
> > }
> >
> > + if (mp->m_sb.sb_blocklog - PAGE_SHIFT >
> > + mapping_max_folio_order_supported()) {
> > + xfs_warn(mp,
> > +"Block Size (%d bytes) is not supported. Check MAX_PAGECACHE_ORDER",
> > + mp->m_sb.sb_blocksize);
>
> You might as well print MAX_PAGECACHE_ORDER here to make analysis
> easier on less-familiar architectures:
Yes!
>
> xfs_warn(mp,
> "block size (%d bytes) is not supported; max folio size is %u.",
> mp->m_sb.sb_blocksize,
> 1U << mapping_max_folio_order_supported());
>
> (I wrote this comment first.)
>
> --D
>
> > + error = -ENOSYS;
> > + goto out_free_sb;
> > + }
> > +
> > xfs_warn(mp,
> > "EXPERIMENTAL: V5 Filesystem with Large Block Size (%d bytes) enabled.",
> > mp->m_sb.sb_blocksize);
> >
> >
> > --
> > Pankaj
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes
2024-07-09 21:08 ` Pankaj Raghav (Samsung)
@ 2024-07-09 21:59 ` Darrick J. Wong
0 siblings, 0 replies; 61+ messages in thread
From: Darrick J. Wong @ 2024-07-09 21:59 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: david, willy, ryan.roberts, linux-kernel, yang, linux-mm,
john.g.garry, linux-fsdevel, hare, p.raghav, mcgrof, gost.dev, cl,
linux-xfs, hch, Zi Yan, akpm, chandan.babu
On Tue, Jul 09, 2024 at 09:08:29PM +0000, Pankaj Raghav (Samsung) wrote:
> > >
> > > - We make THP an explicit dependency for XFS:
> > >
> > > diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
> > > index d41edd30388b7..be2c1c0e9fe8b 100644
> > > --- a/fs/xfs/Kconfig
> > > +++ b/fs/xfs/Kconfig
> > > @@ -5,6 +5,7 @@ config XFS_FS
> > > select EXPORTFS
> > > select LIBCRC32C
> > > select FS_IOMAP
> > > + select TRANSPARENT_HUGEPAGE
> > > help
> > > XFS is a high performance journaling filesystem which originated
> > > on the SGI IRIX platform. It is completely multi-threaded, can
> > >
> > > OR
> > >
> > > We create a helper in page cache that FSs can use to check if a specific
> > > order can be supported at mount time:
> >
> > I like this solution better; if XFS is going to drop support for o[ld]d
> > architectures I think we need /some/ sort of notice period. Or at least
> > a better story than "we want to support 64k fsblocks on x64 so we're
> > withdrawing support even for 4k fsblocks and smallish filesystems on
> > m68k".
> >
> > You probably don't want bs>ps support to block on some arcane discussion
> > about 32-bit, right? ;)
> >
>
> :)
>
> > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > > index 14e1415f7dcf..9be775ef11a5 100644
> > > --- a/include/linux/pagemap.h
> > > +++ b/include/linux/pagemap.h
> > > @@ -374,6 +374,14 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
> > > #define MAX_XAS_ORDER (XA_CHUNK_SHIFT * 2 - 1)
> > > #define MAX_PAGECACHE_ORDER min(MAX_XAS_ORDER, PREFERRED_MAX_PAGECACHE_ORDER)
> > >
> > > +
> > > +static inline unsigned int mapping_max_folio_order_supported()
> > > +{
> > > + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> > > + return 0;
> >
> > Shouldn't this line be indented by two tabs, not six spaces?
> >
> > > + return MAX_PAGECACHE_ORDER;
> > > +}
> >
> > Alternately, should this return the max folio size in bytes?
> >
> > static inline size_t mapping_max_folio_size(void)
> > {
> > if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> > return 1U << (PAGE_SHIFT + MAX_PAGECACHE_ORDER);
> > return PAGE_SIZE;
> > }
>
> We already have mapping_max_folio_size(mapping) which returns the
> maximum folio order set for that mapping. So this could be called as
> mapping_max_folio_size_supported().
>
> So we could just have mapping_max_folio_size_supported() instead of
> having mapping_max_folio_order_supported as you suggest.
<nod>
> >
> > Then the validation looks like:
> >
> > const size_t max_folio_size = mapping_max_folio_size();
> >
> > if (mp->m_sb.sb_blocksize > max_folio_size) {
> > xfs_warn(mp,
> > "block size (%u bytes) not supported; maximum folio size is %u.",
> > mp->m_sb.sb_blocksize, max_folio_size);
> > error = -ENOSYS;
> > goto out_free_sb;
> > }
> >
> > (Don't mind me bikeshedding here.)
> >
> > > +
> > >
> > >
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index b8a93a8f35cac..e2be8743c2c20 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -1647,6 +1647,15 @@ xfs_fs_fill_super(
> > > goto out_free_sb;
> > > }
> > >
> > > + if (mp->m_sb.sb_blocklog - PAGE_SHIFT >
> > > + mapping_max_folio_order_supported()) {
> > > + xfs_warn(mp,
> > > +"Block Size (%d bytes) is not supported. Check MAX_PAGECACHE_ORDER",
> > > + mp->m_sb.sb_blocksize);
> >
> > You might as well print MAX_PAGECACHE_ORDER here to make analysis
> > easier on less-familiar architectures:
>
> Yes!
Thanks.
--D
> >
> > xfs_warn(mp,
> > "block size (%d bytes) is not supported; max folio size is %u.",
> > mp->m_sb.sb_blocksize,
> > 1U << mapping_max_folio_order_supported());
> >
> > (I wrote this comment first.)
>
> >
> > --D
> >
> > > + error = -ENOSYS;
> > > + goto out_free_sb;
> > > + }
> > > +
> > > xfs_warn(mp,
> > > "EXPERIMENTAL: V5 Filesystem with Large Block Size (%d bytes) enabled.",
> > > mp->m_sb.sb_blocksize);
> > >
> > >
> > > --
> > > Pankaj
>
^ permalink raw reply [flat|nested] 61+ messages in thread
end of thread, other threads:[~2024-07-09 21:59 UTC | newest]
Thread overview: 61+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-25 11:44 [PATCH v8 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
2024-06-25 11:44 ` [PATCH v8 01/10] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
2024-07-04 12:23 ` Ryan Roberts
2024-07-04 15:20 ` Matthew Wilcox
2024-07-04 15:52 ` Ryan Roberts
2024-07-04 21:28 ` Pankaj Raghav (Samsung)
2024-07-04 22:06 ` Dave Chinner
2024-07-04 23:56 ` Matthew Wilcox
2024-07-05 4:32 ` Dave Chinner
2024-07-05 9:03 ` Ryan Roberts
2024-07-05 12:45 ` Pankaj Raghav (Samsung)
2024-07-05 13:24 ` Pankaj Raghav (Samsung)
2024-07-05 13:31 ` Ryan Roberts
2024-07-05 14:14 ` Pankaj Raghav (Samsung)
2024-07-08 23:01 ` Dave Chinner
2024-07-09 8:11 ` Ryan Roberts
2024-07-09 13:08 ` Pankaj Raghav (Samsung)
2024-07-05 15:14 ` Matthew Wilcox
2024-07-04 21:34 ` Pankaj Raghav (Samsung)
2024-07-09 16:29 ` Pankaj Raghav (Samsung)
2024-07-09 16:38 ` Matthew Wilcox
2024-07-09 17:33 ` Pankaj Raghav (Samsung)
2024-07-09 16:50 ` Darrick J. Wong
2024-07-09 21:08 ` Pankaj Raghav (Samsung)
2024-07-09 21:59 ` Darrick J. Wong
2024-06-25 11:44 ` [PATCH v8 02/10] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
2024-06-25 15:52 ` Matthew Wilcox
2024-06-25 18:06 ` Pankaj Raghav (Samsung)
2024-06-25 11:44 ` [PATCH v8 03/10] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
2024-07-02 19:38 ` Darrick J. Wong
2024-07-03 14:10 ` Pankaj Raghav (Samsung)
2024-07-04 14:24 ` Ryan Roberts
2024-07-04 14:29 ` Matthew Wilcox
2024-06-25 11:44 ` [PATCH v8 04/10] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
2024-06-25 14:45 ` Zi Yan
2024-06-25 17:20 ` Pankaj Raghav (Samsung)
2024-06-25 11:44 ` [PATCH v8 05/10] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() Pankaj Raghav (Samsung)
2024-07-01 23:39 ` Darrick J. Wong
2024-06-25 11:44 ` [PATCH v8 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
2024-07-01 2:37 ` Dave Chinner
2024-07-01 11:22 ` Pankaj Raghav (Samsung)
2024-07-01 23:40 ` Darrick J. Wong
2024-07-02 7:42 ` Christoph Hellwig
2024-07-02 10:15 ` Pankaj Raghav (Samsung)
2024-07-02 12:02 ` Christoph Hellwig
2024-07-02 14:01 ` Pankaj Raghav (Samsung)
2024-07-02 15:42 ` Christoph Hellwig
2024-07-02 16:13 ` Pankaj Raghav (Samsung)
2024-07-02 16:51 ` Matthew Wilcox
2024-07-02 17:10 ` Pankaj Raghav (Samsung)
2024-07-03 5:16 ` Christoph Hellwig
2024-07-02 16:50 ` Matthew Wilcox
2024-07-02 13:49 ` Luis Chamberlain
2024-06-25 11:44 ` [PATCH v8 07/10] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
2024-06-25 18:07 ` Pankaj Raghav (Samsung)
2024-06-25 11:44 ` [PATCH v8 08/10] xfs: expose block size in stat Pankaj Raghav (Samsung)
2024-07-01 2:33 ` Dave Chinner
2024-06-25 11:44 ` [PATCH v8 09/10] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
2024-07-01 2:34 ` Dave Chinner
2024-06-25 11:44 ` [PATCH v8 10/10] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
2024-07-01 2:34 ` Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).