* [RFC 01/23] fs: Allow fine-grained control of folio sizes
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 19:03 ` Matthew Wilcox
2023-09-15 18:38 ` [RFC 02/23] pagemap: use mapping_min_order in fgf_set_order() Pankaj Raghav
` (23 subsequent siblings)
24 siblings, 1 reply; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Some filesystems want to be able to limit the maximum size of folios,
and some want to be able to ensure that folios are at least a certain
size. Add mapping_set_folio_orders() to allow this level of control.
The max folio order parameter is ignored and it is always set to
MAX_PAGECACHE_ORDER.
[Pankaj]: added mapping_min_folio_order(), changed MAX_MASK to 0x0003e000
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
[mcgrof: rebase in light of "mm, netfs, fscache: stop read optimisation
when folio removed from pagecache" which adds AS_RELEASE_ALWAYS]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
include/linux/pagemap.h | 78 +++++++++++++++++++++++++++++++----------
1 file changed, 60 insertions(+), 18 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 759b29d9a69a..d2b5308cc59e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -202,10 +202,16 @@ enum mapping_flags {
AS_EXITING = 4, /* final truncate in progress */
/* writeback related tags are not used */
AS_NO_WRITEBACK_TAGS = 5,
- AS_LARGE_FOLIO_SUPPORT = 6,
- AS_RELEASE_ALWAYS, /* Call ->release_folio(), even if no private data */
+ AS_RELEASE_ALWAYS = 6, /* Call ->release_folio(), even if no private data */
+ AS_FOLIO_ORDER_MIN = 8,
+ AS_FOLIO_ORDER_MAX = 13,
+ /* 8-17 are used for FOLIO_ORDER */
};
+#define AS_FOLIO_ORDER_MIN_MASK 0x00001f00
+#define AS_FOLIO_ORDER_MAX_MASK 0x0003e000
+#define AS_FOLIO_ORDER_MASK (AS_FOLIO_ORDER_MIN_MASK | AS_FOLIO_ORDER_MAX_MASK)
+
/**
* mapping_set_error - record a writeback error in the address_space
* @mapping: the mapping in which an error should be set
@@ -310,6 +316,46 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
m->gfp_mask = mask;
}
+/*
+ * There are some parts of the kernel which assume that PMD entries
+ * are exactly HPAGE_PMD_ORDER. Those should be fixed, but until then,
+ * limit the maximum allocation order to PMD size. I'm not aware of any
+ * assumptions about maximum order if THP are disabled, but 8 seems like
+ * a good order (that's 1MB if you're using 4kB pages)
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define MAX_PAGECACHE_ORDER HPAGE_PMD_ORDER
+#else
+#define MAX_PAGECACHE_ORDER 8
+#endif
+
+/*
+ * mapping_set_folio_orders() - Set the range of folio sizes supported.
+ * @mapping: The file.
+ * @min: Minimum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
+ * @max: Maximum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
+ *
+ * The filesystem should call this function in its inode constructor to
+ * indicate which sizes of folio the VFS can use to cache the contents
+ * of the file. This should only be used if the filesystem needs special
+ * handling of folio sizes (ie there is something the core cannot know).
+ * Do not tune it based on, eg, i_size.
+ *
+ * Context: This should not be called while the inode is active as it
+ * is non-atomic.
+ */
+static inline void mapping_set_folio_orders(struct address_space *mapping,
+ unsigned int min, unsigned int max)
+{
+ /*
+ * XXX: max is ignored as only minimum folio order is supported
+ * currently.
+ */
+ mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
+ (min << AS_FOLIO_ORDER_MIN) |
+ (MAX_PAGECACHE_ORDER << AS_FOLIO_ORDER_MAX);
+}
+
/**
* mapping_set_large_folios() - Indicate the file supports large folios.
* @mapping: The file.
@@ -323,7 +369,17 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
*/
static inline void mapping_set_large_folios(struct address_space *mapping)
{
- __set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+ mapping_set_folio_orders(mapping, 0, MAX_PAGECACHE_ORDER);
+}
+
+static inline unsigned int mapping_max_folio_order(struct address_space *mapping)
+{
+ return (mapping->flags & AS_FOLIO_ORDER_MAX_MASK) >> AS_FOLIO_ORDER_MAX;
+}
+
+static inline unsigned int mapping_min_folio_order(struct address_space *mapping)
+{
+ return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
}
/*
@@ -332,8 +388,7 @@ static inline void mapping_set_large_folios(struct address_space *mapping)
*/
static inline bool mapping_large_folio_support(struct address_space *mapping)
{
- return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
- test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+ return mapping_max_folio_order(mapping) > 0;
}
static inline int filemap_nr_thps(struct address_space *mapping)
@@ -494,19 +549,6 @@ static inline void *detach_page_private(struct page *page)
return folio_detach_private(page_folio(page));
}
-/*
- * There are some parts of the kernel which assume that PMD entries
- * are exactly HPAGE_PMD_ORDER. Those should be fixed, but until then,
- * limit the maximum allocation order to PMD size. I'm not aware of any
- * assumptions about maximum order if THP are disabled, but 8 seems like
- * a good order (that's 1MB if you're using 4kB pages)
- */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define MAX_PAGECACHE_ORDER HPAGE_PMD_ORDER
-#else
-#define MAX_PAGECACHE_ORDER 8
-#endif
-
#ifdef CONFIG_NUMA
struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
#else
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [RFC 01/23] fs: Allow fine-grained control of folio sizes
2023-09-15 18:38 ` [RFC 01/23] fs: Allow fine-grained control of folio sizes Pankaj Raghav
@ 2023-09-15 19:03 ` Matthew Wilcox
0 siblings, 0 replies; 54+ messages in thread
From: Matthew Wilcox @ 2023-09-15 19:03 UTC (permalink / raw)
To: Pankaj Raghav
Cc: linux-xfs, linux-fsdevel, p.raghav, david, da.gomez, akpm,
linux-kernel, djwong, linux-mm, chandan.babu, mcgrof, gost.dev
On Fri, Sep 15, 2023 at 08:38:26PM +0200, Pankaj Raghav wrote:
> +static inline void mapping_set_folio_orders(struct address_space *mapping,
> + unsigned int min, unsigned int max)
> +{
> + /*
> + * XXX: max is ignored as only minimum folio order is supported
> + * currently.
> + */
I think we need some sanity checking ...
if (min == 1)
min = 2;
if (max < min)
max = min;
if (max > MAX_PAGECACHE_ORDER)
max = MAX_PAGECACHE_ORDER;
> + mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
> + (min << AS_FOLIO_ORDER_MIN) |
> + (MAX_PAGECACHE_ORDER << AS_FOLIO_ORDER_MAX);
> +}
^ permalink raw reply [flat|nested] 54+ messages in thread
* [RFC 02/23] pagemap: use mapping_min_order in fgf_set_order()
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
2023-09-15 18:38 ` [RFC 01/23] fs: Allow fine-grained control of folio sizes Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 18:55 ` Matthew Wilcox
2023-09-15 18:38 ` [RFC 03/23] filemap: add folio with at least mapping_min_order in __filemap_get_folio Pankaj Raghav
` (22 subsequent siblings)
24 siblings, 1 reply; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: Pankaj Raghav <p.raghav@samsung.com>
fgf_set_order() encodes optimal order in fgp flags. Set it to at least
mapping_min_order from the page cache. Default to the old behaviour if
min_order is not set.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
fs/iomap/buffered-io.c | 2 +-
include/linux/pagemap.h | 9 +++++----
2 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ae8673ce08b1..d4613fd550c4 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -549,7 +549,7 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
if (iter->flags & IOMAP_NOWAIT)
fgp |= FGP_NOWAIT;
- fgp |= fgf_set_order(len);
+ fgp |= fgf_set_order(iter->inode->i_mapping, len);
return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
fgp, mapping_gfp_mask(iter->inode->i_mapping));
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d2b5308cc59e..5d392366420a 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -620,6 +620,7 @@ typedef unsigned int __bitwise fgf_t;
/**
* fgf_set_order - Encode a length in the fgf_t flags.
+ * @mapping: address_space struct from the inode
* @size: The suggested size of the folio to create.
*
* The caller of __filemap_get_folio() can use this to suggest a preferred
@@ -629,13 +630,13 @@ typedef unsigned int __bitwise fgf_t;
* due to alignment constraints, memory pressure, or the presence of
* other folios at nearby indices.
*/
-static inline fgf_t fgf_set_order(size_t size)
+static inline fgf_t fgf_set_order(struct address_space *mapping, size_t size)
{
unsigned int shift = ilog2(size);
+ unsigned int min_order = mapping_min_folio_order(mapping);
+ int order = max(min_order, shift - PAGE_SHIFT);
- if (shift <= PAGE_SHIFT)
- return 0;
- return (__force fgf_t)((shift - PAGE_SHIFT) << 26);
+ return (__force fgf_t)((order) << 26);
}
void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [RFC 02/23] pagemap: use mapping_min_order in fgf_set_order()
2023-09-15 18:38 ` [RFC 02/23] pagemap: use mapping_min_order in fgf_set_order() Pankaj Raghav
@ 2023-09-15 18:55 ` Matthew Wilcox
2023-09-20 7:46 ` Pankaj Raghav
0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2023-09-15 18:55 UTC (permalink / raw)
To: Pankaj Raghav
Cc: linux-xfs, linux-fsdevel, p.raghav, david, da.gomez, akpm,
linux-kernel, djwong, linux-mm, chandan.babu, mcgrof, gost.dev
On Fri, Sep 15, 2023 at 08:38:27PM +0200, Pankaj Raghav wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> fgf_set_order() encodes optimal order in fgp flags. Set it to at least
> mapping_min_order from the page cache. Default to the old behaviour if
> min_order is not set.
Why not simply:
+++ b/mm/filemap.c
@@ -1906,9 +1906,12 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
folio_wait_stable(folio);
no_page:
if (!folio && (fgp_flags & FGP_CREAT)) {
- unsigned order = FGF_GET_ORDER(fgp_flags);
+ unsigned order;
int err;
+ order = min(mapping_min_folio_order(mapping),
+ FGF_GET_ORDER(fgp_flags));
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC 02/23] pagemap: use mapping_min_order in fgf_set_order()
2023-09-15 18:55 ` Matthew Wilcox
@ 2023-09-20 7:46 ` Pankaj Raghav
0 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-20 7:46 UTC (permalink / raw)
To: Matthew Wilcox, Pankaj Raghav
Cc: linux-xfs, linux-fsdevel, david, da.gomez, akpm, linux-kernel,
djwong, linux-mm, chandan.babu, mcgrof, gost.dev
On 2023-09-15 20:55, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 08:38:27PM +0200, Pankaj Raghav wrote:
>> From: Pankaj Raghav <p.raghav@samsung.com>
>>
>> fgf_set_order() encodes optimal order in fgp flags. Set it to at least
>> mapping_min_order from the page cache. Default to the old behaviour if
>> min_order is not set.
>
> Why not simply:
>
That is a good idea to move this to filemap instead of changing it in iomap. I will do that!
> +++ b/mm/filemap.c
> @@ -1906,9 +1906,12 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
> folio_wait_stable(folio);
> no_page:
> if (!folio && (fgp_flags & FGP_CREAT)) {
> - unsigned order = FGF_GET_ORDER(fgp_flags);
> + unsigned order;
> int err;
>
> + order = min(mapping_min_folio_order(mapping),
> + FGF_GET_ORDER(fgp_flags));
>
I think this needs to max(mapping..., FGF...)
^ permalink raw reply [flat|nested] 54+ messages in thread
* [RFC 03/23] filemap: add folio with at least mapping_min_order in __filemap_get_folio
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
2023-09-15 18:38 ` [RFC 01/23] fs: Allow fine-grained control of folio sizes Pankaj Raghav
2023-09-15 18:38 ` [RFC 02/23] pagemap: use mapping_min_order in fgf_set_order() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 19:00 ` Matthew Wilcox
2023-09-15 18:38 ` [RFC 04/23] filemap: set the order of the index in page_cache_delete_batch() Pankaj Raghav
` (21 subsequent siblings)
24 siblings, 1 reply; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: Pankaj Raghav <p.raghav@samsung.com>
__filemap_get_folio() with FGP_CREAT should allocate at least folio of
filemap's min_order set using folio_set_mapping_orders().
A higher order folio than min_order by definition is a multiple of the
min_order. If an index is aligned to an order higher than a min_order, it
will also be aligned to the min order.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
mm/filemap.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 8962d1255905..b1ce63143df5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1862,6 +1862,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
fgf_t fgp_flags, gfp_t gfp)
{
struct folio *folio;
+ int min_order = mapping_min_folio_order(mapping);
+ int nr_of_pages = (1U << min_order);
+
+ index = round_down(index, nr_of_pages);
repeat:
folio = filemap_get_entry(mapping, index);
@@ -1929,8 +1933,14 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
err = -ENOMEM;
if (order == 1)
order = 0;
+ if (order < min_order)
+ order = min_order;
if (order > 0)
alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
+
+ if (min_order)
+ VM_BUG_ON(index & ((1UL << order) - 1));
+
folio = filemap_alloc_folio(alloc_gfp, order);
if (!folio)
continue;
@@ -1944,7 +1954,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
break;
folio_put(folio);
folio = NULL;
- } while (order-- > 0);
+ } while (order-- > min_order);
if (err == -EEXIST)
goto repeat;
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [RFC 03/23] filemap: add folio with at least mapping_min_order in __filemap_get_folio
2023-09-15 18:38 ` [RFC 03/23] filemap: add folio with at least mapping_min_order in __filemap_get_folio Pankaj Raghav
@ 2023-09-15 19:00 ` Matthew Wilcox
2023-09-20 8:06 ` Pankaj Raghav
0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2023-09-15 19:00 UTC (permalink / raw)
To: Pankaj Raghav
Cc: linux-xfs, linux-fsdevel, p.raghav, david, da.gomez, akpm,
linux-kernel, djwong, linux-mm, chandan.babu, mcgrof, gost.dev
On Fri, Sep 15, 2023 at 08:38:28PM +0200, Pankaj Raghav wrote:
> +++ b/mm/filemap.c
> @@ -1862,6 +1862,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
> fgf_t fgp_flags, gfp_t gfp)
> {
> struct folio *folio;
> + int min_order = mapping_min_folio_order(mapping);
> + int nr_of_pages = (1U << min_order);
> +
> + index = round_down(index, nr_of_pages);
>
> repeat:
> folio = filemap_get_entry(mapping, index);
> @@ -1929,8 +1933,14 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
> err = -ENOMEM;
> if (order == 1)
> order = 0;
> + if (order < min_order)
> + order = min_order;
... oh, you do something similar here to what I recommend in my previous
response. I don't understand why you need the previous patch.
> + if (min_order)
> + VM_BUG_ON(index & ((1UL << order) - 1));
You don't need the 'if' here; index & ((1 << 0) - 1) becomes false.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC 03/23] filemap: add folio with at least mapping_min_order in __filemap_get_folio
2023-09-15 19:00 ` Matthew Wilcox
@ 2023-09-20 8:06 ` Pankaj Raghav
0 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-20 8:06 UTC (permalink / raw)
To: Matthew Wilcox, Pankaj Raghav
Cc: linux-xfs, linux-fsdevel, david, da.gomez, akpm, linux-kernel,
djwong, linux-mm, chandan.babu, mcgrof, gost.dev
On 2023-09-15 21:00, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 08:38:28PM +0200, Pankaj Raghav wrote:
>> +++ b/mm/filemap.c
>> @@ -1862,6 +1862,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>> fgf_t fgp_flags, gfp_t gfp)
>> {
>> struct folio *folio;
>> + int min_order = mapping_min_folio_order(mapping);
>> + int nr_of_pages = (1U << min_order);
>> +
>> + index = round_down(index, nr_of_pages);
>>
>> repeat:
>> folio = filemap_get_entry(mapping, index);
>> @@ -1929,8 +1933,14 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>> err = -ENOMEM;
>> if (order == 1)
>> order = 0;
>> + if (order < min_order)
>> + order = min_order;
>
> ... oh, you do something similar here to what I recommend in my previous
> response. I don't understand why you need the previous patch.
>
Hmm, we made changes here a bit later and that is why it is a duplicated
I guess in both iomap fgf order and clamping the order here to min_order. We could
remove the previous patch and retain this one here.
>> + if (min_order)
>> + VM_BUG_ON(index & ((1UL << order) - 1));
>
> You don't need the 'if' here; index & ((1 << 0) - 1) becomes false.
>
Sounds good!
^ permalink raw reply [flat|nested] 54+ messages in thread
* [RFC 04/23] filemap: set the order of the index in page_cache_delete_batch()
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (2 preceding siblings ...)
2023-09-15 18:38 ` [RFC 03/23] filemap: add folio with at least mapping_min_order in __filemap_get_folio Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 19:43 ` Matthew Wilcox
2023-09-15 18:38 ` [RFC 05/23] filemap: align index to mapping_min_order in filemap_range_has_page() Pankaj Raghav
` (20 subsequent siblings)
24 siblings, 1 reply; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: Luis Chamberlain <mcgrof@kernel.org>
Similar to page_cache_delete(), call xas_set_order for non-hugetlb pages
while deleting an entry from the page cache. Also put BUG_ON if the
order of the folio is less than the mapping min_order.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
mm/filemap.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/mm/filemap.c b/mm/filemap.c
index b1ce63143df5..2c47729dc8b0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -126,6 +126,7 @@
static void page_cache_delete(struct address_space *mapping,
struct folio *folio, void *shadow)
{
+ unsigned int min_order = mapping_min_folio_order(mapping);
XA_STATE(xas, &mapping->i_pages, folio->index);
long nr = 1;
@@ -134,6 +135,7 @@ static void page_cache_delete(struct address_space *mapping,
xas_set_order(&xas, folio->index, folio_order(folio));
nr = folio_nr_pages(folio);
+ VM_BUG_ON_FOLIO(folio_order(folio) < min_order, folio);
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
xas_store(&xas, shadow);
@@ -276,6 +278,7 @@ void filemap_remove_folio(struct folio *folio)
static void page_cache_delete_batch(struct address_space *mapping,
struct folio_batch *fbatch)
{
+ unsigned int min_order = mapping_min_folio_order(mapping);
XA_STATE(xas, &mapping->i_pages, fbatch->folios[0]->index);
long total_pages = 0;
int i = 0;
@@ -304,6 +307,11 @@ static void page_cache_delete_batch(struct address_space *mapping,
WARN_ON_ONCE(!folio_test_locked(folio));
+ /* hugetlb pages are represented by a single entry in the xarray */
+ if (!folio_test_hugetlb(folio)) {
+ VM_BUG_ON_FOLIO(folio_order(folio) < min_order, folio);
+ xas_set_order(&xas, folio->index, folio_order(folio));
+ }
folio->mapping = NULL;
/* Leave folio->index set: truncation lookup relies on it */
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [RFC 04/23] filemap: set the order of the index in page_cache_delete_batch()
2023-09-15 18:38 ` [RFC 04/23] filemap: set the order of the index in page_cache_delete_batch() Pankaj Raghav
@ 2023-09-15 19:43 ` Matthew Wilcox
2023-09-18 18:20 ` Luis Chamberlain
0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2023-09-15 19:43 UTC (permalink / raw)
To: Pankaj Raghav
Cc: linux-xfs, linux-fsdevel, p.raghav, david, da.gomez, akpm,
linux-kernel, djwong, linux-mm, chandan.babu, mcgrof, gost.dev
On Fri, Sep 15, 2023 at 08:38:29PM +0200, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
>
> Similar to page_cache_delete(), call xas_set_order for non-hugetlb pages
> while deleting an entry from the page cache.
Is this necessary? As I read xas_store(), if you're storing NULL, it
will wipe out all sibling entries. Was this based on "oops, no, it
doesn't" or "here is a gratuitous difference, change it"?
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC 04/23] filemap: set the order of the index in page_cache_delete_batch()
2023-09-15 19:43 ` Matthew Wilcox
@ 2023-09-18 18:20 ` Luis Chamberlain
0 siblings, 0 replies; 54+ messages in thread
From: Luis Chamberlain @ 2023-09-18 18:20 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Pankaj Raghav, linux-xfs, linux-fsdevel, p.raghav, david,
da.gomez, akpm, linux-kernel, djwong, linux-mm, chandan.babu,
gost.dev
On Fri, Sep 15, 2023 at 08:43:28PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 08:38:29PM +0200, Pankaj Raghav wrote:
> > From: Luis Chamberlain <mcgrof@kernel.org>
> >
> > Similar to page_cache_delete(), call xas_set_order for non-hugetlb pages
> > while deleting an entry from the page cache.
>
> Is this necessary? As I read xas_store(), if you're storing NULL, it
> will wipe out all sibling entries. Was this based on "oops, no, it
> doesn't" or "here is a gratuitous difference, change it"?
Based on code inspection, I saw page_cache_delete() did it. The xarray
docs and xarray selftest was not clear about the advanced API about this
case and the usage of the set order on page_cache_delete() gave me
concerns we needed it here.
We do have some enhancements to xarray self tests to use the advanced
API which we could extend with this particular case before posting, so
to prove disprove if this is really needed.
Why would it be needed on page_cache_delete() but needed here?
Luis
^ permalink raw reply [flat|nested] 54+ messages in thread
* [RFC 05/23] filemap: align index to mapping_min_order in filemap_range_has_page()
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (3 preceding siblings ...)
2023-09-15 18:38 ` [RFC 04/23] filemap: set the order of the index in page_cache_delete_batch() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 19:45 ` Matthew Wilcox
2023-09-15 18:38 ` [RFC 06/23] mm: call xas_set_order() in replace_page_cache_folio() Pankaj Raghav
` (19 subsequent siblings)
24 siblings, 1 reply; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: Luis Chamberlain <mcgrof@kernel.org>
page cache is mapping min_folio_order aligned. Use mapping min_folio_order
to align the start_byte and end_byte in filemap_range_has_page().
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
mm/filemap.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 2c47729dc8b0..4dee24b5b61c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -477,9 +477,12 @@ EXPORT_SYMBOL(filemap_flush);
bool filemap_range_has_page(struct address_space *mapping,
loff_t start_byte, loff_t end_byte)
{
+ unsigned int min_order = mapping_min_folio_order(mapping);
+ unsigned int nrpages = 1UL << min_order;
+ pgoff_t index = round_down(start_byte >> PAGE_SHIFT, nrpages);
struct folio *folio;
- XA_STATE(xas, &mapping->i_pages, start_byte >> PAGE_SHIFT);
- pgoff_t max = end_byte >> PAGE_SHIFT;
+ XA_STATE(xas, &mapping->i_pages, index);
+ pgoff_t max = round_down(end_byte >> PAGE_SHIFT, nrpages);
if (end_byte < start_byte)
return false;
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [RFC 05/23] filemap: align index to mapping_min_order in filemap_range_has_page()
2023-09-15 18:38 ` [RFC 05/23] filemap: align index to mapping_min_order in filemap_range_has_page() Pankaj Raghav
@ 2023-09-15 19:45 ` Matthew Wilcox
2023-09-18 18:25 ` Luis Chamberlain
0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2023-09-15 19:45 UTC (permalink / raw)
To: Pankaj Raghav
Cc: linux-xfs, linux-fsdevel, p.raghav, david, da.gomez, akpm,
linux-kernel, djwong, linux-mm, chandan.babu, mcgrof, gost.dev
On Fri, Sep 15, 2023 at 08:38:30PM +0200, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
>
> page cache is mapping min_folio_order aligned. Use mapping min_folio_order
> to align the start_byte and end_byte in filemap_range_has_page().
What goes wrong if you don't? Seems to me like it should work.
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
> mm/filemap.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 2c47729dc8b0..4dee24b5b61c 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -477,9 +477,12 @@ EXPORT_SYMBOL(filemap_flush);
> bool filemap_range_has_page(struct address_space *mapping,
> loff_t start_byte, loff_t end_byte)
> {
> + unsigned int min_order = mapping_min_folio_order(mapping);
> + unsigned int nrpages = 1UL << min_order;
> + pgoff_t index = round_down(start_byte >> PAGE_SHIFT, nrpages);
> struct folio *folio;
> - XA_STATE(xas, &mapping->i_pages, start_byte >> PAGE_SHIFT);
> - pgoff_t max = end_byte >> PAGE_SHIFT;
> + XA_STATE(xas, &mapping->i_pages, index);
> + pgoff_t max = round_down(end_byte >> PAGE_SHIFT, nrpages);
>
> if (end_byte < start_byte)
> return false;
> --
> 2.40.1
>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC 05/23] filemap: align index to mapping_min_order in filemap_range_has_page()
2023-09-15 19:45 ` Matthew Wilcox
@ 2023-09-18 18:25 ` Luis Chamberlain
0 siblings, 0 replies; 54+ messages in thread
From: Luis Chamberlain @ 2023-09-18 18:25 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Pankaj Raghav, linux-xfs, linux-fsdevel, p.raghav, david,
da.gomez, akpm, linux-kernel, djwong, linux-mm, chandan.babu,
gost.dev
On Fri, Sep 15, 2023 at 08:45:20PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 08:38:30PM +0200, Pankaj Raghav wrote:
> > From: Luis Chamberlain <mcgrof@kernel.org>
> >
> > page cache is mapping min_folio_order aligned. Use mapping min_folio_order
> > to align the start_byte and end_byte in filemap_range_has_page().
>
> What goes wrong if you don't? Seems to me like it should work.
Will drop from the series after confirming, thanks.
Luis
^ permalink raw reply [flat|nested] 54+ messages in thread
* [RFC 06/23] mm: call xas_set_order() in replace_page_cache_folio()
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (4 preceding siblings ...)
2023-09-15 18:38 ` [RFC 05/23] filemap: align index to mapping_min_order in filemap_range_has_page() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 19:46 ` Matthew Wilcox
2023-09-15 18:38 ` [RFC 07/23] filemap: align the index to mapping_min_order in __filemap_add_folio() Pankaj Raghav
` (18 subsequent siblings)
24 siblings, 1 reply; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: Luis Chamberlain <mcgrof@kernel.org>
Call xas_set_order() in replace_page_cache_folio() for non hugetlb
pages.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
mm/filemap.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/mm/filemap.c b/mm/filemap.c
index 4dee24b5b61c..33de71bfa953 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -815,12 +815,14 @@ EXPORT_SYMBOL(file_write_and_wait_range);
void replace_page_cache_folio(struct folio *old, struct folio *new)
{
struct address_space *mapping = old->mapping;
+ unsigned int min_order = mapping_min_folio_order(mapping);
void (*free_folio)(struct folio *) = mapping->a_ops->free_folio;
pgoff_t offset = old->index;
XA_STATE(xas, &mapping->i_pages, offset);
VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
VM_BUG_ON_FOLIO(!folio_test_locked(new), new);
+ VM_BUG_ON_FOLIO(folio_order(new) != folio_order(old), new);
VM_BUG_ON_FOLIO(new->mapping, new);
folio_get(new);
@@ -829,6 +831,11 @@ void replace_page_cache_folio(struct folio *old, struct folio *new)
mem_cgroup_migrate(old, new);
+ if (!folio_test_hugetlb(new)) {
+ VM_BUG_ON_FOLIO(folio_order(new) < min_order, new);
+ xas_set_order(&xas, offset, folio_order(new));
+ }
+
xas_lock_irq(&xas);
xas_store(&xas, new);
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [RFC 06/23] mm: call xas_set_order() in replace_page_cache_folio()
2023-09-15 18:38 ` [RFC 06/23] mm: call xas_set_order() in replace_page_cache_folio() Pankaj Raghav
@ 2023-09-15 19:46 ` Matthew Wilcox
2023-09-18 18:27 ` Luis Chamberlain
0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2023-09-15 19:46 UTC (permalink / raw)
To: Pankaj Raghav
Cc: linux-xfs, linux-fsdevel, p.raghav, david, da.gomez, akpm,
linux-kernel, djwong, linux-mm, chandan.babu, mcgrof, gost.dev
On Fri, Sep 15, 2023 at 08:38:31PM +0200, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
>
> Call xas_set_order() in replace_page_cache_folio() for non hugetlb
> pages.
This function definitely should work without this patch. What goes wrong?
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
> mm/filemap.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4dee24b5b61c..33de71bfa953 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -815,12 +815,14 @@ EXPORT_SYMBOL(file_write_and_wait_range);
> void replace_page_cache_folio(struct folio *old, struct folio *new)
> {
> struct address_space *mapping = old->mapping;
> + unsigned int min_order = mapping_min_folio_order(mapping);
> void (*free_folio)(struct folio *) = mapping->a_ops->free_folio;
> pgoff_t offset = old->index;
> XA_STATE(xas, &mapping->i_pages, offset);
>
> VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
> VM_BUG_ON_FOLIO(!folio_test_locked(new), new);
> + VM_BUG_ON_FOLIO(folio_order(new) != folio_order(old), new);
> VM_BUG_ON_FOLIO(new->mapping, new);
>
> folio_get(new);
> @@ -829,6 +831,11 @@ void replace_page_cache_folio(struct folio *old, struct folio *new)
>
> mem_cgroup_migrate(old, new);
>
> + if (!folio_test_hugetlb(new)) {
> + VM_BUG_ON_FOLIO(folio_order(new) < min_order, new);
> + xas_set_order(&xas, offset, folio_order(new));
> + }
> +
> xas_lock_irq(&xas);
> xas_store(&xas, new);
>
> --
> 2.40.1
>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC 06/23] mm: call xas_set_order() in replace_page_cache_folio()
2023-09-15 19:46 ` Matthew Wilcox
@ 2023-09-18 18:27 ` Luis Chamberlain
0 siblings, 0 replies; 54+ messages in thread
From: Luis Chamberlain @ 2023-09-18 18:27 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Pankaj Raghav, linux-xfs, linux-fsdevel, p.raghav, david,
da.gomez, akpm, linux-kernel, djwong, linux-mm, chandan.babu,
gost.dev
On Fri, Sep 15, 2023 at 08:46:10PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 08:38:31PM +0200, Pankaj Raghav wrote:
> > From: Luis Chamberlain <mcgrof@kernel.org>
> >
> > Call xas_set_order() in replace_page_cache_folio() for non hugetlb
> > pages.
>
> This function definitely should work without this patch. What goes wrong?
As with batch delete I was just trying to take care to be explicit about
setting the order for a) ddition and b) removal. Will drop as well after
confirming, thanks!
Luis
^ permalink raw reply [flat|nested] 54+ messages in thread
* [RFC 07/23] filemap: align the index to mapping_min_order in __filemap_add_folio()
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (5 preceding siblings ...)
2023-09-15 18:38 ` [RFC 06/23] mm: call xas_set_order() in replace_page_cache_folio() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 19:48 ` Matthew Wilcox
2023-09-15 18:38 ` [RFC 08/23] filemap: align the index to mapping_min_order in filemap_get_folios_tag() Pankaj Raghav
` (17 subsequent siblings)
24 siblings, 1 reply; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: Luis Chamberlain <mcgrof@kernel.org>
Align the index to the mapping_min_order number of pages while setting
the XA_STATE and xas_set_order().
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
mm/filemap.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 33de71bfa953..15bc810bfc89 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -859,7 +859,10 @@ EXPORT_SYMBOL_GPL(replace_page_cache_folio);
noinline int __filemap_add_folio(struct address_space *mapping,
struct folio *folio, pgoff_t index, gfp_t gfp, void **shadowp)
{
- XA_STATE(xas, &mapping->i_pages, index);
+ unsigned int min_order = mapping_min_folio_order(mapping);
+ unsigned int nr_of_pages = (1U << min_order);
+ pgoff_t rounded_index = round_down(index, nr_of_pages);
+ XA_STATE(xas, &mapping->i_pages, rounded_index);
int huge = folio_test_hugetlb(folio);
bool charged = false;
long nr = 1;
@@ -875,8 +878,8 @@ noinline int __filemap_add_folio(struct address_space *mapping,
charged = true;
}
- VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
- xas_set_order(&xas, index, folio_order(folio));
+ VM_BUG_ON_FOLIO(rounded_index & (folio_nr_pages(folio) - 1), folio);
+ xas_set_order(&xas, rounded_index, folio_order(folio));
nr = folio_nr_pages(folio);
gfp &= GFP_RECLAIM_MASK;
@@ -913,6 +916,7 @@ noinline int __filemap_add_folio(struct address_space *mapping,
}
}
+ VM_BUG_ON_FOLIO(folio_order(folio) < min_order, folio);
xas_store(&xas, folio);
if (xas_error(&xas))
goto unlock;
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [RFC 07/23] filemap: align the index to mapping_min_order in __filemap_add_folio()
2023-09-15 18:38 ` [RFC 07/23] filemap: align the index to mapping_min_order in __filemap_add_folio() Pankaj Raghav
@ 2023-09-15 19:48 ` Matthew Wilcox
2023-09-18 18:32 ` Luis Chamberlain
0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2023-09-15 19:48 UTC (permalink / raw)
To: Pankaj Raghav
Cc: linux-xfs, linux-fsdevel, p.raghav, david, da.gomez, akpm,
linux-kernel, djwong, linux-mm, chandan.babu, mcgrof, gost.dev
On Fri, Sep 15, 2023 at 08:38:32PM +0200, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
>
> Align the index to the mapping_min_order number of pages while setting
> the XA_STATE and xas_set_order().
Not sure why this one's necessary either. The index should already be
aligned to folio_order.
Some bits of it are clearly needed, like checking that folio_order() >=
min_order.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC 07/23] filemap: align the index to mapping_min_order in __filemap_add_folio()
2023-09-15 19:48 ` Matthew Wilcox
@ 2023-09-18 18:32 ` Luis Chamberlain
0 siblings, 0 replies; 54+ messages in thread
From: Luis Chamberlain @ 2023-09-18 18:32 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Pankaj Raghav, linux-xfs, linux-fsdevel, p.raghav, david,
da.gomez, akpm, linux-kernel, djwong, linux-mm, chandan.babu,
gost.dev
On Fri, Sep 15, 2023 at 08:48:43PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 08:38:32PM +0200, Pankaj Raghav wrote:
> > From: Luis Chamberlain <mcgrof@kernel.org>
> >
> > Align the index to the mapping_min_order number of pages while setting
> > the XA_STATE and xas_set_order().
>
> Not sure why this one's necessary either. The index should already be
> aligned to folio_order.
Oh, it was not obvious, would a VM_BUG_ON_FOLIO() be OK then?
> Some bits of it are clearly needed, like checking that folio_order() >=
> min_order.
Thanks,
Luis
^ permalink raw reply [flat|nested] 54+ messages in thread
* [RFC 08/23] filemap: align the index to mapping_min_order in filemap_get_folios_tag()
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (6 preceding siblings ...)
2023-09-15 18:38 ` [RFC 07/23] filemap: align the index to mapping_min_order in __filemap_add_folio() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 19:50 ` Matthew Wilcox
2023-09-15 18:38 ` [RFC 09/23] filemap: use mapping_min_order while allocating folios Pankaj Raghav
` (16 subsequent siblings)
24 siblings, 1 reply; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: Luis Chamberlain <mcgrof@kernel.org>
Align the index to the mapping_min_order number of pages while setting
the XA_STATE in filemap_get_folios_tag().
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
mm/filemap.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 15bc810bfc89..21e1341526ab 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2280,7 +2280,9 @@ EXPORT_SYMBOL(filemap_get_folios_contig);
unsigned filemap_get_folios_tag(struct address_space *mapping, pgoff_t *start,
pgoff_t end, xa_mark_t tag, struct folio_batch *fbatch)
{
- XA_STATE(xas, &mapping->i_pages, *start);
+ unsigned int min_order = mapping_min_folio_order(mapping);
+ unsigned int nrpages = 1UL << min_order;
+ XA_STATE(xas, &mapping->i_pages, round_down(*start, nrpages));
struct folio *folio;
rcu_read_lock();
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [RFC 08/23] filemap: align the index to mapping_min_order in filemap_get_folios_tag()
2023-09-15 18:38 ` [RFC 08/23] filemap: align the index to mapping_min_order in filemap_get_folios_tag() Pankaj Raghav
@ 2023-09-15 19:50 ` Matthew Wilcox
2023-09-18 18:36 ` Luis Chamberlain
0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2023-09-15 19:50 UTC (permalink / raw)
To: Pankaj Raghav
Cc: linux-xfs, linux-fsdevel, p.raghav, david, da.gomez, akpm,
linux-kernel, djwong, linux-mm, chandan.babu, mcgrof, gost.dev
On Fri, Sep 15, 2023 at 08:38:33PM +0200, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
>
> Align the index to the mapping_min_order number of pages while setting
> the XA_STATE in filemap_get_folios_tag().
... because? It should already search backwards in the page cache,
otherwise calling sync_file_range() would skip the start if it landed
in a tail page of a folio.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC 08/23] filemap: align the index to mapping_min_order in filemap_get_folios_tag()
2023-09-15 19:50 ` Matthew Wilcox
@ 2023-09-18 18:36 ` Luis Chamberlain
0 siblings, 0 replies; 54+ messages in thread
From: Luis Chamberlain @ 2023-09-18 18:36 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Pankaj Raghav, linux-xfs, linux-fsdevel, p.raghav, david,
da.gomez, akpm, linux-kernel, djwong, linux-mm, chandan.babu,
gost.dev
On Fri, Sep 15, 2023 at 08:50:59PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 08:38:33PM +0200, Pankaj Raghav wrote:
> > From: Luis Chamberlain <mcgrof@kernel.org>
> >
> > Align the index to the mapping_min_order number of pages while setting
> > the XA_STATE in filemap_get_folios_tag().
>
> ... because? It should already search backwards in the page cache,
> otherwise calling sync_file_range() would skip the start if it landed
> in a tail page of a folio.
Thanks! Will drop and verify!
Luis
^ permalink raw reply [flat|nested] 54+ messages in thread
* [RFC 09/23] filemap: use mapping_min_order while allocating folios
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (7 preceding siblings ...)
2023-09-15 18:38 ` [RFC 08/23] filemap: align the index to mapping_min_order in filemap_get_folios_tag() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 19:54 ` Matthew Wilcox
2023-09-15 18:38 ` [RFC 10/23] filemap: align the index to mapping_min_order in filemap_get_pages() Pankaj Raghav
` (15 subsequent siblings)
24 siblings, 1 reply; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: Pankaj Raghav <p.raghav@samsung.com>
Allocate al teast mapping_min_order when creating new folio for the
filemap in filemap_create_folio() and do_read_cache_folio().
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
mm/filemap.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 21e1341526ab..e4d46f79e95d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2502,7 +2502,8 @@ static int filemap_create_folio(struct file *file,
struct folio *folio;
int error;
- folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
+ folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
+ mapping_min_folio_order(mapping));
if (!folio)
return -ENOMEM;
@@ -3696,7 +3697,8 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
repeat:
folio = filemap_get_folio(mapping, index);
if (IS_ERR(folio)) {
- folio = filemap_alloc_folio(gfp, 0);
+ folio = filemap_alloc_folio(gfp,
+ mapping_min_folio_order(mapping));
if (!folio)
return ERR_PTR(-ENOMEM);
err = filemap_add_folio(mapping, folio, index, gfp);
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [RFC 09/23] filemap: use mapping_min_order while allocating folios
2023-09-15 18:38 ` [RFC 09/23] filemap: use mapping_min_order while allocating folios Pankaj Raghav
@ 2023-09-15 19:54 ` Matthew Wilcox
0 siblings, 0 replies; 54+ messages in thread
From: Matthew Wilcox @ 2023-09-15 19:54 UTC (permalink / raw)
To: Pankaj Raghav
Cc: linux-xfs, linux-fsdevel, p.raghav, david, da.gomez, akpm,
linux-kernel, djwong, linux-mm, chandan.babu, mcgrof, gost.dev
On Fri, Sep 15, 2023 at 08:38:34PM +0200, Pankaj Raghav wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> Allocate al teast mapping_min_order when creating new folio for the
> filemap in filemap_create_folio() and do_read_cache_folio().
This patch is where you should be doing:
index &= ~(folio_nr_pages(folio) - 1UL);
(or similar)
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
> mm/filemap.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 21e1341526ab..e4d46f79e95d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2502,7 +2502,8 @@ static int filemap_create_folio(struct file *file,
> struct folio *folio;
> int error;
>
> - folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
> + folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
> + mapping_min_folio_order(mapping));
> if (!folio)
> return -ENOMEM;
>
> @@ -3696,7 +3697,8 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
> repeat:
> folio = filemap_get_folio(mapping, index);
> if (IS_ERR(folio)) {
> - folio = filemap_alloc_folio(gfp, 0);
> + folio = filemap_alloc_folio(gfp,
> + mapping_min_folio_order(mapping));
> if (!folio)
> return ERR_PTR(-ENOMEM);
> err = filemap_add_folio(mapping, folio, index, gfp);
> --
> 2.40.1
>
^ permalink raw reply [flat|nested] 54+ messages in thread
* [RFC 10/23] filemap: align the index to mapping_min_order in filemap_get_pages()
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (8 preceding siblings ...)
2023-09-15 18:38 ` [RFC 09/23] filemap: use mapping_min_order while allocating folios Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 18:38 ` [RFC 11/23] filemap: align the index to mapping_min_order in do_[a]sync_mmap_readahead Pankaj Raghav
` (14 subsequent siblings)
24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: Luis Chamberlain <mcgrof@kernel.org>
Align the index to the mapping_min_order number of pages in
filemap_get_pages().
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
generic/451 triggers a crash in this path for bs = 16k.
mm/filemap.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index e4d46f79e95d..8a4bbddcf575 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2558,14 +2558,17 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
{
struct file *filp = iocb->ki_filp;
struct address_space *mapping = filp->f_mapping;
+ unsigned int min_order = mapping_min_folio_order(mapping);
+ unsigned int nrpages = 1UL << min_order;
struct file_ra_state *ra = &filp->f_ra;
- pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;
+ pgoff_t index = round_down(iocb->ki_pos >> PAGE_SHIFT, nrpages);
pgoff_t last_index;
struct folio *folio;
int err = 0;
/* "last_index" is the index of the page beyond the end of the read */
last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
+ last_index = round_up(last_index, nrpages);
retry:
if (fatal_signal_pending(current))
return -EINTR;
@@ -2581,8 +2584,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
if (!folio_batch_count(fbatch)) {
if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
return -EAGAIN;
- err = filemap_create_folio(filp, mapping,
- iocb->ki_pos >> PAGE_SHIFT, fbatch);
+ err = filemap_create_folio(filp, mapping, index, fbatch);
if (err == AOP_TRUNCATED_PAGE)
goto retry;
return err;
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [RFC 11/23] filemap: align the index to mapping_min_order in do_[a]sync_mmap_readahead
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (9 preceding siblings ...)
2023-09-15 18:38 ` [RFC 10/23] filemap: align the index to mapping_min_order in filemap_get_pages() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 18:38 ` [RFC 12/23] filemap: align index to mapping_min_order in filemap_fault() Pankaj Raghav
` (13 subsequent siblings)
24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: Pankaj Raghav <p.raghav@samsung.com>
Align the index to the mapping_min_order number of pages in
do_[a]sync_mmap_readahead().
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
mm/filemap.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 8a4bbddcf575..3853df90f9cf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3164,7 +3164,10 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
struct file *file = vmf->vma->vm_file;
struct file_ra_state *ra = &file->f_ra;
struct address_space *mapping = file->f_mapping;
- DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
+ int order = mapping_min_folio_order(mapping);
+ unsigned int nrpages = 1U << order;
+ pgoff_t index = round_down(vmf->pgoff, nrpages);
+ DEFINE_READAHEAD(ractl, file, ra, mapping, index);
struct file *fpin = NULL;
unsigned long vm_flags = vmf->vma->vm_flags;
unsigned int mmap_miss;
@@ -3216,10 +3219,11 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
*/
fpin = maybe_unlock_mmap_for_io(vmf, fpin);
ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
+ ra->start = round_down(ra->start, nrpages);
ra->size = ra->ra_pages;
ra->async_size = ra->ra_pages / 4;
ractl._index = ra->start;
- page_cache_ra_order(&ractl, ra, 0);
+ page_cache_ra_order(&ractl, ra, order);
return fpin;
}
@@ -3233,7 +3237,10 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
{
struct file *file = vmf->vma->vm_file;
struct file_ra_state *ra = &file->f_ra;
- DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, vmf->pgoff);
+ int order = mapping_min_folio_order(file->f_mapping);
+ unsigned int nrpages = 1U << order;
+ pgoff_t index = round_down(vmf->pgoff, nrpages);
+ DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, index);
struct file *fpin = NULL;
unsigned int mmap_miss;
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [RFC 12/23] filemap: align index to mapping_min_order in filemap_fault()
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (10 preceding siblings ...)
2023-09-15 18:38 ` [RFC 11/23] filemap: align the index to mapping_min_order in do_[a]sync_mmap_readahead Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 18:38 ` [RFC 13/23] readahead: set file_ra_state->ra_pages to be at least mapping_min_order Pankaj Raghav
` (12 subsequent siblings)
24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: Pankaj Raghav <p.raghav@samsung.com>
ALign the indices to mapping_min_order number of pages in
filemap_fault().
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
mm/filemap.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 3853df90f9cf..f97099de80b3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3288,13 +3288,17 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
struct file *file = vmf->vma->vm_file;
struct file *fpin = NULL;
struct address_space *mapping = file->f_mapping;
+ unsigned int min_order = mapping_min_folio_order(mapping);
+ unsigned int nrpages = 1UL << min_order;
struct inode *inode = mapping->host;
- pgoff_t max_idx, index = vmf->pgoff;
+ pgoff_t max_idx, index = round_down(vmf->pgoff, nrpages);
struct folio *folio;
vm_fault_t ret = 0;
bool mapping_locked = false;
max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+ max_idx = round_up(max_idx, nrpages);
+
if (unlikely(index >= max_idx))
return VM_FAULT_SIGBUS;
@@ -3386,13 +3390,17 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
* We must recheck i_size under page lock.
*/
max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+ max_idx = round_up(max_idx, nrpages);
+
if (unlikely(index >= max_idx)) {
folio_unlock(folio);
folio_put(folio);
return VM_FAULT_SIGBUS;
}
- vmf->page = folio_file_page(folio, index);
+ VM_BUG_ON_FOLIO(folio_order(folio) < min_order, folio);
+
+ vmf->page = folio_file_page(folio, vmf->pgoff);
return ret | VM_FAULT_LOCKED;
page_not_uptodate:
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [RFC 13/23] readahead: set file_ra_state->ra_pages to be at least mapping_min_order
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (11 preceding siblings ...)
2023-09-15 18:38 ` [RFC 12/23] filemap: align index to mapping_min_order in filemap_fault() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 18:38 ` [RFC 14/23] readahead: allocate folios with mapping_min_order in ra_unbounded() Pankaj Raghav
` (11 subsequent siblings)
24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: Luis Chamberlain <mcgrof@kernel.org>
Set the file_ra_state->ra_pages in file_ra_state_init() to be at least
mapping_min_order of pages if the bdi->ra_pages is less than that.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
mm/readahead.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/mm/readahead.c b/mm/readahead.c
index ef3b23a41973..5c4e7ee64dc1 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -138,7 +138,13 @@
void
file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
{
+ unsigned int order = mapping_min_folio_order(mapping);
+ unsigned int min_nrpages = 1U << order;
+ unsigned int max_pages = inode_to_bdi(mapping->host)->io_pages;
+
ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages;
+ if (ra->ra_pages < min_nrpages && min_nrpages < max_pages)
+ ra->ra_pages = min_nrpages;
ra->prev_pos = -1;
}
EXPORT_SYMBOL_GPL(file_ra_state_init);
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [RFC 14/23] readahead: allocate folios with mapping_min_order in ra_unbounded()
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (12 preceding siblings ...)
2023-09-15 18:38 ` [RFC 13/23] readahead: set file_ra_state->ra_pages to be at least mapping_min_order Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 18:38 ` [RFC 15/23] readahead: align with mapping_min_order in force_page_cache_ra() Pankaj Raghav
` (10 subsequent siblings)
24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: Pankaj Raghav <p.raghav@samsung.com>
Allocate folios with mapping_min_order order in
page_cache_ra_unbounded(). Also adjust the accounting to take the
folio_nr_pages in the loop.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
mm/readahead.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c
index 5c4e7ee64dc1..2a9e9020b7cf 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -250,7 +250,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
continue;
}
- folio = filemap_alloc_folio(gfp_mask, 0);
+ folio = filemap_alloc_folio(gfp_mask,
+ mapping_min_folio_order(mapping));
if (!folio)
break;
if (filemap_add_folio(mapping, folio, index + i,
@@ -264,7 +265,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
if (i == nr_to_read - lookahead_size)
folio_set_readahead(folio);
ractl->_workingset |= folio_test_workingset(folio);
- ractl->_nr_pages++;
+ ractl->_nr_pages += folio_nr_pages(folio);
+ i += folio_nr_pages(folio) - 1;
}
/*
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [RFC 15/23] readahead: align with mapping_min_order in force_page_cache_ra()
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (13 preceding siblings ...)
2023-09-15 18:38 ` [RFC 14/23] readahead: allocate folios with mapping_min_order in ra_unbounded() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 18:38 ` [RFC 16/23] readahead: add folio with at least mapping_min_order in page_cache_ra_order Pankaj Raghav
` (9 subsequent siblings)
24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: Pankaj Raghav <p.raghav@samsung.com>
Align the index to mapping_min_order in force_page_cache_ra(). This will
ensure that the folios allocated for readahead that are added to the
page cache are aligned to mapping_min_order.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
mm/readahead.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/mm/readahead.c b/mm/readahead.c
index 2a9e9020b7cf..838dd9ca8dad 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -318,6 +318,8 @@ void force_page_cache_ra(struct readahead_control *ractl,
struct file_ra_state *ra = ractl->ra;
struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
unsigned long max_pages, index;
+ unsigned int folio_order = mapping_min_folio_order(mapping);
+ unsigned int nr_of_pages = (1 << folio_order);
if (unlikely(!mapping->a_ops->read_folio && !mapping->a_ops->readahead))
return;
@@ -327,6 +329,13 @@ void force_page_cache_ra(struct readahead_control *ractl,
* be up to the optimal hardware IO size
*/
index = readahead_index(ractl);
+ if (folio_order && (index & (nr_of_pages - 1))) {
+ unsigned long old_index = index;
+
+ index = round_down(index, nr_of_pages);
+ nr_to_read += (old_index - index);
+ }
+
max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages);
nr_to_read = min_t(unsigned long, nr_to_read, max_pages);
while (nr_to_read) {
@@ -335,6 +344,7 @@ void force_page_cache_ra(struct readahead_control *ractl,
if (this_chunk > nr_to_read)
this_chunk = nr_to_read;
ractl->_index = index;
+ VM_BUG_ON(index & (nr_of_pages - 1));
do_page_cache_ra(ractl, this_chunk, 0);
index += this_chunk;
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [RFC 16/23] readahead: add folio with at least mapping_min_order in page_cache_ra_order
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (14 preceding siblings ...)
2023-09-15 18:38 ` [RFC 15/23] readahead: align with mapping_min_order in force_page_cache_ra() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 18:38 ` [RFC 17/23] readahead: set the minimum ra size in get_(init|next)_ra Pankaj Raghav
` (8 subsequent siblings)
24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: Luis Chamberlain <mcgrof@kernel.org>
Set the folio order to at least mapping_min_order before calling
ra_alloc_folio().
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
mm/readahead.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/mm/readahead.c b/mm/readahead.c
index 838dd9ca8dad..fb5ff180c39e 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -506,6 +506,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
{
struct address_space *mapping = ractl->mapping;
pgoff_t index = readahead_index(ractl);
+ unsigned int min_order = mapping_min_folio_order(mapping);
pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
pgoff_t mark = index + ra->size - ra->async_size;
int err = 0;
@@ -535,10 +536,16 @@ void page_cache_ra_order(struct readahead_control *ractl,
order = 0;
}
/* Don't allocate pages past EOF */
- while (index + (1UL << order) - 1 > limit) {
+ while (order > min_order && index + (1UL << order) - 1 > limit) {
if (--order == 1)
order = 0;
}
+
+ if (order < min_order)
+ order = min_order;
+
+ VM_BUG_ON(index & ((1UL << order) - 1));
+
err = ra_alloc_folio(ractl, index, mark, order, gfp);
if (err)
break;
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [RFC 17/23] readahead: set the minimum ra size in get_(init|next)_ra
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (15 preceding siblings ...)
2023-09-15 18:38 ` [RFC 16/23] readahead: add folio with at least mapping_min_order in page_cache_ra_order Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 18:38 ` [RFC 18/23] readahead: align ra start and size to mapping_min_order in ondemand_ra() Pankaj Raghav
` (7 subsequent siblings)
24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: Luis Chamberlain <mcgrof@kernel.org>
Make sure the minimum ra size is based on mapping_min_order in
get_init_ra() and get_next_ra(). If request ra size is greater than
mapping_min_order of pages, align it to mapping_min_order of pages.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
mm/readahead.c | 26 ++++++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c
index fb5ff180c39e..7c2660815a01 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -357,9 +357,17 @@ void force_page_cache_ra(struct readahead_control *ractl,
* for small size, x 4 for medium, and x 2 for large
* for 128k (32 page) max ra
* 1-2 page = 16k, 3-4 page 32k, 5-8 page = 64k, > 8 page = 128k initial
+ *
+ * For higher order address space requirements we ensure no initial reads
+ * are ever less than the min number of pages required.
+ *
+ * We *always* cap the max io size allowed by the device.
*/
-static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
+static unsigned long get_init_ra_size(unsigned long size,
+ unsigned int min_order,
+ unsigned long max)
{
+ unsigned int min_nrpages = 1UL << min_order;
unsigned long newsize = roundup_pow_of_two(size);
if (newsize <= max / 32)
@@ -369,6 +377,15 @@ static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
else
newsize = max;
+ if (newsize < min_nrpages) {
+ if (min_nrpages <= max)
+ newsize = min_nrpages;
+ else
+ newsize = round_up(max, min_nrpages);
+ }
+
+ VM_BUG_ON(newsize & (min_nrpages - 1));
+
return newsize;
}
@@ -377,14 +394,19 @@ static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
* return it as the new window size.
*/
static unsigned long get_next_ra_size(struct file_ra_state *ra,
+ unsigned int min_order,
unsigned long max)
{
- unsigned long cur = ra->size;
+ unsigned int min_nrpages = 1UL << min_order;
+ unsigned long cur = max(ra->size, min_nrpages);
+
+ cur = round_down(cur, min_nrpages);
if (cur < max / 16)
return 4 * cur;
if (cur <= max / 2)
return 2 * cur;
+
return max;
}
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [RFC 18/23] readahead: align ra start and size to mapping_min_order in ondemand_ra()
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (16 preceding siblings ...)
2023-09-15 18:38 ` [RFC 17/23] readahead: set the minimum ra size in get_(init|next)_ra Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 18:38 ` [RFC 19/23] truncate: align index to mapping_min_order Pankaj Raghav
` (6 subsequent siblings)
24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: Luis Chamberlain <mcgrof@kernel.org>
Align the ra->start and ra->size to mapping_min_order in
ondemand_readahead(). This will ensure the folios added to the
page_cache will be aligned to mapping_min_order number of pages.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
mm/readahead.c | 29 ++++++++++++++++++++++-------
1 file changed, 22 insertions(+), 7 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c
index 7c2660815a01..03fa6f6c8145 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -605,7 +605,11 @@ static void ondemand_readahead(struct readahead_control *ractl,
unsigned long add_pages;
pgoff_t index = readahead_index(ractl);
pgoff_t expected, prev_index;
- unsigned int order = folio ? folio_order(folio) : 0;
+ unsigned int min_order = mapping_min_folio_order(ractl->mapping);
+ unsigned int min_nrpages = 1UL << min_order;
+ unsigned int order = folio ? folio_order(folio) : min_order;
+
+ VM_BUG_ON(ractl->_index & (min_nrpages - 1));
/*
* If the request exceeds the readahead window, allow the read to
@@ -627,9 +631,13 @@ static void ondemand_readahead(struct readahead_control *ractl,
expected = round_up(ra->start + ra->size - ra->async_size,
1UL << order);
if (index == expected || index == (ra->start + ra->size)) {
- ra->start += ra->size;
- ra->size = get_next_ra_size(ra, max_pages);
+ ra->start += round_down(ra->size, min_nrpages);
+ ra->size = get_next_ra_size(ra, min_order, max_pages);
ra->async_size = ra->size;
+
+ VM_BUG_ON(ra->size & ((1UL << min_order) - 1));
+ VM_BUG_ON(ra->start & ((1UL << min_order) - 1));
+
goto readit;
}
@@ -647,13 +655,19 @@ static void ondemand_readahead(struct readahead_control *ractl,
max_pages);
rcu_read_unlock();
+ start = round_down(start, min_nrpages);
+
+ VM_BUG_ON(start & (min_nrpages - 1));
+ VM_BUG_ON(folio->index & (folio_nr_pages(folio) - 1));
+
if (!start || start - index > max_pages)
return;
ra->start = start;
ra->size = start - index; /* old async_size */
- ra->size += req_size;
- ra->size = get_next_ra_size(ra, max_pages);
+ VM_BUG_ON(ra->size & (min_nrpages - 1));
+ ra->size += round_up(req_size, min_nrpages);
+ ra->size = get_next_ra_size(ra, min_order, max_pages);
ra->async_size = ra->size;
goto readit;
}
@@ -690,7 +704,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
initial_readahead:
ra->start = index;
- ra->size = get_init_ra_size(req_size, max_pages);
+ ra->size = get_init_ra_size(req_size, min_order, max_pages);
ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
readit:
@@ -701,7 +715,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
* Take care of maximum IO pages as above.
*/
if (index == ra->start && ra->size == ra->async_size) {
- add_pages = get_next_ra_size(ra, max_pages);
+ add_pages = get_next_ra_size(ra, min_order, max_pages);
if (ra->size + add_pages <= max_pages) {
ra->async_size = add_pages;
ra->size += add_pages;
@@ -712,6 +726,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
}
ractl->_index = ra->start;
+ VM_BUG_ON(ractl->_index & (min_nrpages - 1));
page_cache_ra_order(ractl, ra, order);
}
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [RFC 19/23] truncate: align index to mapping_min_order
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (17 preceding siblings ...)
2023-09-15 18:38 ` [RFC 18/23] readahead: align ra start and size to mapping_min_order in ondemand_ra() Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 18:38 ` [RFC 20/23] mm: round down folio split requirements Pankaj Raghav
` (5 subsequent siblings)
24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: Luis Chamberlain <mcgrof@kernel.org>
Align indices to mapping_min_order in invalidate_inode_pages2_range(),
mapping_try_invalidate() and truncate_inode_pages_range(). This is
necessary to keep the folios added to the page cache aligned with
mapping_min_order.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
mm/truncate.c | 34 ++++++++++++++++++++++++----------
1 file changed, 24 insertions(+), 10 deletions(-)
diff --git a/mm/truncate.c b/mm/truncate.c
index 8e3aa9e8618e..d5ce8e30df70 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -337,6 +337,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
int i;
struct folio *folio;
bool same_folio;
+ unsigned int order = mapping_min_folio_order(mapping);
+ unsigned int nrpages = 1U << order;
if (mapping_empty(mapping))
return;
@@ -347,7 +349,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
* start of the range and 'partial_end' at the end of the range.
* Note that 'end' is exclusive while 'lend' is inclusive.
*/
- start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ start = (lstart + (nrpages * PAGE_SIZE) - 1) >> PAGE_SHIFT;
+ start = round_down(start, nrpages);
+
if (lend == -1)
/*
* lend == -1 indicates end-of-file so we have to set 'end'
@@ -356,7 +360,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
*/
end = -1;
else
- end = (lend + 1) >> PAGE_SHIFT;
+ end = round_down((lend + 1) >> PAGE_SHIFT, nrpages);
folio_batch_init(&fbatch);
index = start;
@@ -372,8 +376,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
cond_resched();
}
- same_folio = (lstart >> PAGE_SHIFT) == (lend >> PAGE_SHIFT);
- folio = __filemap_get_folio(mapping, lstart >> PAGE_SHIFT, FGP_LOCK, 0);
+ same_folio = round_down(lstart >> PAGE_SHIFT, nrpages) ==
+ round_down(lend >> PAGE_SHIFT, nrpages);
+ folio = __filemap_get_folio(mapping, start, FGP_LOCK, 0);
if (!IS_ERR(folio)) {
same_folio = lend < folio_pos(folio) + folio_size(folio);
if (!truncate_inode_partial_folio(folio, lstart, lend)) {
@@ -387,7 +392,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
}
if (!same_folio) {
- folio = __filemap_get_folio(mapping, lend >> PAGE_SHIFT,
+ folio = __filemap_get_folio(mapping,
+ round_down(lend >> PAGE_SHIFT, nrpages),
FGP_LOCK, 0);
if (!IS_ERR(folio)) {
if (!truncate_inode_partial_folio(folio, lstart, lend))
@@ -497,15 +503,18 @@ EXPORT_SYMBOL(truncate_inode_pages_final);
unsigned long mapping_try_invalidate(struct address_space *mapping,
pgoff_t start, pgoff_t end, unsigned long *nr_failed)
{
+ unsigned int min_order = mapping_min_folio_order(mapping);
+ unsigned int nrpages = 1UL << min_order;
pgoff_t indices[PAGEVEC_SIZE];
struct folio_batch fbatch;
- pgoff_t index = start;
+ pgoff_t index = round_up(start, nrpages);
+ pgoff_t end_idx = round_down(end, nrpages);
unsigned long ret;
unsigned long count = 0;
int i;
folio_batch_init(&fbatch);
- while (find_lock_entries(mapping, &index, end, &fbatch, indices)) {
+ while (find_lock_entries(mapping, &index, end_idx, &fbatch, indices)) {
for (i = 0; i < folio_batch_count(&fbatch); i++) {
struct folio *folio = fbatch.folios[i];
@@ -618,9 +627,11 @@ static int folio_launder(struct address_space *mapping, struct folio *folio)
int invalidate_inode_pages2_range(struct address_space *mapping,
pgoff_t start, pgoff_t end)
{
+ unsigned int min_order = mapping_min_folio_order(mapping);
+ unsigned int nrpages = 1UL << min_order;
pgoff_t indices[PAGEVEC_SIZE];
struct folio_batch fbatch;
- pgoff_t index;
+ pgoff_t index, end_idx;
int i;
int ret = 0;
int ret2 = 0;
@@ -630,8 +641,9 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
return 0;
folio_batch_init(&fbatch);
- index = start;
- while (find_get_entries(mapping, &index, end, &fbatch, indices)) {
+ index = round_up(start, nrpages);
+ end_idx = round_down(end, nrpages);
+ while (find_get_entries(mapping, &index, end_idx, &fbatch, indices)) {
for (i = 0; i < folio_batch_count(&fbatch); i++) {
struct folio *folio = fbatch.folios[i];
@@ -660,6 +672,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
continue;
}
VM_BUG_ON_FOLIO(!folio_contains(folio, indices[i]), folio);
+ VM_BUG_ON_FOLIO(folio_order(folio) < min_order, folio);
+ VM_BUG_ON_FOLIO(folio->index & (nrpages - 1), folio);
folio_wait_writeback(folio);
if (folio_mapped(folio))
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [RFC 20/23] mm: round down folio split requirements
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (18 preceding siblings ...)
2023-09-15 18:38 ` [RFC 19/23] truncate: align index to mapping_min_order Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 18:38 ` [RFC 21/23] xfs: expose block size in stat Pankaj Raghav
` (4 subsequent siblings)
24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: Luis Chamberlain <mcgrof@kernel.org>
When we truncate we always check if we can split a large folio, we do
this by checking the userspace mapped pages match folio_nr_pages() - 1,
but if we are using a filesystem or a block device which has a min order
it must be respected and we should only split rounding down to the
min order page requirements.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
mm/huge_memory.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f899b3500419..e608a805c79f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2617,16 +2617,24 @@ static void __split_huge_page(struct page *page, struct list_head *list,
bool can_split_folio(struct folio *folio, int *pextra_pins)
{
int extra_pins;
+ unsigned int min_order = 0;
+ unsigned int nrpages;
/* Additional pins from page cache */
- if (folio_test_anon(folio))
+ if (folio_test_anon(folio)) {
extra_pins = folio_test_swapcache(folio) ?
folio_nr_pages(folio) : 0;
- else
+ } else {
extra_pins = folio_nr_pages(folio);
+ if (folio->mapping)
+ min_order = mapping_min_folio_order(folio->mapping);
+ }
+
+ nrpages = 1UL << min_order;
+
if (pextra_pins)
*pextra_pins = extra_pins;
- return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins - 1;
+ return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins - nrpages;
}
/*
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [RFC 21/23] xfs: expose block size in stat
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (19 preceding siblings ...)
2023-09-15 18:38 ` [RFC 20/23] mm: round down folio split requirements Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 18:38 ` [RFC 22/23] xfs: enable block size larger than page size support Pankaj Raghav
` (3 subsequent siblings)
24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev, Dave Chinner
From: Dave Chinner <dchinner@redhat.com>
For block size larger than page size, the unit of efficient IO is
the block size, not the page size. Leaving stat() to report
PAGE_SIZE as the block size causes test programs like fsx to issue
illegal ranges for operations that require block size alignment
(e.g. fallocate() insert range). Hence update the preferred IO size
to reflect the block size in this case.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
[mcgrof: forward rebase in consideration for commit
dd2d535e3fb29d ("xfs: cleanup calculating the stat optimal I/O size")]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
fs/xfs/xfs_iops.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 2ededd3f6b8c..080a79a81c46 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -515,6 +515,8 @@ xfs_stat_blksize(
struct xfs_inode *ip)
{
struct xfs_mount *mp = ip->i_mount;
+ unsigned long default_size = max_t(unsigned long, PAGE_SIZE,
+ mp->m_sb.sb_blocksize);
/*
* If the file blocks are being allocated from a realtime volume, then
@@ -543,7 +545,7 @@ xfs_stat_blksize(
return 1U << mp->m_allocsize_log;
}
- return PAGE_SIZE;
+ return default_size;
}
STATIC int
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [RFC 22/23] xfs: enable block size larger than page size support
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (20 preceding siblings ...)
2023-09-15 18:38 ` [RFC 21/23] xfs: expose block size in stat Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 18:38 ` [RFC 23/23] xfs: set minimum order folio for page cache based on blocksize Pankaj Raghav
` (2 subsequent siblings)
24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: Pankaj Raghav <p.raghav@samsung.com>
Currently we don't support blocksize that is twice the page size due to
the limitation of having at least three pages in a large folio[1].
[1] https://lore.kernel.org/all/ZH0GvxAdw1RO2Shr@casper.infradead.org/
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
fs/xfs/xfs_mount.c | 9 +++++++--
fs/xfs/xfs_super.c | 7 ++-----
2 files changed, 9 insertions(+), 7 deletions(-)
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index aed5be5508fe..4272898c508a 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -131,11 +131,16 @@ xfs_sb_validate_fsb_count(
xfs_sb_t *sbp,
uint64_t nblocks)
{
- ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
ASSERT(sbp->sb_blocklog >= BBSHIFT);
+ unsigned long mapping_count;
+
+ if (sbp->sb_blocklog <= PAGE_SHIFT)
+ mapping_count = nblocks >> (PAGE_SHIFT - sbp->sb_blocklog);
+ else
+ mapping_count = nblocks << (sbp->sb_blocklog - PAGE_SHIFT);
/* Limited by ULONG_MAX of page cache index */
- if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
+ if (mapping_count > ULONG_MAX)
return -EFBIG;
return 0;
}
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 1f77014c6e1a..75bf4d23051c 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1651,13 +1651,10 @@ xfs_fs_fill_super(
goto out_free_sb;
}
- /*
- * Until this is fixed only page-sized or smaller data blocks work.
- */
- if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
+ if (mp->m_sb.sb_blocksize == (2 * PAGE_SIZE)) {
xfs_warn(mp,
"File system with blocksize %d bytes. "
- "Only pagesize (%ld) or less will currently work.",
+ "Blocksize that is twice the pagesize %ld does not currently work.",
mp->m_sb.sb_blocksize, PAGE_SIZE);
error = -ENOSYS;
goto out_free_sb;
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [RFC 23/23] xfs: set minimum order folio for page cache based on blocksize
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (21 preceding siblings ...)
2023-09-15 18:38 ` [RFC 22/23] xfs: enable block size larger than page size support Pankaj Raghav
@ 2023-09-15 18:38 ` Pankaj Raghav
2023-09-15 18:50 ` [RFC 00/23] Enable block size > page size in XFS Matthew Wilcox
2023-09-17 22:05 ` Dave Chinner
24 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
To: linux-xfs, linux-fsdevel
Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, mcgrof, gost.dev
From: Pankaj Raghav <p.raghav@samsung.com>
Enabling a block size > PAGE_SIZE is only possible if we can
ensure that the filesystem allocations for the block size is treated
atomically and we do this with the min order folio requirement for the
inode. This allows the page cache to treat this inode atomically even
if on the block layer we may treat it separately.
For instance, on x86 this enables eventual usage of block size > 4k
so long as you use a sector size set of 4k.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
fs/xfs/xfs_icache.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index aacc7eec2497..81f07503f5ca 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -73,6 +73,7 @@ xfs_inode_alloc(
xfs_ino_t ino)
{
struct xfs_inode *ip;
+ int min_order = 0;
/*
* XXX: If this didn't occur in transactions, we could drop GFP_NOFAIL
@@ -88,7 +89,8 @@ xfs_inode_alloc(
/* VFS doesn't initialise i_mode or i_state! */
VFS_I(ip)->i_mode = 0;
VFS_I(ip)->i_state = 0;
- mapping_set_large_folios(VFS_I(ip)->i_mapping);
+ min_order = max(min_order, ilog2(mp->m_sb.sb_blocksize) - PAGE_SHIFT);
+ mapping_set_folio_orders(VFS_I(ip)->i_mapping, min_order, MAX_PAGECACHE_ORDER);
XFS_STATS_INC(mp, vn_active);
ASSERT(atomic_read(&ip->i_pincount) == 0);
@@ -313,6 +315,7 @@ xfs_reinit_inode(
dev_t dev = inode->i_rdev;
kuid_t uid = inode->i_uid;
kgid_t gid = inode->i_gid;
+ int min_order = 0;
error = inode_init_always(mp->m_super, inode);
@@ -323,7 +326,8 @@ xfs_reinit_inode(
inode->i_rdev = dev;
inode->i_uid = uid;
inode->i_gid = gid;
- mapping_set_large_folios(inode->i_mapping);
+ min_order = max(min_order, ilog2(mp->m_sb.sb_blocksize) - PAGE_SHIFT);
+ mapping_set_folio_orders(inode->i_mapping, min_order, MAX_PAGECACHE_ORDER);
return error;
}
--
2.40.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [RFC 00/23] Enable block size > page size in XFS
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (22 preceding siblings ...)
2023-09-15 18:38 ` [RFC 23/23] xfs: set minimum order folio for page cache based on blocksize Pankaj Raghav
@ 2023-09-15 18:50 ` Matthew Wilcox
2023-09-18 12:35 ` Pankaj Raghav
2023-09-17 22:05 ` Dave Chinner
24 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2023-09-15 18:50 UTC (permalink / raw)
To: Pankaj Raghav
Cc: linux-xfs, linux-fsdevel, p.raghav, david, da.gomez, akpm,
linux-kernel, djwong, linux-mm, chandan.babu, mcgrof, gost.dev
On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote:
> Only XFS was enabled and tested as a part of this series as it has
> supported block sizes up to 64k and sector sizes up to 32k for years.
> The only thing missing was the page cache magic to enable bs > ps. However any filesystem
> that doesn't depend on buffer-heads and support larger block sizes
> already should be able to leverage this effort to also support LBS,
> bs > ps.
I think you should choose whether you're going to use 'bs > ps' or LBS
and stick to it. They're both pretty inscrutable and using both
interchanagbly is worse.
But I think filesystems which use buffer_heads should be fine to support
bs > ps. The problems with the buffer cache are really when you try to
support small block sizes and large folio sizes (eg arrays of bhs on
the stack). Supporting bs == folio_size shouldn't be a problem.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC 00/23] Enable block size > page size in XFS
2023-09-15 18:50 ` [RFC 00/23] Enable block size > page size in XFS Matthew Wilcox
@ 2023-09-18 12:35 ` Pankaj Raghav
0 siblings, 0 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-18 12:35 UTC (permalink / raw)
To: Matthew Wilcox, Pankaj Raghav
Cc: linux-xfs, linux-fsdevel, david, da.gomez, akpm, linux-kernel,
djwong, linux-mm, chandan.babu, mcgrof, gost.dev
On 2023-09-15 20:50, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote:
>> Only XFS was enabled and tested as a part of this series as it has
>> supported block sizes up to 64k and sector sizes up to 32k for years.
>> The only thing missing was the page cache magic to enable bs > ps. However any filesystem
>> that doesn't depend on buffer-heads and support larger block sizes
>> already should be able to leverage this effort to also support LBS,
>> bs > ps.
>
> I think you should choose whether you're going to use 'bs > ps' or LBS
> and stick to it. They're both pretty inscrutable and using both
> interchanagbly is worse.
>
Got it! Probably I will stick to Large block size and explain what it means
at the start of the patchset.
> But I think filesystems which use buffer_heads should be fine to support
> bs > ps. The problems with the buffer cache are really when you try to
> support small block sizes and large folio sizes (eg arrays of bhs on
> the stack). Supporting bs == folio_size shouldn't be a problem.
>
I remember some patches from you trying to avoid the stack limitation while working
with bh. Thanks for the clarification!
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC 00/23] Enable block size > page size in XFS
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
` (23 preceding siblings ...)
2023-09-15 18:50 ` [RFC 00/23] Enable block size > page size in XFS Matthew Wilcox
@ 2023-09-17 22:05 ` Dave Chinner
2023-09-18 2:04 ` Luis Chamberlain
24 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2023-09-17 22:05 UTC (permalink / raw)
To: Pankaj Raghav
Cc: linux-xfs, linux-fsdevel, p.raghav, da.gomez, akpm, linux-kernel,
willy, djwong, linux-mm, chandan.babu, mcgrof, gost.dev
On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> There has been efforts over the last 16 years to enable enable Large
> Block Sizes (LBS), that is block sizes in filesystems where bs > page
> size [1] [2]. Through these efforts we have learned that one of the
> main blockers to supporting bs > ps in fiesystems has been a way to
> allocate pages that are at least the filesystem block size on the page
> cache where bs > ps [3]. Another blocker was changed in filesystems due to
> buffer-heads. Thanks to these previous efforts, the surgery by Matthew
> Willcox in the page cache for adopting xarray's multi-index support, and
> iomap support, it makes supporting bs > ps in XFS possible with only a few
> line change to XFS. Most of changes are to the page cache to support minimum
> order folio support for the target block size on the filesystem.
>
> A new motivation for LBS today is to support high-capacity (large amount
> of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are
> typically greater than 4k [4] to help reduce DRAM and so in turn cost
> and space. In practice this then allows different architectures to use a
> base page size of 4k while still enabling support for block sizes
> aligned to the larger IUs by relying on high order folios on the page
> cache when needed. It also enables to take advantage of these same
> drive's support for larger atomics than 4k with buffered IO support in
> Linux. As described this year at LSFMM, supporting large atomics greater
> than 4k enables databases to remove the need to rely on their own
> journaling, so they can disable double buffered writes [5], which is a
> feature different cloud providers are already innovating and enabling
> customers for through custom storage solutions.
>
> This series still needs some polishing and fixing some crashes, but it is
> mainly targeted to get initial feedback from the community, enable initial
> experimentation, hence the RFC. It's being posted now given the results from
> our testing are proving much better results than expected and we hope to
> polish this up together with the community. After all, this has been a 16
> year old effort and none of this could have been possible without that effort.
>
> Implementation:
>
> This series only adds the notion of a minimum order of a folio in the
> page cache that was initially proposed by Willy. The minimum folio order
> requirement is set during inode creation. The minimum order will
> typically correspond to the filesystem block size. The page cache will
> in turn respect the minimum folio order requirement while allocating a
> folio. This series mainly changes the page cache's filemap, readahead, and
> truncation code to allocate and align the folios to the minimum order set for the
> filesystem's inode's respective address space mapping.
>
> Only XFS was enabled and tested as a part of this series as it has
> supported block sizes up to 64k and sector sizes up to 32k for years.
> The only thing missing was the page cache magic to enable bs > ps. However any filesystem
> that doesn't depend on buffer-heads and support larger block sizes
> already should be able to leverage this effort to also support LBS,
> bs > ps.
>
> This also paves the way for supporting block devices where their logical
> block size > page size in the future by leveraging iomap's address space
> operation added to the block device cache by Christoph Hellwig [6]. We
> have work to enable support for this, enabling LBAs > 4k on NVME, and
> at the same time allow coexistence with buffer-heads on the same block
> device so to enable support allow for a drive to use filesystem's to
> switch between filesystem's which may depend on buffer-heads or need the
> iomap address space operations for the block device cache. Patches for
> this will be posted shortly after this patch series.
Do you have a git tree branch that I can pull this from
somewhere?
As it is, I'd really prefer stuff that adds significant XFS
functionality that we need to test to be based on a current Linus
TOT kernel so that we can test it without being impacted by all
the random unrelated breakages that regularly happen in linux-next
kernels....
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC 00/23] Enable block size > page size in XFS
2023-09-17 22:05 ` Dave Chinner
@ 2023-09-18 2:04 ` Luis Chamberlain
2023-09-18 5:07 ` Dave Chinner
0 siblings, 1 reply; 54+ messages in thread
From: Luis Chamberlain @ 2023-09-18 2:04 UTC (permalink / raw)
To: Dave Chinner
Cc: Pankaj Raghav, linux-xfs, linux-fsdevel, p.raghav, da.gomez, akpm,
linux-kernel, willy, djwong, linux-mm, chandan.babu, gost.dev
On Mon, Sep 18, 2023 at 08:05:20AM +1000, Dave Chinner wrote:
> On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> >
> > There has been efforts over the last 16 years to enable enable Large
> > Block Sizes (LBS), that is block sizes in filesystems where bs > page
> > size [1] [2]. Through these efforts we have learned that one of the
> > main blockers to supporting bs > ps in fiesystems has been a way to
> > allocate pages that are at least the filesystem block size on the page
> > cache where bs > ps [3]. Another blocker was changed in filesystems due to
> > buffer-heads. Thanks to these previous efforts, the surgery by Matthew
> > Willcox in the page cache for adopting xarray's multi-index support, and
> > iomap support, it makes supporting bs > ps in XFS possible with only a few
> > line change to XFS. Most of changes are to the page cache to support minimum
> > order folio support for the target block size on the filesystem.
> >
> > A new motivation for LBS today is to support high-capacity (large amount
> > of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are
> > typically greater than 4k [4] to help reduce DRAM and so in turn cost
> > and space. In practice this then allows different architectures to use a
> > base page size of 4k while still enabling support for block sizes
> > aligned to the larger IUs by relying on high order folios on the page
> > cache when needed. It also enables to take advantage of these same
> > drive's support for larger atomics than 4k with buffered IO support in
> > Linux. As described this year at LSFMM, supporting large atomics greater
> > than 4k enables databases to remove the need to rely on their own
> > journaling, so they can disable double buffered writes [5], which is a
> > feature different cloud providers are already innovating and enabling
> > customers for through custom storage solutions.
> >
> > This series still needs some polishing and fixing some crashes, but it is
> > mainly targeted to get initial feedback from the community, enable initial
> > experimentation, hence the RFC. It's being posted now given the results from
> > our testing are proving much better results than expected and we hope to
> > polish this up together with the community. After all, this has been a 16
> > year old effort and none of this could have been possible without that effort.
> >
> > Implementation:
> >
> > This series only adds the notion of a minimum order of a folio in the
> > page cache that was initially proposed by Willy. The minimum folio order
> > requirement is set during inode creation. The minimum order will
> > typically correspond to the filesystem block size. The page cache will
> > in turn respect the minimum folio order requirement while allocating a
> > folio. This series mainly changes the page cache's filemap, readahead, and
> > truncation code to allocate and align the folios to the minimum order set for the
> > filesystem's inode's respective address space mapping.
> >
> > Only XFS was enabled and tested as a part of this series as it has
> > supported block sizes up to 64k and sector sizes up to 32k for years.
> > The only thing missing was the page cache magic to enable bs > ps. However any filesystem
> > that doesn't depend on buffer-heads and support larger block sizes
> > already should be able to leverage this effort to also support LBS,
> > bs > ps.
> >
> > This also paves the way for supporting block devices where their logical
> > block size > page size in the future by leveraging iomap's address space
> > operation added to the block device cache by Christoph Hellwig [6]. We
> > have work to enable support for this, enabling LBAs > 4k on NVME, and
> > at the same time allow coexistence with buffer-heads on the same block
> > device so to enable support allow for a drive to use filesystem's to
> > switch between filesystem's which may depend on buffer-heads or need the
> > iomap address space operations for the block device cache. Patches for
> > this will be posted shortly after this patch series.
>
> Do you have a git tree branch that I can pull this from
> somewhere?
>
> As it is, I'd really prefer stuff that adds significant XFS
> functionality that we need to test to be based on a current Linus
> TOT kernel so that we can test it without being impacted by all
> the random unrelated breakages that regularly happen in linux-next
> kernels....
That's understandable! I just rebased onto Linus' tree, this only
has the bs > ps support on 4k sector size:
https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev
I just did a cursory build / boot / fsx with 16k block size / 4k sector size
test with this tree only. I havne't ran fstests on it.
Just a heads up, using 512 byte sector size will fail for now, it's a
regression we have to fix. Likewise using block sizes 1k, 2k will also
regress on fsx right now. These are regressions we are aware of but
haven't had time yet to bisect / fix.
Luis
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC 00/23] Enable block size > page size in XFS
2023-09-18 2:04 ` Luis Chamberlain
@ 2023-09-18 5:07 ` Dave Chinner
2023-09-18 12:29 ` Pankaj Raghav
0 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2023-09-18 5:07 UTC (permalink / raw)
To: Luis Chamberlain
Cc: Pankaj Raghav, linux-xfs, linux-fsdevel, p.raghav, da.gomez, akpm,
linux-kernel, willy, djwong, linux-mm, chandan.babu, gost.dev
On Sun, Sep 17, 2023 at 07:04:24PM -0700, Luis Chamberlain wrote:
> On Mon, Sep 18, 2023 at 08:05:20AM +1000, Dave Chinner wrote:
> > On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote:
> > > From: Pankaj Raghav <p.raghav@samsung.com>
> > >
> > > There has been efforts over the last 16 years to enable enable Large
> > > Block Sizes (LBS), that is block sizes in filesystems where bs > page
> > > size [1] [2]. Through these efforts we have learned that one of the
> > > main blockers to supporting bs > ps in fiesystems has been a way to
> > > allocate pages that are at least the filesystem block size on the page
> > > cache where bs > ps [3]. Another blocker was changed in filesystems due to
> > > buffer-heads. Thanks to these previous efforts, the surgery by Matthew
> > > Willcox in the page cache for adopting xarray's multi-index support, and
> > > iomap support, it makes supporting bs > ps in XFS possible with only a few
> > > line change to XFS. Most of changes are to the page cache to support minimum
> > > order folio support for the target block size on the filesystem.
> > >
> > > A new motivation for LBS today is to support high-capacity (large amount
> > > of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are
> > > typically greater than 4k [4] to help reduce DRAM and so in turn cost
> > > and space. In practice this then allows different architectures to use a
> > > base page size of 4k while still enabling support for block sizes
> > > aligned to the larger IUs by relying on high order folios on the page
> > > cache when needed. It also enables to take advantage of these same
> > > drive's support for larger atomics than 4k with buffered IO support in
> > > Linux. As described this year at LSFMM, supporting large atomics greater
> > > than 4k enables databases to remove the need to rely on their own
> > > journaling, so they can disable double buffered writes [5], which is a
> > > feature different cloud providers are already innovating and enabling
> > > customers for through custom storage solutions.
> > >
> > > This series still needs some polishing and fixing some crashes, but it is
> > > mainly targeted to get initial feedback from the community, enable initial
> > > experimentation, hence the RFC. It's being posted now given the results from
> > > our testing are proving much better results than expected and we hope to
> > > polish this up together with the community. After all, this has been a 16
> > > year old effort and none of this could have been possible without that effort.
> > >
> > > Implementation:
> > >
> > > This series only adds the notion of a minimum order of a folio in the
> > > page cache that was initially proposed by Willy. The minimum folio order
> > > requirement is set during inode creation. The minimum order will
> > > typically correspond to the filesystem block size. The page cache will
> > > in turn respect the minimum folio order requirement while allocating a
> > > folio. This series mainly changes the page cache's filemap, readahead, and
> > > truncation code to allocate and align the folios to the minimum order set for the
> > > filesystem's inode's respective address space mapping.
> > >
> > > Only XFS was enabled and tested as a part of this series as it has
> > > supported block sizes up to 64k and sector sizes up to 32k for years.
> > > The only thing missing was the page cache magic to enable bs > ps. However any filesystem
> > > that doesn't depend on buffer-heads and support larger block sizes
> > > already should be able to leverage this effort to also support LBS,
> > > bs > ps.
> > >
> > > This also paves the way for supporting block devices where their logical
> > > block size > page size in the future by leveraging iomap's address space
> > > operation added to the block device cache by Christoph Hellwig [6]. We
> > > have work to enable support for this, enabling LBAs > 4k on NVME, and
> > > at the same time allow coexistence with buffer-heads on the same block
> > > device so to enable support allow for a drive to use filesystem's to
> > > switch between filesystem's which may depend on buffer-heads or need the
> > > iomap address space operations for the block device cache. Patches for
> > > this will be posted shortly after this patch series.
> >
> > Do you have a git tree branch that I can pull this from
> > somewhere?
> >
> > As it is, I'd really prefer stuff that adds significant XFS
> > functionality that we need to test to be based on a current Linus
> > TOT kernel so that we can test it without being impacted by all
> > the random unrelated breakages that regularly happen in linux-next
> > kernels....
>
> That's understandable! I just rebased onto Linus' tree, this only
> has the bs > ps support on 4k sector size:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev
> I just did a cursory build / boot / fsx with 16k block size / 4k sector size
> test with this tree only. I havne't ran fstests on it.
W/ 64k block size, generic/042 fails (maybe just a test block size
thing), generic/091 fails (data corruption on read after ~70 ops)
and then generic/095 hung with a crash in iomap_readpage_iter()
during readahead.
Looks like a null folio was passed to ifs_alloc(), which implies the
iomap_readpage_ctx didn't have a folio attached to it. Something
isn't working properly in the readahead code, which would also
explain the quick fsx failure...
> Just a heads up, using 512 byte sector size will fail for now, it's a
> regression we have to fix. Likewise using block sizes 1k, 2k will also
> regress on fsx right now. These are regressions we are aware of but
> haven't had time yet to bisect / fix.
I'm betting that the recently added sub-folio dirty tracking code
got broken by this patchset....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC 00/23] Enable block size > page size in XFS
2023-09-18 5:07 ` Dave Chinner
@ 2023-09-18 12:29 ` Pankaj Raghav
2023-09-19 11:56 ` Ritesh Harjani
2023-09-21 3:00 ` Luis Chamberlain
0 siblings, 2 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-18 12:29 UTC (permalink / raw)
To: Dave Chinner, Luis Chamberlain
Cc: Pankaj Raghav, linux-xfs, linux-fsdevel, da.gomez, akpm,
linux-kernel, willy, djwong, linux-mm, chandan.babu, gost.dev,
riteshh
>>>
>>> As it is, I'd really prefer stuff that adds significant XFS
>>> functionality that we need to test to be based on a current Linus
>>> TOT kernel so that we can test it without being impacted by all
>>> the random unrelated breakages that regularly happen in linux-next
>>> kernels....
>>
>> That's understandable! I just rebased onto Linus' tree, this only
>> has the bs > ps support on 4k sector size:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev
>
I think this tree doesn't have some of the last minute changes I did before I sent the RFC. I will
sync with Luis offline regarding that.
>
>> I just did a cursory build / boot / fsx with 16k block size / 4k sector size
>> test with this tree only. I havne't ran fstests on it.
>
> W/ 64k block size, generic/042 fails (maybe just a test block size
> thing), generic/091 fails (data corruption on read after ~70 ops)
> and then generic/095 hung with a crash in iomap_readpage_iter()
> during readahead.
>
> Looks like a null folio was passed to ifs_alloc(), which implies the
> iomap_readpage_ctx didn't have a folio attached to it. Something
> isn't working properly in the readahead code, which would also
> explain the quick fsx failure...
>
Yeah, I have noticed this as well. This is the main crash scenario I am noticing
when I am running xfstests, and hopefully we will be able to fix it soon.
In general, we have had better results with 16k block size than 64k block size. I still don't
know why, but the ifs_alloc crash happens in generic/451 with 16k block size.
>> Just a heads up, using 512 byte sector size will fail for now, it's a
>> regression we have to fix. Likewise using block sizes 1k, 2k will also
>> regress on fsx right now. These are regressions we are aware of but
>> haven't had time yet to bisect / fix.
>
> I'm betting that the recently added sub-folio dirty tracking code
> got broken by this patchset....
>
Hmm, this crossed my mind as well. I am assuming I can really test the sub-folio dirty
tracking code on a system which has a page size greater than the block size? Or is there
some tests that can already test this? CCing Ritesh as well.
> Cheers,
>
> Dave.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC 00/23] Enable block size > page size in XFS
2023-09-18 12:29 ` Pankaj Raghav
@ 2023-09-19 11:56 ` Ritesh Harjani
2023-09-19 21:15 ` Luis Chamberlain
2023-09-21 3:00 ` Luis Chamberlain
1 sibling, 1 reply; 54+ messages in thread
From: Ritesh Harjani @ 2023-09-19 11:56 UTC (permalink / raw)
To: Pankaj Raghav, Dave Chinner, Luis Chamberlain
Cc: Pankaj Raghav, linux-xfs, linux-fsdevel, da.gomez, akpm,
linux-kernel, willy, djwong, linux-mm, chandan.babu, gost.dev,
riteshh
Pankaj Raghav <p.raghav@samsung.com> writes:
>>>>
>>>> As it is, I'd really prefer stuff that adds significant XFS
>>>> functionality that we need to test to be based on a current Linus
>>>> TOT kernel so that we can test it without being impacted by all
>>>> the random unrelated breakages that regularly happen in linux-next
>>>> kernels....
>>>
>>> That's understandable! I just rebased onto Linus' tree, this only
>>> has the bs > ps support on 4k sector size:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev
>>
>
> I think this tree doesn't have some of the last minute changes I did before I sent the RFC. I will
> sync with Luis offline regarding that.
>
>>
>>> I just did a cursory build / boot / fsx with 16k block size / 4k sector size
>>> test with this tree only. I havne't ran fstests on it.
>>
>> W/ 64k block size, generic/042 fails (maybe just a test block size
>> thing), generic/091 fails (data corruption on read after ~70 ops)
>> and then generic/095 hung with a crash in iomap_readpage_iter()
>> during readahead.
>>
>> Looks like a null folio was passed to ifs_alloc(), which implies the
>> iomap_readpage_ctx didn't have a folio attached to it. Something
>> isn't working properly in the readahead code, which would also
>> explain the quick fsx failure...
>>
>
> Yeah, I have noticed this as well. This is the main crash scenario I am noticing
> when I am running xfstests, and hopefully we will be able to fix it soon.
>
> In general, we have had better results with 16k block size than 64k block size. I still don't
> know why, but the ifs_alloc crash happens in generic/451 with 16k block size.
>
>
>>> Just a heads up, using 512 byte sector size will fail for now, it's a
>>> regression we have to fix. Likewise using block sizes 1k, 2k will also
>>> regress on fsx right now. These are regressions we are aware of but
>>> haven't had time yet to bisect / fix.
>>
>> I'm betting that the recently added sub-folio dirty tracking code
>> got broken by this patchset....
>>
>
> Hmm, this crossed my mind as well. I am assuming I can really test the sub-folio dirty
> tracking code on a system which has a page size greater than the block size? Or is there
> some tests that can already test this? CCing Ritesh as well.
>
Sorry I haven't yet looked into this series yet. I will spend sometime
reading it. Will also give a spin to run the fstests.
But to answer your question on how to test sub-folio dirty
tracking code[1] [2] with XFS. Just use blocksize < pagesize in mkfs option
and run fstests. There are a no. of tests which checks for data
correctness for various types of writes.
1. test 1k blocksize on a 4k pagsize machine (as long as bs < ps)
2. Test 4k blocksize on a 64k pagesize machine (if you have one) (as long as bs < ps)
3. Or also enable large folios support and test bs < ps
(with large folios system starts insantiating large folios > 4k on a 4k
pagesize machine. So blocksize automatically becomes lesser than folio size)
You will need CONFIG_TRANSPARENT_HUGEPAGE to be enabled along with
willy's series which enables large folios in buffered write path [3].
(This is already in linux 6.6-rc1)
<snip>
/*
* Large folio support currently depends on THP. These dependencies are
* being worked on but are not yet fixed.
*/
static inline bool mapping_large_folio_support(struct address_space *mapping)
{
return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
}
<links>
[1]: https://lore.kernel.org/linux-xfs/20230725122932.144426-1-ritesh.list@gmail.com/
[2]:
https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?h=for-next&id=4ce02c67972211be488408c275c8fbf19faf29b3
[3]: https://lore.kernel.org/all/ZLVrEkVU2YCneoXR@casper.infradead.org/
Hope this helps!
-ritesh
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC 00/23] Enable block size > page size in XFS
2023-09-19 11:56 ` Ritesh Harjani
@ 2023-09-19 21:15 ` Luis Chamberlain
0 siblings, 0 replies; 54+ messages in thread
From: Luis Chamberlain @ 2023-09-19 21:15 UTC (permalink / raw)
To: Ritesh Harjani
Cc: Pankaj Raghav, Dave Chinner, Pankaj Raghav, linux-xfs,
linux-fsdevel, da.gomez, akpm, linux-kernel, willy, djwong,
linux-mm, chandan.babu, gost.dev, riteshh
On Tue, Sep 19, 2023 at 05:26:44PM +0530, Ritesh Harjani wrote:
> Pankaj Raghav <p.raghav@samsung.com> writes:
>
> >>>>
> >>>> As it is, I'd really prefer stuff that adds significant XFS
> >>>> functionality that we need to test to be based on a current Linus
> >>>> TOT kernel so that we can test it without being impacted by all
> >>>> the random unrelated breakages that regularly happen in linux-next
> >>>> kernels....
> >>>
> >>> That's understandable! I just rebased onto Linus' tree, this only
> >>> has the bs > ps support on 4k sector size:
> >>>
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev
> >>
> >
> > I think this tree doesn't have some of the last minute changes I did before I sent the RFC. I will
> > sync with Luis offline regarding that.
> >
> >>
> >>> I just did a cursory build / boot / fsx with 16k block size / 4k sector size
> >>> test with this tree only. I havne't ran fstests on it.
> >>
> >> W/ 64k block size, generic/042 fails (maybe just a test block size
> >> thing), generic/091 fails (data corruption on read after ~70 ops)
> >> and then generic/095 hung with a crash in iomap_readpage_iter()
> >> during readahead.
> >>
> >> Looks like a null folio was passed to ifs_alloc(), which implies the
> >> iomap_readpage_ctx didn't have a folio attached to it. Something
> >> isn't working properly in the readahead code, which would also
> >> explain the quick fsx failure...
> >>
> >
> > Yeah, I have noticed this as well. This is the main crash scenario I am noticing
> > when I am running xfstests, and hopefully we will be able to fix it soon.
> >
> > In general, we have had better results with 16k block size than 64k block size. I still don't
> > know why, but the ifs_alloc crash happens in generic/451 with 16k block size.
> >
> >
> >>> Just a heads up, using 512 byte sector size will fail for now, it's a
> >>> regression we have to fix. Likewise using block sizes 1k, 2k will also
> >>> regress on fsx right now. These are regressions we are aware of but
> >>> haven't had time yet to bisect / fix.
> >>
> >> I'm betting that the recently added sub-folio dirty tracking code
> >> got broken by this patchset....
> >>
> >
> > Hmm, this crossed my mind as well. I am assuming I can really test the sub-folio dirty
> > tracking code on a system which has a page size greater than the block size? Or is there
> > some tests that can already test this? CCing Ritesh as well.
> >
>
> Sorry I haven't yet looked into this series yet. I will spend sometime
> reading it. Will also give a spin to run the fstests.
Ritesh,
You can save yourself time in not testing the patch series with fstests
for block sizes below ps as we already are aware that a patch in the
series breaks this. We just wanted to get the patch series out early for
review given the progress. There's probably one patch which regresses
this, if each patch regresses this, that's a bigger issue :P
Luis
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC 00/23] Enable block size > page size in XFS
2023-09-18 12:29 ` Pankaj Raghav
2023-09-19 11:56 ` Ritesh Harjani
@ 2023-09-21 3:00 ` Luis Chamberlain
[not found] ` <ZQvNVAfZMjE3hgmN@bombadil.infradead.org>
1 sibling, 1 reply; 54+ messages in thread
From: Luis Chamberlain @ 2023-09-21 3:00 UTC (permalink / raw)
To: Pankaj Raghav
Cc: Dave Chinner, Pankaj Raghav, linux-xfs, linux-fsdevel, da.gomez,
akpm, linux-kernel, willy, djwong, linux-mm, chandan.babu,
gost.dev, riteshh
On Mon, Sep 18, 2023 at 02:29:22PM +0200, Pankaj Raghav wrote:
> I think this tree doesn't have some of the last minute changes I did
> before I sent the RFC. I will sync with Luis offline regarding that.
OK, we sorted the small changes, and this patch series posted is now rebased
and available here to Linus' v6.6-rc2, for those that want more
stability than the wild wild linux-next:
https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus-nobdev
If you wanna muck with the coexistence stuff, which you will need if you
want to actually use an LBS device, that is this patch series
and then the coexistence stuff:
https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus
Given this is a fresh rebase, I started running fsx on the nobdev branch
which only has this series and managed to get fsx ops up to over 1 million
for:
512 sector size:
* 16k block size
* 32k block size
* 64k block size
4k sector size:
* 16k block size
* 32k block size
* 64k block size
It's at least enough cursory test to git push it. I haven't tested
yet the second branch I pushed though but it applied without any changes
so it should be good (usual famous last words).
Luis
^ permalink raw reply [flat|nested] 54+ messages in thread