* [RFC PATCH 34/40] btrfs: allocate eb-attached btree pages as movable
[not found] <20260520150018.2491267-1-riel@surriel.com>
@ 2026-05-20 14:59 ` Rik van Riel
2026-05-20 17:47 ` Boris Burkov
2026-05-21 7:39 ` [syzbot ci] Re: mm: reliable 1GB page allocation syzbot ci
2026-06-27 9:28 ` [RFC PATCH 00/40] " Lorenzo Stoakes
2 siblings, 1 reply; 12+ messages in thread
From: Rik van Riel @ 2026-05-20 14:59 UTC (permalink / raw)
To: linux-kernel
Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
usama.arif, fvdl, Rik van Riel, Chris Mason, David Sterba,
Boris Burkov, linux-btrfs
Extent buffer pages allocated by alloc_extent_buffer() are attached to
btree_inode->i_mapping (the buffer_tree path), reach the LRU, and are
served by the btree_migrate_folio aops in fs/btrfs/disk-io.c. They are
migratable in practice once their owning extent buffer hits refs == 1,
which happens naturally as tree roots rotate. The buddy allocator
classifies them by GFP, however, and bare GFP_NOFS lands them in
MIGRATE_UNMOVABLE pageblocks. The result: every btree_inode page we
read in pins an unmovable pageblock from the page-superblock allocator's
perspective, even though the page itself can be moved.
Add __GFP_MOVABLE to that one allocation site (alloc_extent_buffer's
call to alloc_eb_folio_array). Plumb the flag through
alloc_eb_folio_array → btrfs_alloc_page_array as a `gfp_t extra_gfp`
parameter. All other call sites pass 0.
Three categories of caller stay on bare GFP_NOFS, deliberately:
- alloc_dummy_extent_buffer / btrfs_clone_extent_buffer: the
resulting eb is EXTENT_BUFFER_UNMAPPED, folio->mapping stays NULL,
the folios never enter LRU, never get migrate_folio aops. Tagging
them __GFP_MOVABLE would violate the page allocator's migrability
contract and they would defeat compaction in MOVABLE pageblocks
where isolate_migratepages_block skips non-LRU non-movable_ops
pages outright.
- btrfs_alloc_page_array callers in fs/btrfs/raid56.c (stripe
pages), fs/btrfs/inode.c (encoded reads), fs/btrfs/ioctl.c (uring
encoded reads), fs/btrfs/relocation.c (relocation buffers): same
contract violation. raid56 stripe_pages additionally persist in
the stripe cache (RBIO_CACHE_SIZE=1024) well beyond a single I/O,
so they are not transient enough to hand-wave the contract.
- btrfs_alloc_folio_array caller in fs/btrfs/scrub.c (stripe
folios): same -- stripe->folios[] are private buffers freed via
folio_put in release_scrub_stripe.
This change targets the dominant fragmentation source observed on the
page-superblock series: ~28 GB of btree_inode pages parked across
many tainted superpageblocks on a 250 GB test system with btrfs root,
preventing 1 GiB hugepage allocation from those regions. With the
movable hint, those pages now land in MOVABLE pageblocks where the
existing background defragger drains them through the standard
PB_has_movable gate, no LRU-sample fallback needed.
Cc: Chris Mason <clm@meta.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Boris Burkov <boris@bur.io>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
fs/btrfs/extent_io.c | 69 ++++++++++++++++++++++++++++++-------------
fs/btrfs/extent_io.h | 4 +--
fs/btrfs/inode.c | 2 +-
fs/btrfs/ioctl.c | 2 +-
fs/btrfs/raid56.c | 6 ++--
fs/btrfs/relocation.c | 2 +-
fs/btrfs/scrub.c | 3 +-
7 files changed, 59 insertions(+), 29 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 2275189b7860..563c4a7eaa36 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -620,24 +620,33 @@ static void end_bbio_data_read(struct btrfs_bio *bbio)
}
/*
- * Populate every free slot in a provided array with folios using GFP_NOFS.
+ * Populate every free slot in a provided array with folios using
+ * GFP_NOFS plus optional caller-supplied flags.
*
- * @nr_folios: number of folios to allocate
- * @order: the order of the folios to be allocated
- * @folio_array: the array to fill with folios; any existing non-NULL entries in
- * the array will be skipped
+ * @nr_folios: number of folios to allocate
+ * @order: folio order
+ * @folio_array: array to fill with folios; non-NULL entries are skipped
+ * @extra_gfp: extra GFP flags OR'd into GFP_NOFS. The only value used
+ * today is __GFP_MOVABLE, which the extent-buffer real-mapping
+ * path (alloc_extent_buffer) passes when the resulting folios
+ * will be attached to btree_inode->i_mapping (added to LRU,
+ * served by the btree_migrate_folio aops). Pass 0 for
+ * everything else; folios allocated by other callers stay in
+ * driver-owned arrays, never reach LRU and never register
+ * movable_ops, so they cannot satisfy the __GFP_MOVABLE
+ * migrability contract.
*
* Return: 0 if all folios were able to be allocated;
* -ENOMEM otherwise, the partially allocated folios would be freed and
* the array slots zeroed
*/
int btrfs_alloc_folio_array(unsigned int nr_folios, unsigned int order,
- struct folio **folio_array)
+ struct folio **folio_array, gfp_t extra_gfp)
{
for (int i = 0; i < nr_folios; i++) {
if (folio_array[i])
continue;
- folio_array[i] = folio_alloc(GFP_NOFS, order);
+ folio_array[i] = folio_alloc(GFP_NOFS | extra_gfp, order);
if (!folio_array[i])
goto error;
}
@@ -652,21 +661,27 @@ int btrfs_alloc_folio_array(unsigned int nr_folios, unsigned int order,
}
/*
- * Populate every free slot in a provided array with pages, using GFP_NOFS.
+ * Populate every free slot in a provided array with pages, using GFP_NOFS
+ * plus optional caller-supplied flags.
*
- * @nr_pages: number of pages to allocate
- * @page_array: the array to fill with pages; any existing non-null entries in
- * the array will be skipped
- * @nofail: whether using __GFP_NOFAIL flag
+ * @nr_pages: number of pages to allocate
+ * @page_array: array to fill; non-NULL entries are skipped
+ * @nofail: whether to use __GFP_NOFAIL
+ * @extra_gfp: extra GFP flags OR'd into the base mask. The only value used
+ * today is __GFP_MOVABLE, which the extent-buffer real-mapping
+ * path passes when the resulting pages will be attached to
+ * btree_inode->i_mapping. See btrfs_alloc_folio_array() for
+ * the full migrability rationale.
*
* Return: 0 if all pages were able to be allocated;
* -ENOMEM otherwise, the partially allocated pages would be freed and
* the array slots zeroed
*/
int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
- bool nofail)
+ bool nofail, gfp_t extra_gfp)
{
- const gfp_t gfp = nofail ? (GFP_NOFS | __GFP_NOFAIL) : GFP_NOFS;
+ const gfp_t gfp = (nofail ? (GFP_NOFS | __GFP_NOFAIL) : GFP_NOFS) |
+ extra_gfp;
unsigned int allocated;
for (allocated = 0; allocated < nr_pages;) {
@@ -689,14 +704,23 @@ int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
* Populate needed folios for the extent buffer.
*
* For now, the folios populated are always in order 0 (aka, single page).
+ *
+ * @movable: pass true only when the resulting pages will be attached to
+ * btree_inode->i_mapping (the alloc_extent_buffer real path).
+ * Cloned/dummy extent buffers (EXTENT_BUFFER_UNMAPPED) leave
+ * folio->mapping NULL, never enter the LRU, and never get the
+ * btree_migrate_folio aops, so __GFP_MOVABLE would violate the
+ * page-allocator's migrability contract for them.
*/
-static int alloc_eb_folio_array(struct extent_buffer *eb, bool nofail)
+static int alloc_eb_folio_array(struct extent_buffer *eb, bool nofail,
+ bool movable)
{
struct page *page_array[INLINE_EXTENT_BUFFER_PAGES] = { 0 };
int num_pages = num_extent_pages(eb);
int ret;
- ret = btrfs_alloc_page_array(num_pages, page_array, nofail);
+ ret = btrfs_alloc_page_array(num_pages, page_array, nofail,
+ movable ? __GFP_MOVABLE : 0);
if (ret < 0)
return ret;
@@ -3097,7 +3121,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
*/
set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
- ret = alloc_eb_folio_array(new, false);
+ ret = alloc_eb_folio_array(new, false, false);
if (ret)
goto release_eb;
@@ -3138,7 +3162,7 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
if (!eb)
return NULL;
- ret = alloc_eb_folio_array(eb, false);
+ ret = alloc_eb_folio_array(eb, false, false);
if (ret)
goto release_eb;
@@ -3491,8 +3515,13 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
}
reallocate:
- /* Allocate all pages first. */
- ret = alloc_eb_folio_array(eb, true);
+ /*
+ * Allocate all pages first. These will be attached to
+ * btree_inode->i_mapping below (added to LRU, served by
+ * btree_migrate_folio), so request __GFP_MOVABLE so the
+ * page allocator places them in MOVABLE pageblocks.
+ */
+ ret = alloc_eb_folio_array(eb, true, true);
if (ret < 0) {
btrfs_free_folio_state(prealloc);
goto out;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index b310a5145cf6..5e263f07b59d 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -387,9 +387,9 @@ void btrfs_clear_buffer_dirty(struct btrfs_trans_handle *trans,
struct extent_buffer *buf);
int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
- bool nofail);
+ bool nofail, gfp_t extra_gfp);
int btrfs_alloc_folio_array(unsigned int nr_folios, unsigned int order,
- struct folio **folio_array);
+ struct folio **folio_array, gfp_t extra_gfp);
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
bool find_lock_delalloc_range(struct inode *inode,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 906d5c21ebc4..85f56ab815f9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9659,7 +9659,7 @@ ssize_t btrfs_encoded_read_regular(struct kiocb *iocb, struct iov_iter *iter,
pages = kzalloc_objs(struct page *, nr_pages, GFP_NOFS);
if (!pages)
return -ENOMEM;
- ret = btrfs_alloc_page_array(nr_pages, pages, false);
+ ret = btrfs_alloc_page_array(nr_pages, pages, false, 0);
if (ret) {
ret = -ENOMEM;
goto out;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index a39460bf68a7..77091915cacc 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4621,7 +4621,7 @@ static int btrfs_uring_read_extent(struct kiocb *iocb, struct iov_iter *iter,
pages = kzalloc_objs(struct page *, nr_pages, GFP_NOFS);
if (!pages)
return -ENOMEM;
- ret = btrfs_alloc_page_array(nr_pages, pages, 0);
+ ret = btrfs_alloc_page_array(nr_pages, pages, 0, 0);
if (ret) {
ret = -ENOMEM;
goto out_fail;
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 08ee8f316d96..4135bac62be1 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1123,7 +1123,7 @@ static int alloc_rbio_pages(struct btrfs_raid_bio *rbio)
{
int ret;
- ret = btrfs_alloc_page_array(rbio->nr_pages, rbio->stripe_pages, false);
+ ret = btrfs_alloc_page_array(rbio->nr_pages, rbio->stripe_pages, false, 0);
if (ret < 0)
return ret;
/* Mapping all sectors */
@@ -1138,7 +1138,7 @@ static int alloc_rbio_parity_pages(struct btrfs_raid_bio *rbio)
int ret;
ret = btrfs_alloc_page_array(rbio->nr_pages - data_pages,
- rbio->stripe_pages + data_pages, false);
+ rbio->stripe_pages + data_pages, false, 0);
if (ret < 0)
return ret;
@@ -1732,7 +1732,7 @@ static int alloc_rbio_data_pages(struct btrfs_raid_bio *rbio)
const int data_pages = rbio->nr_data * rbio->stripe_npages;
int ret;
- ret = btrfs_alloc_page_array(data_pages, rbio->stripe_pages, false);
+ ret = btrfs_alloc_page_array(data_pages, rbio->stripe_pages, false, 0);
if (ret < 0)
return ret;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 3ebaf5880125..6f6d25724fb8 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4038,7 +4038,7 @@ static int copy_remapped_data(struct btrfs_fs_info *fs_info, u64 old_addr,
if (!pages)
return -ENOMEM;
- ret = btrfs_alloc_page_array(nr_pages, pages, 0);
+ ret = btrfs_alloc_page_array(nr_pages, pages, 0, 0);
if (ret) {
ret = -ENOMEM;
goto end;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 1ac609239cbe..4089e80077cc 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -369,7 +369,8 @@ static int init_scrub_stripe(struct btrfs_fs_info *fs_info,
ASSERT(BTRFS_STRIPE_LEN >> min_folio_shift <= SCRUB_STRIPE_MAX_FOLIOS);
ret = btrfs_alloc_folio_array(BTRFS_STRIPE_LEN >> min_folio_shift,
- fs_info->block_min_order, stripe->folios);
+ fs_info->block_min_order, stripe->folios,
+ 0);
if (ret < 0)
goto error;
--
2.54.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 34/40] btrfs: allocate eb-attached btree pages as movable
2026-05-20 14:59 ` [RFC PATCH 34/40] btrfs: allocate eb-attached btree pages as movable Rik van Riel
@ 2026-05-20 17:47 ` Boris Burkov
2026-05-23 15:58 ` David Sterba
0 siblings, 1 reply; 12+ messages in thread
From: Boris Burkov @ 2026-05-20 17:47 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, kernel-team, linux-mm, david, willy, surenb, hannes,
ljs, ziy, usama.arif, fvdl, Chris Mason, David Sterba,
linux-btrfs
On Wed, May 20, 2026 at 10:59:40AM -0400, Rik van Riel wrote:
> Extent buffer pages allocated by alloc_extent_buffer() are attached to
> btree_inode->i_mapping (the buffer_tree path), reach the LRU, and are
> served by the btree_migrate_folio aops in fs/btrfs/disk-io.c. They are
> migratable in practice once their owning extent buffer hits refs == 1,
> which happens naturally as tree roots rotate. The buddy allocator
Small correction to a generally correct story:
every extent_buffer we have used (it is newly cowed or we have read it
for some reason) will settle into a state where extent_buffer->refs == 1
and it is eligible for release_folio(). If a block gets cowed, then its
extent_buffer should be fully released/freed right away, with a tricky
EXTENT_BUFFER_STALE bit dance to prevent accidental reuse during this
process.
So I don't think cow or trees moving around is all that relevant here.
The folios you are after are the ones which are sitting around with
extent_buffer->refs == 1 and which will return 0 to release_folio(),
which is any valid, non-stale extent_buffer no btrfs operation is
currently using (typically referred to from a btrfs_path). These are
already available to be reclaimed by memory pressure based reclaim,
drop_caches, etc.
> classifies them by GFP, however, and bare GFP_NOFS lands them in
> MIGRATE_UNMOVABLE pageblocks. The result: every btree_inode page we
> read in pins an unmovable pageblock from the page-superblock allocator's
> perspective, even though the page itself can be moved.
>
> Add __GFP_MOVABLE to that one allocation site (alloc_extent_buffer's
> call to alloc_eb_folio_array). Plumb the flag through
> alloc_eb_folio_array → btrfs_alloc_page_array as a `gfp_t extra_gfp`
> parameter. All other call sites pass 0.
>
> Three categories of caller stay on bare GFP_NOFS, deliberately:
>
> - alloc_dummy_extent_buffer / btrfs_clone_extent_buffer: the
> resulting eb is EXTENT_BUFFER_UNMAPPED, folio->mapping stays NULL,
> the folios never enter LRU, never get migrate_folio aops. Tagging
> them __GFP_MOVABLE would violate the page allocator's migrability
> contract and they would defeat compaction in MOVABLE pageblocks
> where isolate_migratepages_block skips non-LRU non-movable_ops
> pages outright.
>
> - btrfs_alloc_page_array callers in fs/btrfs/raid56.c (stripe
> pages), fs/btrfs/inode.c (encoded reads), fs/btrfs/ioctl.c (uring
> encoded reads), fs/btrfs/relocation.c (relocation buffers): same
> contract violation. raid56 stripe_pages additionally persist in
> the stripe cache (RBIO_CACHE_SIZE=1024) well beyond a single I/O,
> so they are not transient enough to hand-wave the contract.
>
> - btrfs_alloc_folio_array caller in fs/btrfs/scrub.c (stripe
> folios): same -- stripe->folios[] are private buffers freed via
> folio_put in release_scrub_stripe.
>
> This change targets the dominant fragmentation source observed on the
> page-superblock series: ~28 GB of btree_inode pages parked across
> many tainted superpageblocks on a 250 GB test system with btrfs root,
> preventing 1 GiB hugepage allocation from those regions. With the
> movable hint, those pages now land in MOVABLE pageblocks where the
> existing background defragger drains them through the standard
> PB_has_movable gate, no LRU-sample fallback needed.
>
I left a small code nit inline, but whether you take that suggestion or
leave it, you can add:
Reviewed-by: Boris Burkov <boris@bur.io>
Thanks for the patch,
Boris
> Cc: Chris Mason <clm@meta.com>
> Cc: David Sterba <dsterba@suse.com>
> Cc: Boris Burkov <boris@bur.io>
> Cc: linux-btrfs@vger.kernel.org
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Assisted-by: Claude:claude-opus-4.7 syzkaller
> ---
> fs/btrfs/extent_io.c | 69 ++++++++++++++++++++++++++++++-------------
> fs/btrfs/extent_io.h | 4 +--
> fs/btrfs/inode.c | 2 +-
> fs/btrfs/ioctl.c | 2 +-
> fs/btrfs/raid56.c | 6 ++--
> fs/btrfs/relocation.c | 2 +-
> fs/btrfs/scrub.c | 3 +-
> 7 files changed, 59 insertions(+), 29 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 2275189b7860..563c4a7eaa36 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -620,24 +620,33 @@ static void end_bbio_data_read(struct btrfs_bio *bbio)
> }
>
> /*
> - * Populate every free slot in a provided array with folios using GFP_NOFS.
> + * Populate every free slot in a provided array with folios using
> + * GFP_NOFS plus optional caller-supplied flags.
> *
> - * @nr_folios: number of folios to allocate
> - * @order: the order of the folios to be allocated
> - * @folio_array: the array to fill with folios; any existing non-NULL entries in
> - * the array will be skipped
> + * @nr_folios: number of folios to allocate
> + * @order: folio order
> + * @folio_array: array to fill with folios; non-NULL entries are skipped
> + * @extra_gfp: extra GFP flags OR'd into GFP_NOFS. The only value used
> + * today is __GFP_MOVABLE, which the extent-buffer real-mapping
> + * path (alloc_extent_buffer) passes when the resulting folios
> + * will be attached to btree_inode->i_mapping (added to LRU,
> + * served by the btree_migrate_folio aops). Pass 0 for
> + * everything else; folios allocated by other callers stay in
> + * driver-owned arrays, never reach LRU and never register
> + * movable_ops, so they cannot satisfy the __GFP_MOVABLE
> + * migrability contract.
> *
> * Return: 0 if all folios were able to be allocated;
> * -ENOMEM otherwise, the partially allocated folios would be freed and
> * the array slots zeroed
> */
> int btrfs_alloc_folio_array(unsigned int nr_folios, unsigned int order,
> - struct folio **folio_array)
> + struct folio **folio_array, gfp_t extra_gfp)
> {
> for (int i = 0; i < nr_folios; i++) {
> if (folio_array[i])
> continue;
> - folio_array[i] = folio_alloc(GFP_NOFS, order);
> + folio_array[i] = folio_alloc(GFP_NOFS | extra_gfp, order);
> if (!folio_array[i])
> goto error;
> }
> @@ -652,21 +661,27 @@ int btrfs_alloc_folio_array(unsigned int nr_folios, unsigned int order,
> }
>
> /*
> - * Populate every free slot in a provided array with pages, using GFP_NOFS.
> + * Populate every free slot in a provided array with pages, using GFP_NOFS
> + * plus optional caller-supplied flags.
> *
> - * @nr_pages: number of pages to allocate
> - * @page_array: the array to fill with pages; any existing non-null entries in
> - * the array will be skipped
> - * @nofail: whether using __GFP_NOFAIL flag
> + * @nr_pages: number of pages to allocate
> + * @page_array: array to fill; non-NULL entries are skipped
> + * @nofail: whether to use __GFP_NOFAIL
> + * @extra_gfp: extra GFP flags OR'd into the base mask. The only value used
> + * today is __GFP_MOVABLE, which the extent-buffer real-mapping
> + * path passes when the resulting pages will be attached to
> + * btree_inode->i_mapping. See btrfs_alloc_folio_array() for
> + * the full migrability rationale.
> *
> * Return: 0 if all pages were able to be allocated;
> * -ENOMEM otherwise, the partially allocated pages would be freed and
> * the array slots zeroed
> */
> int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
> - bool nofail)
> + bool nofail, gfp_t extra_gfp)
> {
> - const gfp_t gfp = nofail ? (GFP_NOFS | __GFP_NOFAIL) : GFP_NOFS;
> + const gfp_t gfp = (nofail ? (GFP_NOFS | __GFP_NOFAIL) : GFP_NOFS) |
> + extra_gfp;
> unsigned int allocated;
>
> for (allocated = 0; allocated < nr_pages;) {
> @@ -689,14 +704,23 @@ int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
> * Populate needed folios for the extent buffer.
> *
> * For now, the folios populated are always in order 0 (aka, single page).
> + *
> + * @movable: pass true only when the resulting pages will be attached to
> + * btree_inode->i_mapping (the alloc_extent_buffer real path).
> + * Cloned/dummy extent buffers (EXTENT_BUFFER_UNMAPPED) leave
> + * folio->mapping NULL, never enter the LRU, and never get the
> + * btree_migrate_folio aops, so __GFP_MOVABLE would violate the
> + * page-allocator's migrability contract for them.
> */
My gut says it is nicer to just transparently pass along a gfp flag
rather than a carefully documented movable boolean and let the callers
do the right thing. Just a nit, I don't care strongly about it.
> -static int alloc_eb_folio_array(struct extent_buffer *eb, bool nofail)
> +static int alloc_eb_folio_array(struct extent_buffer *eb, bool nofail,
> + bool movable)
> {
> struct page *page_array[INLINE_EXTENT_BUFFER_PAGES] = { 0 };
> int num_pages = num_extent_pages(eb);
> int ret;
>
> - ret = btrfs_alloc_page_array(num_pages, page_array, nofail);
> + ret = btrfs_alloc_page_array(num_pages, page_array, nofail,
> + movable ? __GFP_MOVABLE : 0);
> if (ret < 0)
> return ret;
>
> @@ -3097,7 +3121,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
> */
> set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
>
> - ret = alloc_eb_folio_array(new, false);
> + ret = alloc_eb_folio_array(new, false, false);
> if (ret)
> goto release_eb;
>
> @@ -3138,7 +3162,7 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
> if (!eb)
> return NULL;
>
> - ret = alloc_eb_folio_array(eb, false);
> + ret = alloc_eb_folio_array(eb, false, false);
> if (ret)
> goto release_eb;
>
> @@ -3491,8 +3515,13 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
> }
>
> reallocate:
> - /* Allocate all pages first. */
> - ret = alloc_eb_folio_array(eb, true);
> + /*
> + * Allocate all pages first. These will be attached to
> + * btree_inode->i_mapping below (added to LRU, served by
> + * btree_migrate_folio), so request __GFP_MOVABLE so the
> + * page allocator places them in MOVABLE pageblocks.
> + */
i.e., just pass GFP_MOVABLE here with a comment explaining why
> + ret = alloc_eb_folio_array(eb, true, true);
> if (ret < 0) {
> btrfs_free_folio_state(prealloc);
> goto out;
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index b310a5145cf6..5e263f07b59d 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -387,9 +387,9 @@ void btrfs_clear_buffer_dirty(struct btrfs_trans_handle *trans,
> struct extent_buffer *buf);
>
> int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
> - bool nofail);
> + bool nofail, gfp_t extra_gfp);
> int btrfs_alloc_folio_array(unsigned int nr_folios, unsigned int order,
> - struct folio **folio_array);
> + struct folio **folio_array, gfp_t extra_gfp);
>
> #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> bool find_lock_delalloc_range(struct inode *inode,
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 906d5c21ebc4..85f56ab815f9 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9659,7 +9659,7 @@ ssize_t btrfs_encoded_read_regular(struct kiocb *iocb, struct iov_iter *iter,
> pages = kzalloc_objs(struct page *, nr_pages, GFP_NOFS);
> if (!pages)
> return -ENOMEM;
> - ret = btrfs_alloc_page_array(nr_pages, pages, false);
> + ret = btrfs_alloc_page_array(nr_pages, pages, false, 0);
> if (ret) {
> ret = -ENOMEM;
> goto out;
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index a39460bf68a7..77091915cacc 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -4621,7 +4621,7 @@ static int btrfs_uring_read_extent(struct kiocb *iocb, struct iov_iter *iter,
> pages = kzalloc_objs(struct page *, nr_pages, GFP_NOFS);
> if (!pages)
> return -ENOMEM;
> - ret = btrfs_alloc_page_array(nr_pages, pages, 0);
> + ret = btrfs_alloc_page_array(nr_pages, pages, 0, 0);
> if (ret) {
> ret = -ENOMEM;
> goto out_fail;
> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> index 08ee8f316d96..4135bac62be1 100644
> --- a/fs/btrfs/raid56.c
> +++ b/fs/btrfs/raid56.c
> @@ -1123,7 +1123,7 @@ static int alloc_rbio_pages(struct btrfs_raid_bio *rbio)
> {
> int ret;
>
> - ret = btrfs_alloc_page_array(rbio->nr_pages, rbio->stripe_pages, false);
> + ret = btrfs_alloc_page_array(rbio->nr_pages, rbio->stripe_pages, false, 0);
> if (ret < 0)
> return ret;
> /* Mapping all sectors */
> @@ -1138,7 +1138,7 @@ static int alloc_rbio_parity_pages(struct btrfs_raid_bio *rbio)
> int ret;
>
> ret = btrfs_alloc_page_array(rbio->nr_pages - data_pages,
> - rbio->stripe_pages + data_pages, false);
> + rbio->stripe_pages + data_pages, false, 0);
> if (ret < 0)
> return ret;
>
> @@ -1732,7 +1732,7 @@ static int alloc_rbio_data_pages(struct btrfs_raid_bio *rbio)
> const int data_pages = rbio->nr_data * rbio->stripe_npages;
> int ret;
>
> - ret = btrfs_alloc_page_array(data_pages, rbio->stripe_pages, false);
> + ret = btrfs_alloc_page_array(data_pages, rbio->stripe_pages, false, 0);
> if (ret < 0)
> return ret;
>
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 3ebaf5880125..6f6d25724fb8 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -4038,7 +4038,7 @@ static int copy_remapped_data(struct btrfs_fs_info *fs_info, u64 old_addr,
> if (!pages)
> return -ENOMEM;
>
> - ret = btrfs_alloc_page_array(nr_pages, pages, 0);
> + ret = btrfs_alloc_page_array(nr_pages, pages, 0, 0);
> if (ret) {
> ret = -ENOMEM;
> goto end;
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 1ac609239cbe..4089e80077cc 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -369,7 +369,8 @@ static int init_scrub_stripe(struct btrfs_fs_info *fs_info,
>
> ASSERT(BTRFS_STRIPE_LEN >> min_folio_shift <= SCRUB_STRIPE_MAX_FOLIOS);
> ret = btrfs_alloc_folio_array(BTRFS_STRIPE_LEN >> min_folio_shift,
> - fs_info->block_min_order, stripe->folios);
> + fs_info->block_min_order, stripe->folios,
> + 0);
> if (ret < 0)
> goto error;
>
> --
> 2.54.0
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [syzbot ci] Re: mm: reliable 1GB page allocation
[not found] <20260520150018.2491267-1-riel@surriel.com>
2026-05-20 14:59 ` [RFC PATCH 34/40] btrfs: allocate eb-attached btree pages as movable Rik van Riel
@ 2026-05-21 7:39 ` syzbot ci
2026-06-27 9:28 ` [RFC PATCH 00/40] " Lorenzo Stoakes
2 siblings, 0 replies; 12+ messages in thread
From: syzbot ci @ 2026-05-21 7:39 UTC (permalink / raw)
To: boris, clm, david, dsterba, fvdl, hannes, hnaz, kernel-team, lenb,
linux-btrfs, linux-kernel, linux-mm, linux-pm, ljs, pavel, rafael,
riel, surenb, usama.arif, willy, ziy
Cc: syzbot, syzkaller-bugs
syzbot ci has tested the following series
[v1] mm: reliable 1GB page allocation
https://lore.kernel.org/all/20260520150018.2491267-1-riel@surriel.com
* [RFC PATCH 01/40] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data
* [RFC PATCH 02/40] mm: page_alloc: per-cpu pageblock buddy allocator
* [RFC PATCH 03/40] mm: page_alloc: split-path PCP free with local-trylock + remote-llist
* [RFC PATCH 04/40] mm: mm_init: fix zone assignment for pages in unavailable ranges
* [RFC PATCH 05/40] mm: page_alloc: remove watermark boost mechanism
* [RFC PATCH 06/40] mm: page_alloc: async evacuation of stolen movable pageblocks
* [RFC PATCH 07/40] mm: page_alloc: track actual page contents in pageblock flags
* [RFC PATCH 08/40] mm: page_alloc: superpageblock metadata for 1GB anti-fragmentation
* [RFC PATCH 09/40] mm: page_alloc: support superpageblock resize for memory hotplug
* [RFC PATCH 10/40] mm: page_alloc: add superpageblock fullness lists for allocation steering
* [RFC PATCH 11/40] mm: page_alloc: steer pageblock stealing to tainted superpageblocks
* [RFC PATCH 12/40] mm: page_alloc: steer movable allocations to fullest clean superpageblocks
* [RFC PATCH 13/40] mm: page_alloc: extract claim_whole_block from try_to_claim_block
* [RFC PATCH 14/40] mm: page_alloc: add per-superpageblock free lists
* [RFC PATCH 15/40] mm: page_alloc: add background superpageblock defragmentation worker
* [RFC PATCH 16/40] mm: compaction: walk per-superpageblock free lists for migration targets
* [RFC PATCH 17/40] mm: page_alloc: superpageblock-aware contiguous and higher order allocation
* [RFC PATCH 18/40] mm: page_alloc: prevent atomic allocations from tainting clean SPBs
* [RFC PATCH 19/40] mm: page_alloc: aggressively pack non-movable allocs in tainted SPBs on large systems
* [RFC PATCH 20/40] mm: page_alloc: prefer reclaim over tainting clean superpageblocks
* [RFC PATCH 21/40] mm: page_alloc: adopt partial pageblocks from tainted superpageblocks
* [RFC PATCH 22/40] mm: page_alloc: add CONFIG_DEBUG_VM sanity checks for SPB counters
* [RFC PATCH 23/40] mm: page_alloc: targeted evacuation and dynamic reserves for tainted SPBs
* [RFC PATCH 24/40] mm: page_alloc: prevent UNMOVABLE/RECLAIMABLE mixing in pageblocks
* [RFC PATCH 25/40] mm: trigger deferred SPB evac when atomic allocs would taint a clean SPB
* [RFC PATCH 26/40] mm: page_alloc: refuse fragmenting fallback for callers with cheap fallback
* [RFC PATCH 27/40] mm: page_alloc: cross-migratetype buddy borrow within tainted SPBs
* [RFC PATCH 28/40] mm: page_alloc: drive slab shrink from SPB anti-fragmentation pressure
* [RFC PATCH 29/40] mm: page_reporting: walk per-superpageblock free lists
* [RFC PATCH 30/40] mm: show_mem: collect migratetype letters from per-superpageblock lists
* [RFC PATCH 31/40] mm: page_alloc: per-(zone, order, mt) PASS_1 hint cache
* [RFC PATCH 32/40] mm: debug: prevent infinite recursion in dump_page() with CMA
* [RFC PATCH 33/40] PM: hibernate: walk per-superpageblock free lists in mark_free_pages
* [RFC PATCH 34/40] btrfs: allocate eb-attached btree pages as movable
* [RFC PATCH 35/40] mm: page_alloc: refuse best-effort high-order allocs servable at lower orders
* [RFC PATCH 36/40] mm: page_alloc: set ALLOC_NOFRAGMENT on alloc_frozen_pages_nolock_noprof
* [RFC PATCH 37/40] mm: page_alloc: move spb_get_category and spb_tainted_reserve to mmzone.h
* [RFC PATCH 38/40] mm: compaction: skip empty tainted superpageblocks as migration source
* [RFC PATCH 39/40] mm: compaction: respect tainted SPB reserve in destination selection
* [RFC PATCH 40/40] mm: page_alloc: SPB tracepoint instrumentation [DO-NOT-MERGE]
and found the following issue:
WARNING in preempt_count_sub
Full report is available here:
https://ci.syzbot.org/series/8eef71e1-708c-40d0-9f9d-3a1bd637fc80
***
WARNING in preempt_count_sub
tree: mm-new
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
base: 4b83cbc4c15f09b000cc06f033f64b0824b6dc87
arch: amd64
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config: https://ci.syzbot.org/builds/de024c89-d472-4f06-b5be-4f75f2acd1d9/config
------------[ cut here ]------------
DEBUG_LOCKS_WARN_ON(val > preempt_count())
WARNING: kernel/sched/core.c:5883 at preempt_count_sub+0x9e/0x170, CPU#0: kworker/0:0/9
Modules linked in:
CPU: 0 UID: 0 PID: 9 Comm: kworker/0:0 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Workqueue: events drain_vmap_area_work
RIP: 0010:preempt_count_sub+0xa5/0x170
Code: 29 31 90 48 c1 e8 03 0f b6 04 18 84 c0 0f 85 88 00 00 00 83 3d ff 50 9d 0e 00 75 13 48 8d 3d 32 37 a0 0e 48 c7 c6 c0 f8 cb 8b <67> 48 0f b9 3a 90 eb b8 90 e8 ed d6 1d 03 85 c0 74 2f 48 c7 c0 c4
RSP: 0000:ffffc900000e7590 EFLAGS: 00010246
RAX: 0000000000000000 RBX: dffffc0000000000 RCX: ffff8881007d5880
RDX: 0000000000000000 RSI: ffffffff8bcbf8c0 RDI: ffffffff90341000
RBP: ffff888121042dc0 R08: ffffffff903129c3 R09: 1ffffffff2062538
R10: dffffc0000000000 R11: fffffbfff2062539 R12: 0000000000000020
R13: dffffc0000000000 R14: ffff888121042d80 R15: ffff88815fffbae8
FS: 0000000000000000(0000) GS:ffff88818dc7e000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88823ffff000 CR3: 000000000e74a000 CR4: 00000000000006f0
Call Trace:
<TASK>
free_frozen_page_commit+0x737/0x1290
__free_frozen_pages+0x873/0x1070
kasan_depopulate_vmalloc_pte+0x6d/0x90
__apply_to_page_range+0xbdc/0x1420
__kasan_release_vmalloc+0xa2/0xd0
purge_vmap_node+0x220/0x960
__purge_vmap_area_lazy+0x779/0xb40
drain_vmap_area_work+0x27/0x40
process_scheduled_works+0xb5d/0x1860
worker_thread+0xa53/0xfc0
kthread+0x388/0x470
ret_from_fork+0x514/0xb70
ret_from_fork_asm+0x1a/0x30
</TASK>
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot@syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).
The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 34/40] btrfs: allocate eb-attached btree pages as movable
2026-05-20 17:47 ` Boris Burkov
@ 2026-05-23 15:58 ` David Sterba
2026-05-24 1:43 ` Rik van Riel
0 siblings, 1 reply; 12+ messages in thread
From: David Sterba @ 2026-05-23 15:58 UTC (permalink / raw)
To: Boris Burkov
Cc: Rik van Riel, linux-kernel, kernel-team, linux-mm, david, willy,
surenb, hannes, ljs, ziy, usama.arif, fvdl, Chris Mason,
David Sterba, linux-btrfs
On Wed, May 20, 2026 at 10:47:49AM -0700, Boris Burkov wrote:
> > @@ -689,14 +704,23 @@ int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
> > * Populate needed folios for the extent buffer.
> > *
> > * For now, the folios populated are always in order 0 (aka, single page).
> > + *
> > + * @movable: pass true only when the resulting pages will be attached to
> > + * btree_inode->i_mapping (the alloc_extent_buffer real path).
> > + * Cloned/dummy extent buffers (EXTENT_BUFFER_UNMAPPED) leave
> > + * folio->mapping NULL, never enter the LRU, and never get the
> > + * btree_migrate_folio aops, so __GFP_MOVABLE would violate the
> > + * page-allocator's migrability contract for them.
> > */
>
> My gut says it is nicer to just transparently pass along a gfp flag
> rather than a carefully documented movable boolean and let the callers
> do the right thing. Just a nit, I don't care strongly about it.
Agreed, the GFP flags would be better and also a bit more self
documenting where the functions are called. The patch is not standalone
so I don't mind having the booleans, we can clean it up later to let the
40-patch series proceed.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 34/40] btrfs: allocate eb-attached btree pages as movable
2026-05-23 15:58 ` David Sterba
@ 2026-05-24 1:43 ` Rik van Riel
2026-05-24 19:59 ` Matthew Wilcox
0 siblings, 1 reply; 12+ messages in thread
From: Rik van Riel @ 2026-05-24 1:43 UTC (permalink / raw)
To: dsterba, Boris Burkov
Cc: linux-kernel, kernel-team, linux-mm, david, willy, surenb, hannes,
ljs, ziy, usama.arif, fvdl, Chris Mason, David Sterba,
linux-btrfs
On Sat, 2026-05-23 at 17:58 +0200, David Sterba wrote:
> On Wed, May 20, 2026 at 10:47:49AM -0700, Boris Burkov wrote:
> > > @@ -689,14 +704,23 @@ int btrfs_alloc_page_array(unsigned int
> > > nr_pages, struct page **page_array,
> > > * Populate needed folios for the extent buffer.
> > > *
> > > * For now, the folios populated are always in order 0 (aka,
> > > single page).
> > > + *
> > > + * @movable: pass true only when the resulting pages will be
> > > attached to
> > > + * btree_inode->i_mapping (the alloc_extent_buffer
> > > real path).
> > > + * Cloned/dummy extent buffers
> > > (EXTENT_BUFFER_UNMAPPED) leave
> > > + * folio->mapping NULL, never enter the LRU, and never
> > > get the
> > > + * btree_migrate_folio aops, so __GFP_MOVABLE would
> > > violate the
> > > + * page-allocator's migrability contract for them.
> > > */
> >
> > My gut says it is nicer to just transparently pass along a gfp flag
> > rather than a carefully documented movable boolean and let the
> > callers
> > do the right thing. Just a nit, I don't care strongly about it.
>
> Agreed, the GFP flags would be better and also a bit more self
> documenting where the functions are called. The patch is not
> standalone
> so I don't mind having the booleans, we can clean it up later to let
> the
> 40-patch series proceed.
I think this patch could go in by itself. It doesn't
depend on anything else in the series.
I sent it with this series mostly because it's needed
to make the series work, and to provide context on why
it's needed.
I'm happy to resend it with a GFP mask passed in by
each caller. That would look better, indeed!
I'll send you guys a cleaned up patch this coming week.
--
All Rights Reversed.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 34/40] btrfs: allocate eb-attached btree pages as movable
2026-05-24 1:43 ` Rik van Riel
@ 2026-05-24 19:59 ` Matthew Wilcox
2026-05-25 6:57 ` Christoph Hellwig
0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2026-05-24 19:59 UTC (permalink / raw)
To: Rik van Riel
Cc: dsterba, Boris Burkov, linux-kernel, kernel-team, linux-mm, david,
surenb, hannes, ljs, ziy, usama.arif, fvdl, Chris Mason,
David Sterba, linux-btrfs
On Sat, May 23, 2026 at 09:43:23PM -0400, Rik van Riel wrote:
> On Sat, 2026-05-23 at 17:58 +0200, David Sterba wrote:
> > On Wed, May 20, 2026 at 10:47:49AM -0700, Boris Burkov wrote:
> > > My gut says it is nicer to just transparently pass along a gfp flag
> > > rather than a carefully documented movable boolean and let the
> > > callers
> > > do the right thing. Just a nit, I don't care strongly about it.
> >
> > Agreed, the GFP flags would be better and also a bit more self
> > documenting where the functions are called. The patch is not
> > standalone
> > so I don't mind having the booleans, we can clean it up later to let
> > the
> > 40-patch series proceed.
>
>
> I think this patch could go in by itself. It doesn't
> depend on anything else in the series.
>
> I sent it with this series mostly because it's needed
> to make the series work, and to provide context on why
> it's needed.
>
> I'm happy to resend it with a GFP mask passed in by
> each caller. That would look better, indeed!
If you're respinning anyway ...
Every use of GFP_NOFS is a bad code smell. Filesystems should be using
memalloc_nofs_save/restore. The problem is that it takes deep filesystem
knowledge to understand where the memalloc_nofs_save() should start.
It would be good to lift the GFP_NOFS to the callers of this function as
it gets us slightly closer to being able to replace it with GFP_KERNEL
in the places it's not actually needed.
ie what I'm asking for is, please don't do the 'extra_gfp' thing.
Just pass in gfp from the callers and make them all pass in GFP_NOFS |
__GFP_MOVABLE (or a plain GFP_NOFS) for now. Someone can come along
later and change them to GFP_KERNEL later where warranted.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 34/40] btrfs: allocate eb-attached btree pages as movable
2026-05-24 19:59 ` Matthew Wilcox
@ 2026-05-25 6:57 ` Christoph Hellwig
0 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2026-05-25 6:57 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Rik van Riel, dsterba, Boris Burkov, linux-kernel, kernel-team,
linux-mm, david, surenb, hannes, ljs, ziy, usama.arif, fvdl,
Chris Mason, David Sterba, linux-btrfs
On Sun, May 24, 2026 at 08:59:09PM +0100, Matthew Wilcox wrote:
> Every use of GFP_NOFS is a bad code smell. Filesystems should be using
> memalloc_nofs_save/restore. The problem is that it takes deep filesystem
> knowledge to understand where the memalloc_nofs_save() should start.
> It would be good to lift the GFP_NOFS to the callers of this function as
> it gets us slightly closer to being able to replace it with GFP_KERNEL
> in the places it's not actually needed.
>
> ie what I'm asking for is, please don't do the 'extra_gfp' thing.
> Just pass in gfp from the callers and make them all pass in GFP_NOFS |
> __GFP_MOVABLE (or a plain GFP_NOFS) for now. Someone can come along
> later and change them to GFP_KERNEL later where warranted.
Note that is should be pretty trivial to just bubble it up a layer
for changes like this. This will naturally gravitate to more useful
places. To make it fully correct it will need manual intervention,
but generally bubbling up should improve the situation.
Same for GFP_NOIO.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 00/40] mm: reliable 1GB page allocation
[not found] <20260520150018.2491267-1-riel@surriel.com>
2026-05-20 14:59 ` [RFC PATCH 34/40] btrfs: allocate eb-attached btree pages as movable Rik van Riel
2026-05-21 7:39 ` [syzbot ci] Re: mm: reliable 1GB page allocation syzbot ci
@ 2026-06-27 9:28 ` Lorenzo Stoakes
2026-06-27 13:36 ` Rik van Riel
2 siblings, 1 reply; 12+ messages in thread
From: Lorenzo Stoakes @ 2026-06-27 9:28 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, kernel-team, linux-mm, david, willy, surenb, hannes,
ziy, usama.arif, fvdl, Andrew Morton, Jonathan Corbet,
Chris Mason, David Sterba, Vlastimil Babka, Steven Rostedt,
Masami Hiramatsu, Rafael J. Wysocki, Oscar Salvador,
Mike Rapoport, linux-doc, linux-btrfs, linux-trace-kernel,
linux-pm, linux-cxl, Linus Torvalds
+cc missing maintainers.
+cc Linus for general issues raised.
(Apologies, this email is VERY long, even for me)
Hi Rik,
I appreciate this is an RFC, but obviously part of that is early
feedback. So if this is meant to be an (EXTREMELY) rough pre-RFC
proof-of-concept then fine (however your 'todo' suggests otherwise).
And of course, before I get critical :) thank you for looking into this,
it's very interesting work. I want this feature to land. BUT :)
So if this in any way resembles what you plan to send upstream un-RFC'd I
need to pour a fairly large bucket of very cold water over this.
TL;DR: The series is completely unmergeable as it stands. Not even close.
Before we get into the code, please please please make sure you cc- the
right people. You're doing very invasive, very major work here.
You failed to even cc- the page allocator maintainer (Vlastimil) for a
series that radically alters page allocation.
MAKE SURE you cc- Vlastimil and all other maintainers and reviewers + lists
on future (RFC!!) revisions of this please.
If you want to do a quiet off-list pre-RFC that's fine, but it's
unforgivable to leave Vlastimil out of this even for that.
It's 30 seconds running a script (filtering for maintainers alone to save
on noise):
$ scripts/get_maintainer.pl --nogit --nogit-fallback --nor 20260520150018.2491267-1-riel@surriel.com.mbx
Andrew Morton <akpm@linux-foundation.org> (maintainer:MEMORY MANAGEMENT - CORE)
David Hildenbrand <david@kernel.org> (maintainer:MEMORY MANAGEMENT - CORE)
Jonathan Corbet <corbet@lwn.net> (maintainer:DOCUMENTATION)
Chris Mason <clm@fb.com> (maintainer:BTRFS FILE SYSTEM)
David Sterba <dsterba@suse.com> (maintainer:BTRFS FILE SYSTEM)
Vlastimil Babka <vbabka@kernel.org> (maintainer:MEMORY MANAGEMENT - PAGE ALLOCATOR)
Steven Rostedt <rostedt@goodmis.org> (maintainer:TRACING)
Masami Hiramatsu <mhiramat@kernel.org> (maintainer:TRACING)
"Rafael J. Wysocki" <rafael@kernel.org> (maintainer:HIBERNATION (aka Software Suspend, aka swsusp))
Oscar Salvador <osalvador@suse.de> (maintainer:MEMORY HOT(UN)PLUG)
Mike Rapoport <rppt@kernel.org> (maintainer:MEMBLOCK AND MEMORY MANAGEMENT INITIALIZATION)
Johannes Weiner <hannes@cmpxchg.org> (maintainer:MEMORY MANAGEMENT - RECLAIM)
linux-mm@kvack.org (open list:MEMORY MANAGEMENT - CORE)
linux-doc@vger.kernel.org (open list:DOCUMENTATION)
linux-kernel@vger.kernel.org (open list)
linux-btrfs@vger.kernel.org (open list:BTRFS FILE SYSTEM)
linux-trace-kernel@vger.kernel.org (open list:TRACING)
linux-pm@vger.kernel.org (open list:HIBERNATION (aka Software Suspend, aka swsusp))
linux-cxl@vger.kernel.org (open list:MEMORY HOT(UN)PLUG)
HIBERNATION (aka Software Suspend, aka swsusp) status: Supported
SUSPEND TO RAM status: Supported
OK, so getting into the code. You didn't include an overall diffstat. So let's
see:
$ git checkout e1914add2799
...
$ b4 shazam 20260520150018.2491267-1-riel@surriel.com
...
$ git format-patch --cover-letter HEAD~40
...
$ tail -31 0000-cover-letter.patch
Documentation/admin-guide/sysctl/vm.rst | 21 -
Documentation/mm/physical_memory.rst | 13 +-
fs/btrfs/extent_io.c | 69 +-
fs/btrfs/extent_io.h | 4 +-
fs/btrfs/inode.c | 2 +-
fs/btrfs/ioctl.c | 2 +-
fs/btrfs/raid56.c | 6 +-
fs/btrfs/relocation.c | 2 +-
fs/btrfs/scrub.c | 3 +-
include/linux/mmzone.h | 236 +-
include/linux/page-flags.h | 9 +
include/linux/pageblock-flags.h | 10 +
include/linux/vm_event_item.h | 10 +
include/trace/events/kmem.h | 373 ++
kernel/power/snapshot.c | 35 +-
mm/compaction.c | 360 +-
mm/debug.c | 19 +-
mm/internal.h | 39 +
mm/memory_hotplug.c | 4 +
mm/mm_init.c | 400 +-
mm/page_alloc.c | 5064 ++++++++++++++++++++---
mm/page_reporting.c | 149 +-
mm/show_mem.c | 27 +-
mm/sparse.c | 3 +-
mm/vmscan.c | 115 +-
mm/vmstat.c | 71 +-
26 files changed, 6131 insertions(+), 915 deletions(-)
This is completely insane :) I do hope the omission of the diffstat was not
to hide this, but at any rate, adding ~5,000 lines of code to page_alloc.c
is a complete non-starter.
In any case I think clearly more files are required, mm/super_pageblock.c
or whatever it'll be, but at any rate this much code being added to me
clearly indicates something is terribly wrong here anyway.
And you added no tests whatsoever. I get that some things are hard to test
in the kernel and mm, but not even an attempt? Surely you have some
localised stuff that you've been using to exercise this, couldn't some of
that be made into selftests?
Certainly if tests are not present, this needs to be justified in the cover
letter.
(Again I appreciate this is an RFC, so perhaps those are coming.)
Anyway let's look at some of this code. Johannes's patches all look
reasonable, but as soon as we start seeing:
Assisted-by: Claude:claude-opus-4.7
Things start to go awry.
The series is full of extremely dense comments that contain far too much
specific information to be parseable. E.g.:
+ * - owner_cpu == this CPU (or no owner): take the local PCP
+ * lock with spin_trylock and enqueue normally. The trylock
+ * fails only on rare local self re-entry (IRQ/NMI fires
+ * while the interrupted task already holds the lock) or
+ * while a remote drain is active; either way, fall back to
+ * free_one_page (or the zone-llist for FPI_TRYLOCK). No
+ * irqsave: the trylock cannot block on self, and remote
+ * CPUs never take this pcp->lock (they go via free_llist),
+ * so an interruption cannot deadlock against another freer.
This isn't readable. This isn't acceptable. It matches exactly what I've
found LLMs to do, adding far too much detail in way that isn't
human-readable.
You must redo all of these.
The patches are all quite dense. 40 patches is already far too many IMO for
a series this massive, but you're also piling in tonnes of complexity in
every patch.
The series should really be broken out into separate series that each
slowly build up the prerequisites, so they are reviewable, parseable, can
be seen in action before we move to the superblock concept and each having
tests or means of asserting that they in fact function.
It feels that there are patches that indeed can be broken out sensibly like
this,
e.g. https://lore.kernel.org/all/20260520150018.2491267-5-riel@surriel.com/
(which also seems to be a fix? So maybe should be sent as one with
appropriate tags?), so I think that's really the only viable way to land
something this huge.
I see Usama commented on you breaking userspace, and you said you are not
going to do that after all (thanks)... but it worries me that you
essentially render the watermark boost broken? I don't think that's OK?
The code in general dumps large blocks of code into already existing huge
functions but, even more unforgivably, turns small, maintainable, easily
understood functions into completely disgusting horror shows.
For instance, __rmqueue_smallest() goes from ~20 lines to 604 lines. This
is completely insane and totally unacceptable.
The code is also full of software engineering 101 fails. For instance in
__rmqueue_sb_find_fallback():
if (search_cats & SB_SEARCH_PREFERRED) {
...
for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
list_for_each_entry(sb,
&zone->spb_lists[cat][full], list) {
...
if (movable && cat == SB_TAINTED &&
sb->nr_free <= spb_tainted_reserve(sb))
continue;
...
for (i = 0; i < MIGRATE_PCPTYPES - 1; i++) {
...
if (page) {
...
}
}
}
}
}
Note the mix of insane nesting AND guard clauses to make it even more
indecipherable.
There's no world in which I, or any other mm maintainer (I would venture to
say) want to maintain stuff like this.
A particularly concerning patch is
https://lore.kernel.org/all/20260520150018.2491267-15-riel@surriel.com/ -
The commit messages suffer from the same issue as the comments - incredibly
dense material that is far too detailed to the point of absolutely
clarifying nothing. E.g.:
"The SPB list heads (zone->spb_empty and the spb_lists[cat][full] matrix)
are initialized only by setup_superpageblocks(), which is __init and runs
only at boot. Hot-add into a previously-empty zone invokes
init_one_superpageblock() with zero-initialized list_heads, and the
inlined list_add_tail() NULL-derefs walking ->next->prev."
Is just complete word salad. It may as well be written in Egyptian
hieroglyphs.
You're explaining spaghetti code using English language which makes it even
more indecipherable, and not getting to the root of what you are actually
doing.
Again this feels very LLM-generated. I've noticed that they tend to do this
'write out the code again but in English' stuff (and is why I do not
allow LLMs to generate code or comments for me!)
You need to make commit messages and comments as succinct as possible and
as clear as possible for _human beings_ :)
I also notice you have this huge blocks of word salad comments all over the
place but I haven't seen a single kdoc comment (EDIT: noticed at least 1 later!)
This is a far more helpful, structured, form of describing functions and
for anything put in a header you really do need to provide them.
Back to the patch. The diffstat is:
include/linux/mmzone.h | 10 +
mm/compaction.c | 36 +-
mm/internal.h | 10 +
mm/mm_init.c | 146 +++++--
mm/page_alloc.c | 853 ++++++++++++++++++++++++++++++++---------
mm/vmstat.c | 66 ++--
6 files changed, 883 insertions(+), 238 deletions(-)
This is (generally speaking, and certainly in this case) far, far too much
for a single patch.
But as per above this is the patch where you commit what are essentially
code war crimes against __rmqueue_smallest(), adding code so densely nested
that you have to lop off the end of code to fit:
+ if (!movable && !is_migrate_cma(migratetype)) {
+ for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
+ list_for_each_entry(sb,
+ &zone->spb_lists[SB_TAINTED][full], list) {
+ if (!sb->nr_free)
+ continue;
+ for (current_order = max_t(unsigned int,
+ order, pageblock_order);
+ current_order < NR_PAGE_ORDERS;
+ ++current_order) {
+ area = &sb->free_area[current_order];
+ page = get_page_from_free_area(
+ area, MIGRATE_MOVABLE);
+ if (!page)
+ continue;
+ if (get_pageblock_isolate(page))
+ continue;
+ if (is_migrate_cma(
+ get_pageblock_migratetype(page)))
+ continue;
+ page = claim_whole_block(zone, page,
+ current_order, order,
+ migratetype, MIGRATE_MOVABLE);
+ trace_mm_page_alloc_zone_locked(
+ page, order, migratetype,
+ pcp_allowed_order(order) &&
+ migratetype < MIGRATE_PCPTYPES);
+ return page;
+ }
+ }
+ }
+ }
I mean in what world are we taking code like this?
https://lore.kernel.org/all/20260520150018.2491267-22-riel@surriel.com/ has
what seem to be more human comments (good!) but then so densely nested that
they have to be cropped (bad).
using smaller, separate, functions and a structured programming approach
here is the way to go.
There's random smaller issues too like
https://lore.kernel.org/all/20260520150018.2491267-23-riel@surriel.com/
seems to assume the compiler can't figure out to remove dead code (note the
wild inconsistency too in comments)
You seem in some commits to undo or correct stuff you did in previous ones
like
https://lore.kernel.org/all/20260520150018.2491267-24-riel@surriel.com/
I mean perhaps I'm mistaken, but otherwise, can you instead rebase your
series please?
And again we see the world salad:
+ * Maximum tainted superpageblock candidates per spb_evacuate_for_order call.
+ * Collected under zone->lock, then evacuated without it. Larger than the
+ * contig-allocation candidate cap because evacuation runs from the slowpath
+ * after reclaim/compaction failed: we need a meaningful chance of freeing a
+ * non-MOV-claimable pageblock before the slowpath escalates to dropping
+ * ALLOC_NOFRAGMENT (which lets __rmqueue_claim taint clean SPBs). Sized to
+ * scan a meaningful fraction of a typical tainted-pool population.
No kdocs, some functions/complicated logic missing any comments at all,
then piles of words thrown at you like a rock.
This isn't pleasant to read, it adds no clarity, it just makes the whole
thing a mess.
The code in that one is at least vaguely ok (though still too nested).
I feel like where you've manually written code the quality substantially
improves, where the LLM has, the quality nosedives into oblivion.
For instance
https://lore.kernel.org/all/20260520150018.2491267-26-riel@surriel.com/
seems to be radically better, albeit adding far too much code. So I assume
that was more manual.
Stuff like
https://lore.kernel.org/all/20260520150018.2491267-27-riel@surriel.com/
shows more of a structural issue.
In:
+ * Mark callers that have a cheap fallback if the page allocator returns
+ * NULL, so __rmqueue can refuse to taint a clean SPB when an existing
+ * tainted SPB still has free pageblocks waiting to be evacuated.
+ *
+ * Two shapes qualify:
+ *
+ * 1. Explicit fallback declaration: __GFP_NORETRY without
+ * __GFP_RETRY_MAYFAIL. Used by THP, slab high-order refill,
+ * skb_page_frag_refill on full sockets, etc.
+ *
+ * 2. Atomic-context shape: no __GFP_DIRECT_RECLAIM, no __GFP_NOMEMALLOC,
+ * no __GFP_NOFAIL. These callers (GFP_ATOMIC, GFP_NOWAIT, including
+ * ALLOC_HIGHATOMIC consumers) have implicit fallbacks: drop the
+ * packet, demote the slab order, return ENOMEM up the slowpath,
+ * retry from process context with GFP_KERNEL, etc. ALLOC_HIGHATOMIC
+ * callers also get a second crack at the dedicated MIGRATE_HIGHATOMIC
+ * reserve in rmqueue_buddy after __rmqueue returns NULL.
+ * Tainting a 1 GiB SPB to satisfy any of them is a long-lived
+ * fragmentation event for short-lived data.
+ *
The comment is vastly better than most, but you seem to be tying far too
much up in assumptions about what particular workloads do.
This is indicative perhaps of a need to refactor to more reasonably
determine these.
Adding a function and state, say, that expresses these properties would
allow you to break out these concepts and have the code be self-documenting
(as well as adding suitable actual documentation in comments that would be
more succinct).
https://lore.kernel.org/all/20260520150018.2491267-29-riel@surriel.com/
also shows some real issues with how you've implemented this, e.g.:
+ * Called from each PASS_1/2/2B/2C/2D success path after a successful
+ * allocation against a tainted SPB. If the SPB is below its shrink
+ * high-water mark, queue the SPB-driven slab shrink and try to start
+ * the per-SPB defrag worker. Both have their own cooldown gates inside,
+ * so this is cheap to call on every such allocation.
Now you're putting information from one gargantuan function with labels
about 'passes' (rather reminiscent of VMA merge 'cases') into another.
This is a clear sign of a broken abstraction and the code not being
structured correctly.
You should _express_ state encoded in these 'passes' in the _actual code_,
break that code up sensibly and in a way that can be assessed for
correctness, not put a bit-rotting comment on top of a function whose
correctness is much harder to confirm.
Again this patch has similar issues with ridiculously dense indecipherable
comments, e.g.:
+ * Last-chance defrag trigger before tainting a fresh clean SPB.
+ * Walk the tainted-SPB list and try to wake the per-SPB defrag
+ * worker on each. Catches SPBs that are stuck in expired-cooldown
+ * state because no allocator activity has touched them recently
+ * (the routine event-driven trigger from spb_update_list only
+ * fires on bucket transitions, not on every alloc). Once the
+ * cooldown has expired, spb_maybe_start_defrag() will requeue
+ * work; otherwise the gate inside spb_needs_defrag() no-ops
+ * cheaply. Bounded by nr_tainted_spbs and only runs when we are
+ * already on the slow path of fragmenting the clean pool.
The spoondecker is montiplexed in the fradupple complex dedadderated in the
splunkyfied concratanator underfined by the transpontaculatoration
matrixifier... :)
I think this is symptomatic of the abstraction being fundamentally broken.
If you have to establish vast swathes of cognitive context like this mid
way through a huge function (again poor __rmqueue_smallest() who will never
forgive you), then you've basically failed to abstract it.
Please I beg you ADD FUNCTIONS :) structured programming is a thing,
struct's are a thing, abstraction is a thing :)
Pouring spaghetti into a single function is something you expect to see on
a 1990's PHP website, not in core mm code ;)
https://lore.kernel.org/all/20260520150018.2491267-30-riel@surriel.com/
seems to be another example of you just rethinking parts of what you
already submitted midway through the series.
Again I might be missing something here, but if you are doing this, please
just rebase your series to use $NEWIDEA from the start.
https://lore.kernel.org/all/20260520150018.2491267-35-riel@surriel.com/ is
another wall of text (newlines! Please :) but it seems like something you
can break out separately no?
It seems at least reviewable :)
https://lore.kernel.org/all/20260520150018.2491267-36-riel@surriel.com/
contains more of the same broken abstraction stuff, terrible nesting stuff,
word salad stuff but is at least mercifully smaller.
https://lore.kernel.org/all/20260520150018.2491267-37-riel@surriel.com/ is
almost emblematic of the terrible (LLM I hope?) comment issue in the
series. Overly dense hieroglyphs.
https://lore.kernel.org/all/20260520150018.2491267-38-riel@surriel.com/ is
another 'why didn't you rebase?' patch (again maybe I'm missing something
here!)
https://lore.kernel.org/all/20260520150018.2491267-39-riel@surriel.com/
seems more reasonable although newlines please :)
And finally
https://lore.kernel.org/all/20260520150018.2491267-40-riel@surriel.com/ -
while it's a do-not-merge patch in an RFC that changes tracing so maybe
unfair, but the comments are suffering the same symptoms as the rest of the
series.
~
So really, you need to start again, from scratch, and without the use of an
LLM for generating code, or at least with it kept on a (very very short)
leash.
And to be clear, I _want_ this concept of GB superpageblocks to land. It's
a really exciting concept.
Pulling compaction kicking and screaming into 2026 stands to significantly
benefit linux users and developers.
But the execution has to be _completely_ rethought.
I also worry about correctness - I simply cannot see how you can have sense
of it, given the state of the code. For something so invasive and so
critical to kernel functionality, code quality is simply not optional here.
Practical thoughts on how to rework the series:
- Properly abstract the concepts
- Properly separate out functions and data structures
- Add _human-readable_ comments that are succinct + clear as possible
- If a function becomes too nested, separate into smaller functions
- Add tests, if at all or in any way possible (and justify if you can't)
- Clarify commit messages, don't just rewrite code in English
- Use newlines for new paragraphs everywhere :)
- Refactor existing code in preparation for your changes
- Add a new file to contain your changes (+ add to page alloc MAINTAINERS
section)
- Add kdocs for anything not static, and clearly describe static functions
- Split into separate series as much as possible, gradually building
foundations for your changes
- Make everything less dense and more abstracted
In general, write the series with reviewers and other kernel developers in
mind - write clear explanations in comments and commit messages, have the
series slowly build to what you're implementing.
That way, we can see that correctness is maintained throughout, can review
what's there sensibly, and the series becomes upstreamable.
IOW I say we take off and nuke the entire site from orbit. It's the only
way to be sure :)
~
Another issue here is maintainer time - even this _extremely_ light-touch
review has taken me a few hours (of my weekend :). To review it in detail
would take probably DAYS of dedicated work.
In general, I'm concerned that we're going to get caught in a cycle of LLM
code sent - reviewers spending hours reviewing - LLM generates new revision
- etc.
This simply isn't scalable, maintainers cannot do this in the face of the
sheer quantity and complexity of code that LLMs can generate.
We are simply going to have to NAK. And that helps nobody.
Luckily for me, this isn't in a sub-(sub-)system I maintain, so I am not
obliged, but I do have empathy for my fellow maintainers, and am VERY
concerned about this trend.
There has recently been an absolute wave of LLM code, some acked as such
(and I think you for doing so here!) but others unacknowledged entirely,
and the workload, which was already too much, has risen significantly (Jon
has noted the rise in commit count for instance in LWN).
Treating maintainer time as without value was already an issue, but I fear
that we'll see a significant increase in maintainer stress and sadly,
burnout.
As such, I feel that we will have to implement measures at some point to
deprioritise/quickly dismiss such series, unfortunately.
But series from smart and capable engineers such as yourselfk who are well
respected in the communityk will likely not bek and thus will suffer from
this issue indefinitely.
So on that basis, I ask respectfully that you account for this when using
this tooling.
(Also, I appreciate that this is an RFC, but your recent non-RFC GUP series
https://lore.kernel.org/all/20260616190300.1509639-1-riel@surriel.com/
(revision v2 at
https://lore.kernel.org/all/20260625015053.2445008-1-riel@surriel.com/) has
all the same hallmarks as this one, so I feel this point needs to be
underlined).
~
Speaking more generally across the industry, I've been reading about
companies generating code blindly and running into terrible problems with
software that ends up becoming totally unmanageable.
I won't tolerate this happening happening in mm, and strongly object to the
concept (held by some AI proponents) that code is simply an unimportant
byproduct of AI.
The kernel's code (and especially mm's) is _critically_ important, quite
literally, as it forms the basis of the world's critical technical
infrastructure.
I am not opposed to the use of LLMs, but they _must_ serve as tools to
_assist_ experts at their job, not a means by which code bases are
degraded.
~
OK with all that said - to be absolutely clear - I respect you a great
deal, and I KNOW you're (much, much) better than this.
And, to repeat, this idea is very exciting and I _want_ to see this land.
But I feel you've rather let the LLM run amok and it's selling you (very,
very) short, given just how smart and capable you are.
Let's try again :)
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 00/40] mm: reliable 1GB page allocation
2026-06-27 9:28 ` [RFC PATCH 00/40] " Lorenzo Stoakes
@ 2026-06-27 13:36 ` Rik van Riel
2026-06-29 9:29 ` Lorenzo Stoakes
0 siblings, 1 reply; 12+ messages in thread
From: Rik van Riel @ 2026-06-27 13:36 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-kernel, kernel-team, linux-mm, david, willy, surenb, hannes,
ziy, usama.arif, fvdl, Andrew Morton, Jonathan Corbet,
Chris Mason, David Sterba, Vlastimil Babka, Steven Rostedt,
Masami Hiramatsu, Rafael J. Wysocki, Oscar Salvador,
Mike Rapoport, linux-doc, linux-btrfs, linux-trace-kernel,
linux-pm, linux-cxl, Linus Torvalds
On Sat, 2026-06-27 at 10:28 +0100, Lorenzo Stoakes wrote:
>
> So really, you need to start again, from scratch, and without the use
> of an
> LLM for generating code, or at least with it kept on a (very very
> short)
> leash.
>
> And to be clear, I _want_ this concept of GB superpageblocks to land.
> It's
> a really exciting concept.
That is the one reason I sent out RFC code before it
is ready. I am looking for feedback on the concepts
in this series.
How do people feel about splitting up the free lists,
so each gigabyte (well, PUD sized) chunk of memory
has its own free lists?
How can we balance the desire for higher-order kernel
allocations, against the desire to preserve gigabyte
sized chunks of memory that can be used for user space?
>
> Pulling compaction kicking and screaming into 2026 stands to
> significantly
> benefit linux users and developers.
That's another big question. How do we balance the
desire to keep compaction overhead low with the desire
to do higher order allocations almost everywhere?
>
> But the execution has to be _completely_ rethought.
There's no argument there.
I am just hoping to figure out what I should be
doing on a conceptual level, before figuring out
how to do it cleanly.
The mess in the RFC is the result of trying something
that seemed right, watching it fail in some subtle
way, and trying to fix it up.
Once I know what I need to do, coming up with a
cleaner implementation is very doable.
>
> IOW I say we take off and nuke the entire site from orbit. It's the
> only
> way to be sure :)
>
BOOM?
> Another issue here is maintainer time - even this _extremely_ light-
> touch
> review has taken me a few hours (of my weekend :). To review it in
> detail
> would take probably DAYS of dedicated work.
I suspect there is a mismatch in expectations here.
I already knew this code has to be totally redone.
I was looking for feedback on the basic concepts
and design in the patch series, but failed to
clearly communicate that.
You provided some detailed feedback on the code,
but as of yet nobody has really provided any
opinions on things like whether it is desirable
at all to have the free lists per gigablock,
or whether we need to come up with some totally
different approach.
How do we better communicate that kind of thing
in the future?
Is that something to spell out more clearly in
the cover letter?
Is that kind of feedback something developers
could even reasonably ask for? (if not, how do
we figure out what maintainers want?)
--
All Rights Reversed.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 00/40] mm: reliable 1GB page allocation
2026-06-27 13:36 ` Rik van Riel
@ 2026-06-29 9:29 ` Lorenzo Stoakes
2026-06-29 10:03 ` Vlastimil Babka (SUSE)
0 siblings, 1 reply; 12+ messages in thread
From: Lorenzo Stoakes @ 2026-06-29 9:29 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, kernel-team, linux-mm, david, willy, surenb, hannes,
ziy, usama.arif, fvdl, Andrew Morton, Jonathan Corbet,
Chris Mason, David Sterba, Vlastimil Babka, Steven Rostedt,
Masami Hiramatsu, Rafael J. Wysocki, Oscar Salvador,
Mike Rapoport, linux-doc, linux-btrfs, linux-trace-kernel,
linux-pm, linux-cxl, Linus Torvalds
TL;DR - please don't send unfiltered LLM code to list _at all_. If you want
to share it, link to a repo.
On Sat, Jun 27, 2026 at 09:36:51AM -0400, Rik van Riel wrote:
> That is the one reason I sent out RFC code before it
> is ready. I am looking for feedback on the concepts
> in this series.
...
> Once I know what I need to do, coming up with a
> cleaner implementation is very doable.
...
> The mess in the RFC is the result of trying something
> that seemed right, watching it fail in some subtle
> way, and trying to fix it up.
...
> > But the execution has to be _completely_ rethought.
>
> There's no argument there.
...
> > Another issue here is maintainer time - even this _extremely_ light-
> > touch
> > review has taken me a few hours (of my weekend :). To review it in
> > detail
> > would take probably DAYS of dedicated work.
>
> I suspect there is a mismatch in expectations here.
>
> I already knew this code has to be totally redone.
I'm glad we are in agreement on this :)
But in general I feel you have sent this and at least one other series like this
without being as clear as you should have been.
I hate to belabour the point but just to be clear:
* You label one patch [DO-NOT-MERGE], but none of the others (implying they
are candidates for being merged) [0] and the cover letter has TODOs,
including trivia like naming, but nothing about the code.
* You sent a non-RFC series with identical code quality issues [1]
recently.
* Until I pointed it out, you were responding to other review here as if
the series was genuinely was intended for (eventual) merge:
- "This is a userspace-visible removal. Writes to
/proc/sys/vm/watermark_boost_factor will now return -ENOENT instead of
being accepted, breaking userspace." [2]
<-: "I'll just drop this patch for now." [3]
- "I left a small code nit inline, but whether you take that suggestion
or leave it, you can add Reviewed-by: ..." [4]
<-: "I sent it with this series mostly because it's needed to make the
series work, and to provide context on why it's needed. I'm happy to
resend it with a GFP mask passed in by each caller. That would look
better, indeed!" [5]
So to be concrete, if you send really rough code, Use [pre-RFC] or [DO NOT
MERGE] (on the series as a whole) to make that clear and say so in the
cover letter VERY VERY clearly.
Or, you can put it in a repo somewhere and link it in an email discussing
the concepts (like I did with scalable CoW for instance).
Also if people respond to the series as if it isn't pre-RFC, I'd suggest in
your replies saying something like 'I intend to completely rework all this
anyway' or something like that! :)
> How do people feel about splitting up the free lists,
> so each gigabyte (well, PUD sized) chunk of memory
> has its own free lists?
>
> How can we balance the desire for higher-order kernel
> allocations, against the desire to preserve gigabyte
> sized chunks of memory that can be used for user space?
...
> That's another big question. How do we balance the
> desire to keep compaction overhead low with the desire
> to do higher order allocations almost everywhere?
>
> >
...
>
> I am just hoping to figure out what I should be
> doing on a conceptual level, before figuring out
> how to do it cleanly.
>
...
>
> I was looking for feedback on the basic concepts
> and design in the patch series, but failed to
> clearly communicate that.
>
> You provided some detailed feedback on the code,
> but as of yet nobody has really provided any
> opinions on things like whether it is desirable
> at all to have the free lists per gigablock,
> or whether we need to come up with some totally
> different approach.
>
> How do we better communicate that kind of thing
> in the future?
>
> Is that something to spell out more clearly in
> the cover letter?
>
> Is that kind of feedback something developers
> could even reasonably ask for? (if not, how do
> we figure out what maintainers want?)
As above, firstly make it clear that the code you are sending for review is
not to be reviewed so people don't waste highly contended maintainer time
on that! :)
Also, you didn't respond to my point regarding cc'ing the right people -
but that's clearly something you need to get right if you want this kind of
feedback to start with.
For instance, you didn't cc- the page allocator maintainer (Vlastimil) on a
series that is fundamentally changing the page allocator. That's not going
to help with feedback.
In general, this area of the page allocator and compaction isn't my
specialism in the kernel so I can't give you the in-depth feedback you need
on that.
But I do have thoughts in general as to how to achieve what you want here:
Firstly - you should try to summarise what you're doing here and what
you're changing alongside the trade-offs as clearly as you can in the cover
letter.
Then highlight what it is you need feedback on, broken out into clear
questions or points that make it easy for people to respond to.
And _you have already done this_ in your reply here:
* "How do people feel about splitting up the free lists, so each gigabyte
(well, PUD sized) chunk of memory has its own free lists?"
* "How can we balance the desire for higher-order kernel allocations,
against the desire to preserve gigabyte sized chunks of memory that can
be used for user space?"
* "How do we balance the desire to keep compaction overhead low with the
desire to do higher order allocations almost everywhere?"
I think a really good way of doing this would be to start out with
something like:
Right now compaction often fails to achieve what we need, with
fragmentation occurring anyway and (for instance) THP stalling on
the availability of higher order folios.
etc. etc.
Summarising _the problem_.
Then a section about your proposed solution, e.g.:
I propose a means by which we proactively achieve gigabyte-sized
pageblocks with logic which maintains these as physically
contiguous under both ordinary and contended workloads
Then list out the "secret sauce" of your approach, e.g.:
This works by arranging memory such that unmovable allocations are
grouped at <blah blah blah> etc.
Then raise your questions e.g.:
I'd like to ask the community - how do people feel about splitting
up the free lists, so each gigabyte (well, PUD sized) chunk of
memory has its own free lists? <etc. etc.>
Then make it clear whether this is an RFC that is ready for primetime or
not:
This series is simply intended as a proof-of-concept - PLEASE DO
NOT REVIEW THE CODE per-se, but rather comment on the concepts!
(And obviously as above, if that _is_ what you intend, underline it with
[DO NOT MERGE] or [pre-RFC] or something like that).
I'd also very strongly suggest (as I did in my original reply) breaking out
parts that can be broken out as prerequisite series.
If you're doing something good or useful _anyway_ then just send that
separately first, and have later work rely on the earlier work.
There's no rush, this is huge and will take time.
A final KEY point:
NEVER submit unfiltered code generated by an LLMs to the list in _any_
form. If you want people to access code like that to test or something,
then put it in a remote repo and link to it.
The code is SO overly complicated and SO messy that it's really difficult
for people to understand what's actually going on.
At the heart of what you need here is CLARITY.
You need to CLEARLY communicate what it is you're doing so busy maintainers
can examine it. That's the _only_ way you're going to get something like
this merged.
The LLM-generated code is so awful that ain't nobody got the time to try to
understand what it's doing.
The workload for this really has to be on submitters, not maintainers.
And what you've done, even if not intended, is workslopping, and that's
really not acceptable. Quoting the kernel process on tool-generated content
[6]:
"If tools permit you to generate a contribution automatically, expect
additional scrutiny in proportion to how much of it was generated.
As with the output of any tooling, the result may be incorrect or
inappropriate. You are expected to understand and to be able to defend
everything you submit. If you are unable to do so, then do not submit the
resulting changes.
If you do so anyway, maintainers are entitled to reject your series without
detailed review."
As per this and my previous reply, AI slop doesn't scale, even as an RFC -
I won't have time to reply like this in future, and we will just have to
reject your series out of hand, which helps nobody.
>
>
> --
> All Rights Reversed.
Thanks, Lorenzo
[0]:https://lore.kernel.org/all/20260520150018.2491267-41-riel@surriel.com/
[1]:https://lore.kernel.org/linux-mm/20260616190300.1509639-1-riel@surriel.com/
[2]:https://lore.kernel.org/all/20260526140204.1390573-1-usama.arif@linux.dev/
[3]:https://lore.kernel.org/all/2ecf71858845e7d14c718b1a6845389cb78b986e.camel@surriel.com/
[4]:https://lore.kernel.org/all/20260520174749.GA1458531@zen.localdomain/
[5]:https://lore.kernel.org/all/daa29c92f055d028a5b3ec0e42cfb1ee1496a593.camel@surriel.com/
[6]:https://docs.kernel.org/process/generated-content.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 00/40] mm: reliable 1GB page allocation
2026-06-29 9:29 ` Lorenzo Stoakes
@ 2026-06-29 10:03 ` Vlastimil Babka (SUSE)
2026-06-29 14:39 ` Rik van Riel
0 siblings, 1 reply; 12+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-06-29 10:03 UTC (permalink / raw)
To: Lorenzo Stoakes, Rik van Riel
Cc: linux-kernel, kernel-team, linux-mm, david, willy, surenb, hannes,
ziy, usama.arif, fvdl, Andrew Morton, Jonathan Corbet,
Chris Mason, David Sterba, Steven Rostedt, Masami Hiramatsu,
Rafael J. Wysocki, Oscar Salvador, Mike Rapoport, linux-doc,
linux-btrfs, linux-trace-kernel, linux-pm, linux-cxl,
Linus Torvalds
On 6/29/26 11:29, Lorenzo Stoakes wrote:
> TL;DR - please don't send unfiltered LLM code to list _at all_. If you want
> to share it, link to a repo.
>
> On Sat, Jun 27, 2026 at 09:36:51AM -0400, Rik van Riel wrote:
>> That is the one reason I sent out RFC code before it
>> is ready. I am looking for feedback on the concepts
>> in this series.
> ...
>> Once I know what I need to do, coming up with a
>> cleaner implementation is very doable.
> ...
>> The mess in the RFC is the result of trying something
>> that seemed right, watching it fail in some subtle
>> way, and trying to fix it up.
> ...
>> > But the execution has to be _completely_ rethought.
>>
>> There's no argument there.
> ...
>> > Another issue here is maintainer time - even this _extremely_ light-
>> > touch
>> > review has taken me a few hours (of my weekend :). To review it in
>> > detail
>> > would take probably DAYS of dedicated work.
>>
>> I suspect there is a mismatch in expectations here.
>>
>> I already knew this code has to be totally redone.
>
> I'm glad we are in agreement on this :)
>
> But in general I feel you have sent this and at least one other series like this
> without being as clear as you should have been.
>
> I hate to belabour the point but just to be clear:
>
> * You label one patch [DO-NOT-MERGE], but none of the others (implying they
> are candidates for being merged) [0] and the cover letter has TODOs,
> including trivia like naming, but nothing about the code.
>
> * You sent a non-RFC series with identical code quality issues [1]
> recently.
>
> * Until I pointed it out, you were responding to other review here as if
> the series was genuinely was intended for (eventual) merge:
>
> - "This is a userspace-visible removal. Writes to
> /proc/sys/vm/watermark_boost_factor will now return -ENOENT instead of
> being accepted, breaking userspace." [2]
>
> <-: "I'll just drop this patch for now." [3]
>
> - "I left a small code nit inline, but whether you take that suggestion
> or leave it, you can add Reviewed-by: ..." [4]
>
> <-: "I sent it with this series mostly because it's needed to make the
> series work, and to provide context on why it's needed. I'm happy to
> resend it with a GFP mask passed in by each caller. That would look
> better, indeed!" [5]
>
> So to be concrete, if you send really rough code, Use [pre-RFC] or [DO NOT
> MERGE] (on the series as a whole) to make that clear and say so in the
> cover letter VERY VERY clearly.
Yes please. [POC NOT-FOR-MERGE] perhaps?
> Or, you can put it in a repo somewhere and link it in an email discussing
> the concepts (like I did with scalable CoW for instance).
Indeed.
> As above, firstly make it clear that the code you are sending for review is
> not to be reviewed so people don't waste highly contended maintainer time
> on that! :)
>
> Also, you didn't respond to my point regarding cc'ing the right people -
> but that's clearly something you need to get right if you want this kind of
> feedback to start with.
>
> For instance, you didn't cc- the page allocator maintainer (Vlastimil) on a
> series that is fundamentally changing the page allocator. That's not going
> to help with feedback.
Right! Thanks a lot for adding me, Lorenzo.
> In general, this area of the page allocator and compaction isn't my
> specialism in the kernel so I can't give you the in-depth feedback you need
> on that.
>
> But I do have thoughts in general as to how to achieve what you want here:
>
> Firstly - you should try to summarise what you're doing here and what
> you're changing alongside the trade-offs as clearly as you can in the cover
> letter.
>
> Then highlight what it is you need feedback on, broken out into clear
> questions or points that make it easy for people to respond to.
Yep.
> And _you have already done this_ in your reply here:
>
> * "How do people feel about splitting up the free lists, so each gigabyte
> (well, PUD sized) chunk of memory has its own free lists?"
My immediate response is that now we'd need to search multiple sets of lists
instead of a single one? What about the overhead?
Having a POC (even vibe-coded) for measuring that overhead might be actually
useful to quickly figure out whether the idea is viable or not.
But then the code doesn't need to be sent as a huge series if it's not for
review. As Lorenzo said, git repo link is enough.
> * "How can we balance the desire for higher-order kernel allocations,
> against the desire to preserve gigabyte sized chunks of memory that can
> be used for user space?"
>
> * "How do we balance the desire to keep compaction overhead low with the
> desire to do higher order allocations almost everywhere?"
How can we have a cake and eat it too? :)
> I think a really good way of doing this would be to start out with
> something like:
>
> Right now compaction often fails to achieve what we need, with
> fragmentation occurring anyway and (for instance) THP stalling on
> the availability of higher order folios.
>
> etc. etc.
>
> Summarising _the problem_.
>
> Then a section about your proposed solution, e.g.:
>
> I propose a means by which we proactively achieve gigabyte-sized
> pageblocks with logic which maintains these as physically
> contiguous under both ordinary and contended workloads
>
> Then list out the "secret sauce" of your approach, e.g.:
>
> This works by arranging memory such that unmovable allocations are
> grouped at <blah blah blah> etc.
>
> Then raise your questions e.g.:
>
> I'd like to ask the community - how do people feel about splitting
> up the free lists, so each gigabyte (well, PUD sized) chunk of
> memory has its own free lists? <etc. etc.>
>
> Then make it clear whether this is an RFC that is ready for primetime or
> not:
>
> This series is simply intended as a proof-of-concept - PLEASE DO
> NOT REVIEW THE CODE per-se, but rather comment on the concepts!
>
> (And obviously as above, if that _is_ what you intend, underline it with
> [DO NOT MERGE] or [pre-RFC] or something like that).
Ack.
> I'd also very strongly suggest (as I did in my original reply) breaking out
> parts that can be broken out as prerequisite series.
>
> If you're doing something good or useful _anyway_ then just send that
> separately first, and have later work rely on the earlier work.
Ack.
> There's no rush, this is huge and will take time.
>
> A final KEY point:
>
> NEVER submit unfiltered code generated by an LLMs to the list in _any_
> form. If you want people to access code like that to test or something,
> then put it in a remote repo and link to it.
>
> The code is SO overly complicated and SO messy that it's really difficult
> for people to understand what's actually going on.
>
> At the heart of what you need here is CLARITY.
>
> You need to CLEARLY communicate what it is you're doing so busy maintainers
> can examine it. That's the _only_ way you're going to get something like
> this merged.
>
> The LLM-generated code is so awful that ain't nobody got the time to try to
> understand what it's doing.
Indeed.
> The workload for this really has to be on submitters, not maintainers.
>
> And what you've done, even if not intended, is workslopping, and that's
> really not acceptable. Quoting the kernel process on tool-generated content
> [6]:
>
> "If tools permit you to generate a contribution automatically, expect
> additional scrutiny in proportion to how much of it was generated.
>
> As with the output of any tooling, the result may be incorrect or
> inappropriate. You are expected to understand and to be able to defend
> everything you submit. If you are unable to do so, then do not submit the
> resulting changes.
>
> If you do so anyway, maintainers are entitled to reject your series without
> detailed review."
>
> As per this and my previous reply, AI slop doesn't scale, even as an RFC -
> I won't have time to reply like this in future, and we will just have to
> reject your series out of hand, which helps nobody.
True. Thanks a lot for going out of your way on this!
>>
>>
>> --
>> All Rights Reversed.
>
> Thanks, Lorenzo
>
> [0]:https://lore.kernel.org/all/20260520150018.2491267-41-riel@surriel.com/
> [1]:https://lore.kernel.org/linux-mm/20260616190300.1509639-1-riel@surriel.com/
> [2]:https://lore.kernel.org/all/20260526140204.1390573-1-usama.arif@linux.dev/
> [3]:https://lore.kernel.org/all/2ecf71858845e7d14c718b1a6845389cb78b986e.camel@surriel.com/
> [4]:https://lore.kernel.org/all/20260520174749.GA1458531@zen.localdomain/
> [5]:https://lore.kernel.org/all/daa29c92f055d028a5b3ec0e42cfb1ee1496a593.camel@surriel.com/
> [6]:https://docs.kernel.org/process/generated-content.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 00/40] mm: reliable 1GB page allocation
2026-06-29 10:03 ` Vlastimil Babka (SUSE)
@ 2026-06-29 14:39 ` Rik van Riel
0 siblings, 0 replies; 12+ messages in thread
From: Rik van Riel @ 2026-06-29 14:39 UTC (permalink / raw)
To: Vlastimil Babka (SUSE), Lorenzo Stoakes
Cc: linux-kernel, kernel-team, linux-mm, david, willy, surenb, hannes,
ziy, usama.arif, fvdl, Andrew Morton, Jonathan Corbet,
Chris Mason, David Sterba, Steven Rostedt, Masami Hiramatsu,
Rafael J. Wysocki, Oscar Salvador, Mike Rapoport, linux-doc,
linux-btrfs, linux-trace-kernel, linux-pm, linux-cxl,
Linus Torvalds
On Mon, 2026-06-29 at 12:03 +0200, Vlastimil Babka (SUSE) wrote:
> On 6/29/26 11:29, Lorenzo Stoakes wrote:
> >
> > So to be concrete, if you send really rough code, Use [pre-RFC] or
> > [DO NOT
> > MERGE] (on the series as a whole) to make that clear and say so in
> > the
> > cover letter VERY VERY clearly.
>
> Yes please. [POC NOT-FOR-MERGE] perhaps?
>
> > Or, you can put it in a repo somewhere and link it in an email
> > discussing
> > the concepts (like I did with scalable CoW for instance).
>
> Indeed.
I'll do that for the next version.
I suspect it will take a while to beat this thing
into shape.
>
> > And _you have already done this_ in your reply here:
> >
> > * "How do people feel about splitting up the free lists, so each
> > gigabyte
> > (well, PUD sized) chunk of memory has its own free lists?"
>
> My immediate response is that now we'd need to search multiple sets
> of lists
> instead of a single one? What about the overhead?
The current code is clearly not good enough. It
has to try several gigablocks almost blindly,
because there is no efficient way to find the
right gigablock.
I have an idea on how to fix that with bitmaps.
We could have one bitmap per order, indicating which
gigablocks have order 0 pages, order 1 pages, etc
Then a second set of bitmaps indicating which gigablocks
have unmovable / reclaimable pages.
At that point, finding a good gigablock to allocate
from can be done with a bitmap_and and a search.
These bitmaps would only need to be changed when the
status of a gigablock changes, eg. going from having
order 0 pages free, to not having any order 0 pages
free.
Does that seem like a workable approach?
Once we can quickly pinpoint a gigablock for the
page allocator to grab pages from, we can also
split out the "pick a gigablock" code from the
"allocate a page" code.
>
> > * "How can we balance the desire for higher-order kernel
> > allocations,
> > against the desire to preserve gigabyte sized chunks of memory
> > that can
> > be used for user space?"
> >
> > * "How do we balance the desire to keep compaction overhead low
> > with the
> > desire to do higher order allocations almost everywhere?"
>
> How can we have a cake and eat it too? :)
Pretty much :/
I suspect it's going to require some fun interactions
between allocation, reclaim, and compaction.
However, with everybody from networking, to filesystems,
to anonymous memory wanting to use higher order allocations
of differing sizes, it seems like we're going to have to
tackle this somehow.
>
> > I'd also very strongly suggest (as I did in my original reply)
> > breaking out
> > parts that can be broken out as prerequisite series.
> >
> > If you're doing something good or useful _anyway_ then just send
> > that
> > separately first, and have later work rely on the earlier work.
>
That becomes cleaner with the "post a link to
a tree" thing, as well.
The pcpbuddy stuff is likely to go in separately.
Johannes is still working on that code.
The "make btrfs inode cache pages movable" thing
already went in.
I think I have a few more things in the tree that
can go in separately, but hopefully that will grow
as this code solidifies.
On the flip side, things like "making compaction
scale" may well end up depending on the gigablock
stuff, because lack of targeting data seems like a
likely cause for why compaction has to try so hard.
I'll make sure to go over every point raised by
you guys before writing the next version of the
code, and again before posting a link to the
tree.
--
All Rights Reversed.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-06-29 14:39 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260520150018.2491267-1-riel@surriel.com>
2026-05-20 14:59 ` [RFC PATCH 34/40] btrfs: allocate eb-attached btree pages as movable Rik van Riel
2026-05-20 17:47 ` Boris Burkov
2026-05-23 15:58 ` David Sterba
2026-05-24 1:43 ` Rik van Riel
2026-05-24 19:59 ` Matthew Wilcox
2026-05-25 6:57 ` Christoph Hellwig
2026-05-21 7:39 ` [syzbot ci] Re: mm: reliable 1GB page allocation syzbot ci
2026-06-27 9:28 ` [RFC PATCH 00/40] " Lorenzo Stoakes
2026-06-27 13:36 ` Rik van Riel
2026-06-29 9:29 ` Lorenzo Stoakes
2026-06-29 10:03 ` Vlastimil Babka (SUSE)
2026-06-29 14:39 ` Rik van Riel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox