* [PATCH RFC] btrfs: get rid of btrfs_(alloc|free)_compr_folio()
@ 2026-03-02 8:00 Qu Wenruo
2026-03-05 2:56 ` David Sterba
0 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2026-03-02 8:00 UTC (permalink / raw)
To: linux-btrfs
[GLOBAL POOL]
Btrfs has maintained a global (per-module) pool of folios for compressed
read/writes.
That pool is maintained by a LRU list of cached folios, with a shrinker
to free all cached folios when needed.
The function btrfs_alloc_compr_folio() will try to grab any existing
folio from that LRU list first, and go regular folio allocation if that
list is empty.
And btrfs_free_compr_folio() will try to put the folio into the LRU list
if the current cache level is below the threshold (256 pages).
[EXISTING LIMITS]
Since the global pool is per-module, we have no way to support different
folio sizes. This limit is already affecting bs > ps support, so that bs
> ps cases just by-pass that global pool completely.
Another limit is for bs < ps cases, which is more common.
In that case we always allocate a full page (can be as large as 64K) for
the compressed bio, this can be very wasteful if the on-disk extent is
pretty small.
[POTENTIAL BUGS]
Recently David is reporting bugs related to compression:
- List corruption on that LRU list
- Folio ref count reaching zero at btrfs_free_compr_folio
Although I haven't yet found a concrete chain of how this happened, I'm
suspecting the usage of folio->lru is different from page->lru.
Under most cases a lot of folio members are 1:1 mapped to page members,
but I have not seen any driver/fs using folio->lru for their internal
usage. (There are users of page->lru though)
[REMOVAL]
Personally I'm not a huge fan of maintaining a local folio list just for
compression.
I believe MM has seen much more pressure and can still properly give us
folios/pages.
I do not think we are doing it better than MM, so let's just use regular
folio allocation instead.
And hopefully this will address David's recent crash (as usual I'm not
able to reproduce locally).
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Reason for RFC:
Firstly I can not reproduce the problem hit by David, although all call
traces point to compr_pool, there is no concrete explanation.
Secondly this change itself may cause performance changes depending on
the memory pressure of the system, and there are quite some space for
extra fine tuning.
E.g. for compressed write path, if we failed to allocate a folio we can
always fall back to uncompressed write.
Thus in that case we can use GFP_NOWARN or even GFP_NORETRY for that
case.
---
fs/btrfs/compression.c | 132 +----------------------------------------
fs/btrfs/compression.h | 5 +-
fs/btrfs/inode.c | 2 +-
fs/btrfs/lzo.c | 8 +--
fs/btrfs/zlib.c | 10 ++--
fs/btrfs/zstd.c | 8 +--
6 files changed, 18 insertions(+), 147 deletions(-)
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 192f133d9eb5..a96e4dc6624e 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -126,107 +126,6 @@ static int compression_decompress(int type, struct list_head *ws,
static int btrfs_decompress_bio(struct compressed_bio *cb);
-/*
- * Global cache of last unused pages for compression/decompression.
- */
-static struct btrfs_compr_pool {
- struct shrinker *shrinker;
- spinlock_t lock;
- struct list_head list;
- int count;
- int thresh;
-} compr_pool;
-
-static unsigned long btrfs_compr_pool_count(struct shrinker *sh, struct shrink_control *sc)
-{
- int ret;
-
- /*
- * We must not read the values more than once if 'ret' gets expanded in
- * the return statement so we don't accidentally return a negative
- * number, even if the first condition finds it positive.
- */
- ret = READ_ONCE(compr_pool.count) - READ_ONCE(compr_pool.thresh);
-
- return ret > 0 ? ret : 0;
-}
-
-static unsigned long btrfs_compr_pool_scan(struct shrinker *sh, struct shrink_control *sc)
-{
- LIST_HEAD(remove);
- struct list_head *tmp, *next;
- int freed;
-
- if (compr_pool.count == 0)
- return SHRINK_STOP;
-
- /* For now, just simply drain the whole list. */
- spin_lock(&compr_pool.lock);
- list_splice_init(&compr_pool.list, &remove);
- freed = compr_pool.count;
- compr_pool.count = 0;
- spin_unlock(&compr_pool.lock);
-
- list_for_each_safe(tmp, next, &remove) {
- struct page *page = list_entry(tmp, struct page, lru);
-
- ASSERT(page_ref_count(page) == 1);
- put_page(page);
- }
-
- return freed;
-}
-
-/*
- * Common wrappers for page allocation from compression wrappers
- */
-struct folio *btrfs_alloc_compr_folio(struct btrfs_fs_info *fs_info)
-{
- struct folio *folio = NULL;
-
- /* For bs > ps cases, no cached folio pool for now. */
- if (fs_info->block_min_order)
- goto alloc;
-
- spin_lock(&compr_pool.lock);
- if (compr_pool.count > 0) {
- folio = list_first_entry(&compr_pool.list, struct folio, lru);
- list_del_init(&folio->lru);
- compr_pool.count--;
- }
- spin_unlock(&compr_pool.lock);
-
- if (folio)
- return folio;
-
-alloc:
- return folio_alloc(GFP_NOFS, fs_info->block_min_order);
-}
-
-void btrfs_free_compr_folio(struct folio *folio)
-{
- bool do_free = false;
-
- /* The folio is from bs > ps fs, no cached pool for now. */
- if (folio_order(folio))
- goto free;
-
- spin_lock(&compr_pool.lock);
- if (compr_pool.count > compr_pool.thresh) {
- do_free = true;
- } else {
- list_add(&folio->lru, &compr_pool.list);
- compr_pool.count++;
- }
- spin_unlock(&compr_pool.lock);
-
- if (!do_free)
- return;
-
-free:
- folio_put(folio);
-}
-
static void end_bbio_compressed_read(struct btrfs_bio *bbio)
{
struct compressed_bio *cb = to_compressed_bio(bbio);
@@ -238,7 +137,7 @@ static void end_bbio_compressed_read(struct btrfs_bio *bbio)
btrfs_bio_end_io(cb->orig_bbio, status);
bio_for_each_folio_all(fi, &bbio->bio)
- btrfs_free_compr_folio(fi.folio);
+ folio_put(fi.folio);
bio_put(&bbio->bio);
}
@@ -298,7 +197,7 @@ static void end_bbio_compressed_write(struct btrfs_bio *bbio)
end_compressed_writeback(cb);
/* Note, our inode could be gone now. */
bio_for_each_folio_all(fi, &bbio->bio)
- btrfs_free_compr_folio(fi.folio);
+ folio_put(fi.folio);
bio_put(&cb->bbio.bio);
}
@@ -568,7 +467,7 @@ void btrfs_submit_compressed_read(struct btrfs_bio *bbio)
struct folio *folio;
u32 cur_len = min(compressed_len - i * min_folio_size, min_folio_size);
- folio = btrfs_alloc_compr_folio(fs_info);
+ folio = folio_alloc(GFP_NOFS, fs_info->block_min_order);
if (!folio) {
ret = -ENOMEM;
goto out_free_bio;
@@ -996,11 +895,6 @@ int btrfs_compress_filemap_get_folio(struct address_space *mapping, u64 start,
* @cb->bbio.bio.bi_iter.bi_size will indicate the compressed data size.
* The bi_size may not be sectorsize aligned, thus the caller still need
* to do the round up before submission.
- *
- * This function will allocate compressed folios with btrfs_alloc_compr_folio(),
- * thus callers must make sure the endio function and error handling are using
- * btrfs_free_compr_folio() to release those folios.
- * This is already done in end_bbio_compressed_write() and cleanup_compressed_bio().
*/
struct compressed_bio *btrfs_compress_bio(struct btrfs_inode *inode,
u64 start, u32 len, unsigned int type,
@@ -1133,31 +1027,11 @@ int __init btrfs_init_compress(void)
offsetof(struct compressed_bio, bbio.bio),
BIOSET_NEED_BVECS))
return -ENOMEM;
-
- compr_pool.shrinker = shrinker_alloc(SHRINKER_NONSLAB, "btrfs-compr-pages");
- if (!compr_pool.shrinker)
- return -ENOMEM;
-
- spin_lock_init(&compr_pool.lock);
- INIT_LIST_HEAD(&compr_pool.list);
- compr_pool.count = 0;
- /* 128K / 4K = 32, for 8 threads is 256 pages. */
- compr_pool.thresh = BTRFS_MAX_COMPRESSED / PAGE_SIZE * 8;
- compr_pool.shrinker->count_objects = btrfs_compr_pool_count;
- compr_pool.shrinker->scan_objects = btrfs_compr_pool_scan;
- compr_pool.shrinker->batch = 32;
- compr_pool.shrinker->seeks = DEFAULT_SEEKS;
- shrinker_register(compr_pool.shrinker);
-
return 0;
}
void __cold btrfs_exit_compress(void)
{
- /* For now scan drains all pages and does not touch the parameters. */
- btrfs_compr_pool_scan(NULL, NULL);
- shrinker_free(compr_pool.shrinker);
-
bioset_exit(&btrfs_compressed_bioset);
}
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index 84600b284e1e..f3bb565ede6a 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -95,9 +95,6 @@ void btrfs_submit_compressed_read(struct btrfs_bio *bbio);
int btrfs_compress_str2level(unsigned int type, const char *str, int *level_ret);
-struct folio *btrfs_alloc_compr_folio(struct btrfs_fs_info *fs_info);
-void btrfs_free_compr_folio(struct folio *folio);
-
struct workspace_manager {
struct list_head idle_ws;
spinlock_t ws_lock;
@@ -144,7 +141,7 @@ static inline void cleanup_compressed_bio(struct compressed_bio *cb)
struct folio_iter fi;
bio_for_each_folio_all(fi, bio)
- btrfs_free_compr_folio(fi.folio);
+ folio_put(fi.folio);
bio_put(bio);
}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ea3746c14760..c9c58e192a56 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9980,7 +9980,7 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
size_t bytes = min(min_folio_size, iov_iter_count(from));
char *kaddr;
- folio = btrfs_alloc_compr_folio(fs_info);
+ folio = folio_alloc(GFP_NOFS, fs_info->block_min_order);
if (!folio) {
ret = -ENOMEM;
goto out_cb;
diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
index 0c9093770739..4cf54de274c9 100644
--- a/fs/btrfs/lzo.c
+++ b/fs/btrfs/lzo.c
@@ -218,7 +218,7 @@ static int copy_compressed_data_to_bio(struct btrfs_fs_info *fs_info,
ASSERT((old_size >> sectorsize_bits) == (old_size + LZO_LEN - 1) >> sectorsize_bits);
if (!*out_folio) {
- *out_folio = btrfs_alloc_compr_folio(fs_info);
+ *out_folio = folio_alloc(GFP_NOFS, fs_info->block_min_order);
if (!*out_folio)
return -ENOMEM;
}
@@ -245,7 +245,7 @@ static int copy_compressed_data_to_bio(struct btrfs_fs_info *fs_info,
return -E2BIG;
if (!*out_folio) {
- *out_folio = btrfs_alloc_compr_folio(fs_info);
+ *out_folio = folio_alloc(GFP_NOFS, fs_info->block_min_order);
if (!*out_folio)
return -ENOMEM;
}
@@ -296,7 +296,7 @@ int lzo_compress_bio(struct list_head *ws, struct compressed_bio *cb)
ASSERT(bio->bi_iter.bi_size == 0);
ASSERT(len);
- folio_out = btrfs_alloc_compr_folio(fs_info);
+ folio_out = folio_alloc(GFP_NOFS, fs_info->block_min_order);
if (!folio_out)
return -ENOMEM;
@@ -372,7 +372,7 @@ int lzo_compress_bio(struct list_head *ws, struct compressed_bio *cb)
* the endio function of bio.
*/
if (folio_out && IS_ALIGNED(total_out, min_folio_size)) {
- btrfs_free_compr_folio(folio_out);
+ folio_put(folio_out);
folio_out = NULL;
}
if (folio_in)
diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c
index 1a5093525e32..09957bf5ad23 100644
--- a/fs/btrfs/zlib.c
+++ b/fs/btrfs/zlib.c
@@ -175,7 +175,7 @@ int zlib_compress_bio(struct list_head *ws, struct compressed_bio *cb)
workspace->strm.total_in = 0;
workspace->strm.total_out = 0;
- out_folio = btrfs_alloc_compr_folio(fs_info);
+ out_folio = folio_alloc(GFP_NOFS, fs_info->block_min_order);
if (out_folio == NULL) {
ret = -ENOMEM;
goto out;
@@ -258,7 +258,7 @@ int zlib_compress_bio(struct list_head *ws, struct compressed_bio *cb)
goto out;
}
- out_folio = btrfs_alloc_compr_folio(fs_info);
+ out_folio = folio_alloc(GFP_NOFS, fs_info->block_min_order);
if (out_folio == NULL) {
ret = -ENOMEM;
goto out;
@@ -296,7 +296,7 @@ int zlib_compress_bio(struct list_head *ws, struct compressed_bio *cb)
goto out;
}
/* Get another folio for the stream end. */
- out_folio = btrfs_alloc_compr_folio(fs_info);
+ out_folio = folio_alloc(GFP_NOFS, fs_info->block_min_order);
if (out_folio == NULL) {
ret = -ENOMEM;
goto out;
@@ -316,7 +316,7 @@ int zlib_compress_bio(struct list_head *ws, struct compressed_bio *cb)
}
} else {
/* The last folio hasn't' been utilized. */
- btrfs_free_compr_folio(out_folio);
+ folio_put(out_folio);
}
out_folio = NULL;
ASSERT(bio->bi_iter.bi_size == workspace->strm.total_out);
@@ -330,7 +330,7 @@ int zlib_compress_bio(struct list_head *ws, struct compressed_bio *cb)
ret = 0;
out:
if (out_folio)
- btrfs_free_compr_folio(out_folio);
+ folio_put(out_folio);
if (data_in) {
kunmap_local(data_in);
folio_put(in_folio);
diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
index 41547ff187f6..cb19d2ad24ea 100644
--- a/fs/btrfs/zstd.c
+++ b/fs/btrfs/zstd.c
@@ -439,7 +439,7 @@ int zstd_compress_bio(struct list_head *ws, struct compressed_bio *cb)
workspace->in_buf.size = btrfs_calc_input_length(in_folio, end, start);
/* Allocate and map in the output buffer. */
- out_folio = btrfs_alloc_compr_folio(fs_info);
+ out_folio = folio_alloc(GFP_NOFS, fs_info->block_min_order);
if (out_folio == NULL) {
ret = -ENOMEM;
goto out;
@@ -482,7 +482,7 @@ int zstd_compress_bio(struct list_head *ws, struct compressed_bio *cb)
goto out;
}
- out_folio = btrfs_alloc_compr_folio(fs_info);
+ out_folio = folio_alloc(GFP_NOFS, fs_info->block_min_order);
if (out_folio == NULL) {
ret = -ENOMEM;
goto out;
@@ -555,7 +555,7 @@ int zstd_compress_bio(struct list_head *ws, struct compressed_bio *cb)
ret = -E2BIG;
goto out;
}
- out_folio = btrfs_alloc_compr_folio(fs_info);
+ out_folio = folio_alloc(GFP_NOFS, fs_info->block_min_order);
if (out_folio == NULL) {
ret = -ENOMEM;
goto out;
@@ -574,7 +574,7 @@ int zstd_compress_bio(struct list_head *ws, struct compressed_bio *cb)
ASSERT(tot_out == bio->bi_iter.bi_size);
out:
if (out_folio)
- btrfs_free_compr_folio(out_folio);
+ folio_put(out_folio);
if (workspace->in_buf.src) {
kunmap_local(workspace->in_buf.src);
folio_put(in_folio);
--
2.53.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH RFC] btrfs: get rid of btrfs_(alloc|free)_compr_folio()
2026-03-02 8:00 [PATCH RFC] btrfs: get rid of btrfs_(alloc|free)_compr_folio() Qu Wenruo
@ 2026-03-05 2:56 ` David Sterba
2026-03-05 3:09 ` David Sterba
2026-03-05 17:46 ` Boris Burkov
0 siblings, 2 replies; 7+ messages in thread
From: David Sterba @ 2026-03-05 2:56 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Mon, Mar 02, 2026 at 06:30:30PM +1030, Qu Wenruo wrote:
> [GLOBAL POOL]
> Btrfs has maintained a global (per-module) pool of folios for compressed
> read/writes.
>
> That pool is maintained by a LRU list of cached folios, with a shrinker
> to free all cached folios when needed.
>
> The function btrfs_alloc_compr_folio() will try to grab any existing
> folio from that LRU list first, and go regular folio allocation if that
> list is empty.
>
> And btrfs_free_compr_folio() will try to put the folio into the LRU list
> if the current cache level is below the threshold (256 pages).
>
> [EXISTING LIMITS]
> Since the global pool is per-module, we have no way to support different
> folio sizes. This limit is already affecting bs > ps support, so that bs
> > ps cases just by-pass that global pool completely.
>
> Another limit is for bs < ps cases, which is more common.
> In that case we always allocate a full page (can be as large as 64K) for
> the compressed bio, this can be very wasteful if the on-disk extent is
> pretty small.
>
> [POTENTIAL BUGS]
> Recently David is reporting bugs related to compression:
>
> - List corruption on that LRU list
>
> - Folio ref count reaching zero at btrfs_free_compr_folio
>
> Although I haven't yet found a concrete chain of how this happened, I'm
> suspecting the usage of folio->lru is different from page->lru.
> Under most cases a lot of folio members are 1:1 mapped to page members,
> but I have not seen any driver/fs using folio->lru for their internal
> usage. (There are users of page->lru though)
The pool worked well for pages and it is possible that the ->lru has
different semantics that do not apply the same for folios. With the
support for large folios and bs < ps the pool lost its appeal.
> [REMOVAL]
> Personally I'm not a huge fan of maintaining a local folio list just for
> compression.
>
> I believe MM has seen much more pressure and can still properly give us
> folios/pages.
> I do not think we are doing it better than MM, so let's just use regular
> folio allocation instead.
While in principle we should not try to be smarter than the allocator
there's a still a difference when we don't have to use it at all when
there's memory pressure that wants to write out data. The allocation is
a place where it can stall or increase the pressure on the rest of the
system.
When we cycle through a few pages in the pool we can write lots of data.
Returning pages to allocator does not guarantee we'll get them back.
Placing that just to the compression path was an exception because there
should ideally be no allocation required on the writeout path. As
there's no generic layer that would do that transparently it's btrfs
being nice at a small cost of a few pages.
So I stand by the reason of the pool but with the evolution of folios
and bs < ps the cost could be too high. I can still see the pool for a
subset of the combinations for some common scenario like 4K block size
on 64K page host (e.g. ARM).
> And hopefully this will address David's recent crash (as usual I'm not
> able to reproduce locally).
I'll run the test with this patch.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH RFC] btrfs: get rid of btrfs_(alloc|free)_compr_folio()
2026-03-05 2:56 ` David Sterba
@ 2026-03-05 3:09 ` David Sterba
2026-03-05 4:33 ` Qu Wenruo
2026-03-05 17:46 ` Boris Burkov
1 sibling, 1 reply; 7+ messages in thread
From: David Sterba @ 2026-03-05 3:09 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Thu, Mar 05, 2026 at 03:56:11AM +0100, David Sterba wrote:
> On Mon, Mar 02, 2026 at 06:30:30PM +1030, Qu Wenruo wrote:
> > And hopefully this will address David's recent crash (as usual I'm not
> > able to reproduce locally).
>
> I'll run the test with this patch.
Still crashes so the lru is a false hunch.
[ 110.693070] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff888100000000 pfn:0x111262
[ 110.694052] flags: 0x4000000000000000(node=0|zone=2)
[ 110.694596] raw: 4000000000000000 ffffea00040f2008 ffffea00042088c8 0000000000000000
[ 110.695383] raw: ffff888100000000 0000000000000000 00000000ffffffff 0000000000000000
[ 110.696164] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
[ 110.696925] ------------[ cut here ]------------
[ 110.697414] kernel BUG at ./include/linux/mm.h:1493!
[ 110.697955] Oops: invalid opcode: 0000 [#1] SMP KASAN
[ 110.698482] CPU: 8 UID: 0 PID: 12 Comm: kworker/u64:0 Not tainted 7.0.0-rc1-default+ #626 PREEMPT(full)
[ 110.699385] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-2-gc13ff2cd-prebuilt.qemu.org 04/01/2014
[ 110.700464] Workqueue: btrfs-delalloc btrfs_work_helper [btrfs]
[ 110.701110] RIP: 0010:btrfs_compress_bio+0x5c2/0x6a0 [btrfs]
[ 110.702716] RSP: 0018:ffff8881003b79a0 EFLAGS: 00010286
[ 110.703082] RAX: 000000000000003e RBX: ffff88810a83d5f8 RCX: 0000000000000000
[ 110.703550] RDX: 000000000000003e RSI: 0000000000000004 RDI: ffffed1020076f27
[ 110.704019] RBP: 1ffff11020076f37 R08: ffffffff8a444651 R09: fffffbfff195c438
[ 110.704484] R10: 0000000000000003 R11: 0000000000000001 R12: ffffea00044498c0
[ 110.704956] R13: ffffea00044498b4 R14: 0000000000000000 R15: ffffea0004449880
[ 110.705555] FS: 0000000000000000(0000) GS:ffff88818baa0000(0000) knlGS:0000000000000000
[ 110.706197] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 110.706664] CR2: 00007f4aa3c715a0 CR3: 0000000097a59000 CR4: 00000000000006b0
[ 110.707131] Call Trace:
[ 110.707335] <TASK>
[ 110.707522] ? btrfs_compress_filemap_get_folio+0x130/0x130 [btrfs]
[ 110.707999] ? _raw_spin_unlock+0x1a/0x30
[ 110.708307] ? btrfs_compress_heuristic+0x48c/0x700 [btrfs]
[ 110.708766] compress_file_range+0x7b7/0x1640 [btrfs]
[ 110.709169] ? cow_file_range_inline.constprop.0+0x1b0/0x1b0 [btrfs]
[ 110.709629] ? __lock_acquire+0x568/0xbd0
[ 110.709934] ? lock_acquire.part.0+0xad/0x230
[ 110.710240] ? process_one_work+0x7ec/0x1590
[ 110.710550] ? submit_one_async_extent+0xb00/0xb00 [btrfs]
[ 110.710970] btrfs_work_helper+0x1c1/0x760 [btrfs]
[ 110.711354] ? lock_acquire+0x128/0x150
[ 110.711635] process_one_work+0x86b/0x1590
[ 110.711934] ? pwq_dec_nr_in_flight+0x720/0x720
[ 110.712255] ? lock_is_held_type+0x83/0xe0
[ 110.712584] worker_thread+0x5e9/0xfc0
[ 110.712869] ? process_one_work+0x1590/0x1590
[ 110.713179] kthread+0x323/0x410
[ 110.713430] ? _raw_spin_unlock_irq+0x1f/0x40
[ 110.713741] ? kthread_affine_node+0x1c0/0x1c0
[ 110.714058] ret_from_fork+0x476/0x5f0
[ 110.714339] ? arch_exit_to_user_mode_prepare.isra.0+0x60/0x60
[ 110.714730] ? __switch_to+0x22/0xe00
[ 110.715011] ? kthread_affine_node+0x1c0/0x1c0
[ 110.715327] ret_from_fork_asm+0x11/0x20
[ 110.715616] </TASK>
[ 110.715806] Modules linked in: btrfs xor raid6_pq loop
[ 110.716186] ---[ end trace 0000000000000000 ]---
[ 110.716538] RIP: 0010:btrfs_compress_bio+0x5c2/0x6a0 [btrfs]
[ 110.718125] RSP: 0018:ffff8881003b79a0 EFLAGS: 00010286
[ 110.718488] RAX: 000000000000003e RBX: ffff88810a83d5f8 RCX: 0000000000000000
[ 110.718958] RDX: 000000000000003e RSI: 0000000000000004 RDI: ffffed1020076f27
[ 110.719448] RBP: 1ffff11020076f37 R08: ffffffff8a444651 R09: fffffbfff195c438
[ 110.719912] R10: 0000000000000003 R11: 0000000000000001 R12: ffffea00044498c0
[ 110.720373] R13: ffffea00044498b4 R14: 0000000000000000 R15: ffffea0004449880
[ 110.720871] FS: 0000000000000000(0000) GS:ffff88818baa0000(0000) knlGS:0000000000000000
[ 110.721409] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 110.721800] CR2: 00007f4aa3c715a0 CR3: 0000000097a59000 CR4: 00000000000006b0
Looks like folio references are wrong, the assert in zlib related to the page
pool was just a symptom and I think actually correct.
The line numbers do not tell anything interesting:
(gdb) l *(btrfs_compress_bio+0x5c2)
0x1f38e2 is in btrfs_compress_bio (./include/linux/mm.h:1493).
1488 /*
1489 * Drop a ref, return true if the refcount fell to zero (the page has no users)
1490 */
1491 static inline int put_page_testzero(struct page *page)
1492 {
1493 VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
1494 return page_ref_dec_and_test(page);
1495 }
1496
1497 static inline int folio_put_testzero(struct folio *folio)
(gdb) l *(compress_file_range+0x7b7)
0xde947 is in compress_file_range (fs/btrfs/inode.c:1014).
1009 } else if (inode->prop_compress) {
1010 compress_type = inode->prop_compress;
1011 }
1012
1013 /* Compression level is applied here. */
1014 cb = btrfs_compress_bio(inode, start, cur_len, compress_type,
1015 compress_level, async_chunk->write_flags);
1016 if (IS_ERR(cb)) {
1017 cb = NULL;
1018 goto mark_incompressible;
(gdb) l *(btrfs_compress_filemap_get_folio+0x130)
0x1f3320 is in btrfs_compress_bio (fs/btrfs/compression.c:902).
897 * to do the round up before submission.
898 */
899 struct compressed_bio *btrfs_compress_bio(struct btrfs_inode *inode,
900 u64 start, u32 len, unsigned int type,
901 int level, blk_opf_t write_flags)
902 {
903 struct btrfs_fs_info *fs_info = inode->root->fs_info;
904 struct list_head *workspace;
905 struct compressed_bio *cb;
906 int ret;
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH RFC] btrfs: get rid of btrfs_(alloc|free)_compr_folio()
2026-03-05 3:09 ` David Sterba
@ 2026-03-05 4:33 ` Qu Wenruo
0 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2026-03-05 4:33 UTC (permalink / raw)
To: dsterba; +Cc: linux-btrfs
在 2026/3/5 13:39, David Sterba 写道:
> On Thu, Mar 05, 2026 at 03:56:11AM +0100, David Sterba wrote:
>> On Mon, Mar 02, 2026 at 06:30:30PM +1030, Qu Wenruo wrote:
>>> And hopefully this will address David's recent crash (as usual I'm not
>>> able to reproduce locally).
>>
>> I'll run the test with this patch.
>
> Still crashes so the lru is a false hunch.
Mind to try this branch again?
https://github.com/adam900710/linux/tree/debug
It has only one patch beyond for-next, which adds several trace_printk()
related to zlib compression.
My current guess is, zlib compression can add a folio into the bio, but
later called folio_put() on that folio, resulting the problem.
The problem is I do not have any clue just staring at the code.
Also, please enable "kernel.ftrace_dump_on_oops = 1" sysctl so that at
crash the ftrace buffer will also be dumped into dmesg.
Finally this ftrace can result lengthy output, and the default buffer is
very large.
Recommended to use "trace-cmd set -b 2" to limit the buffer to 2KiB per CPU.
Thanks,
Qu
>
> [ 110.693070] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff888100000000 pfn:0x111262
> [ 110.694052] flags: 0x4000000000000000(node=0|zone=2)
> [ 110.694596] raw: 4000000000000000 ffffea00040f2008 ffffea00042088c8 0000000000000000
> [ 110.695383] raw: ffff888100000000 0000000000000000 00000000ffffffff 0000000000000000
> [ 110.696164] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> [ 110.696925] ------------[ cut here ]------------
> [ 110.697414] kernel BUG at ./include/linux/mm.h:1493!
> [ 110.697955] Oops: invalid opcode: 0000 [#1] SMP KASAN
> [ 110.698482] CPU: 8 UID: 0 PID: 12 Comm: kworker/u64:0 Not tainted 7.0.0-rc1-default+ #626 PREEMPT(full)
> [ 110.699385] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-2-gc13ff2cd-prebuilt.qemu.org 04/01/2014
> [ 110.700464] Workqueue: btrfs-delalloc btrfs_work_helper [btrfs]
> [ 110.701110] RIP: 0010:btrfs_compress_bio+0x5c2/0x6a0 [btrfs]
> [ 110.702716] RSP: 0018:ffff8881003b79a0 EFLAGS: 00010286
> [ 110.703082] RAX: 000000000000003e RBX: ffff88810a83d5f8 RCX: 0000000000000000
> [ 110.703550] RDX: 000000000000003e RSI: 0000000000000004 RDI: ffffed1020076f27
> [ 110.704019] RBP: 1ffff11020076f37 R08: ffffffff8a444651 R09: fffffbfff195c438
> [ 110.704484] R10: 0000000000000003 R11: 0000000000000001 R12: ffffea00044498c0
> [ 110.704956] R13: ffffea00044498b4 R14: 0000000000000000 R15: ffffea0004449880
> [ 110.705555] FS: 0000000000000000(0000) GS:ffff88818baa0000(0000) knlGS:0000000000000000
> [ 110.706197] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 110.706664] CR2: 00007f4aa3c715a0 CR3: 0000000097a59000 CR4: 00000000000006b0
> [ 110.707131] Call Trace:
> [ 110.707335] <TASK>
> [ 110.707522] ? btrfs_compress_filemap_get_folio+0x130/0x130 [btrfs]
> [ 110.707999] ? _raw_spin_unlock+0x1a/0x30
> [ 110.708307] ? btrfs_compress_heuristic+0x48c/0x700 [btrfs]
> [ 110.708766] compress_file_range+0x7b7/0x1640 [btrfs]
> [ 110.709169] ? cow_file_range_inline.constprop.0+0x1b0/0x1b0 [btrfs]
> [ 110.709629] ? __lock_acquire+0x568/0xbd0
> [ 110.709934] ? lock_acquire.part.0+0xad/0x230
> [ 110.710240] ? process_one_work+0x7ec/0x1590
> [ 110.710550] ? submit_one_async_extent+0xb00/0xb00 [btrfs]
> [ 110.710970] btrfs_work_helper+0x1c1/0x760 [btrfs]
> [ 110.711354] ? lock_acquire+0x128/0x150
> [ 110.711635] process_one_work+0x86b/0x1590
> [ 110.711934] ? pwq_dec_nr_in_flight+0x720/0x720
> [ 110.712255] ? lock_is_held_type+0x83/0xe0
> [ 110.712584] worker_thread+0x5e9/0xfc0
> [ 110.712869] ? process_one_work+0x1590/0x1590
> [ 110.713179] kthread+0x323/0x410
> [ 110.713430] ? _raw_spin_unlock_irq+0x1f/0x40
> [ 110.713741] ? kthread_affine_node+0x1c0/0x1c0
> [ 110.714058] ret_from_fork+0x476/0x5f0
> [ 110.714339] ? arch_exit_to_user_mode_prepare.isra.0+0x60/0x60
> [ 110.714730] ? __switch_to+0x22/0xe00
> [ 110.715011] ? kthread_affine_node+0x1c0/0x1c0
> [ 110.715327] ret_from_fork_asm+0x11/0x20
> [ 110.715616] </TASK>
> [ 110.715806] Modules linked in: btrfs xor raid6_pq loop
> [ 110.716186] ---[ end trace 0000000000000000 ]---
> [ 110.716538] RIP: 0010:btrfs_compress_bio+0x5c2/0x6a0 [btrfs]
> [ 110.718125] RSP: 0018:ffff8881003b79a0 EFLAGS: 00010286
> [ 110.718488] RAX: 000000000000003e RBX: ffff88810a83d5f8 RCX: 0000000000000000
> [ 110.718958] RDX: 000000000000003e RSI: 0000000000000004 RDI: ffffed1020076f27
> [ 110.719448] RBP: 1ffff11020076f37 R08: ffffffff8a444651 R09: fffffbfff195c438
> [ 110.719912] R10: 0000000000000003 R11: 0000000000000001 R12: ffffea00044498c0
> [ 110.720373] R13: ffffea00044498b4 R14: 0000000000000000 R15: ffffea0004449880
> [ 110.720871] FS: 0000000000000000(0000) GS:ffff88818baa0000(0000) knlGS:0000000000000000
> [ 110.721409] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 110.721800] CR2: 00007f4aa3c715a0 CR3: 0000000097a59000 CR4: 00000000000006b0
>
> Looks like folio references are wrong, the assert in zlib related to the page
> pool was just a symptom and I think actually correct.
>
> The line numbers do not tell anything interesting:
>
> (gdb) l *(btrfs_compress_bio+0x5c2)
> 0x1f38e2 is in btrfs_compress_bio (./include/linux/mm.h:1493).
> 1488 /*
> 1489 * Drop a ref, return true if the refcount fell to zero (the page has no users)
> 1490 */
> 1491 static inline int put_page_testzero(struct page *page)
> 1492 {
> 1493 VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> 1494 return page_ref_dec_and_test(page);
> 1495 }
> 1496
> 1497 static inline int folio_put_testzero(struct folio *folio)
>
> (gdb) l *(compress_file_range+0x7b7)
> 0xde947 is in compress_file_range (fs/btrfs/inode.c:1014).
> 1009 } else if (inode->prop_compress) {
> 1010 compress_type = inode->prop_compress;
> 1011 }
> 1012
> 1013 /* Compression level is applied here. */
> 1014 cb = btrfs_compress_bio(inode, start, cur_len, compress_type,
> 1015 compress_level, async_chunk->write_flags);
> 1016 if (IS_ERR(cb)) {
> 1017 cb = NULL;
> 1018 goto mark_incompressible;
>
> (gdb) l *(btrfs_compress_filemap_get_folio+0x130)
> 0x1f3320 is in btrfs_compress_bio (fs/btrfs/compression.c:902).
> 897 * to do the round up before submission.
> 898 */
> 899 struct compressed_bio *btrfs_compress_bio(struct btrfs_inode *inode,
> 900 u64 start, u32 len, unsigned int type,
> 901 int level, blk_opf_t write_flags)
> 902 {
> 903 struct btrfs_fs_info *fs_info = inode->root->fs_info;
> 904 struct list_head *workspace;
> 905 struct compressed_bio *cb;
> 906 int ret;
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH RFC] btrfs: get rid of btrfs_(alloc|free)_compr_folio()
2026-03-05 2:56 ` David Sterba
2026-03-05 3:09 ` David Sterba
@ 2026-03-05 17:46 ` Boris Burkov
2026-03-05 22:43 ` Qu Wenruo
1 sibling, 1 reply; 7+ messages in thread
From: Boris Burkov @ 2026-03-05 17:46 UTC (permalink / raw)
To: David Sterba; +Cc: Qu Wenruo, linux-btrfs
On Thu, Mar 05, 2026 at 03:56:11AM +0100, David Sterba wrote:
> On Mon, Mar 02, 2026 at 06:30:30PM +1030, Qu Wenruo wrote:
> > [GLOBAL POOL]
> > Btrfs has maintained a global (per-module) pool of folios for compressed
> > read/writes.
> >
> > That pool is maintained by a LRU list of cached folios, with a shrinker
> > to free all cached folios when needed.
> >
> > The function btrfs_alloc_compr_folio() will try to grab any existing
> > folio from that LRU list first, and go regular folio allocation if that
> > list is empty.
> >
> > And btrfs_free_compr_folio() will try to put the folio into the LRU list
> > if the current cache level is below the threshold (256 pages).
> >
> > [EXISTING LIMITS]
> > Since the global pool is per-module, we have no way to support different
> > folio sizes. This limit is already affecting bs > ps support, so that bs
> > > ps cases just by-pass that global pool completely.
> >
> > Another limit is for bs < ps cases, which is more common.
> > In that case we always allocate a full page (can be as large as 64K) for
> > the compressed bio, this can be very wasteful if the on-disk extent is
> > pretty small.
> >
> > [POTENTIAL BUGS]
> > Recently David is reporting bugs related to compression:
> >
> > - List corruption on that LRU list
> >
> > - Folio ref count reaching zero at btrfs_free_compr_folio
> >
> > Although I haven't yet found a concrete chain of how this happened, I'm
> > suspecting the usage of folio->lru is different from page->lru.
> > Under most cases a lot of folio members are 1:1 mapped to page members,
> > but I have not seen any driver/fs using folio->lru for their internal
> > usage. (There are users of page->lru though)
>
> The pool worked well for pages and it is possible that the ->lru has
> different semantics that do not apply the same for folios. With the
> support for large folios and bs < ps the pool lost its appeal.
>
> > [REMOVAL]
> > Personally I'm not a huge fan of maintaining a local folio list just for
> > compression.
> >
> > I believe MM has seen much more pressure and can still properly give us
> > folios/pages.
> > I do not think we are doing it better than MM, so let's just use regular
> > folio allocation instead.
>
> While in principle we should not try to be smarter than the allocator
> there's a still a difference when we don't have to use it at all when
> there's memory pressure that wants to write out data. The allocation is
> a place where it can stall or increase the pressure on the rest of the
> system.
>
> When we cycle through a few pages in the pool we can write lots of data.
> Returning pages to allocator does not guarantee we'll get them back.
> Placing that just to the compression path was an exception because there
> should ideally be no allocation required on the writeout path. As
> there's no generic layer that would do that transparently it's btrfs
> being nice at a small cost of a few pages.
>
> So I stand by the reason of the pool but with the evolution of folios
> and bs < ps the cost could be too high. I can still see the pool for a
> subset of the combinations for some common scenario like 4K block size
> on 64K page host (e.g. ARM).
>
FWIW, we do already see issues with allocation in compressed IO in
production, so I am hesitant to support this change.
On the one hand, we have the issues so the pool is not completely saving
us anyway. On the other hand, removing it seems likely to make this worse
in the way that you predict, Dave.
If we do move forward with this, I will try to watch such errors closely
on the first release of a kernel without the pool.
Thanks,
Boris
> > And hopefully this will address David's recent crash (as usual I'm not
> > able to reproduce locally).
>
> I'll run the test with this patch.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH RFC] btrfs: get rid of btrfs_(alloc|free)_compr_folio()
2026-03-05 17:46 ` Boris Burkov
@ 2026-03-05 22:43 ` Qu Wenruo
2026-03-05 22:45 ` Boris Burkov
0 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2026-03-05 22:43 UTC (permalink / raw)
To: Boris Burkov, David Sterba; +Cc: Qu Wenruo, linux-btrfs
在 2026/3/6 04:16, Boris Burkov 写道:
[...]
>> So I stand by the reason of the pool but with the evolution of folios
>> and bs < ps the cost could be too high. I can still see the pool for a
>> subset of the combinations for some common scenario like 4K block size
>> on 64K page host (e.g. ARM).
>>
>
> FWIW, we do already see issues with allocation in compressed IO in
> production, so I am hesitant to support this change.
May I ask how frequent the problem is, and what's the CPU arch?
Finally is it only happening after commit 6f706f34fc4c ("btrfs: switch
to btrfs_compress_bio() interface for compressed writes"), aka v7.0 kernel.
I tried my best locally to introduce extra ASSERT()s to make sure every
folio inside a compressed bio has a ref of 1, but it hasn't yet
triggered inside zlib_compress_bio().
Thanks,
Qu
>
> On the one hand, we have the issues so the pool is not completely saving
> us anyway. On the other hand, removing it seems likely to make this worse
> in the way that you predict, Dave.
>
> If we do move forward with this, I will try to watch such errors closely
> on the first release of a kernel without the pool.
>
> Thanks,
> Boris
>
>>> And hopefully this will address David's recent crash (as usual I'm not
>>> able to reproduce locally).
>>
>> I'll run the test with this patch.
>
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH RFC] btrfs: get rid of btrfs_(alloc|free)_compr_folio()
2026-03-05 22:43 ` Qu Wenruo
@ 2026-03-05 22:45 ` Boris Burkov
0 siblings, 0 replies; 7+ messages in thread
From: Boris Burkov @ 2026-03-05 22:45 UTC (permalink / raw)
To: Qu Wenruo; +Cc: David Sterba, Qu Wenruo, linux-btrfs
On Fri, Mar 06, 2026 at 09:13:23AM +1030, Qu Wenruo wrote:
>
>
> 在 2026/3/6 04:16, Boris Burkov 写道:
> [...]
> > > So I stand by the reason of the pool but with the evolution of folios
> > > and bs < ps the cost could be too high. I can still see the pool for a
> > > subset of the combinations for some common scenario like 4K block size
> > > on 64K page host (e.g. ARM).
> > >
> >
> > FWIW, we do already see issues with allocation in compressed IO in
> > production, so I am hesitant to support this change.
>
> May I ask how frequent the problem is, and what's the CPU arch?
> Finally is it only happening after commit 6f706f34fc4c ("btrfs: switch to
> btrfs_compress_bio() interface for compressed writes"), aka v7.0 kernel.
>
> I tried my best locally to introduce extra ASSERT()s to make sure every
> folio inside a compressed bio has a ref of 1, but it hasn't yet triggered
> inside zlib_compress_bio().
>
> Thanks,
> Qu
>
I'm sorry, I didn't mean to give you the wrong impression.
I have not seen the bug you and Dave have been hunting.
I meant generic allocation issues in more complex paths involving
writeback/reclaim/page faults, etc that Dave was warning about with
removing the compr pool.
Thanks,
Boris
> >
> > On the one hand, we have the issues so the pool is not completely saving
> > us anyway. On the other hand, removing it seems likely to make this worse
> > in the way that you predict, Dave.
> >
> > If we do move forward with this, I will try to watch such errors closely
> > on the first release of a kernel without the pool.
> >
> > Thanks,
> > Boris
> >
> > > > And hopefully this will address David's recent crash (as usual I'm not
> > > > able to reproduce locally).
> > >
> > > I'll run the test with this patch.
> >
>
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-03-05 22:45 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-02 8:00 [PATCH RFC] btrfs: get rid of btrfs_(alloc|free)_compr_folio() Qu Wenruo
2026-03-05 2:56 ` David Sterba
2026-03-05 3:09 ` David Sterba
2026-03-05 4:33 ` Qu Wenruo
2026-03-05 17:46 ` Boris Burkov
2026-03-05 22:43 ` Qu Wenruo
2026-03-05 22:45 ` Boris Burkov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox