public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* erofs pointer corruption and kernel crash
@ 2026-04-10  8:13 Arseniy Krasnov
  2026-04-10  8:31 ` Gao Xiang
  0 siblings, 1 reply; 22+ messages in thread
From: Arseniy Krasnov @ 2026-04-10  8:13 UTC (permalink / raw)
  To: Gao Xiang, Chao Yu, Yue Hu, Jeffle Xu, Sandeep Dhavale, Hongbo Li,
	Sheng Yong
  Cc: oxffffaa, linux-erofs, linux-kernel, kernel

Hi,

We found unexpected behaviour of erofs:

There is function in erofs - 'erofs_onlinefolio_end()'. It has pointer to
'struct folio' as first argument, and there is loop inside this function,
which updates 'private' field of provided folio:

  do {
          orig = atomic_read((atomic_t *)&folio->private);
          DBG_BUGON(orig <= 0);
          v = dirty << EROFS_ONLINEFOLIO_DIRTY;
          v |= (orig - 1) | (!!err << EROFS_ONLINEFOLIO_EIO);
  } while (atomic_cmpxchg((atomic_t *)&folio->private, orig, v) != orig);

Now, we see that in some rare case, this function processes folio, where
'private' is pointer, and thus this loop will update some bits in this
pointer. Then later kernel dereferences such pointer and crashes.

To catch this, the following small debug patch was used (e.g. we check that 'private' field is pointer):

diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index 33cb0a7330d2..b1d8deffec4d 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -238,6 +238,11 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
 {
     int orig, v;
 
+    if (((uintptr_t)folio->private) & 0xffff000000000000) {
+        pr_emerg("\n[foliodbg] %s:%d EROFS FOLIO %px PRIVATE BEFORE %px\n", __func__, __LINE__, folio, folio->private);
+        dump_stack();
+    }
+
     do {
         orig = atomic_read((atomic_t *)&folio->private);
         DBG_BUGON(orig <= 0);
@@ -245,6 +250,9 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
         v |= (orig - 1) | (!!err << EROFS_ONLINEFOLIO_EIO);
     } while (atomic_cmpxchg((atomic_t *)&folio->private, orig, v) != orig);
 
+    if (((uintptr_t)folio->private) & 0xffff000000000000)
+        pr_emerg("\n[foliodbg] %s:%d EROFS FOLIO %px PRIVATE SET %px\n", __func__, __LINE__, folio, folio->private);
+
     if (v & (BIT(EROFS_ONLINEFOLIO_DIRTY) - 1))
         return;
     folio->private = 0;


And it gives result:

[][  T639] [foliodbg] erofs_onlinefolio_end:242 EROFS FOLIO fffffdffc0030440 PRIVATE BEFORE ffff000002b32468
[][  T639] CPU: 0 UID: 0 PID: 639 Comm: kworker/0:6H Tainted: G O 6.15.11-sdkernel #1 PREEMPT
[][  T639] Tainted: [O]=OOT_MODULE
[][  T639] Workqueue: kverityd verity_work
[][  T639] Call trace:
[][  T639]  show_stack+0x18/0x30 (C)
[][  T639]  dump_stack_lvl+0x60/0x80
[][  T639]  dump_stack+0x18/0x24
[][  T639]  erofs_onlinefolio_end+0x124/0x130
[][  T639]  z_erofs_decompress_queue+0x4b0/0x8c0
[][  T639]  z_erofs_decompress_kickoff+0x88/0x150
[][  T639]  z_erofs_endio+0x144/0x250
[][  T639]  bio_endio+0x138/0x150
[][  T639]  __dm_io_complete+0x1e0/0x2b0
[][  T639]  clone_endio+0xd0/0x270
[][  T639]  bio_endio+0x138/0x150
[][  T639]  verity_finish_io+0x64/0xf0
[][  T639]  verity_work+0x30/0x40
[][  T639]  process_one_work+0x180/0x2e0
[][  T639]  worker_thread+0x2c4/0x3f0
[][  T639]  kthread+0x12c/0x210
[][  T639]  ret_from_fork+0x10/0x20
[][  T639]
[][  T639] [foliodbg] erofs_onlinefolio_end:254 EROFS FOLIO fffffdffc0030440 PRIVATE SET ffff000022b32467
[][   T39] Unable to handle kernel paging request at virtual address ffff000022b32467
[][   T39] Mem abort info:
[][   T39]   ESR = 0x0000000096000006
[][   T39]   EC = 0x25: DABT (current EL), IL = 32 bits
[][   T39]   SET = 0, FnV = 0
[][   T39]   EA = 0, S1PTW = 0
[][   T39]   FSC = 0x06: level 2 translation fault
[][   T39] Data abort info:
[][   T39]   ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
[][   T39]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[][   T39]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[][   T39] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000001e36000
[][   T39] [ffff000022b32467] pgd=1800000007fff403, p4d=1800000007fff403, pud=1800000007ffe403, pmd=0000000000000000
[][   T39] Internal error: Oops: 0000000096000006 [#1]  SMP
[][   T39] Modules linked in: vlsicomm(O)
[][   T39] CPU: 1 UID: 0 PID: 39 Comm: kswapd0 Tainted: G O 6.15.11-sdkernel #1 PREEMPT
[][   T39] Tainted: [O]=OOT_MODULE
[][   T39] pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[][   T39] pc : drop_buffers.constprop.0+0x34/0x120
[][   T39] lr : try_to_free_buffers+0xd0/0x100
[][   T39] sp : ffff80008105b780
[][   T39] x29: ffff80008105b780 x28: 0000000000000000 x27: fffffdffc0030448
[][   T39] x26: ffff80008105b8a0 x25: ffff80008105b868 x24: 0000000000000001
[][   T39] x23: fffffdffc0030440 x22: ffff80008105b7b0 x21: fffffdffc0030440
[][   T39] x20: ffff000022b32467 x19: ffff000022b32467 x18: 0000000000000000
[][   T39] x17: 0000000000000000 x16: 0000000000000000 x15: 00000000d69f4cc0
[][   T39] x14: ffff0000000c5dc0 x13: 0000000000000000 x12: ffff800080d59b58
[][   T39] x11: 00000000000000c0 x10: 0000000000000000 x9 : 0000000000000000
[][   T39] x8 : ffff80008105b7d0 x7 : 0000000000000000 x6 : 000000000000003f
[][   T39] x5 : 0000000000000000 x4 : fffffdffc0030440 x3 : 1ff0000000004001
[][   T39] x2 : 1ff0000000004001 x1 : ffff80008105b7b0 x0 : fffffdffc0030440
[][   T39] Call trace:
[][   T39]  drop_buffers.constprop.0+0x34/0x120 (P)
[][   T39]  try_to_free_buffers+0xd0/0x100
[][   T39]  filemap_release_folio+0x94/0xc0
[][   T39]  shrink_folio_list+0x8c8/0xc40
[][   T39]  shrink_lruvec+0x740/0xb80
[][   T39]  shrink_node+0x2b8/0x9a0
[][   T39]  balance_pgdat+0x3b8/0x760
[][   T39]  kswapd+0x220/0x3b0
[][   T39]  kthread+0x12c/0x210
[][   T39]  ret_from_fork+0x10/0x20
[][   T39] Code: 14000004 f9400673 eb13029f 54000180 (f9400262)
[][   T39] ---[ end trace 0000000000000000 ]---
[][   T39] Kernel panic - not syncing: Oops: Fatal exception
[][   T39] SMP: stopping secondary CPUs
[][   T39] Kernel Offset: disabled
[][   T39] CPU features: 0x0000,00000000,01000000,0200420b
[][   T39] Memory Limit: none
[][   T39] Rebooting in 5 seconds..

So 'erofs_onlinefolio_end()' takes some folio with 'private' field contains
some pointer (0xffff000002b32468), "corrupts" this pointer (result will be
0xffff000022b32467 - at least we see that 0x20000000 was ORed to original
pointer and this is (1 << EROFS_ONLINEFOLIO_DIRTY)), and then kernel crashes.
We guess it is not valid case when such folio is passed as argument to
'erofs_onlinefolio_end()'.

We have the following erofs configuration in buildroot:

BR2_TARGET_ROOTFS_EROFS=y
BR2_TARGET_ROOTFS_EROFS_CUSTOM_COMPRESSION=y
BR2_TARGET_ROOTFS_EROFS_COMPRESSION_ALGORITHMS="zstd,22 --max-extent-bytes 65536 -E48bit"
BR2_TARGET_ROOTFS_EROFS_FRAGMENTS=y
BR2_TARGET_ROOTFS_EROFS_PCLUSTERSIZE=65536



May be You know how to fix it or some ideas? Because we are new at erofs and need to discover and
learn its source code.

Thanks

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: erofs pointer corruption and kernel crash
  2026-04-10  8:13 erofs pointer corruption and kernel crash Arseniy Krasnov
@ 2026-04-10  8:31 ` Gao Xiang
  2026-04-10  8:42   ` Gao Xiang
                     ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Gao Xiang @ 2026-04-10  8:31 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: oxffffaa, linux-erofs, linux-kernel, kernel, Gao Xiang, Chao Yu,
	Yue Hu, Jeffle Xu, Sandeep Dhavale, Hongbo Li, Sheng Yong

Hi,

On 2026/4/10 16:13, Arseniy Krasnov wrote:
> Hi,
> 
> We found unexpected behaviour of erofs:
> 
> There is function in erofs - 'erofs_onlinefolio_end()'. It has pointer to
> 'struct folio' as first argument, and there is loop inside this function,
> which updates 'private' field of provided folio:
> 
>    do {
>            orig = atomic_read((atomic_t *)&folio->private);
>            DBG_BUGON(orig <= 0);
>            v = dirty << EROFS_ONLINEFOLIO_DIRTY;
>            v |= (orig - 1) | (!!err << EROFS_ONLINEFOLIO_EIO);
>    } while (atomic_cmpxchg((atomic_t *)&folio->private, orig, v) != orig);
> 
> Now, we see that in some rare case, this function processes folio, where
> 'private' is pointer, and thus this loop will update some bits in this
> pointer. Then later kernel dereferences such pointer and crashes.
> 
> To catch this, the following small debug patch was used (e.g. we check that 'private' field is pointer):
> 
> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
> index 33cb0a7330d2..b1d8deffec4d 100644
> --- a/fs/erofs/data.c
> +++ b/fs/erofs/data.c
> @@ -238,6 +238,11 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
>   {
>       int orig, v;
>   
> +    if (((uintptr_t)folio->private) & 0xffff000000000000) {

No, if erofs_onlinefolio_end() is called, `folio->private`
shouldn't be a pointer, it's just a counter inside, and
storing a pointer is unexpected.

And since the folio is locked, it shouldn't call into
try_to_free_buffers().

Is it easy to reproduce? if yes, can you print other
values like `folio->mapping` and `folio->index` as
well?

I need more informations to find some clues.

Thanks,
Gao Xiang

> +        pr_emerg("\n[foliodbg] %s:%d EROFS FOLIO %px PRIVATE BEFORE %px\n", __func__, __LINE__, folio, folio->private);
> +        dump_stack();
> +    }
> +
>       do {
>           orig = atomic_read((atomic_t *)&folio->private);
>           DBG_BUGON(orig <= 0);
> @@ -245,6 +250,9 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
>           v |= (orig - 1) | (!!err << EROFS_ONLINEFOLIO_EIO);
>       } while (atomic_cmpxchg((atomic_t *)&folio->private, orig, v) != orig);
>   
> +    if (((uintptr_t)folio->private) & 0xffff000000000000)
> +        pr_emerg("\n[foliodbg] %s:%d EROFS FOLIO %px PRIVATE SET %px\n", __func__, __LINE__, folio, folio->private);
> +
>       if (v & (BIT(EROFS_ONLINEFOLIO_DIRTY) - 1))
>           return;
>       folio->private = 0;
> 
> 
> And it gives result:
> 
> [][  T639] [foliodbg] erofs_onlinefolio_end:242 EROFS FOLIO fffffdffc0030440 PRIVATE BEFORE ffff000002b32468
> [][  T639] CPU: 0 UID: 0 PID: 639 Comm: kworker/0:6H Tainted: G O 6.15.11-sdkernel #1 PREEMPT
> [][  T639] Tainted: [O]=OOT_MODULE
> [][  T639] Workqueue: kverityd verity_work
> [][  T639] Call trace:
> [][  T639]  show_stack+0x18/0x30 (C)
> [][  T639]  dump_stack_lvl+0x60/0x80
> [][  T639]  dump_stack+0x18/0x24
> [][  T639]  erofs_onlinefolio_end+0x124/0x130
> [][  T639]  z_erofs_decompress_queue+0x4b0/0x8c0
> [][  T639]  z_erofs_decompress_kickoff+0x88/0x150
> [][  T639]  z_erofs_endio+0x144/0x250
> [][  T639]  bio_endio+0x138/0x150
> [][  T639]  __dm_io_complete+0x1e0/0x2b0
> [][  T639]  clone_endio+0xd0/0x270
> [][  T639]  bio_endio+0x138/0x150
> [][  T639]  verity_finish_io+0x64/0xf0
> [][  T639]  verity_work+0x30/0x40
> [][  T639]  process_one_work+0x180/0x2e0
> [][  T639]  worker_thread+0x2c4/0x3f0
> [][  T639]  kthread+0x12c/0x210
> [][  T639]  ret_from_fork+0x10/0x20
> [][  T639]
> [][  T639] [foliodbg] erofs_onlinefolio_end:254 EROFS FOLIO fffffdffc0030440 PRIVATE SET ffff000022b32467
> [][   T39] Unable to handle kernel paging request at virtual address ffff000022b32467
> [][   T39] Mem abort info:
> [][   T39]   ESR = 0x0000000096000006
> [][   T39]   EC = 0x25: DABT (current EL), IL = 32 bits
> [][   T39]   SET = 0, FnV = 0
> [][   T39]   EA = 0, S1PTW = 0
> [][   T39]   FSC = 0x06: level 2 translation fault
> [][   T39] Data abort info:
> [][   T39]   ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
> [][   T39]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> [][   T39]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [][   T39] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000001e36000
> [][   T39] [ffff000022b32467] pgd=1800000007fff403, p4d=1800000007fff403, pud=1800000007ffe403, pmd=0000000000000000
> [][   T39] Internal error: Oops: 0000000096000006 [#1]  SMP
> [][   T39] Modules linked in: vlsicomm(O)
> [][   T39] CPU: 1 UID: 0 PID: 39 Comm: kswapd0 Tainted: G O 6.15.11-sdkernel #1 PREEMPT
> [][   T39] Tainted: [O]=OOT_MODULE
> [][   T39] pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [][   T39] pc : drop_buffers.constprop.0+0x34/0x120
> [][   T39] lr : try_to_free_buffers+0xd0/0x100
> [][   T39] sp : ffff80008105b780
> [][   T39] x29: ffff80008105b780 x28: 0000000000000000 x27: fffffdffc0030448
> [][   T39] x26: ffff80008105b8a0 x25: ffff80008105b868 x24: 0000000000000001
> [][   T39] x23: fffffdffc0030440 x22: ffff80008105b7b0 x21: fffffdffc0030440
> [][   T39] x20: ffff000022b32467 x19: ffff000022b32467 x18: 0000000000000000
> [][   T39] x17: 0000000000000000 x16: 0000000000000000 x15: 00000000d69f4cc0
> [][   T39] x14: ffff0000000c5dc0 x13: 0000000000000000 x12: ffff800080d59b58
> [][   T39] x11: 00000000000000c0 x10: 0000000000000000 x9 : 0000000000000000
> [][   T39] x8 : ffff80008105b7d0 x7 : 0000000000000000 x6 : 000000000000003f
> [][   T39] x5 : 0000000000000000 x4 : fffffdffc0030440 x3 : 1ff0000000004001
> [][   T39] x2 : 1ff0000000004001 x1 : ffff80008105b7b0 x0 : fffffdffc0030440
> [][   T39] Call trace:
> [][   T39]  drop_buffers.constprop.0+0x34/0x120 (P)
> [][   T39]  try_to_free_buffers+0xd0/0x100
> [][   T39]  filemap_release_folio+0x94/0xc0
> [][   T39]  shrink_folio_list+0x8c8/0xc40
> [][   T39]  shrink_lruvec+0x740/0xb80
> [][   T39]  shrink_node+0x2b8/0x9a0
> [][   T39]  balance_pgdat+0x3b8/0x760
> [][   T39]  kswapd+0x220/0x3b0
> [][   T39]  kthread+0x12c/0x210
> [][   T39]  ret_from_fork+0x10/0x20
> [][   T39] Code: 14000004 f9400673 eb13029f 54000180 (f9400262)
> [][   T39] ---[ end trace 0000000000000000 ]---
> [][   T39] Kernel panic - not syncing: Oops: Fatal exception
> [][   T39] SMP: stopping secondary CPUs
> [][   T39] Kernel Offset: disabled
> [][   T39] CPU features: 0x0000,00000000,01000000,0200420b
> [][   T39] Memory Limit: none
> [][   T39] Rebooting in 5 seconds..
> 
> So 'erofs_onlinefolio_end()' takes some folio with 'private' field contains
> some pointer (0xffff000002b32468), "corrupts" this pointer (result will be
> 0xffff000022b32467 - at least we see that 0x20000000 was ORed to original
> pointer and this is (1 << EROFS_ONLINEFOLIO_DIRTY)), and then kernel crashes.
> We guess it is not valid case when such folio is passed as argument to
> 'erofs_onlinefolio_end()'.
> 
> We have the following erofs configuration in buildroot:
> 
> BR2_TARGET_ROOTFS_EROFS=y
> BR2_TARGET_ROOTFS_EROFS_CUSTOM_COMPRESSION=y
> BR2_TARGET_ROOTFS_EROFS_COMPRESSION_ALGORITHMS="zstd,22 --max-extent-bytes 65536 -E48bit"
> BR2_TARGET_ROOTFS_EROFS_FRAGMENTS=y
> BR2_TARGET_ROOTFS_EROFS_PCLUSTERSIZE=65536
> 
> 
> 
> May be You know how to fix it or some ideas? Because we are new at erofs and need to discover and
> learn its source code.
> 
> Thanks


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: erofs pointer corruption and kernel crash
  2026-04-10  8:31 ` Gao Xiang
@ 2026-04-10  8:42   ` Gao Xiang
  2026-04-10  8:51     ` Arseniy Krasnov
  2026-04-10  8:55   ` Arseniy Krasnov
  2026-04-10 11:37   ` Arseniy Krasnov
  2 siblings, 1 reply; 22+ messages in thread
From: Gao Xiang @ 2026-04-10  8:42 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: oxffffaa, linux-erofs, linux-kernel, kernel, Gao Xiang, Chao Yu,
	Yue Hu, Jeffle Xu, Sandeep Dhavale, Hongbo Li, Sheng Yong



On 2026/4/10 16:31, Gao Xiang wrote:
> Hi,
> 
> On 2026/4/10 16:13, Arseniy Krasnov wrote:
>> Hi,
>>
>> We found unexpected behaviour of erofs:
>>
>> There is function in erofs - 'erofs_onlinefolio_end()'. It has pointer to
>> 'struct folio' as first argument, and there is loop inside this function,
>> which updates 'private' field of provided folio:
>>
>>    do {
>>            orig = atomic_read((atomic_t *)&folio->private);
>>            DBG_BUGON(orig <= 0);
>>            v = dirty << EROFS_ONLINEFOLIO_DIRTY;
>>            v |= (orig - 1) | (!!err << EROFS_ONLINEFOLIO_EIO);
>>    } while (atomic_cmpxchg((atomic_t *)&folio->private, orig, v) != orig);
>>
>> Now, we see that in some rare case, this function processes folio, where
>> 'private' is pointer, and thus this loop will update some bits in this
>> pointer. Then later kernel dereferences such pointer and crashes.
>>
>> To catch this, the following small debug patch was used (e.g. we check that 'private' field is pointer):
>>
>> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
>> index 33cb0a7330d2..b1d8deffec4d 100644
>> --- a/fs/erofs/data.c
>> +++ b/fs/erofs/data.c
>> @@ -238,6 +238,11 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
>>   {
>>       int orig, v;
>> +    if (((uintptr_t)folio->private) & 0xffff000000000000) {
> 
> No, if erofs_onlinefolio_end() is called, `folio->private`
> shouldn't be a pointer, it's just a counter inside, and
> storing a pointer is unexpected.
> 
> And since the folio is locked, it shouldn't call into
> try_to_free_buffers().
> 
> Is it easy to reproduce? if yes, can you print other
> values like `folio->mapping` and `folio->index` as
> well?
> 
> I need more informations to find some clues.

btw, is that an unmodified upstream kernel "6.15.11-sdkernel"?
Currently I never heard Android phone vendors using 6.12 LTS
for example hit this. If it can easily reproduced, is it
possible to reproduce with the upstream kernel?

And is the "0xffff000002b32468" pointer a valid pointer? what
does it point to? If it looks erofs pointer, the only one I
can think out is "struct z_erofs_pcluster", if it's not the
case, I think there should be other thing wrong if the kernel
is modified.

> 
> Thanks,
> Gao Xiang
> 
>> +        pr_emerg("\n[foliodbg] %s:%d EROFS FOLIO %px PRIVATE BEFORE %px\n", __func__, __LINE__, folio, folio->private);
>> +        dump_stack();
>> +    }
>> +
>>       do {
>>           orig = atomic_read((atomic_t *)&folio->private);
>>           DBG_BUGON(orig <= 0);
>> @@ -245,6 +250,9 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
>>           v |= (orig - 1) | (!!err << EROFS_ONLINEFOLIO_EIO);
>>       } while (atomic_cmpxchg((atomic_t *)&folio->private, orig, v) != orig);
>> +    if (((uintptr_t)folio->private) & 0xffff000000000000)
>> +        pr_emerg("\n[foliodbg] %s:%d EROFS FOLIO %px PRIVATE SET %px\n", __func__, __LINE__, folio, folio->private);
>> +
>>       if (v & (BIT(EROFS_ONLINEFOLIO_DIRTY) - 1))
>>           return;
>>       folio->private = 0;
>>
>>
>> And it gives result:
>>
>> [][  T639] [foliodbg] erofs_onlinefolio_end:242 EROFS FOLIO fffffdffc0030440 PRIVATE BEFORE ffff000002b32468
>> [][  T639] CPU: 0 UID: 0 PID: 639 Comm: kworker/0:6H Tainted: G O 6.15.11-sdkernel #1 PREEMPT
>> [][  T639] Tainted: [O]=OOT_MODULE
>> [][  T639] Workqueue: kverityd verity_work
>> [][  T639] Call trace:
>> [][  T639]  show_stack+0x18/0x30 (C)
>> [][  T639]  dump_stack_lvl+0x60/0x80
>> [][  T639]  dump_stack+0x18/0x24
>> [][  T639]  erofs_onlinefolio_end+0x124/0x130
>> [][  T639]  z_erofs_decompress_queue+0x4b0/0x8c0
>> [][  T639]  z_erofs_decompress_kickoff+0x88/0x150
>> [][  T639]  z_erofs_endio+0x144/0x250
>> [][  T639]  bio_endio+0x138/0x150
>> [][  T639]  __dm_io_complete+0x1e0/0x2b0
>> [][  T639]  clone_endio+0xd0/0x270
>> [][  T639]  bio_endio+0x138/0x150
>> [][  T639]  verity_finish_io+0x64/0xf0
>> [][  T639]  verity_work+0x30/0x40
>> [][  T639]  process_one_work+0x180/0x2e0
>> [][  T639]  worker_thread+0x2c4/0x3f0
>> [][  T639]  kthread+0x12c/0x210
>> [][  T639]  ret_from_fork+0x10/0x20
>> [][  T639]
>> [][  T639] [foliodbg] erofs_onlinefolio_end:254 EROFS FOLIO fffffdffc0030440 PRIVATE SET ffff000022b32467
>> [][   T39] Unable to handle kernel paging request at virtual address ffff000022b32467
>> [][   T39] Mem abort info:
>> [][   T39]   ESR = 0x0000000096000006
>> [][   T39]   EC = 0x25: DABT (current EL), IL = 32 bits
>> [][   T39]   SET = 0, FnV = 0
>> [][   T39]   EA = 0, S1PTW = 0
>> [][   T39]   FSC = 0x06: level 2 translation fault
>> [][   T39] Data abort info:
>> [][   T39]   ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
>> [][   T39]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>> [][   T39]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>> [][   T39] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000001e36000
>> [][   T39] [ffff000022b32467] pgd=1800000007fff403, p4d=1800000007fff403, pud=1800000007ffe403, pmd=0000000000000000
>> [][   T39] Internal error: Oops: 0000000096000006 [#1]  SMP
>> [][   T39] Modules linked in: vlsicomm(O)
>> [][   T39] CPU: 1 UID: 0 PID: 39 Comm: kswapd0 Tainted: G O 6.15.11-sdkernel #1 PREEMPT
>> [][   T39] Tainted: [O]=OOT_MODULE
>> [][   T39] pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [][   T39] pc : drop_buffers.constprop.0+0x34/0x120
>> [][   T39] lr : try_to_free_buffers+0xd0/0x100
>> [][   T39] sp : ffff80008105b780
>> [][   T39] x29: ffff80008105b780 x28: 0000000000000000 x27: fffffdffc0030448
>> [][   T39] x26: ffff80008105b8a0 x25: ffff80008105b868 x24: 0000000000000001
>> [][   T39] x23: fffffdffc0030440 x22: ffff80008105b7b0 x21: fffffdffc0030440
>> [][   T39] x20: ffff000022b32467 x19: ffff000022b32467 x18: 0000000000000000
>> [][   T39] x17: 0000000000000000 x16: 0000000000000000 x15: 00000000d69f4cc0
>> [][   T39] x14: ffff0000000c5dc0 x13: 0000000000000000 x12: ffff800080d59b58
>> [][   T39] x11: 00000000000000c0 x10: 0000000000000000 x9 : 0000000000000000
>> [][   T39] x8 : ffff80008105b7d0 x7 : 0000000000000000 x6 : 000000000000003f
>> [][   T39] x5 : 0000000000000000 x4 : fffffdffc0030440 x3 : 1ff0000000004001
>> [][   T39] x2 : 1ff0000000004001 x1 : ffff80008105b7b0 x0 : fffffdffc0030440
>> [][   T39] Call trace:
>> [][   T39]  drop_buffers.constprop.0+0x34/0x120 (P)
>> [][   T39]  try_to_free_buffers+0xd0/0x100
>> [][   T39]  filemap_release_folio+0x94/0xc0
>> [][   T39]  shrink_folio_list+0x8c8/0xc40
>> [][   T39]  shrink_lruvec+0x740/0xb80
>> [][   T39]  shrink_node+0x2b8/0x9a0
>> [][   T39]  balance_pgdat+0x3b8/0x760
>> [][   T39]  kswapd+0x220/0x3b0
>> [][   T39]  kthread+0x12c/0x210
>> [][   T39]  ret_from_fork+0x10/0x20
>> [][   T39] Code: 14000004 f9400673 eb13029f 54000180 (f9400262)
>> [][   T39] ---[ end trace 0000000000000000 ]---
>> [][   T39] Kernel panic - not syncing: Oops: Fatal exception
>> [][   T39] SMP: stopping secondary CPUs
>> [][   T39] Kernel Offset: disabled
>> [][   T39] CPU features: 0x0000,00000000,01000000,0200420b
>> [][   T39] Memory Limit: none
>> [][   T39] Rebooting in 5 seconds..
>>
>> So 'erofs_onlinefolio_end()' takes some folio with 'private' field contains
>> some pointer (0xffff000002b32468), "corrupts" this pointer (result will be
>> 0xffff000022b32467 - at least we see that 0x20000000 was ORed to original
>> pointer and this is (1 << EROFS_ONLINEFOLIO_DIRTY)), and then kernel crashes.
>> We guess it is not valid case when such folio is passed as argument to
>> 'erofs_onlinefolio_end()'.
>>
>> We have the following erofs configuration in buildroot:
>>
>> BR2_TARGET_ROOTFS_EROFS=y
>> BR2_TARGET_ROOTFS_EROFS_CUSTOM_COMPRESSION=y
>> BR2_TARGET_ROOTFS_EROFS_COMPRESSION_ALGORITHMS="zstd,22 --max-extent-bytes 65536 -E48bit"
>> BR2_TARGET_ROOTFS_EROFS_FRAGMENTS=y
>> BR2_TARGET_ROOTFS_EROFS_PCLUSTERSIZE=65536
>>
>>
>>
>> May be You know how to fix it or some ideas? Because we are new at erofs and need to discover and
>> learn its source code.
>>
>> Thanks
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: erofs pointer corruption and kernel crash
  2026-04-10  8:42   ` Gao Xiang
@ 2026-04-10  8:51     ` Arseniy Krasnov
  2026-04-10  8:59       ` Gao Xiang
  0 siblings, 1 reply; 22+ messages in thread
From: Arseniy Krasnov @ 2026-04-10  8:51 UTC (permalink / raw)
  To: Gao Xiang
  Cc: oxffffaa, linux-erofs, linux-kernel, kernel, Gao Xiang, Chao Yu,
	Yue Hu, Jeffle Xu, Sandeep Dhavale, Hongbo Li, Sheng Yong



10.04.2026 11:42, Gao Xiang wrote:
> 
> 
> On 2026/4/10 16:31, Gao Xiang wrote:
>> Hi,
>>
>> On 2026/4/10 16:13, Arseniy Krasnov wrote:
>>> Hi,
>>>
>>> We found unexpected behaviour of erofs:
>>>
>>> There is function in erofs - 'erofs_onlinefolio_end()'. It has pointer to
>>> 'struct folio' as first argument, and there is loop inside this function,
>>> which updates 'private' field of provided folio:
>>>
>>>    do {
>>>            orig = atomic_read((atomic_t *)&folio->private);
>>>            DBG_BUGON(orig <= 0);
>>>            v = dirty << EROFS_ONLINEFOLIO_DIRTY;
>>>            v |= (orig - 1) | (!!err << EROFS_ONLINEFOLIO_EIO);
>>>    } while (atomic_cmpxchg((atomic_t *)&folio->private, orig, v) != orig);
>>>
>>> Now, we see that in some rare case, this function processes folio, where
>>> 'private' is pointer, and thus this loop will update some bits in this
>>> pointer. Then later kernel dereferences such pointer and crashes.
>>>
>>> To catch this, the following small debug patch was used (e.g. we check that 'private' field is pointer):
>>>
>>> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
>>> index 33cb0a7330d2..b1d8deffec4d 100644
>>> --- a/fs/erofs/data.c
>>> +++ b/fs/erofs/data.c
>>> @@ -238,6 +238,11 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
>>>   {
>>>       int orig, v;
>>> +    if (((uintptr_t)folio->private) & 0xffff000000000000) {
>>
>> No, if erofs_onlinefolio_end() is called, `folio->private`
>> shouldn't be a pointer, it's just a counter inside, and
>> storing a pointer is unexpected.
>>
>> And since the folio is locked, it shouldn't call into
>> try_to_free_buffers().
>>
>> Is it easy to reproduce? if yes, can you print other
>> values like `folio->mapping` and `folio->index` as
>> well?
>>
>> I need more informations to find some clues.
> 
> btw, is that an unmodified upstream kernel "6.15.11-sdkernel"?
> Currently I never heard Android phone vendors using 6.12 LTS
> for example hit this. If it can easily reproduced, is it
> possible to reproduce with the upstream kernel?

Yes, this is just upstream kernel, no vendor modifications. It is not android, just
buildroot.

> 
> And is the "0xffff000002b32468" pointer a valid pointer? what
> does it point to? If it looks erofs pointer, the only one I
> can think out is "struct z_erofs_pcluster", if it's not the
> case, I think there should be other thing wrong if the kernel
> is modified.

Yes, this is valid pointer, need to check about that pointer. I'll feedback here.

Thanks

> 
>>
>> Thanks,
>> Gao Xiang
>>
>>> +        pr_emerg("\n[foliodbg] %s:%d EROFS FOLIO %px PRIVATE BEFORE %px\n", __func__, __LINE__, folio, folio->private);
>>> +        dump_stack();
>>> +    }
>>> +
>>>       do {
>>>           orig = atomic_read((atomic_t *)&folio->private);
>>>           DBG_BUGON(orig <= 0);
>>> @@ -245,6 +250,9 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
>>>           v |= (orig - 1) | (!!err << EROFS_ONLINEFOLIO_EIO);
>>>       } while (atomic_cmpxchg((atomic_t *)&folio->private, orig, v) != orig);
>>> +    if (((uintptr_t)folio->private) & 0xffff000000000000)
>>> +        pr_emerg("\n[foliodbg] %s:%d EROFS FOLIO %px PRIVATE SET %px\n", __func__, __LINE__, folio, folio->private);
>>> +
>>>       if (v & (BIT(EROFS_ONLINEFOLIO_DIRTY) - 1))
>>>           return;
>>>       folio->private = 0;
>>>
>>>
>>> And it gives result:
>>>
>>> [][  T639] [foliodbg] erofs_onlinefolio_end:242 EROFS FOLIO fffffdffc0030440 PRIVATE BEFORE ffff000002b32468
>>> [][  T639] CPU: 0 UID: 0 PID: 639 Comm: kworker/0:6H Tainted: G O 6.15.11-sdkernel #1 PREEMPT
>>> [][  T639] Tainted: [O]=OOT_MODULE
>>> [][  T639] Workqueue: kverityd verity_work
>>> [][  T639] Call trace:
>>> [][  T639]  show_stack+0x18/0x30 (C)
>>> [][  T639]  dump_stack_lvl+0x60/0x80
>>> [][  T639]  dump_stack+0x18/0x24
>>> [][  T639]  erofs_onlinefolio_end+0x124/0x130
>>> [][  T639]  z_erofs_decompress_queue+0x4b0/0x8c0
>>> [][  T639]  z_erofs_decompress_kickoff+0x88/0x150
>>> [][  T639]  z_erofs_endio+0x144/0x250
>>> [][  T639]  bio_endio+0x138/0x150
>>> [][  T639]  __dm_io_complete+0x1e0/0x2b0
>>> [][  T639]  clone_endio+0xd0/0x270
>>> [][  T639]  bio_endio+0x138/0x150
>>> [][  T639]  verity_finish_io+0x64/0xf0
>>> [][  T639]  verity_work+0x30/0x40
>>> [][  T639]  process_one_work+0x180/0x2e0
>>> [][  T639]  worker_thread+0x2c4/0x3f0
>>> [][  T639]  kthread+0x12c/0x210
>>> [][  T639]  ret_from_fork+0x10/0x20
>>> [][  T639]
>>> [][  T639] [foliodbg] erofs_onlinefolio_end:254 EROFS FOLIO fffffdffc0030440 PRIVATE SET ffff000022b32467
>>> [][   T39] Unable to handle kernel paging request at virtual address ffff000022b32467
>>> [][   T39] Mem abort info:
>>> [][   T39]   ESR = 0x0000000096000006
>>> [][   T39]   EC = 0x25: DABT (current EL), IL = 32 bits
>>> [][   T39]   SET = 0, FnV = 0
>>> [][   T39]   EA = 0, S1PTW = 0
>>> [][   T39]   FSC = 0x06: level 2 translation fault
>>> [][   T39] Data abort info:
>>> [][   T39]   ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
>>> [][   T39]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>>> [][   T39]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>>> [][   T39] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000001e36000
>>> [][   T39] [ffff000022b32467] pgd=1800000007fff403, p4d=1800000007fff403, pud=1800000007ffe403, pmd=0000000000000000
>>> [][   T39] Internal error: Oops: 0000000096000006 [#1]  SMP
>>> [][   T39] Modules linked in: vlsicomm(O)
>>> [][   T39] CPU: 1 UID: 0 PID: 39 Comm: kswapd0 Tainted: G O 6.15.11-sdkernel #1 PREEMPT
>>> [][   T39] Tainted: [O]=OOT_MODULE
>>> [][   T39] pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [][   T39] pc : drop_buffers.constprop.0+0x34/0x120
>>> [][   T39] lr : try_to_free_buffers+0xd0/0x100
>>> [][   T39] sp : ffff80008105b780
>>> [][   T39] x29: ffff80008105b780 x28: 0000000000000000 x27: fffffdffc0030448
>>> [][   T39] x26: ffff80008105b8a0 x25: ffff80008105b868 x24: 0000000000000001
>>> [][   T39] x23: fffffdffc0030440 x22: ffff80008105b7b0 x21: fffffdffc0030440
>>> [][   T39] x20: ffff000022b32467 x19: ffff000022b32467 x18: 0000000000000000
>>> [][   T39] x17: 0000000000000000 x16: 0000000000000000 x15: 00000000d69f4cc0
>>> [][   T39] x14: ffff0000000c5dc0 x13: 0000000000000000 x12: ffff800080d59b58
>>> [][   T39] x11: 00000000000000c0 x10: 0000000000000000 x9 : 0000000000000000
>>> [][   T39] x8 : ffff80008105b7d0 x7 : 0000000000000000 x6 : 000000000000003f
>>> [][   T39] x5 : 0000000000000000 x4 : fffffdffc0030440 x3 : 1ff0000000004001
>>> [][   T39] x2 : 1ff0000000004001 x1 : ffff80008105b7b0 x0 : fffffdffc0030440
>>> [][   T39] Call trace:
>>> [][   T39]  drop_buffers.constprop.0+0x34/0x120 (P)
>>> [][   T39]  try_to_free_buffers+0xd0/0x100
>>> [][   T39]  filemap_release_folio+0x94/0xc0
>>> [][   T39]  shrink_folio_list+0x8c8/0xc40
>>> [][   T39]  shrink_lruvec+0x740/0xb80
>>> [][   T39]  shrink_node+0x2b8/0x9a0
>>> [][   T39]  balance_pgdat+0x3b8/0x760
>>> [][   T39]  kswapd+0x220/0x3b0
>>> [][   T39]  kthread+0x12c/0x210
>>> [][   T39]  ret_from_fork+0x10/0x20
>>> [][   T39] Code: 14000004 f9400673 eb13029f 54000180 (f9400262)
>>> [][   T39] ---[ end trace 0000000000000000 ]---
>>> [][   T39] Kernel panic - not syncing: Oops: Fatal exception
>>> [][   T39] SMP: stopping secondary CPUs
>>> [][   T39] Kernel Offset: disabled
>>> [][   T39] CPU features: 0x0000,00000000,01000000,0200420b
>>> [][   T39] Memory Limit: none
>>> [][   T39] Rebooting in 5 seconds..
>>>
>>> So 'erofs_onlinefolio_end()' takes some folio with 'private' field contains
>>> some pointer (0xffff000002b32468), "corrupts" this pointer (result will be
>>> 0xffff000022b32467 - at least we see that 0x20000000 was ORed to original
>>> pointer and this is (1 << EROFS_ONLINEFOLIO_DIRTY)), and then kernel crashes.
>>> We guess it is not valid case when such folio is passed as argument to
>>> 'erofs_onlinefolio_end()'.
>>>
>>> We have the following erofs configuration in buildroot:
>>>
>>> BR2_TARGET_ROOTFS_EROFS=y
>>> BR2_TARGET_ROOTFS_EROFS_CUSTOM_COMPRESSION=y
>>> BR2_TARGET_ROOTFS_EROFS_COMPRESSION_ALGORITHMS="zstd,22 --max-extent-bytes 65536 -E48bit"
>>> BR2_TARGET_ROOTFS_EROFS_FRAGMENTS=y
>>> BR2_TARGET_ROOTFS_EROFS_PCLUSTERSIZE=65536
>>>
>>>
>>>
>>> May be You know how to fix it or some ideas? Because we are new at erofs and need to discover and
>>> learn its source code.
>>>
>>> Thanks
>>
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: erofs pointer corruption and kernel crash
  2026-04-10  8:31 ` Gao Xiang
  2026-04-10  8:42   ` Gao Xiang
@ 2026-04-10  8:55   ` Arseniy Krasnov
  2026-04-10  9:20     ` Gao Xiang
  2026-04-10 11:37   ` Arseniy Krasnov
  2 siblings, 1 reply; 22+ messages in thread
From: Arseniy Krasnov @ 2026-04-10  8:55 UTC (permalink / raw)
  To: Gao Xiang
  Cc: oxffffaa, linux-erofs, linux-kernel, kernel, Gao Xiang, Chao Yu,
	Yue Hu, Jeffle Xu, Sandeep Dhavale, Hongbo Li, Sheng Yong



10.04.2026 11:31, Gao Xiang wrote:
> Hi,
> 
> On 2026/4/10 16:13, Arseniy Krasnov wrote:
>> Hi,
>>
>> We found unexpected behaviour of erofs:
>>
>> There is function in erofs - 'erofs_onlinefolio_end()'. It has pointer to
>> 'struct folio' as first argument, and there is loop inside this function,
>> which updates 'private' field of provided folio:
>>
>>    do {
>>            orig = atomic_read((atomic_t *)&folio->private);
>>            DBG_BUGON(orig <= 0);
>>            v = dirty << EROFS_ONLINEFOLIO_DIRTY;
>>            v |= (orig - 1) | (!!err << EROFS_ONLINEFOLIO_EIO);
>>    } while (atomic_cmpxchg((atomic_t *)&folio->private, orig, v) != orig);
>>
>> Now, we see that in some rare case, this function processes folio, where
>> 'private' is pointer, and thus this loop will update some bits in this
>> pointer. Then later kernel dereferences such pointer and crashes.
>>
>> To catch this, the following small debug patch was used (e.g. we check that 'private' field is pointer):
>>
>> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
>> index 33cb0a7330d2..b1d8deffec4d 100644
>> --- a/fs/erofs/data.c
>> +++ b/fs/erofs/data.c
>> @@ -238,6 +238,11 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
>>   {
>>       int orig, v;
>>   +    if (((uintptr_t)folio->private) & 0xffff000000000000) {
> 
> No, if erofs_onlinefolio_end() is called, `folio->private`
> shouldn't be a pointer, it's just a counter inside, and
> storing a pointer is unexpected.

I see. Ok.

> 
> And since the folio is locked, it shouldn't call into
> try_to_free_buffers().
> 
> Is it easy to reproduce? if yes, can you print other
> values like `folio->mapping` and `folio->index` as
> well?

Reproduce rate is low (at least in our case). We trigger IO activity, memory is low and
at the same time doing 'echo 3 > /proc/sys/vm/drop_caches' in loop. After may be 1 hour
kernel crashes with the trace from this mail.

Anyway I'll reproduce and print 'folio->mapping' and 'folio->index' for you.

Thanks

> 
> I need more informations to find some clues.

Ok, we can reproduce it and provide information.

Thanks.

> 
> Thanks,
> Gao Xiang
> 
>> +        pr_emerg("\n[foliodbg] %s:%d EROFS FOLIO %px PRIVATE BEFORE %px\n", __func__, __LINE__, folio, folio->private);
>> +        dump_stack();
>> +    }
>> +
>>       do {
>>           orig = atomic_read((atomic_t *)&folio->private);
>>           DBG_BUGON(orig <= 0);
>> @@ -245,6 +250,9 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
>>           v |= (orig - 1) | (!!err << EROFS_ONLINEFOLIO_EIO);
>>       } while (atomic_cmpxchg((atomic_t *)&folio->private, orig, v) != orig);
>>   +    if (((uintptr_t)folio->private) & 0xffff000000000000)
>> +        pr_emerg("\n[foliodbg] %s:%d EROFS FOLIO %px PRIVATE SET %px\n", __func__, __LINE__, folio, folio->private);
>> +
>>       if (v & (BIT(EROFS_ONLINEFOLIO_DIRTY) - 1))
>>           return;
>>       folio->private = 0;
>>
>>
>> And it gives result:
>>
>> [][  T639] [foliodbg] erofs_onlinefolio_end:242 EROFS FOLIO fffffdffc0030440 PRIVATE BEFORE ffff000002b32468
>> [][  T639] CPU: 0 UID: 0 PID: 639 Comm: kworker/0:6H Tainted: G O 6.15.11-sdkernel #1 PREEMPT
>> [][  T639] Tainted: [O]=OOT_MODULE
>> [][  T639] Workqueue: kverityd verity_work
>> [][  T639] Call trace:
>> [][  T639]  show_stack+0x18/0x30 (C)
>> [][  T639]  dump_stack_lvl+0x60/0x80
>> [][  T639]  dump_stack+0x18/0x24
>> [][  T639]  erofs_onlinefolio_end+0x124/0x130
>> [][  T639]  z_erofs_decompress_queue+0x4b0/0x8c0
>> [][  T639]  z_erofs_decompress_kickoff+0x88/0x150
>> [][  T639]  z_erofs_endio+0x144/0x250
>> [][  T639]  bio_endio+0x138/0x150
>> [][  T639]  __dm_io_complete+0x1e0/0x2b0
>> [][  T639]  clone_endio+0xd0/0x270
>> [][  T639]  bio_endio+0x138/0x150
>> [][  T639]  verity_finish_io+0x64/0xf0
>> [][  T639]  verity_work+0x30/0x40
>> [][  T639]  process_one_work+0x180/0x2e0
>> [][  T639]  worker_thread+0x2c4/0x3f0
>> [][  T639]  kthread+0x12c/0x210
>> [][  T639]  ret_from_fork+0x10/0x20
>> [][  T639]
>> [][  T639] [foliodbg] erofs_onlinefolio_end:254 EROFS FOLIO fffffdffc0030440 PRIVATE SET ffff000022b32467
>> [][   T39] Unable to handle kernel paging request at virtual address ffff000022b32467
>> [][   T39] Mem abort info:
>> [][   T39]   ESR = 0x0000000096000006
>> [][   T39]   EC = 0x25: DABT (current EL), IL = 32 bits
>> [][   T39]   SET = 0, FnV = 0
>> [][   T39]   EA = 0, S1PTW = 0
>> [][   T39]   FSC = 0x06: level 2 translation fault
>> [][   T39] Data abort info:
>> [][   T39]   ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
>> [][   T39]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>> [][   T39]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>> [][   T39] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000001e36000
>> [][   T39] [ffff000022b32467] pgd=1800000007fff403, p4d=1800000007fff403, pud=1800000007ffe403, pmd=0000000000000000
>> [][   T39] Internal error: Oops: 0000000096000006 [#1]  SMP
>> [][   T39] Modules linked in: vlsicomm(O)
>> [][   T39] CPU: 1 UID: 0 PID: 39 Comm: kswapd0 Tainted: G O 6.15.11-sdkernel #1 PREEMPT
>> [][   T39] Tainted: [O]=OOT_MODULE
>> [][   T39] pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [][   T39] pc : drop_buffers.constprop.0+0x34/0x120
>> [][   T39] lr : try_to_free_buffers+0xd0/0x100
>> [][   T39] sp : ffff80008105b780
>> [][   T39] x29: ffff80008105b780 x28: 0000000000000000 x27: fffffdffc0030448
>> [][   T39] x26: ffff80008105b8a0 x25: ffff80008105b868 x24: 0000000000000001
>> [][   T39] x23: fffffdffc0030440 x22: ffff80008105b7b0 x21: fffffdffc0030440
>> [][   T39] x20: ffff000022b32467 x19: ffff000022b32467 x18: 0000000000000000
>> [][   T39] x17: 0000000000000000 x16: 0000000000000000 x15: 00000000d69f4cc0
>> [][   T39] x14: ffff0000000c5dc0 x13: 0000000000000000 x12: ffff800080d59b58
>> [][   T39] x11: 00000000000000c0 x10: 0000000000000000 x9 : 0000000000000000
>> [][   T39] x8 : ffff80008105b7d0 x7 : 0000000000000000 x6 : 000000000000003f
>> [][   T39] x5 : 0000000000000000 x4 : fffffdffc0030440 x3 : 1ff0000000004001
>> [][   T39] x2 : 1ff0000000004001 x1 : ffff80008105b7b0 x0 : fffffdffc0030440
>> [][   T39] Call trace:
>> [][   T39]  drop_buffers.constprop.0+0x34/0x120 (P)
>> [][   T39]  try_to_free_buffers+0xd0/0x100
>> [][   T39]  filemap_release_folio+0x94/0xc0
>> [][   T39]  shrink_folio_list+0x8c8/0xc40
>> [][   T39]  shrink_lruvec+0x740/0xb80
>> [][   T39]  shrink_node+0x2b8/0x9a0
>> [][   T39]  balance_pgdat+0x3b8/0x760
>> [][   T39]  kswapd+0x220/0x3b0
>> [][   T39]  kthread+0x12c/0x210
>> [][   T39]  ret_from_fork+0x10/0x20
>> [][   T39] Code: 14000004 f9400673 eb13029f 54000180 (f9400262)
>> [][   T39] ---[ end trace 0000000000000000 ]---
>> [][   T39] Kernel panic - not syncing: Oops: Fatal exception
>> [][   T39] SMP: stopping secondary CPUs
>> [][   T39] Kernel Offset: disabled
>> [][   T39] CPU features: 0x0000,00000000,01000000,0200420b
>> [][   T39] Memory Limit: none
>> [][   T39] Rebooting in 5 seconds..
>>
>> So 'erofs_onlinefolio_end()' takes some folio with 'private' field contains
>> some pointer (0xffff000002b32468), "corrupts" this pointer (result will be
>> 0xffff000022b32467 - at least we see that 0x20000000 was ORed to original
>> pointer and this is (1 << EROFS_ONLINEFOLIO_DIRTY)), and then kernel crashes.
>> We guess it is not valid case when such folio is passed as argument to
>> 'erofs_onlinefolio_end()'.
>>
>> We have the following erofs configuration in buildroot:
>>
>> BR2_TARGET_ROOTFS_EROFS=y
>> BR2_TARGET_ROOTFS_EROFS_CUSTOM_COMPRESSION=y
>> BR2_TARGET_ROOTFS_EROFS_COMPRESSION_ALGORITHMS="zstd,22 --max-extent-bytes 65536 -E48bit"
>> BR2_TARGET_ROOTFS_EROFS_FRAGMENTS=y
>> BR2_TARGET_ROOTFS_EROFS_PCLUSTERSIZE=65536
>>
>>
>>
>> May be You know how to fix it or some ideas? Because we are new at erofs and need to discover and
>> learn its source code.
>>
>> Thanks
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: erofs pointer corruption and kernel crash
  2026-04-10  8:51     ` Arseniy Krasnov
@ 2026-04-10  8:59       ` Gao Xiang
  0 siblings, 0 replies; 22+ messages in thread
From: Gao Xiang @ 2026-04-10  8:59 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: oxffffaa, linux-erofs, linux-kernel, kernel, Gao Xiang, Chao Yu,
	Yue Hu, Jeffle Xu, Sandeep Dhavale, Hongbo Li, Sheng Yong



On 2026/4/10 16:51, Arseniy Krasnov wrote:
> 
> 
> 10.04.2026 11:42, Gao Xiang wrote:
>>
>>
>> On 2026/4/10 16:31, Gao Xiang wrote:
>>> Hi,
>>>
>>> On 2026/4/10 16:13, Arseniy Krasnov wrote:
>>>> Hi,
>>>>
>>>> We found unexpected behaviour of erofs:
>>>>
>>>> There is function in erofs - 'erofs_onlinefolio_end()'. It has pointer to
>>>> 'struct folio' as first argument, and there is loop inside this function,
>>>> which updates 'private' field of provided folio:
>>>>
>>>>     do {
>>>>             orig = atomic_read((atomic_t *)&folio->private);
>>>>             DBG_BUGON(orig <= 0);
>>>>             v = dirty << EROFS_ONLINEFOLIO_DIRTY;
>>>>             v |= (orig - 1) | (!!err << EROFS_ONLINEFOLIO_EIO);
>>>>     } while (atomic_cmpxchg((atomic_t *)&folio->private, orig, v) != orig);
>>>>
>>>> Now, we see that in some rare case, this function processes folio, where
>>>> 'private' is pointer, and thus this loop will update some bits in this
>>>> pointer. Then later kernel dereferences such pointer and crashes.
>>>>
>>>> To catch this, the following small debug patch was used (e.g. we check that 'private' field is pointer):
>>>>
>>>> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
>>>> index 33cb0a7330d2..b1d8deffec4d 100644
>>>> --- a/fs/erofs/data.c
>>>> +++ b/fs/erofs/data.c
>>>> @@ -238,6 +238,11 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
>>>>    {
>>>>        int orig, v;
>>>> +    if (((uintptr_t)folio->private) & 0xffff000000000000) {
>>>
>>> No, if erofs_onlinefolio_end() is called, `folio->private`
>>> shouldn't be a pointer, it's just a counter inside, and
>>> storing a pointer is unexpected.
>>>
>>> And since the folio is locked, it shouldn't call into
>>> try_to_free_buffers().
>>>
>>> Is it easy to reproduce? if yes, can you print other
>>> values like `folio->mapping` and `folio->index` as
>>> well?
>>>
>>> I need more informations to find some clues.
>>
>> btw, is that an unmodified upstream kernel "6.15.11-sdkernel"?
>> Currently I never heard Android phone vendors using 6.12 LTS
>> for example hit this. If it can easily reproduced, is it
>> possible to reproduce with the upstream kernel?
> 
> Yes, this is just upstream kernel, no vendor modifications. It is not android, just
> buildroot.

I know, I mean for buildroot workloads, it should be
less pressure since it's just for embeded use.

> 
>>
>> And is the "0xffff000002b32468" pointer a valid pointer? what
>> does it point to? If it looks erofs pointer, the only one I
>> can think out is "struct z_erofs_pcluster", if it's not the
>> case, I think there should be other thing wrong if the kernel
>> is modified.
> 
> Yes, this is valid pointer, need to check about that pointer. I'll feedback here.

Anyway, if z_erofs_decompress_queue->erofs_onlinefolio_end()
is called:
   - the folio should be locked, and folio->private should not
     be a pointer;

   - it seems `PG_Private` is set on the problematic folio
     (otherwise try_to_free_buffers() won't be called), which
     is unexpected too.

So what I need for some further analysis are:

   - the folio structure (folio flags, mapping, index, count, etc.);

   - what does folio->private point to?

Also is it possible I could get the memory dump if possible?
Not quite sure if it's possible in buildroot environment.

Thanks,
Gao Xiang

> 
> Thanks
> 
>>
>>>
>>> Thanks,
>>> Gao Xiang

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: erofs pointer corruption and kernel crash
  2026-04-10  8:55   ` Arseniy Krasnov
@ 2026-04-10  9:20     ` Gao Xiang
  2026-04-10  9:59       ` Arseniy Krasnov
  0 siblings, 1 reply; 22+ messages in thread
From: Gao Xiang @ 2026-04-10  9:20 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: oxffffaa, linux-erofs, linux-kernel, kernel, Gao Xiang, Chao Yu,
	Yue Hu, Jeffle Xu, Sandeep Dhavale, Hongbo Li, Sheng Yong



On 2026/4/10 16:55, Arseniy Krasnov wrote:
> 
> 

...

>>>
>>> BR2_TARGET_ROOTFS_EROFS=y
>>> BR2_TARGET_ROOTFS_EROFS_CUSTOM_COMPRESSION=y
>>> BR2_TARGET_ROOTFS_EROFS_COMPRESSION_ALGORITHMS="zstd,22 --max-extent-bytes 65536 -E48bit"

btw, may I ask what's the erofs-utils version?
erofs-utils 1.9?

I guess it's related to a relatively new experimental
feature (-E48bit + encoded extents + zstd) introduced in v6.15.

If you don't use this new feature, the issue may not be
reproduced anymore.

Thanks,
Gao Xiang


>>> BR2_TARGET_ROOTFS_EROFS_FRAGMENTS=y
>>> BR2_TARGET_ROOTFS_EROFS_PCLUSTERSIZE=65536
>>>
>>>
>>>
>>> May be You know how to fix it or some ideas? Because we are new at erofs and need to discover and
>>> learn its source code.
>>>
>>> Thanks
>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: erofs pointer corruption and kernel crash
  2026-04-10  9:20     ` Gao Xiang
@ 2026-04-10  9:59       ` Arseniy Krasnov
  2026-04-10 10:01         ` Gao Xiang
  0 siblings, 1 reply; 22+ messages in thread
From: Arseniy Krasnov @ 2026-04-10  9:59 UTC (permalink / raw)
  To: Gao Xiang
  Cc: oxffffaa, linux-erofs, linux-kernel, kernel, Gao Xiang, Chao Yu,
	Yue Hu, Jeffle Xu, Sandeep Dhavale, Hongbo Li, Sheng Yong



10.04.2026 12:20, Gao Xiang wrote:
> 
> 
> On 2026/4/10 16:55, Arseniy Krasnov wrote:
>>
>>
> 
> ...
> 
>>>>
>>>> BR2_TARGET_ROOTFS_EROFS=y
>>>> BR2_TARGET_ROOTFS_EROFS_CUSTOM_COMPRESSION=y
>>>> BR2_TARGET_ROOTFS_EROFS_COMPRESSION_ALGORITHMS="zstd,22 --max-extent-bytes 65536 -E48bit"
> 
> btw, may I ask what's the erofs-utils version?
> erofs-utils 1.9?

We have 1.8.5 erofs-utils

> 
> I guess it's related to a relatively new experimental
> feature (-E48bit + encoded extents + zstd) introduced in v6.15.
> 
> If you don't use this new feature, the issue may not be
> reproduced anymore.
> 
> Thanks,
> Gao Xiang
> 
> 
>>>> BR2_TARGET_ROOTFS_EROFS_FRAGMENTS=y
>>>> BR2_TARGET_ROOTFS_EROFS_PCLUSTERSIZE=65536
>>>>
>>>>
>>>>
>>>> May be You know how to fix it or some ideas? Because we are new at erofs and need to discover and
>>>> learn its source code.
>>>>
>>>> Thanks
>>>
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: erofs pointer corruption and kernel crash
  2026-04-10  9:59       ` Arseniy Krasnov
@ 2026-04-10 10:01         ` Gao Xiang
  2026-04-10 10:03           ` Arseniy Krasnov
  0 siblings, 1 reply; 22+ messages in thread
From: Gao Xiang @ 2026-04-10 10:01 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: oxffffaa, linux-erofs, linux-kernel, kernel, Gao Xiang, Chao Yu,
	Yue Hu, Jeffle Xu, Sandeep Dhavale, Hongbo Li, Sheng Yong



On 2026/4/10 17:59, Arseniy Krasnov wrote:
> 
> 
> 10.04.2026 12:20, Gao Xiang wrote:
>>
>>
>> On 2026/4/10 16:55, Arseniy Krasnov wrote:
>>>
>>>
>>
>> ...
>>
>>>>>
>>>>> BR2_TARGET_ROOTFS_EROFS=y
>>>>> BR2_TARGET_ROOTFS_EROFS_CUSTOM_COMPRESSION=y
>>>>> BR2_TARGET_ROOTFS_EROFS_COMPRESSION_ALGORITHMS="zstd,22 --max-extent-bytes 65536 -E48bit"
>>
>> btw, may I ask what's the erofs-utils version?
>> erofs-utils 1.9?
> 
> We have 1.8.5 erofs-utils

1.8.5 shouldn't have `-E48bit` support, that is my question.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: erofs pointer corruption and kernel crash
  2026-04-10 10:01         ` Gao Xiang
@ 2026-04-10 10:03           ` Arseniy Krasnov
  2026-04-10 10:06             ` Gao Xiang
  0 siblings, 1 reply; 22+ messages in thread
From: Arseniy Krasnov @ 2026-04-10 10:03 UTC (permalink / raw)
  To: Gao Xiang
  Cc: oxffffaa, linux-erofs, linux-kernel, kernel, Gao Xiang, Chao Yu,
	Yue Hu, Jeffle Xu, Sandeep Dhavale, Hongbo Li, Sheng Yong



10.04.2026 13:01, Gao Xiang пишет:
> 
> 
> On 2026/4/10 17:59, Arseniy Krasnov wrote:
>>
>>
>> 10.04.2026 12:20, Gao Xiang wrote:
>>>
>>>
>>> On 2026/4/10 16:55, Arseniy Krasnov wrote:
>>>>
>>>>
>>>
>>> ...
>>>
>>>>>>
>>>>>> BR2_TARGET_ROOTFS_EROFS=y
>>>>>> BR2_TARGET_ROOTFS_EROFS_CUSTOM_COMPRESSION=y
>>>>>> BR2_TARGET_ROOTFS_EROFS_COMPRESSION_ALGORITHMS="zstd,22 --max-extent-bytes 65536 -E48bit"
>>>
>>> btw, may I ask what's the erofs-utils version?
>>> erofs-utils 1.9?
>>
>> We have 1.8.5 erofs-utils
> 
> 1.8.5 shouldn't have `-E48bit` support, that is my question.

You mean to try to reproduce with 1.9 utils ?

> 
> Thanks,
> Gao Xiang


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: erofs pointer corruption and kernel crash
  2026-04-10 10:03           ` Arseniy Krasnov
@ 2026-04-10 10:06             ` Gao Xiang
  2026-04-10 10:10               ` Arseniy Krasnov
  0 siblings, 1 reply; 22+ messages in thread
From: Gao Xiang @ 2026-04-10 10:06 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: oxffffaa, linux-erofs, linux-kernel, kernel, Gao Xiang, Chao Yu,
	Yue Hu, Jeffle Xu, Sandeep Dhavale, Hongbo Li, Sheng Yong



On 2026/4/10 18:03, Arseniy Krasnov wrote:
> 
> 
> 10.04.2026 13:01, Gao Xiang пишет:
>>
>>
>> On 2026/4/10 17:59, Arseniy Krasnov wrote:
>>>
>>>
>>> 10.04.2026 12:20, Gao Xiang wrote:
>>>>
>>>>
>>>> On 2026/4/10 16:55, Arseniy Krasnov wrote:
>>>>>
>>>>>
>>>>
>>>> ...
>>>>
>>>>>>>
>>>>>>> BR2_TARGET_ROOTFS_EROFS=y
>>>>>>> BR2_TARGET_ROOTFS_EROFS_CUSTOM_COMPRESSION=y
>>>>>>> BR2_TARGET_ROOTFS_EROFS_COMPRESSION_ALGORITHMS="zstd,22 --max-extent-bytes 65536 -E48bit"
>>>>
>>>> btw, may I ask what's the erofs-utils version?
>>>> erofs-utils 1.9?
>>>
>>> We have 1.8.5 erofs-utils
>>
>> 1.8.5 shouldn't have `-E48bit` support, that is my question.
> 
> You mean to try to reproduce with 1.9 utils ?

Nope, I meant

BR2_TARGET_ROOTFS_EROFS_COMPRESSION_ALGORITHMS="zstd,22 --max-extent-bytes 65536 -E48bit"

is invalid in erofs-utils 1.8.5.

If you were using erofs-utils 1.9+, that may be due to
the new `-E48bit`;

but you said you are using erofs-utils 1.8.5, that makes
me feel confused.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: erofs pointer corruption and kernel crash
  2026-04-10 10:06             ` Gao Xiang
@ 2026-04-10 10:10               ` Arseniy Krasnov
  2026-04-10 10:22                 ` Gao Xiang
  0 siblings, 1 reply; 22+ messages in thread
From: Arseniy Krasnov @ 2026-04-10 10:10 UTC (permalink / raw)
  To: Gao Xiang
  Cc: oxffffaa, linux-erofs, linux-kernel, kernel, Gao Xiang, Chao Yu,
	Yue Hu, Jeffle Xu, Sandeep Dhavale, Hongbo Li, Sheng Yong



10.04.2026 13:06, Gao Xiang пишет:
> 
> 
> On 2026/4/10 18:03, Arseniy Krasnov wrote:
>>
>>
>> 10.04.2026 13:01, Gao Xiang пишет:
>>>
>>>
>>> On 2026/4/10 17:59, Arseniy Krasnov wrote:
>>>>
>>>>
>>>> 10.04.2026 12:20, Gao Xiang wrote:
>>>>>
>>>>>
>>>>> On 2026/4/10 16:55, Arseniy Krasnov wrote:
>>>>>>
>>>>>>
>>>>>
>>>>> ...
>>>>>
>>>>>>>>
>>>>>>>> BR2_TARGET_ROOTFS_EROFS=y
>>>>>>>> BR2_TARGET_ROOTFS_EROFS_CUSTOM_COMPRESSION=y
>>>>>>>> BR2_TARGET_ROOTFS_EROFS_COMPRESSION_ALGORITHMS="zstd,22 --max-extent-bytes 65536 -E48bit"
>>>>>
>>>>> btw, may I ask what's the erofs-utils version?
>>>>> erofs-utils 1.9?
>>>>
>>>> We have 1.8.5 erofs-utils
>>>
>>> 1.8.5 shouldn't have `-E48bit` support, that is my question.
>>
>> You mean to try to reproduce with 1.9 utils ?
> 
> Nope, I meant
> 
> BR2_TARGET_ROOTFS_EROFS_COMPRESSION_ALGORITHMS="zstd,22 --max-extent-bytes 65536 -E48bit"
> 
> is invalid in erofs-utils 1.8.5.
> 
> If you were using erofs-utils 1.9+, that may be due to
> the new `-E48bit`;
> 
> but you said you are using erofs-utils 1.8.5, that makes
> me feel confused.
> 

Ah, ok, so need to debug it at kernel side as we talked before.

Thanks

> Thanks,
> Gao Xiang


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: erofs pointer corruption and kernel crash
  2026-04-10 10:10               ` Arseniy Krasnov
@ 2026-04-10 10:22                 ` Gao Xiang
  2026-04-10 10:31                   ` Arseniy Krasnov
  0 siblings, 1 reply; 22+ messages in thread
From: Gao Xiang @ 2026-04-10 10:22 UTC (permalink / raw)
  To: Arseniy Krasnov; +Cc: oxffffaa, linux-erofs, linux-kernel, kernel



On 2026/4/10 18:10, Arseniy Krasnov wrote:
> 
> 
> 10.04.2026 13:06, Gao Xiang пишет:
>>
>>
>> On 2026/4/10 18:03, Arseniy Krasnov wrote:
>>>
>>>
>>> 10.04.2026 13:01, Gao Xiang пишет:
>>>>
>>>>
>>>> On 2026/4/10 17:59, Arseniy Krasnov wrote:
>>>>>
>>>>>
>>>>> 10.04.2026 12:20, Gao Xiang wrote:
>>>>>>
>>>>>>
>>>>>> On 2026/4/10 16:55, Arseniy Krasnov wrote:
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ...
>>>>>>
>>>>>>>>>
>>>>>>>>> BR2_TARGET_ROOTFS_EROFS=y
>>>>>>>>> BR2_TARGET_ROOTFS_EROFS_CUSTOM_COMPRESSION=y
>>>>>>>>> BR2_TARGET_ROOTFS_EROFS_COMPRESSION_ALGORITHMS="zstd,22 --max-extent-bytes 65536 -E48bit"
>>>>>>
>>>>>> btw, may I ask what's the erofs-utils version?
>>>>>> erofs-utils 1.9?
>>>>>
>>>>> We have 1.8.5 erofs-utils
>>>>
>>>> 1.8.5 shouldn't have `-E48bit` support, that is my question.
>>>
>>> You mean to try to reproduce with 1.9 utils ?
>>
>> Nope, I meant
>>
>> BR2_TARGET_ROOTFS_EROFS_COMPRESSION_ALGORITHMS="zstd,22 --max-extent-bytes 65536 -E48bit"
>>
>> is invalid in erofs-utils 1.8.5.
>>
>> If you were using erofs-utils 1.9+, that may be due to
>> the new `-E48bit`;
>>
>> but you said you are using erofs-utils 1.8.5, that makes
>> me feel confused.
>>
> 
> Ah, ok, so need to debug it at kernel side as we talked before.

Sigh, you don't answer my questions:

1. so are you using `-E48bit` and newer erofs-utils verison?

2. Is "EXPERIMENTAL 48-bit layout support in use. Use at your
    own risk!" shown when mounting?

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: erofs pointer corruption and kernel crash
  2026-04-10 10:22                 ` Gao Xiang
@ 2026-04-10 10:31                   ` Arseniy Krasnov
  0 siblings, 0 replies; 22+ messages in thread
From: Arseniy Krasnov @ 2026-04-10 10:31 UTC (permalink / raw)
  To: Gao Xiang; +Cc: oxffffaa, linux-erofs, linux-kernel, kernel



10.04.2026 13:22, Gao Xiang пишет:
> 
> 
> On 2026/4/10 18:10, Arseniy Krasnov wrote:
>>
>>
>> 10.04.2026 13:06, Gao Xiang пишет:
>>>
>>>
>>> On 2026/4/10 18:03, Arseniy Krasnov wrote:
>>>>
>>>>
>>>> 10.04.2026 13:01, Gao Xiang пишет:
>>>>>
>>>>>
>>>>> On 2026/4/10 17:59, Arseniy Krasnov wrote:
>>>>>>
>>>>>>
>>>>>> 10.04.2026 12:20, Gao Xiang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2026/4/10 16:55, Arseniy Krasnov wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>>>>>
>>>>>>>>>> BR2_TARGET_ROOTFS_EROFS=y
>>>>>>>>>> BR2_TARGET_ROOTFS_EROFS_CUSTOM_COMPRESSION=y
>>>>>>>>>> BR2_TARGET_ROOTFS_EROFS_COMPRESSION_ALGORITHMS="zstd,22 --max-extent-bytes 65536 -E48bit"
>>>>>>>
>>>>>>> btw, may I ask what's the erofs-utils version?
>>>>>>> erofs-utils 1.9?
>>>>>>
>>>>>> We have 1.8.5 erofs-utils
>>>>>
>>>>> 1.8.5 shouldn't have `-E48bit` support, that is my question.
>>>>
>>>> You mean to try to reproduce with 1.9 utils ?
>>>
>>> Nope, I meant
>>>
>>> BR2_TARGET_ROOTFS_EROFS_COMPRESSION_ALGORITHMS="zstd,22 --max-extent-bytes 65536 -E48bit"
>>>
>>> is invalid in erofs-utils 1.8.5.
>>>
>>> If you were using erofs-utils 1.9+, that may be due to
>>> the new `-E48bit`;
>>>
>>> but you said you are using erofs-utils 1.8.5, that makes
>>> me feel confused.
>>>
>>
>> Ah, ok, so need to debug it at kernel side as we talked before.
> 
> Sigh, you don't answer my questions:
> 
> 1. so are you using `-E48bit` and newer erofs-utils verison?
> 

Here is exact version which is built in buildroot:

> bin/mkfs.erofs
<E> erofs: missing argument: FILE
mkfs.erofs 1.8.10
Try 'bin/mkfs.erofs --help' for more information.

> 2. Is "EXPERIMENTAL 48-bit layout support in use. Use at your
>    own risk!" shown when mounting?

Yes


> dmesg | grep EXP
[    3.003188][    T1] erofs (device dm-0): EXPERIMENTAL 48-bit layout support in use. Use at your own risk!

Thanks

> 
> Thanks,
> Gao Xiang


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: erofs pointer corruption and kernel crash
  2026-04-10  8:31 ` Gao Xiang
  2026-04-10  8:42   ` Gao Xiang
  2026-04-10  8:55   ` Arseniy Krasnov
@ 2026-04-10 11:37   ` Arseniy Krasnov
  2026-04-10 12:20     ` Gao Xiang
  2 siblings, 1 reply; 22+ messages in thread
From: Arseniy Krasnov @ 2026-04-10 11:37 UTC (permalink / raw)
  To: Gao Xiang
  Cc: oxffffaa, linux-erofs, linux-kernel, kernel, Gao Xiang, Chao Yu,
	Yue Hu, Jeffle Xu, Sandeep Dhavale, Hongbo Li, Sheng Yong



10.04.2026 11:31, Gao Xiang wrote:
> Hi,
> 
> On 2026/4/10 16:13, Arseniy Krasnov wrote:
>> Hi,
>>
>> We found unexpected behaviour of erofs:
>>
>> There is function in erofs - 'erofs_onlinefolio_end()'. It has pointer to
>> 'struct folio' as first argument, and there is loop inside this function,
>> which updates 'private' field of provided folio:
>>
>>    do {
>>            orig = atomic_read((atomic_t *)&folio->private);
>>            DBG_BUGON(orig <= 0);
>>            v = dirty << EROFS_ONLINEFOLIO_DIRTY;
>>            v |= (orig - 1) | (!!err << EROFS_ONLINEFOLIO_EIO);
>>    } while (atomic_cmpxchg((atomic_t *)&folio->private, orig, v) != orig);
>>
>> Now, we see that in some rare case, this function processes folio, where
>> 'private' is pointer, and thus this loop will update some bits in this
>> pointer. Then later kernel dereferences such pointer and crashes.
>>
>> To catch this, the following small debug patch was used (e.g. we check that 'private' field is pointer):
>>
>> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
>> index 33cb0a7330d2..b1d8deffec4d 100644
>> --- a/fs/erofs/data.c
>> +++ b/fs/erofs/data.c
>> @@ -238,6 +238,11 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
>>   {
>>       int orig, v;
>>   +    if (((uintptr_t)folio->private) & 0xffff000000000000) {
> 
> No, if erofs_onlinefolio_end() is called, `folio->private`
> shouldn't be a pointer, it's just a counter inside, and
> storing a pointer is unexpected.
> 
> And since the folio is locked, it shouldn't call into
> try_to_free_buffers().
> 
> Is it easy to reproduce? if yes, can you print other
> values like `folio->mapping` and `folio->index` as
> well?
> 
> I need more informations to find some clues.



So reproduced again with this debug patch which adds magic to 'struct z_erofs_pcluster' and prints 'struct folio'
when pointer in 'private' is passed to 'erofs_onlinefolio_end()'. In short - 'private' points to 'struct z_erofs_pcluster'.

Patch itself:


diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index 33cb0a7330d2..6eb975facce1 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -234,10 +234,132 @@ void erofs_onlinefolio_split(struct folio *folio)
 	atomic_inc((atomic_t *)&folio->private);
 }
 
+static void dump_folio_address_space(const struct address_space *mapping)
+{
+	pr_emerg("[foliodbg] %s struct address_space (%p):\ni_ino=0x%lx\ns_dev=0x%x\n"
+		"i_rdev=0x%x\ngfp_mask=%lx\ni_mmap_writable=%d\nnrpages=%lu\n"
+		"writeback_index=%lu flags=0x%lx\nwb_err=%x\ni_private_data=%px\n"
+		"\n",
+		__func__, mapping,
+		(mapping && mapping->host) ? mapping->host->i_ino : 0,
+		(mapping && mapping->host && mapping->host->i_sb) ? mapping->host->i_sb->s_dev : 0,
+		(mapping && mapping->host) ? mapping->host->i_rdev : 0,
+		mapping ? (unsigned long)mapping->gfp_mask : 0,
+		mapping ? atomic_read(&mapping->i_mmap_writable) : 0,
+		mapping ? mapping->nrpages : 0,
+		mapping ? mapping->writeback_index : 0,
+		mapping ? mapping->flags : 0,
+		mapping ? mapping->wb_err : 0,
+		mapping ? mapping->i_private_data : NULL
+	);
+}
+
+static void dump_folio_page(const struct page *page)
+{
+	pr_emerg("[foliodbg] %s struct page (%p):\nflags=0x%lx\nindex=%lu\n"
+		"privat=0x%lx\npage_type(mapcount)=0x%x\n"
+#ifdef CONFIG_MEMCG
+		"memcg_data=0x%lx\n"
+#elif defined(CONFIG_SLAB_OBJ_EXT)
+		"unused_slab_obj_exts=0x%lx\n"
+#endif
+#if defined(WANT_PAGE_VIRTUAL)
+		"virtual=%p\n"
+#endif
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+		"_last_cpupid=%d\n"
+#endif
+		"refcount=%d\n"
+		"\n",
+		__func__, page, page->flags, page->index, page->private,
+		page->page_type,
+#ifdef CONFIG_MEMCG
+		page->memcg_data,
+#elif defined(CONFIG_SLAB_OBJ_EXT)
+		page->_unused_slab_obj_exts,
+#endif
+#if defined(WANT_PAGE_VIRTUAL)
+		page->virtual,
+#endif
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+		page->_last_cpupid,
+#endif
+		atomic_read(&page->_refcount)
+	);
+}
+
+static void dump_folio(const struct folio *folio)
+{
+	pr_emerg("[foliodbg] %s struct folio (%p):\nflags=0x%lx\nindex=%lu\n"
+		"private=%px\nmapcount=%d\nrefcount=%d\n"
+#ifdef CONFIG_MEMCG
+		"memcg_data=0x%lx\n"
+#elif defined(CONFIG_SLAB_OBJ_EXT)
+		"unused_slab_obj_exts=0x%lx\n"
+#endif
+#if defined(WANT_PAGE_VIRTUAL)
+		"virtual=%p\n"
+#endif
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+		"_last_cpupid=%d\m"
+#endif
+		"flags_1=0x%lx\nhead_1=0x%lx\nlarge_mapcount=%d\nnr_pages_mapped=%d\n"
+#ifdef CONFIG_64BIT
+		"entire_mapcount=%d\npincount=%d\n"
+#endif /* CONFIG_64BIT */
+		"mm_id_mapcount=[%d, %d]\nmapcount_1=%d\nrefcount_1=%d\n"
+#ifdef NR_PAGES_IN_LARGE_FOLIO
+		"nr_pages=%d\n"
+#endif /* NR_PAGES_IN_LARGE_FOLIO */
+		"flags_2=0x%lx\nhead_2=0x%lx\n"
+		"flags_3=0x%lx\nhead_3=0x%lx\n"
+		"\n",
+		__func__, folio,
+		folio->flags, folio->index, folio->private,
+		atomic_read(&folio->_mapcount), atomic_read(&folio->_refcount),
+#ifdef CONFIG_MEMCG
+		folio->memcg_data,
+#elif defined(CONFIG_SLAB_OBJ_EXT)
+		folio->_unused_slab_obj_exts,
+#endif
+#if defined(WANT_PAGE_VIRTUAL)
+		folio->virtual,
+#endif
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+		folio->_last_cpupid,
+#endif
+		folio->_flags_1, folio->_head_1,
+		atomic_read(&folio->_large_mapcount),
+		atomic_read(&folio->_nr_pages_mapped),
+#ifdef CONFIG_64BIT
+		atomic_read(&folio->_entire_mapcount),
+		atomic_read(&folio->_pincount),
+#endif /* CONFIG_64BIT */
+		folio->_mm_id_mapcount[0], folio->_mm_id_mapcount[1],
+		atomic_read(&folio->_mapcount_1),
+		atomic_read(&folio->_refcount_1),
+#ifdef NR_PAGES_IN_LARGE_FOLIO
+		folio->_nr_pages,
+#endif /* NR_PAGES_IN_LARGE_FOLIO */
+		folio->_flags_2, folio->_head_2,
+		folio->_flags_3, folio->_head_3
+	);
+
+	dump_folio_page(&folio->page);
+	dump_folio_address_space(folio->mapping);
+	print_hex_dump(KERN_EMERG, "folio private", DUMP_PREFIX_ADDRESS, 16, 1,
+		       folio->private, 32, true);
+}
 void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
 {
 	int orig, v;
 
+	if (((uintptr_t)folio->private) & 0xffff000000000000) {
+		pr_emerg("\n[foliodbg] %s:%d EROFS FOLIO %px PRIVATE BEFORE %px\n", __func__, __LINE__, folio, folio->private);
+		dump_folio(folio);
+		dump_stack();
+	}
+
 	do {
 		orig = atomic_read((atomic_t *)&folio->private);
 		DBG_BUGON(orig <= 0);
@@ -245,6 +367,9 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
 		v |= (orig - 1) | (!!err << EROFS_ONLINEFOLIO_EIO);
 	} while (atomic_cmpxchg((atomic_t *)&folio->private, orig, v) != orig);
 
+	if (((uintptr_t)folio->private) & 0xffff000000000000)
+		pr_emerg("\n[foliodbg] %s:%d EROFS FOLIO %px PRIVATE SET %px\n", __func__, __LINE__, folio, folio->private);
+
 	if (v & (BIT(EROFS_ONLINEFOLIO_DIRTY) - 1))
 		return;
 	folio->private = 0;
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index d21ae4802c7f..e8f295e90b05 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -38,6 +38,7 @@ __Z_EROFS_BVSET(z_erofs_bvset_inline, Z_EROFS_INLINE_BVECS);
  * A: Field should be accessed / updated in atomic for parallelized code.
  */
 struct z_erofs_pcluster {
+	u64 magic;
 	struct mutex lock;
 	struct lockref lockref;
 
@@ -262,6 +263,8 @@ static struct z_erofs_pcluster *z_erofs_alloc_pcluster(unsigned int size)
 		pcl = kmem_cache_zalloc(pcs->slab, GFP_KERNEL);
 		if (!pcl)
 			return ERR_PTR(-ENOMEM);
+
+		pcl->magic = 0x435053464F52455AULL;
 		return pcl;
 	}
 	return ERR_PTR(-EINVAL);



Crash result (tail of log was corrupted a little due to second sequential catch of this problem):



[  684.634126][  T919] [foliodbg] erofs_onlinefolio_end:358 EROFS FOLIO fffffdffc00420c0 PRIVATE BEFORE ffff000004442780
[  684.642007][  T919] [foliodbg] dump_folio struct folio (00000000cf314425):
[  684.642007][  T919] flags=0x1ff0000000004320
[  684.642007][  T919] index=3392
[  684.642007][  T919] private=ffff000004442780
[  684.642007][  T919] mapcount=-1
[  684.642007][  T919] refcount=3
[  684.642007][  T919] memcg_data=0x0
[  684.642007][  T919] flags_1=0x1ff0000000000000
[  684.642007][  T919] head_1=0x0
[  684.642007][  T919] large_mapcount=290
[  684.642007][  T919] nr_pages_mapped=-559087616
[  684.642007][  T919] entire_mapcount=0
[  684.642007][  T919] pincount=0
[  684.642007][  T919] mm_id_mapcount=[1102, 0]
[  684.642007][  T919] mapcount_1=-1
[  684.642007][  T919] refcount_1=1
[  684.642007][  T919] nr_pages=14937858
[  684.642007][  T919] flags_2=0x1ff0000000000000
[  684.642007][  T919] head_2=0x0
[  684.642007][  T919] flags_3=0x1ff0000000000000
[  684.642007][  T919] head_3=0x0
[  684.642007][  T919]
[  684.744491][  T919] [foliodbg] dump_folio_page struct page (00000000cf314425):
[  684.744491][  T919] flags=0x1ff0000000004320
[  684.744491][  T919] index=3392
[  684.744491][  T919] privat=0xffff000004442780
[  684.744491][  T919] page_type(mapcount)=0xffffffff
[  684.744491][  T919] memcg_data=0x0
[  684.744491][  T919] refcount=3
[  684.744491][  T919]
[  684.779714][  T919] [foliodbg] dump_folio_address_space struct address_space (0000000000000000):
[  684.779714][  T919] i_ino=0x0
[  684.779714][  T919] s_dev=0x0
[  684.779714][  T919] i_rdev=0x0
[  684.779714][  T919] gfp_mask=0
[  684.779714][  T919] i_mmap_writable=0
[  684.779714][  T919] nrpages=0
[  684.779714][  T919] writeback_index=0 flags=0x0
[  684.779714][  T919] wb_err=0
[  684.779714][  T919] i_private_data=0000000000000000
[  684.779714][  T919]
[  684.826573][  T919] folio private000000009ef9d99a: 5a 45 52 4f 46 53 50 43 00 00 00 00 00 00 00 00  ZEROFSPC........
[  684.831621][  T919] folio private000000007d6aa995: 00 00 00 00 00 00 00 00 98 27 44 04 00 00 ff ff  .........'D.....
[  684.842031][  T919] CPU: 0 UID: 0 PID: 919 Comm: kworker/0:14H Tainted: G           O        6.15.11-sdkernel #6 PREEMPT
[  684.842056][  T919] Tainted: [O]=OOT_MODULE
[  684.842076][  T919] Workqueue: kverityd verity_work
[  684.842098][  T919] Call trace:
[  684.842103][  T919]  show_stack+0x18/0x30 (C)
[  684.842123][  T919]  dump_stack_lvl+0x60/0x80
[  684.842138][  T919]  dump_stack+0x18/0x24
[  684.842156][  T919]  erofs_onlinefolio_end+0x264/0x2b0
[  684.842185][  T919]  z_erofs_decompress_queue+0x4c0/0x8e0
[  684.842211][  T919]  z_erofs_decompress_kickoff+0x88/0x150
[  684.842226][  T919]  z_erofs_endio+0x144/0x250
[  684.842246][  T919]  bio_endio+0x138/0x150
[  684.842266][  T919]  __dm_io_complete+0x1e0/0x2b0
[  684.842282][  T919]  clone_endio+0xd0/0x270
[  684.842302][  T919]  bio_endio+0x138/0x150
[  684.842322][  T919]  verity_finish_io+0x64/0xf0
[  684.842345][  T919]  verity_work+0x30/0x40
[  684.842355][  T919]  process_one_work+0x180/0x2e0
[  684.842375][  T919]  worker_thread+0x2c4/0x3f0
[  684.842394][  T919]  kthread+0x12c/0x210
[  684.842414][  T919]  ret_from_fork+0x10/0x20
[  684.842434][  T919]
[  684.842434][  T919] [foliodbg] erofs_onlinefolio_end:371 EROFS FOLIO fffffdffc00420c0 PRIVATE SET ffff00002444277f
[  684.958838][ T2486] ALSA: PCM: [Q] Lost interrupts?: (stream=1, delta=6576, new_hw_ptr=4943792, old_hw_ptr=4937216)
[  684.969204][ T2485] ALSA: PCM: [Q] Lost interrupts?: (stream=1, delta=2352, new_hw_ptr=1646016, old_hw_ptr=1643664)
[  685.015481][   T40] Unable to handle kernel paging request at virtual address ffff00002444277f
[  685.021334][   T40] Mem abort info:
[  685.021989][   T40]   ESR = 0x0000000096000006
[  685.026898][   T40]   EC = 0x25: DABT (current EL), IL = 32 bits
[  685.035918][   T40]   SET = 0, FnV = 0
[  685.035986][   T40]   EA = 0, S1PTW = 0
[  685.039855][   T40]   FSC = 0x06: level 2 translation fault
[  685.045343][   T40] Data abort info:
[  685.049827][   T40]   ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
[  685.060195][   T40]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  685.060700][   T40]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  685.066643][   T40] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000001e36000
[  685.074075][   T40] [ffff00002444277f] pgd=1800000007fff403, p4d=1800000007fff403, pud=1800000007ffe403, pmd=0000000000000000
[  685.086103][   T40] Internal error: Oops: 0000000096000006 [#1]  SMP
[  685.091448][   T40] Modules linked in: vlsicomm(O)
[  685.092319][  T928]
[  685.092319][  T928] [foliodbg] erofs_onlinefolio_end:358 EROFS FOLIO fffffdffc00420c0 PRIVATE BEFORE ffff00002444277f
[  685.096180][   T40] CPU: 0 UID: 0 PID: 40 Comm: kswapd0 Tainted: G           O        6.15.11-sdkernel #6 PREEMPT
[  685.096194][   T40] Tainted: [O]=OOT_MODULE
[  685.096199][   T40] Hardware name: SberDevices SberBoom Mini (DT)
[  685.096205][   T40] pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  685.096214][   T40] pc : drop_buffers.constprop.0+0x34/0x120
[  685.096233][   T40] lr : try_to_free_buffers+0xd0/0x100
[  685.096244][   T40] sp : ffff800081063780
[  685.096248][   T40] x29: ffff800081063780 x28: 0000000000000000 x27: fffffdffc00420c8
[  685.096265][   T40] x26: ffff8000810638a0 x25: ffff800081063868 x24: 0000000000000001
[  685.122584][  T928] [foliodbg] dump_folio struct folio (00000000cf314425):
[  685.122584][  T928] flags=0x1ff0000000004201
[  685.122584][  T928] index=3392
[  685.122584][  T928] private=ffff00002444277f
[  685.122584][  T928] mapcount=-1
[  685.122584][  T928] refcount=4
[  685.122584][  T928] memcg_data=0x0
[  685.122584][  T928] flags_1=0x1ff0000000000000
[  685.122584][  T928] head_1=0x0
[  685.122584][  T928] large_mapcount=290
[  685.122584][  T928] nr_pages_mapped=-559087616
[  685.122584][  T928] entire_mapcount=0
[  685.122584][  T928] pincount=0
[  685.122584][  T928] mm_id_mapcount=[1102, 0]
[  685.122584][  T928] mapcount_1=-1
[  685.122584][  T928] refcount_1=1
[  685.122584][  T928] nr_pages=14937858
[  685.122584][  T928] flags_2=0x1ff0000000000000
[  685.122584][  T928] head_2=0x0
[  685.122584][  T928] flags_3=0x1ff0000000000000
[  685.122584][  T928] head_3=0x0
[  685.122584][  T928]
[  685.123256][   T40]
[  685.123259][   T40] x23: fffffdffc00420c0 x22: ffff8000810637b0 x21: fffffdffc00420c0
[  685.123275][   T40] x20: ffff00002444277f x19: ffff00002444277f x18: 0000000000000000
[  685.123290][   T40] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[  685.129416][  T928] [foliodbg] dump_folio_page struct page (00000000cf314425):
[  685.129416][  T928] flags=0x1ff0000000004201
[  685.129416][  T928] index=3392
[  685.129416][  T928] privat=0xffff00002444277f
[  685.129416][  T928] page_type(mapcount)=0xffffffff
[  685.129416][  T928] memcg_data=0x0
[  685.129416][  T928] refcount=4
[  685.129416][  T928]
[  685.136882][   T40] x14: ffff7fff87288000 x13: 00589f68e49c5358 x12: ffff800080d59b58
[  685.136897][   T40] x11: 0000000000000002
[  685.142500][  T928] [foliodbg] dump_folio_address_space struct address_space (0000000000000000):
[  685.142500][  T928] i_ino=0x0
[  685.142500][  T928] s_dev=0x0
[  685.142500][  T928] i_rdev=0x0
[  685.142500][  T928] gfp_mask=0
[  685.142500][  T928] i_mmap_writable=0
[  685.142500][  T928] nrpages=0
[  685.142500][  T928] writeback_index=0 flags=0x0
[  685.142500][  T928] wb_err=0
[  685.142500][  T928] i_private_data=0000000000000000
[  685.142500][  T928]
[  685.147659][   T40]  x10: ffff800081063770 x9 : 0000000000000001
[  685.151841][  T928] Unable to handle kernel paging request at virtual address ffff00002444277f
[  685.159394][   T40]
[  685.159399][   T40] x8 : ffff8000810637d0 x7 : 0000000000000000 x6 : 0000000000000000
[  685.167188][  T928] Mem abort info:
[  685.248399][   T40]
[  685.248406][   T40] x5 : 0000000000000000 x4 : fffffdffc00420c0 x3 : 1ff0000000004201
[  685.248421][   T40] x2 : 1ff0000000004201
[  685.252253][  T928]   ESR = 0x0000000096000006
[  685.258322][   T40]  x1 : ffff8000810637b0 x0 : fffffdffc00420c0
[  685.258335][   T40] Call trace:
[  685.258340][   T40]  drop_buffers.constprop.0+0x34/0x120 (P)
[  685.258359][   T40]  try_to_free_buffers+0xd0/0x100
[  685.266380][  T928]   EC = 0x25: DABT (current EL), IL = 32 bits
[  685.273848][   T40]  filemap_release_folio+0x94/0xc0
[  685.273865][   T40]  shrink_folio_list+0x8c8/0xc40
[  685.306455][  T928]   SET = 0, FnV = 0
[  685.313609][   T40]  shrink_lruvec+0x740/0xb80
[  685.313624][   T40]  shrink_node+0x2b8/0x9a0
[  685.318050][  T928]   EA = 0, S1PTW = 0
[  685.359062][   T40]  balance_pgdat+0x3b8/0x760
[  685.359077][   T40]  kswapd+0x220/0x3b0
[  685.359096][   T40]  kthread+0x12c/0x210
[  685.359108][   T40]  ret_from_fork+0x10/0x20
[  685.359127][   T40] Code: 14000004 f9400673 eb13029f 54000180 (f9400262)
[  685.359134][   T40] ---[ end trace 0000000000000000 ]---
[  685.372682][   T40] Kernel panic - not syncing: Oops: Fatal exception
[  685.372694][   T40] SMP: stopping secondary CPUs
[  685.373561][   T40] Kernel Offset: disabled
[  685.373565][   T40] CPU features: 0x0000,00000000,01000000,0200420b
[  685.373576][   T40] Memory Limit: none


Thanks


> 
> Thanks,
> Gao Xiang
> 
>> +        pr_emerg("\n[foliodbg] %s:%d EROFS FOLIO %px PRIVATE BEFORE %px\n", __func__, __LINE__, folio, folio->private);
>> +        dump_stack();
>> +    }
>> +
>>       do {
>>           orig = atomic_read((atomic_t *)&folio->private);
>>           DBG_BUGON(orig <= 0);
>> @@ -245,6 +250,9 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
>>           v |= (orig - 1) | (!!err << EROFS_ONLINEFOLIO_EIO);
>>       } while (atomic_cmpxchg((atomic_t *)&folio->private, orig, v) != orig);
>>   +    if (((uintptr_t)folio->private) & 0xffff000000000000)
>> +        pr_emerg("\n[foliodbg] %s:%d EROFS FOLIO %px PRIVATE SET %px\n", __func__, __LINE__, folio, folio->private);
>> +
>>       if (v & (BIT(EROFS_ONLINEFOLIO_DIRTY) - 1))
>>           return;
>>       folio->private = 0;
>>
>>
>> And it gives result:
>>
>> [][  T639] [foliodbg] erofs_onlinefolio_end:242 EROFS FOLIO fffffdffc0030440 PRIVATE BEFORE ffff000002b32468
>> [][  T639] CPU: 0 UID: 0 PID: 639 Comm: kworker/0:6H Tainted: G O 6.15.11-sdkernel #1 PREEMPT
>> [][  T639] Tainted: [O]=OOT_MODULE
>> [][  T639] Workqueue: kverityd verity_work
>> [][  T639] Call trace:
>> [][  T639]  show_stack+0x18/0x30 (C)
>> [][  T639]  dump_stack_lvl+0x60/0x80
>> [][  T639]  dump_stack+0x18/0x24
>> [][  T639]  erofs_onlinefolio_end+0x124/0x130
>> [][  T639]  z_erofs_decompress_queue+0x4b0/0x8c0
>> [][  T639]  z_erofs_decompress_kickoff+0x88/0x150
>> [][  T639]  z_erofs_endio+0x144/0x250
>> [][  T639]  bio_endio+0x138/0x150
>> [][  T639]  __dm_io_complete+0x1e0/0x2b0
>> [][  T639]  clone_endio+0xd0/0x270
>> [][  T639]  bio_endio+0x138/0x150
>> [][  T639]  verity_finish_io+0x64/0xf0
>> [][  T639]  verity_work+0x30/0x40
>> [][  T639]  process_one_work+0x180/0x2e0
>> [][  T639]  worker_thread+0x2c4/0x3f0
>> [][  T639]  kthread+0x12c/0x210
>> [][  T639]  ret_from_fork+0x10/0x20
>> [][  T639]
>> [][  T639] [foliodbg] erofs_onlinefolio_end:254 EROFS FOLIO fffffdffc0030440 PRIVATE SET ffff000022b32467
>> [][   T39] Unable to handle kernel paging request at virtual address ffff000022b32467
>> [][   T39] Mem abort info:
>> [][   T39]   ESR = 0x0000000096000006
>> [][   T39]   EC = 0x25: DABT (current EL), IL = 32 bits
>> [][   T39]   SET = 0, FnV = 0
>> [][   T39]   EA = 0, S1PTW = 0
>> [][   T39]   FSC = 0x06: level 2 translation fault
>> [][   T39] Data abort info:
>> [][   T39]   ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
>> [][   T39]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>> [][   T39]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>> [][   T39] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000001e36000
>> [][   T39] [ffff000022b32467] pgd=1800000007fff403, p4d=1800000007fff403, pud=1800000007ffe403, pmd=0000000000000000
>> [][   T39] Internal error: Oops: 0000000096000006 [#1]  SMP
>> [][   T39] Modules linked in: vlsicomm(O)
>> [][   T39] CPU: 1 UID: 0 PID: 39 Comm: kswapd0 Tainted: G O 6.15.11-sdkernel #1 PREEMPT
>> [][   T39] Tainted: [O]=OOT_MODULE
>> [][   T39] pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [][   T39] pc : drop_buffers.constprop.0+0x34/0x120
>> [][   T39] lr : try_to_free_buffers+0xd0/0x100
>> [][   T39] sp : ffff80008105b780
>> [][   T39] x29: ffff80008105b780 x28: 0000000000000000 x27: fffffdffc0030448
>> [][   T39] x26: ffff80008105b8a0 x25: ffff80008105b868 x24: 0000000000000001
>> [][   T39] x23: fffffdffc0030440 x22: ffff80008105b7b0 x21: fffffdffc0030440
>> [][   T39] x20: ffff000022b32467 x19: ffff000022b32467 x18: 0000000000000000
>> [][   T39] x17: 0000000000000000 x16: 0000000000000000 x15: 00000000d69f4cc0
>> [][   T39] x14: ffff0000000c5dc0 x13: 0000000000000000 x12: ffff800080d59b58
>> [][   T39] x11: 00000000000000c0 x10: 0000000000000000 x9 : 0000000000000000
>> [][   T39] x8 : ffff80008105b7d0 x7 : 0000000000000000 x6 : 000000000000003f
>> [][   T39] x5 : 0000000000000000 x4 : fffffdffc0030440 x3 : 1ff0000000004001
>> [][   T39] x2 : 1ff0000000004001 x1 : ffff80008105b7b0 x0 : fffffdffc0030440
>> [][   T39] Call trace:
>> [][   T39]  drop_buffers.constprop.0+0x34/0x120 (P)
>> [][   T39]  try_to_free_buffers+0xd0/0x100
>> [][   T39]  filemap_release_folio+0x94/0xc0
>> [][   T39]  shrink_folio_list+0x8c8/0xc40
>> [][   T39]  shrink_lruvec+0x740/0xb80
>> [][   T39]  shrink_node+0x2b8/0x9a0
>> [][   T39]  balance_pgdat+0x3b8/0x760
>> [][   T39]  kswapd+0x220/0x3b0
>> [][   T39]  kthread+0x12c/0x210
>> [][   T39]  ret_from_fork+0x10/0x20
>> [][   T39] Code: 14000004 f9400673 eb13029f 54000180 (f9400262)
>> [][   T39] ---[ end trace 0000000000000000 ]---
>> [][   T39] Kernel panic - not syncing: Oops: Fatal exception
>> [][   T39] SMP: stopping secondary CPUs
>> [][   T39] Kernel Offset: disabled
>> [][   T39] CPU features: 0x0000,00000000,01000000,0200420b
>> [][   T39] Memory Limit: none
>> [][   T39] Rebooting in 5 seconds..
>>
>> So 'erofs_onlinefolio_end()' takes some folio with 'private' field contains
>> some pointer (0xffff000002b32468), "corrupts" this pointer (result will be
>> 0xffff000022b32467 - at least we see that 0x20000000 was ORed to original
>> pointer and this is (1 << EROFS_ONLINEFOLIO_DIRTY)), and then kernel crashes.
>> We guess it is not valid case when such folio is passed as argument to
>> 'erofs_onlinefolio_end()'.
>>
>> We have the following erofs configuration in buildroot:
>>
>> BR2_TARGET_ROOTFS_EROFS=y
>> BR2_TARGET_ROOTFS_EROFS_CUSTOM_COMPRESSION=y
>> BR2_TARGET_ROOTFS_EROFS_COMPRESSION_ALGORITHMS="zstd,22 --max-extent-bytes 65536 -E48bit"
>> BR2_TARGET_ROOTFS_EROFS_FRAGMENTS=y
>> BR2_TARGET_ROOTFS_EROFS_PCLUSTERSIZE=65536
>>
>>
>>
>> May be You know how to fix it or some ideas? Because we are new at erofs and need to discover and
>> learn its source code.
>>
>> Thanks
> 


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: erofs pointer corruption and kernel crash
  2026-04-10 11:37   ` Arseniy Krasnov
@ 2026-04-10 12:20     ` Gao Xiang
  2026-04-10 13:27       ` Arseniy Krasnov
  2026-04-10 13:35       ` Arseniy Krasnov
  0 siblings, 2 replies; 22+ messages in thread
From: Gao Xiang @ 2026-04-10 12:20 UTC (permalink / raw)
  To: Arseniy Krasnov; +Cc: oxffffaa, linux-erofs, linux-kernel, kernel, Gao Xiang



On 2026/4/10 19:37, Arseniy Krasnov wrote:

(drop unrelated folks since they all subscribed erofs mailing list)

> 
> 
> 10.04.2026 11:31, Gao Xiang wrote:
>> Hi,
>>
>> On 2026/4/10 16:13, Arseniy Krasnov wrote:
>>> Hi,
>>>
>>> We found unexpected behaviour of erofs:
>>>
>>> There is function in erofs - 'erofs_onlinefolio_end()'. It has pointer to
>>> 'struct folio' as first argument, and there is loop inside this function,
>>> which updates 'private' field of provided folio:
>>>
>>>     do {
>>>             orig = atomic_read((atomic_t *)&folio->private);
>>>             DBG_BUGON(orig <= 0);
>>>             v = dirty << EROFS_ONLINEFOLIO_DIRTY;
>>>             v |= (orig - 1) | (!!err << EROFS_ONLINEFOLIO_EIO);
>>>     } while (atomic_cmpxchg((atomic_t *)&folio->private, orig, v) != orig);
>>>
>>> Now, we see that in some rare case, this function processes folio, where
>>> 'private' is pointer, and thus this loop will update some bits in this
>>> pointer. Then later kernel dereferences such pointer and crashes.
>>>
>>> To catch this, the following small debug patch was used (e.g. we check that 'private' field is pointer):
>>>
>>> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
>>> index 33cb0a7330d2..b1d8deffec4d 100644
>>> --- a/fs/erofs/data.c
>>> +++ b/fs/erofs/data.c
>>> @@ -238,6 +238,11 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
>>>    {
>>>        int orig, v;
>>>    +    if (((uintptr_t)folio->private) & 0xffff000000000000) {
>>
>> No, if erofs_onlinefolio_end() is called, `folio->private`
>> shouldn't be a pointer, it's just a counter inside, and
>> storing a pointer is unexpected.
>>
>> And since the folio is locked, it shouldn't call into
>> try_to_free_buffers().
>>
>> Is it easy to reproduce? if yes, can you print other
>> values like `folio->mapping` and `folio->index` as
>> well?
>>
>> I need more informations to find some clues.
> 
> 
> 
> So reproduced again with this debug patch which adds magic to 'struct z_erofs_pcluster' and prints 'struct folio'
> when pointer in 'private' is passed to 'erofs_onlinefolio_end()'. In short - 'private' points to 'struct z_erofs_pcluster'.
First, erofs-utils 1.8.10 doesn't support `-E48bit`:
only erofs-utils 1.9+ ship it as an experimental
feature, see Changelog; so I think you're using
modified erofs-utils 1.8.10:
https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/tree/ChangeLog

```
erofs-utils 1.9

  * This release includes the following updates:
    - Add 48-bit layout support for larger filesystems (EXPERIMENTAL);
```

Second, I'm pretty sure this issue is related to
experimenal `-E48bit`, and those information is
not enough for me to find the root cause, so I
need to find a way to reproduce myself: It may
take time; you could debug yourself but I don't
think it's an easy task if you don't quite familiar
with the EROFS codebase.

Anyway I really suggest if you need a rush solution
for production, don't use `-E48bit + zstd` like
this for now: try to use other options like
`-zzstd -C65536 -Efragments` instead since those
are common production choices.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: erofs pointer corruption and kernel crash
  2026-04-10 12:20     ` Gao Xiang
@ 2026-04-10 13:27       ` Arseniy Krasnov
  2026-04-10 15:41         ` Gao Xiang
  2026-04-10 13:35       ` Arseniy Krasnov
  1 sibling, 1 reply; 22+ messages in thread
From: Arseniy Krasnov @ 2026-04-10 13:27 UTC (permalink / raw)
  To: Gao Xiang; +Cc: oxffffaa, linux-erofs, linux-kernel, kernel, Gao Xiang



10.04.2026 15:20, Gao Xiang пишет:
> 
> 
> On 2026/4/10 19:37, Arseniy Krasnov wrote:
> 
> (drop unrelated folks since they all subscribed erofs mailing list)
> 
>>
>>
>> 10.04.2026 11:31, Gao Xiang wrote:
>>> Hi,
>>>
>>> On 2026/4/10 16:13, Arseniy Krasnov wrote:
>>>> Hi,
>>>>
>>>> We found unexpected behaviour of erofs:
>>>>
>>>> There is function in erofs - 'erofs_onlinefolio_end()'. It has pointer to
>>>> 'struct folio' as first argument, and there is loop inside this function,
>>>> which updates 'private' field of provided folio:
>>>>
>>>>     do {
>>>>             orig = atomic_read((atomic_t *)&folio->private);
>>>>             DBG_BUGON(orig <= 0);
>>>>             v = dirty << EROFS_ONLINEFOLIO_DIRTY;
>>>>             v |= (orig - 1) | (!!err << EROFS_ONLINEFOLIO_EIO);
>>>>     } while (atomic_cmpxchg((atomic_t *)&folio->private, orig, v) != orig);
>>>>
>>>> Now, we see that in some rare case, this function processes folio, where
>>>> 'private' is pointer, and thus this loop will update some bits in this
>>>> pointer. Then later kernel dereferences such pointer and crashes.
>>>>
>>>> To catch this, the following small debug patch was used (e.g. we check that 'private' field is pointer):
>>>>
>>>> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
>>>> index 33cb0a7330d2..b1d8deffec4d 100644
>>>> --- a/fs/erofs/data.c
>>>> +++ b/fs/erofs/data.c
>>>> @@ -238,6 +238,11 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
>>>>    {
>>>>        int orig, v;
>>>>    +    if (((uintptr_t)folio->private) & 0xffff000000000000) {
>>>
>>> No, if erofs_onlinefolio_end() is called, `folio->private`
>>> shouldn't be a pointer, it's just a counter inside, and
>>> storing a pointer is unexpected.
>>>
>>> And since the folio is locked, it shouldn't call into
>>> try_to_free_buffers().
>>>
>>> Is it easy to reproduce? if yes, can you print other
>>> values like `folio->mapping` and `folio->index` as
>>> well?
>>>
>>> I need more informations to find some clues.
>>
>>
>>
>> So reproduced again with this debug patch which adds magic to 'struct z_erofs_pcluster' and prints 'struct folio'
>> when pointer in 'private' is passed to 'erofs_onlinefolio_end()'. In short - 'private' points to 'struct z_erofs_pcluster'.
> First, erofs-utils 1.8.10 doesn't support `-E48bit`:
> only erofs-utils 1.9+ ship it as an experimental
> feature, see Changelog; so I think you're using
> modified erofs-utils 1.8.10:
> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/tree/ChangeLog
> 
> ```
> erofs-utils 1.9
> 
>  * This release includes the following updates:
>    - Add 48-bit layout support for larger filesystems (EXPERIMENTAL);
> ```
> 
> Second, I'm pretty sure this issue is related to
> experimenal `-E48bit`, and those information is
> not enough for me to find the root cause, so I
> need to find a way to reproduce myself: It may
> take time; you could debug yourself but I don't
> think it's an easy task if you don't quite familiar
> with the EROFS codebase.
> 
> Anyway I really suggest if you need a rush solution
> for production, don't use `-E48bit + zstd` like
> this for now: try to use other options like
> `-zzstd -C65536 -Efragments` instead since those
> are common production choices.

Ok thanks for this advice! One more question: currently we use this options:
"zstd,22 --max-extent-bytes 65536 -E48bit". Ok we remove "zstd,22" and "E48bit",
but what about "--max-extent-bytes 65536" - is it considered stable option?
Or it is better to use your version: "-zzstd -C65536 -Efragments" ?

Thanks

> 
> Thanks,
> Gao Xiang


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: erofs pointer corruption and kernel crash
  2026-04-10 12:20     ` Gao Xiang
  2026-04-10 13:27       ` Arseniy Krasnov
@ 2026-04-10 13:35       ` Arseniy Krasnov
  1 sibling, 0 replies; 22+ messages in thread
From: Arseniy Krasnov @ 2026-04-10 13:35 UTC (permalink / raw)
  To: Gao Xiang; +Cc: oxffffaa, linux-erofs, linux-kernel, kernel, Gao Xiang



10.04.2026 15:20, Gao Xiang пишет:
> 
> 
> On 2026/4/10 19:37, Arseniy Krasnov wrote:
> 
> (drop unrelated folks since they all subscribed erofs mailing list)
> 
>>
>>
>> 10.04.2026 11:31, Gao Xiang wrote:
>>> Hi,
>>>
>>> On 2026/4/10 16:13, Arseniy Krasnov wrote:
>>>> Hi,
>>>>
>>>> We found unexpected behaviour of erofs:
>>>>
>>>> There is function in erofs - 'erofs_onlinefolio_end()'. It has pointer to
>>>> 'struct folio' as first argument, and there is loop inside this function,
>>>> which updates 'private' field of provided folio:
>>>>
>>>>     do {
>>>>             orig = atomic_read((atomic_t *)&folio->private);
>>>>             DBG_BUGON(orig <= 0);
>>>>             v = dirty << EROFS_ONLINEFOLIO_DIRTY;
>>>>             v |= (orig - 1) | (!!err << EROFS_ONLINEFOLIO_EIO);
>>>>     } while (atomic_cmpxchg((atomic_t *)&folio->private, orig, v) != orig);
>>>>
>>>> Now, we see that in some rare case, this function processes folio, where
>>>> 'private' is pointer, and thus this loop will update some bits in this
>>>> pointer. Then later kernel dereferences such pointer and crashes.
>>>>
>>>> To catch this, the following small debug patch was used (e.g. we check that 'private' field is pointer):
>>>>
>>>> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
>>>> index 33cb0a7330d2..b1d8deffec4d 100644
>>>> --- a/fs/erofs/data.c
>>>> +++ b/fs/erofs/data.c
>>>> @@ -238,6 +238,11 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
>>>>    {
>>>>        int orig, v;
>>>>    +    if (((uintptr_t)folio->private) & 0xffff000000000000) {
>>>
>>> No, if erofs_onlinefolio_end() is called, `folio->private`
>>> shouldn't be a pointer, it's just a counter inside, and
>>> storing a pointer is unexpected.
>>>
>>> And since the folio is locked, it shouldn't call into
>>> try_to_free_buffers().
>>>
>>> Is it easy to reproduce? if yes, can you print other
>>> values like `folio->mapping` and `folio->index` as
>>> well?
>>>
>>> I need more informations to find some clues.
>>
>>
>>
>> So reproduced again with this debug patch which adds magic to 'struct z_erofs_pcluster' and prints 'struct folio'
>> when pointer in 'private' is passed to 'erofs_onlinefolio_end()'. In short - 'private' points to 'struct z_erofs_pcluster'.
> First, erofs-utils 1.8.10 doesn't support `-E48bit`:
> only erofs-utils 1.9+ ship it as an experimental
> feature, see Changelog; so I think you're using
> modified erofs-utils 1.8.10:
> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/tree/ChangeLog
> 
> ```
> erofs-utils 1.9
> 
>  * This release includes the following updates:
>    - Add 48-bit layout support for larger filesystems (EXPERIMENTAL);
> ```
> 
> Second, I'm pretty sure this issue is related to
> experimenal `-E48bit`, and those information is
> not enough for me to find the root cause, so I
> need to find a way to reproduce myself: It may
> take time; you could debug yourself but I don't
> think it's an easy task if you don't quite familiar
> with the EROFS codebase.

Also some more information just catched with CONFIG_EROFS_FS_DEBUG. Same problem, but enabled
debug logic BUGed kernel earlier. May be useful for You.

Thanks


[  368.587000][  T608] ------------[ cut here ]------------
[  368.587079][  T608] kernel BUG at fs/erofs/zdata.c:1606!
[  368.591977][  T608] Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
[  368.593622][ T1214] ------------[ cut here ]------------
[  368.598779][  T608] Modules linked in: vlsicomm(O)
[  368.604040][ T1214] kernel BUG at fs/erofs/zdata.c:1606!
[  368.608787][  T608] CPU: 1 UID: 0 PID: 608 Comm: kworker/1:3H Tainted: G           O        6.15.11-sdkernel #1 PREEMPT
[  368.624876][  T608] Tainted: [O]=OOT_MODULE
[  368.635015][  T608] Workqueue: kverityd verity_work
[  368.639844][  T608] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  368.647428][  T608] pc : z_erofs_endio+0x220/0x270
[  368.652172][  T608] lr : z_erofs_endio+0x23c/0x270
[  368.656920][  T608] sp : ffff80008215bbe0
[  368.660887][  T608] x29: ffff80008215bbe0 x28: ffff0000032feb40 x27: ffff0000032feb80
[  368.668646][  T608] x26: fffffdffc0029280 x25: 0000000000000009 x24: ffff000000be17e0
[  368.676408][  T608] x23: ffff000007e85c00 x22: 0000000000001000 x21: 0000000000001000
[  368.684170][  T608] x20: 0000000000000000 x19: 0000000000001000 x18: 00000000e6fb12fd
[  368.691933][  T608] x17: 00000000c98c11f0 x16: 00000000ac7e39e2 x15: 00000000c3362985
[  368.699696][  T608] x14: 0000000001040820 x13: 00000000a3bddb58 x12: ffff80008215bb68
[  368.707458][  T608] x11: 0000000049a63821 x10: ffff8000809febe0 x9 : 0000000000000000
[  368.715221][  T608] x8 : ffff000003cee8e8 x7 : 0000000000000000 x6 : 459ea227f0118cc9
[  368.722983][  T608] x5 : 0000000000000000 x4 : 1ff0000000004021 x3 : 0000000000000000
[  368.730746][  T608] x2 : 0000000000000000 x1 : ffff0000029f3e00 x0 : fffffdffc0029240
[  368.738513][  T608] Call trace:
[  368.741619][  T608]  z_erofs_endio+0x220/0x270 (P)
[  368.746362][  T608]  bio_endio+0x138/0x150
[  368.750411][  T608]  __dm_io_complete+0x1e0/0x2b0
[  368.755068][  T608]  clone_endio+0xd0/0x270
[  368.759213][  T608]  bio_endio+0x138/0x150
[  368.763262][  T608]  verity_finish_io+0x64/0xf0
[  368.767747][  T608]  verity_work+0x30/0x40
[  368.771800][  T608]  process_one_work+0x180/0x2e0
[  368.776463][  T608]  worker_thread+0x2c4/0x3f0
[  368.780862][  T608]  kthread+0x12c/0x210
[  368.784742][  T608]  ret_from_fork+0x10/0x20
[  368.788979][  T608] Code: 17ffffc8 f9401401 b100103f 54fff5a0 (d4210000)
[  368.795698][  T608] ---[ end trace 0000000000000000 ]---
[  368.813672][  T608] Kernel panic - not syncing: Oops - BUG: Fatal exception
[  368.815015][  T608] SMP: stopping secondary CPUs
[  369.896670][  T608] SMP: failed to stop secondary CPUs 0
[  369.896729][  T608] Kernel Offset: disabled
[  369.900508][  T608] CPU features: 0x0000,00000000,01000000,0200420b
[  369.906718][  T608] Memory Limit: none
[  369.922397][  T608] Rebooting in 5 seconds..



> 
> Anyway I really suggest if you need a rush solution
> for production, don't use `-E48bit + zstd` like
> this for now: try to use other options like
> `-zzstd -C65536 -Efragments` instead since those
> are common production choices.
> 
> Thanks,
> Gao Xiang


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: erofs pointer corruption and kernel crash
  2026-04-10 13:27       ` Arseniy Krasnov
@ 2026-04-10 15:41         ` Gao Xiang
  2026-04-11 15:10           ` Arseniy Krasnov
  0 siblings, 1 reply; 22+ messages in thread
From: Gao Xiang @ 2026-04-10 15:41 UTC (permalink / raw)
  To: Arseniy Krasnov; +Cc: oxffffaa, linux-erofs, linux-kernel, kernel, Gao Xiang

Hi Arseniy,

On 2026/4/10 21:27, Arseniy Krasnov wrote:
> 
> 
> 10.04.2026 15:20, Gao Xiang пишет:
>>
>>
>> On 2026/4/10 19:37, Arseniy Krasnov wrote:
>>
>> (drop unrelated folks since they all subscribed erofs mailing list)
>>
>>>
>>>
>>> 10.04.2026 11:31, Gao Xiang wrote:
>>>> Hi,
>>>>
>>>> On 2026/4/10 16:13, Arseniy Krasnov wrote:
>>>>> Hi,
>>>>>
>>>>> We found unexpected behaviour of erofs:
>>>>>
>>>>> There is function in erofs - 'erofs_onlinefolio_end()'. It has pointer to
>>>>> 'struct folio' as first argument, and there is loop inside this function,
>>>>> which updates 'private' field of provided folio:
>>>>>
>>>>>      do {
>>>>>              orig = atomic_read((atomic_t *)&folio->private);
>>>>>              DBG_BUGON(orig <= 0);
>>>>>              v = dirty << EROFS_ONLINEFOLIO_DIRTY;
>>>>>              v |= (orig - 1) | (!!err << EROFS_ONLINEFOLIO_EIO);
>>>>>      } while (atomic_cmpxchg((atomic_t *)&folio->private, orig, v) != orig);
>>>>>
>>>>> Now, we see that in some rare case, this function processes folio, where
>>>>> 'private' is pointer, and thus this loop will update some bits in this
>>>>> pointer. Then later kernel dereferences such pointer and crashes.
>>>>>
>>>>> To catch this, the following small debug patch was used (e.g. we check that 'private' field is pointer):
>>>>>
>>>>> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
>>>>> index 33cb0a7330d2..b1d8deffec4d 100644
>>>>> --- a/fs/erofs/data.c
>>>>> +++ b/fs/erofs/data.c
>>>>> @@ -238,6 +238,11 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
>>>>>     {
>>>>>         int orig, v;
>>>>>     +    if (((uintptr_t)folio->private) & 0xffff000000000000) {
>>>>
>>>> No, if erofs_onlinefolio_end() is called, `folio->private`
>>>> shouldn't be a pointer, it's just a counter inside, and
>>>> storing a pointer is unexpected.
>>>>
>>>> And since the folio is locked, it shouldn't call into
>>>> try_to_free_buffers().
>>>>
>>>> Is it easy to reproduce? if yes, can you print other
>>>> values like `folio->mapping` and `folio->index` as
>>>> well?
>>>>
>>>> I need more informations to find some clues.
>>>
>>>
>>>
>>> So reproduced again with this debug patch which adds magic to 'struct z_erofs_pcluster' and prints 'struct folio'
>>> when pointer in 'private' is passed to 'erofs_onlinefolio_end()'. In short - 'private' points to 'struct z_erofs_pcluster'.
>> First, erofs-utils 1.8.10 doesn't support `-E48bit`:
>> only erofs-utils 1.9+ ship it as an experimental
>> feature, see Changelog; so I think you're using
>> modified erofs-utils 1.8.10:
>> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/tree/ChangeLog
>>
>> ```
>> erofs-utils 1.9
>>
>>   * This release includes the following updates:
>>     - Add 48-bit layout support for larger filesystems (EXPERIMENTAL);
>> ```
>>
>> Second, I'm pretty sure this issue is related to
>> experimenal `-E48bit`, and those information is
>> not enough for me to find the root cause, so I
>> need to find a way to reproduce myself: It may
>> take time; you could debug yourself but I don't
>> think it's an easy task if you don't quite familiar
>> with the EROFS codebase.
>>
>> Anyway I really suggest if you need a rush solution
>> for production, don't use `-E48bit + zstd` like
>> this for now: try to use other options like
>> `-zzstd -C65536 -Efragments` instead since those
>> are common production choices.
> 
> Ok thanks for this advice! One more question: currently we use this options:
> "zstd,22 --max-extent-bytes 65536 -E48bit". Ok we remove "zstd,22" and "E48bit",
> but what about "--max-extent-bytes 65536" - is it considered stable option?
> Or it is better to use your version: "-zzstd -C65536 -Efragments" ?

I'm not sure how you find this
"zstd,22 --max-extent-bytes 65536 -E48bit" combination.

My suggestion based on production is that as long as
you don't use `-zzstd` ++ `-E48bit`, it should be fine.

If you need smaller images, I suggest: `-zlzma,9 -C65536 -Efragments`
Or like Android, they all use `-zlz4hc`,
Or zstd, but don't add `-E48bit`.

As for "--max-extent-bytes 65536", it can be dropped
since if `-E48bit` is not used, it only has negative
impacts.

In short, `-E48bit` + `-zzstd` + `--max-extent-bytes`
enables new unaligned compression for zstd, but it's
a relatively new feature, I still still some time to
stablize it but my own time is limited and all things
are always prioritized.

Thanks,
Gao Xiang

> 
> Thanks
> 
>>
>> Thanks,
>> Gao Xiang


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: erofs pointer corruption and kernel crash
  2026-04-10 15:41         ` Gao Xiang
@ 2026-04-11 15:10           ` Arseniy Krasnov
  2026-04-13  7:08             ` Gao Xiang
  0 siblings, 1 reply; 22+ messages in thread
From: Arseniy Krasnov @ 2026-04-11 15:10 UTC (permalink / raw)
  To: Gao Xiang; +Cc: oxffffaa, linux-erofs, linux-kernel, kernel, Gao Xiang



10.04.2026 18:41, Gao Xiang пишет:
> Hi Arseniy,
> 
> On 2026/4/10 21:27, Arseniy Krasnov wrote:
>>
>>
>> 10.04.2026 15:20, Gao Xiang пишет:
>>>
>>>
>>> On 2026/4/10 19:37, Arseniy Krasnov wrote:
>>>
>>> (drop unrelated folks since they all subscribed erofs mailing list)
>>>
>>>>
>>>>
>>>> 10.04.2026 11:31, Gao Xiang wrote:
>>>>> Hi,
>>>>>
>>>>> On 2026/4/10 16:13, Arseniy Krasnov wrote:
>>>>>> Hi,
>>>>>>
>>>>>> We found unexpected behaviour of erofs:
>>>>>>
>>>>>> There is function in erofs - 'erofs_onlinefolio_end()'. It has pointer to
>>>>>> 'struct folio' as first argument, and there is loop inside this function,
>>>>>> which updates 'private' field of provided folio:
>>>>>>
>>>>>>      do {
>>>>>>              orig = atomic_read((atomic_t *)&folio->private);
>>>>>>              DBG_BUGON(orig <= 0);
>>>>>>              v = dirty << EROFS_ONLINEFOLIO_DIRTY;
>>>>>>              v |= (orig - 1) | (!!err << EROFS_ONLINEFOLIO_EIO);
>>>>>>      } while (atomic_cmpxchg((atomic_t *)&folio->private, orig, v) != orig);
>>>>>>
>>>>>> Now, we see that in some rare case, this function processes folio, where
>>>>>> 'private' is pointer, and thus this loop will update some bits in this
>>>>>> pointer. Then later kernel dereferences such pointer and crashes.
>>>>>>
>>>>>> To catch this, the following small debug patch was used (e.g. we check that 'private' field is pointer):
>>>>>>
>>>>>> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
>>>>>> index 33cb0a7330d2..b1d8deffec4d 100644
>>>>>> --- a/fs/erofs/data.c
>>>>>> +++ b/fs/erofs/data.c
>>>>>> @@ -238,6 +238,11 @@ void erofs_onlinefolio_end(struct folio *folio, int err, bool dirty)
>>>>>>     {
>>>>>>         int orig, v;
>>>>>>     +    if (((uintptr_t)folio->private) & 0xffff000000000000) {
>>>>>
>>>>> No, if erofs_onlinefolio_end() is called, `folio->private`
>>>>> shouldn't be a pointer, it's just a counter inside, and
>>>>> storing a pointer is unexpected.
>>>>>
>>>>> And since the folio is locked, it shouldn't call into
>>>>> try_to_free_buffers().
>>>>>
>>>>> Is it easy to reproduce? if yes, can you print other
>>>>> values like `folio->mapping` and `folio->index` as
>>>>> well?
>>>>>
>>>>> I need more informations to find some clues.
>>>>
>>>>
>>>>
>>>> So reproduced again with this debug patch which adds magic to 'struct z_erofs_pcluster' and prints 'struct folio'
>>>> when pointer in 'private' is passed to 'erofs_onlinefolio_end()'. In short - 'private' points to 'struct z_erofs_pcluster'.
>>> First, erofs-utils 1.8.10 doesn't support `-E48bit`:
>>> only erofs-utils 1.9+ ship it as an experimental
>>> feature, see Changelog; so I think you're using
>>> modified erofs-utils 1.8.10:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/tree/ChangeLog
>>>
>>> ```
>>> erofs-utils 1.9
>>>
>>>   * This release includes the following updates:
>>>     - Add 48-bit layout support for larger filesystems (EXPERIMENTAL);
>>> ```
>>>
>>> Second, I'm pretty sure this issue is related to
>>> experimenal `-E48bit`, and those information is
>>> not enough for me to find the root cause, so I
>>> need to find a way to reproduce myself: It may
>>> take time; you could debug yourself but I don't
>>> think it's an easy task if you don't quite familiar
>>> with the EROFS codebase.
>>>
>>> Anyway I really suggest if you need a rush solution
>>> for production, don't use `-E48bit + zstd` like
>>> this for now: try to use other options like
>>> `-zzstd -C65536 -Efragments` instead since those
>>> are common production choices.
>>
>> Ok thanks for this advice! One more question: currently we use this options:
>> "zstd,22 --max-extent-bytes 65536 -E48bit". Ok we remove "zstd,22" and "E48bit",
>> but what about "--max-extent-bytes 65536" - is it considered stable option?
>> Or it is better to use your version: "-zzstd -C65536 -Efragments" ?
> 
> I'm not sure how you find this
> "zstd,22 --max-extent-bytes 65536 -E48bit" combination.
> 
> My suggestion based on production is that as long as
> you don't use `-zzstd` ++ `-E48bit`, it should be fine.
> 
> If you need smaller images, I suggest: `-zlzma,9 -C65536 -Efragments`
> Or like Android, they all use `-zlz4hc`,
> Or zstd, but don't add `-E48bit`.
> 
> As for "--max-extent-bytes 65536", it can be dropped
> since if `-E48bit` is not used, it only has negative
> impacts.
> 
> In short, `-E48bit` + `-zzstd` + `--max-extent-bytes`
> enables new unaligned compression for zstd, but it's
> a relatively new feature, I still still some time to
> stablize it but my own time is limited and all things
> are always prioritized.

Ok, thanks for this advice!

Thanks

> 
> Thanks,
> Gao Xiang
> 
>>
>> Thanks
>>
>>>
>>> Thanks,
>>> Gao Xiang
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: erofs pointer corruption and kernel crash
  2026-04-11 15:10           ` Arseniy Krasnov
@ 2026-04-13  7:08             ` Gao Xiang
  2026-04-13  7:20               ` Arseniy Krasnov
  0 siblings, 1 reply; 22+ messages in thread
From: Gao Xiang @ 2026-04-13  7:08 UTC (permalink / raw)
  To: Arseniy Krasnov; +Cc: oxffffaa, linux-erofs, linux-kernel, kernel, Gao Xiang



On 2026/4/11 23:10, Arseniy Krasnov wrote:
> 
> 
> 10.04.2026 18:41, Gao Xiang пишет:
>> Hi Arseniy,
>>
>> On 2026/4/10 21:27, Arseniy Krasnov wrote:
>>>
>>>
>>> 10.04.2026 15:20, Gao Xiang пишет:
>>>>
>>>>
>>>> On 2026/4/10 19:37, Arseniy Krasnov wrote:
>>>>
>>>> (drop unrelated folks since they all subscribed erofs mailing list)
>>>>
>>>>>
>>>>>
>>>>> 10.04.2026 11:31, Gao Xiang wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 2026/4/10 16:13, Arseniy Krasnov wrote:

...

>>>>>>
>>>>>> I need more informations to find some clues.
>>>>>
>>>>>
>>>>>
>>>>> So reproduced again with this debug patch which adds magic to 'struct z_erofs_pcluster' and prints 'struct folio'
>>>>> when pointer in 'private' is passed to 'erofs_onlinefolio_end()'. In short - 'private' points to 'struct z_erofs_pcluster'.
>>>> First, erofs-utils 1.8.10 doesn't support `-E48bit`:
>>>> only erofs-utils 1.9+ ship it as an experimental
>>>> feature, see Changelog; so I think you're using
>>>> modified erofs-utils 1.8.10:
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/tree/ChangeLog
>>>>
>>>> ```
>>>> erofs-utils 1.9
>>>>
>>>>    * This release includes the following updates:
>>>>      - Add 48-bit layout support for larger filesystems (EXPERIMENTAL);
>>>> ```
>>>>
>>>> Second, I'm pretty sure this issue is related to
>>>> experimenal `-E48bit`, and those information is
>>>> not enough for me to find the root cause, so I
>>>> need to find a way to reproduce myself: It may
>>>> take time; you could debug yourself but I don't
>>>> think it's an easy task if you don't quite familiar
>>>> with the EROFS codebase.
>>>>
>>>> Anyway I really suggest if you need a rush solution
>>>> for production, don't use `-E48bit + zstd` like
>>>> this for now: try to use other options like
>>>> `-zzstd -C65536 -Efragments` instead since those
>>>> are common production choices.
>>>
>>> Ok thanks for this advice! One more question: currently we use this options:
>>> "zstd,22 --max-extent-bytes 65536 -E48bit". Ok we remove "zstd,22" and "E48bit",
>>> but what about "--max-extent-bytes 65536" - is it considered stable option?
>>> Or it is better to use your version: "-zzstd -C65536 -Efragments" ?
>>
>> I'm not sure how you find this
>> "zstd,22 --max-extent-bytes 65536 -E48bit" combination.
>>
>> My suggestion based on production is that as long as
>> you don't use `-zzstd` ++ `-E48bit`, it should be fine.
>>
>> If you need smaller images, I suggest: `-zlzma,9 -C65536 -Efragments`
>> Or like Android, they all use `-zlz4hc`,
>> Or zstd, but don't add `-E48bit`.
>>
>> As for "--max-extent-bytes 65536", it can be dropped
>> since if `-E48bit` is not used, it only has negative
>> impacts.
>>
>> In short, `-E48bit` + `-zzstd` + `--max-extent-bytes`
>> enables new unaligned compression for zstd, but it's
>> a relatively new feature, I still still some time to
>> stablize it but my own time is limited and all things
>> are always prioritized.
> 
> Ok, thanks for this advice!

FYI, I can reproduce this issue locally with `-E48bit`
on in 600s.

I do think it's a `-E48bit` + zstd issue so
non-`-E48bit` won't be impacted and I will find time
to troubleshoot it this week.

Thanks,
Gao Xiang

> 
> Thanks
> 
>>
>> Thanks,
>> Gao Xiang
>>
>>>
>>> Thanks
>>>
>>>>
>>>> Thanks,
>>>> Gao Xiang
>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: erofs pointer corruption and kernel crash
  2026-04-13  7:08             ` Gao Xiang
@ 2026-04-13  7:20               ` Arseniy Krasnov
  0 siblings, 0 replies; 22+ messages in thread
From: Arseniy Krasnov @ 2026-04-13  7:20 UTC (permalink / raw)
  To: Gao Xiang; +Cc: oxffffaa, linux-erofs, linux-kernel, kernel, Gao Xiang



13.04.2026 10:08, Gao Xiang пишет:
> 
> 
> On 2026/4/11 23:10, Arseniy Krasnov wrote:
>>
>>
>> 10.04.2026 18:41, Gao Xiang пишет:
>>> Hi Arseniy,
>>>
>>> On 2026/4/10 21:27, Arseniy Krasnov wrote:
>>>>
>>>>
>>>> 10.04.2026 15:20, Gao Xiang пишет:
>>>>>
>>>>>
>>>>> On 2026/4/10 19:37, Arseniy Krasnov wrote:
>>>>>
>>>>> (drop unrelated folks since they all subscribed erofs mailing list)
>>>>>
>>>>>>
>>>>>>
>>>>>> 10.04.2026 11:31, Gao Xiang wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 2026/4/10 16:13, Arseniy Krasnov wrote:
> 
> ...
> 
>>>>>>>
>>>>>>> I need more informations to find some clues.
>>>>>>
>>>>>>
>>>>>>
>>>>>> So reproduced again with this debug patch which adds magic to 'struct z_erofs_pcluster' and prints 'struct folio'
>>>>>> when pointer in 'private' is passed to 'erofs_onlinefolio_end()'. In short - 'private' points to 'struct z_erofs_pcluster'.
>>>>> First, erofs-utils 1.8.10 doesn't support `-E48bit`:
>>>>> only erofs-utils 1.9+ ship it as an experimental
>>>>> feature, see Changelog; so I think you're using
>>>>> modified erofs-utils 1.8.10:
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/tree/ChangeLog
>>>>>
>>>>> ```
>>>>> erofs-utils 1.9
>>>>>
>>>>>    * This release includes the following updates:
>>>>>      - Add 48-bit layout support for larger filesystems (EXPERIMENTAL);
>>>>> ```
>>>>>
>>>>> Second, I'm pretty sure this issue is related to
>>>>> experimenal `-E48bit`, and those information is
>>>>> not enough for me to find the root cause, so I
>>>>> need to find a way to reproduce myself: It may
>>>>> take time; you could debug yourself but I don't
>>>>> think it's an easy task if you don't quite familiar
>>>>> with the EROFS codebase.
>>>>>
>>>>> Anyway I really suggest if you need a rush solution
>>>>> for production, don't use `-E48bit + zstd` like
>>>>> this for now: try to use other options like
>>>>> `-zzstd -C65536 -Efragments` instead since those
>>>>> are common production choices.
>>>>
>>>> Ok thanks for this advice! One more question: currently we use this options:
>>>> "zstd,22 --max-extent-bytes 65536 -E48bit". Ok we remove "zstd,22" and "E48bit",
>>>> but what about "--max-extent-bytes 65536" - is it considered stable option?
>>>> Or it is better to use your version: "-zzstd -C65536 -Efragments" ?
>>>
>>> I'm not sure how you find this
>>> "zstd,22 --max-extent-bytes 65536 -E48bit" combination.
>>>
>>> My suggestion based on production is that as long as
>>> you don't use `-zzstd` ++ `-E48bit`, it should be fine.
>>>
>>> If you need smaller images, I suggest: `-zlzma,9 -C65536 -Efragments`
>>> Or like Android, they all use `-zlz4hc`,
>>> Or zstd, but don't add `-E48bit`.
>>>
>>> As for "--max-extent-bytes 65536", it can be dropped
>>> since if `-E48bit` is not used, it only has negative
>>> impacts.
>>>
>>> In short, `-E48bit` + `-zzstd` + `--max-extent-bytes`
>>> enables new unaligned compression for zstd, but it's
>>> a relatively new feature, I still still some time to
>>> stablize it but my own time is limited and all things
>>> are always prioritized.
>>
>> Ok, thanks for this advice!
> 
> FYI, I can reproduce this issue locally with `-E48bit`
> on in 600s.
> 
> I do think it's a `-E48bit` + zstd issue so
> non-`-E48bit` won't be impacted and I will find time
> to troubleshoot it this week.

Yes, without '-E48bit' we also can't reproduce it for entire weekend on several boards. No such panics.

Thanks

> 
> Thanks,
> Gao Xiang
> 
>>
>> Thanks
>>
>>>
>>> Thanks,
>>> Gao Xiang
>>>
>>>>
>>>> Thanks
>>>>
>>>>>
>>>>> Thanks,
>>>>> Gao Xiang
>>>
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2026-04-13  7:20 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-10  8:13 erofs pointer corruption and kernel crash Arseniy Krasnov
2026-04-10  8:31 ` Gao Xiang
2026-04-10  8:42   ` Gao Xiang
2026-04-10  8:51     ` Arseniy Krasnov
2026-04-10  8:59       ` Gao Xiang
2026-04-10  8:55   ` Arseniy Krasnov
2026-04-10  9:20     ` Gao Xiang
2026-04-10  9:59       ` Arseniy Krasnov
2026-04-10 10:01         ` Gao Xiang
2026-04-10 10:03           ` Arseniy Krasnov
2026-04-10 10:06             ` Gao Xiang
2026-04-10 10:10               ` Arseniy Krasnov
2026-04-10 10:22                 ` Gao Xiang
2026-04-10 10:31                   ` Arseniy Krasnov
2026-04-10 11:37   ` Arseniy Krasnov
2026-04-10 12:20     ` Gao Xiang
2026-04-10 13:27       ` Arseniy Krasnov
2026-04-10 15:41         ` Gao Xiang
2026-04-11 15:10           ` Arseniy Krasnov
2026-04-13  7:08             ` Gao Xiang
2026-04-13  7:20               ` Arseniy Krasnov
2026-04-10 13:35       ` Arseniy Krasnov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox