btrfs crashes during routine btrfs-balance-least-used

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* btrfs crashes during routine btrfs-balance-least-used
@ 2024-07-14 16:13 Kai Krakow
  2024-07-14 21:53 ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Kai Krakow @ 2024-07-14 16:13 UTC (permalink / raw)
  To: linux-btrfs, Oliver Wien

Hello btrfs list!

(also reported in irc)

Our btrfs pool crashed during a routine btrfs-balance-least-used.
Maybe of interest: bees is also running on this filesystem, snapper
takes hourly snapshots with retention policy.

I'm currently still collecting diagnostics, "btrfs check" log is
already 3 GB and growing.

The btrfs runs on three devices vd{c,e,f}1 with data=single meta=raid1.

Here's an excerpt from dmesg (full log https://gist.tnonline.net/TE):

[1143841.581968] BTRFS info (device vdc1): balance: start
-dvrange=402046058496..402046058497
[1143841.583434] BTRFS info (device vdc1): relocating block group
402046058496 flags data
[1143852.414459] BTRFS info (device vdc1): found 45428 extents, stage:
move data extents
[1143913.107511] ------------[ cut here ]------------
[1143913.107516] WARNING: CPU: 10 PID: 937 at
fs/btrfs/extent-tree.c:3092 __btrfs_free_extent+0x68e/0x1130
[1143913.107583] CPU: 10 PID: 937 Comm: btrfs-transacti Not tainted
6.6.30-gentoo #1
[1143913.107585] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[1143913.107587] RIP: 0010:__btrfs_free_extent+0x68e/0x1130
[1143913.107590] Code: 58 61 00 00 49 8b 7d 50 49 89 d8 48 89 e9 45 8b
4e 40 48 c7 c6 20 48 06 af 8b 94 24 98 00 00 00 e8 37 f3 0a 00 e9 95
fc ff ff <0f> 0b f0 48 0f ba a8 f8 09 00 00 02 41 b8 00 00 00 00 0f 83
33 03
[1143913.107592] RSP: 0018:ffffbdd081063c78 EFLAGS: 00010246
[1143913.107595] RAX: ffffa0e64547a000 RBX: 0000005dc970d000 RCX:
0000000000004000
[1143913.107597] RDX: 0000000000000011 RSI: ffffa0e682bee2da RDI:
ffffbdd081063c17
[1143913.107598] RBP: 0000000000000001 R08: 0000005dc970e000 R09:
00000000001000a8
[1143913.107599] R10: a80000005dc970e0 R11: 0000000000001000 R12:
0000000004da9000
[1143913.107600] R13: ffffa0e82ed28270 R14: ffffa0e8059e4700 R15:
ffffa0e8742c40e0
[1143913.107601] FS:  0000000000000000(0000) GS:ffffa0f833c80000(0000)
knlGS:0000000000000000
[1143913.107605] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1143913.107607] CR2: 000000005665c138 CR3: 00000001060fb000 CR4:
00000000001506e0
[1143913.107608] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[1143913.107608] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[1143913.107609] Call Trace:
[1143913.107612]  <TASK>
[1143913.107613]  ? __warn+0x62/0xc0
[1143913.107638]  ? __btrfs_free_extent+0x68e/0x1130
[1143913.107640]  ? report_bug+0x15e/0x1a0
[1143913.107661]  ? handle_bug+0x36/0x70
[1143913.107674]  ? exc_invalid_op+0x13/0x60
[1143913.107676]  ? asm_exc_invalid_op+0x16/0x20
[1143913.107687]  ? __btrfs_free_extent+0x68e/0x1130
[1143913.107689]  __btrfs_run_delayed_refs+0x274/0xfc0
[1143913.107691]  btrfs_run_delayed_refs+0x50/0x1f0
[1143913.107692]  btrfs_commit_transaction+0x65/0xd40
[1143913.107696]  ? start_transaction+0xcb/0x570
[1143913.107698]  transaction_kthread+0x150/0x1b0
[1143913.107701]  ? close_ctree+0x420/0x420
[1143913.107703]  kthread+0xc4/0xf0
[1143913.107715]  ? kthread_complete_and_exit+0x20/0x20
[1143913.107718]  ret_from_fork+0x28/0x40
[1143913.107729]  ? kthread_complete_and_exit+0x20/0x20
[1143913.107732]  ret_from_fork_asm+0x11/0x20
[1143913.107734]  </TASK>
[1143913.107735] ---[ end trace 0000000000000000 ]---
[1143913.107737] ------------[ cut here ]------------
[1143913.107737] BTRFS: Transaction aborted (error -117)
[1143913.107749] WARNING: CPU: 10 PID: 937 at
fs/btrfs/extent-tree.c:3093 __btrfs_free_extent+0xed9/0x1130
[1143913.107752] CPU: 10 PID: 937 Comm: btrfs-transacti Tainted: G
   W          6.6.30-gentoo #1
[1143913.107754] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[1143913.107754] RIP: 0010:__btrfs_free_extent+0xed9/0x1130
[1143913.107756] Code: ff be 8b ff ff ff 48 c7 c7 80 3f 06 af e8 8f 5b
c2 ff 0f 0b e9 04 fb ff ff be 8b ff ff ff 48 c7 c7 80 3f 06 af e8 77
5b c2 ff <0f> 0b e9 20 fb ff ff 8b 5c 24 28 89 df e8 95 2f ff ff 84 c0
0f 85
[1143913.107757] RSP: 0018:ffffbdd081063c78 EFLAGS: 00010296
[1143913.107759] RAX: 0000000000000027 RBX: 0000005dc970d000 RCX:
0000000000000027
[1143913.107760] RDX: ffffa0f833c9b448 RSI: 0000000000000001 RDI:
ffffa0f833c9b440
[1143913.107761] RBP: 0000000000000001 R08: 0000000000000001 R09:
00000000ffffdfff
[1143913.107762] R10: 0000000000000000 R11: 0000000000000003 R12:
0000000004da9000
[1143913.107762] R13: ffffa0e82ed28270 R14: ffffa0e8059e4700 R15:
ffffa0e8742c40e0
[1143913.107763] FS:  0000000000000000(0000) GS:ffffa0f833c80000(0000)
knlGS:0000000000000000
[1143913.107766] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1143913.107767] CR2: 000000005665c138 CR3: 00000001060fb000 CR4:
00000000001506e0
[1143913.107768] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[1143913.107769] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[1143913.107770] Call Trace:
[1143913.107771]  <TASK>
[1143913.107771]  ? __warn+0x62/0xc0
[1143913.107774]  ? __btrfs_free_extent+0xed9/0x1130
[1143913.107775]  ? report_bug+0x15e/0x1a0
[1143913.107778]  ? handle_bug+0x36/0x70
[1143913.107780]  ? exc_invalid_op+0x13/0x60
[1143913.107783]  ? asm_exc_invalid_op+0x16/0x20
[1143913.107785]  ? __btrfs_free_extent+0xed9/0x1130
[1143913.107786]  ? __btrfs_free_extent+0xed9/0x1130
[1143913.107788]  __btrfs_run_delayed_refs+0x274/0xfc0
[1143913.107789]  btrfs_run_delayed_refs+0x50/0x1f0
[1143913.107791]  btrfs_commit_transaction+0x65/0xd40
[1143913.107794]  ? start_transaction+0xcb/0x570
[1143913.107797]  transaction_kthread+0x150/0x1b0
[1143913.107804]  ? close_ctree+0x420/0x420
[1143913.107806]  kthread+0xc4/0xf0
[1143913.107809]  ? kthread_complete_and_exit+0x20/0x20
[1143913.107812]  ret_from_fork+0x28/0x40
[1143913.107814]  ? kthread_complete_and_exit+0x20/0x20
[1143913.107817]  ret_from_fork_asm+0x11/0x20
[1143913.107818]  </TASK>
[1143913.107819] ---[ end trace 0000000000000000 ]---
[1143913.107820] BTRFS: error (device vdc1: state A) in
__btrfs_free_extent:3093: errno=-117 Filesystem corrupted
[1143913.107823] BTRFS info (device vdc1: state EA): forced readonly
[1143913.107829] BTRFS info (device vdc1: state EA): leaf
1581679099904 gen 3933860 total ptrs 260 free space 5156 owner 2
[1143913.107831] item 0 key (402811465728 178 5935621899263475458)
itemoff 16255 itemsize 28
[1143913.107834] extent data backref root 340 objectid 338204534
offset 0 count 1
[1143913.107835] item 1 key (402811465728 178 5935621899272047437)
itemoff 16227 itemsize 28

"btrfs check" can only run in lowmem mode, it will crash with "out of
memory" (the system has 74G of RAM). Here's the beginning of the log:

[1/7] checking root items
[2/7] checking extents
ERROR: shared extent 15929577472 referencer lost (parent: 1147747794944)
ERROR: shared extent 15929577472 referencer lost (parent: 1148095201280)
ERROR: shared extent 15929577472 referencer lost (parent: 1175758274560)
(repeating thousands of similar lines)

Last gist: https://gist.tnonline.net/Z4 (meanwhile, this log is over
3GB, I can upload it somewhere later).

We have backups (daily backups stored inside borg on a remote host).

Is there anything we can do? Restoring from backup will probably take
more than 24h (3 TB). The system runs web and mail hosts for more than
100 customers.

We did not try to run "btrfs check --repair" yet, nor
"--init-extent-tree". I'd rather try a quick repair before restoring.
But OTOH, I don't want to make it worse and waste time by trying.

Unfortunately, the btrfs has been mounted rw again after unmounting
following the incident. This restarted the balance, and it seems it
changed the first error "btrfs check" found. I'll try
"ro,skip-balance" after btrfs-check finished. I think the file-system
is still fully readable and we can take one last backup.

Also, I happily provide the logs collected if a dev wanted to look into it.


Thanks in advance
Kai

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs crashes during routine btrfs-balance-least-used
  2024-07-14 16:13 btrfs crashes during routine btrfs-balance-least-used Kai Krakow
@ 2024-07-14 21:53 ` Qu Wenruo
  2024-07-15  4:29   ` Kai Krakow
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2024-07-14 21:53 UTC (permalink / raw)
  To: Kai Krakow, linux-btrfs, Oliver Wien



在 2024/7/15 01:43, Kai Krakow 写道:
> Hello btrfs list!
>
> (also reported in irc)
>
> Our btrfs pool crashed during a routine btrfs-balance-least-used.
> Maybe of interest: bees is also running on this filesystem, snapper
> takes hourly snapshots with retention policy.
>
> I'm currently still collecting diagnostics, "btrfs check" log is
> already 3 GB and growing.
>
> The btrfs runs on three devices vd{c,e,f}1 with data=single meta=raid1.
>
> Here's an excerpt from dmesg (full log https://gist.tnonline.net/TE):

Unfortunately the full log is not really full.

There should be extent leaf dump, and after that dump, showing the
reason why we believe it's a problem.

Is there any true full dmesg dump?

But overall, most of the errors inside __btrfs_free_extent() would be
extent tree corruption.

> [...]
>
> "btrfs check" can only run in lowmem mode, it will crash with "out of
> memory" (the system has 74G of RAM). Here's the beginning of the log:
>
> [1/7] checking root items
> [2/7] checking extents
> ERROR: shared extent 15929577472 referencer lost (parent: 1147747794944)

I believe that's the cause, some extent tree corruption.

> ERROR: shared extent 15929577472 referencer lost (parent: 1148095201280)
> ERROR: shared extent 15929577472 referencer lost (parent: 1175758274560)
> (repeating thousands of similar lines)
>
> Last gist: https://gist.tnonline.net/Z4 (meanwhile, this log is over
> 3GB, I can upload it somewhere later).
>
> We have backups (daily backups stored inside borg on a remote host).
>
> Is there anything we can do? Restoring from backup will probably take
> more than 24h (3 TB). The system runs web and mail hosts for more than
> 100 customers.
>
> We did not try to run "btrfs check --repair" yet, nor
> "--init-extent-tree". I'd rather try a quick repair before restoring.
> But OTOH, I don't want to make it worse and waste time by trying.

Considering the size of the metadata, I do not believe --repair nor
--init-extent-tree is going to fully fix the problem.

>
> Unfortunately, the btrfs has been mounted rw again after unmounting
> following the incident. This restarted the balance, and it seems it
> changed the first error "btrfs check" found. I'll try
> "ro,skip-balance" after btrfs-check finished. I think the file-system
> is still fully readable and we can take one last backup.
>
> Also, I happily provide the logs collected if a dev wanted to look into it.

I guess there is no real full dmesg of that incident?

The corrupted extent leaf has 260 items, but the dump only contains 36,
nor the final reason line.

The other thing is, does the server has ECC memory?
It's not uncommon to see bitflips causing various problems (almost
monthly reports).

If the machine doesn't have ECC memory, then a memtest would be preferable.

Thanks,
Qu
>
>
> Thanks in advance
> Kai
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs crashes during routine btrfs-balance-least-used
  2024-07-14 21:53 ` Qu Wenruo
@ 2024-07-15  4:29   ` Kai Krakow
  2024-07-15  5:00     ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Kai Krakow @ 2024-07-15  4:29 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Oliver Wien

Hello Qu!

Thanks for looking into this. We already started the restore, so we no
longer have any access to the corrupted disk images.

Am So., 14. Juli 2024 um 23:54 Uhr schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
>
>
>
> 在 2024/7/15 01:43, Kai Krakow 写道:
> > Hello btrfs list!
> >
> > (also reported in irc)
> >
> > Our btrfs pool crashed during a routine btrfs-balance-least-used.
> > Maybe of interest: bees is also running on this filesystem, snapper
> > takes hourly snapshots with retention policy.
> >
> > I'm currently still collecting diagnostics, "btrfs check" log is
> > already 3 GB and growing.
> >
> > The btrfs runs on three devices vd{c,e,f}1 with data=single meta=raid1.
> >
> > Here's an excerpt from dmesg (full log https://gist.tnonline.net/TE):
>
> Unfortunately the full log is not really full.
>
> There should be extent leaf dump, and after that dump, showing the
> reason why we believe it's a problem.
>
> Is there any true full dmesg dump?

Yes, sorry. The gist has been truncated - mea culpa. I repasted it:
https://gist.tnonline.net/6Q

> But overall, most of the errors inside __btrfs_free_extent() would be
> extent tree corruption.
>
> > [...]
> >
> > "btrfs check" can only run in lowmem mode, it will crash with "out of
> > memory" (the system has 74G of RAM). Here's the beginning of the log:
> >
> > [1/7] checking root items
> > [2/7] checking extents
> > ERROR: shared extent 15929577472 referencer lost (parent: 1147747794944)
>
> I believe that's the cause, some extent tree corruption.
>
> > ERROR: shared extent 15929577472 referencer lost (parent: 1148095201280)
> > ERROR: shared extent 15929577472 referencer lost (parent: 1175758274560)
> > (repeating thousands of similar lines)
> >
> > Last gist: https://gist.tnonline.net/Z4 (meanwhile, this log is over
> > 3GB, I can upload it somewhere later).
> >
> > We have backups (daily backups stored inside borg on a remote host).
> >
> > Is there anything we can do? Restoring from backup will probably take
> > more than 24h (3 TB). The system runs web and mail hosts for more than
> > 100 customers.
> >
> > We did not try to run "btrfs check --repair" yet, nor
> > "--init-extent-tree". I'd rather try a quick repair before restoring.
> > But OTOH, I don't want to make it worse and waste time by trying.
>
> Considering the size of the metadata, I do not believe --repair nor
> --init-extent-tree is going to fully fix the problem.

Yes, I suspected that and cancelled the btrfs check. I still have the
log (3.8GB) but it's probably not useful, and I cancelled at step 3/7
after seeing it sitting there for hours without output but 100% CPU
usage on one core.

> > Unfortunately, the btrfs has been mounted rw again after unmounting
> > following the incident. This restarted the balance, and it seems it
> > changed the first error "btrfs check" found. I'll try
> > "ro,skip-balance" after btrfs-check finished. I think the file-system
> > is still fully readable and we can take one last backup.
> >
> > Also, I happily provide the logs collected if a dev wanted to look into it.
>
> I guess there is no real full dmesg of that incident?
>
> The corrupted extent leaf has 260 items, but the dump only contains 36,
> nor the final reason line.

See above, I repasted it.

> The other thing is, does the server has ECC memory?
> It's not uncommon to see bitflips causing various problems (almost
> monthly reports).

I don't know the exact hosting environment, we are inside of a QEMU
VM. But I'm pretty sure it is ECC.

The disk images are hosted on DRBD, with two redundant remote block
devices on NVMe RAID. Our VM runs on KVM/QEMU. We are not seeing DRBD
from within the VM. Because the lower storage layer is redundant, we
are not running a data raid profile in btrfs but we use multiple block
devices because we are seeing better latency behavior that way.

> If the machine doesn't have ECC memory, then a memtest would be preferable.

I'll ask our data center operators about ECC but I'm pretty sure the
answer will be: yes, it's ECC.

We have been using their data centers for 20+ years and have never
seen a bit flip or storage failure.

I wonder if parallel use of snapper (hourly, with thinning after 24h),
bees (we are seeing dedup rates of 2:1 - 3:1 for some datasets in the
hosting services) and btrfs-balance-least-used somehow triggered this.
I remember some old reports where bees could trigger corruption in
balance or scrub, and evading that by pausing if it detected it. I
don't know if this is an issue any longer (kernel 6.6.30 LTS).

Thanks,
Kai

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs crashes during routine btrfs-balance-least-used
  2024-07-15  4:29   ` Kai Krakow
@ 2024-07-15  5:00     ` Qu Wenruo
  2024-07-15  5:31       ` Kai Krakow
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2024-07-15  5:00 UTC (permalink / raw)
  To: Kai Krakow, Qu Wenruo; +Cc: linux-btrfs, Oliver Wien

在 2024/7/15 13:59, Kai Krakow 写道:
> Hello Qu!
> 
> Thanks for looking into this. We already started the restore, so we no
> longer have any access to the corrupted disk images.
> 
> Am So., 14. Juli 2024 um 23:54 Uhr schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>
>>
>>
>> 在 2024/7/15 01:43, Kai Krakow 写道:
>>> Hello btrfs list!
>>>
>>> (also reported in irc)
>>>
>>> Our btrfs pool crashed during a routine btrfs-balance-least-used.
>>> Maybe of interest: bees is also running on this filesystem, snapper
>>> takes hourly snapshots with retention policy.
>>>
>>> I'm currently still collecting diagnostics, "btrfs check" log is
>>> already 3 GB and growing.
>>>
>>> The btrfs runs on three devices vd{c,e,f}1 with data=single meta=raid1.
>>>
>>> Here's an excerpt from dmesg (full log https://gist.tnonline.net/TE):
>>
>> Unfortunately the full log is not really full.
>>
>> There should be extent leaf dump, and after that dump, showing the
>> reason why we believe it's a problem.
>>
>> Is there any true full dmesg dump?
> 
> Yes, sorry. The gist has been truncated - mea culpa. I repasted it:
> https://gist.tnonline.net/6Q

Thanks a lot!

That contains (almost) all info we need to know.

The offending bytenr is 402811572224, and in the dump we indeed have the 
item for it:

[1143913.108184] 	item 188 key (402811572224 168 4096) itemoff 14598 
itemsize 79
[1143913.108185] 		extent refs 3 gen 3678544 flags 1
[1143913.108186] 		ref#0: extent data backref root 13835058055282163977 
objectid 281473384125923 offset 81432576 count 1

The last line is the problem.

Firstly we shouldn't even have a root with that high value.
Secondly that root number 13835058055282163977 is a very weird hex value 
too (0xc000000000000109), the '0xc' part means it's definitely not a 
simple bitflip.

Furthermore, the objectid value is also very weird (0xffffa11315e3).
No wonder the extent tree is not going to handle it correctly.

But I have no idea why this happens, it passes csum thus I'm assuming 
it's runtime corruption.

[...]
> 
>> The other thing is, does the server has ECC memory?
>> It's not uncommon to see bitflips causing various problems (almost
>> monthly reports).
> 
> I don't know the exact hosting environment, we are inside of a QEMU
> VM. But I'm pretty sure it is ECC.

And considering it's some virtualization environment, you do not have 
any out-of-tree modules?

> 
> The disk images are hosted on DRBD, with two redundant remote block
> devices on NVMe RAID. Our VM runs on KVM/QEMU. We are not seeing DRBD
> from within the VM. Because the lower storage layer is redundant, we
> are not running a data raid profile in btrfs but we use multiple block
> devices because we are seeing better latency behavior that way.
> 
>> If the machine doesn't have ECC memory, then a memtest would be preferable.
> 
> I'll ask our data center operators about ECC but I'm pretty sure the
> answer will be: yes, it's ECC.
> 
> We have been using their data centers for 20+ years and have never
> seen a bit flip or storage failure.

Yeah, I do not think it's the hardware corruption either now.

> 
> I wonder if parallel use of snapper (hourly, with thinning after 24h),
> bees (we are seeing dedup rates of 2:1 - 3:1 for some datasets in the
> hosting services)

Snapshotting is done in a very special timing (at the end of transaction 
commit), thus it should not be related to balance operations.

> and btrfs-balance-least-used somehow triggered this.
> I remember some old reports where bees could trigger corruption in
> balance or scrub, and evading that by pausing if it detected it. I
> don't know if this is an issue any longer (kernel 6.6.30 LTS).

No recent bugs come up to me immediately, but even if we have, the 
corruption looks too special.
It still matches the item size and ref count, but in the middle the data 
it got corrupted with seemingly garbage.

As the higher bits of u64 is store in higher address in x86_64 memory,
the corruption looks to cover the following bits:

0                       8                       16
|        le64 root      |      le64 objectid    |
|09 01 00 00 00 00 00 0c|e3 15 13 a1 ff ff 00 00|
                       ====================
16                      24          28
|        le64 offset    | le32 refs |
|00 09 da 04 00 00 00 00|01 00 00 00|

So far the corruption looks like starts from byte 7 ends at byte 14.

In theory, if you have kept the image, we can spend enough time to find 
out the correct values, but so far it really looks like some garbage 
filling the range.

For now what I can do is add extra checks (making sure the root number 
seems valid), but it won't really catch all randomly corrupted data.

And as my final struggle, did this VM experienced any migration?
As that's the only thing not btrfs I can think of, that would corrupt 
data at runtime.

Thanks,
Qu
> 
> 
> Thanks,
> Kai
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs crashes during routine btrfs-balance-least-used
  2024-07-15  5:00     ` Qu Wenruo
@ 2024-07-15  5:31       ` Kai Krakow
  2024-07-15  5:50         ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Kai Krakow @ 2024-07-15  5:31 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs, Oliver Wien

Am Mo., 15. Juli 2024 um 07:00 Uhr schrieb Qu Wenruo <wqu@suse.com>:
> 在 2024/7/15 13:59, Kai Krakow 写道:
> > Hello Qu!
> >
> > Thanks for looking into this. We already started the restore, so we no
> > longer have any access to the corrupted disk images.
> >
> > Am So., 14. Juli 2024 um 23:54 Uhr schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
> >>
> >>
> >>
> >> 在 2024/7/15 01:43, Kai Krakow 写道:
> >>> Hello btrfs list!
> >>>
> >>> (also reported in irc)
> >>>
> >>> Our btrfs pool crashed during a routine btrfs-balance-least-used.
> >>> Maybe of interest: bees is also running on this filesystem, snapper
> >>> takes hourly snapshots with retention policy.
> >>>
> >>> I'm currently still collecting diagnostics, "btrfs check" log is
> >>> already 3 GB and growing.
> >>>
> >>> The btrfs runs on three devices vd{c,e,f}1 with data=single meta=raid1.
> >>>
> >>> Here's an excerpt from dmesg (full log https://gist.tnonline.net/TE):
> >>
> >> Unfortunately the full log is not really full.
> >>
> >> There should be extent leaf dump, and after that dump, showing the
> >> reason why we believe it's a problem.
> >>
> >> Is there any true full dmesg dump?
> >
> > Yes, sorry. The gist has been truncated - mea culpa. I repasted it:
> > https://gist.tnonline.net/6Q
>
> Thanks a lot!
>
> That contains (almost) all info we need to know.
>
> The offending bytenr is 402811572224, and in the dump we indeed have the
> item for it:
>
> [1143913.108184]        item 188 key (402811572224 168 4096) itemoff 14598
> itemsize 79
> [1143913.108185]                extent refs 3 gen 3678544 flags 1
> [1143913.108186]                ref#0: extent data backref root 13835058055282163977
> objectid 281473384125923 offset 81432576 count 1
>
> The last line is the problem.
>
> Firstly we shouldn't even have a root with that high value.
> Secondly that root number 13835058055282163977 is a very weird hex value
> too (0xc000000000000109), the '0xc' part means it's definitely not a
> simple bitflip.

Oh wow, I didn't even notice that. I skimmed through the logs to
resolve the root numbers to subvolids, hoping for an idea where the
"corrupted extents" are - but they were all over the place in
seemingly unrelated subvols.

This host runs several systemd-nspawn containers with various
generations of PHP, MySQL containers (MySQL data itself is on a
different partition because btrfs and mysql don't play well), a huge
maildir storage, a huge web vhost storage, and some mail filter / mail
services containers. Just to give you an idea of what kind of data and
workload is used...

> Furthermore, the objectid value is also very weird (0xffffa11315e3).
> No wonder the extent tree is not going to handle it correctly.
>
> But I have no idea why this happens, it passes csum thus I'm assuming
> it's runtime corruption.

We had some crashes in the past due to OOM, sometimes btrfs has been
involved. This was largely solved by disabling huge pages, updating
from kernel 6.1 to 6.6, and running with a bees patch that reduces the
memory used for ref lookups:
https://github.com/Zygo/bees/issues/260

> [...]
> >
> >> The other thing is, does the server has ECC memory?
> >> It's not uncommon to see bitflips causing various problems (almost
> >> monthly reports).
> >
> > I don't know the exact hosting environment, we are inside of a QEMU
> > VM. But I'm pretty sure it is ECC.
>
> And considering it's some virtualization environment, you do not have
> any out-of-tree modules?

No, the system is running Gentoo, the kernel is manually configured
and runs without module support. Everything is baked in.

> > The disk images are hosted on DRBD, with two redundant remote block
> > devices on NVMe RAID. Our VM runs on KVM/QEMU. We are not seeing DRBD
> > from within the VM. Because the lower storage layer is redundant, we
> > are not running a data raid profile in btrfs but we use multiple block
> > devices because we are seeing better latency behavior that way.
> >
> >> If the machine doesn't have ECC memory, then a memtest would be preferable.
> >
> > I'll ask our data center operators about ECC but I'm pretty sure the
> > answer will be: yes, it's ECC.
> >
> > We have been using their data centers for 20+ years and have never
> > seen a bit flip or storage failure.
>
> Yeah, I do not think it's the hardware corruption either now.

Yes, what you found above looks really messed up - that's not a bitflip.

> > I wonder if parallel use of snapper (hourly, with thinning after 24h),
> > bees (we are seeing dedup rates of 2:1 - 3:1 for some datasets in the
> > hosting services)
>
> Snapshotting is done in a very special timing (at the end of transaction
> commit), thus it should not be related to balance operations.
>
> > and btrfs-balance-least-used somehow triggered this.
> > I remember some old reports where bees could trigger corruption in
> > balance or scrub, and evading that by pausing if it detected it. I
> > don't know if this is an issue any longer (kernel 6.6.30 LTS).
>
> No recent bugs come up to me immediately, but even if we have, the
> corruption looks too special.
> It still matches the item size and ref count, but in the middle the data
> it got corrupted with seemingly garbage.

I think Zygo has some notes of it in the bees github:
https://github.com/Zygo/bees/blob/master/docs/btrfs-kernel.md

I think it was about btrfs-send and dedup at the same time... Memory
fades faster if one gets older... ;-)

> As the higher bits of u64 is store in higher address in x86_64 memory,
> the corruption looks to cover the following bits:
>
> 0                       8                       16
> |        le64 root      |      le64 objectid    |
> |09 01 00 00 00 00 00 0c|e3 15 13 a1 ff ff 00 00|
>                        ====================
> 16                      24          28
> |        le64 offset    | le32 refs |
> |00 09 da 04 00 00 00 00|01 00 00 00|
>
> So far the corruption looks like starts from byte 7 ends at byte 14.
>
> In theory, if you have kept the image, we can spend enough time to find
> out the correct values, but so far it really looks like some garbage
> filling the range.
>
> For now what I can do is add extra checks (making sure the root number
> seems valid), but it won't really catch all randomly corrupted data.

This could be useful. Will this be in btrfs-check, or in the kernel?

> And as my final struggle, did this VM experienced any migration?

No. It has been sitting there for 4 or 5 years. But we are slowly
approaching the capacity of the hardware. What happens then: Our data
center operator would shut the VM down, rewire the DRBD to another
hardware, and boot it again. No hot migration, if you meant that.

I'm not sure how reliable DRBD is, but I researched it a little while
ago and it seems to prefer reliability over speed, so it looks very
solid. I don't think anything broke there, and even then 8 bytes of
garbage looks strange. Well, that's 64 bits of garbage - if that means
anything.

> As that's the only thing not btrfs I can think of, that would corrupt
> data at runtime.

I'm using btrfs in various workloads, and even with partially broken
hardware (raid1 on spinning rust with broken sectors). And while it
had its flaws like 10+ years ago, it has been rock solid for me during
the last 5 years at least - even if I handled the hardware very
impatiently (like hitting the reset button or cycling power). Of
course, this system we're talking about has been handled seriously.
;-)

Thanks,
Kai

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs crashes during routine btrfs-balance-least-used
  2024-07-15  5:31       ` Kai Krakow
@ 2024-07-15  5:50         ` Qu Wenruo
  2024-07-16  6:51           ` Kai Krakow
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2024-07-15  5:50 UTC (permalink / raw)
  To: Kai Krakow, Qu Wenruo; +Cc: linux-btrfs, Oliver Wien



在 2024/7/15 15:01, Kai Krakow 写道:
> Am Mo., 15. Juli 2024 um 07:00 Uhr schrieb Qu Wenruo <wqu@suse.com>:
[...]
>> The last line is the problem.
>>
>> Firstly we shouldn't even have a root with that high value.
>> Secondly that root number 13835058055282163977 is a very weird hex value
>> too (0xc000000000000109), the '0xc' part means it's definitely not a
>> simple bitflip.
>
> Oh wow, I didn't even notice that. I skimmed through the logs to
> resolve the root numbers to subvolids, hoping for an idea where the
> "corrupted extents" are - but they were all over the place in
> seemingly unrelated subvols.
>
> This host runs several systemd-nspawn containers with various
> generations of PHP, MySQL containers (MySQL data itself is on a
> different partition because btrfs and mysql don't play well), a huge
> maildir storage, a huge web vhost storage, and some mail filter / mail
> services containers. Just to give you an idea of what kind of data and
> workload is used...

That shouldn't be a problem, as the only thing can access btrfs metadata
is, btrfs itself.

Although in theory, anything can access the host memory can also access
the kernel memory of the guest...

>
>> Furthermore, the objectid value is also very weird (0xffffa11315e3).
>> No wonder the extent tree is not going to handle it correctly.
>>
>> But I have no idea why this happens, it passes csum thus I'm assuming
>> it's runtime corruption.
>
> We had some crashes in the past due to OOM, sometimes btrfs has been
> involved. This was largely solved by disabling huge pages, updating
> from kernel 6.1 to 6.6, and running with a bees patch that reduces the
> memory used for ref lookups:
> https://github.com/Zygo/bees/issues/260
>
>> [...]
>>>
>>>> The other thing is, does the server has ECC memory?
>>>> It's not uncommon to see bitflips causing various problems (almost
>>>> monthly reports).
>>>
>>> I don't know the exact hosting environment, we are inside of a QEMU
>>> VM. But I'm pretty sure it is ECC.
>>
>> And considering it's some virtualization environment, you do not have
>> any out-of-tree modules?
>
> No, the system is running Gentoo, the kernel is manually configured
> and runs without module support. Everything is baked in.
>
>>> The disk images are hosted on DRBD, with two redundant remote block
>>> devices on NVMe RAID. Our VM runs on KVM/QEMU. We are not seeing DRBD
>>> from within the VM. Because the lower storage layer is redundant, we
>>> are not running a data raid profile in btrfs but we use multiple block
>>> devices because we are seeing better latency behavior that way.
>>>
>>>> If the machine doesn't have ECC memory, then a memtest would be preferable.
>>>
>>> I'll ask our data center operators about ECC but I'm pretty sure the
>>> answer will be: yes, it's ECC.
>>>
>>> We have been using their data centers for 20+ years and have never
>>> seen a bit flip or storage failure.
>>
>> Yeah, I do not think it's the hardware corruption either now.
>
> Yes, what you found above looks really messed up - that's not a bitflip.
>
>>> I wonder if parallel use of snapper (hourly, with thinning after 24h),
>>> bees (we are seeing dedup rates of 2:1 - 3:1 for some datasets in the
>>> hosting services)
>>
>> Snapshotting is done in a very special timing (at the end of transaction
>> commit), thus it should not be related to balance operations.
>>
>>> and btrfs-balance-least-used somehow triggered this.
>>> I remember some old reports where bees could trigger corruption in
>>> balance or scrub, and evading that by pausing if it detected it. I
>>> don't know if this is an issue any longer (kernel 6.6.30 LTS).
>>
>> No recent bugs come up to me immediately, but even if we have, the
>> corruption looks too special.
>> It still matches the item size and ref count, but in the middle the data
>> it got corrupted with seemingly garbage.
>
> I think Zygo has some notes of it in the bees github:
> https://github.com/Zygo/bees/blob/master/docs/btrfs-kernel.md

Wow, the first time I know there is such a well maintained matrix on
various problems.

>
> I think it was about btrfs-send and dedup at the same time... Memory
> fades faster if one gets older... ;-)

Nope, this is completely a different one.

>
>> As the higher bits of u64 is store in higher address in x86_64 memory,
>> the corruption looks to cover the following bits:
>>
>> 0                       8                       16
>> |        le64 root      |      le64 objectid    |
>> |09 01 00 00 00 00 00 0c|e3 15 13 a1 ff ff 00 00|
>>                         ====================
>> 16                      24          28
>> |        le64 offset    | le32 refs |
>> |00 09 da 04 00 00 00 00|01 00 00 00|
>>
>> So far the corruption looks like starts from byte 7 ends at byte 14.
>>
>> In theory, if you have kept the image, we can spend enough time to find
>> out the correct values, but so far it really looks like some garbage
>> filling the range.
>>
>> For now what I can do is add extra checks (making sure the root number
>> seems valid), but it won't really catch all randomly corrupted data.
>
> This could be useful. Will this be in btrfs-check, or in the kernel?

Kernel, and it would catch the error at write time, so that such obvious
corruption would not even reach disks.
Although it would still make the fs RO, it will not cause any corruption
on-disk.

For btrfs-progs, it's already detecting such mismatch, but that's
already too late, isn't it?

>
>> And as my final struggle, did this VM experienced any migration?
>
> No. It has been sitting there for 4 or 5 years. But we are slowly
> approaching the capacity of the hardware. What happens then: Our data
> center operator would shut the VM down, rewire the DRBD to another
> hardware, and boot it again. No hot migration, if you meant that.

OK, then definitely something weird happened inside btrfs code.
But I'm out of any clue...

>
> I'm not sure how reliable DRBD is, but I researched it a little while
> ago and it seems to prefer reliability over speed, so it looks very
> solid. I don't think anything broke there, and even then 8 bytes of
> garbage looks strange. Well, that's 64 bits of garbage - if that means
> anything.

Even if it's really some lower level storage corruption, it has to pass
the btrfs metadata csum first.
You really need a super lucky random corruption that still matches the csum.

Then you still need to pass tree-checker, which doesn't sounds
reasonable to me at all.

>
>> As that's the only thing not btrfs I can think of, that would corrupt
>> data at runtime.
>
> I'm using btrfs in various workloads, and even with partially broken
> hardware (raid1 on spinning rust with broken sectors). And while it
> had its flaws like 10+ years ago, it has been rock solid for me during
> the last 5 years at least - even if I handled the hardware very
> impatiently (like hitting the reset button or cycling power). Of
> course, this system we're talking about has been handled seriously.
> ;-)

And we're also enhancing our handling on bad hardwares (except when they
cheat on FLUSH), the biggest example is tree-checker.

But I really run out of ideas for such a huge corruption.
Your setup really rules out almost everything, from out-of-tree (Nv*dia
drivers) to bad hot memory migration implementation.

I'll let you know when the new tree-checker patch come out, and
hopefully it can catch the root cause.

Thanks,
Qu
>
> Thanks,
> Kai

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs crashes during routine btrfs-balance-least-used
  2024-07-15  5:50         ` Qu Wenruo
@ 2024-07-16  6:51           ` Kai Krakow
  2024-07-16  9:09             ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Kai Krakow @ 2024-07-16  6:51 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

Hello Qu!

Am Mo., 15. Juli 2024 um 07:50 Uhr schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
>
>
>
> 在 2024/7/15 15:01, Kai Krakow 写道:
> > Am Mo., 15. Juli 2024 um 07:00 Uhr schrieb Qu Wenruo <wqu@suse.com>:
> [...]
> >> The last line is the problem.
> >>
> >> Firstly we shouldn't even have a root with that high value.
> >> Secondly that root number 13835058055282163977 is a very weird hex value
> >> too (0xc000000000000109), the '0xc' part means it's definitely not a
> >> simple bitflip.
> >
> > Oh wow, I didn't even notice that. I skimmed through the logs to
> > resolve the root numbers to subvolids, hoping for an idea where the
> > "corrupted extents" are - but they were all over the place in
> > seemingly unrelated subvols.
> >
> > This host runs several systemd-nspawn containers with various
> > generations of PHP, MySQL containers (MySQL data itself is on a
> > different partition because btrfs and mysql don't play well), a huge
> > maildir storage, a huge web vhost storage, and some mail filter / mail
> > services containers. Just to give you an idea of what kind of data and
> > workload is used...
>
> That shouldn't be a problem, as the only thing can access btrfs metadata
> is, btrfs itself.
>
> Although in theory, anything can access the host memory can also access
> the kernel memory of the guest...
>
> >
> >> Furthermore, the objectid value is also very weird (0xffffa11315e3).
> >> No wonder the extent tree is not going to handle it correctly.
> >>
> >> But I have no idea why this happens, it passes csum thus I'm assuming
> >> it's runtime corruption.
> >
> > We had some crashes in the past due to OOM, sometimes btrfs has been
> > involved. This was largely solved by disabling huge pages, updating
> > from kernel 6.1 to 6.6, and running with a bees patch that reduces the
> > memory used for ref lookups:
> > https://github.com/Zygo/bees/issues/260
> >
> >> [...]
> >>>
> >>>> The other thing is, does the server has ECC memory?
> >>>> It's not uncommon to see bitflips causing various problems (almost
> >>>> monthly reports).
> >>>
> >>> I don't know the exact hosting environment, we are inside of a QEMU
> >>> VM. But I'm pretty sure it is ECC.
> >>
> >> And considering it's some virtualization environment, you do not have
> >> any out-of-tree modules?
> >
> > No, the system is running Gentoo, the kernel is manually configured
> > and runs without module support. Everything is baked in.
> >
> >>> The disk images are hosted on DRBD, with two redundant remote block
> >>> devices on NVMe RAID. Our VM runs on KVM/QEMU. We are not seeing DRBD
> >>> from within the VM. Because the lower storage layer is redundant, we
> >>> are not running a data raid profile in btrfs but we use multiple block
> >>> devices because we are seeing better latency behavior that way.
> >>>
> >>>> If the machine doesn't have ECC memory, then a memtest would be preferable.
> >>>
> >>> I'll ask our data center operators about ECC but I'm pretty sure the
> >>> answer will be: yes, it's ECC.
> >>>
> >>> We have been using their data centers for 20+ years and have never
> >>> seen a bit flip or storage failure.
> >>
> >> Yeah, I do not think it's the hardware corruption either now.
> >
> > Yes, what you found above looks really messed up - that's not a bitflip.
> >
> >>> I wonder if parallel use of snapper (hourly, with thinning after 24h),
> >>> bees (we are seeing dedup rates of 2:1 - 3:1 for some datasets in the
> >>> hosting services)
> >>
> >> Snapshotting is done in a very special timing (at the end of transaction
> >> commit), thus it should not be related to balance operations.
> >>
> >>> and btrfs-balance-least-used somehow triggered this.
> >>> I remember some old reports where bees could trigger corruption in
> >>> balance or scrub, and evading that by pausing if it detected it. I
> >>> don't know if this is an issue any longer (kernel 6.6.30 LTS).
> >>
> >> No recent bugs come up to me immediately, but even if we have, the
> >> corruption looks too special.
> >> It still matches the item size and ref count, but in the middle the data
> >> it got corrupted with seemingly garbage.
> >
> > I think Zygo has some notes of it in the bees github:
> > https://github.com/Zygo/bees/blob/master/docs/btrfs-kernel.md
>
> Wow, the first time I know there is such a well maintained matrix on
> various problems.
>
> >
> > I think it was about btrfs-send and dedup at the same time... Memory
> > fades faster if one gets older... ;-)
>
> Nope, this is completely a different one.
>
> >
> >> As the higher bits of u64 is store in higher address in x86_64 memory,
> >> the corruption looks to cover the following bits:
> >>
> >> 0                       8                       16
> >> |        le64 root      |      le64 objectid    |
> >> |09 01 00 00 00 00 00 0c|e3 15 13 a1 ff ff 00 00|
> >>                         ====================
> >> 16                      24          28
> >> |        le64 offset    | le32 refs |
> >> |00 09 da 04 00 00 00 00|01 00 00 00|
> >>
> >> So far the corruption looks like starts from byte 7 ends at byte 14.
> >>
> >> In theory, if you have kept the image, we can spend enough time to find
> >> out the correct values, but so far it really looks like some garbage
> >> filling the range.
> >>
> >> For now what I can do is add extra checks (making sure the root number
> >> seems valid), but it won't really catch all randomly corrupted data.
> >
> > This could be useful. Will this be in btrfs-check, or in the kernel?
>
> Kernel, and it would catch the error at write time, so that such obvious
> corruption would not even reach disks.
> Although it would still make the fs RO, it will not cause any corruption
> on-disk.
>
> For btrfs-progs, it's already detecting such mismatch, but that's
> already too late, isn't it?
>
> >
> >> And as my final struggle, did this VM experienced any migration?
> >
> > No. It has been sitting there for 4 or 5 years. But we are slowly
> > approaching the capacity of the hardware. What happens then: Our data
> > center operator would shut the VM down, rewire the DRBD to another
> > hardware, and boot it again. No hot migration, if you meant that.
>
> OK, then definitely something weird happened inside btrfs code.
> But I'm out of any clue...
>
> >
> > I'm not sure how reliable DRBD is, but I researched it a little while
> > ago and it seems to prefer reliability over speed, so it looks very
> > solid. I don't think anything broke there, and even then 8 bytes of
> > garbage looks strange. Well, that's 64 bits of garbage - if that means
> > anything.
>
> Even if it's really some lower level storage corruption, it has to pass
> the btrfs metadata csum first.
> You really need a super lucky random corruption that still matches the csum.
>
> Then you still need to pass tree-checker, which doesn't sounds
> reasonable to me at all.
>
> >
> >> As that's the only thing not btrfs I can think of, that would corrupt
> >> data at runtime.
> >
> > I'm using btrfs in various workloads, and even with partially broken
> > hardware (raid1 on spinning rust with broken sectors). And while it
> > had its flaws like 10+ years ago, it has been rock solid for me during
> > the last 5 years at least - even if I handled the hardware very
> > impatiently (like hitting the reset button or cycling power). Of
> > course, this system we're talking about has been handled seriously.
> > ;-)
>
> And we're also enhancing our handling on bad hardwares (except when they
> cheat on FLUSH), the biggest example is tree-checker.
>
> But I really run out of ideas for such a huge corruption.
> Your setup really rules out almost everything, from out-of-tree (Nv*dia
> drivers) to bad hot memory migration implementation.
>
> I'll let you know when the new tree-checker patch come out, and
> hopefully it can catch the root cause.

I've got your patch from linux-btrfs git (which you've sent to the
list yesterday). It looks like this cannot be applied to 6.6 LTS. Is
it part of a larger patch series and will be backported?

Is it safe to just cherry pick all patches saying "tree-checker" to
the 6.6 tree?

Thanks,
Kai

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs crashes during routine btrfs-balance-least-used
  2024-07-16  6:51           ` Kai Krakow
@ 2024-07-16  9:09             ` Qu Wenruo
  2024-07-16 13:25               ` Kai Krakow
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2024-07-16  9:09 UTC (permalink / raw)
  To: Kai Krakow; +Cc: Qu Wenruo, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1210 bytes --]



在 2024/7/16 16:21, Kai Krakow 写道:
> Hello Qu!
>
>[...]
>>
>> And we're also enhancing our handling on bad hardwares (except when they
>> cheat on FLUSH), the biggest example is tree-checker.
>>
>> But I really run out of ideas for such a huge corruption.
>> Your setup really rules out almost everything, from out-of-tree (Nv*dia
>> drivers) to bad hot memory migration implementation.
>>
>> I'll let you know when the new tree-checker patch come out, and
>> hopefully it can catch the root cause.
>
> I've got your patch from linux-btrfs git (which you've sent to the
> list yesterday). It looks like this cannot be applied to 6.6 LTS. Is
> it part of a larger patch series and will be backported?
>
> Is it safe to just cherry pick all patches saying "tree-checker" to
> the 6.6 tree?

The conflicts are caused by the missing commit 1645c283a87c ("btrfs:
tree-checker: add type and sequence check for inline backrefs") and a
cleanup patch.

So I manually backported those two patches, just give them a try.

Meanwhile the missing commit looks like a good candidate for stable 6.6
branch, I can definitely send it to stable later.

Thanks,
Qu

>
> Thanks,
> Kai

[-- Attachment #2: 0002-btrfs-tree-checker-validate-dref-root-and-objectid.patch --]
[-- Type: text/x-patch, Size: 5451 bytes --]

From dbd8f12a0f31489eed4e061116041376ef7ec627 Mon Sep 17 00:00:00 2001
Message-ID: <dbd8f12a0f31489eed4e061116041376ef7ec627.1721120832.git.wqu@suse.com>
In-Reply-To: <cover.1721120832.git.wqu@suse.com>
References: <cover.1721120832.git.wqu@suse.com>
From: Qu Wenruo <wqu@suse.com>
Date: Mon, 15 Jul 2024 16:07:07 +0930
Subject: [PATCH 2/2] btrfs: tree-checker: validate dref root and objectid

Not yet upstreamed.

[CORRUPTION]
There is a bug report that btrfs flips RO due to a corruption in the
extent tree, the involved dumps looks like this:

 	item 188 key (402811572224 168 4096) itemoff 14598 itemsize 79
 		extent refs 3 gen 3678544 flags 1
 		ref#0: extent data backref root 13835058055282163977 objectid 281473384125923 offset 81432576 count 1
 		ref#1: shared data backref parent 1947073626112 count 1
 		ref#2: shared data backref parent 1156030103552 count 1
 BTRFS critical (device vdc1: state EA): unable to find ref byte nr 402811572224 parent 0 root 265 owner 28703026 offset 81432576 slot 189
 BTRFS error (device vdc1: state EA): failed to run delayed ref for logical 402811572224 num_bytes 4096 type 178 action 2 ref_mod 1: -2

[CAUSE]
The corrupted entry is ref#0 of item 188.
The root number 13835058055282163977 is beyond the upper limit for root
items (the current limit is 1 << 48), and the objectid also looks
suspicious.

Only the offset and count is correct.

[ENHANCEMENT]
Although it's still unknown why we have such many bytes corrupted
randomly, we can still enhance the tree-checker for data backrefs by:

- Validate the root value
  For now there should only be 3 types of roots can have data backref:
  * subvolume trees
  * data reloc trees
  * root tree
    Only for v1 space cache

- validate the objectid value
  The objectid should be a valid inode number.

Hopefully we can catch such problem in the future with the new checkers.

Reported-by: Kai Krakow <hurikhan77@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAMthOuPjg5RDT-G_LXeBBUUtzt3cq=JywF+D1_h+JYxe=WKp-Q@mail.gmail.com/#t
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/tree-checker.c | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 5d6cfa618dc4..f14825f3d4e8 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -1265,6 +1265,19 @@ static void extent_err(const struct extent_buffer *eb, int slot,
 	va_end(args);
 }
 
+static bool is_valid_dref_root(u64 rootid)
+{
+	/*
+	 * The following tree root objectids are allowed to have a data backref:
+	 * - subvolume trees
+	 * - data reloc tree
+	 * - tree root
+	 *   For v1 space cache
+	 */
+	return is_fstree(rootid) || rootid == BTRFS_DATA_RELOC_TREE_OBJECTID ||
+	       rootid == BTRFS_ROOT_TREE_OBJECTID;
+}
+
 static int check_extent_item(struct extent_buffer *leaf,
 			     struct btrfs_key *key, int slot,
 			     struct btrfs_key *prev_key)
@@ -1417,6 +1430,8 @@ static int check_extent_item(struct extent_buffer *leaf,
 		struct btrfs_extent_data_ref *dref;
 		struct btrfs_shared_data_ref *sref;
 		u64 seq;
+		u64 dref_root;
+		u64 dref_objectid;
 		u64 dref_offset;
 		u64 inline_offset;
 		u8 inline_type;
@@ -1460,11 +1475,26 @@ static int check_extent_item(struct extent_buffer *leaf,
 		 */
 		case BTRFS_EXTENT_DATA_REF_KEY:
 			dref = (struct btrfs_extent_data_ref *)(&iref->offset);
+			dref_root = btrfs_extent_data_ref_root(leaf, dref);
+			dref_objectid = btrfs_extent_data_ref_objectid(leaf, dref);
 			dref_offset = btrfs_extent_data_ref_offset(leaf, dref);
 			seq = hash_extent_data_ref(
 					btrfs_extent_data_ref_root(leaf, dref),
 					btrfs_extent_data_ref_objectid(leaf, dref),
 					btrfs_extent_data_ref_offset(leaf, dref));
+			if (unlikely(!is_valid_dref_root(dref_root))) {
+				extent_err(leaf, slot,
+					   "invalid data ref root value %llu",
+					   dref_root);
+				return -EUCLEAN;
+			}
+			if (unlikely(dref_objectid < BTRFS_FIRST_FREE_OBJECTID ||
+				     dref_objectid > BTRFS_LAST_FREE_OBJECTID)) {
+				extent_err(leaf, slot,
+					   "invalid data ref objectid value %llu",
+					   dref_root);
+				return -EUCLEAN;
+			}
 			if (unlikely(!IS_ALIGNED(dref_offset,
 						 fs_info->sectorsize))) {
 				extent_err(leaf, slot,
@@ -1600,6 +1630,8 @@ static int check_extent_data_ref(struct extent_buffer *leaf,
 		return -EUCLEAN;
 	}
 	for (; ptr < end; ptr += sizeof(*dref)) {
+		u64 root;
+		u64 objectid;
 		u64 offset;
 
 		/*
@@ -1607,7 +1639,22 @@ static int check_extent_data_ref(struct extent_buffer *leaf,
 		 * overflow from the leaf due to hash collisions.
 		 */
 		dref = (struct btrfs_extent_data_ref *)ptr;
+		root = btrfs_extent_data_ref_root(leaf, dref);
+		objectid = btrfs_extent_data_ref_objectid(leaf, dref);
 		offset = btrfs_extent_data_ref_offset(leaf, dref);
+		if (unlikely(!is_valid_dref_root(root))) {
+			extent_err(leaf, slot,
+				   "invalid extent data backref root value %llu",
+				   root);
+			return -EUCLEAN;
+		}
+		if (unlikely(objectid < BTRFS_FIRST_FREE_OBJECTID ||
+			     objectid > BTRFS_LAST_FREE_OBJECTID)) {
+			extent_err(leaf, slot,
+				   "invalid extent data backref objectid value %llu",
+				   root);
+			return -EUCLEAN;
+		}
 		if (unlikely(!IS_ALIGNED(offset, leaf->fs_info->sectorsize))) {
 			extent_err(leaf, slot,
 	"invalid extent data backref offset, have %llu expect aligned to %u",
-- 
2.45.2


[-- Attachment #3: 0001-btrfs-tree-checker-add-type-and-sequence-check-for-i.patch --]
[-- Type: text/x-patch, Size: 5404 bytes --]

From 63189f5d922db2bc525f5251be46fe857e00a2d6 Mon Sep 17 00:00:00 2001
Message-ID: <63189f5d922db2bc525f5251be46fe857e00a2d6.1721120832.git.wqu@suse.com>
In-Reply-To: <cover.1721120832.git.wqu@suse.com>
References: <cover.1721120832.git.wqu@suse.com>
From: Qu Wenruo <wqu@suse.com>
Date: Tue, 24 Oct 2023 12:41:11 +1030
Subject: [PATCH 1/2] btrfs: tree-checker: add type and sequence check for
 inline backrefs

commit 1645c283a87c61f84b2bffd81f50724df959b11a upstream.

[BUG]
There is a bug report that ntfs2btrfs had a bug that it can lead to
transaction abort and the filesystem flips to read-only.

[CAUSE]
For inline backref items, kernel has a strict requirement for their
ordered, they must follow the following rules:

- All btrfs_extent_inline_ref::type should be in an ascending order

- Within the same type, the items should follow a descending order by
  their sequence number

  For EXTENT_DATA_REF type, the sequence number is result from
  hash_extent_data_ref().
  For other types, their sequence numbers are
  btrfs_extent_inline_ref::offset.

Thus if there is any code not following above rules, the resulted
inline backrefs can prevent the kernel to locate the needed inline
backref and lead to transaction abort.

[FIX]
Ntrfs2btrfs has already fixed the problem, and btrfs-progs has added the
ability to detect such problems.

For kernel, let's be more noisy and be more specific about the order, so
that the next time kernel hits such problem we would reject it in the
first place, without leading to transaction abort.

Link: https://github.com/kdave/btrfs-progs/pull/622
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
[ Fix a conflict due to header cleanup. ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/tree-checker.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index cc6bc5985120..5d6cfa618dc4 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -29,6 +29,7 @@
 #include "accessors.h"
 #include "file-item.h"
 #include "inode-item.h"
+#include "extent-tree.h"
 
 /*
  * Error message should follow the following format:
@@ -1274,6 +1275,8 @@ static int check_extent_item(struct extent_buffer *leaf,
 	unsigned long ptr;	/* Current pointer inside inline refs */
 	unsigned long end;	/* Extent item end */
 	const u32 item_size = btrfs_item_size(leaf, slot);
+	u8 last_type = 0;
+	u64 last_seq = U64_MAX;
 	u64 flags;
 	u64 generation;
 	u64 total_refs;		/* Total refs in btrfs_extent_item */
@@ -1320,6 +1323,18 @@ static int check_extent_item(struct extent_buffer *leaf,
 	 *    2.2) Ref type specific data
 	 *         Either using btrfs_extent_inline_ref::offset, or specific
 	 *         data structure.
+	 *
+	 *    All above inline items should follow the order:
+	 *
+	 *    - All btrfs_extent_inline_ref::type should be in an ascending
+	 *      order
+	 *
+	 *    - Within the same type, the items should follow a descending
+	 *      order by their sequence number. The sequence number is
+	 *      determined by:
+	 *      * btrfs_extent_inline_ref::offset for all types  other than
+	 *        EXTENT_DATA_REF
+	 *      * hash_extent_data_ref() for EXTENT_DATA_REF
 	 */
 	if (unlikely(item_size < sizeof(*ei))) {
 		extent_err(leaf, slot,
@@ -1401,6 +1416,7 @@ static int check_extent_item(struct extent_buffer *leaf,
 		struct btrfs_extent_inline_ref *iref;
 		struct btrfs_extent_data_ref *dref;
 		struct btrfs_shared_data_ref *sref;
+		u64 seq;
 		u64 dref_offset;
 		u64 inline_offset;
 		u8 inline_type;
@@ -1414,6 +1430,7 @@ static int check_extent_item(struct extent_buffer *leaf,
 		iref = (struct btrfs_extent_inline_ref *)ptr;
 		inline_type = btrfs_extent_inline_ref_type(leaf, iref);
 		inline_offset = btrfs_extent_inline_ref_offset(leaf, iref);
+		seq = inline_offset;
 		if (unlikely(ptr + btrfs_extent_inline_ref_size(inline_type) > end)) {
 			extent_err(leaf, slot,
 "inline ref item overflows extent item, ptr %lu iref size %u end %lu",
@@ -1444,6 +1461,10 @@ static int check_extent_item(struct extent_buffer *leaf,
 		case BTRFS_EXTENT_DATA_REF_KEY:
 			dref = (struct btrfs_extent_data_ref *)(&iref->offset);
 			dref_offset = btrfs_extent_data_ref_offset(leaf, dref);
+			seq = hash_extent_data_ref(
+					btrfs_extent_data_ref_root(leaf, dref),
+					btrfs_extent_data_ref_objectid(leaf, dref),
+					btrfs_extent_data_ref_offset(leaf, dref));
 			if (unlikely(!IS_ALIGNED(dref_offset,
 						 fs_info->sectorsize))) {
 				extent_err(leaf, slot,
@@ -1470,6 +1491,24 @@ static int check_extent_item(struct extent_buffer *leaf,
 				   inline_type);
 			return -EUCLEAN;
 		}
+		if (inline_type < last_type) {
+			extent_err(leaf, slot,
+				   "inline ref out-of-order: has type %u, prev type %u",
+				   inline_type, last_type);
+			return -EUCLEAN;
+		}
+		/* Type changed, allow the sequence starts from U64_MAX again. */
+		if (inline_type > last_type)
+			last_seq = U64_MAX;
+		if (seq > last_seq) {
+			extent_err(leaf, slot,
+"inline ref out-of-order: has type %u offset %llu seq 0x%llx, prev type %u seq 0x%llx",
+				   inline_type, inline_offset, seq,
+				   last_type, last_seq);
+			return -EUCLEAN;
+		}
+		last_type = inline_type;
+		last_seq = seq;
 		ptr += btrfs_extent_inline_ref_size(inline_type);
 	}
 	/* No padding is allowed */
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: btrfs crashes during routine btrfs-balance-least-used
  2024-07-16  9:09             ` Qu Wenruo
@ 2024-07-16 13:25               ` Kai Krakow
  2024-07-16 22:18                 ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Kai Krakow @ 2024-07-16 13:25 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

Hello Qu!

Am Di., 16. Juli 2024 um 11:09 Uhr schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
>
> 在 2024/7/16 16:21, Kai Krakow 写道:
> > Hello Qu!
> >
> >[...]
> >>
> >> And we're also enhancing our handling on bad hardwares (except when they
> >> cheat on FLUSH), the biggest example is tree-checker.
> >>
> >> But I really run out of ideas for such a huge corruption.
> >> Your setup really rules out almost everything, from out-of-tree (Nv*dia
> >> drivers) to bad hot memory migration implementation.
> >>
> >> I'll let you know when the new tree-checker patch come out, and
> >> hopefully it can catch the root cause.
> >
> > I've got your patch from linux-btrfs git (which you've sent to the
> > list yesterday). It looks like this cannot be applied to 6.6 LTS. Is
> > it part of a larger patch series and will be backported?
> >
> > Is it safe to just cherry pick all patches saying "tree-checker" to
> > the 6.6 tree?
>
> The conflicts are caused by the missing commit 1645c283a87c ("btrfs:
> tree-checker: add type and sequence check for inline backrefs") and a
> cleanup patch.
>
> So I manually backported those two patches, just give them a try.

Yes, they apply. Thanks. I will be testing this first on two different
machines before applying to the server.

Also, I wanted to update you about the VM host environment, I got
replies from our datacenter operator:

* 2 x Intel Xeon E5-2680 V4, 2.4Ghz 14-Core 35MB 9.8GT 2400Mhz HT&TB
* 16 x 64 GB, ECC reg. DDR4-2400 LRDIMM
- this hardware exists twice and VMs can be migrated

The qemu host process of our VM has been up since May, 26th 2024. This
is when we did a shut down to attach a third btrfs pool disk to the
system. No VM migration has been done since then, and there probably
hasn't been a migration before for at least one year. Our operator
says it has probably never been migrated since it has been taken into
production. So, a VM migration before May 26th is unlikely but not
impossible, and there has been no migration since May, 26th.

And some other observations:

Also, while the issue existed, I could mount the FS fine in ro mode
and copy data (I've did a full borg backup, it only reads changed
files). But if I mounted rw, it would take anything from 2 min to 30
min before the kernel complained again, even if the FS was completely
idle. I cancelled the previous (failed by crashing) balance job before
doing anything else with the FS.

Not sure if anything matters for your analysis in the afterthought,
just wanted to update you.

> Meanwhile the missing commit looks like a good candidate for stable 6.6
> branch, I can definitely send it to stable later.
>
> Thanks,
> Qu

Thanks,
Kai

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs crashes during routine btrfs-balance-least-used
  2024-07-16 13:25               ` Kai Krakow
@ 2024-07-16 22:18                 ` Qu Wenruo
  2024-07-17  8:09                   ` Kai Krakow
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2024-07-16 22:18 UTC (permalink / raw)
  To: Kai Krakow; +Cc: Qu Wenruo, linux-btrfs



在 2024/7/16 22:55, Kai Krakow 写道:
> Hello Qu!
>
> Am Di., 16. Juli 2024 um 11:09 Uhr schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
[...]
>
> Yes, they apply. Thanks. I will be testing this first on two different
> machines before applying to the server.

I doubt if you will hit anything like that again.

Even if there is some hidden cause of memory corruption, it may not hit
extent tree again.

But at least enhanced sanity checks are always a good thing.

>
> Also, I wanted to update you about the VM host environment, I got
> replies from our datacenter operator:
>
> * 2 x Intel Xeon E5-2680 V4, 2.4Ghz 14-Core 35MB 9.8GT 2400Mhz HT&TB
> * 16 x 64 GB, ECC reg. DDR4-2400 LRDIMM
> - this hardware exists twice and VMs can be migrated
>
> The qemu host process of our VM has been up since May, 26th 2024. This
> is when we did a shut down to attach a third btrfs pool disk to the
> system. No VM migration has been done since then, and there probably
> hasn't been a migration before for at least one year. Our operator
> says it has probably never been migrated since it has been taken into
> production. So, a VM migration before May 26th is unlikely but not
> impossible, and there has been no migration since May, 26th.
>
> And some other observations:
>
> Also, while the issue existed, I could mount the FS fine in ro mode
> and copy data (I've did a full borg backup, it only reads changed
> files). But if I mounted rw, it would take anything from 2 min to 30
> min before the kernel complained again, even if the FS was completely
> idle. I cancelled the previous (failed by crashing) balance job before
> doing anything else with the FS.

The problem of that specific corruption is (despite the mysterious
reason), it won't be detected until that offending data extent at
402811572224 got some updates.

And if the updates on that extent doesn't touch the corrupted entry it
will still be fine.

So the corruption can exist for quite some time, until triggered recently.

At least the generation when the data extent is created is not that new,
the latest generation is 3933860, meanwhile the generation of that
extent is 3678544, and considering the indirect ref should always be the
first to be created, I guess the corruption is there for a while.

With the new tree-checker, your scrub routine should be able to catch
any similar existing problems now.

Thanks,
Qu

>
> Not sure if anything matters for your analysis in the afterthought,
> just wanted to update you.
>
>
>> Meanwhile the missing commit looks like a good candidate for stable 6.6
>> branch, I can definitely send it to stable later.
>>
>> Thanks,
>> Qu
>
> Thanks,
> Kai

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs crashes during routine btrfs-balance-least-used
  2024-07-16 22:18                 ` Qu Wenruo
@ 2024-07-17  8:09                   ` Kai Krakow
  0 siblings, 0 replies; 11+ messages in thread
From: Kai Krakow @ 2024-07-17  8:09 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

Am Mi., 17. Juli 2024 um 00:18 Uhr schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
>
> 在 2024/7/16 22:55, Kai Krakow 写道:
> > Hello Qu!
> >
> > Am Di., 16. Juli 2024 um 11:09 Uhr schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
> [...]
> >
> > Yes, they apply. Thanks. I will be testing this first on two different
> > machines before applying to the server.
>
> I doubt if you will hit anything like that again.

Yes, but it's not about that. It's to see if there are no other
side-effects before pushing it to production. I don't expect any
side-effects, tho.


> Even if there is some hidden cause of memory corruption, it may not hit
> extent tree again.
>
> But at least enhanced sanity checks are always a good thing.
>
> >
> > Also, I wanted to update you about the VM host environment, I got
> > replies from our datacenter operator:
> >
> > * 2 x Intel Xeon E5-2680 V4, 2.4Ghz 14-Core 35MB 9.8GT 2400Mhz HT&TB
> > * 16 x 64 GB, ECC reg. DDR4-2400 LRDIMM
> > - this hardware exists twice and VMs can be migrated
> >
> > The qemu host process of our VM has been up since May, 26th 2024. This
> > is when we did a shut down to attach a third btrfs pool disk to the
> > system. No VM migration has been done since then, and there probably
> > hasn't been a migration before for at least one year. Our operator
> > says it has probably never been migrated since it has been taken into
> > production. So, a VM migration before May 26th is unlikely but not
> > impossible, and there has been no migration since May, 26th.
> >
> > And some other observations:
> >
> > Also, while the issue existed, I could mount the FS fine in ro mode
> > and copy data (I've did a full borg backup, it only reads changed
> > files). But if I mounted rw, it would take anything from 2 min to 30
> > min before the kernel complained again, even if the FS was completely
> > idle. I cancelled the previous (failed by crashing) balance job before
> > doing anything else with the FS.
>
> The problem of that specific corruption is (despite the mysterious
> reason), it won't be detected until that offending data extent at
> 402811572224 got some updates.
>
> And if the updates on that extent doesn't touch the corrupted entry it
> will still be fine.
>
> So the corruption can exist for quite some time, until triggered recently.
>
> At least the generation when the data extent is created is not that new,
> the latest generation is 3933860, meanwhile the generation of that
> extent is 3678544, and considering the indirect ref should always be the
> first to be created, I guess the corruption is there for a while.
>
> With the new tree-checker, your scrub routine should be able to catch
> any similar existing problems now.
>
> Thanks,
> Qu
>
> >
> > Not sure if anything matters for your analysis in the afterthought,
> > just wanted to update you.
> >
> >
> >> Meanwhile the missing commit looks like a good candidate for stable 6.6
> >> branch, I can definitely send it to stable later.
> >>
> >> Thanks,
> >> Qu
> >
> > Thanks,
> > Kai

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-07-17  8:10 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-14 16:13 btrfs crashes during routine btrfs-balance-least-used Kai Krakow
2024-07-14 21:53 ` Qu Wenruo
2024-07-15  4:29   ` Kai Krakow
2024-07-15  5:00     ` Qu Wenruo
2024-07-15  5:31       ` Kai Krakow
2024-07-15  5:50         ` Qu Wenruo
2024-07-16  6:51           ` Kai Krakow
2024-07-16  9:09             ` Qu Wenruo
2024-07-16 13:25               ` Kai Krakow
2024-07-16 22:18                 ` Qu Wenruo
2024-07-17  8:09                   ` Kai Krakow

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox