Linux Btrfs filesystem development
 help / color / mirror / Atom feed
* btrfs crashes during routine btrfs-balance-least-used
@ 2024-07-14 16:13 Kai Krakow
  2024-07-14 21:53 ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Kai Krakow @ 2024-07-14 16:13 UTC (permalink / raw)
  To: linux-btrfs, Oliver Wien

Hello btrfs list!

(also reported in irc)

Our btrfs pool crashed during a routine btrfs-balance-least-used.
Maybe of interest: bees is also running on this filesystem, snapper
takes hourly snapshots with retention policy.

I'm currently still collecting diagnostics, "btrfs check" log is
already 3 GB and growing.

The btrfs runs on three devices vd{c,e,f}1 with data=single meta=raid1.

Here's an excerpt from dmesg (full log https://gist.tnonline.net/TE):

[1143841.581968] BTRFS info (device vdc1): balance: start
-dvrange=402046058496..402046058497
[1143841.583434] BTRFS info (device vdc1): relocating block group
402046058496 flags data
[1143852.414459] BTRFS info (device vdc1): found 45428 extents, stage:
move data extents
[1143913.107511] ------------[ cut here ]------------
[1143913.107516] WARNING: CPU: 10 PID: 937 at
fs/btrfs/extent-tree.c:3092 __btrfs_free_extent+0x68e/0x1130
[1143913.107583] CPU: 10 PID: 937 Comm: btrfs-transacti Not tainted
6.6.30-gentoo #1
[1143913.107585] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[1143913.107587] RIP: 0010:__btrfs_free_extent+0x68e/0x1130
[1143913.107590] Code: 58 61 00 00 49 8b 7d 50 49 89 d8 48 89 e9 45 8b
4e 40 48 c7 c6 20 48 06 af 8b 94 24 98 00 00 00 e8 37 f3 0a 00 e9 95
fc ff ff <0f> 0b f0 48 0f ba a8 f8 09 00 00 02 41 b8 00 00 00 00 0f 83
33 03
[1143913.107592] RSP: 0018:ffffbdd081063c78 EFLAGS: 00010246
[1143913.107595] RAX: ffffa0e64547a000 RBX: 0000005dc970d000 RCX:
0000000000004000
[1143913.107597] RDX: 0000000000000011 RSI: ffffa0e682bee2da RDI:
ffffbdd081063c17
[1143913.107598] RBP: 0000000000000001 R08: 0000005dc970e000 R09:
00000000001000a8
[1143913.107599] R10: a80000005dc970e0 R11: 0000000000001000 R12:
0000000004da9000
[1143913.107600] R13: ffffa0e82ed28270 R14: ffffa0e8059e4700 R15:
ffffa0e8742c40e0
[1143913.107601] FS:  0000000000000000(0000) GS:ffffa0f833c80000(0000)
knlGS:0000000000000000
[1143913.107605] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1143913.107607] CR2: 000000005665c138 CR3: 00000001060fb000 CR4:
00000000001506e0
[1143913.107608] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[1143913.107608] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[1143913.107609] Call Trace:
[1143913.107612]  <TASK>
[1143913.107613]  ? __warn+0x62/0xc0
[1143913.107638]  ? __btrfs_free_extent+0x68e/0x1130
[1143913.107640]  ? report_bug+0x15e/0x1a0
[1143913.107661]  ? handle_bug+0x36/0x70
[1143913.107674]  ? exc_invalid_op+0x13/0x60
[1143913.107676]  ? asm_exc_invalid_op+0x16/0x20
[1143913.107687]  ? __btrfs_free_extent+0x68e/0x1130
[1143913.107689]  __btrfs_run_delayed_refs+0x274/0xfc0
[1143913.107691]  btrfs_run_delayed_refs+0x50/0x1f0
[1143913.107692]  btrfs_commit_transaction+0x65/0xd40
[1143913.107696]  ? start_transaction+0xcb/0x570
[1143913.107698]  transaction_kthread+0x150/0x1b0
[1143913.107701]  ? close_ctree+0x420/0x420
[1143913.107703]  kthread+0xc4/0xf0
[1143913.107715]  ? kthread_complete_and_exit+0x20/0x20
[1143913.107718]  ret_from_fork+0x28/0x40
[1143913.107729]  ? kthread_complete_and_exit+0x20/0x20
[1143913.107732]  ret_from_fork_asm+0x11/0x20
[1143913.107734]  </TASK>
[1143913.107735] ---[ end trace 0000000000000000 ]---
[1143913.107737] ------------[ cut here ]------------
[1143913.107737] BTRFS: Transaction aborted (error -117)
[1143913.107749] WARNING: CPU: 10 PID: 937 at
fs/btrfs/extent-tree.c:3093 __btrfs_free_extent+0xed9/0x1130
[1143913.107752] CPU: 10 PID: 937 Comm: btrfs-transacti Tainted: G
   W          6.6.30-gentoo #1
[1143913.107754] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[1143913.107754] RIP: 0010:__btrfs_free_extent+0xed9/0x1130
[1143913.107756] Code: ff be 8b ff ff ff 48 c7 c7 80 3f 06 af e8 8f 5b
c2 ff 0f 0b e9 04 fb ff ff be 8b ff ff ff 48 c7 c7 80 3f 06 af e8 77
5b c2 ff <0f> 0b e9 20 fb ff ff 8b 5c 24 28 89 df e8 95 2f ff ff 84 c0
0f 85
[1143913.107757] RSP: 0018:ffffbdd081063c78 EFLAGS: 00010296
[1143913.107759] RAX: 0000000000000027 RBX: 0000005dc970d000 RCX:
0000000000000027
[1143913.107760] RDX: ffffa0f833c9b448 RSI: 0000000000000001 RDI:
ffffa0f833c9b440
[1143913.107761] RBP: 0000000000000001 R08: 0000000000000001 R09:
00000000ffffdfff
[1143913.107762] R10: 0000000000000000 R11: 0000000000000003 R12:
0000000004da9000
[1143913.107762] R13: ffffa0e82ed28270 R14: ffffa0e8059e4700 R15:
ffffa0e8742c40e0
[1143913.107763] FS:  0000000000000000(0000) GS:ffffa0f833c80000(0000)
knlGS:0000000000000000
[1143913.107766] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1143913.107767] CR2: 000000005665c138 CR3: 00000001060fb000 CR4:
00000000001506e0
[1143913.107768] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[1143913.107769] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[1143913.107770] Call Trace:
[1143913.107771]  <TASK>
[1143913.107771]  ? __warn+0x62/0xc0
[1143913.107774]  ? __btrfs_free_extent+0xed9/0x1130
[1143913.107775]  ? report_bug+0x15e/0x1a0
[1143913.107778]  ? handle_bug+0x36/0x70
[1143913.107780]  ? exc_invalid_op+0x13/0x60
[1143913.107783]  ? asm_exc_invalid_op+0x16/0x20
[1143913.107785]  ? __btrfs_free_extent+0xed9/0x1130
[1143913.107786]  ? __btrfs_free_extent+0xed9/0x1130
[1143913.107788]  __btrfs_run_delayed_refs+0x274/0xfc0
[1143913.107789]  btrfs_run_delayed_refs+0x50/0x1f0
[1143913.107791]  btrfs_commit_transaction+0x65/0xd40
[1143913.107794]  ? start_transaction+0xcb/0x570
[1143913.107797]  transaction_kthread+0x150/0x1b0
[1143913.107804]  ? close_ctree+0x420/0x420
[1143913.107806]  kthread+0xc4/0xf0
[1143913.107809]  ? kthread_complete_and_exit+0x20/0x20
[1143913.107812]  ret_from_fork+0x28/0x40
[1143913.107814]  ? kthread_complete_and_exit+0x20/0x20
[1143913.107817]  ret_from_fork_asm+0x11/0x20
[1143913.107818]  </TASK>
[1143913.107819] ---[ end trace 0000000000000000 ]---
[1143913.107820] BTRFS: error (device vdc1: state A) in
__btrfs_free_extent:3093: errno=-117 Filesystem corrupted
[1143913.107823] BTRFS info (device vdc1: state EA): forced readonly
[1143913.107829] BTRFS info (device vdc1: state EA): leaf
1581679099904 gen 3933860 total ptrs 260 free space 5156 owner 2
[1143913.107831] item 0 key (402811465728 178 5935621899263475458)
itemoff 16255 itemsize 28
[1143913.107834] extent data backref root 340 objectid 338204534
offset 0 count 1
[1143913.107835] item 1 key (402811465728 178 5935621899272047437)
itemoff 16227 itemsize 28

"btrfs check" can only run in lowmem mode, it will crash with "out of
memory" (the system has 74G of RAM). Here's the beginning of the log:

[1/7] checking root items
[2/7] checking extents
ERROR: shared extent 15929577472 referencer lost (parent: 1147747794944)
ERROR: shared extent 15929577472 referencer lost (parent: 1148095201280)
ERROR: shared extent 15929577472 referencer lost (parent: 1175758274560)
(repeating thousands of similar lines)

Last gist: https://gist.tnonline.net/Z4 (meanwhile, this log is over
3GB, I can upload it somewhere later).

We have backups (daily backups stored inside borg on a remote host).

Is there anything we can do? Restoring from backup will probably take
more than 24h (3 TB). The system runs web and mail hosts for more than
100 customers.

We did not try to run "btrfs check --repair" yet, nor
"--init-extent-tree". I'd rather try a quick repair before restoring.
But OTOH, I don't want to make it worse and waste time by trying.

Unfortunately, the btrfs has been mounted rw again after unmounting
following the incident. This restarted the balance, and it seems it
changed the first error "btrfs check" found. I'll try
"ro,skip-balance" after btrfs-check finished. I think the file-system
is still fully readable and we can take one last backup.

Also, I happily provide the logs collected if a dev wanted to look into it.


Thanks in advance
Kai

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-07-17  8:10 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-14 16:13 btrfs crashes during routine btrfs-balance-least-used Kai Krakow
2024-07-14 21:53 ` Qu Wenruo
2024-07-15  4:29   ` Kai Krakow
2024-07-15  5:00     ` Qu Wenruo
2024-07-15  5:31       ` Kai Krakow
2024-07-15  5:50         ` Qu Wenruo
2024-07-16  6:51           ` Kai Krakow
2024-07-16  9:09             ` Qu Wenruo
2024-07-16 13:25               ` Kai Krakow
2024-07-16 22:18                 ` Qu Wenruo
2024-07-17  8:09                   ` Kai Krakow

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox