* (no subject)
@ 2025-06-22 17:31 Kai Krakow
2025-06-22 17:33 ` btrfs dies in RIP: 0010:btrfs_get_64+0x65/0x110 Kai Krakow
0 siblings, 1 reply; 6+ messages in thread
From: Kai Krakow @ 2025-06-22 17:31 UTC (permalink / raw)
To: linux-btrfs
Hello btrfs list!
Once in a while, we are seeing the following kernel bug during the
night while the backup is putting some additional load on the system:
Jun 22 04:11:29 vch01 kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Jun 22 04:11:29 vch01 kernel: rcu: 10-....: (2100 ticks this
GP) idle=0494/1/0x4000000000000000 softirq=164826140/164826187
fqs=1052
Jun 22 04:11:29 vch01 kernel: rcu: (t=2100 jiffies g=358306033
q=2241752 ncpus=16)
Jun 22 04:11:29 vch01 kernel: CPU: 10 UID: 0 PID: 1524681 Comm:
map_0x178e45670 Not tainted 6.12.21-gentoo #1
Jun 22 04:11:29 vch01 kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Jun 22 04:11:29 vch01 kernel: RIP: 0010:btrfs_get_64+0x65/0x110
Jun 22 04:11:29 vch01 kernel: Code: d3 ed 48 8b 4f 70 48 8b 31 83 e6
40 74 11 0f b6 49 40 41 bc 00 10 00 00 49 d3 e4 49 83 ec 01 4a 8b 5c
ed 70 49 21 d4 45 89 c9 <48> 2b 1d 7c 99 09 01 49 01 c1 8b 55 08 49 8d
49 08 44 8b 75 0c 48
Jun 22 04:11:29 vch01 kernel: RSP: 0018:ffffbb7ad531bba0 EFLAGS: 00000202
Jun 22 04:11:29 vch01 kernel: RAX: 0000000000001f15 RBX:
fffff437ea382200 RCX: fffff437cb891200
Jun 22 04:11:29 vch01 kernel: RDX: 000001922b68df2a RSI:
0000000000000000 RDI: ffffa434c3e66d20
Jun 22 04:11:29 vch01 kernel: RBP: ffffa434c3e66d20 R08:
000001922b68c000 R09: 0000000000000015
Jun 22 04:11:29 vch01 kernel: R10: 6c0000000000000a R11:
0000000009fe7000 R12: 0000000000000f2a
Jun 22 04:11:29 vch01 kernel: R13: 0000000000000001 R14:
ffffa43192e6d230 R15: ffffa43160c4c800
Jun 22 04:11:29 vch01 kernel: FS: 000055d07085e6c0(0000)
GS:ffffa4452bc80000(0000) knlGS:0000000000000000
Jun 22 04:11:29 vch01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 22 04:11:29 vch01 kernel: CR2: 00007fff204ecfc0 CR3:
0000000121a0b000 CR4: 00000000001506f0
Jun 22 04:11:29 vch01 kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Jun 22 04:11:29 vch01 kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
Jun 22 04:11:29 vch01 kernel: Call Trace:
Jun 22 04:11:29 vch01 kernel: <IRQ>
Jun 22 04:11:29 vch01 kernel: ? rcu_dump_cpu_stacks+0xd3/0x100
Jun 22 04:11:29 vch01 kernel: ? rcu_sched_clock_irq+0x4ff/0x920
Jun 22 04:11:29 vch01 kernel: ? update_process_times+0x6c/0xa0
Jun 22 04:11:29 vch01 kernel: ? tick_nohz_handler+0x82/0x110
Jun 22 04:11:29 vch01 kernel: ? tick_do_update_jiffies64+0xd0/0xd0
Jun 22 04:11:29 vch01 kernel: ? __hrtimer_run_queues+0x10b/0x190
Jun 22 04:11:29 vch01 kernel: ? hrtimer_interrupt+0xf1/0x200
Jun 22 04:11:29 vch01 kernel: ? __sysvec_apic_timer_interrupt+0x44/0x50
Jun 22 04:11:29 vch01 kernel: ? sysvec_apic_timer_interrupt+0x60/0x80
Jun 22 04:11:29 vch01 kernel: </IRQ>
Jun 22 04:11:29 vch01 kernel: <TASK>
Jun 22 04:11:29 vch01 kernel: ? asm_sysvec_apic_timer_interrupt+0x16/0x20
Jun 22 04:11:29 vch01 kernel: ? btrfs_get_64+0x65/0x110
Jun 22 04:11:29 vch01 kernel: find_parent_nodes+0x1b84/0x1dc0
Jun 22 04:11:29 vch01 kernel: btrfs_find_all_leafs+0x31/0xd0
Jun 22 04:11:29 vch01 kernel: ? queued_write_lock_slowpath+0x30/0x70
Jun 22 04:11:29 vch01 kernel: iterate_extent_inodes+0x6f/0x370
Jun 22 04:11:29 vch01 kernel: ? update_share_count+0x60/0x60
Jun 22 04:11:29 vch01 kernel: ? extent_from_logical+0x139/0x190
Jun 22 04:11:29 vch01 kernel: ? release_extent_buffer+0x96/0xb0
Jun 22 04:11:29 vch01 kernel: iterate_inodes_from_logical+0xaa/0xd0
Jun 22 04:11:29 vch01 kernel: btrfs_ioctl_logical_to_ino+0xaa/0x150
Jun 22 04:11:29 vch01 kernel: __x64_sys_ioctl+0x84/0xc0
Jun 22 04:11:29 vch01 kernel: do_syscall_64+0x47/0x100
Jun 22 04:11:29 vch01 kernel: entry_SYSCALL_64_after_hwframe+0x4b/0x53
Jun 22 04:11:29 vch01 kernel: RIP: 0033:0x55d07617eaaf
Jun 22 04:11:29 vch01 kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44
24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24
10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64
48 2b 04 25 28 00 00
Jun 22 04:11:29 vch01 kernel: RSP: 002b:000055d07085bc20 EFLAGS:
00000246 ORIG_RAX: 0000000000000010
Jun 22 04:11:29 vch01 kernel: RAX: ffffffffffffffda RBX:
000055d0402f8550 RCX: 000055d07617eaaf
Jun 22 04:11:29 vch01 kernel: RDX: 000055d07085bca0 RSI:
00000000c038943b RDI: 0000000000000003
Jun 22 04:11:29 vch01 kernel: RBP: 000055d07085bea0 R08:
00007fee46c84080 R09: 0000000000000000
Jun 22 04:11:29 vch01 kernel: R10: 0000000000000000 R11:
0000000000000246 R12: 0000000000000003
Jun 22 04:11:29 vch01 kernel: R13: 000055d07085bf80 R14:
000055d07085bf48 R15: 000055d07085c0b0
Jun 22 04:11:29 vch01 kernel: </TASK>
Some more information about the environment and incident observations:
The kernel is compiled without module support, everything baked in, no
proprietary modules.
The file system has previously been recreated by restoring from backup
but we still see this problem once in a while, so I suspect there's no
file system inconsistency involved.
The rootfs itself is on xfs, so I can still login. But the system is
very slow, takes 5-10 minutes to log in via SSH, commands like
"reboot" time out because the system dbus is congested.
This bug cannot be easily triggered. So far, it only happened at
night, the system needs uptime of multiple days or weeks, with some
amount of swap used, and the backup (borg backup) has to run. So I
think this happens because we get some memory pressure while we also
have memory fragmentation going on for some time.
The system is running bees on the btrfs pool because naturally this
web and mail hosting system has a lot of duplicate files. Mysql is
running on xfs only. Temporary files during backup are created on xfs
only (the backup can access the btrfs only via r/o, no writes
allowed). Snapper takes snapshots every hour and cleans out older
snapshots over time.
I've now upgraded to the current Gentoo stable release 6.21.31 of the kernel.
btrfs is running in single data / raid1 meta on three disks provided
by KVM via virtio. The KVM images itself are served by drbd redundant
over multiple servers (outside of the VM). The hardware itself has no
known hardware issues (no memory errors, no storage errors). Scrubbing
finds no checksum or other errors. The VM or storage hasn't been
migrated at time of the incident or recently.
# uname -a
Linux vch01 6.12.21-gentoo #2 SMP Thu May 15 18:02:52 CEST 2025 x86_64
Intel Xeon E3-12xx v2 (Ivy Bridge) GenuineIntel GNU/Linux
# free -m
total used free shared buff/cache available
Mem: 84476 33122 1941 3344 48183 51353
Swap: 15358 5362 9996
# cat /proc/cpuinfo | egrep 'processor|model|cpu|cache' | sort -u
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
l1tf mds swapgs itlb_multihit srbds mmio_unknown bhi
cache_alignment : 64
cache size : 4096 KB
cpu cores : 1
cpu family : 6
cpuid level : 13
cpu MHz : 2399.998
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm
constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3
cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm
constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3
cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
rdrand hypervisor lahf_lm intel_ppin fsgsbase smep erms xsaveopt
model : 58
model name : Intel Xeon E3-12xx v2 (Ivy Bridge)
processor : 0
processor : 1
processor : 10
processor : 11
processor : 12
processor : 13
processor : 14
processor : 15
processor : 2
processor : 3
processor : 4
processor : 5
processor : 6
processor : 7
processor : 8
processor : 9
Thanks for looking into it. If you need more information, I may be
able to provide it as long as I don't have to get it from the still
running machine: it has been rebooted since the incident.
Regards,
Kai
^ permalink raw reply [flat|nested] 6+ messages in thread
* btrfs dies in RIP: 0010:btrfs_get_64+0x65/0x110
2025-06-22 17:31 Kai Krakow
@ 2025-06-22 17:33 ` Kai Krakow
2025-06-23 8:45 ` Kai Krakow
0 siblings, 1 reply; 6+ messages in thread
From: Kai Krakow @ 2025-06-22 17:33 UTC (permalink / raw)
To: linux-btrfs
Hello btrfs list!
Added a missing subject... Sorry for the extra noise.
Am So., 22. Juni 2025 um 19:31 Uhr schrieb Kai Krakow <hurikhan77@gmail.com>:
>
> Hello btrfs list!
>
> Once in a while, we are seeing the following kernel bug during the
> night while the backup is putting some additional load on the system:
>
> Jun 22 04:11:29 vch01 kernel: rcu: INFO: rcu_sched self-detected stall on CPU
> Jun 22 04:11:29 vch01 kernel: rcu: 10-....: (2100 ticks this
> GP) idle=0494/1/0x4000000000000000 softirq=164826140/164826187
> fqs=1052
> Jun 22 04:11:29 vch01 kernel: rcu: (t=2100 jiffies g=358306033
> q=2241752 ncpus=16)
> Jun 22 04:11:29 vch01 kernel: CPU: 10 UID: 0 PID: 1524681 Comm:
> map_0x178e45670 Not tainted 6.12.21-gentoo #1
> Jun 22 04:11:29 vch01 kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> Jun 22 04:11:29 vch01 kernel: RIP: 0010:btrfs_get_64+0x65/0x110
> Jun 22 04:11:29 vch01 kernel: Code: d3 ed 48 8b 4f 70 48 8b 31 83 e6
> 40 74 11 0f b6 49 40 41 bc 00 10 00 00 49 d3 e4 49 83 ec 01 4a 8b 5c
> ed 70 49 21 d4 45 89 c9 <48> 2b 1d 7c 99 09 01 49 01 c1 8b 55 08 49 8d
> 49 08 44 8b 75 0c 48
> Jun 22 04:11:29 vch01 kernel: RSP: 0018:ffffbb7ad531bba0 EFLAGS: 00000202
> Jun 22 04:11:29 vch01 kernel: RAX: 0000000000001f15 RBX:
> fffff437ea382200 RCX: fffff437cb891200
> Jun 22 04:11:29 vch01 kernel: RDX: 000001922b68df2a RSI:
> 0000000000000000 RDI: ffffa434c3e66d20
> Jun 22 04:11:29 vch01 kernel: RBP: ffffa434c3e66d20 R08:
> 000001922b68c000 R09: 0000000000000015
> Jun 22 04:11:29 vch01 kernel: R10: 6c0000000000000a R11:
> 0000000009fe7000 R12: 0000000000000f2a
> Jun 22 04:11:29 vch01 kernel: R13: 0000000000000001 R14:
> ffffa43192e6d230 R15: ffffa43160c4c800
> Jun 22 04:11:29 vch01 kernel: FS: 000055d07085e6c0(0000)
> GS:ffffa4452bc80000(0000) knlGS:0000000000000000
> Jun 22 04:11:29 vch01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jun 22 04:11:29 vch01 kernel: CR2: 00007fff204ecfc0 CR3:
> 0000000121a0b000 CR4: 00000000001506f0
> Jun 22 04:11:29 vch01 kernel: DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000
> Jun 22 04:11:29 vch01 kernel: DR3: 0000000000000000 DR6:
> 00000000fffe0ff0 DR7: 0000000000000400
> Jun 22 04:11:29 vch01 kernel: Call Trace:
> Jun 22 04:11:29 vch01 kernel: <IRQ>
> Jun 22 04:11:29 vch01 kernel: ? rcu_dump_cpu_stacks+0xd3/0x100
> Jun 22 04:11:29 vch01 kernel: ? rcu_sched_clock_irq+0x4ff/0x920
> Jun 22 04:11:29 vch01 kernel: ? update_process_times+0x6c/0xa0
> Jun 22 04:11:29 vch01 kernel: ? tick_nohz_handler+0x82/0x110
> Jun 22 04:11:29 vch01 kernel: ? tick_do_update_jiffies64+0xd0/0xd0
> Jun 22 04:11:29 vch01 kernel: ? __hrtimer_run_queues+0x10b/0x190
> Jun 22 04:11:29 vch01 kernel: ? hrtimer_interrupt+0xf1/0x200
> Jun 22 04:11:29 vch01 kernel: ? __sysvec_apic_timer_interrupt+0x44/0x50
> Jun 22 04:11:29 vch01 kernel: ? sysvec_apic_timer_interrupt+0x60/0x80
> Jun 22 04:11:29 vch01 kernel: </IRQ>
> Jun 22 04:11:29 vch01 kernel: <TASK>
> Jun 22 04:11:29 vch01 kernel: ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> Jun 22 04:11:29 vch01 kernel: ? btrfs_get_64+0x65/0x110
> Jun 22 04:11:29 vch01 kernel: find_parent_nodes+0x1b84/0x1dc0
> Jun 22 04:11:29 vch01 kernel: btrfs_find_all_leafs+0x31/0xd0
> Jun 22 04:11:29 vch01 kernel: ? queued_write_lock_slowpath+0x30/0x70
> Jun 22 04:11:29 vch01 kernel: iterate_extent_inodes+0x6f/0x370
> Jun 22 04:11:29 vch01 kernel: ? update_share_count+0x60/0x60
> Jun 22 04:11:29 vch01 kernel: ? extent_from_logical+0x139/0x190
> Jun 22 04:11:29 vch01 kernel: ? release_extent_buffer+0x96/0xb0
> Jun 22 04:11:29 vch01 kernel: iterate_inodes_from_logical+0xaa/0xd0
> Jun 22 04:11:29 vch01 kernel: btrfs_ioctl_logical_to_ino+0xaa/0x150
> Jun 22 04:11:29 vch01 kernel: __x64_sys_ioctl+0x84/0xc0
> Jun 22 04:11:29 vch01 kernel: do_syscall_64+0x47/0x100
> Jun 22 04:11:29 vch01 kernel: entry_SYSCALL_64_after_hwframe+0x4b/0x53
> Jun 22 04:11:29 vch01 kernel: RIP: 0033:0x55d07617eaaf
> Jun 22 04:11:29 vch01 kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44
> 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24
> 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64
> 48 2b 04 25 28 00 00
> Jun 22 04:11:29 vch01 kernel: RSP: 002b:000055d07085bc20 EFLAGS:
> 00000246 ORIG_RAX: 0000000000000010
> Jun 22 04:11:29 vch01 kernel: RAX: ffffffffffffffda RBX:
> 000055d0402f8550 RCX: 000055d07617eaaf
> Jun 22 04:11:29 vch01 kernel: RDX: 000055d07085bca0 RSI:
> 00000000c038943b RDI: 0000000000000003
> Jun 22 04:11:29 vch01 kernel: RBP: 000055d07085bea0 R08:
> 00007fee46c84080 R09: 0000000000000000
> Jun 22 04:11:29 vch01 kernel: R10: 0000000000000000 R11:
> 0000000000000246 R12: 0000000000000003
> Jun 22 04:11:29 vch01 kernel: R13: 000055d07085bf80 R14:
> 000055d07085bf48 R15: 000055d07085c0b0
> Jun 22 04:11:29 vch01 kernel: </TASK>
>
> Some more information about the environment and incident observations:
>
> The kernel is compiled without module support, everything baked in, no
> proprietary modules.
>
> The file system has previously been recreated by restoring from backup
> but we still see this problem once in a while, so I suspect there's no
> file system inconsistency involved.
>
> The rootfs itself is on xfs, so I can still login. But the system is
> very slow, takes 5-10 minutes to log in via SSH, commands like
> "reboot" time out because the system dbus is congested.
>
> This bug cannot be easily triggered. So far, it only happened at
> night, the system needs uptime of multiple days or weeks, with some
> amount of swap used, and the backup (borg backup) has to run. So I
> think this happens because we get some memory pressure while we also
> have memory fragmentation going on for some time.
>
> The system is running bees on the btrfs pool because naturally this
> web and mail hosting system has a lot of duplicate files. Mysql is
> running on xfs only. Temporary files during backup are created on xfs
> only (the backup can access the btrfs only via r/o, no writes
> allowed). Snapper takes snapshots every hour and cleans out older
> snapshots over time.
>
> I've now upgraded to the current Gentoo stable release 6.21.31 of the kernel.
>
> btrfs is running in single data / raid1 meta on three disks provided
> by KVM via virtio. The KVM images itself are served by drbd redundant
> over multiple servers (outside of the VM). The hardware itself has no
> known hardware issues (no memory errors, no storage errors). Scrubbing
> finds no checksum or other errors. The VM or storage hasn't been
> migrated at time of the incident or recently.
>
> # uname -a
> Linux vch01 6.12.21-gentoo #2 SMP Thu May 15 18:02:52 CEST 2025 x86_64
> Intel Xeon E3-12xx v2 (Ivy Bridge) GenuineIntel GNU/Linux
>
> # free -m
> total used free shared buff/cache available
> Mem: 84476 33122 1941 3344 48183 51353
> Swap: 15358 5362 9996
>
> # cat /proc/cpuinfo | egrep 'processor|model|cpu|cache' | sort -u
> bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
> l1tf mds swapgs itlb_multihit srbds mmio_unknown bhi
> cache_alignment : 64
> cache size : 4096 KB
> cpu cores : 1
> cpu family : 6
> cpuid level : 13
> cpu MHz : 2399.998
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm
> constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3
> cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
> rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm
> constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3
> cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
> rdrand hypervisor lahf_lm intel_ppin fsgsbase smep erms xsaveopt
> model : 58
> model name : Intel Xeon E3-12xx v2 (Ivy Bridge)
> processor : 0
> processor : 1
> processor : 10
> processor : 11
> processor : 12
> processor : 13
> processor : 14
> processor : 15
> processor : 2
> processor : 3
> processor : 4
> processor : 5
> processor : 6
> processor : 7
> processor : 8
> processor : 9
>
> Thanks for looking into it. If you need more information, I may be
> able to provide it as long as I don't have to get it from the still
> running machine: it has been rebooted since the incident.
>
> Regards,
> Kai
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: btrfs dies in RIP: 0010:btrfs_get_64+0x65/0x110
2025-06-22 17:33 ` btrfs dies in RIP: 0010:btrfs_get_64+0x65/0x110 Kai Krakow
@ 2025-06-23 8:45 ` Kai Krakow
2025-06-23 10:34 ` Filipe Manana
0 siblings, 1 reply; 6+ messages in thread
From: Kai Krakow @ 2025-06-23 8:45 UTC (permalink / raw)
To: linux-btrfs
Hello btrfs list!
Some more context:
> > Hello btrfs list!
> >
> > Once in a while, we are seeing the following kernel bug during the
> > night while the backup is putting some additional load on the system:
> >
> > Jun 22 04:11:29 vch01 kernel: rcu: INFO: rcu_sched self-detected stall on CPU
> > Jun 22 04:11:29 vch01 kernel: rcu: 10-....: (2100 ticks this
> > GP) idle=0494/1/0x4000000000000000 softirq=164826140/164826187
> > fqs=1052
> > Jun 22 04:11:29 vch01 kernel: rcu: (t=2100 jiffies g=358306033
> > q=2241752 ncpus=16)
> > Jun 22 04:11:29 vch01 kernel: CPU: 10 UID: 0 PID: 1524681 Comm:
> > map_0x178e45670 Not tainted 6.12.21-gentoo #1
> > Jun 22 04:11:29 vch01 kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> > Jun 22 04:11:29 vch01 kernel: RIP: 0010:btrfs_get_64+0x65/0x110
> > Jun 22 04:11:29 vch01 kernel: Code: d3 ed 48 8b 4f 70 48 8b 31 83 e6
> > 40 74 11 0f b6 49 40 41 bc 00 10 00 00 49 d3 e4 49 83 ec 01 4a 8b 5c
> > ed 70 49 21 d4 45 89 c9 <48> 2b 1d 7c 99 09 01 49 01 c1 8b 55 08 49 8d
> > 49 08 44 8b 75 0c 48
> > Jun 22 04:11:29 vch01 kernel: RSP: 0018:ffffbb7ad531bba0 EFLAGS: 00000202
> > Jun 22 04:11:29 vch01 kernel: RAX: 0000000000001f15 RBX:
> > fffff437ea382200 RCX: fffff437cb891200
> > Jun 22 04:11:29 vch01 kernel: RDX: 000001922b68df2a RSI:
> > 0000000000000000 RDI: ffffa434c3e66d20
> > Jun 22 04:11:29 vch01 kernel: RBP: ffffa434c3e66d20 R08:
> > 000001922b68c000 R09: 0000000000000015
> > Jun 22 04:11:29 vch01 kernel: R10: 6c0000000000000a R11:
> > 0000000009fe7000 R12: 0000000000000f2a
> > Jun 22 04:11:29 vch01 kernel: R13: 0000000000000001 R14:
> > ffffa43192e6d230 R15: ffffa43160c4c800
> > Jun 22 04:11:29 vch01 kernel: FS: 000055d07085e6c0(0000)
> > GS:ffffa4452bc80000(0000) knlGS:0000000000000000
> > Jun 22 04:11:29 vch01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > Jun 22 04:11:29 vch01 kernel: CR2: 00007fff204ecfc0 CR3:
> > 0000000121a0b000 CR4: 00000000001506f0
> > Jun 22 04:11:29 vch01 kernel: DR0: 0000000000000000 DR1:
> > 0000000000000000 DR2: 0000000000000000
> > Jun 22 04:11:29 vch01 kernel: DR3: 0000000000000000 DR6:
> > 00000000fffe0ff0 DR7: 0000000000000400
> > Jun 22 04:11:29 vch01 kernel: Call Trace:
> > Jun 22 04:11:29 vch01 kernel: <IRQ>
> > Jun 22 04:11:29 vch01 kernel: ? rcu_dump_cpu_stacks+0xd3/0x100
> > Jun 22 04:11:29 vch01 kernel: ? rcu_sched_clock_irq+0x4ff/0x920
> > Jun 22 04:11:29 vch01 kernel: ? update_process_times+0x6c/0xa0
> > Jun 22 04:11:29 vch01 kernel: ? tick_nohz_handler+0x82/0x110
> > Jun 22 04:11:29 vch01 kernel: ? tick_do_update_jiffies64+0xd0/0xd0
> > Jun 22 04:11:29 vch01 kernel: ? __hrtimer_run_queues+0x10b/0x190
> > Jun 22 04:11:29 vch01 kernel: ? hrtimer_interrupt+0xf1/0x200
> > Jun 22 04:11:29 vch01 kernel: ? __sysvec_apic_timer_interrupt+0x44/0x50
> > Jun 22 04:11:29 vch01 kernel: ? sysvec_apic_timer_interrupt+0x60/0x80
> > Jun 22 04:11:29 vch01 kernel: </IRQ>
> > Jun 22 04:11:29 vch01 kernel: <TASK>
> > Jun 22 04:11:29 vch01 kernel: ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > Jun 22 04:11:29 vch01 kernel: ? btrfs_get_64+0x65/0x110
> > Jun 22 04:11:29 vch01 kernel: find_parent_nodes+0x1b84/0x1dc0
> > Jun 22 04:11:29 vch01 kernel: btrfs_find_all_leafs+0x31/0xd0
> > Jun 22 04:11:29 vch01 kernel: ? queued_write_lock_slowpath+0x30/0x70
> > Jun 22 04:11:29 vch01 kernel: iterate_extent_inodes+0x6f/0x370
> > Jun 22 04:11:29 vch01 kernel: ? update_share_count+0x60/0x60
> > Jun 22 04:11:29 vch01 kernel: ? extent_from_logical+0x139/0x190
> > Jun 22 04:11:29 vch01 kernel: ? release_extent_buffer+0x96/0xb0
> > Jun 22 04:11:29 vch01 kernel: iterate_inodes_from_logical+0xaa/0xd0
> > Jun 22 04:11:29 vch01 kernel: btrfs_ioctl_logical_to_ino+0xaa/0x150
> > Jun 22 04:11:29 vch01 kernel: __x64_sys_ioctl+0x84/0xc0
> > Jun 22 04:11:29 vch01 kernel: do_syscall_64+0x47/0x100
> > Jun 22 04:11:29 vch01 kernel: entry_SYSCALL_64_after_hwframe+0x4b/0x53
> > Jun 22 04:11:29 vch01 kernel: RIP: 0033:0x55d07617eaaf
> > Jun 22 04:11:29 vch01 kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44
> > 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24
> > 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64
> > 48 2b 04 25 28 00 00
> > Jun 22 04:11:29 vch01 kernel: RSP: 002b:000055d07085bc20 EFLAGS:
> > 00000246 ORIG_RAX: 0000000000000010
> > Jun 22 04:11:29 vch01 kernel: RAX: ffffffffffffffda RBX:
> > 000055d0402f8550 RCX: 000055d07617eaaf
> > Jun 22 04:11:29 vch01 kernel: RDX: 000055d07085bca0 RSI:
> > 00000000c038943b RDI: 0000000000000003
> > Jun 22 04:11:29 vch01 kernel: RBP: 000055d07085bea0 R08:
> > 00007fee46c84080 R09: 0000000000000000
> > Jun 22 04:11:29 vch01 kernel: R10: 0000000000000000 R11:
> > 0000000000000246 R12: 0000000000000003
> > Jun 22 04:11:29 vch01 kernel: R13: 000055d07085bf80 R14:
> > 000055d07085bf48 R15: 000055d07085c0b0
> > Jun 22 04:11:29 vch01 kernel: </TASK>
> >
> > Some more information about the environment and incident observations:
> >
> > The kernel is compiled without module support, everything baked in, no
> > proprietary modules.
> >
> > The file system has previously been recreated by restoring from backup
> > but we still see this problem once in a while, so I suspect there's no
> > file system inconsistency involved.
> >
> > The rootfs itself is on xfs, so I can still login. But the system is
> > very slow, takes 5-10 minutes to log in via SSH, commands like
> > "reboot" time out because the system dbus is congested.
> >
> > This bug cannot be easily triggered. So far, it only happened at
> > night, the system needs uptime of multiple days or weeks, with some
> > amount of swap used, and the backup (borg backup) has to run. So I
> > think this happens because we get some memory pressure while we also
> > have memory fragmentation going on for some time.
> >
> > The system is running bees on the btrfs pool because naturally this
> > web and mail hosting system has a lot of duplicate files. Mysql is
> > running on xfs only. Temporary files during backup are created on xfs
> > only (the backup can access the btrfs only via r/o, no writes
> > allowed). Snapper takes snapshots every hour and cleans out older
> > snapshots over time.
> >
> > I've now upgraded to the current Gentoo stable release 6.21.31 of the kernel.
> >
> > btrfs is running in single data / raid1 meta on three disks provided
> > by KVM via virtio. The KVM images itself are served by drbd redundant
> > over multiple servers (outside of the VM). The hardware itself has no
> > known hardware issues (no memory errors, no storage errors). Scrubbing
> > finds no checksum or other errors. The VM or storage hasn't been
> > migrated at time of the incident or recently.
> >
> > # uname -a
> > Linux vch01 6.12.21-gentoo #2 SMP Thu May 15 18:02:52 CEST 2025 x86_64
> > Intel Xeon E3-12xx v2 (Ivy Bridge) GenuineIntel GNU/Linux
> >
> > # free -m
> > total used free shared buff/cache available
> > Mem: 84476 33122 1941 3344 48183 51353
> > Swap: 15358 5362 9996
> >
> > # cat /proc/cpuinfo | egrep 'processor|model|cpu|cache' | sort -u
> > bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
> > l1tf mds swapgs itlb_multihit srbds mmio_unknown bhi
> > cache_alignment : 64
> > cache size : 4096 KB
> > cpu cores : 1
> > cpu family : 6
> > cpuid level : 13
> > cpu MHz : 2399.998
> > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> > mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm
> > constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3
> > cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
> > rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
> > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> > mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm
> > constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3
> > cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
> > rdrand hypervisor lahf_lm intel_ppin fsgsbase smep erms xsaveopt
> > model : 58
> > model name : Intel Xeon E3-12xx v2 (Ivy Bridge)
> > processor : 0
> > processor : 1
> > processor : 10
> > processor : 11
> > processor : 12
> > processor : 13
> > processor : 14
> > processor : 15
> > processor : 2
> > processor : 3
> > processor : 4
> > processor : 5
> > processor : 6
> > processor : 7
> > processor : 8
> > processor : 9
> >
> > Thanks for looking into it. If you need more information, I may be
> > able to provide it as long as I don't have to get it from the still
> > running machine: it has been rebooted since the incident.
We have reports that the system already slowed down around 4-5 hours
before the kernel log. This was reported before the backup started at
1 am. So the additional load introduced by the backup probably only
accelerated what already has gone wrong. On another note, I think if
it survived the backup, the situation would have probably relaxed
enough to not trigger the RCU stall.
Regards,
Kai
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: btrfs dies in RIP: 0010:btrfs_get_64+0x65/0x110
2025-06-23 8:45 ` Kai Krakow
@ 2025-06-23 10:34 ` Filipe Manana
2025-06-23 11:19 ` Kai Krakow
0 siblings, 1 reply; 6+ messages in thread
From: Filipe Manana @ 2025-06-23 10:34 UTC (permalink / raw)
To: Kai Krakow; +Cc: linux-btrfs
On Mon, Jun 23, 2025 at 9:45 AM Kai Krakow <hurikhan77@gmail.com> wrote:
>
> Hello btrfs list!
>
> Some more context:
>
> > > Hello btrfs list!
> > >
> > > Once in a while, we are seeing the following kernel bug during the
> > > night while the backup is putting some additional load on the system:
> > >
> > > Jun 22 04:11:29 vch01 kernel: rcu: INFO: rcu_sched self-detected stall on CPU
> > > Jun 22 04:11:29 vch01 kernel: rcu: 10-....: (2100 ticks this
> > > GP) idle=0494/1/0x4000000000000000 softirq=164826140/164826187
> > > fqs=1052
> > > Jun 22 04:11:29 vch01 kernel: rcu: (t=2100 jiffies g=358306033
> > > q=2241752 ncpus=16)
> > > Jun 22 04:11:29 vch01 kernel: CPU: 10 UID: 0 PID: 1524681 Comm:
> > > map_0x178e45670 Not tainted 6.12.21-gentoo #1
> > > Jun 22 04:11:29 vch01 kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> > > Jun 22 04:11:29 vch01 kernel: RIP: 0010:btrfs_get_64+0x65/0x110
> > > Jun 22 04:11:29 vch01 kernel: Code: d3 ed 48 8b 4f 70 48 8b 31 83 e6
> > > 40 74 11 0f b6 49 40 41 bc 00 10 00 00 49 d3 e4 49 83 ec 01 4a 8b 5c
> > > ed 70 49 21 d4 45 89 c9 <48> 2b 1d 7c 99 09 01 49 01 c1 8b 55 08 49 8d
> > > 49 08 44 8b 75 0c 48
> > > Jun 22 04:11:29 vch01 kernel: RSP: 0018:ffffbb7ad531bba0 EFLAGS: 00000202
> > > Jun 22 04:11:29 vch01 kernel: RAX: 0000000000001f15 RBX:
> > > fffff437ea382200 RCX: fffff437cb891200
> > > Jun 22 04:11:29 vch01 kernel: RDX: 000001922b68df2a RSI:
> > > 0000000000000000 RDI: ffffa434c3e66d20
> > > Jun 22 04:11:29 vch01 kernel: RBP: ffffa434c3e66d20 R08:
> > > 000001922b68c000 R09: 0000000000000015
> > > Jun 22 04:11:29 vch01 kernel: R10: 6c0000000000000a R11:
> > > 0000000009fe7000 R12: 0000000000000f2a
> > > Jun 22 04:11:29 vch01 kernel: R13: 0000000000000001 R14:
> > > ffffa43192e6d230 R15: ffffa43160c4c800
> > > Jun 22 04:11:29 vch01 kernel: FS: 000055d07085e6c0(0000)
> > > GS:ffffa4452bc80000(0000) knlGS:0000000000000000
> > > Jun 22 04:11:29 vch01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > Jun 22 04:11:29 vch01 kernel: CR2: 00007fff204ecfc0 CR3:
> > > 0000000121a0b000 CR4: 00000000001506f0
> > > Jun 22 04:11:29 vch01 kernel: DR0: 0000000000000000 DR1:
> > > 0000000000000000 DR2: 0000000000000000
> > > Jun 22 04:11:29 vch01 kernel: DR3: 0000000000000000 DR6:
> > > 00000000fffe0ff0 DR7: 0000000000000400
> > > Jun 22 04:11:29 vch01 kernel: Call Trace:
> > > Jun 22 04:11:29 vch01 kernel: <IRQ>
> > > Jun 22 04:11:29 vch01 kernel: ? rcu_dump_cpu_stacks+0xd3/0x100
> > > Jun 22 04:11:29 vch01 kernel: ? rcu_sched_clock_irq+0x4ff/0x920
> > > Jun 22 04:11:29 vch01 kernel: ? update_process_times+0x6c/0xa0
> > > Jun 22 04:11:29 vch01 kernel: ? tick_nohz_handler+0x82/0x110
> > > Jun 22 04:11:29 vch01 kernel: ? tick_do_update_jiffies64+0xd0/0xd0
> > > Jun 22 04:11:29 vch01 kernel: ? __hrtimer_run_queues+0x10b/0x190
> > > Jun 22 04:11:29 vch01 kernel: ? hrtimer_interrupt+0xf1/0x200
> > > Jun 22 04:11:29 vch01 kernel: ? __sysvec_apic_timer_interrupt+0x44/0x50
> > > Jun 22 04:11:29 vch01 kernel: ? sysvec_apic_timer_interrupt+0x60/0x80
> > > Jun 22 04:11:29 vch01 kernel: </IRQ>
> > > Jun 22 04:11:29 vch01 kernel: <TASK>
> > > Jun 22 04:11:29 vch01 kernel: ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > Jun 22 04:11:29 vch01 kernel: ? btrfs_get_64+0x65/0x110
> > > Jun 22 04:11:29 vch01 kernel: find_parent_nodes+0x1b84/0x1dc0
> > > Jun 22 04:11:29 vch01 kernel: btrfs_find_all_leafs+0x31/0xd0
> > > Jun 22 04:11:29 vch01 kernel: ? queued_write_lock_slowpath+0x30/0x70
> > > Jun 22 04:11:29 vch01 kernel: iterate_extent_inodes+0x6f/0x370
> > > Jun 22 04:11:29 vch01 kernel: ? update_share_count+0x60/0x60
> > > Jun 22 04:11:29 vch01 kernel: ? extent_from_logical+0x139/0x190
> > > Jun 22 04:11:29 vch01 kernel: ? release_extent_buffer+0x96/0xb0
> > > Jun 22 04:11:29 vch01 kernel: iterate_inodes_from_logical+0xaa/0xd0
> > > Jun 22 04:11:29 vch01 kernel: btrfs_ioctl_logical_to_ino+0xaa/0x150
> > > Jun 22 04:11:29 vch01 kernel: __x64_sys_ioctl+0x84/0xc0
> > > Jun 22 04:11:29 vch01 kernel: do_syscall_64+0x47/0x100
> > > Jun 22 04:11:29 vch01 kernel: entry_SYSCALL_64_after_hwframe+0x4b/0x53
> > > Jun 22 04:11:29 vch01 kernel: RIP: 0033:0x55d07617eaaf
> > > Jun 22 04:11:29 vch01 kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44
> > > 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24
> > > 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64
> > > 48 2b 04 25 28 00 00
> > > Jun 22 04:11:29 vch01 kernel: RSP: 002b:000055d07085bc20 EFLAGS:
> > > 00000246 ORIG_RAX: 0000000000000010
> > > Jun 22 04:11:29 vch01 kernel: RAX: ffffffffffffffda RBX:
> > > 000055d0402f8550 RCX: 000055d07617eaaf
> > > Jun 22 04:11:29 vch01 kernel: RDX: 000055d07085bca0 RSI:
> > > 00000000c038943b RDI: 0000000000000003
> > > Jun 22 04:11:29 vch01 kernel: RBP: 000055d07085bea0 R08:
> > > 00007fee46c84080 R09: 0000000000000000
> > > Jun 22 04:11:29 vch01 kernel: R10: 0000000000000000 R11:
> > > 0000000000000246 R12: 0000000000000003
> > > Jun 22 04:11:29 vch01 kernel: R13: 000055d07085bf80 R14:
> > > 000055d07085bf48 R15: 000055d07085c0b0
> > > Jun 22 04:11:29 vch01 kernel: </TASK>
> > >
> > > Some more information about the environment and incident observations:
> > >
> > > The kernel is compiled without module support, everything baked in, no
> > > proprietary modules.
> > >
> > > The file system has previously been recreated by restoring from backup
> > > but we still see this problem once in a while, so I suspect there's no
> > > file system inconsistency involved.
> > >
> > > The rootfs itself is on xfs, so I can still login. But the system is
> > > very slow, takes 5-10 minutes to log in via SSH, commands like
> > > "reboot" time out because the system dbus is congested.
> > >
> > > This bug cannot be easily triggered. So far, it only happened at
> > > night, the system needs uptime of multiple days or weeks, with some
> > > amount of swap used, and the backup (borg backup) has to run. So I
> > > think this happens because we get some memory pressure while we also
> > > have memory fragmentation going on for some time.
> > >
> > > The system is running bees on the btrfs pool because naturally this
> > > web and mail hosting system has a lot of duplicate files. Mysql is
> > > running on xfs only. Temporary files during backup are created on xfs
> > > only (the backup can access the btrfs only via r/o, no writes
> > > allowed). Snapper takes snapshots every hour and cleans out older
> > > snapshots over time.
> > >
> > > I've now upgraded to the current Gentoo stable release 6.21.31 of the kernel.
> > >
> > > btrfs is running in single data / raid1 meta on three disks provided
> > > by KVM via virtio. The KVM images itself are served by drbd redundant
> > > over multiple servers (outside of the VM). The hardware itself has no
> > > known hardware issues (no memory errors, no storage errors). Scrubbing
> > > finds no checksum or other errors. The VM or storage hasn't been
> > > migrated at time of the incident or recently.
> > >
> > > # uname -a
> > > Linux vch01 6.12.21-gentoo #2 SMP Thu May 15 18:02:52 CEST 2025 x86_64
> > > Intel Xeon E3-12xx v2 (Ivy Bridge) GenuineIntel GNU/Linux
> > >
> > > # free -m
> > > total used free shared buff/cache available
> > > Mem: 84476 33122 1941 3344 48183 51353
> > > Swap: 15358 5362 9996
> > >
> > > # cat /proc/cpuinfo | egrep 'processor|model|cpu|cache' | sort -u
> > > bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
> > > l1tf mds swapgs itlb_multihit srbds mmio_unknown bhi
> > > cache_alignment : 64
> > > cache size : 4096 KB
> > > cpu cores : 1
> > > cpu family : 6
> > > cpuid level : 13
> > > cpu MHz : 2399.998
> > > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> > > mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm
> > > constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3
> > > cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
> > > rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
> > > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> > > mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm
> > > constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3
> > > cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
> > > rdrand hypervisor lahf_lm intel_ppin fsgsbase smep erms xsaveopt
> > > model : 58
> > > model name : Intel Xeon E3-12xx v2 (Ivy Bridge)
> > > processor : 0
> > > processor : 1
> > > processor : 10
> > > processor : 11
> > > processor : 12
> > > processor : 13
> > > processor : 14
> > > processor : 15
> > > processor : 2
> > > processor : 3
> > > processor : 4
> > > processor : 5
> > > processor : 6
> > > processor : 7
> > > processor : 8
> > > processor : 9
> > >
> > > Thanks for looking into it. If you need more information, I may be
> > > able to provide it as long as I don't have to get it from the still
> > > running machine: it has been rebooted since the incident.
>
> We have reports that the system already slowed down around 4-5 hours
> before the kernel log. This was reported before the backup started at
> 1 am. So the additional load introduced by the backup probably only
> accelerated what already has gone wrong. On another note, I think if
> it survived the backup, the situation would have probably relaxed
> enough to not trigger the RCU stall.
The RCU stall could be because there's a large number of backrefs for
some extents and we're spending too much time looping over them
without ever yielding the cpu.
Are you able to test the following patch and see if the warning goes away?
https://pastebin.com/raw/aGUPHi93
Thanks.
>
> Regards,
> Kai
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: btrfs dies in RIP: 0010:btrfs_get_64+0x65/0x110
2025-06-23 10:34 ` Filipe Manana
@ 2025-06-23 11:19 ` Kai Krakow
2025-06-23 16:30 ` Filipe Manana
0 siblings, 1 reply; 6+ messages in thread
From: Kai Krakow @ 2025-06-23 11:19 UTC (permalink / raw)
To: Filipe Manana; +Cc: linux-btrfs
Hi Filipe!
Am Mo., 23. Juni 2025 um 12:35 Uhr schrieb Filipe Manana <fdmanana@kernel.org>:
>
> On Mon, Jun 23, 2025 at 9:45 AM Kai Krakow <hurikhan77@gmail.com> wrote:
> >
> > Hello btrfs list!
> >
> > Some more context:
> >
> > > > Hello btrfs list!
> > > >
> > > > Once in a while, we are seeing the following kernel bug during the
> > > > night while the backup is putting some additional load on the system:
> > > >
> > > > Jun 22 04:11:29 vch01 kernel: rcu: INFO: rcu_sched self-detected stall on CPU
> > > > Jun 22 04:11:29 vch01 kernel: rcu: 10-....: (2100 ticks this
> > > > GP) idle=0494/1/0x4000000000000000 softirq=164826140/164826187
> > > > fqs=1052
> > > > Jun 22 04:11:29 vch01 kernel: rcu: (t=2100 jiffies g=358306033
> > > > q=2241752 ncpus=16)
> > > > Jun 22 04:11:29 vch01 kernel: CPU: 10 UID: 0 PID: 1524681 Comm:
> > > > map_0x178e45670 Not tainted 6.12.21-gentoo #1
> > > > Jun 22 04:11:29 vch01 kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> > > > Jun 22 04:11:29 vch01 kernel: RIP: 0010:btrfs_get_64+0x65/0x110
> > > > Jun 22 04:11:29 vch01 kernel: Code: d3 ed 48 8b 4f 70 48 8b 31 83 e6
> > > > 40 74 11 0f b6 49 40 41 bc 00 10 00 00 49 d3 e4 49 83 ec 01 4a 8b 5c
> > > > ed 70 49 21 d4 45 89 c9 <48> 2b 1d 7c 99 09 01 49 01 c1 8b 55 08 49 8d
> > > > 49 08 44 8b 75 0c 48
> > > > Jun 22 04:11:29 vch01 kernel: RSP: 0018:ffffbb7ad531bba0 EFLAGS: 00000202
> > > > Jun 22 04:11:29 vch01 kernel: RAX: 0000000000001f15 RBX:
> > > > fffff437ea382200 RCX: fffff437cb891200
> > > > Jun 22 04:11:29 vch01 kernel: RDX: 000001922b68df2a RSI:
> > > > 0000000000000000 RDI: ffffa434c3e66d20
> > > > Jun 22 04:11:29 vch01 kernel: RBP: ffffa434c3e66d20 R08:
> > > > 000001922b68c000 R09: 0000000000000015
> > > > Jun 22 04:11:29 vch01 kernel: R10: 6c0000000000000a R11:
> > > > 0000000009fe7000 R12: 0000000000000f2a
> > > > Jun 22 04:11:29 vch01 kernel: R13: 0000000000000001 R14:
> > > > ffffa43192e6d230 R15: ffffa43160c4c800
> > > > Jun 22 04:11:29 vch01 kernel: FS: 000055d07085e6c0(0000)
> > > > GS:ffffa4452bc80000(0000) knlGS:0000000000000000
> > > > Jun 22 04:11:29 vch01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > Jun 22 04:11:29 vch01 kernel: CR2: 00007fff204ecfc0 CR3:
> > > > 0000000121a0b000 CR4: 00000000001506f0
> > > > Jun 22 04:11:29 vch01 kernel: DR0: 0000000000000000 DR1:
> > > > 0000000000000000 DR2: 0000000000000000
> > > > Jun 22 04:11:29 vch01 kernel: DR3: 0000000000000000 DR6:
> > > > 00000000fffe0ff0 DR7: 0000000000000400
> > > > Jun 22 04:11:29 vch01 kernel: Call Trace:
> > > > Jun 22 04:11:29 vch01 kernel: <IRQ>
> > > > Jun 22 04:11:29 vch01 kernel: ? rcu_dump_cpu_stacks+0xd3/0x100
> > > > Jun 22 04:11:29 vch01 kernel: ? rcu_sched_clock_irq+0x4ff/0x920
> > > > Jun 22 04:11:29 vch01 kernel: ? update_process_times+0x6c/0xa0
> > > > Jun 22 04:11:29 vch01 kernel: ? tick_nohz_handler+0x82/0x110
> > > > Jun 22 04:11:29 vch01 kernel: ? tick_do_update_jiffies64+0xd0/0xd0
> > > > Jun 22 04:11:29 vch01 kernel: ? __hrtimer_run_queues+0x10b/0x190
> > > > Jun 22 04:11:29 vch01 kernel: ? hrtimer_interrupt+0xf1/0x200
> > > > Jun 22 04:11:29 vch01 kernel: ? __sysvec_apic_timer_interrupt+0x44/0x50
> > > > Jun 22 04:11:29 vch01 kernel: ? sysvec_apic_timer_interrupt+0x60/0x80
> > > > Jun 22 04:11:29 vch01 kernel: </IRQ>
> > > > Jun 22 04:11:29 vch01 kernel: <TASK>
> > > > Jun 22 04:11:29 vch01 kernel: ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > > Jun 22 04:11:29 vch01 kernel: ? btrfs_get_64+0x65/0x110
> > > > Jun 22 04:11:29 vch01 kernel: find_parent_nodes+0x1b84/0x1dc0
> > > > Jun 22 04:11:29 vch01 kernel: btrfs_find_all_leafs+0x31/0xd0
> > > > Jun 22 04:11:29 vch01 kernel: ? queued_write_lock_slowpath+0x30/0x70
> > > > Jun 22 04:11:29 vch01 kernel: iterate_extent_inodes+0x6f/0x370
> > > > Jun 22 04:11:29 vch01 kernel: ? update_share_count+0x60/0x60
> > > > Jun 22 04:11:29 vch01 kernel: ? extent_from_logical+0x139/0x190
> > > > Jun 22 04:11:29 vch01 kernel: ? release_extent_buffer+0x96/0xb0
> > > > Jun 22 04:11:29 vch01 kernel: iterate_inodes_from_logical+0xaa/0xd0
> > > > Jun 22 04:11:29 vch01 kernel: btrfs_ioctl_logical_to_ino+0xaa/0x150
> > > > Jun 22 04:11:29 vch01 kernel: __x64_sys_ioctl+0x84/0xc0
> > > > Jun 22 04:11:29 vch01 kernel: do_syscall_64+0x47/0x100
> > > > Jun 22 04:11:29 vch01 kernel: entry_SYSCALL_64_after_hwframe+0x4b/0x53
> > > > Jun 22 04:11:29 vch01 kernel: RIP: 0033:0x55d07617eaaf
> > > > Jun 22 04:11:29 vch01 kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44
> > > > 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24
> > > > 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64
> > > > 48 2b 04 25 28 00 00
> > > > Jun 22 04:11:29 vch01 kernel: RSP: 002b:000055d07085bc20 EFLAGS:
> > > > 00000246 ORIG_RAX: 0000000000000010
> > > > Jun 22 04:11:29 vch01 kernel: RAX: ffffffffffffffda RBX:
> > > > 000055d0402f8550 RCX: 000055d07617eaaf
> > > > Jun 22 04:11:29 vch01 kernel: RDX: 000055d07085bca0 RSI:
> > > > 00000000c038943b RDI: 0000000000000003
> > > > Jun 22 04:11:29 vch01 kernel: RBP: 000055d07085bea0 R08:
> > > > 00007fee46c84080 R09: 0000000000000000
> > > > Jun 22 04:11:29 vch01 kernel: R10: 0000000000000000 R11:
> > > > 0000000000000246 R12: 0000000000000003
> > > > Jun 22 04:11:29 vch01 kernel: R13: 000055d07085bf80 R14:
> > > > 000055d07085bf48 R15: 000055d07085c0b0
> > > > Jun 22 04:11:29 vch01 kernel: </TASK>
> > > >
> > > > Some more information about the environment and incident observations:
> > > >
> > > > The kernel is compiled without module support, everything baked in, no
> > > > proprietary modules.
> > > >
> > > > The file system has previously been recreated by restoring from backup
> > > > but we still see this problem once in a while, so I suspect there's no
> > > > file system inconsistency involved.
> > > >
> > > > The rootfs itself is on xfs, so I can still login. But the system is
> > > > very slow, takes 5-10 minutes to log in via SSH, commands like
> > > > "reboot" time out because the system dbus is congested.
> > > >
> > > > This bug cannot be easily triggered. So far, it only happened at
> > > > night, the system needs uptime of multiple days or weeks, with some
> > > > amount of swap used, and the backup (borg backup) has to run. So I
> > > > think this happens because we get some memory pressure while we also
> > > > have memory fragmentation going on for some time.
> > > >
> > > > The system is running bees on the btrfs pool because naturally this
> > > > web and mail hosting system has a lot of duplicate files. Mysql is
> > > > running on xfs only. Temporary files during backup are created on xfs
> > > > only (the backup can access the btrfs only via r/o, no writes
> > > > allowed). Snapper takes snapshots every hour and cleans out older
> > > > snapshots over time.
> > > >
> > > > I've now upgraded to the current Gentoo stable release 6.21.31 of the kernel.
> > > >
> > > > btrfs is running in single data / raid1 meta on three disks provided
> > > > by KVM via virtio. The KVM images itself are served by drbd redundant
> > > > over multiple servers (outside of the VM). The hardware itself has no
> > > > known hardware issues (no memory errors, no storage errors). Scrubbing
> > > > finds no checksum or other errors. The VM or storage hasn't been
> > > > migrated at time of the incident or recently.
> > > >
> > > > # uname -a
> > > > Linux vch01 6.12.21-gentoo #2 SMP Thu May 15 18:02:52 CEST 2025 x86_64
> > > > Intel Xeon E3-12xx v2 (Ivy Bridge) GenuineIntel GNU/Linux
> > > >
> > > > # free -m
> > > > total used free shared buff/cache available
> > > > Mem: 84476 33122 1941 3344 48183 51353
> > > > Swap: 15358 5362 9996
> > > >
> > > > # cat /proc/cpuinfo | egrep 'processor|model|cpu|cache' | sort -u
> > > > bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
> > > > l1tf mds swapgs itlb_multihit srbds mmio_unknown bhi
> > > > cache_alignment : 64
> > > > cache size : 4096 KB
> > > > cpu cores : 1
> > > > cpu family : 6
> > > > cpuid level : 13
> > > > cpu MHz : 2399.998
> > > > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> > > > mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm
> > > > constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3
> > > > cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
> > > > rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
> > > > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> > > > mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm
> > > > constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3
> > > > cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
> > > > rdrand hypervisor lahf_lm intel_ppin fsgsbase smep erms xsaveopt
> > > > model : 58
> > > > model name : Intel Xeon E3-12xx v2 (Ivy Bridge)
> > > > processor : 0
> > > > processor : 1
> > > > processor : 10
> > > > processor : 11
> > > > processor : 12
> > > > processor : 13
> > > > processor : 14
> > > > processor : 15
> > > > processor : 2
> > > > processor : 3
> > > > processor : 4
> > > > processor : 5
> > > > processor : 6
> > > > processor : 7
> > > > processor : 8
> > > > processor : 9
> > > >
> > > > Thanks for looking into it. If you need more information, I may be
> > > > able to provide it as long as I don't have to get it from the still
> > > > running machine: it has been rebooted since the incident.
> >
> > We have reports that the system already slowed down around 4-5 hours
> > before the kernel log. This was reported before the backup started at
> > 1 am. So the additional load introduced by the backup probably only
> > accelerated what already has gone wrong. On another note, I think if
> > it survived the backup, the situation would have probably relaxed
> > enough to not trigger the RCU stall.
>
> The RCU stall could be because there's a large number of backrefs for
> some extents and we're spending too much time looping over them
> without ever yielding the cpu.
>
> Are you able to test the following patch and see if the warning goes away?
>
> https://pastebin.com/raw/aGUPHi93
I will happily test this patch. But since this problem only triggers
every few weeks, it's hard to reproduce, and even harder to say if it
solved anything.
Using ChatGPT, I worked out that I should probably reduce meta data
operations like snapshot cleanups while the backup is running. So I
modified snapper-cleanup.service to not run while the borgbackup lock
file exists. This may already modify the outcome of testing the patch,
what do you think?
> Thanks.
Regards,
Kai
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: btrfs dies in RIP: 0010:btrfs_get_64+0x65/0x110
2025-06-23 11:19 ` Kai Krakow
@ 2025-06-23 16:30 ` Filipe Manana
0 siblings, 0 replies; 6+ messages in thread
From: Filipe Manana @ 2025-06-23 16:30 UTC (permalink / raw)
To: Kai Krakow; +Cc: linux-btrfs
On Mon, Jun 23, 2025 at 12:20 PM Kai Krakow <hurikhan77@gmail.com> wrote:
>
> Hi Filipe!
>
> Am Mo., 23. Juni 2025 um 12:35 Uhr schrieb Filipe Manana <fdmanana@kernel.org>:
> >
> > On Mon, Jun 23, 2025 at 9:45 AM Kai Krakow <hurikhan77@gmail.com> wrote:
> > >
> > > Hello btrfs list!
> > >
> > > Some more context:
> > >
> > > > > Hello btrfs list!
> > > > >
> > > > > Once in a while, we are seeing the following kernel bug during the
> > > > > night while the backup is putting some additional load on the system:
> > > > >
> > > > > Jun 22 04:11:29 vch01 kernel: rcu: INFO: rcu_sched self-detected stall on CPU
> > > > > Jun 22 04:11:29 vch01 kernel: rcu: 10-....: (2100 ticks this
> > > > > GP) idle=0494/1/0x4000000000000000 softirq=164826140/164826187
> > > > > fqs=1052
> > > > > Jun 22 04:11:29 vch01 kernel: rcu: (t=2100 jiffies g=358306033
> > > > > q=2241752 ncpus=16)
> > > > > Jun 22 04:11:29 vch01 kernel: CPU: 10 UID: 0 PID: 1524681 Comm:
> > > > > map_0x178e45670 Not tainted 6.12.21-gentoo #1
> > > > > Jun 22 04:11:29 vch01 kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> > > > > Jun 22 04:11:29 vch01 kernel: RIP: 0010:btrfs_get_64+0x65/0x110
> > > > > Jun 22 04:11:29 vch01 kernel: Code: d3 ed 48 8b 4f 70 48 8b 31 83 e6
> > > > > 40 74 11 0f b6 49 40 41 bc 00 10 00 00 49 d3 e4 49 83 ec 01 4a 8b 5c
> > > > > ed 70 49 21 d4 45 89 c9 <48> 2b 1d 7c 99 09 01 49 01 c1 8b 55 08 49 8d
> > > > > 49 08 44 8b 75 0c 48
> > > > > Jun 22 04:11:29 vch01 kernel: RSP: 0018:ffffbb7ad531bba0 EFLAGS: 00000202
> > > > > Jun 22 04:11:29 vch01 kernel: RAX: 0000000000001f15 RBX:
> > > > > fffff437ea382200 RCX: fffff437cb891200
> > > > > Jun 22 04:11:29 vch01 kernel: RDX: 000001922b68df2a RSI:
> > > > > 0000000000000000 RDI: ffffa434c3e66d20
> > > > > Jun 22 04:11:29 vch01 kernel: RBP: ffffa434c3e66d20 R08:
> > > > > 000001922b68c000 R09: 0000000000000015
> > > > > Jun 22 04:11:29 vch01 kernel: R10: 6c0000000000000a R11:
> > > > > 0000000009fe7000 R12: 0000000000000f2a
> > > > > Jun 22 04:11:29 vch01 kernel: R13: 0000000000000001 R14:
> > > > > ffffa43192e6d230 R15: ffffa43160c4c800
> > > > > Jun 22 04:11:29 vch01 kernel: FS: 000055d07085e6c0(0000)
> > > > > GS:ffffa4452bc80000(0000) knlGS:0000000000000000
> > > > > Jun 22 04:11:29 vch01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > Jun 22 04:11:29 vch01 kernel: CR2: 00007fff204ecfc0 CR3:
> > > > > 0000000121a0b000 CR4: 00000000001506f0
> > > > > Jun 22 04:11:29 vch01 kernel: DR0: 0000000000000000 DR1:
> > > > > 0000000000000000 DR2: 0000000000000000
> > > > > Jun 22 04:11:29 vch01 kernel: DR3: 0000000000000000 DR6:
> > > > > 00000000fffe0ff0 DR7: 0000000000000400
> > > > > Jun 22 04:11:29 vch01 kernel: Call Trace:
> > > > > Jun 22 04:11:29 vch01 kernel: <IRQ>
> > > > > Jun 22 04:11:29 vch01 kernel: ? rcu_dump_cpu_stacks+0xd3/0x100
> > > > > Jun 22 04:11:29 vch01 kernel: ? rcu_sched_clock_irq+0x4ff/0x920
> > > > > Jun 22 04:11:29 vch01 kernel: ? update_process_times+0x6c/0xa0
> > > > > Jun 22 04:11:29 vch01 kernel: ? tick_nohz_handler+0x82/0x110
> > > > > Jun 22 04:11:29 vch01 kernel: ? tick_do_update_jiffies64+0xd0/0xd0
> > > > > Jun 22 04:11:29 vch01 kernel: ? __hrtimer_run_queues+0x10b/0x190
> > > > > Jun 22 04:11:29 vch01 kernel: ? hrtimer_interrupt+0xf1/0x200
> > > > > Jun 22 04:11:29 vch01 kernel: ? __sysvec_apic_timer_interrupt+0x44/0x50
> > > > > Jun 22 04:11:29 vch01 kernel: ? sysvec_apic_timer_interrupt+0x60/0x80
> > > > > Jun 22 04:11:29 vch01 kernel: </IRQ>
> > > > > Jun 22 04:11:29 vch01 kernel: <TASK>
> > > > > Jun 22 04:11:29 vch01 kernel: ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > > > Jun 22 04:11:29 vch01 kernel: ? btrfs_get_64+0x65/0x110
> > > > > Jun 22 04:11:29 vch01 kernel: find_parent_nodes+0x1b84/0x1dc0
> > > > > Jun 22 04:11:29 vch01 kernel: btrfs_find_all_leafs+0x31/0xd0
> > > > > Jun 22 04:11:29 vch01 kernel: ? queued_write_lock_slowpath+0x30/0x70
> > > > > Jun 22 04:11:29 vch01 kernel: iterate_extent_inodes+0x6f/0x370
> > > > > Jun 22 04:11:29 vch01 kernel: ? update_share_count+0x60/0x60
> > > > > Jun 22 04:11:29 vch01 kernel: ? extent_from_logical+0x139/0x190
> > > > > Jun 22 04:11:29 vch01 kernel: ? release_extent_buffer+0x96/0xb0
> > > > > Jun 22 04:11:29 vch01 kernel: iterate_inodes_from_logical+0xaa/0xd0
> > > > > Jun 22 04:11:29 vch01 kernel: btrfs_ioctl_logical_to_ino+0xaa/0x150
> > > > > Jun 22 04:11:29 vch01 kernel: __x64_sys_ioctl+0x84/0xc0
> > > > > Jun 22 04:11:29 vch01 kernel: do_syscall_64+0x47/0x100
> > > > > Jun 22 04:11:29 vch01 kernel: entry_SYSCALL_64_after_hwframe+0x4b/0x53
> > > > > Jun 22 04:11:29 vch01 kernel: RIP: 0033:0x55d07617eaaf
> > > > > Jun 22 04:11:29 vch01 kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44
> > > > > 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24
> > > > > 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64
> > > > > 48 2b 04 25 28 00 00
> > > > > Jun 22 04:11:29 vch01 kernel: RSP: 002b:000055d07085bc20 EFLAGS:
> > > > > 00000246 ORIG_RAX: 0000000000000010
> > > > > Jun 22 04:11:29 vch01 kernel: RAX: ffffffffffffffda RBX:
> > > > > 000055d0402f8550 RCX: 000055d07617eaaf
> > > > > Jun 22 04:11:29 vch01 kernel: RDX: 000055d07085bca0 RSI:
> > > > > 00000000c038943b RDI: 0000000000000003
> > > > > Jun 22 04:11:29 vch01 kernel: RBP: 000055d07085bea0 R08:
> > > > > 00007fee46c84080 R09: 0000000000000000
> > > > > Jun 22 04:11:29 vch01 kernel: R10: 0000000000000000 R11:
> > > > > 0000000000000246 R12: 0000000000000003
> > > > > Jun 22 04:11:29 vch01 kernel: R13: 000055d07085bf80 R14:
> > > > > 000055d07085bf48 R15: 000055d07085c0b0
> > > > > Jun 22 04:11:29 vch01 kernel: </TASK>
> > > > >
> > > > > Some more information about the environment and incident observations:
> > > > >
> > > > > The kernel is compiled without module support, everything baked in, no
> > > > > proprietary modules.
> > > > >
> > > > > The file system has previously been recreated by restoring from backup
> > > > > but we still see this problem once in a while, so I suspect there's no
> > > > > file system inconsistency involved.
> > > > >
> > > > > The rootfs itself is on xfs, so I can still login. But the system is
> > > > > very slow, takes 5-10 minutes to log in via SSH, commands like
> > > > > "reboot" time out because the system dbus is congested.
> > > > >
> > > > > This bug cannot be easily triggered. So far, it only happened at
> > > > > night, the system needs uptime of multiple days or weeks, with some
> > > > > amount of swap used, and the backup (borg backup) has to run. So I
> > > > > think this happens because we get some memory pressure while we also
> > > > > have memory fragmentation going on for some time.
> > > > >
> > > > > The system is running bees on the btrfs pool because naturally this
> > > > > web and mail hosting system has a lot of duplicate files. Mysql is
> > > > > running on xfs only. Temporary files during backup are created on xfs
> > > > > only (the backup can access the btrfs only via r/o, no writes
> > > > > allowed). Snapper takes snapshots every hour and cleans out older
> > > > > snapshots over time.
> > > > >
> > > > > I've now upgraded to the current Gentoo stable release 6.21.31 of the kernel.
> > > > >
> > > > > btrfs is running in single data / raid1 meta on three disks provided
> > > > > by KVM via virtio. The KVM images itself are served by drbd redundant
> > > > > over multiple servers (outside of the VM). The hardware itself has no
> > > > > known hardware issues (no memory errors, no storage errors). Scrubbing
> > > > > finds no checksum or other errors. The VM or storage hasn't been
> > > > > migrated at time of the incident or recently.
> > > > >
> > > > > # uname -a
> > > > > Linux vch01 6.12.21-gentoo #2 SMP Thu May 15 18:02:52 CEST 2025 x86_64
> > > > > Intel Xeon E3-12xx v2 (Ivy Bridge) GenuineIntel GNU/Linux
> > > > >
> > > > > # free -m
> > > > > total used free shared buff/cache available
> > > > > Mem: 84476 33122 1941 3344 48183 51353
> > > > > Swap: 15358 5362 9996
> > > > >
> > > > > # cat /proc/cpuinfo | egrep 'processor|model|cpu|cache' | sort -u
> > > > > bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
> > > > > l1tf mds swapgs itlb_multihit srbds mmio_unknown bhi
> > > > > cache_alignment : 64
> > > > > cache size : 4096 KB
> > > > > cpu cores : 1
> > > > > cpu family : 6
> > > > > cpuid level : 13
> > > > > cpu MHz : 2399.998
> > > > > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> > > > > mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm
> > > > > constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3
> > > > > cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
> > > > > rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
> > > > > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> > > > > mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm
> > > > > constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3
> > > > > cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
> > > > > rdrand hypervisor lahf_lm intel_ppin fsgsbase smep erms xsaveopt
> > > > > model : 58
> > > > > model name : Intel Xeon E3-12xx v2 (Ivy Bridge)
> > > > > processor : 0
> > > > > processor : 1
> > > > > processor : 10
> > > > > processor : 11
> > > > > processor : 12
> > > > > processor : 13
> > > > > processor : 14
> > > > > processor : 15
> > > > > processor : 2
> > > > > processor : 3
> > > > > processor : 4
> > > > > processor : 5
> > > > > processor : 6
> > > > > processor : 7
> > > > > processor : 8
> > > > > processor : 9
> > > > >
> > > > > Thanks for looking into it. If you need more information, I may be
> > > > > able to provide it as long as I don't have to get it from the still
> > > > > running machine: it has been rebooted since the incident.
> > >
> > > We have reports that the system already slowed down around 4-5 hours
> > > before the kernel log. This was reported before the backup started at
> > > 1 am. So the additional load introduced by the backup probably only
> > > accelerated what already has gone wrong. On another note, I think if
> > > it survived the backup, the situation would have probably relaxed
> > > enough to not trigger the RCU stall.
> >
> > The RCU stall could be because there's a large number of backrefs for
> > some extents and we're spending too much time looping over them
> > without ever yielding the cpu.
> >
> > Are you able to test the following patch and see if the warning goes away?
> >
> > https://pastebin.com/raw/aGUPHi93
>
> I will happily test this patch. But since this problem only triggers
> every few weeks, it's hard to reproduce, and even harder to say if it
> solved anything.
>
> Using ChatGPT, I worked out that I should probably reduce meta data
> operations like snapshot cleanups while the backup is running. So I
> modified snapper-cleanup.service to not run while the borgbackup lock
> file exists. This may already modify the outcome of testing the patch,
> what do you think?
There are many variables that can make something more likely or less
likely to happen.
If by snapshot cleanups you mean deleting snapshots, that could help
decrease the number of the backrefs and maybe make the problem less
likely.
But then there may be other things running in your system that are
creating a lot more backrefs (clone, dedupe. snapshot creation) at a
higher rate.
Unlinks also remove backrefs, and so many other things.
The best way to test the patch is really without doing any changes to
the workload.
Judging from your stack trace, we are processing keyed backrefs, and
that means there are at least 33 backrefs for one particular extent at
least.
Maybe you have some extents that are shared hundreds or thousands of
times, and that could make the task running the logical_to_ino ioctl
spend a lot of time iterating over backrefs without yielding the cpu,
therefore triggering the RCU stall detector.
>
> > Thanks.
>
> Regards,
> Kai
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-06-23 16:30 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-22 17:31 Kai Krakow
2025-06-22 17:33 ` btrfs dies in RIP: 0010:btrfs_get_64+0x65/0x110 Kai Krakow
2025-06-23 8:45 ` Kai Krakow
2025-06-23 10:34 ` Filipe Manana
2025-06-23 11:19 ` Kai Krakow
2025-06-23 16:30 ` Filipe Manana
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).