* btrfs freezing on writes
@ 2020-04-07 19:46 kjansen387
2020-04-07 20:11 ` Chris Murphy
2020-04-09 4:32 ` Zygo Blaxell
0 siblings, 2 replies; 13+ messages in thread
From: kjansen387 @ 2020-04-07 19:46 UTC (permalink / raw)
To: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 1094 bytes --]
Hello,
I'm using btrfs on fedora 31 running 5.5.10-200.fc31.x86_64 .
I've moved my workload from md raid5 to btrfs raid1.
# btrfs filesystem show
Label: none uuid: 8ce9e167-57ea-4cf8-8678-3049ba028c12
Total devices 5 FS bytes used 3.73TiB
devid 1 size 3.64TiB used 2.53TiB path /dev/sde
devid 2 size 3.64TiB used 2.53TiB path /dev/sdf
devid 3 size 1.82TiB used 902.00GiB path /dev/sdb
devid 4 size 1.82TiB used 902.03GiB path /dev/sdc
devid 5 size 1.82TiB used 904.03GiB path /dev/sdd
# btrfs fi df /export
Data, RAID1: total=3.85TiB, used=3.72TiB
System, RAID1: total=32.00MiB, used=608.00KiB
Metadata, RAID1: total=6.00GiB, used=5.16GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
After moving to btrfs, I'm seeing freezes of ~10 seconds very, very
often (multiple times per minute). Mariadb, vim, influxdb, etc.
See attachment for a stacktrace of vim, and the dmesg output of 'echo w
> /proc/sysrq-trigger' also including other hanging processes.
What's going on.. ? Hope someone can help.
Thanks,
Klaas
[-- Attachment #2: debug.txt --]
[-- Type: text/plain, Size: 15988 bytes --]
[937013.793854] sysrq: Show Blocked State
[937013.793988] task PC stack pid father
[937013.794093] mysqld D 0 10400 1 0x00004000
[937013.794095] Call Trace:
[937013.794100] ? __schedule+0x2c7/0x740
[937013.794101] schedule+0x4a/0xb0
[937013.794102] schedule_preempt_disabled+0xa/0x10
[937013.794103] __mutex_lock.isra.0+0x16b/0x4d0
[937013.794105] ? __switch_to_asm+0x40/0x70
[937013.794144] ? can_overcommit.part.0+0x50/0xa0 [btrfs]
[937013.794156] ? btrfs_reserve_metadata_bytes+0x6dc/0x940 [btrfs]
[937013.794168] btrfs_log_inode_parent+0x149/0xdf0 [btrfs]
[937013.794171] ? prepare_to_wait_event+0x7e/0x150
[937013.794173] ? finish_wait+0x3f/0x80
[937013.794184] ? btrfs_block_rsv_add_bytes+0x20/0x60 [btrfs]
[937013.794192] ? wait_current_trans+0x2f/0x130 [btrfs]
[937013.794193] ? _cond_resched+0x15/0x30
[937013.794202] ? join_transaction+0x24/0x410 [btrfs]
[937013.794213] btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
[937013.794237] btrfs_sync_file+0x25f/0x400 [btrfs]
[937013.794240] do_fsync+0x38/0x70
[937013.794240] __x64_sys_fsync+0x10/0x20
[937013.794242] do_syscall_64+0x5b/0x1c0
[937013.794244] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[937013.794245] RIP: 0033:0x7fd0d48e1deb
[937013.794247] Code: Bad RIP value.
[937013.794247] RSP: 002b:00007fd0a4bcc200 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[937013.794248] RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007fd0d48e1deb
[937013.794248] RDX: 0000000000000010 RSI: 0000000000000010 RDI: 0000000000000010
[937013.794249] RBP: 00007fd0a4bcc2d0 R08: 0000000000000000 R09: 00000000000002f3
[937013.794249] R10: 0000000000000010 R11: 0000000000000293 R12: 00007fd0a4bcc240
[937013.794250] R13: 0000000000000000 R14: 0005a2b86cf17940 R15: 0000000000056925
[937013.794253] mysqld D 0 10412 1 0x00004000
[937013.794254] Call Trace:
[937013.794255] ? __schedule+0x2c7/0x740
[937013.794256] schedule+0x4a/0xb0
[937013.794264] wait_for_commit+0x58/0x80 [btrfs]
[937013.794266] ? finish_wait+0x80/0x80
[937013.794274] btrfs_commit_transaction+0x87b/0xa20 [btrfs]
[937013.794276] ? dput+0xb6/0x2d0
[937013.794286] ? btrfs_log_dentry_safe+0x55/0x70 [btrfs]
[937013.794295] btrfs_sync_file+0x3b2/0x400 [btrfs]
[937013.794297] do_fsync+0x38/0x70
[937013.794298] __x64_sys_fsync+0x10/0x20
[937013.794299] do_syscall_64+0x5b/0x1c0
[937013.794300] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[937013.794300] RIP: 0033:0x7fd0d48e1deb
[937013.794301] Code: Bad RIP value.
[937013.794302] RSP: 002b:00007fd0934dfcd0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[937013.794302] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fd0d48e1deb
[937013.794303] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000054
[937013.794303] RBP: 00007fd0934dfed0 R08: 0000000000000000 R09: 0000000000000000
[937013.794303] R10: 0000000000000009 R11: 0000000000000293 R12: 00007fd0934dff20
[937013.794304] R13: 0000000000000054 R14: 8f5c28f5c28f5c29 R15: 028f5c28f5c28f5c
[937013.794306] mysqld D 0 10421 1 0x00004000
[937013.794307] Call Trace:
[937013.794308] ? __schedule+0x2c7/0x740
[937013.794309] ? __switch_to_asm+0x34/0x70
[937013.794310] ? __switch_to_asm+0x34/0x70
[937013.794311] schedule+0x4a/0xb0
[937013.794312] schedule_timeout+0x20f/0x300
[937013.794313] ? __schedule+0x2cf/0x740
[937013.794314] io_schedule_timeout+0x19/0x40
[937013.794314] wait_for_completion_io+0x119/0x160
[937013.794316] ? wake_up_q+0xa0/0xa0
[937013.794323] write_all_supers+0x7d2/0x890 [btrfs]
[937013.794332] btrfs_commit_transaction+0x753/0xa20 [btrfs]
[937013.794342] ? btrfs_log_dentry_safe+0x55/0x70 [btrfs]
[937013.794351] btrfs_sync_file+0x3b2/0x400 [btrfs]
[937013.794353] do_fsync+0x38/0x70
[937013.794353] __x64_sys_fsync+0x10/0x20
[937013.794355] do_syscall_64+0x5b/0x1c0
[937013.794356] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[937013.794356] RIP: 0033:0x7fd0d48e1deb
[937013.794357] Code: Bad RIP value.
[937013.794358] RSP: 002b:00007fd0917fcfd0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[937013.794358] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fd0d48e1deb
[937013.794358] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000019
[937013.794359] RBP: 00007fd0917fd1d0 R08: 0000000000000000 R09: 0000000000000000
[937013.794359] R10: 00005601eb73eee0 R11: 0000000000000293 R12: 00007fd0917fd220
[937013.794360] R13: 0000000000000019 R14: 8f5c28f5c28f5c29 R15: 028f5c28f5c28f5c
[937013.794385] postmaster D 0 3216 2409 0x00004000
[937013.794385] Call Trace:
[937013.794387] ? __schedule+0x2c7/0x740
[937013.794388] schedule+0x4a/0xb0
[937013.794388] schedule_preempt_disabled+0xa/0x10
[937013.794390] __mutex_lock.isra.0+0x16b/0x4d0
[937013.794398] ? start_transaction+0xbb/0x4c0 [btrfs]
[937013.794408] btrfs_pin_log_trans+0x19/0x30 [btrfs]
[937013.794417] btrfs_rename2+0x25b/0x1f60 [btrfs]
[937013.794419] ? link_path_walk.part.0+0x74/0x540
[937013.794420] ? path_parentat.isra.0+0x3f/0x80
[937013.794421] ? inode_permission+0xad/0x140
[937013.794422] ? vfs_rename+0x3e1/0x9e0
[937013.794430] ? btrfs_create+0x200/0x200 [btrfs]
[937013.794431] vfs_rename+0x3e1/0x9e0
[937013.794433] ? __d_lookup+0x5e/0x140
[937013.794434] ? lookup_dcache+0x18/0x60
[937013.794435] do_renameat2+0x381/0x530
[937013.794437] __x64_sys_rename+0x1f/0x30
[937013.794438] do_syscall_64+0x5b/0x1c0
[937013.794439] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[937013.794439] RIP: 0033:0x7fb3ce3bae8b
[937013.794440] Code: Bad RIP value.
[937013.794441] RSP: 002b:00007ffc432333c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[937013.794441] RAX: ffffffffffffffda RBX: 00007ffc43233410 RCX: 00007fb3ce3bae8b
[937013.794442] RDX: 0000557d55c1bd50 RSI: 00007ffc43233830 RDI: 00007ffc43233430
[937013.794442] RBP: 0000557d55c4b8e0 R08: 0000000000000000 R09: 00000000000001ff
[937013.794442] R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000000
[937013.794443] R13: 00007ffc43233430 R14: 00007ffc43233830 R15: 0000000000000001
[937013.794750] mongod D 0 1186970 1186954 0x00000320
[937013.794750] Call Trace:
[937013.794752] ? __schedule+0x2c7/0x740
[937013.794753] schedule+0x4a/0xb0
[937013.794755] rwsem_down_write_slowpath+0x24c/0x510
[937013.794765] btrfs_file_write_iter+0x83/0x570 [btrfs]
[937013.794773] ? btrfs_real_readdir+0x2b0/0x4c0 [btrfs]
[937013.794775] ? __seccomp_filter+0x7b/0x670
[937013.794777] new_sync_write+0x12d/0x1d0
[937013.794778] vfs_write+0xb6/0x1a0
[937013.794779] ksys_pwrite64+0x65/0xa0
[937013.794780] do_syscall_64+0x5b/0x1c0
[937013.794782] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[937013.794782] RIP: 0033:0x7f9b6f1c504f
[937013.794783] Code: Bad RIP value.
[937013.794784] RSP: 002b:00007f9b6cdbc560 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
[937013.794784] RAX: ffffffffffffffda RBX: 00000000000000b3 RCX: 00007f9b6f1c504f
[937013.794785] RDX: 0000000000004300 RSI: 000055b4f307b000 RDI: 00000000000000b3
[937013.794785] RBP: 000055b4f307b000 R08: 0000000000000000 R09: 0000000000237f80
[937013.794786] R10: 0000000000237f80 R11: 0000000000000293 R12: 0000000000004300
[937013.794786] R13: 0000000000237f80 R14: 000055b4f509f640 R15: 0000000000237f80
[937013.794788] WTJourn.Flusher D 0 1186978 1186954 0x00004320
[937013.794788] Call Trace:
[937013.794789] ? __schedule+0x2c7/0x740
[937013.794790] schedule+0x4a/0xb0
[937013.794791] schedule_preempt_disabled+0xa/0x10
[937013.794792] __mutex_lock.isra.0+0x16b/0x4d0
[937013.794793] ? __switch_to_asm+0x40/0x70
[937013.794804] ? can_overcommit.part.0+0x50/0xa0 [btrfs]
[937013.794830] ? btrfs_reserve_metadata_bytes+0x6dc/0x940 [btrfs]
[937013.794841] btrfs_log_inode_parent+0x149/0xdf0 [btrfs]
[937013.794843] ? prepare_to_wait_event+0x7e/0x150
[937013.794844] ? finish_wait+0x3f/0x80
[937013.794855] ? btrfs_block_rsv_add_bytes+0x20/0x60 [btrfs]
[937013.794863] ? wait_current_trans+0x2f/0x130 [btrfs]
[937013.794864] ? _cond_resched+0x15/0x30
[937013.794872] ? join_transaction+0x24/0x410 [btrfs]
[937013.794883] btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
[937013.794893] btrfs_sync_file+0x25f/0x400 [btrfs]
[937013.794894] do_fsync+0x38/0x70
[937013.794895] __x64_sys_fdatasync+0x13/0x20
[937013.794896] do_syscall_64+0x5b/0x1c0
[937013.794897] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[937013.794898] RIP: 0033:0x7f9b6eed92e7
[937013.794899] Code: Bad RIP value.
[937013.794899] RSP: 002b:00007f9b695b52d0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[937013.794900] RAX: ffffffffffffffda RBX: 00000000000000b3 RCX: 00007f9b6eed92e7
[937013.794900] RDX: 0000000000000000 RSI: 00000000000000b3 RDI: 00000000000000b3
[937013.794901] RBP: 00007f9b695b5310 R08: 000055b504668080 R09: 00007f9b695b52e8
[937013.794901] R10: 00007f9b695b52e0 R11: 0000000000000293 R12: 000055b5046680c0
[937013.794902] R13: 000055b4ef32c559 R14: 0000000000000000 R15: 000055b4f0aef0f8
[937013.794903] ftdc D 0 1186980 1186954 0x00000320
[937013.794904] Call Trace:
[937013.794905] ? __schedule+0x2c7/0x740
[937013.794906] schedule+0x4a/0xb0
[937013.794907] schedule_preempt_disabled+0xa/0x10
[937013.794907] __mutex_lock.isra.0+0x16b/0x4d0
[937013.794916] ? start_transaction+0xbb/0x4c0 [btrfs]
[937013.794927] btrfs_pin_log_trans+0x19/0x30 [btrfs]
[937013.794936] btrfs_rename2+0x25b/0x1f60 [btrfs]
[937013.794938] ? link_path_walk.part.0+0x74/0x540
[937013.794939] ? path_parentat.isra.0+0x3f/0x80
[937013.794940] ? inode_permission+0xad/0x140
[937013.794942] ? vfs_rename+0x3e1/0x9e0
[937013.794951] ? btrfs_create+0x200/0x200 [btrfs]
(gdb) #0 0x00007fb16c2badc7 in fsync () from /lib64/libpthread.so.0
#1 0x000055d92401053f in mf_sync ()
#2 0x000055d923ec772e in ml_sync_all ()
#3 0x000055d923e96527 in before_blocking ()
#4 0x000055d923fa0fe1 in inchar_loop ()
#5 0x000055d923fa04d4 in ui_inchar ()
#6 0x000055d923e9697f in inchar ()
#7 0x000055d923e9743e in vgetorpeek ()
#8 0x000055d923e98ba7 in vgetc ()
#9 0x000055d923e98d5d in safe_vgetc ()
#10 0x000055d923e3aabd in edit ()
#11 0x000055d923edfa16 in invoke_edit.isra ()
#12 0x000055d923ee7866 in normal_cmd ()
#13 0x000055d92400d7aa in main_loop ()
#14 0x000055d92400e75b in vim_main2 ()
#15 0x000055d923dfd23d in main ()
[937013.794952] vfs_rename+0x3e1/0x9e0
[937013.794953] ? __d_lookup+0x5e/0x140
[937013.794955] do_renameat2+0x381/0x530
[937013.794956] __x64_sys_rename+0x1f/0x30
[937013.794957] do_syscall_64+0x5b/0x1c0
[937013.794959] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[937013.794959] RIP: 0033:0x7f9b6ee3dd37
[937013.794960] Code: Bad RIP value.
[937013.794960] RSP: 002b:00007f9b685b3318 EFLAGS: 00000202 ORIG_RAX: 0000000000000052
[937013.794961] RAX: ffffffffffffffda RBX: 000055b4f4762c50 RCX: 00007f9b6ee3dd37
[937013.794961] RDX: 00007f9b685b3360 RSI: 000055b4f43d5ff0 RDI: 000055b4f43d5fa0
[937013.794962] RBP: 00007f9b685b3340 R08: 00007f9b685b4700 R09: 000055b4ef32e2e9
[937013.794962] R10: 00000000000001b6 R11: 0000000000000202 R12: 000055b4f4762c30
[937013.794962] R13: 00007f9b685b3360 R14: 00007f9b685b3720 R15: 000055b4ff007004
[937013.795022] influxd D 0 2082782 1 0x00004000
[937013.795023] Call Trace:
[937013.795024] ? __schedule+0x2c7/0x740
[937013.795025] schedule+0x4a/0xb0
[937013.795026] schedule_preempt_disabled+0xa/0x10
[937013.795027] __mutex_lock.isra.0+0x16b/0x4d0
[937013.795028] ? __switch_to_asm+0x40/0x70
[937013.795029] ? __switch_to_asm+0x34/0x70
[937013.795030] ? __switch_to_asm+0x34/0x70
[937013.795031] ? __switch_to_asm+0x40/0x70
[937013.795032] ? __switch_to+0x10d/0x420
[937013.795044] btrfs_log_inode_parent+0x407/0xdf0 [btrfs]
[937013.795046] ? prepare_to_wait_event+0x7e/0x150
[937013.795047] ? finish_wait+0x3f/0x80
[937013.795056] ? wait_current_trans+0xc0/0x130 [btrfs]
[937013.795064] ? join_transaction+0x24/0x410 [btrfs]
[937013.795074] btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
[937013.795084] btrfs_sync_file+0x25f/0x400 [btrfs]
[937013.795086] do_fsync+0x38/0x70
[937013.795087] __x64_sys_fsync+0x10/0x20
[937013.795088] do_syscall_64+0x5b/0x1c0
[937013.795090] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[937013.795090] RIP: 0033:0x4b1420
[937013.795091] Code: Bad RIP value.
[937013.795092] RSP: 002b:000000c00246e5d8 EFLAGS: 00000212 ORIG_RAX: 000000000000004a
[937013.795092] RAX: ffffffffffffffda RBX: 000000c000058000 RCX: 00000000004b1420
[937013.795093] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000007f4
[937013.795093] RBP: 000000c00246e618 R08: 0000000000000000 R09: 0000000000000000
[937013.795094] R10: 0000000000000000 R11: 0000000000000212 R12: 000000c0490ae120
[937013.795094] R13: 000000c0490ae120 R14: 000000c0436fd020 R15: 0000000000000055
[937013.795098] influxd D 0 2082804 1 0x00004000
[937013.795099] Call Trace:
[937013.795100] ? __schedule+0x2c7/0x740
[937013.795102] schedule+0x4a/0xb0
[937013.795110] wait_for_commit+0x58/0x80 [btrfs]
[937013.795112] ? finish_wait+0x80/0x80
[937013.795120] btrfs_commit_transaction+0x87b/0xa20 [btrfs]
[937013.795121] ? dput+0xb6/0x2d0
[937013.795132] ? btrfs_log_dentry_safe+0x55/0x70 [btrfs]
[937013.795141] btrfs_sync_file+0x3b2/0x400 [btrfs]
[937013.795143] do_fsync+0x38/0x70
[937013.795144] __x64_sys_fsync+0x10/0x20
[937013.795145] do_syscall_64+0x5b/0x1c0
[937013.795146] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[937013.795147] RIP: 0033:0x4b1420
[937013.795148] Code: Bad RIP value.
[937013.795148] RSP: 002b:000000c000efa5d8 EFLAGS: 00000212 ORIG_RAX: 000000000000004a
[937013.795149] RAX: ffffffffffffffda RBX: 000000c000053400 RCX: 00000000004b1420
[937013.795149] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000051
[937013.795150] RBP: 000000c000efa618 R08: 0000000000000000 R09: 0000000000000000
[937013.795150] R10: 0000000000000000 R11: 0000000000000212 R12: 000000c04c5ae360
[937013.795151] R13: 000000c04c5ae360 R14: 000000c0008b6240 R15: 0000000000000055
[937013.795191] vim D 0 2228648 2223845 0x00004000
[937013.795191] Call Trace:
[937013.795193] ? __schedule+0x2c7/0x740
[937013.795194] ? __switch_to_asm+0x40/0x70
[937013.795195] ? __switch_to_asm+0x40/0x70
[937013.795196] schedule+0x4a/0xb0
[937013.795197] schedule_preempt_disabled+0xa/0x10
[937013.795198] __mutex_lock.isra.0+0x16b/0x4d0
[937013.795199] ? __switch_to_asm+0x40/0x70
[937013.795210] ? can_overcommit.part.0+0x50/0xa0 [btrfs]
[937013.795221] ? btrfs_reserve_metadata_bytes+0x6dc/0x940 [btrfs]
[937013.795232] btrfs_log_inode_parent+0x149/0xdf0 [btrfs]
[937013.795234] ? prepare_to_wait_event+0x7e/0x150
[937013.795235] ? finish_wait+0x3f/0x80
[937013.795245] ? btrfs_block_rsv_add_bytes+0x20/0x60 [btrfs]
[937013.795254] ? wait_current_trans+0x2f/0x130 [btrfs]
[937013.795256] ? _cond_resched+0x15/0x30
[937013.795264] ? join_transaction+0x24/0x410 [btrfs]
[937013.795274] btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
[937013.795284] btrfs_sync_file+0x25f/0x400 [btrfs]
[937013.795286] do_fsync+0x38/0x70
[937013.795286] __x64_sys_fsync+0x10/0x20
[937013.795287] do_syscall_64+0x5b/0x1c0
[937013.795289] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[937013.795289] RIP: 0033:0x7fb16c2badc7
[937013.795290] Code: Bad RIP value.
[937013.795291] RSP: 002b:00007ffc7dc31d78 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[937013.795291] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb16c2badc7
[937013.795292] RDX: 0000000000000000 RSI: 000055d92402ad3b RDI: 0000000000000003
[937013.795292] RBP: 000055d9251a9380 R08: 0000000000000001 R09: 0000000000000000
[937013.795292] R10: 000055d9240c0e00 R11: 0000000000000246 R12: 0000000000000001
[937013.795293] R13: 0000000000000004 R14: 0000000000000000 R15: 0000000000000000
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: btrfs freezing on writes 2020-04-07 19:46 btrfs freezing on writes kjansen387 @ 2020-04-07 20:11 ` Chris Murphy 2020-04-07 20:22 ` Chris Murphy 2020-04-09 4:32 ` Zygo Blaxell 1 sibling, 1 reply; 13+ messages in thread From: Chris Murphy @ 2020-04-07 20:11 UTC (permalink / raw) To: kjansen387; +Cc: Btrfs BTRFS On Tue, Apr 7, 2020 at 1:46 PM kjansen387 <kjansen387@gmail.com> wrote: > > Hello, > > I'm using btrfs on fedora 31 running 5.5.10-200.fc31.x86_64 . > > I've moved my workload from md raid5 to btrfs raid1. > # btrfs filesystem show > Label: none uuid: 8ce9e167-57ea-4cf8-8678-3049ba028c12 > Total devices 5 FS bytes used 3.73TiB > devid 1 size 3.64TiB used 2.53TiB path /dev/sde > devid 2 size 3.64TiB used 2.53TiB path /dev/sdf > devid 3 size 1.82TiB used 902.00GiB path /dev/sdb > devid 4 size 1.82TiB used 902.03GiB path /dev/sdc > devid 5 size 1.82TiB used 904.03GiB path /dev/sdd > > # btrfs fi df /export > Data, RAID1: total=3.85TiB, used=3.72TiB > System, RAID1: total=32.00MiB, used=608.00KiB > Metadata, RAID1: total=6.00GiB, used=5.16GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > After moving to btrfs, I'm seeing freezes of ~10 seconds very, very > often (multiple times per minute). Mariadb, vim, influxdb, etc. > > See attachment for a stacktrace of vim, and the dmesg output of 'echo w > > /proc/sysrq-trigger' also including other hanging processes. > > What's going on.. ? Hope someone can help. How busy are the databases? What are the mount options for this volume? I think there is some kind of write contention possible if there's heavy fsync writes, since the tree log is per subvolume? (Maybe a developer can describe this correctly if I haven't.) A possible work around is putting each database in its own subvolume. -- Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: btrfs freezing on writes 2020-04-07 20:11 ` Chris Murphy @ 2020-04-07 20:22 ` Chris Murphy 2020-04-07 20:39 ` kjansen387 0 siblings, 1 reply; 13+ messages in thread From: Chris Murphy @ 2020-04-07 20:22 UTC (permalink / raw) To: Chris Murphy; +Cc: kjansen387, Btrfs BTRFS On Tue, Apr 7, 2020 at 2:11 PM Chris Murphy <lists@colorremedies.com> wrote: > > On Tue, Apr 7, 2020 at 1:46 PM kjansen387 <kjansen387@gmail.com> wrote: > > > > Hello, > > > > I'm using btrfs on fedora 31 running 5.5.10-200.fc31.x86_64 . > > > > I've moved my workload from md raid5 to btrfs raid1. > > # btrfs filesystem show > > Label: none uuid: 8ce9e167-57ea-4cf8-8678-3049ba028c12 > > Total devices 5 FS bytes used 3.73TiB > > devid 1 size 3.64TiB used 2.53TiB path /dev/sde > > devid 2 size 3.64TiB used 2.53TiB path /dev/sdf > > devid 3 size 1.82TiB used 902.00GiB path /dev/sdb > > devid 4 size 1.82TiB used 902.03GiB path /dev/sdc > > devid 5 size 1.82TiB used 904.03GiB path /dev/sdd > > > > # btrfs fi df /export > > Data, RAID1: total=3.85TiB, used=3.72TiB > > System, RAID1: total=32.00MiB, used=608.00KiB > > Metadata, RAID1: total=6.00GiB, used=5.16GiB > > GlobalReserve, single: total=512.00MiB, used=0.00B > > > > After moving to btrfs, I'm seeing freezes of ~10 seconds very, very > > often (multiple times per minute). Mariadb, vim, influxdb, etc. > > > > See attachment for a stacktrace of vim, and the dmesg output of 'echo w > > > /proc/sysrq-trigger' also including other hanging processes. > > > > What's going on.. ? Hope someone can help. > > How busy are the databases? What are the mount options for this volume? > > I think there is some kind of write contention possible if there's > heavy fsync writes, since the tree log is per subvolume? (Maybe a > developer can describe this correctly if I haven't.) A possible work > around is putting each database in its own subvolume. You could, of course, start with just the busiest database and see if it makes any difference. By default on Fedora, only /home and / are on separate subvolumes. The general idea would be to use a subvolume instead of a directory, e.g. ##shutdown the database mv /var/lib/mysql /var/lib/mysqlold btrfs sub create /var/lib/mysql restorecon -Rv /var/lib/mysql cp -r --reflink /var/lib/mysqlold/* /var/lib/mysql/ ##resume database operation, and clean up mysqlold whenever Another thing that might help, depending on the workload, is space_cache=v2. You can safely switch to v2 via: mount -o remount,clear_cache,space_cache=v2 / This is a one time command. It will set a feature flag for space cache v2, and it'll be used automatically on subsequent mounts. It may be busy for a minute or so while the cache is rebuilt. You might test these separately so you have an idea of their relative effects on the workload. -- Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: btrfs freezing on writes 2020-04-07 20:22 ` Chris Murphy @ 2020-04-07 20:39 ` kjansen387 2020-04-07 22:11 ` Chris Murphy 0 siblings, 1 reply; 13+ messages in thread From: kjansen387 @ 2020-04-07 20:39 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 3658 bytes --] Hi Chris, $ grep /export /etc/fstab UUID=8ce9e167-57ea-4cf8-8678-3049ba028c12 /export btrfs device=/dev/sde,device=/dev/sdf 0 2 The device section is not complete, but I don't think that matters because btrfs handles this automatically right ? $ mount | grep /export /dev/sde on /export type btrfs (rw,relatime,space_cache,subvolid=5,subvol=/) I've attached sysstat info of my disks. What's obvious is that 2 disks have the load (one is written to, the other one is the mirror), and 3 are pretty idle. But, it's 2.4MB per second - that's not much! I've just changed the space_cache to v2, but it doesn't seem to help much. 'sync -f /export/tmp' still takes very long at times (just took 22 seconds!) Any way we can find the cause, before I move everything into subvolumes ? I'd like to avoid that if possible. Sounds a bit overkill for 2.4MB/s writes, and I think most of it is going to one influxdb database anyway. Thanks! On 07-Apr-20 22:22, Chris Murphy wrote: > On Tue, Apr 7, 2020 at 2:11 PM Chris Murphy <lists@colorremedies.com> wrote: >> >> On Tue, Apr 7, 2020 at 1:46 PM kjansen387 <kjansen387@gmail.com> wrote: >>> >>> Hello, >>> >>> I'm using btrfs on fedora 31 running 5.5.10-200.fc31.x86_64 . >>> >>> I've moved my workload from md raid5 to btrfs raid1. >>> # btrfs filesystem show >>> Label: none uuid: 8ce9e167-57ea-4cf8-8678-3049ba028c12 >>> Total devices 5 FS bytes used 3.73TiB >>> devid 1 size 3.64TiB used 2.53TiB path /dev/sde >>> devid 2 size 3.64TiB used 2.53TiB path /dev/sdf >>> devid 3 size 1.82TiB used 902.00GiB path /dev/sdb >>> devid 4 size 1.82TiB used 902.03GiB path /dev/sdc >>> devid 5 size 1.82TiB used 904.03GiB path /dev/sdd >>> >>> # btrfs fi df /export >>> Data, RAID1: total=3.85TiB, used=3.72TiB >>> System, RAID1: total=32.00MiB, used=608.00KiB >>> Metadata, RAID1: total=6.00GiB, used=5.16GiB >>> GlobalReserve, single: total=512.00MiB, used=0.00B >>> >>> After moving to btrfs, I'm seeing freezes of ~10 seconds very, very >>> often (multiple times per minute). Mariadb, vim, influxdb, etc. >>> >>> See attachment for a stacktrace of vim, and the dmesg output of 'echo w >>> > /proc/sysrq-trigger' also including other hanging processes. >>> >>> What's going on.. ? Hope someone can help. >> >> How busy are the databases? What are the mount options for this volume? >> >> I think there is some kind of write contention possible if there's >> heavy fsync writes, since the tree log is per subvolume? (Maybe a >> developer can describe this correctly if I haven't.) A possible work >> around is putting each database in its own subvolume. > > You could, of course, start with just the busiest database and see if > it makes any difference. By default on Fedora, only /home and / are on > separate subvolumes. The general idea would be to use a subvolume > instead of a directory, e.g. > > ##shutdown the database > mv /var/lib/mysql /var/lib/mysqlold > btrfs sub create /var/lib/mysql > restorecon -Rv /var/lib/mysql > cp -r --reflink /var/lib/mysqlold/* /var/lib/mysql/ > ##resume database operation, and clean up mysqlold whenever > > Another thing that might help, depending on the workload, is > space_cache=v2. You can safely switch to v2 via: > > mount -o remount,clear_cache,space_cache=v2 / > > This is a one time command. It will set a feature flag for space cache > v2, and it'll be used automatically on subsequent mounts. It may be > busy for a minute or so while the cache is rebuilt. > > You might test these separately so you have an idea of their relative > effects on the workload. > > [-- Attachment #2: debug_sar.txt --] [-- Type: text/plain, Size: 31067 bytes --] $ sar -dp -s 21:00:00 -e 22:00:00 | egrep -e '(tps|sd[bcdef])' 21:00:01 DEV tps rkB/s wkB/s dkB/s areq-sz aqu-sz await %util 21:01:00 sdb 6.21 0.09 101.08 0.00 16.30 0.03 5.52 0.75 21:01:00 sdc 4.15 0.03 10.81 0.00 2.61 0.03 7.05 0.56 21:01:00 sdd 3.95 0.50 9.30 0.00 2.48 0.03 8.18 0.52 21:01:00 sde 44.66 2.08 2178.70 0.00 48.83 8.17 183.46 2.49 21:01:00 sdf 47.07 0.44 2146.21 0.00 45.61 6.53 139.14 2.50 21:02:00 sdb 7.58 0.09 156.24 0.00 20.64 0.05 6.72 0.81 21:02:00 sdc 5.78 0.02 16.38 0.00 2.84 0.05 9.00 0.67 21:02:00 sdd 4.43 0.02 11.72 0.00 2.65 0.04 8.80 0.59 21:02:00 sde 67.12 0.02 2549.52 0.00 37.99 2.39 36.12 3.09 21:02:00 sdf 71.21 0.29 2753.11 0.00 38.66 7.49 105.73 3.19 21:03:00 sdb 6.78 0.02 207.26 0.00 30.59 0.03 5.40 0.58 21:03:00 sdc 2.41 0.02 5.79 0.00 2.41 0.02 7.97 0.33 21:03:00 sdd 2.45 0.09 5.93 0.00 2.46 0.03 11.99 0.32 21:03:00 sde 34.33 0.02 1554.25 0.00 45.27 7.60 221.78 2.11 21:03:00 sdf 36.41 0.02 1740.13 0.00 47.79 8.06 221.85 2.22 21:04:00 sdb 6.73 0.02 265.20 0.00 39.43 0.02 3.77 0.45 21:04:00 sdc 1.65 0.02 4.06 0.00 2.48 0.02 11.16 0.20 21:04:00 sdd 1.71 0.02 4.60 0.00 2.69 0.03 15.21 0.19 21:04:00 sde 39.88 1.16 1927.74 0.00 48.37 10.57 265.53 2.04 21:04:00 sdf 44.11 0.16 2056.74 0.00 46.64 2.24 51.22 2.15 21:05:00 sdb 2.58 0.05 49.74 0.00 19.30 0.02 8.17 0.24 21:05:00 sdc 1.10 0.05 2.20 0.00 2.05 0.01 11.00 0.15 21:05:00 sdd 1.23 0.05 3.46 0.00 2.85 0.04 31.08 0.15 21:05:00 sde 21.69 0.05 1086.47 0.00 50.09 4.37 202.00 1.23 21:05:00 sdf 18.03 0.05 1106.38 0.00 61.37 9.84 546.46 1.22 21:06:00 sdb 1.77 0.02 35.17 0.00 19.94 0.01 7.28 0.16 21:06:00 sdc 0.67 0.02 1.53 0.00 2.34 0.01 10.25 0.08 21:06:00 sdd 0.72 0.02 1.73 0.00 2.45 0.01 20.72 0.08 21:06:00 sde 23.98 1.42 1013.76 0.00 42.33 0.42 17.94 1.25 21:06:00 sdf 20.50 0.29 1055.99 0.00 51.53 15.91 776.58 1.24 21:07:00 sdb 3.66 0.02 91.71 0.00 25.04 0.02 6.30 0.32 21:07:00 sdc 1.35 0.29 3.33 0.00 2.69 0.01 8.98 0.18 21:07:00 sdd 1.43 0.09 4.13 0.00 2.95 0.01 10.57 0.17 21:07:00 sde 29.74 0.49 1274.66 0.00 42.88 3.63 122.68 1.49 21:07:00 sdf 26.39 0.29 1351.98 0.00 51.24 11.08 420.47 1.49 21:08:00 sdb 2.36 4.55 44.61 0.00 20.80 0.02 6.92 0.22 21:08:00 sdc 0.80 0.02 1.86 0.00 2.36 0.01 12.58 0.10 21:08:00 sdd 0.87 0.02 2.33 0.00 2.72 0.01 16.15 0.10 21:08:00 sde 23.97 1.16 1128.15 0.00 47.11 12.84 536.13 1.44 21:08:00 sdf 27.02 0.02 1164.18 0.00 43.09 9.12 338.15 1.52 21:09:00 sdb 6.16 0.02 178.38 0.00 28.96 0.03 5.61 0.50 21:09:00 sdc 2.16 0.02 5.06 0.00 2.35 0.03 13.26 0.28 21:09:00 sdd 2.33 0.09 6.66 0.00 2.90 0.03 11.49 0.28 21:09:00 sde 47.24 0.02 2285.96 0.00 48.39 5.08 108.10 2.19 21:09:00 sdf 46.24 0.29 2438.23 0.00 52.73 9.52 206.36 2.41 21:10:00 sdb 3.70 0.22 113.22 0.00 30.69 0.01 4.37 0.29 21:10:00 sdc 1.12 0.02 2.60 0.00 2.35 0.01 9.06 0.14 21:10:00 sdd 1.42 1.22 3.00 0.00 2.98 0.02 14.45 0.18 21:10:00 sde 29.44 31.86 1348.12 0.00 46.88 0.48 16.75 1.54 21:10:00 sdf 24.08 0.82 1416.18 0.00 58.86 13.47 559.88 1.41 21:11:00 sdb 2.83 0.02 38.36 0.00 13.56 0.02 6.88 0.30 21:11:00 sdc 2.65 0.02 8.92 0.00 3.38 0.02 8.69 0.20 21:11:00 sdd 1.50 0.09 3.66 0.00 2.51 0.02 11.19 0.18 21:11:00 sde 28.97 0.02 1456.10 0.00 50.27 8.24 284.85 1.63 21:11:00 sdf 33.59 0.02 1582.09 0.00 47.09 4.01 119.77 1.71 21:12:00 sdb 5.63 0.02 149.48 0.00 26.56 0.03 5.40 0.52 21:12:00 sdc 2.15 0.02 5.46 0.00 2.55 0.02 7.76 0.30 21:12:00 sdd 2.23 0.02 6.86 0.00 3.09 0.02 11.04 0.28 21:12:00 sde 45.41 0.02 2012.12 0.00 44.31 2.08 46.19 2.18 21:12:00 sdf 43.61 0.02 2108.91 0.00 48.35 12.94 297.30 2.23 21:13:00 sdb 5.71 0.02 82.92 0.00 14.52 0.03 5.87 0.65 21:13:00 sdc 3.51 0.02 7.93 0.00 2.26 0.03 7.89 0.48 21:13:00 sdd 3.60 0.02 8.66 0.00 2.41 0.03 9.21 0.47 21:13:00 sde 47.12 0.02 1945.85 0.00 41.30 1.17 25.40 2.20 21:13:00 sdf 46.29 0.02 2005.59 0.00 43.33 4.54 98.58 2.22 21:14:00 sdb 3.40 5.15 34.37 0.00 11.63 0.02 7.26 0.37 21:14:00 sdc 1.78 0.69 3.86 0.00 2.56 0.01 8.64 0.24 21:14:00 sdd 1.75 0.02 4.20 0.00 2.41 0.02 13.26 0.22 21:14:00 sde 31.34 12.35 1519.93 0.00 48.89 8.77 280.39 1.76 21:14:00 sdf 35.69 6.69 1559.10 0.00 43.88 3.03 85.51 1.71 21:15:00 sdb 6.26 0.06 129.45 0.00 20.69 0.03 5.82 0.58 21:15:00 sdc 3.16 0.06 7.19 0.00 2.29 0.03 9.32 0.42 21:15:00 sdd 3.18 0.06 7.46 0.00 2.36 0.06 19.19 0.40 21:15:00 sde 49.08 0.32 2347.86 0.00 47.85 6.09 124.60 2.56 21:15:00 sdf 51.61 0.46 2328.95 0.00 45.14 4.57 89.12 2.63 21:16:00 sdb 1.90 0.02 45.42 0.00 23.94 0.01 7.19 0.18 21:16:00 sdc 0.98 0.02 2.80 0.00 2.87 0.01 10.24 0.12 21:16:00 sdd 1.07 0.29 2.93 0.00 3.02 0.01 12.55 0.12 21:16:00 sde 16.55 11.88 1008.52 0.00 61.66 15.59 942.31 1.14 21:16:00 sdf 21.99 0.69 1031.37 0.00 46.92 0.27 12.55 1.12 21:17:00 sdb 2.06 0.02 37.44 0.00 18.14 0.01 6.96 0.21 21:17:00 sdc 1.08 0.02 2.73 0.00 2.55 0.01 9.58 0.14 21:17:00 sdd 1.27 0.22 4.00 0.00 3.34 0.02 12.63 0.14 21:17:00 sde 19.22 8.42 1329.36 0.00 69.61 13.69 712.74 1.37 21:17:00 sdf 26.73 2.49 1306.71 0.00 48.98 0.61 23.24 1.34 21:18:01 sdb 6.48 0.02 104.81 0.00 16.19 0.04 7.32 0.75 21:18:01 sdc 4.01 0.02 9.32 0.00 2.33 0.03 7.99 0.56 21:18:01 sdd 4.08 0.02 9.99 0.00 2.46 0.03 7.30 0.53 21:18:01 sde 58.90 1.09 2357.38 0.00 40.04 3.71 63.55 2.75 21:18:01 sdf 59.96 0.16 2372.57 0.00 39.57 1.34 22.83 2.78 21:19:01 sdb 2.30 0.02 20.92 0.00 9.11 0.02 7.54 0.24 21:19:01 sdc 1.37 0.02 3.33 0.00 2.46 0.01 8.41 0.17 21:19:01 sdd 1.42 0.02 3.93 0.00 2.79 0.02 11.29 0.16 21:19:01 sde 31.59 0.02 1686.53 0.00 53.39 7.34 232.92 1.78 21:19:01 sdf 33.12 0.02 1610.12 0.00 48.61 9.48 286.80 1.86 21:20:01 sdb 2.61 0.29 36.16 0.00 13.95 0.01 5.72 0.27 21:20:01 sdc 1.51 0.02 3.66 0.00 2.43 0.02 12.71 0.20 21:20:01 sdd 1.55 0.02 4.20 0.00 2.73 0.02 10.18 0.19 21:20:01 sde 34.09 0.02 1615.58 0.00 47.39 8.43 247.65 1.90 21:20:01 sdf 33.79 0.02 1670.18 0.00 49.42 10.99 325.83 1.88 21:21:01 sdb 2.88 1.29 17.98 0.00 6.69 0.02 6.71 0.35 21:21:01 sdc 2.95 0.02 9.59 0.00 3.26 0.01 5.07 0.26 21:21:01 sdd 1.76 0.02 4.73 0.00 2.69 0.02 9.55 0.22 21:21:01 sde 35.70 8.75 1959.24 0.00 55.13 11.88 333.36 2.13 21:21:01 sdf 43.74 48.91 1973.09 0.00 46.23 3.60 82.75 2.47 21:22:01 sdb 3.10 0.02 16.92 0.00 5.47 0.01 5.12 0.39 21:22:01 sdc 2.18 0.02 5.53 0.00 2.55 0.02 8.05 0.30 21:22:01 sdd 2.23 0.02 5.93 0.00 2.67 0.02 8.52 0.28 21:22:01 sde 33.57 0.02 1472.57 0.00 43.86 3.44 102.90 2.03 21:22:01 sdf 32.07 0.02 1476.04 0.00 46.02 9.75 304.46 1.90 21:23:01 sdb 2.93 0.02 18.58 0.00 6.35 0.02 6.43 0.33 21:23:01 sdc 1.98 0.02 4.73 0.00 2.40 0.02 12.38 0.27 21:23:01 sdd 2.08 0.09 5.46 0.00 2.67 0.02 10.12 0.25 21:23:01 sde 42.64 0.02 1941.59 0.00 45.53 5.33 125.49 2.17 21:23:01 sdf 41.56 0.02 1942.99 0.00 46.75 10.29 248.03 2.09 21:24:01 sdb 3.95 0.09 19.21 0.00 4.89 0.03 6.79 0.50 21:24:01 sdc 3.15 0.03 7.23 0.00 2.31 0.03 9.10 0.45 21:24:01 sdd 3.21 0.03 7.97 0.00 2.49 0.03 9.65 0.43 21:24:01 sde 43.18 0.03 1820.85 0.00 42.17 5.30 123.18 2.40 21:24:01 sdf 43.68 0.03 1826.74 0.00 41.82 6.16 141.60 2.26 21:25:01 sdb 3.28 0.98 15.64 0.00 5.07 0.02 7.61 0.41 21:25:01 sdc 2.48 0.05 5.57 0.00 2.26 0.02 8.09 0.35 21:25:01 sdd 2.67 0.05 7.29 0.00 2.75 0.05 17.55 0.34 21:25:01 sde 36.86 7.07 1739.47 0.00 47.39 5.13 139.82 1.94 21:25:01 sdf 38.50 3.69 1741.46 0.00 45.33 6.78 176.60 1.92 21:26:01 sdb 0.92 0.09 3.00 0.00 3.37 0.01 9.91 0.11 21:26:01 sdc 0.75 0.02 1.80 0.00 2.43 0.01 11.84 0.10 21:26:01 sdd 0.78 0.02 2.00 0.00 2.59 0.01 14.91 0.09 21:26:01 sde 12.74 0.02 876.46 0.00 68.81 11.13 874.11 0.95 21:26:01 sdf 13.54 0.02 939.79 0.00 69.43 12.19 901.14 0.95 21:27:01 sdb 1.57 0.02 9.99 0.00 6.40 0.01 7.76 0.18 21:27:01 sdc 1.10 0.02 2.86 0.00 2.63 0.01 9.20 0.15 21:27:01 sdd 1.17 0.02 3.33 0.00 2.88 0.01 11.43 0.14 21:27:01 sde 23.73 0.02 1364.46 0.00 57.50 3.53 149.12 1.39 21:27:01 sdf 21.90 0.02 1317.10 0.00 60.15 11.87 542.60 1.31 21:28:01 sdb 1.72 0.03 8.41 0.00 4.91 0.01 7.81 0.23 21:28:01 sdc 1.40 0.03 3.27 0.00 2.35 0.01 9.50 0.20 21:28:01 sdd 1.47 0.03 3.74 0.00 2.56 0.02 13.36 0.19 21:28:01 sde 26.21 0.03 1266.63 0.00 48.33 3.38 129.37 1.39 21:28:01 sdf 24.12 0.03 1244.93 0.00 51.62 5.84 242.51 1.39 21:29:01 sdb 5.42 0.03 26.34 0.00 4.87 0.03 6.85 0.68 21:29:01 sdc 4.36 0.03 10.03 0.00 2.30 0.04 8.70 0.62 21:29:01 sdd 4.56 0.09 12.17 0.00 2.69 0.04 8.22 0.59 21:29:01 sde 60.35 0.09 2264.75 0.00 37.53 2.76 46.25 2.90 21:29:01 sdf 61.66 0.03 2203.58 0.00 35.74 3.33 54.54 2.81 21:30:01 sdb 4.32 0.02 25.15 0.00 5.83 0.03 7.85 0.55 21:30:01 sdc 3.31 0.02 7.54 0.00 2.29 0.03 10.39 0.48 21:30:01 sdd 3.36 0.02 7.94 0.00 2.37 0.02 7.23 0.44 21:30:01 sde 51.75 0.02 2065.78 0.00 39.92 0.96 19.05 2.53 21:30:01 sdf 50.07 0.55 2089.21 0.00 41.74 5.24 105.19 2.51 21:31:01 sdb 4.20 0.02 20.85 0.00 4.97 0.03 6.72 0.54 21:31:01 sdc 4.43 0.02 13.19 0.00 2.98 0.02 5.36 0.47 21:31:01 sdd 3.55 1.09 8.66 0.00 2.75 0.04 10.48 0.45 21:31:01 sde 44.56 1.89 2270.60 0.00 51.00 7.86 176.80 2.61 21:31:01 sdf 46.72 0.22 2325.54 0.00 49.78 11.52 247.11 2.62 21:32:01 sdb 5.24 0.03 24.13 0.00 4.61 0.05 9.07 0.66 21:32:01 sdc 4.41 0.03 10.72 0.00 2.44 0.04 9.17 0.62 21:32:01 sdd 4.51 0.03 12.06 0.00 2.68 0.03 7.95 0.57 21:32:01 sde 58.13 0.03 2335.79 0.00 40.18 2.63 45.75 2.89 21:32:01 sdf 58.45 0.03 2340.62 0.00 40.05 4.05 69.71 2.76 21:33:01 sdb 2.36 0.02 11.41 0.00 4.86 0.02 8.26 0.31 21:33:01 sdc 1.92 0.02 4.45 0.00 2.32 0.02 9.39 0.26 21:33:01 sdd 1.99 0.02 4.91 0.00 2.48 0.02 10.93 0.25 21:33:01 sde 28.98 0.02 1349.54 0.00 46.57 5.67 196.25 1.61 21:33:01 sdf 28.58 0.02 1316.44 0.00 46.06 2.29 80.75 1.53 21:34:01 sdb 2.50 0.03 12.89 0.00 5.17 0.02 7.48 0.29 21:34:01 sdc 1.97 0.03 4.66 0.00 2.37 0.02 10.15 0.27 21:34:01 sdd 2.01 0.03 5.19 0.00 2.60 0.02 11.58 0.25 21:34:01 sde 29.60 0.03 1298.84 0.00 43.88 1.46 49.83 1.65 21:34:01 sdf 28.88 0.03 1338.51 0.00 46.36 8.35 289.83 1.64 21:35:01 sdb 1.56 0.05 6.89 0.00 4.46 0.01 8.19 0.18 21:35:01 sdc 1.27 0.05 2.85 0.00 2.27 0.01 10.86 0.16 21:35:01 sdd 1.29 0.05 3.11 0.00 2.45 0.03 21.23 0.15 21:35:01 sde 23.72 0.05 1340.62 0.00 56.53 10.50 443.37 1.43 21:35:01 sdf 28.02 0.05 1343.13 0.00 47.94 4.09 146.62 1.40 21:36:01 sdb 1.55 17.62 6.53 0.00 15.59 0.01 6.68 0.15 21:36:01 sdc 1.40 17.88 1.53 0.00 13.88 0.01 7.00 0.12 21:36:01 sdd 1.38 17.62 1.53 0.00 13.85 0.02 11.42 0.11 21:36:01 sde 18.27 17.62 902.88 0.00 50.37 12.86 704.01 1.31 21:36:01 sdf 17.76 17.62 940.53 0.00 53.96 13.46 758.66 1.30 21:37:01 sdb 8.10 0.09 144.64 0.00 17.88 0.04 5.59 0.86 21:37:01 sdc 31.19 0.03 517.64 0.00 16.60 0.75 24.56 2.07 21:37:01 sdd 31.07 0.03 518.31 0.00 16.68 0.10 3.74 2.13 21:37:01 sde 102.70 0.29 3837.35 0.00 37.37 2.17 21.66 4.14 21:37:01 sdf 100.55 0.29 3683.64 0.00 36.64 1.11 11.53 4.08 21:38:00 sdb 5.93 0.03 26.00 0.00 4.39 0.04 6.97 0.77 21:38:00 sdc 11.69 0.03 139.47 0.00 11.93 0.12 10.68 0.93 21:38:00 sdd 11.71 0.03 140.14 0.00 11.97 0.06 5.82 0.89 21:38:00 sde 95.35 0.23 3413.91 0.00 35.81 0.93 10.23 3.89 21:38:00 sdf 96.39 0.03 3417.87 0.00 35.46 1.06 11.51 3.85 21:39:00 sdb 51.18 0.02 847.68 0.00 16.56 0.59 11.94 2.62 21:39:00 sdc 5.56 0.02 12.78 0.00 2.30 0.04 7.99 0.78 21:39:00 sdd 5.75 0.09 14.64 0.00 2.56 0.04 6.42 0.75 21:39:00 sde 68.13 0.02 3002.38 0.00 44.07 0.50 7.86 3.25 21:39:00 sdf 111.67 0.02 3827.48 0.00 34.27 2.57 23.51 4.65 21:40:01 sdb 74.28 0.09 1347.79 0.00 18.15 1.08 14.98 3.73 21:40:01 sdc 5.86 0.02 13.57 0.00 2.32 0.04 7.91 0.81 21:40:01 sdd 5.96 0.02 14.43 0.00 2.43 0.04 7.85 0.79 21:40:01 sde 63.27 0.02 3366.73 0.00 53.21 0.70 11.50 3.31 21:40:01 sdf 131.17 0.29 4649.23 0.00 35.45 1.36 10.83 5.69 21:41:01 sdb 41.49 0.09 754.77 0.00 18.19 0.69 17.12 2.51 21:41:01 sdc 8.13 0.03 21.25 0.00 2.61 0.05 6.34 1.00 21:41:01 sdd 7.00 0.03 16.70 0.00 2.39 0.04 6.68 0.92 21:41:01 sde 69.88 0.29 3200.33 0.00 45.80 1.30 19.15 3.80 21:41:01 sdf 99.06 0.56 3681.11 0.00 37.16 1.31 13.73 4.74 21:42:00 sdb 6.65 0.03 31.28 0.00 4.71 0.04 5.82 0.86 21:42:00 sdc 5.41 0.03 12.86 0.00 2.38 0.04 7.27 0.77 21:42:00 sdd 5.58 0.03 14.54 0.00 2.61 0.04 6.61 0.73 21:42:00 sde 74.96 0.03 2923.73 0.00 39.00 1.30 17.85 4.10 21:42:00 sdf 76.60 0.03 2924.33 0.00 38.18 1.47 19.71 4.01 21:43:01 sdb 6.14 0.02 25.17 0.00 4.10 0.04 6.78 0.84 21:43:01 sdc 5.23 0.02 12.02 0.00 2.30 0.04 7.64 0.74 21:43:01 sdd 5.32 0.02 12.75 0.00 2.40 0.03 6.62 0.71 21:43:01 sde 68.64 0.02 2657.53 0.00 38.72 1.27 19.07 3.61 21:43:01 sdf 68.11 0.02 2662.81 0.00 39.09 1.12 16.90 3.60 21:44:00 sdb 6.30 0.03 33.27 0.00 5.29 0.04 7.21 0.82 21:44:00 sdc 5.08 0.03 11.52 0.00 2.27 0.04 9.01 0.71 21:44:00 sdd 5.12 0.03 12.05 0.00 2.36 0.03 6.54 0.69 21:44:00 sde 68.80 0.03 2527.21 0.00 36.73 0.89 13.45 3.71 21:44:00 sdf 69.18 0.03 2542.76 0.00 36.76 1.96 28.81 3.66 21:45:01 sdb 6.40 0.06 25.49 0.00 3.99 0.04 7.03 0.85 21:45:01 sdc 5.55 0.06 12.41 0.00 2.25 0.04 8.07 0.77 21:45:01 sdd 5.60 0.32 12.94 0.00 2.37 0.07 12.99 0.72 21:45:01 sde 72.05 0.06 2806.21 0.00 38.95 0.87 12.63 3.51 21:45:01 sdf 72.91 0.32 2812.28 0.00 38.58 0.91 12.96 3.55 21:46:01 sdb 8.73 0.03 117.82 0.00 13.50 0.05 6.13 1.07 21:46:01 sdc 6.55 0.03 15.85 0.00 2.42 0.05 7.44 0.89 21:46:01 sdd 6.97 1.57 16.19 0.00 2.55 0.05 7.69 0.89 21:46:01 sde 78.81 2.71 3419.65 0.00 43.43 1.97 25.52 4.95 21:46:01 sdf 78.27 1.10 3444.50 0.00 44.02 2.33 30.33 4.72 21:47:00 sdb 5.92 0.03 31.38 0.00 5.31 0.03 6.25 0.76 21:47:00 sdc 4.58 0.03 10.46 0.00 2.29 0.03 7.63 0.65 21:47:00 sdd 4.68 0.09 11.26 0.00 2.43 0.03 6.77 0.62 21:47:00 sde 69.63 0.03 2466.38 0.00 35.42 1.98 28.94 4.25 21:47:00 sdf 69.26 0.03 2468.45 0.00 35.64 1.90 27.87 4.15 21:48:01 sdb 7.07 0.02 29.71 0.00 4.21 0.05 6.80 0.95 21:48:01 sdc 5.79 0.02 13.09 0.00 2.26 0.04 7.82 0.81 21:48:01 sdd 5.82 0.02 13.61 0.00 2.34 0.04 6.46 0.78 21:48:01 sde 76.31 0.02 2922.56 0.00 38.30 2.03 27.05 4.66 21:48:01 sdf 78.32 0.02 2932.11 0.00 37.44 2.39 31.06 4.81 21:49:01 sdb 6.45 0.02 26.88 0.00 4.17 0.04 6.20 0.86 21:49:01 sdc 5.29 0.02 11.84 0.00 2.24 0.04 7.35 0.75 21:49:01 sdd 5.32 0.02 12.38 0.00 2.33 0.03 6.19 0.72 21:49:01 sde 68.41 0.02 2402.66 0.00 35.12 2.41 35.66 4.01 21:49:01 sdf 70.06 0.02 2408.98 0.00 34.38 1.49 21.75 4.44 21:50:01 sdb 4.68 0.03 18.72 0.00 4.01 0.03 6.66 0.62 21:50:01 sdc 4.05 0.03 9.19 0.00 2.27 0.03 8.19 0.55 21:50:01 sdd 4.09 0.03 9.73 0.00 2.39 0.03 7.16 0.54 21:50:01 sde 62.16 0.03 2202.13 0.00 35.43 1.04 17.15 3.20 21:50:01 sdf 60.03 0.03 2205.58 0.00 36.74 3.44 57.78 3.12 21:51:01 sdb 7.19 0.03 29.03 0.00 4.04 0.04 6.20 0.97 21:51:01 sdc 6.96 0.03 18.55 0.00 2.67 0.04 5.75 0.82 21:51:01 sdd 5.81 0.03 13.95 0.00 2.41 0.04 6.64 0.76 21:51:01 sde 79.19 0.03 2975.14 0.00 37.57 2.01 25.83 4.79 21:51:01 sdf 80.73 0.03 2988.55 0.00 37.02 3.07 38.59 5.02 21:52:01 sdb 6.89 0.03 65.27 0.00 9.48 0.04 5.58 0.83 21:52:01 sdc 4.72 0.03 11.14 0.00 2.37 0.04 7.92 0.67 21:52:01 sdd 5.06 0.09 21.29 0.00 4.23 0.04 7.57 0.63 21:52:01 sde 76.41 0.03 2972.24 0.00 38.90 3.84 50.79 5.18 21:52:01 sdf 77.69 0.03 3009.41 0.00 38.73 3.57 46.45 5.20 21:53:01 sdb 6.07 0.02 31.33 0.00 5.17 0.04 6.21 0.77 21:53:01 sdc 4.64 0.02 10.58 0.00 2.28 0.03 7.08 0.65 21:53:01 sdd 5.02 0.02 16.36 0.00 3.26 0.04 8.33 0.65 21:53:01 sde 68.11 0.02 2519.26 0.00 36.99 1.33 20.07 3.90 21:53:01 sdf 68.61 0.02 2523.71 0.00 36.79 2.41 35.69 3.84 21:54:01 sdb 6.29 0.02 29.63 0.00 4.71 0.04 6.08 0.80 21:54:01 sdc 5.08 0.02 11.72 0.00 2.31 0.04 8.11 0.70 21:54:01 sdd 5.26 0.02 14.18 0.00 2.70 0.04 7.26 0.68 21:54:01 sde 69.39 0.02 2613.95 0.00 37.67 1.29 19.08 3.70 21:54:01 sdf 67.85 0.02 2618.41 0.00 38.59 1.56 23.50 3.67 21:55:00 sdb 7.03 0.05 30.49 0.00 4.34 0.06 9.19 0.91 21:55:00 sdc 5.81 0.05 13.36 0.00 2.31 0.05 9.83 0.82 21:55:00 sdd 6.09 0.05 16.59 0.00 2.73 0.07 11.96 0.78 21:55:00 sde 66.03 0.12 2727.33 0.00 41.30 0.99 15.43 3.53 21:55:00 sdf 66.49 0.05 2731.70 0.00 41.09 1.24 19.20 3.52 21:56:01 sdb 5.13 0.02 21.36 0.00 4.17 0.03 7.05 0.66 21:56:01 sdc 4.65 0.02 12.07 0.00 2.60 0.03 7.43 0.58 21:56:01 sdd 4.19 0.02 10.09 0.00 2.42 0.04 8.84 0.56 21:56:01 sde 48.02 0.02 2200.92 0.00 45.83 4.83 101.11 3.05 21:56:01 sdf 50.43 0.02 2208.57 0.00 43.80 5.34 106.47 3.13 21:57:01 sdb 5.87 0.02 24.63 0.00 4.20 0.04 7.44 0.76 21:57:01 sdc 4.98 0.02 11.65 0.00 2.35 0.04 8.58 0.70 21:57:01 sdd 5.09 0.02 12.45 0.00 2.45 0.03 7.14 0.67 21:57:01 sde 59.42 0.56 2439.73 0.00 41.07 2.49 42.36 3.10 21:57:01 sdf 61.40 0.02 2446.59 0.00 39.85 0.79 13.36 3.18 21:58:01 sdb 6.36 0.03 30.14 0.00 4.74 0.04 6.25 0.85 21:58:01 sdc 5.30 0.03 12.04 0.00 2.28 0.04 7.80 0.74 21:58:01 sdd 5.40 0.03 12.85 0.00 2.38 0.04 6.80 0.72 21:58:01 sde 58.39 0.29 2092.45 0.00 35.84 0.56 10.04 2.83 21:58:01 sdf 58.92 0.03 2102.81 0.00 35.69 0.60 10.75 2.87 21:59:00 sdb 7.47 0.03 61.00 0.00 8.16 0.04 6.25 0.96 21:59:00 sdc 5.68 0.03 12.87 0.00 2.27 0.04 7.30 0.81 21:59:00 sdd 5.70 0.03 13.41 0.00 2.36 0.04 7.63 0.76 21:59:00 sde 63.15 0.29 2417.43 0.00 38.29 0.68 11.27 3.29 21:59:00 sdf 64.77 0.03 2459.33 0.00 37.97 1.16 18.47 3.38 Average: sdb 7.33 0.54 101.03 0.00 13.86 0.07 9.65 0.68 Average: sdc 4.07 0.35 19.11 0.00 4.78 0.04 10.41 0.50 Average: sdd 4.02 0.42 19.50 0.00 4.95 0.03 8.29 0.48 Average: sde 49.50 1.92 2101.35 0.00 42.49 4.44 90.20 2.66 Average: sdf 52.62 1.49 2167.92 0.00 41.22 5.22 99.74 2.76 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: btrfs freezing on writes 2020-04-07 20:39 ` kjansen387 @ 2020-04-07 22:11 ` Chris Murphy 2020-04-07 22:30 ` kjansen387 0 siblings, 1 reply; 13+ messages in thread From: Chris Murphy @ 2020-04-07 22:11 UTC (permalink / raw) To: kjansen387; +Cc: Btrfs BTRFS, Qu Wenruo On Tue, Apr 7, 2020 at 2:39 PM kjansen387 <kjansen387@gmail.com> wrote: >$ grep /export /etc/fstab >UUID=8ce9e167-57ea-4cf8-8678-3049ba028c12 /export btrfs >device=/dev/sde,device=/dev/sdf 0 2 I'd use only noatime for options. There's no advantage of specifying devices. And also fs_passno can be 0. fsck.btrfs is a no op anyway, so it doesn't hurt anything to leave it at 2. > I've attached sysstat info of my disks. What's obvious is that 2 disks > have the load (one is written to, the other one is the mirror), and 3 > are pretty idle. But, it's 2.4MB per second - that's not much! Lots of small file writes maybe? What's iostat show for utilization? Or vmstat for io? Hard drives of course have limited IO per second. > I've just changed the space_cache to v2, but it doesn't seem to help > much. 'sync -f /export/tmp' still takes very long at times (just took 22 > seconds!) I just did a strace on this command and it uses syncfs, not fsync. I'm pretty sure on Btrfs this is a full filesystem sync, which is expensive, all data and metadata. So if it's very dirty, yeah it could take some time to flush everything to disk. Try it without -f and you'll get fsync. > Any way we can find the cause, before I move everything into subvolumes > ? I'd like to avoid that if possible. Sounds a bit overkill for 2.4MB/s > writes, and I think most of it is going to one influxdb database anyway. From the sysrq... [937013.794093] mysqld D 0 10400 1 0x00004000 [937013.794240] do_fsync+0x38/0x70 [937013.794253] mysqld D 0 10412 1 0x00004000 [937013.794297] do_fsync+0x38/0x70 [937013.794306] mysqld D 0 10421 1 0x00004000 [937013.794353] do_fsync+0x38/0x70 [937013.794788] WTJourn.Flusher D 0 1186978 1186954 0x00004320 [937013.794894] do_fsync+0x38/0x70 [937013.794903] ftdc D 0 1186980 1186954 0x00000320 [937013.794951] ? btrfs_create+0x200/0x200 [btrfs] [937013.795022] influxd D 0 2082782 1 0x00004000 [937013.795086] do_fsync+0x38/0x70 [937013.795098] influxd D 0 2082804 1 0x00004000 [937013.795143] do_fsync+0x38/0x70 [937013.795191] vim D 0 2228648 2223845 0x00004000 [937013.795286] do_fsync+0x38/0x70 Quite a lot of fsync all at once. But I'm not very good at parsing kernel output. Maybe a developer will have input. -- Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: btrfs freezing on writes 2020-04-07 22:11 ` Chris Murphy @ 2020-04-07 22:30 ` kjansen387 0 siblings, 0 replies; 13+ messages in thread From: kjansen387 @ 2020-04-07 22:30 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS, Qu Wenruo Ah, true, sync -f does a syncfs. sync without -f on an existing file is always fast. But when I strace vim, fsync is slow (sometimes >10 seconds) because it needs to write something. Hopefully a developer can reply - thanks so far! On 08-Apr-20 00:11, Chris Murphy wrote: > On Tue, Apr 7, 2020 at 2:39 PM kjansen387 <kjansen387@gmail.com> wrote: > > > >> $ grep /export /etc/fstab >> UUID=8ce9e167-57ea-4cf8-8678-3049ba028c12 /export btrfs >> device=/dev/sde,device=/dev/sdf 0 2 > > I'd use only noatime for options. There's no advantage of specifying > devices. And also fs_passno can be 0. fsck.btrfs is a no op anyway, so > it doesn't hurt anything to leave it at 2. > > > >> I've attached sysstat info of my disks. What's obvious is that 2 disks >> have the load (one is written to, the other one is the mirror), and 3 >> are pretty idle. But, it's 2.4MB per second - that's not much! > > Lots of small file writes maybe? What's iostat show for utilization? > Or vmstat for io? Hard drives of course have limited IO per second. > >> I've just changed the space_cache to v2, but it doesn't seem to help >> much. 'sync -f /export/tmp' still takes very long at times (just took 22 >> seconds!) > > I just did a strace on this command and it uses syncfs, not fsync. I'm > pretty sure on Btrfs this is a full filesystem sync, which is > expensive, all data and metadata. So if it's very dirty, yeah it could > take some time to flush everything to disk. > > Try it without -f and you'll get fsync. > > >> Any way we can find the cause, before I move everything into subvolumes >> ? I'd like to avoid that if possible. Sounds a bit overkill for 2.4MB/s >> writes, and I think most of it is going to one influxdb database anyway. > > From the sysrq... > [937013.794093] mysqld D 0 10400 1 0x00004000 > [937013.794240] do_fsync+0x38/0x70 > > [937013.794253] mysqld D 0 10412 1 0x00004000 > [937013.794297] do_fsync+0x38/0x70 > > [937013.794306] mysqld D 0 10421 1 0x00004000 > [937013.794353] do_fsync+0x38/0x70 > > [937013.794788] WTJourn.Flusher D 0 1186978 1186954 0x00004320 > [937013.794894] do_fsync+0x38/0x70 > > [937013.794903] ftdc D 0 1186980 1186954 0x00000320 > [937013.794951] ? btrfs_create+0x200/0x200 [btrfs] > > [937013.795022] influxd D 0 2082782 1 0x00004000 > [937013.795086] do_fsync+0x38/0x70 > > [937013.795098] influxd D 0 2082804 1 0x00004000 > [937013.795143] do_fsync+0x38/0x70 > > [937013.795191] vim D 0 2228648 2223845 0x00004000 > [937013.795286] do_fsync+0x38/0x70 > > Quite a lot of fsync all at once. > > But I'm not very good at parsing kernel output. Maybe a developer will > have input. > > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: btrfs freezing on writes 2020-04-07 19:46 btrfs freezing on writes kjansen387 2020-04-07 20:11 ` Chris Murphy @ 2020-04-09 4:32 ` Zygo Blaxell 2020-04-09 21:53 ` kjansen387 1 sibling, 1 reply; 13+ messages in thread From: Zygo Blaxell @ 2020-04-09 4:32 UTC (permalink / raw) To: kjansen387; +Cc: linux-btrfs On Tue, Apr 07, 2020 at 09:46:29PM +0200, kjansen387 wrote: > Hello, > > I'm using btrfs on fedora 31 running 5.5.10-200.fc31.x86_64 . > > I've moved my workload from md raid5 to btrfs raid1. > # btrfs filesystem show > Label: none uuid: 8ce9e167-57ea-4cf8-8678-3049ba028c12 > Total devices 5 FS bytes used 3.73TiB > devid 1 size 3.64TiB used 2.53TiB path /dev/sde > devid 2 size 3.64TiB used 2.53TiB path /dev/sdf > devid 3 size 1.82TiB used 902.00GiB path /dev/sdb > devid 4 size 1.82TiB used 902.03GiB path /dev/sdc > devid 5 size 1.82TiB used 904.03GiB path /dev/sdd > > # btrfs fi df /export > Data, RAID1: total=3.85TiB, used=3.72TiB > System, RAID1: total=32.00MiB, used=608.00KiB > Metadata, RAID1: total=6.00GiB, used=5.16GiB > GlobalReserve, single: total=512.00MiB, used=0.00B Try 'btrfs fi usage /export'. I'm guessing most or all of your metadata is on sde and sdf, because that's where the raid1 allocator would initially put it if you put those drives together in 'mkfs.btrfs'. 'fi usage' will confirm or refute that guess. If this is true, then the following set of commands may help: btrfs fi resize 1:-4g /export; btrfs fi resize 2:-4g /export; btrfs balance start -mdevid=1 /export; btrfs fi resize 1:max /export; btrfs fi resize 2:max /export; Right now your filesystem has equal space unallocated on each drive. The first resize commands create a temporary shortage of space on the 2 largest drives, which will cause the allocator to place new metadata on the 3 smaller ones to equalize free space during the balance. The last 2 resize commands let you use all the space again--they will be filled in with later data allocations. A 4G size reduction is required because you have 5 GB of metadata and 5 disks. 5 GB / 5 disks = 1 GB per disk, and 5 GB (total metadata on disks 1,2) - 1 GB (desired metadata on each disk) = 4 GB (how much metadata we want to displace from the bigger disks). This will probably take 2-3 hours, and the system will be very laggy while it runs. Best to leave it overnight. I see that 'space_cache=v2' has already been suggested. That usually helps a lot especially on big filesystems. space_cache=v2 does an amount of work proportional to committed data size during a commit, while space_cache=v1 does an amount of work proportional to *filesystem* size (and committed data size too, but that is usually negligible if your workload is nothing but write and fsync). > After moving to btrfs, I'm seeing freezes of ~10 seconds very, very often > (multiple times per minute). Mariadb, vim, influxdb, etc. > > See attachment for a stacktrace of vim, and the dmesg output of 'echo w > > /proc/sysrq-trigger' also including other hanging processes. > > What's going on.. ? Hope someone can help. > > Thanks, > Klaas > [937013.794253] mysqld D 0 10412 1 0x00004000 > [937013.794254] Call Trace: > [937013.794255] ? __schedule+0x2c7/0x740 > [937013.794256] schedule+0x4a/0xb0 > [937013.794264] wait_for_commit+0x58/0x80 [btrfs] > [937013.794266] ? finish_wait+0x80/0x80 > [937013.794274] btrfs_commit_transaction+0x87b/0xa20 [btrfs] > [937013.794276] ? dput+0xb6/0x2d0 > [937013.794286] ? btrfs_log_dentry_safe+0x55/0x70 [btrfs] > [937013.794295] btrfs_sync_file+0x3b2/0x400 [btrfs] > [937013.794297] do_fsync+0x38/0x70 > [937013.794298] __x64_sys_fsync+0x10/0x20 > [937013.794299] do_syscall_64+0x5b/0x1c0 > [937013.794300] entry_SYSCALL_64_after_hwframe+0x44/0xa9 If fsync() can't do everything it wants to do with the (very fast) log tree, it switches to a (not fast at all) full filesystem commit. When that happens, the process calling fsync is stuck waiting for another process to complete a transaction so it can start its own. > [937013.795098] influxd D 0 2082804 1 0x00004000 > [937013.795099] Call Trace: > [937013.795100] ? __schedule+0x2c7/0x740 > [937013.795102] schedule+0x4a/0xb0 > [937013.795110] wait_for_commit+0x58/0x80 [btrfs] > [937013.795112] ? finish_wait+0x80/0x80 > [937013.795120] btrfs_commit_transaction+0x87b/0xa20 [btrfs] > [937013.795121] ? dput+0xb6/0x2d0 > [937013.795132] ? btrfs_log_dentry_safe+0x55/0x70 [btrfs] > [937013.795141] btrfs_sync_file+0x3b2/0x400 [btrfs] > [937013.795143] do_fsync+0x38/0x70 > [937013.795144] __x64_sys_fsync+0x10/0x20 > [937013.795145] do_syscall_64+0x5b/0x1c0 > [937013.795146] entry_SYSCALL_64_after_hwframe+0x44/0xa9 This one started the transaction that mysqld's fsync is waiting for. You also have this: > [937013.794385] postmaster D 0 3216 2409 0x00004000 > [937013.794385] Call Trace: > [937013.794387] ? __schedule+0x2c7/0x740 > [937013.794388] schedule+0x4a/0xb0 > [937013.794388] schedule_preempt_disabled+0xa/0x10 > [937013.794390] __mutex_lock.isra.0+0x16b/0x4d0 > [937013.794398] ? start_transaction+0xbb/0x4c0 [btrfs] > [937013.794408] btrfs_pin_log_trans+0x19/0x30 [btrfs] > [937013.794417] btrfs_rename2+0x25b/0x1f60 [btrfs] > [937013.794419] ? link_path_walk.part.0+0x74/0x540 > [937013.794420] ? path_parentat.isra.0+0x3f/0x80 > [937013.794421] ? inode_permission+0xad/0x140 > [937013.794422] ? vfs_rename+0x3e1/0x9e0 > [937013.794430] ? btrfs_create+0x200/0x200 [btrfs] > [937013.794431] vfs_rename+0x3e1/0x9e0 > [937013.794433] ? __d_lookup+0x5e/0x140 > [937013.794434] ? lookup_dcache+0x18/0x60 > [937013.794435] do_renameat2+0x381/0x530 > [937013.794437] __x64_sys_rename+0x1f/0x30 > [937013.794438] do_syscall_64+0x5b/0x1c0 > [937013.794439] entry_SYSCALL_64_after_hwframe+0x44/0xa9 I don't know if this renaming over an existing file, but if you do delete or truncate files (including renames over existing files) a lot, you might be accumulating delayed refs (increases or decreases in reference counts are queued up and performed in batches on btrfs because there are so many). Try 'slabtop', see if there are a lot of 'btrfs_delayed_ref_head' slabs ("a lot" is more than 10,000). It's not likely you will hit this problem unless you are deleting a lot of files, deleting a snapshot, or running a balance; however, if you are doing those things, they certainly won't help with commit latency. If you're hitting that bug, there are fixes in development that may help (probably one or two kernel releases out though). ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: btrfs freezing on writes 2020-04-09 4:32 ` Zygo Blaxell @ 2020-04-09 21:53 ` kjansen387 2020-04-09 23:07 ` Zygo Blaxell 0 siblings, 1 reply; 13+ messages in thread From: kjansen387 @ 2020-04-09 21:53 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs Hi Zygo, Thanks for your reply! > Try 'btrfs fi usage /export'. I'm removing /dev/sdd at the moment, my usage now (metadata usage was the same a few weeks back): # btrfs fi usage /export Overall: Device size: 10.92TiB Device allocated: 7.58TiB Device unallocated: 3.34TiB Device missing: 0.00B Used: 7.45TiB Free (estimated): 1.73TiB (min: 1.73TiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,RAID1: Size:3.78TiB, Used:3.72TiB (98.35%) /dev/sde 2.53TiB /dev/sdf 2.52TiB /dev/sdb 869.00GiB /dev/sdc 868.00GiB /dev/sdd 838.00GiB Metadata,RAID1: Size:6.00GiB, Used:5.18GiB (86.40%) /dev/sde 4.00GiB /dev/sdf 5.00GiB /dev/sdb 1.00GiB /dev/sdc 1.00GiB /dev/sdd 1.00GiB System,RAID1: Size:32.00MiB, Used:592.00KiB (1.81%) /dev/sdc 32.00MiB /dev/sdd 32.00MiB Unallocated: /dev/sde 1.11TiB /dev/sdf 1.11TiB /dev/sdb 993.02GiB /dev/sdc 993.99GiB /dev/sdd -839.03GiB What this suggests to me is that I don't have 5GB metadata - I have 12GB ? > > btrfs fi resize 1:-4g /export; > btrfs fi resize 2:-4g /export; > btrfs balance start -mdevid=1 /export; > btrfs fi resize 1:max /export; > btrfs fi resize 2:max /export; I'm moving from 5 to 4 disks for the time being, assuming metadata stays the same I guess I'd have to aim for 3GB metadata per disk. Do I have to change the commands like this ? btrfs fi resize 1:-1g /export; # Assuming 4GB metadata btrfs fi resize 2:-2g /export; # Assuming 5GB metadata btrfs balance start -mdevid=1 /export; # Why only devid 1, and not 2 ? btrfs fi resize 1:max /export; btrfs fi resize 2:max /export; Thanks! ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: btrfs freezing on writes 2020-04-09 21:53 ` kjansen387 @ 2020-04-09 23:07 ` Zygo Blaxell 2020-04-11 19:46 ` kjansen387 0 siblings, 1 reply; 13+ messages in thread From: Zygo Blaxell @ 2020-04-09 23:07 UTC (permalink / raw) To: kjansen387; +Cc: linux-btrfs On Thu, Apr 09, 2020 at 11:53:00PM +0200, kjansen387 wrote: > Hi Zygo, > > Thanks for your reply! > > > Try 'btrfs fi usage /export'. > I'm removing /dev/sdd at the moment, my usage now (metadata usage was the > same a few weeks back): > > # btrfs fi usage /export > Overall: > Device size: 10.92TiB > Device allocated: 7.58TiB > Device unallocated: 3.34TiB > Device missing: 0.00B > Used: 7.45TiB > Free (estimated): 1.73TiB (min: 1.73TiB) > Data ratio: 2.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data,RAID1: Size:3.78TiB, Used:3.72TiB (98.35%) > /dev/sde 2.53TiB > /dev/sdf 2.52TiB > /dev/sdb 869.00GiB > /dev/sdc 868.00GiB > /dev/sdd 838.00GiB > > Metadata,RAID1: Size:6.00GiB, Used:5.18GiB (86.40%) > /dev/sde 4.00GiB > /dev/sdf 5.00GiB > /dev/sdb 1.00GiB > /dev/sdc 1.00GiB > /dev/sdd 1.00GiB > > System,RAID1: Size:32.00MiB, Used:592.00KiB (1.81%) > /dev/sdc 32.00MiB > /dev/sdd 32.00MiB > > Unallocated: > /dev/sde 1.11TiB > /dev/sdf 1.11TiB > /dev/sdb 993.02GiB > /dev/sdc 993.99GiB > /dev/sdd -839.03GiB > > > What this suggests to me is that I don't have 5GB metadata - I have 12GB ? It's RAID1, so there are 2 sets of 6x 1 GB block groups. Each block group consists of a mirrored pair of 1GB chunks on 2 different disks. They are not all on sde and sdf as I had guessed before. The worst case would be that 4GB are mirrored between sde and sdf, then 1GB between sdc and sdd, and 1GB between sdf and sdb (i.e. sdf has 4GB mirrored with sde and 1GB mirrored with sdb for a total of 5GB on sdf), but certainly other combinations are possible to arrive at those numbers. "Metadata,RAID1: Size:6.00GiB, Used:5.18GiB" is referring to logical sizes--you have 6 logical GB allocted for 5.18 logical GB of metadata--while the lines that follow are referring to physical sizes on each disk (a total of 12 GB allocated on all disks). You'll notice system and data follow the same pattern. > > btrfs fi resize 1:-4g /export; > > btrfs fi resize 2:-4g /export; > > btrfs balance start -mdevid=1 /export; > > btrfs fi resize 1:max /export; > > btrfs fi resize 2:max /export; > > I'm moving from 5 to 4 disks for the time being, assuming metadata stays the > same I guess I'd have to aim for 3GB metadata per disk. Do I have to change > the commands like this ? > > btrfs fi resize 1:-1g /export; # Assuming 4GB metadata > btrfs fi resize 2:-2g /export; # Assuming 5GB metadata Based on current data, yes; however, it's possible that the device remove you are already running might balance the metadata as a side-effect. Redo the math with the values you get after the device remove is done. You may not need to balance anything. > btrfs balance start -mdevid=1 /export; # Why only devid 1, and not 2 ? We want balance to relocate metadata block groups that are on both devids 1 and 2, i.e. the BG has a chunk on both drives at the same time. Balance filters only allow one devid to be specified, but in this case 'devid=1' or 'devid=2' is close enough. All we want to do here is filter out block groups where one mirror chunk is already on devid 3, 4, or 5, since that would just place the metadata somewhere else on the same disks. > btrfs fi resize 1:max /export; > btrfs fi resize 2:max /export; > > Thanks! > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: btrfs freezing on writes 2020-04-09 23:07 ` Zygo Blaxell @ 2020-04-11 19:46 ` kjansen387 2020-04-11 19:59 ` Chris Murphy 2020-04-11 20:21 ` Zygo Blaxell 0 siblings, 2 replies; 13+ messages in thread From: kjansen387 @ 2020-04-11 19:46 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs I have tried to rebalance metadata.. Starting point: # btrfs fi usage /storage Overall: Device size: 10.92TiB Device allocated: 7.45TiB Device unallocated: 3.47TiB Device missing: 0.00B Used: 7.35TiB Free (estimated): 1.78TiB (min: 1.78TiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,RAID1: Size:3.72TiB, Used:3.67TiB (98.74%) /dev/sdc 2.81TiB /dev/sdb 2.81TiB /dev/sda 1017.00GiB /dev/sdd 840.00GiB Metadata,RAID1: Size:6.00GiB, Used:5.09GiB (84.86%) /dev/sdc 3.00GiB /dev/sdb 3.00GiB /dev/sda 1.00GiB /dev/sdd 5.00GiB System,RAID1: Size:32.00MiB, Used:608.00KiB (1.86%) /dev/sdb 32.00MiB /dev/sdd 32.00MiB Unallocated: /dev/sdc 845.02GiB /dev/sdb 845.99GiB /dev/sda 845.02GiB /dev/sdd 1017.99GiB I did: # btrfs fi resize 4:-2g /storage/ # btrfs balance start -mdevid=4 /storage # btrfs fi resize 4:max /storage/ but the distribution of metadata ended up like before. I also tried (to match the free space of the other disks): # btrfs fi resize 4:-172g /storage/ # btrfs balance start -mdevid=4 /storage # btrfs fi resize 4:max /storage/ again, the distribution of metadata ended up like before.. Any other tips to rebalance metadata ? On 10-Apr-20 01:07, Zygo Blaxell wrote: > On Thu, Apr 09, 2020 at 11:53:00PM +0200, kjansen387 wrote: >> btrfs fi resize 1:-1g /export; # Assuming 4GB metadata >> btrfs fi resize 2:-2g /export; # Assuming 5GB metadata > > Based on current data, yes; however, it's possible that the device remove > you are already running might balance the metadata as a side-effect. > Redo the math with the values you get after the device remove is done. > You may not need to balance anything. > >> btrfs balance start -mdevid=1 /export; # Why only devid 1, and not 2 ? > > We want balance to relocate metadata block groups that are on both > devids 1 and 2, i.e. the BG has a chunk on both drives at the same time. > Balance filters only allow one devid to be specified, but in this case > 'devid=1' or 'devid=2' is close enough. All we want to do here is filter > out block groups where one mirror chunk is already on devid 3, 4, or 5, > since that would just place the metadata somewhere else on the same disks. > >> btrfs fi resize 1:max /export; >> btrfs fi resize 2:max /export; >> >> Thanks! >> >> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: btrfs freezing on writes 2020-04-11 19:46 ` kjansen387 @ 2020-04-11 19:59 ` Chris Murphy 2020-04-11 20:21 ` Zygo Blaxell 1 sibling, 0 replies; 13+ messages in thread From: Chris Murphy @ 2020-04-11 19:59 UTC (permalink / raw) To: kjansen387; +Cc: Zygo Blaxell, Btrfs BTRFS I don't know how much the size of the metadata chunk matters with this, but I see 256M, 512M, and 1G metadata chunks on my file systems. Try this for length only, which is in bytes: $ sudo btrfs insp dump-t -t chunk /dev/nvme0n1p7 | grep METADATA This will show the entire item, A5 might be needed to show both devuuids, since you're using raid1, and might want to know on which device the chunk is located; the devuuid is found in the superblock. $ sudo btrfs insp dump-t -t chunk /dev/nvme0n1p7 | grep -A4 -B1 METADATA Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: btrfs freezing on writes 2020-04-11 19:46 ` kjansen387 2020-04-11 19:59 ` Chris Murphy @ 2020-04-11 20:21 ` Zygo Blaxell 2020-04-15 11:14 ` kjansen387 1 sibling, 1 reply; 13+ messages in thread From: Zygo Blaxell @ 2020-04-11 20:21 UTC (permalink / raw) To: kjansen387; +Cc: linux-btrfs On Sat, Apr 11, 2020 at 09:46:43PM +0200, kjansen387 wrote: > I have tried to rebalance metadata.. > > Starting point: > # btrfs fi usage /storage > Overall: > Device size: 10.92TiB > Device allocated: 7.45TiB > Device unallocated: 3.47TiB > Device missing: 0.00B > Used: 7.35TiB > Free (estimated): 1.78TiB (min: 1.78TiB) > Data ratio: 2.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data,RAID1: Size:3.72TiB, Used:3.67TiB (98.74%) > /dev/sdc 2.81TiB > /dev/sdb 2.81TiB > /dev/sda 1017.00GiB > /dev/sdd 840.00GiB > > Metadata,RAID1: Size:6.00GiB, Used:5.09GiB (84.86%) > /dev/sdc 3.00GiB > /dev/sdb 3.00GiB > /dev/sda 1.00GiB > /dev/sdd 5.00GiB > > System,RAID1: Size:32.00MiB, Used:608.00KiB (1.86%) > /dev/sdb 32.00MiB > /dev/sdd 32.00MiB > > Unallocated: > /dev/sdc 845.02GiB > /dev/sdb 845.99GiB > /dev/sda 845.02GiB > /dev/sdd 1017.99GiB > > I did: > # btrfs fi resize 4:-2g /storage/ > # btrfs balance start -mdevid=4 /storage > # btrfs fi resize 4:max /storage/ > > but the distribution of metadata ended up like before. > > I also tried (to match the free space of the other disks): > # btrfs fi resize 4:-172g /storage/ > # btrfs balance start -mdevid=4 /storage > # btrfs fi resize 4:max /storage/ > > again, the distribution of metadata ended up like before.. > > Any other tips to rebalance metadata ? The purpose of resize -2g was to make a little less unallocated space on one drive compared to all the others, starting with all the drives having equal unallocated space. The purpose of resize -172g was to make the extra unallocated space on sdd go away, so it would be equal to the other 3 drives. You have to do _both_ of those before the balance. Or just add the two numbers, i.e. resize -174g. If you really want to be sure, resize by -200g (far more than necessary), then balance start -mlimit=4,devid=4. The balance is "I know there are exactly 5 block groups now, and I want to leave exactly one behind," and the resize is "I want no possibility of new block groups on sdd for some time." > On 10-Apr-20 01:07, Zygo Blaxell wrote: > > On Thu, Apr 09, 2020 at 11:53:00PM +0200, kjansen387 wrote: > > > btrfs fi resize 1:-1g /export; # Assuming 4GB metadata > > > btrfs fi resize 2:-2g /export; # Assuming 5GB metadata > > > > Based on current data, yes; however, it's possible that the device remove > > you are already running might balance the metadata as a side-effect. > > Redo the math with the values you get after the device remove is done. > > You may not need to balance anything. > > > > > btrfs balance start -mdevid=1 /export; # Why only devid 1, and not 2 ? > > > > We want balance to relocate metadata block groups that are on both > > devids 1 and 2, i.e. the BG has a chunk on both drives at the same time. > > Balance filters only allow one devid to be specified, but in this case > > 'devid=1' or 'devid=2' is close enough. All we want to do here is filter > > out block groups where one mirror chunk is already on devid 3, 4, or 5, > > since that would just place the metadata somewhere else on the same disks. > > > > > btrfs fi resize 1:max /export; > > > btrfs fi resize 2:max /export; > > > > > > Thanks! > > > > > > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: btrfs freezing on writes 2020-04-11 20:21 ` Zygo Blaxell @ 2020-04-15 11:14 ` kjansen387 0 siblings, 0 replies; 13+ messages in thread From: kjansen387 @ 2020-04-15 11:14 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs OK thanks. I did try 2g on top of 172g, that didn't work, but after playing with 'limit' it did, all disks now have 3GB metadata. On 11-Apr-20 22:21, Zygo Blaxell wrote: > The purpose of resize -2g was to make a little less unallocated space > on one drive compared to all the others, starting with all the drives > having equal unallocated space. The purpose of resize -172g was to > make the extra unallocated space on sdd go away, so it would be equal to > the other 3 drives. You have to do _both_ of those before the balance. > Or just add the two numbers, i.e. resize -174g. > > If you really want to be sure, resize by -200g (far more than necessary), > then balance start -mlimit=4,devid=4. The balance is "I know there are > exactly 5 block groups now, and I want to leave exactly one behind," > and the resize is "I want no possibility of new block groups on sdd for > some time." > ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2020-04-15 11:16 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-04-07 19:46 btrfs freezing on writes kjansen387 2020-04-07 20:11 ` Chris Murphy 2020-04-07 20:22 ` Chris Murphy 2020-04-07 20:39 ` kjansen387 2020-04-07 22:11 ` Chris Murphy 2020-04-07 22:30 ` kjansen387 2020-04-09 4:32 ` Zygo Blaxell 2020-04-09 21:53 ` kjansen387 2020-04-09 23:07 ` Zygo Blaxell 2020-04-11 19:46 ` kjansen387 2020-04-11 19:59 ` Chris Murphy 2020-04-11 20:21 ` Zygo Blaxell 2020-04-15 11:14 ` kjansen387
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).