* Reproducible system lockup, extracting files into XFS on dm-raid5 on dm-integrity on HDD
@ 2024-06-05 18:40 Zack Weinberg
2024-06-05 23:05 ` Dave Chinner
2024-06-06 12:17 ` Roger Heflin
0 siblings, 2 replies; 5+ messages in thread
From: Zack Weinberg @ 2024-06-05 18:40 UTC (permalink / raw)
To: dm-devel, linux-xfs
[-- Attachment #1: Type: text/plain, Size: 1682 bytes --]
I am experimenting with the use of dm-integrity underneath dm-raid,
to get around the problem where, if a RAID 1 or RAID 5 array is
inconsistent, you may not know which copy is the good one. I have found
a reproducible hard lockup involving XFS, RAID 5 and dm-integrity.
My test array consists of three spinning HDDs (each 2 decimal
terabytes), each with dm-integrity laid directly onto the disk
(no partition table), using SHA-256 checksums. On top of this is
an MD-RAID array (raid5), and on top of *that* is an ordinary XFS
filesystem.
Extracting a large tar archive (970 G) into the filesystem causes a hard
lockup -- the entire system becomes unresponsive -- after some tens of
gigabytes have been extracted. I have reproduced the lockup using
kernel versions 6.6.21 and 6.9.3. No error messages make it to the
console, but with 6.9.3 I was able to extract almost all of a lockdep
report from pstore. I don't fully understand lockdep reports, but it
*looks* like it might be a livelock rather than a deadlock, with all
available kworker threads so bogged down with dm-integrity chores that
an XFS log flush is blocked long enough to hit the hung task timeout.
Attached are:
- what I have of the lockdep report (kernel 6.9.3) (only a couple
of lines at the very top are missing)
- kernel .config (6.9.3, lockdep enabled)
- dmesg up till userspace starts (6.6.21, lockdep not enabled)
- details of the test array configuration
Please advise if there is any more information you need. I am happy to
test patches. I'm not subscribed to either dm-devel or linux-xfs.
zw
p.s. Incidentally, why doesn't the dm-integrity superblock record the
checksum algorithm in use?
[-- Attachment #2: array-configuration.txt --]
[-- Type: text/plain, Size: 3247 bytes --]
# xfs_info /dev/md/sdr5p1
meta-data=/dev/md/sdr5p1 isize=512 agcount=32, agsize=30283904 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=1
data = bsize=4096 blocks=969084928, imaxpct=5
= sunit=128 swidth=256 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=473186, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
# gdisk -l /dev/md/sdr5
GPT fdisk (gdisk) version 1.0.9
Partition table scan:
MBR: protective
BSD: not present
APM: not present
GPT: present
Found valid GPT with protective MBR; using GPT.
Disk /dev/md/sdr5: 969086976 sectors, 3.6 TiB
Sector size (logical/physical): 4096/4096 bytes
Disk identifier (GUID): 28F28613-3AAD-46F9-AABA-9CC7E7EFFC3D
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 5
First usable sector is 6, last usable sector is 969086970
Partitions will be aligned on 256-sector boundaries
Total free space is 501 sectors (2.0 MiB)
Number Start (sector) End (sector) Size Code Name
1 256 969086719 3.6 TiB 8300 test_array
# mdadm --detail /dev/md/sdr5
/dev/md/sdr5:
Version : 1.2
Creation Time : Fri May 10 21:13:54 2024
Raid Level : raid5
Array Size : 3876347904 (3.61 TiB 3.97 TB)
Used Dev Size : 1938173952 (1848.39 GiB 1984.69 GB)
Raid Devices : 3
Total Devices : 3
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Jun 5 14:09:18 2024
State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : bitmap
Name : moxana:sdr5 (local to host moxana)
UUID : 395e1626:1483f9f8:39d6e78c:af21beb1
Events : 13781
Number Major Minor RaidDevice State
0 253 3 0 active sync /dev/dm-3
1 253 4 1 active sync /dev/dm-4
3 253 5 2 active sync /dev/dm-5
# for d in sda sdb sdc; do integritysetup dump /dev/$d; done
Info for integrity device /dev/sda.
superblock_version 5
log2_interleave_sectors 15
integrity_tag_size 32
journal_sections 496
provided_data_sectors 3876612136
sector_size 4096
log2_blocks_per_bitmap 12
flags fix_padding fix_hmac
Info for integrity device /dev/sdb.
superblock_version 5
log2_interleave_sectors 15
integrity_tag_size 32
journal_sections 496
provided_data_sectors 3876612136
sector_size 4096
log2_blocks_per_bitmap 12
flags fix_padding fix_hmac
Info for integrity device /dev/sdc.
superblock_version 5
log2_interleave_sectors 15
integrity_tag_size 32
journal_sections 496
provided_data_sectors 3876612136
sector_size 4096
log2_blocks_per_bitmap 12
flags fix_padding fix_hmac
[-- Attachment #3: xfs-mdraid-integrity-deadlock.txt --]
[-- Type: text/plain, Size: 9677 bytes --]
[ 2213.559141] Not tainted 6.9.3-gentoo-lockdep #2
[ 2213.559146] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2213.559149] task:kworker/25:3 state:D stack:0 pid:13498 tgid:13498 ppid:2 flags:0x00004000
[ 2213.559160] Workqueue: xfs-sync/md126p1 xfs_log_worker
[ 2213.559169] Call Trace:
[ 2213.559172] <TASK>
[ 2213.559177] __schedule+0x49a/0x1900
[ 2213.559183] ? find_held_lock+0x32/0x90
[ 2213.559190] ? srso_return_thunk+0x5/0x5f
[ 2213.559198] schedule+0x31/0x130
[ 2213.559204] schedule_timeout+0x1cd/0x1e0
[ 2213.559212] __wait_for_common+0xbc/0x1e0
[ 2213.559218] ? usleep_range_state+0xc0/0xc0
[ 2213.559226] __flush_workqueue+0x15f/0x470
[ 2213.559235] ? __wait_for_common+0x4d/0x1e0
[ 2213.559242] xlog_cil_push_now.isra.0+0x59/0xa0
[ 2213.559249] xlog_cil_force_seq+0x7d/0x290
[ 2213.559257] xfs_log_force+0x86/0x2d0
[ 2213.559263] xfs_log_worker+0x36/0xd0
[ 2213.559270] process_one_work+0x210/0x640
[ 2213.559279] worker_thread+0x1c7/0x3c0
[ 2213.559287] ? wq_sysfs_prep_attrs+0xa0/0xa0
[ 2213.559294] kthread+0xd2/0x100
[ 2213.559301] ? kthread_complete_and_exit+0x20/0x20
[ 2213.559309] ret_from_fork+0x2b/0x40
[ 2213.559317] ? kthread_complete_and_exit+0x20/0x20
[ 2213.559324] ret_from_fork_asm+0x11/0x20
[ 2213.559332] </TASK>
[ 2213.559361] Showing all locks held in the system:
[ 2213.559390] 2 locks held by kworker/u131:0/208:
[ 2213.559395] #0: ffff9aa10ffe2d58 ((wq_completion)xfs-cil/md126p1){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.559421] #1: ffffb848c08dbe58 ((work_completion)(&ctx->push_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.559446] 3 locks held by kworker/u130:13/223:
[ 2213.559451] #0: ffff9aa7cc1f8158 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.559474] #1: ffffb848c0953e58 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.559497] #2: ffff9aa0c25400e8 (&type->s_umount_key#32){++++}-{3:3}, at: super_trylock_shared+0x11/0x50
[ 2213.559522] 1 lock held by khungtaskd/230:
[ 2213.559526] #0: ffffffff89ec2e20 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x2c/0x1d0
[ 2213.559557] 1 lock held by usb-storage/414:
[ 2213.559561] #0: ffff9aa0cb15ace8 (&us_interface_key[i]){+.+.}-{3:3}, at: usb_stor_control_thread+0x43/0x2d0
[ 2213.559591] 1 lock held by in:imklog/1997:
[ 2213.559595] #0: ffff9aa0db2258d8 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x84/0xd0
[ 2213.559620] 2 locks held by kworker/u131:3/3226:
[ 2213.559624] #0: ffff9aa10ffe2d58 ((wq_completion)xfs-cil/md126p1){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.559664] #1: ffffb848c47a7e58 ((work_completion)(&ctx->push_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.559706] 2 locks held by tar/5845:
[ 2213.559710] #0: ffff9aa0c2540420 (sb_writers#6){.+.+}-{0:0}, at: ksys_write+0x6c/0xf0
[ 2213.559732] #1: ffff9aa0e16c3f58 (&sb->s_type->i_mutex_key#8){++++}-{3:3}, at: xfs_ilock+0x144/0x180
[ 2213.559789] 2 locks held by kworker/14:28/6524:
[ 2213.559793] #0: ffff9aa0da45e758 ((wq_completion)dm-integrity-writer#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.559815] #1: ffffb848c64e7e58 ((work_completion)(&ic->writer_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.559882] 2 locks held by kworker/12:45/8171:
[ 2213.559886] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.559908] #1: ffffb848d6583e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.559949] 2 locks held by kworker/12:81/8479:
[ 2213.559953] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.559979] #1: ffffb848d6ea3e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.560006] 2 locks held by kworker/12:98/8496:
[ 2213.560010] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.560036] #1: ffffb848d6f2be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.560062] 2 locks held by kworker/12:101/8499:
[ 2213.560067] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.560093] #1: ffffb848d6f43e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.560118] 2 locks held by kworker/12:110/8508:
[ 2213.560122] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.560149] #1: ffffb848d6f8be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.560175] 2 locks held by kworker/12:111/8509:
[ 2213.560180] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.560206] #1: ffffb848d6f93e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.560230] 2 locks held by kworker/12:112/8510:
[ 2213.560235] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.560261] #1: ffffb848d6f9be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.560307] 2 locks held by kworker/u131:5/9163:
[ 2213.560312] #0: ffff9aa10ffe2d58 ((wq_completion)xfs-cil/md126p1){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.560335] #1: ffffb848d8803e58 ((work_completion)(&ctx->push_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.560359] 2 locks held by kworker/u131:6/9166:
[ 2213.560364] #0: ffff9aa10ffe2d58 ((wq_completion)xfs-cil/md126p1){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.560387] #1: ffffb848c44c7e58 ((work_completion)(&ctx->push_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.560429] 2 locks held by kworker/30:236/9664:
[ 2213.560433] #0: ffff9aa0e43c0b58 ((wq_completion)dm-integrity-writer#3){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.560455] #1: ffffb848da42be58 ((work_completion)(&ic->writer_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.560540] 2 locks held by kworker/12:128/11574:
[ 2213.560544] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.560564] #1: ffffb848de48be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.648428] 2 locks held by kworker/12:175/11621:
[ 2213.648431] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.648443] #1: ffffb848de603e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.651134] 2 locks held by kworker/12:177/11623:
[ 2213.651136] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.651147] #1: ffffb848c4c47e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.651158] 2 locks held by kworker/12:179/11625:
[ 2213.651159] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.651170] #1: ffffb848de613e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.651181] 2 locks held by kworker/12:180/11626:
[ 2213.651183] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.651193] #1: ffffb848de61be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.651205] 2 locks held by kworker/12:182/11628:
[ 2213.651206] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.651217] #1: ffffb848de62be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.651228] 2 locks held by kworker/12:184/11630:
[ 2213.651230] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.651240] #1: ffffb848d4793e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.651257] 2 locks held by kworker/12:236/11682:
[ 2213.651259] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.651270] #1: ffffb848de7cbe58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.651280] 2 locks held by kworker/12:239/11685:
[ 2213.651282] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.651293] #1: ffffb848de7e3e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.651341] 2 locks held by kworker/25:121/12751:
[ 2213.651343] #0: ffff9aa0c8122f58 ((wq_completion)dm-integrity-writer#4){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.651353] #1: ffffb848e0c13e58 ((work_completion)(&ic->writer_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.651425] 2 locks held by kworker/25:3/13498:
[ 2213.651426] #0: ffff9aa0c7bfe758 ((wq_completion)xfs-sync/md126p1){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.651436] #1: ffffb848e259be58 ((work_completion)(&(&log->l_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.651465] =============================================
[ 2213.651467] Kernel panic - not syncing: hung_task: blocked tasks
[ 2213.652654] Kernel Offset: 0x7000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[-- Attachment #4: dmesg.txt.gz --]
[-- Type: application/gzip, Size: 17538 bytes --]
[-- Attachment #5: kconfig-6.9.x-lockdep.gz --]
[-- Type: application/gzip, Size: 28814 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: Reproducible system lockup, extracting files into XFS on dm-raid5 on dm-integrity on HDD 2024-06-05 18:40 Reproducible system lockup, extracting files into XFS on dm-raid5 on dm-integrity on HDD Zack Weinberg @ 2024-06-05 23:05 ` Dave Chinner 2024-06-06 15:48 ` Zack Weinberg 2024-06-06 12:17 ` Roger Heflin 1 sibling, 1 reply; 5+ messages in thread From: Dave Chinner @ 2024-06-05 23:05 UTC (permalink / raw) To: Zack Weinberg; +Cc: dm-devel, linux-xfs On Wed, Jun 05, 2024 at 02:40:45PM -0400, Zack Weinberg wrote: > I am experimenting with the use of dm-integrity underneath dm-raid, > to get around the problem where, if a RAID 1 or RAID 5 array is > inconsistent, you may not know which copy is the good one. I have found > a reproducible hard lockup involving XFS, RAID 5 and dm-integrity. > > My test array consists of three spinning HDDs (each 2 decimal > terabytes), each with dm-integrity laid directly onto the disk > (no partition table), using SHA-256 checksums. On top of this is > an MD-RAID array (raid5), and on top of *that* is an ordinary XFS > filesystem. > > Extracting a large tar archive (970 G) into the filesystem causes a hard > lockup -- the entire system becomes unresponsive -- after some tens of > gigabytes have been extracted. I have reproduced the lockup using > kernel versions 6.6.21 and 6.9.3. No error messages make it to the > console, but with 6.9.3 I was able to extract almost all of a lockdep > report from pstore. I don't fully understand lockdep reports, but it > *looks* like it might be a livelock rather than a deadlock, with all > available kworker threads so bogged down with dm-integrity chores that > an XFS log flush is blocked long enough to hit the hung task timeout. > > Attached are: > > - what I have of the lockdep report (kernel 6.9.3) (only a couple > of lines at the very top are missing) > - kernel .config (6.9.3, lockdep enabled) > - dmesg up till userspace starts (6.6.21, lockdep not enabled) > - details of the test array configuration > > Please advise if there is any more information you need. I am happy to > test patches. I'm not subscribed to either dm-devel or linux-xfs. I don't think there's any lockup or kernel bug here - this just looks to be a case of having a really, really slow storage setup and everything waiting for a huge amount of IO to complete to make forwards progress. RAID 5 writes are slow with spinning disks. dm-integrity makes writes even slower. If you storage array can sustain more than 50 random 4kB writes a second, I'd be very surprised. It's going to be -very slow-. With this in mind, untarring a TB sized tarball is going to drive lots of data IO and create lots of metadata. XFS will buffer most of that metadata in memory until the journal fills, but once the journal fills (a few seconds to a few minutes into the untar) it will throttle metadata performance to the small random write speed of the underlying storage.... The high speed collision with that brick wall looks like this: Userspace stalls on on writes because there are too many dirty pages in RAM. It throttles all incoming writes, waiting for background writeback to clean dirty pages. Data writeback requires block allocation which requires metadata modification. Metadata modification requires journal space reservations which block waiting for metadata writeback IO to complete. There are hours of metadata writeback needed to free journal space, so everything pauses waiting for metadata IO completion. > p.s. Incidentally, why doesn't the dm-integrity superblock record the > checksum algorithm in use? > # xfs_info /dev/md/sdr5p1 > meta-data=/dev/md/sdr5p1 isize=512 agcount=32, agsize=30283904 blks > = sectsz=4096 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=1, rmapbt=1 > = reflink=1 bigtime=1 inobtcount=1 nrext64=1 > data = bsize=4096 blocks=969084928, imaxpct=5 > = sunit=128 swidth=256 blks > naming =version 2 bsize=4096 ascii-ci=0, ftype=1 > log =internal log bsize=4096 blocks=473186, version=2 So a 1.6GB journal can buffer hundreds of thousands of dirty 4kb metadata blocks with writeback pending. Once the journal is full, however, the filesystem has to start writing them back to make space in the journal for new incoming changes. At this point, the filesystem with throttle incoming metadata modifications to the rate at which it can remove dirty metadata from the journal. i.e. it will throttle incoming modifications to the sustained random 4kB write rate of your storage hardware. With at least a quarter of a million random 4kB writes pending in the journal when it starts throttling, I'd suggest that you're looking at several hours of waiting just to flush the journal, let alone complete the untar process which will be generating new metadata all the time.... > [ 2213.559141] Not tainted 6.9.3-gentoo-lockdep #2 > [ 2213.559146] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 2213.559149] task:kworker/25:3 state:D stack:0 pid:13498 tgid:13498 ppid:2 flags:0x00004000 > [ 2213.559160] Workqueue: xfs-sync/md126p1 xfs_log_worker > [ 2213.559169] Call Trace: > [ 2213.559172] <TASK> > [ 2213.559177] __schedule+0x49a/0x1900 > [ 2213.559183] ? find_held_lock+0x32/0x90 > [ 2213.559190] ? srso_return_thunk+0x5/0x5f > [ 2213.559198] schedule+0x31/0x130 > [ 2213.559204] schedule_timeout+0x1cd/0x1e0 > [ 2213.559212] __wait_for_common+0xbc/0x1e0 > [ 2213.559218] ? usleep_range_state+0xc0/0xc0 > [ 2213.559226] __flush_workqueue+0x15f/0x470 > [ 2213.559235] ? __wait_for_common+0x4d/0x1e0 > [ 2213.559242] xlog_cil_push_now.isra.0+0x59/0xa0 > [ 2213.559249] xlog_cil_force_seq+0x7d/0x290 > [ 2213.559257] xfs_log_force+0x86/0x2d0 > [ 2213.559263] xfs_log_worker+0x36/0xd0 > [ 2213.559270] process_one_work+0x210/0x640 > [ 2213.559279] worker_thread+0x1c7/0x3c0 Yup, so that's the periodic log force that is run to help the filesystem move to idle when nothing is happening (runs every 30s). It's waiting on a CIL push, which writes all the current in memory changes to the journal. > [ 2213.559287] ? wq_sysfs_prep_attrs+0xa0/0xa0 > [ 2213.559294] kthread+0xd2/0x100 > [ 2213.559301] ? kthread_complete_and_exit+0x20/0x20 > [ 2213.559309] ret_from_fork+0x2b/0x40 > [ 2213.559317] ? kthread_complete_and_exit+0x20/0x20 > [ 2213.559324] ret_from_fork_asm+0x11/0x20 > [ 2213.559332] </TASK> > [ 2213.559361] Showing all locks held in the system: > [ 2213.559390] 2 locks held by kworker/u131:0/208: > [ 2213.559395] #0: ffff9aa10ffe2d58 ((wq_completion)xfs-cil/md126p1){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.559421] #1: ffffb848c08dbe58 ((work_completion)(&ctx->push_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 That's a CIL push worker, and it's likely waiting on journal space to write the new changes into the journal. i.e. it's waiting for metadata writeback IO completion to occur so it can make progress. > [ 2213.559446] 3 locks held by kworker/u130:13/223: > [ 2213.559451] #0: ffff9aa7cc1f8158 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.559474] #1: ffffb848c0953e58 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 > [ 2213.559497] #2: ffff9aa0c25400e8 (&type->s_umount_key#32){++++}-{3:3}, at: super_trylock_shared+0x11/0x50 > [ 2213.559522] 1 lock held by khungtaskd/230: > [ 2213.559526] #0: ffffffff89ec2e20 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x2c/0x1d0 > [ 2213.559557] 1 lock held by usb-storage/414: > [ 2213.559561] #0: ffff9aa0cb15ace8 (&us_interface_key[i]){+.+.}-{3:3}, at: usb_stor_control_thread+0x43/0x2d0 > [ 2213.559591] 1 lock held by in:imklog/1997: > [ 2213.559595] #0: ffff9aa0db2258d8 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x84/0xd0 > [ 2213.559620] 2 locks held by kworker/u131:3/3226: > [ 2213.559624] #0: ffff9aa10ffe2d58 ((wq_completion)xfs-cil/md126p1){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.559664] #1: ffffb848c47a7e58 ((work_completion)(&ctx->push_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 There's another CIL push worker - we can have 4 pushes in flight at once. > [ 2213.559706] 2 locks held by tar/5845: > [ 2213.559710] #0: ffff9aa0c2540420 (sb_writers#6){.+.+}-{0:0}, at: ksys_write+0x6c/0xf0 > [ 2213.559732] #1: ffff9aa0e16c3f58 (&sb->s_type->i_mutex_key#8){++++}-{3:3}, at: xfs_ilock+0x144/0x180 There's the tar process, looking to be blocked during data IO, likely on something like a timestamp update being blocked waiting for journal space... > [ 2213.559789] 2 locks held by kworker/14:28/6524: > [ 2213.559793] #0: ffff9aa0da45e758 ((wq_completion)dm-integrity-writer#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.559815] #1: ffffb848c64e7e58 ((work_completion)(&ic->writer_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 > [ 2213.559882] 2 locks held by kworker/12:45/8171: > [ 2213.559886] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.559908] #1: ffffb848d6583e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 > [ 2213.559949] 2 locks held by kworker/12:81/8479: > [ 2213.559953] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.559979] #1: ffffb848d6ea3e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 > [ 2213.560006] 2 locks held by kworker/12:98/8496: > [ 2213.560010] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.560036] #1: ffffb848d6f2be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 > [ 2213.560062] 2 locks held by kworker/12:101/8499: > [ 2213.560067] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.560093] #1: ffffb848d6f43e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 > [ 2213.560118] 2 locks held by kworker/12:110/8508: > [ 2213.560122] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.560149] #1: ffffb848d6f8be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 > [ 2213.560175] 2 locks held by kworker/12:111/8509: > [ 2213.560180] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.560206] #1: ffffb848d6f93e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 > [ 2213.560230] 2 locks held by kworker/12:112/8510: > [ 2213.560235] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.560261] #1: ffffb848d6f9be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 A bunch of dm-integrity workers doing something. > [ 2213.560307] 2 locks held by kworker/u131:5/9163: > [ 2213.560312] #0: ffff9aa10ffe2d58 ((wq_completion)xfs-cil/md126p1){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.560335] #1: ffffb848d8803e58 ((work_completion)(&ctx->push_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 > [ 2213.560359] 2 locks held by kworker/u131:6/9166: > [ 2213.560364] #0: ffff9aa10ffe2d58 ((wq_completion)xfs-cil/md126p1){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.560387] #1: ffffb848c44c7e58 ((work_completion)(&ctx->push_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 And there's the other two CIL push workers with pushes in progress. So the journal is clearly bound waiting for space to write new commits into the journal. Hence the system is clearly IO bound, and that's why it suddenly goes slow. > [ 2213.560429] 2 locks held by kworker/30:236/9664: > [ 2213.560433] #0: ffff9aa0e43c0b58 ((wq_completion)dm-integrity-writer#3){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.560455] #1: ffffb848da42be58 ((work_completion)(&ic->writer_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 > [ 2213.560540] 2 locks held by kworker/12:128/11574: > [ 2213.560544] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.560564] #1: ffffb848de48be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 > [ 2213.648428] 2 locks held by kworker/12:175/11621: > [ 2213.648431] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.648443] #1: ffffb848de603e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 > [ 2213.651134] 2 locks held by kworker/12:177/11623: > [ 2213.651136] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.651147] #1: ffffb848c4c47e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 > [ 2213.651158] 2 locks held by kworker/12:179/11625: > [ 2213.651159] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.651170] #1: ffffb848de613e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 > [ 2213.651181] 2 locks held by kworker/12:180/11626: > [ 2213.651183] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.651193] #1: ffffb848de61be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 > [ 2213.651205] 2 locks held by kworker/12:182/11628: > [ 2213.651206] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.651217] #1: ffffb848de62be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 > [ 2213.651228] 2 locks held by kworker/12:184/11630: > [ 2213.651230] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.651240] #1: ffffb848d4793e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 > [ 2213.651257] 2 locks held by kworker/12:236/11682: > [ 2213.651259] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.651270] #1: ffffb848de7cbe58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 > [ 2213.651280] 2 locks held by kworker/12:239/11685: > [ 2213.651282] #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.651293] #1: ffffb848de7e3e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 > [ 2213.651341] 2 locks held by kworker/25:121/12751: > [ 2213.651343] #0: ffff9aa0c8122f58 ((wq_completion)dm-integrity-writer#4){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.651353] #1: ffffb848e0c13e58 ((work_completion)(&ic->writer_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 Another large bunch of dm-integrity workers doing something. It looks to me like dm-integrity is trying to do a lot at this point in time... > [ 2213.651425] 2 locks held by kworker/25:3/13498: > [ 2213.651426] #0: ffff9aa0c7bfe758 ((wq_completion)xfs-sync/md126p1){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640 > [ 2213.651436] #1: ffffb848e259be58 ((work_completion)(&(&log->l_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640 And that's the periodic log worker that generated the log force which trigger the hung task timer. > [ 2213.651465] ============================================= > [ 2213.651467] Kernel panic - not syncing: hung_task: blocked tasks > [ 2213.652654] Kernel Offset: 0x7000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) So the system has been up for just under an hour, and it's configured to panic on warnings and/or hung tasks. If you want the untar to complete, turn off the hung task timer or push it out so far that it doesn't trigger (i.e. 12-24 hours). Don't expect it to finish quickly, and when it does there's probably still hours of metadata writeback pending which will block unmount until it's done. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Reproducible system lockup, extracting files into XFS on dm-raid5 on dm-integrity on HDD 2024-06-05 23:05 ` Dave Chinner @ 2024-06-06 15:48 ` Zack Weinberg 2024-06-07 1:17 ` Dave Chinner 0 siblings, 1 reply; 5+ messages in thread From: Zack Weinberg @ 2024-06-06 15:48 UTC (permalink / raw) To: Dave Chinner; +Cc: dm-devel, linux-xfs On Wed, Jun 5, 2024, at 7:05 PM, Dave Chinner wrote: > On Wed, Jun 05, 2024 at 02:40:45PM -0400, Zack Weinberg wrote: >> I am experimenting with the use of dm-integrity underneath dm-raid, >> to get around the problem where, if a RAID 1 or RAID 5 array is >> inconsistent, you may not know which copy is the good one. I have >> found a reproducible hard lockup involving XFS, RAID 5 and dm- >> integrity. > > I don't think there's any lockup or kernel bug here - this just looks > to be a case of having a really, really slow storage setup and > everything waiting for a huge amount of IO to complete to make > forwards progress. ... > Userspace stalls on on writes because there are too many dirty pages > in RAM. It throttles all incoming writes, waiting for background > writeback to clean dirty pages. Data writeback requires block > allocation which requires metadata modification. Metadata modification > requires journal space reservations which block waiting for metadata > writeback IO to complete. There are hours of metadata writeback needed > to free journal space, so everything pauses waiting for metadata IO > completion. This makes a lot of sense. > RAID 5 writes are slow with spinning disks. dm-integrity makes writes > even slower. If you storage array can sustain more than 50 random 4kB > writes a second, I'd be very surprised. It's going to be -very slow-. I wiped the contents of the filesystem and ran bonnie++ on it in direct I/O mode with 4k block writes, skipping the one-character write and small file creation tests. This is what I got: Version 2.00 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 64G:4k::65536 15.8m 19 60.5m 26 218m 31 279.1 13 Latency 659ms 517ms 61146us 3052ms I think this is doing seek-and-read, not seek-and-write, but 300 random reads per second is still really damn slow compared to the sequential performance. And it didn't lock up (with unchanged hung task timeout of two minutes) so that also tends to confirm your hypothesis -- direct I/O means no write backlog. (Do you know of a good way to benchmark seek-and-write performance, ideally directly on a block device instead of having a filesystem present?) I don't actually care how slow it is to write things to this array, because (if I can ever get it working) it's meant to be archival storage, written to only rarely. But I do need to get this tarball unpacked, I'd prefer it if the runtime of 'tar' would correspond closely to the actual time required to get the data all the way to stable storage, and disabling the hung task timeout seems like a kludge. ... > So a 1.6GB journal can buffer hundreds of thousands of dirty 4kb > metadata blocks with writeback pending. Once the journal is full, > however, the filesystem has to start writing them back to make space > in the journal for new incoming changes. At this point, the filesystem > with throttle incoming metadata modifications to the rate at which it > can remove dirty metadata from the journal. i.e. it will throttle > incoming modifications to the sustained random 4kB write rate of your > storage hardware. > > With at least a quarter of a million random 4kB writes pending in the > journal when it starts throttling, I'd suggest that you're looking at > several hours of waiting just to flush the journal, let alone complete > the untar process which will be generating new metadata all the > time.... This reminds me of the 'bufferbloat' phenomenon over in networking land. Would it help to reduce the size of the journal to something like 6MB, which (assuming 50 random writes per second) would take only 30s to flush? Is a journal that small, for a filesystem this large, likely to cause other problems? Are there any other tuning knobs you can think of that might restrict the rate of incoming metadata modifications from 'tar' to a sustainable level from the get-go, instead of barging ahead and then hitting a wall? I'm inclined to doubt that VM-level writeback controls (as suggested elsethread) will help much, since they would not change how much data can pile up in the filesystem's journal, but I could be wrong. Thanks for your help so far. zw ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Reproducible system lockup, extracting files into XFS on dm-raid5 on dm-integrity on HDD 2024-06-06 15:48 ` Zack Weinberg @ 2024-06-07 1:17 ` Dave Chinner 0 siblings, 0 replies; 5+ messages in thread From: Dave Chinner @ 2024-06-07 1:17 UTC (permalink / raw) To: Zack Weinberg; +Cc: dm-devel, linux-xfs On Thu, Jun 06, 2024 at 11:48:57AM -0400, Zack Weinberg wrote: > On Wed, Jun 5, 2024, at 7:05 PM, Dave Chinner wrote: > > On Wed, Jun 05, 2024 at 02:40:45PM -0400, Zack Weinberg wrote: > >> I am experimenting with the use of dm-integrity underneath dm-raid, > >> to get around the problem where, if a RAID 1 or RAID 5 array is > >> inconsistent, you may not know which copy is the good one. I have > >> found a reproducible hard lockup involving XFS, RAID 5 and dm- > >> integrity. > > > > I don't think there's any lockup or kernel bug here - this just looks > > to be a case of having a really, really slow storage setup and > > everything waiting for a huge amount of IO to complete to make > > forwards progress. > ... > > Userspace stalls on on writes because there are too many dirty pages > > in RAM. It throttles all incoming writes, waiting for background > > writeback to clean dirty pages. Data writeback requires block > > allocation which requires metadata modification. Metadata modification > > requires journal space reservations which block waiting for metadata > > writeback IO to complete. There are hours of metadata writeback needed > > to free journal space, so everything pauses waiting for metadata IO > > completion. > > This makes a lot of sense. > > > RAID 5 writes are slow with spinning disks. dm-integrity makes writes > > even slower. If you storage array can sustain more than 50 random 4kB > > writes a second, I'd be very surprised. It's going to be -very slow-. > > I wiped the contents of the filesystem and ran bonnie++ on it in direct Wow, that's is sooooo 2000's :) > I/O mode with 4k block writes, skipping the one-character write and > small file creation tests. This is what I got: > > Version 2.00 ------Sequential Output------ --Sequential Input- --Random- > -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- > Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP > 64G:4k::65536 15.8m 19 60.5m 26 218m 31 279.1 13 > Latency 659ms 517ms 61146us 3052ms > > I think this is doing seek-and-read, not seek-and-write, but 300 random > reads per second is still really damn slow compared to the sequential > performance. And it didn't lock up (with unchanged hung task timeout of > two minutes) so that also tends to confirm your hypothesis -- direct I/O > means no write backlog. > > (Do you know of a good way to benchmark seek-and-write > performance, ideally directly on a block device instead of having > a filesystem present?) fio. Use it with direct=1, bs=4k, rw=randwrite and you can point it at either a file or a block device. > I don't actually care how slow it is to write things to this array, > because (if I can ever get it working) it's meant to be archival > storage, written to only rarely. But I do need to get this tarball > unpacked, I'd prefer it if the runtime of 'tar' would correspond closely > to the actual time required to get the data all the way to stable > storage, and disabling the hung task timeout seems like a kludge. The hung task timeout is intended to capture deadlocks that are forever, not something that is blocking because it has to wait for a hundred thousand IOs to complete at 50 IOPS. When you have storage this slow and data sets this big, you have to tune these detectors so they don't report false positives. What you are doing is so far out of the "normal operation" window that it's no surprise that you're getting false positive ahng detections like this. > ... > > So a 1.6GB journal can buffer hundreds of thousands of dirty 4kb > > metadata blocks with writeback pending. Once the journal is full, > > however, the filesystem has to start writing them back to make space > > in the journal for new incoming changes. At this point, the filesystem > > with throttle incoming metadata modifications to the rate at which it > > can remove dirty metadata from the journal. i.e. it will throttle > > incoming modifications to the sustained random 4kB write rate of your > > storage hardware. > > > > With at least a quarter of a million random 4kB writes pending in the > > journal when it starts throttling, I'd suggest that you're looking at > > several hours of waiting just to flush the journal, let alone complete > > the untar process which will be generating new metadata all the > > time.... > > This reminds me of the 'bufferbloat' phenomenon over in networking land. Yes, exactly. Storage has a bandwidth delay product, just like networks, and when you put huge buffers in front of something with low bandwidth and long round trip latencies to try to maintain high throughput, it generally goes really bad the moemnt interactivity is required. > Would it help to reduce the size of the journal to something like 6MB, > which (assuming 50 random writes per second) would take only 30s to > flush? That's taking journal sizes to the other extreme. > Is a journal that small, for a filesystem this large, likely to > cause other problems? Definitely. e.g. not having enough journal space to allow aggregation of changes to the same structures in memory before they are written to disk. this alone will increase the required journal bandwidth for any given workload by 1-2 orders of magnitude. It will also increase the amount of metadata writeback by similar amounts because the window for relogging already dirty objects is now tiny compared to the dataset you are creating. IOWs, when you have low bandwidth, seek limited storage, making the journal too small can be much worse for performance than having a really large journal and hitting the problems you are seeing. A current mkfs.xfs defaults to a minimum log size of 64MB - that's probably a good choice for this particular storage setup as it's large enough to soak up short bursts of metadata activity, but no so large it pins GBs of dirty metadata in RAM that stuff will stall on. > Are there any other tuning knobs you can think of > that might restrict the rate of incoming metadata modifications from > 'tar' to a sustainable level from the get-go, instead of barging ahead > and then hitting a wall? > I'm inclined to doubt that VM-level writeback > controls (as suggested elsethread) will help much, since they would not > change how much data can pile up in the filesystem's journal, but I > could be wrong. No matter what you do you are going to have the workload throttled to disk speed -somewhere-. Reducing the dirty limits on the page cache will help in the same way that reducing the journal size will, and that should help improve interactivity a bit. But, fundamentally, the data set is much larger than RAM and so it will get written at disk speed and that means worst case latencies for anything the kernel does can be very high. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Reproducible system lockup, extracting files into XFS on dm-raid5 on dm-integrity on HDD 2024-06-05 18:40 Reproducible system lockup, extracting files into XFS on dm-raid5 on dm-integrity on HDD Zack Weinberg 2024-06-05 23:05 ` Dave Chinner @ 2024-06-06 12:17 ` Roger Heflin 1 sibling, 0 replies; 5+ messages in thread From: Roger Heflin @ 2024-06-06 12:17 UTC (permalink / raw) To: Zack Weinberg; +Cc: dm-devel, linux-xfs grep -E 'Dirty|Write' /proc/meminfo The above are the amount of outstanding writes that need to be processed, I would expect it to be significant. sysctl -a 2>/dev/null | grep -iE 'dirty_ratio|dirty_bytes|dirty_background' vm.dirty_background_bytes = 3000000 vm.dirty_background_ratio = 0 vm.dirty_bytes = 5000000 vm.dirty_ratio = 0 Default you will be using ratio and ratio is a % of ram. dirty_(ratio|bytes) is the high water mark (background is the low water mark) and the OS will stop writes when the high water mark is hit until you reach the low water mark. With spinning disks and a tar file and the default setting that will be GB's of data requiring a significant number of seeks and the high water to lower water mark could be a really long time (many minutes or even hours if you have enough ram). In my experience about all these settings do is make it appear IO finished when it really did not (I believe the original intent of these changes may have been to improve benchmark results when the benchmark did not force sync). The downsides are that at any given time a much higher amount of IO/data could be in ram only and not on disk and if a crash (hardware/software/power) were to happen a lot of data could get lost. Because of that I set these values much lower (see above). It appears that you need a write IO cache, but really it does not need to be huge. The settings above will stop writes when 5MB is hit and restart at 3MB (likely a pause of well under a second, rather than locking up the machine for a really long time). On Wed, Jun 5, 2024 at 1:41 PM Zack Weinberg <zack@owlfolio.org> wrote: > > I am experimenting with the use of dm-integrity underneath dm-raid, > to get around the problem where, if a RAID 1 or RAID 5 array is > inconsistent, you may not know which copy is the good one. I have found > a reproducible hard lockup involving XFS, RAID 5 and dm-integrity. > > My test array consists of three spinning HDDs (each 2 decimal > terabytes), each with dm-integrity laid directly onto the disk > (no partition table), using SHA-256 checksums. On top of this is > an MD-RAID array (raid5), and on top of *that* is an ordinary XFS > filesystem. > > Extracting a large tar archive (970 G) into the filesystem causes a hard > lockup -- the entire system becomes unresponsive -- after some tens of > gigabytes have been extracted. I have reproduced the lockup using > kernel versions 6.6.21 and 6.9.3. No error messages make it to the > console, but with 6.9.3 I was able to extract almost all of a lockdep > report from pstore. I don't fully understand lockdep reports, but it > *looks* like it might be a livelock rather than a deadlock, with all > available kworker threads so bogged down with dm-integrity chores that > an XFS log flush is blocked long enough to hit the hung task timeout. > > Attached are: > > - what I have of the lockdep report (kernel 6.9.3) (only a couple > of lines at the very top are missing) > - kernel .config (6.9.3, lockdep enabled) > - dmesg up till userspace starts (6.6.21, lockdep not enabled) > - details of the test array configuration > > Please advise if there is any more information you need. I am happy to > test patches. I'm not subscribed to either dm-devel or linux-xfs. > > zw > > p.s. Incidentally, why doesn't the dm-integrity superblock record the > checksum algorithm in use? ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-06-07 1:17 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-06-05 18:40 Reproducible system lockup, extracting files into XFS on dm-raid5 on dm-integrity on HDD Zack Weinberg 2024-06-05 23:05 ` Dave Chinner 2024-06-06 15:48 ` Zack Weinberg 2024-06-07 1:17 ` Dave Chinner 2024-06-06 12:17 ` Roger Heflin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox