Reproducible system lockup, extracting files into XFS on dm-raid5 on dm-integrity on HDD

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* Reproducible system lockup, extracting files into XFS on dm-raid5 on dm-integrity on HDD
@ 2024-06-05 18:40 Zack Weinberg
  2024-06-05 23:05 ` Dave Chinner
  2024-06-06 12:17 ` Roger Heflin
  0 siblings, 2 replies; 5+ messages in thread
From: Zack Weinberg @ 2024-06-05 18:40 UTC (permalink / raw)
  To: dm-devel, linux-xfs

[-- Attachment #1: Type: text/plain, Size: 1682 bytes --]

I am experimenting with the use of dm-integrity underneath dm-raid,
to get around the problem where, if a RAID 1 or RAID 5 array is
inconsistent, you may not know which copy is the good one.  I have found
a reproducible hard lockup involving XFS, RAID 5 and dm-integrity.

My test array consists of three spinning HDDs (each 2 decimal
terabytes), each with dm-integrity laid directly onto the disk
(no partition table), using SHA-256 checksums.  On top of this is
an MD-RAID array (raid5), and on top of *that* is an ordinary XFS
filesystem.

Extracting a large tar archive (970 G) into the filesystem causes a hard
lockup -- the entire system becomes unresponsive -- after some tens of
gigabytes have been extracted.  I have reproduced the lockup using
kernel versions 6.6.21 and 6.9.3.  No error messages make it to the
console, but with 6.9.3 I was able to extract almost all of a lockdep
report from pstore.  I don't fully understand lockdep reports, but it
*looks* like it might be a livelock rather than a deadlock, with all
available kworker threads so bogged down with dm-integrity chores that
an XFS log flush is blocked long enough to hit the hung task timeout.

Attached are:

- what I have of the lockdep report (kernel 6.9.3) (only a couple
  of lines at the very top are missing)
- kernel .config (6.9.3, lockdep enabled)
- dmesg up till userspace starts (6.6.21, lockdep not enabled)
- details of the test array configuration

Please advise if there is any more information you need.  I am happy to
test patches.  I'm not subscribed to either dm-devel or linux-xfs.

zw

p.s. Incidentally, why doesn't the dm-integrity superblock record the
checksum algorithm in use?

[-- Attachment #2: array-configuration.txt --]
[-- Type: text/plain, Size: 3247 bytes --]

# xfs_info /dev/md/sdr5p1
meta-data=/dev/md/sdr5p1         isize=512    agcount=32, agsize=30283904 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
data     =                       bsize=4096   blocks=969084928, imaxpct=5
         =                       sunit=128    swidth=256 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=473186, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# gdisk -l /dev/md/sdr5
GPT fdisk (gdisk) version 1.0.9

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/md/sdr5: 969086976 sectors, 3.6 TiB
Sector size (logical/physical): 4096/4096 bytes
Disk identifier (GUID): 28F28613-3AAD-46F9-AABA-9CC7E7EFFC3D
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 5
First usable sector is 6, last usable sector is 969086970
Partitions will be aligned on 256-sector boundaries
Total free space is 501 sectors (2.0 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1             256       969086719   3.6 TiB     8300  test_array

# mdadm --detail /dev/md/sdr5
/dev/md/sdr5:
           Version : 1.2
     Creation Time : Fri May 10 21:13:54 2024
        Raid Level : raid5
        Array Size : 3876347904 (3.61 TiB 3.97 TB)
     Used Dev Size : 1938173952 (1848.39 GiB 1984.69 GB)
      Raid Devices : 3
     Total Devices : 3
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Wed Jun  5 14:09:18 2024
             State : clean
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : moxana:sdr5  (local to host moxana)
              UUID : 395e1626:1483f9f8:39d6e78c:af21beb1
            Events : 13781

    Number   Major   Minor   RaidDevice State
       0     253        3        0      active sync   /dev/dm-3
       1     253        4        1      active sync   /dev/dm-4
       3     253        5        2      active sync   /dev/dm-5

# for d in sda sdb sdc; do integritysetup dump /dev/$d; done
Info for integrity device /dev/sda.
superblock_version 5
log2_interleave_sectors 15
integrity_tag_size 32
journal_sections 496
provided_data_sectors 3876612136
sector_size 4096
log2_blocks_per_bitmap 12
flags fix_padding fix_hmac

Info for integrity device /dev/sdb.
superblock_version 5
log2_interleave_sectors 15
integrity_tag_size 32
journal_sections 496
provided_data_sectors 3876612136
sector_size 4096
log2_blocks_per_bitmap 12
flags fix_padding fix_hmac

Info for integrity device /dev/sdc.
superblock_version 5
log2_interleave_sectors 15
integrity_tag_size 32
journal_sections 496
provided_data_sectors 3876612136
sector_size 4096
log2_blocks_per_bitmap 12
flags fix_padding fix_hmac

[-- Attachment #3: xfs-mdraid-integrity-deadlock.txt --]
[-- Type: text/plain, Size: 9677 bytes --]

[ 2213.559141]       Not tainted 6.9.3-gentoo-lockdep #2
[ 2213.559146] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2213.559149] task:kworker/25:3    state:D stack:0     pid:13498 tgid:13498 ppid:2      flags:0x00004000
[ 2213.559160] Workqueue: xfs-sync/md126p1 xfs_log_worker
[ 2213.559169] Call Trace:
[ 2213.559172]  <TASK>
[ 2213.559177]  __schedule+0x49a/0x1900
[ 2213.559183]  ? find_held_lock+0x32/0x90
[ 2213.559190]  ? srso_return_thunk+0x5/0x5f
[ 2213.559198]  schedule+0x31/0x130
[ 2213.559204]  schedule_timeout+0x1cd/0x1e0
[ 2213.559212]  __wait_for_common+0xbc/0x1e0
[ 2213.559218]  ? usleep_range_state+0xc0/0xc0
[ 2213.559226]  __flush_workqueue+0x15f/0x470
[ 2213.559235]  ? __wait_for_common+0x4d/0x1e0
[ 2213.559242]  xlog_cil_push_now.isra.0+0x59/0xa0
[ 2213.559249]  xlog_cil_force_seq+0x7d/0x290
[ 2213.559257]  xfs_log_force+0x86/0x2d0
[ 2213.559263]  xfs_log_worker+0x36/0xd0
[ 2213.559270]  process_one_work+0x210/0x640
[ 2213.559279]  worker_thread+0x1c7/0x3c0
[ 2213.559287]  ? wq_sysfs_prep_attrs+0xa0/0xa0
[ 2213.559294]  kthread+0xd2/0x100
[ 2213.559301]  ? kthread_complete_and_exit+0x20/0x20
[ 2213.559309]  ret_from_fork+0x2b/0x40
[ 2213.559317]  ? kthread_complete_and_exit+0x20/0x20
[ 2213.559324]  ret_from_fork_asm+0x11/0x20
[ 2213.559332]  </TASK>
[ 2213.559361] Showing all locks held in the system:
[ 2213.559390] 2 locks held by kworker/u131:0/208:
[ 2213.559395]  #0: ffff9aa10ffe2d58 ((wq_completion)xfs-cil/md126p1){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.559421]  #1: ffffb848c08dbe58 ((work_completion)(&ctx->push_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.559446] 3 locks held by kworker/u130:13/223:
[ 2213.559451]  #0: ffff9aa7cc1f8158 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.559474]  #1: ffffb848c0953e58 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.559497]  #2: ffff9aa0c25400e8 (&type->s_umount_key#32){++++}-{3:3}, at: super_trylock_shared+0x11/0x50
[ 2213.559522] 1 lock held by khungtaskd/230:
[ 2213.559526]  #0: ffffffff89ec2e20 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x2c/0x1d0
[ 2213.559557] 1 lock held by usb-storage/414:
[ 2213.559561]  #0: ffff9aa0cb15ace8 (&us_interface_key[i]){+.+.}-{3:3}, at: usb_stor_control_thread+0x43/0x2d0
[ 2213.559591] 1 lock held by in:imklog/1997:
[ 2213.559595]  #0: ffff9aa0db2258d8 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x84/0xd0
[ 2213.559620] 2 locks held by kworker/u131:3/3226:
[ 2213.559624]  #0: ffff9aa10ffe2d58 ((wq_completion)xfs-cil/md126p1){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.559664]  #1: ffffb848c47a7e58 ((work_completion)(&ctx->push_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.559706] 2 locks held by tar/5845:
[ 2213.559710]  #0: ffff9aa0c2540420 (sb_writers#6){.+.+}-{0:0}, at: ksys_write+0x6c/0xf0
[ 2213.559732]  #1: ffff9aa0e16c3f58 (&sb->s_type->i_mutex_key#8){++++}-{3:3}, at: xfs_ilock+0x144/0x180
[ 2213.559789] 2 locks held by kworker/14:28/6524:
[ 2213.559793]  #0: ffff9aa0da45e758 ((wq_completion)dm-integrity-writer#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.559815]  #1: ffffb848c64e7e58 ((work_completion)(&ic->writer_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.559882] 2 locks held by kworker/12:45/8171:
[ 2213.559886]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.559908]  #1: ffffb848d6583e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.559949] 2 locks held by kworker/12:81/8479:
[ 2213.559953]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.559979]  #1: ffffb848d6ea3e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.560006] 2 locks held by kworker/12:98/8496:
[ 2213.560010]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.560036]  #1: ffffb848d6f2be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.560062] 2 locks held by kworker/12:101/8499:
[ 2213.560067]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.560093]  #1: ffffb848d6f43e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.560118] 2 locks held by kworker/12:110/8508:
[ 2213.560122]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.560149]  #1: ffffb848d6f8be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.560175] 2 locks held by kworker/12:111/8509:
[ 2213.560180]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.560206]  #1: ffffb848d6f93e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.560230] 2 locks held by kworker/12:112/8510:
[ 2213.560235]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.560261]  #1: ffffb848d6f9be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.560307] 2 locks held by kworker/u131:5/9163:
[ 2213.560312]  #0: ffff9aa10ffe2d58 ((wq_completion)xfs-cil/md126p1){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.560335]  #1: ffffb848d8803e58 ((work_completion)(&ctx->push_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.560359] 2 locks held by kworker/u131:6/9166:
[ 2213.560364]  #0: ffff9aa10ffe2d58 ((wq_completion)xfs-cil/md126p1){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.560387]  #1: ffffb848c44c7e58 ((work_completion)(&ctx->push_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.560429] 2 locks held by kworker/30:236/9664:
[ 2213.560433]  #0: ffff9aa0e43c0b58 ((wq_completion)dm-integrity-writer#3){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.560455]  #1: ffffb848da42be58 ((work_completion)(&ic->writer_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.560540] 2 locks held by kworker/12:128/11574:
[ 2213.560544]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.560564]  #1: ffffb848de48be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.648428] 2 locks held by kworker/12:175/11621:
[ 2213.648431]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.648443]  #1: ffffb848de603e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.651134] 2 locks held by kworker/12:177/11623:
[ 2213.651136]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.651147]  #1: ffffb848c4c47e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.651158] 2 locks held by kworker/12:179/11625:
[ 2213.651159]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.651170]  #1: ffffb848de613e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.651181] 2 locks held by kworker/12:180/11626:
[ 2213.651183]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.651193]  #1: ffffb848de61be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.651205] 2 locks held by kworker/12:182/11628:
[ 2213.651206]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.651217]  #1: ffffb848de62be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.651228] 2 locks held by kworker/12:184/11630:
[ 2213.651230]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.651240]  #1: ffffb848d4793e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.651257] 2 locks held by kworker/12:236/11682:
[ 2213.651259]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.651270]  #1: ffffb848de7cbe58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.651280] 2 locks held by kworker/12:239/11685:
[ 2213.651282]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.651293]  #1: ffffb848de7e3e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.651341] 2 locks held by kworker/25:121/12751:
[ 2213.651343]  #0: ffff9aa0c8122f58 ((wq_completion)dm-integrity-writer#4){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.651353]  #1: ffffb848e0c13e58 ((work_completion)(&ic->writer_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.651425] 2 locks held by kworker/25:3/13498:
[ 2213.651426]  #0: ffff9aa0c7bfe758 ((wq_completion)xfs-sync/md126p1){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
[ 2213.651436]  #1: ffffb848e259be58 ((work_completion)(&(&log->l_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
[ 2213.651465] =============================================
[ 2213.651467] Kernel panic - not syncing: hung_task: blocked tasks
[ 2213.652654] Kernel Offset: 0x7000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

[-- Attachment #4: dmesg.txt.gz --]
[-- Type: application/gzip, Size: 17538 bytes --]

[-- Attachment #5: kconfig-6.9.x-lockdep.gz --]
[-- Type: application/gzip, Size: 28814 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Reproducible system lockup, extracting files into XFS on dm-raid5 on dm-integrity on HDD
  2024-06-05 18:40 Reproducible system lockup, extracting files into XFS on dm-raid5 on dm-integrity on HDD Zack Weinberg
@ 2024-06-05 23:05 ` Dave Chinner
  2024-06-06 15:48   ` Zack Weinberg
  2024-06-06 12:17 ` Roger Heflin
  1 sibling, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2024-06-05 23:05 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: dm-devel, linux-xfs

On Wed, Jun 05, 2024 at 02:40:45PM -0400, Zack Weinberg wrote:
> I am experimenting with the use of dm-integrity underneath dm-raid,
> to get around the problem where, if a RAID 1 or RAID 5 array is
> inconsistent, you may not know which copy is the good one.  I have found
> a reproducible hard lockup involving XFS, RAID 5 and dm-integrity.
> 
> My test array consists of three spinning HDDs (each 2 decimal
> terabytes), each with dm-integrity laid directly onto the disk
> (no partition table), using SHA-256 checksums.  On top of this is
> an MD-RAID array (raid5), and on top of *that* is an ordinary XFS
> filesystem.
> 
> Extracting a large tar archive (970 G) into the filesystem causes a hard
> lockup -- the entire system becomes unresponsive -- after some tens of
> gigabytes have been extracted.  I have reproduced the lockup using
> kernel versions 6.6.21 and 6.9.3.  No error messages make it to the
> console, but with 6.9.3 I was able to extract almost all of a lockdep
> report from pstore.  I don't fully understand lockdep reports, but it
> *looks* like it might be a livelock rather than a deadlock, with all
> available kworker threads so bogged down with dm-integrity chores that
> an XFS log flush is blocked long enough to hit the hung task timeout.
> 
> Attached are:
> 
> - what I have of the lockdep report (kernel 6.9.3) (only a couple
>   of lines at the very top are missing)
> - kernel .config (6.9.3, lockdep enabled)
> - dmesg up till userspace starts (6.6.21, lockdep not enabled)
> - details of the test array configuration
> 
> Please advise if there is any more information you need.  I am happy to
> test patches.  I'm not subscribed to either dm-devel or linux-xfs.

I don't think there's any lockup or kernel bug here - this just
looks to be a case of having a really, really slow storage setup and
everything waiting for a huge amount of IO to complete to make
forwards progress.

RAID 5 writes are slow with spinning disks. dm-integrity makes
writes even slower.  If you storage array can sustain more than 50
random 4kB writes a second, I'd be very surprised. It's going to be
-very slow-.

With this in mind, untarring a TB sized tarball is going to drive
lots of data IO and create lots of metadata. XFS will buffer most of
that metadata in memory until the journal fills, but once the
journal fills (a few seconds to a few minutes into the untar) it
will throttle metadata performance to the small random write speed
of the underlying storage....

The high speed collision with that brick wall looks like this:

Userspace stalls on on writes because there are too many dirty pages
in RAM. It throttles all incoming writes, waiting for background
writeback to clean dirty pages.  Data writeback requires block
allocation which requires metadata modification. Metadata
modification requires journal space reservations which block
waiting for metadata writeback IO to complete. There are hours of
metadata writeback needed to free journal space, so everything
pauses waiting for metadata IO completion.

> p.s. Incidentally, why doesn't the dm-integrity superblock record the
> checksum algorithm in use?

> # xfs_info /dev/md/sdr5p1
> meta-data=/dev/md/sdr5p1         isize=512    agcount=32, agsize=30283904 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=1
>          =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
> data     =                       bsize=4096   blocks=969084928, imaxpct=5
>          =                       sunit=128    swidth=256 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=473186, version=2

So a 1.6GB journal can buffer hundreds of thousands of dirty 4kb
metadata blocks with writeback pending. Once the journal is full,
however, the filesystem has to start writing them back to make space
in the journal for new incoming changes. At this point, the
filesystem with throttle incoming metadata modifications to the rate
at which it can remove dirty metadata from the journal. i.e. it will
throttle incoming modifications to the sustained random 4kB write
rate of your storage hardware.

With at least a quarter of a million random 4kB writes pending in
the journal when it starts throttling, I'd suggest that you're
looking at several hours of waiting just to flush the journal, let
alone complete the untar process which will be generating new
metadata all the time....

> [ 2213.559141]       Not tainted 6.9.3-gentoo-lockdep #2
> [ 2213.559146] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 2213.559149] task:kworker/25:3    state:D stack:0     pid:13498 tgid:13498 ppid:2      flags:0x00004000
> [ 2213.559160] Workqueue: xfs-sync/md126p1 xfs_log_worker
> [ 2213.559169] Call Trace:
> [ 2213.559172]  <TASK>
> [ 2213.559177]  __schedule+0x49a/0x1900
> [ 2213.559183]  ? find_held_lock+0x32/0x90
> [ 2213.559190]  ? srso_return_thunk+0x5/0x5f
> [ 2213.559198]  schedule+0x31/0x130
> [ 2213.559204]  schedule_timeout+0x1cd/0x1e0
> [ 2213.559212]  __wait_for_common+0xbc/0x1e0
> [ 2213.559218]  ? usleep_range_state+0xc0/0xc0
> [ 2213.559226]  __flush_workqueue+0x15f/0x470
> [ 2213.559235]  ? __wait_for_common+0x4d/0x1e0
> [ 2213.559242]  xlog_cil_push_now.isra.0+0x59/0xa0
> [ 2213.559249]  xlog_cil_force_seq+0x7d/0x290
> [ 2213.559257]  xfs_log_force+0x86/0x2d0
> [ 2213.559263]  xfs_log_worker+0x36/0xd0
> [ 2213.559270]  process_one_work+0x210/0x640
> [ 2213.559279]  worker_thread+0x1c7/0x3c0

Yup, so that's the periodic log force that is run to help the
filesystem move to idle when nothing is happening (runs every 30s).

It's waiting on a CIL push, which writes all the current in memory
changes to the journal. 

> [ 2213.559287]  ? wq_sysfs_prep_attrs+0xa0/0xa0
> [ 2213.559294]  kthread+0xd2/0x100
> [ 2213.559301]  ? kthread_complete_and_exit+0x20/0x20
> [ 2213.559309]  ret_from_fork+0x2b/0x40
> [ 2213.559317]  ? kthread_complete_and_exit+0x20/0x20
> [ 2213.559324]  ret_from_fork_asm+0x11/0x20
> [ 2213.559332]  </TASK>
> [ 2213.559361] Showing all locks held in the system:
> [ 2213.559390] 2 locks held by kworker/u131:0/208:
> [ 2213.559395]  #0: ffff9aa10ffe2d58 ((wq_completion)xfs-cil/md126p1){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.559421]  #1: ffffb848c08dbe58 ((work_completion)(&ctx->push_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640

That's a CIL push worker, and it's likely waiting on journal space
to write the new changes into the journal. i.e. it's waiting for
metadata writeback IO completion to occur so it can make progress.

> [ 2213.559446] 3 locks held by kworker/u130:13/223:
> [ 2213.559451]  #0: ffff9aa7cc1f8158 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.559474]  #1: ffffb848c0953e58 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
> [ 2213.559497]  #2: ffff9aa0c25400e8 (&type->s_umount_key#32){++++}-{3:3}, at: super_trylock_shared+0x11/0x50
> [ 2213.559522] 1 lock held by khungtaskd/230:
> [ 2213.559526]  #0: ffffffff89ec2e20 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x2c/0x1d0
> [ 2213.559557] 1 lock held by usb-storage/414:
> [ 2213.559561]  #0: ffff9aa0cb15ace8 (&us_interface_key[i]){+.+.}-{3:3}, at: usb_stor_control_thread+0x43/0x2d0
> [ 2213.559591] 1 lock held by in:imklog/1997:
> [ 2213.559595]  #0: ffff9aa0db2258d8 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x84/0xd0
> [ 2213.559620] 2 locks held by kworker/u131:3/3226:
> [ 2213.559624]  #0: ffff9aa10ffe2d58 ((wq_completion)xfs-cil/md126p1){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.559664]  #1: ffffb848c47a7e58 ((work_completion)(&ctx->push_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640

There's another CIL push worker - we can have 4 pushes in flight at
once.

> [ 2213.559706] 2 locks held by tar/5845:
> [ 2213.559710]  #0: ffff9aa0c2540420 (sb_writers#6){.+.+}-{0:0}, at: ksys_write+0x6c/0xf0
> [ 2213.559732]  #1: ffff9aa0e16c3f58 (&sb->s_type->i_mutex_key#8){++++}-{3:3}, at: xfs_ilock+0x144/0x180

There's the tar process, looking to be blocked during data IO,
likely on something like a timestamp update being blocked waiting
for journal space...

> [ 2213.559789] 2 locks held by kworker/14:28/6524:
> [ 2213.559793]  #0: ffff9aa0da45e758 ((wq_completion)dm-integrity-writer#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.559815]  #1: ffffb848c64e7e58 ((work_completion)(&ic->writer_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
> [ 2213.559882] 2 locks held by kworker/12:45/8171:
> [ 2213.559886]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.559908]  #1: ffffb848d6583e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
> [ 2213.559949] 2 locks held by kworker/12:81/8479:
> [ 2213.559953]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.559979]  #1: ffffb848d6ea3e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
> [ 2213.560006] 2 locks held by kworker/12:98/8496:
> [ 2213.560010]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.560036]  #1: ffffb848d6f2be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
> [ 2213.560062] 2 locks held by kworker/12:101/8499:
> [ 2213.560067]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.560093]  #1: ffffb848d6f43e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
> [ 2213.560118] 2 locks held by kworker/12:110/8508:
> [ 2213.560122]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.560149]  #1: ffffb848d6f8be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
> [ 2213.560175] 2 locks held by kworker/12:111/8509:
> [ 2213.560180]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.560206]  #1: ffffb848d6f93e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
> [ 2213.560230] 2 locks held by kworker/12:112/8510:
> [ 2213.560235]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.560261]  #1: ffffb848d6f9be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640

A bunch of dm-integrity workers doing something.

> [ 2213.560307] 2 locks held by kworker/u131:5/9163:
> [ 2213.560312]  #0: ffff9aa10ffe2d58 ((wq_completion)xfs-cil/md126p1){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.560335]  #1: ffffb848d8803e58 ((work_completion)(&ctx->push_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
> [ 2213.560359] 2 locks held by kworker/u131:6/9166:
> [ 2213.560364]  #0: ffff9aa10ffe2d58 ((wq_completion)xfs-cil/md126p1){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.560387]  #1: ffffb848c44c7e58 ((work_completion)(&ctx->push_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640

And there's the other two CIL push workers with pushes in progress.

So the journal is clearly bound waiting for space to write new
commits into the journal. Hence the system is clearly IO bound, and
that's why it suddenly goes slow.

> [ 2213.560429] 2 locks held by kworker/30:236/9664:
> [ 2213.560433]  #0: ffff9aa0e43c0b58 ((wq_completion)dm-integrity-writer#3){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.560455]  #1: ffffb848da42be58 ((work_completion)(&ic->writer_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
> [ 2213.560540] 2 locks held by kworker/12:128/11574:
> [ 2213.560544]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.560564]  #1: ffffb848de48be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
> [ 2213.648428] 2 locks held by kworker/12:175/11621:
> [ 2213.648431]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.648443]  #1: ffffb848de603e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
> [ 2213.651134] 2 locks held by kworker/12:177/11623:
> [ 2213.651136]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.651147]  #1: ffffb848c4c47e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
> [ 2213.651158] 2 locks held by kworker/12:179/11625:
> [ 2213.651159]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.651170]  #1: ffffb848de613e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
> [ 2213.651181] 2 locks held by kworker/12:180/11626:
> [ 2213.651183]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.651193]  #1: ffffb848de61be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
> [ 2213.651205] 2 locks held by kworker/12:182/11628:
> [ 2213.651206]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.651217]  #1: ffffb848de62be58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
> [ 2213.651228] 2 locks held by kworker/12:184/11630:
> [ 2213.651230]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.651240]  #1: ffffb848d4793e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
> [ 2213.651257] 2 locks held by kworker/12:236/11682:
> [ 2213.651259]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.651270]  #1: ffffb848de7cbe58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
> [ 2213.651280] 2 locks held by kworker/12:239/11685:
> [ 2213.651282]  #0: ffff9aa0da420358 ((wq_completion)dm-integrity-offload#5){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.651293]  #1: ffffb848de7e3e58 ((work_completion)(&dio->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640
> [ 2213.651341] 2 locks held by kworker/25:121/12751:
> [ 2213.651343]  #0: ffff9aa0c8122f58 ((wq_completion)dm-integrity-writer#4){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.651353]  #1: ffffb848e0c13e58 ((work_completion)(&ic->writer_work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640

Another large bunch of dm-integrity workers doing something. It
looks to me like dm-integrity is trying to do a lot at this point in
time...

> [ 2213.651425] 2 locks held by kworker/25:3/13498:
> [ 2213.651426]  #0: ffff9aa0c7bfe758 ((wq_completion)xfs-sync/md126p1){+.+.}-{0:0}, at: process_one_work+0x3cc/0x640
> [ 2213.651436]  #1: ffffb848e259be58 ((work_completion)(&(&log->l_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1ca/0x640

And that's the periodic log worker that generated the log force
which trigger the hung task timer.

> [ 2213.651465] =============================================
> [ 2213.651467] Kernel panic - not syncing: hung_task: blocked tasks
> [ 2213.652654] Kernel Offset: 0x7000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

So the system has been up for just under an hour, and it's
configured to panic on warnings and/or hung tasks. If you want the
untar to complete, turn off the hung task timer or push it out so
far that it doesn't trigger (i.e. 12-24 hours). Don't expect it to
finish quickly, and when it does there's probably still hours of
metadata writeback pending which will block unmount until it's done.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Reproducible system lockup, extracting files into XFS on dm-raid5 on dm-integrity on HDD
  2024-06-05 18:40 Reproducible system lockup, extracting files into XFS on dm-raid5 on dm-integrity on HDD Zack Weinberg
  2024-06-05 23:05 ` Dave Chinner
@ 2024-06-06 12:17 ` Roger Heflin
  1 sibling, 0 replies; 5+ messages in thread
From: Roger Heflin @ 2024-06-06 12:17 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: dm-devel, linux-xfs

grep -E 'Dirty|Write' /proc/meminfo

The above are the amount of outstanding writes that need to be
processed, I would expect it to be significant.
sysctl -a 2>/dev/null | grep -iE 'dirty_ratio|dirty_bytes|dirty_background'
vm.dirty_background_bytes = 3000000
vm.dirty_background_ratio = 0
vm.dirty_bytes = 5000000
vm.dirty_ratio = 0

Default you will be using ratio and ratio is a % of ram.
dirty_(ratio|bytes) is the high water mark  (background is the low
water mark) and the OS will stop writes when the high water mark is
hit until you reach the low water mark.  With spinning disks and a tar
file and the default setting that will be GB's of data requiring a
significant number of seeks and the high water to lower water mark
could be a really long time (many minutes or even hours if you have
enough ram).

In my experience about all these settings do is make it appear IO
finished when it really did not (I believe the original intent of
these changes may have been to improve benchmark results when the
benchmark did not force sync).

The downsides are that at any given time a much higher amount of
IO/data could be in ram only and not on disk and if a crash
(hardware/software/power) were to happen a lot of data could get lost.

Because of that I set these values much lower (see above).  It appears
that you need a write IO cache, but really it does not need to be
huge.

The settings above will stop writes when 5MB is hit and restart at 3MB
(likely a pause of well under a second, rather than locking up the
machine for a really long time).

On Wed, Jun 5, 2024 at 1:41 PM Zack Weinberg <zack@owlfolio.org> wrote:
>
> I am experimenting with the use of dm-integrity underneath dm-raid,
> to get around the problem where, if a RAID 1 or RAID 5 array is
> inconsistent, you may not know which copy is the good one.  I have found
> a reproducible hard lockup involving XFS, RAID 5 and dm-integrity.
>
> My test array consists of three spinning HDDs (each 2 decimal
> terabytes), each with dm-integrity laid directly onto the disk
> (no partition table), using SHA-256 checksums.  On top of this is
> an MD-RAID array (raid5), and on top of *that* is an ordinary XFS
> filesystem.
>
> Extracting a large tar archive (970 G) into the filesystem causes a hard
> lockup -- the entire system becomes unresponsive -- after some tens of
> gigabytes have been extracted.  I have reproduced the lockup using
> kernel versions 6.6.21 and 6.9.3.  No error messages make it to the
> console, but with 6.9.3 I was able to extract almost all of a lockdep
> report from pstore.  I don't fully understand lockdep reports, but it
> *looks* like it might be a livelock rather than a deadlock, with all
> available kworker threads so bogged down with dm-integrity chores that
> an XFS log flush is blocked long enough to hit the hung task timeout.
>
> Attached are:
>
> - what I have of the lockdep report (kernel 6.9.3) (only a couple
>   of lines at the very top are missing)
> - kernel .config (6.9.3, lockdep enabled)
> - dmesg up till userspace starts (6.6.21, lockdep not enabled)
> - details of the test array configuration
>
> Please advise if there is any more information you need.  I am happy to
> test patches.  I'm not subscribed to either dm-devel or linux-xfs.
>
> zw
>
> p.s. Incidentally, why doesn't the dm-integrity superblock record the
> checksum algorithm in use?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Reproducible system lockup, extracting files into XFS on dm-raid5 on dm-integrity on HDD
  2024-06-05 23:05 ` Dave Chinner
@ 2024-06-06 15:48   ` Zack Weinberg
  2024-06-07  1:17     ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Zack Weinberg @ 2024-06-06 15:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: dm-devel, linux-xfs

On Wed, Jun 5, 2024, at 7:05 PM, Dave Chinner wrote:
> On Wed, Jun 05, 2024 at 02:40:45PM -0400, Zack Weinberg wrote:
>> I am experimenting with the use of dm-integrity underneath dm-raid,
>> to get around the problem where, if a RAID 1 or RAID 5 array is
>> inconsistent, you may not know which copy is the good one.  I have
>> found a reproducible hard lockup involving XFS, RAID 5 and dm-
>> integrity.
>
> I don't think there's any lockup or kernel bug here - this just looks
> to be a case of having a really, really slow storage setup and
> everything waiting for a huge amount of IO to complete to make
> forwards progress.
...
> Userspace stalls on on writes because there are too many dirty pages
> in RAM. It throttles all incoming writes, waiting for background
> writeback to clean dirty pages.  Data writeback requires block
> allocation which requires metadata modification. Metadata modification
> requires journal space reservations which block waiting for metadata
> writeback IO to complete. There are hours of metadata writeback needed
> to free journal space, so everything pauses waiting for metadata IO
> completion.

This makes a lot of sense.

> RAID 5 writes are slow with spinning disks. dm-integrity makes writes
> even slower.  If you storage array can sustain more than 50 random 4kB
> writes a second, I'd be very surprised. It's going to be -very slow-.

I wiped the contents of the filesystem and ran bonnie++ on it in direct
I/O mode with 4k block writes, skipping the one-character write and
small file creation tests.  This is what I got:

Version  2.00       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
64G:4k::65536                 15.8m  19 60.5m  26            218m  31 279.1  13
Latency                         659ms     517ms             61146us    3052ms

I think this is doing seek-and-read, not seek-and-write, but 300 random
reads per second is still really damn slow compared to the sequential
performance.  And it didn't lock up (with unchanged hung task timeout of
two minutes) so that also tends to confirm your hypothesis -- direct I/O
means no write backlog.

(Do you know of a good way to benchmark seek-and-write
performance, ideally directly on a block device instead of having
a filesystem present?)

I don't actually care how slow it is to write things to this array,
because (if I can ever get it working) it's meant to be archival
storage, written to only rarely.  But I do need to get this tarball
unpacked, I'd prefer it if the runtime of 'tar' would correspond closely
to the actual time required to get the data all the way to stable
storage, and disabling the hung task timeout seems like a kludge.

...
> So a 1.6GB journal can buffer hundreds of thousands of dirty 4kb
> metadata blocks with writeback pending. Once the journal is full,
> however, the filesystem has to start writing them back to make space
> in the journal for new incoming changes. At this point, the filesystem
> with throttle incoming metadata modifications to the rate at which it
> can remove dirty metadata from the journal. i.e. it will throttle
> incoming modifications to the sustained random 4kB write rate of your
> storage hardware.
>
> With at least a quarter of a million random 4kB writes pending in the
> journal when it starts throttling, I'd suggest that you're looking at
> several hours of waiting just to flush the journal, let alone complete
> the untar process which will be generating new metadata all the
> time....

This reminds me of the 'bufferbloat' phenomenon over in networking land.
Would it help to reduce the size of the journal to something like 6MB,
which (assuming 50 random writes per second) would take only 30s to
flush?  Is a journal that small, for a filesystem this large, likely to
cause other problems? Are there any other tuning knobs you can think of
that might restrict the rate of incoming metadata modifications from
'tar' to a sustainable level from the get-go, instead of barging ahead
and then hitting a wall? I'm inclined to doubt that VM-level writeback
controls (as suggested elsethread) will help much, since they would not
change how much data can pile up in the filesystem's journal, but I
could be wrong.

Thanks for your help so far.
zw

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Reproducible system lockup, extracting files into XFS on dm-raid5 on dm-integrity on HDD
  2024-06-06 15:48   ` Zack Weinberg
@ 2024-06-07  1:17     ` Dave Chinner
  0 siblings, 0 replies; 5+ messages in thread
From: Dave Chinner @ 2024-06-07  1:17 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: dm-devel, linux-xfs

On Thu, Jun 06, 2024 at 11:48:57AM -0400, Zack Weinberg wrote:
> On Wed, Jun 5, 2024, at 7:05 PM, Dave Chinner wrote:
> > On Wed, Jun 05, 2024 at 02:40:45PM -0400, Zack Weinberg wrote:
> >> I am experimenting with the use of dm-integrity underneath dm-raid,
> >> to get around the problem where, if a RAID 1 or RAID 5 array is
> >> inconsistent, you may not know which copy is the good one.  I have
> >> found a reproducible hard lockup involving XFS, RAID 5 and dm-
> >> integrity.
> >
> > I don't think there's any lockup or kernel bug here - this just looks
> > to be a case of having a really, really slow storage setup and
> > everything waiting for a huge amount of IO to complete to make
> > forwards progress.
> ...
> > Userspace stalls on on writes because there are too many dirty pages
> > in RAM. It throttles all incoming writes, waiting for background
> > writeback to clean dirty pages.  Data writeback requires block
> > allocation which requires metadata modification. Metadata modification
> > requires journal space reservations which block waiting for metadata
> > writeback IO to complete. There are hours of metadata writeback needed
> > to free journal space, so everything pauses waiting for metadata IO
> > completion.
> 
> This makes a lot of sense.
> 
> > RAID 5 writes are slow with spinning disks. dm-integrity makes writes
> > even slower.  If you storage array can sustain more than 50 random 4kB
> > writes a second, I'd be very surprised. It's going to be -very slow-.
> 
> I wiped the contents of the filesystem and ran bonnie++ on it in direct

Wow, that's is sooooo 2000's :) 

> I/O mode with 4k block writes, skipping the one-character write and
> small file creation tests.  This is what I got:
> 
> Version  2.00       ------Sequential Output------ --Sequential Input- --Random-
>                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
> 64G:4k::65536                 15.8m  19 60.5m  26            218m  31 279.1  13
> Latency                         659ms     517ms             61146us    3052ms
> 
> I think this is doing seek-and-read, not seek-and-write, but 300 random
> reads per second is still really damn slow compared to the sequential
> performance.  And it didn't lock up (with unchanged hung task timeout of
> two minutes) so that also tends to confirm your hypothesis -- direct I/O
> means no write backlog.
> 
> (Do you know of a good way to benchmark seek-and-write
> performance, ideally directly on a block device instead of having
> a filesystem present?)

fio.

Use it with direct=1, bs=4k, rw=randwrite and you can point it at
either a file or a block device.

> I don't actually care how slow it is to write things to this array,
> because (if I can ever get it working) it's meant to be archival
> storage, written to only rarely.  But I do need to get this tarball
> unpacked, I'd prefer it if the runtime of 'tar' would correspond closely
> to the actual time required to get the data all the way to stable
> storage, and disabling the hung task timeout seems like a kludge.

The hung task timeout is intended to capture deadlocks that are
forever, not something that is blocking because it has to wait for
a hundred thousand IOs to complete at 50 IOPS. When you have storage
this slow and data sets this big, you have to tune these detectors
so they don't report false positives. What you are doing is so far
out of the "normal operation" window that it's no surprise that
you're getting false positive ahng detections like this.

> ...
> > So a 1.6GB journal can buffer hundreds of thousands of dirty 4kb
> > metadata blocks with writeback pending. Once the journal is full,
> > however, the filesystem has to start writing them back to make space
> > in the journal for new incoming changes. At this point, the filesystem
> > with throttle incoming metadata modifications to the rate at which it
> > can remove dirty metadata from the journal. i.e. it will throttle
> > incoming modifications to the sustained random 4kB write rate of your
> > storage hardware.
> >
> > With at least a quarter of a million random 4kB writes pending in the
> > journal when it starts throttling, I'd suggest that you're looking at
> > several hours of waiting just to flush the journal, let alone complete
> > the untar process which will be generating new metadata all the
> > time....
> 
> This reminds me of the 'bufferbloat' phenomenon over in networking land.

Yes, exactly.

Storage has a bandwidth delay product, just like networks, and when
you put huge buffers in front of something with low bandwidth and
long round trip latencies to try to maintain high throughput, it
generally goes really bad the moemnt interactivity is required.

> Would it help to reduce the size of the journal to something like 6MB,
> which (assuming 50 random writes per second) would take only 30s to
> flush?

That's taking journal sizes to the other extreme.

> Is a journal that small, for a filesystem this large, likely to
> cause other problems?

Definitely. e.g. not having enough journal space to allow
aggregation of changes to the same structures in
memory before they are written to disk. this alone will increase the
required journal bandwidth for any given workload by 1-2 orders of
magnitude. It will also increase the amount of metadata writeback by
similar amounts because the window for relogging already dirty
objects is now tiny compared to the dataset you are creating.

IOWs, when you have low bandwidth, seek limited storage, making the
journal too small can be much worse for performance than having a
really large journal and hitting the problems you are seeing.

A current mkfs.xfs defaults to a minimum log size of 64MB - that's
probably a good choice for this particular storage setup as it's
large enough to soak up short bursts of metadata activity, but no so
large it pins GBs of dirty metadata in RAM that stuff will stall on.

> Are there any other tuning knobs you can think of
> that might restrict the rate of incoming metadata modifications from
> 'tar' to a sustainable level from the get-go, instead of barging ahead
> and then hitting a wall?

> I'm inclined to doubt that VM-level writeback
> controls (as suggested elsethread) will help much, since they would not
> change how much data can pile up in the filesystem's journal, but I
> could be wrong.

No matter what you do you are going to have the workload throttled
to disk speed -somewhere-. Reducing the dirty limits on the page
cache will help in the same way that reducing the journal size will,
and that should help improve interactivity a bit. But,
fundamentally, the data set is much larger than RAM and so it
will get written at disk speed and that means worst case latencies
for anything the kernel does can be very high.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-06-07  1:17 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-05 18:40 Reproducible system lockup, extracting files into XFS on dm-raid5 on dm-integrity on HDD Zack Weinberg
2024-06-05 23:05 ` Dave Chinner
2024-06-06 15:48   ` Zack Weinberg
2024-06-07  1:17     ` Dave Chinner
2024-06-06 12:17 ` Roger Heflin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox