Stuck IOs with dm-integrity + md raid1 + dm-thin

All of lore.kernel.org
 help / color / mirror / Atom feed

* Stuck IOs with dm-integrity + md raid1 + dm-thin
@ 2023-11-29  1:55 Eric Wheeler
       [not found] ` <CAJ0trDaipa_80xqVpDkJdm-SSCaN7g_O-m9-StG10P_DUxmtEA@mail.gmail.com>
  2024-01-11 12:34 ` Xiao Ni
  0 siblings, 2 replies; 11+ messages in thread
From: Eric Wheeler @ 2023-11-29  1:55 UTC (permalink / raw)
  To: Joe Thornber; +Cc: dm-devel

Hi Joe,

I'm not sure who else to CC on this issue, feel free to add others.

Recently we tried putting dm-integrity on NVMe's under MD RAID1 with 
dm-thin metadata (tmeta) on that raid1 mirror (Linux v6.5.7).  It worked 
fine for ~12 hours and then, at ~3:30 am in the morning when we do a 
reserve_metadata_snap and thin_dump's, IO's began to hang.

For reference, we have been using the same block device stack for years, 
and it is quite stable.  We've been running Linux v6.5.7 for about 6 weeks 
without issue until we tried dm-integrity under dm-thin.  The stack 
_without_ dm-integrity looks like this:

    nvme{0,1} =>
        raid mirror /dev/md0 =>
            LVM PV for VG "data" =>
                tmeta for dm-thin

    /dev/sda (LSI-based Hardware RAID)
        bcache =>
            LVM PV for VG "data" =>
                tdata for dm-thin "pool0" =>
                    single thin volume from the pool =>
                        drbd to another host over the network

After adding dm-integrity, the stack is the same as above except that
the RAID mirror is above dm-integrity:

    nvme{0,1} =>
        dm-integrity =>
            raid mirror /dev/md0 =>
                LVM PV for VG "data" =>
                    tmeta for dm-thin

The first time we had the thindump+IO hang problem we thought that an
underlying NVMe drive was bad, so we replaced both: We rebooted and did
a `pvmove` of the tmeta volume to different NVMe+md-raid1+dm-integrity
on disks that were brand new, but had the same issue that night during
reserve_metadata_snap+thin_dump. 

Since we had the same problem on a different set of new NVMe disks, we 
rebooted, did a `pvmove` back to the original MD RAID1 mirror stack 
_without_ dm-integrity, and everything was back to normal.  Thus, I do not 
think it is a disk or tmeta corruption issue, but an obscure race or 
deadlock bug.

We tried to reproduce this in a virtual environment, but so far it has not 
triggered. In about a week when we get downtime for this system, I will 
take a copy of its thin metadata to inspect further. For now, I have not 
yet been enable to reproduce problem outside our semi-production testing 
environment.

So, this is what we see so far:

We get _thousands_ of these lines in the kernel logs when IO hangs. The 
`dm-175` device is one of the the dm-integrity targets being mirrored. 
Note the "12924 callbacks suppressed" message:

	[  +0.584816] raid1_end_read_request: 12924 callbacks suppressed
	[  +0.000006] md/raid1:md0: dm-175: rescheduling sector 1979682216
	[  +0.020011] md/raid1:md0: dm-175: rescheduling sector 1979682216
	[  +0.019462] md/raid1:md0: dm-175: rescheduling sector 1979682216
	[  +0.018672] md/raid1:md0: dm-175: rescheduling sector 1979682216
	[  +0.006242] md/raid1:md0: dm-175: rescheduling sector 1979682216
	[  +0.002399] md/raid1:md0: dm-175: rescheduling sector 1979682216
	[  +0.000934] md/raid1:md0: dm-175: rescheduling sector 1979682216
	[  +0.000688] md/raid1:md0: dm-175: rescheduling sector 1979682216
	[  +0.001161] md/raid1:md0: dm-175: rescheduling sector 1979682216
	[  +0.006404] md/raid1:md0: dm-175: rescheduling sector 1979682216
	[  +3.594380] integrity_metadata: 13095 callbacks suppressed
	[  +0.000006] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
	[  +0.001410] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
	[  +0.000824] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
	[  +0.003587] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
	[  +0.011065] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
	[  +0.004176] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
	[  +0.000604] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
	[  +0.006153] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
	[  +0.008797] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
	[  +0.002350] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
	[  +0.668866] raid1_read_request: 12747 callbacks suppressed
	[  +0.000005] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
	[  +0.024708] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
	[  +0.024441] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
	[  +0.018321] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
	[  +0.000850] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
	[  +0.004479] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
	[  +0.001225] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
	[  +0.001024] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
	[  +0.000741] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
	[  +0.001094] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175

Here are two traces of hung tasks that the kernel reported in dmesg. The
decoded line numbers are at the bottom of this email.

It looks like dmeventd is checking metadata transaction IDs, and then
hangs. We believe that is snapshot was being made at the moment that IOs
hung. This trace only appears once:

	INFO: task dmeventd:3718459 blocked for more than 122 seconds.
	     Tainted: G            E      6.5.7 #23
	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	task:dmeventd        state:D stack:0     pid:3718459 ppid:1      flags:0x00000002
	Call Trace:
	<TASK>
	__schedule+0x2d7/0x800
	schedule+0x5e/0xe0
	schedule_preempt_disabled+0x15/0x30
	rwsem_down_read_slowpath+0x2c4/0x5f0
	? lock_acquire+0x10b/0x120
	down_read+0xff/0x260
	dm_pool_get_metadata_transaction_id+0x24/0x60 [dm_thin_pool]
	pool_status+0x133/0x630 [dm_thin_pool]
	retrieve_status+0xc4/0x210 [dm_mod]
	table_status+0x9c/0x150 [dm_mod]
	ctl_ioctl+0x248/0x370 [dm_mod]
	? __pfx_table_status+0x10/0x10 [dm_mod]
	dm_ctl_ioctl+0xe/0x20 [dm_mod]
	__x64_sys_ioctl+0x99/0xe0
	do_syscall_64+0x3f/0xa0
	entry_SYSCALL_64_after_hwframe+0x6e/0xd8

In the next trace below, you can see that DRBD is trying to write to it's 
thin volume backing device, and thin metadata is being manipulated it by 
dm-thin (which is above md raid1 with dm-integrity), but hangs during 
`dm_thin_find_block`:

This trace appears several times for different DRBD volumes that are
backed by thin volumes in the same pool.  Except that the thin volume
differs, the traces are the same:

	INFO: task drbd_r_secServe:1072212 blocked for more than 122 seconds.
	     Tainted: G            E      6.5.7 #23
	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	task:drbd_r_secServe state:D stack:0     pid:1072212 ppid:2      flags:0x00004000
	Call Trace:
	<TASK>
	__schedule+0x2d7/0x800
	schedule+0x5e/0xe0
	schedule_preempt_disabled+0x15/0x30
	rwsem_down_read_slowpath+0x2c4/0x5f0
	? dm_thin_find_block+0x32/0x90 [dm_thin_pool]
	down_read+0xff/0x260
	dm_thin_find_block+0x32/0x90 [dm_thin_pool]
	thin_bio_map.isra.0+0x169/0x3a0 [dm_thin_pool]
	__map_bio+0x49/0x1d0 [dm_mod]
	dm_split_and_process_bio+0x21c/0x580 [dm_mod]
	dm_submit_bio+0x64/0x170 [dm_mod]
	__submit_bio+0x7e/0xd0
	__submit_bio_noacct+0x90/0x200
	drbd_submit_peer_request+0x138/0x2d0 [drbd]
	receive_DataRequest+0x244/0x710 [drbd]
	? __pfx_receive_DataRequest+0x10/0x10 [drbd]
	drbdd+0x174/0x280 [drbd]
	? __pfx_drbd_thread_setup+0x10/0x10 [drbd]
	drbd_receiver+0x51/0xd0 [drbd]
	drbd_thread_setup+0xa3/0x1b0 [drbd]
	kthread+0xe9/0x120
	? __pfx_kthread+0x10/0x10
	ret_from_fork+0x34/0x50
	? __pfx_kthread+0x10/0x10
	ret_from_fork_asm+0x1b/0x30
	</TASK>

This is the call stack of the thin_dump command, its just waiting:

	# thin_dump -f xml --dev-id 671125 -m /dev/mapper/data-pool0_tmeta
	...

	# cat /proc/4016388/stack 
	[<0>] submit_bio_wait+0x7a/0xc0
	[<0>] __blkdev_direct_IO_simple+0x130/0x280
	[<0>] blkdev_read_iter+0xb8/0x150
	[<0>] do_iter_readv_writev+0x14e/0x160
	[<0>] do_iter_read+0x125/0x1c0
	[<0>] vfs_readv+0xa8/0xf0
	[<0>] __x64_sys_preadv+0xc0/0x100
	[<0>] do_syscall_64+0x3f/0xa0
	[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8



These are the line number decodes for the above traces:

trace 1:
	kernel: INFO: task dmeventd:3718459 blocked for more than 122 seconds.
	__schedule (kernel/sched/core.c:5381 kernel/sched/core.c:6710) 
	schedule (./arch/x86/include/asm/preempt.h:85 (discriminator 1) kernel/sched/core.c:6787 (discriminator 1)) 
	schedule_preempt_disabled (./arch/x86/include/asm/preempt.h:80 kernel/sched/core.c:6846) 
	rwsem_down_read_slowpath (kernel/locking/rwsem.c:1073) 
	? lock_acquire (./include/trace/events/lock.h:24 kernel/locking/lockdep.c:5724) 
	down_read (kernel/locking/rwsem.c:1250 kernel/locking/rwsem.c:1263 kernel/locking/rwsem.c:1522) 
	dm_pool_get_metadata_transaction_id (drivers/md/dm-thin-metadata.c:1323) dm_thin_pool
	pool_status (drivers/md/dm-thin.c:3944) dm_thin_pool
	retrieve_status (./include/linux/fortify-string.h:214 drivers/md/dm-ioctl.c:1316) dm_mod
	table_status (drivers/md/dm-ioctl.c:1707) dm_mod
	ctl_ioctl (drivers/md/dm-ioctl.c:2083) dm_mod
	? __pfx_table_status (drivers/md/dm-ioctl.c:1693) dm_mod
	dm_ctl_ioctl (drivers/md/dm-ioctl.c:2103) dm_mod
	__x64_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:870 fs/ioctl.c:856 fs/ioctl.c:856) 
	do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
	entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) 


trace 2:
	kernel: INFO: task drbd_r_secServe:1072212 blocked for more than 122 seconds.
	__schedule (kernel/sched/core.c:5381 kernel/sched/core.c:6710) 
	schedule (./arch/x86/include/asm/preempt.h:85 (discriminator 1) kernel/sched/core.c:6787 (discriminator 1)) 
	schedule_preempt_disabled (./arch/x86/include/asm/preempt.h:80 kernel/sched/core.c:6846) 
	rwsem_down_read_slowpath (kernel/locking/rwsem.c:1073) 
	? dm_thin_find_block (drivers/md/dm-thin-metadata.c:1574) dm_thin_pool
	down_read (kernel/locking/rwsem.c:1250 kernel/locking/rwsem.c:1263 kernel/locking/rwsem.c:1522) 
	dm_thin_find_block (drivers/md/dm-thin-metadata.c:1574) dm_thin_pool
	thin_bio_map.isra.0 (drivers/md/dm-thin.c:2764) dm_thin_pool
	__map_bio (drivers/md/dm.c:1427) dm_mod
	dm_split_and_process_bio (drivers/md/dm.c:1729 drivers/md/dm.c:1782) dm_mod
	dm_submit_bio (drivers/md/dm.c:1835) dm_mod
	__submit_bio (block/blk-core.c:373 block/blk-core.c:603) 
	__submit_bio_noacct (./include/linux/bio.h:624 block/blk-core.c:653) 
	drbd_submit_peer_request (drivers/block/drbd/drbd_receiver.c:1723) drbd
	receive_DataRequest (drivers/block/drbd/drbd_receiver.c:2993) drbd
	? __pfx_receive_DataRequest (drivers/block/drbd/drbd_receiver.c:2803) drbd
	drbdd (drivers/block/drbd/drbd_receiver.c:5092) drbd
	? __pfx_drbd_thread_setup (drivers/block/drbd/drbd_main.c:310) drbd
	drbd_receiver (drivers/block/drbd/drbd_receiver.c:5561) drbd
	drbd_thread_setup (drivers/block/drbd/drbd_main.c:325) drbd
	kthread (kernel/kthread.c:389) 
	? __pfx_kthread (kernel/kthread.c:342) 
	ret_from_fork (arch/x86/kernel/process.c:151) 
	? __pfx_kthread (kernel/kthread.c:342) 
	ret_from_fork_asm (arch/x86/entry/entry_64.S:312) 

trace 3:
	# thin_dump -f xml --dev-id 671125 -m /dev/mapper/data-pool0_tmeta
	<hung>

	# cat /proc/4016388/stack
	submit_bio_wait (block/bio.c:1370 (discriminator 1)) 
	__blkdev_direct_IO_simple (./include/linux/bio.h:495 block/fops.c:99) 
	blkdev_read_iter (block/fops.c:608) 
	do_iter_readv_writev (./include/linux/fs.h:1871) 
	do_iter_read (fs/read_write.c:795) 
	vfs_readv (fs/read_write.c:916) 
	__x64_sys_preadv (fs/read_write.c:1007 fs/read_write.c:1057 fs/read_write.c:1052 fs/read_write.c:1052) 
	do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
	entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) 



--
Eric Wheeler

^ permalink raw reply	[flat|nested] 11+ messages in thread

[parent not found: <CAJ0trDaipa_80xqVpDkJdm-SSCaN7g_O-m9-StG10P_DUxmtEA@mail.gmail.com>]

[parent not found: <CAJ0trDbP4BFjb0KQ+t4Lh1JdVjP9sbMiwJEOXPkGEV_R95rFoQ@mail.gmail.com>]

* Re: Stuck IOs with dm-integrity + md raid1 + dm-thin
       [not found]   ` <CAJ0trDbP4BFjb0KQ+t4Lh1JdVjP9sbMiwJEOXPkGEV_R95rFoQ@mail.gmail.com>
@ 2023-11-29 19:51     ` Eric Wheeler
       [not found]       ` <CAJ0trDaRQaDicO9ho8Hokj68vEopaJwM8YKwPEjzYkWBXE0dWg@mail.gmail.com>
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Wheeler @ 2023-11-29 19:51 UTC (permalink / raw)
  To: Joe Thornber; +Cc: dm-devel

[-- Attachment #1: Type: text/plain, Size: 1330 bytes --]

On Wed, 29 Nov 2023, Joe Thornber wrote:
> On Wed, Nov 29, 2023 at 8:24 AM Joe Thornber <thornber@redhat.com> wrote:
>       Hi Eric,
>
> Since integrity is below thin I would say this is likely a real issue 
> with your hardware; probably causing thinp to head down the rarely used 
> error path.

While I suppose this is possible, the problem presented itself on two 
completely different NVMe mirror pairs, two NVMes of which are brand new. 
The system survived a 24 hour memory test and we have not seen any machine 
check exceptions or ECC errors, so is far as I can tell the hardware is 
working as expected.

> It looks like something is holding the rw_semaphore that protects 
> accesses to the thin_metadata. This could either be a process that is 
> blocked holding it (in which case you should be able to see it).  Or a 
> rarely used error path has omitted to release it.

Is any kernel tooling around finding what is holding the lock, or some 
kind of tracing mechanism that I could trigger while rw_semaphore is 
locked?

--
Eric Wheeler



> 
> 
> Looking at the code I think it's really unlikely that we've accidentally left it unlocked.  So could you check
> all processes to see if any are holding it please, and check
> the kernel logs to see if any processes were killed for some reason.
> 
> - Joe
> 
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

[parent not found: <CAJ0trDaRQaDicO9ho8Hokj68vEopaJwM8YKwPEjzYkWBXE0dWg@mail.gmail.com>]

* Re: Stuck IOs with dm-integrity + md raid1 + dm-thin
       [not found]       ` <CAJ0trDaRQaDicO9ho8Hokj68vEopaJwM8YKwPEjzYkWBXE0dWg@mail.gmail.com>
@ 2023-11-30 19:34         ` Eric Wheeler
  2023-11-30 21:08           ` Eric Wheeler
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Wheeler @ 2023-11-30 19:34 UTC (permalink / raw)
  To: Joe Thornber; +Cc: dm-devel

[-- Attachment #1: Type: text/plain, Size: 1044 bytes --]

On Thu, 30 Nov 2023, Joe Thornber wrote:
> On Wed, Nov 29, 2023 at 7:51 PM Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:
>       On Wed, 29 Nov 2023, Joe Thornber wrote:
>       > On Wed, Nov 29, 2023 at 8:24 AM Joe Thornber <thornber@redhat.com> wrote:
>       >       Hi Eric,
>       >
>       > Since integrity is below thin I would say this is likely a real issue
>       > with your hardware; probably causing thinp to head down the rarely used
>       > error path.
> 
>       While I suppose this is possible, the problem presented itself on two
>       completely different NVMe mirror pairs, two NVMes of which are brand new.
>       The system survived a 24 hour memory test and we have not seen any machine
>       check exceptions or ECC errors, so is far as I can tell the hardware is
>       working as expected.
> 
> 
> Have you performed any long running tests with just dm-integrity and not dm-thin?

It ran fine for about 12 hours and then IOs hung shortly after we did a 
reserve_metadata_snap.


--
Eric Wheeler

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Stuck IOs with dm-integrity + md raid1 + dm-thin
  2023-11-30 19:34         ` Eric Wheeler
@ 2023-11-30 21:08           ` Eric Wheeler
  2023-12-01  0:40             ` Eric Wheeler
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Wheeler @ 2023-11-30 21:08 UTC (permalink / raw)
  To: Joe Thornber; +Cc: dm-devel

[-- Attachment #1: Type: text/plain, Size: 1688 bytes --]

On Thu, 30 Nov 2023, Eric Wheeler wrote:
> On Thu, 30 Nov 2023, Joe Thornber wrote:
> > On Wed, Nov 29, 2023 at 7:51 PM Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:
> >       On Wed, 29 Nov 2023, Joe Thornber wrote:
> >       > On Wed, Nov 29, 2023 at 8:24 AM Joe Thornber <thornber@redhat.com> wrote:
> >       >       Hi Eric,
> >       >
> >       > Since integrity is below thin I would say this is likely a real issue
> >       > with your hardware; probably causing thinp to head down the rarely used
> >       > error path.
> > 
> >       While I suppose this is possible, the problem presented itself on two
> >       completely different NVMe mirror pairs, two NVMes of which are brand new.
> >       The system survived a 24 hour memory test and we have not seen any machine
> >       check exceptions or ECC errors, so is far as I can tell the hardware is
> >       working as expected.
> > 
> > 
> > Have you performed any long running tests with just dm-integrity and not dm-thin?
> 
> It ran fine for about 12 hours and then IOs hung shortly after we did a 
> reserve_metadata_snap.

More info:

Not only did it run for 12 hours just fine before IOs hung at 3:30am when 
we do metadata dumps, but this happened twice, on consecutive days at the 
same time of day (~3:30am), when cron does a reserve_metadata_snap for
metadata thin_dumps of inactive snapshots.

In each case we reformated dm-integrity the afternoon before, and ran an 
`integritysetup --integrity-recalculate-reset` to make sure it was clean. 
The first day we tried was on an older set of NVMe disks, but the second 
day we tried with brand new NVMe's that had never been used.

--
Eric Wheeler

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Stuck IOs with dm-integrity + md raid1 + dm-thin
  2023-11-30 21:08           ` Eric Wheeler
@ 2023-12-01  0:40             ` Eric Wheeler
  2023-12-01  7:37               ` Joe Thornber
  2023-12-01 18:26               ` David Teigland
  0 siblings, 2 replies; 11+ messages in thread
From: Eric Wheeler @ 2023-12-01  0:40 UTC (permalink / raw)
  To: Joe Thornber; +Cc: dm-devel

[-- Attachment #1: Type: text/plain, Size: 4159 bytes --]

On Thu, 30 Nov 2023, Eric Wheeler wrote:
> On Thu, 30 Nov 2023, Eric Wheeler wrote:
> > On Thu, 30 Nov 2023, Joe Thornber wrote:
> > > On Wed, Nov 29, 2023 at 7:51 PM Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:
> > > 
> > > Have you performed any long running tests with just dm-integrity and not dm-thin?
> > 
> > It ran fine for about 12 hours and then IOs hung shortly after we did a 
> > reserve_metadata_snap.
> 
> More info:
> 
> Not only did it run for 12 hours just fine before IOs hung at 3:30am when 
> we do metadata dumps, but this happened twice, on consecutive days at the 
> same time of day (~3:30am), when cron does a reserve_metadata_snap for
> metadata thin_dumps of inactive snapshots.
 
Here is a reproducer using ram disks, it hangs quickly on my test VM
with all the kernels I tested.  Here are a few example versions:

 - 6.5.7 vanilla
 - 5.14.0-362.8.1.el9_3
 - 5.14.0-162.12.1.el9_1

Watch dmesg and you'll eventually see errors like this:

	device-mapper: integrity: dm-6: Checksum failed at sector 0x17d8
	md/raid1:md9: dm-6: rescheduling sector 3912
	md/raid1:md9: redirecting sector 3912 to other mirror: dm-6
	raid1_read_request: 18950 callbacks suppressed

---------------------------------------------------------------
#!/bin/bash

# Notice: /dev/ram0 and /dev/ram1 will be wiped unconditionally.

# Configure these if you need to:
MD_DEV=/dev/md9
VGNAME=testvg
LVNAME=thin
LVSIZE=$((10 * 1024*1024*1024/512))

echo "NOTICE: THIS MAY BE UNSAFE. ONLY RUN THIS IN A TEST ENVIRONMENT!"
echo "Press enter twice to continue or CTRL-C to abort."

read
read

set -x

# append disks into a linear target
linear_add()
{
	name=$1
	shift

	prevsize=0
	for vol in "$@"; do
		size=`blockdev --getsize $vol`
		echo "$prevsize $size linear $vol 0"
		prevsize=$size
	done \
		| dmsetup create $name

	echo /dev/mapper/$name
}

lvthin_add()
{
	id=$1
	lvcreate -An -V $LVSIZE -n $LVNAME$id --thinpool pool0 $VGNAME >&2
	echo /dev/$VGNAME/$LVNAME$id
}

lvthin_snapshot()
{
	origin=$1
	id=$2

	lvcreate -An -s $VGNAME/$LVNAME$origin -n $LVNAME$id >&2
	echo /dev/$VGNAME/$LVNAME$id
}

fio()
{
	dev=$1
	/bin/fio --name=$dev --rw=randrw --direct=1 --bs=512 --numjobs=1 --filename=$dev --time_based --runtime=$FIOTIME --ioengine=libaio --iodepth=1 &> /dev/null
}

do_reset()
{
	killall -9 fio
	lvchange -an $VGNAME
	rmdir /dev/$VGNAME
	dmsetup remove ssd
	dmsetup remove pv
	lvmdevices --deldev /dev/mapper/ssd
	lvmdevices --deldev /dev/mapper/pv
	losetup -d /dev/loop?
	mdadm --stop $MD_DEV
	integritysetup close /dev/mapper/dm-integrity0
	integritysetup close /dev/mapper/dm-integrity1
	rmmod brd
	rmmod null_blk
	echo ==== reset done
	sleep 1
}

do_init()
{
	modprobe null_blk gb=30000 bs=512

	ramsize_gb=1
	modprobe brd rd_size=$(($ramsize_gb * 1024*1024)) rd_nr=2

	idev0=/dev/ram0
	idev1=/dev/ram1

	integritysetup format $idev0 --integrity xxhash64 --batch-mode
	integritysetup format $idev1 --integrity xxhash64 --batch-mode

	integritysetup open --integrity xxhash64 --allow-discards $idev0 dm-integrity0
	integritysetup open --integrity xxhash64 --allow-discards $idev1 dm-integrity1

	mdadm --create $MD_DEV --metadata=1.2 --assume-clean --level=1 --raid-devices=2 /dev/mapper/dm-integrity[01]

	# 1. This should be enough to trigger it:
	SSD_DEV=$MD_DEV

	# 2. If not, then wrap /dev/md9 in a linear target:
	#linear_add ssd $MD_DEV
	#SSD_DEV=/dev/mapper/ssd

	# Create a writable header for the PV meta:
	dd if=/dev/zero bs=1M count=16 oflag=direct of=/tmp/pvheader
	loop=`losetup -f --show /tmp/pvheader`
	linear_add pv $loop /dev/nullb0

	# Create the VG
	lvmdevices --adddev $SSD_DEV
	lvmdevices --adddev /dev/mapper/pv
	vgcreate $VGNAME /dev/mapper/pv $SSD_DEV

	# Create the pool:
	lvcreate -n pool0 -L 1T $VGNAME /dev/mapper/pv
	lvcreate -n meta0 -L 512m $VGNAME $SSD_DEV

	# Make sure the meta volume is on the SSD (it should be already from above):
	pvmove -n meta0 /dev/mapper/pv
	lvconvert -y --force --force --chunksize 64k --type thin-pool --poolmetadata $VGNAME/meta0 $VGNAME/pool0
}


while true; do
	do_reset
	do_init

	thin1=`lvthin_add 1`
	fio $thin1 &

	thin2=`lvthin_snapshot 1 2`
	fio $thin2 &

	wait

done

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Stuck IOs with dm-integrity + md raid1 + dm-thin
  2023-12-01  0:40             ` Eric Wheeler
@ 2023-12-01  7:37               ` Joe Thornber
  2023-12-01 18:26               ` David Teigland
  1 sibling, 0 replies; 11+ messages in thread
From: Joe Thornber @ 2023-12-01  7:37 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: dm-devel

On Fri, Dec 1, 2023 at 12:40 AM Eric Wheeler
<dm-devel@lists.ewheeler.net> wrote:
>
>
> Here is a reproducer using ram disks, it hangs quickly on my test VM
> with all the kernels I tested.  Here are a few example versions:
>
>  - 6.5.7 vanilla
>  - 5.14.0-362.8.1.el9_3
>  - 5.14.0-162.12.1.el9_1
>
> Watch dmesg and you'll eventually see errors like this:
>
>         device-mapper: integrity: dm-6: Checksum failed at sector 0x17d8
>         md/raid1:md9: dm-6: rescheduling sector 3912
>         md/raid1:md9: redirecting sector 3912 to other mirror: dm-6
>         raid1_read_request: 18950 callbacks suppressed
>

Excellent, thanks.
>
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Stuck IOs with dm-integrity + md raid1 + dm-thin
  2023-12-01  0:40             ` Eric Wheeler
  2023-12-01  7:37               ` Joe Thornber
@ 2023-12-01 18:26               ` David Teigland
  2023-12-02  2:18                 ` Eric Wheeler
  1 sibling, 1 reply; 11+ messages in thread
From: David Teigland @ 2023-12-01 18:26 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Joe Thornber, dm-devel

On Thu, Nov 30, 2023 at 04:40:00PM -0800, Eric Wheeler wrote:
> 	integritysetup format $idev0 --integrity xxhash64 --batch-mode
> 	integritysetup format $idev1 --integrity xxhash64 --batch-mode
> 
> 	integritysetup open --integrity xxhash64 --allow-discards $idev0 dm-integrity0
> 	integritysetup open --integrity xxhash64 --allow-discards $idev1 dm-integrity1
> 
> 	mdadm --create $MD_DEV --metadata=1.2 --assume-clean --level=1 --raid-devices=2 /dev/mapper/dm-integrity[01]
> 
> 	# 1. This should be enough to trigger it:
> 	SSD_DEV=$MD_DEV
> 
> 	# 2. If not, then wrap /dev/md9 in a linear target:
> 	#linear_add ssd $MD_DEV
> 	#SSD_DEV=/dev/mapper/ssd

Interesting that just adding a linear layer there would have some effect.

> 	# Create a writable header for the PV meta:
> 	dd if=/dev/zero bs=1M count=16 oflag=direct of=/tmp/pvheader
> 	loop=`losetup -f --show /tmp/pvheader`
> 	linear_add pv $loop /dev/nullb0
> 
> 	# Create the VG
> 	lvmdevices --adddev $SSD_DEV
> 	lvmdevices --adddev /dev/mapper/pv
> 	vgcreate $VGNAME /dev/mapper/pv $SSD_DEV
> 
> 	# Create the pool:
> 	lvcreate -n pool0 -L 1T $VGNAME /dev/mapper/pv
> 	lvcreate -n meta0 -L 512m $VGNAME $SSD_DEV
> 
> 	# Make sure the meta volume is on the SSD (it should be already from above):
> 	pvmove -n meta0 /dev/mapper/pv

I'd omit that pvmove if possible just in case it makes some unexpected
change.  You have more than enough layers to complicate things as it is
without pvmove adding dm-mirror to the mix.

> 	lvconvert -y --force --force --chunksize 64k --type thin-pool --poolmetadata $VGNAME/meta0 $VGNAME/pool0

It's not a bad idea to use mdraid over dm-integrity, but it would be
interesting to know if doing raid+integrity in lvm would have the same
problems. e.g.

lvcreate --type raid1 --raidintegrity y -m1 -L 512m -n meta0 $vg /dev/ram[01]
lvcreate -n pool0 -L 1T $vg /dev/nullb0
lvconvert --type thin-pool --poolmetadata meta0 $vg/pool0

Dave


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Stuck IOs with dm-integrity + md raid1 + dm-thin
  2023-12-01 18:26               ` David Teigland
@ 2023-12-02  2:18                 ` Eric Wheeler
  2023-12-15 23:26                   ` Eric Wheeler
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Wheeler @ 2023-12-02  2:18 UTC (permalink / raw)
  To: David Teigland; +Cc: Joe Thornber, dm-devel

On Fri, 1 Dec 2023, David Teigland wrote:
> On Thu, Nov 30, 2023 at 04:40:00PM -0800, Eric Wheeler wrote:
> > 	integritysetup format $idev0 --integrity xxhash64 --batch-mode
> > 	integritysetup format $idev1 --integrity xxhash64 --batch-mode
> > 
> > 	integritysetup open --integrity xxhash64 --allow-discards $idev0 dm-integrity0
> > 	integritysetup open --integrity xxhash64 --allow-discards $idev1 dm-integrity1
> > 
> > 	mdadm --create $MD_DEV --metadata=1.2 --assume-clean --level=1 --raid-devices=2 /dev/mapper/dm-integrity[01]
> > 
> > 	# 1. This should be enough to trigger it:
> > 	SSD_DEV=$MD_DEV
> > 
> > 	# 2. If not, then wrap /dev/md9 in a linear target:
> > 	#linear_add ssd $MD_DEV
> > 	#SSD_DEV=/dev/mapper/ssd
> 
> Interesting that just adding a linear layer there would have some effect.

Its seemed (sometimes) to be the difference between triggering on the 
first instead of the second iteration.  Without the linear wrap, it would 
sometimes not crash until the second loop, typically on `lvchange -an 
testvg`.  Maybe the extra linear hop delayed the IO just enough to help 
with the race, but that is only speculation.

> > 	# Create a writable header for the PV meta:
> > 	dd if=/dev/zero bs=1M count=16 oflag=direct of=/tmp/pvheader
> > 	loop=`losetup -f --show /tmp/pvheader`
> > 	linear_add pv $loop /dev/nullb0
> > 
> > 	# Create the VG
> > 	lvmdevices --adddev $SSD_DEV
> > 	lvmdevices --adddev /dev/mapper/pv
> > 	vgcreate $VGNAME /dev/mapper/pv $SSD_DEV
> > 
> > 	# Create the pool:
> > 	lvcreate -n pool0 -L 1T $VGNAME /dev/mapper/pv
> > 	lvcreate -n meta0 -L 512m $VGNAME $SSD_DEV
> > 
> > 	# Make sure the meta volume is on the SSD (it should be already from above):
> > 	pvmove -n meta0 /dev/mapper/pv
> 
> I'd omit that pvmove if possible just in case it makes some unexpected
> change.  You have more than enough layers to complicate things as it is
> without pvmove adding dm-mirror to the mix.

Good point. Since we specify the PV with `lvcreate ... $SSD_DEV`, the
pvmove isn't necessary at all.  (Specifying the PV on lvcreate was added
later, pvmove was leftover from earlier testing.)

> 
> > 	lvconvert -y --force --force --chunksize 64k --type thin-pool --poolmetadata $VGNAME/meta0 $VGNAME/pool0
> 
> It's not a bad idea to use mdraid over dm-integrity, but it would be
> interesting to know if doing raid+integrity in lvm would have the same
> problems. e.g.
>
> 
> lvcreate --type raid1 --raidintegrity y -m1 -L 512m -n meta0 $vg /dev/ram[01]
> lvcreate -n pool0 -L 1T $vg /dev/nullb0
> lvconvert --type thin-pool --poolmetadata meta0 $vg/pool0

Good idea!  As it turns out, that crashes just as easily.  Here is the
script with LVM based RAID1, which is somewhat simpler:

--------------------------------------------------------------------
#!/bin/bash

# Notice: /dev/ram0 and /dev/ram1 will be wiped unconditionally.

# Configure these if you need to:
VGNAME=testvg
LVNAME=thin
LVSIZE=$((10 * 1024*1024*1024/512))

echo "NOTICE: THIS MAY BE UNSAFE. ONLY RUN THIS IN A TEST ENVIRONMENT!"
echo "Press enter twice to continue or CTRL-C to abort."

read
read

set -x


# append disks into a linear target
linear_add()
{
	name=$1
	shift

	prevsize=0
	for vol in "$@"; do
		size=`blockdev --getsize $vol`
		echo "$prevsize $size linear $vol 0"
		prevsize=$size
	done \
		| dmsetup create $name

	echo /dev/mapper/$name
}

lvthin_add()
{
	id=$1
	lvcreate -An -V $LVSIZE -n $LVNAME$id --thinpool pool0 $VGNAME >&2
	echo /dev/$VGNAME/$LVNAME$id
}

lvthin_snapshot()
{
	origin=$1
	id=$2

	lvcreate -An -s $VGNAME/$LVNAME$origin -n $LVNAME$id >&2
	echo /dev/$VGNAME/$LVNAME$id
}

fio()
{
	dev=$1
	/bin/fio --name=$dev --rw=randrw --direct=1 --bs=512 --numjobs=1 --filename=$dev --time_based --runtime=$FIOTIME --ioengine=libaio --iodepth=1 &> /dev/null
}

do_reset()
{
	killall -9 fio
	lvchange -an $VGNAME
	rmdir /dev/$VGNAME
	dmsetup remove pv
	lvmdevices --deldev /dev/mapper/pv
	losetup -d /dev/loop?
	rmmod brd
	rmmod null_blk
	echo ==== reset done
	sleep 1
}

do_init()
{
	modprobe null_blk gb=30000 bs=512

	ramsize_gb=1
	modprobe brd rd_size=$(($ramsize_gb * 1024*1024)) rd_nr=2

	# Create a writable header for the PV meta:
	dd if=/dev/zero bs=1M count=16 oflag=direct of=/tmp/pvheader
	loop=`losetup -f --show /tmp/pvheader`
	linear_add pv $loop /dev/nullb0

	# Create the VG
	lvmdevices --adddev /dev/ram0
	lvmdevices --adddev /dev/ram1
	lvmdevices --adddev /dev/mapper/pv
	vgcreate $VGNAME /dev/mapper/pv /dev/ram[01]

	# Create the pool:
	lvcreate -n pool0 -L 1T $VGNAME /dev/mapper/pv
	lvcreate --type raid1 --raidintegrity y -m1 -L 512m -n meta0 $VGNAME /dev/ram[01]

	lvconvert -y --force --force --chunksize 64k --type thin-pool --poolmetadata $VGNAME/meta0 $VGNAME/pool0
}


while true; do
	do_reset
	do_init

	thin1=`lvthin_add 1`
	fio $thin1 &

	thin2=`lvthin_snapshot 1 2`
	fio $thin2 &

	wait

done


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Stuck IOs with dm-integrity + md raid1 + dm-thin
  2023-12-02  2:18                 ` Eric Wheeler
@ 2023-12-15 23:26                   ` Eric Wheeler
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Wheeler @ 2023-12-15 23:26 UTC (permalink / raw)
  To: Heinz Mauelshagen; +Cc: David Teigland, Mikulas Patocka, Joe Thornber, dm-devel

[ +cc Heinz, Mikulas ]

> > On Tue, 5 Dec 2023, Joe Thornber wrote:
> > > Hi Eric,
> > > 
> > > I just released v1.0.8 of the thinp tools:
> > > 
> > > https://github.com/jthornber/thin-provisioning-tools
> >
> > Interesting, so do these tools play a role in the kernel somehow (ie,
> > kernel rust), or is this just userspace testing?
> 
> The tools issue IO.  LVM runs thin_check during one of the lvcreate 
> calls in your reproducer script.  The way the tools were issuing io was 
> triggering the dm-integrity error. The fact that it then hung we 
> currently believe is an md kernel bug. Mikulas and Heinz are looking 
> into it.

Hi Mikulas and Heinz, 

Joe mentioned that you were looking into this.  Have you been able to 
figure out what might be causing the strange md<=>dm-integrity interaction 
in the kernel?

-Eric

> On Fri, 1 Dec 2023, David Teigland wrote:
> > On Thu, Nov 30, 2023 at 04:40:00PM -0800, Eric Wheeler wrote:
> > > 	integritysetup format $idev0 --integrity xxhash64 --batch-mode
> > > 	integritysetup format $idev1 --integrity xxhash64 --batch-mode
> > > 
> > > 	integritysetup open --integrity xxhash64 --allow-discards $idev0 dm-integrity0
> > > 	integritysetup open --integrity xxhash64 --allow-discards $idev1 dm-integrity1
> > > 
> > > 	mdadm --create $MD_DEV --metadata=1.2 --assume-clean --level=1 --raid-devices=2 /dev/mapper/dm-integrity[01]
> > > 
> > > 	# 1. This should be enough to trigger it:
> > > 	SSD_DEV=$MD_DEV
> > > 
> > > 	# 2. If not, then wrap /dev/md9 in a linear target:
> > > 	#linear_add ssd $MD_DEV
> > > 	#SSD_DEV=/dev/mapper/ssd
> > 
> > Interesting that just adding a linear layer there would have some effect.
> 
> Its seemed (sometimes) to be the difference between triggering on the 
> first instead of the second iteration.  Without the linear wrap, it would 
> sometimes not crash until the second loop, typically on `lvchange -an 
> testvg`.  Maybe the extra linear hop delayed the IO just enough to help 
> with the race, but that is only speculation.
> 
> > > 	# Create a writable header for the PV meta:
> > > 	dd if=/dev/zero bs=1M count=16 oflag=direct of=/tmp/pvheader
> > > 	loop=`losetup -f --show /tmp/pvheader`
> > > 	linear_add pv $loop /dev/nullb0
> > > 
> > > 	# Create the VG
> > > 	lvmdevices --adddev $SSD_DEV
> > > 	lvmdevices --adddev /dev/mapper/pv
> > > 	vgcreate $VGNAME /dev/mapper/pv $SSD_DEV
> > > 
> > > 	# Create the pool:
> > > 	lvcreate -n pool0 -L 1T $VGNAME /dev/mapper/pv
> > > 	lvcreate -n meta0 -L 512m $VGNAME $SSD_DEV
> > > 
> > > 	# Make sure the meta volume is on the SSD (it should be already from above):
> > > 	pvmove -n meta0 /dev/mapper/pv
> > 
> > I'd omit that pvmove if possible just in case it makes some unexpected
> > change.  You have more than enough layers to complicate things as it is
> > without pvmove adding dm-mirror to the mix.
> 
> Good point. Since we specify the PV with `lvcreate ... $SSD_DEV`, the
> pvmove isn't necessary at all.  (Specifying the PV on lvcreate was added
> later, pvmove was leftover from earlier testing.)
> 
> > 
> > > 	lvconvert -y --force --force --chunksize 64k --type thin-pool --poolmetadata $VGNAME/meta0 $VGNAME/pool0
> > 
> > It's not a bad idea to use mdraid over dm-integrity, but it would be
> > interesting to know if doing raid+integrity in lvm would have the same
> > problems. e.g.
> >
> > 
> > lvcreate --type raid1 --raidintegrity y -m1 -L 512m -n meta0 $vg /dev/ram[01]
> > lvcreate -n pool0 -L 1T $vg /dev/nullb0
> > lvconvert --type thin-pool --poolmetadata meta0 $vg/pool0
> 
> Good idea!  As it turns out, that crashes just as easily.  Here is the
> script with LVM based RAID1, which is somewhat simpler:
> 
> --------------------------------------------------------------------
> #!/bin/bash
> 
> # Notice: /dev/ram0 and /dev/ram1 will be wiped unconditionally.
> 
> # Configure these if you need to:
> VGNAME=testvg
> LVNAME=thin
> LVSIZE=$((10 * 1024*1024*1024/512))
> 
> echo "NOTICE: THIS MAY BE UNSAFE. ONLY RUN THIS IN A TEST ENVIRONMENT!"
> echo "Press enter twice to continue or CTRL-C to abort."
> 
> read
> read
> 
> set -x
> 
> 
> # append disks into a linear target
> linear_add()
> {
> 	name=$1
> 	shift
> 
> 	prevsize=0
> 	for vol in "$@"; do
> 		size=`blockdev --getsize $vol`
> 		echo "$prevsize $size linear $vol 0"
> 		prevsize=$size
> 	done \
> 		| dmsetup create $name
> 
> 	echo /dev/mapper/$name
> }
> 
> lvthin_add()
> {
> 	id=$1
> 	lvcreate -An -V $LVSIZE -n $LVNAME$id --thinpool pool0 $VGNAME >&2
> 	echo /dev/$VGNAME/$LVNAME$id
> }
> 
> lvthin_snapshot()
> {
> 	origin=$1
> 	id=$2
> 
> 	lvcreate -An -s $VGNAME/$LVNAME$origin -n $LVNAME$id >&2
> 	echo /dev/$VGNAME/$LVNAME$id
> }
> 
> fio()
> {
> 	dev=$1
> 	/bin/fio --name=$dev --rw=randrw --direct=1 --bs=512 --numjobs=1 --filename=$dev --time_based --runtime=$FIOTIME --ioengine=libaio --iodepth=1 &> /dev/null
> }
> 
> do_reset()
> {
> 	killall -9 fio
> 	lvchange -an $VGNAME
> 	rmdir /dev/$VGNAME
> 	dmsetup remove pv
> 	lvmdevices --deldev /dev/mapper/pv
> 	losetup -d /dev/loop?
> 	rmmod brd
> 	rmmod null_blk
> 	echo ==== reset done
> 	sleep 1
> }
> 
> do_init()
> {
> 	modprobe null_blk gb=30000 bs=512
> 
> 	ramsize_gb=1
> 	modprobe brd rd_size=$(($ramsize_gb * 1024*1024)) rd_nr=2
> 
> 	# Create a writable header for the PV meta:
> 	dd if=/dev/zero bs=1M count=16 oflag=direct of=/tmp/pvheader
> 	loop=`losetup -f --show /tmp/pvheader`
> 	linear_add pv $loop /dev/nullb0
> 
> 	# Create the VG
> 	lvmdevices --adddev /dev/ram0
> 	lvmdevices --adddev /dev/ram1
> 	lvmdevices --adddev /dev/mapper/pv
> 	vgcreate $VGNAME /dev/mapper/pv /dev/ram[01]
> 
> 	# Create the pool:
> 	lvcreate -n pool0 -L 1T $VGNAME /dev/mapper/pv
> 	lvcreate --type raid1 --raidintegrity y -m1 -L 512m -n meta0 $VGNAME /dev/ram[01]
> 
> 	lvconvert -y --force --force --chunksize 64k --type thin-pool --poolmetadata $VGNAME/meta0 $VGNAME/pool0
> }
> 
> 
> while true; do
> 	do_reset
> 	do_init
> 
> 	thin1=`lvthin_add 1`
> 	fio $thin1 &
> 
> 	thin2=`lvthin_snapshot 1 2`
> 	fio $thin2 &
> 
> 	wait
> 
> done
> 
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Stuck IOs with dm-integrity + md raid1 + dm-thin
  2023-11-29  1:55 Stuck IOs with dm-integrity + md raid1 + dm-thin Eric Wheeler
       [not found] ` <CAJ0trDaipa_80xqVpDkJdm-SSCaN7g_O-m9-StG10P_DUxmtEA@mail.gmail.com>
@ 2024-01-11 12:34 ` Xiao Ni
  2024-01-11 20:44   ` Eric Wheeler
  1 sibling, 1 reply; 11+ messages in thread
From: Xiao Ni @ 2024-01-11 12:34 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Joe Thornber, dm-devel, Heinz Mauelshagen, Mikulas Patocka,
	Snitzer, Mike, Jonathan Brassow

On Wed, Nov 29, 2023 at 10:10 AM Eric Wheeler
<dm-devel@lists.ewheeler.net> wrote:
>
> Hi Joe,
>
> I'm not sure who else to CC on this issue, feel free to add others.
>
> Recently we tried putting dm-integrity on NVMe's under MD RAID1 with
> dm-thin metadata (tmeta) on that raid1 mirror (Linux v6.5.7).  It worked
> fine for ~12 hours and then, at ~3:30 am in the morning when we do a
> reserve_metadata_snap and thin_dump's, IO's began to hang.
>
> For reference, we have been using the same block device stack for years,
> and it is quite stable.  We've been running Linux v6.5.7 for about 6 weeks
> without issue until we tried dm-integrity under dm-thin.  The stack
> _without_ dm-integrity looks like this:
>
>     nvme{0,1} =>
>         raid mirror /dev/md0 =>
>             LVM PV for VG "data" =>
>                 tmeta for dm-thin
>
>     /dev/sda (LSI-based Hardware RAID)
>         bcache =>
>             LVM PV for VG "data" =>
>                 tdata for dm-thin "pool0" =>
>                     single thin volume from the pool =>
>                         drbd to another host over the network
>
> After adding dm-integrity, the stack is the same as above except that
> the RAID mirror is above dm-integrity:
>
>     nvme{0,1} =>
>         dm-integrity =>
>             raid mirror /dev/md0 =>
>                 LVM PV for VG "data" =>
>                     tmeta for dm-thin
>
> The first time we had the thindump+IO hang problem we thought that an
> underlying NVMe drive was bad, so we replaced both: We rebooted and did
> a `pvmove` of the tmeta volume to different NVMe+md-raid1+dm-integrity
> on disks that were brand new, but had the same issue that night during
> reserve_metadata_snap+thin_dump.
>
> Since we had the same problem on a different set of new NVMe disks, we
> rebooted, did a `pvmove` back to the original MD RAID1 mirror stack
> _without_ dm-integrity, and everything was back to normal.  Thus, I do not
> think it is a disk or tmeta corruption issue, but an obscure race or
> deadlock bug.
>
> We tried to reproduce this in a virtual environment, but so far it has not
> triggered. In about a week when we get downtime for this system, I will
> take a copy of its thin metadata to inspect further. For now, I have not
> yet been enable to reproduce problem outside our semi-production testing
> environment.
>
> So, this is what we see so far:
>
> We get _thousands_ of these lines in the kernel logs when IO hangs. The
> `dm-175` device is one of the the dm-integrity targets being mirrored.
> Note the "12924 callbacks suppressed" message:
>
>         [  +0.584816] raid1_end_read_request: 12924 callbacks suppressed
>         [  +0.000006] md/raid1:md0: dm-175: rescheduling sector 1979682216
>         [  +0.020011] md/raid1:md0: dm-175: rescheduling sector 1979682216
>         [  +0.019462] md/raid1:md0: dm-175: rescheduling sector 1979682216
>         [  +0.018672] md/raid1:md0: dm-175: rescheduling sector 1979682216
>         [  +0.006242] md/raid1:md0: dm-175: rescheduling sector 1979682216
>         [  +0.002399] md/raid1:md0: dm-175: rescheduling sector 1979682216
>         [  +0.000934] md/raid1:md0: dm-175: rescheduling sector 1979682216
>         [  +0.000688] md/raid1:md0: dm-175: rescheduling sector 1979682216
>         [  +0.001161] md/raid1:md0: dm-175: rescheduling sector 1979682216
>         [  +0.006404] md/raid1:md0: dm-175: rescheduling sector 1979682216
>         [  +3.594380] integrity_metadata: 13095 callbacks suppressed
>         [  +0.000006] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
>         [  +0.001410] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
>         [  +0.000824] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
>         [  +0.003587] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
>         [  +0.011065] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
>         [  +0.004176] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
>         [  +0.000604] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
>         [  +0.006153] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
>         [  +0.008797] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
>         [  +0.002350] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
>         [  +0.668866] raid1_read_request: 12747 callbacks suppressed
>         [  +0.000005] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
>         [  +0.024708] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
>         [  +0.024441] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
>         [  +0.018321] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
>         [  +0.000850] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
>         [  +0.004479] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
>         [  +0.001225] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
>         [  +0.001024] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
>         [  +0.000741] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
>         [  +0.001094] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175

Hi all

The process is as follows and I add some logs
1. raid1_read_request submits bio to read data
2. raid1_end_read finds the read request fails and try to fix the read error
3. handle_read_error>fix_read_error tries to read data in 4K unit in
another disk, write to the failed disk in 4K unit and read from the
failed disk in 4K unit. fix_read_error finishes successfully.
4. handle_read_error re-read data from the failed disk. And the read
request fails again.
[ 1067.149829] device-mapper: integrity: dm-3: Checksum failed at sector 0x17d8
[ 1067.149842] md/raid1:md9: dm-3: rescheduling sector 3912, sectors 248
[ 1067.150437] md/raid1:md9: dm-3: fix read error 3912, sectors 248
[ 1067.150439] md/raid1:md9: dm-3: re read 3912, sectors 248

It looks like the bio size is suspicious. Because it reads
successfully when bio size is 4K.

Best Regards
Xiao

>
> Here are two traces of hung tasks that the kernel reported in dmesg. The
> decoded line numbers are at the bottom of this email.
>
> It looks like dmeventd is checking metadata transaction IDs, and then
> hangs. We believe that is snapshot was being made at the moment that IOs
> hung. This trace only appears once:
>
>         INFO: task dmeventd:3718459 blocked for more than 122 seconds.
>              Tainted: G            E      6.5.7 #23
>         "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>         task:dmeventd        state:D stack:0     pid:3718459 ppid:1      flags:0x00000002
>         Call Trace:
>         <TASK>
>         __schedule+0x2d7/0x800
>         schedule+0x5e/0xe0
>         schedule_preempt_disabled+0x15/0x30
>         rwsem_down_read_slowpath+0x2c4/0x5f0
>         ? lock_acquire+0x10b/0x120
>         down_read+0xff/0x260
>         dm_pool_get_metadata_transaction_id+0x24/0x60 [dm_thin_pool]
>         pool_status+0x133/0x630 [dm_thin_pool]
>         retrieve_status+0xc4/0x210 [dm_mod]
>         table_status+0x9c/0x150 [dm_mod]
>         ctl_ioctl+0x248/0x370 [dm_mod]
>         ? __pfx_table_status+0x10/0x10 [dm_mod]
>         dm_ctl_ioctl+0xe/0x20 [dm_mod]
>         __x64_sys_ioctl+0x99/0xe0
>         do_syscall_64+0x3f/0xa0
>         entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>
> In the next trace below, you can see that DRBD is trying to write to it's
> thin volume backing device, and thin metadata is being manipulated it by
> dm-thin (which is above md raid1 with dm-integrity), but hangs during
> `dm_thin_find_block`:
>
> This trace appears several times for different DRBD volumes that are
> backed by thin volumes in the same pool.  Except that the thin volume
> differs, the traces are the same:
>
>         INFO: task drbd_r_secServe:1072212 blocked for more than 122 seconds.
>              Tainted: G            E      6.5.7 #23
>         "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>         task:drbd_r_secServe state:D stack:0     pid:1072212 ppid:2      flags:0x00004000
>         Call Trace:
>         <TASK>
>         __schedule+0x2d7/0x800
>         schedule+0x5e/0xe0
>         schedule_preempt_disabled+0x15/0x30
>         rwsem_down_read_slowpath+0x2c4/0x5f0
>         ? dm_thin_find_block+0x32/0x90 [dm_thin_pool]
>         down_read+0xff/0x260
>         dm_thin_find_block+0x32/0x90 [dm_thin_pool]
>         thin_bio_map.isra.0+0x169/0x3a0 [dm_thin_pool]
>         __map_bio+0x49/0x1d0 [dm_mod]
>         dm_split_and_process_bio+0x21c/0x580 [dm_mod]
>         dm_submit_bio+0x64/0x170 [dm_mod]
>         __submit_bio+0x7e/0xd0
>         __submit_bio_noacct+0x90/0x200
>         drbd_submit_peer_request+0x138/0x2d0 [drbd]
>         receive_DataRequest+0x244/0x710 [drbd]
>         ? __pfx_receive_DataRequest+0x10/0x10 [drbd]
>         drbdd+0x174/0x280 [drbd]
>         ? __pfx_drbd_thread_setup+0x10/0x10 [drbd]
>         drbd_receiver+0x51/0xd0 [drbd]
>         drbd_thread_setup+0xa3/0x1b0 [drbd]
>         kthread+0xe9/0x120
>         ? __pfx_kthread+0x10/0x10
>         ret_from_fork+0x34/0x50
>         ? __pfx_kthread+0x10/0x10
>         ret_from_fork_asm+0x1b/0x30
>         </TASK>
>
> This is the call stack of the thin_dump command, its just waiting:
>
>         # thin_dump -f xml --dev-id 671125 -m /dev/mapper/data-pool0_tmeta
>         ...
>
>         # cat /proc/4016388/stack
>         [<0>] submit_bio_wait+0x7a/0xc0
>         [<0>] __blkdev_direct_IO_simple+0x130/0x280
>         [<0>] blkdev_read_iter+0xb8/0x150
>         [<0>] do_iter_readv_writev+0x14e/0x160
>         [<0>] do_iter_read+0x125/0x1c0
>         [<0>] vfs_readv+0xa8/0xf0
>         [<0>] __x64_sys_preadv+0xc0/0x100
>         [<0>] do_syscall_64+0x3f/0xa0
>         [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>
>
>
> These are the line number decodes for the above traces:
>
> trace 1:
>         kernel: INFO: task dmeventd:3718459 blocked for more than 122 seconds.
>         __schedule (kernel/sched/core.c:5381 kernel/sched/core.c:6710)
>         schedule (./arch/x86/include/asm/preempt.h:85 (discriminator 1) kernel/sched/core.c:6787 (discriminator 1))
>         schedule_preempt_disabled (./arch/x86/include/asm/preempt.h:80 kernel/sched/core.c:6846)
>         rwsem_down_read_slowpath (kernel/locking/rwsem.c:1073)
>         ? lock_acquire (./include/trace/events/lock.h:24 kernel/locking/lockdep.c:5724)
>         down_read (kernel/locking/rwsem.c:1250 kernel/locking/rwsem.c:1263 kernel/locking/rwsem.c:1522)
>         dm_pool_get_metadata_transaction_id (drivers/md/dm-thin-metadata.c:1323) dm_thin_pool
>         pool_status (drivers/md/dm-thin.c:3944) dm_thin_pool
>         retrieve_status (./include/linux/fortify-string.h:214 drivers/md/dm-ioctl.c:1316) dm_mod
>         table_status (drivers/md/dm-ioctl.c:1707) dm_mod
>         ctl_ioctl (drivers/md/dm-ioctl.c:2083) dm_mod
>         ? __pfx_table_status (drivers/md/dm-ioctl.c:1693) dm_mod
>         dm_ctl_ioctl (drivers/md/dm-ioctl.c:2103) dm_mod
>         __x64_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:870 fs/ioctl.c:856 fs/ioctl.c:856)
>         do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
>         entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
>
>
> trace 2:
>         kernel: INFO: task drbd_r_secServe:1072212 blocked for more than 122 seconds.
>         __schedule (kernel/sched/core.c:5381 kernel/sched/core.c:6710)
>         schedule (./arch/x86/include/asm/preempt.h:85 (discriminator 1) kernel/sched/core.c:6787 (discriminator 1))
>         schedule_preempt_disabled (./arch/x86/include/asm/preempt.h:80 kernel/sched/core.c:6846)
>         rwsem_down_read_slowpath (kernel/locking/rwsem.c:1073)
>         ? dm_thin_find_block (drivers/md/dm-thin-metadata.c:1574) dm_thin_pool
>         down_read (kernel/locking/rwsem.c:1250 kernel/locking/rwsem.c:1263 kernel/locking/rwsem.c:1522)
>         dm_thin_find_block (drivers/md/dm-thin-metadata.c:1574) dm_thin_pool
>         thin_bio_map.isra.0 (drivers/md/dm-thin.c:2764) dm_thin_pool
>         __map_bio (drivers/md/dm.c:1427) dm_mod
>         dm_split_and_process_bio (drivers/md/dm.c:1729 drivers/md/dm.c:1782) dm_mod
>         dm_submit_bio (drivers/md/dm.c:1835) dm_mod
>         __submit_bio (block/blk-core.c:373 block/blk-core.c:603)
>         __submit_bio_noacct (./include/linux/bio.h:624 block/blk-core.c:653)
>         drbd_submit_peer_request (drivers/block/drbd/drbd_receiver.c:1723) drbd
>         receive_DataRequest (drivers/block/drbd/drbd_receiver.c:2993) drbd
>         ? __pfx_receive_DataRequest (drivers/block/drbd/drbd_receiver.c:2803) drbd
>         drbdd (drivers/block/drbd/drbd_receiver.c:5092) drbd
>         ? __pfx_drbd_thread_setup (drivers/block/drbd/drbd_main.c:310) drbd
>         drbd_receiver (drivers/block/drbd/drbd_receiver.c:5561) drbd
>         drbd_thread_setup (drivers/block/drbd/drbd_main.c:325) drbd
>         kthread (kernel/kthread.c:389)
>         ? __pfx_kthread (kernel/kthread.c:342)
>         ret_from_fork (arch/x86/kernel/process.c:151)
>         ? __pfx_kthread (kernel/kthread.c:342)
>         ret_from_fork_asm (arch/x86/entry/entry_64.S:312)
>
> trace 3:
>         # thin_dump -f xml --dev-id 671125 -m /dev/mapper/data-pool0_tmeta
>         <hung>
>
>         # cat /proc/4016388/stack
>         submit_bio_wait (block/bio.c:1370 (discriminator 1))
>         __blkdev_direct_IO_simple (./include/linux/bio.h:495 block/fops.c:99)
>         blkdev_read_iter (block/fops.c:608)
>         do_iter_readv_writev (./include/linux/fs.h:1871)
>         do_iter_read (fs/read_write.c:795)
>         vfs_readv (fs/read_write.c:916)
>         __x64_sys_preadv (fs/read_write.c:1007 fs/read_write.c:1057 fs/read_write.c:1052 fs/read_write.c:1052)
>         do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
>         entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
>
>
>
> --
> Eric Wheeler
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Stuck IOs with dm-integrity + md raid1 + dm-thin
  2024-01-11 12:34 ` Xiao Ni
@ 2024-01-11 20:44   ` Eric Wheeler
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Wheeler @ 2024-01-11 20:44 UTC (permalink / raw)
  To: Xiao Ni
  Cc: Joe Thornber, dm-devel, Heinz Mauelshagen, Mikulas Patocka,
	Snitzer, Mike, Jonathan Brassow

[-- Attachment #1: Type: text/plain, Size: 9391 bytes --]

On Thu, 11 Jan 2024, Xiao Ni wrote:
> On Wed, Nov 29, 2023 at 10:10 AM Eric Wheeler
> <dm-devel@lists.ewheeler.net> wrote:
> >
> > Hi Joe,
> >
> > I'm not sure who else to CC on this issue, feel free to add others.
> >
> > Recently we tried putting dm-integrity on NVMe's under MD RAID1 with
> > dm-thin metadata (tmeta) on that raid1 mirror (Linux v6.5.7).  It worked
> > fine for ~12 hours and then, at ~3:30 am in the morning when we do a
> > reserve_metadata_snap and thin_dump's, IO's began to hang.
> >
> > For reference, we have been using the same block device stack for years,
> > and it is quite stable.  We've been running Linux v6.5.7 for about 6 weeks
> > without issue until we tried dm-integrity under dm-thin.  The stack
> > _without_ dm-integrity looks like this:
> >
> >     nvme{0,1} =>
> >         raid mirror /dev/md0 =>
> >             LVM PV for VG "data" =>
> >                 tmeta for dm-thin
> >
> >     /dev/sda (LSI-based Hardware RAID)
> >         bcache =>
> >             LVM PV for VG "data" =>
> >                 tdata for dm-thin "pool0" =>
> >                     single thin volume from the pool =>
> >                         drbd to another host over the network
> >
> > After adding dm-integrity, the stack is the same as above except that
> > the RAID mirror is above dm-integrity:
> >
> >     nvme{0,1} =>
> >         dm-integrity =>
> >             raid mirror /dev/md0 =>
> >                 LVM PV for VG "data" =>
> >                     tmeta for dm-thin
> >
> > The first time we had the thindump+IO hang problem we thought that an
> > underlying NVMe drive was bad, so we replaced both: We rebooted and did
> > a `pvmove` of the tmeta volume to different NVMe+md-raid1+dm-integrity
> > on disks that were brand new, but had the same issue that night during
> > reserve_metadata_snap+thin_dump.
> >
> > Since we had the same problem on a different set of new NVMe disks, we
> > rebooted, did a `pvmove` back to the original MD RAID1 mirror stack
> > _without_ dm-integrity, and everything was back to normal.  Thus, I do not
> > think it is a disk or tmeta corruption issue, but an obscure race or
> > deadlock bug.
> >
> > We tried to reproduce this in a virtual environment, but so far it has not
> > triggered. In about a week when we get downtime for this system, I will
> > take a copy of its thin metadata to inspect further. For now, I have not
> > yet been enable to reproduce problem outside our semi-production testing
> > environment.
> >
> > So, this is what we see so far:
> >
> > We get _thousands_ of these lines in the kernel logs when IO hangs. The
> > `dm-175` device is one of the the dm-integrity targets being mirrored.
> > Note the "12924 callbacks suppressed" message:
> >
> >         [  +0.584816] raid1_end_read_request: 12924 callbacks suppressed
> >         [  +0.000006] md/raid1:md0: dm-175: rescheduling sector 1979682216
> >         [  +0.020011] md/raid1:md0: dm-175: rescheduling sector 1979682216
> >         [  +0.019462] md/raid1:md0: dm-175: rescheduling sector 1979682216
> >         [  +0.018672] md/raid1:md0: dm-175: rescheduling sector 1979682216
> >         [  +0.006242] md/raid1:md0: dm-175: rescheduling sector 1979682216
> >         [  +0.002399] md/raid1:md0: dm-175: rescheduling sector 1979682216
> >         [  +0.000934] md/raid1:md0: dm-175: rescheduling sector 1979682216
> >         [  +0.000688] md/raid1:md0: dm-175: rescheduling sector 1979682216
> >         [  +0.001161] md/raid1:md0: dm-175: rescheduling sector 1979682216
> >         [  +0.006404] md/raid1:md0: dm-175: rescheduling sector 1979682216
> >         [  +3.594380] integrity_metadata: 13095 callbacks suppressed
> >         [  +0.000006] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> >         [  +0.001410] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> >         [  +0.000824] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> >         [  +0.003587] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> >         [  +0.011065] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> >         [  +0.004176] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> >         [  +0.000604] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> >         [  +0.006153] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> >         [  +0.008797] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> >         [  +0.002350] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> >         [  +0.668866] raid1_read_request: 12747 callbacks suppressed
> >         [  +0.000005] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> >         [  +0.024708] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> >         [  +0.024441] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> >         [  +0.018321] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> >         [  +0.000850] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> >         [  +0.004479] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> >         [  +0.001225] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> >         [  +0.001024] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> >         [  +0.000741] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> >         [  +0.001094] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> 
> Hi all
> 
> The process is as follows and I add some logs
> 1. raid1_read_request submits bio to read data
> 2. raid1_end_read finds the read request fails and try to fix the read error
> 3. handle_read_error>fix_read_error tries to read data in 4K unit in
> another disk, write to the failed disk in 4K unit and read from the
> failed disk in 4K unit. fix_read_error finishes successfully.
> 4. handle_read_error re-read data from the failed disk. And the read
> request fails again.
> [ 1067.149829] device-mapper: integrity: dm-3: Checksum failed at sector 0x17d8
> [ 1067.149842] md/raid1:md9: dm-3: rescheduling sector 3912, sectors 248
> [ 1067.150437] md/raid1:md9: dm-3: fix read error 3912, sectors 248
> [ 1067.150439] md/raid1:md9: dm-3: re read 3912, sectors 248
> 
> It looks like the bio size is suspicious. Because it reads
> successfully when bio size is 4K.

Hi Xiao, thanks for looking into this.  

It looks like the problem can be reproduced using 4k blocks as well, but
I suppose its still possible that some 512-byte (4k-unaligned) IO is
happening somewhere in the device stack.

The script below changes the null_blk modprobe to set bs=4096 and uses
zram instead of brd because zram only supports 4k blocks:

# blockdev --getss /dev/mapper/pv /dev/zram[01]
4096
4096
4096

The script below has been added to the Jira ticket:
	https://issues.redhat.com/browse/RHEL-20912

--
Eric Wheeler

Script to trigger the bug with 4k blocks:

# cat dm-integrity-thin-lvraid-zram-4k
#!/bin/bash

# Notice: /dev/zram0 and /dev/zram1 will be wiped unconditionally.

# Configure these if you need to:
VGNAME=testvg
LVNAME=thin
LVSIZE=$((10 * 1024*1024*1024/512))

echo "NOTICE: THIS MAY BE UNSAFE. ONLY RUN THIS IN A TEST ENVIRONMENT!"
echo "Press enter twice to continue or CTRL-C to abort."

read
read

set -x


# append disks into a linear target
linear_add()
{
	name=$1
	shift

	prevsize=0
	for vol in "$@"; do
		size=`blockdev --getsize $vol`
		echo "$prevsize $size linear $vol 0"
		prevsize=$size
	done \
		| dmsetup create $name

	echo /dev/mapper/$name
}

lvthin_add()
{
	id=$1
	lvcreate -An -V $LVSIZE -n $LVNAME$id --thinpool pool0 $VGNAME >&2
	echo /dev/$VGNAME/$LVNAME$id
}

lvthin_snapshot()
{
	origin=$1
	id=$2

	lvcreate -An -s $VGNAME/$LVNAME$origin -n $LVNAME$id >&2
	echo /dev/$VGNAME/$LVNAME$id
}

fio()
{
	dev=$1
	/bin/fio --name=$dev --rw=randrw --direct=1 --bs=512 --numjobs=1 --filename=$dev --time_based --runtime=$FIOTIME --ioengine=libaio --iodepth=1 &> /dev/null
}

do_reset()
{
	killall -9 fio
	lvchange -an $VGNAME
	rmdir /dev/$VGNAME
	dmsetup remove pv
	dmsetup remove zram0
	dmsetup remove zram1
	lvmdevices --deldev /dev/mapper/pv
	losetup -d /dev/loop?
	rmmod zram
	rmmod null_blk
	echo ==== reset done
	sleep 1
}

do_init()
{
	modprobe null_blk gb=30000 bs=4096

	modprobe zram num_devices=2
	echo 1g > /sys/block/zram0/disksize
	echo 1g > /sys/block/zram1/disksize

	# Create a writable header for the PV meta:
	dd if=/dev/zero bs=1M count=16 oflag=direct of=/tmp/pvheader
	loop=`losetup -f --show /tmp/pvheader`
	linear_add pv $loop /dev/nullb0

	# Create the VG
	sleep 1
	#linear_add zram0 /dev/zram0
	#linear_add zram1 /dev/zram1
	lvmdevices --adddev /dev/zram0
	lvmdevices --adddev /dev/zram1
	lvmdevices --adddev /dev/mapper/pv
	vgcreate $VGNAME /dev/mapper/pv /dev/zram[01]

	# Create the pool:
	lvcreate -n pool0 -L 1T $VGNAME /dev/mapper/pv
	lvcreate --type raid1 --raidintegrity y -m1 -L 512m -n meta0 $VGNAME /dev/zram[01]

	lvconvert -y --force --force --chunksize 64k --type thin-pool --poolmetadata $VGNAME/meta0 $VGNAME/pool0
}


while true; do
	do_reset
	do_init

	thin1=`lvthin_add 1`
	fio $thin1 &

	thin2=`lvthin_snapshot 1 2`
	fio $thin2 &

	wait

done

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-01-11 20:51 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-29  1:55 Stuck IOs with dm-integrity + md raid1 + dm-thin Eric Wheeler
     [not found] ` <CAJ0trDaipa_80xqVpDkJdm-SSCaN7g_O-m9-StG10P_DUxmtEA@mail.gmail.com>
     [not found]   ` <CAJ0trDbP4BFjb0KQ+t4Lh1JdVjP9sbMiwJEOXPkGEV_R95rFoQ@mail.gmail.com>
2023-11-29 19:51     ` Eric Wheeler
     [not found]       ` <CAJ0trDaRQaDicO9ho8Hokj68vEopaJwM8YKwPEjzYkWBXE0dWg@mail.gmail.com>
2023-11-30 19:34         ` Eric Wheeler
2023-11-30 21:08           ` Eric Wheeler
2023-12-01  0:40             ` Eric Wheeler
2023-12-01  7:37               ` Joe Thornber
2023-12-01 18:26               ` David Teigland
2023-12-02  2:18                 ` Eric Wheeler
2023-12-15 23:26                   ` Eric Wheeler
2024-01-11 12:34 ` Xiao Ni
2024-01-11 20:44   ` Eric Wheeler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.