* Stuck IOs with dm-integrity + md raid1 + dm-thin
@ 2023-11-29 1:55 Eric Wheeler
[not found] ` <CAJ0trDaipa_80xqVpDkJdm-SSCaN7g_O-m9-StG10P_DUxmtEA@mail.gmail.com>
2024-01-11 12:34 ` Xiao Ni
0 siblings, 2 replies; 11+ messages in thread
From: Eric Wheeler @ 2023-11-29 1:55 UTC (permalink / raw)
To: Joe Thornber; +Cc: dm-devel
Hi Joe,
I'm not sure who else to CC on this issue, feel free to add others.
Recently we tried putting dm-integrity on NVMe's under MD RAID1 with
dm-thin metadata (tmeta) on that raid1 mirror (Linux v6.5.7). It worked
fine for ~12 hours and then, at ~3:30 am in the morning when we do a
reserve_metadata_snap and thin_dump's, IO's began to hang.
For reference, we have been using the same block device stack for years,
and it is quite stable. We've been running Linux v6.5.7 for about 6 weeks
without issue until we tried dm-integrity under dm-thin. The stack
_without_ dm-integrity looks like this:
nvme{0,1} =>
raid mirror /dev/md0 =>
LVM PV for VG "data" =>
tmeta for dm-thin
/dev/sda (LSI-based Hardware RAID)
bcache =>
LVM PV for VG "data" =>
tdata for dm-thin "pool0" =>
single thin volume from the pool =>
drbd to another host over the network
After adding dm-integrity, the stack is the same as above except that
the RAID mirror is above dm-integrity:
nvme{0,1} =>
dm-integrity =>
raid mirror /dev/md0 =>
LVM PV for VG "data" =>
tmeta for dm-thin
The first time we had the thindump+IO hang problem we thought that an
underlying NVMe drive was bad, so we replaced both: We rebooted and did
a `pvmove` of the tmeta volume to different NVMe+md-raid1+dm-integrity
on disks that were brand new, but had the same issue that night during
reserve_metadata_snap+thin_dump.
Since we had the same problem on a different set of new NVMe disks, we
rebooted, did a `pvmove` back to the original MD RAID1 mirror stack
_without_ dm-integrity, and everything was back to normal. Thus, I do not
think it is a disk or tmeta corruption issue, but an obscure race or
deadlock bug.
We tried to reproduce this in a virtual environment, but so far it has not
triggered. In about a week when we get downtime for this system, I will
take a copy of its thin metadata to inspect further. For now, I have not
yet been enable to reproduce problem outside our semi-production testing
environment.
So, this is what we see so far:
We get _thousands_ of these lines in the kernel logs when IO hangs. The
`dm-175` device is one of the the dm-integrity targets being mirrored.
Note the "12924 callbacks suppressed" message:
[ +0.584816] raid1_end_read_request: 12924 callbacks suppressed
[ +0.000006] md/raid1:md0: dm-175: rescheduling sector 1979682216
[ +0.020011] md/raid1:md0: dm-175: rescheduling sector 1979682216
[ +0.019462] md/raid1:md0: dm-175: rescheduling sector 1979682216
[ +0.018672] md/raid1:md0: dm-175: rescheduling sector 1979682216
[ +0.006242] md/raid1:md0: dm-175: rescheduling sector 1979682216
[ +0.002399] md/raid1:md0: dm-175: rescheduling sector 1979682216
[ +0.000934] md/raid1:md0: dm-175: rescheduling sector 1979682216
[ +0.000688] md/raid1:md0: dm-175: rescheduling sector 1979682216
[ +0.001161] md/raid1:md0: dm-175: rescheduling sector 1979682216
[ +0.006404] md/raid1:md0: dm-175: rescheduling sector 1979682216
[ +3.594380] integrity_metadata: 13095 callbacks suppressed
[ +0.000006] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
[ +0.001410] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
[ +0.000824] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
[ +0.003587] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
[ +0.011065] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
[ +0.004176] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
[ +0.000604] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
[ +0.006153] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
[ +0.008797] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
[ +0.002350] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
[ +0.668866] raid1_read_request: 12747 callbacks suppressed
[ +0.000005] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
[ +0.024708] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
[ +0.024441] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
[ +0.018321] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
[ +0.000850] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
[ +0.004479] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
[ +0.001225] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
[ +0.001024] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
[ +0.000741] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
[ +0.001094] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
Here are two traces of hung tasks that the kernel reported in dmesg. The
decoded line numbers are at the bottom of this email.
It looks like dmeventd is checking metadata transaction IDs, and then
hangs. We believe that is snapshot was being made at the moment that IOs
hung. This trace only appears once:
INFO: task dmeventd:3718459 blocked for more than 122 seconds.
Tainted: G E 6.5.7 #23
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:dmeventd state:D stack:0 pid:3718459 ppid:1 flags:0x00000002
Call Trace:
<TASK>
__schedule+0x2d7/0x800
schedule+0x5e/0xe0
schedule_preempt_disabled+0x15/0x30
rwsem_down_read_slowpath+0x2c4/0x5f0
? lock_acquire+0x10b/0x120
down_read+0xff/0x260
dm_pool_get_metadata_transaction_id+0x24/0x60 [dm_thin_pool]
pool_status+0x133/0x630 [dm_thin_pool]
retrieve_status+0xc4/0x210 [dm_mod]
table_status+0x9c/0x150 [dm_mod]
ctl_ioctl+0x248/0x370 [dm_mod]
? __pfx_table_status+0x10/0x10 [dm_mod]
dm_ctl_ioctl+0xe/0x20 [dm_mod]
__x64_sys_ioctl+0x99/0xe0
do_syscall_64+0x3f/0xa0
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
In the next trace below, you can see that DRBD is trying to write to it's
thin volume backing device, and thin metadata is being manipulated it by
dm-thin (which is above md raid1 with dm-integrity), but hangs during
`dm_thin_find_block`:
This trace appears several times for different DRBD volumes that are
backed by thin volumes in the same pool. Except that the thin volume
differs, the traces are the same:
INFO: task drbd_r_secServe:1072212 blocked for more than 122 seconds.
Tainted: G E 6.5.7 #23
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:drbd_r_secServe state:D stack:0 pid:1072212 ppid:2 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x2d7/0x800
schedule+0x5e/0xe0
schedule_preempt_disabled+0x15/0x30
rwsem_down_read_slowpath+0x2c4/0x5f0
? dm_thin_find_block+0x32/0x90 [dm_thin_pool]
down_read+0xff/0x260
dm_thin_find_block+0x32/0x90 [dm_thin_pool]
thin_bio_map.isra.0+0x169/0x3a0 [dm_thin_pool]
__map_bio+0x49/0x1d0 [dm_mod]
dm_split_and_process_bio+0x21c/0x580 [dm_mod]
dm_submit_bio+0x64/0x170 [dm_mod]
__submit_bio+0x7e/0xd0
__submit_bio_noacct+0x90/0x200
drbd_submit_peer_request+0x138/0x2d0 [drbd]
receive_DataRequest+0x244/0x710 [drbd]
? __pfx_receive_DataRequest+0x10/0x10 [drbd]
drbdd+0x174/0x280 [drbd]
? __pfx_drbd_thread_setup+0x10/0x10 [drbd]
drbd_receiver+0x51/0xd0 [drbd]
drbd_thread_setup+0xa3/0x1b0 [drbd]
kthread+0xe9/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x34/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
This is the call stack of the thin_dump command, its just waiting:
# thin_dump -f xml --dev-id 671125 -m /dev/mapper/data-pool0_tmeta
...
# cat /proc/4016388/stack
[<0>] submit_bio_wait+0x7a/0xc0
[<0>] __blkdev_direct_IO_simple+0x130/0x280
[<0>] blkdev_read_iter+0xb8/0x150
[<0>] do_iter_readv_writev+0x14e/0x160
[<0>] do_iter_read+0x125/0x1c0
[<0>] vfs_readv+0xa8/0xf0
[<0>] __x64_sys_preadv+0xc0/0x100
[<0>] do_syscall_64+0x3f/0xa0
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
These are the line number decodes for the above traces:
trace 1:
kernel: INFO: task dmeventd:3718459 blocked for more than 122 seconds.
__schedule (kernel/sched/core.c:5381 kernel/sched/core.c:6710)
schedule (./arch/x86/include/asm/preempt.h:85 (discriminator 1) kernel/sched/core.c:6787 (discriminator 1))
schedule_preempt_disabled (./arch/x86/include/asm/preempt.h:80 kernel/sched/core.c:6846)
rwsem_down_read_slowpath (kernel/locking/rwsem.c:1073)
? lock_acquire (./include/trace/events/lock.h:24 kernel/locking/lockdep.c:5724)
down_read (kernel/locking/rwsem.c:1250 kernel/locking/rwsem.c:1263 kernel/locking/rwsem.c:1522)
dm_pool_get_metadata_transaction_id (drivers/md/dm-thin-metadata.c:1323) dm_thin_pool
pool_status (drivers/md/dm-thin.c:3944) dm_thin_pool
retrieve_status (./include/linux/fortify-string.h:214 drivers/md/dm-ioctl.c:1316) dm_mod
table_status (drivers/md/dm-ioctl.c:1707) dm_mod
ctl_ioctl (drivers/md/dm-ioctl.c:2083) dm_mod
? __pfx_table_status (drivers/md/dm-ioctl.c:1693) dm_mod
dm_ctl_ioctl (drivers/md/dm-ioctl.c:2103) dm_mod
__x64_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:870 fs/ioctl.c:856 fs/ioctl.c:856)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
trace 2:
kernel: INFO: task drbd_r_secServe:1072212 blocked for more than 122 seconds.
__schedule (kernel/sched/core.c:5381 kernel/sched/core.c:6710)
schedule (./arch/x86/include/asm/preempt.h:85 (discriminator 1) kernel/sched/core.c:6787 (discriminator 1))
schedule_preempt_disabled (./arch/x86/include/asm/preempt.h:80 kernel/sched/core.c:6846)
rwsem_down_read_slowpath (kernel/locking/rwsem.c:1073)
? dm_thin_find_block (drivers/md/dm-thin-metadata.c:1574) dm_thin_pool
down_read (kernel/locking/rwsem.c:1250 kernel/locking/rwsem.c:1263 kernel/locking/rwsem.c:1522)
dm_thin_find_block (drivers/md/dm-thin-metadata.c:1574) dm_thin_pool
thin_bio_map.isra.0 (drivers/md/dm-thin.c:2764) dm_thin_pool
__map_bio (drivers/md/dm.c:1427) dm_mod
dm_split_and_process_bio (drivers/md/dm.c:1729 drivers/md/dm.c:1782) dm_mod
dm_submit_bio (drivers/md/dm.c:1835) dm_mod
__submit_bio (block/blk-core.c:373 block/blk-core.c:603)
__submit_bio_noacct (./include/linux/bio.h:624 block/blk-core.c:653)
drbd_submit_peer_request (drivers/block/drbd/drbd_receiver.c:1723) drbd
receive_DataRequest (drivers/block/drbd/drbd_receiver.c:2993) drbd
? __pfx_receive_DataRequest (drivers/block/drbd/drbd_receiver.c:2803) drbd
drbdd (drivers/block/drbd/drbd_receiver.c:5092) drbd
? __pfx_drbd_thread_setup (drivers/block/drbd/drbd_main.c:310) drbd
drbd_receiver (drivers/block/drbd/drbd_receiver.c:5561) drbd
drbd_thread_setup (drivers/block/drbd/drbd_main.c:325) drbd
kthread (kernel/kthread.c:389)
? __pfx_kthread (kernel/kthread.c:342)
ret_from_fork (arch/x86/kernel/process.c:151)
? __pfx_kthread (kernel/kthread.c:342)
ret_from_fork_asm (arch/x86/entry/entry_64.S:312)
trace 3:
# thin_dump -f xml --dev-id 671125 -m /dev/mapper/data-pool0_tmeta
<hung>
# cat /proc/4016388/stack
submit_bio_wait (block/bio.c:1370 (discriminator 1))
__blkdev_direct_IO_simple (./include/linux/bio.h:495 block/fops.c:99)
blkdev_read_iter (block/fops.c:608)
do_iter_readv_writev (./include/linux/fs.h:1871)
do_iter_read (fs/read_write.c:795)
vfs_readv (fs/read_write.c:916)
__x64_sys_preadv (fs/read_write.c:1007 fs/read_write.c:1057 fs/read_write.c:1052 fs/read_write.c:1052)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
--
Eric Wheeler
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Stuck IOs with dm-integrity + md raid1 + dm-thin
[not found] ` <CAJ0trDbP4BFjb0KQ+t4Lh1JdVjP9sbMiwJEOXPkGEV_R95rFoQ@mail.gmail.com>
@ 2023-11-29 19:51 ` Eric Wheeler
[not found] ` <CAJ0trDaRQaDicO9ho8Hokj68vEopaJwM8YKwPEjzYkWBXE0dWg@mail.gmail.com>
0 siblings, 1 reply; 11+ messages in thread
From: Eric Wheeler @ 2023-11-29 19:51 UTC (permalink / raw)
To: Joe Thornber; +Cc: dm-devel
[-- Attachment #1: Type: text/plain, Size: 1330 bytes --]
On Wed, 29 Nov 2023, Joe Thornber wrote:
> On Wed, Nov 29, 2023 at 8:24 AM Joe Thornber <thornber@redhat.com> wrote:
> Hi Eric,
>
> Since integrity is below thin I would say this is likely a real issue
> with your hardware; probably causing thinp to head down the rarely used
> error path.
While I suppose this is possible, the problem presented itself on two
completely different NVMe mirror pairs, two NVMes of which are brand new.
The system survived a 24 hour memory test and we have not seen any machine
check exceptions or ECC errors, so is far as I can tell the hardware is
working as expected.
> It looks like something is holding the rw_semaphore that protects
> accesses to the thin_metadata. This could either be a process that is
> blocked holding it (in which case you should be able to see it). Or a
> rarely used error path has omitted to release it.
Is any kernel tooling around finding what is holding the lock, or some
kind of tracing mechanism that I could trigger while rw_semaphore is
locked?
--
Eric Wheeler
>
>
> Looking at the code I think it's really unlikely that we've accidentally left it unlocked. So could you check
> all processes to see if any are holding it please, and check
> the kernel logs to see if any processes were killed for some reason.
>
> - Joe
>
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Stuck IOs with dm-integrity + md raid1 + dm-thin
[not found] ` <CAJ0trDaRQaDicO9ho8Hokj68vEopaJwM8YKwPEjzYkWBXE0dWg@mail.gmail.com>
@ 2023-11-30 19:34 ` Eric Wheeler
2023-11-30 21:08 ` Eric Wheeler
0 siblings, 1 reply; 11+ messages in thread
From: Eric Wheeler @ 2023-11-30 19:34 UTC (permalink / raw)
To: Joe Thornber; +Cc: dm-devel
[-- Attachment #1: Type: text/plain, Size: 1044 bytes --]
On Thu, 30 Nov 2023, Joe Thornber wrote:
> On Wed, Nov 29, 2023 at 7:51 PM Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:
> On Wed, 29 Nov 2023, Joe Thornber wrote:
> > On Wed, Nov 29, 2023 at 8:24 AM Joe Thornber <thornber@redhat.com> wrote:
> > Hi Eric,
> >
> > Since integrity is below thin I would say this is likely a real issue
> > with your hardware; probably causing thinp to head down the rarely used
> > error path.
>
> While I suppose this is possible, the problem presented itself on two
> completely different NVMe mirror pairs, two NVMes of which are brand new.
> The system survived a 24 hour memory test and we have not seen any machine
> check exceptions or ECC errors, so is far as I can tell the hardware is
> working as expected.
>
>
> Have you performed any long running tests with just dm-integrity and not dm-thin?
It ran fine for about 12 hours and then IOs hung shortly after we did a
reserve_metadata_snap.
--
Eric Wheeler
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Stuck IOs with dm-integrity + md raid1 + dm-thin
2023-11-30 19:34 ` Eric Wheeler
@ 2023-11-30 21:08 ` Eric Wheeler
2023-12-01 0:40 ` Eric Wheeler
0 siblings, 1 reply; 11+ messages in thread
From: Eric Wheeler @ 2023-11-30 21:08 UTC (permalink / raw)
To: Joe Thornber; +Cc: dm-devel
[-- Attachment #1: Type: text/plain, Size: 1688 bytes --]
On Thu, 30 Nov 2023, Eric Wheeler wrote:
> On Thu, 30 Nov 2023, Joe Thornber wrote:
> > On Wed, Nov 29, 2023 at 7:51 PM Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:
> > On Wed, 29 Nov 2023, Joe Thornber wrote:
> > > On Wed, Nov 29, 2023 at 8:24 AM Joe Thornber <thornber@redhat.com> wrote:
> > > Hi Eric,
> > >
> > > Since integrity is below thin I would say this is likely a real issue
> > > with your hardware; probably causing thinp to head down the rarely used
> > > error path.
> >
> > While I suppose this is possible, the problem presented itself on two
> > completely different NVMe mirror pairs, two NVMes of which are brand new.
> > The system survived a 24 hour memory test and we have not seen any machine
> > check exceptions or ECC errors, so is far as I can tell the hardware is
> > working as expected.
> >
> >
> > Have you performed any long running tests with just dm-integrity and not dm-thin?
>
> It ran fine for about 12 hours and then IOs hung shortly after we did a
> reserve_metadata_snap.
More info:
Not only did it run for 12 hours just fine before IOs hung at 3:30am when
we do metadata dumps, but this happened twice, on consecutive days at the
same time of day (~3:30am), when cron does a reserve_metadata_snap for
metadata thin_dumps of inactive snapshots.
In each case we reformated dm-integrity the afternoon before, and ran an
`integritysetup --integrity-recalculate-reset` to make sure it was clean.
The first day we tried was on an older set of NVMe disks, but the second
day we tried with brand new NVMe's that had never been used.
--
Eric Wheeler
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Stuck IOs with dm-integrity + md raid1 + dm-thin
2023-11-30 21:08 ` Eric Wheeler
@ 2023-12-01 0:40 ` Eric Wheeler
2023-12-01 7:37 ` Joe Thornber
2023-12-01 18:26 ` David Teigland
0 siblings, 2 replies; 11+ messages in thread
From: Eric Wheeler @ 2023-12-01 0:40 UTC (permalink / raw)
To: Joe Thornber; +Cc: dm-devel
[-- Attachment #1: Type: text/plain, Size: 4159 bytes --]
On Thu, 30 Nov 2023, Eric Wheeler wrote:
> On Thu, 30 Nov 2023, Eric Wheeler wrote:
> > On Thu, 30 Nov 2023, Joe Thornber wrote:
> > > On Wed, Nov 29, 2023 at 7:51 PM Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:
> > >
> > > Have you performed any long running tests with just dm-integrity and not dm-thin?
> >
> > It ran fine for about 12 hours and then IOs hung shortly after we did a
> > reserve_metadata_snap.
>
> More info:
>
> Not only did it run for 12 hours just fine before IOs hung at 3:30am when
> we do metadata dumps, but this happened twice, on consecutive days at the
> same time of day (~3:30am), when cron does a reserve_metadata_snap for
> metadata thin_dumps of inactive snapshots.
Here is a reproducer using ram disks, it hangs quickly on my test VM
with all the kernels I tested. Here are a few example versions:
- 6.5.7 vanilla
- 5.14.0-362.8.1.el9_3
- 5.14.0-162.12.1.el9_1
Watch dmesg and you'll eventually see errors like this:
device-mapper: integrity: dm-6: Checksum failed at sector 0x17d8
md/raid1:md9: dm-6: rescheduling sector 3912
md/raid1:md9: redirecting sector 3912 to other mirror: dm-6
raid1_read_request: 18950 callbacks suppressed
---------------------------------------------------------------
#!/bin/bash
# Notice: /dev/ram0 and /dev/ram1 will be wiped unconditionally.
# Configure these if you need to:
MD_DEV=/dev/md9
VGNAME=testvg
LVNAME=thin
LVSIZE=$((10 * 1024*1024*1024/512))
echo "NOTICE: THIS MAY BE UNSAFE. ONLY RUN THIS IN A TEST ENVIRONMENT!"
echo "Press enter twice to continue or CTRL-C to abort."
read
read
set -x
# append disks into a linear target
linear_add()
{
name=$1
shift
prevsize=0
for vol in "$@"; do
size=`blockdev --getsize $vol`
echo "$prevsize $size linear $vol 0"
prevsize=$size
done \
| dmsetup create $name
echo /dev/mapper/$name
}
lvthin_add()
{
id=$1
lvcreate -An -V $LVSIZE -n $LVNAME$id --thinpool pool0 $VGNAME >&2
echo /dev/$VGNAME/$LVNAME$id
}
lvthin_snapshot()
{
origin=$1
id=$2
lvcreate -An -s $VGNAME/$LVNAME$origin -n $LVNAME$id >&2
echo /dev/$VGNAME/$LVNAME$id
}
fio()
{
dev=$1
/bin/fio --name=$dev --rw=randrw --direct=1 --bs=512 --numjobs=1 --filename=$dev --time_based --runtime=$FIOTIME --ioengine=libaio --iodepth=1 &> /dev/null
}
do_reset()
{
killall -9 fio
lvchange -an $VGNAME
rmdir /dev/$VGNAME
dmsetup remove ssd
dmsetup remove pv
lvmdevices --deldev /dev/mapper/ssd
lvmdevices --deldev /dev/mapper/pv
losetup -d /dev/loop?
mdadm --stop $MD_DEV
integritysetup close /dev/mapper/dm-integrity0
integritysetup close /dev/mapper/dm-integrity1
rmmod brd
rmmod null_blk
echo ==== reset done
sleep 1
}
do_init()
{
modprobe null_blk gb=30000 bs=512
ramsize_gb=1
modprobe brd rd_size=$(($ramsize_gb * 1024*1024)) rd_nr=2
idev0=/dev/ram0
idev1=/dev/ram1
integritysetup format $idev0 --integrity xxhash64 --batch-mode
integritysetup format $idev1 --integrity xxhash64 --batch-mode
integritysetup open --integrity xxhash64 --allow-discards $idev0 dm-integrity0
integritysetup open --integrity xxhash64 --allow-discards $idev1 dm-integrity1
mdadm --create $MD_DEV --metadata=1.2 --assume-clean --level=1 --raid-devices=2 /dev/mapper/dm-integrity[01]
# 1. This should be enough to trigger it:
SSD_DEV=$MD_DEV
# 2. If not, then wrap /dev/md9 in a linear target:
#linear_add ssd $MD_DEV
#SSD_DEV=/dev/mapper/ssd
# Create a writable header for the PV meta:
dd if=/dev/zero bs=1M count=16 oflag=direct of=/tmp/pvheader
loop=`losetup -f --show /tmp/pvheader`
linear_add pv $loop /dev/nullb0
# Create the VG
lvmdevices --adddev $SSD_DEV
lvmdevices --adddev /dev/mapper/pv
vgcreate $VGNAME /dev/mapper/pv $SSD_DEV
# Create the pool:
lvcreate -n pool0 -L 1T $VGNAME /dev/mapper/pv
lvcreate -n meta0 -L 512m $VGNAME $SSD_DEV
# Make sure the meta volume is on the SSD (it should be already from above):
pvmove -n meta0 /dev/mapper/pv
lvconvert -y --force --force --chunksize 64k --type thin-pool --poolmetadata $VGNAME/meta0 $VGNAME/pool0
}
while true; do
do_reset
do_init
thin1=`lvthin_add 1`
fio $thin1 &
thin2=`lvthin_snapshot 1 2`
fio $thin2 &
wait
done
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Stuck IOs with dm-integrity + md raid1 + dm-thin
2023-12-01 0:40 ` Eric Wheeler
@ 2023-12-01 7:37 ` Joe Thornber
2023-12-01 18:26 ` David Teigland
1 sibling, 0 replies; 11+ messages in thread
From: Joe Thornber @ 2023-12-01 7:37 UTC (permalink / raw)
To: Eric Wheeler; +Cc: dm-devel
On Fri, Dec 1, 2023 at 12:40 AM Eric Wheeler
<dm-devel@lists.ewheeler.net> wrote:
>
>
> Here is a reproducer using ram disks, it hangs quickly on my test VM
> with all the kernels I tested. Here are a few example versions:
>
> - 6.5.7 vanilla
> - 5.14.0-362.8.1.el9_3
> - 5.14.0-162.12.1.el9_1
>
> Watch dmesg and you'll eventually see errors like this:
>
> device-mapper: integrity: dm-6: Checksum failed at sector 0x17d8
> md/raid1:md9: dm-6: rescheduling sector 3912
> md/raid1:md9: redirecting sector 3912 to other mirror: dm-6
> raid1_read_request: 18950 callbacks suppressed
>
Excellent, thanks.
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Stuck IOs with dm-integrity + md raid1 + dm-thin
2023-12-01 0:40 ` Eric Wheeler
2023-12-01 7:37 ` Joe Thornber
@ 2023-12-01 18:26 ` David Teigland
2023-12-02 2:18 ` Eric Wheeler
1 sibling, 1 reply; 11+ messages in thread
From: David Teigland @ 2023-12-01 18:26 UTC (permalink / raw)
To: Eric Wheeler; +Cc: Joe Thornber, dm-devel
On Thu, Nov 30, 2023 at 04:40:00PM -0800, Eric Wheeler wrote:
> integritysetup format $idev0 --integrity xxhash64 --batch-mode
> integritysetup format $idev1 --integrity xxhash64 --batch-mode
>
> integritysetup open --integrity xxhash64 --allow-discards $idev0 dm-integrity0
> integritysetup open --integrity xxhash64 --allow-discards $idev1 dm-integrity1
>
> mdadm --create $MD_DEV --metadata=1.2 --assume-clean --level=1 --raid-devices=2 /dev/mapper/dm-integrity[01]
>
> # 1. This should be enough to trigger it:
> SSD_DEV=$MD_DEV
>
> # 2. If not, then wrap /dev/md9 in a linear target:
> #linear_add ssd $MD_DEV
> #SSD_DEV=/dev/mapper/ssd
Interesting that just adding a linear layer there would have some effect.
> # Create a writable header for the PV meta:
> dd if=/dev/zero bs=1M count=16 oflag=direct of=/tmp/pvheader
> loop=`losetup -f --show /tmp/pvheader`
> linear_add pv $loop /dev/nullb0
>
> # Create the VG
> lvmdevices --adddev $SSD_DEV
> lvmdevices --adddev /dev/mapper/pv
> vgcreate $VGNAME /dev/mapper/pv $SSD_DEV
>
> # Create the pool:
> lvcreate -n pool0 -L 1T $VGNAME /dev/mapper/pv
> lvcreate -n meta0 -L 512m $VGNAME $SSD_DEV
>
> # Make sure the meta volume is on the SSD (it should be already from above):
> pvmove -n meta0 /dev/mapper/pv
I'd omit that pvmove if possible just in case it makes some unexpected
change. You have more than enough layers to complicate things as it is
without pvmove adding dm-mirror to the mix.
> lvconvert -y --force --force --chunksize 64k --type thin-pool --poolmetadata $VGNAME/meta0 $VGNAME/pool0
It's not a bad idea to use mdraid over dm-integrity, but it would be
interesting to know if doing raid+integrity in lvm would have the same
problems. e.g.
lvcreate --type raid1 --raidintegrity y -m1 -L 512m -n meta0 $vg /dev/ram[01]
lvcreate -n pool0 -L 1T $vg /dev/nullb0
lvconvert --type thin-pool --poolmetadata meta0 $vg/pool0
Dave
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Stuck IOs with dm-integrity + md raid1 + dm-thin
2023-12-01 18:26 ` David Teigland
@ 2023-12-02 2:18 ` Eric Wheeler
2023-12-15 23:26 ` Eric Wheeler
0 siblings, 1 reply; 11+ messages in thread
From: Eric Wheeler @ 2023-12-02 2:18 UTC (permalink / raw)
To: David Teigland; +Cc: Joe Thornber, dm-devel
On Fri, 1 Dec 2023, David Teigland wrote:
> On Thu, Nov 30, 2023 at 04:40:00PM -0800, Eric Wheeler wrote:
> > integritysetup format $idev0 --integrity xxhash64 --batch-mode
> > integritysetup format $idev1 --integrity xxhash64 --batch-mode
> >
> > integritysetup open --integrity xxhash64 --allow-discards $idev0 dm-integrity0
> > integritysetup open --integrity xxhash64 --allow-discards $idev1 dm-integrity1
> >
> > mdadm --create $MD_DEV --metadata=1.2 --assume-clean --level=1 --raid-devices=2 /dev/mapper/dm-integrity[01]
> >
> > # 1. This should be enough to trigger it:
> > SSD_DEV=$MD_DEV
> >
> > # 2. If not, then wrap /dev/md9 in a linear target:
> > #linear_add ssd $MD_DEV
> > #SSD_DEV=/dev/mapper/ssd
>
> Interesting that just adding a linear layer there would have some effect.
Its seemed (sometimes) to be the difference between triggering on the
first instead of the second iteration. Without the linear wrap, it would
sometimes not crash until the second loop, typically on `lvchange -an
testvg`. Maybe the extra linear hop delayed the IO just enough to help
with the race, but that is only speculation.
> > # Create a writable header for the PV meta:
> > dd if=/dev/zero bs=1M count=16 oflag=direct of=/tmp/pvheader
> > loop=`losetup -f --show /tmp/pvheader`
> > linear_add pv $loop /dev/nullb0
> >
> > # Create the VG
> > lvmdevices --adddev $SSD_DEV
> > lvmdevices --adddev /dev/mapper/pv
> > vgcreate $VGNAME /dev/mapper/pv $SSD_DEV
> >
> > # Create the pool:
> > lvcreate -n pool0 -L 1T $VGNAME /dev/mapper/pv
> > lvcreate -n meta0 -L 512m $VGNAME $SSD_DEV
> >
> > # Make sure the meta volume is on the SSD (it should be already from above):
> > pvmove -n meta0 /dev/mapper/pv
>
> I'd omit that pvmove if possible just in case it makes some unexpected
> change. You have more than enough layers to complicate things as it is
> without pvmove adding dm-mirror to the mix.
Good point. Since we specify the PV with `lvcreate ... $SSD_DEV`, the
pvmove isn't necessary at all. (Specifying the PV on lvcreate was added
later, pvmove was leftover from earlier testing.)
>
> > lvconvert -y --force --force --chunksize 64k --type thin-pool --poolmetadata $VGNAME/meta0 $VGNAME/pool0
>
> It's not a bad idea to use mdraid over dm-integrity, but it would be
> interesting to know if doing raid+integrity in lvm would have the same
> problems. e.g.
>
>
> lvcreate --type raid1 --raidintegrity y -m1 -L 512m -n meta0 $vg /dev/ram[01]
> lvcreate -n pool0 -L 1T $vg /dev/nullb0
> lvconvert --type thin-pool --poolmetadata meta0 $vg/pool0
Good idea! As it turns out, that crashes just as easily. Here is the
script with LVM based RAID1, which is somewhat simpler:
--------------------------------------------------------------------
#!/bin/bash
# Notice: /dev/ram0 and /dev/ram1 will be wiped unconditionally.
# Configure these if you need to:
VGNAME=testvg
LVNAME=thin
LVSIZE=$((10 * 1024*1024*1024/512))
echo "NOTICE: THIS MAY BE UNSAFE. ONLY RUN THIS IN A TEST ENVIRONMENT!"
echo "Press enter twice to continue or CTRL-C to abort."
read
read
set -x
# append disks into a linear target
linear_add()
{
name=$1
shift
prevsize=0
for vol in "$@"; do
size=`blockdev --getsize $vol`
echo "$prevsize $size linear $vol 0"
prevsize=$size
done \
| dmsetup create $name
echo /dev/mapper/$name
}
lvthin_add()
{
id=$1
lvcreate -An -V $LVSIZE -n $LVNAME$id --thinpool pool0 $VGNAME >&2
echo /dev/$VGNAME/$LVNAME$id
}
lvthin_snapshot()
{
origin=$1
id=$2
lvcreate -An -s $VGNAME/$LVNAME$origin -n $LVNAME$id >&2
echo /dev/$VGNAME/$LVNAME$id
}
fio()
{
dev=$1
/bin/fio --name=$dev --rw=randrw --direct=1 --bs=512 --numjobs=1 --filename=$dev --time_based --runtime=$FIOTIME --ioengine=libaio --iodepth=1 &> /dev/null
}
do_reset()
{
killall -9 fio
lvchange -an $VGNAME
rmdir /dev/$VGNAME
dmsetup remove pv
lvmdevices --deldev /dev/mapper/pv
losetup -d /dev/loop?
rmmod brd
rmmod null_blk
echo ==== reset done
sleep 1
}
do_init()
{
modprobe null_blk gb=30000 bs=512
ramsize_gb=1
modprobe brd rd_size=$(($ramsize_gb * 1024*1024)) rd_nr=2
# Create a writable header for the PV meta:
dd if=/dev/zero bs=1M count=16 oflag=direct of=/tmp/pvheader
loop=`losetup -f --show /tmp/pvheader`
linear_add pv $loop /dev/nullb0
# Create the VG
lvmdevices --adddev /dev/ram0
lvmdevices --adddev /dev/ram1
lvmdevices --adddev /dev/mapper/pv
vgcreate $VGNAME /dev/mapper/pv /dev/ram[01]
# Create the pool:
lvcreate -n pool0 -L 1T $VGNAME /dev/mapper/pv
lvcreate --type raid1 --raidintegrity y -m1 -L 512m -n meta0 $VGNAME /dev/ram[01]
lvconvert -y --force --force --chunksize 64k --type thin-pool --poolmetadata $VGNAME/meta0 $VGNAME/pool0
}
while true; do
do_reset
do_init
thin1=`lvthin_add 1`
fio $thin1 &
thin2=`lvthin_snapshot 1 2`
fio $thin2 &
wait
done
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Stuck IOs with dm-integrity + md raid1 + dm-thin
2023-12-02 2:18 ` Eric Wheeler
@ 2023-12-15 23:26 ` Eric Wheeler
0 siblings, 0 replies; 11+ messages in thread
From: Eric Wheeler @ 2023-12-15 23:26 UTC (permalink / raw)
To: Heinz Mauelshagen; +Cc: David Teigland, Mikulas Patocka, Joe Thornber, dm-devel
[ +cc Heinz, Mikulas ]
> > On Tue, 5 Dec 2023, Joe Thornber wrote:
> > > Hi Eric,
> > >
> > > I just released v1.0.8 of the thinp tools:
> > >
> > > https://github.com/jthornber/thin-provisioning-tools
> >
> > Interesting, so do these tools play a role in the kernel somehow (ie,
> > kernel rust), or is this just userspace testing?
>
> The tools issue IO. LVM runs thin_check during one of the lvcreate
> calls in your reproducer script. The way the tools were issuing io was
> triggering the dm-integrity error. The fact that it then hung we
> currently believe is an md kernel bug. Mikulas and Heinz are looking
> into it.
Hi Mikulas and Heinz,
Joe mentioned that you were looking into this. Have you been able to
figure out what might be causing the strange md<=>dm-integrity interaction
in the kernel?
-Eric
> On Fri, 1 Dec 2023, David Teigland wrote:
> > On Thu, Nov 30, 2023 at 04:40:00PM -0800, Eric Wheeler wrote:
> > > integritysetup format $idev0 --integrity xxhash64 --batch-mode
> > > integritysetup format $idev1 --integrity xxhash64 --batch-mode
> > >
> > > integritysetup open --integrity xxhash64 --allow-discards $idev0 dm-integrity0
> > > integritysetup open --integrity xxhash64 --allow-discards $idev1 dm-integrity1
> > >
> > > mdadm --create $MD_DEV --metadata=1.2 --assume-clean --level=1 --raid-devices=2 /dev/mapper/dm-integrity[01]
> > >
> > > # 1. This should be enough to trigger it:
> > > SSD_DEV=$MD_DEV
> > >
> > > # 2. If not, then wrap /dev/md9 in a linear target:
> > > #linear_add ssd $MD_DEV
> > > #SSD_DEV=/dev/mapper/ssd
> >
> > Interesting that just adding a linear layer there would have some effect.
>
> Its seemed (sometimes) to be the difference between triggering on the
> first instead of the second iteration. Without the linear wrap, it would
> sometimes not crash until the second loop, typically on `lvchange -an
> testvg`. Maybe the extra linear hop delayed the IO just enough to help
> with the race, but that is only speculation.
>
> > > # Create a writable header for the PV meta:
> > > dd if=/dev/zero bs=1M count=16 oflag=direct of=/tmp/pvheader
> > > loop=`losetup -f --show /tmp/pvheader`
> > > linear_add pv $loop /dev/nullb0
> > >
> > > # Create the VG
> > > lvmdevices --adddev $SSD_DEV
> > > lvmdevices --adddev /dev/mapper/pv
> > > vgcreate $VGNAME /dev/mapper/pv $SSD_DEV
> > >
> > > # Create the pool:
> > > lvcreate -n pool0 -L 1T $VGNAME /dev/mapper/pv
> > > lvcreate -n meta0 -L 512m $VGNAME $SSD_DEV
> > >
> > > # Make sure the meta volume is on the SSD (it should be already from above):
> > > pvmove -n meta0 /dev/mapper/pv
> >
> > I'd omit that pvmove if possible just in case it makes some unexpected
> > change. You have more than enough layers to complicate things as it is
> > without pvmove adding dm-mirror to the mix.
>
> Good point. Since we specify the PV with `lvcreate ... $SSD_DEV`, the
> pvmove isn't necessary at all. (Specifying the PV on lvcreate was added
> later, pvmove was leftover from earlier testing.)
>
> >
> > > lvconvert -y --force --force --chunksize 64k --type thin-pool --poolmetadata $VGNAME/meta0 $VGNAME/pool0
> >
> > It's not a bad idea to use mdraid over dm-integrity, but it would be
> > interesting to know if doing raid+integrity in lvm would have the same
> > problems. e.g.
> >
> >
> > lvcreate --type raid1 --raidintegrity y -m1 -L 512m -n meta0 $vg /dev/ram[01]
> > lvcreate -n pool0 -L 1T $vg /dev/nullb0
> > lvconvert --type thin-pool --poolmetadata meta0 $vg/pool0
>
> Good idea! As it turns out, that crashes just as easily. Here is the
> script with LVM based RAID1, which is somewhat simpler:
>
> --------------------------------------------------------------------
> #!/bin/bash
>
> # Notice: /dev/ram0 and /dev/ram1 will be wiped unconditionally.
>
> # Configure these if you need to:
> VGNAME=testvg
> LVNAME=thin
> LVSIZE=$((10 * 1024*1024*1024/512))
>
> echo "NOTICE: THIS MAY BE UNSAFE. ONLY RUN THIS IN A TEST ENVIRONMENT!"
> echo "Press enter twice to continue or CTRL-C to abort."
>
> read
> read
>
> set -x
>
>
> # append disks into a linear target
> linear_add()
> {
> name=$1
> shift
>
> prevsize=0
> for vol in "$@"; do
> size=`blockdev --getsize $vol`
> echo "$prevsize $size linear $vol 0"
> prevsize=$size
> done \
> | dmsetup create $name
>
> echo /dev/mapper/$name
> }
>
> lvthin_add()
> {
> id=$1
> lvcreate -An -V $LVSIZE -n $LVNAME$id --thinpool pool0 $VGNAME >&2
> echo /dev/$VGNAME/$LVNAME$id
> }
>
> lvthin_snapshot()
> {
> origin=$1
> id=$2
>
> lvcreate -An -s $VGNAME/$LVNAME$origin -n $LVNAME$id >&2
> echo /dev/$VGNAME/$LVNAME$id
> }
>
> fio()
> {
> dev=$1
> /bin/fio --name=$dev --rw=randrw --direct=1 --bs=512 --numjobs=1 --filename=$dev --time_based --runtime=$FIOTIME --ioengine=libaio --iodepth=1 &> /dev/null
> }
>
> do_reset()
> {
> killall -9 fio
> lvchange -an $VGNAME
> rmdir /dev/$VGNAME
> dmsetup remove pv
> lvmdevices --deldev /dev/mapper/pv
> losetup -d /dev/loop?
> rmmod brd
> rmmod null_blk
> echo ==== reset done
> sleep 1
> }
>
> do_init()
> {
> modprobe null_blk gb=30000 bs=512
>
> ramsize_gb=1
> modprobe brd rd_size=$(($ramsize_gb * 1024*1024)) rd_nr=2
>
> # Create a writable header for the PV meta:
> dd if=/dev/zero bs=1M count=16 oflag=direct of=/tmp/pvheader
> loop=`losetup -f --show /tmp/pvheader`
> linear_add pv $loop /dev/nullb0
>
> # Create the VG
> lvmdevices --adddev /dev/ram0
> lvmdevices --adddev /dev/ram1
> lvmdevices --adddev /dev/mapper/pv
> vgcreate $VGNAME /dev/mapper/pv /dev/ram[01]
>
> # Create the pool:
> lvcreate -n pool0 -L 1T $VGNAME /dev/mapper/pv
> lvcreate --type raid1 --raidintegrity y -m1 -L 512m -n meta0 $VGNAME /dev/ram[01]
>
> lvconvert -y --force --force --chunksize 64k --type thin-pool --poolmetadata $VGNAME/meta0 $VGNAME/pool0
> }
>
>
> while true; do
> do_reset
> do_init
>
> thin1=`lvthin_add 1`
> fio $thin1 &
>
> thin2=`lvthin_snapshot 1 2`
> fio $thin2 &
>
> wait
>
> done
>
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Stuck IOs with dm-integrity + md raid1 + dm-thin
2023-11-29 1:55 Stuck IOs with dm-integrity + md raid1 + dm-thin Eric Wheeler
[not found] ` <CAJ0trDaipa_80xqVpDkJdm-SSCaN7g_O-m9-StG10P_DUxmtEA@mail.gmail.com>
@ 2024-01-11 12:34 ` Xiao Ni
2024-01-11 20:44 ` Eric Wheeler
1 sibling, 1 reply; 11+ messages in thread
From: Xiao Ni @ 2024-01-11 12:34 UTC (permalink / raw)
To: Eric Wheeler
Cc: Joe Thornber, dm-devel, Heinz Mauelshagen, Mikulas Patocka,
Snitzer, Mike, Jonathan Brassow
On Wed, Nov 29, 2023 at 10:10 AM Eric Wheeler
<dm-devel@lists.ewheeler.net> wrote:
>
> Hi Joe,
>
> I'm not sure who else to CC on this issue, feel free to add others.
>
> Recently we tried putting dm-integrity on NVMe's under MD RAID1 with
> dm-thin metadata (tmeta) on that raid1 mirror (Linux v6.5.7). It worked
> fine for ~12 hours and then, at ~3:30 am in the morning when we do a
> reserve_metadata_snap and thin_dump's, IO's began to hang.
>
> For reference, we have been using the same block device stack for years,
> and it is quite stable. We've been running Linux v6.5.7 for about 6 weeks
> without issue until we tried dm-integrity under dm-thin. The stack
> _without_ dm-integrity looks like this:
>
> nvme{0,1} =>
> raid mirror /dev/md0 =>
> LVM PV for VG "data" =>
> tmeta for dm-thin
>
> /dev/sda (LSI-based Hardware RAID)
> bcache =>
> LVM PV for VG "data" =>
> tdata for dm-thin "pool0" =>
> single thin volume from the pool =>
> drbd to another host over the network
>
> After adding dm-integrity, the stack is the same as above except that
> the RAID mirror is above dm-integrity:
>
> nvme{0,1} =>
> dm-integrity =>
> raid mirror /dev/md0 =>
> LVM PV for VG "data" =>
> tmeta for dm-thin
>
> The first time we had the thindump+IO hang problem we thought that an
> underlying NVMe drive was bad, so we replaced both: We rebooted and did
> a `pvmove` of the tmeta volume to different NVMe+md-raid1+dm-integrity
> on disks that were brand new, but had the same issue that night during
> reserve_metadata_snap+thin_dump.
>
> Since we had the same problem on a different set of new NVMe disks, we
> rebooted, did a `pvmove` back to the original MD RAID1 mirror stack
> _without_ dm-integrity, and everything was back to normal. Thus, I do not
> think it is a disk or tmeta corruption issue, but an obscure race or
> deadlock bug.
>
> We tried to reproduce this in a virtual environment, but so far it has not
> triggered. In about a week when we get downtime for this system, I will
> take a copy of its thin metadata to inspect further. For now, I have not
> yet been enable to reproduce problem outside our semi-production testing
> environment.
>
> So, this is what we see so far:
>
> We get _thousands_ of these lines in the kernel logs when IO hangs. The
> `dm-175` device is one of the the dm-integrity targets being mirrored.
> Note the "12924 callbacks suppressed" message:
>
> [ +0.584816] raid1_end_read_request: 12924 callbacks suppressed
> [ +0.000006] md/raid1:md0: dm-175: rescheduling sector 1979682216
> [ +0.020011] md/raid1:md0: dm-175: rescheduling sector 1979682216
> [ +0.019462] md/raid1:md0: dm-175: rescheduling sector 1979682216
> [ +0.018672] md/raid1:md0: dm-175: rescheduling sector 1979682216
> [ +0.006242] md/raid1:md0: dm-175: rescheduling sector 1979682216
> [ +0.002399] md/raid1:md0: dm-175: rescheduling sector 1979682216
> [ +0.000934] md/raid1:md0: dm-175: rescheduling sector 1979682216
> [ +0.000688] md/raid1:md0: dm-175: rescheduling sector 1979682216
> [ +0.001161] md/raid1:md0: dm-175: rescheduling sector 1979682216
> [ +0.006404] md/raid1:md0: dm-175: rescheduling sector 1979682216
> [ +3.594380] integrity_metadata: 13095 callbacks suppressed
> [ +0.000006] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> [ +0.001410] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> [ +0.000824] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> [ +0.003587] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> [ +0.011065] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> [ +0.004176] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> [ +0.000604] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> [ +0.006153] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> [ +0.008797] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> [ +0.002350] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> [ +0.668866] raid1_read_request: 12747 callbacks suppressed
> [ +0.000005] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> [ +0.024708] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> [ +0.024441] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> [ +0.018321] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> [ +0.000850] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> [ +0.004479] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> [ +0.001225] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> [ +0.001024] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> [ +0.000741] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> [ +0.001094] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
Hi all
The process is as follows and I add some logs
1. raid1_read_request submits bio to read data
2. raid1_end_read finds the read request fails and try to fix the read error
3. handle_read_error>fix_read_error tries to read data in 4K unit in
another disk, write to the failed disk in 4K unit and read from the
failed disk in 4K unit. fix_read_error finishes successfully.
4. handle_read_error re-read data from the failed disk. And the read
request fails again.
[ 1067.149829] device-mapper: integrity: dm-3: Checksum failed at sector 0x17d8
[ 1067.149842] md/raid1:md9: dm-3: rescheduling sector 3912, sectors 248
[ 1067.150437] md/raid1:md9: dm-3: fix read error 3912, sectors 248
[ 1067.150439] md/raid1:md9: dm-3: re read 3912, sectors 248
It looks like the bio size is suspicious. Because it reads
successfully when bio size is 4K.
Best Regards
Xiao
>
> Here are two traces of hung tasks that the kernel reported in dmesg. The
> decoded line numbers are at the bottom of this email.
>
> It looks like dmeventd is checking metadata transaction IDs, and then
> hangs. We believe that is snapshot was being made at the moment that IOs
> hung. This trace only appears once:
>
> INFO: task dmeventd:3718459 blocked for more than 122 seconds.
> Tainted: G E 6.5.7 #23
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:dmeventd state:D stack:0 pid:3718459 ppid:1 flags:0x00000002
> Call Trace:
> <TASK>
> __schedule+0x2d7/0x800
> schedule+0x5e/0xe0
> schedule_preempt_disabled+0x15/0x30
> rwsem_down_read_slowpath+0x2c4/0x5f0
> ? lock_acquire+0x10b/0x120
> down_read+0xff/0x260
> dm_pool_get_metadata_transaction_id+0x24/0x60 [dm_thin_pool]
> pool_status+0x133/0x630 [dm_thin_pool]
> retrieve_status+0xc4/0x210 [dm_mod]
> table_status+0x9c/0x150 [dm_mod]
> ctl_ioctl+0x248/0x370 [dm_mod]
> ? __pfx_table_status+0x10/0x10 [dm_mod]
> dm_ctl_ioctl+0xe/0x20 [dm_mod]
> __x64_sys_ioctl+0x99/0xe0
> do_syscall_64+0x3f/0xa0
> entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>
> In the next trace below, you can see that DRBD is trying to write to it's
> thin volume backing device, and thin metadata is being manipulated it by
> dm-thin (which is above md raid1 with dm-integrity), but hangs during
> `dm_thin_find_block`:
>
> This trace appears several times for different DRBD volumes that are
> backed by thin volumes in the same pool. Except that the thin volume
> differs, the traces are the same:
>
> INFO: task drbd_r_secServe:1072212 blocked for more than 122 seconds.
> Tainted: G E 6.5.7 #23
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:drbd_r_secServe state:D stack:0 pid:1072212 ppid:2 flags:0x00004000
> Call Trace:
> <TASK>
> __schedule+0x2d7/0x800
> schedule+0x5e/0xe0
> schedule_preempt_disabled+0x15/0x30
> rwsem_down_read_slowpath+0x2c4/0x5f0
> ? dm_thin_find_block+0x32/0x90 [dm_thin_pool]
> down_read+0xff/0x260
> dm_thin_find_block+0x32/0x90 [dm_thin_pool]
> thin_bio_map.isra.0+0x169/0x3a0 [dm_thin_pool]
> __map_bio+0x49/0x1d0 [dm_mod]
> dm_split_and_process_bio+0x21c/0x580 [dm_mod]
> dm_submit_bio+0x64/0x170 [dm_mod]
> __submit_bio+0x7e/0xd0
> __submit_bio_noacct+0x90/0x200
> drbd_submit_peer_request+0x138/0x2d0 [drbd]
> receive_DataRequest+0x244/0x710 [drbd]
> ? __pfx_receive_DataRequest+0x10/0x10 [drbd]
> drbdd+0x174/0x280 [drbd]
> ? __pfx_drbd_thread_setup+0x10/0x10 [drbd]
> drbd_receiver+0x51/0xd0 [drbd]
> drbd_thread_setup+0xa3/0x1b0 [drbd]
> kthread+0xe9/0x120
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x34/0x50
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1b/0x30
> </TASK>
>
> This is the call stack of the thin_dump command, its just waiting:
>
> # thin_dump -f xml --dev-id 671125 -m /dev/mapper/data-pool0_tmeta
> ...
>
> # cat /proc/4016388/stack
> [<0>] submit_bio_wait+0x7a/0xc0
> [<0>] __blkdev_direct_IO_simple+0x130/0x280
> [<0>] blkdev_read_iter+0xb8/0x150
> [<0>] do_iter_readv_writev+0x14e/0x160
> [<0>] do_iter_read+0x125/0x1c0
> [<0>] vfs_readv+0xa8/0xf0
> [<0>] __x64_sys_preadv+0xc0/0x100
> [<0>] do_syscall_64+0x3f/0xa0
> [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>
>
>
> These are the line number decodes for the above traces:
>
> trace 1:
> kernel: INFO: task dmeventd:3718459 blocked for more than 122 seconds.
> __schedule (kernel/sched/core.c:5381 kernel/sched/core.c:6710)
> schedule (./arch/x86/include/asm/preempt.h:85 (discriminator 1) kernel/sched/core.c:6787 (discriminator 1))
> schedule_preempt_disabled (./arch/x86/include/asm/preempt.h:80 kernel/sched/core.c:6846)
> rwsem_down_read_slowpath (kernel/locking/rwsem.c:1073)
> ? lock_acquire (./include/trace/events/lock.h:24 kernel/locking/lockdep.c:5724)
> down_read (kernel/locking/rwsem.c:1250 kernel/locking/rwsem.c:1263 kernel/locking/rwsem.c:1522)
> dm_pool_get_metadata_transaction_id (drivers/md/dm-thin-metadata.c:1323) dm_thin_pool
> pool_status (drivers/md/dm-thin.c:3944) dm_thin_pool
> retrieve_status (./include/linux/fortify-string.h:214 drivers/md/dm-ioctl.c:1316) dm_mod
> table_status (drivers/md/dm-ioctl.c:1707) dm_mod
> ctl_ioctl (drivers/md/dm-ioctl.c:2083) dm_mod
> ? __pfx_table_status (drivers/md/dm-ioctl.c:1693) dm_mod
> dm_ctl_ioctl (drivers/md/dm-ioctl.c:2103) dm_mod
> __x64_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:870 fs/ioctl.c:856 fs/ioctl.c:856)
> do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
>
>
> trace 2:
> kernel: INFO: task drbd_r_secServe:1072212 blocked for more than 122 seconds.
> __schedule (kernel/sched/core.c:5381 kernel/sched/core.c:6710)
> schedule (./arch/x86/include/asm/preempt.h:85 (discriminator 1) kernel/sched/core.c:6787 (discriminator 1))
> schedule_preempt_disabled (./arch/x86/include/asm/preempt.h:80 kernel/sched/core.c:6846)
> rwsem_down_read_slowpath (kernel/locking/rwsem.c:1073)
> ? dm_thin_find_block (drivers/md/dm-thin-metadata.c:1574) dm_thin_pool
> down_read (kernel/locking/rwsem.c:1250 kernel/locking/rwsem.c:1263 kernel/locking/rwsem.c:1522)
> dm_thin_find_block (drivers/md/dm-thin-metadata.c:1574) dm_thin_pool
> thin_bio_map.isra.0 (drivers/md/dm-thin.c:2764) dm_thin_pool
> __map_bio (drivers/md/dm.c:1427) dm_mod
> dm_split_and_process_bio (drivers/md/dm.c:1729 drivers/md/dm.c:1782) dm_mod
> dm_submit_bio (drivers/md/dm.c:1835) dm_mod
> __submit_bio (block/blk-core.c:373 block/blk-core.c:603)
> __submit_bio_noacct (./include/linux/bio.h:624 block/blk-core.c:653)
> drbd_submit_peer_request (drivers/block/drbd/drbd_receiver.c:1723) drbd
> receive_DataRequest (drivers/block/drbd/drbd_receiver.c:2993) drbd
> ? __pfx_receive_DataRequest (drivers/block/drbd/drbd_receiver.c:2803) drbd
> drbdd (drivers/block/drbd/drbd_receiver.c:5092) drbd
> ? __pfx_drbd_thread_setup (drivers/block/drbd/drbd_main.c:310) drbd
> drbd_receiver (drivers/block/drbd/drbd_receiver.c:5561) drbd
> drbd_thread_setup (drivers/block/drbd/drbd_main.c:325) drbd
> kthread (kernel/kthread.c:389)
> ? __pfx_kthread (kernel/kthread.c:342)
> ret_from_fork (arch/x86/kernel/process.c:151)
> ? __pfx_kthread (kernel/kthread.c:342)
> ret_from_fork_asm (arch/x86/entry/entry_64.S:312)
>
> trace 3:
> # thin_dump -f xml --dev-id 671125 -m /dev/mapper/data-pool0_tmeta
> <hung>
>
> # cat /proc/4016388/stack
> submit_bio_wait (block/bio.c:1370 (discriminator 1))
> __blkdev_direct_IO_simple (./include/linux/bio.h:495 block/fops.c:99)
> blkdev_read_iter (block/fops.c:608)
> do_iter_readv_writev (./include/linux/fs.h:1871)
> do_iter_read (fs/read_write.c:795)
> vfs_readv (fs/read_write.c:916)
> __x64_sys_preadv (fs/read_write.c:1007 fs/read_write.c:1057 fs/read_write.c:1052 fs/read_write.c:1052)
> do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
>
>
>
> --
> Eric Wheeler
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Stuck IOs with dm-integrity + md raid1 + dm-thin
2024-01-11 12:34 ` Xiao Ni
@ 2024-01-11 20:44 ` Eric Wheeler
0 siblings, 0 replies; 11+ messages in thread
From: Eric Wheeler @ 2024-01-11 20:44 UTC (permalink / raw)
To: Xiao Ni
Cc: Joe Thornber, dm-devel, Heinz Mauelshagen, Mikulas Patocka,
Snitzer, Mike, Jonathan Brassow
[-- Attachment #1: Type: text/plain, Size: 9391 bytes --]
On Thu, 11 Jan 2024, Xiao Ni wrote:
> On Wed, Nov 29, 2023 at 10:10 AM Eric Wheeler
> <dm-devel@lists.ewheeler.net> wrote:
> >
> > Hi Joe,
> >
> > I'm not sure who else to CC on this issue, feel free to add others.
> >
> > Recently we tried putting dm-integrity on NVMe's under MD RAID1 with
> > dm-thin metadata (tmeta) on that raid1 mirror (Linux v6.5.7). It worked
> > fine for ~12 hours and then, at ~3:30 am in the morning when we do a
> > reserve_metadata_snap and thin_dump's, IO's began to hang.
> >
> > For reference, we have been using the same block device stack for years,
> > and it is quite stable. We've been running Linux v6.5.7 for about 6 weeks
> > without issue until we tried dm-integrity under dm-thin. The stack
> > _without_ dm-integrity looks like this:
> >
> > nvme{0,1} =>
> > raid mirror /dev/md0 =>
> > LVM PV for VG "data" =>
> > tmeta for dm-thin
> >
> > /dev/sda (LSI-based Hardware RAID)
> > bcache =>
> > LVM PV for VG "data" =>
> > tdata for dm-thin "pool0" =>
> > single thin volume from the pool =>
> > drbd to another host over the network
> >
> > After adding dm-integrity, the stack is the same as above except that
> > the RAID mirror is above dm-integrity:
> >
> > nvme{0,1} =>
> > dm-integrity =>
> > raid mirror /dev/md0 =>
> > LVM PV for VG "data" =>
> > tmeta for dm-thin
> >
> > The first time we had the thindump+IO hang problem we thought that an
> > underlying NVMe drive was bad, so we replaced both: We rebooted and did
> > a `pvmove` of the tmeta volume to different NVMe+md-raid1+dm-integrity
> > on disks that were brand new, but had the same issue that night during
> > reserve_metadata_snap+thin_dump.
> >
> > Since we had the same problem on a different set of new NVMe disks, we
> > rebooted, did a `pvmove` back to the original MD RAID1 mirror stack
> > _without_ dm-integrity, and everything was back to normal. Thus, I do not
> > think it is a disk or tmeta corruption issue, but an obscure race or
> > deadlock bug.
> >
> > We tried to reproduce this in a virtual environment, but so far it has not
> > triggered. In about a week when we get downtime for this system, I will
> > take a copy of its thin metadata to inspect further. For now, I have not
> > yet been enable to reproduce problem outside our semi-production testing
> > environment.
> >
> > So, this is what we see so far:
> >
> > We get _thousands_ of these lines in the kernel logs when IO hangs. The
> > `dm-175` device is one of the the dm-integrity targets being mirrored.
> > Note the "12924 callbacks suppressed" message:
> >
> > [ +0.584816] raid1_end_read_request: 12924 callbacks suppressed
> > [ +0.000006] md/raid1:md0: dm-175: rescheduling sector 1979682216
> > [ +0.020011] md/raid1:md0: dm-175: rescheduling sector 1979682216
> > [ +0.019462] md/raid1:md0: dm-175: rescheduling sector 1979682216
> > [ +0.018672] md/raid1:md0: dm-175: rescheduling sector 1979682216
> > [ +0.006242] md/raid1:md0: dm-175: rescheduling sector 1979682216
> > [ +0.002399] md/raid1:md0: dm-175: rescheduling sector 1979682216
> > [ +0.000934] md/raid1:md0: dm-175: rescheduling sector 1979682216
> > [ +0.000688] md/raid1:md0: dm-175: rescheduling sector 1979682216
> > [ +0.001161] md/raid1:md0: dm-175: rescheduling sector 1979682216
> > [ +0.006404] md/raid1:md0: dm-175: rescheduling sector 1979682216
> > [ +3.594380] integrity_metadata: 13095 callbacks suppressed
> > [ +0.000006] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> > [ +0.001410] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> > [ +0.000824] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> > [ +0.003587] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> > [ +0.011065] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> > [ +0.004176] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> > [ +0.000604] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> > [ +0.006153] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> > [ +0.008797] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> > [ +0.002350] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0
> > [ +0.668866] raid1_read_request: 12747 callbacks suppressed
> > [ +0.000005] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> > [ +0.024708] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> > [ +0.024441] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> > [ +0.018321] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> > [ +0.000850] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> > [ +0.004479] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> > [ +0.001225] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> > [ +0.001024] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> > [ +0.000741] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
> > [ +0.001094] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175
>
> Hi all
>
> The process is as follows and I add some logs
> 1. raid1_read_request submits bio to read data
> 2. raid1_end_read finds the read request fails and try to fix the read error
> 3. handle_read_error>fix_read_error tries to read data in 4K unit in
> another disk, write to the failed disk in 4K unit and read from the
> failed disk in 4K unit. fix_read_error finishes successfully.
> 4. handle_read_error re-read data from the failed disk. And the read
> request fails again.
> [ 1067.149829] device-mapper: integrity: dm-3: Checksum failed at sector 0x17d8
> [ 1067.149842] md/raid1:md9: dm-3: rescheduling sector 3912, sectors 248
> [ 1067.150437] md/raid1:md9: dm-3: fix read error 3912, sectors 248
> [ 1067.150439] md/raid1:md9: dm-3: re read 3912, sectors 248
>
> It looks like the bio size is suspicious. Because it reads
> successfully when bio size is 4K.
Hi Xiao, thanks for looking into this.
It looks like the problem can be reproduced using 4k blocks as well, but
I suppose its still possible that some 512-byte (4k-unaligned) IO is
happening somewhere in the device stack.
The script below changes the null_blk modprobe to set bs=4096 and uses
zram instead of brd because zram only supports 4k blocks:
# blockdev --getss /dev/mapper/pv /dev/zram[01]
4096
4096
4096
The script below has been added to the Jira ticket:
https://issues.redhat.com/browse/RHEL-20912
--
Eric Wheeler
Script to trigger the bug with 4k blocks:
# cat dm-integrity-thin-lvraid-zram-4k
#!/bin/bash
# Notice: /dev/zram0 and /dev/zram1 will be wiped unconditionally.
# Configure these if you need to:
VGNAME=testvg
LVNAME=thin
LVSIZE=$((10 * 1024*1024*1024/512))
echo "NOTICE: THIS MAY BE UNSAFE. ONLY RUN THIS IN A TEST ENVIRONMENT!"
echo "Press enter twice to continue or CTRL-C to abort."
read
read
set -x
# append disks into a linear target
linear_add()
{
name=$1
shift
prevsize=0
for vol in "$@"; do
size=`blockdev --getsize $vol`
echo "$prevsize $size linear $vol 0"
prevsize=$size
done \
| dmsetup create $name
echo /dev/mapper/$name
}
lvthin_add()
{
id=$1
lvcreate -An -V $LVSIZE -n $LVNAME$id --thinpool pool0 $VGNAME >&2
echo /dev/$VGNAME/$LVNAME$id
}
lvthin_snapshot()
{
origin=$1
id=$2
lvcreate -An -s $VGNAME/$LVNAME$origin -n $LVNAME$id >&2
echo /dev/$VGNAME/$LVNAME$id
}
fio()
{
dev=$1
/bin/fio --name=$dev --rw=randrw --direct=1 --bs=512 --numjobs=1 --filename=$dev --time_based --runtime=$FIOTIME --ioengine=libaio --iodepth=1 &> /dev/null
}
do_reset()
{
killall -9 fio
lvchange -an $VGNAME
rmdir /dev/$VGNAME
dmsetup remove pv
dmsetup remove zram0
dmsetup remove zram1
lvmdevices --deldev /dev/mapper/pv
losetup -d /dev/loop?
rmmod zram
rmmod null_blk
echo ==== reset done
sleep 1
}
do_init()
{
modprobe null_blk gb=30000 bs=4096
modprobe zram num_devices=2
echo 1g > /sys/block/zram0/disksize
echo 1g > /sys/block/zram1/disksize
# Create a writable header for the PV meta:
dd if=/dev/zero bs=1M count=16 oflag=direct of=/tmp/pvheader
loop=`losetup -f --show /tmp/pvheader`
linear_add pv $loop /dev/nullb0
# Create the VG
sleep 1
#linear_add zram0 /dev/zram0
#linear_add zram1 /dev/zram1
lvmdevices --adddev /dev/zram0
lvmdevices --adddev /dev/zram1
lvmdevices --adddev /dev/mapper/pv
vgcreate $VGNAME /dev/mapper/pv /dev/zram[01]
# Create the pool:
lvcreate -n pool0 -L 1T $VGNAME /dev/mapper/pv
lvcreate --type raid1 --raidintegrity y -m1 -L 512m -n meta0 $VGNAME /dev/zram[01]
lvconvert -y --force --force --chunksize 64k --type thin-pool --poolmetadata $VGNAME/meta0 $VGNAME/pool0
}
while true; do
do_reset
do_init
thin1=`lvthin_add 1`
fio $thin1 &
thin2=`lvthin_snapshot 1 2`
fio $thin2 &
wait
done
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-01-11 20:51 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-29 1:55 Stuck IOs with dm-integrity + md raid1 + dm-thin Eric Wheeler
[not found] ` <CAJ0trDaipa_80xqVpDkJdm-SSCaN7g_O-m9-StG10P_DUxmtEA@mail.gmail.com>
[not found] ` <CAJ0trDbP4BFjb0KQ+t4Lh1JdVjP9sbMiwJEOXPkGEV_R95rFoQ@mail.gmail.com>
2023-11-29 19:51 ` Eric Wheeler
[not found] ` <CAJ0trDaRQaDicO9ho8Hokj68vEopaJwM8YKwPEjzYkWBXE0dWg@mail.gmail.com>
2023-11-30 19:34 ` Eric Wheeler
2023-11-30 21:08 ` Eric Wheeler
2023-12-01 0:40 ` Eric Wheeler
2023-12-01 7:37 ` Joe Thornber
2023-12-01 18:26 ` David Teigland
2023-12-02 2:18 ` Eric Wheeler
2023-12-15 23:26 ` Eric Wheeler
2024-01-11 12:34 ` Xiao Ni
2024-01-11 20:44 ` Eric Wheeler
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.