public inbox for linux-raid@vger.kernel.org
 help / color / mirror / Atom feed
From: Yu Kuai <yukuai1@huaweicloud.com>
To: Mikulas Patocka <mpatocka@redhat.com>, Song Liu <song@kernel.org>,
	David Jeffery <djeffery@redhat.com>, Li Nan <linan122@huawei.com>
Cc: dm-devel@lists.linux.dev, linux-raid@vger.kernel.org,
	Mike Snitzer <msnitzer@redhat.com>,
	Heinz Mauelshagen <heinzm@redhat.com>,
	Benjamin Marzinski <bmarzins@redhat.com>,
	"yukuai (C)" <yukuai3@huawei.com>
Subject: Re: [PATCH 0/7] MD fixes for the LVM2 testsuite
Date: Sat, 27 Jan 2024 15:57:59 +0800	[thread overview]
Message-ID: <e52bbc20-353d-6984-e2cb-662d4676b99a@huaweicloud.com> (raw)
In-Reply-To: <e5e8afe2-e9a8-49a2-5ab0-958d4065c55e@redhat.com>

Hi, Mikulas

在 2024/01/18 2:16, Mikulas Patocka 写道:
> Hi
> 
> Here I'm sending MD patches that fix the LVM2 testsuite for the kernels
> 6.7 and 6.8. The testsuite was broken in the 6.6 -> 6.7 window, there are
> multiple tests that deadlock.
> 
> I fixed some of the bugs. And I reverted some patches that claim to be
> fixing bugs but they break the testsuite.
> 
> I'd like to ask you - please, next time when you are going to commit
> something into the MD subsystem, download the LVM2 package from
> git://sourceware.org/git/lvm2.git and run the testsuite ("./configure &&
> make && make check") to make sure that your bugfix doesn't introduce
> another bug. You can run a specific test with the "T" parameter - for
> example "make check T=shell/integrity-caching.sh"

I tried to found broken test by myself, but I have to ask now... While
verify my fixes[1], other than the test you mentioned in this patchset:

shell/integrity-caching.sh
shell/lvconvert-raid-reshape-linear_to_raid6-single-type.sh
shell/lvconvert-raid-reshape.sh

I verified in my VM that before my fixes, they do hang/fail easily and
they are indeed related to md/raid changes recently. However, I still
meet some problems that I can't make progress for now.

Is there other tests that you know are broken in the v6.6->v6.7 window?
And is there know broken test in v6.6?

Because I found some tests will fail occasionally in v6.8-rc1 with my
fixes, and they will fail in v6.6 as well. For example:

shell/lvchange-raid1-writemostly.sh
## ERROR: The test started dmeventd (2064) unexpectedly.

shell/select-report.sh
    >>> NUMBER OF ITEMS EXPECTED: 6 vol1 vol2 abc abc orig xyz
#select-report.sh:67+ echo '  >>> NUMBER OF ITEMS FOUND: 7 (  vol1 vol2 
abc abc orig snap xyz )'

And I also met a following BUG during test:

[12504.959682] BUG bio-296 (Not tainted): Object already free 

[12504.960239] 
----------------------------------------------------------------------------- 

[12504.960239] 

[12504.961209] Allocated in mempool_alloc+0xe8/0x270 age=30 cpu=1 
pid=203288
[12504.961905]  kmem_cache_alloc+0x36a/0x3b0 

[12504.962324]  mempool_alloc+0xe8/0x270 

[12504.962712]  bio_alloc_bioset+0x3b5/0x920 

[12504.963129]  bio_alloc_clone+0x3e/0x160 

[12504.963533]  alloc_io+0x3d/0x1f0 

[12504.963876]  dm_submit_bio+0x12f/0xa30 

[12504.964267]  __submit_bio+0x9c/0xe0 

[12504.964639]  submit_bio_noacct_nocheck+0x25a/0x570 

[12504.965136]  submit_bio_wait+0xc2/0x160 

[12504.965535]  blkdev_issue_zeroout+0x19b/0x2e0 

[12504.965991]  ext4_init_inode_table+0x246/0x560 

[12504.966462]  ext4_lazyinit_thread+0x750/0xbe0 

[12504.966922]  kthread+0x1b4/0x1f0

And a lockdep waring:

[ 1229.452306] ============================================ 

[ 1229.452838] WARNING: possible recursive locking detected 

[ 1229.453344] 6.8.0-rc1+ #941 Not tainted 

[ 1229.453711] -------------------------------------------- 

[ 1229.454242] lvm/18080 is trying to acquire lock: 

[ 1229.454687] ffff888112abc1d0 (&pmd->root_lock){++++}-{3:3}, at: 
dm_thin_find_block+0x9f/0x0
[ 1229.455543] 

[ 1229.455543] but task is already holding lock: 

[ 1229.456122] ffff8881058bf1d0 (&pmd->root_lock){++++}-{3:3}, at: 
dm_pool_commit_metadata+0x0
[ 1229.456992] 

[ 1229.456992] other info that might help us debug this: 

[ 1229.457628]  Possible unsafe locking scenario: 

[ 1229.457628] 

[ 1229.458218]        CPU0 

[ 1229.458469]        ---- 

[ 1229.458726]   lock(&pmd->root_lock); 

[ 1229.459093]   lock(&pmd->root_lock); 

[ 1229.459455] 

[ 1229.459455]  *** DEADLOCK *** 

[ 1229.459455] 

[ 1229.460045]  May be due to missing lock nesting notation 

[ 1229.460045] 

[ 1229.460697] 3 locks held by lvm/18080: 

[ 1229.461074]  #0: ffff888153306870 (&md->suspend_lock/1){+.+.}-{3:3}, 
at: dm_resume+0x24/0x0
[ 1229.461935]  #1: ffff8881058bf1d0 (&pmd->root_lock){++++}-{3:3}, at: 
dm_pool_commit_metada0
[ 1229.462857]  #2: ffff88810f3abf10 (&md->io_barrier){.+.+}-{0:0}, at: 
dm_get_live_table+0x50
[ 1229.463731] 

[ 1229.463731] 

[ 1229.463731] stack backtrace: 

[ 1229.464165] CPU: 3 PID: 18080 Comm: lvm Not tainted 6.8.0-rc1+ #941 

[ 1229.464780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.16.1-2.fc37 04/04
[ 1229.465618] Call Trace: 

[ 1229.465884]  <TASK> 

[ 1229.466110]  dump_stack_lvl+0x4a/0x80 

[ 1229.466491]  __lock_acquire+0x1ad4/0x3540 

[ 1229.467369]  lock_acquire+0x16a/0x400 

[ 1229.470276]  down_read+0xa3/0x380 

[ 1229.471877]  dm_thin_find_block+0x9f/0x1e0 

[ 1229.474901]  thin_map+0x28b/0x5f0 

[ 1229.476116]  __map_bio+0x237/0x260 

[ 1229.476469]  dm_submit_bio+0x321/0xa30 

[ 1229.478546]  __submit_bio+0x9c/0xe0 

[ 1229.478913]  submit_bio_noacct_nocheck+0x25a/0x570 

[ 1229.480807]  __flush_write_list+0x115/0x1a0 

[ 1229.481725]  dm_bufio_write_dirty_buffers+0xb9/0x600 

[ 1229.483642]  __commit_transaction+0x2f3/0x4e0 

[ 1229.486185]  dm_pool_commit_metadata+0x3c/0x70 

[ 1229.486636]  commit+0x8c/0x1b0 

[ 1229.487757]  pool_preresume+0x235/0x550 

[ 1229.489010]  dm_table_resume_targets+0xa6/0x1b0 

[ 1229.489467]  dm_resume+0x120/0x210 

[ 1229.489820]  dev_suspend+0x269/0x3e0 

[ 1229.490187]  ctl_ioctl+0x447/0x740 

[ 1229.492539]  dm_ctl_ioctl+0xe/0x20 

[ 1229.492895]  __x64_sys_ioctl+0xc9/0x100 

[ 1229.493284]  do_syscall_64+0x7d/0x1a0 

[ 1229.493667]  entry_SYSCALL_64_after_hwframe+0x6e/0x76 

[ 1229.494171] RIP: 0033:0x7f86667400ab 

[ 1229.494537] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 
89 e0 41 5c c3 66 0f 8
[ 1229.496340] RSP: 002b:00007fff312bc1f8 EFLAGS: 00000206 ORIG_RAX: 
0000000000000010
[ 1229.497084] RAX: ffffffffffffffda RBX: 0000561323f78320 RCX: 
00007f86667400ab
[ 1229.497796] RDX: 000056132506a8e0 RSI: 00000000c138fd06 RDI: 
0000000000000004
[ 1229.498494] RBP: 000056132402614e R08: 0000000000000000 R09: 
00000000000f4240
[ 1229.499195] R10: 0000000000000001 R11: 0000000000000206 R12: 
000056132506a990
[ 1229.499901] R13: 0000000000000001 R14: 0000561325060460 R15: 
000056132506a8e0
[ 1229.500615]  </TASK> 

[ 1230.525934] device-mapper: thin: Data device (dm-8) discard 
unsupported: Disabling discard.

Currently, I'm not sure if there are still new regressions related to
md/raid changes. Do you have any suggestions? Please let me know what
you think, I really need some help here. :(

BTW, as Xiao Ni replied in the other thread[2], he also tests my fixes
and reported that there are 72 failed tests in total, which is
unbelievable for me. And we must dig deeper for root cause...

[1] 
https://lore.kernel.org/all/20240127074754.2380890-1-yukuai1@huaweicloud.com/
[2] 
https://lore.kernel.org/all/CALTww2_f_orkTXPDtA4AJsbX-UmwhAb-AF_tujH4Gw3cX3ObWg@mail.gmail.com/
Thanks,
Kuai

> 
> Mikulas
> 
> .
> 


      parent reply	other threads:[~2024-01-27  7:58 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-17 18:16 [PATCH 0/7] MD fixes for the LVM2 testsuite Mikulas Patocka
2024-01-17 18:17 ` [PATCH 1/7] md: Revert fa2bbff7b0b4 ("md: synchronize flush io with array reconfiguration") Mikulas Patocka
2024-01-18  1:27   ` Yu Kuai
2024-01-17 18:18 ` [PATCH 2/7] md: fix a race condition when stopping the sync thread Mikulas Patocka
2024-01-18  1:32   ` Yu Kuai
2024-01-18 13:07     ` Mikulas Patocka
2024-01-18 13:20       ` Yu Kuai
2024-01-18 13:28         ` Mikulas Patocka
2024-01-17 18:19 ` [PATCH 3/7] md: test for MD_RECOVERY_DONE in stop_sync_thread Mikulas Patocka
2024-01-18  0:19   ` Song Liu
2024-01-18 13:23     ` Mikulas Patocka
2024-01-18 21:10       ` Song Liu
2024-01-22 16:34         ` Mikulas Patocka
2024-01-23  2:31           ` Benjamin Marzinski
2024-01-26  9:17             ` Yu Kuai
2024-01-26  9:37               ` Yu Kuai
2024-01-26 10:29                 ` Zdenek Kabelac
2024-01-27  1:13                   ` Yu Kuai
2024-01-27  1:19                     ` Yu Kuai
2024-01-18  1:35   ` Yu Kuai
2024-01-17 18:20 ` [PATCH 4/7] md: call md_reap_sync_thread from __md_stop_writes Mikulas Patocka
2024-01-18  1:38   ` Yu Kuai
2024-01-17 18:21 ` [PATCH 5/7] md: fix deadlock in shell/lvconvert-raid-reshape-linear_to_raid6-single-type.sh Mikulas Patocka
2024-01-18  1:12   ` Song Liu
2024-01-18  1:51   ` Yu Kuai
2024-01-17 18:22 ` [PATCH 6/7] md: partially revert "md/raid6: use valid sector values to determine if an I/O should wait on the reshape" Mikulas Patocka
2024-01-17 23:56   ` Song Liu
2024-01-17 18:22 ` [PATCH 7/7] md: fix a suspicious RCU usage warning Mikulas Patocka
2024-01-17 23:59   ` Song Liu
2024-01-18  1:56   ` Yu Kuai
2024-01-25 17:31     ` Song Liu
2024-01-17 19:27 ` [PATCH 0/7] MD fixes for the LVM2 testsuite Song Liu
2024-01-18  2:03   ` Yu Kuai
2024-01-27  7:57 ` Yu Kuai [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e52bbc20-353d-6984-e2cb-662d4676b99a@huaweicloud.com \
    --to=yukuai1@huaweicloud.com \
    --cc=bmarzins@redhat.com \
    --cc=djeffery@redhat.com \
    --cc=dm-devel@lists.linux.dev \
    --cc=heinzm@redhat.com \
    --cc=linan122@huawei.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=mpatocka@redhat.com \
    --cc=msnitzer@redhat.com \
    --cc=song@kernel.org \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox