From: Blazej Kucman <blazej.kucman@linux.intel.com>
To: Dan Moulding <dan@danm.net>
Cc: carlos@fisica.ufpr.br, gregkh@linuxfoundation.org,
junxiao.bi@oracle.com, linux-kernel@vger.kernel.org,
linux-raid@vger.kernel.org, regressions@lists.linux.dev,
song@kernel.org, stable@vger.kernel.org, yukuai1@huaweicloud.com
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
Date: Tue, 30 Jan 2024 17:26:59 +0100 [thread overview]
Message-ID: <20240130172524.0000417b@linux.intel.com> (raw)
In-Reply-To: <20240126154610.24755-1-dan@danm.net>
Hi,
On Fri, 26 Jan 2024 08:46:10 -0700
Dan Moulding <dan@danm.net> wrote:
>
> That's a good suggestion, so I switched it to use XFS. It can still
> reproduce the hang. Sounds like this is probably a different problem
> than the known ext4 one.
>
Our daily tests directed at mdadm/md also detected a problem with
identical symptoms as described in the thread.
Issue detected with IMSM metadata but it also reproduces with native
metadata.
NVMe disks under VMD controller were used.
Scenario:
1. Create raid10:
mdadm --create /dev/md/r10d4s128-15_A --level=10 --chunk=128
--raid-devices=4 /dev/nvme6n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme0n1
--size=7864320 --run
2. Create FS
mkfs.ext4 /dev/md/r10d4s128-15_A
3. Set faulty one raid member:
mdadm --set-faulty /dev/md/r10d4s128-15_A /dev/nvme3n1
4. Stop raid devies:
mdadm -Ss
Expected result:
The raid stops without kernel hangs and errors.
Actual result:
command "mdadm -Ss" hangs,
hung_task occurs in OS.
[ 62.770472] md: resync of RAID array md127
[ 140.893329] md: md127: resync done.
[ 204.100490] md/raid10:md127: Disk failure on nvme3n1, disabling
device. md/raid10:md127: Operation continuing on 3 devices.
[ 244.625393] INFO: task kworker/48:1:755 blocked for more than 30
seconds. [ 244.632294] Tainted: G S
6.8.0-rc1-20240129.intel.13479453+ #1 [ 244.640157] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message. [
244.648105] task:kworker/48:1 state:D stack:14592 pid:755 tgid:755
ppid:2 flags:0x00004000 [ 244.657552] Workqueue: md_misc
md_start_sync [md_mod] [ 244.662688] Call Trace: [ 244.665176] <TASK>
[ 244.667316] __schedule+0x2f0/0x9c0
[ 244.670868] ? sched_clock+0x10/0x20
[ 244.674510] schedule+0x28/0x90
[ 244.677703] mddev_suspend+0x11d/0x1e0 [md_mod]
[ 244.682313] ? __update_idle_core+0x29/0xc0
[ 244.686574] ? swake_up_all+0xe0/0xe0
[ 244.690302] md_start_sync+0x3c/0x280 [md_mod]
[ 244.694825] process_scheduled_works+0x87/0x320
[ 244.699427] worker_thread+0x147/0x2a0
[ 244.703237] ? rescuer_thread+0x2d0/0x2d0
[ 244.707313] kthread+0xe5/0x120
[ 244.710504] ? kthread_complete_and_exit+0x20/0x20
[ 244.715370] ret_from_fork+0x31/0x40
[ 244.719007] ? kthread_complete_and_exit+0x20/0x20
[ 244.723879] ret_from_fork_asm+0x11/0x20
[ 244.727872] </TASK>
[ 244.730117] INFO: task mdadm:8457 blocked for more than 30 seconds.
[ 244.736486] Tainted: G S
6.8.0-rc1-20240129.intel.13479453+ #1 [ 244.744345] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message. [
244.752293] task:mdadm state:D stack:13512 pid:8457
tgid:8457 ppid:8276 flags:0x00000000 [ 244.761736] Call Trace: [
244.764241] <TASK> [ 244.766389] __schedule+0x2f0/0x9c0
[ 244.773224] schedule+0x28/0x90
[ 244.779690] stop_sync_thread+0xfa/0x170 [md_mod]
[ 244.787737] ? swake_up_all+0xe0/0xe0
[ 244.794705] do_md_stop+0x51/0x4c0 [md_mod]
[ 244.802166] md_ioctl+0x59d/0x10a0 [md_mod]
[ 244.809567] blkdev_ioctl+0x1bb/0x270
[ 244.816417] __x64_sys_ioctl+0x7a/0xb0
[ 244.823720] do_syscall_64+0x4e/0x110
[ 244.830481] entry_SYSCALL_64_after_hwframe+0x63/0x6b
[ 244.838700] RIP: 0033:0x7f2c540c97cb
[ 244.845457] RSP: 002b:00007fff4ad6a8f8 EFLAGS: 00000246 ORIG_RAX:
0000000000000010 [ 244.856265] RAX: ffffffffffffffda RBX:
0000000000000003 RCX: 00007f2c540c97cb [ 244.866659] RDX:
0000000000000000 RSI: 0000000000000932 RDI: 0000000000000003 [
244.877031] RBP: 0000000000000019 R08: 0000000000200000 R09:
00007fff4ad6a4c5 [ 244.887382] R10: 0000000000000000 R11:
0000000000000246 R12: 00007fff4ad6a9c0 [ 244.897723] R13:
00007fff4ad6a9a0 R14: 000055724d0990e0 R15: 000055724efaa780 [
244.908018] </TASK> [ 275.345375] INFO: task kworker/48:1:755 blocked
for more than 60 seconds. [ 275.355363] Tainted: G S
6.8.0-rc1-20240129.intel.13479453+ #1 [ 275.366306] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message. [
275.377334] task:kworker/48:1 state:D stack:14592 pid:755 tgid:755
ppid:2 flags:0x00004000 [ 275.389863] Workqueue: md_misc
md_start_sync [md_mod] [ 275.398102] Call Trace: [ 275.403673] <TASK>
Also reproduces with XFS FS, does not reproduce when there is no FS on
RAID.
Repository used for testing:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
Branch: master
Last working build: kernel branch HEAD: acc657692aed ("keys, dns: Fix
size check of V1 server-list header")
I see one merge commit touching md after the above one:
01d550f0fcc0 ("Merge tag 'for-6.8/block-2024-01-08' of
git://git.kernel.dk/linux")
I hope these additional logs will help find the cause.
Thanks,
Blazej
next prev parent reply other threads:[~2024-01-30 16:27 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-23 0:56 [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected Dan Moulding
2024-01-23 1:08 ` Song Liu
2024-01-23 1:35 ` Dan Moulding
2024-01-23 6:35 ` Song Liu
2024-01-23 21:53 ` Dan Moulding
2024-01-23 22:21 ` Song Liu
2024-01-23 23:58 ` Dan Moulding
2024-01-25 0:01 ` Song Liu
2024-01-25 16:44 ` junxiao.bi
2024-01-25 19:40 ` Song Liu
2024-01-25 20:31 ` Dan Moulding
2024-01-26 3:30 ` Carlos Carvalho
2024-01-26 15:46 ` Dan Moulding
2024-01-30 16:26 ` Blazej Kucman [this message]
2024-01-30 20:21 ` Song Liu
2024-01-31 1:26 ` Song Liu
2024-01-31 2:13 ` Yu Kuai
2024-01-31 2:41 ` Yu Kuai
2024-01-31 4:55 ` Song Liu
2024-01-31 13:36 ` Blazej Kucman
2024-02-01 1:39 ` Yu Kuai
2024-01-26 16:21 ` Roman Mamedov
2024-01-31 17:37 ` junxiao.bi
2024-02-06 8:07 ` Song Liu
2024-02-06 20:56 ` Dan Moulding
2024-02-06 21:34 ` Song Liu
2024-02-20 23:06 ` Dan Moulding
2024-02-20 23:15 ` junxiao.bi
2024-02-21 14:50 ` Mateusz Kusiak
2024-02-21 19:15 ` junxiao.bi
2024-02-23 17:44 ` Dan Moulding
2024-02-23 19:18 ` junxiao.bi
2024-02-23 20:22 ` Dan Moulding
2024-02-23 8:07 ` Linux regression tracking (Thorsten Leemhuis)
2024-02-24 2:13 ` Song Liu
2024-03-01 20:26 ` junxiao.bi
2024-03-01 23:12 ` Dan Moulding
2024-03-02 0:05 ` Song Liu
2024-03-06 8:38 ` Linux regression tracking (Thorsten Leemhuis)
2024-03-06 17:13 ` Song Liu
2024-03-02 16:55 ` Dan Moulding
2024-03-07 3:34 ` Yu Kuai
2024-03-08 23:49 ` junxiao.bi
2024-03-10 5:13 ` Dan Moulding
2024-03-11 1:50 ` Yu Kuai
2024-03-12 22:56 ` junxiao.bi
2024-03-13 1:20 ` Yu Kuai
2024-03-14 18:20 ` junxiao.bi
2024-03-14 22:36 ` Song Liu
2024-03-15 1:30 ` Yu Kuai
2024-03-14 16:12 ` Dan Moulding
2024-03-15 1:17 ` Yu Kuai
2024-03-19 14:16 ` Dan Moulding
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240130172524.0000417b@linux.intel.com \
--to=blazej.kucman@linux.intel.com \
--cc=carlos@fisica.ufpr.br \
--cc=dan@danm.net \
--cc=gregkh@linuxfoundation.org \
--cc=junxiao.bi@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=regressions@lists.linux.dev \
--cc=song@kernel.org \
--cc=stable@vger.kernel.org \
--cc=yukuai1@huaweicloud.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).