From: Colgate Minuette <rabbit@minuette.net>
To: linux-raid@vger.kernel.org, Yu Kuai <yukuai1@huaweicloud.com>,
Yu Kuai <yukuai3@huawei.com>
Cc: "yangerkun@huawei.com" <yangerkun@huawei.com>
Subject: Re: General Protection Fault in md raid10
Date: Mon, 29 Apr 2024 00:06:18 -0700 [thread overview]
Message-ID: <12425339.O9o76ZdvQC@sparkler> (raw)
In-Reply-To: <2932875.e9J7NaK4W3@sparkler>
On Sunday, April 28, 2024 11:39:21 PM PDT Colgate Minuette wrote:
> On Sunday, April 28, 2024 11:06:51 PM PDT Yu Kuai wrote:
> > Hi,
> >
> > 在 2024/04/29 12:30, Colgate Minuette 写道:
> > > On Sunday, April 28, 2024 8:12:01 PM PDT Yu Kuai wrote:
> > >> Hi,
> > >>
> > >> 在 2024/04/29 10:18, Colgate Minuette 写道:
> > >>> On Sunday, April 28, 2024 6:02:30 PM PDT Yu Kuai wrote:
> > >>>> Hi,
> > >>>>
> > >>>> 在 2024/04/29 3:41, Colgate Minuette 写道:
> > >>>>> Hello all,
> > >>>>>
> > >>>>> I am trying to set up an md raid-10 array spanning 8 disks using the
> > >>>>> following command
> > >>>>>
> > >>>>>> mdadm --create /dev/md64 --level=10 --layout=o2 -n 8
> > >>>>>> /dev/sd[efghijkl]1
> > >>>>>
> > >>>>> The raid is created successfully, but the moment that the newly
> > >>>>> created
> > >>>>> raid starts initial sync, a general protection fault is issued. This
> > >>>>> fault happens on kernels 6.1.85, 6.6.26, and 6.8.5 using mdadm
> > >>>>> version
> > >>>>> 4.3. The raid is then completely unusable. After the fault, if I try
> > >>>>> to
> > >>>>> stop the raid using>
> > >>>>>
> > >>>>>> mdadm --stop /dev/md64
> > >>>>>
> > >>>>> mdadm hangs indefinitely.
> > >>>>>
> > >>>>> I have tried raid levels 0 and 6, and both work as expected without
> > >>>>> any
> > >>>>> errors on these same 8 drives. I also have a working md raid-10 on
> > >>>>> the
> > >>>>> system already with 4 disks(not related to this 8 disk array).
> > >>>>>
> > >>>>> Other things I have tried include trying to create/sync the raid
> > >>>>> from
> > >>>>> a
> > >>>>> debian live environment, and using near/far/offset layouts, but both
> > >>>>> methods came back with the same protection fault. Also ran a memory
> > >>>>> test
> > >>>>> on the computer, but did not have any errors after 10 passes.
> > >>>>>
> > >>>>> Below is the output from the general protection fault. Let me know
> > >>>>> of
> > >>>>> anything else to try or log information that would be helpful to
> > >>>>> diagnose.
> > >>>>>
> > >>>>> [ 10.965542] md64: detected capacity change from 0 to 120021483520
> > >>>>> [ 10.965593] md: resync of RAID array md64
> > >>>>> [ 10.999289] general protection fault, probably for non-canonical
> > >>>>> address
> > >>>>> 0xd071e7fff89be: 0000 [#1] PREEMPT SMP NOPTI
> > >>>>> [ 11.000842] CPU: 4 PID: 912 Comm: md64_raid10 Not tainted
> > >>>>> 6.1.85-1-MANJARO #1 44ae6c380f5656fa036749a28fdade8f34f2f9ce
> > >>>>> [ 11.001192] Hardware name: ASUS System Product Name/TUF GAMING
> > >>>>> X670E-PLUS WIFI, BIOS 1618 05/18/2023
> > >>>>> [ 11.001482] RIP: 0010:bio_copy_data_iter+0x187/0x260
> > >>>>> [ 11.001756] Code: 29 f1 4c 29 f6 48 c1 f9 06 48 c1 fe 06 48 c1 e1
> > >>>>> 0c
> > >>>>> 48
> > >>>>> c1 e6 0c 48 01 e9 48 01 ee 48 01 d9 4c 01 d6 83 fa 08 0f 82 b0 fe ff
> > >>>>> ff
> > >>>>> <48> 8b 06 48 89 01 89 d0 48 8b 7c 06 f8 48 89 7c 01 f8 48 8d 79 08
> > >>>>> [ 11.002045] RSP: 0018:ffffa838124ffd28 EFLAGS: 00010216
> > >>>>> [ 11.002336] RAX: ffffca0a84195a80 RBX: 0000000000000000 RCX:
> > >>>>> ffff89be8656a000 [ 11.002628] RDX: 0000000000000642 RSI:
> > >>>>> 000d071e7fff89be RDI: ffff89beb4039df8 [ 11.002922] RBP:
> > >>>>> ffff89bd80000000 R08: ffffa838124ffd74 R09: ffffa838124ffd60 [
> > >>>>> 11.003217] R10: 00000000000009be R11: 0000000000002000 R12:
> > >>>>> ffff89be8bbff400 [ 11.003522] R13: ffff89beb4039a00 R14:
> > >>>>> ffffca0a80000000 R15: 0000000000001000 [ 11.003825] FS:
> > >>>>> 0000000000000000(0000) GS:ffff89c5b8700000(0000) knlGS:
> > >>>>> 0000000000000000
> > >>>>> [ 11.004126] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >>>>> [ 11.004429] CR2: 0000563308baac38 CR3: 000000012e900000 CR4:
> > >>>>> 0000000000750ee0
> > >>>>> [ 11.004737] PKRU: 55555554
> > >>>>> [ 11.005040] Call Trace:
> > >>>>> [ 11.005342] <TASK>
> > >>>>> [ 11.005645] ? __die_body.cold+0x1a/0x1f
> > >>>>> [ 11.005951] ? die_addr+0x3c/0x60
> > >>>>> [ 11.006256] ? exc_general_protection+0x1c1/0x380
> > >>>>> [ 11.006562] ? asm_exc_general_protection+0x26/0x30
> > >>>>> [ 11.006865] ? bio_copy_data_iter+0x187/0x260
> > >>>>> [ 11.007169] bio_copy_data+0x5c/0x80
> > >>>>> [ 11.007474] raid10d+0xcad/0x1c00 [raid10
> > >>>>> 1721e6c9d579361bf112b0ce400eec9240452da1]
> > >>>>
> > >>>> Can you try to use addr2line or gdb to locate which this code line
> > >>>> is this correspond to?
> > >>>>
> > >>>> I never see problem like this before... And it'll be greate if you
> > >>>> can bisect this since you can reporduce this problem easily.
> > >>>>
> > >>>> Thanks,
> > >>>> Kuai
> > >>>
> > >>> Can you provide guidance on how to do this? I haven't ever debugged
> > >>> kernel
> > >>> code before. I'm assuming this would be in the raid10.ko module, but
> > >>> don't
> > >>> know where to go from there.
> > >>
> > >> For addr2line, you can gdb raid10.ko, then:
> > >>
> > >> list *(raid10d+0xcad)
> > >>
> > >> and gdb vmlinux:
> > >>
> > >> list *(bio_copy_data_iter+0x187)
> > >>
> > >> For git bisect, you must find a good kernel version, then:
> > >>
> > >> git bisect start
> > >> git bisect bad v6.1
> > >> git bisect good xxx
> > >>
> > >> Then git will show you how many steps are needed and choose a commit
> > >> for
> > >> you, after compile and test the kernel:
> > >>
> > >> git bisect good/bad
> > >>
> > >> Then git will do the bisection based on your test result, at last
> > >> you will get a blamed commit.
> > >>
> > >> Thanks,
> > >> Kuai
> > >
> > > I don't know of any kernel that is working for this, every setup I've
> > > tried
> > > has had the same issue.
> >
> > This's really wried, is this the first time you ever using raid10? Did
> > you try some older kernel like v5.10 or v4.19?
>
> I have been using md raid10 on this system for about 10 years with a
> different set of disks with no issues. The other raid10 that is in place is
> SATA drives, but I have created and tested a raid10 with different SAS
> drives on this system, and had no issues with that test.
>
> These Samsung SSDs are a new addition to the system. I'll try the raid10 on
> 4.19.307 and 5.10.211 as well, since those are in my distro's repos.
>
> -Colgate
>
Following up, the raid10 with the Samsung drives builds and starts syncing
correctly on 4.19.307, 5.10.211, and 5.15.154, but does not work on 6.1.85 or
newer.
-Colgate
next prev parent reply other threads:[~2024-04-29 7:06 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-28 19:41 General Protection Fault in md raid10 Colgate Minuette
2024-04-27 16:21 ` Paul E Luse
2024-04-28 20:07 ` Colgate Minuette
2024-04-27 18:22 ` Paul E Luse
2024-04-28 22:16 ` Colgate Minuette
2024-04-28 22:25 ` Roman Mamedov
2024-04-28 22:38 ` Colgate Minuette
2024-04-29 1:02 ` Yu Kuai
2024-04-29 2:18 ` Colgate Minuette
2024-04-29 3:12 ` Yu Kuai
2024-04-29 4:30 ` Colgate Minuette
2024-04-29 6:06 ` Yu Kuai
2024-04-29 6:39 ` Colgate Minuette
2024-04-29 7:06 ` Colgate Minuette [this message]
2024-04-29 7:52 ` Yu Kuai
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=12425339.O9o76ZdvQC@sparkler \
--to=rabbit@minuette.net \
--cc=linux-raid@vger.kernel.org \
--cc=yangerkun@huawei.com \
--cc=yukuai1@huaweicloud.com \
--cc=yukuai3@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).