All of lore.kernel.org
 help / color / mirror / Atom feed
From: Colgate Minuette <rabbit@minuette.net>
To: linux-raid@vger.kernel.org, Yu Kuai <yukuai1@huaweicloud.com>,
	Yu Kuai <yukuai3@huawei.com>
Cc: "yangerkun@huawei.com" <yangerkun@huawei.com>
Subject: Re: General Protection Fault in md raid10
Date: Mon, 29 Apr 2024 00:06:18 -0700	[thread overview]
Message-ID: <12425339.O9o76ZdvQC@sparkler> (raw)
In-Reply-To: <2932875.e9J7NaK4W3@sparkler>

On Sunday, April 28, 2024 11:39:21 PM PDT Colgate Minuette wrote:
> On Sunday, April 28, 2024 11:06:51 PM PDT Yu Kuai wrote:
> > Hi,
> > 
> > 在 2024/04/29 12:30, Colgate Minuette 写道:
> > > On Sunday, April 28, 2024 8:12:01 PM PDT Yu Kuai wrote:
> > >> Hi,
> > >> 
> > >> 在 2024/04/29 10:18, Colgate Minuette 写道:
> > >>> On Sunday, April 28, 2024 6:02:30 PM PDT Yu Kuai wrote:
> > >>>> Hi,
> > >>>> 
> > >>>> 在 2024/04/29 3:41, Colgate Minuette 写道:
> > >>>>> Hello all,
> > >>>>> 
> > >>>>> I am trying to set up an md raid-10 array spanning 8 disks using the
> > >>>>> following command
> > >>>>> 
> > >>>>>> mdadm --create /dev/md64 --level=10 --layout=o2 -n 8
> > >>>>>> /dev/sd[efghijkl]1
> > >>>>> 
> > >>>>> The raid is created successfully, but the moment that the newly
> > >>>>> created
> > >>>>> raid starts initial sync, a general protection fault is issued. This
> > >>>>> fault happens on kernels 6.1.85, 6.6.26, and 6.8.5 using mdadm
> > >>>>> version
> > >>>>> 4.3. The raid is then completely unusable. After the fault, if I try
> > >>>>> to
> > >>>>> stop the raid using>
> > >>>>> 
> > >>>>>> mdadm --stop /dev/md64
> > >>>>> 
> > >>>>> mdadm hangs indefinitely.
> > >>>>> 
> > >>>>> I have tried raid levels 0 and 6, and both work as expected without
> > >>>>> any
> > >>>>> errors on these same 8 drives. I also have a working md raid-10 on
> > >>>>> the
> > >>>>> system already with 4 disks(not related to this 8 disk array).
> > >>>>> 
> > >>>>> Other things I have tried include trying to create/sync the raid
> > >>>>> from
> > >>>>> a
> > >>>>> debian live environment, and using near/far/offset layouts, but both
> > >>>>> methods came back with the same protection fault. Also ran a memory
> > >>>>> test
> > >>>>> on the computer, but did not have any errors after 10 passes.
> > >>>>> 
> > >>>>> Below is the output from the general protection fault. Let me know
> > >>>>> of
> > >>>>> anything else to try or log information that would be helpful to
> > >>>>> diagnose.
> > >>>>> 
> > >>>>> [   10.965542] md64: detected capacity change from 0 to 120021483520
> > >>>>> [   10.965593] md: resync of RAID array md64
> > >>>>> [   10.999289] general protection fault, probably for non-canonical
> > >>>>> address
> > >>>>> 0xd071e7fff89be: 0000 [#1] PREEMPT SMP NOPTI
> > >>>>> [   11.000842] CPU: 4 PID: 912 Comm: md64_raid10 Not tainted
> > >>>>> 6.1.85-1-MANJARO #1 44ae6c380f5656fa036749a28fdade8f34f2f9ce
> > >>>>> [   11.001192] Hardware name: ASUS System Product Name/TUF GAMING
> > >>>>> X670E-PLUS WIFI, BIOS 1618 05/18/2023
> > >>>>> [   11.001482] RIP: 0010:bio_copy_data_iter+0x187/0x260
> > >>>>> [   11.001756] Code: 29 f1 4c 29 f6 48 c1 f9 06 48 c1 fe 06 48 c1 e1
> > >>>>> 0c
> > >>>>> 48
> > >>>>> c1 e6 0c 48 01 e9 48 01 ee 48 01 d9 4c 01 d6 83 fa 08 0f 82 b0 fe ff
> > >>>>> ff
> > >>>>> <48> 8b 06 48 89 01 89 d0 48 8b 7c 06 f8 48 89 7c 01 f8 48 8d 79 08
> > >>>>> [   11.002045] RSP: 0018:ffffa838124ffd28 EFLAGS: 00010216
> > >>>>> [   11.002336] RAX: ffffca0a84195a80 RBX: 0000000000000000 RCX:
> > >>>>> ffff89be8656a000 [   11.002628] RDX: 0000000000000642 RSI:
> > >>>>> 000d071e7fff89be RDI: ffff89beb4039df8 [   11.002922] RBP:
> > >>>>> ffff89bd80000000 R08: ffffa838124ffd74 R09: ffffa838124ffd60 [
> > >>>>> 11.003217] R10: 00000000000009be R11: 0000000000002000 R12:
> > >>>>> ffff89be8bbff400 [   11.003522] R13: ffff89beb4039a00 R14:
> > >>>>> ffffca0a80000000 R15: 0000000000001000 [   11.003825] FS:
> > >>>>> 0000000000000000(0000) GS:ffff89c5b8700000(0000) knlGS:
> > >>>>> 0000000000000000
> > >>>>> [   11.004126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >>>>> [   11.004429] CR2: 0000563308baac38 CR3: 000000012e900000 CR4:
> > >>>>> 0000000000750ee0
> > >>>>> [   11.004737] PKRU: 55555554
> > >>>>> [   11.005040] Call Trace:
> > >>>>> [   11.005342]  <TASK>
> > >>>>> [   11.005645]  ? __die_body.cold+0x1a/0x1f
> > >>>>> [   11.005951]  ? die_addr+0x3c/0x60
> > >>>>> [   11.006256]  ? exc_general_protection+0x1c1/0x380
> > >>>>> [   11.006562]  ? asm_exc_general_protection+0x26/0x30
> > >>>>> [   11.006865]  ? bio_copy_data_iter+0x187/0x260
> > >>>>> [   11.007169]  bio_copy_data+0x5c/0x80
> > >>>>> [   11.007474]  raid10d+0xcad/0x1c00 [raid10
> > >>>>> 1721e6c9d579361bf112b0ce400eec9240452da1]
> > >>>> 
> > >>>> Can you try to use addr2line or gdb to locate which this code line
> > >>>> is this correspond to?
> > >>>> 
> > >>>> I never see problem like this before... And it'll be greate if you
> > >>>> can bisect this since you can reporduce this problem easily.
> > >>>> 
> > >>>> Thanks,
> > >>>> Kuai
> > >>> 
> > >>> Can you provide guidance on how to do this? I haven't ever debugged
> > >>> kernel
> > >>> code before. I'm assuming this would be in the raid10.ko module, but
> > >>> don't
> > >>> know where to go from there.
> > >> 
> > >> For addr2line, you can gdb raid10.ko, then:
> > >> 
> > >> list *(raid10d+0xcad)
> > >> 
> > >> and gdb vmlinux:
> > >> 
> > >> list *(bio_copy_data_iter+0x187)
> > >> 
> > >> For git bisect, you must find a good kernel version, then:
> > >> 
> > >> git bisect start
> > >> git bisect bad v6.1
> > >> git bisect good xxx
> > >> 
> > >> Then git will show you how many steps are needed and choose a commit
> > >> for
> > >> you, after compile and test the kernel:
> > >> 
> > >> git bisect good/bad
> > >> 
> > >> Then git will do the bisection based on your test result, at last
> > >> you will get a blamed commit.
> > >> 
> > >> Thanks,
> > >> Kuai
> > > 
> > > I don't know of any kernel that is working for this, every setup I've
> > > tried
> > > has had the same issue.
> > 
> > This's really wried, is this the first time you ever using raid10? Did
> > you try some older kernel like v5.10 or v4.19?
> 
> I have been using md raid10 on this system for about 10 years with a
> different set of disks with no issues. The other raid10 that is in place is
> SATA drives, but I have created and tested a raid10 with different SAS
> drives on this system, and had no issues with that test.
> 
> These Samsung SSDs are a new addition to the system. I'll try the raid10 on
> 4.19.307 and 5.10.211 as well, since those are in my distro's repos.
> 
> -Colgate
> 

Following up, the raid10 with the Samsung drives builds and starts syncing 
correctly on 4.19.307, 5.10.211, and 5.15.154, but does not work on 6.1.85 or 
newer.

-Colgate



  reply	other threads:[~2024-04-29  7:06 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-28 19:41 General Protection Fault in md raid10 Colgate Minuette
2024-04-27 16:21 ` Paul E Luse
2024-04-28 20:07   ` Colgate Minuette
2024-04-27 18:22     ` Paul E Luse
2024-04-28 22:16       ` Colgate Minuette
2024-04-28 22:25         ` Roman Mamedov
2024-04-28 22:38           ` Colgate Minuette
2024-04-29  1:02 ` Yu Kuai
2024-04-29  2:18   ` Colgate Minuette
2024-04-29  3:12     ` Yu Kuai
2024-04-29  4:30       ` Colgate Minuette
2024-04-29  6:06         ` Yu Kuai
2024-04-29  6:39           ` Colgate Minuette
2024-04-29  7:06             ` Colgate Minuette [this message]
2024-04-29  7:52               ` Yu Kuai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=12425339.O9o76ZdvQC@sparkler \
    --to=rabbit@minuette.net \
    --cc=linux-raid@vger.kernel.org \
    --cc=yangerkun@huawei.com \
    --cc=yukuai1@huaweicloud.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.