Re: General Protection Fault in md raid10

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Colgate Minuette <rabbit@minuette.net>
To: Paul E Luse <paul.e.luse@linux.intel.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: General Protection Fault in md raid10
Date: Sun, 28 Apr 2024 15:16:27 -0700	[thread overview]
Message-ID: <2322142.ElGaqSPkdT@sparkler> (raw)
In-Reply-To: <20240427112219.1bf00101@peluse-desk5>

On Saturday, April 27, 2024 11:22:19 AM PDT Paul E Luse wrote:
> On Sun, 28 Apr 2024 13:07:49 -0700
> 
> Colgate Minuette <rabbit@minuette.net> wrote:
> > On Saturday, April 27, 2024 9:21:19 AM PDT Paul E Luse wrote:
> > > On Sun, 28 Apr 2024 12:41:13 -0700
> > > 
> > > Colgate Minuette <rabbit@minuette.net> wrote:
> > > > Hello all,
> > > > 
> > > > I am trying to set up an md raid-10 array spanning 8 disks using
> > > > the following command
> > > > 
> > > > >mdadm --create /dev/md64 --level=10 --layout=o2 -n 8
> > > > >/dev/sd[efghijkl]1
> > > > 
> > > > The raid is created successfully, but the moment that the newly
> > > > created raid starts initial sync, a general protection fault is
> > > > issued. This fault happens on kernels 6.1.85, 6.6.26, and 6.8.5
> > > > using mdadm version 4.3. The raid is then completely unusable.
> > > > After the fault, if I try to stop the raid using
> > > > 
> > > > >mdadm --stop /dev/md64
> > > > 
> > > > mdadm hangs indefinitely.
> > > > 
> > > > I have tried raid levels 0 and 6, and both work as expected
> > > > without any errors on these same 8 drives. I also have a working
> > > > md raid-10 on the system already with 4 disks(not related to this
> > > > 8 disk array).
> > > > 
> > > > Other things I have tried include trying to create/sync the raid
> > > > from a debian live environment, and using near/far/offset
> > > > layouts, but both methods came back with the same protection
> > > > fault. Also ran a memory test on the computer, but did not have
> > > > any errors after 10 passes.
> > > > 
> > > > Below is the output from the general protection fault. Let me
> > > > know of anything else to try or log information that would be
> > > > helpful to diagnose.
> > > > 
> > > > [   10.965542] md64: detected capacity change from 0 to
> > > > 120021483520 [   10.965593] md: resync of RAID array md64
> > > > [   10.999289] general protection fault, probably for
> > > > non-canonical address 0xd071e7fff89be: 0000 [#1] PREEMPT SMP NOPTI
> > > > [   11.000842] CPU: 4 PID: 912 Comm: md64_raid10 Not tainted
> > > > 6.1.85-1-MANJARO #1 44ae6c380f5656fa036749a28fdade8f34f2f9ce
> > > > [   11.001192] Hardware name: ASUS System Product Name/TUF GAMING
> > > > X670E-PLUS WIFI, BIOS 1618 05/18/2023
> > > > [   11.001482] RIP: 0010:bio_copy_data_iter+0x187/0x260
> > > > [   11.001756] Code: 29 f1 4c 29 f6 48 c1 f9 06 48 c1 fe 06 48 c1
> > > > e1 0c 48 c1 e6 0c 48 01 e9 48 01 ee 48 01 d9 4c 01 d6 83 fa 08 0f
> > > > 82 b0 fe ff ff <48> 8b 06 48 89 01 89 d0 48 8b 7c 06 f8 48 89 7c
> > > > 01 f8 48 8d 79 08 [   11.002045] RSP: 0018:ffffa838124ffd28
> > > > EFLAGS: 00010216 [   11.002336] RAX: ffffca0a84195a80 RBX:
> > > > 0000000000000000 RCX: ffff89be8656a000 [   11.002628] RDX:
> > > > 0000000000000642 RSI: 000d071e7fff89be RDI: ffff89beb4039df8 [
> > > > 11.002922] RBP: ffff89bd80000000 R08: ffffa838124ffd74 R09:
> > > > ffffa838124ffd60 [ 11.003217] R10: 00000000000009be R11:
> > > > 0000000000002000 R12: ffff89be8bbff400 [   11.003522] R13:
> > > > ffff89beb4039a00 R14: ffffca0a80000000 R15: 0000000000001000 [
> > > > 11.003825] FS: 0000000000000000(0000) GS:ffff89c5b8700000(0000)
> > > > knlGS: 0000000000000000 [   11.004126] CS:  0010 DS: 0000 ES:
> > > > 0000 CR0: 0000000080050033 [   11.004429] CR2: 0000563308baac38
> > > > CR3: 000000012e900000 CR4: 0000000000750ee0 [   11.004737] PKRU:
> > > > 55555554 [   11.005040] Call Trace:
> > > > [   11.005342]  <TASK>
> > > > [   11.005645]  ? __die_body.cold+0x1a/0x1f
> > > > [   11.005951]  ? die_addr+0x3c/0x60
> > > > [   11.006256]  ? exc_general_protection+0x1c1/0x380
> > > > [   11.006562]  ? asm_exc_general_protection+0x26/0x30
> > > > [   11.006865]  ? bio_copy_data_iter+0x187/0x260
> > > > [   11.007169]  bio_copy_data+0x5c/0x80
> > > > [   11.007474]  raid10d+0xcad/0x1c00 [raid10
> > > > 1721e6c9d579361bf112b0ce400eec9240452da1]
> > > > [   11.007788]  ? srso_alias_return_thunk+0x5/0x7f
> > > > [   11.008099]  ? srso_alias_return_thunk+0x5/0x7f
> > > > [   11.008408]  ? prepare_to_wait_event+0x60/0x180
> > > > [   11.008720]  ? unregister_md_personality+0x70/0x70 [md_mod
> > > > 64c55bfe07bb9f714eafd175176a02873a443cb7]
> > > > [   11.009039]  md_thread+0xab/0x190 [md_mod
> > > > 64c55bfe07bb9f714eafd175176a02873a443cb7]
> > > > [   11.009359]  ? sched_energy_aware_handler+0xb0/0xb0
> > > > [   11.009681]  kthread+0xdb/0x110
> > > > [   11.009996]  ? kthread_complete_and_exit+0x20/0x20
> > > > [   11.010319]  ret_from_fork+0x1f/0x30
> > > > [   11.010325]  </TASK>
> > > > [   11.010326] Modules linked in: platform_profile libarc4
> > > > snd_hda_core snd_hwdep i8042 realtek kvm cfg80211 snd_pcm
> > > > sp5100_tco mdio_devres serio snd_timer raid10 irqbypass wmi_bmof
> > > > pcspkr k10temp i2c_piix4 rapl rfkill libphy snd soundcore md_mod
> > > > gpio_amdpt acpi_cpufreq gpio_generic mac_hid uinput i2c_dev sg
> > > > crypto_user fuse loop nfnetlink bpf_preload ip_tables x_tables
> > > > ext4 crc32c_generic crc16 mbcache jbd2 usbhid dm_crypt cbc
> > > > encrypted_keys trusted asn1_encoder tee dm_mod crct10dif_pclmul
> > > > crc32_pclmul crc32c_intel polyval_clmulni polyval_generic
> > > > gf128mul ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3
> > > > nvme aesni_intel crypto_simd mpt3sas nvme_core cryptd ccp
> > > > nvme_common xhci_pci raid_class xhci_pci_renesas
> > > > scsi_transport_sas amdgpu drm_ttm_helper ttm video wmi gpu_sched
> > > > drm_buddy drm_display_helper cec [   11.012188] ---[ end trace
> > > > 0000000000000000 ]---
> > > 
> > > I wish had some some ides for you, I'm sure others will soon.  Two
> > > quick questions though:
> > > 
> > > 1) what is the manuf/model of the 8 drives?
> > > 2) have you tried creating a 4 disk RAID10 out of those drives?
> > > (just curious since you have a 4 disk RAID10 working there)
> > > 
> > > -Paul
> > 
> > 1. Samsung MZILS15THMLS-0G5, "1633a"
> > 2. I tried making a 4 disk and a 3 disk RAID10, both immediately had
> > the same protection fault upon initial sync.
> > 
> > -Colgate
> 
> So just to test real quick I have PM 1743 here (NVMe not SAS) and tried
> a quick 4 disk RAID10 on 6.9.0.rc2+ and although it worked (created and
> did some dd writes) I did get this in dmesg. Anything in any of your
> logs?
> 
> Is it safe to say that your tried other disks as well? I realize
> these disks work with orhter RAID levels, just trying to help complete
> the triage info for others, I'm still earning to debug mdraid :)
> 
> [   86.703241] {1}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 0 [   86.703251] {1}[Hardware Error]: It has
> been corrected by h/w and requires no further action [   86.703254]
> {1}[Hardware Error]: event severity: corrected [   86.703257]
> {1}[Hardware Error]:  Error 0, type: corrected [   86.703261]
> {1}[Hardware Error]:   section_type: PCIe error [   86.703263]
> {1}[Hardware Error]:   port_type: 0, PCIe end point [   86.703265]
> {1}[Hardware Error]:   version: 3.0 [   86.703267] {1}[Hardware Error]:
>   command: 0x0546, status: 0x0011 [   86.703271] {1}[Hardware Error]:
> device_id: 0000:cf:00.0 [   86.703275] {1}[Hardware Error]:   slot: 0
> [   86.703277] {1}[Hardware Error]:   secondary_bus: 0x00
> [   86.703279] {1}[Hardware Error]:   vendor_id: 0x144d, device_id:
> 0xa826 [   86.703282] {1}[Hardware Error]:   class_code: 010802
> 
> 
> -Paul

I'm not seeing any log entries similar to that, or any other errors in dmesg/
journalctl besides the protection fault.

I just tried RAID10 on the same HBA/cables with 4 seagate 4TB SAS HDDs, and it 
is functioning correctly. Syncing correctly and able to write/read from the md 
device.

-Colgate

next prev parent reply	other threads:[~2024-04-28 22:16 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-28 19:41 General Protection Fault in md raid10 Colgate Minuette
2024-04-27 16:21 ` Paul E Luse
2024-04-28 20:07   ` Colgate Minuette
2024-04-27 18:22     ` Paul E Luse
2024-04-28 22:16       ` Colgate Minuette [this message]
2024-04-28 22:25         ` Roman Mamedov
2024-04-28 22:38           ` Colgate Minuette
2024-04-29  1:02 ` Yu Kuai
2024-04-29  2:18   ` Colgate Minuette
2024-04-29  3:12     ` Yu Kuai
2024-04-29  4:30       ` Colgate Minuette
2024-04-29  6:06         ` Yu Kuai
2024-04-29  6:39           ` Colgate Minuette
2024-04-29  7:06             ` Colgate Minuette
2024-04-29  7:52               ` Yu Kuai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2322142.ElGaqSPkdT@sparkler \
    --to=rabbit@minuette.net \
    --cc=linux-raid@vger.kernel.org \
    --cc=paul.e.luse@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.