Re: General Protection Fault in md raid10

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: General Protection Fault in md raid10
  2024-04-28 19:41 General Protection Fault in md raid10 Colgate Minuette
@ 2024-04-27 16:21 ` Paul E Luse
  2024-04-28 20:07   ` Colgate Minuette
  2024-04-29  1:02 ` Yu Kuai
  1 sibling, 1 reply; 15+ messages in thread
From: Paul E Luse @ 2024-04-27 16:21 UTC (permalink / raw)
  To: Colgate Minuette; +Cc: linux-raid

On Sun, 28 Apr 2024 12:41:13 -0700
Colgate Minuette <rabbit@minuette.net> wrote:

> Hello all,
> 
> I am trying to set up an md raid-10 array spanning 8 disks using the
> following command
> 
> >mdadm --create /dev/md64 --level=10 --layout=o2 -n 8
> >/dev/sd[efghijkl]1  
> 
> The raid is created successfully, but the moment that the newly
> created raid starts initial sync, a general protection fault is
> issued. This fault happens on kernels 6.1.85, 6.6.26, and 6.8.5 using
> mdadm version 4.3. The raid is then completely unusable. After the
> fault, if I try to stop the raid using
> 
> >mdadm --stop /dev/md64  
> 
> mdadm hangs indefinitely.
> 
> I have tried raid levels 0 and 6, and both work as expected without
> any errors on these same 8 drives. I also have a working md raid-10
> on the system already with 4 disks(not related to this 8 disk array).
> 
> Other things I have tried include trying to create/sync the raid from
> a debian live environment, and using near/far/offset layouts, but
> both methods came back with the same protection fault. Also ran a
> memory test on the computer, but did not have any errors after 10
> passes.
> 
> Below is the output from the general protection fault. Let me know of
> anything else to try or log information that would be helpful to
> diagnose.
> 
> [   10.965542] md64: detected capacity change from 0 to 120021483520
> [   10.965593] md: resync of RAID array md64
> [   10.999289] general protection fault, probably for non-canonical
> address 0xd071e7fff89be: 0000 [#1] PREEMPT SMP NOPTI
> [   11.000842] CPU: 4 PID: 912 Comm: md64_raid10 Not tainted
> 6.1.85-1-MANJARO #1 44ae6c380f5656fa036749a28fdade8f34f2f9ce
> [   11.001192] Hardware name: ASUS System Product Name/TUF GAMING
> X670E-PLUS WIFI, BIOS 1618 05/18/2023
> [   11.001482] RIP: 0010:bio_copy_data_iter+0x187/0x260
> [   11.001756] Code: 29 f1 4c 29 f6 48 c1 f9 06 48 c1 fe 06 48 c1 e1
> 0c 48 c1 e6 0c 48 01 e9 48 01 ee 48 01 d9 4c 01 d6 83 fa 08 0f 82 b0
> fe ff ff <48> 8b 06 48 89 01 89 d0 48 8b 7c 06 f8 48 89 7c 01 f8 48
> 8d 79 08 [   11.002045] RSP: 0018:ffffa838124ffd28 EFLAGS: 00010216
> [   11.002336] RAX: ffffca0a84195a80 RBX: 0000000000000000 RCX:
> ffff89be8656a000 [   11.002628] RDX: 0000000000000642 RSI:
> 000d071e7fff89be RDI: ffff89beb4039df8 [   11.002922] RBP:
> ffff89bd80000000 R08: ffffa838124ffd74 R09: ffffa838124ffd60 [
> 11.003217] R10: 00000000000009be R11: 0000000000002000 R12:
> ffff89be8bbff400 [   11.003522] R13: ffff89beb4039a00 R14:
> ffffca0a80000000 R15: 0000000000001000 [   11.003825] FS:
> 0000000000000000(0000) GS:ffff89c5b8700000(0000) knlGS:
> 0000000000000000 [   11.004126] CS:  0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033 [   11.004429] CR2: 0000563308baac38 CR3:
> 000000012e900000 CR4: 0000000000750ee0 [   11.004737] PKRU: 55555554
> [   11.005040] Call Trace:
> [   11.005342]  <TASK>
> [   11.005645]  ? __die_body.cold+0x1a/0x1f
> [   11.005951]  ? die_addr+0x3c/0x60
> [   11.006256]  ? exc_general_protection+0x1c1/0x380
> [   11.006562]  ? asm_exc_general_protection+0x26/0x30
> [   11.006865]  ? bio_copy_data_iter+0x187/0x260
> [   11.007169]  bio_copy_data+0x5c/0x80
> [   11.007474]  raid10d+0xcad/0x1c00 [raid10 
> 1721e6c9d579361bf112b0ce400eec9240452da1]
> [   11.007788]  ? srso_alias_return_thunk+0x5/0x7f
> [   11.008099]  ? srso_alias_return_thunk+0x5/0x7f
> [   11.008408]  ? prepare_to_wait_event+0x60/0x180
> [   11.008720]  ? unregister_md_personality+0x70/0x70 [md_mod 
> 64c55bfe07bb9f714eafd175176a02873a443cb7]
> [   11.009039]  md_thread+0xab/0x190 [md_mod 
> 64c55bfe07bb9f714eafd175176a02873a443cb7]
> [   11.009359]  ? sched_energy_aware_handler+0xb0/0xb0
> [   11.009681]  kthread+0xdb/0x110
> [   11.009996]  ? kthread_complete_and_exit+0x20/0x20
> [   11.010319]  ret_from_fork+0x1f/0x30
> [   11.010325]  </TASK>
> [   11.010326] Modules linked in: platform_profile libarc4
> snd_hda_core snd_hwdep i8042 realtek kvm cfg80211 snd_pcm sp5100_tco
> mdio_devres serio snd_timer raid10 irqbypass wmi_bmof pcspkr k10temp
> i2c_piix4 rapl rfkill libphy snd soundcore md_mod gpio_amdpt
> acpi_cpufreq gpio_generic mac_hid uinput i2c_dev sg crypto_user fuse
> loop nfnetlink bpf_preload ip_tables x_tables ext4 crc32c_generic
> crc16 mbcache jbd2 usbhid dm_crypt cbc encrypted_keys trusted
> asn1_encoder tee dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel
> polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel
> sha512_ssse3 sha256_ssse3 sha1_ssse3 nvme aesni_intel crypto_simd
> mpt3sas nvme_core cryptd ccp nvme_common xhci_pci raid_class
> xhci_pci_renesas scsi_transport_sas amdgpu drm_ttm_helper ttm video
> wmi gpu_sched drm_buddy drm_display_helper cec [   11.012188] ---[
> end trace 0000000000000000 ]---
> 
> 
> 
I wish had some some ides for you, I'm sure others will soon.  Two
quick questions though:

1) what is the manuf/model of the 8 drives?
2) have you tried creating a 4 disk RAID10 out of those drives? (just
curious since you have a 4 disk RAID10 working there)

-Paul



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: General Protection Fault in md raid10
  2024-04-28 20:07   ` Colgate Minuette
@ 2024-04-27 18:22     ` Paul E Luse
  2024-04-28 22:16       ` Colgate Minuette
  0 siblings, 1 reply; 15+ messages in thread
From: Paul E Luse @ 2024-04-27 18:22 UTC (permalink / raw)
  To: Colgate Minuette; +Cc: linux-raid

On Sun, 28 Apr 2024 13:07:49 -0700
Colgate Minuette <rabbit@minuette.net> wrote:

> On Saturday, April 27, 2024 9:21:19 AM PDT Paul E Luse wrote:
> > On Sun, 28 Apr 2024 12:41:13 -0700
> > 
> > Colgate Minuette <rabbit@minuette.net> wrote:
> > > Hello all,
> > > 
> > > I am trying to set up an md raid-10 array spanning 8 disks using
> > > the following command
> > > 
> > > >mdadm --create /dev/md64 --level=10 --layout=o2 -n 8
> > > >/dev/sd[efghijkl]1
> > > 
> > > The raid is created successfully, but the moment that the newly
> > > created raid starts initial sync, a general protection fault is
> > > issued. This fault happens on kernels 6.1.85, 6.6.26, and 6.8.5
> > > using mdadm version 4.3. The raid is then completely unusable.
> > > After the fault, if I try to stop the raid using
> > > 
> > > >mdadm --stop /dev/md64
> > > 
> > > mdadm hangs indefinitely.
> > > 
> > > I have tried raid levels 0 and 6, and both work as expected
> > > without any errors on these same 8 drives. I also have a working
> > > md raid-10 on the system already with 4 disks(not related to this
> > > 8 disk array).
> > > 
> > > Other things I have tried include trying to create/sync the raid
> > > from a debian live environment, and using near/far/offset
> > > layouts, but both methods came back with the same protection
> > > fault. Also ran a memory test on the computer, but did not have
> > > any errors after 10 passes.
> > > 
> > > Below is the output from the general protection fault. Let me
> > > know of anything else to try or log information that would be
> > > helpful to diagnose.
> > > 
> > > [   10.965542] md64: detected capacity change from 0 to
> > > 120021483520 [   10.965593] md: resync of RAID array md64
> > > [   10.999289] general protection fault, probably for
> > > non-canonical address 0xd071e7fff89be: 0000 [#1] PREEMPT SMP NOPTI
> > > [   11.000842] CPU: 4 PID: 912 Comm: md64_raid10 Not tainted
> > > 6.1.85-1-MANJARO #1 44ae6c380f5656fa036749a28fdade8f34f2f9ce
> > > [   11.001192] Hardware name: ASUS System Product Name/TUF GAMING
> > > X670E-PLUS WIFI, BIOS 1618 05/18/2023
> > > [   11.001482] RIP: 0010:bio_copy_data_iter+0x187/0x260
> > > [   11.001756] Code: 29 f1 4c 29 f6 48 c1 f9 06 48 c1 fe 06 48 c1
> > > e1 0c 48 c1 e6 0c 48 01 e9 48 01 ee 48 01 d9 4c 01 d6 83 fa 08 0f
> > > 82 b0 fe ff ff <48> 8b 06 48 89 01 89 d0 48 8b 7c 06 f8 48 89 7c
> > > 01 f8 48 8d 79 08 [   11.002045] RSP: 0018:ffffa838124ffd28
> > > EFLAGS: 00010216 [   11.002336] RAX: ffffca0a84195a80 RBX:
> > > 0000000000000000 RCX: ffff89be8656a000 [   11.002628] RDX:
> > > 0000000000000642 RSI: 000d071e7fff89be RDI: ffff89beb4039df8 [
> > > 11.002922] RBP: ffff89bd80000000 R08: ffffa838124ffd74 R09:
> > > ffffa838124ffd60 [ 11.003217] R10: 00000000000009be R11:
> > > 0000000000002000 R12: ffff89be8bbff400 [   11.003522] R13:
> > > ffff89beb4039a00 R14: ffffca0a80000000 R15: 0000000000001000 [
> > > 11.003825] FS: 0000000000000000(0000) GS:ffff89c5b8700000(0000)
> > > knlGS: 0000000000000000 [   11.004126] CS:  0010 DS: 0000 ES:
> > > 0000 CR0: 0000000080050033 [   11.004429] CR2: 0000563308baac38
> > > CR3: 000000012e900000 CR4: 0000000000750ee0 [   11.004737] PKRU:
> > > 55555554 [   11.005040] Call Trace:
> > > [   11.005342]  <TASK>
> > > [   11.005645]  ? __die_body.cold+0x1a/0x1f
> > > [   11.005951]  ? die_addr+0x3c/0x60
> > > [   11.006256]  ? exc_general_protection+0x1c1/0x380
> > > [   11.006562]  ? asm_exc_general_protection+0x26/0x30
> > > [   11.006865]  ? bio_copy_data_iter+0x187/0x260
> > > [   11.007169]  bio_copy_data+0x5c/0x80
> > > [   11.007474]  raid10d+0xcad/0x1c00 [raid10
> > > 1721e6c9d579361bf112b0ce400eec9240452da1]
> > > [   11.007788]  ? srso_alias_return_thunk+0x5/0x7f
> > > [   11.008099]  ? srso_alias_return_thunk+0x5/0x7f
> > > [   11.008408]  ? prepare_to_wait_event+0x60/0x180
> > > [   11.008720]  ? unregister_md_personality+0x70/0x70 [md_mod
> > > 64c55bfe07bb9f714eafd175176a02873a443cb7]
> > > [   11.009039]  md_thread+0xab/0x190 [md_mod
> > > 64c55bfe07bb9f714eafd175176a02873a443cb7]
> > > [   11.009359]  ? sched_energy_aware_handler+0xb0/0xb0
> > > [   11.009681]  kthread+0xdb/0x110
> > > [   11.009996]  ? kthread_complete_and_exit+0x20/0x20
> > > [   11.010319]  ret_from_fork+0x1f/0x30
> > > [   11.010325]  </TASK>
> > > [   11.010326] Modules linked in: platform_profile libarc4
> > > snd_hda_core snd_hwdep i8042 realtek kvm cfg80211 snd_pcm
> > > sp5100_tco mdio_devres serio snd_timer raid10 irqbypass wmi_bmof
> > > pcspkr k10temp i2c_piix4 rapl rfkill libphy snd soundcore md_mod
> > > gpio_amdpt acpi_cpufreq gpio_generic mac_hid uinput i2c_dev sg
> > > crypto_user fuse loop nfnetlink bpf_preload ip_tables x_tables
> > > ext4 crc32c_generic crc16 mbcache jbd2 usbhid dm_crypt cbc
> > > encrypted_keys trusted asn1_encoder tee dm_mod crct10dif_pclmul
> > > crc32_pclmul crc32c_intel polyval_clmulni polyval_generic
> > > gf128mul ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3
> > > nvme aesni_intel crypto_simd mpt3sas nvme_core cryptd ccp
> > > nvme_common xhci_pci raid_class xhci_pci_renesas
> > > scsi_transport_sas amdgpu drm_ttm_helper ttm video wmi gpu_sched
> > > drm_buddy drm_display_helper cec [   11.012188] ---[ end trace
> > > 0000000000000000 ]---
> > 
> > I wish had some some ides for you, I'm sure others will soon.  Two
> > quick questions though:
> > 
> > 1) what is the manuf/model of the 8 drives?
> > 2) have you tried creating a 4 disk RAID10 out of those drives?
> > (just curious since you have a 4 disk RAID10 working there)
> > 
> > -Paul
> 
> 1. Samsung MZILS15THMLS-0G5, "1633a"
> 2. I tried making a 4 disk and a 3 disk RAID10, both immediately had
> the same protection fault upon initial sync.
> 
> -Colgate

So just to test real quick I have PM 1743 here (NVMe not SAS) and tried
a quick 4 disk RAID10 on 6.9.0.rc2+ and although it worked (created and
did some dd writes) I did get this in dmesg. Anything in any of your
logs?

Is it safe to say that your tried other disks as well? I realize
these disks work with orhter RAID levels, just trying to help complete
the triage info for others, I'm still earning to debug mdraid :) 

[   86.703241] {1}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 0 [   86.703251] {1}[Hardware Error]: It has
been corrected by h/w and requires no further action [   86.703254]
{1}[Hardware Error]: event severity: corrected [   86.703257]
{1}[Hardware Error]:  Error 0, type: corrected [   86.703261]
{1}[Hardware Error]:   section_type: PCIe error [   86.703263]
{1}[Hardware Error]:   port_type: 0, PCIe end point [   86.703265]
{1}[Hardware Error]:   version: 3.0 [   86.703267] {1}[Hardware Error]:
  command: 0x0546, status: 0x0011 [   86.703271] {1}[Hardware Error]:
device_id: 0000:cf:00.0 [   86.703275] {1}[Hardware Error]:   slot: 0
[   86.703277] {1}[Hardware Error]:   secondary_bus: 0x00
[   86.703279] {1}[Hardware Error]:   vendor_id: 0x144d, device_id:
0xa826 [   86.703282] {1}[Hardware Error]:   class_code: 010802


-Paul


> 
> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* General Protection Fault in md raid10
@ 2024-04-28 19:41 Colgate Minuette
  2024-04-27 16:21 ` Paul E Luse
  2024-04-29  1:02 ` Yu Kuai
  0 siblings, 2 replies; 15+ messages in thread
From: Colgate Minuette @ 2024-04-28 19:41 UTC (permalink / raw)
  To: linux-raid

Hello all,

I am trying to set up an md raid-10 array spanning 8 disks using the following 
command

>mdadm --create /dev/md64 --level=10 --layout=o2 -n 8 /dev/sd[efghijkl]1

The raid is created successfully, but the moment that the newly created raid 
starts initial sync, a general protection fault is issued. This fault happens 
on kernels 6.1.85, 6.6.26, and 6.8.5 using mdadm version 4.3. The raid is then 
completely unusable. After the fault, if I try to stop the raid using

>mdadm --stop /dev/md64

mdadm hangs indefinitely.

I have tried raid levels 0 and 6, and both work as expected without any errors 
on these same 8 drives. I also have a working md raid-10 on the system already 
with 4 disks(not related to this 8 disk array).

Other things I have tried include trying to create/sync the raid from a debian 
live environment, and using near/far/offset layouts, but both methods came back 
with the same protection fault. Also ran a memory test on the computer, but 
did not have any errors after 10 passes.

Below is the output from the general protection fault. Let me know of anything 
else to try or log information that would be helpful to diagnose.

[   10.965542] md64: detected capacity change from 0 to 120021483520
[   10.965593] md: resync of RAID array md64
[   10.999289] general protection fault, probably for non-canonical address 
0xd071e7fff89be: 0000 [#1] PREEMPT SMP NOPTI
[   11.000842] CPU: 4 PID: 912 Comm: md64_raid10 Not tainted 6.1.85-1-MANJARO 
#1 44ae6c380f5656fa036749a28fdade8f34f2f9ce
[   11.001192] Hardware name: ASUS System Product Name/TUF GAMING X670E-PLUS 
WIFI, BIOS 1618 05/18/2023
[   11.001482] RIP: 0010:bio_copy_data_iter+0x187/0x260
[   11.001756] Code: 29 f1 4c 29 f6 48 c1 f9 06 48 c1 fe 06 48 c1 e1 0c 48 c1 
e6 0c 48 01 e9 48 01 ee 48 01 d9 4c 01 d6 83 fa 08 0f 82 b0 fe ff ff <48> 8b 06 
48 89 01 89 d0 48 8b 7c 06 f8 48 89 7c 01 f8 48 8d 79 08
[   11.002045] RSP: 0018:ffffa838124ffd28 EFLAGS: 00010216
[   11.002336] RAX: ffffca0a84195a80 RBX: 0000000000000000 RCX: ffff89be8656a000
[   11.002628] RDX: 0000000000000642 RSI: 000d071e7fff89be RDI: ffff89beb4039df8
[   11.002922] RBP: ffff89bd80000000 R08: ffffa838124ffd74 R09: ffffa838124ffd60
[   11.003217] R10: 00000000000009be R11: 0000000000002000 R12: ffff89be8bbff400
[   11.003522] R13: ffff89beb4039a00 R14: ffffca0a80000000 R15: 0000000000001000
[   11.003825] FS:  0000000000000000(0000) GS:ffff89c5b8700000(0000) knlGS:
0000000000000000
[   11.004126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.004429] CR2: 0000563308baac38 CR3: 000000012e900000 CR4: 
0000000000750ee0
[   11.004737] PKRU: 55555554
[   11.005040] Call Trace:
[   11.005342]  <TASK>
[   11.005645]  ? __die_body.cold+0x1a/0x1f
[   11.005951]  ? die_addr+0x3c/0x60
[   11.006256]  ? exc_general_protection+0x1c1/0x380
[   11.006562]  ? asm_exc_general_protection+0x26/0x30
[   11.006865]  ? bio_copy_data_iter+0x187/0x260
[   11.007169]  bio_copy_data+0x5c/0x80
[   11.007474]  raid10d+0xcad/0x1c00 [raid10 
1721e6c9d579361bf112b0ce400eec9240452da1]
[   11.007788]  ? srso_alias_return_thunk+0x5/0x7f
[   11.008099]  ? srso_alias_return_thunk+0x5/0x7f
[   11.008408]  ? prepare_to_wait_event+0x60/0x180
[   11.008720]  ? unregister_md_personality+0x70/0x70 [md_mod 
64c55bfe07bb9f714eafd175176a02873a443cb7]
[   11.009039]  md_thread+0xab/0x190 [md_mod 
64c55bfe07bb9f714eafd175176a02873a443cb7]
[   11.009359]  ? sched_energy_aware_handler+0xb0/0xb0
[   11.009681]  kthread+0xdb/0x110
[   11.009996]  ? kthread_complete_and_exit+0x20/0x20
[   11.010319]  ret_from_fork+0x1f/0x30
[   11.010325]  </TASK>
[   11.010326] Modules linked in: platform_profile libarc4 snd_hda_core 
snd_hwdep i8042 realtek kvm cfg80211 snd_pcm sp5100_tco mdio_devres serio 
snd_timer raid10 irqbypass wmi_bmof pcspkr k10temp i2c_piix4 rapl rfkill 
libphy snd soundcore md_mod gpio_amdpt acpi_cpufreq gpio_generic mac_hid 
uinput i2c_dev sg crypto_user fuse loop nfnetlink bpf_preload ip_tables 
x_tables ext4 crc32c_generic crc16 mbcache jbd2 usbhid dm_crypt cbc 
encrypted_keys trusted asn1_encoder tee dm_mod crct10dif_pclmul crc32_pclmul 
crc32c_intel polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel 
sha512_ssse3 sha256_ssse3 sha1_ssse3 nvme aesni_intel crypto_simd mpt3sas 
nvme_core cryptd ccp nvme_common xhci_pci raid_class xhci_pci_renesas 
scsi_transport_sas amdgpu drm_ttm_helper ttm video wmi gpu_sched drm_buddy 
drm_display_helper cec
[   11.012188] ---[ end trace 0000000000000000 ]---



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: General Protection Fault in md raid10
  2024-04-27 16:21 ` Paul E Luse
@ 2024-04-28 20:07   ` Colgate Minuette
  2024-04-27 18:22     ` Paul E Luse
  0 siblings, 1 reply; 15+ messages in thread
From: Colgate Minuette @ 2024-04-28 20:07 UTC (permalink / raw)
  To: Paul E Luse; +Cc: linux-raid

On Saturday, April 27, 2024 9:21:19 AM PDT Paul E Luse wrote:
> On Sun, 28 Apr 2024 12:41:13 -0700
> 
> Colgate Minuette <rabbit@minuette.net> wrote:
> > Hello all,
> > 
> > I am trying to set up an md raid-10 array spanning 8 disks using the
> > following command
> > 
> > >mdadm --create /dev/md64 --level=10 --layout=o2 -n 8
> > >/dev/sd[efghijkl]1
> > 
> > The raid is created successfully, but the moment that the newly
> > created raid starts initial sync, a general protection fault is
> > issued. This fault happens on kernels 6.1.85, 6.6.26, and 6.8.5 using
> > mdadm version 4.3. The raid is then completely unusable. After the
> > fault, if I try to stop the raid using
> > 
> > >mdadm --stop /dev/md64
> > 
> > mdadm hangs indefinitely.
> > 
> > I have tried raid levels 0 and 6, and both work as expected without
> > any errors on these same 8 drives. I also have a working md raid-10
> > on the system already with 4 disks(not related to this 8 disk array).
> > 
> > Other things I have tried include trying to create/sync the raid from
> > a debian live environment, and using near/far/offset layouts, but
> > both methods came back with the same protection fault. Also ran a
> > memory test on the computer, but did not have any errors after 10
> > passes.
> > 
> > Below is the output from the general protection fault. Let me know of
> > anything else to try or log information that would be helpful to
> > diagnose.
> > 
> > [   10.965542] md64: detected capacity change from 0 to 120021483520
> > [   10.965593] md: resync of RAID array md64
> > [   10.999289] general protection fault, probably for non-canonical
> > address 0xd071e7fff89be: 0000 [#1] PREEMPT SMP NOPTI
> > [   11.000842] CPU: 4 PID: 912 Comm: md64_raid10 Not tainted
> > 6.1.85-1-MANJARO #1 44ae6c380f5656fa036749a28fdade8f34f2f9ce
> > [   11.001192] Hardware name: ASUS System Product Name/TUF GAMING
> > X670E-PLUS WIFI, BIOS 1618 05/18/2023
> > [   11.001482] RIP: 0010:bio_copy_data_iter+0x187/0x260
> > [   11.001756] Code: 29 f1 4c 29 f6 48 c1 f9 06 48 c1 fe 06 48 c1 e1
> > 0c 48 c1 e6 0c 48 01 e9 48 01 ee 48 01 d9 4c 01 d6 83 fa 08 0f 82 b0
> > fe ff ff <48> 8b 06 48 89 01 89 d0 48 8b 7c 06 f8 48 89 7c 01 f8 48
> > 8d 79 08 [   11.002045] RSP: 0018:ffffa838124ffd28 EFLAGS: 00010216
> > [   11.002336] RAX: ffffca0a84195a80 RBX: 0000000000000000 RCX:
> > ffff89be8656a000 [   11.002628] RDX: 0000000000000642 RSI:
> > 000d071e7fff89be RDI: ffff89beb4039df8 [   11.002922] RBP:
> > ffff89bd80000000 R08: ffffa838124ffd74 R09: ffffa838124ffd60 [
> > 11.003217] R10: 00000000000009be R11: 0000000000002000 R12:
> > ffff89be8bbff400 [   11.003522] R13: ffff89beb4039a00 R14:
> > ffffca0a80000000 R15: 0000000000001000 [   11.003825] FS:
> > 0000000000000000(0000) GS:ffff89c5b8700000(0000) knlGS:
> > 0000000000000000 [   11.004126] CS:  0010 DS: 0000 ES: 0000 CR0:
> > 0000000080050033 [   11.004429] CR2: 0000563308baac38 CR3:
> > 000000012e900000 CR4: 0000000000750ee0 [   11.004737] PKRU: 55555554
> > [   11.005040] Call Trace:
> > [   11.005342]  <TASK>
> > [   11.005645]  ? __die_body.cold+0x1a/0x1f
> > [   11.005951]  ? die_addr+0x3c/0x60
> > [   11.006256]  ? exc_general_protection+0x1c1/0x380
> > [   11.006562]  ? asm_exc_general_protection+0x26/0x30
> > [   11.006865]  ? bio_copy_data_iter+0x187/0x260
> > [   11.007169]  bio_copy_data+0x5c/0x80
> > [   11.007474]  raid10d+0xcad/0x1c00 [raid10
> > 1721e6c9d579361bf112b0ce400eec9240452da1]
> > [   11.007788]  ? srso_alias_return_thunk+0x5/0x7f
> > [   11.008099]  ? srso_alias_return_thunk+0x5/0x7f
> > [   11.008408]  ? prepare_to_wait_event+0x60/0x180
> > [   11.008720]  ? unregister_md_personality+0x70/0x70 [md_mod
> > 64c55bfe07bb9f714eafd175176a02873a443cb7]
> > [   11.009039]  md_thread+0xab/0x190 [md_mod
> > 64c55bfe07bb9f714eafd175176a02873a443cb7]
> > [   11.009359]  ? sched_energy_aware_handler+0xb0/0xb0
> > [   11.009681]  kthread+0xdb/0x110
> > [   11.009996]  ? kthread_complete_and_exit+0x20/0x20
> > [   11.010319]  ret_from_fork+0x1f/0x30
> > [   11.010325]  </TASK>
> > [   11.010326] Modules linked in: platform_profile libarc4
> > snd_hda_core snd_hwdep i8042 realtek kvm cfg80211 snd_pcm sp5100_tco
> > mdio_devres serio snd_timer raid10 irqbypass wmi_bmof pcspkr k10temp
> > i2c_piix4 rapl rfkill libphy snd soundcore md_mod gpio_amdpt
> > acpi_cpufreq gpio_generic mac_hid uinput i2c_dev sg crypto_user fuse
> > loop nfnetlink bpf_preload ip_tables x_tables ext4 crc32c_generic
> > crc16 mbcache jbd2 usbhid dm_crypt cbc encrypted_keys trusted
> > asn1_encoder tee dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel
> > polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel
> > sha512_ssse3 sha256_ssse3 sha1_ssse3 nvme aesni_intel crypto_simd
> > mpt3sas nvme_core cryptd ccp nvme_common xhci_pci raid_class
> > xhci_pci_renesas scsi_transport_sas amdgpu drm_ttm_helper ttm video
> > wmi gpu_sched drm_buddy drm_display_helper cec [   11.012188] ---[
> > end trace 0000000000000000 ]---
> 
> I wish had some some ides for you, I'm sure others will soon.  Two
> quick questions though:
> 
> 1) what is the manuf/model of the 8 drives?
> 2) have you tried creating a 4 disk RAID10 out of those drives? (just
> curious since you have a 4 disk RAID10 working there)
> 
> -Paul

1. Samsung MZILS15THMLS-0G5, "1633a"
2. I tried making a 4 disk and a 3 disk RAID10, both immediately had the same 
protection fault upon initial sync.

-Colgate



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: General Protection Fault in md raid10
  2024-04-27 18:22     ` Paul E Luse
@ 2024-04-28 22:16       ` Colgate Minuette
  2024-04-28 22:25         ` Roman Mamedov
  0 siblings, 1 reply; 15+ messages in thread
From: Colgate Minuette @ 2024-04-28 22:16 UTC (permalink / raw)
  To: Paul E Luse; +Cc: linux-raid

On Saturday, April 27, 2024 11:22:19 AM PDT Paul E Luse wrote:
> On Sun, 28 Apr 2024 13:07:49 -0700
> 
> Colgate Minuette <rabbit@minuette.net> wrote:
> > On Saturday, April 27, 2024 9:21:19 AM PDT Paul E Luse wrote:
> > > On Sun, 28 Apr 2024 12:41:13 -0700
> > > 
> > > Colgate Minuette <rabbit@minuette.net> wrote:
> > > > Hello all,
> > > > 
> > > > I am trying to set up an md raid-10 array spanning 8 disks using
> > > > the following command
> > > > 
> > > > >mdadm --create /dev/md64 --level=10 --layout=o2 -n 8
> > > > >/dev/sd[efghijkl]1
> > > > 
> > > > The raid is created successfully, but the moment that the newly
> > > > created raid starts initial sync, a general protection fault is
> > > > issued. This fault happens on kernels 6.1.85, 6.6.26, and 6.8.5
> > > > using mdadm version 4.3. The raid is then completely unusable.
> > > > After the fault, if I try to stop the raid using
> > > > 
> > > > >mdadm --stop /dev/md64
> > > > 
> > > > mdadm hangs indefinitely.
> > > > 
> > > > I have tried raid levels 0 and 6, and both work as expected
> > > > without any errors on these same 8 drives. I also have a working
> > > > md raid-10 on the system already with 4 disks(not related to this
> > > > 8 disk array).
> > > > 
> > > > Other things I have tried include trying to create/sync the raid
> > > > from a debian live environment, and using near/far/offset
> > > > layouts, but both methods came back with the same protection
> > > > fault. Also ran a memory test on the computer, but did not have
> > > > any errors after 10 passes.
> > > > 
> > > > Below is the output from the general protection fault. Let me
> > > > know of anything else to try or log information that would be
> > > > helpful to diagnose.
> > > > 
> > > > [   10.965542] md64: detected capacity change from 0 to
> > > > 120021483520 [   10.965593] md: resync of RAID array md64
> > > > [   10.999289] general protection fault, probably for
> > > > non-canonical address 0xd071e7fff89be: 0000 [#1] PREEMPT SMP NOPTI
> > > > [   11.000842] CPU: 4 PID: 912 Comm: md64_raid10 Not tainted
> > > > 6.1.85-1-MANJARO #1 44ae6c380f5656fa036749a28fdade8f34f2f9ce
> > > > [   11.001192] Hardware name: ASUS System Product Name/TUF GAMING
> > > > X670E-PLUS WIFI, BIOS 1618 05/18/2023
> > > > [   11.001482] RIP: 0010:bio_copy_data_iter+0x187/0x260
> > > > [   11.001756] Code: 29 f1 4c 29 f6 48 c1 f9 06 48 c1 fe 06 48 c1
> > > > e1 0c 48 c1 e6 0c 48 01 e9 48 01 ee 48 01 d9 4c 01 d6 83 fa 08 0f
> > > > 82 b0 fe ff ff <48> 8b 06 48 89 01 89 d0 48 8b 7c 06 f8 48 89 7c
> > > > 01 f8 48 8d 79 08 [   11.002045] RSP: 0018:ffffa838124ffd28
> > > > EFLAGS: 00010216 [   11.002336] RAX: ffffca0a84195a80 RBX:
> > > > 0000000000000000 RCX: ffff89be8656a000 [   11.002628] RDX:
> > > > 0000000000000642 RSI: 000d071e7fff89be RDI: ffff89beb4039df8 [
> > > > 11.002922] RBP: ffff89bd80000000 R08: ffffa838124ffd74 R09:
> > > > ffffa838124ffd60 [ 11.003217] R10: 00000000000009be R11:
> > > > 0000000000002000 R12: ffff89be8bbff400 [   11.003522] R13:
> > > > ffff89beb4039a00 R14: ffffca0a80000000 R15: 0000000000001000 [
> > > > 11.003825] FS: 0000000000000000(0000) GS:ffff89c5b8700000(0000)
> > > > knlGS: 0000000000000000 [   11.004126] CS:  0010 DS: 0000 ES:
> > > > 0000 CR0: 0000000080050033 [   11.004429] CR2: 0000563308baac38
> > > > CR3: 000000012e900000 CR4: 0000000000750ee0 [   11.004737] PKRU:
> > > > 55555554 [   11.005040] Call Trace:
> > > > [   11.005342]  <TASK>
> > > > [   11.005645]  ? __die_body.cold+0x1a/0x1f
> > > > [   11.005951]  ? die_addr+0x3c/0x60
> > > > [   11.006256]  ? exc_general_protection+0x1c1/0x380
> > > > [   11.006562]  ? asm_exc_general_protection+0x26/0x30
> > > > [   11.006865]  ? bio_copy_data_iter+0x187/0x260
> > > > [   11.007169]  bio_copy_data+0x5c/0x80
> > > > [   11.007474]  raid10d+0xcad/0x1c00 [raid10
> > > > 1721e6c9d579361bf112b0ce400eec9240452da1]
> > > > [   11.007788]  ? srso_alias_return_thunk+0x5/0x7f
> > > > [   11.008099]  ? srso_alias_return_thunk+0x5/0x7f
> > > > [   11.008408]  ? prepare_to_wait_event+0x60/0x180
> > > > [   11.008720]  ? unregister_md_personality+0x70/0x70 [md_mod
> > > > 64c55bfe07bb9f714eafd175176a02873a443cb7]
> > > > [   11.009039]  md_thread+0xab/0x190 [md_mod
> > > > 64c55bfe07bb9f714eafd175176a02873a443cb7]
> > > > [   11.009359]  ? sched_energy_aware_handler+0xb0/0xb0
> > > > [   11.009681]  kthread+0xdb/0x110
> > > > [   11.009996]  ? kthread_complete_and_exit+0x20/0x20
> > > > [   11.010319]  ret_from_fork+0x1f/0x30
> > > > [   11.010325]  </TASK>
> > > > [   11.010326] Modules linked in: platform_profile libarc4
> > > > snd_hda_core snd_hwdep i8042 realtek kvm cfg80211 snd_pcm
> > > > sp5100_tco mdio_devres serio snd_timer raid10 irqbypass wmi_bmof
> > > > pcspkr k10temp i2c_piix4 rapl rfkill libphy snd soundcore md_mod
> > > > gpio_amdpt acpi_cpufreq gpio_generic mac_hid uinput i2c_dev sg
> > > > crypto_user fuse loop nfnetlink bpf_preload ip_tables x_tables
> > > > ext4 crc32c_generic crc16 mbcache jbd2 usbhid dm_crypt cbc
> > > > encrypted_keys trusted asn1_encoder tee dm_mod crct10dif_pclmul
> > > > crc32_pclmul crc32c_intel polyval_clmulni polyval_generic
> > > > gf128mul ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3
> > > > nvme aesni_intel crypto_simd mpt3sas nvme_core cryptd ccp
> > > > nvme_common xhci_pci raid_class xhci_pci_renesas
> > > > scsi_transport_sas amdgpu drm_ttm_helper ttm video wmi gpu_sched
> > > > drm_buddy drm_display_helper cec [   11.012188] ---[ end trace
> > > > 0000000000000000 ]---
> > > 
> > > I wish had some some ides for you, I'm sure others will soon.  Two
> > > quick questions though:
> > > 
> > > 1) what is the manuf/model of the 8 drives?
> > > 2) have you tried creating a 4 disk RAID10 out of those drives?
> > > (just curious since you have a 4 disk RAID10 working there)
> > > 
> > > -Paul
> > 
> > 1. Samsung MZILS15THMLS-0G5, "1633a"
> > 2. I tried making a 4 disk and a 3 disk RAID10, both immediately had
> > the same protection fault upon initial sync.
> > 
> > -Colgate
> 
> So just to test real quick I have PM 1743 here (NVMe not SAS) and tried
> a quick 4 disk RAID10 on 6.9.0.rc2+ and although it worked (created and
> did some dd writes) I did get this in dmesg. Anything in any of your
> logs?
> 
> Is it safe to say that your tried other disks as well? I realize
> these disks work with orhter RAID levels, just trying to help complete
> the triage info for others, I'm still earning to debug mdraid :)
> 
> [   86.703241] {1}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 0 [   86.703251] {1}[Hardware Error]: It has
> been corrected by h/w and requires no further action [   86.703254]
> {1}[Hardware Error]: event severity: corrected [   86.703257]
> {1}[Hardware Error]:  Error 0, type: corrected [   86.703261]
> {1}[Hardware Error]:   section_type: PCIe error [   86.703263]
> {1}[Hardware Error]:   port_type: 0, PCIe end point [   86.703265]
> {1}[Hardware Error]:   version: 3.0 [   86.703267] {1}[Hardware Error]:
>   command: 0x0546, status: 0x0011 [   86.703271] {1}[Hardware Error]:
> device_id: 0000:cf:00.0 [   86.703275] {1}[Hardware Error]:   slot: 0
> [   86.703277] {1}[Hardware Error]:   secondary_bus: 0x00
> [   86.703279] {1}[Hardware Error]:   vendor_id: 0x144d, device_id:
> 0xa826 [   86.703282] {1}[Hardware Error]:   class_code: 010802
> 
> 
> -Paul

I'm not seeing any log entries similar to that, or any other errors in dmesg/
journalctl besides the protection fault.

I just tried RAID10 on the same HBA/cables with 4 seagate 4TB SAS HDDs, and it 
is functioning correctly. Syncing correctly and able to write/read from the md 
device.

-Colgate



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: General Protection Fault in md raid10
  2024-04-28 22:16       ` Colgate Minuette
@ 2024-04-28 22:25         ` Roman Mamedov
  2024-04-28 22:38           ` Colgate Minuette
  0 siblings, 1 reply; 15+ messages in thread
From: Roman Mamedov @ 2024-04-28 22:25 UTC (permalink / raw)
  To: Colgate Minuette; +Cc: Paul E Luse, linux-raid

On Sun, 28 Apr 2024 15:16:27 -0700
Colgate Minuette <rabbit@minuette.net> wrote:

> I just tried RAID10 on the same HBA/cables with 4 seagate 4TB SAS HDDs, and it 
> is functioning correctly. Syncing correctly and able to write/read from the md 
> device.

With those 15 TB SSDs, maybe something wonky with the large size?

Did you try creating a smaller partition on each, maybe just start with 4, to
not redo all of them, since you say 4 also repros the issue. Test with a 4 TB
or even 1TB partitions as RAID members.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: General Protection Fault in md raid10
  2024-04-28 22:25         ` Roman Mamedov
@ 2024-04-28 22:38           ` Colgate Minuette
  0 siblings, 0 replies; 15+ messages in thread
From: Colgate Minuette @ 2024-04-28 22:38 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Paul E Luse, linux-raid

On Sunday, April 28, 2024 3:25:29 PM PDT Roman Mamedov wrote:
> On Sun, 28 Apr 2024 15:16:27 -0700
> 
> Colgate Minuette <rabbit@minuette.net> wrote:
> > I just tried RAID10 on the same HBA/cables with 4 seagate 4TB SAS HDDs,
> > and it is functioning correctly. Syncing correctly and able to write/read
> > from the md device.
> 
> With those 15 TB SSDs, maybe something wonky with the large size?
> 
> Did you try creating a smaller partition on each, maybe just start with 4,
> to not redo all of them, since you say 4 also repros the issue. Test with a
> 4 TB or even 1TB partitions as RAID members.

Created a 2TB partition on 4 of the drives, created md RAID10 on the 2TB 
partitions, and got another protection fault shortly after starting the array.

[  515.504412] md/raid10:md51: not clean -- starting background reconstruction
[  515.504414] md/raid10:md51: active with 4 out of 4 devices
[  515.530362] md51: detected capacity change from 0 to 8388079616
[  515.530445] md: resync of RAID array md51
[  524.083652] general protection fault, probably for non-canonical address 
0xb1c8a7fff899a: 0000 [#1] PREEMPT SMP NOPTI


-Colgate



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: General Protection Fault in md raid10
  2024-04-28 19:41 General Protection Fault in md raid10 Colgate Minuette
  2024-04-27 16:21 ` Paul E Luse
@ 2024-04-29  1:02 ` Yu Kuai
  2024-04-29  2:18   ` Colgate Minuette
  1 sibling, 1 reply; 15+ messages in thread
From: Yu Kuai @ 2024-04-29  1:02 UTC (permalink / raw)
  To: Colgate Minuette, linux-raid; +Cc: yukuai (C), yangerkun@huawei.com

Hi,

在 2024/04/29 3:41, Colgate Minuette 写道:
> Hello all,
> 
> I am trying to set up an md raid-10 array spanning 8 disks using the following
> command
> 
>> mdadm --create /dev/md64 --level=10 --layout=o2 -n 8 /dev/sd[efghijkl]1
> 
> The raid is created successfully, but the moment that the newly created raid
> starts initial sync, a general protection fault is issued. This fault happens
> on kernels 6.1.85, 6.6.26, and 6.8.5 using mdadm version 4.3. The raid is then
> completely unusable. After the fault, if I try to stop the raid using
> 
>> mdadm --stop /dev/md64
> 
> mdadm hangs indefinitely.
> 
> I have tried raid levels 0 and 6, and both work as expected without any errors
> on these same 8 drives. I also have a working md raid-10 on the system already
> with 4 disks(not related to this 8 disk array).
> 
> Other things I have tried include trying to create/sync the raid from a debian
> live environment, and using near/far/offset layouts, but both methods came back
> with the same protection fault. Also ran a memory test on the computer, but
> did not have any errors after 10 passes.
> 
> Below is the output from the general protection fault. Let me know of anything
> else to try or log information that would be helpful to diagnose.
> 
> [   10.965542] md64: detected capacity change from 0 to 120021483520
> [   10.965593] md: resync of RAID array md64
> [   10.999289] general protection fault, probably for non-canonical address
> 0xd071e7fff89be: 0000 [#1] PREEMPT SMP NOPTI
> [   11.000842] CPU: 4 PID: 912 Comm: md64_raid10 Not tainted 6.1.85-1-MANJARO
> #1 44ae6c380f5656fa036749a28fdade8f34f2f9ce
> [   11.001192] Hardware name: ASUS System Product Name/TUF GAMING X670E-PLUS
> WIFI, BIOS 1618 05/18/2023
> [   11.001482] RIP: 0010:bio_copy_data_iter+0x187/0x260
> [   11.001756] Code: 29 f1 4c 29 f6 48 c1 f9 06 48 c1 fe 06 48 c1 e1 0c 48 c1
> e6 0c 48 01 e9 48 01 ee 48 01 d9 4c 01 d6 83 fa 08 0f 82 b0 fe ff ff <48> 8b 06
> 48 89 01 89 d0 48 8b 7c 06 f8 48 89 7c 01 f8 48 8d 79 08
> [   11.002045] RSP: 0018:ffffa838124ffd28 EFLAGS: 00010216
> [   11.002336] RAX: ffffca0a84195a80 RBX: 0000000000000000 RCX: ffff89be8656a000
> [   11.002628] RDX: 0000000000000642 RSI: 000d071e7fff89be RDI: ffff89beb4039df8
> [   11.002922] RBP: ffff89bd80000000 R08: ffffa838124ffd74 R09: ffffa838124ffd60
> [   11.003217] R10: 00000000000009be R11: 0000000000002000 R12: ffff89be8bbff400
> [   11.003522] R13: ffff89beb4039a00 R14: ffffca0a80000000 R15: 0000000000001000
> [   11.003825] FS:  0000000000000000(0000) GS:ffff89c5b8700000(0000) knlGS:
> 0000000000000000
> [   11.004126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   11.004429] CR2: 0000563308baac38 CR3: 000000012e900000 CR4:
> 0000000000750ee0
> [   11.004737] PKRU: 55555554
> [   11.005040] Call Trace:
> [   11.005342]  <TASK>
> [   11.005645]  ? __die_body.cold+0x1a/0x1f
> [   11.005951]  ? die_addr+0x3c/0x60
> [   11.006256]  ? exc_general_protection+0x1c1/0x380
> [   11.006562]  ? asm_exc_general_protection+0x26/0x30
> [   11.006865]  ? bio_copy_data_iter+0x187/0x260
> [   11.007169]  bio_copy_data+0x5c/0x80
> [   11.007474]  raid10d+0xcad/0x1c00 [raid10
> 1721e6c9d579361bf112b0ce400eec9240452da1]
Can you try to use addr2line or gdb to locate which this code line
is this correspond to?

I never see problem like this before... And it'll be greate if you
can bisect this since you can reporduce this problem easily.

Thanks,
Kuai

> [   11.007788]  ? srso_alias_return_thunk+0x5/0x7f
> [   11.008099]  ? srso_alias_return_thunk+0x5/0x7f
> [   11.008408]  ? prepare_to_wait_event+0x60/0x180
> [   11.008720]  ? unregister_md_personality+0x70/0x70 [md_mod
> 64c55bfe07bb9f714eafd175176a02873a443cb7]
> [   11.009039]  md_thread+0xab/0x190 [md_mod
> 64c55bfe07bb9f714eafd175176a02873a443cb7]
> [   11.009359]  ? sched_energy_aware_handler+0xb0/0xb0
> [   11.009681]  kthread+0xdb/0x110
> [   11.009996]  ? kthread_complete_and_exit+0x20/0x20
> [   11.010319]  ret_from_fork+0x1f/0x30
> [   11.010325]  </TASK>
> [   11.010326] Modules linked in: platform_profile libarc4 snd_hda_core
> snd_hwdep i8042 realtek kvm cfg80211 snd_pcm sp5100_tco mdio_devres serio
> snd_timer raid10 irqbypass wmi_bmof pcspkr k10temp i2c_piix4 rapl rfkill
> libphy snd soundcore md_mod gpio_amdpt acpi_cpufreq gpio_generic mac_hid
> uinput i2c_dev sg crypto_user fuse loop nfnetlink bpf_preload ip_tables
> x_tables ext4 crc32c_generic crc16 mbcache jbd2 usbhid dm_crypt cbc
> encrypted_keys trusted asn1_encoder tee dm_mod crct10dif_pclmul crc32_pclmul
> crc32c_intel polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel
> sha512_ssse3 sha256_ssse3 sha1_ssse3 nvme aesni_intel crypto_simd mpt3sas
> nvme_core cryptd ccp nvme_common xhci_pci raid_class xhci_pci_renesas
> scsi_transport_sas amdgpu drm_ttm_helper ttm video wmi gpu_sched drm_buddy
> drm_display_helper cec
> [   11.012188] ---[ end trace 0000000000000000 ]---
> 
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: General Protection Fault in md raid10
  2024-04-29  1:02 ` Yu Kuai
@ 2024-04-29  2:18   ` Colgate Minuette
  2024-04-29  3:12     ` Yu Kuai
  0 siblings, 1 reply; 15+ messages in thread
From: Colgate Minuette @ 2024-04-29  2:18 UTC (permalink / raw)
  To: linux-raid, Yu Kuai; +Cc: yukuai (C), yangerkun@huawei.com

On Sunday, April 28, 2024 6:02:30 PM PDT Yu Kuai wrote:
> Hi,
> 
> 在 2024/04/29 3:41, Colgate Minuette 写道:
> > Hello all,
> > 
> > I am trying to set up an md raid-10 array spanning 8 disks using the
> > following command
> > 
> >> mdadm --create /dev/md64 --level=10 --layout=o2 -n 8 /dev/sd[efghijkl]1
> > 
> > The raid is created successfully, but the moment that the newly created
> > raid starts initial sync, a general protection fault is issued. This
> > fault happens on kernels 6.1.85, 6.6.26, and 6.8.5 using mdadm version
> > 4.3. The raid is then completely unusable. After the fault, if I try to
> > stop the raid using> 
> >> mdadm --stop /dev/md64
> > 
> > mdadm hangs indefinitely.
> > 
> > I have tried raid levels 0 and 6, and both work as expected without any
> > errors on these same 8 drives. I also have a working md raid-10 on the
> > system already with 4 disks(not related to this 8 disk array).
> > 
> > Other things I have tried include trying to create/sync the raid from a
> > debian live environment, and using near/far/offset layouts, but both
> > methods came back with the same protection fault. Also ran a memory test
> > on the computer, but did not have any errors after 10 passes.
> > 
> > Below is the output from the general protection fault. Let me know of
> > anything else to try or log information that would be helpful to
> > diagnose.
> > 
> > [   10.965542] md64: detected capacity change from 0 to 120021483520
> > [   10.965593] md: resync of RAID array md64
> > [   10.999289] general protection fault, probably for non-canonical
> > address
> > 0xd071e7fff89be: 0000 [#1] PREEMPT SMP NOPTI
> > [   11.000842] CPU: 4 PID: 912 Comm: md64_raid10 Not tainted
> > 6.1.85-1-MANJARO #1 44ae6c380f5656fa036749a28fdade8f34f2f9ce
> > [   11.001192] Hardware name: ASUS System Product Name/TUF GAMING
> > X670E-PLUS WIFI, BIOS 1618 05/18/2023
> > [   11.001482] RIP: 0010:bio_copy_data_iter+0x187/0x260
> > [   11.001756] Code: 29 f1 4c 29 f6 48 c1 f9 06 48 c1 fe 06 48 c1 e1 0c 48
> > c1 e6 0c 48 01 e9 48 01 ee 48 01 d9 4c 01 d6 83 fa 08 0f 82 b0 fe ff ff
> > <48> 8b 06 48 89 01 89 d0 48 8b 7c 06 f8 48 89 7c 01 f8 48 8d 79 08
> > [   11.002045] RSP: 0018:ffffa838124ffd28 EFLAGS: 00010216
> > [   11.002336] RAX: ffffca0a84195a80 RBX: 0000000000000000 RCX:
> > ffff89be8656a000 [   11.002628] RDX: 0000000000000642 RSI:
> > 000d071e7fff89be RDI: ffff89beb4039df8 [   11.002922] RBP:
> > ffff89bd80000000 R08: ffffa838124ffd74 R09: ffffa838124ffd60 [  
> > 11.003217] R10: 00000000000009be R11: 0000000000002000 R12:
> > ffff89be8bbff400 [   11.003522] R13: ffff89beb4039a00 R14:
> > ffffca0a80000000 R15: 0000000000001000 [   11.003825] FS: 
> > 0000000000000000(0000) GS:ffff89c5b8700000(0000) knlGS: 0000000000000000
> > [   11.004126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   11.004429] CR2: 0000563308baac38 CR3: 000000012e900000 CR4:
> > 0000000000750ee0
> > [   11.004737] PKRU: 55555554
> > [   11.005040] Call Trace:
> > [   11.005342]  <TASK>
> > [   11.005645]  ? __die_body.cold+0x1a/0x1f
> > [   11.005951]  ? die_addr+0x3c/0x60
> > [   11.006256]  ? exc_general_protection+0x1c1/0x380
> > [   11.006562]  ? asm_exc_general_protection+0x26/0x30
> > [   11.006865]  ? bio_copy_data_iter+0x187/0x260
> > [   11.007169]  bio_copy_data+0x5c/0x80
> > [   11.007474]  raid10d+0xcad/0x1c00 [raid10
> > 1721e6c9d579361bf112b0ce400eec9240452da1]
> 
> Can you try to use addr2line or gdb to locate which this code line
> is this correspond to?
> 
> I never see problem like this before... And it'll be greate if you
> can bisect this since you can reporduce this problem easily.
> 
> Thanks,
> Kuai
> 

Can you provide guidance on how to do this? I haven't ever debugged kernel 
code before. I'm assuming this would be in the raid10.ko module, but don't 
know where to go from there.

-Colgate 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: General Protection Fault in md raid10
  2024-04-29  2:18   ` Colgate Minuette
@ 2024-04-29  3:12     ` Yu Kuai
  2024-04-29  4:30       ` Colgate Minuette
  0 siblings, 1 reply; 15+ messages in thread
From: Yu Kuai @ 2024-04-29  3:12 UTC (permalink / raw)
  To: Colgate Minuette, linux-raid, Yu Kuai; +Cc: yangerkun@huawei.com, yukuai (C)

Hi,

在 2024/04/29 10:18, Colgate Minuette 写道:
> On Sunday, April 28, 2024 6:02:30 PM PDT Yu Kuai wrote:
>> Hi,
>>
>> 在 2024/04/29 3:41, Colgate Minuette 写道:
>>> Hello all,
>>>
>>> I am trying to set up an md raid-10 array spanning 8 disks using the
>>> following command
>>>
>>>> mdadm --create /dev/md64 --level=10 --layout=o2 -n 8 /dev/sd[efghijkl]1
>>>
>>> The raid is created successfully, but the moment that the newly created
>>> raid starts initial sync, a general protection fault is issued. This
>>> fault happens on kernels 6.1.85, 6.6.26, and 6.8.5 using mdadm version
>>> 4.3. The raid is then completely unusable. After the fault, if I try to
>>> stop the raid using>
>>>> mdadm --stop /dev/md64
>>>
>>> mdadm hangs indefinitely.
>>>
>>> I have tried raid levels 0 and 6, and both work as expected without any
>>> errors on these same 8 drives. I also have a working md raid-10 on the
>>> system already with 4 disks(not related to this 8 disk array).
>>>
>>> Other things I have tried include trying to create/sync the raid from a
>>> debian live environment, and using near/far/offset layouts, but both
>>> methods came back with the same protection fault. Also ran a memory test
>>> on the computer, but did not have any errors after 10 passes.
>>>
>>> Below is the output from the general protection fault. Let me know of
>>> anything else to try or log information that would be helpful to
>>> diagnose.
>>>
>>> [   10.965542] md64: detected capacity change from 0 to 120021483520
>>> [   10.965593] md: resync of RAID array md64
>>> [   10.999289] general protection fault, probably for non-canonical
>>> address
>>> 0xd071e7fff89be: 0000 [#1] PREEMPT SMP NOPTI
>>> [   11.000842] CPU: 4 PID: 912 Comm: md64_raid10 Not tainted
>>> 6.1.85-1-MANJARO #1 44ae6c380f5656fa036749a28fdade8f34f2f9ce
>>> [   11.001192] Hardware name: ASUS System Product Name/TUF GAMING
>>> X670E-PLUS WIFI, BIOS 1618 05/18/2023
>>> [   11.001482] RIP: 0010:bio_copy_data_iter+0x187/0x260
>>> [   11.001756] Code: 29 f1 4c 29 f6 48 c1 f9 06 48 c1 fe 06 48 c1 e1 0c 48
>>> c1 e6 0c 48 01 e9 48 01 ee 48 01 d9 4c 01 d6 83 fa 08 0f 82 b0 fe ff ff
>>> <48> 8b 06 48 89 01 89 d0 48 8b 7c 06 f8 48 89 7c 01 f8 48 8d 79 08
>>> [   11.002045] RSP: 0018:ffffa838124ffd28 EFLAGS: 00010216
>>> [   11.002336] RAX: ffffca0a84195a80 RBX: 0000000000000000 RCX:
>>> ffff89be8656a000 [   11.002628] RDX: 0000000000000642 RSI:
>>> 000d071e7fff89be RDI: ffff89beb4039df8 [   11.002922] RBP:
>>> ffff89bd80000000 R08: ffffa838124ffd74 R09: ffffa838124ffd60 [
>>> 11.003217] R10: 00000000000009be R11: 0000000000002000 R12:
>>> ffff89be8bbff400 [   11.003522] R13: ffff89beb4039a00 R14:
>>> ffffca0a80000000 R15: 0000000000001000 [   11.003825] FS:
>>> 0000000000000000(0000) GS:ffff89c5b8700000(0000) knlGS: 0000000000000000
>>> [   11.004126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [   11.004429] CR2: 0000563308baac38 CR3: 000000012e900000 CR4:
>>> 0000000000750ee0
>>> [   11.004737] PKRU: 55555554
>>> [   11.005040] Call Trace:
>>> [   11.005342]  <TASK>
>>> [   11.005645]  ? __die_body.cold+0x1a/0x1f
>>> [   11.005951]  ? die_addr+0x3c/0x60
>>> [   11.006256]  ? exc_general_protection+0x1c1/0x380
>>> [   11.006562]  ? asm_exc_general_protection+0x26/0x30
>>> [   11.006865]  ? bio_copy_data_iter+0x187/0x260
>>> [   11.007169]  bio_copy_data+0x5c/0x80
>>> [   11.007474]  raid10d+0xcad/0x1c00 [raid10
>>> 1721e6c9d579361bf112b0ce400eec9240452da1]
>>
>> Can you try to use addr2line or gdb to locate which this code line
>> is this correspond to?
>>
>> I never see problem like this before... And it'll be greate if you
>> can bisect this since you can reporduce this problem easily.
>>
>> Thanks,
>> Kuai
>>
> 
> Can you provide guidance on how to do this? I haven't ever debugged kernel
> code before. I'm assuming this would be in the raid10.ko module, but don't
> know where to go from there.

For addr2line, you can gdb raid10.ko, then:

list *(raid10d+0xcad)

and gdb vmlinux:

list *(bio_copy_data_iter+0x187)

For git bisect, you must find a good kernel version, then:

git bisect start
git bisect bad v6.1
git bisect good xxx

Then git will show you how many steps are needed and choose a commit for
you, after compile and test the kernel:

git bisect good/bad

Then git will do the bisection based on your test result, at last
you will get a blamed commit.

Thanks,
Kuai
> 
> -Colgate
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: General Protection Fault in md raid10
  2024-04-29  3:12     ` Yu Kuai
@ 2024-04-29  4:30       ` Colgate Minuette
  2024-04-29  6:06         ` Yu Kuai
  0 siblings, 1 reply; 15+ messages in thread
From: Colgate Minuette @ 2024-04-29  4:30 UTC (permalink / raw)
  To: linux-raid, Yu Kuai, Yu Kuai; +Cc: yangerkun@huawei.com, yukuai (C)

On Sunday, April 28, 2024 8:12:01 PM PDT Yu Kuai wrote:
> Hi,
> 
> 在 2024/04/29 10:18, Colgate Minuette 写道:
> > On Sunday, April 28, 2024 6:02:30 PM PDT Yu Kuai wrote:
> >> Hi,
> >> 
> >> 在 2024/04/29 3:41, Colgate Minuette 写道:
> >>> Hello all,
> >>> 
> >>> I am trying to set up an md raid-10 array spanning 8 disks using the
> >>> following command
> >>> 
> >>>> mdadm --create /dev/md64 --level=10 --layout=o2 -n 8 /dev/sd[efghijkl]1
> >>> 
> >>> The raid is created successfully, but the moment that the newly created
> >>> raid starts initial sync, a general protection fault is issued. This
> >>> fault happens on kernels 6.1.85, 6.6.26, and 6.8.5 using mdadm version
> >>> 4.3. The raid is then completely unusable. After the fault, if I try to
> >>> stop the raid using>
> >>> 
> >>>> mdadm --stop /dev/md64
> >>> 
> >>> mdadm hangs indefinitely.
> >>> 
> >>> I have tried raid levels 0 and 6, and both work as expected without any
> >>> errors on these same 8 drives. I also have a working md raid-10 on the
> >>> system already with 4 disks(not related to this 8 disk array).
> >>> 
> >>> Other things I have tried include trying to create/sync the raid from a
> >>> debian live environment, and using near/far/offset layouts, but both
> >>> methods came back with the same protection fault. Also ran a memory test
> >>> on the computer, but did not have any errors after 10 passes.
> >>> 
> >>> Below is the output from the general protection fault. Let me know of
> >>> anything else to try or log information that would be helpful to
> >>> diagnose.
> >>> 
> >>> [   10.965542] md64: detected capacity change from 0 to 120021483520
> >>> [   10.965593] md: resync of RAID array md64
> >>> [   10.999289] general protection fault, probably for non-canonical
> >>> address
> >>> 0xd071e7fff89be: 0000 [#1] PREEMPT SMP NOPTI
> >>> [   11.000842] CPU: 4 PID: 912 Comm: md64_raid10 Not tainted
> >>> 6.1.85-1-MANJARO #1 44ae6c380f5656fa036749a28fdade8f34f2f9ce
> >>> [   11.001192] Hardware name: ASUS System Product Name/TUF GAMING
> >>> X670E-PLUS WIFI, BIOS 1618 05/18/2023
> >>> [   11.001482] RIP: 0010:bio_copy_data_iter+0x187/0x260
> >>> [   11.001756] Code: 29 f1 4c 29 f6 48 c1 f9 06 48 c1 fe 06 48 c1 e1 0c
> >>> 48
> >>> c1 e6 0c 48 01 e9 48 01 ee 48 01 d9 4c 01 d6 83 fa 08 0f 82 b0 fe ff ff
> >>> <48> 8b 06 48 89 01 89 d0 48 8b 7c 06 f8 48 89 7c 01 f8 48 8d 79 08
> >>> [   11.002045] RSP: 0018:ffffa838124ffd28 EFLAGS: 00010216
> >>> [   11.002336] RAX: ffffca0a84195a80 RBX: 0000000000000000 RCX:
> >>> ffff89be8656a000 [   11.002628] RDX: 0000000000000642 RSI:
> >>> 000d071e7fff89be RDI: ffff89beb4039df8 [   11.002922] RBP:
> >>> ffff89bd80000000 R08: ffffa838124ffd74 R09: ffffa838124ffd60 [
> >>> 11.003217] R10: 00000000000009be R11: 0000000000002000 R12:
> >>> ffff89be8bbff400 [   11.003522] R13: ffff89beb4039a00 R14:
> >>> ffffca0a80000000 R15: 0000000000001000 [   11.003825] FS:
> >>> 0000000000000000(0000) GS:ffff89c5b8700000(0000) knlGS: 0000000000000000
> >>> [   11.004126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> [   11.004429] CR2: 0000563308baac38 CR3: 000000012e900000 CR4:
> >>> 0000000000750ee0
> >>> [   11.004737] PKRU: 55555554
> >>> [   11.005040] Call Trace:
> >>> [   11.005342]  <TASK>
> >>> [   11.005645]  ? __die_body.cold+0x1a/0x1f
> >>> [   11.005951]  ? die_addr+0x3c/0x60
> >>> [   11.006256]  ? exc_general_protection+0x1c1/0x380
> >>> [   11.006562]  ? asm_exc_general_protection+0x26/0x30
> >>> [   11.006865]  ? bio_copy_data_iter+0x187/0x260
> >>> [   11.007169]  bio_copy_data+0x5c/0x80
> >>> [   11.007474]  raid10d+0xcad/0x1c00 [raid10
> >>> 1721e6c9d579361bf112b0ce400eec9240452da1]
> >> 
> >> Can you try to use addr2line or gdb to locate which this code line
> >> is this correspond to?
> >> 
> >> I never see problem like this before... And it'll be greate if you
> >> can bisect this since you can reporduce this problem easily.
> >> 
> >> Thanks,
> >> Kuai
> > 
> > Can you provide guidance on how to do this? I haven't ever debugged kernel
> > code before. I'm assuming this would be in the raid10.ko module, but don't
> > know where to go from there.
> 
> For addr2line, you can gdb raid10.ko, then:
> 
> list *(raid10d+0xcad)
> 
> and gdb vmlinux:
> 
> list *(bio_copy_data_iter+0x187)
> 
> For git bisect, you must find a good kernel version, then:
> 
> git bisect start
> git bisect bad v6.1
> git bisect good xxx
> 
> Then git will show you how many steps are needed and choose a commit for
> you, after compile and test the kernel:
> 
> git bisect good/bad
> 
> Then git will do the bisection based on your test result, at last
> you will get a blamed commit.
> 
> Thanks,
> Kuai
> 

I don't know of any kernel that is working for this, every setup I've tried 
has had the same issue.

(gdb) list *(raid10d+0xa52)
0x6692 is in raid10d (drivers/md/raid10.c:2480).
2475    in drivers/md/raid10.c

(gdb) list *(bio_copy_data_iter+0x187)
0xffffffff814c3a77 is in bio_copy_data_iter (block/bio.c:1357).
1352    in block/bio.c

uname -a
Linux debian 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 
(2024-02-01) x86_64 GNU/Linux

-Colgate




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: General Protection Fault in md raid10
  2024-04-29  4:30       ` Colgate Minuette
@ 2024-04-29  6:06         ` Yu Kuai
  2024-04-29  6:39           ` Colgate Minuette
  0 siblings, 1 reply; 15+ messages in thread
From: Yu Kuai @ 2024-04-29  6:06 UTC (permalink / raw)
  To: Colgate Minuette, linux-raid, Yu Kuai; +Cc: yangerkun@huawei.com

Hi,

在 2024/04/29 12:30, Colgate Minuette 写道:
> On Sunday, April 28, 2024 8:12:01 PM PDT Yu Kuai wrote:
>> Hi,
>>
>> 在 2024/04/29 10:18, Colgate Minuette 写道:
>>> On Sunday, April 28, 2024 6:02:30 PM PDT Yu Kuai wrote:
>>>> Hi,
>>>>
>>>> 在 2024/04/29 3:41, Colgate Minuette 写道:
>>>>> Hello all,
>>>>>
>>>>> I am trying to set up an md raid-10 array spanning 8 disks using the
>>>>> following command
>>>>>
>>>>>> mdadm --create /dev/md64 --level=10 --layout=o2 -n 8 /dev/sd[efghijkl]1
>>>>>
>>>>> The raid is created successfully, but the moment that the newly created
>>>>> raid starts initial sync, a general protection fault is issued. This
>>>>> fault happens on kernels 6.1.85, 6.6.26, and 6.8.5 using mdadm version
>>>>> 4.3. The raid is then completely unusable. After the fault, if I try to
>>>>> stop the raid using>
>>>>>
>>>>>> mdadm --stop /dev/md64
>>>>>
>>>>> mdadm hangs indefinitely.
>>>>>
>>>>> I have tried raid levels 0 and 6, and both work as expected without any
>>>>> errors on these same 8 drives. I also have a working md raid-10 on the
>>>>> system already with 4 disks(not related to this 8 disk array).
>>>>>
>>>>> Other things I have tried include trying to create/sync the raid from a
>>>>> debian live environment, and using near/far/offset layouts, but both
>>>>> methods came back with the same protection fault. Also ran a memory test
>>>>> on the computer, but did not have any errors after 10 passes.
>>>>>
>>>>> Below is the output from the general protection fault. Let me know of
>>>>> anything else to try or log information that would be helpful to
>>>>> diagnose.
>>>>>
>>>>> [   10.965542] md64: detected capacity change from 0 to 120021483520
>>>>> [   10.965593] md: resync of RAID array md64
>>>>> [   10.999289] general protection fault, probably for non-canonical
>>>>> address
>>>>> 0xd071e7fff89be: 0000 [#1] PREEMPT SMP NOPTI
>>>>> [   11.000842] CPU: 4 PID: 912 Comm: md64_raid10 Not tainted
>>>>> 6.1.85-1-MANJARO #1 44ae6c380f5656fa036749a28fdade8f34f2f9ce
>>>>> [   11.001192] Hardware name: ASUS System Product Name/TUF GAMING
>>>>> X670E-PLUS WIFI, BIOS 1618 05/18/2023
>>>>> [   11.001482] RIP: 0010:bio_copy_data_iter+0x187/0x260
>>>>> [   11.001756] Code: 29 f1 4c 29 f6 48 c1 f9 06 48 c1 fe 06 48 c1 e1 0c
>>>>> 48
>>>>> c1 e6 0c 48 01 e9 48 01 ee 48 01 d9 4c 01 d6 83 fa 08 0f 82 b0 fe ff ff
>>>>> <48> 8b 06 48 89 01 89 d0 48 8b 7c 06 f8 48 89 7c 01 f8 48 8d 79 08
>>>>> [   11.002045] RSP: 0018:ffffa838124ffd28 EFLAGS: 00010216
>>>>> [   11.002336] RAX: ffffca0a84195a80 RBX: 0000000000000000 RCX:
>>>>> ffff89be8656a000 [   11.002628] RDX: 0000000000000642 RSI:
>>>>> 000d071e7fff89be RDI: ffff89beb4039df8 [   11.002922] RBP:
>>>>> ffff89bd80000000 R08: ffffa838124ffd74 R09: ffffa838124ffd60 [
>>>>> 11.003217] R10: 00000000000009be R11: 0000000000002000 R12:
>>>>> ffff89be8bbff400 [   11.003522] R13: ffff89beb4039a00 R14:
>>>>> ffffca0a80000000 R15: 0000000000001000 [   11.003825] FS:
>>>>> 0000000000000000(0000) GS:ffff89c5b8700000(0000) knlGS: 0000000000000000
>>>>> [   11.004126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [   11.004429] CR2: 0000563308baac38 CR3: 000000012e900000 CR4:
>>>>> 0000000000750ee0
>>>>> [   11.004737] PKRU: 55555554
>>>>> [   11.005040] Call Trace:
>>>>> [   11.005342]  <TASK>
>>>>> [   11.005645]  ? __die_body.cold+0x1a/0x1f
>>>>> [   11.005951]  ? die_addr+0x3c/0x60
>>>>> [   11.006256]  ? exc_general_protection+0x1c1/0x380
>>>>> [   11.006562]  ? asm_exc_general_protection+0x26/0x30
>>>>> [   11.006865]  ? bio_copy_data_iter+0x187/0x260
>>>>> [   11.007169]  bio_copy_data+0x5c/0x80
>>>>> [   11.007474]  raid10d+0xcad/0x1c00 [raid10
>>>>> 1721e6c9d579361bf112b0ce400eec9240452da1]
>>>>
>>>> Can you try to use addr2line or gdb to locate which this code line
>>>> is this correspond to?
>>>>
>>>> I never see problem like this before... And it'll be greate if you
>>>> can bisect this since you can reporduce this problem easily.
>>>>
>>>> Thanks,
>>>> Kuai
>>>
>>> Can you provide guidance on how to do this? I haven't ever debugged kernel
>>> code before. I'm assuming this would be in the raid10.ko module, but don't
>>> know where to go from there.
>>
>> For addr2line, you can gdb raid10.ko, then:
>>
>> list *(raid10d+0xcad)
>>
>> and gdb vmlinux:
>>
>> list *(bio_copy_data_iter+0x187)
>>
>> For git bisect, you must find a good kernel version, then:
>>
>> git bisect start
>> git bisect bad v6.1
>> git bisect good xxx
>>
>> Then git will show you how many steps are needed and choose a commit for
>> you, after compile and test the kernel:
>>
>> git bisect good/bad
>>
>> Then git will do the bisection based on your test result, at last
>> you will get a blamed commit.
>>
>> Thanks,
>> Kuai
>>
> 
> I don't know of any kernel that is working for this, every setup I've tried
> has had the same issue.

This's really wried, is this the first time you ever using raid10? Did
you try some older kernel like v5.10 or v4.19?

> 
> (gdb) list *(raid10d+0xa52)
> 0x6692 is in raid10d (drivers/md/raid10.c:2480).
> 2475    in drivers/md/raid10.c
> 
> (gdb) list *(bio_copy_data_iter+0x187)
> 0xffffffff814c3a77 is in bio_copy_data_iter (block/bio.c:1357).
> 1352    in block/bio.c

Thanks for this, I'll try to take a look at related code.

Kuai

> 
> uname -a
> Linux debian 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1
> (2024-02-01) x86_64 GNU/Linux
> 
> -Colgate
> 
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: General Protection Fault in md raid10
  2024-04-29  6:06         ` Yu Kuai
@ 2024-04-29  6:39           ` Colgate Minuette
  2024-04-29  7:06             ` Colgate Minuette
  0 siblings, 1 reply; 15+ messages in thread
From: Colgate Minuette @ 2024-04-29  6:39 UTC (permalink / raw)
  To: linux-raid, Yu Kuai, Yu Kuai; +Cc: yangerkun@huawei.com

On Sunday, April 28, 2024 11:06:51 PM PDT Yu Kuai wrote:
> Hi,
> 
> 在 2024/04/29 12:30, Colgate Minuette 写道:
> > On Sunday, April 28, 2024 8:12:01 PM PDT Yu Kuai wrote:
> >> Hi,
> >> 
> >> 在 2024/04/29 10:18, Colgate Minuette 写道:
> >>> On Sunday, April 28, 2024 6:02:30 PM PDT Yu Kuai wrote:
> >>>> Hi,
> >>>> 
> >>>> 在 2024/04/29 3:41, Colgate Minuette 写道:
> >>>>> Hello all,
> >>>>> 
> >>>>> I am trying to set up an md raid-10 array spanning 8 disks using the
> >>>>> following command
> >>>>> 
> >>>>>> mdadm --create /dev/md64 --level=10 --layout=o2 -n 8
> >>>>>> /dev/sd[efghijkl]1
> >>>>> 
> >>>>> The raid is created successfully, but the moment that the newly
> >>>>> created
> >>>>> raid starts initial sync, a general protection fault is issued. This
> >>>>> fault happens on kernels 6.1.85, 6.6.26, and 6.8.5 using mdadm version
> >>>>> 4.3. The raid is then completely unusable. After the fault, if I try
> >>>>> to
> >>>>> stop the raid using>
> >>>>> 
> >>>>>> mdadm --stop /dev/md64
> >>>>> 
> >>>>> mdadm hangs indefinitely.
> >>>>> 
> >>>>> I have tried raid levels 0 and 6, and both work as expected without
> >>>>> any
> >>>>> errors on these same 8 drives. I also have a working md raid-10 on the
> >>>>> system already with 4 disks(not related to this 8 disk array).
> >>>>> 
> >>>>> Other things I have tried include trying to create/sync the raid from
> >>>>> a
> >>>>> debian live environment, and using near/far/offset layouts, but both
> >>>>> methods came back with the same protection fault. Also ran a memory
> >>>>> test
> >>>>> on the computer, but did not have any errors after 10 passes.
> >>>>> 
> >>>>> Below is the output from the general protection fault. Let me know of
> >>>>> anything else to try or log information that would be helpful to
> >>>>> diagnose.
> >>>>> 
> >>>>> [   10.965542] md64: detected capacity change from 0 to 120021483520
> >>>>> [   10.965593] md: resync of RAID array md64
> >>>>> [   10.999289] general protection fault, probably for non-canonical
> >>>>> address
> >>>>> 0xd071e7fff89be: 0000 [#1] PREEMPT SMP NOPTI
> >>>>> [   11.000842] CPU: 4 PID: 912 Comm: md64_raid10 Not tainted
> >>>>> 6.1.85-1-MANJARO #1 44ae6c380f5656fa036749a28fdade8f34f2f9ce
> >>>>> [   11.001192] Hardware name: ASUS System Product Name/TUF GAMING
> >>>>> X670E-PLUS WIFI, BIOS 1618 05/18/2023
> >>>>> [   11.001482] RIP: 0010:bio_copy_data_iter+0x187/0x260
> >>>>> [   11.001756] Code: 29 f1 4c 29 f6 48 c1 f9 06 48 c1 fe 06 48 c1 e1
> >>>>> 0c
> >>>>> 48
> >>>>> c1 e6 0c 48 01 e9 48 01 ee 48 01 d9 4c 01 d6 83 fa 08 0f 82 b0 fe ff
> >>>>> ff
> >>>>> <48> 8b 06 48 89 01 89 d0 48 8b 7c 06 f8 48 89 7c 01 f8 48 8d 79 08
> >>>>> [   11.002045] RSP: 0018:ffffa838124ffd28 EFLAGS: 00010216
> >>>>> [   11.002336] RAX: ffffca0a84195a80 RBX: 0000000000000000 RCX:
> >>>>> ffff89be8656a000 [   11.002628] RDX: 0000000000000642 RSI:
> >>>>> 000d071e7fff89be RDI: ffff89beb4039df8 [   11.002922] RBP:
> >>>>> ffff89bd80000000 R08: ffffa838124ffd74 R09: ffffa838124ffd60 [
> >>>>> 11.003217] R10: 00000000000009be R11: 0000000000002000 R12:
> >>>>> ffff89be8bbff400 [   11.003522] R13: ffff89beb4039a00 R14:
> >>>>> ffffca0a80000000 R15: 0000000000001000 [   11.003825] FS:
> >>>>> 0000000000000000(0000) GS:ffff89c5b8700000(0000) knlGS:
> >>>>> 0000000000000000
> >>>>> [   11.004126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>> [   11.004429] CR2: 0000563308baac38 CR3: 000000012e900000 CR4:
> >>>>> 0000000000750ee0
> >>>>> [   11.004737] PKRU: 55555554
> >>>>> [   11.005040] Call Trace:
> >>>>> [   11.005342]  <TASK>
> >>>>> [   11.005645]  ? __die_body.cold+0x1a/0x1f
> >>>>> [   11.005951]  ? die_addr+0x3c/0x60
> >>>>> [   11.006256]  ? exc_general_protection+0x1c1/0x380
> >>>>> [   11.006562]  ? asm_exc_general_protection+0x26/0x30
> >>>>> [   11.006865]  ? bio_copy_data_iter+0x187/0x260
> >>>>> [   11.007169]  bio_copy_data+0x5c/0x80
> >>>>> [   11.007474]  raid10d+0xcad/0x1c00 [raid10
> >>>>> 1721e6c9d579361bf112b0ce400eec9240452da1]
> >>>> 
> >>>> Can you try to use addr2line or gdb to locate which this code line
> >>>> is this correspond to?
> >>>> 
> >>>> I never see problem like this before... And it'll be greate if you
> >>>> can bisect this since you can reporduce this problem easily.
> >>>> 
> >>>> Thanks,
> >>>> Kuai
> >>> 
> >>> Can you provide guidance on how to do this? I haven't ever debugged
> >>> kernel
> >>> code before. I'm assuming this would be in the raid10.ko module, but
> >>> don't
> >>> know where to go from there.
> >> 
> >> For addr2line, you can gdb raid10.ko, then:
> >> 
> >> list *(raid10d+0xcad)
> >> 
> >> and gdb vmlinux:
> >> 
> >> list *(bio_copy_data_iter+0x187)
> >> 
> >> For git bisect, you must find a good kernel version, then:
> >> 
> >> git bisect start
> >> git bisect bad v6.1
> >> git bisect good xxx
> >> 
> >> Then git will show you how many steps are needed and choose a commit for
> >> you, after compile and test the kernel:
> >> 
> >> git bisect good/bad
> >> 
> >> Then git will do the bisection based on your test result, at last
> >> you will get a blamed commit.
> >> 
> >> Thanks,
> >> Kuai
> > 
> > I don't know of any kernel that is working for this, every setup I've
> > tried
> > has had the same issue.
> 
> This's really wried, is this the first time you ever using raid10? Did
> you try some older kernel like v5.10 or v4.19?
> 

I have been using md raid10 on this system for about 10 years with a different 
set of disks with no issues. The other raid10 that is in place is SATA drives, 
but I have created and tested a raid10 with different SAS drives on this 
system, and had no issues with that test.

These Samsung SSDs are a new addition to the system. I'll try the raid10 on 
4.19.307 and 5.10.211 as well, since those are in my distro's repos.

-Colgate

> > (gdb) list *(raid10d+0xa52)
> > 0x6692 is in raid10d (drivers/md/raid10.c:2480).
> > 2475    in drivers/md/raid10.c
> > 
> > (gdb) list *(bio_copy_data_iter+0x187)
> > 0xffffffff814c3a77 is in bio_copy_data_iter (block/bio.c:1357).
> > 1352    in block/bio.c
> 
> Thanks for this, I'll try to take a look at related code.
> 
> Kuai
> 
> > uname -a
> > Linux debian 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1
> > (2024-02-01) x86_64 GNU/Linux
> > 
> > -Colgate
> > 






^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: General Protection Fault in md raid10
  2024-04-29  6:39           ` Colgate Minuette
@ 2024-04-29  7:06             ` Colgate Minuette
  2024-04-29  7:52               ` Yu Kuai
  0 siblings, 1 reply; 15+ messages in thread
From: Colgate Minuette @ 2024-04-29  7:06 UTC (permalink / raw)
  To: linux-raid, Yu Kuai, Yu Kuai; +Cc: yangerkun@huawei.com

On Sunday, April 28, 2024 11:39:21 PM PDT Colgate Minuette wrote:
> On Sunday, April 28, 2024 11:06:51 PM PDT Yu Kuai wrote:
> > Hi,
> > 
> > 在 2024/04/29 12:30, Colgate Minuette 写道:
> > > On Sunday, April 28, 2024 8:12:01 PM PDT Yu Kuai wrote:
> > >> Hi,
> > >> 
> > >> 在 2024/04/29 10:18, Colgate Minuette 写道:
> > >>> On Sunday, April 28, 2024 6:02:30 PM PDT Yu Kuai wrote:
> > >>>> Hi,
> > >>>> 
> > >>>> 在 2024/04/29 3:41, Colgate Minuette 写道:
> > >>>>> Hello all,
> > >>>>> 
> > >>>>> I am trying to set up an md raid-10 array spanning 8 disks using the
> > >>>>> following command
> > >>>>> 
> > >>>>>> mdadm --create /dev/md64 --level=10 --layout=o2 -n 8
> > >>>>>> /dev/sd[efghijkl]1
> > >>>>> 
> > >>>>> The raid is created successfully, but the moment that the newly
> > >>>>> created
> > >>>>> raid starts initial sync, a general protection fault is issued. This
> > >>>>> fault happens on kernels 6.1.85, 6.6.26, and 6.8.5 using mdadm
> > >>>>> version
> > >>>>> 4.3. The raid is then completely unusable. After the fault, if I try
> > >>>>> to
> > >>>>> stop the raid using>
> > >>>>> 
> > >>>>>> mdadm --stop /dev/md64
> > >>>>> 
> > >>>>> mdadm hangs indefinitely.
> > >>>>> 
> > >>>>> I have tried raid levels 0 and 6, and both work as expected without
> > >>>>> any
> > >>>>> errors on these same 8 drives. I also have a working md raid-10 on
> > >>>>> the
> > >>>>> system already with 4 disks(not related to this 8 disk array).
> > >>>>> 
> > >>>>> Other things I have tried include trying to create/sync the raid
> > >>>>> from
> > >>>>> a
> > >>>>> debian live environment, and using near/far/offset layouts, but both
> > >>>>> methods came back with the same protection fault. Also ran a memory
> > >>>>> test
> > >>>>> on the computer, but did not have any errors after 10 passes.
> > >>>>> 
> > >>>>> Below is the output from the general protection fault. Let me know
> > >>>>> of
> > >>>>> anything else to try or log information that would be helpful to
> > >>>>> diagnose.
> > >>>>> 
> > >>>>> [   10.965542] md64: detected capacity change from 0 to 120021483520
> > >>>>> [   10.965593] md: resync of RAID array md64
> > >>>>> [   10.999289] general protection fault, probably for non-canonical
> > >>>>> address
> > >>>>> 0xd071e7fff89be: 0000 [#1] PREEMPT SMP NOPTI
> > >>>>> [   11.000842] CPU: 4 PID: 912 Comm: md64_raid10 Not tainted
> > >>>>> 6.1.85-1-MANJARO #1 44ae6c380f5656fa036749a28fdade8f34f2f9ce
> > >>>>> [   11.001192] Hardware name: ASUS System Product Name/TUF GAMING
> > >>>>> X670E-PLUS WIFI, BIOS 1618 05/18/2023
> > >>>>> [   11.001482] RIP: 0010:bio_copy_data_iter+0x187/0x260
> > >>>>> [   11.001756] Code: 29 f1 4c 29 f6 48 c1 f9 06 48 c1 fe 06 48 c1 e1
> > >>>>> 0c
> > >>>>> 48
> > >>>>> c1 e6 0c 48 01 e9 48 01 ee 48 01 d9 4c 01 d6 83 fa 08 0f 82 b0 fe ff
> > >>>>> ff
> > >>>>> <48> 8b 06 48 89 01 89 d0 48 8b 7c 06 f8 48 89 7c 01 f8 48 8d 79 08
> > >>>>> [   11.002045] RSP: 0018:ffffa838124ffd28 EFLAGS: 00010216
> > >>>>> [   11.002336] RAX: ffffca0a84195a80 RBX: 0000000000000000 RCX:
> > >>>>> ffff89be8656a000 [   11.002628] RDX: 0000000000000642 RSI:
> > >>>>> 000d071e7fff89be RDI: ffff89beb4039df8 [   11.002922] RBP:
> > >>>>> ffff89bd80000000 R08: ffffa838124ffd74 R09: ffffa838124ffd60 [
> > >>>>> 11.003217] R10: 00000000000009be R11: 0000000000002000 R12:
> > >>>>> ffff89be8bbff400 [   11.003522] R13: ffff89beb4039a00 R14:
> > >>>>> ffffca0a80000000 R15: 0000000000001000 [   11.003825] FS:
> > >>>>> 0000000000000000(0000) GS:ffff89c5b8700000(0000) knlGS:
> > >>>>> 0000000000000000
> > >>>>> [   11.004126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >>>>> [   11.004429] CR2: 0000563308baac38 CR3: 000000012e900000 CR4:
> > >>>>> 0000000000750ee0
> > >>>>> [   11.004737] PKRU: 55555554
> > >>>>> [   11.005040] Call Trace:
> > >>>>> [   11.005342]  <TASK>
> > >>>>> [   11.005645]  ? __die_body.cold+0x1a/0x1f
> > >>>>> [   11.005951]  ? die_addr+0x3c/0x60
> > >>>>> [   11.006256]  ? exc_general_protection+0x1c1/0x380
> > >>>>> [   11.006562]  ? asm_exc_general_protection+0x26/0x30
> > >>>>> [   11.006865]  ? bio_copy_data_iter+0x187/0x260
> > >>>>> [   11.007169]  bio_copy_data+0x5c/0x80
> > >>>>> [   11.007474]  raid10d+0xcad/0x1c00 [raid10
> > >>>>> 1721e6c9d579361bf112b0ce400eec9240452da1]
> > >>>> 
> > >>>> Can you try to use addr2line or gdb to locate which this code line
> > >>>> is this correspond to?
> > >>>> 
> > >>>> I never see problem like this before... And it'll be greate if you
> > >>>> can bisect this since you can reporduce this problem easily.
> > >>>> 
> > >>>> Thanks,
> > >>>> Kuai
> > >>> 
> > >>> Can you provide guidance on how to do this? I haven't ever debugged
> > >>> kernel
> > >>> code before. I'm assuming this would be in the raid10.ko module, but
> > >>> don't
> > >>> know where to go from there.
> > >> 
> > >> For addr2line, you can gdb raid10.ko, then:
> > >> 
> > >> list *(raid10d+0xcad)
> > >> 
> > >> and gdb vmlinux:
> > >> 
> > >> list *(bio_copy_data_iter+0x187)
> > >> 
> > >> For git bisect, you must find a good kernel version, then:
> > >> 
> > >> git bisect start
> > >> git bisect bad v6.1
> > >> git bisect good xxx
> > >> 
> > >> Then git will show you how many steps are needed and choose a commit
> > >> for
> > >> you, after compile and test the kernel:
> > >> 
> > >> git bisect good/bad
> > >> 
> > >> Then git will do the bisection based on your test result, at last
> > >> you will get a blamed commit.
> > >> 
> > >> Thanks,
> > >> Kuai
> > > 
> > > I don't know of any kernel that is working for this, every setup I've
> > > tried
> > > has had the same issue.
> > 
> > This's really wried, is this the first time you ever using raid10? Did
> > you try some older kernel like v5.10 or v4.19?
> 
> I have been using md raid10 on this system for about 10 years with a
> different set of disks with no issues. The other raid10 that is in place is
> SATA drives, but I have created and tested a raid10 with different SAS
> drives on this system, and had no issues with that test.
> 
> These Samsung SSDs are a new addition to the system. I'll try the raid10 on
> 4.19.307 and 5.10.211 as well, since those are in my distro's repos.
> 
> -Colgate
> 

Following up, the raid10 with the Samsung drives builds and starts syncing 
correctly on 4.19.307, 5.10.211, and 5.15.154, but does not work on 6.1.85 or 
newer.

-Colgate



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: General Protection Fault in md raid10
  2024-04-29  7:06             ` Colgate Minuette
@ 2024-04-29  7:52               ` Yu Kuai
  0 siblings, 0 replies; 15+ messages in thread
From: Yu Kuai @ 2024-04-29  7:52 UTC (permalink / raw)
  To: Colgate Minuette, linux-raid, Yu Kuai; +Cc: yangerkun@huawei.com, yukuai (C)

Hi!

在 2024/04/29 15:06, Colgate Minuette 写道:
> Following up, the raid10 with the Samsung drives builds and starts syncing
> correctly on 4.19.307, 5.10.211, and 5.15.154, but does not work on 6.1.85 or
> newer.

That's good news, perhaps can you bisect between 5.15 and 6.1?

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-04-29  7:52 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-28 19:41 General Protection Fault in md raid10 Colgate Minuette
2024-04-27 16:21 ` Paul E Luse
2024-04-28 20:07   ` Colgate Minuette
2024-04-27 18:22     ` Paul E Luse
2024-04-28 22:16       ` Colgate Minuette
2024-04-28 22:25         ` Roman Mamedov
2024-04-28 22:38           ` Colgate Minuette
2024-04-29  1:02 ` Yu Kuai
2024-04-29  2:18   ` Colgate Minuette
2024-04-29  3:12     ` Yu Kuai
2024-04-29  4:30       ` Colgate Minuette
2024-04-29  6:06         ` Yu Kuai
2024-04-29  6:39           ` Colgate Minuette
2024-04-29  7:06             ` Colgate Minuette
2024-04-29  7:52               ` Yu Kuai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).