md raid6 oops in 6.6.4 stable

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* md raid6 oops in 6.6.4 stable
@ 2023-12-07 13:10 Genes Lists
  2023-12-07 13:30 ` Bagas Sanjaya
  2023-12-07 16:15 ` Xiao Ni
  0 siblings, 2 replies; 10+ messages in thread
From: Genes Lists @ 2023-12-07 13:10 UTC (permalink / raw)
  To: snitzer, song, yukuai3, axboe, mpatocka, heinzm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 469 bytes --]

I have not had chance to git bisect this but since it happened in stable 
I thought it was important to share sooner than later.

One possibly relevant commit between 6.6.3 and 6.6.4 could be:

   commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
   Author: Song Liu <song@kernel.org>
   Date:   Fri Nov 17 15:56:30 2023 -0800

     md: fix bi_status reporting in md_end_clone_io

log attached shows page_fault_oops.
Machine was up for 3 days before crash happened.

gene

[-- Attachment #2: raid6-crash --]
[-- Type: text/plain, Size: 4134 bytes --]

Dec 06 19:20:54 s6 kernel: BUG: unable to handle page fault for address: ffff8881019312e8
Dec 06 19:20:54 s6 kernel: #PF: supervisor write access in kernel mode
Dec 06 19:20:54 s6 kernel: #PF: error_code(0x0003) - permissions violation
Dec 06 19:20:54 s6 kernel: PGD 336e01067 P4D 336e01067 PUD 1019ee063 PMD 1019f0063 PTE 8000000101931021
Dec 06 19:20:54 s6 kernel: Oops: 0003 [#1] PREEMPT SMP PTI
Dec 06 19:20:54 s6 kernel: CPU: 3 PID: 773 Comm: md127_raid6 Not tainted 6.6.4-stable-1 #4 784c1c710646cffc1e8cc5978f8f6cec974aa179
Dec 06 19:20:54 s6 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z370 Extreme4, BIOS P4.20 10/31/2019
Dec 06 19:20:54 s6 kernel: RIP: 0010:update_io_ticks+0x2c/0x60
Dec 06 19:20:54 s6 kernel: Code: 1f 00 0f 1f 44 00 00 48 8b 4f 28 48 39 f1 78 17 80 7f 31 00 74 3b 48 8b 47 10 48 8b 78 40 48 8b 4f 28 48 39 f1 79 e9 48 89 c8 <f0> 48 0f b1 77 28 75 de 48 89 f0 48 29 c8 84 d2 b9 01 00 >
Dec 06 19:20:54 s6 kernel: RSP: 0018:ffffc90000c0bb78 EFLAGS: 00010296
Dec 06 19:20:54 s6 kernel: RAX: cccccccccccccccc RBX: ffff8881019312c0 RCX: cccccccccccccccc
Dec 06 19:20:54 s6 kernel: RDX: 0000000000000001 RSI: 0000000110f28f4e RDI: ffff8881019312c0
Dec 06 19:20:54 s6 kernel: RBP: 0000000000000001 R08: ffff888104cc1760 R09: 0000000080200016
Dec 06 19:20:54 s6 kernel: R10: ffff88851f0ced00 R11: ffff8888beffb000 R12: 0000000000000008
Dec 06 19:20:54 s6 kernel: R13: 0000000000000028 R14: 0000000000000008 R15: 0000000000000048
Dec 06 19:20:54 s6 kernel: FS:  0000000000000000(0000) GS:ffff88889eec0000(0000) knlGS:0000000000000000
Dec 06 19:20:54 s6 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 06 19:20:54 s6 kernel: CR2: ffff8881019312e8 CR3: 0000000336020002 CR4: 00000000003706e0
Dec 06 19:20:54 s6 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 06 19:20:54 s6 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dec 06 19:20:54 s6 kernel: Call Trace:
Dec 06 19:20:54 s6 kernel:  <TASK>
Dec 06 19:20:54 s6 kernel:  ? __die+0x23/0x70
Dec 06 19:20:54 s6 kernel:  ? page_fault_oops+0x171/0x4e0
Dec 06 19:20:54 s6 kernel:  ? exc_page_fault+0x175/0x180
Dec 06 19:20:54 s6 kernel:  ? asm_exc_page_fault+0x26/0x30
Dec 06 19:20:54 s6 kernel:  ? update_io_ticks+0x2c/0x60
Dec 06 19:20:54 s6 kernel:  bdev_end_io_acct+0x63/0x160
Dec 06 19:20:54 s6 kernel:  md_end_clone_io+0x75/0xa0 [md_mod b6ca17ee4ae6c03e518ad33b70ddd658bdb0c03a]
Dec 06 19:20:54 s6 kernel:  handle_stripe_clean_event+0x1ee/0x430 [raid456 ca9a49662bf54a9ebef65a8016b05e6c30248d77]
Dec 06 19:20:54 s6 kernel:  handle_stripe+0x7b6/0x1ac0 [raid456 ca9a49662bf54a9ebef65a8016b05e6c30248d77]
Dec 06 19:20:54 s6 kernel:  handle_active_stripes.isra.0+0x38d/0x550 [raid456 ca9a49662bf54a9ebef65a8016b05e6c30248d77]
Dec 06 19:20:54 s6 kernel:  raid5d+0x488/0x750 [raid456 ca9a49662bf54a9ebef65a8016b05e6c30248d77]
Dec 06 19:20:54 s6 kernel:  ? lock_timer_base+0x61/0x80
Dec 06 19:20:54 s6 kernel:  ? prepare_to_wait_event+0x60/0x180
Dec 06 19:20:54 s6 kernel:  ? __pfx_md_thread+0x10/0x10 [md_mod b6ca17ee4ae6c03e518ad33b70ddd658bdb0c03a]
Dec 06 19:20:54 s6 kernel:  md_thread+0xab/0x190 [md_mod b6ca17ee4ae6c03e518ad33b70ddd658bdb0c03a]
Dec 06 19:20:54 s6 kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
Dec 06 19:20:54 s6 kernel:  kthread+0xe5/0x120
Dec 06 19:20:54 s6 kernel:  ? __pfx_kthread+0x10/0x10
Dec 06 19:20:54 s6 kernel:  ret_from_fork+0x31/0x50
Dec 06 19:20:54 s6 kernel:  ? __pfx_kthread+0x10/0x10
Dec 06 19:20:54 s6 kernel:  ret_from_fork_asm+0x1b/0x30
Dec 06 19:20:54 s6 kernel:  </TASK>
Dec 06 19:20:54 s6 kernel: Modules linked in: algif_hash af_alg mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache netfs nft_ct>
Dec 06 19:20:54 s6 kernel:  snd_hda_codec kvm snd_hda_core drm_buddy snd_hwdep iTCO_wdt i2c_algo_bit mei_pxp intel_pmc_bxt snd_pcm mei_hdcp ee1004 irqbypass ttm iTCO_vendor_support rapl drm_display_helper nls_iso8859_1>
Dec 06 19:20:54 s6 kernel: CR2: ffff8881019312e8
Dec 06 19:20:54 s6 kernel: ---[ end trace 0000000000000000 ]---


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: md raid6 oops in 6.6.4 stable
  2023-12-07 13:10 md raid6 oops in 6.6.4 stable Genes Lists
@ 2023-12-07 13:30 ` Bagas Sanjaya
  2023-12-07 13:55   ` Genes Lists
  2023-12-07 13:58   ` Thorsten Leemhuis
  2023-12-07 16:15 ` Xiao Ni
  1 sibling, 2 replies; 10+ messages in thread
From: Bagas Sanjaya @ 2023-12-07 13:30 UTC (permalink / raw)
  To: Genes Lists, snitzer, song, yukuai3, axboe, mpatocka, heinzm,
	Linux Kernel Mailing List, Linux RAID, Linux Regressions
  Cc: Bhanu Victor DiCara, Xiao Ni, Guoqing Jiang, Greg Kroah-Hartman

[-- Attachment #1: Type: text/plain, Size: 672 bytes --]

On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote:
> I have not had chance to git bisect this but since it happened in stable I
> thought it was important to share sooner than later.
> 
> One possibly relevant commit between 6.6.3 and 6.6.4 could be:
> 
>   commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
>   Author: Song Liu <song@kernel.org>
>   Date:   Fri Nov 17 15:56:30 2023 -0800
> 
>     md: fix bi_status reporting in md_end_clone_io
> 
> log attached shows page_fault_oops.
> Machine was up for 3 days before crash happened.
> 

Can you confirm that culprit by bisection?

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: md raid6 oops in 6.6.4 stable
  2023-12-07 13:30 ` Bagas Sanjaya
@ 2023-12-07 13:55   ` Genes Lists
  2023-12-07 14:42     ` Guoqing Jiang
  2023-12-07 13:58   ` Thorsten Leemhuis
  1 sibling, 1 reply; 10+ messages in thread
From: Genes Lists @ 2023-12-07 13:55 UTC (permalink / raw)
  To: Bagas Sanjaya, snitzer, song, yukuai3, axboe, mpatocka, heinzm,
	Linux Kernel Mailing List, Linux RAID, Linux Regressions
  Cc: Bhanu Victor DiCara, Xiao Ni, Guoqing Jiang, Greg Kroah-Hartman

On 12/7/23 08:30, Bagas Sanjaya wrote:
> On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote:
>> I have not had chance to git bisect this but since it happened in stable I
>> thought it was important to share sooner than later.
>>
>> One possibly relevant commit between 6.6.3 and 6.6.4 could be:
>>
>>    commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
>>    Author: Song Liu <song@kernel.org>
>>    Date:   Fri Nov 17 15:56:30 2023 -0800
>>
>>      md: fix bi_status reporting in md_end_clone_io
>>
>> log attached shows page_fault_oops.
>> Machine was up for 3 days before crash happened.
>>
> 
> Can you confirm that culprit by bisection?
> 

That's the plan - however, turn around could be horribly slow if the 
average wait time to crash is of order a few days between each bisect.
Also machine is currently in use, so will need to deal with that as 
well. Will do my best.

Fingers crossed someone might just spot something in the meantime.

The commit mentioned above ensures underlying errors are not hidden, so 
it may simply have revealed some underlying issue and not be the actual 
'culprit'.

thanks

gene


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: md raid6 oops in 6.6.4 stable
  2023-12-07 13:55   ` Genes Lists
@ 2023-12-07 14:42     ` Guoqing Jiang
  2023-12-07 15:58       ` Genes Lists
  0 siblings, 1 reply; 10+ messages in thread
From: Guoqing Jiang @ 2023-12-07 14:42 UTC (permalink / raw)
  To: Genes Lists, Bagas Sanjaya, snitzer, song, yukuai3, axboe,
	mpatocka, heinzm, Linux Kernel Mailing List, Linux RAID,
	Linux Regressions
  Cc: Bhanu Victor DiCara, Xiao Ni, Greg Kroah-Hartman

Hi,

On 12/7/23 21:55, Genes Lists wrote:
> On 12/7/23 08:30, Bagas Sanjaya wrote:
>> On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote:
>>> I have not had chance to git bisect this but since it happened in 
>>> stable I
>>> thought it was important to share sooner than later.
>>>
>>> One possibly relevant commit between 6.6.3 and 6.6.4 could be:
>>>
>>>    commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
>>>    Author: Song Liu <song@kernel.org>
>>>    Date:   Fri Nov 17 15:56:30 2023 -0800
>>>
>>>      md: fix bi_status reporting in md_end_clone_io
>>>
>>> log attached shows page_fault_oops.
>>> Machine was up for 3 days before crash happened.

Could you decode the oops (I can't find it in lore for some reason) 
([1])? And
can it be reproduced reliably? If so, pls share the reproduce step.

[1]. https://lwn.net/Articles/592724/

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: md raid6 oops in 6.6.4 stable
  2023-12-07 14:42     ` Guoqing Jiang
@ 2023-12-07 15:58       ` Genes Lists
  2023-12-07 17:37         ` Song Liu
  0 siblings, 1 reply; 10+ messages in thread
From: Genes Lists @ 2023-12-07 15:58 UTC (permalink / raw)
  To: Guoqing Jiang, Bagas Sanjaya, snitzer, song, yukuai3, axboe,
	mpatocka, heinzm, Linux Kernel Mailing List, Linux RAID,
	Linux Regressions
  Cc: Bhanu Victor DiCara, Xiao Ni, Greg Kroah-Hartman

[-- Attachment #1: Type: text/plain, Size: 1649 bytes --]

On 12/7/23 09:42, Guoqing Jiang wrote:
> Hi,
> 
> On 12/7/23 21:55, Genes Lists wrote:
>> On 12/7/23 08:30, Bagas Sanjaya wrote:
>>> On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote:
>>>> I have not had chance to git bisect this but since it happened in 
>>>> stable I
>>>> thought it was important to share sooner than later.
>>>>
>>>> One possibly relevant commit between 6.6.3 and 6.6.4 could be:
>>>>
>>>>    commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
>>>>    Author: Song Liu <song@kernel.org>
>>>>    Date:   Fri Nov 17 15:56:30 2023 -0800
>>>>
>>>>      md: fix bi_status reporting in md_end_clone_io
>>>>
>>>> log attached shows page_fault_oops.
>>>> Machine was up for 3 days before crash happened.
> 
> Could you decode the oops (I can't find it in lore for some reason) 
> ([1])? And
> can it be reproduced reliably? If so, pls share the reproduce step.
> 
> [1]. https://lwn.net/Articles/592724/
> 
> Thanks,
> Guoqing

   - reproducing
     An rsync runs 2 x / day. It copies to this server from another. The 
copy is from a (large) top level directory. On the 3rd day after booting 
6.6.4,  the second of these rysnc's triggered the oops. I need to do 
more testing to see if I can reliably reproduce. I have not seen this 
oops on earlier stable kernels.

   - decoding oops with scripts/decode_stacktrace.sh had errors :
    readelf: Error: Not an ELF file - it has the wrong magic bytes at 
the start

    It appears that the decode script doesn't handle compressed modules. 
  I changed the readelf line to decompress first. This fixes the above 
script complaint and the result is attached.

gene






[-- Attachment #2: raid6-stacktrace --]
[-- Type: text/plain, Size: 5283 bytes --]

Dec 06 19:20:54 s6 kernel: BUG: unable to handle page fault for address: ffff8881019312e8
Dec 06 19:20:54 s6 kernel: #PF: supervisor write access in kernel mode
Dec 06 19:20:54 s6 kernel: #PF: error_code(0x0003) - permissions violation
Dec 06 19:20:54 s6 kernel: PGD 336e01067 P4D 336e01067 PUD 1019ee063 PMD 1019f0063 PTE 8000000101931021
Dec 06 19:20:54 s6 kernel: Oops: 0003 [#1] PREEMPT SMP PTI
Dec 06 19:20:54 s6 kernel: CPU: 3 PID: 773 Comm: md127_raid6 Not tainted 6.6.4-stable-1 #4 784c1c710646cffc1e8cc5978f8f6cec974aa179
Dec 06 19:20:54 s6 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z370 Extreme4, BIOS P4.20 10/31/2019
Dec 06 19:20:54 s6 kernel: RIP: update_io_ticks+0x2c/0x60 
Dec 06 19:20:54 s6 kernel: Code: 1f 00 0f 1f 44 00 00 48 8b 4f 28 48 39 f1 78 17 80 7f 31 00 74 3b 48 8b 47 10 48 8b 78 40 48 8b 4f 28 48 39 f1 79 e9 48 89 c8 <f0> 48 0f b1 77 28 75 de 48 89 f0 48 29 c8 84 d2 b9 01 00 >
All code
========
   0:	1f                   	(bad)
   1:	00 0f                	add    %cl,(%rdi)
   3:	1f                   	(bad)
   4:	44 00 00             	add    %r8b,(%rax)
   7:	48 8b 4f 28          	mov    0x28(%rdi),%rcx
   b:	48 39 f1             	cmp    %rsi,%rcx
   e:	78 17                	js     0x27
  10:	80 7f 31 00          	cmpb   $0x0,0x31(%rdi)
  14:	74 3b                	je     0x51
  16:	48 8b 47 10          	mov    0x10(%rdi),%rax
  1a:	48 8b 78 40          	mov    0x40(%rax),%rdi
  1e:	48 8b 4f 28          	mov    0x28(%rdi),%rcx
  22:	48 39 f1             	cmp    %rsi,%rcx
  25:	79 e9                	jns    0x10
  27:	48 89 c8             	mov    %rcx,%rax
  2a:*	f0 48 0f b1 77 28    	lock cmpxchg %rsi,0x28(%rdi)		<-- trapping instruction
  30:	75 de                	jne    0x10
  32:	48 89 f0             	mov    %rsi,%rax
  35:	48 29 c8             	sub    %rcx,%rax
  38:	84 d2                	test   %dl,%dl
  3a:	b9                   	.byte 0xb9
  3b:	01 00                	add    %eax,(%rax)
	...

Code starting with the faulting instruction
===========================================
   0:	f0 48 0f b1 77 28    	lock cmpxchg %rsi,0x28(%rdi)
   6:	75 de                	jne    0xffffffffffffffe6
   8:	48 89 f0             	mov    %rsi,%rax
   b:	48 29 c8             	sub    %rcx,%rax
   e:	84 d2                	test   %dl,%dl
  10:	b9                   	.byte 0xb9
  11:	01 00                	add    %eax,(%rax)
	...
Dec 06 19:20:54 s6 kernel: RSP: 0018:ffffc90000c0bb78 EFLAGS: 00010296
Dec 06 19:20:54 s6 kernel: RAX: cccccccccccccccc RBX: ffff8881019312c0 RCX: cccccccccccccccc
Dec 06 19:20:54 s6 kernel: RDX: 0000000000000001 RSI: 0000000110f28f4e RDI: ffff8881019312c0
Dec 06 19:20:54 s6 kernel: RBP: 0000000000000001 R08: ffff888104cc1760 R09: 0000000080200016
Dec 06 19:20:54 s6 kernel: R10: ffff88851f0ced00 R11: ffff8888beffb000 R12: 0000000000000008
Dec 06 19:20:54 s6 kernel: R13: 0000000000000028 R14: 0000000000000008 R15: 0000000000000048
Dec 06 19:20:54 s6 kernel: FS:  0000000000000000(0000) GS:ffff88889eec0000(0000) knlGS:0000000000000000
Dec 06 19:20:54 s6 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 06 19:20:54 s6 kernel: CR2: ffff8881019312e8 CR3: 0000000336020002 CR4: 00000000003706e0
Dec 06 19:20:54 s6 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 06 19:20:54 s6 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dec 06 19:20:54 s6 kernel: Call Trace:
Dec 06 19:20:54 s6 kernel:  <TASK>
Dec 06 19:20:54 s6 kernel: ? __die+0x23/0x70 
Dec 06 19:20:54 s6 kernel: ? page_fault_oops+0x171/0x4e0 
Dec 06 19:20:54 s6 kernel: ? exc_page_fault+0x175/0x180 
Dec 06 19:20:54 s6 kernel: ? asm_exc_page_fault+0x26/0x30 
Dec 06 19:20:54 s6 kernel: ? update_io_ticks+0x2c/0x60 
Dec 06 19:20:54 s6 kernel: bdev_end_io_acct+0x63/0x160 
Dec 06 19:20:54 s6 kernel: md_end_clone_io+0x75/0xa0 md_mod
Dec 06 19:20:54 s6 kernel: handle_stripe_clean_event+0x1ee/0x430 raid456
Dec 06 19:20:54 s6 kernel: handle_stripe+0x7b6/0x1ac0 raid456
Dec 06 19:20:54 s6 kernel: handle_active_stripes.isra.0+0x38d/0x550 raid456
Dec 06 19:20:54 s6 kernel: raid5d+0x488/0x750 raid456
Dec 06 19:20:54 s6 kernel: ? lock_timer_base+0x61/0x80 
Dec 06 19:20:54 s6 kernel: ? prepare_to_wait_event+0x60/0x180 
Dec 06 19:20:54 s6 kernel: ? __pfx_md_thread+0x10/0x10 md_mod
Dec 06 19:20:54 s6 kernel: md_thread+0xab/0x190 md_mod
Dec 06 19:20:54 s6 kernel: ? __pfx_autoremove_wake_function+0x10/0x10 
Dec 06 19:20:54 s6 kernel: kthread+0xe5/0x120 
Dec 06 19:20:54 s6 kernel: ? __pfx_kthread+0x10/0x10 
Dec 06 19:20:54 s6 kernel: ret_from_fork+0x31/0x50 
Dec 06 19:20:54 s6 kernel: ? __pfx_kthread+0x10/0x10 
Dec 06 19:20:54 s6 kernel: ret_from_fork_asm+0x1b/0x30 
Dec 06 19:20:54 s6 kernel:  </TASK>
Dec 06 19:20:54 s6 kernel: Modules linked in: algif_hash af_alg mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache netfs nft_ct>
Dec 06 19:20:54 s6 kernel:  snd_hda_codec kvm snd_hda_core drm_buddy snd_hwdep iTCO_wdt i2c_algo_bit mei_pxp intel_pmc_bxt snd_pcm mei_hdcp ee1004 irqbypass ttm iTCO_vendor_support rapl drm_display_helper nls_iso8859_1>
Dec 06 19:20:54 s6 kernel: CR2: ffff8881019312e8
Dec 06 19:20:54 s6 kernel: ---[ end trace 0000000000000000 ]---


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: md raid6 oops in 6.6.4 stable
  2023-12-07 15:58       ` Genes Lists
@ 2023-12-07 17:37         ` Song Liu
  2023-12-07 19:27           ` Genes Lists
  0 siblings, 1 reply; 10+ messages in thread
From: Song Liu @ 2023-12-07 17:37 UTC (permalink / raw)
  To: Genes Lists
  Cc: Guoqing Jiang, Bagas Sanjaya, snitzer, yukuai3, axboe, mpatocka,
	heinzm, Linux Kernel Mailing List, Linux RAID, Linux Regressions,
	Bhanu Victor DiCara, Xiao Ni, Greg Kroah-Hartman

On Thu, Dec 7, 2023 at 7:58 AM Genes Lists <lists@sapience.com> wrote:
>
> On 12/7/23 09:42, Guoqing Jiang wrote:
> > Hi,
> >
> > On 12/7/23 21:55, Genes Lists wrote:
> >> On 12/7/23 08:30, Bagas Sanjaya wrote:
> >>> On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote:
> >>>> I have not had chance to git bisect this but since it happened in
> >>>> stable I
> >>>> thought it was important to share sooner than later.
> >>>>
> >>>> One possibly relevant commit between 6.6.3 and 6.6.4 could be:
> >>>>
> >>>>    commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
> >>>>    Author: Song Liu <song@kernel.org>
> >>>>    Date:   Fri Nov 17 15:56:30 2023 -0800
> >>>>
> >>>>      md: fix bi_status reporting in md_end_clone_io
> >>>>
> >>>> log attached shows page_fault_oops.
> >>>> Machine was up for 3 days before crash happened.
> >
> > Could you decode the oops (I can't find it in lore for some reason)
> > ([1])? And
> > can it be reproduced reliably? If so, pls share the reproduce step.
> >
> > [1]. https://lwn.net/Articles/592724/
> >
> > Thanks,
> > Guoqing
>
>    - reproducing
>      An rsync runs 2 x / day. It copies to this server from another. The
> copy is from a (large) top level directory. On the 3rd day after booting
> 6.6.4,  the second of these rysnc's triggered the oops. I need to do
> more testing to see if I can reliably reproduce. I have not seen this
> oops on earlier stable kernels.
>
>    - decoding oops with scripts/decode_stacktrace.sh had errors :
>     readelf: Error: Not an ELF file - it has the wrong magic bytes at
> the start
>
>     It appears that the decode script doesn't handle compressed modules.
>   I changed the readelf line to decompress first. This fixes the above
> script complaint and the result is attached.

I probably missed something, but I really don't think the commit
(2c975b0b8b11f1ffb1ed538609e2c89d8abf800e) could trigger this issue.

From the trace:

  kernel: RIP: 0010:update_io_ticks+0x2c/0x60
    =>
       2a:* f0 48 0f b1 77 28     lock cmpxchg %rsi,0x28(%rdi)  << trapped here.
  [...]
  kernel: Call Trace:
  kernel:  <TASK>
  kernel:  ? __die+0x23/0x70
  kernel:  ? page_fault_oops+0x171/0x4e0
  kernel:  ? exc_page_fault+0x175/0x180
  kernel:  ? asm_exc_page_fault+0x26/0x30
  kernel:  ? update_io_ticks+0x2c/0x60
  kernel:  bdev_end_io_acct+0x63/0x160
  kernel:  md_end_clone_io+0x75/0xa0     <<< change in md_end_clone_io

The commit only changes how we update bi_status. But bi_status was not
used/checked at all between md_end_clone_io and the trap (lock cmpxchg).
Did I miss something?

Given the issue takes very long to reproduce. Maybe we have the issue
before 6.6.4?

Thanks,
Song

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: md raid6 oops in 6.6.4 stable
  2023-12-07 17:37         ` Song Liu
@ 2023-12-07 19:27           ` Genes Lists
  0 siblings, 0 replies; 10+ messages in thread
From: Genes Lists @ 2023-12-07 19:27 UTC (permalink / raw)
  To: Song Liu
  Cc: Guoqing Jiang, Bagas Sanjaya, snitzer, yukuai3, axboe, mpatocka,
	heinzm, Linux Kernel Mailing List, Linux RAID, Linux Regressions,
	Bhanu Victor DiCara, Xiao Ni, Greg Kroah-Hartman

On 12/7/23 12:37, Song Liu wrote:
...
>    kernel:  md_end_clone_io+0x75/0xa0     <<< change in md_end_clone_io
> 
> The commit only changes how we update bi_status. But bi_status was not
> used/checked at all between md_end_clone_io and the trap (lock cmpxchg).
> Did I miss something?
> 
> Given the issue takes very long to reproduce. Maybe we have the issue
> before 6.6.4?
> 
> Thanks,
> Song

Thanks for clarifying that point.

In meantime I rebooted server (shutdown was a struggle) - finally I 
fsck'd the filesystem (ext4) sitting on the raid6 - and manually ran the 
triggering rsync. This of course completed normally. That's either good 
or bad depending on your perspective :)

If I can get it to crash again, I will either start a git bisect (from 
6.6.3) or see if 6.7rc4 shows same issue.

thanks,

gene

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: md raid6 oops in 6.6.4 stable
  2023-12-07 13:30 ` Bagas Sanjaya
  2023-12-07 13:55   ` Genes Lists
@ 2023-12-07 13:58   ` Thorsten Leemhuis
  2023-12-08  2:05     ` Bagas Sanjaya
  1 sibling, 1 reply; 10+ messages in thread
From: Thorsten Leemhuis @ 2023-12-07 13:58 UTC (permalink / raw)
  To: Bagas Sanjaya, Genes Lists, snitzer, song, yukuai3, axboe,
	mpatocka, heinzm, Linux Kernel Mailing List, Linux RAID,
	Linux Regressions
  Cc: Bhanu Victor DiCara, Xiao Ni, Guoqing Jiang, Greg Kroah-Hartman

On 07.12.23 14:30, Bagas Sanjaya wrote:
> On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote:
>> I have not had chance to git bisect this but since it happened in stable I
>> thought it was important to share sooner than later.
>>
>> One possibly relevant commit between 6.6.3 and 6.6.4 could be:
>>
>>   commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
>>   Author: Song Liu <song@kernel.org>
>>   Date:   Fri Nov 17 15:56:30 2023 -0800
>>
>>     md: fix bi_status reporting in md_end_clone_io
>>
>> log attached shows page_fault_oops.
>> Machine was up for 3 days before crash happened.
> 
> Can you confirm that culprit by bisection?

Bagas, I know you are trying to help, but sorry, I'd say this is not
helpful at all -- any maybe even harmful.

From the quoted texts it's pretty clear that the reporter knows that a
bisection would be helpful, but currently is unable to perform one --
and even states reasons for reporting it without having it bisected. So
your message afaics doesn't bring anything new to the table; and I might
be wrong with that, but I fear some people in a situation like this
might even be offended by a reply like that, as it states something
already obvious.

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: md raid6 oops in 6.6.4 stable
  2023-12-07 13:58   ` Thorsten Leemhuis
@ 2023-12-08  2:05     ` Bagas Sanjaya
  0 siblings, 0 replies; 10+ messages in thread
From: Bagas Sanjaya @ 2023-12-08  2:05 UTC (permalink / raw)
  To: Thorsten Leemhuis, Genes Lists, snitzer, song, yukuai3, axboe,
	mpatocka, heinzm, Linux Kernel Mailing List, Linux RAID,
	Linux Regressions
  Cc: Bhanu Victor DiCara, Xiao Ni, Guoqing Jiang, Greg Kroah-Hartman

On 12/7/23 20:58, Thorsten Leemhuis wrote:
> On 07.12.23 14:30, Bagas Sanjaya wrote:
>> On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote:
>>> I have not had chance to git bisect this but since it happened in stable I
>>> thought it was important to share sooner than later.
>>>
>>> One possibly relevant commit between 6.6.3 and 6.6.4 could be:
>>>
>>>   commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
>>>   Author: Song Liu <song@kernel.org>
>>>   Date:   Fri Nov 17 15:56:30 2023 -0800
>>>
>>>     md: fix bi_status reporting in md_end_clone_io
>>>
>>> log attached shows page_fault_oops.
>>> Machine was up for 3 days before crash happened.
>>
>> Can you confirm that culprit by bisection?
> 
> Bagas, I know you are trying to help, but sorry, I'd say this is not
> helpful at all -- any maybe even harmful.
> 
> From the quoted texts it's pretty clear that the reporter knows that a
> bisection would be helpful, but currently is unable to perform one --
> and even states reasons for reporting it without having it bisected. So
> your message afaics doesn't bring anything new to the table; and I might
> be wrong with that, but I fear some people in a situation like this
> might even be offended by a reply like that, as it states something
> already obvious.
> 

Oops, I didn't fully understand the context. Thanks anyway.

-- 
An old man doll... just what I always wanted! - Clara


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: md raid6 oops in 6.6.4 stable
  2023-12-07 13:10 md raid6 oops in 6.6.4 stable Genes Lists
  2023-12-07 13:30 ` Bagas Sanjaya
@ 2023-12-07 16:15 ` Xiao Ni
  1 sibling, 0 replies; 10+ messages in thread
From: Xiao Ni @ 2023-12-07 16:15 UTC (permalink / raw)
  To: Genes Lists; +Cc: snitzer, song, yukuai3, axboe, mpatocka, heinzm, linux-kernel

On Thu, Dec 7, 2023 at 9:12 PM Genes Lists <lists@sapience.com> wrote:
>
> I have not had chance to git bisect this but since it happened in stable
> I thought it was important to share sooner than later.
>
> One possibly relevant commit between 6.6.3 and 6.6.4 could be:
>
>    commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
>    Author: Song Liu <song@kernel.org>
>    Date:   Fri Nov 17 15:56:30 2023 -0800
>
>      md: fix bi_status reporting in md_end_clone_io
>
> log attached shows page_fault_oops.
> Machine was up for 3 days before crash happened.
>
> gene

Hi all

I'm following the crash reference rule to try to find some hints. The
RDI is ffff8881019312c0 which should be the address of struct
block_device *part. And the CR2 is ffff8881019312e8. So the panic
happens when it wants to introduce part->bd_stamp. Hope it's helpful
if the addresses are right.

Best Regards
Xiao


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-12-08  2:05 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-07 13:10 md raid6 oops in 6.6.4 stable Genes Lists
2023-12-07 13:30 ` Bagas Sanjaya
2023-12-07 13:55   ` Genes Lists
2023-12-07 14:42     ` Guoqing Jiang
2023-12-07 15:58       ` Genes Lists
2023-12-07 17:37         ` Song Liu
2023-12-07 19:27           ` Genes Lists
2023-12-07 13:58   ` Thorsten Leemhuis
2023-12-08  2:05     ` Bagas Sanjaya
2023-12-07 16:15 ` Xiao Ni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox