* md raid6 oops in 6.6.4 stable
@ 2023-12-07 13:10 Genes Lists
2023-12-07 13:30 ` Bagas Sanjaya
2023-12-07 16:15 ` Xiao Ni
0 siblings, 2 replies; 10+ messages in thread
From: Genes Lists @ 2023-12-07 13:10 UTC (permalink / raw)
To: snitzer, song, yukuai3, axboe, mpatocka, heinzm, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 469 bytes --]
I have not had chance to git bisect this but since it happened in stable
I thought it was important to share sooner than later.
One possibly relevant commit between 6.6.3 and 6.6.4 could be:
commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
Author: Song Liu <song@kernel.org>
Date: Fri Nov 17 15:56:30 2023 -0800
md: fix bi_status reporting in md_end_clone_io
log attached shows page_fault_oops.
Machine was up for 3 days before crash happened.
gene
[-- Attachment #2: raid6-crash --]
[-- Type: text/plain, Size: 4134 bytes --]
Dec 06 19:20:54 s6 kernel: BUG: unable to handle page fault for address: ffff8881019312e8
Dec 06 19:20:54 s6 kernel: #PF: supervisor write access in kernel mode
Dec 06 19:20:54 s6 kernel: #PF: error_code(0x0003) - permissions violation
Dec 06 19:20:54 s6 kernel: PGD 336e01067 P4D 336e01067 PUD 1019ee063 PMD 1019f0063 PTE 8000000101931021
Dec 06 19:20:54 s6 kernel: Oops: 0003 [#1] PREEMPT SMP PTI
Dec 06 19:20:54 s6 kernel: CPU: 3 PID: 773 Comm: md127_raid6 Not tainted 6.6.4-stable-1 #4 784c1c710646cffc1e8cc5978f8f6cec974aa179
Dec 06 19:20:54 s6 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z370 Extreme4, BIOS P4.20 10/31/2019
Dec 06 19:20:54 s6 kernel: RIP: 0010:update_io_ticks+0x2c/0x60
Dec 06 19:20:54 s6 kernel: Code: 1f 00 0f 1f 44 00 00 48 8b 4f 28 48 39 f1 78 17 80 7f 31 00 74 3b 48 8b 47 10 48 8b 78 40 48 8b 4f 28 48 39 f1 79 e9 48 89 c8 <f0> 48 0f b1 77 28 75 de 48 89 f0 48 29 c8 84 d2 b9 01 00 >
Dec 06 19:20:54 s6 kernel: RSP: 0018:ffffc90000c0bb78 EFLAGS: 00010296
Dec 06 19:20:54 s6 kernel: RAX: cccccccccccccccc RBX: ffff8881019312c0 RCX: cccccccccccccccc
Dec 06 19:20:54 s6 kernel: RDX: 0000000000000001 RSI: 0000000110f28f4e RDI: ffff8881019312c0
Dec 06 19:20:54 s6 kernel: RBP: 0000000000000001 R08: ffff888104cc1760 R09: 0000000080200016
Dec 06 19:20:54 s6 kernel: R10: ffff88851f0ced00 R11: ffff8888beffb000 R12: 0000000000000008
Dec 06 19:20:54 s6 kernel: R13: 0000000000000028 R14: 0000000000000008 R15: 0000000000000048
Dec 06 19:20:54 s6 kernel: FS: 0000000000000000(0000) GS:ffff88889eec0000(0000) knlGS:0000000000000000
Dec 06 19:20:54 s6 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 06 19:20:54 s6 kernel: CR2: ffff8881019312e8 CR3: 0000000336020002 CR4: 00000000003706e0
Dec 06 19:20:54 s6 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 06 19:20:54 s6 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dec 06 19:20:54 s6 kernel: Call Trace:
Dec 06 19:20:54 s6 kernel: <TASK>
Dec 06 19:20:54 s6 kernel: ? __die+0x23/0x70
Dec 06 19:20:54 s6 kernel: ? page_fault_oops+0x171/0x4e0
Dec 06 19:20:54 s6 kernel: ? exc_page_fault+0x175/0x180
Dec 06 19:20:54 s6 kernel: ? asm_exc_page_fault+0x26/0x30
Dec 06 19:20:54 s6 kernel: ? update_io_ticks+0x2c/0x60
Dec 06 19:20:54 s6 kernel: bdev_end_io_acct+0x63/0x160
Dec 06 19:20:54 s6 kernel: md_end_clone_io+0x75/0xa0 [md_mod b6ca17ee4ae6c03e518ad33b70ddd658bdb0c03a]
Dec 06 19:20:54 s6 kernel: handle_stripe_clean_event+0x1ee/0x430 [raid456 ca9a49662bf54a9ebef65a8016b05e6c30248d77]
Dec 06 19:20:54 s6 kernel: handle_stripe+0x7b6/0x1ac0 [raid456 ca9a49662bf54a9ebef65a8016b05e6c30248d77]
Dec 06 19:20:54 s6 kernel: handle_active_stripes.isra.0+0x38d/0x550 [raid456 ca9a49662bf54a9ebef65a8016b05e6c30248d77]
Dec 06 19:20:54 s6 kernel: raid5d+0x488/0x750 [raid456 ca9a49662bf54a9ebef65a8016b05e6c30248d77]
Dec 06 19:20:54 s6 kernel: ? lock_timer_base+0x61/0x80
Dec 06 19:20:54 s6 kernel: ? prepare_to_wait_event+0x60/0x180
Dec 06 19:20:54 s6 kernel: ? __pfx_md_thread+0x10/0x10 [md_mod b6ca17ee4ae6c03e518ad33b70ddd658bdb0c03a]
Dec 06 19:20:54 s6 kernel: md_thread+0xab/0x190 [md_mod b6ca17ee4ae6c03e518ad33b70ddd658bdb0c03a]
Dec 06 19:20:54 s6 kernel: ? __pfx_autoremove_wake_function+0x10/0x10
Dec 06 19:20:54 s6 kernel: kthread+0xe5/0x120
Dec 06 19:20:54 s6 kernel: ? __pfx_kthread+0x10/0x10
Dec 06 19:20:54 s6 kernel: ret_from_fork+0x31/0x50
Dec 06 19:20:54 s6 kernel: ? __pfx_kthread+0x10/0x10
Dec 06 19:20:54 s6 kernel: ret_from_fork_asm+0x1b/0x30
Dec 06 19:20:54 s6 kernel: </TASK>
Dec 06 19:20:54 s6 kernel: Modules linked in: algif_hash af_alg mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache netfs nft_ct>
Dec 06 19:20:54 s6 kernel: snd_hda_codec kvm snd_hda_core drm_buddy snd_hwdep iTCO_wdt i2c_algo_bit mei_pxp intel_pmc_bxt snd_pcm mei_hdcp ee1004 irqbypass ttm iTCO_vendor_support rapl drm_display_helper nls_iso8859_1>
Dec 06 19:20:54 s6 kernel: CR2: ffff8881019312e8
Dec 06 19:20:54 s6 kernel: ---[ end trace 0000000000000000 ]---
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: md raid6 oops in 6.6.4 stable 2023-12-07 13:10 md raid6 oops in 6.6.4 stable Genes Lists @ 2023-12-07 13:30 ` Bagas Sanjaya 2023-12-07 13:55 ` Genes Lists 2023-12-07 13:58 ` Thorsten Leemhuis 2023-12-07 16:15 ` Xiao Ni 1 sibling, 2 replies; 10+ messages in thread From: Bagas Sanjaya @ 2023-12-07 13:30 UTC (permalink / raw) To: Genes Lists, snitzer, song, yukuai3, axboe, mpatocka, heinzm, Linux Kernel Mailing List, Linux RAID, Linux Regressions Cc: Bhanu Victor DiCara, Xiao Ni, Guoqing Jiang, Greg Kroah-Hartman [-- Attachment #1: Type: text/plain, Size: 672 bytes --] On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote: > I have not had chance to git bisect this but since it happened in stable I > thought it was important to share sooner than later. > > One possibly relevant commit between 6.6.3 and 6.6.4 could be: > > commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e > Author: Song Liu <song@kernel.org> > Date: Fri Nov 17 15:56:30 2023 -0800 > > md: fix bi_status reporting in md_end_clone_io > > log attached shows page_fault_oops. > Machine was up for 3 days before crash happened. > Can you confirm that culprit by bisection? -- An old man doll... just what I always wanted! - Clara [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: md raid6 oops in 6.6.4 stable 2023-12-07 13:30 ` Bagas Sanjaya @ 2023-12-07 13:55 ` Genes Lists 2023-12-07 14:42 ` Guoqing Jiang 2023-12-07 13:58 ` Thorsten Leemhuis 1 sibling, 1 reply; 10+ messages in thread From: Genes Lists @ 2023-12-07 13:55 UTC (permalink / raw) To: Bagas Sanjaya, snitzer, song, yukuai3, axboe, mpatocka, heinzm, Linux Kernel Mailing List, Linux RAID, Linux Regressions Cc: Bhanu Victor DiCara, Xiao Ni, Guoqing Jiang, Greg Kroah-Hartman On 12/7/23 08:30, Bagas Sanjaya wrote: > On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote: >> I have not had chance to git bisect this but since it happened in stable I >> thought it was important to share sooner than later. >> >> One possibly relevant commit between 6.6.3 and 6.6.4 could be: >> >> commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e >> Author: Song Liu <song@kernel.org> >> Date: Fri Nov 17 15:56:30 2023 -0800 >> >> md: fix bi_status reporting in md_end_clone_io >> >> log attached shows page_fault_oops. >> Machine was up for 3 days before crash happened. >> > > Can you confirm that culprit by bisection? > That's the plan - however, turn around could be horribly slow if the average wait time to crash is of order a few days between each bisect. Also machine is currently in use, so will need to deal with that as well. Will do my best. Fingers crossed someone might just spot something in the meantime. The commit mentioned above ensures underlying errors are not hidden, so it may simply have revealed some underlying issue and not be the actual 'culprit'. thanks gene ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: md raid6 oops in 6.6.4 stable 2023-12-07 13:55 ` Genes Lists @ 2023-12-07 14:42 ` Guoqing Jiang 2023-12-07 15:58 ` Genes Lists 0 siblings, 1 reply; 10+ messages in thread From: Guoqing Jiang @ 2023-12-07 14:42 UTC (permalink / raw) To: Genes Lists, Bagas Sanjaya, snitzer, song, yukuai3, axboe, mpatocka, heinzm, Linux Kernel Mailing List, Linux RAID, Linux Regressions Cc: Bhanu Victor DiCara, Xiao Ni, Greg Kroah-Hartman Hi, On 12/7/23 21:55, Genes Lists wrote: > On 12/7/23 08:30, Bagas Sanjaya wrote: >> On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote: >>> I have not had chance to git bisect this but since it happened in >>> stable I >>> thought it was important to share sooner than later. >>> >>> One possibly relevant commit between 6.6.3 and 6.6.4 could be: >>> >>> commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e >>> Author: Song Liu <song@kernel.org> >>> Date: Fri Nov 17 15:56:30 2023 -0800 >>> >>> md: fix bi_status reporting in md_end_clone_io >>> >>> log attached shows page_fault_oops. >>> Machine was up for 3 days before crash happened. Could you decode the oops (I can't find it in lore for some reason) ([1])? And can it be reproduced reliably? If so, pls share the reproduce step. [1]. https://lwn.net/Articles/592724/ Thanks, Guoqing ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: md raid6 oops in 6.6.4 stable 2023-12-07 14:42 ` Guoqing Jiang @ 2023-12-07 15:58 ` Genes Lists 2023-12-07 17:37 ` Song Liu 0 siblings, 1 reply; 10+ messages in thread From: Genes Lists @ 2023-12-07 15:58 UTC (permalink / raw) To: Guoqing Jiang, Bagas Sanjaya, snitzer, song, yukuai3, axboe, mpatocka, heinzm, Linux Kernel Mailing List, Linux RAID, Linux Regressions Cc: Bhanu Victor DiCara, Xiao Ni, Greg Kroah-Hartman [-- Attachment #1: Type: text/plain, Size: 1649 bytes --] On 12/7/23 09:42, Guoqing Jiang wrote: > Hi, > > On 12/7/23 21:55, Genes Lists wrote: >> On 12/7/23 08:30, Bagas Sanjaya wrote: >>> On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote: >>>> I have not had chance to git bisect this but since it happened in >>>> stable I >>>> thought it was important to share sooner than later. >>>> >>>> One possibly relevant commit between 6.6.3 and 6.6.4 could be: >>>> >>>> commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e >>>> Author: Song Liu <song@kernel.org> >>>> Date: Fri Nov 17 15:56:30 2023 -0800 >>>> >>>> md: fix bi_status reporting in md_end_clone_io >>>> >>>> log attached shows page_fault_oops. >>>> Machine was up for 3 days before crash happened. > > Could you decode the oops (I can't find it in lore for some reason) > ([1])? And > can it be reproduced reliably? If so, pls share the reproduce step. > > [1]. https://lwn.net/Articles/592724/ > > Thanks, > Guoqing - reproducing An rsync runs 2 x / day. It copies to this server from another. The copy is from a (large) top level directory. On the 3rd day after booting 6.6.4, the second of these rysnc's triggered the oops. I need to do more testing to see if I can reliably reproduce. I have not seen this oops on earlier stable kernels. - decoding oops with scripts/decode_stacktrace.sh had errors : readelf: Error: Not an ELF file - it has the wrong magic bytes at the start It appears that the decode script doesn't handle compressed modules. I changed the readelf line to decompress first. This fixes the above script complaint and the result is attached. gene [-- Attachment #2: raid6-stacktrace --] [-- Type: text/plain, Size: 5283 bytes --] Dec 06 19:20:54 s6 kernel: BUG: unable to handle page fault for address: ffff8881019312e8 Dec 06 19:20:54 s6 kernel: #PF: supervisor write access in kernel mode Dec 06 19:20:54 s6 kernel: #PF: error_code(0x0003) - permissions violation Dec 06 19:20:54 s6 kernel: PGD 336e01067 P4D 336e01067 PUD 1019ee063 PMD 1019f0063 PTE 8000000101931021 Dec 06 19:20:54 s6 kernel: Oops: 0003 [#1] PREEMPT SMP PTI Dec 06 19:20:54 s6 kernel: CPU: 3 PID: 773 Comm: md127_raid6 Not tainted 6.6.4-stable-1 #4 784c1c710646cffc1e8cc5978f8f6cec974aa179 Dec 06 19:20:54 s6 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z370 Extreme4, BIOS P4.20 10/31/2019 Dec 06 19:20:54 s6 kernel: RIP: update_io_ticks+0x2c/0x60 Dec 06 19:20:54 s6 kernel: Code: 1f 00 0f 1f 44 00 00 48 8b 4f 28 48 39 f1 78 17 80 7f 31 00 74 3b 48 8b 47 10 48 8b 78 40 48 8b 4f 28 48 39 f1 79 e9 48 89 c8 <f0> 48 0f b1 77 28 75 de 48 89 f0 48 29 c8 84 d2 b9 01 00 > All code ======== 0: 1f (bad) 1: 00 0f add %cl,(%rdi) 3: 1f (bad) 4: 44 00 00 add %r8b,(%rax) 7: 48 8b 4f 28 mov 0x28(%rdi),%rcx b: 48 39 f1 cmp %rsi,%rcx e: 78 17 js 0x27 10: 80 7f 31 00 cmpb $0x0,0x31(%rdi) 14: 74 3b je 0x51 16: 48 8b 47 10 mov 0x10(%rdi),%rax 1a: 48 8b 78 40 mov 0x40(%rax),%rdi 1e: 48 8b 4f 28 mov 0x28(%rdi),%rcx 22: 48 39 f1 cmp %rsi,%rcx 25: 79 e9 jns 0x10 27: 48 89 c8 mov %rcx,%rax 2a:* f0 48 0f b1 77 28 lock cmpxchg %rsi,0x28(%rdi) <-- trapping instruction 30: 75 de jne 0x10 32: 48 89 f0 mov %rsi,%rax 35: 48 29 c8 sub %rcx,%rax 38: 84 d2 test %dl,%dl 3a: b9 .byte 0xb9 3b: 01 00 add %eax,(%rax) ... Code starting with the faulting instruction =========================================== 0: f0 48 0f b1 77 28 lock cmpxchg %rsi,0x28(%rdi) 6: 75 de jne 0xffffffffffffffe6 8: 48 89 f0 mov %rsi,%rax b: 48 29 c8 sub %rcx,%rax e: 84 d2 test %dl,%dl 10: b9 .byte 0xb9 11: 01 00 add %eax,(%rax) ... Dec 06 19:20:54 s6 kernel: RSP: 0018:ffffc90000c0bb78 EFLAGS: 00010296 Dec 06 19:20:54 s6 kernel: RAX: cccccccccccccccc RBX: ffff8881019312c0 RCX: cccccccccccccccc Dec 06 19:20:54 s6 kernel: RDX: 0000000000000001 RSI: 0000000110f28f4e RDI: ffff8881019312c0 Dec 06 19:20:54 s6 kernel: RBP: 0000000000000001 R08: ffff888104cc1760 R09: 0000000080200016 Dec 06 19:20:54 s6 kernel: R10: ffff88851f0ced00 R11: ffff8888beffb000 R12: 0000000000000008 Dec 06 19:20:54 s6 kernel: R13: 0000000000000028 R14: 0000000000000008 R15: 0000000000000048 Dec 06 19:20:54 s6 kernel: FS: 0000000000000000(0000) GS:ffff88889eec0000(0000) knlGS:0000000000000000 Dec 06 19:20:54 s6 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 06 19:20:54 s6 kernel: CR2: ffff8881019312e8 CR3: 0000000336020002 CR4: 00000000003706e0 Dec 06 19:20:54 s6 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 06 19:20:54 s6 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Dec 06 19:20:54 s6 kernel: Call Trace: Dec 06 19:20:54 s6 kernel: <TASK> Dec 06 19:20:54 s6 kernel: ? __die+0x23/0x70 Dec 06 19:20:54 s6 kernel: ? page_fault_oops+0x171/0x4e0 Dec 06 19:20:54 s6 kernel: ? exc_page_fault+0x175/0x180 Dec 06 19:20:54 s6 kernel: ? asm_exc_page_fault+0x26/0x30 Dec 06 19:20:54 s6 kernel: ? update_io_ticks+0x2c/0x60 Dec 06 19:20:54 s6 kernel: bdev_end_io_acct+0x63/0x160 Dec 06 19:20:54 s6 kernel: md_end_clone_io+0x75/0xa0 md_mod Dec 06 19:20:54 s6 kernel: handle_stripe_clean_event+0x1ee/0x430 raid456 Dec 06 19:20:54 s6 kernel: handle_stripe+0x7b6/0x1ac0 raid456 Dec 06 19:20:54 s6 kernel: handle_active_stripes.isra.0+0x38d/0x550 raid456 Dec 06 19:20:54 s6 kernel: raid5d+0x488/0x750 raid456 Dec 06 19:20:54 s6 kernel: ? lock_timer_base+0x61/0x80 Dec 06 19:20:54 s6 kernel: ? prepare_to_wait_event+0x60/0x180 Dec 06 19:20:54 s6 kernel: ? __pfx_md_thread+0x10/0x10 md_mod Dec 06 19:20:54 s6 kernel: md_thread+0xab/0x190 md_mod Dec 06 19:20:54 s6 kernel: ? __pfx_autoremove_wake_function+0x10/0x10 Dec 06 19:20:54 s6 kernel: kthread+0xe5/0x120 Dec 06 19:20:54 s6 kernel: ? __pfx_kthread+0x10/0x10 Dec 06 19:20:54 s6 kernel: ret_from_fork+0x31/0x50 Dec 06 19:20:54 s6 kernel: ? __pfx_kthread+0x10/0x10 Dec 06 19:20:54 s6 kernel: ret_from_fork_asm+0x1b/0x30 Dec 06 19:20:54 s6 kernel: </TASK> Dec 06 19:20:54 s6 kernel: Modules linked in: algif_hash af_alg mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache netfs nft_ct> Dec 06 19:20:54 s6 kernel: snd_hda_codec kvm snd_hda_core drm_buddy snd_hwdep iTCO_wdt i2c_algo_bit mei_pxp intel_pmc_bxt snd_pcm mei_hdcp ee1004 irqbypass ttm iTCO_vendor_support rapl drm_display_helper nls_iso8859_1> Dec 06 19:20:54 s6 kernel: CR2: ffff8881019312e8 Dec 06 19:20:54 s6 kernel: ---[ end trace 0000000000000000 ]--- ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: md raid6 oops in 6.6.4 stable 2023-12-07 15:58 ` Genes Lists @ 2023-12-07 17:37 ` Song Liu 2023-12-07 19:27 ` Genes Lists 0 siblings, 1 reply; 10+ messages in thread From: Song Liu @ 2023-12-07 17:37 UTC (permalink / raw) To: Genes Lists Cc: Guoqing Jiang, Bagas Sanjaya, snitzer, yukuai3, axboe, mpatocka, heinzm, Linux Kernel Mailing List, Linux RAID, Linux Regressions, Bhanu Victor DiCara, Xiao Ni, Greg Kroah-Hartman On Thu, Dec 7, 2023 at 7:58 AM Genes Lists <lists@sapience.com> wrote: > > On 12/7/23 09:42, Guoqing Jiang wrote: > > Hi, > > > > On 12/7/23 21:55, Genes Lists wrote: > >> On 12/7/23 08:30, Bagas Sanjaya wrote: > >>> On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote: > >>>> I have not had chance to git bisect this but since it happened in > >>>> stable I > >>>> thought it was important to share sooner than later. > >>>> > >>>> One possibly relevant commit between 6.6.3 and 6.6.4 could be: > >>>> > >>>> commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e > >>>> Author: Song Liu <song@kernel.org> > >>>> Date: Fri Nov 17 15:56:30 2023 -0800 > >>>> > >>>> md: fix bi_status reporting in md_end_clone_io > >>>> > >>>> log attached shows page_fault_oops. > >>>> Machine was up for 3 days before crash happened. > > > > Could you decode the oops (I can't find it in lore for some reason) > > ([1])? And > > can it be reproduced reliably? If so, pls share the reproduce step. > > > > [1]. https://lwn.net/Articles/592724/ > > > > Thanks, > > Guoqing > > - reproducing > An rsync runs 2 x / day. It copies to this server from another. The > copy is from a (large) top level directory. On the 3rd day after booting > 6.6.4, the second of these rysnc's triggered the oops. I need to do > more testing to see if I can reliably reproduce. I have not seen this > oops on earlier stable kernels. > > - decoding oops with scripts/decode_stacktrace.sh had errors : > readelf: Error: Not an ELF file - it has the wrong magic bytes at > the start > > It appears that the decode script doesn't handle compressed modules. > I changed the readelf line to decompress first. This fixes the above > script complaint and the result is attached. I probably missed something, but I really don't think the commit (2c975b0b8b11f1ffb1ed538609e2c89d8abf800e) could trigger this issue. From the trace: kernel: RIP: 0010:update_io_ticks+0x2c/0x60 => 2a:* f0 48 0f b1 77 28 lock cmpxchg %rsi,0x28(%rdi) << trapped here. [...] kernel: Call Trace: kernel: <TASK> kernel: ? __die+0x23/0x70 kernel: ? page_fault_oops+0x171/0x4e0 kernel: ? exc_page_fault+0x175/0x180 kernel: ? asm_exc_page_fault+0x26/0x30 kernel: ? update_io_ticks+0x2c/0x60 kernel: bdev_end_io_acct+0x63/0x160 kernel: md_end_clone_io+0x75/0xa0 <<< change in md_end_clone_io The commit only changes how we update bi_status. But bi_status was not used/checked at all between md_end_clone_io and the trap (lock cmpxchg). Did I miss something? Given the issue takes very long to reproduce. Maybe we have the issue before 6.6.4? Thanks, Song ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: md raid6 oops in 6.6.4 stable 2023-12-07 17:37 ` Song Liu @ 2023-12-07 19:27 ` Genes Lists 0 siblings, 0 replies; 10+ messages in thread From: Genes Lists @ 2023-12-07 19:27 UTC (permalink / raw) To: Song Liu Cc: Guoqing Jiang, Bagas Sanjaya, snitzer, yukuai3, axboe, mpatocka, heinzm, Linux Kernel Mailing List, Linux RAID, Linux Regressions, Bhanu Victor DiCara, Xiao Ni, Greg Kroah-Hartman On 12/7/23 12:37, Song Liu wrote: ... > kernel: md_end_clone_io+0x75/0xa0 <<< change in md_end_clone_io > > The commit only changes how we update bi_status. But bi_status was not > used/checked at all between md_end_clone_io and the trap (lock cmpxchg). > Did I miss something? > > Given the issue takes very long to reproduce. Maybe we have the issue > before 6.6.4? > > Thanks, > Song Thanks for clarifying that point. In meantime I rebooted server (shutdown was a struggle) - finally I fsck'd the filesystem (ext4) sitting on the raid6 - and manually ran the triggering rsync. This of course completed normally. That's either good or bad depending on your perspective :) If I can get it to crash again, I will either start a git bisect (from 6.6.3) or see if 6.7rc4 shows same issue. thanks, gene ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: md raid6 oops in 6.6.4 stable 2023-12-07 13:30 ` Bagas Sanjaya 2023-12-07 13:55 ` Genes Lists @ 2023-12-07 13:58 ` Thorsten Leemhuis 2023-12-08 2:05 ` Bagas Sanjaya 1 sibling, 1 reply; 10+ messages in thread From: Thorsten Leemhuis @ 2023-12-07 13:58 UTC (permalink / raw) To: Bagas Sanjaya, Genes Lists, snitzer, song, yukuai3, axboe, mpatocka, heinzm, Linux Kernel Mailing List, Linux RAID, Linux Regressions Cc: Bhanu Victor DiCara, Xiao Ni, Guoqing Jiang, Greg Kroah-Hartman On 07.12.23 14:30, Bagas Sanjaya wrote: > On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote: >> I have not had chance to git bisect this but since it happened in stable I >> thought it was important to share sooner than later. >> >> One possibly relevant commit between 6.6.3 and 6.6.4 could be: >> >> commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e >> Author: Song Liu <song@kernel.org> >> Date: Fri Nov 17 15:56:30 2023 -0800 >> >> md: fix bi_status reporting in md_end_clone_io >> >> log attached shows page_fault_oops. >> Machine was up for 3 days before crash happened. > > Can you confirm that culprit by bisection? Bagas, I know you are trying to help, but sorry, I'd say this is not helpful at all -- any maybe even harmful. From the quoted texts it's pretty clear that the reporter knows that a bisection would be helpful, but currently is unable to perform one -- and even states reasons for reporting it without having it bisected. So your message afaics doesn't bring anything new to the table; and I might be wrong with that, but I fear some people in a situation like this might even be offended by a reply like that, as it states something already obvious. Ciao, Thorsten ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: md raid6 oops in 6.6.4 stable 2023-12-07 13:58 ` Thorsten Leemhuis @ 2023-12-08 2:05 ` Bagas Sanjaya 0 siblings, 0 replies; 10+ messages in thread From: Bagas Sanjaya @ 2023-12-08 2:05 UTC (permalink / raw) To: Thorsten Leemhuis, Genes Lists, snitzer, song, yukuai3, axboe, mpatocka, heinzm, Linux Kernel Mailing List, Linux RAID, Linux Regressions Cc: Bhanu Victor DiCara, Xiao Ni, Guoqing Jiang, Greg Kroah-Hartman On 12/7/23 20:58, Thorsten Leemhuis wrote: > On 07.12.23 14:30, Bagas Sanjaya wrote: >> On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote: >>> I have not had chance to git bisect this but since it happened in stable I >>> thought it was important to share sooner than later. >>> >>> One possibly relevant commit between 6.6.3 and 6.6.4 could be: >>> >>> commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e >>> Author: Song Liu <song@kernel.org> >>> Date: Fri Nov 17 15:56:30 2023 -0800 >>> >>> md: fix bi_status reporting in md_end_clone_io >>> >>> log attached shows page_fault_oops. >>> Machine was up for 3 days before crash happened. >> >> Can you confirm that culprit by bisection? > > Bagas, I know you are trying to help, but sorry, I'd say this is not > helpful at all -- any maybe even harmful. > > From the quoted texts it's pretty clear that the reporter knows that a > bisection would be helpful, but currently is unable to perform one -- > and even states reasons for reporting it without having it bisected. So > your message afaics doesn't bring anything new to the table; and I might > be wrong with that, but I fear some people in a situation like this > might even be offended by a reply like that, as it states something > already obvious. > Oops, I didn't fully understand the context. Thanks anyway. -- An old man doll... just what I always wanted! - Clara ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: md raid6 oops in 6.6.4 stable 2023-12-07 13:10 md raid6 oops in 6.6.4 stable Genes Lists 2023-12-07 13:30 ` Bagas Sanjaya @ 2023-12-07 16:15 ` Xiao Ni 1 sibling, 0 replies; 10+ messages in thread From: Xiao Ni @ 2023-12-07 16:15 UTC (permalink / raw) To: Genes Lists; +Cc: snitzer, song, yukuai3, axboe, mpatocka, heinzm, linux-kernel On Thu, Dec 7, 2023 at 9:12 PM Genes Lists <lists@sapience.com> wrote: > > I have not had chance to git bisect this but since it happened in stable > I thought it was important to share sooner than later. > > One possibly relevant commit between 6.6.3 and 6.6.4 could be: > > commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e > Author: Song Liu <song@kernel.org> > Date: Fri Nov 17 15:56:30 2023 -0800 > > md: fix bi_status reporting in md_end_clone_io > > log attached shows page_fault_oops. > Machine was up for 3 days before crash happened. > > gene Hi all I'm following the crash reference rule to try to find some hints. The RDI is ffff8881019312c0 which should be the address of struct block_device *part. And the CR2 is ffff8881019312e8. So the panic happens when it wants to introduce part->bd_stamp. Hope it's helpful if the addresses are right. Best Regards Xiao ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2023-12-08 2:05 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-12-07 13:10 md raid6 oops in 6.6.4 stable Genes Lists 2023-12-07 13:30 ` Bagas Sanjaya 2023-12-07 13:55 ` Genes Lists 2023-12-07 14:42 ` Guoqing Jiang 2023-12-07 15:58 ` Genes Lists 2023-12-07 17:37 ` Song Liu 2023-12-07 19:27 ` Genes Lists 2023-12-07 13:58 ` Thorsten Leemhuis 2023-12-08 2:05 ` Bagas Sanjaya 2023-12-07 16:15 ` Xiao Ni
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox