public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Genes Lists <lists@sapience.com>
To: Guoqing Jiang <guoqing.jiang@linux.dev>,
	Bagas Sanjaya <bagasdotme@gmail.com>,
	snitzer@kernel.org, song@kernel.org, yukuai3@huawei.com,
	axboe@kernel.dk, mpatocka@redhat.com, heinzm@redhat.com,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Linux RAID <linux-raid@vger.kernel.org>,
	Linux Regressions <regressions@lists.linux.dev>
Cc: Bhanu Victor DiCara <00bvd0+linux@gmail.com>,
	Xiao Ni <xni@redhat.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Subject: Re: md raid6 oops in 6.6.4 stable
Date: Thu, 7 Dec 2023 10:58:04 -0500	[thread overview]
Message-ID: <c866bcfa-85cc-44fb-9b54-bb4840f588e6@sapience.com> (raw)
In-Reply-To: <714b22c7-b8dd-008d-a1ea-a184dc8ec1cf@linux.dev>

[-- Attachment #1: Type: text/plain, Size: 1649 bytes --]

On 12/7/23 09:42, Guoqing Jiang wrote:
> Hi,
> 
> On 12/7/23 21:55, Genes Lists wrote:
>> On 12/7/23 08:30, Bagas Sanjaya wrote:
>>> On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote:
>>>> I have not had chance to git bisect this but since it happened in 
>>>> stable I
>>>> thought it was important to share sooner than later.
>>>>
>>>> One possibly relevant commit between 6.6.3 and 6.6.4 could be:
>>>>
>>>>    commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
>>>>    Author: Song Liu <song@kernel.org>
>>>>    Date:   Fri Nov 17 15:56:30 2023 -0800
>>>>
>>>>      md: fix bi_status reporting in md_end_clone_io
>>>>
>>>> log attached shows page_fault_oops.
>>>> Machine was up for 3 days before crash happened.
> 
> Could you decode the oops (I can't find it in lore for some reason) 
> ([1])? And
> can it be reproduced reliably? If so, pls share the reproduce step.
> 
> [1]. https://lwn.net/Articles/592724/
> 
> Thanks,
> Guoqing

   - reproducing
     An rsync runs 2 x / day. It copies to this server from another. The 
copy is from a (large) top level directory. On the 3rd day after booting 
6.6.4,  the second of these rysnc's triggered the oops. I need to do 
more testing to see if I can reliably reproduce. I have not seen this 
oops on earlier stable kernels.

   - decoding oops with scripts/decode_stacktrace.sh had errors :
    readelf: Error: Not an ELF file - it has the wrong magic bytes at 
the start

    It appears that the decode script doesn't handle compressed modules. 
  I changed the readelf line to decompress first. This fixes the above 
script complaint and the result is attached.

gene






[-- Attachment #2: raid6-stacktrace --]
[-- Type: text/plain, Size: 5283 bytes --]

Dec 06 19:20:54 s6 kernel: BUG: unable to handle page fault for address: ffff8881019312e8
Dec 06 19:20:54 s6 kernel: #PF: supervisor write access in kernel mode
Dec 06 19:20:54 s6 kernel: #PF: error_code(0x0003) - permissions violation
Dec 06 19:20:54 s6 kernel: PGD 336e01067 P4D 336e01067 PUD 1019ee063 PMD 1019f0063 PTE 8000000101931021
Dec 06 19:20:54 s6 kernel: Oops: 0003 [#1] PREEMPT SMP PTI
Dec 06 19:20:54 s6 kernel: CPU: 3 PID: 773 Comm: md127_raid6 Not tainted 6.6.4-stable-1 #4 784c1c710646cffc1e8cc5978f8f6cec974aa179
Dec 06 19:20:54 s6 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z370 Extreme4, BIOS P4.20 10/31/2019
Dec 06 19:20:54 s6 kernel: RIP: update_io_ticks+0x2c/0x60 
Dec 06 19:20:54 s6 kernel: Code: 1f 00 0f 1f 44 00 00 48 8b 4f 28 48 39 f1 78 17 80 7f 31 00 74 3b 48 8b 47 10 48 8b 78 40 48 8b 4f 28 48 39 f1 79 e9 48 89 c8 <f0> 48 0f b1 77 28 75 de 48 89 f0 48 29 c8 84 d2 b9 01 00 >
All code
========
   0:	1f                   	(bad)
   1:	00 0f                	add    %cl,(%rdi)
   3:	1f                   	(bad)
   4:	44 00 00             	add    %r8b,(%rax)
   7:	48 8b 4f 28          	mov    0x28(%rdi),%rcx
   b:	48 39 f1             	cmp    %rsi,%rcx
   e:	78 17                	js     0x27
  10:	80 7f 31 00          	cmpb   $0x0,0x31(%rdi)
  14:	74 3b                	je     0x51
  16:	48 8b 47 10          	mov    0x10(%rdi),%rax
  1a:	48 8b 78 40          	mov    0x40(%rax),%rdi
  1e:	48 8b 4f 28          	mov    0x28(%rdi),%rcx
  22:	48 39 f1             	cmp    %rsi,%rcx
  25:	79 e9                	jns    0x10
  27:	48 89 c8             	mov    %rcx,%rax
  2a:*	f0 48 0f b1 77 28    	lock cmpxchg %rsi,0x28(%rdi)		<-- trapping instruction
  30:	75 de                	jne    0x10
  32:	48 89 f0             	mov    %rsi,%rax
  35:	48 29 c8             	sub    %rcx,%rax
  38:	84 d2                	test   %dl,%dl
  3a:	b9                   	.byte 0xb9
  3b:	01 00                	add    %eax,(%rax)
	...

Code starting with the faulting instruction
===========================================
   0:	f0 48 0f b1 77 28    	lock cmpxchg %rsi,0x28(%rdi)
   6:	75 de                	jne    0xffffffffffffffe6
   8:	48 89 f0             	mov    %rsi,%rax
   b:	48 29 c8             	sub    %rcx,%rax
   e:	84 d2                	test   %dl,%dl
  10:	b9                   	.byte 0xb9
  11:	01 00                	add    %eax,(%rax)
	...
Dec 06 19:20:54 s6 kernel: RSP: 0018:ffffc90000c0bb78 EFLAGS: 00010296
Dec 06 19:20:54 s6 kernel: RAX: cccccccccccccccc RBX: ffff8881019312c0 RCX: cccccccccccccccc
Dec 06 19:20:54 s6 kernel: RDX: 0000000000000001 RSI: 0000000110f28f4e RDI: ffff8881019312c0
Dec 06 19:20:54 s6 kernel: RBP: 0000000000000001 R08: ffff888104cc1760 R09: 0000000080200016
Dec 06 19:20:54 s6 kernel: R10: ffff88851f0ced00 R11: ffff8888beffb000 R12: 0000000000000008
Dec 06 19:20:54 s6 kernel: R13: 0000000000000028 R14: 0000000000000008 R15: 0000000000000048
Dec 06 19:20:54 s6 kernel: FS:  0000000000000000(0000) GS:ffff88889eec0000(0000) knlGS:0000000000000000
Dec 06 19:20:54 s6 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 06 19:20:54 s6 kernel: CR2: ffff8881019312e8 CR3: 0000000336020002 CR4: 00000000003706e0
Dec 06 19:20:54 s6 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 06 19:20:54 s6 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dec 06 19:20:54 s6 kernel: Call Trace:
Dec 06 19:20:54 s6 kernel:  <TASK>
Dec 06 19:20:54 s6 kernel: ? __die+0x23/0x70 
Dec 06 19:20:54 s6 kernel: ? page_fault_oops+0x171/0x4e0 
Dec 06 19:20:54 s6 kernel: ? exc_page_fault+0x175/0x180 
Dec 06 19:20:54 s6 kernel: ? asm_exc_page_fault+0x26/0x30 
Dec 06 19:20:54 s6 kernel: ? update_io_ticks+0x2c/0x60 
Dec 06 19:20:54 s6 kernel: bdev_end_io_acct+0x63/0x160 
Dec 06 19:20:54 s6 kernel: md_end_clone_io+0x75/0xa0 md_mod
Dec 06 19:20:54 s6 kernel: handle_stripe_clean_event+0x1ee/0x430 raid456
Dec 06 19:20:54 s6 kernel: handle_stripe+0x7b6/0x1ac0 raid456
Dec 06 19:20:54 s6 kernel: handle_active_stripes.isra.0+0x38d/0x550 raid456
Dec 06 19:20:54 s6 kernel: raid5d+0x488/0x750 raid456
Dec 06 19:20:54 s6 kernel: ? lock_timer_base+0x61/0x80 
Dec 06 19:20:54 s6 kernel: ? prepare_to_wait_event+0x60/0x180 
Dec 06 19:20:54 s6 kernel: ? __pfx_md_thread+0x10/0x10 md_mod
Dec 06 19:20:54 s6 kernel: md_thread+0xab/0x190 md_mod
Dec 06 19:20:54 s6 kernel: ? __pfx_autoremove_wake_function+0x10/0x10 
Dec 06 19:20:54 s6 kernel: kthread+0xe5/0x120 
Dec 06 19:20:54 s6 kernel: ? __pfx_kthread+0x10/0x10 
Dec 06 19:20:54 s6 kernel: ret_from_fork+0x31/0x50 
Dec 06 19:20:54 s6 kernel: ? __pfx_kthread+0x10/0x10 
Dec 06 19:20:54 s6 kernel: ret_from_fork_asm+0x1b/0x30 
Dec 06 19:20:54 s6 kernel:  </TASK>
Dec 06 19:20:54 s6 kernel: Modules linked in: algif_hash af_alg mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache netfs nft_ct>
Dec 06 19:20:54 s6 kernel:  snd_hda_codec kvm snd_hda_core drm_buddy snd_hwdep iTCO_wdt i2c_algo_bit mei_pxp intel_pmc_bxt snd_pcm mei_hdcp ee1004 irqbypass ttm iTCO_vendor_support rapl drm_display_helper nls_iso8859_1>
Dec 06 19:20:54 s6 kernel: CR2: ffff8881019312e8
Dec 06 19:20:54 s6 kernel: ---[ end trace 0000000000000000 ]---


  reply	other threads:[~2023-12-07 15:58 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-07 13:10 md raid6 oops in 6.6.4 stable Genes Lists
2023-12-07 13:30 ` Bagas Sanjaya
2023-12-07 13:55   ` Genes Lists
2023-12-07 14:42     ` Guoqing Jiang
2023-12-07 15:58       ` Genes Lists [this message]
2023-12-07 17:37         ` Song Liu
2023-12-07 19:27           ` Genes Lists
2023-12-07 13:58   ` Thorsten Leemhuis
2023-12-08  2:05     ` Bagas Sanjaya
2023-12-07 16:15 ` Xiao Ni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c866bcfa-85cc-44fb-9b54-bb4840f588e6@sapience.com \
    --to=lists@sapience.com \
    --cc=00bvd0+linux@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=bagasdotme@gmail.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=guoqing.jiang@linux.dev \
    --cc=heinzm@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=mpatocka@redhat.com \
    --cc=regressions@lists.linux.dev \
    --cc=snitzer@kernel.org \
    --cc=song@kernel.org \
    --cc=xni@redhat.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox