All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chengming Zhou <chengming.zhou@linux.dev>
To: Friedrich Weber <f.weber@proxmox.com>,
	axboe@kernel.dk, ming.lei@redhat.com, hch@lst.de,
	bvanassche@acm.org
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	zhouchengming@bytedance.com
Subject: Re: [PATCH v4 4/4] blk-flush: reuse rq queuelist in flush state machine
Date: Mon, 27 May 2024 13:09:20 +0800	[thread overview]
Message-ID: <984f1f77-288c-441a-a649-5f320249b576@linux.dev> (raw)
In-Reply-To: <14b89dfb-505c-49f7-aebb-01c54451db40@proxmox.com>

Hi,

On 2024/5/25 00:07, Friedrich Weber wrote:
> Hi,
> 
> some of our (Proxmox VE) users have been reporting crashes [1] with NULL
> pointer dereferences in blk_flush_complete_seq. AFAICT, all systems
> crashing in blk_flush_complete_seq used software RAID1, and ran our
> downstream kernel based on (Ubuntu) kernel 6.8.
> 
> On mainline kernel 6.9, I can reproduce the crash [1] by:
> 
> * compiling the kernel with CONFIG_FAIL_MAKE_REQUEST enabled
> * setting up root on software RAID1
> * enabling the write-intent bitmap with `mdadm --grow --bitmap=internal
> /dev/md0`. So far, I have not been able to reproduce the crash with the
> write-intent bitmap disabled.
> * injecting write faults via CONFIG_FAIL_MAKE_REQUEST on one of the
> RAID1 disks
> * generating IO in a KVM guest with fio
> * waiting ~5 minutes
> 
> I've used this reproducer for a bisect, which produced
> 
>  81ada09cc25e (blk-flush: reuse rq queuelist in flush state machine)
> 
> as the first commit with which I can reproduce the crashes. I'm not 100%
> sure it is this one because the reproducer is a bit flaky. But it does
> sound plausible, as the commit is included in our 6.8 kernel, and
> touches `queuelist` which is AFAICT where blk_flush_complete_seq
> dereferences the NULL pointer.

Ok, it will be better that I can reproduce it locally, will try later.

> 
> Does anyone have an idea what could be the cause for the crash, or how
> to further debug this? Happy to provide more information if needed, as
> well as the complete reproducer (I'd need to clean it up a little bit
> first).

BUG shows it panic on 0000000000000008, not sure what it's accessing then,
does it means rq->queuelist.next == 0 or something? Could you use add2line
to show the exact source code line that panic? I use blk_flush_complete_seq+0x296/0x2e0
and get block/blk-flush.c:190, which is "fq->flush_data_in_flight++;",
obviously fq can't be NULL. (I'm using the v6.9 kernel)

Thanks.

> 
> Do you think this could also affect setups without software RAID?
> 
> Best,
> 
> Friedrich
> 
> [1]
> 
> [  132.292488] BUG: kernel NULL pointer dereference, address:
> 0000000000000008
> [  132.293594] #PF: supervisor write access in kernel mode
> [  132.294408] #PF: error_code(0x0002) - not-present page
> [  132.295187] PGD 0 P4D 0
> [  132.295609] Oops: 0002 [#1] PREEMPT SMP NOPTI
> [  132.296285] CPU: 0 PID: 776 Comm: kvm Tainted: G            E
> 6.5.0-rc2-bisect11+ #22
> [  132.297556] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [  132.299292] RIP: 0010:blk_flush_complete_seq+0x296/0x2e0
> [  132.300122] Code: 0f b6 f6 49 8d 56 01 49 c1 e6 04 4d 01 ee 48 c1 e2
> 04 49 8b 4e 10 4c 01 ea 48 39 ca 74 2b 48 8b 4b 50 48 8b 7b 48 48 8d 73
> 48 <48> 89 4f 08 48 89 39 49 8b 4e 18 49 89 76 18 48 89 53 48 48 89 4b
> [  132.302946] RSP: 0018:ffffa431407a39b8 EFLAGS: 00010046
> [  132.303765] RAX: 0000000000000000 RBX: ffff92d00c2fd000 RCX:
> ffff92d00c2fd048
> [  132.304856] RDX: ffff92d0036dfaa0 RSI: ffff92d00c2fd048 RDI:
> 0000000000000000
> [  132.305948] RBP: ffffa431407a39f8 R08: 0000000000000000 R09:
> 0000000000000000
> [  132.307043] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000801
> [  132.308120] R13: ffff92d0036dfa80 R14: ffff92d0036dfa90 R15:
> ffff92d00ba09e00
> [  132.309216] FS:  0000727d191b86c0(0000) GS:ffff92d137c00000(0000)
> knlGS:0000000000000000
> [  132.310457] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  132.311355] CR2: 0000000000000008 CR3: 000000018f088006 CR4:
> 0000000000372ef0
> [  132.312444] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [  132.313535] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [  132.314633] Call Trace:
> [  132.315016]  <TASK>
> [  132.315381]  ? show_regs+0x6d/0x80
> [  132.315921]  ? __die+0x24/0x80
> [  132.316421]  ? page_fault_oops+0x176/0x500
> [  132.317049]  ? do_user_addr_fault+0x31d/0x6a0
> [  132.317744]  ? exc_page_fault+0x83/0x1b0
> [  132.318368]  ? asm_exc_page_fault+0x27/0x30
> [  132.319008]  ? blk_flush_complete_seq+0x296/0x2e0
> [  132.319759]  ? __blk_mq_alloc_requests+0x383/0x3a0
> [  132.320505]  ? wbt_wait+0xb3/0x100
> [  132.321041]  blk_insert_flush+0xd1/0x220
> [  132.321664]  blk_mq_submit_bio+0x564/0x690
> [  132.322294]  __submit_bio+0xb3/0x1c0
> [  132.322859]  submit_bio_noacct_nocheck+0x2b7/0x390
> [  132.323622]  submit_bio_noacct+0x18a/0x6b0
> [  132.324241]  submit_bio+0x6c/0x80
> [  132.324774]  md_super_write+0xd1/0x120
> [  132.325376]  write_page+0x23b/0x3f0
> [  132.325915]  ? md_bitmap_wait_writes+0xda/0xf0
> [  132.326624]  md_bitmap_unplug+0x9f/0x140
> [  132.327251]  flush_bio_list+0x102/0x110 [raid1]
> [  132.327963]  raid1_unplug+0x3c/0xe0 [raid1]
> [  132.328626]  __blk_flush_plug+0xc1/0x130
> [  132.329238]  blk_finish_plug+0x31/0x50
> [  132.329819]  io_submit_sqes+0x53e/0x680
> [  132.330426]  __do_sys_io_uring_enter+0x59a/0xc10
> [  132.331149]  ? vfs_read+0x20a/0x360
> [  132.331699]  __x64_sys_io_uring_enter+0x22/0x40
> [  132.332414]  do_syscall_64+0x58/0x90
> [  132.332979]  ? ksys_read+0xe6/0x100
> [  132.333530]  ? exit_to_user_mode_prepare+0x49/0x220
> [  132.334292]  ? syscall_exit_to_user_mode+0x1b/0x50
> [  132.335032]  ? do_syscall_64+0x67/0x90
> [  132.335627]  ? syscall_exit_to_user_mode+0x1b/0x50
> [  132.336381]  ? do_syscall_64+0x67/0x90
> [  132.336958]  ? syscall_exit_to_user_mode+0x1b/0x50
> [  132.337707]  ? do_syscall_64+0x67/0x90
> [  132.338296]  ? syscall_exit_to_user_mode+0x1b/0x50
> [  132.339045]  ? do_syscall_64+0x67/0x90
> [  132.339646]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> [  132.340426] RIP: 0033:0x727d26282b95
> [  132.341011] Code: 00 00 00 44 89 d0 41 b9 08 00 00 00 83 c8 10 f6 87
> d0 00 00 00 01 8b bf cc 00 00 00 44 0f 45 d0 45 31 c0 b8 aa 01 00 00 0f
> 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 41 83 e2 02 74 c2 f0 48 83 0c 24
> [  132.343837] RSP: 002b:0000727d191b2fd8 EFLAGS: 00000246 ORIG_RAX:
> 00000000000001aa
> [  132.344990] RAX: ffffffffffffffda RBX: 0000727d0c0039d0 RCX:
> 0000727d26282b95
> [  132.346079] RDX: 0000000000000000 RSI: 0000000000000003 RDI:
> 0000000000000033
> [  132.347176] RBP: 0000727d0c0039d8 R08: 0000000000000000 R09:
> 0000000000000008
> [  132.348272] R10: 0000000000000000 R11: 0000000000000246 R12:
> 0000727d0c003ac0
> [  132.349376] R13: 0000000000000000 R14: 00005a5a2313bc68 R15:
> 00005a5a25cf0450
> [  132.350472]  </TASK>
> [  132.350828] Modules linked in: veth(E) cmac(E) nls_utf8(E) cifs(E)
> cifs_arc4(E) rdma_cm(E) iw_cm(E) ib_cm(E) ib_core(E) cifs_md4(E)
> ebtable_filter(E) ebtables(E) ip_set(E) ip6table_raw(E) iptable_raw(E)
> ip6table_filter(E) ip6_tables(E) iptable_filter(E) nf_tables(E)
> sunrpc(E) softdog(E) binfmt_misc(E) bonding(E) tls(E) nfnetlink_log(E)
> nfnetlink(E) intel_rapl_msr(E) intel_rapl_common(E) intel_pmc_core(E)
> kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) polyval_clmulni(E)
> polyval_generic(E) ghash_clmulni_intel(E) aesni_intel(E) crypto_simd(E)
> cryptd(E) rapl(E) pcspkr(E) vmgenid(E) joydev(E) input_leds(E)
> mac_hid(E) serio_raw(E) vhost_net(E) vhost(E) vhost_iotlb(E) tap(E)
> efi_pstore(E) dmi_sysfs(E) qemu_fw_cfg(E) ip_tables(E) x_tables(E)
> autofs4(E) hid_generic(E) usbhid(E) hid(E) raid10(E) raid456(E)
> async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E)
> async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) raid0(E) raid1(E)
> crc32_pclmul(E) bochs(E) drm_vram_helper(E) psmouse(E) drm_ttm_helper(E)
> uhci_hcd(E) ttm(E)
> [  132.350919]  ehci_hcd(E) i2c_piix4(E) pata_acpi(E) floppy(E)
> [  132.365204] CR2: 0000000000000008
> [  132.365747] ---[ end trace 0000000000000000 ]---
> [  132.366468] RIP: 0010:blk_flush_complete_seq+0x296/0x2e0
> [  132.367290] Code: 0f b6 f6 49 8d 56 01 49 c1 e6 04 4d 01 ee 48 c1 e2
> 04 49 8b 4e 10 4c 01 ea 48 39 ca 74 2b 48 8b 4b 50 48 8b 7b 48 48 8d 73
> 48 <48> 89 4f 08 48 89 39 49 8b 4e 18 49 89 76 18 48 89 53 48 48 89 4b
> [  132.370119] RSP: 0018:ffffa431407a39b8 EFLAGS: 00010046
> [  132.370935] RAX: 0000000000000000 RBX: ffff92d00c2fd000 RCX:
> ffff92d00c2fd048
> [  132.372015] RDX: ffff92d0036dfaa0 RSI: ffff92d00c2fd048 RDI:
> 0000000000000000
> [  132.373115] RBP: ffffa431407a39f8 R08: 0000000000000000 R09:
> 0000000000000000
> [  132.374193] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000801
> [  132.375286] R13: ffff92d0036dfa80 R14: ffff92d0036dfa90 R15:
> ffff92d00ba09e00
> [  132.376385] FS:  0000727d191b86c0(0000) GS:ffff92d137c00000(0000)
> knlGS:0000000000000000
> [  132.377600] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  132.378486] CR2: 0000000000000008 CR3: 000000018f088006 CR4:
> 0000000000372ef0
> [  132.379564] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [  132.380638] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [  132.381738] note: kvm[776] exited with irqs disabled
> [  132.382631] note: kvm[776] exited with preempt_count 1
> [  132.383469] ------------[ cut here ]------------
> [  132.384193] WARNING: CPU: 0 PID: 776 at kernel/exit.c:818
> do_exit+0x8db/0xae0
> [  132.385307] Modules linked in: veth(E) cmac(E) nls_utf8(E) cifs(E)
> cifs_arc4(E) rdma_cm(E) iw_cm(E) ib_cm(E) ib_core(E) cifs_md4(E)
> ebtable_filter(E) ebtables(E) ip_set(E) ip6table_raw(E) iptable_raw(E)
> ip6table_filter(E) ip6_tables(E) iptable_filter(E) nf_tables(E)
> sunrpc(E) softdog(E) binfmt_misc(E) bonding(E) tls(E) nfnetlink_log(E)
> nfnetlink(E) intel_rapl_msr(E) intel_rapl_common(E) intel_pmc_core(E)
> kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) polyval_clmulni(E)
> polyval_generic(E) ghash_clmulni_intel(E) aesni_intel(E) crypto_simd(E)
> cryptd(E) rapl(E) pcspkr(E) vmgenid(E) joydev(E) input_leds(E)
> mac_hid(E) serio_raw(E) vhost_net(E) vhost(E) vhost_iotlb(E) tap(E)
> efi_pstore(E) dmi_sysfs(E) qemu_fw_cfg(E) ip_tables(E) x_tables(E)
> autofs4(E) hid_generic(E) usbhid(E) hid(E) raid10(E) raid456(E)
> async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E)
> async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) raid0(E) raid1(E)
> crc32_pclmul(E) bochs(E) drm_vram_helper(E) psmouse(E) drm_ttm_helper(E)
> uhci_hcd(E) ttm(E)
> [  132.385396]  ehci_hcd(E) i2c_piix4(E) pata_acpi(E) floppy(E)
> [  132.399854] CPU: 0 PID: 776 Comm: kvm Tainted: G      D     E
> 6.5.0-rc2-bisect11+ #22
> [  132.401165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [  132.402987] RIP: 0010:do_exit+0x8db/0xae0
> [  132.403661] Code: e9 44 f8 ff ff 48 8b bb d8 09 00 00 31 f6 e8 fc d7
> ff ff e9 f0 fd ff ff 4c 89 ee bf 05 06 00 00 e8 4a 2e 01 00 e9 70 f8 ff
> ff <0f> 0b e9 9e f7 ff ff 0f 0b e9 57 f7 ff ff 48 89 df e8 bf 7d 13 00
> [  132.406622] RSP: 0018:ffffa431407a3ec8 EFLAGS: 00010282
> [  132.407487] RAX: 0000000000000000 RBX: ffff92d08d53a940 RCX:
> 0000000000000000
> [  132.408637] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> 0000000000000000
> [  132.409791] RBP: ffffa431407a3f20 R08: 0000000000000000 R09:
> 0000000000000000
> [  132.410961] R10: 0000000000000000 R11: 0000000000000000 R12:
> ffff92d0031cba80
> [  132.412102] R13: 0000000000000009 R14: ffff92d090d53180 R15:
> 0000000000000000
> [  132.413256] FS:  0000727d191b86c0(0000) GS:ffff92d137c00000(0000)
> knlGS:0000000000000000
> [  132.414582] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  132.415512] CR2: 0000000000000008 CR3: 000000018f088006 CR4:
> 0000000000372ef0
> [  132.416670] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [  132.417842] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [  132.418998] Call Trace:
> [  132.419430]  <TASK>
> [  132.419788]  ? show_regs+0x6d/0x80
> [  132.420352]  ? __warn+0x89/0x160
> [  132.420891]  ? do_exit+0x8db/0xae0
> [  132.421477]  ? report_bug+0x17e/0x1b0
> [  132.422063]  ? handle_bug+0x46/0x90
> [  132.422655]  ? exc_invalid_op+0x18/0x80
> [  132.423299]  ? asm_exc_invalid_op+0x1b/0x20
> [  132.423975]  ? do_exit+0x8db/0xae0
> [  132.424565]  ? do_exit+0x72/0xae0
> [  132.425131]  ? _printk+0x60/0x90
> [  132.425666]  make_task_dead+0x86/0x180
> [  132.426297]  rewind_stack_and_make_dead+0x17/0x20
> [  132.427056] RIP: 0033:0x727d26282b95
> [  132.427681] Code: 00 00 00 44 89 d0 41 b9 08 00 00 00 83 c8 10 f6 87
> d0 00 00 00 01 8b bf cc 00 00 00 44 0f 45 d0 45 31 c0 b8 aa 01 00 00 0f
> 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 41 83 e2 02 74 c2 f0 48 83 0c 24
> [  132.430663] RSP: 002b:0000727d191b2fd8 EFLAGS: 00000246 ORIG_RAX:
> 00000000000001aa
> [  132.431895] RAX: ffffffffffffffda RBX: 0000727d0c0039d0 RCX:
> 0000727d26282b95
> [  132.433049] RDX: 0000000000000000 RSI: 0000000000000003 RDI:
> 0000000000000033
> [  132.434203] RBP: 0000727d0c0039d8 R08: 0000000000000000 R09:
> 0000000000000008
> [  132.435372] R10: 0000000000000000 R11: 0000000000000246 R12:
> 0000727d0c003ac0
> [  132.436529] R13: 0000000000000000 R14: 00005a5a2313bc68 R15:
> 00005a5a25cf0450
> [  132.437685]  </TASK>
> [  132.438056] ---[ end trace 0000000000000000 ]---
> 

  reply	other threads:[~2024-05-27  5:09 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-17  4:00 [PATCH v4 0/4] blk-mq: optimize flush and request size chengming.zhou
2023-07-17  4:00 ` [PATCH v4 1/4] blk-mq: use percpu csd to remote complete instead of per-rq csd chengming.zhou
2023-07-17  4:00 ` [PATCH v4 2/4] blk-flush: fix rq->flush.seq for post-flush requests chengming.zhou
2023-07-17  4:00 ` [PATCH v4 3/4] blk-flush: count inflight flush_data requests chengming.zhou
2023-07-17  4:00 ` [PATCH v4 4/4] blk-flush: reuse rq queuelist in flush state machine chengming.zhou
2024-05-24 16:07   ` Friedrich Weber
2024-05-27  5:09     ` Chengming Zhou [this message]
2024-05-27 16:04       ` Friedrich Weber
2024-05-27 23:34         ` Chengming Zhou
2024-05-27 23:50           ` Chengming Zhou
2024-05-28  0:12             ` Chengming Zhou
2024-05-28  8:42               ` Friedrich Weber
2024-05-28  9:09                 ` Chengming Zhou
2024-05-28 14:40                   ` Friedrich Weber
2024-05-29  8:50                     ` Chengming Zhou
2024-05-31  6:17                       ` Christoph Hellwig
2024-05-31  8:16                         ` Chengming Zhou
2023-07-17 14:18 ` [PATCH v4 0/4] blk-mq: optimize flush and request size Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=984f1f77-288c-441a-a649-5f320249b576@linux.dev \
    --to=chengming.zhou@linux.dev \
    --cc=axboe@kernel.dk \
    --cc=bvanassche@acm.org \
    --cc=f.weber@proxmox.com \
    --cc=hch@lst.de \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.