From: Chengming Zhou <chengming.zhou@linux.dev>
To: Friedrich Weber <f.weber@proxmox.com>,
axboe@kernel.dk, ming.lei@redhat.com, hch@lst.de,
bvanassche@acm.org
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
zhouchengming@bytedance.com
Subject: Re: [PATCH] block: fix request.queuelist usage in flush
Date: Wed, 5 Jun 2024 18:30:02 +0800 [thread overview]
Message-ID: <1344640f-b22d-4791-aed4-68fc62fb6e36@linux.dev> (raw)
In-Reply-To: <c9d03ff7-27c5-4ebd-b3f6-5a90d96f35ba@proxmox.com>
On 2024/6/5 16:45, Friedrich Weber wrote:
> Hi,
>
> On 04/06/2024 08:47, Chengming Zhou wrote:
>> Friedrich Weber reported a kernel crash problem and bisected to commit
>> 81ada09cc25e ("blk-flush: reuse rq queuelist in flush state machine").
>>
>> The root cause is that we use "list_move_tail(&rq->queuelist, pending)"
>> in the PREFLUSH/POSTFLUSH sequences. But rq->queuelist.next == xxx since
>> it's popped out from plug->cached_rq in __blk_mq_alloc_requests_batch().
>> We don't initialize its queuelist just for this first request, although
>> the queuelist of all later popped requests will be initialized.
>>
>> Fix it by changing to use "list_add_tail(&rq->queuelist, pending)" so
>> rq->queuelist doesn't need to be initialized. It should be ok since rq
>> can't be on any list when PREFLUSH or POSTFLUSH, has no move actually.
>>
>> Please note the commit 81ada09cc25e ("blk-flush: reuse rq queuelist in
>> flush state machine") also has another requirement that no drivers would
>> touch rq->queuelist after blk_mq_end_request() since we will reuse it to
>> add rq to the post-flush pending list in POSTFLUSH. If this is not true,
>> we will have to revert that commit IMHO.
>
> Unfortunately, with this patch applied to kernel 6.9 I get a different
> crash [2] on a Debian 12 (virtual) machine with root on LVM on boot (no
> software RAID involved). See [1] for lsblk and findmnt output. addr2line
> says:
Sorry, which commit is your kernel? Is mainline tag v6.9 or at some commit?
And is it reproducible using the mainline kernel v6.10-rc2?
>
> # addr2line -f -e /usr/lib/debug/vmlinux-6.9.0-patch0604-nodebuglist+
> blk_mq_request_bypass_insert+0x20
I think here should use blk_mq_insert_request+0x120, instead of the
blk_mq_request_bypass_insert+0x20, which has "?" at the beginning.
> blk_mq_request_bypass_insert
> [...]/linux/block/blk-mq.c:2456
>
> No crashes seen so far if the root is on LVM on top of software RAID, or
> if the root partition is directly on disk.
Ok, I will look into this ASAP, thank you for the information!
>
> If I can provide any more information, just let me know.
>
> Thanks!
>
> Best,
>
> Friedrich
>
> [1]
>
> # lsblk -o name,fstype,label --ascii
> NAME FSTYPE LABEL
> sda
> |-sda1 ext2
> |-sda2
> `-sda5 LVM2_member
> |-kernel684--deb--vg-root ext4
> `-kernel684--deb--vg-swap_1 swap
> sr0 iso9660 Debian 12.5.0 amd64 n
> # findmnt --ascii
> TARGET SOURCE FSTYPE OPTIONS
> / /dev/mapper/kernel684--deb--vg-root
> ext4
> rw,relatime,errors=remount-ro
> |-/sys sysfs sysfs
> rw,nosuid,nodev,noexec,relatime
> | |-/sys/kernel/security securityfs securityf
> rw,nosuid,nodev,noexec,relatime
> | |-/sys/fs/cgroup cgroup2 cgroup2
> rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursive
> | |-/sys/fs/pstore pstore pstore
> rw,nosuid,nodev,noexec,relatime
> | |-/sys/fs/bpf bpf bpf
> rw,nosuid,nodev,noexec,relatime,mode=700
> | |-/sys/kernel/debug debugfs debugfs
> rw,nosuid,nodev,noexec,relatime
> | |-/sys/kernel/tracing tracefs tracefs
> rw,nosuid,nodev,noexec,relatime
> | |-/sys/fs/fuse/connections fusectl fusectl
> rw,nosuid,nodev,noexec,relatime
> | `-/sys/kernel/config configfs configfs
> rw,nosuid,nodev,noexec,relatime
> |-/proc proc proc
> rw,nosuid,nodev,noexec,relatime
> | `-/proc/sys/fs/binfmt_misc systemd-1 autofs
> rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,di
> |-/dev udev devtmpfs
> rw,nosuid,relatime,size=4040780k,nr_inodes=1010195,mode=755
> | |-/dev/pts devpts devpts
> rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000
> | |-/dev/shm tmpfs tmpfs rw,nosuid,nodev,inode64
> | |-/dev/hugepages hugetlbfs hugetlbfs rw,relatime,pagesize=2M
> | `-/dev/mqueue mqueue mqueue
> rw,nosuid,nodev,noexec,relatime
> |-/run tmpfs tmpfs
> rw,nosuid,nodev,noexec,relatime,size=813456k,mode=755,inode
> | |-/run/lock tmpfs tmpfs
> rw,nosuid,nodev,noexec,relatime,size=5120k,inode64
> | |-/run/credentials/systemd-sysctl.service
> | | ramfs ramfs
> ro,nosuid,nodev,noexec,relatime,mode=700
> | |-/run/credentials/systemd-sysusers.service
> | | ramfs ramfs
> ro,nosuid,nodev,noexec,relatime,mode=700
> | |-/run/credentials/systemd-tmpfiles-setup-dev.service
> | | ramfs ramfs
> ro,nosuid,nodev,noexec,relatime,mode=700
> | |-/run/user/0 tmpfs tmpfs
> rw,nosuid,nodev,relatime,size=813452k,nr_inodes=203363,mode
> | `-/run/credentials/systemd-tmpfiles-setup.service
> | ramfs ramfs
> ro,nosuid,nodev,noexec,relatime,mode=700
> `-/boot /dev/sda1 ext2 rw,relatime
>
> [2]
> [ 1.137443] BUG: kernel NULL pointer dereference, address:
> 0000000000000000
> [ 1.137951] #PF: supervisor write access in kernel mode
> [ 1.138332] #PF: error_code(0x0002) - not-present page
> [ 1.138695] PGD 0 P4D 0
> [ 1.138697] Oops: 0002 [#1] PREEMPT SMP NOPTI
> [ 1.138702] CPU: 1 PID: 27 Comm: kworker/1:0H Tainted: G E
> 6.9.0-patch0604-nodebuglist+ #35
> [ 1.138703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [ 1.138705] Workqueue: kblockd blk_mq_requeue_work
> [ 1.141021] RIP: 0010:_raw_spin_lock+0x13/0x60
> [ 1.141336] Code: 31 db c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90
> 90 90 90 90 90 90 0f 1f 44 00 00 65 ff 05 bc 94 cb 69 31 c0 ba 01 00 00
> 00 <f0> 0f b1 17 75 1b 31 c0 31 d2 31 c9 31 f6 31 ff 45 31 c0 45 31 c9
> [ 1.142670] RSP: 0018:ffffa42a40103d78 EFLAGS: 00010246
> [ 1.143032] RAX: 0000000000000000 RBX: ffff91c4c0357c00 RCX:
> 00000000ffffffe0
> [ 1.143545] RDX: 0000000000000001 RSI: 0000000000000001 RDI:
> 0000000000000000
> [ 1.144037] RBP: ffffa42a40103d98 R08: 0000000000000000 R09:
> 0000000000000000
> [ 1.144548] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000000
> [ 1.145036] R13: 0000000000000001 R14: ffff91c5f7cc1d80 R15:
> ffff91c4c153eb54
> [ 1.145542] FS: 0000000000000000(0000) GS:ffff91c5f7c80000(0000)
> knlGS:0000000000000000
> [ 1.146092] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1.146511] CR2: 0000000000000000 CR3: 000000010e514001 CR4:
> 0000000000370ef0
> [ 1.147003] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 1.147507] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [ 1.147997] Call Trace:
> [ 1.148177] <TASK>
> [ 1.148332] ? show_regs+0x6c/0x80
> [ 1.148603] ? __die+0x24/0x80
> [ 1.148824] ? page_fault_oops+0x175/0x5b0
> [ 1.149111] ? do_user_addr_fault+0x311/0x680
> [ 1.149420] ? exc_page_fault+0x82/0x1b0
> [ 1.149718] ? asm_exc_page_fault+0x27/0x30
> [ 1.150013] ? _raw_spin_lock+0x13/0x60
> [ 1.150282] ? blk_mq_request_bypass_insert+0x20/0xe0
> [ 1.150663] blk_mq_insert_request+0x120/0x1e0
> [ 1.150975] blk_mq_requeue_work+0x18f/0x230
> [ 1.151277] process_one_work+0x19b/0x3f0
> [ 1.151562] worker_thread+0x32a/0x500
> [ 1.151847] ? __pfx_worker_thread+0x10/0x10
> [ 1.152148] kthread+0xe1/0x110
> [ 1.152373] ? __pfx_kthread+0x10/0x10
> [ 1.152640] ret_from_fork+0x44/0x70
> [ 1.152906] ? __pfx_kthread+0x10/0x10
> [ 1.153169] ret_from_fork_asm+0x1a/0x30
> [ 1.153449] </TASK>
> [ 1.153608] Modules linked in: efi_pstore(E) dmi_sysfs(E)
> qemu_fw_cfg(E) ip_tables(E) x_tables(E) autofs4(E) psmouse(E) bochs(E)
> uhci_hcd(E) crc32_pclmul(E) drm_vram_helper(E) drm_ttm_helper(E)
> i2c_piix4(E) ttm(E) ehci_hcd(E) pata_acpi(E) floppy(E)
> [ 1.155135] CR2: 0000000000000000
> [ 1.155370] ---[ end trace 0000000000000000 ]---
> [ 1.155694] RIP: 0010:_raw_spin_lock+0x13/0x60
> [ 1.156024] Code: 31 db c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90
> 90 90 90 90 90 90 0f 1f 44 00 00 65 ff 05 bc 94 cb 69 31 c0 ba 01 00 00
> 00 <f0> 0f b1 17 75 1b 31 c0 31 d2 31 c9 31 f6 31 ff 45 31 c0 45 31 c9
> [ 1.157306] RSP: 0018:ffffa42a40103d78 EFLAGS: 00010246
> [ 1.157669] RAX: 0000000000000000 RBX: ffff91c4c0357c00 RCX:
> 00000000ffffffe0
> [ 1.158172] RDX: 0000000000000001 RSI: 0000000000000001 RDI:
> 0000000000000000
> [ 1.158682] RBP: ffffa42a40103d98 R08: 0000000000000000 R09:
> 0000000000000000
> [ 1.159311] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000000
> [ 1.159992] R13: 0000000000000001 R14: ffff91c5f7cc1d80 R15:
> ffff91c4c153eb54
> [ 1.160575] FS: 0000000000000000(0000) GS:ffff91c5f7c80000(0000)
> knlGS:0000000000000000
> [ 1.161186] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1.161618] CR2: 0000000000000000 CR3: 000000010e514001 CR4:
> 0000000000370ef0
> [ 1.162158] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 1.162691] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
>
next prev parent reply other threads:[~2024-06-05 10:30 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-04 6:47 [PATCH] block: fix request.queuelist usage in flush Chengming Zhou
2024-06-04 14:17 ` Jens Axboe
2024-06-05 18:14 ` Jens Axboe
2024-06-05 8:45 ` Friedrich Weber
2024-06-05 10:30 ` Chengming Zhou [this message]
2024-06-05 10:54 ` Friedrich Weber
2024-06-05 13:34 ` Friedrich Weber
2024-06-05 14:27 ` Chengming Zhou
2024-06-06 8:44 ` Friedrich Weber
2024-06-06 16:05 ` Friedrich Weber
2024-06-07 2:37 ` Chengming Zhou
2024-06-07 4:55 ` Christoph Hellwig
2024-06-07 6:24 ` Chengming Zhou
2024-06-07 6:31 ` Christoph Hellwig
2024-06-07 6:33 ` Chengming Zhou
2024-06-07 15:13 ` Friedrich Weber
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1344640f-b22d-4791-aed4-68fc62fb6e36@linux.dev \
--to=chengming.zhou@linux.dev \
--cc=axboe@kernel.dk \
--cc=bvanassche@acm.org \
--cc=f.weber@proxmox.com \
--cc=hch@lst.de \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=ming.lei@redhat.com \
--cc=zhouchengming@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.