* regression on aarch64? panic on boot
@ 2023-01-16 21:57 Klaus Jensen
2023-01-17 5:58 ` Christoph Hellwig
2023-01-19 13:10 ` Linux kernel regression tracking (#adding)
0 siblings, 2 replies; 11+ messages in thread
From: Klaus Jensen @ 2023-01-16 21:57 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1404 bytes --]
Hi,
I'm getting panics when booting from a QEMU hw/nvme device on an aarch64
guest in roughly 20% of boots on v6.2-rc4. Example panic below.
I've bisected it to commit eac3ef262941 ("nvme-pci: split the initial
probe from the rest path").
I'm not seeing this on any other emulated platforms that I'm currently
testing (x86_64, riscv32/64, mips32/64 and sparc64).
nvme nvme0: 1/0/0 default/read/poll queues
NET: Registered PF_VSOCK protocol family
registered taskstats version 1
nvme nvme0: Ignoring bogus Namespace Identifiers
/dev/root: Can't open blockdev
VFS: Cannot open root device "nvme0n1" or unknown-block(0,0): error -6
Please append a correct "root=" boot option; here are the available partitions:
103:00000 61440 nvme0n1
(driver?)
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.2.0-rc4 #22
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace.part.0+0xdc/0xf0
show_stack+0x18/0x30
dump_stack_lvl+0x7c/0xa0
dump_stack+0x18/0x34
panic+0x17c/0x328
mount_block_root+0x184/0x234
mount_root+0x178/0x198
prepare_namespace+0x124/0x164
kernel_init_freeable+0x2a0/0x2c8
kernel_init+0x2c/0x130
ret_from_fork+0x10/0x20
Kernel Offset: disabled
CPU features: 0x00000,01800100,0000420b
Memory Limit: none
---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0) ]---
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: regression on aarch64? panic on boot 2023-01-16 21:57 regression on aarch64? panic on boot Klaus Jensen @ 2023-01-17 5:58 ` Christoph Hellwig 2023-01-17 6:31 ` Klaus Jensen 2023-01-19 13:10 ` Linux kernel regression tracking (#adding) 1 sibling, 1 reply; 11+ messages in thread From: Christoph Hellwig @ 2023-01-17 5:58 UTC (permalink / raw) To: Klaus Jensen Cc: Christoph Hellwig, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel On Mon, Jan 16, 2023 at 10:57:11PM +0100, Klaus Jensen wrote: > Hi, > > I'm getting panics when booting from a QEMU hw/nvme device on an aarch64 > guest in roughly 20% of boots on v6.2-rc4. Example panic below. This smells like your setup somehow doesn't wait for async driver probe. Does the hack below work around it? diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index b13baccedb4a95..f47e19c701d520 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -3508,7 +3508,6 @@ static struct pci_driver nvme_driver = { .remove = nvme_remove, .shutdown = nvme_shutdown, .driver = { - .probe_type = PROBE_PREFER_ASYNCHRONOUS, #ifdef CONFIG_PM_SLEEP .pm = &nvme_dev_pm_ops, #endif ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: regression on aarch64? panic on boot 2023-01-17 5:58 ` Christoph Hellwig @ 2023-01-17 6:31 ` Klaus Jensen 2023-01-17 6:37 ` Christoph Hellwig 0 siblings, 1 reply; 11+ messages in thread From: Klaus Jensen @ 2023-01-17 6:31 UTC (permalink / raw) To: Christoph Hellwig Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1336 bytes --] On Jan 17 06:58, Christoph Hellwig wrote: > On Mon, Jan 16, 2023 at 10:57:11PM +0100, Klaus Jensen wrote: > > Hi, > > > > I'm getting panics when booting from a QEMU hw/nvme device on an aarch64 > > guest in roughly 20% of boots on v6.2-rc4. Example panic below. > > This smells like your setup somehow doesn't wait for async driver > probe. Does the hack below work around it? > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index b13baccedb4a95..f47e19c701d520 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -3508,7 +3508,6 @@ static struct pci_driver nvme_driver = { > .remove = nvme_remove, > .shutdown = nvme_shutdown, > .driver = { > - .probe_type = PROBE_PREFER_ASYNCHRONOUS, > #ifdef CONFIG_PM_SLEEP > .pm = &nvme_dev_pm_ops, > #endif Good morning Christoph, Yep, the above works. My setup is a buildroot qemu_aarch64_virt_defconfig booting from an emulated nvme device: qemu-system-aarch64 -M "virt" -cpu "cortex-a53" -m 512M \ -nodefaults -nographic -snapshot -no-reboot \ -kernel images/Image \ -append "root=/dev/nvme0n1 console=ttyAMA0,115200" \ -drive file=images/rootfs.ext2,format=raw,if=none,id=d0 \ -device nvme,serial=default,drive=d0 \ -nic user,model=virtio \ -serial stdio [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: regression on aarch64? panic on boot 2023-01-17 6:31 ` Klaus Jensen @ 2023-01-17 6:37 ` Christoph Hellwig 2023-01-17 6:39 ` Klaus Jensen ` (2 more replies) 0 siblings, 3 replies; 11+ messages in thread From: Christoph Hellwig @ 2023-01-17 6:37 UTC (permalink / raw) To: Klaus Jensen Cc: Christoph Hellwig, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel, Greg Kroah-Hartman, Rafael J. Wysocki On Tue, Jan 17, 2023 at 07:31:59AM +0100, Klaus Jensen wrote: > Good morning Christoph, > > Yep, the above works. Context for the newly added: This is dropping the newly added PROBE_PREFER_ASYNCHRONOUS in nvme, which causes Klaus' arm64 (but not other boot tests) to fail. Any idea what could be going wrong there probably in userspace? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: regression on aarch64? panic on boot 2023-01-17 6:37 ` Christoph Hellwig @ 2023-01-17 6:39 ` Klaus Jensen 2023-01-17 12:11 ` Martin Wilck 2023-01-19 16:48 ` Keith Busch 2 siblings, 0 replies; 11+ messages in thread From: Klaus Jensen @ 2023-01-17 6:39 UTC (permalink / raw) To: Christoph Hellwig Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel, Greg Kroah-Hartman, Rafael J. Wysocki [-- Attachment #1: Type: text/plain, Size: 482 bytes --] On Jan 17 07:37, Christoph Hellwig wrote: > On Tue, Jan 17, 2023 at 07:31:59AM +0100, Klaus Jensen wrote: > > Good morning Christoph, > > > > Yep, the above works. > > Context for the newly added: This is dropping the newly added > PROBE_PREFER_ASYNCHRONOUS in nvme, which causes Klaus' arm64 (but not > other boot tests) to fail. Any idea what could be going wrong there > probably in userspace? > Adding 'rootwait' to the boot parameters does the trick as well. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: regression on aarch64? panic on boot 2023-01-17 6:37 ` Christoph Hellwig 2023-01-17 6:39 ` Klaus Jensen @ 2023-01-17 12:11 ` Martin Wilck 2023-01-19 8:29 ` Klaus Jensen 2023-01-19 16:48 ` Keith Busch 2 siblings, 1 reply; 11+ messages in thread From: Martin Wilck @ 2023-01-17 12:11 UTC (permalink / raw) To: Christoph Hellwig, Klaus Jensen Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel, Greg Kroah-Hartman, Rafael J. Wysocki On Tue, 2023-01-17 at 07:37 +0100, Christoph Hellwig wrote: > On Tue, Jan 17, 2023 at 07:31:59AM +0100, Klaus Jensen wrote: > > Good morning Christoph, > > > > Yep, the above works. > > Context for the newly added: This is dropping the newly added > PROBE_PREFER_ASYNCHRONOUS in nvme, which causes Klaus' arm64 (but not > other boot tests) to fail. Any idea what could be going wrong there > probably in userspace? If this is an aarch64 userspace issue, maybe related to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107678 ? That bug causes segfaults of user space programs if for some reason the unwind code is invoked. It happens only if libgcc_s.so is compiled with gcc 13, and the pauth CPU feature is enabled in qemu. Martin ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: regression on aarch64? panic on boot 2023-01-17 12:11 ` Martin Wilck @ 2023-01-19 8:29 ` Klaus Jensen 0 siblings, 0 replies; 11+ messages in thread From: Klaus Jensen @ 2023-01-19 8:29 UTC (permalink / raw) To: Martin Wilck Cc: Christoph Hellwig, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel, Greg Kroah-Hartman, Rafael J. Wysocki [-- Attachment #1: Type: text/plain, Size: 1012 bytes --] On Jan 17 13:11, Martin Wilck wrote: > On Tue, 2023-01-17 at 07:37 +0100, Christoph Hellwig wrote: > > On Tue, Jan 17, 2023 at 07:31:59AM +0100, Klaus Jensen wrote: > > > Good morning Christoph, > > > > > > Yep, the above works. > > > > Context for the newly added: This is dropping the newly added > > PROBE_PREFER_ASYNCHRONOUS in nvme, which causes Klaus' arm64 (but not > > other boot tests) to fail. Any idea what could be going wrong there > > probably in userspace? > > If this is an aarch64 userspace issue, maybe related to > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107678 ? > > That bug causes segfaults of user space programs if for some reason the > unwind code is invoked. It happens only if libgcc_s.so is compiled with > gcc 13, and the pauth CPU feature is enabled in qemu. > > Martin > I just observed the same panic on qemu emulated ppc64 as well. It's pretty rare, maybe 1 in 20. 'rootwait' or removing the the prefer asynchronous probe fixes it as well. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: regression on aarch64? panic on boot 2023-01-17 6:37 ` Christoph Hellwig 2023-01-17 6:39 ` Klaus Jensen 2023-01-17 12:11 ` Martin Wilck @ 2023-01-19 16:48 ` Keith Busch 2023-01-24 17:11 ` Ville Syrjälä 2 siblings, 1 reply; 11+ messages in thread From: Keith Busch @ 2023-01-19 16:48 UTC (permalink / raw) To: Christoph Hellwig Cc: Klaus Jensen, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel, Greg Kroah-Hartman, Rafael J. Wysocki On Tue, Jan 17, 2023 at 07:37:35AM +0100, Christoph Hellwig wrote: > On Tue, Jan 17, 2023 at 07:31:59AM +0100, Klaus Jensen wrote: > > Good morning Christoph, > > > > Yep, the above works. > > Context for the newly added: This is dropping the newly added > PROBE_PREFER_ASYNCHRONOUS in nvme, which causes Klaus' arm64 (but not > other boot tests) to fail. Any idea what could be going wrong there > probably in userspace? Prior to 6.2, the driver would do it's own async_schedule, and that async probe function would flush the first scan work. wait_for_device_probe() was then forced to wait for the scan_work to complete, which brings up the root device. We're not flushing the scan_work anymore from our probe, so this should fix it for 6.2: --- diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index b294b41a149a7..ff97426749976 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -3046,6 +3046,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) nvme_start_ctrl(&dev->ctrl); nvme_put_ctrl(&dev->ctrl); + flush_work(&dev->ctrl.scan_work); return 0; out_disable: -- ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: regression on aarch64? panic on boot 2023-01-19 16:48 ` Keith Busch @ 2023-01-24 17:11 ` Ville Syrjälä 0 siblings, 0 replies; 11+ messages in thread From: Ville Syrjälä @ 2023-01-24 17:11 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Klaus Jensen, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel, Greg Kroah-Hartman, Rafael J. Wysocki On Thu, Jan 19, 2023 at 09:48:56AM -0700, Keith Busch wrote: > On Tue, Jan 17, 2023 at 07:37:35AM +0100, Christoph Hellwig wrote: > > On Tue, Jan 17, 2023 at 07:31:59AM +0100, Klaus Jensen wrote: > > > Good morning Christoph, > > > > > > Yep, the above works. > > > > Context for the newly added: This is dropping the newly added > > PROBE_PREFER_ASYNCHRONOUS in nvme, which causes Klaus' arm64 (but not > > other boot tests) to fail. Any idea what could be going wrong there > > probably in userspace? > > Prior to 6.2, the driver would do it's own async_schedule, and that > async probe function would flush the first scan work. > wait_for_device_probe() was then forced to wait for the scan_work to > complete, which brings up the root device. > > We're not flushing the scan_work anymore from our probe, so this should > fix it for 6.2: Appears to fix my Tigerlake Thinkpad T14 gen2. Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com> > > --- > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index b294b41a149a7..ff97426749976 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -3046,6 +3046,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) > > nvme_start_ctrl(&dev->ctrl); > nvme_put_ctrl(&dev->ctrl); > + flush_work(&dev->ctrl.scan_work); > return 0; > > out_disable: > -- > -- Ville Syrjälä Intel ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: regression on aarch64? panic on boot 2023-01-16 21:57 regression on aarch64? panic on boot Klaus Jensen 2023-01-17 5:58 ` Christoph Hellwig @ 2023-01-19 13:10 ` Linux kernel regression tracking (#adding) 2023-01-27 11:11 ` Linux kernel regression tracking (#update) 1 sibling, 1 reply; 11+ messages in thread From: Linux kernel regression tracking (#adding) @ 2023-01-19 13:10 UTC (permalink / raw) To: Klaus Jensen, Christoph Hellwig Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel, Greg Kroah-Hartman, Rafael J. Wysocki, Linux kernel regressions list [TLDR: I'm adding this report to the list of tracked Linux kernel regressions; the text you find below is based on a few templates paragraphs you might have encountered already in similar form. See link in footer if these mails annoy you.] [CCing the regression list, as it should be in the loop for regressions: https://docs.kernel.org/admin-guide/reporting-regressions.html] On 16.01.23 22:57, Klaus Jensen wrote: > > I'm getting panics when booting from a QEMU hw/nvme device on an aarch64 > guest in roughly 20% of boots on v6.2-rc4. Example panic below. > > I've bisected it to commit eac3ef262941 ("nvme-pci: split the initial > probe from the rest path"). > > I'm not seeing this on any other emulated platforms that I'm currently > testing (x86_64, riscv32/64, mips32/64 and sparc64). > [...] Thanks for the report. To be sure the issue doesn't fall through the cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression tracking bot: #regzbot ^introduced eac3ef262941 #regzbot title nvme: occasional boot problems due to the newly supported async driver probe #regzbot ignore-activity This isn't a regression? This issue or a fix for it are already discussed somewhere else? It was fixed already? You want to clarify when the regression started to happen? Or point out I got the title or something else totally wrong? Then just reply and tell me -- ideally while also telling regzbot about it, as explained by the page listed in the footer of this mail. Developers: When fixing the issue, remember to add 'Link:' tags pointing to the report (the parent of this mail). See page linked in footer for details. Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr That page also explains what to do if mails like this annoy you. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: regression on aarch64? panic on boot 2023-01-19 13:10 ` Linux kernel regression tracking (#adding) @ 2023-01-27 11:11 ` Linux kernel regression tracking (#update) 0 siblings, 0 replies; 11+ messages in thread From: Linux kernel regression tracking (#update) @ 2023-01-27 11:11 UTC (permalink / raw) To: Klaus Jensen, Christoph Hellwig Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel, Greg Kroah-Hartman, Rafael J. Wysocki, Linux kernel regressions list [TLDR: there afaics is a fix for the regression discussed in this thread, but its author did not use a Link: tag to point to the report, as wanted by Linus and explained in the documentation; this forces me to write this mail, which sole purpose it to update the state of this tracked Linux kernel regression.] On 19.01.23 14:10, Linux kernel regression tracking (#adding) wrote: > On 16.01.23 22:57, Klaus Jensen wrote: >> >> I'm getting panics when booting from a QEMU hw/nvme device on an aarch64 >> guest in roughly 20% of boots on v6.2-rc4. Example panic below. >> >> I've bisected it to commit eac3ef262941 ("nvme-pci: split the initial >> probe from the rest path"). >> >> I'm not seeing this on any other emulated platforms that I'm currently >> testing (x86_64, riscv32/64, mips32/64 and sparc64). >> [...] > > Thanks for the report. To be sure the issue doesn't fall through the > cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression > tracking bot: > > #regzbot ^introduced eac3ef262941 > #regzbot title nvme: occasional boot problems due to the newly supported > async driver probe > #regzbot ignore-activity #regzbot monitor: https://lore.kernel.org/all/20230124171738.2311160-1-kbusch@meta.com/ #regzbot fix: nvme-pci: flush initial scan_work for async probe #regzbot ignore-activity Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr That page also explains what to do if mails like this annoy you. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2023-01-27 11:11 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-01-16 21:57 regression on aarch64? panic on boot Klaus Jensen 2023-01-17 5:58 ` Christoph Hellwig 2023-01-17 6:31 ` Klaus Jensen 2023-01-17 6:37 ` Christoph Hellwig 2023-01-17 6:39 ` Klaus Jensen 2023-01-17 12:11 ` Martin Wilck 2023-01-19 8:29 ` Klaus Jensen 2023-01-19 16:48 ` Keith Busch 2023-01-24 17:11 ` Ville Syrjälä 2023-01-19 13:10 ` Linux kernel regression tracking (#adding) 2023-01-27 11:11 ` Linux kernel regression tracking (#update)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox