* regression on aarch64? panic on boot
@ 2023-01-16 21:57 Klaus Jensen
2023-01-17 5:58 ` Christoph Hellwig
2023-01-19 13:10 ` Linux kernel regression tracking (#adding)
0 siblings, 2 replies; 11+ messages in thread
From: Klaus Jensen @ 2023-01-16 21:57 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1404 bytes --]
Hi,
I'm getting panics when booting from a QEMU hw/nvme device on an aarch64
guest in roughly 20% of boots on v6.2-rc4. Example panic below.
I've bisected it to commit eac3ef262941 ("nvme-pci: split the initial
probe from the rest path").
I'm not seeing this on any other emulated platforms that I'm currently
testing (x86_64, riscv32/64, mips32/64 and sparc64).
nvme nvme0: 1/0/0 default/read/poll queues
NET: Registered PF_VSOCK protocol family
registered taskstats version 1
nvme nvme0: Ignoring bogus Namespace Identifiers
/dev/root: Can't open blockdev
VFS: Cannot open root device "nvme0n1" or unknown-block(0,0): error -6
Please append a correct "root=" boot option; here are the available partitions:
103:00000 61440 nvme0n1
(driver?)
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.2.0-rc4 #22
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace.part.0+0xdc/0xf0
show_stack+0x18/0x30
dump_stack_lvl+0x7c/0xa0
dump_stack+0x18/0x34
panic+0x17c/0x328
mount_block_root+0x184/0x234
mount_root+0x178/0x198
prepare_namespace+0x124/0x164
kernel_init_freeable+0x2a0/0x2c8
kernel_init+0x2c/0x130
ret_from_fork+0x10/0x20
Kernel Offset: disabled
CPU features: 0x00000,01800100,0000420b
Memory Limit: none
---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0) ]---
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: regression on aarch64? panic on boot
2023-01-16 21:57 regression on aarch64? panic on boot Klaus Jensen
@ 2023-01-17 5:58 ` Christoph Hellwig
2023-01-17 6:31 ` Klaus Jensen
2023-01-19 13:10 ` Linux kernel regression tracking (#adding)
1 sibling, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2023-01-17 5:58 UTC (permalink / raw)
To: Klaus Jensen
Cc: Christoph Hellwig, Keith Busch, Jens Axboe, Sagi Grimberg,
linux-nvme, linux-kernel
On Mon, Jan 16, 2023 at 10:57:11PM +0100, Klaus Jensen wrote:
> Hi,
>
> I'm getting panics when booting from a QEMU hw/nvme device on an aarch64
> guest in roughly 20% of boots on v6.2-rc4. Example panic below.
This smells like your setup somehow doesn't wait for async driver
probe. Does the hack below work around it?
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b13baccedb4a95..f47e19c701d520 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3508,7 +3508,6 @@ static struct pci_driver nvme_driver = {
.remove = nvme_remove,
.shutdown = nvme_shutdown,
.driver = {
- .probe_type = PROBE_PREFER_ASYNCHRONOUS,
#ifdef CONFIG_PM_SLEEP
.pm = &nvme_dev_pm_ops,
#endif
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: regression on aarch64? panic on boot
2023-01-17 5:58 ` Christoph Hellwig
@ 2023-01-17 6:31 ` Klaus Jensen
2023-01-17 6:37 ` Christoph Hellwig
0 siblings, 1 reply; 11+ messages in thread
From: Klaus Jensen @ 2023-01-17 6:31 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1336 bytes --]
On Jan 17 06:58, Christoph Hellwig wrote:
> On Mon, Jan 16, 2023 at 10:57:11PM +0100, Klaus Jensen wrote:
> > Hi,
> >
> > I'm getting panics when booting from a QEMU hw/nvme device on an aarch64
> > guest in roughly 20% of boots on v6.2-rc4. Example panic below.
>
> This smells like your setup somehow doesn't wait for async driver
> probe. Does the hack below work around it?
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index b13baccedb4a95..f47e19c701d520 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -3508,7 +3508,6 @@ static struct pci_driver nvme_driver = {
> .remove = nvme_remove,
> .shutdown = nvme_shutdown,
> .driver = {
> - .probe_type = PROBE_PREFER_ASYNCHRONOUS,
> #ifdef CONFIG_PM_SLEEP
> .pm = &nvme_dev_pm_ops,
> #endif
Good morning Christoph,
Yep, the above works.
My setup is a buildroot qemu_aarch64_virt_defconfig booting from an
emulated nvme device:
qemu-system-aarch64 -M "virt" -cpu "cortex-a53" -m 512M \
-nodefaults -nographic -snapshot -no-reboot \
-kernel images/Image \
-append "root=/dev/nvme0n1 console=ttyAMA0,115200" \
-drive file=images/rootfs.ext2,format=raw,if=none,id=d0 \
-device nvme,serial=default,drive=d0 \
-nic user,model=virtio \
-serial stdio
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: regression on aarch64? panic on boot
2023-01-17 6:31 ` Klaus Jensen
@ 2023-01-17 6:37 ` Christoph Hellwig
2023-01-17 6:39 ` Klaus Jensen
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Christoph Hellwig @ 2023-01-17 6:37 UTC (permalink / raw)
To: Klaus Jensen
Cc: Christoph Hellwig, Keith Busch, Jens Axboe, Sagi Grimberg,
linux-nvme, linux-kernel, Greg Kroah-Hartman, Rafael J. Wysocki
On Tue, Jan 17, 2023 at 07:31:59AM +0100, Klaus Jensen wrote:
> Good morning Christoph,
>
> Yep, the above works.
Context for the newly added: This is dropping the newly added
PROBE_PREFER_ASYNCHRONOUS in nvme, which causes Klaus' arm64 (but not
other boot tests) to fail. Any idea what could be going wrong there
probably in userspace?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: regression on aarch64? panic on boot
2023-01-17 6:37 ` Christoph Hellwig
@ 2023-01-17 6:39 ` Klaus Jensen
2023-01-17 12:11 ` Martin Wilck
2023-01-19 16:48 ` Keith Busch
2 siblings, 0 replies; 11+ messages in thread
From: Klaus Jensen @ 2023-01-17 6:39 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel,
Greg Kroah-Hartman, Rafael J. Wysocki
[-- Attachment #1: Type: text/plain, Size: 482 bytes --]
On Jan 17 07:37, Christoph Hellwig wrote:
> On Tue, Jan 17, 2023 at 07:31:59AM +0100, Klaus Jensen wrote:
> > Good morning Christoph,
> >
> > Yep, the above works.
>
> Context for the newly added: This is dropping the newly added
> PROBE_PREFER_ASYNCHRONOUS in nvme, which causes Klaus' arm64 (but not
> other boot tests) to fail. Any idea what could be going wrong there
> probably in userspace?
>
Adding 'rootwait' to the boot parameters does the trick as well.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: regression on aarch64? panic on boot
2023-01-17 6:37 ` Christoph Hellwig
2023-01-17 6:39 ` Klaus Jensen
@ 2023-01-17 12:11 ` Martin Wilck
2023-01-19 8:29 ` Klaus Jensen
2023-01-19 16:48 ` Keith Busch
2 siblings, 1 reply; 11+ messages in thread
From: Martin Wilck @ 2023-01-17 12:11 UTC (permalink / raw)
To: Christoph Hellwig, Klaus Jensen
Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel,
Greg Kroah-Hartman, Rafael J. Wysocki
On Tue, 2023-01-17 at 07:37 +0100, Christoph Hellwig wrote:
> On Tue, Jan 17, 2023 at 07:31:59AM +0100, Klaus Jensen wrote:
> > Good morning Christoph,
> >
> > Yep, the above works.
>
> Context for the newly added: This is dropping the newly added
> PROBE_PREFER_ASYNCHRONOUS in nvme, which causes Klaus' arm64 (but not
> other boot tests) to fail. Any idea what could be going wrong there
> probably in userspace?
If this is an aarch64 userspace issue, maybe related to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107678 ?
That bug causes segfaults of user space programs if for some reason the
unwind code is invoked. It happens only if libgcc_s.so is compiled with
gcc 13, and the pauth CPU feature is enabled in qemu.
Martin
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: regression on aarch64? panic on boot
2023-01-17 12:11 ` Martin Wilck
@ 2023-01-19 8:29 ` Klaus Jensen
0 siblings, 0 replies; 11+ messages in thread
From: Klaus Jensen @ 2023-01-19 8:29 UTC (permalink / raw)
To: Martin Wilck
Cc: Christoph Hellwig, Keith Busch, Jens Axboe, Sagi Grimberg,
linux-nvme, linux-kernel, Greg Kroah-Hartman, Rafael J. Wysocki
[-- Attachment #1: Type: text/plain, Size: 1012 bytes --]
On Jan 17 13:11, Martin Wilck wrote:
> On Tue, 2023-01-17 at 07:37 +0100, Christoph Hellwig wrote:
> > On Tue, Jan 17, 2023 at 07:31:59AM +0100, Klaus Jensen wrote:
> > > Good morning Christoph,
> > >
> > > Yep, the above works.
> >
> > Context for the newly added: This is dropping the newly added
> > PROBE_PREFER_ASYNCHRONOUS in nvme, which causes Klaus' arm64 (but not
> > other boot tests) to fail. Any idea what could be going wrong there
> > probably in userspace?
>
> If this is an aarch64 userspace issue, maybe related to
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107678 ?
>
> That bug causes segfaults of user space programs if for some reason the
> unwind code is invoked. It happens only if libgcc_s.so is compiled with
> gcc 13, and the pauth CPU feature is enabled in qemu.
>
> Martin
>
I just observed the same panic on qemu emulated ppc64 as well. It's
pretty rare, maybe 1 in 20. 'rootwait' or removing the the prefer
asynchronous probe fixes it as well.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: regression on aarch64? panic on boot
2023-01-16 21:57 regression on aarch64? panic on boot Klaus Jensen
2023-01-17 5:58 ` Christoph Hellwig
@ 2023-01-19 13:10 ` Linux kernel regression tracking (#adding)
2023-01-27 11:11 ` Linux kernel regression tracking (#update)
1 sibling, 1 reply; 11+ messages in thread
From: Linux kernel regression tracking (#adding) @ 2023-01-19 13:10 UTC (permalink / raw)
To: Klaus Jensen, Christoph Hellwig
Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel,
Greg Kroah-Hartman, Rafael J. Wysocki,
Linux kernel regressions list
[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]
[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]
On 16.01.23 22:57, Klaus Jensen wrote:
>
> I'm getting panics when booting from a QEMU hw/nvme device on an aarch64
> guest in roughly 20% of boots on v6.2-rc4. Example panic below.
>
> I've bisected it to commit eac3ef262941 ("nvme-pci: split the initial
> probe from the rest path").
>
> I'm not seeing this on any other emulated platforms that I'm currently
> testing (x86_64, riscv32/64, mips32/64 and sparc64).
> [...]
Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:
#regzbot ^introduced eac3ef262941
#regzbot title nvme: occasional boot problems due to the newly supported
async driver probe
#regzbot ignore-activity
This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.
Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: regression on aarch64? panic on boot
2023-01-17 6:37 ` Christoph Hellwig
2023-01-17 6:39 ` Klaus Jensen
2023-01-17 12:11 ` Martin Wilck
@ 2023-01-19 16:48 ` Keith Busch
2023-01-24 17:11 ` Ville Syrjälä
2 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2023-01-19 16:48 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Klaus Jensen, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel,
Greg Kroah-Hartman, Rafael J. Wysocki
On Tue, Jan 17, 2023 at 07:37:35AM +0100, Christoph Hellwig wrote:
> On Tue, Jan 17, 2023 at 07:31:59AM +0100, Klaus Jensen wrote:
> > Good morning Christoph,
> >
> > Yep, the above works.
>
> Context for the newly added: This is dropping the newly added
> PROBE_PREFER_ASYNCHRONOUS in nvme, which causes Klaus' arm64 (but not
> other boot tests) to fail. Any idea what could be going wrong there
> probably in userspace?
Prior to 6.2, the driver would do it's own async_schedule, and that
async probe function would flush the first scan work.
wait_for_device_probe() was then forced to wait for the scan_work to
complete, which brings up the root device.
We're not flushing the scan_work anymore from our probe, so this should
fix it for 6.2:
---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b294b41a149a7..ff97426749976 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3046,6 +3046,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
nvme_start_ctrl(&dev->ctrl);
nvme_put_ctrl(&dev->ctrl);
+ flush_work(&dev->ctrl.scan_work);
return 0;
out_disable:
--
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: regression on aarch64? panic on boot
2023-01-19 16:48 ` Keith Busch
@ 2023-01-24 17:11 ` Ville Syrjälä
0 siblings, 0 replies; 11+ messages in thread
From: Ville Syrjälä @ 2023-01-24 17:11 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Klaus Jensen, Jens Axboe, Sagi Grimberg,
linux-nvme, linux-kernel, Greg Kroah-Hartman, Rafael J. Wysocki
On Thu, Jan 19, 2023 at 09:48:56AM -0700, Keith Busch wrote:
> On Tue, Jan 17, 2023 at 07:37:35AM +0100, Christoph Hellwig wrote:
> > On Tue, Jan 17, 2023 at 07:31:59AM +0100, Klaus Jensen wrote:
> > > Good morning Christoph,
> > >
> > > Yep, the above works.
> >
> > Context for the newly added: This is dropping the newly added
> > PROBE_PREFER_ASYNCHRONOUS in nvme, which causes Klaus' arm64 (but not
> > other boot tests) to fail. Any idea what could be going wrong there
> > probably in userspace?
>
> Prior to 6.2, the driver would do it's own async_schedule, and that
> async probe function would flush the first scan work.
> wait_for_device_probe() was then forced to wait for the scan_work to
> complete, which brings up the root device.
>
> We're not flushing the scan_work anymore from our probe, so this should
> fix it for 6.2:
Appears to fix my Tigerlake Thinkpad T14 gen2.
Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
>
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index b294b41a149a7..ff97426749976 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -3046,6 +3046,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>
> nvme_start_ctrl(&dev->ctrl);
> nvme_put_ctrl(&dev->ctrl);
> + flush_work(&dev->ctrl.scan_work);
> return 0;
>
> out_disable:
> --
>
--
Ville Syrjälä
Intel
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: regression on aarch64? panic on boot
2023-01-19 13:10 ` Linux kernel regression tracking (#adding)
@ 2023-01-27 11:11 ` Linux kernel regression tracking (#update)
0 siblings, 0 replies; 11+ messages in thread
From: Linux kernel regression tracking (#update) @ 2023-01-27 11:11 UTC (permalink / raw)
To: Klaus Jensen, Christoph Hellwig
Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel,
Greg Kroah-Hartman, Rafael J. Wysocki,
Linux kernel regressions list
[TLDR: there afaics is a fix for the regression discussed in this
thread, but its author did not use a Link: tag to point to the report,
as wanted by Linus and explained in the documentation; this forces me to
write this mail, which sole purpose it to update the state of this
tracked Linux kernel regression.]
On 19.01.23 14:10, Linux kernel regression tracking (#adding) wrote:
> On 16.01.23 22:57, Klaus Jensen wrote:
>>
>> I'm getting panics when booting from a QEMU hw/nvme device on an aarch64
>> guest in roughly 20% of boots on v6.2-rc4. Example panic below.
>>
>> I've bisected it to commit eac3ef262941 ("nvme-pci: split the initial
>> probe from the rest path").
>>
>> I'm not seeing this on any other emulated platforms that I'm currently
>> testing (x86_64, riscv32/64, mips32/64 and sparc64).
>> [...]
>
> Thanks for the report. To be sure the issue doesn't fall through the
> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> tracking bot:
>
> #regzbot ^introduced eac3ef262941
> #regzbot title nvme: occasional boot problems due to the newly supported
> async driver probe
> #regzbot ignore-activity
#regzbot monitor:
https://lore.kernel.org/all/20230124171738.2311160-1-kbusch@meta.com/
#regzbot fix: nvme-pci: flush initial scan_work for async probe
#regzbot ignore-activity
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2023-01-27 11:11 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-16 21:57 regression on aarch64? panic on boot Klaus Jensen
2023-01-17 5:58 ` Christoph Hellwig
2023-01-17 6:31 ` Klaus Jensen
2023-01-17 6:37 ` Christoph Hellwig
2023-01-17 6:39 ` Klaus Jensen
2023-01-17 12:11 ` Martin Wilck
2023-01-19 8:29 ` Klaus Jensen
2023-01-19 16:48 ` Keith Busch
2023-01-24 17:11 ` Ville Syrjälä
2023-01-19 13:10 ` Linux kernel regression tracking (#adding)
2023-01-27 11:11 ` Linux kernel regression tracking (#update)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox