public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed
* regression on aarch64? panic on boot
@ 2023-01-16 21:57 Klaus Jensen
  2023-01-17  5:58 ` Christoph Hellwig
  2023-01-19 13:10 ` Linux kernel regression tracking (#adding)
  0 siblings, 2 replies; 11+ messages in thread
From: Klaus Jensen @ 2023-01-16 21:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1404 bytes --]

Hi,

I'm getting panics when booting from a QEMU hw/nvme device on an aarch64
guest in roughly 20% of boots on v6.2-rc4. Example panic below.

I've bisected it to commit eac3ef262941 ("nvme-pci: split the initial
probe from the rest path").

I'm not seeing this on any other emulated platforms that I'm currently
testing (x86_64, riscv32/64, mips32/64 and sparc64).


nvme nvme0: 1/0/0 default/read/poll queues
NET: Registered PF_VSOCK protocol family
registered taskstats version 1
nvme nvme0: Ignoring bogus Namespace Identifiers
/dev/root: Can't open blockdev
VFS: Cannot open root device "nvme0n1" or unknown-block(0,0): error -6
Please append a correct "root=" boot option; here are the available partitions:
103:00000      61440 nvme0n1
 (driver?)
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.2.0-rc4 #22
Hardware name: linux,dummy-virt (DT)
Call trace:
 dump_backtrace.part.0+0xdc/0xf0
 show_stack+0x18/0x30
 dump_stack_lvl+0x7c/0xa0
 dump_stack+0x18/0x34
 panic+0x17c/0x328
 mount_block_root+0x184/0x234
 mount_root+0x178/0x198
 prepare_namespace+0x124/0x164
 kernel_init_freeable+0x2a0/0x2c8
 kernel_init+0x2c/0x130
 ret_from_fork+0x10/0x20
Kernel Offset: disabled
CPU features: 0x00000,01800100,0000420b
Memory Limit: none
---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0) ]---

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: regression on aarch64? panic on boot
  2023-01-16 21:57 regression on aarch64? panic on boot Klaus Jensen
@ 2023-01-17  5:58 ` Christoph Hellwig
  2023-01-17  6:31   ` Klaus Jensen
  2023-01-19 13:10 ` Linux kernel regression tracking (#adding)
  1 sibling, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2023-01-17  5:58 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Christoph Hellwig, Keith Busch, Jens Axboe, Sagi Grimberg,
	linux-nvme, linux-kernel

On Mon, Jan 16, 2023 at 10:57:11PM +0100, Klaus Jensen wrote:
> Hi,
> 
> I'm getting panics when booting from a QEMU hw/nvme device on an aarch64
> guest in roughly 20% of boots on v6.2-rc4. Example panic below.

This smells like your setup somehow doesn't wait for async driver
probe.  Does the hack below work around it?

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b13baccedb4a95..f47e19c701d520 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3508,7 +3508,6 @@ static struct pci_driver nvme_driver = {
 	.remove		= nvme_remove,
 	.shutdown	= nvme_shutdown,
 	.driver		= {
-		.probe_type	= PROBE_PREFER_ASYNCHRONOUS,
 #ifdef CONFIG_PM_SLEEP
 		.pm		= &nvme_dev_pm_ops,
 #endif


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: regression on aarch64? panic on boot
  2023-01-17  5:58 ` Christoph Hellwig
@ 2023-01-17  6:31   ` Klaus Jensen
  2023-01-17  6:37     ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Klaus Jensen @ 2023-01-17  6:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1336 bytes --]

On Jan 17 06:58, Christoph Hellwig wrote:
> On Mon, Jan 16, 2023 at 10:57:11PM +0100, Klaus Jensen wrote:
> > Hi,
> > 
> > I'm getting panics when booting from a QEMU hw/nvme device on an aarch64
> > guest in roughly 20% of boots on v6.2-rc4. Example panic below.
> 
> This smells like your setup somehow doesn't wait for async driver
> probe.  Does the hack below work around it?
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index b13baccedb4a95..f47e19c701d520 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -3508,7 +3508,6 @@ static struct pci_driver nvme_driver = {
>  	.remove		= nvme_remove,
>  	.shutdown	= nvme_shutdown,
>  	.driver		= {
> -		.probe_type	= PROBE_PREFER_ASYNCHRONOUS,
>  #ifdef CONFIG_PM_SLEEP
>  		.pm		= &nvme_dev_pm_ops,
>  #endif

Good morning Christoph,

Yep, the above works.

My setup is a buildroot qemu_aarch64_virt_defconfig booting from an
emulated nvme device:

  qemu-system-aarch64 -M "virt" -cpu "cortex-a53" -m 512M \
    -nodefaults -nographic -snapshot -no-reboot \
    -kernel images/Image \
    -append "root=/dev/nvme0n1 console=ttyAMA0,115200" \
    -drive file=images/rootfs.ext2,format=raw,if=none,id=d0 \
    -device nvme,serial=default,drive=d0 \
    -nic user,model=virtio \
    -serial stdio

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: regression on aarch64? panic on boot
  2023-01-17  6:31   ` Klaus Jensen
@ 2023-01-17  6:37     ` Christoph Hellwig
  2023-01-17  6:39       ` Klaus Jensen
                         ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Christoph Hellwig @ 2023-01-17  6:37 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Christoph Hellwig, Keith Busch, Jens Axboe, Sagi Grimberg,
	linux-nvme, linux-kernel, Greg Kroah-Hartman, Rafael J. Wysocki

On Tue, Jan 17, 2023 at 07:31:59AM +0100, Klaus Jensen wrote:
> Good morning Christoph,
> 
> Yep, the above works.

Context for the newly added: This is dropping the newly added
PROBE_PREFER_ASYNCHRONOUS in nvme, which causes Klaus' arm64 (but not
other boot tests) to fail.  Any idea what could be going wrong there
probably in userspace?



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: regression on aarch64? panic on boot
  2023-01-17  6:37     ` Christoph Hellwig
@ 2023-01-17  6:39       ` Klaus Jensen
  2023-01-17 12:11       ` Martin Wilck
  2023-01-19 16:48       ` Keith Busch
  2 siblings, 0 replies; 11+ messages in thread
From: Klaus Jensen @ 2023-01-17  6:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel,
	Greg Kroah-Hartman, Rafael J. Wysocki

[-- Attachment #1: Type: text/plain, Size: 482 bytes --]

On Jan 17 07:37, Christoph Hellwig wrote:
> On Tue, Jan 17, 2023 at 07:31:59AM +0100, Klaus Jensen wrote:
> > Good morning Christoph,
> > 
> > Yep, the above works.
> 
> Context for the newly added: This is dropping the newly added
> PROBE_PREFER_ASYNCHRONOUS in nvme, which causes Klaus' arm64 (but not
> other boot tests) to fail.  Any idea what could be going wrong there
> probably in userspace?
> 

Adding 'rootwait' to the boot parameters does the trick as well.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: regression on aarch64? panic on boot
  2023-01-17  6:37     ` Christoph Hellwig
  2023-01-17  6:39       ` Klaus Jensen
@ 2023-01-17 12:11       ` Martin Wilck
  2023-01-19  8:29         ` Klaus Jensen
  2023-01-19 16:48       ` Keith Busch
  2 siblings, 1 reply; 11+ messages in thread
From: Martin Wilck @ 2023-01-17 12:11 UTC (permalink / raw)
  To: Christoph Hellwig, Klaus Jensen
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel,
	Greg Kroah-Hartman, Rafael J. Wysocki

On Tue, 2023-01-17 at 07:37 +0100, Christoph Hellwig wrote:
> On Tue, Jan 17, 2023 at 07:31:59AM +0100, Klaus Jensen wrote:
> > Good morning Christoph,
> > 
> > Yep, the above works.
> 
> Context for the newly added: This is dropping the newly added
> PROBE_PREFER_ASYNCHRONOUS in nvme, which causes Klaus' arm64 (but not
> other boot tests) to fail.  Any idea what could be going wrong there
> probably in userspace?

If this is an aarch64 userspace issue, maybe related to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107678 ?

That bug causes segfaults of user space programs if for some reason the
unwind code is invoked. It happens only if libgcc_s.so is compiled with
gcc 13, and the pauth CPU feature is enabled in qemu.

Martin




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: regression on aarch64? panic on boot
  2023-01-17 12:11       ` Martin Wilck
@ 2023-01-19  8:29         ` Klaus Jensen
  0 siblings, 0 replies; 11+ messages in thread
From: Klaus Jensen @ 2023-01-19  8:29 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Christoph Hellwig, Keith Busch, Jens Axboe, Sagi Grimberg,
	linux-nvme, linux-kernel, Greg Kroah-Hartman, Rafael J. Wysocki

[-- Attachment #1: Type: text/plain, Size: 1012 bytes --]

On Jan 17 13:11, Martin Wilck wrote:
> On Tue, 2023-01-17 at 07:37 +0100, Christoph Hellwig wrote:
> > On Tue, Jan 17, 2023 at 07:31:59AM +0100, Klaus Jensen wrote:
> > > Good morning Christoph,
> > > 
> > > Yep, the above works.
> > 
> > Context for the newly added: This is dropping the newly added
> > PROBE_PREFER_ASYNCHRONOUS in nvme, which causes Klaus' arm64 (but not
> > other boot tests) to fail.  Any idea what could be going wrong there
> > probably in userspace?
> 
> If this is an aarch64 userspace issue, maybe related to
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107678 ?
> 
> That bug causes segfaults of user space programs if for some reason the
> unwind code is invoked. It happens only if libgcc_s.so is compiled with
> gcc 13, and the pauth CPU feature is enabled in qemu.
> 
> Martin
> 

I just observed the same panic on qemu emulated ppc64 as well. It's
pretty rare, maybe 1 in 20. 'rootwait' or removing the the prefer
asynchronous probe fixes it as well.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: regression on aarch64? panic on boot
  2023-01-16 21:57 regression on aarch64? panic on boot Klaus Jensen
  2023-01-17  5:58 ` Christoph Hellwig
@ 2023-01-19 13:10 ` Linux kernel regression tracking (#adding)
  2023-01-27 11:11   ` Linux kernel regression tracking (#update)
  1 sibling, 1 reply; 11+ messages in thread
From: Linux kernel regression tracking (#adding) @ 2023-01-19 13:10 UTC (permalink / raw)
  To: Klaus Jensen, Christoph Hellwig
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel,
	Greg Kroah-Hartman, Rafael J. Wysocki,
	Linux kernel regressions list

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

On 16.01.23 22:57, Klaus Jensen wrote:
> 
> I'm getting panics when booting from a QEMU hw/nvme device on an aarch64
> guest in roughly 20% of boots on v6.2-rc4. Example panic below.
> 
> I've bisected it to commit eac3ef262941 ("nvme-pci: split the initial
> probe from the rest path").
> 
> I'm not seeing this on any other emulated platforms that I'm currently
> testing (x86_64, riscv32/64, mips32/64 and sparc64).
> [...]

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced eac3ef262941
#regzbot title nvme: occasional boot problems due to the newly supported
async driver probe
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: regression on aarch64? panic on boot
  2023-01-17  6:37     ` Christoph Hellwig
  2023-01-17  6:39       ` Klaus Jensen
  2023-01-17 12:11       ` Martin Wilck
@ 2023-01-19 16:48       ` Keith Busch
  2023-01-24 17:11         ` Ville Syrjälä
  2 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2023-01-19 16:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Klaus Jensen, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel,
	Greg Kroah-Hartman, Rafael J. Wysocki

On Tue, Jan 17, 2023 at 07:37:35AM +0100, Christoph Hellwig wrote:
> On Tue, Jan 17, 2023 at 07:31:59AM +0100, Klaus Jensen wrote:
> > Good morning Christoph,
> > 
> > Yep, the above works.
> 
> Context for the newly added: This is dropping the newly added
> PROBE_PREFER_ASYNCHRONOUS in nvme, which causes Klaus' arm64 (but not
> other boot tests) to fail.  Any idea what could be going wrong there
> probably in userspace?

Prior to 6.2, the driver would do it's own async_schedule, and that
async probe function would flush the first scan work.
wait_for_device_probe() was then forced to wait for the scan_work to
complete, which brings up the root device.

We're not flushing the scan_work anymore from our probe, so this should
fix it for 6.2:

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b294b41a149a7..ff97426749976 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3046,6 +3046,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)

        nvme_start_ctrl(&dev->ctrl);
        nvme_put_ctrl(&dev->ctrl);
+       flush_work(&dev->ctrl.scan_work);
        return 0;

 out_disable:
--


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: regression on aarch64? panic on boot
  2023-01-19 16:48       ` Keith Busch
@ 2023-01-24 17:11         ` Ville Syrjälä
  0 siblings, 0 replies; 11+ messages in thread
From: Ville Syrjälä @ 2023-01-24 17:11 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Klaus Jensen, Jens Axboe, Sagi Grimberg,
	linux-nvme, linux-kernel, Greg Kroah-Hartman, Rafael J. Wysocki

On Thu, Jan 19, 2023 at 09:48:56AM -0700, Keith Busch wrote:
> On Tue, Jan 17, 2023 at 07:37:35AM +0100, Christoph Hellwig wrote:
> > On Tue, Jan 17, 2023 at 07:31:59AM +0100, Klaus Jensen wrote:
> > > Good morning Christoph,
> > > 
> > > Yep, the above works.
> > 
> > Context for the newly added: This is dropping the newly added
> > PROBE_PREFER_ASYNCHRONOUS in nvme, which causes Klaus' arm64 (but not
> > other boot tests) to fail.  Any idea what could be going wrong there
> > probably in userspace?
> 
> Prior to 6.2, the driver would do it's own async_schedule, and that
> async probe function would flush the first scan work.
> wait_for_device_probe() was then forced to wait for the scan_work to
> complete, which brings up the root device.
> 
> We're not flushing the scan_work anymore from our probe, so this should
> fix it for 6.2:

Appears to fix my Tigerlake Thinkpad T14 gen2.

Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>

> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index b294b41a149a7..ff97426749976 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -3046,6 +3046,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> 
>         nvme_start_ctrl(&dev->ctrl);
>         nvme_put_ctrl(&dev->ctrl);
> +       flush_work(&dev->ctrl.scan_work);
>         return 0;
> 
>  out_disable:
> --
> 

-- 
Ville Syrjälä
Intel


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: regression on aarch64? panic on boot
  2023-01-19 13:10 ` Linux kernel regression tracking (#adding)
@ 2023-01-27 11:11   ` Linux kernel regression tracking (#update)
  0 siblings, 0 replies; 11+ messages in thread
From: Linux kernel regression tracking (#update) @ 2023-01-27 11:11 UTC (permalink / raw)
  To: Klaus Jensen, Christoph Hellwig
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-kernel,
	Greg Kroah-Hartman, Rafael J. Wysocki,
	Linux kernel regressions list

[TLDR: there afaics is a fix for the regression discussed in this
thread, but its author did not use a Link: tag to point to the report,
as wanted by Linus and explained in the documentation; this forces me to
write this mail, which sole purpose it to update the state of this
tracked Linux kernel regression.]

On 19.01.23 14:10, Linux kernel regression tracking (#adding) wrote:
> On 16.01.23 22:57, Klaus Jensen wrote:
>>
>> I'm getting panics when booting from a QEMU hw/nvme device on an aarch64
>> guest in roughly 20% of boots on v6.2-rc4. Example panic below.
>>
>> I've bisected it to commit eac3ef262941 ("nvme-pci: split the initial
>> probe from the rest path").
>>
>> I'm not seeing this on any other emulated platforms that I'm currently
>> testing (x86_64, riscv32/64, mips32/64 and sparc64).
>> [...]
> 
> Thanks for the report. To be sure the issue doesn't fall through the
> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> tracking bot:
> 
> #regzbot ^introduced eac3ef262941
> #regzbot title nvme: occasional boot problems due to the newly supported
> async driver probe
> #regzbot ignore-activity

#regzbot monitor:
https://lore.kernel.org/all/20230124171738.2311160-1-kbusch@meta.com/
#regzbot fix: nvme-pci: flush initial scan_work for async probe
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-01-27 11:11 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-16 21:57 regression on aarch64? panic on boot Klaus Jensen
2023-01-17  5:58 ` Christoph Hellwig
2023-01-17  6:31   ` Klaus Jensen
2023-01-17  6:37     ` Christoph Hellwig
2023-01-17  6:39       ` Klaus Jensen
2023-01-17 12:11       ` Martin Wilck
2023-01-19  8:29         ` Klaus Jensen
2023-01-19 16:48       ` Keith Busch
2023-01-24 17:11         ` Ville Syrjälä
2023-01-19 13:10 ` Linux kernel regression tracking (#adding)
2023-01-27 11:11   ` Linux kernel regression tracking (#update)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox