From: Laurence Oberman <loberman@redhat.com>
To: Keith Busch <kbusch@kernel.org>
Cc: "busch, keith" <keith.busch@intel.com>, linux-nvme@lists.infradead.org
Subject: Re: nvme: machine check when running nvme subsystem-reset /dev/nvme0 against direct attach via PCIE slot
Date: Mon, 07 Oct 2024 11:56:09 -0400 [thread overview]
Message-ID: <5325b263817024d0ca617b114f0a30aab0e0e2bc.camel@redhat.com> (raw)
In-Reply-To: <Zv8G8l5cyzDRwLJA@kbusch-mbp>
On Thu, 2024-10-03 at 15:04 -0600, Keith Busch wrote:
> On Thu, Sep 26, 2024 at 05:11:05PM -0400, Laurence Oberman wrote:
> > It was reported to Red Hat, seeing issues with using a
> > "nvme subsystem-reset /dev/nvme0" command to test resets.
>
> I really dislike that command. The side effects are overkill for the
> pci
> transport...
>
> > On multiple servers I tested on two types of nvme attached devices
> > These are not the rootfs devices
> >
> > 1. The front slot (hotplug) devices in a 2.5in format
> > reset and after some time recover (what is expected)
> >
> > Example of one working
> >
> > Does not trap and land up as a machine-check
>
> <snip>
>
> > 2. Any kernel upstream latest 6.11, RHEL8 or RHEL9 causes
> > a machine check and panics the box when its against a nvme in a
> > PCIE slot
> >
> > [ 263.862919] mce: [Hardware Error]: CPU 12: Machine Check
> > Exception: 5 Bank 6: ba00000000000e0b
> > [ 263.862924] mce: [Hardware Error]: RIP !INEXACT!
> > 10:<ffffffff8571dce4> {intel_idle+0x54/0x90}
>
> So this wasn't failing before 6.11? As Nilay mentioned, there are
> some
> changes on how nvme subsystem reset is handled. The main thing being
> this ioctl doesn't automatically trigger an nvme reset. I expected
> delayed recovery might happen, but machine checks are not expected.
> If
> this was working before, I can only guess right now that the previous
> behavior was accessing MMIO and config quicker and triggered a
> different
> error path. If you're successful with the PPC patch reverted, I would
> be
> interested to hear about it.
>
Hello
Quick update about this.
I went back all the way to 6.8 and this still happens.
I started to think that these HPE servers were more susceptible to the
machine checks on the PCIE state changes.
So I tested on a Lenovo and still had panics.
I do not think this is worth pursuing given that Keith already
confirmed this is not recommended and way too heavy handed on the PCIE
path.
I have told the reporter of this that they are not to use this type of
fault injection on directly attached nvme devices.
Thanks
Laurence
next prev parent reply other threads:[~2024-10-07 15:58 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-26 21:11 nvme: machine check when running nvme subsystem-reset /dev/nvme0 against direct attach via PCIE slot Laurence Oberman
2024-09-27 6:10 ` Nilay Shroff
2024-09-27 12:18 ` Laurence Oberman
2024-09-27 13:06 ` Nilay Shroff
2024-10-03 21:04 ` Keith Busch
2024-10-07 15:56 ` Laurence Oberman [this message]
2024-10-29 16:07 ` Laurence Oberman
2024-10-29 16:42 ` Keith Busch
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5325b263817024d0ca617b114f0a30aab0e0e2bc.camel@redhat.com \
--to=loberman@redhat.com \
--cc=kbusch@kernel.org \
--cc=keith.busch@intel.com \
--cc=linux-nvme@lists.infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox