Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Laurence Oberman <loberman@redhat.com>
To: Keith Busch <kbusch@kernel.org>
Cc: "busch, keith" <keith.busch@intel.com>, linux-nvme@lists.infradead.org
Subject: Re: nvme: machine check when running nvme subsystem-reset /dev/nvme0 against direct attach via PCIE slot
Date: Mon, 07 Oct 2024 11:56:09 -0400	[thread overview]
Message-ID: <5325b263817024d0ca617b114f0a30aab0e0e2bc.camel@redhat.com> (raw)
In-Reply-To: <Zv8G8l5cyzDRwLJA@kbusch-mbp>

On Thu, 2024-10-03 at 15:04 -0600, Keith Busch wrote:
> On Thu, Sep 26, 2024 at 05:11:05PM -0400, Laurence Oberman wrote:
> > It was reported to Red Hat, seeing issues with using a
> > "nvme subsystem-reset /dev/nvme0" command to test resets.
> 
> I really dislike that command. The side effects are overkill for the
> pci
> transport...
>  
> > On multiple servers I tested on two types of nvme attached devices
> > These are not the rootfs devices
> > 
> > 1. The front slot (hotplug) devices in a 2.5in format 
> > reset and after some time recover (what is expected)
> > 
> > Example of one working
> > 
> > Does not trap and land up as a machine-check
> 
> <snip>
> 
> > 2. Any kernel upstream latest 6.11, RHEL8 or RHEL9 causes 
> > a machine check and panics the box when its against a nvme in a 
> > PCIE slot
> > 
> > [  263.862919] mce: [Hardware Error]: CPU 12: Machine Check
> > Exception: 5 Bank 6: ba00000000000e0b
> > [  263.862924] mce: [Hardware Error]: RIP !INEXACT!
> > 10:<ffffffff8571dce4> {intel_idle+0x54/0x90}
> 
> So this wasn't failing before 6.11? As Nilay mentioned, there are
> some
> changes on how nvme subsystem reset is handled. The main thing being
> this ioctl doesn't automatically trigger an nvme reset. I expected
> delayed recovery might happen, but machine checks are not expected.
> If
> this was working before, I can only guess right now that the previous
> behavior was accessing MMIO and config quicker and triggered a
> different
> error path. If you're successful with the PPC patch reverted, I would
> be
> interested to hear about it.
> 

Hello

Quick update about this.
I went back all the way to 6.8 and this still happens.
I started to think that these HPE servers were more susceptible to the
machine checks on the PCIE state changes.

So I tested on a Lenovo and still had panics.
I do not think this is worth pursuing given that Keith already
confirmed this is not recommended and way too heavy handed on the PCIE
path.

I have told the reporter of this that they are not to use this type of
fault injection on directly attached nvme devices.

Thanks
Laurence



  reply	other threads:[~2024-10-07 15:58 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-26 21:11 nvme: machine check when running nvme subsystem-reset /dev/nvme0 against direct attach via PCIE slot Laurence Oberman
2024-09-27  6:10 ` Nilay Shroff
2024-09-27 12:18   ` Laurence Oberman
2024-09-27 13:06     ` Nilay Shroff
2024-10-03 21:04 ` Keith Busch
2024-10-07 15:56   ` Laurence Oberman [this message]
2024-10-29 16:07     ` Laurence Oberman
2024-10-29 16:42       ` Keith Busch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5325b263817024d0ca617b114f0a30aab0e0e2bc.camel@redhat.com \
    --to=loberman@redhat.com \
    --cc=kbusch@kernel.org \
    --cc=keith.busch@intel.com \
    --cc=linux-nvme@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox