Re: nvme: machine check when running nvme subsystem-reset /dev/nvme0 against direct attach via PCIE slot

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Laurence Oberman <loberman@redhat.com>
To: Keith Busch <kbusch@kernel.org>
Cc: "busch, keith" <keith.busch@intel.com>, linux-nvme@lists.infradead.org
Subject: Re: nvme: machine check when running nvme subsystem-reset /dev/nvme0 against direct attach via PCIE slot
Date: Tue, 29 Oct 2024 12:07:26 -0400	[thread overview]
Message-ID: <bdd5b88fbd3325ae16ec9791b0719c373bbefed9.camel@redhat.com> (raw)
In-Reply-To: <5325b263817024d0ca617b114f0a30aab0e0e2bc.camel@redhat.com>

On Mon, 2024-10-07 at 11:56 -0400, Laurence Oberman wrote:
> On Thu, 2024-10-03 at 15:04 -0600, Keith Busch wrote:
> > On Thu, Sep 26, 2024 at 05:11:05PM -0400, Laurence Oberman wrote:
> > > It was reported to Red Hat, seeing issues with using a
> > > "nvme subsystem-reset /dev/nvme0" command to test resets.
> > 
> > I really dislike that command. The side effects are overkill for
> > the
> > pci
> > transport...
> >  
> > > On multiple servers I tested on two types of nvme attached
> > > devices
> > > These are not the rootfs devices
> > > 
> > > 1. The front slot (hotplug) devices in a 2.5in format 
> > > reset and after some time recover (what is expected)
> > > 
> > > Example of one working
> > > 
> > > Does not trap and land up as a machine-check
> > 
> > <snip>
> > 
> > > 2. Any kernel upstream latest 6.11, RHEL8 or RHEL9 causes 
> > > a machine check and panics the box when its against a nvme in a 
> > > PCIE slot
> > > 
> > > [  263.862919] mce: [Hardware Error]: CPU 12: Machine Check
> > > Exception: 5 Bank 6: ba00000000000e0b
> > > [  263.862924] mce: [Hardware Error]: RIP !INEXACT!
> > > 10:<ffffffff8571dce4> {intel_idle+0x54/0x90}
> > 
> > So this wasn't failing before 6.11? As Nilay mentioned, there are
> > some
> > changes on how nvme subsystem reset is handled. The main thing
> > being
> > this ioctl doesn't automatically trigger an nvme reset. I expected
> > delayed recovery might happen, but machine checks are not expected.
> > If
> > this was working before, I can only guess right now that the
> > previous
> > behavior was accessing MMIO and config quicker and triggered a
> > different
> > error path. If you're successful with the PPC patch reverted, I
> > would
> > be
> > interested to hear about it.
> > 
> 
> Hello
> 
> Quick update about this.
> I went back all the way to 6.8 and this still happens.
> I started to think that these HPE servers were more susceptible to
> the
> machine checks on the PCIE state changes.
> 
> So I tested on a Lenovo and still had panics.
> I do not think this is worth pursuing given that Keith already
> confirmed this is not recommended and way too heavy handed on the
> PCIE
> path.
> 
> I have told the reporter of this that they are not to use this type
> of
> fault injection on directly attached nvme devices.
> 
> Thanks
> Laurence
> 
Hello

Finishing this thread off but have a final question. 
Bottom line is certain server hardware sees the nvme reset command
create a machine check for PCIE plugged NVME devices going back quite
far in kernel versions,  and we panic.

As Keith had said, that nvme reset command is too much impact

There is a final simple question for M2 connected NVME devices. 
Are these expected to auto-re-connect after an nvme reset is issued. 

The complaint is the following

nvme subsystem-reset /dev/nvme0 
Device is disconnected as expected but requires the following to
reconnect

echo 1 >  /sys/bus/pci/devices/0000:02:00.0/remove
echo 1 > /sys/bus/pci/rescan

Then it is reconnected.

Thanks
Laurence

next prev parent reply	other threads:[~2024-10-29 17:37 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-26 21:11 nvme: machine check when running nvme subsystem-reset /dev/nvme0 against direct attach via PCIE slot Laurence Oberman
2024-09-27  6:10 ` Nilay Shroff
2024-09-27 12:18   ` Laurence Oberman
2024-09-27 13:06     ` Nilay Shroff
2024-10-03 21:04 ` Keith Busch
2024-10-07 15:56   ` Laurence Oberman
2024-10-29 16:07     ` Laurence Oberman [this message]
2024-10-29 16:42       ` Keith Busch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bdd5b88fbd3325ae16ec9791b0719c373bbefed9.camel@redhat.com \
    --to=loberman@redhat.com \
    --cc=kbusch@kernel.org \
    --cc=keith.busch@intel.com \
    --cc=linux-nvme@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox