linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: keith.busch@linux.intel.com (Keith Busch)
Subject: blktests block/019 lead system hang
Date: Tue, 5 Jun 2018 10:18:53 -0600	[thread overview]
Message-ID: <20180605161853.GB16899@localhost.localdomain> (raw)
In-Reply-To: <1858098161.4693883.1527665214701.JavaMail.zimbra@redhat.com>

On Wed, May 30, 2018@03:26:54AM -0400, Yi Zhang wrote:
> Hi Keith
> I found blktest block/019 also can lead my NVMe server hang with 4.17.0-rc7, let me know if you need more info, thanks. 
> 
> Server: Dell R730xd
> NVMe SSD: 85:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01)
> 
> Console log:
> Kernel 4.17.0-rc7 on an x86_64
> 
> storageqe-62 login: [ 6043.121834] run blktests block/019 at 2018-05-30 03:16:34
> [ 6049.108476] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3
> [ 6049.108478] {1}[Hardware Error]: event severity: fatal
> [ 6049.108479] {1}[Hardware Error]:  Error 0, type: fatal
> [ 6049.108481] {1}[Hardware Error]:   section_type: PCIe error
> [ 6049.108482] {1}[Hardware Error]:   port_type: 6, downstream switch port
> [ 6049.108483] {1}[Hardware Error]:   version: 1.16
> [ 6049.108484] {1}[Hardware Error]:   command: 0x0407, status: 0x0010
> [ 6049.108485] {1}[Hardware Error]:   device_id: 0000:83:05.0
> [ 6049.108486] {1}[Hardware Error]:   slot: 0
> [ 6049.108487] {1}[Hardware Error]:   secondary_bus: 0x85
> [ 6049.108488] {1}[Hardware Error]:   vendor_id: 0x10b5, device_id: 0x8734
> [ 6049.108489] {1}[Hardware Error]:   class_code: 000406
> [ 6049.108489] {1}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0003
> [ 6049.108491] Kernel panic - not syncing: Fatal hardware error!
> [ 6049.108514] Kernel Offset: 0x25800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Sounds like your platform fundamentally doesn't support surprise link
down if it considers the event a fatal error. That's sort of what this
test was supposed to help catch so we know what platforms can do this
vs ones that can't.

The test does check that the slot is hotplug capable before running,
so it's supposed to only run the test on slots that claim to be capable
of handling the event. I just don't know of a good way to query platform
firmware to know what it will do in response to such an event.

  reply	other threads:[~2018-06-05 16:18 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <838678680.4693215.1527664726174.JavaMail.zimbra@redhat.com>
2018-05-30  7:26 ` blktests block/019 lead system hang Yi Zhang
2018-06-05 16:18   ` Keith Busch [this message]
2018-06-05 17:21     ` Keith Busch
2018-06-06  5:42       ` Yi Zhang
2018-06-06 14:28         ` Keith Busch
2018-06-12 23:41     ` Austin.Bolen
2018-06-13 15:44       ` Keith Busch
2018-06-13 17:17         ` Austin.Bolen
2018-06-13 18:24         ` Austin.Bolen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180605161853.GB16899@localhost.localdomain \
    --to=keith.busch@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).