All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yi Zhang <yi.zhang@redhat.com>
To: Keith Busch <keith.busch@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>,
	linux-block@vger.kernel.org, osandov@osandov.com,
	linux-nvme@lists.infradead.org, ming.lei@redhat.com
Subject: Re: blktests block/019 lead system hang
Date: Wed, 6 Jun 2018 13:42:15 +0800	[thread overview]
Message-ID: <1cbee034-d237-104d-bf5a-33e373821301@redhat.com> (raw)
In-Reply-To: <20180605172112.GC17057@localhost.localdomain>

Here is the output, and I can see "HotPlug+ Surprise+" on SltCap

# lspci -vvv -s 0000:83:05.0
83:05.0 PCI bridge: PLX Technology, Inc. PEX 8734 32-lane, 8-Port PCI 
Express Gen 3 (8.0GT/s) Switch (rev ab) (prog-if 00 [Normal decode])
     Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
     Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- >SERR- <PERR- INTx-
     Latency: 0, Cache Line Size: 32 bytes
     Interrupt: pin A routed to IRQ 40
     NUMA node: 1
     Bus: primary=83, secondary=85, subordinate=85, sec-latency=0
     I/O behind bridge: 00009000-00009fff
     Memory behind bridge: c8600000-c86fffff
     Prefetchable memory behind bridge: 000003c000200000-000003c0003fffff
     Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- <SERR- <PERR-
     BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
         PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
     Capabilities: [40] Power Management version 3
         Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
         Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
     Capabilities: [48] MSI: Enable+ Count=1/8 Maskable+ 64bit+
         Address: 00000000fee00118  Data: 0000
         Masking: 000000fe  Pending: 00000000
     Capabilities: [68] Express (v2) Downstream Port (Slot+), MSI 00
         DevCap:    MaxPayload 512 bytes, PhantFunc 0
             ExtTag- RBE+
         DevCtl:    Report errors: Correctable- Non-Fatal+ Fatal+ 
Unsupported+
             RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
             MaxPayload 128 bytes, MaxReadReq 128 bytes
         DevSta:    CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- 
TransPend-
         LnkCap:    Port #5, Speed 8GT/s, Width x4, ASPM L1, Exit 
Latency L0s <4us, L1 <4us
             ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
         LnkCtl:    ASPM Disabled; Disabled- CommClk-
             ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
         LnkSta:    Speed 8GT/s, Width x4, TrErr- Train- SlotClk- 
DLActive+ BWMgmt- ABWMgmt-
         SltCap:    AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ 
Surprise+
             Slot #181, PowerLimit 25.000W; Interlock- NoCompl-
         SltCtl:    Enable: AttnBtn- PwrFlt- MRL- PresDet+ CmdCplt+ 
HPIrq+ LinkChg+
             Control: AttnInd Unknown, PwrInd On, Power- Interlock-
         SltSta:    Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ 
Interlock-
             Changed: MRL- PresDet- LinkState-
         DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, 
OBFF Via message ARIFwd+
         DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, 
OBFF Disabled ARIFwd+
         LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-, 
Selectable De-emphasis: -6dB
              Transmit Margin: Normal Operating Range, 
EnterModifiedCompliance- ComplianceSOS-
              Compliance De-emphasis: -6dB
         LnkSta2: Current De-emphasis Level: -6dB, 
EqualizationComplete+, EqualizationPhase1+
              EqualizationPhase2+, EqualizationPhase3+, 
LinkEqualizationRequest-
     Capabilities: [a4] Subsystem: Dell Device 1f84
     Capabilities: [100 v1] Device Serial Number ab-87-00-10-b5-df-0e-00
     Capabilities: [fb4 v1] Advanced Error Reporting
         UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
         UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ 
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol+
         UESvrt:    DLP+ SDES+ TLP+ FCP+ CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
         CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
         CEMsk:    RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
         AERCap:    First Error Pointer: 1f, GenCap+ CGenEn+ ChkCap+ ChkEn+
     Capabilities: [138 v1] Power Budgeting <?>
     Capabilities: [10c v1] #19
     Capabilities: [148 v1] Virtual Channel
         Caps:    LPEVC=0 RefClk=100ns PATEntryBits=1
         Arb:    Fixed- WRR32- WRR64- WRR128-
         Ctrl:    ArbSelect=Fixed
         Status:    InProgress-
         VC0:    Caps:    PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
             Arb:    Fixed+ WRR32- WRR64- WRR128- TWRR128- WRR256-
             Ctrl:    Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
             Status:    NegoPending- InProgress-
     Capabilities: [e00 v1] #12
     Capabilities: [f24 v1] Access Control Services
         ACSCap:    SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ 
UpstreamFwd+ EgressCtrl+ DirectTrans+
         ACSCtl:    SrcValid- TransBlk- ReqRedir- CmpltRedir- 
UpstreamFwd- EgressCtrl- DirectTrans-
     Capabilities: [b70 v1] Vendor Specific Information: ID=0001 Rev=0 
Len=010 <?>
     Kernel driver in use: pcieport
     Kernel modules: shpchp

Thanks

Yi


On 06/06/2018 01:21 AM, Keith Busch wrote:
> On Tue, Jun 05, 2018 at 10:18:53AM -0600, Keith Busch wrote:
>> On Wed, May 30, 2018 at 03:26:54AM -0400, Yi Zhang wrote:
>>> Hi Keith
>>> I found blktest block/019 also can lead my NVMe server hang with 4.17.0-rc7, let me know if you need more info, thanks.
>>>
>>> Server: Dell R730xd
>>> NVMe SSD: 85:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01)
>>>
>>> Console log:
>>> Kernel 4.17.0-rc7 on an x86_64
>>>
>>> storageqe-62 login: [ 6043.121834] run blktests block/019 at 2018-05-30 03:16:34
>>> [ 6049.108476] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3
>>> [ 6049.108478] {1}[Hardware Error]: event severity: fatal
>>> [ 6049.108479] {1}[Hardware Error]:  Error 0, type: fatal
>>> [ 6049.108481] {1}[Hardware Error]:   section_type: PCIe error
>>> [ 6049.108482] {1}[Hardware Error]:   port_type: 6, downstream switch port
>>> [ 6049.108483] {1}[Hardware Error]:   version: 1.16
>>> [ 6049.108484] {1}[Hardware Error]:   command: 0x0407, status: 0x0010
>>> [ 6049.108485] {1}[Hardware Error]:   device_id: 0000:83:05.0
>>> [ 6049.108486] {1}[Hardware Error]:   slot: 0
>>> [ 6049.108487] {1}[Hardware Error]:   secondary_bus: 0x85
>>> [ 6049.108488] {1}[Hardware Error]:   vendor_id: 0x10b5, device_id: 0x8734
>>> [ 6049.108489] {1}[Hardware Error]:   class_code: 000406
>>> [ 6049.108489] {1}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0003
>>> [ 6049.108491] Kernel panic - not syncing: Fatal hardware error!
>>> [ 6049.108514] Kernel Offset: 0x25800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> Could you attach 'lspci -vvv -s 0000:83:05.0'? Just want to see
> your switch's capabilities to confirm the pre-test checks are really
> sufficient.
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

WARNING: multiple messages have this Message-ID (diff)
From: yi.zhang@redhat.com (Yi Zhang)
Subject: blktests block/019 lead system hang
Date: Wed, 6 Jun 2018 13:42:15 +0800	[thread overview]
Message-ID: <1cbee034-d237-104d-bf5a-33e373821301@redhat.com> (raw)
In-Reply-To: <20180605172112.GC17057@localhost.localdomain>

Here is the output, and I can see "HotPlug+ Surprise+" on SltCap

# lspci -vvv -s 0000:83:05.0
83:05.0 PCI bridge: PLX Technology, Inc. PEX 8734 32-lane, 8-Port PCI 
Express Gen 3 (8.0GT/s) Switch (rev ab) (prog-if 00 [Normal decode])
 ??? Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
 ??? Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- >SERR- <PERR- INTx-
 ??? Latency: 0, Cache Line Size: 32 bytes
 ??? Interrupt: pin A routed to IRQ 40
 ??? NUMA node: 1
 ??? Bus: primary=83, secondary=85, subordinate=85, sec-latency=0
 ??? I/O behind bridge: 00009000-00009fff
 ??? Memory behind bridge: c8600000-c86fffff
 ??? Prefetchable memory behind bridge: 000003c000200000-000003c0003fffff
 ??? Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- <SERR- <PERR-
 ??? BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
 ??? ??? PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
 ??? Capabilities: [40] Power Management version 3
 ??? ??? Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
 ??? ??? Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
 ??? Capabilities: [48] MSI: Enable+ Count=1/8 Maskable+ 64bit+
 ??? ??? Address: 00000000fee00118? Data: 0000
 ??? ??? Masking: 000000fe? Pending: 00000000
 ??? Capabilities: [68] Express (v2) Downstream Port (Slot+), MSI 00
 ??? ??? DevCap:??? MaxPayload 512 bytes, PhantFunc 0
 ??? ??? ??? ExtTag- RBE+
 ??? ??? DevCtl:??? Report errors: Correctable- Non-Fatal+ Fatal+ 
Unsupported+
 ??? ??? ??? RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
 ??? ??? ??? MaxPayload 128 bytes, MaxReadReq 128 bytes
 ??? ??? DevSta:??? CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- 
TransPend-
 ??? ??? LnkCap:??? Port #5, Speed 8GT/s, Width x4, ASPM L1, Exit 
Latency L0s <4us, L1 <4us
 ??? ??? ??? ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
 ??? ??? LnkCtl:??? ASPM Disabled; Disabled- CommClk-
 ??? ??? ??? ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
 ??? ??? LnkSta:??? Speed 8GT/s, Width x4, TrErr- Train- SlotClk- 
DLActive+ BWMgmt- ABWMgmt-
 ??? ??? SltCap:??? AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ 
Surprise+
 ??? ??? ??? Slot #181, PowerLimit 25.000W; Interlock- NoCompl-
 ??? ??? SltCtl:??? Enable: AttnBtn- PwrFlt- MRL- PresDet+ CmdCplt+ 
HPIrq+ LinkChg+
 ??? ??? ??? Control: AttnInd Unknown, PwrInd On, Power- Interlock-
 ??? ??? SltSta:??? Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ 
Interlock-
 ??? ??? ??? Changed: MRL- PresDet- LinkState-
 ??? ??? DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, 
OBFF Via message ARIFwd+
 ??? ??? DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, 
OBFF Disabled ARIFwd+
 ??? ??? LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-, 
Selectable De-emphasis: -6dB
 ??? ??? ??? ?Transmit Margin: Normal Operating Range, 
EnterModifiedCompliance- ComplianceSOS-
 ??? ??? ??? ?Compliance De-emphasis: -6dB
 ??? ??? LnkSta2: Current De-emphasis Level: -6dB, 
EqualizationComplete+, EqualizationPhase1+
 ??? ??? ??? ?EqualizationPhase2+, EqualizationPhase3+, 
LinkEqualizationRequest-
 ??? Capabilities: [a4] Subsystem: Dell Device 1f84
 ??? Capabilities: [100 v1] Device Serial Number ab-87-00-10-b5-df-0e-00
 ??? Capabilities: [fb4 v1] Advanced Error Reporting
 ??? ??? UESta:??? DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
 ??? ??? UEMsk:??? DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ 
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol+
 ??? ??? UESvrt:??? DLP+ SDES+ TLP+ FCP+ CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
 ??? ??? CESta:??? RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
 ??? ??? CEMsk:??? RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
 ??? ??? AERCap:??? First Error Pointer: 1f, GenCap+ CGenEn+ ChkCap+ ChkEn+
 ??? Capabilities: [138 v1] Power Budgeting <?>
 ??? Capabilities: [10c v1] #19
 ??? Capabilities: [148 v1] Virtual Channel
 ??? ??? Caps:??? LPEVC=0 RefClk=100ns PATEntryBits=1
 ??? ??? Arb:??? Fixed- WRR32- WRR64- WRR128-
 ??? ??? Ctrl:??? ArbSelect=Fixed
 ??? ??? Status:??? InProgress-
 ??? ??? VC0:??? Caps:??? PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
 ??? ??? ??? Arb:??? Fixed+ WRR32- WRR64- WRR128- TWRR128- WRR256-
 ??? ??? ??? Ctrl:??? Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
 ??? ??? ??? Status:??? NegoPending- InProgress-
 ??? Capabilities: [e00 v1] #12
 ??? Capabilities: [f24 v1] Access Control Services
 ??? ??? ACSCap:??? SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ 
UpstreamFwd+ EgressCtrl+ DirectTrans+
 ??? ??? ACSCtl:??? SrcValid- TransBlk- ReqRedir- CmpltRedir- 
UpstreamFwd- EgressCtrl- DirectTrans-
 ??? Capabilities: [b70 v1] Vendor Specific Information: ID=0001 Rev=0 
Len=010 <?>
 ??? Kernel driver in use: pcieport
 ??? Kernel modules: shpchp

Thanks

Yi


On 06/06/2018 01:21 AM, Keith Busch wrote:
> On Tue, Jun 05, 2018@10:18:53AM -0600, Keith Busch wrote:
>> On Wed, May 30, 2018@03:26:54AM -0400, Yi Zhang wrote:
>>> Hi Keith
>>> I found blktest block/019 also can lead my NVMe server hang with 4.17.0-rc7, let me know if you need more info, thanks.
>>>
>>> Server: Dell R730xd
>>> NVMe SSD: 85:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01)
>>>
>>> Console log:
>>> Kernel 4.17.0-rc7 on an x86_64
>>>
>>> storageqe-62 login: [ 6043.121834] run blktests block/019 at 2018-05-30 03:16:34
>>> [ 6049.108476] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3
>>> [ 6049.108478] {1}[Hardware Error]: event severity: fatal
>>> [ 6049.108479] {1}[Hardware Error]:  Error 0, type: fatal
>>> [ 6049.108481] {1}[Hardware Error]:   section_type: PCIe error
>>> [ 6049.108482] {1}[Hardware Error]:   port_type: 6, downstream switch port
>>> [ 6049.108483] {1}[Hardware Error]:   version: 1.16
>>> [ 6049.108484] {1}[Hardware Error]:   command: 0x0407, status: 0x0010
>>> [ 6049.108485] {1}[Hardware Error]:   device_id: 0000:83:05.0
>>> [ 6049.108486] {1}[Hardware Error]:   slot: 0
>>> [ 6049.108487] {1}[Hardware Error]:   secondary_bus: 0x85
>>> [ 6049.108488] {1}[Hardware Error]:   vendor_id: 0x10b5, device_id: 0x8734
>>> [ 6049.108489] {1}[Hardware Error]:   class_code: 000406
>>> [ 6049.108489] {1}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0003
>>> [ 6049.108491] Kernel panic - not syncing: Fatal hardware error!
>>> [ 6049.108514] Kernel Offset: 0x25800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> Could you attach 'lspci -vvv -s 0000:83:05.0'? Just want to see
> your switch's capabilities to confirm the pre-test checks are really
> sufficient.
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

  reply	other threads:[~2018-06-06  5:42 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <838678680.4693215.1527664726174.JavaMail.zimbra@redhat.com>
2018-05-30  7:26 ` blktests block/019 lead system hang Yi Zhang
2018-05-30  7:26   ` Yi Zhang
2018-06-05 16:18   ` Keith Busch
2018-06-05 16:18     ` Keith Busch
2018-06-05 17:21     ` Keith Busch
2018-06-05 17:21       ` Keith Busch
2018-06-06  5:42       ` Yi Zhang [this message]
2018-06-06  5:42         ` Yi Zhang
2018-06-06 14:28         ` Keith Busch
2018-06-06 14:28           ` Keith Busch
2018-06-12 23:41     ` Austin.Bolen
2018-06-12 23:41       ` Austin.Bolen
2018-06-13 15:44       ` Keith Busch
2018-06-13 15:44         ` Keith Busch
2018-06-13 17:17         ` Austin.Bolen
2018-06-13 17:17           ` Austin.Bolen
2018-06-13 18:24         ` Austin.Bolen
2018-06-13 18:24           ` Austin.Bolen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1cbee034-d237-104d-bf5a-33e373821301@redhat.com \
    --to=yi.zhang@redhat.com \
    --cc=keith.busch@intel.com \
    --cc=keith.busch@linux.intel.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=ming.lei@redhat.com \
    --cc=osandov@osandov.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.