From: Yi Zhang <yi.zhang@redhat.com>
To: Keith Busch <keith.busch@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>,
linux-block@vger.kernel.org, osandov@osandov.com,
linux-nvme@lists.infradead.org, ming.lei@redhat.com
Subject: Re: blktests block/019 lead system hang
Date: Wed, 6 Jun 2018 13:42:15 +0800 [thread overview]
Message-ID: <1cbee034-d237-104d-bf5a-33e373821301@redhat.com> (raw)
In-Reply-To: <20180605172112.GC17057@localhost.localdomain>
Here is the output, and I can see "HotPlug+ Surprise+" on SltCap
# lspci -vvv -s 0000:83:05.0
83:05.0 PCI bridge: PLX Technology, Inc. PEX 8734 32-lane, 8-Port PCI
Express Gen 3 (8.0GT/s) Switch (rev ab) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 40
NUMA node: 1
Bus: primary=83, secondary=85, subordinate=85, sec-latency=0
I/O behind bridge: 00009000-00009fff
Memory behind bridge: c8600000-c86fffff
Prefetchable memory behind bridge: 000003c000200000-000003c0003fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] MSI: Enable+ Count=1/8 Maskable+ 64bit+
Address: 00000000fee00118 Data: 0000
Masking: 000000fe Pending: 00000000
Capabilities: [68] Express (v2) Downstream Port (Slot+), MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0
ExtTag- RBE+
DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+
Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr-
TransPend-
LnkCap: Port #5, Speed 8GT/s, Width x4, ASPM L1, Exit
Latency L0s <4us, L1 <4us
ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM Disabled; Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x4, TrErr- Train- SlotClk-
DLActive+ BWMgmt- ABWMgmt-
SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+
Surprise+
Slot #181, PowerLimit 25.000W; Interlock- NoCompl-
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet+ CmdCplt+
HPIrq+ LinkChg+
Control: AttnInd Unknown, PwrInd On, Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+
Interlock-
Changed: MRL- PresDet- LinkState-
DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+,
OBFF Via message ARIFwd+
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-,
OBFF Disabled ARIFwd+
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-,
Selectable De-emphasis: -6dB
Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB,
EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+,
LinkEqualizationRequest-
Capabilities: [a4] Subsystem: Dell Device 1f84
Capabilities: [100 v1] Device Serial Number ab-87-00-10-b5-df-0e-00
Capabilities: [fb4 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol+
UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
AERCap: First Error Pointer: 1f, GenCap+ CGenEn+ ChkCap+ ChkEn+
Capabilities: [138 v1] Power Budgeting <?>
Capabilities: [10c v1] #19
Capabilities: [148 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed+ WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
Status: NegoPending- InProgress-
Capabilities: [e00 v1] #12
Capabilities: [f24 v1] Access Control Services
ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+
UpstreamFwd+ EgressCtrl+ DirectTrans+
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir-
UpstreamFwd- EgressCtrl- DirectTrans-
Capabilities: [b70 v1] Vendor Specific Information: ID=0001 Rev=0
Len=010 <?>
Kernel driver in use: pcieport
Kernel modules: shpchp
Thanks
Yi
On 06/06/2018 01:21 AM, Keith Busch wrote:
> On Tue, Jun 05, 2018 at 10:18:53AM -0600, Keith Busch wrote:
>> On Wed, May 30, 2018 at 03:26:54AM -0400, Yi Zhang wrote:
>>> Hi Keith
>>> I found blktest block/019 also can lead my NVMe server hang with 4.17.0-rc7, let me know if you need more info, thanks.
>>>
>>> Server: Dell R730xd
>>> NVMe SSD: 85:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01)
>>>
>>> Console log:
>>> Kernel 4.17.0-rc7 on an x86_64
>>>
>>> storageqe-62 login: [ 6043.121834] run blktests block/019 at 2018-05-30 03:16:34
>>> [ 6049.108476] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3
>>> [ 6049.108478] {1}[Hardware Error]: event severity: fatal
>>> [ 6049.108479] {1}[Hardware Error]: Error 0, type: fatal
>>> [ 6049.108481] {1}[Hardware Error]: section_type: PCIe error
>>> [ 6049.108482] {1}[Hardware Error]: port_type: 6, downstream switch port
>>> [ 6049.108483] {1}[Hardware Error]: version: 1.16
>>> [ 6049.108484] {1}[Hardware Error]: command: 0x0407, status: 0x0010
>>> [ 6049.108485] {1}[Hardware Error]: device_id: 0000:83:05.0
>>> [ 6049.108486] {1}[Hardware Error]: slot: 0
>>> [ 6049.108487] {1}[Hardware Error]: secondary_bus: 0x85
>>> [ 6049.108488] {1}[Hardware Error]: vendor_id: 0x10b5, device_id: 0x8734
>>> [ 6049.108489] {1}[Hardware Error]: class_code: 000406
>>> [ 6049.108489] {1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
>>> [ 6049.108491] Kernel panic - not syncing: Fatal hardware error!
>>> [ 6049.108514] Kernel Offset: 0x25800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> Could you attach 'lspci -vvv -s 0000:83:05.0'? Just want to see
> your switch's capabilities to confirm the pre-test checks are really
> sufficient.
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
WARNING: multiple messages have this Message-ID (diff)
From: yi.zhang@redhat.com (Yi Zhang)
Subject: blktests block/019 lead system hang
Date: Wed, 6 Jun 2018 13:42:15 +0800 [thread overview]
Message-ID: <1cbee034-d237-104d-bf5a-33e373821301@redhat.com> (raw)
In-Reply-To: <20180605172112.GC17057@localhost.localdomain>
Here is the output, and I can see "HotPlug+ Surprise+" on SltCap
# lspci -vvv -s 0000:83:05.0
83:05.0 PCI bridge: PLX Technology, Inc. PEX 8734 32-lane, 8-Port PCI
Express Gen 3 (8.0GT/s) Switch (rev ab) (prog-if 00 [Normal decode])
??? Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
??? Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
??? Latency: 0, Cache Line Size: 32 bytes
??? Interrupt: pin A routed to IRQ 40
??? NUMA node: 1
??? Bus: primary=83, secondary=85, subordinate=85, sec-latency=0
??? I/O behind bridge: 00009000-00009fff
??? Memory behind bridge: c8600000-c86fffff
??? Prefetchable memory behind bridge: 000003c000200000-000003c0003fffff
??? Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- <SERR- <PERR-
??? BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
??? ??? PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
??? Capabilities: [40] Power Management version 3
??? ??? Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
??? ??? Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
??? Capabilities: [48] MSI: Enable+ Count=1/8 Maskable+ 64bit+
??? ??? Address: 00000000fee00118? Data: 0000
??? ??? Masking: 000000fe? Pending: 00000000
??? Capabilities: [68] Express (v2) Downstream Port (Slot+), MSI 00
??? ??? DevCap:??? MaxPayload 512 bytes, PhantFunc 0
??? ??? ??? ExtTag- RBE+
??? ??? DevCtl:??? Report errors: Correctable- Non-Fatal+ Fatal+
Unsupported+
??? ??? ??? RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
??? ??? ??? MaxPayload 128 bytes, MaxReadReq 128 bytes
??? ??? DevSta:??? CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr-
TransPend-
??? ??? LnkCap:??? Port #5, Speed 8GT/s, Width x4, ASPM L1, Exit
Latency L0s <4us, L1 <4us
??? ??? ??? ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
??? ??? LnkCtl:??? ASPM Disabled; Disabled- CommClk-
??? ??? ??? ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
??? ??? LnkSta:??? Speed 8GT/s, Width x4, TrErr- Train- SlotClk-
DLActive+ BWMgmt- ABWMgmt-
??? ??? SltCap:??? AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+
Surprise+
??? ??? ??? Slot #181, PowerLimit 25.000W; Interlock- NoCompl-
??? ??? SltCtl:??? Enable: AttnBtn- PwrFlt- MRL- PresDet+ CmdCplt+
HPIrq+ LinkChg+
??? ??? ??? Control: AttnInd Unknown, PwrInd On, Power- Interlock-
??? ??? SltSta:??? Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+
Interlock-
??? ??? ??? Changed: MRL- PresDet- LinkState-
??? ??? DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+,
OBFF Via message ARIFwd+
??? ??? DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-,
OBFF Disabled ARIFwd+
??? ??? LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-,
Selectable De-emphasis: -6dB
??? ??? ??? ?Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
??? ??? ??? ?Compliance De-emphasis: -6dB
??? ??? LnkSta2: Current De-emphasis Level: -6dB,
EqualizationComplete+, EqualizationPhase1+
??? ??? ??? ?EqualizationPhase2+, EqualizationPhase3+,
LinkEqualizationRequest-
??? Capabilities: [a4] Subsystem: Dell Device 1f84
??? Capabilities: [100 v1] Device Serial Number ab-87-00-10-b5-df-0e-00
??? Capabilities: [fb4 v1] Advanced Error Reporting
??? ??? UESta:??? DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
??? ??? UEMsk:??? DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol+
??? ??? UESvrt:??? DLP+ SDES+ TLP+ FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
??? ??? CESta:??? RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
??? ??? CEMsk:??? RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
??? ??? AERCap:??? First Error Pointer: 1f, GenCap+ CGenEn+ ChkCap+ ChkEn+
??? Capabilities: [138 v1] Power Budgeting <?>
??? Capabilities: [10c v1] #19
??? Capabilities: [148 v1] Virtual Channel
??? ??? Caps:??? LPEVC=0 RefClk=100ns PATEntryBits=1
??? ??? Arb:??? Fixed- WRR32- WRR64- WRR128-
??? ??? Ctrl:??? ArbSelect=Fixed
??? ??? Status:??? InProgress-
??? ??? VC0:??? Caps:??? PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
??? ??? ??? Arb:??? Fixed+ WRR32- WRR64- WRR128- TWRR128- WRR256-
??? ??? ??? Ctrl:??? Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
??? ??? ??? Status:??? NegoPending- InProgress-
??? Capabilities: [e00 v1] #12
??? Capabilities: [f24 v1] Access Control Services
??? ??? ACSCap:??? SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+
UpstreamFwd+ EgressCtrl+ DirectTrans+
??? ??? ACSCtl:??? SrcValid- TransBlk- ReqRedir- CmpltRedir-
UpstreamFwd- EgressCtrl- DirectTrans-
??? Capabilities: [b70 v1] Vendor Specific Information: ID=0001 Rev=0
Len=010 <?>
??? Kernel driver in use: pcieport
??? Kernel modules: shpchp
Thanks
Yi
On 06/06/2018 01:21 AM, Keith Busch wrote:
> On Tue, Jun 05, 2018@10:18:53AM -0600, Keith Busch wrote:
>> On Wed, May 30, 2018@03:26:54AM -0400, Yi Zhang wrote:
>>> Hi Keith
>>> I found blktest block/019 also can lead my NVMe server hang with 4.17.0-rc7, let me know if you need more info, thanks.
>>>
>>> Server: Dell R730xd
>>> NVMe SSD: 85:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01)
>>>
>>> Console log:
>>> Kernel 4.17.0-rc7 on an x86_64
>>>
>>> storageqe-62 login: [ 6043.121834] run blktests block/019 at 2018-05-30 03:16:34
>>> [ 6049.108476] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3
>>> [ 6049.108478] {1}[Hardware Error]: event severity: fatal
>>> [ 6049.108479] {1}[Hardware Error]: Error 0, type: fatal
>>> [ 6049.108481] {1}[Hardware Error]: section_type: PCIe error
>>> [ 6049.108482] {1}[Hardware Error]: port_type: 6, downstream switch port
>>> [ 6049.108483] {1}[Hardware Error]: version: 1.16
>>> [ 6049.108484] {1}[Hardware Error]: command: 0x0407, status: 0x0010
>>> [ 6049.108485] {1}[Hardware Error]: device_id: 0000:83:05.0
>>> [ 6049.108486] {1}[Hardware Error]: slot: 0
>>> [ 6049.108487] {1}[Hardware Error]: secondary_bus: 0x85
>>> [ 6049.108488] {1}[Hardware Error]: vendor_id: 0x10b5, device_id: 0x8734
>>> [ 6049.108489] {1}[Hardware Error]: class_code: 000406
>>> [ 6049.108489] {1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
>>> [ 6049.108491] Kernel panic - not syncing: Fatal hardware error!
>>> [ 6049.108514] Kernel Offset: 0x25800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> Could you attach 'lspci -vvv -s 0000:83:05.0'? Just want to see
> your switch's capabilities to confirm the pre-test checks are really
> sufficient.
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
next prev parent reply other threads:[~2018-06-06 5:42 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <838678680.4693215.1527664726174.JavaMail.zimbra@redhat.com>
2018-05-30 7:26 ` blktests block/019 lead system hang Yi Zhang
2018-05-30 7:26 ` Yi Zhang
2018-06-05 16:18 ` Keith Busch
2018-06-05 16:18 ` Keith Busch
2018-06-05 17:21 ` Keith Busch
2018-06-05 17:21 ` Keith Busch
2018-06-06 5:42 ` Yi Zhang [this message]
2018-06-06 5:42 ` Yi Zhang
2018-06-06 14:28 ` Keith Busch
2018-06-06 14:28 ` Keith Busch
2018-06-12 23:41 ` Austin.Bolen
2018-06-12 23:41 ` Austin.Bolen
2018-06-13 15:44 ` Keith Busch
2018-06-13 15:44 ` Keith Busch
2018-06-13 17:17 ` Austin.Bolen
2018-06-13 17:17 ` Austin.Bolen
2018-06-13 18:24 ` Austin.Bolen
2018-06-13 18:24 ` Austin.Bolen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1cbee034-d237-104d-bf5a-33e373821301@redhat.com \
--to=yi.zhang@redhat.com \
--cc=keith.busch@intel.com \
--cc=keith.busch@linux.intel.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=ming.lei@redhat.com \
--cc=osandov@osandov.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.