* PCI device hot insert is not detected
@ 2023-12-12 10:34 Ashutosh Sharma
2023-12-12 10:59 ` Lukas Wunner
0 siblings, 1 reply; 6+ messages in thread
From: Ashutosh Sharma @ 2023-12-12 10:34 UTC (permalink / raw)
To: linux-pci
Cc: alex.williamson, helgaas, dwmw2, yi.l.liu, majosaheb, cohuck,
zhenzhong.duan
Hi,
I have a system running Ubuntu 22.04.3 LTS with kernel version
5.15.0-89-generic. There are 10 NVMe drives connected with this
system, attached to the "vfio-pci" driver.
Removed one NVMe drive (pci address 0000:83:00.0), it got unbound
successfully from "vfio-pci" driver but saw below error in the syslog.
can't change power state from D0 to D3hot (config space inaccessible)
Then after 2:30 min approx, re-inserted the same drive to the same PCI
slot. But the drive was not detected.
Kernel log:
Dec 11 23:52:05 node-4 kernel: [183519.565000] pcieport 0000:80:03.2:
pciehp: Slot(14): Link Down
Dec 11 23:52:05 node-4 kernel: [183519.565020] vfio-pci 0000:83:00.0:
Relaying device request to user (#0)
Dec 11 23:52:05 node-4 kernel: [183519.567467] vfio-pci 0000:83:00.0:
vfio_bar_restore: reset recovery - restoring BARs
Dec 11 23:52:06 node-4 kernel: [183519.629302] vfio-pci 0000:83:00.0:
can't change power state from D0 to D3hot (config space inaccessible)
Dec 11 23:52:06 node-4 kernel: [183519.639070] pci 0000:83:00.0:
Removing from iommu group 41
Dec 11 23:54:39 node-4 kernel: [183672.630191] pcieport 0000:80:03.2:
pciehp: Slot(14): Attention button pressed
Dec 11 23:54:39 node-4 kernel: [183672.630195] pcieport 0000:80:03.2:
pciehp: Slot(14) Powering on due to button press
Dec 11 23:54:44 node-4 kernel: [183677.671931] pcieport 0000:80:03.2:
pciehp: Slot(14): Card present
Dec 11 23:54:46 node-4 kernel: [183679.783922] pcieport 0000:80:03.2:
pciehp: Slot(14): No link
Dec 12 00:09:17 node-4 kernel: [184550.808980] pcieport 0000:80:03.2:
pciehp: Slot(14): Attention button pressed
Dec 12 00:09:17 node-4 kernel: [184550.808991] pcieport 0000:80:03.2:
pciehp: Slot(14) Powering on due to button press
Dec 12 00:09:22 node-4 kernel: [184556.025139] pcieport 0000:80:03.2:
pciehp: Slot(14): Card present
Dec 12 00:09:24 node-4 kernel: [184558.189151] pcieport 0000:80:03.2:
pciehp: Slot(14): No link
lspci output:
+-[0000:80]-+-00.0 Advanced Micro Devices, Inc. [AMD]
Starship/Matisse Root Complex
| +-00.2 Advanced Micro Devices, Inc. [AMD] Milan IOMMU
| +-01.0 Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Host Bridge
| +-01.1-[81]--+-00.0 Mellanox Technologies MT28908 Family
[ConnectX-6]
| | \-00.1 Mellanox Technologies MT28908 Family
[ConnectX-6]
| +-02.0 Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Host Bridge
| +-03.0 Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Host Bridge
| +-03.1-[82]----00.0 Samsung Electronics Co Ltd NVMe SSD
Controller PM9A1/PM9A3/980PRO
| +-03.2-[83]--
| +-03.3-[84]--
| +-03.4-[85]--
| +-04.0 Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Host Bridge
| +-05.0 Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Host Bridge
| +-07.0 Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Host Bridge
| +-07.1-[86]--+-00.0 Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Function
| | \-00.2 Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PTDMA
| +-08.0 Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Host Bridge
| \-08.1-[87]--+-00.0 Advanced Micro Devices, Inc. [AMD]
Starship/Matisse Reserved SPP
| \-00.2 Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PTDMA
+-[0000:40]-+-00.0 Advanced Micro Devices, Inc. [AMD]
Starship/Matisse Root Complex
Since the slot 14 link was down, Although the drive was physically
present, the value of power remained 0 in the sysfs, even echo 1 to
this power was also not working here.
admin@node-4:/sys/bus/pci/slots/14$ cat address
0000:83:00
admin@node-4:~$
admin@node-4:/sys/bus/pci/slots/14$ cat power
0
admin@node-4:/sys/bus/pci/slots/14$
admin@node-4:/sys/bus/pci/slots/14$ sudo echo 1 > power
echo: write error: Operation not permitted
admin@node-4:/sys/bus/pci/slots/14$
Dec 12 09:18:09 node-4 kernel: [217484.101870] pcieport 0000:80:03.2:
pciehp: Slot(14): Card present
Dec 12 09:18:12 node-4 kernel: [217486.272077] pcieport 0000:80:03.2:
pciehp: Slot(14): No link
But after system reboot, the drive detected successfully. So, can I
get some insight like why the drive was not detecting before the
system reboot ?
Here are some system details:
lspci tree:
admin@node-4:~$ sudo lspci -t -vvv
+-[0000:80]-+-00.0 Advanced Micro Devices, Inc. [AMD]
Starship/Matisse Root Complex
| +-00.2 Advanced Micro Devices, Inc. [AMD] Milan IOMMU
| +-01.0 Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Host Bridge
| +-01.1-[81]--+-00.0 Mellanox Technologies MT28908 Family
[ConnectX-6]
| | \-00.1 Mellanox Technologies MT28908 Family
[ConnectX-6]
| +-02.0 Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Host Bridge
| +-03.0 Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Host Bridge
| +-03.1-[82]----00.0 Samsung Electronics Co Ltd NVMe SSD
Controller PM9A1/PM9A3/980PRO
| +-03.2-[83]----00.0 Samsung Electronics Co Ltd NVMe SSD
Controller PM9A1/PM9A3/980PRO
lspci output of the said drive:
83:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd
NVMe SSD Controller PM9A1/PM9A3/980PRO (prog-if 02 [NVM Express])
Subsystem: Samsung Electronics Co Ltd General DC NVMe PM9A3
Physical Slot: 14
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 469
NUMA node: 2
IOMMU group: 41
Region 0: Memory at f2410000 (64-bit, non-prefetchable) [size=16K]
Expansion ROM at f2400000 [disabled] [size=64K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [70] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+
FLReset+ SlotPowerLimit 75.000W
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq-
AuxPwr- TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM not supported
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 16GT/s (ok), Width x4 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
NROPrPrP- LTR+
10BitTagComp+ 10BitTagReq- OBFF Not
Supported, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported,
EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
LTR+ OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink-
Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB,
EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+
LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: Upstream Port
Capabilities: [b0] MSI-X: Enable+ Count=130 Masked-
Vector table: BAR=0 offset=00003000
PBA: BAR=0 offset=00002000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO+ CmpltAbrt-
UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
AdvNonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
AdvNonFatalErr-
AERCap: First Error Pointer: 00, ECRCGenCap+
ECRCGenEn+ ECRCChkCap+ ECRCChkEn+
MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [168 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [178 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [198 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [1bc v1] Lane Margining at the Receiver <?>
Capabilities: [3a0 v1] Data Link Feature <?>
Kernel driver in use: vfio-pci
Kernel modules: nvme
lspci output of the pci port:
80:03.2 PCI bridge: Advanced Micro Devices, Inc. [AMD]
Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin ? routed to IRQ 41
NUMA node: 2
IOMMU group: 29
Bus: primary=80, secondary=83, subordinate=83, sec-latency=0
I/O behind bridge: 0000f000-00000fff [disabled]
Memory behind bridge: f2400000-f24fffff [size=1M]
Prefetchable memory behind bridge:
00000180a0400000-00000180a05fffff [size=2M]
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort+ <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0
ExtTag+ RBE+
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq-
AuxPwr- TransPend-
LnkCap: Port #2, Speed 16GT/s, Width x4, ASPM L1, Exit
Latency L1 <64us
ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 16GT/s (ok), Width x4 (ok)
TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
SltCap: AttnBtn+ PwrCtrl+ MRL- AttnInd+ PwrInd+
HotPlug+ Surprise+
Slot #14, PowerLimit 75.000W; Interlock+ NoCompl-
SltCtl: Enable: AttnBtn+ PwrFlt- MRL- PresDet-
CmdCplt+ HPIrq+ LinkChg+
Control: AttnInd Off, PwrInd On, Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt-
PresDet+ Interlock-
Changed: MRL- PresDet- LinkState-
RootCap: CRSVisible+
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal-
PMEIntEna+ CRSVisible+
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
NROPrPrP- LTR+
10BitTagComp+ 10BitTagReq+ OBFF Not
Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
EmergencyPowerReduction Not Supported,
EmergencyPowerReductionInit-
FRS- LN System CLS Not Supported, TPHComp+
ExtTPHComp- ARIFwd+
AtomicOpsCap: Routing+ 32bit+ 64bit+ 128bitCAS-
DevCtl2: Completion Timeout: 65ms to 210ms,
TimeoutDis- LTR+ OBFF Disabled, ARIFwd+
AtomicOpsCtl: ReqEn- EgressBlck-
LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink-
Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB,
EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+
LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00000 Data: 0000
Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc.
[AMD] Starship/Matisse GPP Bridge
Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
Capabilities: [100 v1] Vendor Specific Information: ID=0001
Rev=1 Len=010 <?>
Capabilities: [150 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO+ CmpltAbrt-
UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol+
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
AdvNonFatalErr-
AERCap: First Error Pointer: 00, ECRCGenCap+
ECRCGenEn+ ECRCChkCap+ ECRCChkEn+
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
RootCmd: CERptEn- NFERptEn- FERptEn-
RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
Capabilities: [270 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [2a0 v1] Access Control Services
ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+
UpstreamFwd+ EgressCtrl- DirectTrans+
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+
UpstreamFwd+ EgressCtrl- DirectTrans-
Capabilities: [370 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2-
ASPM_L1.1+ L1_PM_Substates+
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
L1SubCtl2:
Capabilities: [380 v1] Downstream Port Containment
DpcCap: INT Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP
PIO Log 6, DL_ActiveErr+
DpcCtl: Trigger:0 Cmpl- INT- ErrCor- PoisonedTLP-
SwTrigger- DL_ActiveErr-
DpcSta: Trigger- Reason:00 INT- RPBusy- TriggerExt:00
RP PIO ErrPtr:1f
Source: 0000
Capabilities: [400 v1] Data Link Feature <?>
Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [440 v1] Lane Margining at the Receiver <?>
Kernel driver in use: pcieport
admin@node-4:~$ sudo inxi -c 42
CPU: 64-core AMD EPYC 7713P (-MCP-) speed/min/max: 2000/1500/3721 MHz
Kernel: 5.15.0-89-generic x86_64 Up: 12m Mem: 565419.6/1019777.2 MiB (55.4%)
Regards,
Ashutosh
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: PCI device hot insert is not detected
2023-12-12 10:34 PCI device hot insert is not detected Ashutosh Sharma
@ 2023-12-12 10:59 ` Lukas Wunner
2023-12-12 11:32 ` Ashutosh Sharma
0 siblings, 1 reply; 6+ messages in thread
From: Lukas Wunner @ 2023-12-12 10:59 UTC (permalink / raw)
To: Ashutosh Sharma
Cc: linux-pci, alex.williamson, helgaas, dwmw2, yi.l.liu, majosaheb,
cohuck, zhenzhong.duan, Mario Limonciello, Smita Koralahalli,
Yazen Ghannam
On Tue, Dec 12, 2023 at 04:04:41PM +0530, Ashutosh Sharma wrote:
> Removed one NVMe drive (pci address 0000:83:00.0), it got unbound
> successfully from "vfio-pci" driver but saw below error in the syslog.
>
> can't change power state from D0 to D3hot (config space inaccessible)
This is normal, the drive's config space is inaccessible after removal.
> Then after 2:30 min approx, re-inserted the same drive to the same PCI
> slot. But the drive was not detected.
>
> Dec 11 23:54:39 node-4 kernel: [183672.630191] pcieport 0000:80:03.2:
> pciehp: Slot(14): Attention button pressed
> Dec 11 23:54:39 node-4 kernel: [183672.630195] pcieport 0000:80:03.2:
> pciehp: Slot(14) Powering on due to button press
> Dec 11 23:54:44 node-4 kernel: [183677.671931] pcieport 0000:80:03.2:
> pciehp: Slot(14): Card present
> Dec 11 23:54:46 node-4 kernel: [183679.783922] pcieport 0000:80:03.2:
> pciehp: Slot(14): No link
The link doesn't come up, so the kernel gives up on the slot.
I don't know what the reason is, could be a hardware issue or
protocol incompatibility. This doesn't look like a kernel issue.
> | +-03.0 Advanced Micro Devices, Inc. [AMD]
> Starship/Matisse PCIe Dummy Host Bridge
> | +-03.1-[82]----00.0 Samsung Electronics Co Ltd NVMe SSD
> Controller PM9A1/PM9A3/980PRO
> | +-03.2-[83]--
Adding Mario, Smita, Yazen from AMD to cc, maybe they have an idea
what the issue is or how to get diagnostics on this Epyc platform.
Start of thread:
https://lore.kernel.org/linux-pci/CADOvten7jG7KjW6W1MRd7i8_E18L0xCCaCzmZOY_vvgJhdfOSw@mail.gmail.com/
> admin@node-4:/sys/bus/pci/slots/14$ sudo echo 1 > power
> echo: write error: Operation not permitted
This doesn't work, try "echo 1 | sudo tee power" instead.
> lspci output of the pci port:
> 80:03.2 PCI bridge: Advanced Micro Devices, Inc. [AMD]
> Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
[...]
> LnkSta: Speed 16GT/s (ok), Width x4 (ok)
> TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
This is from a "Link up" situation (DLActive+), it would be more
interesting to get lspci output of the port in a "No link" situation.
Thanks,
Lukas
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: PCI device hot insert is not detected
2023-12-12 10:59 ` Lukas Wunner
@ 2023-12-12 11:32 ` Ashutosh Sharma
2023-12-12 18:29 ` Mario Limonciello
0 siblings, 1 reply; 6+ messages in thread
From: Ashutosh Sharma @ 2023-12-12 11:32 UTC (permalink / raw)
To: Lukas Wunner
Cc: linux-pci, alex.williamson, helgaas, dwmw2, yi.l.liu, majosaheb,
cohuck, zhenzhong.duan, Mario Limonciello, Smita Koralahalli,
Yazen Ghannam
> This doesn't work, try "echo 1 | sudo tee power" instead.
This was not a permission issue, I already gave it read/write permission.
admin@node-4:/sys/bus/pci/slots/14$ sudo echo 1 > power
-bash: power: Permission denied
admin@node-4:/sys/bus/pci/slots/14$ sudo chmod 0666 power
admin@node-4:/sys/bus/pci/slots/14$ sudo echo 1 > power
echo: write error: Operation not permitted
admin@node-4:/sys/bus/pci/slots/14$
> This is from a "Link up" situation (DLActive+), it would be more
> interesting to get lspci output of the port in a "No link" situation.
Unfortunately, I did not collect that output before system reboot.
On Tue, 12 Dec 2023 at 16:29, Lukas Wunner <lukas@wunner.de> wrote:
>
> On Tue, Dec 12, 2023 at 04:04:41PM +0530, Ashutosh Sharma wrote:
> > Removed one NVMe drive (pci address 0000:83:00.0), it got unbound
> > successfully from "vfio-pci" driver but saw below error in the syslog.
> >
> > can't change power state from D0 to D3hot (config space inaccessible)
>
> This is normal, the drive's config space is inaccessible after removal.
>
>
> > Then after 2:30 min approx, re-inserted the same drive to the same PCI
> > slot. But the drive was not detected.
> >
> > Dec 11 23:54:39 node-4 kernel: [183672.630191] pcieport 0000:80:03.2:
> > pciehp: Slot(14): Attention button pressed
> > Dec 11 23:54:39 node-4 kernel: [183672.630195] pcieport 0000:80:03.2:
> > pciehp: Slot(14) Powering on due to button press
> > Dec 11 23:54:44 node-4 kernel: [183677.671931] pcieport 0000:80:03.2:
> > pciehp: Slot(14): Card present
> > Dec 11 23:54:46 node-4 kernel: [183679.783922] pcieport 0000:80:03.2:
> > pciehp: Slot(14): No link
>
> The link doesn't come up, so the kernel gives up on the slot.
>
> I don't know what the reason is, could be a hardware issue or
> protocol incompatibility. This doesn't look like a kernel issue.
>
>
> > | +-03.0 Advanced Micro Devices, Inc. [AMD]
> > Starship/Matisse PCIe Dummy Host Bridge
> > | +-03.1-[82]----00.0 Samsung Electronics Co Ltd NVMe SSD
> > Controller PM9A1/PM9A3/980PRO
> > | +-03.2-[83]--
>
> Adding Mario, Smita, Yazen from AMD to cc, maybe they have an idea
> what the issue is or how to get diagnostics on this Epyc platform.
>
> Start of thread:
> https://lore.kernel.org/linux-pci/CADOvten7jG7KjW6W1MRd7i8_E18L0xCCaCzmZOY_vvgJhdfOSw@mail.gmail.com/
>
>
> > admin@node-4:/sys/bus/pci/slots/14$ sudo echo 1 > power
> > echo: write error: Operation not permitted
>
> This doesn't work, try "echo 1 | sudo tee power" instead.
>
>
> > lspci output of the pci port:
> > 80:03.2 PCI bridge: Advanced Micro Devices, Inc. [AMD]
> > Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
> [...]
> > LnkSta: Speed 16GT/s (ok), Width x4 (ok)
> > TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
>
> This is from a "Link up" situation (DLActive+), it would be more
> interesting to get lspci output of the port in a "No link" situation.
>
> Thanks,
>
> Lukas
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: PCI device hot insert is not detected
2023-12-12 11:32 ` Ashutosh Sharma
@ 2023-12-12 18:29 ` Mario Limonciello
2023-12-12 19:07 ` Alex Williamson
0 siblings, 1 reply; 6+ messages in thread
From: Mario Limonciello @ 2023-12-12 18:29 UTC (permalink / raw)
To: Ashutosh Sharma, Lukas Wunner
Cc: linux-pci, alex.williamson, helgaas, dwmw2, yi.l.liu, majosaheb,
cohuck, zhenzhong.duan, Smita Koralahalli, Yazen Ghannam
On 12/12/2023 05:32, Ashutosh Sharma wrote:
>> This doesn't work, try "echo 1 | sudo tee power" instead.
>
> This was not a permission issue, I already gave it read/write permission.
>
> admin@node-4:/sys/bus/pci/slots/14$ sudo echo 1 > power
> -bash: power: Permission denied
> admin@node-4:/sys/bus/pci/slots/14$ sudo chmod 0666 power
> admin@node-4:/sys/bus/pci/slots/14$ sudo echo 1 > power
> echo: write error: Operation not permitted
> admin@node-4:/sys/bus/pci/slots/14$
>
>> This is from a "Link up" situation (DLActive+), it would be more
>> interesting to get lspci output of the port in a "No link" situation.
>
> Unfortunately, I did not collect that output before system reboot.
>
> On Tue, 12 Dec 2023 at 16:29, Lukas Wunner <lukas@wunner.de> wrote:
>>
>> On Tue, Dec 12, 2023 at 04:04:41PM +0530, Ashutosh Sharma wrote:
>>> Removed one NVMe drive (pci address 0000:83:00.0), it got unbound
>>> successfully from "vfio-pci" driver but saw below error in the syslog.
>>>
>>> can't change power state from D0 to D3hot (config space inaccessible)
>>
>> This is normal, the drive's config space is inaccessible after removal.
>>
Was the removal a "surprise" removal? Or you mean it was by using
'remove' sysfs file?
IIRC surprise removal will need platform firmware support to handle it
properly.
>>
>>> Then after 2:30 min approx, re-inserted the same drive to the same PCI
>>> slot. But the drive was not detected.
>>>
>>> Dec 11 23:54:39 node-4 kernel: [183672.630191] pcieport 0000:80:03.2:
>>> pciehp: Slot(14): Attention button pressed
>>> Dec 11 23:54:39 node-4 kernel: [183672.630195] pcieport 0000:80:03.2:
>>> pciehp: Slot(14) Powering on due to button press
>>> Dec 11 23:54:44 node-4 kernel: [183677.671931] pcieport 0000:80:03.2:
>>> pciehp: Slot(14): Card present
>>> Dec 11 23:54:46 node-4 kernel: [183679.783922] pcieport 0000:80:03.2:
>>> pciehp: Slot(14): No link
>>
>> The link doesn't come up, so the kernel gives up on the slot.
>>
>> I don't know what the reason is, could be a hardware issue or
>> protocol incompatibility. This doesn't look like a kernel issue.
>>
>>
>>> | +-03.0 Advanced Micro Devices, Inc. [AMD]
>>> Starship/Matisse PCIe Dummy Host Bridge
>>> | +-03.1-[82]----00.0 Samsung Electronics Co Ltd NVMe SSD
>>> Controller PM9A1/PM9A3/980PRO
>>> | +-03.2-[83]--
>>
>> Adding Mario, Smita, Yazen from AMD to cc, maybe they have an idea
>> what the issue is or how to get diagnostics on this Epyc platform.
>>
>> Start of thread:
>> https://lore.kernel.org/linux-pci/CADOvten7jG7KjW6W1MRd7i8_E18L0xCCaCzmZOY_vvgJhdfOSw@mail.gmail.com/
>>
>>
>>> admin@node-4:/sys/bus/pci/slots/14$ sudo echo 1 > power
>>> echo: write error: Operation not permitted
>>
>> This doesn't work, try "echo 1 | sudo tee power" instead.
>>
>>
>>> lspci output of the pci port:
>>> 80:03.2 PCI bridge: Advanced Micro Devices, Inc. [AMD]
>>> Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
>> [...]
>>> LnkSta: Speed 16GT/s (ok), Width x4 (ok)
>>> TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
>>
>> This is from a "Link up" situation (DLActive+), it would be more
>> interesting to get lspci output of the port in a "No link" situation.
>>
>> Thanks,
>>
>> Lukas
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: PCI device hot insert is not detected
2023-12-12 18:29 ` Mario Limonciello
@ 2023-12-12 19:07 ` Alex Williamson
2023-12-13 4:57 ` Ashutosh Sharma
0 siblings, 1 reply; 6+ messages in thread
From: Alex Williamson @ 2023-12-12 19:07 UTC (permalink / raw)
To: Mario Limonciello
Cc: Ashutosh Sharma, Lukas Wunner, linux-pci, helgaas, dwmw2,
yi.l.liu, majosaheb, cohuck, zhenzhong.duan, Smita Koralahalli,
Yazen Ghannam
On Tue, 12 Dec 2023 12:29:13 -0600
Mario Limonciello <mario.limonciello@amd.com> wrote:
> On 12/12/2023 05:32, Ashutosh Sharma wrote:
> >> This doesn't work, try "echo 1 | sudo tee power" instead.
> >
> > This was not a permission issue, I already gave it read/write permission.
> >
> > admin@node-4:/sys/bus/pci/slots/14$ sudo echo 1 > power
> > -bash: power: Permission denied
> > admin@node-4:/sys/bus/pci/slots/14$ sudo chmod 0666 power
> > admin@node-4:/sys/bus/pci/slots/14$ sudo echo 1 > power
> > echo: write error: Operation not permitted
> > admin@node-4:/sys/bus/pci/slots/14$
> >
> >> This is from a "Link up" situation (DLActive+), it would be more
> >> interesting to get lspci output of the port in a "No link" situation.
> >
> > Unfortunately, I did not collect that output before system reboot.
> >
> > On Tue, 12 Dec 2023 at 16:29, Lukas Wunner <lukas@wunner.de> wrote:
> >>
> >> On Tue, Dec 12, 2023 at 04:04:41PM +0530, Ashutosh Sharma wrote:
> >>> Removed one NVMe drive (pci address 0000:83:00.0), it got unbound
> >>> successfully from "vfio-pci" driver but saw below error in the syslog.
> >>>
> >>> can't change power state from D0 to D3hot (config space inaccessible)
> >>
> >> This is normal, the drive's config space is inaccessible after removal.
> >>
>
> Was the removal a "surprise" removal? Or you mean it was by using
> 'remove' sysfs file?
>
> IIRC surprise removal will need platform firmware support to handle it
> properly.
The vfio-pci driver also makes zero claims about supporting surprise
removal, you'll likely end up in an inconsistent state. Thanks,
Alex
> >>> Then after 2:30 min approx, re-inserted the same drive to the same PCI
> >>> slot. But the drive was not detected.
> >>>
> >>> Dec 11 23:54:39 node-4 kernel: [183672.630191] pcieport 0000:80:03.2:
> >>> pciehp: Slot(14): Attention button pressed
> >>> Dec 11 23:54:39 node-4 kernel: [183672.630195] pcieport 0000:80:03.2:
> >>> pciehp: Slot(14) Powering on due to button press
> >>> Dec 11 23:54:44 node-4 kernel: [183677.671931] pcieport 0000:80:03.2:
> >>> pciehp: Slot(14): Card present
> >>> Dec 11 23:54:46 node-4 kernel: [183679.783922] pcieport 0000:80:03.2:
> >>> pciehp: Slot(14): No link
> >>
> >> The link doesn't come up, so the kernel gives up on the slot.
> >>
> >> I don't know what the reason is, could be a hardware issue or
> >> protocol incompatibility. This doesn't look like a kernel issue.
> >>
> >>
> >>> | +-03.0 Advanced Micro Devices, Inc. [AMD]
> >>> Starship/Matisse PCIe Dummy Host Bridge
> >>> | +-03.1-[82]----00.0 Samsung Electronics Co Ltd NVMe SSD
> >>> Controller PM9A1/PM9A3/980PRO
> >>> | +-03.2-[83]--
> >>
> >> Adding Mario, Smita, Yazen from AMD to cc, maybe they have an idea
> >> what the issue is or how to get diagnostics on this Epyc platform.
> >>
> >> Start of thread:
> >> https://lore.kernel.org/linux-pci/CADOvten7jG7KjW6W1MRd7i8_E18L0xCCaCzmZOY_vvgJhdfOSw@mail.gmail.com/
> >>
> >>
> >>> admin@node-4:/sys/bus/pci/slots/14$ sudo echo 1 > power
> >>> echo: write error: Operation not permitted
> >>
> >> This doesn't work, try "echo 1 | sudo tee power" instead.
> >>
> >>
> >>> lspci output of the pci port:
> >>> 80:03.2 PCI bridge: Advanced Micro Devices, Inc. [AMD]
> >>> Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
> >> [...]
> >>> LnkSta: Speed 16GT/s (ok), Width x4 (ok)
> >>> TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
> >>
> >> This is from a "Link up" situation (DLActive+), it would be more
> >> interesting to get lspci output of the port in a "No link" situation.
> >>
> >> Thanks,
> >>
> >> Lukas
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: PCI device hot insert is not detected
2023-12-12 19:07 ` Alex Williamson
@ 2023-12-13 4:57 ` Ashutosh Sharma
0 siblings, 0 replies; 6+ messages in thread
From: Ashutosh Sharma @ 2023-12-13 4:57 UTC (permalink / raw)
To: Alex Williamson
Cc: Mario Limonciello, Lukas Wunner, linux-pci, helgaas, dwmw2,
yi.l.liu, majosaheb, cohuck, zhenzhong.duan, Smita Koralahalli,
Yazen Ghannam
> > Was the removal a "surprise" removal? Or you mean it was by using
> > 'remove' sysfs file?
> >
> > IIRC surprise removal will need platform firmware support to handle it
> > properly.
Yes, it was a surprise removal.
On Wed, 13 Dec 2023 at 00:37, Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> On Tue, 12 Dec 2023 12:29:13 -0600
> Mario Limonciello <mario.limonciello@amd.com> wrote:
>
> > On 12/12/2023 05:32, Ashutosh Sharma wrote:
> > >> This doesn't work, try "echo 1 | sudo tee power" instead.
> > >
> > > This was not a permission issue, I already gave it read/write permission.
> > >
> > > admin@node-4:/sys/bus/pci/slots/14$ sudo echo 1 > power
> > > -bash: power: Permission denied
> > > admin@node-4:/sys/bus/pci/slots/14$ sudo chmod 0666 power
> > > admin@node-4:/sys/bus/pci/slots/14$ sudo echo 1 > power
> > > echo: write error: Operation not permitted
> > > admin@node-4:/sys/bus/pci/slots/14$
> > >
> > >> This is from a "Link up" situation (DLActive+), it would be more
> > >> interesting to get lspci output of the port in a "No link" situation.
> > >
> > > Unfortunately, I did not collect that output before system reboot.
> > >
> > > On Tue, 12 Dec 2023 at 16:29, Lukas Wunner <lukas@wunner.de> wrote:
> > >>
> > >> On Tue, Dec 12, 2023 at 04:04:41PM +0530, Ashutosh Sharma wrote:
> > >>> Removed one NVMe drive (pci address 0000:83:00.0), it got unbound
> > >>> successfully from "vfio-pci" driver but saw below error in the syslog.
> > >>>
> > >>> can't change power state from D0 to D3hot (config space inaccessible)
> > >>
> > >> This is normal, the drive's config space is inaccessible after removal.
> > >>
> >
> > Was the removal a "surprise" removal? Or you mean it was by using
> > 'remove' sysfs file?
> >
> > IIRC surprise removal will need platform firmware support to handle it
> > properly.
>
> The vfio-pci driver also makes zero claims about supporting surprise
> removal, you'll likely end up in an inconsistent state. Thanks,
>
> Alex
>
> > >>> Then after 2:30 min approx, re-inserted the same drive to the same PCI
> > >>> slot. But the drive was not detected.
> > >>>
> > >>> Dec 11 23:54:39 node-4 kernel: [183672.630191] pcieport 0000:80:03.2:
> > >>> pciehp: Slot(14): Attention button pressed
> > >>> Dec 11 23:54:39 node-4 kernel: [183672.630195] pcieport 0000:80:03.2:
> > >>> pciehp: Slot(14) Powering on due to button press
> > >>> Dec 11 23:54:44 node-4 kernel: [183677.671931] pcieport 0000:80:03.2:
> > >>> pciehp: Slot(14): Card present
> > >>> Dec 11 23:54:46 node-4 kernel: [183679.783922] pcieport 0000:80:03.2:
> > >>> pciehp: Slot(14): No link
> > >>
> > >> The link doesn't come up, so the kernel gives up on the slot.
> > >>
> > >> I don't know what the reason is, could be a hardware issue or
> > >> protocol incompatibility. This doesn't look like a kernel issue.
> > >>
> > >>
> > >>> | +-03.0 Advanced Micro Devices, Inc. [AMD]
> > >>> Starship/Matisse PCIe Dummy Host Bridge
> > >>> | +-03.1-[82]----00.0 Samsung Electronics Co Ltd NVMe SSD
> > >>> Controller PM9A1/PM9A3/980PRO
> > >>> | +-03.2-[83]--
> > >>
> > >> Adding Mario, Smita, Yazen from AMD to cc, maybe they have an idea
> > >> what the issue is or how to get diagnostics on this Epyc platform.
> > >>
> > >> Start of thread:
> > >> https://lore.kernel.org/linux-pci/CADOvten7jG7KjW6W1MRd7i8_E18L0xCCaCzmZOY_vvgJhdfOSw@mail.gmail.com/
> > >>
> > >>
> > >>> admin@node-4:/sys/bus/pci/slots/14$ sudo echo 1 > power
> > >>> echo: write error: Operation not permitted
> > >>
> > >> This doesn't work, try "echo 1 | sudo tee power" instead.
> > >>
> > >>
> > >>> lspci output of the pci port:
> > >>> 80:03.2 PCI bridge: Advanced Micro Devices, Inc. [AMD]
> > >>> Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
> > >> [...]
> > >>> LnkSta: Speed 16GT/s (ok), Width x4 (ok)
> > >>> TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
> > >>
> > >> This is from a "Link up" situation (DLActive+), it would be more
> > >> interesting to get lspci output of the port in a "No link" situation.
> > >>
> > >> Thanks,
> > >>
> > >> Lukas
> >
>
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2023-12-13 4:57 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-12 10:34 PCI device hot insert is not detected Ashutosh Sharma
2023-12-12 10:59 ` Lukas Wunner
2023-12-12 11:32 ` Ashutosh Sharma
2023-12-12 18:29 ` Mario Limonciello
2023-12-12 19:07 ` Alex Williamson
2023-12-13 4:57 ` Ashutosh Sharma
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox