From: Bjorn Helgaas <helgaas@kernel.org>
To: Changhui Zhong <czhong@redhat.com>
Cc: linux-pci@vger.kernel.org
Subject: Re: [bug report] WARNING: CPU: 0 PID: 226 at drivers/pci/pci.c:2236 pci_disable_device+0xf4/0x100
Date: Tue, 19 Mar 2024 11:23:34 -0500 [thread overview]
Message-ID: <20240319162334.GA1230451@bhelgaas> (raw)
In-Reply-To: <CAGVVp+WyM-ce=c1L4p2EZfvLyxYZSHFkxKLad1TXXyNdVn1KYg@mail.gmail.com>
On Tue, Mar 19, 2024 at 03:34:56PM +0800, Changhui Zhong wrote:
> Hello,
>
> repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> branch: master
> commit HEAD:b3603fcb79b1036acae10602bffc4855a4b9af80
Where's the rest of this? I don't see "WARNING: CPU: 0 PID: 226 at
drivers/pci/pci.c:2236" in the snippet below. Please include or post
the complete dmesg log.
Is this reproducible? If so, how? And is it a regression?
> dmesg log:
> Rebooting.
> [ 292.644951] {1}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 5
> [ 292.644955] {1}[Hardware Error]: event severity: fatal
> [ 292.644958] {1}[Hardware Error]: Error 0, type: fatal
> [ 292.644959] {1}[Hardware Error]: section_type: PCIe error
> [ 292.644960] {1}[Hardware Error]: port_type: 0, PCIe end point
> [ 292.644962] {1}[Hardware Error]: version: 3.0
> [ 292.644963] {1}[Hardware Error]: command: 0x0002, status: 0x0010
> [ 292.644964] {1}[Hardware Error]: device_id: 0000:01:00.1
> [ 292.644966] {1}[Hardware Error]: slot: 0
> [ 292.644967] {1}[Hardware Error]: secondary_bus: 0x00
> [ 292.644968] {1}[Hardware Error]: vendor_id: 0x14e4, device_id: 0x165f
> [ 292.644969] {1}[Hardware Error]: class_code: 020000
> [ 292.644971] {1}[Hardware Error]: aer_uncor_status: 0x00100000,
> aer_uncor_mask: 0x00010000
> [ 292.644972] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030
> [ 292.644973] {1}[Hardware Error]: TLP Header: 40000001 0000020f
> 90028090 00000000
aer_uncor_status 0x00100000 looks like bit 20, Unsupported Request.
If I decoded it correctly, the TLP log says:
40000001: 0100 ... 0001
Fmt 010 3 DW header with data (PCIe r6.0, sec 2.2.1.1)
Type 0 0000 Memory Write
Length 1 1 DW
0000020f (sec 2.2.7.1)
Requester ID 0000
Tag 2
First DW BE f 32-bit write
90028090
Address 90028090
I don't see 0x90028090 as a BAR value in the lspci output below,
although we don't have any information about possible address
translation (this would be in the dmesg log or "lspci -b" output).
But it *looks* like an MMIO write that got routed to 01:00.1 (the
bridge window configuration that would be in the dmesg log would show
this), and 01:00.1 said "I don't know about this address" (it doesn't
match any of my BARs) and logged a UR error.
> [ 292.644976] Kernel panic - not syncing: Fatal hardware error!
> [ 292.644978] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0+ #1
> [ 292.644981] Hardware name: Dell Inc. PowerEdge R640/0X45NX, BIOS
> 2.19.1 06/04/2023
> [ 292.644982] Call Trace:
> [ 292.644984] <NMI>
> [ 292.644985] panic+0x32b/0x350
> [ 292.644995] __ghes_panic+0x69/0x70
> [ 292.645000] ghes_in_nmi_queue_one_entry.constprop.0+0x1d9/0x2b0
> [ 292.645005] ghes_notify_nmi+0x59/0xd0
> [ 292.645007] nmi_handle+0x5b/0x150
> [ 292.645014] default_do_nmi+0x40/0x100
> [ 292.645017] exc_nmi+0x100/0x180
> [ 292.645019] end_repeat_nmi+0xf/0x53
> [ 292.645023] RIP: 0010:intel_idle+0x59/0xa0
> [ 292.645028] Code: d2 48 89 d1 65 48 8b 05 55 21 73 70 0f 01 c8 48
> 8b 00 a8 08 75 14 66 90 0f 00 2d 2e 00 43 00 b9 01 00 00 00 48 89 f0
> 0f 01 c9 <65> 48 8b 05 2f 21 73 70 f0 80 60 02 df f0 83 44 24 fc 00 48
> 8b 00
> [ 292.645030] RSP: 0018:ffffffff90403e48 EFLAGS: 00000046
> [ 292.645032] RAX: 0000000000000001 RBX: 0000000000000002 RCX: 0000000000000001
> [ 292.645034] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff93d22fa3ffa0
> [ 292.645035] RBP: ffff93d22fa3ffa0 R08: 0000000000000002 R09: 00000000fffffffd
> [ 292.645036] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff908bbf60
> [ 292.645037] R13: ffffffff908bc048 R14: 0000000000000002 R15: 0000000000000000
> [ 292.645040] ? intel_idle+0x59/0xa0
> [ 292.645043] ? intel_idle+0x59/0xa0
> [ 292.645046] </NMI>
> [ 292.645046] <TASK>
> [ 292.645047] cpuidle_enter_state+0x7d/0x410
> [ 292.645050] cpuidle_enter+0x29/0x40
> [ 292.645054] cpuidle_idle_call+0xf8/0x160
> [ 292.645060] do_idle+0x7a/0xe0
> [ 292.645062] cpu_startup_entry+0x25/0x30
> [ 292.645065] rest_init+0xcc/0xd0
> [ 292.645068] start_kernel+0x325/0x400
> [ 292.645072] x86_64_start_reservations+0x14/0x30
> [ 292.645076] x86_64_start_kernel+0xed/0xf0
> [ 292.645079] common_startup_64+0x13e/0x141
> [ 292.645084] </TASK>
> [ 292.645101] Kernel Offset: 0xdc00000 from 0xffffffff81000000
> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>
>
> # lspci -nn -s 01:00.1
> 01:00.1 Ethernet controller [0200]: Broadcom Inc. and subsidiaries
> NetXtreme BCM5720 Gigabit Ethernet PCIe [14e4:165f]
>
> # lspci -vvv -s 01:00.1
> 01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme
> BCM5720 Gigabit Ethernet PCIe
> DeviceName: NIC4
> Subsystem: Broadcom Inc. and subsidiaries Device 4160
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr- Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0
> Interrupt: pin B routed to IRQ 17
> NUMA node: 0
> Region 0: Memory at 92900000 (64-bit, prefetchable) [size=64K]
> Region 2: Memory at 92910000 (64-bit, prefetchable) [size=64K]
> Region 4: Memory at 92920000 (64-bit, prefetchable) [size=64K]
> Expansion ROM at 90040000 [disabled] [size=256K]
> Capabilities: [48] Power Management version 3
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
> Capabilities: [50] Vital Product Data
> Product Name: Broadcom NetXtreme Gigabit Ethernet
> Read-only fields:
> [PN] Part number: BCM95720
> [MN] Manufacture ID: 1028
> [V0] Vendor specific: FFV22.61.8
> [V1] Vendor specific: DSV1028VPDR.VER1.0
> [V2] Vendor specific: NPY2
> [V3] Vendor specific: PMT1
> [V4] Vendor specific: NMVBroadcom Corp
> [V5] Vendor specific: DTINIC
> [V6] Vendor specific: DCM3001008d454101008d45
> [RV] Reserved: checksum good, 233 byte(s) reserved
> End
> Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+
> Address: 0000000000000000 Data: 0000
> Capabilities: [a0] MSI-X: Enable+ Count=17 Masked-
> Vector table: BAR=4 offset=00000000
> PBA: BAR=4 offset=00001000
> Capabilities: [ac] Express (v2) Endpoint, MSI 00
> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
> <4us, L1 <64us
> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+
> FLReset+ SlotPowerLimit 25.000W
> DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
> RlxdOrd- ExtTag- PhantFunc- AuxPwr+ NoSnoop- FLReset-
> MaxPayload 128 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq-
> AuxPwr+ TransPend-
> LnkCap: Port #0, Speed 5GT/s, Width x2, ASPM not supported
> ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
> LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
> ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 5GT/s (ok), Width x2 (ok)
> TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
> NROPrPrP- LTR-
> 10BitTagComp- 10BitTagReq- OBFF Not
> Supported, ExtFmt- EETLPPrefix-
> EmergencyPowerReduction Not Supported,
> EmergencyPowerReductionInit-
> FRS- TPHComp- ExtTPHComp-
> AtomicOpsCap: 32bit- 64bit- 128bitCAS-
> DevCtl2: Completion Timeout: 65ms to 210ms,
> TimeoutDis- LTR- OBFF Disabled,
> AtomicOpsCtl: ReqEn-
> LnkSta2: Current De-emphasis Level: -6dB,
> EqualizationComplete- EqualizationPhase1-
> EqualizationPhase2- EqualizationPhase3-
> LinkEqualizationRequest-
> Retimer- 2Retimers- CrosslinkRes: unsupported
> Capabilities: [100 v1] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
> UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
> UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+
> UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
> AdvNonFatalErr+
> CEMsk: RxErr- BadTLP+ BadDLLP+ Rollover+ Timeout+
> AdvNonFatalErr+
> AERCap: First Error Pointer: 00, ECRCGenCap+
> ECRCGenEn- ECRCChkCap+ ECRCChkEn-
> MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
> HeaderLog: 40000001 0000020f 90028090 00000000
> Capabilities: [13c v1] Device Serial Number 00-00-e4-3d-1a-3c-8b-bb
> Capabilities: [150 v1] Power Budgeting <?>
> Capabilities: [160 v1] Virtual Channel
> Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
> Arb: Fixed- WRR32- WRR64- WRR128-
> Ctrl: ArbSelect=Fixed
> Status: InProgress-
> VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
> Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
> Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
> Status: NegoPending- InProgress-
> Kernel driver in use: tg3
> Kernel modules: tg3
>
> Thanks,
>
next prev parent reply other threads:[~2024-03-19 16:23 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-19 7:34 [bug report] WARNING: CPU: 0 PID: 226 at drivers/pci/pci.c:2236 pci_disable_device+0xf4/0x100 Changhui Zhong
2024-03-19 16:23 ` Bjorn Helgaas [this message]
2024-03-20 2:16 ` Changhui Zhong
2024-03-20 2:46 ` Bjorn Helgaas
2024-03-20 3:13 ` Changhui Zhong
2024-03-21 10:11 ` Changhui Zhong
2024-03-21 12:44 ` Bjorn Helgaas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240319162334.GA1230451@bhelgaas \
--to=helgaas@kernel.org \
--cc=czhong@redhat.com \
--cc=linux-pci@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox