All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jakub Kicinski <kuba@kernel.org>
To: Jinpu Wang <jinpu.wang@ionos.com>
Cc: netdev <netdev@vger.kernel.org>,
	Michael Chan <michael.chan@broadcom.com>
Subject: Re: [RFC] bnxt_en TX timeout detected, starting reset task, flapping link after
Date: Wed, 16 Aug 2023 20:01:20 -0700	[thread overview]
Message-ID: <20230816200120.603cc65a@kernel.org> (raw)
In-Reply-To: <CAMGffEm-OUf0UqeqCbcWM8j1q1EaSObEUGFPZqFsH4sKGkis8g@mail.gmail.com>

On Wed, 16 Aug 2023 20:51:25 +0200 Jinpu Wang wrote:
> Hi Michael, and folks on the list.

It seems you meant to CC Michael.. adding him now.
I don't recall anything like this. Could be a bad system...

> We hit two case on two server but same kind of configuration with
> following symptom,
> error started with:
> kern.err: Aug 15 12:21:39 ps502b-104 kernel: [325978.631877] bnxt_en
> 0000:45:00.0 eth0: TX timeout detected, starting reset task!
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251006] bnxt_en
> 0000:45:00.0 eth0: [0]: tx{fw_ring: 0 prod: 1e7 cons: 1e4}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251009] bnxt_en
> 0000:45:00.0 eth0: [0]: rx{fw_ring: 1 prod: 135} rx_agg{fw_ring: 9
> agg_prod: 31f sw_agg_prod: 31f}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251010] bnxt_en
> 0000:45:00.0 eth0: [0]: cp{fw_ring: 0 raw_cons: 5a498f}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251012] bnxt_en
> 0000:45:00.0 eth0: [1]: tx{fw_ring: 1 prod: 190 cons: 190}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251013] bnxt_en
> 0000:45:00.0 eth0: [1]: rx{fw_ring: 2 prod: ae} rx_agg{fw_ring: 10
> agg_prod: cb sw_agg_prod: cb}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251014] bnxt_en
> 0000:45:00.0 eth0: [1]: cp{fw_ring: 16 raw_cons: 644dda}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251015] bnxt_en
> 0000:45:00.0 eth0: [2]: tx{fw_ring: 2 prod: af cons: 9b}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251016] bnxt_en
> 0000:45:00.0 eth0: [2]: rx{fw_ring: 3 prod: 1b2} rx_agg{fw_ring: 11
> agg_prod: 41c sw_agg_prod: 41c}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251017] bnxt_en
> 0000:45:00.0 eth0: [2]: cp{fw_ring: 17 raw_cons: 517b28}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251018] bnxt_en
> 0000:45:00.0 eth0: [3]: tx{fw_ring: 3 prod: 5e cons: 5e}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251020] bnxt_en
> 0000:45:00.0 eth0: [3]: rx{fw_ring: 4 prod: f8} rx_agg{fw_ring: 12
> agg_prod: 19d sw_agg_prod: 19d}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251021] bnxt_en
> 0000:45:00.0 eth0: [3]: cp{fw_ring: 18 raw_cons: 5283a5}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251022] bnxt_en
> 0000:45:00.0 eth0: [4]: tx{fw_ring: 4 prod: d4 cons: d2}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251023] bnxt_en
> 0000:45:00.0 eth0: [4]: rx{fw_ring: 5 prod: 185} rx_agg{fw_ring: 13
> agg_prod: 34f sw_agg_prod: 34f}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251024] bnxt_en
> 0000:45:00.0 eth0: [4]: cp{fw_ring: 19 raw_cons: 4dc622}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251024] bnxt_en
> 0000:45:00.0 eth0: [5]: tx{fw_ring: 5 prod: fd cons: fd}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251026] bnxt_en
> 0000:45:00.0 eth0: [5]: rx{fw_ring: 6 prod: 177} rx_agg{fw_ring: 14
> agg_prod: 47 sw_agg_prod: 47}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251026] bnxt_en
> 0000:45:00.0 eth0: [5]: cp{fw_ring: 20 raw_cons: 770efd}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251027] bnxt_en
> 0000:45:00.0 eth0: [6]: tx{fw_ring: 6 prod: 63 cons: 120}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251028] bnxt_en
> 0000:45:00.0 eth0: [6]: rx{fw_ring: 7 prod: 77} rx_agg{fw_ring: 15
> agg_prod: 7aa sw_agg_prod: 7aa}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251029] bnxt_en
> 0000:45:00.0 eth0: [6]: cp{fw_ring: 21 raw_cons: 1a42064}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251030] bnxt_en
> 0000:45:00.0 eth0: [7]: tx{fw_ring: 7 prod: 179 cons: 179}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251031] bnxt_en
> 0000:45:00.0 eth0: [7]: rx{fw_ring: 8 prod: 1f8} rx_agg{fw_ring: 16
> agg_prod: 785 sw_agg_prod: 786}
> kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251032] bnxt_en
> 0000:45:00.0 eth0: [7]: cp{fw_ring: 22 raw_cons: 9fd6e8}
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326018.695007] bnxt_en
> 0000:45:00.0 eth0: TX timeout detected, starting reset task!
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326019.874938] bnxt_en
> 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326019.884991] bnxt_en
> 0000:45:00.0 eth0: hwrm_ring_free type 1 failed. rc:fffffff0 err:0
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326020.749461] bnxt_en
> 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326020.759434] bnxt_en
> 0000:45:00.0 eth0: hwrm_ring_free type 1 failed. rc:fffffff0 err:0
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326021.623141] bnxt_en
> 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326021.633040] bnxt_en
> 0000:45:00.0 eth0: hwrm_ring_free type 1 failed. rc:fffffff0 err:0
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326022.495977] bnxt_en
> 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326022.505635] bnxt_en
> 0000:45:00.0 eth0: hwrm_ring_free type 1 failed. rc:fffffff0 err:0
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326023.368700] bnxt_en
> 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326023.378155] bnxt_en
> 0000:45:00.0 eth0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326023.423938] bnxt_en
> 0000:45:00.0 eth0: Invalid hwrm seq id 44628
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326023.433163] bnxt_en
> 0000:45:00.0 eth0: Invalid hwrm seq id 44636
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326023.442137] bnxt_en
> 0000:45:00.0 eth0: Invalid hwrm seq id 44634
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326023.450759] bnxt_en
> 0000:45:00.0 eth0: Invalid hwrm seq id 44632
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326023.459092] bnxt_en
> 0000:45:00.0 eth0: Invalid hwrm seq id 44630
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326024.323820] bnxt_en
> 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326024.331783] bnxt_en
> 0000:45:00.0 eth0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326025.194675] bnxt_en
> 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326025.202706] bnxt_en
> 0000:45:00.0 eth0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326026.066879] bnxt_en
> 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326026.074397] bnxt_en
> 0000:45:00.0 eth0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326026.937134] bnxt_en
> 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326026.944341] bnxt_en
> 0000:45:00.0 eth0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326027.806175] bnxt_en
> 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326027.813006] bnxt_en
> 0000:45:00.0 eth0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326028.676645] bnxt_en
> 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326028.683506] bnxt_en
> 0000:45:00.0 eth0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326030.595581] bnxt_en
> 0000:45:00.0 eth0: Invalid hwrm seq id 44644
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326030.602540] bnxt_en
> 0000:45:00.0 eth0: Invalid hwrm seq id 44642
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326030.609230] bnxt_en
> 0000:45:00.0 eth0: Invalid hwrm seq id 44650
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326030.615772] bnxt_en
> 0000:45:00.0 eth0: Invalid hwrm seq id 44640
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326030.622094] bnxt_en
> 0000:45:00.0 eth0: Invalid hwrm seq id 44648
> kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326030.628320] bnxt_en
> 0000:45:00.0 eth0: Invalid hwrm seq id 44646
> 
> it repeated for hours, until hard reboot the machine.
> In another cases, even after reboot once there is traffic the error
> occured again. we have to disable offload with ethtool, and then the
> system stabilizes.
> 
> sudo ethtool -K eth0 rx off tx off
> Actual changes:
> tx-checksum-ipv4: off
> tx-checksum-ipv6: off
> tx-tcp-segmentation: off [not requested]
> tx-tcp6-segmentation: off [not requested]
> rx-checksum: off
> rx-gro-hw: off [not requested]
> 
> Our env:
> sudo ethtool -i eth0
> driver: bnxt_en
> version: 5.10.136-pserver
> firmware-version: 218.0.153.0/pkg 218.0.169.0
> expansion-rom-version:
> bus-info: 0000:45:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: no
> 
> sudo lspci  -vvv -s 45:00.0
> 45:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416
> NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)
> DeviceName: Broadcom 10G Ethernet #1
> Subsystem: Super Micro Computer Inc BCM57416 NetXtreme-E Dual-Media
> 10G RDMA Ethernet Controller
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0
> Interrupt: pin A routed to IRQ 64
> Region 0: Memory at 380d0110000 (64-bit, prefetchable) [size=64K]
> Region 2: Memory at 380d0000000 (64-bit, prefetchable) [size=1M]
> Region 4: Memory at 380d0122000 (64-bit, prefetchable) [size=8K]
> Expansion ROM at b3b40000 [disabled] [size=256K]
> Capabilities: [48] Power Management version 3
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
> Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+
> Address: 0000000000000000  Data: 0000
> Capabilities: [a0] MSI-X: Enable+ Count=255 Masked-
> Vector table: BAR=4 offset=00000000
> PBA: BAR=4 offset=00000ff0
> Capabilities: [ac] Express (v2) Endpoint, MSI 00
> DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
> ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
> DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
> RlxdOrd+ ExtTag+ PhantFunc- AuxPwr+ NoSnoop+ FLReset-
> MaxPayload 512 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
> LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported
> ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
> LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 8GT/s (ok), Width x8 (ok)
> TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
> 10BitTagComp- 10BitTagReq- OBFF Via WAKE#, ExtFmt- EETLPPrefix-
> EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
> FRS- TPHComp- ExtTPHComp-
> AtomicOpsCap: 32bit- 64bit- 128bitCAS-
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled,
> AtomicOpsCtl: ReqEn-
> LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
> LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
> Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> Compliance De-emphasis: -6dB
> LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+
> EqualizationPhase1+
> EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
> Retimer- 2Retimers- CrosslinkRes: unsupported
> Capabilities: [100 v1] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
> MalfTLP- ECRC- UnsupReq+ ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
> MalfTLP- ECRC- UnsupReq+ ACSViol-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+
> MalfTLP+ ECRC+ UnsupReq- ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
> AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn+ ECRCChkCap+ ECRCChkEn+
> MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
> HeaderLog: 04008001 4000200f 45020000 00000000
> Capabilities: [13c v1] Device Serial Number 3c-ec-ef-ff-fe-91-5c-92
> Capabilities: [150 v1] Power Budgeting <?>
> Capabilities: [160 v1] Virtual Channel
> Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
> Arb: Fixed- WRR32- WRR64- WRR128-
> Ctrl: ArbSelect=Fixed
> Status: InProgress-
> VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
> Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
> Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
> Status: NegoPending- InProgress-
> Capabilities: [180 v1] Vendor Specific Information: ID=0000 Rev=0 Len=020 <?>
> Capabilities: [1b0 v1] Latency Tolerance Reporting
> Max snoop latency: 1048576ns
> Max no snoop latency: 1048576ns
> Capabilities: [1b8 v1] Alternative Routing-ID Interpretation (ARI)
> ARICap: MFVC- ACS-, Next Function: 1
> ARICtl: MFVC- ACS-, Function Group: 0
> Capabilities: [230 v1] Transaction Processing Hints
> Interrupt vector mode supported
> Device specific mode supported
> Steering table in MSI-X table
> Capabilities: [300 v1] Secondary PCI Express
> LnkCtl3: LnkEquIntrruptEn- PerformEqu-
> LaneErrStat: 0
> Capabilities: [200 v1] Precision Time Measurement
> PTMCap: Requester:+ Responder:- Root:-
> PTMClockGranularity: Unimplemented
> PTMControl: Enabled:- RootSelected:-
> PTMEffectiveGranularity: Unknown
> Kernel driver in use: bnxt_en
> Kernel modules: bnxt_en
> 
> 
> 
> 
> I checked git history, but can't find any bugfix related to it. The
> internet tells me it could be a
> firmware bug, but I can't find firmware from Broadcom site or supermicro site.
> 
> Can you please give me some suggestions?
> 
> Thx!
> Jinpu Wang @ IONOS Cloud


  reply	other threads:[~2023-08-17  3:01 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-16 18:51 [RFC] bnxt_en TX timeout detected, starting reset task, flapping link after Jinpu Wang
2023-08-17  3:01 ` Jakub Kicinski [this message]
2023-08-17  7:08   ` Michael Chan
2023-08-17  7:25     ` Jinpu Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230816200120.603cc65a@kernel.org \
    --to=kuba@kernel.org \
    --cc=jinpu.wang@ionos.com \
    --cc=michael.chan@broadcom.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.