* megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661] @ 2014-07-12 11:56 Robin H. Johnson 2014-07-12 17:29 ` Bjorn Helgaas 0 siblings, 1 reply; 14+ messages in thread From: Robin H. Johnson @ 2014-07-12 11:56 UTC (permalink / raw) To: aradford, megaraidlinux; +Cc: linux-scsi, arkadiusz.bubala, bhelgaas [-- Attachment #1: Type: text/plain, Size: 8870 bytes --] TL;DR LSI2208 card faults out and does not bring up drives in Linux. In BIOS works fine. Driver has no debug interfaces visible in code for early startup. Hardware: Supermicro SSG-6027R-E1R12T http://www.supermicro.com/products/system/2U/6027/SSG-6027R-E1R12T.cfm Motherboard is X9DRH-7TF Contains an LSI2208 controller (megaraid_sas), which is this bug. I also have a LSI2008 (mp2sas) card in a PCIe slot for accessing an external tape library, that works fine [it's in CPU2-SLOT6, PCIe v3 x8]. 01:00.0 RAID bus controller [0104]: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] [1000:005b] (rev 05) 82:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072] (rev 03) (full lspci output further down) Whenever the megaraid_sas module loads, it fails out :-(. [ 14.188561] megasas: 06.803.01.00-rc1 Mon. Mar. 10 17:00:00 PDT 2014 [ 14.188577] megasas: 0x1000:0x005b:0x15d9:0x0690: bus 1:slot 0:func 0 [ 14.188584] megaraid_sas 0000:01:00.0: enabling device (0000 -> 0002) [ 14.188735] megasas: Waiting for FW to come to ready state [ 14.193999] megasas: FW in FAULT state!! [ 14.194003] megaraid_sas 0000:01:00.0: megasas: FW restarted successfully from megasas_init_fw! [ 44.210482] megasas: Waiting for FW to come to ready state [ 44.210484] megasas: FW in FAULT state!! During boots of the system, it DOES cleanly probe the drives (6x ST32000641AS), and has them assembled into RAID6. The problem occurs in all of these kernels: Ubuntu 3.13.11.2 (3.13.0-30.55-generic) Vanilla 3.14.5 Ubuntu 3.16.0-rc4 (3.16.0-3.8~14.10-generic sic) from ppa:canonical-kernel-team/ppa (quite willing to build custom kernels for testing, I just had these on hand for quick reboots). If you Google around for the problem, there were claims that it's related to bug BKO63661 (https://bugzilla.kernel.org/show_bug.cgi?id=63661), amongst other things, suggesting the following workarounds: pci=conf1 pcie_aspm=off disable_msi=1 None of which have any affect. # lspci -nn -d 1000: -vvxxx 01:00.0 RAID bus controller [0104]: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] [1000:005b] (rev 05) Subsystem: Super Micro Computer Inc LSI MegaRAID ROMB [15d9:0690] Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 16 Region 0: I/O ports at 8000 [disabled] [size=256] Region 1: Memory at dfe60000 (64-bit, non-prefetchable) [size=16K] Region 3: Memory at dfe00000 (64-bit, non-prefetchable) [size=256K] Expansion ROM at dfe40000 [disabled] [size=128K] Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest+ Capabilities: [d0] Vital Product Data pcilib: sysfs_read_vpd: read failed: Connection timed out Not readable Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [c0] MSI-X: Enable- Count=16 Masked- Vector table: BAR=1 offset=00002000 PBA: BAR=1 offset=00003000 00: 00 10 5b 00 02 00 10 00 05 00 04 01 10 00 00 00 10: 01 80 00 00 04 00 e6 df 00 00 00 00 04 00 e0 df 20: 00 00 00 00 00 00 00 00 00 00 00 00 d9 15 90 06 30: 00 00 e4 df 50 00 00 00 00 00 00 00 0b 01 00 00 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 50: 01 68 03 06 08 00 00 00 00 00 00 00 00 00 00 00 60: 00 00 00 00 00 01 00 00 10 d0 02 00 25 80 00 10 70: 20 28 00 00 83 04 40 00 40 00 83 10 00 00 00 00 80: 00 00 00 00 00 00 00 00 00 00 00 00 16 00 00 00 90: 00 00 00 00 0e 00 00 00 03 00 3e 00 00 00 00 00 a0: 00 00 00 00 00 00 00 00 05 c0 80 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 11 00 0f 00 01 20 00 00 01 30 00 00 00 00 00 00 d0: 03 a8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 82:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072] (rev 03) Subsystem: Dell 6Gbps SAS HBA Adapter [1028:1f1c] Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 11 Region 0: I/O ports at f000 [disabled] [size=256] Region 1: Memory at fbe40000 (64-bit, non-prefetchable) [disabled] [size=64K] Region 3: Memory at fbe00000 (64-bit, non-prefetchable) [disabled] [size=256K] Expansion ROM at fbd00000 [disabled] [size=1M] Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [d0] Vital Product Data Unknown small resource type 00, will not decode more. Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [c0] MSI-X: Enable- Count=15 Masked- Vector table: BAR=1 offset=0000e000 PBA: BAR=1 offset=0000f800 00: 00 10 72 00 00 00 10 00 03 00 07 01 10 00 00 00 10: 01 f0 00 00 04 00 e4 fb 00 00 00 00 04 00 e0 fb 20: 00 00 00 00 00 00 00 00 00 00 00 00 28 10 1c 1f 30: 00 00 d0 fb 50 00 00 00 00 00 00 00 0b 01 00 00 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 50: 01 68 03 06 08 00 00 00 00 00 00 00 00 00 00 00 60: 00 00 00 00 00 82 00 00 10 d0 02 00 25 80 00 10 70: 20 28 09 00 82 04 00 00 40 00 82 10 00 00 00 00 80: 00 00 00 00 00 00 00 00 00 00 00 00 16 00 00 00 90: 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 a0: 00 00 00 00 00 00 00 00 05 c0 80 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 11 00 0e 00 01 e0 00 00 01 f8 00 00 00 00 00 00 d0: 03 a8 00 80 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 -- Robin Hugh Johnson Gentoo Linux: Developer, Infrastructure Lead E-Mail : robbat2@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 460 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661] 2014-07-12 11:56 megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661] Robin H. Johnson @ 2014-07-12 17:29 ` Bjorn Helgaas 2014-07-13 1:35 ` Robin H. Johnson 0 siblings, 1 reply; 14+ messages in thread From: Bjorn Helgaas @ 2014-07-12 17:29 UTC (permalink / raw) To: Robin H. Johnson Cc: Adam Radford, Neela Syam Kolli, linux-scsi@vger.kernel.org, arkadiusz.bubala, Matthew Garrett [+cc Matthew] On Sat, Jul 12, 2014 at 5:56 AM, Robin H. Johnson <robbat2@gentoo.org> wrote: > TL;DR LSI2208 card faults out and does not bring up drives in Linux. In BIOS works fine. > Driver has no debug interfaces visible in code for early startup. > > Hardware: Supermicro SSG-6027R-E1R12T > http://www.supermicro.com/products/system/2U/6027/SSG-6027R-E1R12T.cfm > Motherboard is X9DRH-7TF > Contains an LSI2208 controller (megaraid_sas), which is this bug. > > I also have a LSI2008 (mp2sas) card in a PCIe slot for accessing an external > tape library, that works fine [it's in CPU2-SLOT6, PCIe v3 x8]. > > 01:00.0 RAID bus controller [0104]: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] [1000:005b] (rev 05) > 82:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072] (rev 03) > (full lspci output further down) > > Whenever the megaraid_sas module loads, it fails out :-(. > [ 14.188561] megasas: 06.803.01.00-rc1 Mon. Mar. 10 17:00:00 PDT 2014 > [ 14.188577] megasas: 0x1000:0x005b:0x15d9:0x0690: bus 1:slot 0:func 0 > [ 14.188584] megaraid_sas 0000:01:00.0: enabling device (0000 -> 0002) > [ 14.188735] megasas: Waiting for FW to come to ready state > [ 14.193999] megasas: FW in FAULT state!! > [ 14.194003] megaraid_sas 0000:01:00.0: megasas: FW restarted successfully from megasas_init_fw! > [ 44.210482] megasas: Waiting for FW to come to ready state > [ 44.210484] megasas: FW in FAULT state!! > > During boots of the system, it DOES cleanly probe the drives (6x ST32000641AS), > and has them assembled into RAID6. > > The problem occurs in all of these kernels: > Ubuntu 3.13.11.2 (3.13.0-30.55-generic) > Vanilla 3.14.5 > Ubuntu 3.16.0-rc4 (3.16.0-3.8~14.10-generic sic) from ppa:canonical-kernel-team/ppa > (quite willing to build custom kernels for testing, I just had these on hand > for quick reboots). > > If you Google around for the problem, there were claims that it's related to > bug BKO63661 (https://bugzilla.kernel.org/show_bug.cgi?id=63661), amongst other things, suggesting the following workarounds: > pci=conf1 > pcie_aspm=off > disable_msi=1 > None of which have any affect. Thanks for the report, Robin. https://bugzilla.kernel.org/show_bug.cgi?id=63661 bisected the problem to 3c076351c402 ("PCI: Rework ASPM disable code"), which appeared in v3.3. For starters, can you verify that, e.g., by building 69166fbf02c7 (the parent of 3c076351c402) to make sure that it works, and building 3c076351c402 itself to make sure it fails? Assuming that's the case, please attach the complete dmesg and "lspci -vvxxx" output for both kernels to the bugzilla. ASPM is a feature that is configured on both ends of a PCIe link, so I want to see the lspci info for the whole system, not just the SAS adapters. It's not practical to revert 3c076351c402 now, so I'd also like to see the same information for the newest possible kernel (if this is possible; I'm not clear on whether you can boot your system or not) so we can figure out what needs to be changed. Bjorn > # lspci -nn -d 1000: -vvxxx > 01:00.0 RAID bus controller [0104]: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] [1000:005b] (rev 05) > Subsystem: Super Micro Computer Inc LSI MegaRAID ROMB [15d9:0690] > Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- > Interrupt: pin A routed to IRQ 16 > Region 0: I/O ports at 8000 [disabled] [size=256] > Region 1: Memory at dfe60000 (64-bit, non-prefetchable) [size=16K] > Region 3: Memory at dfe00000 (64-bit, non-prefetchable) [size=256K] > Expansion ROM at dfe40000 [disabled] [size=128K] > Capabilities: [50] Power Management version 3 > Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- > Capabilities: [68] Express (v2) Endpoint, MSI 00 > DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us > ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ > DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- > MaxPayload 256 bytes, MaxReadReq 512 bytes > DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- > LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us > ClockPM- Surprise- LLActRep- BwNot- > LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- > DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported > DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled > LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- > Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- > Compliance De-emphasis: -6dB > LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ > EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest+ > Capabilities: [d0] Vital Product Data > pcilib: sysfs_read_vpd: read failed: Connection timed out > Not readable > Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+ > Address: 0000000000000000 Data: 0000 > Capabilities: [c0] MSI-X: Enable- Count=16 Masked- > Vector table: BAR=1 offset=00002000 > PBA: BAR=1 offset=00003000 > 00: 00 10 5b 00 02 00 10 00 05 00 04 01 10 00 00 00 > 10: 01 80 00 00 04 00 e6 df 00 00 00 00 04 00 e0 df > 20: 00 00 00 00 00 00 00 00 00 00 00 00 d9 15 90 06 > 30: 00 00 e4 df 50 00 00 00 00 00 00 00 0b 01 00 00 > 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 50: 01 68 03 06 08 00 00 00 00 00 00 00 00 00 00 00 > 60: 00 00 00 00 00 01 00 00 10 d0 02 00 25 80 00 10 > 70: 20 28 00 00 83 04 40 00 40 00 83 10 00 00 00 00 > 80: 00 00 00 00 00 00 00 00 00 00 00 00 16 00 00 00 > 90: 00 00 00 00 0e 00 00 00 03 00 3e 00 00 00 00 00 > a0: 00 00 00 00 00 00 00 00 05 c0 80 00 00 00 00 00 > b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > c0: 11 00 0f 00 01 20 00 00 01 30 00 00 00 00 00 00 > d0: 03 a8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > 82:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072] (rev 03) > Subsystem: Dell 6Gbps SAS HBA Adapter [1028:1f1c] > Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- > Interrupt: pin A routed to IRQ 11 > Region 0: I/O ports at f000 [disabled] [size=256] > Region 1: Memory at fbe40000 (64-bit, non-prefetchable) [disabled] [size=64K] > Region 3: Memory at fbe00000 (64-bit, non-prefetchable) [disabled] [size=256K] > Expansion ROM at fbd00000 [disabled] [size=1M] > Capabilities: [50] Power Management version 3 > Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- > Capabilities: [68] Express (v2) Endpoint, MSI 00 > DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us > ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ > DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- > MaxPayload 256 bytes, MaxReadReq 512 bytes > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- > LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us > ClockPM- Surprise- LLActRep- BwNot- > LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- > DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported > DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled > LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- > Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- > Compliance De-emphasis: -6dB > LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- > EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- > Capabilities: [d0] Vital Product Data > Unknown small resource type 00, will not decode more. > Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+ > Address: 0000000000000000 Data: 0000 > Capabilities: [c0] MSI-X: Enable- Count=15 Masked- > Vector table: BAR=1 offset=0000e000 > PBA: BAR=1 offset=0000f800 > 00: 00 10 72 00 00 00 10 00 03 00 07 01 10 00 00 00 > 10: 01 f0 00 00 04 00 e4 fb 00 00 00 00 04 00 e0 fb > 20: 00 00 00 00 00 00 00 00 00 00 00 00 28 10 1c 1f > 30: 00 00 d0 fb 50 00 00 00 00 00 00 00 0b 01 00 00 > 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 50: 01 68 03 06 08 00 00 00 00 00 00 00 00 00 00 00 > 60: 00 00 00 00 00 82 00 00 10 d0 02 00 25 80 00 10 > 70: 20 28 09 00 82 04 00 00 40 00 82 10 00 00 00 00 > 80: 00 00 00 00 00 00 00 00 00 00 00 00 16 00 00 00 > 90: 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 > a0: 00 00 00 00 00 00 00 00 05 c0 80 00 00 00 00 00 > b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > c0: 11 00 0e 00 01 e0 00 00 01 f8 00 00 00 00 00 00 > d0: 03 a8 00 80 00 00 00 00 00 00 00 00 00 00 00 00 > e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > -- > Robin Hugh Johnson > Gentoo Linux: Developer, Infrastructure Lead > E-Mail : robbat2@gentoo.org > GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661] 2014-07-12 17:29 ` Bjorn Helgaas @ 2014-07-13 1:35 ` Robin H. Johnson 2015-04-29 17:28 ` Bjorn Helgaas 0 siblings, 1 reply; 14+ messages in thread From: Robin H. Johnson @ 2014-07-13 1:35 UTC (permalink / raw) To: Bjorn Helgaas Cc: Robin H. Johnson, Adam Radford, Neela Syam Kolli, linux-scsi@vger.kernel.org, arkadiusz.bubala, Matthew Garrett [-- Attachment #1: Type: text/plain, Size: 13262 bytes --] On Sat, Jul 12, 2014 at 11:29:20AM -0600, Bjorn Helgaas wrote: > Thanks for the report, Robin. > > https://bugzilla.kernel.org/show_bug.cgi?id=63661 bisected the problem > to 3c076351c402 ("PCI: Rework ASPM disable code"), which appeared in > v3.3. For starters, can you verify that, e.g., by building > 69166fbf02c7 (the parent of 3c076351c402) to make sure that it works, > and building 3c076351c402 itself to make sure it fails? > > Assuming that's the case, please attach the complete dmesg and "lspci > -vvxxx" output for both kernels to the bugzilla. ASPM is a feature > that is configured on both ends of a PCIe link, so I want to see the > lspci info for the whole system, not just the SAS adapters. > > It's not practical to revert 3c076351c402 now, so I'd also like to see > the same information for the newest possible kernel (if this is > possible; I'm not clear on whether you can boot your system or not) so > we can figure out what needs to be changed. TL;DR: FastBoot is leaving the MegaRaidSAS in a weird state, and it fails to start; Commit 3c076351c402 did make it worse, but I think we're right that the bug lies in the SAS code. Ok, I have done more testing on it (40+ boots), and I think we can show the problem is somewhere in how the BIOS/EFI/ROM brings up the card in FastBoot more, or how it leaves the card. Full boot of the system was difficult on the 3.2 kernels, they didn't make it to userspace for other stuff being too new. For testing, I compiled CONFIG_MEGARAID_SAS=y on 3.2, and =m on 3.16-rc4; that way when the initramfs & userspace failed, the megaraid load was captured over IPMI serial. I've done a lot of the analysis below while capturing. I was going to be booting many times, so I flipped the 'Fast Boot' option back to Disabled, so I could more easily get to the BIOS settings to change options while testing. When I did so, an accidental boot on a kernel that previously failed suddenly worked, leading me to raise an eyebrow, and this expanded my test matrix more. 3 kernels, 6 different BIOS config combinations (2x3) = 18 test cases Each configuration was booted at least twice; if the result of two boots was not identical, I booted a third time and took the majority result. All kernels had no boot params involving PCI specified (none of pci=, pcie*=, disable_msi*). Kernels: K.1: Ubuntu's 3.16-rc4 K.2: 3.2-rc4 3c076351c402 - aspm merged K.3: 3.2-rc4 69166fbf02c7 - aspm merge parent Notes: 3.2* compiled with GCC4.6, 3.16-rc4 with GCC4.8 BIOS: Boot -> FastBoot: B1.1 Off B1.2 On (CMOS reset default) BIOS: Advanced -> PCIe/PCI/PnP Configuration -> ASPM Support B2.1 Force L0s B2.2 BIOS (CMOS reset default) B2.3 Disabled Reduced Kernaugh Map of results: Kernels,B1,B2: Result *, B1.1, * PASS *, B1.2, B2.1 VARIABLE (9 runs: 5 fail, 4 pass, no kernel consistency) K.1, B1.2, B2.2 FAIL K.1, B1.2, B2.3 FAIL K.2, B1.2, B2.2 FAIL K.2, B1.2, B2.3 FAIL K.3, B1.2, B2.2 PASS K.3, B1.2, B2.3 PASS Here's the DMI info: Motherboard: X9DRH-7TF/7F/iTF/iF Version: 3.0b Release Date: 04/28/2014 Recall also I said I had two LSI cards in here? SAS2008 (in a slot) and SAS2208 (onboard) Regardless of the BIOS settings, the SAS2008 card continues to work; even when it's IO region0 is marked as disabled. So is there some other initialization work needed on the SAS2208 card so that it works in all cases? The case of FastBoot=on, ASPM=ForceL0s is the interesting one, and the lspci outputs compare nicely; The only trimming to the diff below is to remove the context of other devices (no changes). This does also look functionally identical between 3c076351c402 and 69166fbf02c7. Full lspci & dmesg for the working+broken 3.16-rc4 boots attaches. -lspci.1405201451.ASPM=L0s.FastBoot.no.kparams = 3.16-rc4, working +lspci.1405201693.ASPM=L0s.FastBoot.no.kparams = 3.16-rc4, broken # diff -Nar lspci.1405201451.ASPM=L0s.FastBoot.no.kparams lspci.1405201693.ASPM=L0s.FastBoot.no.kparams -I '^[0-9a-f][0-9a-f]:' -F rev -U15 --- lspci.1405201451.ASPM=L0s.FastBoot.no.kparams 2014-07-12 21:44:11.243897367 +0000 +++ lspci.1405201693.ASPM=L0s.FastBoot.no.kparams 2014-07-12 21:48:13.866860888 +0000 @@ -1157,95 +1157,93 @@ 00:1f.6 Signal processing controller [11 (trim other device, no changes) 01:00.0 RAID bus controller [0104]: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] [1000:005b] (rev 05) Subsystem: Super Micro Computer Inc LSI MegaRAID ROMB [15d9:0690] - Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ + Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- - Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 16 - Region 0: I/O ports at 8000 [size=256] + Region 0: I/O ports at 8000 [disabled] [size=256] Region 1: Memory at dfe60000 (64-bit, non-prefetchable) [size=16K] Region 3: Memory at dfe00000 (64-bit, non-prefetchable) [size=256K] Expansion ROM at dfe40000 [disabled] [size=128K] Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest+ Capabilities: [d0] Vital Product Data - Unknown small resource type 00, will not decode more. + Not readable Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 - Capabilities: [c0] MSI-X: Enable+ Count=16 Masked- + Capabilities: [c0] MSI-X: Enable- Count=16 Masked- Vector table: BAR=1 offset=00002000 PBA: BAR=1 offset=00003000 Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [1e0 v1] #19 Capabilities: [1c0 v1] Power Budgeting <?> Capabilities: [190 v1] #16 Capabilities: [148 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 0 ARICtl: MFVC- ACS-, Function Group: 0 - Kernel driver in use: megaraid_sas -00: 00 10 5b 00 07 04 10 00 05 00 04 01 10 00 00 00 +00: 00 10 5b 00 02 00 10 00 05 00 04 01 10 00 00 00 10: 01 80 00 00 04 00 e6 df 00 00 00 00 04 00 e0 df 20: 00 00 00 00 00 00 00 00 00 00 00 00 d9 15 90 06 30: 00 00 e4 df 50 00 00 00 00 00 00 00 0b 01 00 00 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 50: 01 68 03 06 08 00 00 00 00 00 00 00 00 00 00 00 60: 00 00 00 00 00 01 00 00 10 d0 02 00 25 80 00 10 70: 20 28 00 00 83 04 40 00 40 00 83 10 00 00 00 00 80: 00 00 00 00 00 00 00 00 00 00 00 00 16 00 00 00 90: 00 00 00 00 0e 00 00 00 03 00 3e 00 00 00 00 00 a0: 00 00 00 00 00 00 00 00 05 c0 80 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 -c0: 11 00 0f 80 01 20 00 00 01 30 00 00 00 00 00 00 -d0: 03 a8 00 80 00 00 00 00 00 00 00 00 00 00 00 00 +c0: 11 00 0f 00 01 20 00 00 01 30 00 00 00 00 00 00 +d0: 03 a8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 (trim other device, no changes) @@ -3049,35 +3047,35 @@ 80:05.4 PIC [0800]: Intel Corporation Xe (trim other device, no changes) 82:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072] (rev 03) Subsystem: Dell 6Gbps SAS HBA Adapter [1028:1f1c] - Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ + Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 56 - Region 0: I/O ports at f000 [size=256] + Region 0: I/O ports at f000 [disabled] [size=256] Region 1: Memory at fbe40000 (64-bit, non-prefetchable) [size=64K] Region 3: Memory at fbe00000 (64-bit, non-prefetchable) [size=256K] Expansion ROM at fbd00000 [disabled] [size=1M] Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [d0] Vital Product Data Unknown small resource type 00, will not decode more. Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [c0] MSI-X: Enable+ Count=15 Masked- Vector table: BAR=1 offset=0000e000 PBA: BAR=1 offset=0000f800 Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [138 v1] Power Budgeting <?> Kernel driver in use: mpt2sas -00: 00 10 72 00 07 04 10 00 03 00 07 01 10 00 00 00 +00: 00 10 72 00 06 04 10 00 03 00 07 01 10 00 00 00 10: 01 f0 00 00 04 00 e4 fb 00 00 00 00 04 00 e0 fb 20: 00 00 00 00 00 00 00 00 00 00 00 00 28 10 1c 1f 30: 00 00 d0 fb 50 00 00 00 00 00 00 00 0b 01 00 00 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 50: 01 68 03 06 08 00 00 00 00 00 00 00 00 00 00 00 60: 00 00 00 00 00 82 00 00 10 d0 02 00 25 80 00 10 70: 2f 28 09 00 82 04 00 00 40 00 82 10 00 00 00 00 80: 00 00 00 00 00 00 00 00 00 00 00 00 16 00 00 00 90: 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 a0: 00 00 00 00 00 00 00 00 05 c0 80 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 11 00 0e 80 01 e0 00 00 01 f8 00 00 00 00 00 00 d0: 03 a8 00 80 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 (trim other device, no changes) -- Robin Hugh Johnson Gentoo Linux: Developer, Infrastructure Lead E-Mail : robbat2@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 [-- Attachment #2: dmesg.1405201451.ASPM=L0s.FastBoot.no.kparams.WORKING.gz --] [-- Type: application/octet-stream, Size: 22330 bytes --] [-- Attachment #3: dmesg.1405201693.ASPM=L0s.FastBoot.no.kparams.BROKEN.gz --] [-- Type: application/octet-stream, Size: 21816 bytes --] [-- Attachment #4: lspci.1405201451.ASPM=L0s.FastBoot.no.kparams.WORKING.gz --] [-- Type: application/octet-stream, Size: 16359 bytes --] [-- Attachment #5: lspci.1405201693.ASPM=L0s.FastBoot.no.kparams.BROKEN.gz --] [-- Type: application/octet-stream, Size: 16319 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661] 2014-07-13 1:35 ` Robin H. Johnson @ 2015-04-29 17:28 ` Bjorn Helgaas 2015-05-28 12:24 ` Bjorn Helgaas 0 siblings, 1 reply; 14+ messages in thread From: Bjorn Helgaas @ 2015-04-29 17:28 UTC (permalink / raw) To: Robin H. Johnson Cc: Adam Radford, Neela Syam Kolli, linux-scsi@vger.kernel.org, arkadiusz.bubala, Matthew Garrett, Kashyap Desai, Sumit Saxena, Uday Lingala, megaraidlinux.pdl, linux-pci, linux-kernel [+cc linux-pci, linux-kernel, Kashyap, Sumit, Uday, megaraidlinux.pdl] On Sun, Jul 13, 2014 at 01:35:51AM +0000, Robin H. Johnson wrote: > On Sat, Jul 12, 2014 at 11:29:20AM -0600, Bjorn Helgaas wrote: > > Thanks for the report, Robin. > > > > https://bugzilla.kernel.org/show_bug.cgi?id=63661 bisected the problem > > to 3c076351c402 ("PCI: Rework ASPM disable code"), which appeared in > > v3.3. For starters, can you verify that, e.g., by building > > 69166fbf02c7 (the parent of 3c076351c402) to make sure that it works, > > and building 3c076351c402 itself to make sure it fails? > > > > Assuming that's the case, please attach the complete dmesg and "lspci > > -vvxxx" output for both kernels to the bugzilla. ASPM is a feature > > that is configured on both ends of a PCIe link, so I want to see the > > lspci info for the whole system, not just the SAS adapters. > > > > It's not practical to revert 3c076351c402 now, so I'd also like to see > > the same information for the newest possible kernel (if this is > > possible; I'm not clear on whether you can boot your system or not) so > > we can figure out what needs to be changed. > TL;DR: FastBoot is leaving the MegaRaidSAS in a weird state, and it fails to > start; Commit 3c076351c402 did make it worse, but I think we're right that the > bug lies in the SAS code. > > Ok, I have done more testing on it (40+ boots), and I think we can show the > problem is somewhere in how the BIOS/EFI/ROM brings up the card in FastBoot > more, or how it leaves the card. I attached your dmesg and lspci logs to https://bugzilla.kernel.org/show_bug.cgi?id=63661, thank you! You did a huge amount of excellent testing and analysis, and I'm sorry that we haven't made progress using the results. I still think this is a megaraid_sas driver bug, but I don't have enough evidence to really point fingers. Based on your testing, before 3c076351c402 ("PCI: Rework ASPM disable code"), megaraid_sas worked reliably. After 3c076351c402, megaraid_sas does not work reliably when BIOS Fast Boot is enabled. Fast Boot probably means we don't run the option ROM on the device. Your dmesg logs show that in the working case, BIOS has enabled the device. In the failing case it has not. They also show that when Fast Boot is enabled, there's a little less MTRR write-protect space, which I'm guessing is space that wasn't needed for shadowing option ROMs. I suspect megaraid_sas depends on something done by the option ROM, and that prior to 3c076351c402, Linux did something to ASPM that was enough to make megaraid_sas work. I attached a couple debug patches to https://bugzilla.kernel.org/show_bug.cgi?id=63661 that log all the ASPM configuration the PCI core does. One applies to 69166fbf02c7 (the pre-3c076351c402 commit), and the other applies to v4.1-rc1. Could you boot both of those with "pci=earlydump" and attach the dmesg logs to the bugzilla? If you boot with the BIOS CMOS reset settings (Fast Boot enabled and ASPM set to "BIOS"), I expect the 69166fbf02c7- based kernel to work, and the v4.1-rc1-based one to fail. > Full boot of the system was difficult on the 3.2 kernels, they didn't make it > to userspace for other stuff being too new. For testing, I compiled > CONFIG_MEGARAID_SAS=y on 3.2, and =m on 3.16-rc4; that way when the initramfs & > userspace failed, the megaraid load was captured over IPMI serial. > > I've done a lot of the analysis below while capturing. > > I was going to be booting many times, so I flipped the 'Fast Boot' > option back to Disabled, so I could more easily get to the BIOS settings > to change options while testing. When I did so, an accidental boot on a > kernel that previously failed suddenly worked, leading me to raise an > eyebrow, and this expanded my test matrix more. > > 3 kernels, 6 different BIOS config combinations (2x3) = 18 test cases > Each configuration was booted at least twice; if the result of two boots was > not identical, I booted a third time and took the majority result. > > All kernels had no boot params involving PCI specified (none of pci=, pcie*=, > disable_msi*). > > Kernels: > K.1: Ubuntu's 3.16-rc4 > K.2: 3.2-rc4 3c076351c402 - aspm merged > K.3: 3.2-rc4 69166fbf02c7 - aspm merge parent > Notes: 3.2* compiled with GCC4.6, 3.16-rc4 with GCC4.8 > > BIOS: Boot -> FastBoot: > B1.1 Off > B1.2 On (CMOS reset default) > > BIOS: Advanced -> PCIe/PCI/PnP Configuration -> ASPM Support > B2.1 Force L0s > B2.2 BIOS (CMOS reset default) > B2.3 Disabled > > Reduced Kernaugh Map of results: > Kernels,B1,B2: Result > *, B1.1, * PASS > *, B1.2, B2.1 VARIABLE (9 runs: 5 fail, 4 pass, no kernel consistency) > K.1, B1.2, B2.2 FAIL > K.1, B1.2, B2.3 FAIL > K.2, B1.2, B2.2 FAIL > K.2, B1.2, B2.3 FAIL > K.3, B1.2, B2.2 PASS > K.3, B1.2, B2.3 PASS I'm not very practiced with Karnaugh maps, so correct me if my understanding is wrong: - Fast Boot disabled: all kernels always passed - Fast Boot enabled, ASPM set to Force L0s enabled: variable; no consistency of results - Fast Boot enabled, ASPM set to BIOS or Disabled: pre-3c076351c402 always passed, post-3c076351c402 always failed Bjorn ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661] 2015-04-29 17:28 ` Bjorn Helgaas @ 2015-05-28 12:24 ` Bjorn Helgaas 2015-05-28 13:35 ` Kashyap Desai 0 siblings, 1 reply; 14+ messages in thread From: Bjorn Helgaas @ 2015-05-28 12:24 UTC (permalink / raw) To: Robin H. Johnson Cc: Adam Radford, Neela Syam Kolli, linux-scsi@vger.kernel.org, arkadiusz.bubala, Matthew Garrett, Kashyap Desai, Sumit Saxena, Uday Lingala, megaraidlinux.pdl, linux-pci, linux-kernel, Jean Delvare, Myron Stowe [+cc Jean, Myron] Hello megaraid maintainers, Have you been able to take a look at this at all? People have been reporting this issue since 2012 on upstream, Debian, and Ubuntu, and now we're getting reports on SLES. My theory is that the Linux driver relies on some MegaRAID initialization done by the option ROM, and the bug happens when the BIOS doesn't execute the option ROM. If that's correct, you should be able to reproduce it on any system by booting Linux (v3.3 or later) without running the MegaRAID SAS 2208 option ROM (either by enabling a BIOS "fast boot" switch, or modifying the BIOS to skip it). If the Linux driver doesn't rely on the option ROM, you might even be able to reproduce it by physically removing the option ROM from the MegaRAID. Bjorn On Wed, Apr 29, 2015 at 12:28:32PM -0500, Bjorn Helgaas wrote: > [+cc linux-pci, linux-kernel, Kashyap, Sumit, Uday, megaraidlinux.pdl] > > On Sun, Jul 13, 2014 at 01:35:51AM +0000, Robin H. Johnson wrote: > > On Sat, Jul 12, 2014 at 11:29:20AM -0600, Bjorn Helgaas wrote: > > > Thanks for the report, Robin. > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=63661 bisected the problem > > > to 3c076351c402 ("PCI: Rework ASPM disable code"), which appeared in > > > v3.3. For starters, can you verify that, e.g., by building > > > 69166fbf02c7 (the parent of 3c076351c402) to make sure that it works, > > > and building 3c076351c402 itself to make sure it fails? > > > > > > Assuming that's the case, please attach the complete dmesg and "lspci > > > -vvxxx" output for both kernels to the bugzilla. ASPM is a feature > > > that is configured on both ends of a PCIe link, so I want to see the > > > lspci info for the whole system, not just the SAS adapters. > > > > > > It's not practical to revert 3c076351c402 now, so I'd also like to see > > > the same information for the newest possible kernel (if this is > > > possible; I'm not clear on whether you can boot your system or not) so > > > we can figure out what needs to be changed. > > TL;DR: FastBoot is leaving the MegaRaidSAS in a weird state, and it fails to > > start; Commit 3c076351c402 did make it worse, but I think we're right that the > > bug lies in the SAS code. > > > > Ok, I have done more testing on it (40+ boots), and I think we can show the > > problem is somewhere in how the BIOS/EFI/ROM brings up the card in FastBoot > > more, or how it leaves the card. > > I attached your dmesg and lspci logs to > https://bugzilla.kernel.org/show_bug.cgi?id=63661, thank you! You did > a huge amount of excellent testing and analysis, and I'm sorry that we > haven't made progress using the results. > > I still think this is a megaraid_sas driver bug, but I don't have > enough evidence to really point fingers. > > Based on your testing, before 3c076351c402 ("PCI: Rework ASPM disable > code"), megaraid_sas worked reliably. After 3c076351c402, > megaraid_sas does not work reliably when BIOS Fast Boot is enabled. > > Fast Boot probably means we don't run the option ROM on the device. > Your dmesg logs show that in the working case, BIOS has enabled the > device. In the failing case it has not. They also show that when > Fast Boot is enabled, there's a little less MTRR write-protect space, > which I'm guessing is space that wasn't needed for shadowing option > ROMs. > > I suspect megaraid_sas depends on something done by the option ROM, > and that prior to 3c076351c402, Linux did something to ASPM that was > enough to make megaraid_sas work. > > I attached a couple debug patches to > https://bugzilla.kernel.org/show_bug.cgi?id=63661 that log all the > ASPM configuration the PCI core does. One applies to 69166fbf02c7 > (the pre-3c076351c402 commit), and the other applies to v4.1-rc1. > Could you boot both of those with "pci=earlydump" and attach the dmesg > logs to the bugzilla? If you boot with the BIOS CMOS reset settings > (Fast Boot enabled and ASPM set to "BIOS"), I expect the 69166fbf02c7- > based kernel to work, and the v4.1-rc1-based one to fail. > > > Full boot of the system was difficult on the 3.2 kernels, they didn't make it > > to userspace for other stuff being too new. For testing, I compiled > > CONFIG_MEGARAID_SAS=y on 3.2, and =m on 3.16-rc4; that way when the initramfs & > > userspace failed, the megaraid load was captured over IPMI serial. > > > > I've done a lot of the analysis below while capturing. > > > > I was going to be booting many times, so I flipped the 'Fast Boot' > > option back to Disabled, so I could more easily get to the BIOS settings > > to change options while testing. When I did so, an accidental boot on a > > kernel that previously failed suddenly worked, leading me to raise an > > eyebrow, and this expanded my test matrix more. > > > > 3 kernels, 6 different BIOS config combinations (2x3) = 18 test cases > > Each configuration was booted at least twice; if the result of two boots was > > not identical, I booted a third time and took the majority result. > > > > All kernels had no boot params involving PCI specified (none of pci=, pcie*=, > > disable_msi*). > > > > Kernels: > > K.1: Ubuntu's 3.16-rc4 > > K.2: 3.2-rc4 3c076351c402 - aspm merged > > K.3: 3.2-rc4 69166fbf02c7 - aspm merge parent > > Notes: 3.2* compiled with GCC4.6, 3.16-rc4 with GCC4.8 > > > > BIOS: Boot -> FastBoot: > > B1.1 Off > > B1.2 On (CMOS reset default) > > > > BIOS: Advanced -> PCIe/PCI/PnP Configuration -> ASPM Support > > B2.1 Force L0s > > B2.2 BIOS (CMOS reset default) > > B2.3 Disabled > > > > Reduced Kernaugh Map of results: > > Kernels,B1,B2: Result > > *, B1.1, * PASS > > *, B1.2, B2.1 VARIABLE (9 runs: 5 fail, 4 pass, no kernel consistency) > > K.1, B1.2, B2.2 FAIL > > K.1, B1.2, B2.3 FAIL > > K.2, B1.2, B2.2 FAIL > > K.2, B1.2, B2.3 FAIL > > K.3, B1.2, B2.2 PASS > > K.3, B1.2, B2.3 PASS > > I'm not very practiced with Karnaugh maps, so correct me if my > understanding is wrong: > > - Fast Boot disabled: all kernels always passed > > - Fast Boot enabled, ASPM set to Force L0s enabled: variable; no > consistency of results > > - Fast Boot enabled, ASPM set to BIOS or Disabled: pre-3c076351c402 > always passed, post-3c076351c402 always failed > > Bjorn ^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661] 2015-05-28 12:24 ` Bjorn Helgaas @ 2015-05-28 13:35 ` Kashyap Desai 2015-06-29 13:25 ` Jean Delvare 2015-07-07 8:44 ` Jean Delvare 0 siblings, 2 replies; 14+ messages in thread From: Kashyap Desai @ 2015-05-28 13:35 UTC (permalink / raw) To: Bjorn Helgaas, Robin H. Johnson Cc: Adam Radford, Neela Syam Kolli, linux-scsi, arkadiusz.bubala, Matthew Garrett, Sumit Saxena, Uday Lingala, PDL,MEGARAIDLINUX, linux-pci, linux-kernel, Jean Delvare, Myron Stowe [-- Attachment #1: Type: text/plain, Size: 8035 bytes --] Bjorn/Robin, Apologies for delay. Here is one quick suggestion as we have seen similar issue (not exactly similar, but high probably to have same issue) while controller is configured on VM as pass-through and VM reboot abruptly. In that particular issue, driver interact with FW which may require chip reset to bring controller to operation state. Relevant patch was submitted for only Older controller as it was only seen for few MegaRaid controller. Below patch already try to do chip reset, but only for limited controllers...I have attached one more patch which does chip reset from driver load time for Thunderbolt/Invader/Fury etc. (In your case you have Thunderbolt controller, so attached patch is required.) http://www.spinics.net/lists/linux-scsi/msg67288.html Please post the result with attached patch. Thanks, Kashyap > -----Original Message----- > From: Bjorn Helgaas [mailto:bhelgaas@google.com] > Sent: Thursday, May 28, 2015 5:55 PM > To: Robin H. Johnson > Cc: Adam Radford; Neela Syam Kolli; linux-scsi@vger.kernel.org; > arkadiusz.bubala@open-e.com; Matthew Garrett; Kashyap Desai; Sumit Saxena; > Uday Lingala; megaraidlinux.pdl@avagotech.com; linux-pci@vger.kernel.org; > linux-kernel@vger.kernel.org; Jean Delvare; Myron Stowe > Subject: Re: megaraid_sas: "FW in FAULT state!!", how to get more debug > output? [BKO63661] > > [+cc Jean, Myron] > > Hello megaraid maintainers, > > Have you been able to take a look at this at all? People have been reporting this > issue since 2012 on upstream, Debian, and Ubuntu, and now we're getting > reports on SLES. > > My theory is that the Linux driver relies on some MegaRAID initialization done by > the option ROM, and the bug happens when the BIOS doesn't execute the option > ROM. > > If that's correct, you should be able to reproduce it on any system by booting > Linux (v3.3 or later) without running the MegaRAID SAS 2208 option ROM (either > by enabling a BIOS "fast boot" switch, or modifying the BIOS to skip it). If the > Linux driver doesn't rely on the option ROM, you might even be able to > reproduce it by physically removing the option ROM from the MegaRAID. > > Bjorn > > On Wed, Apr 29, 2015 at 12:28:32PM -0500, Bjorn Helgaas wrote: > > [+cc linux-pci, linux-kernel, Kashyap, Sumit, Uday, megaraidlinux.pdl] > > > > On Sun, Jul 13, 2014 at 01:35:51AM +0000, Robin H. Johnson wrote: > > > On Sat, Jul 12, 2014 at 11:29:20AM -0600, Bjorn Helgaas wrote: > > > > Thanks for the report, Robin. > > > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=63661 bisected the > > > > problem to 3c076351c402 ("PCI: Rework ASPM disable code"), which > > > > appeared in v3.3. For starters, can you verify that, e.g., by > > > > building > > > > 69166fbf02c7 (the parent of 3c076351c402) to make sure that it > > > > works, and building 3c076351c402 itself to make sure it fails? > > > > > > > > Assuming that's the case, please attach the complete dmesg and > > > > "lspci -vvxxx" output for both kernels to the bugzilla. ASPM is a > > > > feature that is configured on both ends of a PCIe link, so I want > > > > to see the lspci info for the whole system, not just the SAS adapters. > > > > > > > > It's not practical to revert 3c076351c402 now, so I'd also like to > > > > see the same information for the newest possible kernel (if this > > > > is possible; I'm not clear on whether you can boot your system or > > > > not) so we can figure out what needs to be changed. > > > TL;DR: FastBoot is leaving the MegaRaidSAS in a weird state, and it > > > fails to start; Commit 3c076351c402 did make it worse, but I think > > > we're right that the bug lies in the SAS code. > > > > > > Ok, I have done more testing on it (40+ boots), and I think we can > > > show the problem is somewhere in how the BIOS/EFI/ROM brings up the > > > card in FastBoot more, or how it leaves the card. > > > > I attached your dmesg and lspci logs to > > https://bugzilla.kernel.org/show_bug.cgi?id=63661, thank you! You did > > a huge amount of excellent testing and analysis, and I'm sorry that we > > haven't made progress using the results. > > > > I still think this is a megaraid_sas driver bug, but I don't have > > enough evidence to really point fingers. > > > > Based on your testing, before 3c076351c402 ("PCI: Rework ASPM disable > > code"), megaraid_sas worked reliably. After 3c076351c402, > > megaraid_sas does not work reliably when BIOS Fast Boot is enabled. > > > > Fast Boot probably means we don't run the option ROM on the device. > > Your dmesg logs show that in the working case, BIOS has enabled the > > device. In the failing case it has not. They also show that when > > Fast Boot is enabled, there's a little less MTRR write-protect space, > > which I'm guessing is space that wasn't needed for shadowing option > > ROMs. > > > > I suspect megaraid_sas depends on something done by the option ROM, > > and that prior to 3c076351c402, Linux did something to ASPM that was > > enough to make megaraid_sas work. > > > > I attached a couple debug patches to > > https://bugzilla.kernel.org/show_bug.cgi?id=63661 that log all the > > ASPM configuration the PCI core does. One applies to 69166fbf02c7 > > (the pre-3c076351c402 commit), and the other applies to v4.1-rc1. > > Could you boot both of those with "pci=earlydump" and attach the dmesg > > logs to the bugzilla? If you boot with the BIOS CMOS reset settings > > (Fast Boot enabled and ASPM set to "BIOS"), I expect the 69166fbf02c7- > > based kernel to work, and the v4.1-rc1-based one to fail. > > > > > Full boot of the system was difficult on the 3.2 kernels, they > > > didn't make it to userspace for other stuff being too new. For > > > testing, I compiled CONFIG_MEGARAID_SAS=y on 3.2, and =m on > > > 3.16-rc4; that way when the initramfs & userspace failed, the megaraid load > was captured over IPMI serial. > > > > > > I've done a lot of the analysis below while capturing. > > > > > > I was going to be booting many times, so I flipped the 'Fast Boot' > > > option back to Disabled, so I could more easily get to the BIOS > > > settings to change options while testing. When I did so, an > > > accidental boot on a kernel that previously failed suddenly worked, > > > leading me to raise an eyebrow, and this expanded my test matrix more. > > > > > > 3 kernels, 6 different BIOS config combinations (2x3) = 18 test > > > cases Each configuration was booted at least twice; if the result of > > > two boots was not identical, I booted a third time and took the majority > result. > > > > > > All kernels had no boot params involving PCI specified (none of > > > pci=, pcie*=, disable_msi*). > > > > > > Kernels: > > > K.1: Ubuntu's 3.16-rc4 > > > K.2: 3.2-rc4 3c076351c402 - aspm merged > > > K.3: 3.2-rc4 69166fbf02c7 - aspm merge parent > > > Notes: 3.2* compiled with GCC4.6, 3.16-rc4 with GCC4.8 > > > > > > BIOS: Boot -> FastBoot: > > > B1.1 Off > > > B1.2 On (CMOS reset default) > > > > > > BIOS: Advanced -> PCIe/PCI/PnP Configuration -> ASPM Support > > > B2.1 Force L0s > > > B2.2 BIOS (CMOS reset default) > > > B2.3 Disabled > > > > > > Reduced Kernaugh Map of results: > > > Kernels,B1,B2: Result > > > *, B1.1, * PASS > > > *, B1.2, B2.1 VARIABLE (9 runs: 5 fail, 4 pass, no kernel > > > consistency) K.1, B1.2, B2.2 FAIL K.1, B1.2, B2.3 FAIL K.2, B1.2, > > > B2.2 FAIL K.2, B1.2, B2.3 FAIL K.3, B1.2, B2.2 PASS K.3, B1.2, > > > B2.3 PASS > > > > I'm not very practiced with Karnaugh maps, so correct me if my > > understanding is wrong: > > > > - Fast Boot disabled: all kernels always passed > > > > - Fast Boot enabled, ASPM set to Force L0s enabled: variable; no > > consistency of results > > > > - Fast Boot enabled, ASPM set to BIOS or Disabled: pre-3c076351c402 > > always passed, post-3c076351c402 always failed > > > > Bjorn [-- Attachment #2: fastboot_1.patch --] [-- Type: application/octet-stream, Size: 6724 bytes --] diff -aurp megaraid/megaraid_sas.h megaraid_u1/megaraid_sas.h --- megaraid/megaraid_sas.h 2015-05-28 23:56:14.611768156 +0530 +++ megaraid_u1/megaraid_sas.h 2015-05-29 00:06:01.259768156 +0530 @@ -33,7 +33,7 @@ /* * MegaRAID SAS Driver meta data */ -#define MEGASAS_VERSION "06.803.01.00-rc1" +#define MEGASAS_VERSION "06.803.01.11-rc1" #define MEGASAS_RELDATE "Mar. 10, 2014" #define MEGASAS_EXT_VERSION "Mon. Mar. 10 17:00:00 PDT 2014" Only in megaraid_u1/: megaraid_sas.mod.c diff -aurp megaraid/megaraid_sas_fusion.c megaraid_u1/megaraid_sas_fusion.c --- megaraid/megaraid_sas_fusion.c 2015-05-28 23:56:14.619768156 +0530 +++ megaraid_u1/megaraid_sas_fusion.c 2015-05-29 00:16:02.179768156 +0530 @@ -2208,6 +2208,89 @@ static int megasas_adp_reset_fusion(struct megasas_instance *instance, struct megasas_register_set __iomem *regs) { + u32 host_diag, abs_state, retry; + + + dev_info(&instance->pdev->dev, "Entered into %s %d \n", __func__, __LINE__); + + writel(MPI2_WRSEQ_FLUSH_KEY_VALUE, + &instance->reg_set->fusion_seq_offset); + writel(MPI2_WRSEQ_1ST_KEY_VALUE, + &instance->reg_set->fusion_seq_offset); + writel(MPI2_WRSEQ_2ND_KEY_VALUE, + &instance->reg_set->fusion_seq_offset); + writel(MPI2_WRSEQ_3RD_KEY_VALUE, + &instance->reg_set->fusion_seq_offset); + writel(MPI2_WRSEQ_4TH_KEY_VALUE, + &instance->reg_set->fusion_seq_offset); + writel(MPI2_WRSEQ_5TH_KEY_VALUE, + &instance->reg_set->fusion_seq_offset); + writel(MPI2_WRSEQ_6TH_KEY_VALUE, + &instance->reg_set->fusion_seq_offset); + + /* Check that the diag write enable (DRWE) bit is on */ + host_diag = readl(&instance->reg_set->fusion_host_diag); + retry = 0; + while (!(host_diag & HOST_DIAG_WRITE_ENABLE)) { + msleep(100); + host_diag = + readl(&instance->reg_set->fusion_host_diag); + if (retry++ == 100) { + printk(KERN_WARNING "megaraid_sas: " + "Host diag unlock failed! " + "for scsi%d\n", + instance->host->host_no); + break; + } + } + if (!(host_diag & HOST_DIAG_WRITE_ENABLE)) + return -1; + + /* Send chip reset command */ + writel(host_diag | HOST_DIAG_RESET_ADAPTER, + &instance->reg_set->fusion_host_diag); + msleep(3000); + + /* Make sure reset adapter bit is cleared */ + host_diag = readl(&instance->reg_set->fusion_host_diag); + retry = 0; + while (host_diag & HOST_DIAG_RESET_ADAPTER) { + msleep(100); + host_diag = + readl(&instance->reg_set->fusion_host_diag); + if (retry++ == 1000) { + printk(KERN_WARNING "megaraid_sas: " + "Diag reset adapter never " + "cleared for scsi%d!\n", + instance->host->host_no); + break; + } + } + if (host_diag & HOST_DIAG_RESET_ADAPTER) + return -1; + + abs_state = + instance->instancet->read_fw_status_reg( + instance->reg_set) & MFI_STATE_MASK; + retry = 0; + + while ((abs_state <= MFI_STATE_FW_INIT) && + (retry++ < 1000)) { + msleep(100); + abs_state = + instance->instancet->read_fw_status_reg( + instance->reg_set) & MFI_STATE_MASK; + } + if (abs_state <= MFI_STATE_FW_INIT) { + printk(KERN_WARNING "megaraid_sas: firmware " + "state < MFI_STATE_FW_INIT, state = " + "0x%x for scsi%d\n", abs_state, + instance->host->host_no); + return -1; + } + + dev_info(&instance->pdev->dev, "Exit from %s %d \n", __func__, __LINE__); + return 0; } @@ -2336,13 +2419,13 @@ out: /* Core fusion reset function */ int megasas_reset_fusion(struct Scsi_Host *shost, int iotimeout) { - int retval = SUCCESS, i, j, retry = 0, convert = 0; + int retval = SUCCESS, i, j, convert = 0; struct megasas_instance *instance; struct megasas_cmd_fusion *cmd_fusion; struct fusion_context *fusion; struct megasas_cmd *cmd_mfi; union MEGASAS_REQUEST_DESCRIPTOR_UNION *req_desc; - u32 host_diag, abs_state, status_reg, reset_adapter; + u32 abs_state, status_reg, reset_adapter; instance = (struct megasas_instance *)shost->hostdata; fusion = instance->ctrl_context; @@ -2457,81 +2540,10 @@ int megasas_reset_fusion(struct Scsi_Hos /* Now try to reset the chip */ for (i = 0; i < MEGASAS_FUSION_MAX_RESET_TRIES; i++) { - writel(MPI2_WRSEQ_FLUSH_KEY_VALUE, - &instance->reg_set->fusion_seq_offset); - writel(MPI2_WRSEQ_1ST_KEY_VALUE, - &instance->reg_set->fusion_seq_offset); - writel(MPI2_WRSEQ_2ND_KEY_VALUE, - &instance->reg_set->fusion_seq_offset); - writel(MPI2_WRSEQ_3RD_KEY_VALUE, - &instance->reg_set->fusion_seq_offset); - writel(MPI2_WRSEQ_4TH_KEY_VALUE, - &instance->reg_set->fusion_seq_offset); - writel(MPI2_WRSEQ_5TH_KEY_VALUE, - &instance->reg_set->fusion_seq_offset); - writel(MPI2_WRSEQ_6TH_KEY_VALUE, - &instance->reg_set->fusion_seq_offset); - - /* Check that the diag write enable (DRWE) bit is on */ - host_diag = readl(&instance->reg_set->fusion_host_diag); - retry = 0; - while (!(host_diag & HOST_DIAG_WRITE_ENABLE)) { - msleep(100); - host_diag = - readl(&instance->reg_set->fusion_host_diag); - if (retry++ == 100) { - printk(KERN_WARNING "megaraid_sas: " - "Host diag unlock failed! " - "for scsi%d\n", - instance->host->host_no); - break; - } - } - if (!(host_diag & HOST_DIAG_WRITE_ENABLE)) - continue; - - /* Send chip reset command */ - writel(host_diag | HOST_DIAG_RESET_ADAPTER, - &instance->reg_set->fusion_host_diag); - msleep(3000); - - /* Make sure reset adapter bit is cleared */ - host_diag = readl(&instance->reg_set->fusion_host_diag); - retry = 0; - while (host_diag & HOST_DIAG_RESET_ADAPTER) { - msleep(100); - host_diag = - readl(&instance->reg_set->fusion_host_diag); - if (retry++ == 1000) { - printk(KERN_WARNING "megaraid_sas: " - "Diag reset adapter never " - "cleared for scsi%d!\n", - instance->host->host_no); - break; - } - } - if (host_diag & HOST_DIAG_RESET_ADAPTER) - continue; - abs_state = - instance->instancet->read_fw_status_reg( - instance->reg_set) & MFI_STATE_MASK; - retry = 0; - - while ((abs_state <= MFI_STATE_FW_INIT) && - (retry++ < 1000)) { - msleep(100); - abs_state = - instance->instancet->read_fw_status_reg( - instance->reg_set) & MFI_STATE_MASK; - } - if (abs_state <= MFI_STATE_FW_INIT) { - printk(KERN_WARNING "megaraid_sas: firmware " - "state < MFI_STATE_FW_INIT, state = " - "0x%x for scsi%d\n", abs_state, - instance->host->host_no); - continue; - } + if (instance->instancet->adp_reset + (instance, instance->reg_set)) + continue; /* Wait for FW to become ready */ if (megasas_transition_to_ready(instance, 1)) { ^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661] 2015-05-28 13:35 ` Kashyap Desai @ 2015-06-29 13:25 ` Jean Delvare 2015-06-30 10:33 ` Kashyap Desai 2015-07-07 8:44 ` Jean Delvare 1 sibling, 1 reply; 14+ messages in thread From: Jean Delvare @ 2015-06-29 13:25 UTC (permalink / raw) To: Kashyap Desai Cc: Bjorn Helgaas, Robin H. Johnson, Adam Radford, Neela Syam Kolli, linux-scsi, arkadiusz.bubala, Matthew Garrett, Sumit Saxena, Uday Lingala, PDL,MEGARAIDLINUX, linux-pci, linux-kernel, Myron Stowe Hi Kashyap, Thanks for the patch. May I ask what tree it was based on? Linus' latest? I am trying to apply it to the SLES 11 SP3 and SLES 12 kernel trees (based on kernel v3.0 + a bunch of backports and v3.12 respectively) but your patch fails to apply in both cases. I'll try harder but I don't know anything about the megaraid_sas code so I really don't know where I'm going. Does your patch depend on any other that may not be present in the SLES 11 SP3 and SLES 12 kernels? Thanks, Jean Le Thursday 28 May 2015 à 19:05 +0530, Kashyap Desai a écrit : > Bjorn/Robin, > > Apologies for delay. Here is one quick suggestion as we have seen similar > issue (not exactly similar, but high probably to have same issue) while > controller is configured on VM as pass-through and VM reboot abruptly. > In that particular issue, driver interact with FW which may require chip > reset to bring controller to operation state. > > Relevant patch was submitted for only Older controller as it was only seen > for few MegaRaid controller. Below patch already try to do chip reset, but > only for limited controllers...I have attached one more patch which does > chip reset from driver load time for Thunderbolt/Invader/Fury etc. (In > your case you have Thunderbolt controller, so attached patch is required.) > > http://www.spinics.net/lists/linux-scsi/msg67288.html > > Please post the result with attached patch. > > Thanks, Kashyap -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661] 2015-06-29 13:25 ` Jean Delvare @ 2015-06-30 10:33 ` Kashyap Desai 2015-07-01 9:20 ` Jean Delvare 0 siblings, 1 reply; 14+ messages in thread From: Kashyap Desai @ 2015-06-30 10:33 UTC (permalink / raw) To: Jean Delvare Cc: Bjorn Helgaas, Robin H. Johnson, Adam Radford, Neela Syam Kolli, linux-scsi, arkadiusz.bubala, Matthew Garrett, Sumit Saxena, Uday Lingala, PDL,MEGARAIDLINUX, linux-pci, linux-kernel, Myron Stowe Jean, Patch is available at below repo - git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git - b for-next Commit id - 6431f5d7c6025f8b007af06ea090de308f7e6881 If you share megaraid_sas driver code of your tree, I can provide patch for you. ` Kashyap > -----Original Message----- > From: Jean Delvare [mailto:jdelvare@suse.de] > Sent: Monday, June 29, 2015 6:55 PM > To: Kashyap Desai > Cc: Bjorn Helgaas; Robin H. Johnson; Adam Radford; Neela Syam Kolli; > linux- > scsi@vger.kernel.org; arkadiusz.bubala@open-e.com; Matthew Garrett; Sumit > Saxena; Uday Lingala; PDL,MEGARAIDLINUX; linux-pci@vger.kernel.org; linux- > kernel@vger.kernel.org; Myron Stowe > Subject: RE: megaraid_sas: "FW in FAULT state!!", how to get more debug > output? [BKO63661] > > Hi Kashyap, > > Thanks for the patch. May I ask what tree it was based on? Linus' > latest? I am trying to apply it to the SLES 11 SP3 and SLES 12 kernel > trees (based > on kernel v3.0 + a bunch of backports and v3.12 > respectively) but your patch fails to apply in both cases. I'll try harder > but I don't > know anything about the megaraid_sas code so I really don't know where I'm > going. > > Does your patch depend on any other that may not be present in the SLES > 11 SP3 and SLES 12 kernels? > > Thanks, > Jean > > Le Thursday 28 May 2015 à 19:05 +0530, Kashyap Desai a écrit : > > Bjorn/Robin, > > > > Apologies for delay. Here is one quick suggestion as we have seen > > similar issue (not exactly similar, but high probably to have same > > issue) while controller is configured on VM as pass-through and VM > > reboot > abruptly. > > In that particular issue, driver interact with FW which may require > > chip reset to bring controller to operation state. > > > > Relevant patch was submitted for only Older controller as it was only > > seen for few MegaRaid controller. Below patch already try to do chip > > reset, but only for limited controllers...I have attached one more > > patch which does chip reset from driver load time for > > Thunderbolt/Invader/Fury etc. (In your case you have Thunderbolt > > controller, so attached patch is required.) > > > > http://www.spinics.net/lists/linux-scsi/msg67288.html > > > > Please post the result with attached patch. > > > > Thanks, Kashyap > ^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661] 2015-06-30 10:33 ` Kashyap Desai @ 2015-07-01 9:20 ` Jean Delvare 0 siblings, 0 replies; 14+ messages in thread From: Jean Delvare @ 2015-07-01 9:20 UTC (permalink / raw) To: Kashyap Desai Cc: Bjorn Helgaas, Robin H. Johnson, Adam Radford, Neela Syam Kolli, linux-scsi, arkadiusz.bubala, Matthew Garrett, Sumit Saxena, Uday Lingala, PDL,MEGARAIDLINUX, linux-pci, linux-kernel, Myron Stowe Hi Kashyap, I finally managed to backport your patch to the SLES 12 kernel :-) I'll build a test kernel for the customer and have them test it. I'll let you know if I need your help later for the SLES 11 SP3 kernel backport - thanks for the offer! Jean Le Tuesday 30 June 2015 à 16:03 +0530, Kashyap Desai a écrit : > Jean, > > Patch is available at below repo - > > git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git - b for-next > > Commit id - > 6431f5d7c6025f8b007af06ea090de308f7e6881 > > If you share megaraid_sas driver code of your tree, I can provide patch for > you. > > ` Kashyap > > > -----Original Message----- > > From: Jean Delvare [mailto:jdelvare@suse.de] > > Sent: Monday, June 29, 2015 6:55 PM > > To: Kashyap Desai > > Cc: Bjorn Helgaas; Robin H. Johnson; Adam Radford; Neela Syam Kolli; > > linux- > > scsi@vger.kernel.org; arkadiusz.bubala@open-e.com; Matthew Garrett; Sumit > > Saxena; Uday Lingala; PDL,MEGARAIDLINUX; linux-pci@vger.kernel.org; linux- > > kernel@vger.kernel.org; Myron Stowe > > Subject: RE: megaraid_sas: "FW in FAULT state!!", how to get more debug > > output? [BKO63661] > > > > Hi Kashyap, > > > > Thanks for the patch. May I ask what tree it was based on? Linus' > > latest? I am trying to apply it to the SLES 11 SP3 and SLES 12 kernel > > trees (based > > on kernel v3.0 + a bunch of backports and v3.12 > > respectively) but your patch fails to apply in both cases. I'll try harder > > but I don't > > know anything about the megaraid_sas code so I really don't know where I'm > > going. > > > > Does your patch depend on any other that may not be present in the SLES > > 11 SP3 and SLES 12 kernels? > > > > Thanks, > > Jean > > > > Le Thursday 28 May 2015 à 19:05 +0530, Kashyap Desai a écrit : > > > Bjorn/Robin, > > > > > > Apologies for delay. Here is one quick suggestion as we have seen > > > similar issue (not exactly similar, but high probably to have same > > > issue) while controller is configured on VM as pass-through and VM > > > reboot > > abruptly. > > > In that particular issue, driver interact with FW which may require > > > chip reset to bring controller to operation state. > > > > > > Relevant patch was submitted for only Older controller as it was only > > > seen for few MegaRaid controller. Below patch already try to do chip > > > reset, but only for limited controllers...I have attached one more > > > patch which does chip reset from driver load time for > > > Thunderbolt/Invader/Fury etc. (In your case you have Thunderbolt > > > controller, so attached patch is required.) > > > > > > http://www.spinics.net/lists/linux-scsi/msg67288.html > > > > > > Please post the result with attached patch. > > > > > > Thanks, Kashyap ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661] 2015-05-28 13:35 ` Kashyap Desai 2015-06-29 13:25 ` Jean Delvare @ 2015-07-07 8:44 ` Jean Delvare 2015-07-07 9:18 ` Kashyap Desai 1 sibling, 1 reply; 14+ messages in thread From: Jean Delvare @ 2015-07-07 8:44 UTC (permalink / raw) To: Kashyap Desai Cc: Bjorn Helgaas, Robin H. Johnson, Adam Radford, Neela Syam Kolli, linux-scsi, arkadiusz.bubala, Matthew Garrett, Sumit Saxena, Uday Lingala, PDL,MEGARAIDLINUX, linux-pci, linux-kernel, Myron Stowe Hi Kashyap, On Thu, 28 May 2015 19:05:35 +0530, Kashyap Desai wrote: > Bjorn/Robin, > > Apologies for delay. Here is one quick suggestion as we have seen similar > issue (not exactly similar, but high probably to have same issue) while > controller is configured on VM as pass-through and VM reboot abruptly. > In that particular issue, driver interact with FW which may require chip > reset to bring controller to operation state. > > Relevant patch was submitted for only Older controller as it was only seen > for few MegaRaid controller. Below patch already try to do chip reset, but > only for limited controllers...I have attached one more patch which does > chip reset from driver load time for Thunderbolt/Invader/Fury etc. (In > your case you have Thunderbolt controller, so attached patch is required.) > > http://www.spinics.net/lists/linux-scsi/msg67288.html > > Please post the result with attached patch. Good news! Customer tested your patch and said it fixed the problem :-) I am now in the process of backporting the patch to the SLES 11 SP3 kernel for further testing. I'll let you know how it goes. Thank you very much for your assistance. -- Jean Delvare SUSE L3 Support ^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661] 2015-07-07 8:44 ` Jean Delvare @ 2015-07-07 9:18 ` Kashyap Desai 2015-07-10 14:05 ` Jean Delvare 0 siblings, 1 reply; 14+ messages in thread From: Kashyap Desai @ 2015-07-07 9:18 UTC (permalink / raw) To: Jean Delvare Cc: Bjorn Helgaas, Robin H. Johnson, Adam Radford, Neela Syam Kolli, linux-scsi, arkadiusz.bubala, Matthew Garrett, Sumit Saxena, Uday Lingala, PDL,MEGARAIDLINUX, linux-pci, linux-kernel, Myron Stowe > -----Original Message----- > From: Jean Delvare [mailto:jdelvare@suse.de] > Sent: Tuesday, July 07, 2015 2:14 PM > To: Kashyap Desai > Cc: Bjorn Helgaas; Robin H. Johnson; Adam Radford; Neela Syam Kolli; linux- > scsi@vger.kernel.org; arkadiusz.bubala@open-e.com; Matthew Garrett; Sumit > Saxena; Uday Lingala; PDL,MEGARAIDLINUX; linux-pci@vger.kernel.org; linux- > kernel@vger.kernel.org; Myron Stowe > Subject: Re: megaraid_sas: "FW in FAULT state!!", how to get more debug > output? [BKO63661] > > Hi Kashyap, > > On Thu, 28 May 2015 19:05:35 +0530, Kashyap Desai wrote: > > Bjorn/Robin, > > > > Apologies for delay. Here is one quick suggestion as we have seen > > similar issue (not exactly similar, but high probably to have same > > issue) while controller is configured on VM as pass-through and VM reboot > abruptly. > > In that particular issue, driver interact with FW which may require > > chip reset to bring controller to operation state. > > > > Relevant patch was submitted for only Older controller as it was only > > seen for few MegaRaid controller. Below patch already try to do chip > > reset, but only for limited controllers...I have attached one more > > patch which does chip reset from driver load time for > > Thunderbolt/Invader/Fury etc. (In your case you have Thunderbolt > > controller, so attached patch is required.) > > > > http://www.spinics.net/lists/linux-scsi/msg67288.html > > > > Please post the result with attached patch. > > Good news! Customer tested your patch and said it fixed the problem :-) > > I am now in the process of backporting the patch to the SLES 11 SP3 kernel for > further testing. I'll let you know how it goes. Thank you very much for your > assistance. Thanks for confirmation. Whatever patch I submitted to you, we have added recently (as part of common interface approach to do chip reset at load time). We will be submitting that patch to mainline soon. ~ Kashyap > > -- > Jean Delvare > SUSE L3 Support ^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661] 2015-07-07 9:18 ` Kashyap Desai @ 2015-07-10 14:05 ` Jean Delvare 2015-07-10 14:16 ` Kashyap Desai 0 siblings, 1 reply; 14+ messages in thread From: Jean Delvare @ 2015-07-10 14:05 UTC (permalink / raw) To: Kashyap Desai Cc: Bjorn Helgaas, Robin H. Johnson, Adam Radford, Neela Syam Kolli, linux-scsi, arkadiusz.bubala, Matthew Garrett, Sumit Saxena, Uday Lingala, PDL,MEGARAIDLINUX, linux-pci, linux-kernel, Myron Stowe Hi Kashyap, Le Tuesday 07 July 2015 à 14:48 +0530, Kashyap Desai a écrit : > > -----Original Message----- > > From: Jean Delvare [mailto:jdelvare@suse.de] > > Sent: Tuesday, July 07, 2015 2:14 PM > > To: Kashyap Desai > > Cc: Bjorn Helgaas; Robin H. Johnson; Adam Radford; Neela Syam Kolli; > linux- > > scsi@vger.kernel.org; arkadiusz.bubala@open-e.com; Matthew Garrett; > Sumit > > Saxena; Uday Lingala; PDL,MEGARAIDLINUX; linux-pci@vger.kernel.org; > linux- > > kernel@vger.kernel.org; Myron Stowe > > Subject: Re: megaraid_sas: "FW in FAULT state!!", how to get more debug > > output? [BKO63661] > > > > Hi Kashyap, > > > > On Thu, 28 May 2015 19:05:35 +0530, Kashyap Desai wrote: > > > Bjorn/Robin, > > > > > > Apologies for delay. Here is one quick suggestion as we have seen > > > similar issue (not exactly similar, but high probably to have same > > > issue) while controller is configured on VM as pass-through and VM > reboot > > abruptly. > > > In that particular issue, driver interact with FW which may require > > > chip reset to bring controller to operation state. > > > > > > Relevant patch was submitted for only Older controller as it was only > > > seen for few MegaRaid controller. Below patch already try to do chip > > > reset, but only for limited controllers...I have attached one more > > > patch which does chip reset from driver load time for > > > Thunderbolt/Invader/Fury etc. (In your case you have Thunderbolt > > > controller, so attached patch is required.) > > > > > > http://www.spinics.net/lists/linux-scsi/msg67288.html > > > > > > Please post the result with attached patch. > > > > Good news! Customer tested your patch and said it fixed the problem :-) > > > > I am now in the process of backporting the patch to the SLES 11 SP3 > > kernel for further testing. I'll let you know how it goes. Thank you > > very much for your assistance. For the record I was able to backport the patch by myself to SLES 11 SP3, it's currently under testing by the customer. > Thanks for confirmation. Whatever patch I submitted to you, we have added > recently (as part of common interface approach to do chip reset at load > time). We will be submitting that patch to mainline soon. I am about to commit the patch that was successfully tested by the customer on SLES 12, but I'm a bit confused. The upstream patch you referred to is: https://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?h=for-next&id=6431f5d7c6025f8b007af06ea090de308f7e6881 [SCSI] megaraid_sas: megaraid_sas driver init fails in kdump kernel But the patch I used is the one you sent by e-mail on May 28th. It is completely different! So what am I supposed to do? Use the patch you sent (and that was tested by the customer) for SLES 11 SP3 and SLES 12? Or was it just for testing and the proper way of fixing the problem would be to backport the upstream commit? Please advise, -- Jean Delvare SUSE L3 Support ^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661] 2015-07-10 14:05 ` Jean Delvare @ 2015-07-10 14:16 ` Kashyap Desai 2015-07-15 7:21 ` Jean Delvare 0 siblings, 1 reply; 14+ messages in thread From: Kashyap Desai @ 2015-07-10 14:16 UTC (permalink / raw) To: Jean Delvare Cc: Bjorn Helgaas, Robin H. Johnson, Adam Radford, Neela Syam Kolli, linux-scsi, arkadiusz.bubala, Matthew Garrett, Sumit Saxena, Uday Lingala, PDL,MEGARAIDLINUX, linux-pci, linux-kernel, Myron Stowe > > I am about to commit the patch that was successfully tested by the > customer on > SLES 12, but I'm a bit confused. The upstream patch you referred to is: > > https://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?h=for- > next&id=6431f5d7c6025f8b007af06ea090de308f7e6881 > [SCSI] megaraid_sas: megaraid_sas driver init fails in kdump kernel > > But the patch I used is the one you sent by e-mail on May 28th. It is > completely > different! > > So what am I supposed to do? Use the patch you sent (and that was tested > by > the customer) for SLES 11 SP3 and SLES 12? Or was it just for testing and > the > proper way of fixing the problem would be to backport the upstream commit? You can use that patch as valid candidate for upstream submission. Some of the MR maintainer (Sumit Saxena) will send that patch. We are just organizing other patch series. Since SLES already ported patch without commit id, we are fine. I am just giving reference that patch which send via email will be send to upstream very soon along with other patch set. Thanks, Kashyap > > Please advise, > -- > Jean Delvare > SUSE L3 Support ^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661] 2015-07-10 14:16 ` Kashyap Desai @ 2015-07-15 7:21 ` Jean Delvare 0 siblings, 0 replies; 14+ messages in thread From: Jean Delvare @ 2015-07-15 7:21 UTC (permalink / raw) To: Kashyap Desai Cc: Bjorn Helgaas, Robin H. Johnson, Adam Radford, Neela Syam Kolli, linux-scsi, arkadiusz.bubala, Matthew Garrett, Sumit Saxena, Uday Lingala, PDL,MEGARAIDLINUX, linux-pci, linux-kernel, Myron Stowe Le Friday 10 July 2015 à 19:46 +0530, Kashyap Desai a écrit : > > > > I am about to commit the patch that was successfully tested by the > > customer on > > SLES 12, but I'm a bit confused. The upstream patch you referred to is: > > > > https://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?h=for- > > next&id=6431f5d7c6025f8b007af06ea090de308f7e6881 > > [SCSI] megaraid_sas: megaraid_sas driver init fails in kdump kernel > > > > But the patch I used is the one you sent by e-mail on May 28th. It is > > completely > > different! > > > > So what am I supposed to do? Use the patch you sent (and that was tested > > by the customer) for SLES 11 SP3 and SLES 12? Or was it just for testing > > and the proper way of fixing the problem would be to backport the >>upstream commit? > > You can use that patch as valid candidate for upstream submission. Some of > the MR maintainer (Sumit Saxena) will send that patch. We are just > organizing other patch series. > Since SLES already ported patch without commit id, we are fine. I am just > giving reference that patch which send via email will be send to upstream > very soon along with other patch set. OK, thanks for the clarification. The patched SLES 11 SP3 kernel is currently under testing by the customer, apparently it doesn't work but I don't have all the details yet. Maybe some more patches need to be backported because that kernel is older. -- Jean Delvare SUSE L3 Support ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2015-07-15 7:21 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-07-12 11:56 megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661] Robin H. Johnson 2014-07-12 17:29 ` Bjorn Helgaas 2014-07-13 1:35 ` Robin H. Johnson 2015-04-29 17:28 ` Bjorn Helgaas 2015-05-28 12:24 ` Bjorn Helgaas 2015-05-28 13:35 ` Kashyap Desai 2015-06-29 13:25 ` Jean Delvare 2015-06-30 10:33 ` Kashyap Desai 2015-07-01 9:20 ` Jean Delvare 2015-07-07 8:44 ` Jean Delvare 2015-07-07 9:18 ` Kashyap Desai 2015-07-10 14:05 ` Jean Delvare 2015-07-10 14:16 ` Kashyap Desai 2015-07-15 7:21 ` Jean Delvare
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).