Re: nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery
       [not found] <bcbd48b5-1d6e-8fe3-d6a0-cb341e5c34e3@nvidia.com>
@ 2023-03-09 17:53 ` Bjorn Helgaas
  2023-03-10 22:39   ` Tushar Dave
  0 siblings, 1 reply; 15+ messages in thread
From: Bjorn Helgaas @ 2023-03-09 17:53 UTC (permalink / raw)
  To: Tushar Dave; +Cc: Sagi Grimberg, linux-nvme, kbusch, bhelgaas, linux-pci

[+cc linux-pci]

On Wed, Mar 08, 2023 at 07:34:58PM -0800, Tushar Dave wrote:
> On 3/7/23 03:59, Sagi Grimberg wrote:
> > On 3/2/23 02:09, Tushar Dave wrote:
> > > We are observing NVMe device disabled due to reset failure after
> > > injecting Malformed TLP. DPC/AER recovery succeed but NVMe fails.
> > > I tried this on 2 different system and it is 100% reproducible with 6.2
> > > kernel.
> > > 
> > > On my system, Samsung NVMe SSD Controller PM173X is directly behind the
> > > Broadcom PCIe Switch Downstream Port.
> > > MalformedTLP is injected by changing MaxPayload Size(MPS) of PCIe switch
> > > to 128B (keeping NVMe device MPS 512B).
> > > 
> > > e.g.
> > > # change MPS of PCIe switch (a9:10.0)
> > > $ setpci -v -s a9:10.0 cap_exp+0x8.w
> > > 0000:a9:10.0 (cap 10 @68) @70 = 0857
> > > $ setpci -v -s a9:10.0 cap_exp+0x8.w=0x0817
> > > 0000:a9:10.0 (cap 10 @68) @70 0817
> > > $ lspci -s a9:10.0 -vvv | grep -w DevCtl -A 2
> > >          DevCtl:    CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
> > >              RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
> > >              MaxPayload 128 bytes, MaxReadReq 128 bytes
> > > 
> > > # run some traffic on nvme (ab:00.0)
> > > $ dd if=/dev/nvme0n1 of=/tmp/test bs=4K
> > > dd: error reading '/dev/nvme0n1': Input/output error
> > > 2+0 records in
> > > 2+0 records out
> > > 8192 bytes (8.2 kB, 8.0 KiB) copied, 0.0115304 s, 710 kB/s
> > > 
> > > #kernel log:
> > > [  163.034889] pcieport 0000:a5:01.0: EDR: EDR event received
> > > [  163.041671] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
> > > [  163.049071] pcieport 0000:a9:10.0: DPC: containment event,
> > > status:0x2009 source:0x0000
> > > [  163.058014] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error
> > > detected
> > > [  163.066081] pcieport 0000:a9:10.0: PCIe Bus Error:
> > > severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
> > > [  163.078151] pcieport 0000:a9:10.0:   device [1000:c030] error
> > > status/mask=00040000/00180000
> > > [  163.087613] pcieport 0000:a9:10.0:    [18] MalfTLP
> > > (First)
> > > [  163.095281] pcieport 0000:a9:10.0: AER:   TLP Header: 60000080
> > > ab0000ff 00000001 d1fd0000
> > > [  163.104517] pcieport 0000:a9:10.0: AER: broadcast error_detected message
> > > [  163.112095] nvme nvme0: frozen state error detected, reset controller
> > > [  163.150716] nvme0c0n1: I/O Cmd(0x2) @ LBA 16, 32 blocks, I/O Error
> > > (sct 0x3 / sc 0x71)
> > > [  163.159802] I/O error, dev nvme0c0n1, sector 16 op 0x0:(READ) flags
> > > 0x4080700 phys_seg 4 prio class 2
> > > [  163.383661] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
> > > [  163.390895] nvme nvme0: restart after slot reset
> > > [  163.396230] nvme 0000:ab:00.0: restoring config space at offset 0x3c
> > > (was 0x100, writing 0x1ff)
> > > [  163.406079] nvme 0000:ab:00.0: restoring config space at offset 0x30
> > > (was 0x0, writing 0xe0600000)
> > > [  163.416212] nvme 0000:ab:00.0: restoring config space at offset 0x10
> > > (was 0x4, writing 0xe0710004)
> > > [  163.426326] nvme 0000:ab:00.0: restoring config space at offset 0xc
> > > (was 0x0, writing 0x8)
> > > [  163.435666] nvme 0000:ab:00.0: restoring config space at offset 0x4
> > > (was 0x100000, writing 0x100546)
> > > [  163.446026] pcieport 0000:a9:10.0: AER: broadcast resume message
> > > [  163.468311] nvme 0000:ab:00.0: saving config space at offset 0x0
> > > (reading 0xa824144d)
> > > [  163.477209] nvme 0000:ab:00.0: saving config space at offset 0x4
> > > (reading 0x100546)
> > > [  163.485876] nvme 0000:ab:00.0: saving config space at offset 0x8
> > > (reading 0x1080200)
> > > [  163.495399] nvme 0000:ab:00.0: saving config space at offset 0xc
> > > (reading 0x8)
> > > [  163.504149] nvme 0000:ab:00.0: saving config space at offset 0x10
> > > (reading 0xe0710004)
> > > [  163.513596] nvme 0000:ab:00.0: saving config space at offset 0x14
> > > (reading 0x0)
> > > [  163.522310] nvme 0000:ab:00.0: saving config space at offset 0x18
> > > (reading 0x0)
> > > [  163.531013] nvme 0000:ab:00.0: saving config space at offset 0x1c
> > > (reading 0x0)
> > > [  163.539704] nvme 0000:ab:00.0: saving config space at offset 0x20
> > > (reading 0x0)
> > > [  163.548353] nvme 0000:ab:00.0: saving config space at offset 0x24
> > > (reading 0x0)
> > > [  163.556983] nvme 0000:ab:00.0: saving config space at offset 0x28
> > > (reading 0x0)
> > > [  163.565615] nvme 0000:ab:00.0: saving config space at offset 0x2c
> > > (reading 0xa80a144d)
> > > [  163.574899] nvme 0000:ab:00.0: saving config space at offset 0x30
> > > (reading 0xe0600000)
> > > [  163.584215] nvme 0000:ab:00.0: saving config space at offset 0x34
> > > (reading 0x40)
> > > [  163.592941] nvme 0000:ab:00.0: saving config space at offset 0x38
> > > (reading 0x0)
> > > [  163.601554] nvme 0000:ab:00.0: saving config space at offset 0x3c
> > > (reading 0x1ff)
> > > [  210.089132] block nvme0n1: no usable path - requeuing I/O
> > > [  223.776595] nvme nvme0: I/O 18 QID 0 timeout, disable controller
> > > [  223.825236] nvme nvme0: Identify Controller failed (-4)
> > > [  223.832145] nvme nvme0: Disabling device after reset failure: -5
> > 
> > At this point the device is not going to recover.
> Yes, I agree.
> 
> I looked little bit more and found that nvme reset failure and second DPC,
> both were due to nvme_slot_reset() restoring MPS as part of
> pci_restore_state().
> 
> AFAICT, after the first DPC event occurs, nvme device MPS gets changed to
> _default_ value 128B (this is likely due to DPC link retraining). However as
> part of software AER recovery, nvme_slot_reset() restores device state, and
> that brings the nvme device MPS back to 512B. (MPS of PCIe switch a9:10.0
> still remains at 128B).
>
> At this point when nvme_reset_ctrl->nvme_reset_work() tries to enable the
> device, malformedTLP again getting generated and that causes second DPC,
> makes NVMe controller reset to fail as well.

This sounds like the behavior I expect.  IIUC:

  - Switch and NVMe MPS are 512B
  - NVMe config space saved (including MPS=512B)
  - You change Switch MPS to 128B
  - NVMe does DMA with payload > 128B
  - Switch reports Malformed TLP because TLP is larger than its MPS
  - Recovery resets NVMe, which sets MPS to the default of 128B
  - nvme_slot_reset() restores NVMe config space (MPS is now 512B)
  - Subsequent NVMe DMA with payload > 128B repeats cycle

What do you think *should* be happening here?  I don't see a PCI
problem here.  If you change MPS on the Switch without coordinating
with NVMe, things aren't going to work.  Or am I missing something?

Bjorn

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery
  2023-03-09 17:53 ` nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery Bjorn Helgaas
@ 2023-03-10 22:39   ` Tushar Dave
  2023-03-10 23:53     ` Bjorn Helgaas
  0 siblings, 1 reply; 15+ messages in thread
From: Tushar Dave @ 2023-03-10 22:39 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Sagi Grimberg, linux-nvme, kbusch, bhelgaas, linux-pci



On 3/9/23 09:53, Bjorn Helgaas wrote:
> [+cc linux-pci]
> 
> On Wed, Mar 08, 2023 at 07:34:58PM -0800, Tushar Dave wrote:
>> On 3/7/23 03:59, Sagi Grimberg wrote:
>>> On 3/2/23 02:09, Tushar Dave wrote:
>>>> We are observing NVMe device disabled due to reset failure after
>>>> injecting Malformed TLP. DPC/AER recovery succeed but NVMe fails.
>>>> I tried this on 2 different system and it is 100% reproducible with 6.2
>>>> kernel.
>>>>
>>>> On my system, Samsung NVMe SSD Controller PM173X is directly behind the
>>>> Broadcom PCIe Switch Downstream Port.
>>>> MalformedTLP is injected by changing MaxPayload Size(MPS) of PCIe switch
>>>> to 128B (keeping NVMe device MPS 512B).
>>>>
>>>> e.g.
>>>> # change MPS of PCIe switch (a9:10.0)
>>>> $ setpci -v -s a9:10.0 cap_exp+0x8.w
>>>> 0000:a9:10.0 (cap 10 @68) @70 = 0857
>>>> $ setpci -v -s a9:10.0 cap_exp+0x8.w=0x0817
>>>> 0000:a9:10.0 (cap 10 @68) @70 0817
>>>> $ lspci -s a9:10.0 -vvv | grep -w DevCtl -A 2
>>>>           DevCtl:    CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
>>>>               RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>>>>               MaxPayload 128 bytes, MaxReadReq 128 bytes
>>>>
>>>> # run some traffic on nvme (ab:00.0)
>>>> $ dd if=/dev/nvme0n1 of=/tmp/test bs=4K
>>>> dd: error reading '/dev/nvme0n1': Input/output error
>>>> 2+0 records in
>>>> 2+0 records out
>>>> 8192 bytes (8.2 kB, 8.0 KiB) copied, 0.0115304 s, 710 kB/s
>>>>
>>>> #kernel log:
>>>> [  163.034889] pcieport 0000:a5:01.0: EDR: EDR event received
>>>> [  163.041671] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
>>>> [  163.049071] pcieport 0000:a9:10.0: DPC: containment event,
>>>> status:0x2009 source:0x0000
>>>> [  163.058014] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error
>>>> detected
>>>> [  163.066081] pcieport 0000:a9:10.0: PCIe Bus Error:
>>>> severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
>>>> [  163.078151] pcieport 0000:a9:10.0:   device [1000:c030] error
>>>> status/mask=00040000/00180000
>>>> [  163.087613] pcieport 0000:a9:10.0:    [18] MalfTLP
>>>> (First)
>>>> [  163.095281] pcieport 0000:a9:10.0: AER:   TLP Header: 60000080
>>>> ab0000ff 00000001 d1fd0000
>>>> [  163.104517] pcieport 0000:a9:10.0: AER: broadcast error_detected message
>>>> [  163.112095] nvme nvme0: frozen state error detected, reset controller
>>>> [  163.150716] nvme0c0n1: I/O Cmd(0x2) @ LBA 16, 32 blocks, I/O Error
>>>> (sct 0x3 / sc 0x71)
>>>> [  163.159802] I/O error, dev nvme0c0n1, sector 16 op 0x0:(READ) flags
>>>> 0x4080700 phys_seg 4 prio class 2
>>>> [  163.383661] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
>>>> [  163.390895] nvme nvme0: restart after slot reset
>>>> [  163.396230] nvme 0000:ab:00.0: restoring config space at offset 0x3c
>>>> (was 0x100, writing 0x1ff)
>>>> [  163.406079] nvme 0000:ab:00.0: restoring config space at offset 0x30
>>>> (was 0x0, writing 0xe0600000)
>>>> [  163.416212] nvme 0000:ab:00.0: restoring config space at offset 0x10
>>>> (was 0x4, writing 0xe0710004)
>>>> [  163.426326] nvme 0000:ab:00.0: restoring config space at offset 0xc
>>>> (was 0x0, writing 0x8)
>>>> [  163.435666] nvme 0000:ab:00.0: restoring config space at offset 0x4
>>>> (was 0x100000, writing 0x100546)
>>>> [  163.446026] pcieport 0000:a9:10.0: AER: broadcast resume message
>>>> [  163.468311] nvme 0000:ab:00.0: saving config space at offset 0x0
>>>> (reading 0xa824144d)
>>>> [  163.477209] nvme 0000:ab:00.0: saving config space at offset 0x4
>>>> (reading 0x100546)
>>>> [  163.485876] nvme 0000:ab:00.0: saving config space at offset 0x8
>>>> (reading 0x1080200)
>>>> [  163.495399] nvme 0000:ab:00.0: saving config space at offset 0xc
>>>> (reading 0x8)
>>>> [  163.504149] nvme 0000:ab:00.0: saving config space at offset 0x10
>>>> (reading 0xe0710004)
>>>> [  163.513596] nvme 0000:ab:00.0: saving config space at offset 0x14
>>>> (reading 0x0)
>>>> [  163.522310] nvme 0000:ab:00.0: saving config space at offset 0x18
>>>> (reading 0x0)
>>>> [  163.531013] nvme 0000:ab:00.0: saving config space at offset 0x1c
>>>> (reading 0x0)
>>>> [  163.539704] nvme 0000:ab:00.0: saving config space at offset 0x20
>>>> (reading 0x0)
>>>> [  163.548353] nvme 0000:ab:00.0: saving config space at offset 0x24
>>>> (reading 0x0)
>>>> [  163.556983] nvme 0000:ab:00.0: saving config space at offset 0x28
>>>> (reading 0x0)
>>>> [  163.565615] nvme 0000:ab:00.0: saving config space at offset 0x2c
>>>> (reading 0xa80a144d)
>>>> [  163.574899] nvme 0000:ab:00.0: saving config space at offset 0x30
>>>> (reading 0xe0600000)
>>>> [  163.584215] nvme 0000:ab:00.0: saving config space at offset 0x34
>>>> (reading 0x40)
>>>> [  163.592941] nvme 0000:ab:00.0: saving config space at offset 0x38
>>>> (reading 0x0)
>>>> [  163.601554] nvme 0000:ab:00.0: saving config space at offset 0x3c
>>>> (reading 0x1ff)
>>>> [  210.089132] block nvme0n1: no usable path - requeuing I/O
>>>> [  223.776595] nvme nvme0: I/O 18 QID 0 timeout, disable controller
>>>> [  223.825236] nvme nvme0: Identify Controller failed (-4)
>>>> [  223.832145] nvme nvme0: Disabling device after reset failure: -5
>>>
>>> At this point the device is not going to recover.
>> Yes, I agree.
>>
>> I looked little bit more and found that nvme reset failure and second DPC,
>> both were due to nvme_slot_reset() restoring MPS as part of
>> pci_restore_state().
>>
>> AFAICT, after the first DPC event occurs, nvme device MPS gets changed to
>> _default_ value 128B (this is likely due to DPC link retraining). However as
>> part of software AER recovery, nvme_slot_reset() restores device state, and
>> that brings the nvme device MPS back to 512B. (MPS of PCIe switch a9:10.0
>> still remains at 128B).
>>
>> At this point when nvme_reset_ctrl->nvme_reset_work() tries to enable the
>> device, malformedTLP again getting generated and that causes second DPC,
>> makes NVMe controller reset to fail as well.
> 
> This sounds like the behavior I expect.  IIUC:
> 
>    - Switch and NVMe MPS are 512B
>    - NVMe config space saved (including MPS=512B)
>    - You change Switch MPS to 128B
>    - NVMe does DMA with payload > 128B
>    - Switch reports Malformed TLP because TLP is larger than its MPS
>    - Recovery resets NVMe, which sets MPS to the default of 128B
>    - nvme_slot_reset() restores NVMe config space (MPS is now 512B)
>    - Subsequent NVMe DMA with payload > 128B repeats cycle
> 
> What do you think *should* be happening here?  I don't see a PCI
> problem here.  If you change MPS on the Switch without coordinating
> with NVMe, things aren't going to work.  Or am I missing something?

I agree this is expected but there are instances where I do _not_ see the issue 
occurring. That is due to involvement of pciehp, which add and configure nvme 
device - (coordinates MPS with pcie switch, and the new MPS will get saved too. 
So future tests of this kind won't reproduce this issue and that is understood).

IMO though, the result of the test should be consistent.
Either pciehp/DPC should take care of device recovery 100% all the time;
Or we consider nvme recovery failure as an expected result because MPS of pcie 
switch got changed without coordinating with nvme.

What do you think?


e.g. [ when pciehp takes care of things]

[  216.608538] pcieport 0000:a9:10.0: pciehp: pending interrupts 0x0108 from 
Slot Status
[  216.639954] pcieport 0000:a5:01.0: EDR: EDR event received
[  216.640429] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
[  216.640438] pcieport 0000:a9:10.0: DPC: containment event, status:0x2009 
source:0x0000
[  216.640442] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error detected
[  216.640452] pcieport 0000:a9:10.0: PCIe Bus Error: severity=Uncorrected 
(Fatal), type=Transaction Layer, (Receiver ID)
[  216.652549] pcieport 0000:a9:10.0:   device [1000:c030] error 
status/mask=00040000/00180000
[  216.661975] pcieport 0000:a9:10.0:    [18] MalfTLP                (First)
[  216.669647] pcieport 0000:a9:10.0: AER:   TLP Header: 60000080 ab0000ff 
00000102 276fe000
[  216.678890] pcieport 0000:a9:10.0: AER: broadcast error_detected message
[  216.678898] nvme nvme0: frozen state error detected, reset controller
[  216.842570] nvme0c0n1: I/O Cmd(0x2) @ LBA 16, 32 blocks, I/O Error (sct 0x3 / 
sc 0x71)
[  216.851684] I/O error, dev nvme0c0n1, sector 16 op 0x0:(READ) flags 0x4080700 
phys_seg 4 prio class 2
[  217.071200] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
[  217.071217] nvme nvme0: restart after slot reset
[  217.071228] nvme 0000:ab:00.0: nvme_slot_reset: before pci_restore_state 
DEVCTL: 0x2910
[  217.071234] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Down/Up ignored 
(recovered by DPC)
[  217.071250] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_active: 
lnk_status = 2044
[  217.071259] pcieport 0000:a9:10.0: pciehp: Slot(272): Card not present
[  217.071267] pcieport 0000:a9:10.0: pciehp: pciehp_unconfigure_device: 
domain:bus:dev = 0000:ab:00
[  217.071320] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was 
0x100, writing 0x1ff)
[  217.071346] nvme 0000:ab:00.0: restoring config space at offset 0x30 (was 
0x0, writing 0xe0600000)
[  217.071373] nvme 0000:ab:00.0: restoring config space at offset 0x10 (was 
0x4, writing 0xe0710004)
[  217.071383] nvme 0000:ab:00.0: restoring config space at offset 0xc (was 0x0, 
writing 0x8)
[  217.071394] nvme 0000:ab:00.0: restoring config space at offset 0x4 (was 
0x100000, writing 0x100546)
[  217.071451] nvme 0000:ab:00.0: nvme_slot_reset: after pci_restore_state, 
DEVCTL: 0x5957
[  217.071464] pcieport 0000:a9:10.0: AER: broadcast resume message
[  217.071467] nvme 0000:ab:00.0: PME# disabled
[  217.071513] pcieport 0000:a9:10.0: AER: device recovery successful
[  217.071522] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
[  217.071526] nvme 0000:ab:00.0: vgaarb: pci_notify
[  217.071531] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80
[  217.071614] nvme nvme0: ctrl state 6 is not RESETTING
[  217.103486] Buffer I/O error on dev nvme0n1, logical block 2, async page read
[  217.308778] pci 0000:ab:00.0: vgaarb: pci_notify
[  217.308831] pci 0000:ab:00.0: vgaarb: pci_notify
[  217.311299] pci 0000:ab:00.0: vgaarb: pci_notify
[  217.311863] pci 0000:ab:00.0: device released
[  217.311887] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_active: 
lnk_status = 2044
[  217.311892] pcieport 0000:a9:10.0: pciehp: Slot(272): Card present
[  217.311897] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Up
[  217.455159] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_status: 
lnk_status = 2044
[  217.455222] pci 0000:ab:00.0: [144d:a824] type 00 class 0x010802
[  217.455275] pci 0000:ab:00.0: reg 0x10: [mem 0xe0710000-0xe0717fff 64bit]
[  217.455362] pci 0000:ab:00.0: reg 0x30: [mem 0xe0600000-0xe060ffff pref]
[  217.455380] pci 0000:ab:00.0: Max Payload Size set to 128 (was 512, max 512)
[  217.455726] pci 0000:ab:00.0: reg 0x20c: [mem 0xe0610000-0xe0617fff 64bit]
[  217.455732] pci 0000:ab:00.0: VF(n) BAR0 space: [mem 0xe0610000-0xe070ffff 
64bit] (contains BAR0 for 32 VFs)
[  217.456307] pci 0000:ab:00.0: vgaarb: pci_notify
[  217.456404] pcieport 0000:a9:10.0: bridge window [io  0x1000-0x0fff] to [bus 
ab] add_size 1000
[  217.456413] pcieport 0000:a9:10.0: bridge window [mem 0x00100000-0x000fffff 
64bit pref] to [bus ab] add_size 200000 add_align 100000
[  217.456430] pcieport 0000:a9:10.0: BAR 15: no space for [mem size 0x00200000 
64bit pref]
[  217.456436] pcieport 0000:a9:10.0: BAR 15: failed to assign [mem size 
0x00200000 64bit pref]
[  217.456440] pcieport 0000:a9:10.0: BAR 13: no space for [io  size 0x1000]
[  217.456444] pcieport 0000:a9:10.0: BAR 13: failed to assign [io  size 0x1000]
[  217.456451] pcieport 0000:a9:10.0: BAR 15: no space for [mem size 0x00200000 
64bit pref]
[  217.456457] pcieport 0000:a9:10.0: BAR 15: failed to assign [mem size 
0x00200000 64bit pref]
[  217.456464] pcieport 0000:a9:10.0: BAR 13: no space for [io  size 0x1000]
[  217.456470] pcieport 0000:a9:10.0: BAR 13: failed to assign [io  size 0x1000]
[  217.456480] pci 0000:ab:00.0: BAR 6: assigned [mem 0xe0600000-0xe060ffff pref]
[  217.456488] pci 0000:ab:00.0: BAR 0: assigned [mem 0xe0610000-0xe0617fff 64bit]
[  217.456509] pci 0000:ab:00.0: BAR 7: assigned [mem 0xe0618000-0xe0717fff 64bit]
[  217.456517] pcieport 0000:a9:10.0: PCI bridge to [bus ab]
[  217.456526] pcieport 0000:a9:10.0:   bridge window [mem 0xe0600000-0xe07fffff]
[  217.456614] nvme 0000:ab:00.0: vgaarb: pci_notify
[  217.456624] nvme 0000:ab:00.0: runtime IRQ mapping not provided by arch
[  217.457452] nvme nvme10: pci function 0000:ab:00.0
[  217.458154] nvme 0000:ab:00.0: saving config space at offset 0x0 (reading 
0xa824144d)
[  217.458166] nvme 0000:ab:00.0: saving config space at offset 0x4 (reading 
0x100546)
[  217.458173] nvme 0000:ab:00.0: saving config space at offset 0x8 (reading 
0x1080200)
[  217.458179] nvme 0000:ab:00.0: saving config space at offset 0xc (reading 0x8)
[  217.458185] nvme 0000:ab:00.0: saving config space at offset 0x10 (reading 
0xe0610004)
[  217.458192] nvme 0000:ab:00.0: saving config space at offset 0x14 (reading 0x0)
[  217.458198] nvme 0000:ab:00.0: saving config space at offset 0x18 (reading 0x0)
[  217.458202] nvme 0000:ab:00.0: saving config space at offset 0x1c (reading 0x0)
[  217.458207] nvme 0000:ab:00.0: saving config space at offset 0x20 (reading 0x0)
[  217.458212] nvme 0000:ab:00.0: saving config space at offset 0x24 (reading 0x0)
[  217.458217] nvme 0000:ab:00.0: saving config space at offset 0x28 (reading 0x0)
[  217.458222] nvme 0000:ab:00.0: saving config space at offset 0x2c (reading 
0xa80a144d)
[  217.458227] nvme 0000:ab:00.0: saving config space at offset 0x30 (reading 
0xe0600000)
[  217.458237] nvme 0000:ab:00.0: saving config space at offset 0x34 (reading 0x40)
[  217.458242] nvme 0000:ab:00.0: saving config space at offset 0x38 (reading 0x0)
[  217.458247] nvme 0000:ab:00.0: saving config space at offset 0x3c (reading 0x1ff)
[  217.462192] nvme nvme10: Shutdown timeout set to 10 seconds
[  217.520625] nvme nvme10: 63/0/0 default/read/poll queues


-Tushar
> 
> Bjorn

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery
  2023-03-10 22:39   ` Tushar Dave
@ 2023-03-10 23:53     ` Bjorn Helgaas
  2023-03-11  1:45       ` Tushar Dave
  0 siblings, 1 reply; 15+ messages in thread
From: Bjorn Helgaas @ 2023-03-10 23:53 UTC (permalink / raw)
  To: Tushar Dave
  Cc: Sagi Grimberg, linux-nvme, kbusch, bhelgaas, linux-pci,
	Lukas Wunner

[+cc Lukas, beginning of thread:
https://lore.kernel.org/all/de1b20e5-ded1-0aae-2221-f5d470d91015@nvidia.com/]

On Fri, Mar 10, 2023 at 02:39:19PM -0800, Tushar Dave wrote:
> On 3/9/23 09:53, Bjorn Helgaas wrote:
> > On Wed, Mar 08, 2023 at 07:34:58PM -0800, Tushar Dave wrote:
> > > On 3/7/23 03:59, Sagi Grimberg wrote:
> > > > On 3/2/23 02:09, Tushar Dave wrote:
> > > > > We are observing NVMe device disabled due to reset failure after
> > > > > injecting Malformed TLP. DPC/AER recovery succeed but NVMe fails.
> > > > > I tried this on 2 different system and it is 100% reproducible with 6.2
> > > > > kernel.
> > > > > 
> > > > > On my system, Samsung NVMe SSD Controller PM173X is directly behind the
> > > > > Broadcom PCIe Switch Downstream Port.
> > > > > MalformedTLP is injected by changing MaxPayload Size(MPS) of PCIe switch
> > > > > to 128B (keeping NVMe device MPS 512B).
> > > > > 
> > > > > e.g.
> > > > > # change MPS of PCIe switch (a9:10.0)
> > > > > $ setpci -v -s a9:10.0 cap_exp+0x8.w
> > > > > 0000:a9:10.0 (cap 10 @68) @70 = 0857
> > > > > $ setpci -v -s a9:10.0 cap_exp+0x8.w=0x0817
> > > > > 0000:a9:10.0 (cap 10 @68) @70 0817
> > > > > $ lspci -s a9:10.0 -vvv | grep -w DevCtl -A 2
> > > > >           DevCtl:    CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
> > > > >               RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
> > > > >               MaxPayload 128 bytes, MaxReadReq 128 bytes
> > > > > 
> > > > > # run some traffic on nvme (ab:00.0)
> > > > > $ dd if=/dev/nvme0n1 of=/tmp/test bs=4K
> > > > > dd: error reading '/dev/nvme0n1': Input/output error
> > > > > 2+0 records in
> > > > > 2+0 records out
> > > > > 8192 bytes (8.2 kB, 8.0 KiB) copied, 0.0115304 s, 710 kB/s
> > > > > 
> > > > > #kernel log:
> > > > > [  163.034889] pcieport 0000:a5:01.0: EDR: EDR event received
> > > > > [  163.041671] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
> > > > > [  163.049071] pcieport 0000:a9:10.0: DPC: containment event,
> > > > > status:0x2009 source:0x0000
> > > > > [  163.058014] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error
> > > > > detected
> > > > > [  163.066081] pcieport 0000:a9:10.0: PCIe Bus Error:
> > > > > severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
> > > > > [  163.078151] pcieport 0000:a9:10.0:   device [1000:c030] error
> > > > > status/mask=00040000/00180000
> > > > > [  163.087613] pcieport 0000:a9:10.0:    [18] MalfTLP
> > > > > (First)
> > > > > [  163.095281] pcieport 0000:a9:10.0: AER:   TLP Header: 60000080
> > > > > ab0000ff 00000001 d1fd0000
> > > > > [  163.104517] pcieport 0000:a9:10.0: AER: broadcast error_detected message
> > > > > [  163.112095] nvme nvme0: frozen state error detected, reset controller
> > > > > [  163.150716] nvme0c0n1: I/O Cmd(0x2) @ LBA 16, 32 blocks, I/O Error
> > > > > (sct 0x3 / sc 0x71)
> > > > > [  163.159802] I/O error, dev nvme0c0n1, sector 16 op 0x0:(READ) flags
> > > > > 0x4080700 phys_seg 4 prio class 2
> > > > > [  163.383661] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
> > > > > [  163.390895] nvme nvme0: restart after slot reset
> > > > > [  163.396230] nvme 0000:ab:00.0: restoring config space at offset 0x3c
> > > > > (was 0x100, writing 0x1ff)
> > > > > [  163.406079] nvme 0000:ab:00.0: restoring config space at offset 0x30
> > > > > (was 0x0, writing 0xe0600000)
> > > > > [  163.416212] nvme 0000:ab:00.0: restoring config space at offset 0x10
> > > > > (was 0x4, writing 0xe0710004)
> > > > > [  163.426326] nvme 0000:ab:00.0: restoring config space at offset 0xc
> > > > > (was 0x0, writing 0x8)
> > > > > [  163.435666] nvme 0000:ab:00.0: restoring config space at offset 0x4
> > > > > (was 0x100000, writing 0x100546)
> > > > > [  163.446026] pcieport 0000:a9:10.0: AER: broadcast resume message
> > > > > [  163.468311] nvme 0000:ab:00.0: saving config space at offset 0x0
> > > > > (reading 0xa824144d)
> > > > > [  163.477209] nvme 0000:ab:00.0: saving config space at offset 0x4
> > > > > (reading 0x100546)
> > > > > [  163.485876] nvme 0000:ab:00.0: saving config space at offset 0x8
> > > > > (reading 0x1080200)
> > > > > [  163.495399] nvme 0000:ab:00.0: saving config space at offset 0xc
> > > > > (reading 0x8)
> > > > > [  163.504149] nvme 0000:ab:00.0: saving config space at offset 0x10
> > > > > (reading 0xe0710004)
> > > > > [  163.513596] nvme 0000:ab:00.0: saving config space at offset 0x14
> > > > > (reading 0x0)
> > > > > [  163.522310] nvme 0000:ab:00.0: saving config space at offset 0x18
> > > > > (reading 0x0)
> > > > > [  163.531013] nvme 0000:ab:00.0: saving config space at offset 0x1c
> > > > > (reading 0x0)
> > > > > [  163.539704] nvme 0000:ab:00.0: saving config space at offset 0x20
> > > > > (reading 0x0)
> > > > > [  163.548353] nvme 0000:ab:00.0: saving config space at offset 0x24
> > > > > (reading 0x0)
> > > > > [  163.556983] nvme 0000:ab:00.0: saving config space at offset 0x28
> > > > > (reading 0x0)
> > > > > [  163.565615] nvme 0000:ab:00.0: saving config space at offset 0x2c
> > > > > (reading 0xa80a144d)
> > > > > [  163.574899] nvme 0000:ab:00.0: saving config space at offset 0x30
> > > > > (reading 0xe0600000)
> > > > > [  163.584215] nvme 0000:ab:00.0: saving config space at offset 0x34
> > > > > (reading 0x40)
> > > > > [  163.592941] nvme 0000:ab:00.0: saving config space at offset 0x38
> > > > > (reading 0x0)
> > > > > [  163.601554] nvme 0000:ab:00.0: saving config space at offset 0x3c
> > > > > (reading 0x1ff)
> > > > > [  210.089132] block nvme0n1: no usable path - requeuing I/O
> > > > > [  223.776595] nvme nvme0: I/O 18 QID 0 timeout, disable controller
> > > > > [  223.825236] nvme nvme0: Identify Controller failed (-4)
> > > > > [  223.832145] nvme nvme0: Disabling device after reset failure: -5
> > > > 
> > > > At this point the device is not going to recover.
> > >
> > > Yes, I agree.
> > > 
> > > I looked little bit more and found that nvme reset failure and second DPC,
> > > both were due to nvme_slot_reset() restoring MPS as part of
> > > pci_restore_state().
> > > 
> > > AFAICT, after the first DPC event occurs, nvme device MPS gets changed to
> > > _default_ value 128B (this is likely due to DPC link retraining). However as
> > > part of software AER recovery, nvme_slot_reset() restores device state, and
> > > that brings the nvme device MPS back to 512B. (MPS of PCIe switch a9:10.0
> > > still remains at 128B).
> > > 
> > > At this point when nvme_reset_ctrl->nvme_reset_work() tries to enable the
> > > device, malformedTLP again getting generated and that causes second DPC,
> > > makes NVMe controller reset to fail as well.
> > 
> > This sounds like the behavior I expect.  IIUC:
> > 
> >    - Switch and NVMe MPS are 512B
> >    - NVMe config space saved (including MPS=512B)
> >    - You change Switch MPS to 128B
> >    - NVMe does DMA with payload > 128B
> >    - Switch reports Malformed TLP because TLP is larger than its MPS
> >    - Recovery resets NVMe, which sets MPS to the default of 128B
> >    - nvme_slot_reset() restores NVMe config space (MPS is now 512B)
> >    - Subsequent NVMe DMA with payload > 128B repeats cycle
> > 
> > What do you think *should* be happening here?  I don't see a PCI
> > problem here.  If you change MPS on the Switch without coordinating
> > with NVMe, things aren't going to work.  Or am I missing something?
> 
> I agree this is expected but there are instances where I do _not_ see the
> issue occurring. That is due to involvement of pciehp, which add and
> configure nvme device - (coordinates MPS with pcie switch, and the new MPS
> will get saved too. So future tests of this kind won't reproduce this issue
> and that is understood).
> 
> IMO though, the result of the test should be consistent.
> Either pciehp/DPC should take care of device recovery 100% all the time;
> Or we consider nvme recovery failure as an expected result because MPS of
> pcie switch got changed without coordinating with nvme.
> 
> What do you think?

In the log below, pciehp obviously is enabled; should I infer that in
the log above, it is not?

Generally we've avoided handling a device reset as a remove/add event
because upper layers can't deal well with that.  But in the log below
it looks like pciehp *did* treat the DPC containment as a remove/add,
which of course involves configuring the "new" device and its MPS
settings.

  [  217.071200] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
  [  217.071217] nvme nvme0: restart after slot reset
  [  217.071234] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Down/Up ignored (recovered by DPC)
  [  217.071250] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_active: lnk_status = 2044
  [  217.071259] pcieport 0000:a9:10.0: pciehp: Slot(272): Card not present
  [  217.071267] pcieport 0000:a9:10.0: pciehp: pciehp_unconfigure_device: domain:bus:dev = 0000:ab:00
  [  217.071320] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was 0x100, writing 0x1ff)
  [  217.071451] nvme 0000:ab:00.0: nvme_slot_reset: after pci_restore_state, DEVCTL: 0x5957

The .slot_reset() method (nvme_slot_reset()) is called *after* the
device has been reset, and the device is supposed to be ready for the
driver to use it.  But here it looks like pciehp thinks ab:00.0 is not
present, so it removes it.  Later ab:00.0 is present again, so we
re-enumerate it:

  [  217.311892] pcieport 0000:a9:10.0: pciehp: Slot(272): Card present
  [  217.311897] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Up
  [  217.455159] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_status: lnk_status = 2044
  [  217.455222] pci 0000:ab:00.0: [144d:a824] type 00 class 0x010802

What kernel are you testing?  53b54ad074de ("PCI/DPC: Await readiness
of secondary bus after reset") looks like it could be related, but
you'd have to be using v6.3-rc1 or later to get it.

> e.g. [ when pciehp takes care of things]
> 
> [  216.608538] pcieport 0000:a9:10.0: pciehp: pending interrupts 0x0108 from
> Slot Status
> [  216.639954] pcieport 0000:a5:01.0: EDR: EDR event received
> [  216.640429] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
> [  216.640438] pcieport 0000:a9:10.0: DPC: containment event, status:0x2009
> source:0x0000
> [  216.640442] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error detected
> [  216.640452] pcieport 0000:a9:10.0: PCIe Bus Error: severity=Uncorrected
> (Fatal), type=Transaction Layer, (Receiver ID)
> [  216.652549] pcieport 0000:a9:10.0:   device [1000:c030] error
> status/mask=00040000/00180000
> [  216.661975] pcieport 0000:a9:10.0:    [18] MalfTLP                (First)
> [  216.669647] pcieport 0000:a9:10.0: AER:   TLP Header: 60000080 ab0000ff
> 00000102 276fe000
> [  216.678890] pcieport 0000:a9:10.0: AER: broadcast error_detected message
> [  216.678898] nvme nvme0: frozen state error detected, reset controller
> [  216.842570] nvme0c0n1: I/O Cmd(0x2) @ LBA 16, 32 blocks, I/O Error (sct
> 0x3 / sc 0x71)
> [  216.851684] I/O error, dev nvme0c0n1, sector 16 op 0x0:(READ) flags
> 0x4080700 phys_seg 4 prio class 2
> [  217.071200] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
> [  217.071217] nvme nvme0: restart after slot reset
> [  217.071228] nvme 0000:ab:00.0: nvme_slot_reset: before pci_restore_state
> DEVCTL: 0x2910
> [  217.071234] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Down/Up
> ignored (recovered by DPC)
> [  217.071250] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_active:
> lnk_status = 2044
> [  217.071259] pcieport 0000:a9:10.0: pciehp: Slot(272): Card not present
> [  217.071267] pcieport 0000:a9:10.0: pciehp: pciehp_unconfigure_device:
> domain:bus:dev = 0000:ab:00
> [  217.071320] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was
> 0x100, writing 0x1ff)
> [  217.071346] nvme 0000:ab:00.0: restoring config space at offset 0x30 (was
> 0x0, writing 0xe0600000)
> [  217.071373] nvme 0000:ab:00.0: restoring config space at offset 0x10 (was
> 0x4, writing 0xe0710004)
> [  217.071383] nvme 0000:ab:00.0: restoring config space at offset 0xc (was
> 0x0, writing 0x8)
> [  217.071394] nvme 0000:ab:00.0: restoring config space at offset 0x4 (was
> 0x100000, writing 0x100546)
> [  217.071451] nvme 0000:ab:00.0: nvme_slot_reset: after pci_restore_state,
> DEVCTL: 0x5957
> [  217.071464] pcieport 0000:a9:10.0: AER: broadcast resume message
> [  217.071467] nvme 0000:ab:00.0: PME# disabled
> [  217.071513] pcieport 0000:a9:10.0: AER: device recovery successful
> [  217.071522] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
> [  217.071526] nvme 0000:ab:00.0: vgaarb: pci_notify
> [  217.071531] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80
> [  217.071614] nvme nvme0: ctrl state 6 is not RESETTING
> [  217.103486] Buffer I/O error on dev nvme0n1, logical block 2, async page read
> [  217.308778] pci 0000:ab:00.0: vgaarb: pci_notify
> [  217.308831] pci 0000:ab:00.0: vgaarb: pci_notify
> [  217.311299] pci 0000:ab:00.0: vgaarb: pci_notify
> [  217.311863] pci 0000:ab:00.0: device released
> [  217.311887] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_active:
> lnk_status = 2044
> [  217.311892] pcieport 0000:a9:10.0: pciehp: Slot(272): Card present
> [  217.311897] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Up
> [  217.455159] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_status:
> lnk_status = 2044
> [  217.455222] pci 0000:ab:00.0: [144d:a824] type 00 class 0x010802
> [  217.455275] pci 0000:ab:00.0: reg 0x10: [mem 0xe0710000-0xe0717fff 64bit]
> [  217.455362] pci 0000:ab:00.0: reg 0x30: [mem 0xe0600000-0xe060ffff pref]
> [  217.455380] pci 0000:ab:00.0: Max Payload Size set to 128 (was 512, max 512)
> [  217.455726] pci 0000:ab:00.0: reg 0x20c: [mem 0xe0610000-0xe0617fff 64bit]
> [  217.455732] pci 0000:ab:00.0: VF(n) BAR0 space: [mem
> 0xe0610000-0xe070ffff 64bit] (contains BAR0 for 32 VFs)
> [  217.456307] pci 0000:ab:00.0: vgaarb: pci_notify
> [  217.456404] pcieport 0000:a9:10.0: bridge window [io  0x1000-0x0fff] to
> [bus ab] add_size 1000
> [  217.456413] pcieport 0000:a9:10.0: bridge window [mem
> 0x00100000-0x000fffff 64bit pref] to [bus ab] add_size 200000 add_align
> 100000
> [  217.456430] pcieport 0000:a9:10.0: BAR 15: no space for [mem size
> 0x00200000 64bit pref]
> [  217.456436] pcieport 0000:a9:10.0: BAR 15: failed to assign [mem size
> 0x00200000 64bit pref]
> [  217.456440] pcieport 0000:a9:10.0: BAR 13: no space for [io  size 0x1000]
> [  217.456444] pcieport 0000:a9:10.0: BAR 13: failed to assign [io  size 0x1000]
> [  217.456451] pcieport 0000:a9:10.0: BAR 15: no space for [mem size
> 0x00200000 64bit pref]
> [  217.456457] pcieport 0000:a9:10.0: BAR 15: failed to assign [mem size
> 0x00200000 64bit pref]
> [  217.456464] pcieport 0000:a9:10.0: BAR 13: no space for [io  size 0x1000]
> [  217.456470] pcieport 0000:a9:10.0: BAR 13: failed to assign [io  size 0x1000]
> [  217.456480] pci 0000:ab:00.0: BAR 6: assigned [mem 0xe0600000-0xe060ffff pref]
> [  217.456488] pci 0000:ab:00.0: BAR 0: assigned [mem 0xe0610000-0xe0617fff 64bit]
> [  217.456509] pci 0000:ab:00.0: BAR 7: assigned [mem 0xe0618000-0xe0717fff 64bit]
> [  217.456517] pcieport 0000:a9:10.0: PCI bridge to [bus ab]
> [  217.456526] pcieport 0000:a9:10.0:   bridge window [mem 0xe0600000-0xe07fffff]
> [  217.456614] nvme 0000:ab:00.0: vgaarb: pci_notify
> [  217.456624] nvme 0000:ab:00.0: runtime IRQ mapping not provided by arch
> [  217.457452] nvme nvme10: pci function 0000:ab:00.0
> [  217.458154] nvme 0000:ab:00.0: saving config space at offset 0x0 (reading
> 0xa824144d)
> [  217.458166] nvme 0000:ab:00.0: saving config space at offset 0x4 (reading
> 0x100546)
> [  217.458173] nvme 0000:ab:00.0: saving config space at offset 0x8 (reading
> 0x1080200)
> [  217.458179] nvme 0000:ab:00.0: saving config space at offset 0xc (reading 0x8)
> [  217.458185] nvme 0000:ab:00.0: saving config space at offset 0x10
> (reading 0xe0610004)
> [  217.458192] nvme 0000:ab:00.0: saving config space at offset 0x14 (reading 0x0)
> [  217.458198] nvme 0000:ab:00.0: saving config space at offset 0x18 (reading 0x0)
> [  217.458202] nvme 0000:ab:00.0: saving config space at offset 0x1c (reading 0x0)
> [  217.458207] nvme 0000:ab:00.0: saving config space at offset 0x20 (reading 0x0)
> [  217.458212] nvme 0000:ab:00.0: saving config space at offset 0x24 (reading 0x0)
> [  217.458217] nvme 0000:ab:00.0: saving config space at offset 0x28 (reading 0x0)
> [  217.458222] nvme 0000:ab:00.0: saving config space at offset 0x2c
> (reading 0xa80a144d)
> [  217.458227] nvme 0000:ab:00.0: saving config space at offset 0x30
> (reading 0xe0600000)
> [  217.458237] nvme 0000:ab:00.0: saving config space at offset 0x34 (reading 0x40)
> [  217.458242] nvme 0000:ab:00.0: saving config space at offset 0x38 (reading 0x0)
> [  217.458247] nvme 0000:ab:00.0: saving config space at offset 0x3c (reading 0x1ff)
> [  217.462192] nvme nvme10: Shutdown timeout set to 10 seconds
> [  217.520625] nvme nvme10: 63/0/0 default/read/poll queues

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery
  2023-03-10 23:53     ` Bjorn Helgaas
@ 2023-03-11  1:45       ` Tushar Dave
  2023-03-11  8:22         ` Lukas Wunner
  0 siblings, 1 reply; 15+ messages in thread
From: Tushar Dave @ 2023-03-11  1:45 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Sagi Grimberg, linux-nvme, kbusch, bhelgaas, linux-pci,
	Lukas Wunner



On 3/10/2023 3:53 PM, Bjorn Helgaas wrote:
> [+cc Lukas, beginning of thread:
> https://lore.kernel.org/all/de1b20e5-ded1-0aae-2221-f5d470d91015@nvidia.com/]
> 
> On Fri, Mar 10, 2023 at 02:39:19PM -0800, Tushar Dave wrote:
>> On 3/9/23 09:53, Bjorn Helgaas wrote:
>>> On Wed, Mar 08, 2023 at 07:34:58PM -0800, Tushar Dave wrote:
>>>> On 3/7/23 03:59, Sagi Grimberg wrote:
>>>>> On 3/2/23 02:09, Tushar Dave wrote:
>>>>>> We are observing NVMe device disabled due to reset failure after
>>>>>> injecting Malformed TLP. DPC/AER recovery succeed but NVMe fails.
>>>>>> I tried this on 2 different system and it is 100% reproducible with 6.2
>>>>>> kernel.
>>>>>>
>>>>>> On my system, Samsung NVMe SSD Controller PM173X is directly behind the
>>>>>> Broadcom PCIe Switch Downstream Port.
>>>>>> MalformedTLP is injected by changing MaxPayload Size(MPS) of PCIe switch
>>>>>> to 128B (keeping NVMe device MPS 512B).
>>>>>>
>>>>>> e.g.
>>>>>> # change MPS of PCIe switch (a9:10.0)
>>>>>> $ setpci -v -s a9:10.0 cap_exp+0x8.w
>>>>>> 0000:a9:10.0 (cap 10 @68) @70 = 0857
>>>>>> $ setpci -v -s a9:10.0 cap_exp+0x8.w=0x0817
>>>>>> 0000:a9:10.0 (cap 10 @68) @70 0817
>>>>>> $ lspci -s a9:10.0 -vvv | grep -w DevCtl -A 2
>>>>>>            DevCtl:    CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
>>>>>>                RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>>>>>>                MaxPayload 128 bytes, MaxReadReq 128 bytes
>>>>>>
>>>>>> # run some traffic on nvme (ab:00.0)
>>>>>> $ dd if=/dev/nvme0n1 of=/tmp/test bs=4K
>>>>>> dd: error reading '/dev/nvme0n1': Input/output error
>>>>>> 2+0 records in
>>>>>> 2+0 records out
>>>>>> 8192 bytes (8.2 kB, 8.0 KiB) copied, 0.0115304 s, 710 kB/s
>>>>>>
>>>>>> #kernel log:
>>>>>> [  163.034889] pcieport 0000:a5:01.0: EDR: EDR event received
>>>>>> [  163.041671] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
>>>>>> [  163.049071] pcieport 0000:a9:10.0: DPC: containment event,
>>>>>> status:0x2009 source:0x0000
>>>>>> [  163.058014] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error
>>>>>> detected
>>>>>> [  163.066081] pcieport 0000:a9:10.0: PCIe Bus Error:
>>>>>> severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
>>>>>> [  163.078151] pcieport 0000:a9:10.0:   device [1000:c030] error
>>>>>> status/mask=00040000/00180000
>>>>>> [  163.087613] pcieport 0000:a9:10.0:    [18] MalfTLP
>>>>>> (First)
>>>>>> [  163.095281] pcieport 0000:a9:10.0: AER:   TLP Header: 60000080
>>>>>> ab0000ff 00000001 d1fd0000
>>>>>> [  163.104517] pcieport 0000:a9:10.0: AER: broadcast error_detected message
>>>>>> [  163.112095] nvme nvme0: frozen state error detected, reset controller
>>>>>> [  163.150716] nvme0c0n1: I/O Cmd(0x2) @ LBA 16, 32 blocks, I/O Error
>>>>>> (sct 0x3 / sc 0x71)
>>>>>> [  163.159802] I/O error, dev nvme0c0n1, sector 16 op 0x0:(READ) flags
>>>>>> 0x4080700 phys_seg 4 prio class 2
>>>>>> [  163.383661] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
>>>>>> [  163.390895] nvme nvme0: restart after slot reset
>>>>>> [  163.396230] nvme 0000:ab:00.0: restoring config space at offset 0x3c
>>>>>> (was 0x100, writing 0x1ff)
>>>>>> [  163.406079] nvme 0000:ab:00.0: restoring config space at offset 0x30
>>>>>> (was 0x0, writing 0xe0600000)
>>>>>> [  163.416212] nvme 0000:ab:00.0: restoring config space at offset 0x10
>>>>>> (was 0x4, writing 0xe0710004)
>>>>>> [  163.426326] nvme 0000:ab:00.0: restoring config space at offset 0xc
>>>>>> (was 0x0, writing 0x8)
>>>>>> [  163.435666] nvme 0000:ab:00.0: restoring config space at offset 0x4
>>>>>> (was 0x100000, writing 0x100546)
>>>>>> [  163.446026] pcieport 0000:a9:10.0: AER: broadcast resume message
>>>>>> [  163.468311] nvme 0000:ab:00.0: saving config space at offset 0x0
>>>>>> (reading 0xa824144d)
>>>>>> [  163.477209] nvme 0000:ab:00.0: saving config space at offset 0x4
>>>>>> (reading 0x100546)
>>>>>> [  163.485876] nvme 0000:ab:00.0: saving config space at offset 0x8
>>>>>> (reading 0x1080200)
>>>>>> [  163.495399] nvme 0000:ab:00.0: saving config space at offset 0xc
>>>>>> (reading 0x8)
>>>>>> [  163.504149] nvme 0000:ab:00.0: saving config space at offset 0x10
>>>>>> (reading 0xe0710004)
>>>>>> [  163.513596] nvme 0000:ab:00.0: saving config space at offset 0x14
>>>>>> (reading 0x0)
>>>>>> [  163.522310] nvme 0000:ab:00.0: saving config space at offset 0x18
>>>>>> (reading 0x0)
>>>>>> [  163.531013] nvme 0000:ab:00.0: saving config space at offset 0x1c
>>>>>> (reading 0x0)
>>>>>> [  163.539704] nvme 0000:ab:00.0: saving config space at offset 0x20
>>>>>> (reading 0x0)
>>>>>> [  163.548353] nvme 0000:ab:00.0: saving config space at offset 0x24
>>>>>> (reading 0x0)
>>>>>> [  163.556983] nvme 0000:ab:00.0: saving config space at offset 0x28
>>>>>> (reading 0x0)
>>>>>> [  163.565615] nvme 0000:ab:00.0: saving config space at offset 0x2c
>>>>>> (reading 0xa80a144d)
>>>>>> [  163.574899] nvme 0000:ab:00.0: saving config space at offset 0x30
>>>>>> (reading 0xe0600000)
>>>>>> [  163.584215] nvme 0000:ab:00.0: saving config space at offset 0x34
>>>>>> (reading 0x40)
>>>>>> [  163.592941] nvme 0000:ab:00.0: saving config space at offset 0x38
>>>>>> (reading 0x0)
>>>>>> [  163.601554] nvme 0000:ab:00.0: saving config space at offset 0x3c
>>>>>> (reading 0x1ff)
>>>>>> [  210.089132] block nvme0n1: no usable path - requeuing I/O
>>>>>> [  223.776595] nvme nvme0: I/O 18 QID 0 timeout, disable controller
>>>>>> [  223.825236] nvme nvme0: Identify Controller failed (-4)
>>>>>> [  223.832145] nvme nvme0: Disabling device after reset failure: -5
>>>>>
>>>>> At this point the device is not going to recover.
>>>>
>>>> Yes, I agree.
>>>>
>>>> I looked little bit more and found that nvme reset failure and second DPC,
>>>> both were due to nvme_slot_reset() restoring MPS as part of
>>>> pci_restore_state().
>>>>
>>>> AFAICT, after the first DPC event occurs, nvme device MPS gets changed to
>>>> _default_ value 128B (this is likely due to DPC link retraining). However as
>>>> part of software AER recovery, nvme_slot_reset() restores device state, and
>>>> that brings the nvme device MPS back to 512B. (MPS of PCIe switch a9:10.0
>>>> still remains at 128B).
>>>>
>>>> At this point when nvme_reset_ctrl->nvme_reset_work() tries to enable the
>>>> device, malformedTLP again getting generated and that causes second DPC,
>>>> makes NVMe controller reset to fail as well.
>>>
>>> This sounds like the behavior I expect.  IIUC:
>>>
>>>     - Switch and NVMe MPS are 512B
>>>     - NVMe config space saved (including MPS=512B)
>>>     - You change Switch MPS to 128B
>>>     - NVMe does DMA with payload > 128B
>>>     - Switch reports Malformed TLP because TLP is larger than its MPS
>>>     - Recovery resets NVMe, which sets MPS to the default of 128B
>>>     - nvme_slot_reset() restores NVMe config space (MPS is now 512B)
>>>     - Subsequent NVMe DMA with payload > 128B repeats cycle
>>>
>>> What do you think *should* be happening here?  I don't see a PCI
>>> problem here.  If you change MPS on the Switch without coordinating
>>> with NVMe, things aren't going to work.  Or am I missing something?
>>
>> I agree this is expected but there are instances where I do _not_ see the
>> issue occurring. That is due to involvement of pciehp, which add and
>> configure nvme device - (coordinates MPS with pcie switch, and the new MPS
>> will get saved too. So future tests of this kind won't reproduce this issue
>> and that is understood).
>>
>> IMO though, the result of the test should be consistent.
>> Either pciehp/DPC should take care of device recovery 100% all the time;
>> Or we consider nvme recovery failure as an expected result because MPS of
>> pcie switch got changed without coordinating with nvme.
>>
>> What do you think?
> 
> In the log below, pciehp obviously is enabled; should I infer that in
> the log above, it is not?

pciehp is enabled all the time. In the log above and below.
I do not have answer yet why pciehp shows-up only in some tests (due to DPC link 
down/up) and not in others like you noticed in both the logs.


> 
> Generally we've avoided handling a device reset as a remove/add event
> because upper layers can't deal well with that.  But in the log below
> it looks like pciehp *did* treat the DPC containment as a remove/add,
> which of course involves configuring the "new" device and its MPS
> settings.

yes and that puzzled me why? especially when"Link Down/Up ignored (recovered by 
DPC)". Do we still have race somewhere, I am not sure.

> 
>    [  217.071200] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
>    [  217.071217] nvme nvme0: restart after slot reset
>    [  217.071234] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Down/Up ignored (recovered by DPC)
>    [  217.071250] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_active: lnk_status = 2044
>    [  217.071259] pcieport 0000:a9:10.0: pciehp: Slot(272): Card not present
>    [  217.071267] pcieport 0000:a9:10.0: pciehp: pciehp_unconfigure_device: domain:bus:dev = 0000:ab:00
>    [  217.071320] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was 0x100, writing 0x1ff)
>    [  217.071451] nvme 0000:ab:00.0: nvme_slot_reset: after pci_restore_state, DEVCTL: 0x5957
> 
> The .slot_reset() method (nvme_slot_reset()) is called *after* the
> device has been reset, and the device is supposed to be ready for the
> driver to use it.  But here it looks like pciehp thinks ab:00.0 is not
> present, so it removes it.  Later ab:00.0 is present again, so we
> re-enumerate it:

that is correct.

> 
>    [  217.311892] pcieport 0000:a9:10.0: pciehp: Slot(272): Card present
>    [  217.311897] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Up
>    [  217.455159] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_status: lnk_status = 2044
>    [  217.455222] pci 0000:ab:00.0: [144d:a824] type 00 class 0x010802
> 
> What kernel are you testing?  53b54ad074de ("PCI/DPC: Await readiness
> of secondary bus after reset") looks like it could be related, but
> you'd have to be using v6.3-rc1 or later to get it.

I am on v6.2 but I will give a try to v6.3-rc1 and get back.

> 
>> e.g. [ when pciehp takes care of things]
>>
>> [  216.608538] pcieport 0000:a9:10.0: pciehp: pending interrupts 0x0108 from
>> Slot Status
>> [  216.639954] pcieport 0000:a5:01.0: EDR: EDR event received
>> [  216.640429] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
>> [  216.640438] pcieport 0000:a9:10.0: DPC: containment event, status:0x2009
>> source:0x0000
>> [  216.640442] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error detected
>> [  216.640452] pcieport 0000:a9:10.0: PCIe Bus Error: severity=Uncorrected
>> (Fatal), type=Transaction Layer, (Receiver ID)
>> [  216.652549] pcieport 0000:a9:10.0:   device [1000:c030] error
>> status/mask=00040000/00180000
>> [  216.661975] pcieport 0000:a9:10.0:    [18] MalfTLP                (First)
>> [  216.669647] pcieport 0000:a9:10.0: AER:   TLP Header: 60000080 ab0000ff
>> 00000102 276fe000
>> [  216.678890] pcieport 0000:a9:10.0: AER: broadcast error_detected message
>> [  216.678898] nvme nvme0: frozen state error detected, reset controller
>> [  216.842570] nvme0c0n1: I/O Cmd(0x2) @ LBA 16, 32 blocks, I/O Error (sct
>> 0x3 / sc 0x71)
>> [  216.851684] I/O error, dev nvme0c0n1, sector 16 op 0x0:(READ) flags
>> 0x4080700 phys_seg 4 prio class 2
>> [  217.071200] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
>> [  217.071217] nvme nvme0: restart after slot reset
>> [  217.071228] nvme 0000:ab:00.0: nvme_slot_reset: before pci_restore_state
>> DEVCTL: 0x2910
>> [  217.071234] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Down/Up
>> ignored (recovered by DPC)
>> [  217.071250] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_active:
>> lnk_status = 2044
>> [  217.071259] pcieport 0000:a9:10.0: pciehp: Slot(272): Card not present
>> [  217.071267] pcieport 0000:a9:10.0: pciehp: pciehp_unconfigure_device:
>> domain:bus:dev = 0000:ab:00
>> [  217.071320] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was
>> 0x100, writing 0x1ff)
>> [  217.071346] nvme 0000:ab:00.0: restoring config space at offset 0x30 (was
>> 0x0, writing 0xe0600000)
>> [  217.071373] nvme 0000:ab:00.0: restoring config space at offset 0x10 (was
>> 0x4, writing 0xe0710004)
>> [  217.071383] nvme 0000:ab:00.0: restoring config space at offset 0xc (was
>> 0x0, writing 0x8)
>> [  217.071394] nvme 0000:ab:00.0: restoring config space at offset 0x4 (was
>> 0x100000, writing 0x100546)
>> [  217.071451] nvme 0000:ab:00.0: nvme_slot_reset: after pci_restore_state,
>> DEVCTL: 0x5957
>> [  217.071464] pcieport 0000:a9:10.0: AER: broadcast resume message
>> [  217.071467] nvme 0000:ab:00.0: PME# disabled
>> [  217.071513] pcieport 0000:a9:10.0: AER: device recovery successful
>> [  217.071522] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
>> [  217.071526] nvme 0000:ab:00.0: vgaarb: pci_notify
>> [  217.071531] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80
>> [  217.071614] nvme nvme0: ctrl state 6 is not RESETTING
>> [  217.103486] Buffer I/O error on dev nvme0n1, logical block 2, async page read
>> [  217.308778] pci 0000:ab:00.0: vgaarb: pci_notify
>> [  217.308831] pci 0000:ab:00.0: vgaarb: pci_notify
>> [  217.311299] pci 0000:ab:00.0: vgaarb: pci_notify
>> [  217.311863] pci 0000:ab:00.0: device released
>> [  217.311887] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_active:
>> lnk_status = 2044
>> [  217.311892] pcieport 0000:a9:10.0: pciehp: Slot(272): Card present
>> [  217.311897] pcieport 0000:a9:10.0: pciehp: Slot(272): Link Up
>> [  217.455159] pcieport 0000:a9:10.0: pciehp: pciehp_check_link_status:
>> lnk_status = 2044
>> [  217.455222] pci 0000:ab:00.0: [144d:a824] type 00 class 0x010802
>> [  217.455275] pci 0000:ab:00.0: reg 0x10: [mem 0xe0710000-0xe0717fff 64bit]
>> [  217.455362] pci 0000:ab:00.0: reg 0x30: [mem 0xe0600000-0xe060ffff pref]
>> [  217.455380] pci 0000:ab:00.0: Max Payload Size set to 128 (was 512, max 512)
>> [  217.455726] pci 0000:ab:00.0: reg 0x20c: [mem 0xe0610000-0xe0617fff 64bit]
>> [  217.455732] pci 0000:ab:00.0: VF(n) BAR0 space: [mem
>> 0xe0610000-0xe070ffff 64bit] (contains BAR0 for 32 VFs)
>> [  217.456307] pci 0000:ab:00.0: vgaarb: pci_notify
>> [  217.456404] pcieport 0000:a9:10.0: bridge window [io  0x1000-0x0fff] to
>> [bus ab] add_size 1000
>> [  217.456413] pcieport 0000:a9:10.0: bridge window [mem
>> 0x00100000-0x000fffff 64bit pref] to [bus ab] add_size 200000 add_align
>> 100000
>> [  217.456430] pcieport 0000:a9:10.0: BAR 15: no space for [mem size
>> 0x00200000 64bit pref]
>> [  217.456436] pcieport 0000:a9:10.0: BAR 15: failed to assign [mem size
>> 0x00200000 64bit pref]
>> [  217.456440] pcieport 0000:a9:10.0: BAR 13: no space for [io  size 0x1000]
>> [  217.456444] pcieport 0000:a9:10.0: BAR 13: failed to assign [io  size 0x1000]
>> [  217.456451] pcieport 0000:a9:10.0: BAR 15: no space for [mem size
>> 0x00200000 64bit pref]
>> [  217.456457] pcieport 0000:a9:10.0: BAR 15: failed to assign [mem size
>> 0x00200000 64bit pref]
>> [  217.456464] pcieport 0000:a9:10.0: BAR 13: no space for [io  size 0x1000]
>> [  217.456470] pcieport 0000:a9:10.0: BAR 13: failed to assign [io  size 0x1000]
>> [  217.456480] pci 0000:ab:00.0: BAR 6: assigned [mem 0xe0600000-0xe060ffff pref]
>> [  217.456488] pci 0000:ab:00.0: BAR 0: assigned [mem 0xe0610000-0xe0617fff 64bit]
>> [  217.456509] pci 0000:ab:00.0: BAR 7: assigned [mem 0xe0618000-0xe0717fff 64bit]
>> [  217.456517] pcieport 0000:a9:10.0: PCI bridge to [bus ab]
>> [  217.456526] pcieport 0000:a9:10.0:   bridge window [mem 0xe0600000-0xe07fffff]
>> [  217.456614] nvme 0000:ab:00.0: vgaarb: pci_notify
>> [  217.456624] nvme 0000:ab:00.0: runtime IRQ mapping not provided by arch
>> [  217.457452] nvme nvme10: pci function 0000:ab:00.0
>> [  217.458154] nvme 0000:ab:00.0: saving config space at offset 0x0 (reading
>> 0xa824144d)
>> [  217.458166] nvme 0000:ab:00.0: saving config space at offset 0x4 (reading
>> 0x100546)
>> [  217.458173] nvme 0000:ab:00.0: saving config space at offset 0x8 (reading
>> 0x1080200)
>> [  217.458179] nvme 0000:ab:00.0: saving config space at offset 0xc (reading 0x8)
>> [  217.458185] nvme 0000:ab:00.0: saving config space at offset 0x10
>> (reading 0xe0610004)
>> [  217.458192] nvme 0000:ab:00.0: saving config space at offset 0x14 (reading 0x0)
>> [  217.458198] nvme 0000:ab:00.0: saving config space at offset 0x18 (reading 0x0)
>> [  217.458202] nvme 0000:ab:00.0: saving config space at offset 0x1c (reading 0x0)
>> [  217.458207] nvme 0000:ab:00.0: saving config space at offset 0x20 (reading 0x0)
>> [  217.458212] nvme 0000:ab:00.0: saving config space at offset 0x24 (reading 0x0)
>> [  217.458217] nvme 0000:ab:00.0: saving config space at offset 0x28 (reading 0x0)
>> [  217.458222] nvme 0000:ab:00.0: saving config space at offset 0x2c
>> (reading 0xa80a144d)
>> [  217.458227] nvme 0000:ab:00.0: saving config space at offset 0x30
>> (reading 0xe0600000)
>> [  217.458237] nvme 0000:ab:00.0: saving config space at offset 0x34 (reading 0x40)
>> [  217.458242] nvme 0000:ab:00.0: saving config space at offset 0x38 (reading 0x0)
>> [  217.458247] nvme 0000:ab:00.0: saving config space at offset 0x3c (reading 0x1ff)
>> [  217.462192] nvme nvme10: Shutdown timeout set to 10 seconds
>> [  217.520625] nvme nvme10: 63/0/0 default/read/poll queues

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery
  2023-03-11  1:45       ` Tushar Dave
@ 2023-03-11  8:22         ` Lukas Wunner
  2023-03-11 16:46           ` Keith Busch
  2023-03-14  0:57           ` Tushar Dave
  0 siblings, 2 replies; 15+ messages in thread
From: Lukas Wunner @ 2023-03-11  8:22 UTC (permalink / raw)
  To: Tushar Dave, Bjorn Helgaas; +Cc: Sagi Grimberg, linux-nvme, kbusch, linux-pci

On Fri, Mar 10, 2023 at 05:45:48PM -0800, Tushar Dave wrote:
> On 3/10/2023 3:53 PM, Bjorn Helgaas wrote:
> > In the log below, pciehp obviously is enabled; should I infer that in
> > the log above, it is not?
> 
> pciehp is enabled all the time. In the log above and below.
> I do not have answer yet why pciehp shows-up only in some tests (due to DPC
> link down/up) and not in others like you noticed in both the logs.

Maybe some of the switch Downstream Ports are hotplug-capable and
some are not?  (Check the Slot Implemented bit in the PCI Express
Capabilities Register as well as the Hot-Plug Capable bit in the
Slot Capabilities Register.)

> > Generally we've avoided handling a device reset as a remove/add event
> > because upper layers can't deal well with that.  But in the log below
> > it looks like pciehp *did* treat the DPC containment as a remove/add,
> > which of course involves configuring the "new" device and its MPS
> > settings.
> 
> yes and that puzzled me why? especially when"Link Down/Up ignored (recovered
> by DPC)". Do we still have race somewhere, I am not sure.

You're seeing the expected behavior.  pciehp ignores DLLSC events
caused by DPC, but then double-checks that DPC recovery succeeded.
If it didn't, it would be a bug not to bring down the slot.
So pciehp does exactly that.  See this code snippet in
pciehp_ignore_dpc_link_change():

	/*
	 * If the link is unexpectedly down after successful recovery,
	 * the corresponding link change may have been ignored above.
	 * Synthesize it to ensure that it is acted on.
	 */
	down_read_nested(&ctrl->reset_lock, ctrl->depth);
	if (!pciehp_check_link_active(ctrl))
		pciehp_request(ctrl, PCI_EXP_SLTSTA_DLLSC);
	up_read(&ctrl->reset_lock);

So on hotplug-capable ports, pciehp is able to mop up the mess created
by fiddling with the MPS settings behind the kernel's back.

We don't have that option on non-hotplug-capable ports.  If error
recovery fails, we generally let the inaccessible devices remain
in the system and user interaction is necessary to recover, either
through a reboot or by manually removing and rescanning PCI devices
via syfs after reinstating sane MPS settings.

>   - Switch and NVMe MPS are 512B
>   - NVMe config space saved (including MPS=512B)
>   - You change Switch MPS to 128B
>   - NVMe does DMA with payload > 128B
>   - Switch reports Malformed TLP because TLP is larger than its MPS
>   - Recovery resets NVMe, which sets MPS to the default of 128B
>   - nvme_slot_reset() restores NVMe config space (MPS is now 512B)
>   - Subsequent NVMe DMA with payload > 128B repeats cycle

Forgive my ignorance, but if MPS is restored to 512B by nvme_slot_reset(),
shouldn't the communication with the device just work again from that
point on?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery
  2023-03-11  8:22         ` Lukas Wunner
@ 2023-03-11 16:46           ` Keith Busch
  2023-03-14  0:57           ` Tushar Dave
  1 sibling, 0 replies; 15+ messages in thread
From: Keith Busch @ 2023-03-11 16:46 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Tushar Dave, Bjorn Helgaas, Sagi Grimberg, linux-nvme, linux-pci

On Sat, Mar 11, 2023 at 09:22:20AM +0100, Lukas Wunner wrote:
> >   - Switch and NVMe MPS are 512B
> >   - NVMe config space saved (including MPS=512B)
> >   - You change Switch MPS to 128B
> >   - NVMe does DMA with payload > 128B
> >   - Switch reports Malformed TLP because TLP is larger than its MPS
> >   - Recovery resets NVMe, which sets MPS to the default of 128B
> >   - nvme_slot_reset() restores NVMe config space (MPS is now 512B)
> >   - Subsequent NVMe DMA with payload > 128B repeats cycle
> 
> Forgive my ignorance, but if MPS is restored to 512B by nvme_slot_reset(),
> shouldn't the communication with the device just work again from that
> point on?

The upstream port was tuned down 128 without coordinating with the kernel, so
restoring the nvme to 512 creates the mismatch.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery
  2023-03-11  8:22         ` Lukas Wunner
  2023-03-11 16:46           ` Keith Busch
@ 2023-03-14  0:57           ` Tushar Dave
  2023-03-14 16:11             ` Bjorn Helgaas
  1 sibling, 1 reply; 15+ messages in thread
From: Tushar Dave @ 2023-03-14  0:57 UTC (permalink / raw)
  To: Lukas Wunner, Bjorn Helgaas; +Cc: Sagi Grimberg, linux-nvme, kbusch, linux-pci



On 3/11/23 00:22, Lukas Wunner wrote:
> On Fri, Mar 10, 2023 at 05:45:48PM -0800, Tushar Dave wrote:
>> On 3/10/2023 3:53 PM, Bjorn Helgaas wrote:
>>> In the log below, pciehp obviously is enabled; should I infer that in
>>> the log above, it is not?
>>
>> pciehp is enabled all the time. In the log above and below.
>> I do not have answer yet why pciehp shows-up only in some tests (due to DPC
>> link down/up) and not in others like you noticed in both the logs.
> 
> Maybe some of the switch Downstream Ports are hotplug-capable and
> some are not?  (Check the Slot Implemented bit in the PCI Express
> Capabilities Register as well as the Hot-Plug Capable bit in the
> Slot Capabilities Register.)

All PCIe switch Downstream ports where nvmes are connected are hotplug-capable
except the boot nvme drive which I would never use in my test :)
For my test, PCIe switch Downstream port is a9:10.0 where nvme ab:00.0 is 
connected as an endpoint device.

# PCI Express Cap - Slot Implemented bit(8) is set to 1
root$ setpci -v -s a9:10.0 cap_exp+0x02.w
0000:a9:10.0 (cap 10 @68) @6a = 0162

# Slt Cap - Hot-Plug Capable bit(6) is set to 1
root$ setpci -v -s a9:10.0 cap_exp+0x14.L
0000:a9:10.0 (cap 10 @68) @7c = 08800ce0

# Slt Control
root$ setpci -v -s a9:10.0 cap_exp+0x18.w
0000:a9:10.0 (cap 10 @68) @80 = 1028

# Slt Status
root$ setpci -v -s a9:10.0 cap_exp+0x1A.w
0000:a9:10.0 (cap 10 @68) @82 = 0040

FWIW, after MPS changed and issue reproduce:
root$ setpci -v -s a9:10.0 cap_exp+0x014.L
0000:a9:10.0 (cap 10 @68) @7c = 08800ce0
root$ setpci -v -s a9:10.0 cap_exp+0x018.w
0000:a9:10.0 (cap 10 @68) @80 = 1028
root$ setpci -v -s a9:10.0 cap_exp+0x01A.w
0000:a9:10.0 (cap 10 @68) @82 = 0148

> 
> 
>>> Generally we've avoided handling a device reset as a remove/add event
>>> because upper layers can't deal well with that.  But in the log below
>>> it looks like pciehp *did* treat the DPC containment as a remove/add,
>>> which of course involves configuring the "new" device and its MPS
>>> settings.
>>
>> yes and that puzzled me why? especially when"Link Down/Up ignored (recovered
>> by DPC)". Do we still have race somewhere, I am not sure.
> 
> You're seeing the expected behavior.  pciehp ignores DLLSC events
> caused by DPC, but then double-checks that DPC recovery succeeded.
> If it didn't, it would be a bug not to bring down the slot.
> So pciehp does exactly that.  See this code snippet in
> pciehp_ignore_dpc_link_change():
> 
> 	/*
> 	 * If the link is unexpectedly down after successful recovery,
> 	 * the corresponding link change may have been ignored above.
> 	 * Synthesize it to ensure that it is acted on.
> 	 */
> 	down_read_nested(&ctrl->reset_lock, ctrl->depth);
> 	if (!pciehp_check_link_active(ctrl))
> 		pciehp_request(ctrl, PCI_EXP_SLTSTA_DLLSC);
> 	up_read(&ctrl->reset_lock);
> 
> So on hotplug-capable ports, pciehp is able to mop up the mess created
> by fiddling with the MPS settings behind the kernel's back.

That's the thing, even on hotplug-capable slot I do not see pciehp _all_ the
time. Sometime pciehp get involve and takes care of things (like I mentioned in
the previous thread) and other times no pciehp engagement at all!

> 
> We don't have that option on non-hotplug-capable ports.  If error
> recovery fails, we generally let the inaccessible devices remain
> in the system and user interaction is necessary to recover, either
> through a reboot or by manually removing and rescanning PCI devices
> via syfs after reinstating sane MPS settings.

Sure, that is understood for non-hotplug-capable slots.

> 
> 
>>    - Switch and NVMe MPS are 512B
>>    - NVMe config space saved (including MPS=512B)
>>    - You change Switch MPS to 128B
>>    - NVMe does DMA with payload > 128B
>>    - Switch reports Malformed TLP because TLP is larger than its MPS
>>    - Recovery resets NVMe, which sets MPS to the default of 128B
>>    - nvme_slot_reset() restores NVMe config space (MPS is now 512B)
>>    - Subsequent NVMe DMA with payload > 128B repeats cycle
> 
> Forgive my ignorance, but if MPS is restored to 512B by nvme_slot_reset(),
> shouldn't the communication with the device just work again from that
> point on >
> Thanks,
> 
> Lukas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery
  2023-03-14  0:57           ` Tushar Dave
@ 2023-03-14 16:11             ` Bjorn Helgaas
  2023-03-14 17:26               ` Keith Busch
  0 siblings, 1 reply; 15+ messages in thread
From: Bjorn Helgaas @ 2023-03-14 16:11 UTC (permalink / raw)
  To: Tushar Dave; +Cc: Lukas Wunner, Sagi Grimberg, linux-nvme, kbusch, linux-pci

On Mon, Mar 13, 2023 at 05:57:43PM -0700, Tushar Dave wrote:
> On 3/11/23 00:22, Lukas Wunner wrote:
> > On Fri, Mar 10, 2023 at 05:45:48PM -0800, Tushar Dave wrote:
> > > On 3/10/2023 3:53 PM, Bjorn Helgaas wrote:
> > > > In the log below, pciehp obviously is enabled; should I infer that in
> > > > the log above, it is not?
> > > 
> > > pciehp is enabled all the time. In the log above and below.
> > > I do not have answer yet why pciehp shows-up only in some tests (due to DPC
> > > link down/up) and not in others like you noticed in both the logs.
> > 
> > Maybe some of the switch Downstream Ports are hotplug-capable and
> > some are not?  (Check the Slot Implemented bit in the PCI Express
> > Capabilities Register as well as the Hot-Plug Capable bit in the
> > Slot Capabilities Register.)
> > ...

> > > > Generally we've avoided handling a device reset as a
> > > > remove/add event because upper layers can't deal well with
> > > > that.  But in the log below it looks like pciehp *did* treat
> > > > the DPC containment as a remove/add, which of course involves
> > > > configuring the "new" device and its MPS settings.
> > > 
> > > yes and that puzzled me why? especially when"Link Down/Up
> > > ignored (recovered by DPC)". Do we still have race somewhere, I
> > > am not sure.
> > 
> > You're seeing the expected behavior.  pciehp ignores DLLSC events
> > caused by DPC, but then double-checks that DPC recovery succeeded.
> > If it didn't, it would be a bug not to bring down the slot.  So
> > pciehp does exactly that.  See this code snippet in
> > pciehp_ignore_dpc_link_change():
> > 
> > 	/*
> > 	 * If the link is unexpectedly down after successful recovery,
> > 	 * the corresponding link change may have been ignored above.
> > 	 * Synthesize it to ensure that it is acted on.
> > 	 */
> > 	down_read_nested(&ctrl->reset_lock, ctrl->depth);
> > 	if (!pciehp_check_link_active(ctrl))
> > 		pciehp_request(ctrl, PCI_EXP_SLTSTA_DLLSC);
> > 	up_read(&ctrl->reset_lock);
> > 
> > So on hotplug-capable ports, pciehp is able to mop up the mess
> > created by fiddling with the MPS settings behind the kernel's
> > back.
> 
> That's the thing, even on hotplug-capable slot I do not see pciehp
> _all_ the time. Sometime pciehp get involve and takes care of things
> (like I mentioned in the previous thread) and other times no pciehp
> engagement at all!

Possibly a timing issue, so I'll be interested to see if 53b54ad074de
("PCI/DPC: Await readiness of secondary bus after reset") makes any
difference.  Lukas didn't mention that, so maybe it's a red herring,
but I'm still curious since it explicitly mentions the DPC reset case
that you're exercising here.

Bjorn

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery
  2023-03-14 16:11             ` Bjorn Helgaas
@ 2023-03-14 17:26               ` Keith Busch
  2023-03-15 20:01                 ` Tushar Dave
  0 siblings, 1 reply; 15+ messages in thread
From: Keith Busch @ 2023-03-14 17:26 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Tushar Dave, Lukas Wunner, Sagi Grimberg, linux-nvme, linux-pci

On Tue, Mar 14, 2023 at 11:11:27AM -0500, Bjorn Helgaas wrote:
> On Mon, Mar 13, 2023 at 05:57:43PM -0700, Tushar Dave wrote:
> > On 3/11/23 00:22, Lukas Wunner wrote:
> > > On Fri, Mar 10, 2023 at 05:45:48PM -0800, Tushar Dave wrote:
> > > > On 3/10/2023 3:53 PM, Bjorn Helgaas wrote:
> > > > > In the log below, pciehp obviously is enabled; should I infer that in
> > > > > the log above, it is not?
> > > > 
> > > > pciehp is enabled all the time. In the log above and below.
> > > > I do not have answer yet why pciehp shows-up only in some tests (due to DPC
> > > > link down/up) and not in others like you noticed in both the logs.
> > > 
> > > Maybe some of the switch Downstream Ports are hotplug-capable and
> > > some are not?  (Check the Slot Implemented bit in the PCI Express
> > > Capabilities Register as well as the Hot-Plug Capable bit in the
> > > Slot Capabilities Register.)
> > > ...
> 
> > > > > Generally we've avoided handling a device reset as a
> > > > > remove/add event because upper layers can't deal well with
> > > > > that.  But in the log below it looks like pciehp *did* treat
> > > > > the DPC containment as a remove/add, which of course involves
> > > > > configuring the "new" device and its MPS settings.
> > > > 
> > > > yes and that puzzled me why? especially when"Link Down/Up
> > > > ignored (recovered by DPC)". Do we still have race somewhere, I
> > > > am not sure.
> > > 
> > > You're seeing the expected behavior.  pciehp ignores DLLSC events
> > > caused by DPC, but then double-checks that DPC recovery succeeded.
> > > If it didn't, it would be a bug not to bring down the slot.  So
> > > pciehp does exactly that.  See this code snippet in
> > > pciehp_ignore_dpc_link_change():
> > > 
> > > 	/*
> > > 	 * If the link is unexpectedly down after successful recovery,
> > > 	 * the corresponding link change may have been ignored above.
> > > 	 * Synthesize it to ensure that it is acted on.
> > > 	 */
> > > 	down_read_nested(&ctrl->reset_lock, ctrl->depth);
> > > 	if (!pciehp_check_link_active(ctrl))
> > > 		pciehp_request(ctrl, PCI_EXP_SLTSTA_DLLSC);
> > > 	up_read(&ctrl->reset_lock);
> > > 
> > > So on hotplug-capable ports, pciehp is able to mop up the mess
> > > created by fiddling with the MPS settings behind the kernel's
> > > back.
> > 
> > That's the thing, even on hotplug-capable slot I do not see pciehp
> > _all_ the time. Sometime pciehp get involve and takes care of things
> > (like I mentioned in the previous thread) and other times no pciehp
> > engagement at all!
> 
> Possibly a timing issue, so I'll be interested to see if 53b54ad074de
> ("PCI/DPC: Await readiness of secondary bus after reset") makes any
> difference.  Lukas didn't mention that, so maybe it's a red herring,
> but I'm still curious since it explicitly mentions the DPC reset case
> that you're exercising here.

Catching the PDC event may be timing related. pciehp ignores the link events
during a DPC event, but it always reacts to PDC since it's indistinguishable
from a DPC occuring in response to a surprise removal, and these slots probably
don't have out-of-band presence detection.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery
  2023-03-14 17:26               ` Keith Busch
@ 2023-03-15 20:01                 ` Tushar Dave
  2023-03-15 20:43                   ` Sathyanarayanan Kuppuswamy
  2023-03-18  0:15                   ` Tushar Dave
  0 siblings, 2 replies; 15+ messages in thread
From: Tushar Dave @ 2023-03-15 20:01 UTC (permalink / raw)
  To: Keith Busch, Bjorn Helgaas
  Cc: Lukas Wunner, Sagi Grimberg, linux-nvme, linux-pci



On 3/14/23 10:26, Keith Busch wrote:
> On Tue, Mar 14, 2023 at 11:11:27AM -0500, Bjorn Helgaas wrote:
>> On Mon, Mar 13, 2023 at 05:57:43PM -0700, Tushar Dave wrote:
>>> On 3/11/23 00:22, Lukas Wunner wrote:
>>>> On Fri, Mar 10, 2023 at 05:45:48PM -0800, Tushar Dave wrote:
>>>>> On 3/10/2023 3:53 PM, Bjorn Helgaas wrote:
>>>>>> In the log below, pciehp obviously is enabled; should I infer that in
>>>>>> the log above, it is not?
>>>>>
>>>>> pciehp is enabled all the time. In the log above and below.
>>>>> I do not have answer yet why pciehp shows-up only in some tests (due to DPC
>>>>> link down/up) and not in others like you noticed in both the logs.
>>>>
>>>> Maybe some of the switch Downstream Ports are hotplug-capable and
>>>> some are not?  (Check the Slot Implemented bit in the PCI Express
>>>> Capabilities Register as well as the Hot-Plug Capable bit in the
>>>> Slot Capabilities Register.)
>>>> ...
>>
>>>>>> Generally we've avoided handling a device reset as a
>>>>>> remove/add event because upper layers can't deal well with
>>>>>> that.  But in the log below it looks like pciehp *did* treat
>>>>>> the DPC containment as a remove/add, which of course involves
>>>>>> configuring the "new" device and its MPS settings.
>>>>>
>>>>> yes and that puzzled me why? especially when"Link Down/Up
>>>>> ignored (recovered by DPC)". Do we still have race somewhere, I
>>>>> am not sure.
>>>>
>>>> You're seeing the expected behavior.  pciehp ignores DLLSC events
>>>> caused by DPC, but then double-checks that DPC recovery succeeded.
>>>> If it didn't, it would be a bug not to bring down the slot.  So
>>>> pciehp does exactly that.  See this code snippet in
>>>> pciehp_ignore_dpc_link_change():
>>>>
>>>> 	/*
>>>> 	 * If the link is unexpectedly down after successful recovery,
>>>> 	 * the corresponding link change may have been ignored above.
>>>> 	 * Synthesize it to ensure that it is acted on.
>>>> 	 */
>>>> 	down_read_nested(&ctrl->reset_lock, ctrl->depth);
>>>> 	if (!pciehp_check_link_active(ctrl))
>>>> 		pciehp_request(ctrl, PCI_EXP_SLTSTA_DLLSC);
>>>> 	up_read(&ctrl->reset_lock);
>>>>
>>>> So on hotplug-capable ports, pciehp is able to mop up the mess
>>>> created by fiddling with the MPS settings behind the kernel's
>>>> back.
>>>
>>> That's the thing, even on hotplug-capable slot I do not see pciehp
>>> _all_ the time. Sometime pciehp get involve and takes care of things
>>> (like I mentioned in the previous thread) and other times no pciehp
>>> engagement at all!
>>
>> Possibly a timing issue, so I'll be interested to see if 53b54ad074de
>> ("PCI/DPC: Await readiness of secondary bus after reset") makes any
>> difference.  Lukas didn't mention that, so maybe it's a red herring,
>> but I'm still curious since it explicitly mentions the DPC reset case
>> that you're exercising here.

Commit 53b54ad074de ("PCI/DPC: Await readiness of secondary bus after reset") 
didn't help.

[ 6265.268757] pcieport 0000:a5:01.0: EDR: EDR event received
[ 6265.276034] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
[ 6265.283780] pcieport 0000:a9:10.0: DPC: containment event, status:0x2009 
source:0x0000
[ 6265.292972] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error detected
[ 6265.301284] pcieport 0000:a9:10.0: PCIe Bus Error: severity=Uncorrected 
(Fatal), type=Transaction Layer, (Receiver ID)
[ 6265.313569] pcieport 0000:a9:10.0:   device [1000:c030] error 
status/mask=00040000/00180000
[ 6265.323208] pcieport 0000:a9:10.0:    [18] MalfTLP                (First)
[ 6265.331084] pcieport 0000:a9:10.0: AER:   TLP Header: 6000007a ab0000ff 
00000001 629d4318
[ 6265.340536] pcieport 0000:a9:10.0: AER: broadcast error_detected message
[ 6265.348320] nvme nvme1: frozen state error detected, reset controller
[ 6265.419633] pcieport 0000:a9:10.0: waiting 100 ms for downstream link, after 
activation
[ 6265.627639] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
[ 6265.635289] nvme nvme1: restart after slot reset
[ 6265.641016] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was 
0x100, writing 0x1ff)
[ 6265.651248] nvme 0000:ab:00.0: restoring config space at offset 0x30 (was 
0x0, writing 0xe0600000)
[ 6265.661739] nvme 0000:ab:00.0: restoring config space at offset 0x10 (was 
0x4, writing 0xe0710004)
[ 6265.672210] nvme 0000:ab:00.0: restoring config space at offset 0xc (was 0x0, 
writing 0x8)
[ 6265.681897] nvme 0000:ab:00.0: restoring config space at offset 0x4 (was 
0x100000, writing 0x100546)
[ 6265.692616] pcieport 0000:a9:10.0: AER: broadcast resume message
[ 6265.716299] nvme 0000:ab:00.0: saving config space at offset 0x0 (reading 
0xa824144d)
[ 6265.725614] nvme 0000:ab:00.0: saving config space at offset 0x4 (reading 
0x100546)
[ 6265.734657] nvme 0000:ab:00.0: saving config space at offset 0x8 (reading 
0x1080200)
[ 6265.743824] nvme 0000:ab:00.0: saving config space at offset 0xc (reading 0x8)
[ 6265.752348] nvme 0000:ab:00.0: saving config space at offset 0x10 (reading 
0xe0710004)
[ 6265.761647] nvme 0000:ab:00.0: saving config space at offset 0x14 (reading 0x0)
[ 6265.770247] nvme 0000:ab:00.0: saving config space at offset 0x18 (reading 0x0)
[ 6265.778857] nvme 0000:ab:00.0: saving config space at offset 0x1c (reading 0x0)
[ 6265.787450] nvme 0000:ab:00.0: saving config space at offset 0x20 (reading 0x0)
[ 6265.796034] nvme 0000:ab:00.0: saving config space at offset 0x24 (reading 0x0)
[ 6265.804620] nvme 0000:ab:00.0: saving config space at offset 0x28 (reading 0x0)
[ 6265.813201] nvme 0000:ab:00.0: saving config space at offset 0x2c (reading 
0xa80a144d)
[ 6265.822473] nvme 0000:ab:00.0: saving config space at offset 0x30 (reading 
0xe0600000)
[ 6265.831816] nvme 0000:ab:00.0: saving config space at offset 0x34 (reading 0x40)
[ 6265.840482] nvme 0000:ab:00.0: saving config space at offset 0x38 (reading 0x0)
[ 6265.849037] nvme 0000:ab:00.0: saving config space at offset 0x3c (reading 0x1ff)
[ 6275.037534] block nvme1n1: no usable path - requeuing I/O
[ 6326.920009] nvme nvme1: I/O 22 QID 0 timeout, disable controller
[ 6326.988701] nvme nvme1: Identify Controller failed (-4)
[ 6326.995253] nvme nvme1: Disabling device after reset failure: -5
[ 6327.032308] pcieport 0000:a9:10.0: AER: device recovery successful
[ 6327.039781] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
[ 6327.047687] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80
[ 6327.083131] pcieport 0000:a5:01.0: EDR: EDR event received
[ 6327.090173] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
[ 6327.097816] pcieport 0000:a9:10.0: DPC: containment event, status:0x2009 
source:0x0000
[ 6327.107009] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error detected
[ 6327.115330] pcieport 0000:a9:10.0: PCIe Bus Error: severity=Uncorrected 
(Fatal), type=Transaction Layer, (Receiver ID)
[ 6327.127640] pcieport 0000:a9:10.0:   device [1000:c030] error 
status/mask=00040000/00180000
[ 6327.137319] pcieport 0000:a9:10.0:    [18] MalfTLP                (First)
[ 6327.145236] pcieport 0000:a9:10.0: AER:   TLP Header: 60000080 ab0000ff 
00000001 5ad65000
[ 6327.154728] pcieport 0000:a9:10.0: AER: broadcast error_detected message
[ 6327.162624] nvme nvme1: frozen state error detected, reset controller
[ 6327.183979] pcieport 0000:a9:10.0: waiting 100 ms for downstream link, after 
activation
[ 6327.387969] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
[ 6327.395596] nvme nvme1: restart after slot reset
[ 6327.401313] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was 
0x100, writing 0x1ff)
[ 6327.411517] nvme 0000:ab:00.0: restoring config space at offset 0x30 (was 
0x0, writing 0xe0600000)
[ 6327.422045] nvme 0000:ab:00.0: restoring config space at offset 0x10 (was 
0x4, writing 0xe0710004)
[ 6327.432523] nvme 0000:ab:00.0: restoring config space at offset 0xc (was 0x0, 
writing 0x8)
[ 6327.442212] nvme 0000:ab:00.0: restoring config space at offset 0x4 (was 
0x100000, writing 0x100546)
[ 6327.452933] pcieport 0000:a9:10.0: AER: broadcast resume message
[ 6327.460184] pcieport 0000:a9:10.0: AER: device recovery successful
[ 6327.467533] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
[ 6327.475367] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80

> 
> Catching the PDC event may be timing related. pciehp ignores the link events
> during a DPC event, but it always reacts to PDC since it's indistinguishable
> from a DPC occuring in response to a surprise removal, and these slots probably
> don't have out-of-band presence detection.

yeah, In-Band PD Disable bit in Slot Control register of PCIe Downstream Switch 
port is set to '0' , no idea about out-of-band presence detection!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery
  2023-03-15 20:01                 ` Tushar Dave
@ 2023-03-15 20:43                   ` Sathyanarayanan Kuppuswamy
  2023-03-15 22:16                     ` Tushar Dave
  2023-03-18  0:15                   ` Tushar Dave
  1 sibling, 1 reply; 15+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2023-03-15 20:43 UTC (permalink / raw)
  To: Tushar Dave, Keith Busch, Bjorn Helgaas
  Cc: Lukas Wunner, Sagi Grimberg, linux-nvme, linux-pci



On 3/15/23 1:01 PM, Tushar Dave wrote:
> 
> 
> On 3/14/23 10:26, Keith Busch wrote:
>> On Tue, Mar 14, 2023 at 11:11:27AM -0500, Bjorn Helgaas wrote:
>>> On Mon, Mar 13, 2023 at 05:57:43PM -0700, Tushar Dave wrote:
>>>> On 3/11/23 00:22, Lukas Wunner wrote:
>>>>> On Fri, Mar 10, 2023 at 05:45:48PM -0800, Tushar Dave wrote:
>>>>>> On 3/10/2023 3:53 PM, Bjorn Helgaas wrote:
>>>>>>> In the log below, pciehp obviously is enabled; should I infer that in
>>>>>>> the log above, it is not?
>>>>>>
>>>>>> pciehp is enabled all the time. In the log above and below.
>>>>>> I do not have answer yet why pciehp shows-up only in some tests (due to DPC
>>>>>> link down/up) and not in others like you noticed in both the logs.
>>>>>
>>>>> Maybe some of the switch Downstream Ports are hotplug-capable and
>>>>> some are not?  (Check the Slot Implemented bit in the PCI Express
>>>>> Capabilities Register as well as the Hot-Plug Capable bit in the
>>>>> Slot Capabilities Register.)
>>>>> ...
>>>
>>>>>>> Generally we've avoided handling a device reset as a
>>>>>>> remove/add event because upper layers can't deal well with
>>>>>>> that.  But in the log below it looks like pciehp *did* treat
>>>>>>> the DPC containment as a remove/add, which of course involves
>>>>>>> configuring the "new" device and its MPS settings.
>>>>>>
>>>>>> yes and that puzzled me why? especially when"Link Down/Up
>>>>>> ignored (recovered by DPC)". Do we still have race somewhere, I
>>>>>> am not sure.
>>>>>
>>>>> You're seeing the expected behavior.  pciehp ignores DLLSC events
>>>>> caused by DPC, but then double-checks that DPC recovery succeeded.
>>>>> If it didn't, it would be a bug not to bring down the slot.  So
>>>>> pciehp does exactly that.  See this code snippet in
>>>>> pciehp_ignore_dpc_link_change():
>>>>>
>>>>>     /*
>>>>>      * If the link is unexpectedly down after successful recovery,
>>>>>      * the corresponding link change may have been ignored above.
>>>>>      * Synthesize it to ensure that it is acted on.
>>>>>      */
>>>>>     down_read_nested(&ctrl->reset_lock, ctrl->depth);
>>>>>     if (!pciehp_check_link_active(ctrl))
>>>>>         pciehp_request(ctrl, PCI_EXP_SLTSTA_DLLSC);
>>>>>     up_read(&ctrl->reset_lock);
>>>>>
>>>>> So on hotplug-capable ports, pciehp is able to mop up the mess
>>>>> created by fiddling with the MPS settings behind the kernel's
>>>>> back.
>>>>
>>>> That's the thing, even on hotplug-capable slot I do not see pciehp
>>>> _all_ the time. Sometime pciehp get involve and takes care of things
>>>> (like I mentioned in the previous thread) and other times no pciehp
>>>> engagement at all!
>>>
>>> Possibly a timing issue, so I'll be interested to see if 53b54ad074de
>>> ("PCI/DPC: Await readiness of secondary bus after reset") makes any
>>> difference.  Lukas didn't mention that, so maybe it's a red herring,
>>> but I'm still curious since it explicitly mentions the DPC reset case
>>> that you're exercising here.
> 
> Commit 53b54ad074de ("PCI/DPC: Await readiness of secondary bus after reset") didn't help.

I did not check the full thread. Since this seems to be in EDR recovery path, make sure to
include following patch.

https://lore.kernel.org/lkml/20230215200532.3126937-1-sathyanarayanan.kuppuswamy@linux.intel.com/T/

> 
> [ 6265.268757] pcieport 0000:a5:01.0: EDR: EDR event received
> [ 6265.276034] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
> [ 6265.283780] pcieport 0000:a9:10.0: DPC: containment event, status:0x2009 source:0x0000
> [ 6265.292972] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error detected
> [ 6265.301284] pcieport 0000:a9:10.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
> [ 6265.313569] pcieport 0000:a9:10.0:   device [1000:c030] error status/mask=00040000/00180000
> [ 6265.323208] pcieport 0000:a9:10.0:    [18] MalfTLP                (First)
> [ 6265.331084] pcieport 0000:a9:10.0: AER:   TLP Header: 6000007a ab0000ff 00000001 629d4318
> [ 6265.340536] pcieport 0000:a9:10.0: AER: broadcast error_detected message
> [ 6265.348320] nvme nvme1: frozen state error detected, reset controller
> [ 6265.419633] pcieport 0000:a9:10.0: waiting 100 ms for downstream link, after activation
> [ 6265.627639] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
> [ 6265.635289] nvme nvme1: restart after slot reset
> [ 6265.641016] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was 0x100, writing 0x1ff)
> [ 6265.651248] nvme 0000:ab:00.0: restoring config space at offset 0x30 (was 0x0, writing 0xe0600000)
> [ 6265.661739] nvme 0000:ab:00.0: restoring config space at offset 0x10 (was 0x4, writing 0xe0710004)
> [ 6265.672210] nvme 0000:ab:00.0: restoring config space at offset 0xc (was 0x0, writing 0x8)
> [ 6265.681897] nvme 0000:ab:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100546)
> [ 6265.692616] pcieport 0000:a9:10.0: AER: broadcast resume message
> [ 6265.716299] nvme 0000:ab:00.0: saving config space at offset 0x0 (reading 0xa824144d)
> [ 6265.725614] nvme 0000:ab:00.0: saving config space at offset 0x4 (reading 0x100546)
> [ 6265.734657] nvme 0000:ab:00.0: saving config space at offset 0x8 (reading 0x1080200)
> [ 6265.743824] nvme 0000:ab:00.0: saving config space at offset 0xc (reading 0x8)
> [ 6265.752348] nvme 0000:ab:00.0: saving config space at offset 0x10 (reading 0xe0710004)
> [ 6265.761647] nvme 0000:ab:00.0: saving config space at offset 0x14 (reading 0x0)
> [ 6265.770247] nvme 0000:ab:00.0: saving config space at offset 0x18 (reading 0x0)
> [ 6265.778857] nvme 0000:ab:00.0: saving config space at offset 0x1c (reading 0x0)
> [ 6265.787450] nvme 0000:ab:00.0: saving config space at offset 0x20 (reading 0x0)
> [ 6265.796034] nvme 0000:ab:00.0: saving config space at offset 0x24 (reading 0x0)
> [ 6265.804620] nvme 0000:ab:00.0: saving config space at offset 0x28 (reading 0x0)
> [ 6265.813201] nvme 0000:ab:00.0: saving config space at offset 0x2c (reading 0xa80a144d)
> [ 6265.822473] nvme 0000:ab:00.0: saving config space at offset 0x30 (reading 0xe0600000)
> [ 6265.831816] nvme 0000:ab:00.0: saving config space at offset 0x34 (reading 0x40)
> [ 6265.840482] nvme 0000:ab:00.0: saving config space at offset 0x38 (reading 0x0)
> [ 6265.849037] nvme 0000:ab:00.0: saving config space at offset 0x3c (reading 0x1ff)
> [ 6275.037534] block nvme1n1: no usable path - requeuing I/O
> [ 6326.920009] nvme nvme1: I/O 22 QID 0 timeout, disable controller
> [ 6326.988701] nvme nvme1: Identify Controller failed (-4)
> [ 6326.995253] nvme nvme1: Disabling device after reset failure: -5
> [ 6327.032308] pcieport 0000:a9:10.0: AER: device recovery successful
> [ 6327.039781] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
> [ 6327.047687] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80
> [ 6327.083131] pcieport 0000:a5:01.0: EDR: EDR event received
> [ 6327.090173] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
> [ 6327.097816] pcieport 0000:a9:10.0: DPC: containment event, status:0x2009 source:0x0000
> [ 6327.107009] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error detected
> [ 6327.115330] pcieport 0000:a9:10.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
> [ 6327.127640] pcieport 0000:a9:10.0:   device [1000:c030] error status/mask=00040000/00180000
> [ 6327.137319] pcieport 0000:a9:10.0:    [18] MalfTLP                (First)
> [ 6327.145236] pcieport 0000:a9:10.0: AER:   TLP Header: 60000080 ab0000ff 00000001 5ad65000
> [ 6327.154728] pcieport 0000:a9:10.0: AER: broadcast error_detected message
> [ 6327.162624] nvme nvme1: frozen state error detected, reset controller
> [ 6327.183979] pcieport 0000:a9:10.0: waiting 100 ms for downstream link, after activation
> [ 6327.387969] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
> [ 6327.395596] nvme nvme1: restart after slot reset
> [ 6327.401313] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was 0x100, writing 0x1ff)
> [ 6327.411517] nvme 0000:ab:00.0: restoring config space at offset 0x30 (was 0x0, writing 0xe0600000)
> [ 6327.422045] nvme 0000:ab:00.0: restoring config space at offset 0x10 (was 0x4, writing 0xe0710004)
> [ 6327.432523] nvme 0000:ab:00.0: restoring config space at offset 0xc (was 0x0, writing 0x8)
> [ 6327.442212] nvme 0000:ab:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100546)
> [ 6327.452933] pcieport 0000:a9:10.0: AER: broadcast resume message
> [ 6327.460184] pcieport 0000:a9:10.0: AER: device recovery successful
> [ 6327.467533] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
> [ 6327.475367] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80
> 
>>
>> Catching the PDC event may be timing related. pciehp ignores the link events
>> during a DPC event, but it always reacts to PDC since it's indistinguishable
>> from a DPC occuring in response to a surprise removal, and these slots probably
>> don't have out-of-band presence detection.
> 
> yeah, In-Band PD Disable bit in Slot Control register of PCIe Downstream Switch port is set to '0' , no idea about out-of-band presence detection!

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery
  2023-03-15 20:43                   ` Sathyanarayanan Kuppuswamy
@ 2023-03-15 22:16                     ` Tushar Dave
  2023-03-15 22:23                       ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 15+ messages in thread
From: Tushar Dave @ 2023-03-15 22:16 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Keith Busch, Bjorn Helgaas
  Cc: Lukas Wunner, Sagi Grimberg, linux-nvme, linux-pci



On 3/15/23 13:43, Sathyanarayanan Kuppuswamy wrote:
> 
> 
> On 3/15/23 1:01 PM, Tushar Dave wrote:
>>
>>
>> On 3/14/23 10:26, Keith Busch wrote:
>>> On Tue, Mar 14, 2023 at 11:11:27AM -0500, Bjorn Helgaas wrote:
>>>> On Mon, Mar 13, 2023 at 05:57:43PM -0700, Tushar Dave wrote:
>>>>> On 3/11/23 00:22, Lukas Wunner wrote:
>>>>>> On Fri, Mar 10, 2023 at 05:45:48PM -0800, Tushar Dave wrote:
>>>>>>> On 3/10/2023 3:53 PM, Bjorn Helgaas wrote:
>>>>>>>> In the log below, pciehp obviously is enabled; should I infer that in
>>>>>>>> the log above, it is not?
>>>>>>>
>>>>>>> pciehp is enabled all the time. In the log above and below.
>>>>>>> I do not have answer yet why pciehp shows-up only in some tests (due to DPC
>>>>>>> link down/up) and not in others like you noticed in both the logs.
>>>>>>
>>>>>> Maybe some of the switch Downstream Ports are hotplug-capable and
>>>>>> some are not?  (Check the Slot Implemented bit in the PCI Express
>>>>>> Capabilities Register as well as the Hot-Plug Capable bit in the
>>>>>> Slot Capabilities Register.)
>>>>>> ...
>>>>
>>>>>>>> Generally we've avoided handling a device reset as a
>>>>>>>> remove/add event because upper layers can't deal well with
>>>>>>>> that.  But in the log below it looks like pciehp *did* treat
>>>>>>>> the DPC containment as a remove/add, which of course involves
>>>>>>>> configuring the "new" device and its MPS settings.
>>>>>>>
>>>>>>> yes and that puzzled me why? especially when"Link Down/Up
>>>>>>> ignored (recovered by DPC)". Do we still have race somewhere, I
>>>>>>> am not sure.
>>>>>>
>>>>>> You're seeing the expected behavior.  pciehp ignores DLLSC events
>>>>>> caused by DPC, but then double-checks that DPC recovery succeeded.
>>>>>> If it didn't, it would be a bug not to bring down the slot.  So
>>>>>> pciehp does exactly that.  See this code snippet in
>>>>>> pciehp_ignore_dpc_link_change():
>>>>>>
>>>>>>      /*
>>>>>>       * If the link is unexpectedly down after successful recovery,
>>>>>>       * the corresponding link change may have been ignored above.
>>>>>>       * Synthesize it to ensure that it is acted on.
>>>>>>       */
>>>>>>      down_read_nested(&ctrl->reset_lock, ctrl->depth);
>>>>>>      if (!pciehp_check_link_active(ctrl))
>>>>>>          pciehp_request(ctrl, PCI_EXP_SLTSTA_DLLSC);
>>>>>>      up_read(&ctrl->reset_lock);
>>>>>>
>>>>>> So on hotplug-capable ports, pciehp is able to mop up the mess
>>>>>> created by fiddling with the MPS settings behind the kernel's
>>>>>> back.
>>>>>
>>>>> That's the thing, even on hotplug-capable slot I do not see pciehp
>>>>> _all_ the time. Sometime pciehp get involve and takes care of things
>>>>> (like I mentioned in the previous thread) and other times no pciehp
>>>>> engagement at all!
>>>>
>>>> Possibly a timing issue, so I'll be interested to see if 53b54ad074de
>>>> ("PCI/DPC: Await readiness of secondary bus after reset") makes any
>>>> difference.  Lukas didn't mention that, so maybe it's a red herring,
>>>> but I'm still curious since it explicitly mentions the DPC reset case
>>>> that you're exercising here.
>>
>> Commit 53b54ad074de ("PCI/DPC: Await readiness of secondary bus after reset") didn't help.
> 
> I did not check the full thread. Since this seems to be in EDR recovery path, make sure to
> include following patch.
> 
> https://lore.kernel.org/lkml/20230215200532.3126937-1-sathyanarayanan.kuppuswamy@linux.intel.com/T/

Sorry but in your patch I guess you really want 'pdev' instead of 'dev' in 
pcie_clear_device_status(dev);

or I am missing something?

> 
>>
>> [ 6265.268757] pcieport 0000:a5:01.0: EDR: EDR event received
>> [ 6265.276034] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
>> [ 6265.283780] pcieport 0000:a9:10.0: DPC: containment event, status:0x2009 source:0x0000
>> [ 6265.292972] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error detected
>> [ 6265.301284] pcieport 0000:a9:10.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
>> [ 6265.313569] pcieport 0000:a9:10.0:   device [1000:c030] error status/mask=00040000/00180000
>> [ 6265.323208] pcieport 0000:a9:10.0:    [18] MalfTLP                (First)
>> [ 6265.331084] pcieport 0000:a9:10.0: AER:   TLP Header: 6000007a ab0000ff 00000001 629d4318
>> [ 6265.340536] pcieport 0000:a9:10.0: AER: broadcast error_detected message
>> [ 6265.348320] nvme nvme1: frozen state error detected, reset controller
>> [ 6265.419633] pcieport 0000:a9:10.0: waiting 100 ms for downstream link, after activation
>> [ 6265.627639] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
>> [ 6265.635289] nvme nvme1: restart after slot reset
>> [ 6265.641016] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was 0x100, writing 0x1ff)
>> [ 6265.651248] nvme 0000:ab:00.0: restoring config space at offset 0x30 (was 0x0, writing 0xe0600000)
>> [ 6265.661739] nvme 0000:ab:00.0: restoring config space at offset 0x10 (was 0x4, writing 0xe0710004)
>> [ 6265.672210] nvme 0000:ab:00.0: restoring config space at offset 0xc (was 0x0, writing 0x8)
>> [ 6265.681897] nvme 0000:ab:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100546)
>> [ 6265.692616] pcieport 0000:a9:10.0: AER: broadcast resume message
>> [ 6265.716299] nvme 0000:ab:00.0: saving config space at offset 0x0 (reading 0xa824144d)
>> [ 6265.725614] nvme 0000:ab:00.0: saving config space at offset 0x4 (reading 0x100546)
>> [ 6265.734657] nvme 0000:ab:00.0: saving config space at offset 0x8 (reading 0x1080200)
>> [ 6265.743824] nvme 0000:ab:00.0: saving config space at offset 0xc (reading 0x8)
>> [ 6265.752348] nvme 0000:ab:00.0: saving config space at offset 0x10 (reading 0xe0710004)
>> [ 6265.761647] nvme 0000:ab:00.0: saving config space at offset 0x14 (reading 0x0)
>> [ 6265.770247] nvme 0000:ab:00.0: saving config space at offset 0x18 (reading 0x0)
>> [ 6265.778857] nvme 0000:ab:00.0: saving config space at offset 0x1c (reading 0x0)
>> [ 6265.787450] nvme 0000:ab:00.0: saving config space at offset 0x20 (reading 0x0)
>> [ 6265.796034] nvme 0000:ab:00.0: saving config space at offset 0x24 (reading 0x0)
>> [ 6265.804620] nvme 0000:ab:00.0: saving config space at offset 0x28 (reading 0x0)
>> [ 6265.813201] nvme 0000:ab:00.0: saving config space at offset 0x2c (reading 0xa80a144d)
>> [ 6265.822473] nvme 0000:ab:00.0: saving config space at offset 0x30 (reading 0xe0600000)
>> [ 6265.831816] nvme 0000:ab:00.0: saving config space at offset 0x34 (reading 0x40)
>> [ 6265.840482] nvme 0000:ab:00.0: saving config space at offset 0x38 (reading 0x0)
>> [ 6265.849037] nvme 0000:ab:00.0: saving config space at offset 0x3c (reading 0x1ff)
>> [ 6275.037534] block nvme1n1: no usable path - requeuing I/O
>> [ 6326.920009] nvme nvme1: I/O 22 QID 0 timeout, disable controller
>> [ 6326.988701] nvme nvme1: Identify Controller failed (-4)
>> [ 6326.995253] nvme nvme1: Disabling device after reset failure: -5
>> [ 6327.032308] pcieport 0000:a9:10.0: AER: device recovery successful
>> [ 6327.039781] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
>> [ 6327.047687] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80
>> [ 6327.083131] pcieport 0000:a5:01.0: EDR: EDR event received
>> [ 6327.090173] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
>> [ 6327.097816] pcieport 0000:a9:10.0: DPC: containment event, status:0x2009 source:0x0000
>> [ 6327.107009] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error detected
>> [ 6327.115330] pcieport 0000:a9:10.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
>> [ 6327.127640] pcieport 0000:a9:10.0:   device [1000:c030] error status/mask=00040000/00180000
>> [ 6327.137319] pcieport 0000:a9:10.0:    [18] MalfTLP                (First)
>> [ 6327.145236] pcieport 0000:a9:10.0: AER:   TLP Header: 60000080 ab0000ff 00000001 5ad65000
>> [ 6327.154728] pcieport 0000:a9:10.0: AER: broadcast error_detected message
>> [ 6327.162624] nvme nvme1: frozen state error detected, reset controller
>> [ 6327.183979] pcieport 0000:a9:10.0: waiting 100 ms for downstream link, after activation
>> [ 6327.387969] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
>> [ 6327.395596] nvme nvme1: restart after slot reset
>> [ 6327.401313] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was 0x100, writing 0x1ff)
>> [ 6327.411517] nvme 0000:ab:00.0: restoring config space at offset 0x30 (was 0x0, writing 0xe0600000)
>> [ 6327.422045] nvme 0000:ab:00.0: restoring config space at offset 0x10 (was 0x4, writing 0xe0710004)
>> [ 6327.432523] nvme 0000:ab:00.0: restoring config space at offset 0xc (was 0x0, writing 0x8)
>> [ 6327.442212] nvme 0000:ab:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100546)
>> [ 6327.452933] pcieport 0000:a9:10.0: AER: broadcast resume message
>> [ 6327.460184] pcieport 0000:a9:10.0: AER: device recovery successful
>> [ 6327.467533] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
>> [ 6327.475367] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80
>>
>>>
>>> Catching the PDC event may be timing related. pciehp ignores the link events
>>> during a DPC event, but it always reacts to PDC since it's indistinguishable
>>> from a DPC occuring in response to a surprise removal, and these slots probably
>>> don't have out-of-band presence detection.
>>
>> yeah, In-Band PD Disable bit in Slot Control register of PCIe Downstream Switch port is set to '0' , no idea about out-of-band presence detection!
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery
  2023-03-15 22:16                     ` Tushar Dave
@ 2023-03-15 22:23                       ` Sathyanarayanan Kuppuswamy
  2023-03-15 22:25                         ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 15+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2023-03-15 22:23 UTC (permalink / raw)
  To: Tushar Dave, Keith Busch, Bjorn Helgaas
  Cc: Lukas Wunner, Sagi Grimberg, linux-nvme, linux-pci



On 3/15/23 3:16 PM, Tushar Dave wrote:
> 
> 
> On 3/15/23 13:43, Sathyanarayanan Kuppuswamy wrote:
>>
>>
>> On 3/15/23 1:01 PM, Tushar Dave wrote:
>>>
>>>
>>> On 3/14/23 10:26, Keith Busch wrote:
>>>> On Tue, Mar 14, 2023 at 11:11:27AM -0500, Bjorn Helgaas wrote:
>>>>> On Mon, Mar 13, 2023 at 05:57:43PM -0700, Tushar Dave wrote:
>>>>>> On 3/11/23 00:22, Lukas Wunner wrote:
>>>>>>> On Fri, Mar 10, 2023 at 05:45:48PM -0800, Tushar Dave wrote:
>>>>>>>> On 3/10/2023 3:53 PM, Bjorn Helgaas wrote:
>>>>>>>>> In the log below, pciehp obviously is enabled; should I infer that in
>>>>>>>>> the log above, it is not?
>>>>>>>>
>>>>>>>> pciehp is enabled all the time. In the log above and below.
>>>>>>>> I do not have answer yet why pciehp shows-up only in some tests (due to DPC
>>>>>>>> link down/up) and not in others like you noticed in both the logs.
>>>>>>>
>>>>>>> Maybe some of the switch Downstream Ports are hotplug-capable and
>>>>>>> some are not?  (Check the Slot Implemented bit in the PCI Express
>>>>>>> Capabilities Register as well as the Hot-Plug Capable bit in the
>>>>>>> Slot Capabilities Register.)
>>>>>>> ...
>>>>>
>>>>>>>>> Generally we've avoided handling a device reset as a
>>>>>>>>> remove/add event because upper layers can't deal well with
>>>>>>>>> that.  But in the log below it looks like pciehp *did* treat
>>>>>>>>> the DPC containment as a remove/add, which of course involves
>>>>>>>>> configuring the "new" device and its MPS settings.
>>>>>>>>
>>>>>>>> yes and that puzzled me why? especially when"Link Down/Up
>>>>>>>> ignored (recovered by DPC)". Do we still have race somewhere, I
>>>>>>>> am not sure.
>>>>>>>
>>>>>>> You're seeing the expected behavior.  pciehp ignores DLLSC events
>>>>>>> caused by DPC, but then double-checks that DPC recovery succeeded.
>>>>>>> If it didn't, it would be a bug not to bring down the slot.  So
>>>>>>> pciehp does exactly that.  See this code snippet in
>>>>>>> pciehp_ignore_dpc_link_change():
>>>>>>>
>>>>>>>      /*
>>>>>>>       * If the link is unexpectedly down after successful recovery,
>>>>>>>       * the corresponding link change may have been ignored above.
>>>>>>>       * Synthesize it to ensure that it is acted on.
>>>>>>>       */
>>>>>>>      down_read_nested(&ctrl->reset_lock, ctrl->depth);
>>>>>>>      if (!pciehp_check_link_active(ctrl))
>>>>>>>          pciehp_request(ctrl, PCI_EXP_SLTSTA_DLLSC);
>>>>>>>      up_read(&ctrl->reset_lock);
>>>>>>>
>>>>>>> So on hotplug-capable ports, pciehp is able to mop up the mess
>>>>>>> created by fiddling with the MPS settings behind the kernel's
>>>>>>> back.
>>>>>>
>>>>>> That's the thing, even on hotplug-capable slot I do not see pciehp
>>>>>> _all_ the time. Sometime pciehp get involve and takes care of things
>>>>>> (like I mentioned in the previous thread) and other times no pciehp
>>>>>> engagement at all!
>>>>>
>>>>> Possibly a timing issue, so I'll be interested to see if 53b54ad074de
>>>>> ("PCI/DPC: Await readiness of secondary bus after reset") makes any
>>>>> difference.  Lukas didn't mention that, so maybe it's a red herring,
>>>>> but I'm still curious since it explicitly mentions the DPC reset case
>>>>> that you're exercising here.
>>>
>>> Commit 53b54ad074de ("PCI/DPC: Await readiness of secondary bus after reset") didn't help.
>>
>> I did not check the full thread. Since this seems to be in EDR recovery path, make sure to
>> include following patch.
>>
>> https://lore.kernel.org/lkml/20230215200532.3126937-1-sathyanarayanan.kuppuswamy@linux.intel.com/T/
> 
> Sorry but in your patch I guess you really want 'pdev' instead of 'dev' in pcie_clear_device_status(dev);
> 
> or I am missing something?

You are right. It is pdev. I will fix this in next version.

> 
>>
>>>
>>> [ 6265.268757] pcieport 0000:a5:01.0: EDR: EDR event received
>>> [ 6265.276034] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
>>> [ 6265.283780] pcieport 0000:a9:10.0: DPC: containment event, status:0x2009 source:0x0000
>>> [ 6265.292972] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error detected
>>> [ 6265.301284] pcieport 0000:a9:10.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
>>> [ 6265.313569] pcieport 0000:a9:10.0:   device [1000:c030] error status/mask=00040000/00180000
>>> [ 6265.323208] pcieport 0000:a9:10.0:    [18] MalfTLP                (First)
>>> [ 6265.331084] pcieport 0000:a9:10.0: AER:   TLP Header: 6000007a ab0000ff 00000001 629d4318
>>> [ 6265.340536] pcieport 0000:a9:10.0: AER: broadcast error_detected message
>>> [ 6265.348320] nvme nvme1: frozen state error detected, reset controller
>>> [ 6265.419633] pcieport 0000:a9:10.0: waiting 100 ms for downstream link, after activation
>>> [ 6265.627639] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
>>> [ 6265.635289] nvme nvme1: restart after slot reset
>>> [ 6265.641016] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was 0x100, writing 0x1ff)
>>> [ 6265.651248] nvme 0000:ab:00.0: restoring config space at offset 0x30 (was 0x0, writing 0xe0600000)
>>> [ 6265.661739] nvme 0000:ab:00.0: restoring config space at offset 0x10 (was 0x4, writing 0xe0710004)
>>> [ 6265.672210] nvme 0000:ab:00.0: restoring config space at offset 0xc (was 0x0, writing 0x8)
>>> [ 6265.681897] nvme 0000:ab:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100546)
>>> [ 6265.692616] pcieport 0000:a9:10.0: AER: broadcast resume message
>>> [ 6265.716299] nvme 0000:ab:00.0: saving config space at offset 0x0 (reading 0xa824144d)
>>> [ 6265.725614] nvme 0000:ab:00.0: saving config space at offset 0x4 (reading 0x100546)
>>> [ 6265.734657] nvme 0000:ab:00.0: saving config space at offset 0x8 (reading 0x1080200)
>>> [ 6265.743824] nvme 0000:ab:00.0: saving config space at offset 0xc (reading 0x8)
>>> [ 6265.752348] nvme 0000:ab:00.0: saving config space at offset 0x10 (reading 0xe0710004)
>>> [ 6265.761647] nvme 0000:ab:00.0: saving config space at offset 0x14 (reading 0x0)
>>> [ 6265.770247] nvme 0000:ab:00.0: saving config space at offset 0x18 (reading 0x0)
>>> [ 6265.778857] nvme 0000:ab:00.0: saving config space at offset 0x1c (reading 0x0)
>>> [ 6265.787450] nvme 0000:ab:00.0: saving config space at offset 0x20 (reading 0x0)
>>> [ 6265.796034] nvme 0000:ab:00.0: saving config space at offset 0x24 (reading 0x0)
>>> [ 6265.804620] nvme 0000:ab:00.0: saving config space at offset 0x28 (reading 0x0)
>>> [ 6265.813201] nvme 0000:ab:00.0: saving config space at offset 0x2c (reading 0xa80a144d)
>>> [ 6265.822473] nvme 0000:ab:00.0: saving config space at offset 0x30 (reading 0xe0600000)
>>> [ 6265.831816] nvme 0000:ab:00.0: saving config space at offset 0x34 (reading 0x40)
>>> [ 6265.840482] nvme 0000:ab:00.0: saving config space at offset 0x38 (reading 0x0)
>>> [ 6265.849037] nvme 0000:ab:00.0: saving config space at offset 0x3c (reading 0x1ff)
>>> [ 6275.037534] block nvme1n1: no usable path - requeuing I/O
>>> [ 6326.920009] nvme nvme1: I/O 22 QID 0 timeout, disable controller
>>> [ 6326.988701] nvme nvme1: Identify Controller failed (-4)
>>> [ 6326.995253] nvme nvme1: Disabling device after reset failure: -5
>>> [ 6327.032308] pcieport 0000:a9:10.0: AER: device recovery successful
>>> [ 6327.039781] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
>>> [ 6327.047687] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80
>>> [ 6327.083131] pcieport 0000:a5:01.0: EDR: EDR event received
>>> [ 6327.090173] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
>>> [ 6327.097816] pcieport 0000:a9:10.0: DPC: containment event, status:0x2009 source:0x0000
>>> [ 6327.107009] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error detected
>>> [ 6327.115330] pcieport 0000:a9:10.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
>>> [ 6327.127640] pcieport 0000:a9:10.0:   device [1000:c030] error status/mask=00040000/00180000
>>> [ 6327.137319] pcieport 0000:a9:10.0:    [18] MalfTLP                (First)
>>> [ 6327.145236] pcieport 0000:a9:10.0: AER:   TLP Header: 60000080 ab0000ff 00000001 5ad65000
>>> [ 6327.154728] pcieport 0000:a9:10.0: AER: broadcast error_detected message
>>> [ 6327.162624] nvme nvme1: frozen state error detected, reset controller
>>> [ 6327.183979] pcieport 0000:a9:10.0: waiting 100 ms for downstream link, after activation
>>> [ 6327.387969] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
>>> [ 6327.395596] nvme nvme1: restart after slot reset
>>> [ 6327.401313] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was 0x100, writing 0x1ff)
>>> [ 6327.411517] nvme 0000:ab:00.0: restoring config space at offset 0x30 (was 0x0, writing 0xe0600000)
>>> [ 6327.422045] nvme 0000:ab:00.0: restoring config space at offset 0x10 (was 0x4, writing 0xe0710004)
>>> [ 6327.432523] nvme 0000:ab:00.0: restoring config space at offset 0xc (was 0x0, writing 0x8)
>>> [ 6327.442212] nvme 0000:ab:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100546)
>>> [ 6327.452933] pcieport 0000:a9:10.0: AER: broadcast resume message
>>> [ 6327.460184] pcieport 0000:a9:10.0: AER: device recovery successful
>>> [ 6327.467533] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
>>> [ 6327.475367] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80
>>>
>>>>
>>>> Catching the PDC event may be timing related. pciehp ignores the link events
>>>> during a DPC event, but it always reacts to PDC since it's indistinguishable
>>>> from a DPC occuring in response to a surprise removal, and these slots probably
>>>> don't have out-of-band presence detection.
>>>
>>> yeah, In-Band PD Disable bit in Slot Control register of PCIe Downstream Switch port is set to '0' , no idea about out-of-band presence detection!
>>

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery
  2023-03-15 22:23                       ` Sathyanarayanan Kuppuswamy
@ 2023-03-15 22:25                         ` Sathyanarayanan Kuppuswamy
  0 siblings, 0 replies; 15+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2023-03-15 22:25 UTC (permalink / raw)
  To: Tushar Dave, Keith Busch, Bjorn Helgaas
  Cc: Lukas Wunner, Sagi Grimberg, linux-nvme, linux-pci



On 3/15/23 3:23 PM, Sathyanarayanan Kuppuswamy wrote:
> 
> 
> On 3/15/23 3:16 PM, Tushar Dave wrote:
>>
>>
>> On 3/15/23 13:43, Sathyanarayanan Kuppuswamy wrote:
>>>
>>>
>>> On 3/15/23 1:01 PM, Tushar Dave wrote:
>>>>
>>>>
>>>> On 3/14/23 10:26, Keith Busch wrote:
>>>>> On Tue, Mar 14, 2023 at 11:11:27AM -0500, Bjorn Helgaas wrote:
>>>>>> On Mon, Mar 13, 2023 at 05:57:43PM -0700, Tushar Dave wrote:
>>>>>>> On 3/11/23 00:22, Lukas Wunner wrote:
>>>>>>>> On Fri, Mar 10, 2023 at 05:45:48PM -0800, Tushar Dave wrote:
>>>>>>>>> On 3/10/2023 3:53 PM, Bjorn Helgaas wrote:
>>>>>>>>>> In the log below, pciehp obviously is enabled; should I infer that in
>>>>>>>>>> the log above, it is not?
>>>>>>>>>
>>>>>>>>> pciehp is enabled all the time. In the log above and below.
>>>>>>>>> I do not have answer yet why pciehp shows-up only in some tests (due to DPC
>>>>>>>>> link down/up) and not in others like you noticed in both the logs.
>>>>>>>>
>>>>>>>> Maybe some of the switch Downstream Ports are hotplug-capable and
>>>>>>>> some are not?  (Check the Slot Implemented bit in the PCI Express
>>>>>>>> Capabilities Register as well as the Hot-Plug Capable bit in the
>>>>>>>> Slot Capabilities Register.)
>>>>>>>> ...
>>>>>>
>>>>>>>>>> Generally we've avoided handling a device reset as a
>>>>>>>>>> remove/add event because upper layers can't deal well with
>>>>>>>>>> that.  But in the log below it looks like pciehp *did* treat
>>>>>>>>>> the DPC containment as a remove/add, which of course involves
>>>>>>>>>> configuring the "new" device and its MPS settings.
>>>>>>>>>
>>>>>>>>> yes and that puzzled me why? especially when"Link Down/Up
>>>>>>>>> ignored (recovered by DPC)". Do we still have race somewhere, I
>>>>>>>>> am not sure.
>>>>>>>>
>>>>>>>> You're seeing the expected behavior.  pciehp ignores DLLSC events
>>>>>>>> caused by DPC, but then double-checks that DPC recovery succeeded.
>>>>>>>> If it didn't, it would be a bug not to bring down the slot.  So
>>>>>>>> pciehp does exactly that.  See this code snippet in
>>>>>>>> pciehp_ignore_dpc_link_change():
>>>>>>>>
>>>>>>>>      /*
>>>>>>>>       * If the link is unexpectedly down after successful recovery,
>>>>>>>>       * the corresponding link change may have been ignored above.
>>>>>>>>       * Synthesize it to ensure that it is acted on.
>>>>>>>>       */
>>>>>>>>      down_read_nested(&ctrl->reset_lock, ctrl->depth);
>>>>>>>>      if (!pciehp_check_link_active(ctrl))
>>>>>>>>          pciehp_request(ctrl, PCI_EXP_SLTSTA_DLLSC);
>>>>>>>>      up_read(&ctrl->reset_lock);
>>>>>>>>
>>>>>>>> So on hotplug-capable ports, pciehp is able to mop up the mess
>>>>>>>> created by fiddling with the MPS settings behind the kernel's
>>>>>>>> back.
>>>>>>>
>>>>>>> That's the thing, even on hotplug-capable slot I do not see pciehp
>>>>>>> _all_ the time. Sometime pciehp get involve and takes care of things
>>>>>>> (like I mentioned in the previous thread) and other times no pciehp
>>>>>>> engagement at all!
>>>>>>
>>>>>> Possibly a timing issue, so I'll be interested to see if 53b54ad074de
>>>>>> ("PCI/DPC: Await readiness of secondary bus after reset") makes any
>>>>>> difference.  Lukas didn't mention that, so maybe it's a red herring,
>>>>>> but I'm still curious since it explicitly mentions the DPC reset case
>>>>>> that you're exercising here.
>>>>
>>>> Commit 53b54ad074de ("PCI/DPC: Await readiness of secondary bus after reset") didn't help.
>>>
>>> I did not check the full thread. Since this seems to be in EDR recovery path, make sure to
>>> include following patch.
>>>
>>> https://lore.kernel.org/lkml/20230215200532.3126937-1-sathyanarayanan.kuppuswamy@linux.intel.com/T/
>>
>> Sorry but in your patch I guess you really want 'pdev' instead of 'dev' in pcie_clear_device_status(dev);
>>
>> or I am missing something?
> 
> You are right. It is pdev. I will fix this in next version.

Actually, it is edev.

> 
>>
>>>
>>>>
>>>> [ 6265.268757] pcieport 0000:a5:01.0: EDR: EDR event received
>>>> [ 6265.276034] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
>>>> [ 6265.283780] pcieport 0000:a9:10.0: DPC: containment event, status:0x2009 source:0x0000
>>>> [ 6265.292972] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error detected
>>>> [ 6265.301284] pcieport 0000:a9:10.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
>>>> [ 6265.313569] pcieport 0000:a9:10.0:   device [1000:c030] error status/mask=00040000/00180000
>>>> [ 6265.323208] pcieport 0000:a9:10.0:    [18] MalfTLP                (First)
>>>> [ 6265.331084] pcieport 0000:a9:10.0: AER:   TLP Header: 6000007a ab0000ff 00000001 629d4318
>>>> [ 6265.340536] pcieport 0000:a9:10.0: AER: broadcast error_detected message
>>>> [ 6265.348320] nvme nvme1: frozen state error detected, reset controller
>>>> [ 6265.419633] pcieport 0000:a9:10.0: waiting 100 ms for downstream link, after activation
>>>> [ 6265.627639] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
>>>> [ 6265.635289] nvme nvme1: restart after slot reset
>>>> [ 6265.641016] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was 0x100, writing 0x1ff)
>>>> [ 6265.651248] nvme 0000:ab:00.0: restoring config space at offset 0x30 (was 0x0, writing 0xe0600000)
>>>> [ 6265.661739] nvme 0000:ab:00.0: restoring config space at offset 0x10 (was 0x4, writing 0xe0710004)
>>>> [ 6265.672210] nvme 0000:ab:00.0: restoring config space at offset 0xc (was 0x0, writing 0x8)
>>>> [ 6265.681897] nvme 0000:ab:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100546)
>>>> [ 6265.692616] pcieport 0000:a9:10.0: AER: broadcast resume message
>>>> [ 6265.716299] nvme 0000:ab:00.0: saving config space at offset 0x0 (reading 0xa824144d)
>>>> [ 6265.725614] nvme 0000:ab:00.0: saving config space at offset 0x4 (reading 0x100546)
>>>> [ 6265.734657] nvme 0000:ab:00.0: saving config space at offset 0x8 (reading 0x1080200)
>>>> [ 6265.743824] nvme 0000:ab:00.0: saving config space at offset 0xc (reading 0x8)
>>>> [ 6265.752348] nvme 0000:ab:00.0: saving config space at offset 0x10 (reading 0xe0710004)
>>>> [ 6265.761647] nvme 0000:ab:00.0: saving config space at offset 0x14 (reading 0x0)
>>>> [ 6265.770247] nvme 0000:ab:00.0: saving config space at offset 0x18 (reading 0x0)
>>>> [ 6265.778857] nvme 0000:ab:00.0: saving config space at offset 0x1c (reading 0x0)
>>>> [ 6265.787450] nvme 0000:ab:00.0: saving config space at offset 0x20 (reading 0x0)
>>>> [ 6265.796034] nvme 0000:ab:00.0: saving config space at offset 0x24 (reading 0x0)
>>>> [ 6265.804620] nvme 0000:ab:00.0: saving config space at offset 0x28 (reading 0x0)
>>>> [ 6265.813201] nvme 0000:ab:00.0: saving config space at offset 0x2c (reading 0xa80a144d)
>>>> [ 6265.822473] nvme 0000:ab:00.0: saving config space at offset 0x30 (reading 0xe0600000)
>>>> [ 6265.831816] nvme 0000:ab:00.0: saving config space at offset 0x34 (reading 0x40)
>>>> [ 6265.840482] nvme 0000:ab:00.0: saving config space at offset 0x38 (reading 0x0)
>>>> [ 6265.849037] nvme 0000:ab:00.0: saving config space at offset 0x3c (reading 0x1ff)
>>>> [ 6275.037534] block nvme1n1: no usable path - requeuing I/O
>>>> [ 6326.920009] nvme nvme1: I/O 22 QID 0 timeout, disable controller
>>>> [ 6326.988701] nvme nvme1: Identify Controller failed (-4)
>>>> [ 6326.995253] nvme nvme1: Disabling device after reset failure: -5
>>>> [ 6327.032308] pcieport 0000:a9:10.0: AER: device recovery successful
>>>> [ 6327.039781] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
>>>> [ 6327.047687] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80
>>>> [ 6327.083131] pcieport 0000:a5:01.0: EDR: EDR event received
>>>> [ 6327.090173] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
>>>> [ 6327.097816] pcieport 0000:a9:10.0: DPC: containment event, status:0x2009 source:0x0000
>>>> [ 6327.107009] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error detected
>>>> [ 6327.115330] pcieport 0000:a9:10.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
>>>> [ 6327.127640] pcieport 0000:a9:10.0:   device [1000:c030] error status/mask=00040000/00180000
>>>> [ 6327.137319] pcieport 0000:a9:10.0:    [18] MalfTLP                (First)
>>>> [ 6327.145236] pcieport 0000:a9:10.0: AER:   TLP Header: 60000080 ab0000ff 00000001 5ad65000
>>>> [ 6327.154728] pcieport 0000:a9:10.0: AER: broadcast error_detected message
>>>> [ 6327.162624] nvme nvme1: frozen state error detected, reset controller
>>>> [ 6327.183979] pcieport 0000:a9:10.0: waiting 100 ms for downstream link, after activation
>>>> [ 6327.387969] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
>>>> [ 6327.395596] nvme nvme1: restart after slot reset
>>>> [ 6327.401313] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was 0x100, writing 0x1ff)
>>>> [ 6327.411517] nvme 0000:ab:00.0: restoring config space at offset 0x30 (was 0x0, writing 0xe0600000)
>>>> [ 6327.422045] nvme 0000:ab:00.0: restoring config space at offset 0x10 (was 0x4, writing 0xe0710004)
>>>> [ 6327.432523] nvme 0000:ab:00.0: restoring config space at offset 0xc (was 0x0, writing 0x8)
>>>> [ 6327.442212] nvme 0000:ab:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100546)
>>>> [ 6327.452933] pcieport 0000:a9:10.0: AER: broadcast resume message
>>>> [ 6327.460184] pcieport 0000:a9:10.0: AER: device recovery successful
>>>> [ 6327.467533] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
>>>> [ 6327.475367] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80
>>>>
>>>>>
>>>>> Catching the PDC event may be timing related. pciehp ignores the link events
>>>>> during a DPC event, but it always reacts to PDC since it's indistinguishable
>>>>> from a DPC occuring in response to a surprise removal, and these slots probably
>>>>> don't have out-of-band presence detection.
>>>>
>>>> yeah, In-Band PD Disable bit in Slot Control register of PCIe Downstream Switch port is set to '0' , no idea about out-of-band presence detection!
>>>
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery
  2023-03-15 20:01                 ` Tushar Dave
  2023-03-15 20:43                   ` Sathyanarayanan Kuppuswamy
@ 2023-03-18  0:15                   ` Tushar Dave
  1 sibling, 0 replies; 15+ messages in thread
From: Tushar Dave @ 2023-03-18  0:15 UTC (permalink / raw)
  To: Keith Busch, Bjorn Helgaas
  Cc: Lukas Wunner, Sagi Grimberg, linux-nvme, linux-pci



On 3/15/23 13:01, Tushar Dave wrote:
> 
> 
> On 3/14/23 10:26, Keith Busch wrote:
>> On Tue, Mar 14, 2023 at 11:11:27AM -0500, Bjorn Helgaas wrote:
>>> On Mon, Mar 13, 2023 at 05:57:43PM -0700, Tushar Dave wrote:
>>>> On 3/11/23 00:22, Lukas Wunner wrote:
>>>>> On Fri, Mar 10, 2023 at 05:45:48PM -0800, Tushar Dave wrote:
>>>>>> On 3/10/2023 3:53 PM, Bjorn Helgaas wrote:
>>>>>>> In the log below, pciehp obviously is enabled; should I infer that in
>>>>>>> the log above, it is not?
>>>>>>
>>>>>> pciehp is enabled all the time. In the log above and below.
>>>>>> I do not have answer yet why pciehp shows-up only in some tests (due to DPC
>>>>>> link down/up) and not in others like you noticed in both the logs.
>>>>>
>>>>> Maybe some of the switch Downstream Ports are hotplug-capable and
>>>>> some are not?  (Check the Slot Implemented bit in the PCI Express
>>>>> Capabilities Register as well as the Hot-Plug Capable bit in the
>>>>> Slot Capabilities Register.)
>>>>> ...
>>>
>>>>>>> Generally we've avoided handling a device reset as a
>>>>>>> remove/add event because upper layers can't deal well with
>>>>>>> that.  But in the log below it looks like pciehp *did* treat
>>>>>>> the DPC containment as a remove/add, which of course involves
>>>>>>> configuring the "new" device and its MPS settings.
>>>>>>
>>>>>> yes and that puzzled me why? especially when"Link Down/Up
>>>>>> ignored (recovered by DPC)". Do we still have race somewhere, I
>>>>>> am not sure.
>>>>>
>>>>> You're seeing the expected behavior.  pciehp ignores DLLSC events
>>>>> caused by DPC, but then double-checks that DPC recovery succeeded.
>>>>> If it didn't, it would be a bug not to bring down the slot.  So
>>>>> pciehp does exactly that.  See this code snippet in
>>>>> pciehp_ignore_dpc_link_change():
>>>>>
>>>>>     /*
>>>>>      * If the link is unexpectedly down after successful recovery,
>>>>>      * the corresponding link change may have been ignored above.
>>>>>      * Synthesize it to ensure that it is acted on.
>>>>>      */
>>>>>     down_read_nested(&ctrl->reset_lock, ctrl->depth);
>>>>>     if (!pciehp_check_link_active(ctrl))
>>>>>         pciehp_request(ctrl, PCI_EXP_SLTSTA_DLLSC);
>>>>>     up_read(&ctrl->reset_lock);
>>>>>
>>>>> So on hotplug-capable ports, pciehp is able to mop up the mess
>>>>> created by fiddling with the MPS settings behind the kernel's
>>>>> back.
>>>>
>>>> That's the thing, even on hotplug-capable slot I do not see pciehp
>>>> _all_ the time. Sometime pciehp get involve and takes care of things
>>>> (like I mentioned in the previous thread) and other times no pciehp
>>>> engagement at all!
>>>
>>> Possibly a timing issue, so I'll be interested to see if 53b54ad074de
>>> ("PCI/DPC: Await readiness of secondary bus after reset") makes any
>>> difference.  Lukas didn't mention that, so maybe it's a red herring,
>>> but I'm still curious since it explicitly mentions the DPC reset case
>>> that you're exercising here.
> 
> Commit 53b54ad074de ("PCI/DPC: Await readiness of secondary bus after reset") 
> didn't help.
> 
> [ 6265.268757] pcieport 0000:a5:01.0: EDR: EDR event received
> [ 6265.276034] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
> [ 6265.283780] pcieport 0000:a9:10.0: DPC: containment event, status:0x2009 
> source:0x0000
> [ 6265.292972] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error detected
> [ 6265.301284] pcieport 0000:a9:10.0: PCIe Bus Error: severity=Uncorrected 
> (Fatal), type=Transaction Layer, (Receiver ID)
> [ 6265.313569] pcieport 0000:a9:10.0:   device [1000:c030] error 
> status/mask=00040000/00180000
> [ 6265.323208] pcieport 0000:a9:10.0:    [18] MalfTLP                (First)
> [ 6265.331084] pcieport 0000:a9:10.0: AER:   TLP Header: 6000007a ab0000ff 
> 00000001 629d4318
> [ 6265.340536] pcieport 0000:a9:10.0: AER: broadcast error_detected message
> [ 6265.348320] nvme nvme1: frozen state error detected, reset controller
> [ 6265.419633] pcieport 0000:a9:10.0: waiting 100 ms for downstream link, after 
> activation
> [ 6265.627639] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
> [ 6265.635289] nvme nvme1: restart after slot reset
> [ 6265.641016] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was 
> 0x100, writing 0x1ff)
> [ 6265.651248] nvme 0000:ab:00.0: restoring config space at offset 0x30 (was 
> 0x0, writing 0xe0600000)
> [ 6265.661739] nvme 0000:ab:00.0: restoring config space at offset 0x10 (was 
> 0x4, writing 0xe0710004)
> [ 6265.672210] nvme 0000:ab:00.0: restoring config space at offset 0xc (was 0x0, 
> writing 0x8)
> [ 6265.681897] nvme 0000:ab:00.0: restoring config space at offset 0x4 (was 
> 0x100000, writing 0x100546)
> [ 6265.692616] pcieport 0000:a9:10.0: AER: broadcast resume message
> [ 6265.716299] nvme 0000:ab:00.0: saving config space at offset 0x0 (reading 
> 0xa824144d)
> [ 6265.725614] nvme 0000:ab:00.0: saving config space at offset 0x4 (reading 
> 0x100546)
> [ 6265.734657] nvme 0000:ab:00.0: saving config space at offset 0x8 (reading 
> 0x1080200)
> [ 6265.743824] nvme 0000:ab:00.0: saving config space at offset 0xc (reading 0x8)
> [ 6265.752348] nvme 0000:ab:00.0: saving config space at offset 0x10 (reading 
> 0xe0710004)
> [ 6265.761647] nvme 0000:ab:00.0: saving config space at offset 0x14 (reading 0x0)
> [ 6265.770247] nvme 0000:ab:00.0: saving config space at offset 0x18 (reading 0x0)
> [ 6265.778857] nvme 0000:ab:00.0: saving config space at offset 0x1c (reading 0x0)
> [ 6265.787450] nvme 0000:ab:00.0: saving config space at offset 0x20 (reading 0x0)
> [ 6265.796034] nvme 0000:ab:00.0: saving config space at offset 0x24 (reading 0x0)
> [ 6265.804620] nvme 0000:ab:00.0: saving config space at offset 0x28 (reading 0x0)
> [ 6265.813201] nvme 0000:ab:00.0: saving config space at offset 0x2c (reading 
> 0xa80a144d)
> [ 6265.822473] nvme 0000:ab:00.0: saving config space at offset 0x30 (reading 
> 0xe0600000)
> [ 6265.831816] nvme 0000:ab:00.0: saving config space at offset 0x34 (reading 0x40)
> [ 6265.840482] nvme 0000:ab:00.0: saving config space at offset 0x38 (reading 0x0)
> [ 6265.849037] nvme 0000:ab:00.0: saving config space at offset 0x3c (reading 
> 0x1ff)
> [ 6275.037534] block nvme1n1: no usable path - requeuing I/O
> [ 6326.920009] nvme nvme1: I/O 22 QID 0 timeout, disable controller
> [ 6326.988701] nvme nvme1: Identify Controller failed (-4)
> [ 6326.995253] nvme nvme1: Disabling device after reset failure: -5
> [ 6327.032308] pcieport 0000:a9:10.0: AER: device recovery successful
> [ 6327.039781] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
> [ 6327.047687] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80
> [ 6327.083131] pcieport 0000:a5:01.0: EDR: EDR event received
> [ 6327.090173] pcieport 0000:a5:01.0: EDR: Reported EDR dev: 0000:a9:10.0
> [ 6327.097816] pcieport 0000:a9:10.0: DPC: containment event, status:0x2009 
> source:0x0000
> [ 6327.107009] pcieport 0000:a9:10.0: DPC: unmasked uncorrectable error detected
> [ 6327.115330] pcieport 0000:a9:10.0: PCIe Bus Error: severity=Uncorrected 
> (Fatal), type=Transaction Layer, (Receiver ID)
> [ 6327.127640] pcieport 0000:a9:10.0:   device [1000:c030] error 
> status/mask=00040000/00180000
> [ 6327.137319] pcieport 0000:a9:10.0:    [18] MalfTLP                (First)
> [ 6327.145236] pcieport 0000:a9:10.0: AER:   TLP Header: 60000080 ab0000ff 
> 00000001 5ad65000
> [ 6327.154728] pcieport 0000:a9:10.0: AER: broadcast error_detected message
> [ 6327.162624] nvme nvme1: frozen state error detected, reset controller
> [ 6327.183979] pcieport 0000:a9:10.0: waiting 100 ms for downstream link, after 
> activation
> [ 6327.387969] pcieport 0000:a9:10.0: AER: broadcast slot_reset message
> [ 6327.395596] nvme nvme1: restart after slot reset
> [ 6327.401313] nvme 0000:ab:00.0: restoring config space at offset 0x3c (was 
> 0x100, writing 0x1ff)
> [ 6327.411517] nvme 0000:ab:00.0: restoring config space at offset 0x30 (was 
> 0x0, writing 0xe0600000)
> [ 6327.422045] nvme 0000:ab:00.0: restoring config space at offset 0x10 (was 
> 0x4, writing 0xe0710004)
> [ 6327.432523] nvme 0000:ab:00.0: restoring config space at offset 0xc (was 0x0, 
> writing 0x8)
> [ 6327.442212] nvme 0000:ab:00.0: restoring config space at offset 0x4 (was 
> 0x100000, writing 0x100546)
> [ 6327.452933] pcieport 0000:a9:10.0: AER: broadcast resume message
> [ 6327.460184] pcieport 0000:a9:10.0: AER: device recovery successful
> [ 6327.467533] pcieport 0000:a9:10.0: EDR: DPC port successfully recovered
> [ 6327.475367] pcieport 0000:a5:01.0: EDR: Status for 0000:a9:10.0: 0x80
> 
>>
>> Catching the PDC event may be timing related. pciehp ignores the link events
>> during a DPC event, but it always reacts to PDC since it's indistinguishable
>> from a DPC occuring in response to a surprise removal, and these slots probably
>> don't have out-of-band presence detection.
> 
> yeah, In-Band PD Disable bit in Slot Control register of PCIe Downstream Switch 
> port is set to '0' , no idea about out-of-band presence detection!

I debug further and noticed that pciehp_isr() _not_ getting triggered at all
(in spite of DPC link down/up activity).
Would that rule out that it has less to do with timing but the hotplug interrupt?
In my case, pciehp_isr() doesn't fire even though Downstream Switch port has
all required settings and condition met that ideally should fire Hot-plug
interrupt for PDC(and DLLSC) events!


e.g.
Before the test:

SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+
         Slot #272, PowerLimit 25.000W; Interlock- NoCompl-
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet+ CmdCplt- HPIrq+ LinkChg+
         Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
         Changed: MRL- PresDet- LinkState-

And after the MPS test with no engagement of pciehp :
(please note  Changed: MRL- PresDet+ LinkState+ )


SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+
         Slot #272, PowerLimit 25.000W; Interlock- NoCompl-
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet+ CmdCplt- HPIrq+ LinkChg+
         Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
         Changed: MRL- PresDet+ LinkState+


-Tushar

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-03-18  0:15 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <bcbd48b5-1d6e-8fe3-d6a0-cb341e5c34e3@nvidia.com>
2023-03-09 17:53 ` nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery Bjorn Helgaas
2023-03-10 22:39   ` Tushar Dave
2023-03-10 23:53     ` Bjorn Helgaas
2023-03-11  1:45       ` Tushar Dave
2023-03-11  8:22         ` Lukas Wunner
2023-03-11 16:46           ` Keith Busch
2023-03-14  0:57           ` Tushar Dave
2023-03-14 16:11             ` Bjorn Helgaas
2023-03-14 17:26               ` Keith Busch
2023-03-15 20:01                 ` Tushar Dave
2023-03-15 20:43                   ` Sathyanarayanan Kuppuswamy
2023-03-15 22:16                     ` Tushar Dave
2023-03-15 22:23                       ` Sathyanarayanan Kuppuswamy
2023-03-15 22:25                         ` Sathyanarayanan Kuppuswamy
2023-03-18  0:15                   ` Tushar Dave

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).