Linux ATA/IDE development
 help / color / mirror / Atom feed
* Re: Re: [PATCH] ahci: libahci: clear pending interrupt status
@ 2023-08-14  4:17 Szuying Chen
  2023-08-22  8:41 ` Niklas Cassel
  0 siblings, 1 reply; 4+ messages in thread
From: Szuying Chen @ 2023-08-14  4:17 UTC (permalink / raw)
  To: dlemoal, linux-ide, linux-kernel; +Cc: Jesse1_Chang, Richard_Hsu, Szuying Chen

On 8/10/23 17:59, Damien Le Moal wrote: 
>On 8/10/23 18:31, Szuying Chen wrote:
>>> On 8/10/23 14:12, Damien Le Moal wrote: 
>>>    On 8/10/23 14:05, Szuying Chen wrote:
>>>    > When ISR handle interface fatal error with error recovery after clear PxIS
>>>    > and PxIE. Another FIS(SDB FIS with err) that set PxIS.IFS to 1 is recevied
>>>    > during error recovery, which causing the HBA not issue any new commands
>>>    > after cmd.ST set 1.
>>>
>>>    This is not very clear. If there was a fatal error, the drive should be in
>>>    error state and no other SDB FIS can be received as the drive does absolutely
>>>    nothing while in error state (it only waits for a read log 10h command to be>
>>>    issued to get it out of error state). So if you are seeing 2 SDB FIS with
>>>    errors one after the other, you have a buggy device...
>>>
>>>    However, I may be misunderstanding your issue here. Could you provide more
>>>    details and a dmesg output example of the issue ?
>> 
>> According to AHCI 1.3.1 specification ch6.1.9, when an R_ERR is received
>> on an H2D Data FIS in normal operation, the HBA sets PxIS.IFS to 1
>> (fatal error) and halts operation. Referring to SATA 3.0 specification we
>> know the device will halt queued command processing and transmit SDB FIS to
>> host with ERR bit in Status field set to one(set PxIS.TFES to 1).
>
>Sure, but that SBD FIS should be completely ignored by the adapter since it
>stopped operation. If you see it, then it means that the handling of the first
>error was incomplete.
>
The SDB FIS  happens between pxIS.IE was cleared and PxCMD.ST not cleared
to 0. Because the PxCMD.ST still setting, so the SDB FIS can't ignored.  

>> In our case, the ISR handles fatal errors(PxIS.IFS) and enters error 
>> recovery after cleaning up PxIS and PxIE. Then a SDB FIS is received 
>> with interrupt bit(PxIS.TFES) set to 1. According to AHCI 1.3.1 
>> specification ch6.2.2, HBA can't issue(cmd.ST set to 1) any new commands
>> under PxIS.TFES alive during error recovery.
>
>But how come you see a new command being issued ? This entire situation should
>result in a port reset with the first error. I do not see how this is possible.
>Are you saying that the port reset is not cleaning things up properly ? Could
>you share the dmesg output of this case ?
>
>
dmesg:
[  654.244958] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  654.460522] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  656.180375] ata7.00: exception Emask 0x10 SAct 0x7c0 SErr 0x200100 
action 0x6 frozen 
[  656.180384] ata7.00: irq_stat 0x08000000, interface fatal error 
[  656.180386] ata7: SError: { UnrecovData BadCRC } 
[  656.180389] ata7.00: failed command: READ FPDMA QUEUED 
[  656.180390] ata7.00: cmd 60/00:30:28:a9:9c/0a:00:01:00:00/40 tag 6 ncq 
dma 1310720 in 
	res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) 
[  656.180395] ata7.00: status: { DRDY } 
[  656.180397] ata7.00: failed command: READ FPDMA QUEUED 
[  656.180398] ata7.00: cmd 60/00:38:28:b3:9c/0a:00:01:00:00/40 tag 7 ncq 
dma 1310720 in 
	res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) 
[  656.180402] ata7.00: status: { DRDY } 
[  656.180403] ata7.00: failed command: READ FPDMA QUEUED 
[  656.180404] ata7.00: cmd 60/00:40:28:bd:9c/0a:00:01:00:00/40 tag 8 ncq 
dma 1310720 in
	res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) 
[  656.180408] ata7.00: status: { DRDY } 
[  656.180410] ata7.00: failed command: READ FPDMA QUEUED 
[  656.180411] ata7.00: cmd 60/00:48:28:c7:9c/0a:00:01:00:00/40 tag 9 ncq 
dma 1310720 in 
	res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) 
[  656.180414] ata7.00: status: { DRDY } 
[  656.180416] ata7.00: failed command: READ FPDMA QUEUED 
[  656.180417] ata7.00: cmd 60/00:50:28:d1:9c/0a:00:01:00:00/40 tag 10 ncq 
dma 1310720 in 
	res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) 
[  656.180421] ata7.00: status: { DRDY } 
[  656.180423] ata7: hard resetting link 
[  659.210328] amd_iommu_report_page_fault: 48 callbacks suppressed 
[  659.210333] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  659.210345] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  659.210352] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  659.811069] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  659.811080] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  659.811086] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  659.827669] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  659.827676] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  659.827680] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  659.878319] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  664.426805] amd_iommu_report_page_fault: 248 callbacks suppressed 
[  664.426809] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  664.426819] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  664.426825] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  664.676294] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  664.676305] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  664.676313] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  664.693501] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  664.693513] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  664.693521] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  664.711480] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000] 
[  666.180398] ata7: softreset failed (1st FIS failed) 
[  666.180405] ata7: hard resetting link 
[  670.054987] amd_iommu_report_page_fault: 99 callbacks suppressed
[  670.054992] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000]
[  670.055003] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000]
[  670.055010] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0019 address=0x0 flags=0x0000]


dmesg (add clear pending Interrupt):
[  120.236847] ahci_error_intr: interface fatal error
[  120.236854] ahci_error_intr: freeze (set PxIE to 0)
[  120.265268] ata7: limiting SATA link speed to 3.0 Gbps
[  120.265275] ata7.00: exception Emask 0x10 SAct 0x3f0 SErr 0x400000 
action 0x6 frozen
[  120.265279] ata7.00: irq_stat 0x08000000, interface fatal error
[  120.265281] ata7: SError: { Handshk }
[  120.265285] ata7.00: failed command: WRITE FPDMA QUEUED
[  120.265287] ata7.00: cmd 61/00:20:a8:b9:50/0a:00:00:00:00/40 tag 4 ncq 
dma 1310720 ou
	res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[  120.265295] ata7.00: status: { DRDY }
[  120.265297] ata7.00: failed command: WRITE FPDMA QUEUED
[  120.265299] ata7.00: cmd 61/20:28:a8:c3:50/03:00:00:00:00/40 tag 5 ncq 
dma 409600 out
	res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[  120.265305] ata7.00: status: { DRDY }
[  120.265307] ata7.00: failed command: WRITE FPDMA QUEUED
[  120.265309] ata7.00: cmd 61/00:30:c8:c6:50/0a:00:00:00:00/40 tag 6 ncq 
dma 1310720 ou
	res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[  120.265314] ata7.00: status: { DRDY }
[  120.265316] ata7.00: failed command: WRITE FPDMA QUEUED
[  120.265318] ata7.00: cmd 61/00:38:c8:d0:50/0a:00:00:00:00/40 tag 7 ncq 
dma 1310720 ou
	res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[  120.265324] ata7.00: status: { DRDY }
[  120.265325] ata7.00: failed command: WRITE FPDMA QUEUED
[  120.265327] ata7.00: cmd 61/00:40:c8:da:50/0a:00:00:00:00/40 tag 8 ncq 
dma 1310720 ou
	res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[  120.265333] ata7.00: status: { DRDY }
[  120.265334] ata7.00: failed command: WRITE FPDMA QUEUED
[  120.265336] ata7.00: cmd 61/00:48:c8:e4:50/0a:00:00:00:00/40 tag 9 ncq 
dma 1310720 ou
	res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[  120.265342] ata7.00: status: { DRDY }
[  120.265346] ata7: hard resetting link
[  120.265348] ata7: ahci_hardreset *****
[  120.265352] ahci 0000:0a:00.0: ahci_stop_engine: PORT_IRQ_STAT 0x40000008
PORT_CMD 0x0004c617
[  120.265356] ahci 0000:0a:00.0: ahci_stop_engine: setting HBA to idle 
(setting PxCMD.ST to 0) 
[  120.937522] ahci 0000:0a:00.0: ahci_start_engine: check 
PORT_SCR_ERR 0x00000000, and to clear
[  120.937534] ahci 0000:0a:00.0: ahci_start_engine: check 
PORT_IRQ_STAT 0x40000008, and to clear 
[  120.937544] ata7: ahci_softreset *****
[  120.937550] ahci 0000:0a:00.0: ahci_stop_engine: setting HBA to idle  
(setting PxCMD.ST to 0) 
[  120.937555] ahci 0000:0a:00.0: ahci_start_engine: check 
PORT_SCR_ERR 0x00000000, and to clear
[  120.937559] ahci 0000:0a:00.0: ahci_start_engine: check 
PORT_IRQ_STAT 0x00000000, and to clear 
[  121.097231] ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[  121.113493] ata7.00: configured for UDMA/133
[  121.113570] ata7: EH complete

I have printk PxIS & PxCMD values to check pending interrupt.
It SBD FIS with ERR to setting PxIS.TFES and PxIS.SDBS bit. 

>>
>>>    >
>>>    > Signed-off-by: Szuying Chen <Chloe_Chen@asmedia.com.tw>
>>>    > ---
>>>    >  drivers/ata/libahci.c | 12 ++++++++++++
>>>    >  1 file changed, 12 insertions(+)
>>>    >
>>>    > diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
>>>    > index 06aec35f88f2..0ae51fd95d46 100644
>>>    > --- a/drivers/ata/libahci.c
>>>    > +++ b/drivers/ata/libahci.c
>>>    > @@ -679,9 +679,21 @@ static int ahci_scr_write(struct ata_link *link, unsigned int sc_reg, u32 val)
>>>    >
>>>    >  void ahci_start_engine(struct ata_port *ap)
>>>    >  {
>>>    > +     struct ahci_host_priv *hpriv = ap->host->private_data;
>>>    >       void __iomem *port_mmio = ahci_port_base(ap);
>>>    >       u32 tmp;
>>>    >
>>>    > +     /* clear SError */
>>>    > +     tmp = readl(port_mmio + PORT_SCR_ERR);
>>>    > +     writel(tmp, port_mmio + PORT_SCR_ERR);
>>>    > +
>>>    > +     /* clear port IRQ */
>>>    > +     tmp = readl(port_mmio + PORT_IRQ_STAT);
>>>    > +     if (tmp)
>>>    > +             writel(tmp, port_mmio + PORT_IRQ_STAT);
>>>    > +
>>>    > +     writel(1 << ap->port_no, hpriv->mmio + PORT_IRQ_STAT);
>>>    > +
>>>    >       /* start DMA */
>>>    >       tmp = readl(port_mmio + PORT_CMD);
>>>    >       tmp |= PORT_CMD_START;
>>>    > --
>>>    > 2.39.2
>>>   >
>>>
>>>   --
>>>   Damien Le Moal
>>>    Western Digital Research
>>>
>> Thanks.
>>
>
>-- 
>Damien Le Moal
>Western Digital Research

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Re: [PATCH] ahci: libahci: clear pending interrupt status
  2023-08-14  4:17 Re: [PATCH] ahci: libahci: clear pending interrupt status Szuying Chen
@ 2023-08-22  8:41 ` Niklas Cassel
  0 siblings, 0 replies; 4+ messages in thread
From: Niklas Cassel @ 2023-08-22  8:41 UTC (permalink / raw)
  To: Szuying Chen
  Cc: dlemoal@kernel.org, linux-ide@vger.kernel.org,
	linux-kernel@vger.kernel.org, Jesse1_Chang@asmedia.com.tw,
	Richard_Hsu@asmedia.com.tw, Szuying Chen

On Mon, Aug 14, 2023 at 12:17:20PM +0800, Szuying Chen wrote:
> On 8/10/23 17:59, Damien Le Moal wrote: 
> >On 8/10/23 18:31, Szuying Chen wrote:
> >>> On 8/10/23 14:12, Damien Le Moal wrote: 
> >>>    On 8/10/23 14:05, Szuying Chen wrote:
> >>>    > When ISR handle interface fatal error with error recovery after clear PxIS
> >>>    > and PxIE. Another FIS(SDB FIS with err) that set PxIS.IFS to 1 is recevied
> >>>    > during error recovery, which causing the HBA not issue any new commands
> >>>    > after cmd.ST set 1.
> >>>
> >>>    This is not very clear. If there was a fatal error, the drive should be in
> >>>    error state and no other SDB FIS can be received as the drive does absolutely
> >>>    nothing while in error state (it only waits for a read log 10h command to be>
> >>>    issued to get it out of error state). So if you are seeing 2 SDB FIS with
> >>>    errors one after the other, you have a buggy device...
> >>>
> >>>    However, I may be misunderstanding your issue here. Could you provide more
> >>>    details and a dmesg output example of the issue ?
> >> 
> >> According to AHCI 1.3.1 specification ch6.1.9, when an R_ERR is received
> >> on an H2D Data FIS in normal operation, the HBA sets PxIS.IFS to 1
> >> (fatal error) and halts operation. Referring to SATA 3.0 specification we
> >> know the device will halt queued command processing and transmit SDB FIS to
> >> host with ERR bit in Status field set to one(set PxIS.TFES to 1).
> >
> >Sure, but that SBD FIS should be completely ignored by the adapter since it
> >stopped operation. If you see it, then it means that the handling of the first
> >error was incomplete.
> >
> The SDB FIS  happens between pxIS.IE was cleared and PxCMD.ST not cleared
> to 0. Because the PxCMD.ST still setting, so the SDB FIS can't ignored.  
> 
> >> In our case, the ISR handles fatal errors(PxIS.IFS) and enters error 
> >> recovery after cleaning up PxIS and PxIE. Then a SDB FIS is received 
> >> with interrupt bit(PxIS.TFES) set to 1. According to AHCI 1.3.1 
> >> specification ch6.2.2, HBA can't issue(cmd.ST set to 1) any new commands
> >> under PxIS.TFES alive during error recovery.
> >
> >But how come you see a new command being issued ? This entire situation should
> >result in a port reset with the first error. I do not see how this is possible.
> >Are you saying that the port reset is not cleaning things up properly ? Could
> >you share the dmesg output of this case ?
> >
> >
> dmesg:
> [  654.244958] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  654.460522] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  656.180375] ata7.00: exception Emask 0x10 SAct 0x7c0 SErr 0x200100

Hello Szuying Chen,

From your logs we can see that SErr 0x200100:
This means that PxSERR.DIAG.C was set.

This means that:
"Calculated Different CRC than Received: When this occurs, the HBA returns
R_ERR and returns to idle."

(Since PxIS.IFS is set, we know that it was a D2H Data FIS that had invalid
CRC. (If it was a D2H non-Data FIS PxIS.INFS would have been set instead.))


So your commit message is wrong. In your case it is the HBA that sends
a R_ERR to the device, so you should look at "6.1.10 CRC error in received
FIS" and not "6.1.9 Device responds to FIS with R_ERR".

"When a CRC error occurs on a D2H Data FIS, the HBA sets PxIS.IFS to ‘1’.
After a CRC error on a D2H Data FIS, the device may send a D2H Register FIS
to indicate the reason for the error. The host should allow posting of
non-Data FISes before PxCMD.ST is cleared to ‘0’."

Looking at the state machine "5.3.11.2 DR:Receive" you will jump to state
ERR:Fatal. (Meaning that we need to clear PxCMD.ST in order to issue new
commands.)


The HBA should this trigger an error irq when noticing the bad CRC.

It is possible that the HBA triggers another error IRQ when receiving
a D2H with TFES bit set, if it is received before PxCMD.ST is cleared
to 0, see NDR:Accept:
FIS Type is D2H Register and PxCMD.ST=0, jump to P:RegFISUpdate, which
does not trigger any IRQ.
If PxCMD.ST is set, the state machine will instead jump to RegFIS:Entry.


So, in order to understand what is going on, can you please add
unconditional prints in ahci_error_intr(), so we can actually see
how many times this is called. (Right now it seems to only be called
once). Please also print when PxCMD.ST is cleared to 0.
(So that we can verify that ahci_error_intr() is never called after
that). Also please remove all prints from other drivers, e.g. nouveau.


Kind regards,
Niklas



> action 0x6 frozen 
> [  656.180384] ata7.00: irq_stat 0x08000000, interface fatal error 
> [  656.180386] ata7: SError: { UnrecovData BadCRC } 
> [  656.180389] ata7.00: failed command: READ FPDMA QUEUED 
> [  656.180390] ata7.00: cmd 60/00:30:28:a9:9c/0a:00:01:00:00/40 tag 6 ncq 
> dma 1310720 in 
> 	res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) 
> [  656.180395] ata7.00: status: { DRDY } 
> [  656.180397] ata7.00: failed command: READ FPDMA QUEUED 
> [  656.180398] ata7.00: cmd 60/00:38:28:b3:9c/0a:00:01:00:00/40 tag 7 ncq 
> dma 1310720 in 
> 	res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) 
> [  656.180402] ata7.00: status: { DRDY } 
> [  656.180403] ata7.00: failed command: READ FPDMA QUEUED 
> [  656.180404] ata7.00: cmd 60/00:40:28:bd:9c/0a:00:01:00:00/40 tag 8 ncq 
> dma 1310720 in
> 	res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) 
> [  656.180408] ata7.00: status: { DRDY } 
> [  656.180410] ata7.00: failed command: READ FPDMA QUEUED 
> [  656.180411] ata7.00: cmd 60/00:48:28:c7:9c/0a:00:01:00:00/40 tag 9 ncq 
> dma 1310720 in 
> 	res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) 
> [  656.180414] ata7.00: status: { DRDY } 
> [  656.180416] ata7.00: failed command: READ FPDMA QUEUED 
> [  656.180417] ata7.00: cmd 60/00:50:28:d1:9c/0a:00:01:00:00/40 tag 10 ncq 
> dma 1310720 in 
> 	res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) 
> [  656.180421] ata7.00: status: { DRDY } 
> [  656.180423] ata7: hard resetting link 
> [  659.210328] amd_iommu_report_page_fault: 48 callbacks suppressed 
> [  659.210333] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  659.210345] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  659.210352] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  659.811069] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  659.811080] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  659.811086] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  659.827669] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  659.827676] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  659.827680] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  659.878319] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  664.426805] amd_iommu_report_page_fault: 248 callbacks suppressed 
> [  664.426809] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  664.426819] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  664.426825] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  664.676294] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  664.676305] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  664.676313] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  664.693501] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  664.693513] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  664.693521] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  664.711480] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000] 
> [  666.180398] ata7: softreset failed (1st FIS failed) 
> [  666.180405] ata7: hard resetting link 
> [  670.054987] amd_iommu_report_page_fault: 99 callbacks suppressed
> [  670.054992] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000]
> [  670.055003] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000]
> [  670.055010] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
> domain=0x0019 address=0x0 flags=0x0000]
> 
> 
> dmesg (add clear pending Interrupt):
> [  120.236847] ahci_error_intr: interface fatal error
> [  120.236854] ahci_error_intr: freeze (set PxIE to 0)
> [  120.265268] ata7: limiting SATA link speed to 3.0 Gbps
> [  120.265275] ata7.00: exception Emask 0x10 SAct 0x3f0 SErr 0x400000 
> action 0x6 frozen
> [  120.265279] ata7.00: irq_stat 0x08000000, interface fatal error
> [  120.265281] ata7: SError: { Handshk }
> [  120.265285] ata7.00: failed command: WRITE FPDMA QUEUED
> [  120.265287] ata7.00: cmd 61/00:20:a8:b9:50/0a:00:00:00:00/40 tag 4 ncq 
> dma 1310720 ou
> 	res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [  120.265295] ata7.00: status: { DRDY }
> [  120.265297] ata7.00: failed command: WRITE FPDMA QUEUED
> [  120.265299] ata7.00: cmd 61/20:28:a8:c3:50/03:00:00:00:00/40 tag 5 ncq 
> dma 409600 out
> 	res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [  120.265305] ata7.00: status: { DRDY }
> [  120.265307] ata7.00: failed command: WRITE FPDMA QUEUED
> [  120.265309] ata7.00: cmd 61/00:30:c8:c6:50/0a:00:00:00:00/40 tag 6 ncq 
> dma 1310720 ou
> 	res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [  120.265314] ata7.00: status: { DRDY }
> [  120.265316] ata7.00: failed command: WRITE FPDMA QUEUED
> [  120.265318] ata7.00: cmd 61/00:38:c8:d0:50/0a:00:00:00:00/40 tag 7 ncq 
> dma 1310720 ou
> 	res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [  120.265324] ata7.00: status: { DRDY }
> [  120.265325] ata7.00: failed command: WRITE FPDMA QUEUED
> [  120.265327] ata7.00: cmd 61/00:40:c8:da:50/0a:00:00:00:00/40 tag 8 ncq 
> dma 1310720 ou
> 	res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [  120.265333] ata7.00: status: { DRDY }
> [  120.265334] ata7.00: failed command: WRITE FPDMA QUEUED
> [  120.265336] ata7.00: cmd 61/00:48:c8:e4:50/0a:00:00:00:00/40 tag 9 ncq 
> dma 1310720 ou
> 	res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [  120.265342] ata7.00: status: { DRDY }
> [  120.265346] ata7: hard resetting link
> [  120.265348] ata7: ahci_hardreset *****
> [  120.265352] ahci 0000:0a:00.0: ahci_stop_engine: PORT_IRQ_STAT 0x40000008
> PORT_CMD 0x0004c617
> [  120.265356] ahci 0000:0a:00.0: ahci_stop_engine: setting HBA to idle 
> (setting PxCMD.ST to 0) 
> [  120.937522] ahci 0000:0a:00.0: ahci_start_engine: check 
> PORT_SCR_ERR 0x00000000, and to clear
> [  120.937534] ahci 0000:0a:00.0: ahci_start_engine: check 
> PORT_IRQ_STAT 0x40000008, and to clear 
> [  120.937544] ata7: ahci_softreset *****
> [  120.937550] ahci 0000:0a:00.0: ahci_stop_engine: setting HBA to idle  
> (setting PxCMD.ST to 0) 
> [  120.937555] ahci 0000:0a:00.0: ahci_start_engine: check 
> PORT_SCR_ERR 0x00000000, and to clear
> [  120.937559] ahci 0000:0a:00.0: ahci_start_engine: check 
> PORT_IRQ_STAT 0x00000000, and to clear 
> [  121.097231] ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
> [  121.113493] ata7.00: configured for UDMA/133
> [  121.113570] ata7: EH complete
> 
> I have printk PxIS & PxCMD values to check pending interrupt.
> It SBD FIS with ERR to setting PxIS.TFES and PxIS.SDBS bit. 
> 
> >>
> >>>    >
> >>>    > Signed-off-by: Szuying Chen <Chloe_Chen@asmedia.com.tw>
> >>>    > ---
> >>>    >  drivers/ata/libahci.c | 12 ++++++++++++
> >>>    >  1 file changed, 12 insertions(+)
> >>>    >
> >>>    > diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
> >>>    > index 06aec35f88f2..0ae51fd95d46 100644
> >>>    > --- a/drivers/ata/libahci.c
> >>>    > +++ b/drivers/ata/libahci.c
> >>>    > @@ -679,9 +679,21 @@ static int ahci_scr_write(struct ata_link *link, unsigned int sc_reg, u32 val)
> >>>    >
> >>>    >  void ahci_start_engine(struct ata_port *ap)
> >>>    >  {
> >>>    > +     struct ahci_host_priv *hpriv = ap->host->private_data;
> >>>    >       void __iomem *port_mmio = ahci_port_base(ap);
> >>>    >       u32 tmp;
> >>>    >
> >>>    > +     /* clear SError */
> >>>    > +     tmp = readl(port_mmio + PORT_SCR_ERR);
> >>>    > +     writel(tmp, port_mmio + PORT_SCR_ERR);
> >>>    > +
> >>>    > +     /* clear port IRQ */
> >>>    > +     tmp = readl(port_mmio + PORT_IRQ_STAT);
> >>>    > +     if (tmp)
> >>>    > +             writel(tmp, port_mmio + PORT_IRQ_STAT);
> >>>    > +
> >>>    > +     writel(1 << ap->port_no, hpriv->mmio + PORT_IRQ_STAT);
> >>>    > +
> >>>    >       /* start DMA */
> >>>    >       tmp = readl(port_mmio + PORT_CMD);
> >>>    >       tmp |= PORT_CMD_START;
> >>>    > --
> >>>    > 2.39.2
> >>>   >
> >>>
> >>>   --
> >>>   Damien Le Moal
> >>>    Western Digital Research
> >>>
> >> Thanks.
> >>
> >
> >-- 
> >Damien Le Moal
> >Western Digital Research

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Re: [PATCH] ahci: libahci: clear pending interrupt status
@ 2023-08-24 10:51 Szuying Chen
  2023-08-28 19:16 ` Niklas Cassel
  0 siblings, 1 reply; 4+ messages in thread
From: Szuying Chen @ 2023-08-24 10:51 UTC (permalink / raw)
  To: Niklas.Cassel, dlemoal, linux-ide, linux-kernel
  Cc: Jesse1_Chang, Richard_Hsu, Chloe_Chen


On 8/22/23 16:41, Niklas Cassel wrote: 
>
>On Mon, Aug 14, 2023 at 12:17:20PM +0800, Szuying Chen wrote:
>> On 8/10/23 17:59, Damien Le Moal wrote: 
>> >On 8/10/23 18:31, Szuying Chen wrote:
>> >>> On 8/10/23 14:12, Damien Le Moal wrote: 
>> >>>    On 8/10/23 14:05, Szuying Chen wrote:
>> >>>    > When ISR handle interface fatal error with error recovery after clear PxIS
>> >>>    > and PxIE. Another FIS(SDB FIS with err) that set PxIS.IFS to 1 is recevied
>> >>>    > during error recovery, which causing the HBA not issue any new commands
>> >>>    > after cmd.ST set 1.
>> >>>
>> >>>    This is not very clear. If there was a fatal error, the drive should be in
>> >>>    error state and no other SDB FIS can be received as the drive does absolutely
>> >>>    nothing while in error state (it only waits for a read log 10h command to be>
>> >>>    issued to get it out of error state). So if you are seeing 2 SDB FIS with
>> >>>    errors one after the other, you have a buggy device...
>> >>>
>> >>>    However, I may be misunderstanding your issue here. Could you provide more
>> >>>    details and a dmesg output example of the issue ?
>> >> 
>> >> According to AHCI 1.3.1 specification ch6.1.9, when an R_ERR is 
>> >> received on an H2D Data FIS in normal operation, the HBA sets 
>> >> PxIS.IFS to 1 (fatal error) and halts operation. Referring to SATA 
>> >> 3.0 specification we know the device will halt queued command 
>> >> processing and transmit SDB FIS to host with ERR bit in Status field set to one(set PxIS.TFES to 1).
>> >
>> >Sure, but that SBD FIS should be completely ignored by the adapter 
>> >since it stopped operation. If you see it, then it means that the 
>> >handling of the first error was incomplete.
>> >
>> The SDB FIS  happens between pxIS.IE was cleared and PxCMD.ST not 
>> cleared to 0. Because the PxCMD.ST still setting, so the SDB FIS can't ignored.
>> 
>> >> In our case, the ISR handles fatal errors(PxIS.IFS) and enters 
>> >> error recovery after cleaning up PxIS and PxIE. Then a SDB FIS is 
>> >> received with interrupt bit(PxIS.TFES) set to 1. According to AHCI 
>> >> 1.3.1 specification ch6.2.2, HBA can't issue(cmd.ST set to 1) any 
>> >> new commands under PxIS.TFES alive during error recovery.
>> >
>> >But how come you see a new command being issued ? This entire 
>> >situation should result in a port reset with the first error. I do not see how this is possible.
>> >Are you saying that the port reset is not cleaning things up properly 
>> >? Could you share the dmesg output of this case ?
>> >
>> >
>> dmesg:
>> [  654.244958] nouveau 0000:09:00.0: AMD-Vi: Event logged 
>> [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  654.460522] nouveau 
>> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  656.180375] ata7.00: 
>> exception Emask 0x10 SAct 0x7c0 SErr 0x200100
>
>Hello Szuying Chen,
>
>From your logs we can see that SErr 0x200100:
>This means that PxSERR.DIAG.C was set.
>
>This means that:
>"Calculated Different CRC than Received: When this occurs, the HBA returns R_ERR and returns to idle."
>
>(Since PxIS.IFS is set, we know that it was a D2H Data FIS that had invalid CRC. (If it was a D2H non-Data FIS PxIS.INFS would have been set instead.))
>
>
>So your commit message is wrong. In your case it is the HBA that sends a R_ERR to the device, so you should look at "6.1.10 CRC error in received FIS" and not "6.1.9 Device responds to FIS with R_ERR".
>
>"When a CRC error occurs on a D2H Data FIS, the HBA sets PxIS.IFS to ‘1’.
>After a CRC error on a D2H Data FIS, the device may send a D2H Register FIS to indicate the reason for the error. The host should allow posting of non-Data FISes before PxCMD.ST is cleared to ‘0’."
>
>Looking at the state machine "5.3.11.2 DR:Receive" you will jump to state ERR:Fatal. (Meaning that we need to clear PxCMD.ST in order to issue new
>commands.)
>
>
>The HBA should this trigger an error irq when noticing the bad CRC.
>
>It is possible that the HBA triggers another error IRQ when receiving a D2H with TFES bit set, if it is received before PxCMD.ST is cleared to 0, see NDR:Accept:
>FIS Type is D2H Register and PxCMD.ST=0, jump to P:RegFISUpdate, which does not trigger any IRQ.
>If PxCMD.ST is set, the state machine will instead jump to RegFIS:Entry.
>
>
>So, in order to understand what is going on, can you please add unconditional prints in ahci_error_intr(), so we can actually see how many times this is called. (Right now it seems to only be called once). Please also print when PxCMD.ST is cleared to 0.
>(So that we can verify that ahci_error_intr() is never called after that). Also please remove all prints from other drivers, e.g. nouveau.
>
>
>Kind regards,
>Niklas
>
>
Hi Niklas,

In our case, receiving D2H with TFES bit set(PxIS.TFES set to 1) doesn't
triggers aother error IRQ before PxCMD.ST is cleared. Because IRQ was 
turn off(PxIE=0).

After received D2H with TFES bit set, port is restart(clearing PxCMD.ST to 
'0' and then setting PxCMD.ST to '1').In the dmesg output, device is lose 
after the port restart 3 times. Durring this time, the TFES bit was not 
cleared.

The Linux kernel dmesg output as shown below:
 [   40.251795] ahci_error_intr start (irq_stat 0x08000000) ****
[   40.251799] ahci_error_intr: interface fatal error
[   40.251800] ahci_error_intr: freeze (set pxIE to 0)
[   40.251805] ahci_error_intr end (PORT_IRQ_MASK = 0x00000000 )****
[   40.298113] ata8.00: exception Emask 0x10 SAct 0x70e000 SErr 0x280100 
action 0x6 frozen
[   40.298117] ata8.00: irq_stat 0x08000000, interface fatal error
[   40.298119] ata8: SError: { UnrecovData 10B8B BadCRC }
[   40.298121] ata8.00: failed command: READ FPDMA QUEUED
[   40.298122] ata8.00: cmd 60/b8:68:c0:c3:03/08:00:00:00:00/40 tag 13 
ncq dma 1142784 in
         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[   40.298126] ata8.00: status: { DRDY }
[   40.298128] ata8.00: failed command: READ FPDMA QUEUED
[   40.298129] ata8.00: cmd 60/b8:70:78:cc:03/05:00:00:00:00/40 tag 14 
ncq dma 749568 in
         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[   40.298132] ata8.00: status: { DRDY }
[   40.298133] ata8.00: failed command: READ FPDMA QUEUED
[   40.298134] ata8.00: cmd 60/00:78:30:d2:03/0a:00:00:00:00/40 tag 15 
ncq dma 1310720 in
         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[   40.298137] ata8.00: status: { DRDY }
[   40.298138] ata8.00: failed command: READ FPDMA QUEUED
[   40.298139] ata8.00: cmd 60/00:a0:30:dc:03/0a:00:00:00:00/40 tag 20 
ncq dma 1310720 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[   40.298142] ata8.00: status: { DRDY }
[   40.298143] ata8.00: failed command: READ FPDMA QUEUED
[   40.298144] ata8.00: cmd 60/40:a8:30:e6:03/08:00:00:00:00/40 tag 21 
ncq dma 1081344 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[   40.298147] ata8.00: status: { DRDY }
[   40.298148] ata8.00: failed command: READ FPDMA QUEUED
[   40.298149] ata8.00: cmd 60/90:b0:70:ee:03/09:00:00:00:00/40 tag 22 
ncq dma 1253376 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[   40.298151] ata8.00: status: { DRDY }
[   40.298154] ata8: hard resetting link
[   40.298155] ata8: ahci_hardreset *****
[   40.298160] ahci 0000:6f:00.0: ahci_stop_engine start *** 
( PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004d517) 
[   40.298166] ahci 0000:6f:00.0: ahci_stop_engine: setting HBA to idle  
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) 
[   40.298167] ahci 0000:6f:00.0: ahci_stop_engine end *** 
[   40.608009] ahci 0000:6f:00.0: ahci_start_engine start 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) ***
[   40.608031] ahci 0000:6f:00.0: ahci_start_engine end 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) ***
[   40.608041] ata8: ahci_softreset *****
[   40.608049] ahci 0000:6f:00.0: ahci_stop_engine start *** 
( PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) 
[   40.608059] ahci 0000:6f:00.0: ahci_stop_engine: setting HBA to idle  
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) 
[   40.608064] ahci 0000:6f:00.0: ahci_stop_engine end *** 
[   40.608069] ahci 0000:6f:00.0: ahci_start_engine start 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) ***
[   40.608079] ahci 0000:6f:00.0: ahci_start_engine end 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) ***
[   50.199215] ahci 0000:6f:00.0: ahci_stop_engine start *** 
( PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) 
[   50.199236] ahci 0000:6f:00.0: ahci_stop_engine: setting HBA to idle  
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) 
[   50.199242] ahci 0000:6f:00.0: ahci_stop_engine end *** 
[   50.199254] ahci 0000:6f:00.0: ahci_start_engine start 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) ***
[   50.199264] ahci 0000:6f:00.0: ahci_start_engine end 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) ***
[   50.199269] ata8: softreset failed (1st FIS failed)
[   50.199278] ata8: hard resetting link
[   50.199283] ata8: ahci_hardreset *****
[   50.199289] ahci 0000:6f:00.0: ahci_stop_engine start *** 
( PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) 
[   50.199298] ahci 0000:6f:00.0: ahci_stop_engine: setting HBA to idle  
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) 
[   50.199302] ahci 0000:6f:00.0: ahci_stop_engine end *** 
[   50.514443] ahci 0000:6f:00.0: ahci_start_engine start 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) ***
[   50.514464] ahci 0000:6f:00.0: ahci_start_engine end 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) ***
[   50.514474] ata8: ahci_softreset *****
[   50.514482] ahci 0000:6f:00.0: ahci_stop_engine start *** 
( PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) 
[   50.514492] ahci 0000:6f:00.0: ahci_stop_engine: setting HBA to idle  
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) 
[   50.514496] ahci 0000:6f:00.0: ahci_stop_engine end *** 
[   50.514502] ahci 0000:6f:00.0: ahci_start_engine start 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) ***
[   50.514511] ahci 0000:6f:00.0: ahci_start_engine end 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) ***
[   60.173000] ahci 0000:6f:00.0: ahci_stop_engine start *** 
( PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) 
[   60.173020] ahci 0000:6f:00.0: ahci_stop_engine: setting HBA to idle  
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) 
[   60.173026] ahci 0000:6f:00.0: ahci_stop_engine end *** 
[   60.173038] ahci 0000:6f:00.0: ahci_start_engine start 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) ***
[   60.173048] ahci 0000:6f:00.0: ahci_start_engine end 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) ***
[   60.173053] ata8: softreset failed (1st FIS failed)
[   60.173063] ata8: hard resetting link
[   60.173067] ata8: ahci_hardreset *****
[   60.173073] ahci 0000:6f:00.0: ahci_stop_engine start *** 
( PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) 
[   60.173082] ahci 0000:6f:00.0: ahci_stop_engine: setting HBA to idle  
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) 
[   60.173086] ahci 0000:6f:00.0: ahci_stop_engine end *** 
[   60.488241] ahci 0000:6f:00.0: ahci_start_engine start 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) ***
[   60.488262] ahci 0000:6f:00.0: ahci_start_engine end 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) ***
[   60.488272] ata8: ahci_softreset *****
[   60.488280] ahci 0000:6f:00.0: ahci_stop_engine start *** 
( PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) 
[   60.488289] ahci 0000:6f:00.0: ahci_stop_engine: setting HBA to idle  
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) 
[   60.488294] ahci 0000:6f:00.0: ahci_stop_engine end *** 
[   60.488300] ahci 0000:6f:00.0: ahci_start_engine start 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) ***
[   60.488309] ahci 0000:6f:00.0: ahci_start_engine end 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) ***
[   95.144383] ahci 0000:6f:00.0: ahci_stop_engine start *** 
( PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) 
[   95.144402] ahci 0000:6f:00.0: ahci_stop_engine: setting HBA to idle  
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) 
[   95.144409] ahci 0000:6f:00.0: ahci_stop_engine end *** 
[   95.144420] ahci 0000:6f:00.0: ahci_start_engine start 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) ***
[   95.144430] ahci 0000:6f:00.0: ahci_start_engine end 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) ***
[   95.144435] ata8: softreset failed (1st FIS failed)
[   95.144450] ata8: limiting SATA link speed to 3.0 Gbps
[   95.144454] ata8: hard resetting link
[   95.144458] ata8: ahci_hardreset *****
[   95.144464] ahci 0000:6f:00.0: ahci_stop_engine start *** 
( PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) 
[   95.144473] ahci 0000:6f:00.0: ahci_stop_engine: setting HBA to idle  
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) 
[   95.144477] ahci 0000:6f:00.0: ahci_stop_engine end *** 
[   95.459765] ahci 0000:6f:00.0: ahci_start_engine start 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) ***
[   95.459785] ahci 0000:6f:00.0: ahci_start_engine end 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) ***
[   95.459795] ata8: ahci_softreset *****
[   95.459803] ahci 0000:6f:00.0: ahci_stop_engine start *** 
( PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) 
[   95.459813] ahci 0000:6f:00.0: ahci_stop_engine: setting HBA to idle  
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) 
[   95.459817] ahci 0000:6f:00.0: ahci_stop_engine end *** 
[   95.459823] ahci 0000:6f:00.0: ahci_start_engine start 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) ***
[   95.459832] ahci 0000:6f:00.0: ahci_start_engine end 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) ***
[  100.142968] ahci 0000:6f:00.0: ahci_stop_engine start *** 
( PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) 
[  100.142987] ahci 0000:6f:00.0: ahci_stop_engine: setting HBA to idle  
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) 
[  100.142993] ahci 0000:6f:00.0: ahci_stop_engine end *** 
[  100.143005] ahci 0000:6f:00.0: ahci_start_engine start 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) ***
[  100.143015] ahci 0000:6f:00.0: ahci_start_engine end 
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c017) ***
[  100.143020] ata8: softreset failed (1st FIS failed)
[  100.143035] ata8: reset failed, giving up
[  100.143040] ata8.00: disable device
[  100.143073] ahci 0000:6f:00.0: ahci_stop_engine: setting HBA to idle  
(PORT_IRQ_STAT 0x00000000 PORT_CMD 0x00044016) 
[  100.143079] ahci 0000:6f:00.0: ahci_stop_engine end *** 
[  100.143095] ata8: EH complete
[  100.143157] sd 7:0:0:0: [sdb] tag#13 FAILED Result: hostbyte=DID_BAD_TARGET 
driverbyte=DRIVER_OK cmd_age=60s
[  100.143166] sd 7:0:0:0: [sdb] tag#13 CDB: Read(10) 28 00 00 03 c3 c0 00 
 08 b8 00
[  100.143170] I/O error, dev sdb, sector 246720 op 0x0:(READ) flags 0x4000 
phys_seg 168 prio class 2
[  100.143196] sd 7:0:0:0: [sdb] tag#14 FAILED Result: hostbyte=DID_BAD_TARGET 
driverbyte=DRIVER_OK cmd_age=60s
[  100.143201] sd 7:0:0:0: [sdb] tag#14 CDB: Read(10) 28 00 00 03 cc 78 00 05 
b8 00
[  100.143204] I/O error, dev sdb, sector 248952 op 0x0:(READ) flags 0x0 phys_seg
 88 prio class 2
[  100.143226] sd 7:0:0:0: [sdb] tag#24 FAILED Result: hostbyte=DID_BAD_TARGET 
driverbyte=DRIVER_OK cmd_age=0s
[  100.143238] sd 7:0:0:0: [sdb] tag#24 CDB: ATA command pass through(16) 85 06 
 20 00 00 00 00 00 00 00 00 00 00 00 e5 00
[  100.143279] sd 7:0:0:0: [sdb] tag#15 FAILED Result: hostbyte=DID_BAD_TARGET 
driverbyte=DRIVER_OK cmd_age=60s
[  100.143287] sd 7:0:0:0: [sdb] tag#15 CDB: Read(10) 28 00 00 03 d2 30 00 
 0a 00 00
[  100.143291] I/O error, dev sdb, sector 250416 op 0x0:(READ) flags 0x4000 
phys_seg 112 prio class 2
[  100.143313] sd 7:0:0:0: [sdb] tag#16 FAILED Result: 
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=60s
[  100.143322] sd 7:0:0:0: [sdb] tag#16 CDB: Read(10) 28 00 00 03 dc 30 00
 0a 00 00
[  100.143327] I/O error, dev sdb, sector 252976 op 0x0:(READ) flags 0x4000 
phys_seg 81 prio class 2
[  100.143346] sd 7:0:0:0: [sdb] tag#17 FAILED Result: 
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=60s
[  100.143353] sd 7:0:0:0: [sdb] tag#17 CDB: Read(10) 28 00 00 03 e6
 30 00 08
 40 00
[  100.143357] I/O error, dev sdb, sector 255536 op 0x0:(READ) flags
 0x4000 phys_seg 65 prio class 2
[  100.143499] sd 7:0:0:0: [sdb] tag#18 FAILED Result:
 hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=60s
[  100.143508] sd 7:0:0:0: [sdb] tag#18 CDB: Read(10) 28 00 00 03 ee 70 00
 09 90 00
[  100.143513] I/O error, dev sdb, sector 257648 op 0x0:(READ) flags 0x0
 phys_seg 70 prio class 2
 

The Linux(add clear pending interrupt) kernel dmesg output as shown below:

[  331.569303] ahci_error_intr start (irq_stat 0x08000000) ****
[  331.569320] ahci_error_intr: interface fatal error
[  331.569324] ahci_error_intr: freeze (set pxIE to 0)
[  331.569345] ahci_error_intr end (PORT_IRQ_MASK = 0x00000000 )****
[  331.599010] ata8.00: exception Emask 0x10 SAct 0x18 SErr 0x280100 
action 0x6 frozen
[  331.599026] ata8.00: irq_stat 0x08000000, interface fatal error
[  331.599032] ata8: SError: { UnrecovData 10B8B BadCRC }
[  331.599042] ata8.00: failed command: READ FPDMA QUEUED
[  331.599047] ata8.00: cmd 60/00:18:e0:3d:19/0a:00:00:00:00/40 tag 3 
ncq dma 1310720 in
        res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[  331.599064] ata8.00: status: { DRDY }
[  331.599070] ata8.00: failed command: READ FPDMA QUEUED
[  331.599074] ata8.00: cmd 60/90:20:e0:47:19/06:00:00:00:00/40 tag 4 
ncq dma 860160 in
        res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[  331.599089] ata8.00: status: { DRDY }
[  331.599096] ata8: hard resetting link
[  331.599100] ata8: ahci_hardreset *****
[  331.599107] ahci 0000:6f:00.0: ahci_stop_engine start *** 
( PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c317) 
[  331.599118] ahci 0000:6f:00.0: ahci_stop_engine: setting HBA to idle  
(PORT_IRQ_STAT 0x40000008 PORT_CMD 0x00044016) 
[  331.599123] ahci 0000:6f:00.0: ahci_stop_engine end *** 
[  331.914440] ahci 0000:6f:00.0: ahci_start_engine: check 
PORT_SCR_ERR 0x00000000, and to clear
[  331.914447] ahci 0000:6f:00.0: ahci_start_engine: check 
PORT_IRQ_STAT 0x40000008, and to clear 
[  331.914456] ata8: ahci_softreset *****
[  331.914466] ahci 0000:6f:00.0: ahci_stop_engine: setting HBA to idle  
(PORT_IRQ_STAT 0x00000000 PORT_CMD 0x00044016) 
[  331.914470] ahci 0000:6f:00.0: ahci_stop_engine end *** 
[  331.914473] ahci 0000:6f:00.0: ahci_start_engine: check 
PORT_SCR_ERR 0x00000000, and to clear
[  331.914477] ahci 0000:6f:00.0: ahci_start_engine: check 
PORT_IRQ_STAT 0x00000000, and to clear 
[  332.075031] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  332.075718] ata8.00: configured for UDMA/133
[  332.075739] ata8: EH complete

Thanks.

>
>> action 0x6 frozen
>> [  656.180384] ata7.00: irq_stat 0x08000000, interface fatal error [  
>> 656.180386] ata7: SError: { UnrecovData BadCRC } [  656.180389] 
>> ata7.00: failed command: READ FPDMA QUEUED [  656.180390] ata7.00: cmd 
>> 60/00:30:28:a9:9c/0a:00:01:00:00/40 tag 6 ncq dma 1310720 in
>> 	res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [  
>> 656.180395] ata7.00: status: { DRDY } [  656.180397] ata7.00: failed 
>> command: READ FPDMA QUEUED [  656.180398] ata7.00: cmd 
>> 60/00:38:28:b3:9c/0a:00:01:00:00/40 tag 7 ncq dma 1310720 in
>> 	res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [  
>> 656.180402] ata7.00: status: { DRDY } [  656.180403] ata7.00: failed 
>> command: READ FPDMA QUEUED [  656.180404] ata7.00: cmd 
>> 60/00:40:28:bd:9c/0a:00:01:00:00/40 tag 8 ncq dma 1310720 in
>> 	res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [  
>> 656.180408] ata7.00: status: { DRDY } [  656.180410] ata7.00: failed 
>> command: READ FPDMA QUEUED [  656.180411] ata7.00: cmd 
>> 60/00:48:28:c7:9c/0a:00:01:00:00/40 tag 9 ncq dma 1310720 in
>> 	res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [  
>> 656.180414] ata7.00: status: { DRDY } [  656.180416] ata7.00: failed 
>> command: READ FPDMA QUEUED [  656.180417] ata7.00: cmd 
>> 60/00:50:28:d1:9c/0a:00:01:00:00/40 tag 10 ncq dma 1310720 in
>> 	res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [  
>> 656.180421] ata7.00: status: { DRDY } [  656.180423] ata7: hard 
>> resetting link [  659.210328] amd_iommu_report_page_fault: 48 
>> callbacks suppressed [  659.210333] nouveau 0000:09:00.0: AMD-Vi: 
>> Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  659.210345] nouveau 
>> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  659.210352] nouveau 
>> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  659.811069] nouveau 
>> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  659.811080] nouveau 
>> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  659.811086] nouveau 
>> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  659.827669] nouveau 
>> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  659.827676] nouveau 
>> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  659.827680] nouveau 
>> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  659.878319] nouveau 
>> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  664.426805] 
>> amd_iommu_report_page_fault: 248 callbacks suppressed [  664.426809] 
>> nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  664.426819] nouveau 
>> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  664.426825] nouveau 
>> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  664.676294] nouveau 
>> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  664.676305] nouveau 
>> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  664.676313] nouveau 
>> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  664.693501] nouveau 
>> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  664.693513] nouveau 
>> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  664.693521] nouveau 
>> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  664.711480] nouveau 
>> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  666.180398] ata7: softreset 
>> failed (1st FIS failed) [  666.180405] ata7: hard resetting link [  
>> 670.054987] amd_iommu_report_page_fault: 99 callbacks suppressed [  
>> 670.054992] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  670.055003] nouveau 
>> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000] [  670.055010] nouveau 
>> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x0019 address=0x0 flags=0x0000]
>> 
>> 
>> dmesg (add clear pending Interrupt):
>> [  120.236847] ahci_error_intr: interface fatal error [  120.236854] 
>> ahci_error_intr: freeze (set PxIE to 0) [  120.265268] ata7: limiting 
>> SATA link speed to 3.0 Gbps [  120.265275] ata7.00: exception Emask 
>> 0x10 SAct 0x3f0 SErr 0x400000 action 0x6 frozen [  120.265279] 
>> ata7.00: irq_stat 0x08000000, interface fatal error [  120.265281] 
>> ata7: SError: { Handshk } [  120.265285] ata7.00: failed command: 
>> WRITE FPDMA QUEUED [  120.265287] ata7.00: cmd 
>> 61/00:20:a8:b9:50/0a:00:00:00:00/40 tag 4 ncq dma 1310720 ou
>> 	res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [  
>> 120.265295] ata7.00: status: { DRDY } [  120.265297] ata7.00: failed 
>> command: WRITE FPDMA QUEUED [  120.265299] ata7.00: cmd 
>> 61/20:28:a8:c3:50/03:00:00:00:00/40 tag 5 ncq dma 409600 out
>> 	res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [  
>> 120.265305] ata7.00: status: { DRDY } [  120.265307] ata7.00: failed 
>> command: WRITE FPDMA QUEUED [  120.265309] ata7.00: cmd 
>> 61/00:30:c8:c6:50/0a:00:00:00:00/40 tag 6 ncq dma 1310720 ou
>> 	res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [  
>> 120.265314] ata7.00: status: { DRDY } [  120.265316] ata7.00: failed 
>> command: WRITE FPDMA QUEUED [  120.265318] ata7.00: cmd 
>> 61/00:38:c8:d0:50/0a:00:00:00:00/40 tag 7 ncq dma 1310720 ou
>> 	res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [  
>> 120.265324] ata7.00: status: { DRDY } [  120.265325] ata7.00: failed 
>> command: WRITE FPDMA QUEUED [  120.265327] ata7.00: cmd 
>> 61/00:40:c8:da:50/0a:00:00:00:00/40 tag 8 ncq dma 1310720 ou
>> 	res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [  
>> 120.265333] ata7.00: status: { DRDY } [  120.265334] ata7.00: failed 
>> command: WRITE FPDMA QUEUED [  120.265336] ata7.00: cmd 
>> 61/00:48:c8:e4:50/0a:00:00:00:00/40 tag 9 ncq dma 1310720 ou
>> 	res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [  
>> 120.265342] ata7.00: status: { DRDY } [  120.265346] ata7: hard 
>> resetting link [  120.265348] ata7: ahci_hardreset ***** [  
>> 120.265352] ahci 0000:0a:00.0: ahci_stop_engine: PORT_IRQ_STAT 
>> 0x40000008 PORT_CMD 0x0004c617 [  120.265356] ahci 0000:0a:00.0: 
>> ahci_stop_engine: setting HBA to idle (setting PxCMD.ST to 0) [  
>> 120.937522] ahci 0000:0a:00.0: ahci_start_engine: check PORT_SCR_ERR 
>> 0x00000000, and to clear [  120.937534] ahci 0000:0a:00.0: 
>> ahci_start_engine: check PORT_IRQ_STAT 0x40000008, and to clear [  
>> 120.937544] ata7: ahci_softreset ***** [  120.937550] ahci 
>> 0000:0a:00.0: ahci_stop_engine: setting HBA to idle (setting PxCMD.ST 
>> to 0) [  120.937555] ahci 0000:0a:00.0: ahci_start_engine: check 
>> PORT_SCR_ERR 0x00000000, and to clear [  120.937559] ahci 
>> 0000:0a:00.0: ahci_start_engine: check PORT_IRQ_STAT 0x00000000, and 
>> to clear [  121.097231] ata7: SATA link up 3.0 Gbps (SStatus 123 
>> SControl 320) [  121.113493] ata7.00: configured for UDMA/133 [  
>> 121.113570] ata7: EH complete
>> 
>> I have printk PxIS & PxCMD values to check pending interrupt.
>> It SBD FIS with ERR to setting PxIS.TFES and PxIS.SDBS bit. 
>> 
>> >>
>> >>>    >
>> >>>    > Signed-off-by: Szuying Chen <Chloe_Chen@asmedia.com.tw>
>> >>>    > ---
>> >>>    >  drivers/ata/libahci.c | 12 ++++++++++++
>> >>>    >  1 file changed, 12 insertions(+)
>> >>>    >
>> >>>    > diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
>> >>>    > index 06aec35f88f2..0ae51fd95d46 100644
>> >>>    > --- a/drivers/ata/libahci.c
>> >>>    > +++ b/drivers/ata/libahci.c
>> >>>    > @@ -679,9 +679,21 @@ static int ahci_scr_write(struct ata_link *link, unsigned int sc_reg, u32 val)
>> >>>    >
>> >>>    >  void ahci_start_engine(struct ata_port *ap)
>> >>>    >  {
>> >>>    > +     struct ahci_host_priv *hpriv = ap->host->private_data;
>> >>>    >       void __iomem *port_mmio = ahci_port_base(ap);
>> >>>    >       u32 tmp;
>> >>>    >
>> >>>    > +     /* clear SError */
>> >>>    > +     tmp = readl(port_mmio + PORT_SCR_ERR);
>> >>>    > +     writel(tmp, port_mmio + PORT_SCR_ERR);
>> >>>    > +
>> >>>    > +     /* clear port IRQ */
>> >>>    > +     tmp = readl(port_mmio + PORT_IRQ_STAT);
>> >>>    > +     if (tmp)
>> >>>    > +             writel(tmp, port_mmio + PORT_IRQ_STAT);
>> >>>    > +
>> >>>    > +     writel(1 << ap->port_no, hpriv->mmio + PORT_IRQ_STAT);
>> >>>    > +
>> >>>    >       /* start DMA */
>> >>>    >       tmp = readl(port_mmio + PORT_CMD);
>> >>>    >       tmp |= PORT_CMD_START;
>> >>>    > --
>> >>>    > 2.39.2
>> >>>   >
>> >>>
>> >>>   --
>> >>>   Damien Le Moal
>> >>>    Western Digital Research
>> >>>
>> >> Thanks.
>> >>
>> >
>> >--
>> >Damien Le Moal
>> >Western Digital Research
><p></p>
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Re: [PATCH] ahci: libahci: clear pending interrupt status
  2023-08-24 10:51 Szuying Chen
@ 2023-08-28 19:16 ` Niklas Cassel
  0 siblings, 0 replies; 4+ messages in thread
From: Niklas Cassel @ 2023-08-28 19:16 UTC (permalink / raw)
  To: Szuying Chen
  Cc: dlemoal@kernel.org, linux-ide@vger.kernel.org,
	linux-kernel@vger.kernel.org, Jesse1_Chang@asmedia.com.tw,
	Richard_Hsu@asmedia.com.tw, Chloe_Chen@asmedia.com.tw

On Thu, Aug 24, 2023 at 06:51:59PM +0800, Szuying Chen wrote:
> 
> On 8/22/23 16:41, Niklas Cassel wrote: 
> >
> >On Mon, Aug 14, 2023 at 12:17:20PM +0800, Szuying Chen wrote:
> >> On 8/10/23 17:59, Damien Le Moal wrote: 
> >> >On 8/10/23 18:31, Szuying Chen wrote:
> >> >>> On 8/10/23 14:12, Damien Le Moal wrote: 
> >> >>>    On 8/10/23 14:05, Szuying Chen wrote:
> >> >>>    > When ISR handle interface fatal error with error recovery after clear PxIS
> >> >>>    > and PxIE. Another FIS(SDB FIS with err) that set PxIS.IFS to 1 is recevied
> >> >>>    > during error recovery, which causing the HBA not issue any new commands
> >> >>>    > after cmd.ST set 1.
> >> >>>
> >> >>>    This is not very clear. If there was a fatal error, the drive should be in
> >> >>>    error state and no other SDB FIS can be received as the drive does absolutely
> >> >>>    nothing while in error state (it only waits for a read log 10h command to be>
> >> >>>    issued to get it out of error state). So if you are seeing 2 SDB FIS with
> >> >>>    errors one after the other, you have a buggy device...
> >> >>>
> >> >>>    However, I may be misunderstanding your issue here. Could you provide more
> >> >>>    details and a dmesg output example of the issue ?
> >> >> 
> >> >> According to AHCI 1.3.1 specification ch6.1.9, when an R_ERR is 
> >> >> received on an H2D Data FIS in normal operation, the HBA sets 
> >> >> PxIS.IFS to 1 (fatal error) and halts operation. Referring to SATA 
> >> >> 3.0 specification we know the device will halt queued command 
> >> >> processing and transmit SDB FIS to host with ERR bit in Status field set to one(set PxIS.TFES to 1).
> >> >
> >> >Sure, but that SBD FIS should be completely ignored by the adapter 
> >> >since it stopped operation. If you see it, then it means that the 
> >> >handling of the first error was incomplete.
> >> >
> >> The SDB FIS  happens between pxIS.IE was cleared and PxCMD.ST not 
> >> cleared to 0. Because the PxCMD.ST still setting, so the SDB FIS can't ignored.
> >> 
> >> >> In our case, the ISR handles fatal errors(PxIS.IFS) and enters 
> >> >> error recovery after cleaning up PxIS and PxIE. Then a SDB FIS is 
> >> >> received with interrupt bit(PxIS.TFES) set to 1. According to AHCI 
> >> >> 1.3.1 specification ch6.2.2, HBA can't issue(cmd.ST set to 1) any 
> >> >> new commands under PxIS.TFES alive during error recovery.
> >> >
> >> >But how come you see a new command being issued ? This entire 
> >> >situation should result in a port reset with the first error. I do not see how this is possible.
> >> >Are you saying that the port reset is not cleaning things up properly 
> >> >? Could you share the dmesg output of this case ?
> >> >
> >> >
> >> dmesg:
> >> [  654.244958] nouveau 0000:09:00.0: AMD-Vi: Event logged 
> >> [IO_PAGE_FAULT
> >> domain=0x0019 address=0x0 flags=0x0000] [  654.460522] nouveau 
> >> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> >> domain=0x0019 address=0x0 flags=0x0000] [  656.180375] ata7.00: 
> >> exception Emask 0x10 SAct 0x7c0 SErr 0x200100
> >
> >Hello Szuying Chen,
> >
> >From your logs we can see that SErr 0x200100:
> >This means that PxSERR.DIAG.C was set.
> >
> >This means that:
> >"Calculated Different CRC than Received: When this occurs, the HBA returns R_ERR and returns to idle."
> >
> >(Since PxIS.IFS is set, we know that it was a D2H Data FIS that had invalid CRC. (If it was a D2H non-Data FIS PxIS.INFS would have been set instead.))
> >
> >
> >So your commit message is wrong. In your case it is the HBA that sends a R_ERR to the device, so you should look at "6.1.10 CRC error in received FIS" and not "6.1.9 Device responds to FIS with R_ERR".
> >
> >"When a CRC error occurs on a D2H Data FIS, the HBA sets PxIS.IFS to ‘1’.
> >After a CRC error on a D2H Data FIS, the device may send a D2H Register FIS to indicate the reason for the error. The host should allow posting of non-Data FISes before PxCMD.ST is cleared to ‘0’."
> >
> >Looking at the state machine "5.3.11.2 DR:Receive" you will jump to state ERR:Fatal. (Meaning that we need to clear PxCMD.ST in order to issue new
> >commands.)
> >
> >
> >The HBA should this trigger an error irq when noticing the bad CRC.
> >
> >It is possible that the HBA triggers another error IRQ when receiving a D2H with TFES bit set, if it is received before PxCMD.ST is cleared to 0, see NDR:Accept:
> >FIS Type is D2H Register and PxCMD.ST=0, jump to P:RegFISUpdate, which does not trigger any IRQ.
> >If PxCMD.ST is set, the state machine will instead jump to RegFIS:Entry.
> >
> >
> >So, in order to understand what is going on, can you please add unconditional prints in ahci_error_intr(), so we can actually see how many times this is called. (Right now it seems to only be called once). Please also print when PxCMD.ST is cleared to 0.
> >(So that we can verify that ahci_error_intr() is never called after that). Also please remove all prints from other drivers, e.g. nouveau.
> >
> >
> >Kind regards,
> >Niklas
> >
> >
> Hi Niklas,
> 
> In our case, receiving D2H with TFES bit set(PxIS.TFES set to 1) doesn't
> triggers aother error IRQ before PxCMD.ST is cleared. Because IRQ was 
> turn off(PxIE=0).
> 
> After received D2H with TFES bit set, port is restart(clearing PxCMD.ST to 
> '0' and then setting PxCMD.ST to '1').In the dmesg output, device is lose 
> after the port restart 3 times. Durring this time, the TFES bit was not 
> cleared.
> 
> The Linux kernel dmesg output as shown below:
>  [   40.251795] ahci_error_intr start (irq_stat 0x08000000) ****
> [   40.251799] ahci_error_intr: interface fatal error
> [   40.251800] ahci_error_intr: freeze (set pxIE to 0)
> [   40.251805] ahci_error_intr end (PORT_IRQ_MASK = 0x00000000 )****
> [   40.298113] ata8.00: exception Emask 0x10 SAct 0x70e000 SErr 0x280100 
> action 0x6 frozen
> [   40.298117] ata8.00: irq_stat 0x08000000, interface fatal error
> [   40.298119] ata8: SError: { UnrecovData 10B8B BadCRC }
> [   40.298121] ata8.00: failed command: READ FPDMA QUEUED
> [   40.298122] ata8.00: cmd 60/b8:68:c0:c3:03/08:00:00:00:00/40 tag 13 
> ncq dma 1142784 in
>          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [   40.298126] ata8.00: status: { DRDY }
> [   40.298128] ata8.00: failed command: READ FPDMA QUEUED
> [   40.298129] ata8.00: cmd 60/b8:70:78:cc:03/05:00:00:00:00/40 tag 14 
> ncq dma 749568 in
>          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [   40.298132] ata8.00: status: { DRDY }
> [   40.298133] ata8.00: failed command: READ FPDMA QUEUED
> [   40.298134] ata8.00: cmd 60/00:78:30:d2:03/0a:00:00:00:00/40 tag 15 
> ncq dma 1310720 in
>          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [   40.298137] ata8.00: status: { DRDY }
> [   40.298138] ata8.00: failed command: READ FPDMA QUEUED
> [   40.298139] ata8.00: cmd 60/00:a0:30:dc:03/0a:00:00:00:00/40 tag 20 
> ncq dma 1310720 in
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [   40.298142] ata8.00: status: { DRDY }
> [   40.298143] ata8.00: failed command: READ FPDMA QUEUED
> [   40.298144] ata8.00: cmd 60/40:a8:30:e6:03/08:00:00:00:00/40 tag 21 
> ncq dma 1081344 in
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [   40.298147] ata8.00: status: { DRDY }
> [   40.298148] ata8.00: failed command: READ FPDMA QUEUED
> [   40.298149] ata8.00: cmd 60/90:b0:70:ee:03/09:00:00:00:00/40 tag 22 
> ncq dma 1253376 in
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [   40.298151] ata8.00: status: { DRDY }
> [   40.298154] ata8: hard resetting link
> [   40.298155] ata8: ahci_hardreset *****
> [   40.298160] ahci 0000:6f:00.0: ahci_stop_engine start ***
> ( PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004d517)

Hello Szuying Chen,


Here we can see that PORT_IRQ_STAT is still set after a hardreset (COMRESET).

This does not seem to be in-line with AHCI 1.3.1.

libata does clear PxIS in ahci_port_init(), but that function seems to only
be called at probe time.
libata does also clear PxIS in ahci_thaw(), but that is after issuing the
COMRESET.



The current libata flow, for your error, is like this:

ahci_error_intr() -> ata_port_freeze()
ata_port_abort() -> EH scheduled for all active QCs.

ata_scsi_error() -> ata_scsi_port_error_handler() -> ahci_error_handler() ->
sata_pmp_error_handler() -> ata_eh_autopsy() -> ata_eh_link_autopsy() ->
ata_eh_analyze_serror()

ata_eh_analyze_serror() will see that SError has SERR_DATA data set,
so it will mark it with:
	err_mask |= AC_ERR_ATA_BUS;
	action |= ATA_EH_RESET;

sata_pmp_eh_recover() -> ata_eh_recover() -> ata_eh_reset() ->
ahci_hardreset() -> ahci_do_hardreset().



ahci_do_hardreset() does:

hpriv->stop_engine() which clears PxCMD.ST.

then performs a COMRESET (calls sata_link_hardreset())

then calls hpriv->start_engine().


After that ata_eh_reset() (which called ahci_do_hardreset()),
will call ata_eh_thaw_port() which will clear PxIS and enable IRQs.

ata_eh_reset() will then call postreset (ahci_postreset()).

ahci_postreset() will call ata_std_postreset() which will clear SError.



The problem with this is that a lot of this is not in line with
AHCI 1.3.1 - section 6.2.2.2, which states that error recovery
should be performed in the following order:

- Reads PxSACT to see which commands have not yet completed
- Clears PxCMD.ST to ‘0’ to reset the PxCI register, waits for PxCMD.CR to
clear to ‘0’
- Clears any error bits in PxSERR to enable capturing new errors.
- Clears status bits in PxIS as appropriate
- If PxTFD.STS.BSY or PxTFD.STS.DRQ is set to ‘1’, issue a COMRESET to the
device to put it in an idle state.
- Sets PxCMD.ST to ‘1’ to enable issuing new commands
- Issue READ LOG EXT to determine the cause of the error condition if
  software did not have to perform a reset (COMRESET or software reset)
  as part of the error recovery


To be in-line with the spec, I think the clearing of PxIS (and SError)
should optimally be done in ahci_do_hardreset(), between the clearing
and setting of PxCMD.ST, and before the COMRESET (before calling
sata_link_hardreset()).



Your current patch proposal (which makes things work for you),
clears PxIS after the COMRESET... Which is also not in line with the spec.
Since your non-spec compliant patch makes things work for you, perhaps you
could instead try to put the PxIS clearing in ahci_postreset().
At least then the clearing of SError and PxIS is done at the same time
(even if both are done after setting PxCMD.ST to 1...)


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-08-28 19:17 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-14  4:17 Re: [PATCH] ahci: libahci: clear pending interrupt status Szuying Chen
2023-08-22  8:41 ` Niklas Cassel
  -- strict thread matches above, loose matches on Subject: below --
2023-08-24 10:51 Szuying Chen
2023-08-28 19:16 ` Niklas Cassel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox