From: Niklas Cassel <Niklas.Cassel@wdc.com>
To: Szuying Chen <chensiying21@gmail.com>
Cc: "dlemoal@kernel.org" <dlemoal@kernel.org>,
"linux-ide@vger.kernel.org" <linux-ide@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"Jesse1_Chang@asmedia.com.tw" <Jesse1_Chang@asmedia.com.tw>,
"Richard_Hsu@asmedia.com.tw" <Richard_Hsu@asmedia.com.tw>,
"Chloe_Chen@asmedia.com.tw" <Chloe_Chen@asmedia.com.tw>
Subject: Re: Re: [PATCH] ahci: libahci: clear pending interrupt status
Date: Mon, 28 Aug 2023 19:16:44 +0000 [thread overview]
Message-ID: <ZOzyl0r2+ijSx7il@x1-carbon> (raw)
In-Reply-To: <20230824105159.12384-1-Chloe_Chen@asmedia.com.tw>
On Thu, Aug 24, 2023 at 06:51:59PM +0800, Szuying Chen wrote:
>
> On 8/22/23 16:41, Niklas Cassel wrote:
> >
> >On Mon, Aug 14, 2023 at 12:17:20PM +0800, Szuying Chen wrote:
> >> On 8/10/23 17:59, Damien Le Moal wrote:
> >> >On 8/10/23 18:31, Szuying Chen wrote:
> >> >>> On 8/10/23 14:12, Damien Le Moal wrote:
> >> >>> On 8/10/23 14:05, Szuying Chen wrote:
> >> >>> > When ISR handle interface fatal error with error recovery after clear PxIS
> >> >>> > and PxIE. Another FIS(SDB FIS with err) that set PxIS.IFS to 1 is recevied
> >> >>> > during error recovery, which causing the HBA not issue any new commands
> >> >>> > after cmd.ST set 1.
> >> >>>
> >> >>> This is not very clear. If there was a fatal error, the drive should be in
> >> >>> error state and no other SDB FIS can be received as the drive does absolutely
> >> >>> nothing while in error state (it only waits for a read log 10h command to be>
> >> >>> issued to get it out of error state). So if you are seeing 2 SDB FIS with
> >> >>> errors one after the other, you have a buggy device...
> >> >>>
> >> >>> However, I may be misunderstanding your issue here. Could you provide more
> >> >>> details and a dmesg output example of the issue ?
> >> >>
> >> >> According to AHCI 1.3.1 specification ch6.1.9, when an R_ERR is
> >> >> received on an H2D Data FIS in normal operation, the HBA sets
> >> >> PxIS.IFS to 1 (fatal error) and halts operation. Referring to SATA
> >> >> 3.0 specification we know the device will halt queued command
> >> >> processing and transmit SDB FIS to host with ERR bit in Status field set to one(set PxIS.TFES to 1).
> >> >
> >> >Sure, but that SBD FIS should be completely ignored by the adapter
> >> >since it stopped operation. If you see it, then it means that the
> >> >handling of the first error was incomplete.
> >> >
> >> The SDB FIS happens between pxIS.IE was cleared and PxCMD.ST not
> >> cleared to 0. Because the PxCMD.ST still setting, so the SDB FIS can't ignored.
> >>
> >> >> In our case, the ISR handles fatal errors(PxIS.IFS) and enters
> >> >> error recovery after cleaning up PxIS and PxIE. Then a SDB FIS is
> >> >> received with interrupt bit(PxIS.TFES) set to 1. According to AHCI
> >> >> 1.3.1 specification ch6.2.2, HBA can't issue(cmd.ST set to 1) any
> >> >> new commands under PxIS.TFES alive during error recovery.
> >> >
> >> >But how come you see a new command being issued ? This entire
> >> >situation should result in a port reset with the first error. I do not see how this is possible.
> >> >Are you saying that the port reset is not cleaning things up properly
> >> >? Could you share the dmesg output of this case ?
> >> >
> >> >
> >> dmesg:
> >> [ 654.244958] nouveau 0000:09:00.0: AMD-Vi: Event logged
> >> [IO_PAGE_FAULT
> >> domain=0x0019 address=0x0 flags=0x0000] [ 654.460522] nouveau
> >> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> >> domain=0x0019 address=0x0 flags=0x0000] [ 656.180375] ata7.00:
> >> exception Emask 0x10 SAct 0x7c0 SErr 0x200100
> >
> >Hello Szuying Chen,
> >
> >From your logs we can see that SErr 0x200100:
> >This means that PxSERR.DIAG.C was set.
> >
> >This means that:
> >"Calculated Different CRC than Received: When this occurs, the HBA returns R_ERR and returns to idle."
> >
> >(Since PxIS.IFS is set, we know that it was a D2H Data FIS that had invalid CRC. (If it was a D2H non-Data FIS PxIS.INFS would have been set instead.))
> >
> >
> >So your commit message is wrong. In your case it is the HBA that sends a R_ERR to the device, so you should look at "6.1.10 CRC error in received FIS" and not "6.1.9 Device responds to FIS with R_ERR".
> >
> >"When a CRC error occurs on a D2H Data FIS, the HBA sets PxIS.IFS to ‘1’.
> >After a CRC error on a D2H Data FIS, the device may send a D2H Register FIS to indicate the reason for the error. The host should allow posting of non-Data FISes before PxCMD.ST is cleared to ‘0’."
> >
> >Looking at the state machine "5.3.11.2 DR:Receive" you will jump to state ERR:Fatal. (Meaning that we need to clear PxCMD.ST in order to issue new
> >commands.)
> >
> >
> >The HBA should this trigger an error irq when noticing the bad CRC.
> >
> >It is possible that the HBA triggers another error IRQ when receiving a D2H with TFES bit set, if it is received before PxCMD.ST is cleared to 0, see NDR:Accept:
> >FIS Type is D2H Register and PxCMD.ST=0, jump to P:RegFISUpdate, which does not trigger any IRQ.
> >If PxCMD.ST is set, the state machine will instead jump to RegFIS:Entry.
> >
> >
> >So, in order to understand what is going on, can you please add unconditional prints in ahci_error_intr(), so we can actually see how many times this is called. (Right now it seems to only be called once). Please also print when PxCMD.ST is cleared to 0.
> >(So that we can verify that ahci_error_intr() is never called after that). Also please remove all prints from other drivers, e.g. nouveau.
> >
> >
> >Kind regards,
> >Niklas
> >
> >
> Hi Niklas,
>
> In our case, receiving D2H with TFES bit set(PxIS.TFES set to 1) doesn't
> triggers aother error IRQ before PxCMD.ST is cleared. Because IRQ was
> turn off(PxIE=0).
>
> After received D2H with TFES bit set, port is restart(clearing PxCMD.ST to
> '0' and then setting PxCMD.ST to '1').In the dmesg output, device is lose
> after the port restart 3 times. Durring this time, the TFES bit was not
> cleared.
>
> The Linux kernel dmesg output as shown below:
> [ 40.251795] ahci_error_intr start (irq_stat 0x08000000) ****
> [ 40.251799] ahci_error_intr: interface fatal error
> [ 40.251800] ahci_error_intr: freeze (set pxIE to 0)
> [ 40.251805] ahci_error_intr end (PORT_IRQ_MASK = 0x00000000 )****
> [ 40.298113] ata8.00: exception Emask 0x10 SAct 0x70e000 SErr 0x280100
> action 0x6 frozen
> [ 40.298117] ata8.00: irq_stat 0x08000000, interface fatal error
> [ 40.298119] ata8: SError: { UnrecovData 10B8B BadCRC }
> [ 40.298121] ata8.00: failed command: READ FPDMA QUEUED
> [ 40.298122] ata8.00: cmd 60/b8:68:c0:c3:03/08:00:00:00:00/40 tag 13
> ncq dma 1142784 in
> res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [ 40.298126] ata8.00: status: { DRDY }
> [ 40.298128] ata8.00: failed command: READ FPDMA QUEUED
> [ 40.298129] ata8.00: cmd 60/b8:70:78:cc:03/05:00:00:00:00/40 tag 14
> ncq dma 749568 in
> res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [ 40.298132] ata8.00: status: { DRDY }
> [ 40.298133] ata8.00: failed command: READ FPDMA QUEUED
> [ 40.298134] ata8.00: cmd 60/00:78:30:d2:03/0a:00:00:00:00/40 tag 15
> ncq dma 1310720 in
> res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [ 40.298137] ata8.00: status: { DRDY }
> [ 40.298138] ata8.00: failed command: READ FPDMA QUEUED
> [ 40.298139] ata8.00: cmd 60/00:a0:30:dc:03/0a:00:00:00:00/40 tag 20
> ncq dma 1310720 in
> res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [ 40.298142] ata8.00: status: { DRDY }
> [ 40.298143] ata8.00: failed command: READ FPDMA QUEUED
> [ 40.298144] ata8.00: cmd 60/40:a8:30:e6:03/08:00:00:00:00/40 tag 21
> ncq dma 1081344 in
> res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [ 40.298147] ata8.00: status: { DRDY }
> [ 40.298148] ata8.00: failed command: READ FPDMA QUEUED
> [ 40.298149] ata8.00: cmd 60/90:b0:70:ee:03/09:00:00:00:00/40 tag 22
> ncq dma 1253376 in
> res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [ 40.298151] ata8.00: status: { DRDY }
> [ 40.298154] ata8: hard resetting link
> [ 40.298155] ata8: ahci_hardreset *****
> [ 40.298160] ahci 0000:6f:00.0: ahci_stop_engine start ***
> ( PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004d517)
Hello Szuying Chen,
Here we can see that PORT_IRQ_STAT is still set after a hardreset (COMRESET).
This does not seem to be in-line with AHCI 1.3.1.
libata does clear PxIS in ahci_port_init(), but that function seems to only
be called at probe time.
libata does also clear PxIS in ahci_thaw(), but that is after issuing the
COMRESET.
The current libata flow, for your error, is like this:
ahci_error_intr() -> ata_port_freeze()
ata_port_abort() -> EH scheduled for all active QCs.
ata_scsi_error() -> ata_scsi_port_error_handler() -> ahci_error_handler() ->
sata_pmp_error_handler() -> ata_eh_autopsy() -> ata_eh_link_autopsy() ->
ata_eh_analyze_serror()
ata_eh_analyze_serror() will see that SError has SERR_DATA data set,
so it will mark it with:
err_mask |= AC_ERR_ATA_BUS;
action |= ATA_EH_RESET;
sata_pmp_eh_recover() -> ata_eh_recover() -> ata_eh_reset() ->
ahci_hardreset() -> ahci_do_hardreset().
ahci_do_hardreset() does:
hpriv->stop_engine() which clears PxCMD.ST.
then performs a COMRESET (calls sata_link_hardreset())
then calls hpriv->start_engine().
After that ata_eh_reset() (which called ahci_do_hardreset()),
will call ata_eh_thaw_port() which will clear PxIS and enable IRQs.
ata_eh_reset() will then call postreset (ahci_postreset()).
ahci_postreset() will call ata_std_postreset() which will clear SError.
The problem with this is that a lot of this is not in line with
AHCI 1.3.1 - section 6.2.2.2, which states that error recovery
should be performed in the following order:
- Reads PxSACT to see which commands have not yet completed
- Clears PxCMD.ST to ‘0’ to reset the PxCI register, waits for PxCMD.CR to
clear to ‘0’
- Clears any error bits in PxSERR to enable capturing new errors.
- Clears status bits in PxIS as appropriate
- If PxTFD.STS.BSY or PxTFD.STS.DRQ is set to ‘1’, issue a COMRESET to the
device to put it in an idle state.
- Sets PxCMD.ST to ‘1’ to enable issuing new commands
- Issue READ LOG EXT to determine the cause of the error condition if
software did not have to perform a reset (COMRESET or software reset)
as part of the error recovery
To be in-line with the spec, I think the clearing of PxIS (and SError)
should optimally be done in ahci_do_hardreset(), between the clearing
and setting of PxCMD.ST, and before the COMRESET (before calling
sata_link_hardreset()).
Your current patch proposal (which makes things work for you),
clears PxIS after the COMRESET... Which is also not in line with the spec.
Since your non-spec compliant patch makes things work for you, perhaps you
could instead try to put the PxIS clearing in ahci_postreset().
At least then the clearing of SError and PxIS is done at the same time
(even if both are done after setting PxCMD.ST to 1...)
Kind regards,
Niklas
next prev parent reply other threads:[~2023-08-28 19:17 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-24 10:51 Re: [PATCH] ahci: libahci: clear pending interrupt status Szuying Chen
2023-08-28 19:16 ` Niklas Cassel [this message]
-- strict thread matches above, loose matches on Subject: below --
2023-08-14 4:17 Szuying Chen
2023-08-22 8:41 ` Niklas Cassel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZOzyl0r2+ijSx7il@x1-carbon \
--to=niklas.cassel@wdc.com \
--cc=Chloe_Chen@asmedia.com.tw \
--cc=Jesse1_Chang@asmedia.com.tw \
--cc=Richard_Hsu@asmedia.com.tw \
--cc=chensiying21@gmail.com \
--cc=dlemoal@kernel.org \
--cc=linux-ide@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox