* Why we were seeing so many spurious NCQ completions
@ 2007-12-07 2:18 Tejun Heo
2007-12-07 2:25 ` Jeff Garzik
0 siblings, 1 reply; 2+ messages in thread
From: Tejun Heo @ 2007-12-07 2:18 UTC (permalink / raw)
To: Jeff Garzik, IDE/ATA development list, diego torres,
Michael Tokarev
[-- Attachment #1: Type: text/plain, Size: 2756 bytes --]
Hello, all.
This has been going on for quite some time now but I finally succeeded
to reproduce the problem and find out what has been going on. It
wasn't drive's or controller's fault. The spurious completion
detection logic was wrong which makes all of this my fault. :-)
The attached patch induces NCQ spurious completions by inserting
artificial delays during irq handling. The following is log with the
patch applied.
A [ 1125.478813] ata35: MON issue=0x0 SAct=0x1 sactive=0x3 SDB FIS=004040a1:00000002
B [ 1125.480248] ata35: MON issue=0x4 SAct=0x6 sactive=0x7 SDB FIS=004040a1:00000001
C [ 1125.481614] ata35: MON issue=0x0 SAct=0x5 sactive=0x7 SDB FIS=004040a1:00000002
D [ 1125.481704] ata35: YYY 0x2 -> 0x4
E [ 1125.481722] ata35: XXX issue=0x0 SAct=0x1 sactive=0x1 SDB FIS=004040a1:00000004
F [ 1125.483087] ata35: MON issue=0x0 SAct=0x0 sactive=0x1 SDB FIS=004040a1:00000001
G [ 1125.484297] ata35: MON issue=0x4 SAct=0x6 sactive=0x7 SDB FIS=004040a1:00000001
MON lines are printed on each SDB FIS while YYY line indicates that
SDB FIS RX area has changed during the artificial delay. XXX line
notes condition which triggers spurious NCQ completion - invoking EH
is disabled for debugging.
Here's what happens.
1. On A, NCQ command 0 and 1 are in flight - command 0 is still being
transmitted to the device. The first SDB FIS indicates completion
of command 1.
2. Between A and B, the driver issues NCQ commands 1 and 2. 0 is
still in flight.
3. On B, command 0 completes and drive sends completion for it.
4. Between B and C, the driver issues NCQ command 0.
5. On C, command 1 completes and drive sends completion for it.
6. On D, then the drive completes command 2 and sends completion for
it. This makes SDB FIS RX area updated and as the driver is still
in IRQ handler, sets IRQ pending bit again. Note that YYY line is
printed *before* actually completing commands. So, after printing
YYY line, the driver completes both commands 1 and 2.
7. On E, the IRQ handler is invoked again because of the IRQ pending
status set from #6. However, completions contained in the SDB FIS
which triggered this IRQ handler invocation is already processed.
ie. Completion for command 2 is already processed in the previous
IRQ handler invocation, so this time IRQ handler has nothing to do
but SDB FIS RX area shows that this IRQ is for SDB FIS which
includes completion for command 2, which triggers spurious NCQ
completion condition.
8. It goes on.
So, trying to detect spurious completions using IRQ and RX FIS area
turns out to be stupid as they aren't interlocked. I'll soon post a
patch to remove spurious completion check and blacklist resulted from
it.
Thanks a lot.
--
tejun
[-- Attachment #2: spurious-completion-debug.patch --]
[-- Type: text/x-patch, Size: 2174 bytes --]
diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 4688dbf..9f9a658 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1664,6 +1664,10 @@ static void ahci_port_intr(struct ata_port *ap)
}
if (status & PORT_IRQ_SDB_FIS) {
+ const __le32 *f = pp->rx_fis + RX_FIS_SDB;
+ u32 t = le32_to_cpu(f[1]);
+ int i;
+
/* If SNotification is available, leave notification
* handling to sata_async_notification(). If not,
* emulate it by snooping SDB FIS RX area.
@@ -1686,6 +1690,32 @@ static void ahci_port_intr(struct ata_port *ap)
if (f0 & (1 << 15))
sata_async_notification(ap);
}
+
+ if (le32_to_cpu(f[1]) & ~pp->active_link->sactive)
+ ata_link_printk(pp->active_link, KERN_INFO,
+ "XXX issue=0x%x SAct=0x%x sactive=0x%x SDB FIS=%08x:%08x\n",
+ readl(port_mmio + PORT_CMD_ISSUE),
+ readl(port_mmio + PORT_SCR_ACT),
+ pp->active_link->sactive,
+ le32_to_cpu(f[0]), le32_to_cpu(f[1]));
+ else
+ ata_link_printk(pp->active_link, KERN_INFO,
+ "MON issue=0x%x SAct=0x%x sactive=0x%x SDB FIS=%08x:%08x\n",
+ readl(port_mmio + PORT_CMD_ISSUE),
+ readl(port_mmio + PORT_SCR_ACT),
+ pp->active_link->sactive,
+ le32_to_cpu(f[0]), le32_to_cpu(f[1]));
+
+ for (i = 0; i < 100; i++) {
+ if (t != le32_to_cpu(f[1])) {
+ ata_link_printk(pp->active_link, KERN_INFO,
+ "YYY 0x%x -> 0x%x\n",
+ t, le32_to_cpu(f[1]));
+ break;
+ }
+ udelay(1);
+ cpu_relax();
+ }
}
/* pp->active_link is valid iff any command is in flight */
@@ -1746,7 +1776,7 @@ static void ahci_port_intr(struct ata_port *ap)
* with HSM violation. EH will turn off NCQ
* after several such failures.
*/
- ata_ehi_push_desc(ehi,
+ /* ata_ehi_push_desc(ehi,
"spurious completions during NCQ "
"issue=0x%x SAct=0x%x FIS=%08x:%08x",
readl(port_mmio + PORT_CMD_ISSUE),
@@ -1754,7 +1784,7 @@ static void ahci_port_intr(struct ata_port *ap)
le32_to_cpu(f[0]), le32_to_cpu(f[1]));
ehi->err_mask |= AC_ERR_HSM;
ehi->action |= ATA_EH_SOFTRESET;
- ata_port_freeze(ap);
+ ata_port_freeze(ap);*/
} else {
if (!pp->ncq_saw_sdb)
ata_port_printk(ap, KERN_INFO,
^ permalink raw reply related [flat|nested] 2+ messages in thread* Re: Why we were seeing so many spurious NCQ completions
2007-12-07 2:18 Why we were seeing so many spurious NCQ completions Tejun Heo
@ 2007-12-07 2:25 ` Jeff Garzik
0 siblings, 0 replies; 2+ messages in thread
From: Jeff Garzik @ 2007-12-07 2:25 UTC (permalink / raw)
To: Tejun Heo; +Cc: IDE/ATA development list, diego torres, Michael Tokarev
Tejun Heo wrote:
> Hello, all.
>
> This has been going on for quite some time now but I finally succeeded
> to reproduce the problem and find out what has been going on. It
> wasn't drive's or controller's fault. The spurious completion
> detection logic was wrong which makes all of this my fault. :-)
>
> The attached patch induces NCQ spurious completions by inserting
> artificial delays during irq handling. The following is log with the
> patch applied.
>
> A [ 1125.478813] ata35: MON issue=0x0 SAct=0x1 sactive=0x3 SDB FIS=004040a1:00000002
> B [ 1125.480248] ata35: MON issue=0x4 SAct=0x6 sactive=0x7 SDB FIS=004040a1:00000001
> C [ 1125.481614] ata35: MON issue=0x0 SAct=0x5 sactive=0x7 SDB FIS=004040a1:00000002
> D [ 1125.481704] ata35: YYY 0x2 -> 0x4
> E [ 1125.481722] ata35: XXX issue=0x0 SAct=0x1 sactive=0x1 SDB FIS=004040a1:00000004
> F [ 1125.483087] ata35: MON issue=0x0 SAct=0x0 sactive=0x1 SDB FIS=004040a1:00000001
> G [ 1125.484297] ata35: MON issue=0x4 SAct=0x6 sactive=0x7 SDB FIS=004040a1:00000001
Thanks a lot for tracking this down, and thanks even more for being
humble enough to admit mistakes. More kernel hackers should follow your
example.
I continue to be a proud mentor, watching you kick ass on the Linux
kernel scene :)
Jeff
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2007-12-07 2:25 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-07 2:18 Why we were seeing so many spurious NCQ completions Tejun Heo
2007-12-07 2:25 ` Jeff Garzik
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).