linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* NCQ-related READ/WRITE frozen ATA errors with Intel C220 and Intel s3610 SSDs
@ 2015-08-09  4:27 Andy Smith
  2015-08-10 15:35 ` Tim Small
  2015-09-08 21:23 ` Andy Smith
  0 siblings, 2 replies; 7+ messages in thread
From: Andy Smith @ 2015-08-09  4:27 UTC (permalink / raw)
  To: linux-ide

Hi,

I've just put together a system based on a Supermicro X10SDV-F
motherboard, which comes with an Intel C220 SATA controller. I've
got two Intel DC s3610 SSDs plugged into this, and have
intermittently been seeing the following ATA errors:

Jul 23 17:14:41 snaps kernel: [68044.504092] ata2.00: exception Emask 0x0 SAct 0x3000000 SErr 0x0 action 0x6 frozen
Jul 23 17:14:41 snaps kernel: [68044.504215] ata2.00: failed command: WRITE FPDMA QUEUED
Jul 23 17:14:41 snaps kernel: [68044.504291] ata2.00: cmd 61/01:c0:00:a8:75/00:00:66:00:00/40 tag 24 ncq 512 out
Jul 23 17:14:41 snaps kernel: [68044.504291]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 23 17:14:41 snaps kernel: [68044.504357] ata2.00: status: { DRDY }
Jul 23 17:14:41 snaps kernel: [68044.504376] ata2.00: failed command: WRITE FPDMA QUEUED
Jul 23 17:14:41 snaps kernel: [68044.504402] ata2.00: cmd 61/08:c8:d1:b1:b5/00:00:09:00:00/40 tag 25 ncq 4096 out
Jul 23 17:14:41 snaps kernel: [68044.504402]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 23 17:14:41 snaps kernel: [68044.504468] ata2.00: status: { DRDY }
Jul 23 17:14:41 snaps kernel: [68044.504488] ata2: hard resetting link
Jul 23 17:14:42 snaps kernel: [68044.824115] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jul 23 17:14:42 snaps kernel: [68044.825069] ata2.00: configured for UDMA/133
Jul 23 17:14:42 snaps kernel: [68044.825096] ata2.00: device reported invalid CHS sector 0
Jul 23 17:14:42 snaps kernel: [68044.825123] ata2.00: device reported invalid CHS sector 0
Jul 23 17:14:42 snaps kernel: [68044.825153] ata2: EH complete

This would happen once every couple of days, not seemingly related
to any particular level of IO load.

Initially this was restricted to ata2, and after the SSDs were
swapped around the problem then followed the drive to ata1, so some
time was wasted asking Intel if this was a faulty drive. However
while that was taking place similar errors started to happen with
ata2 again, so in fact both drives are affected and the possibility
of it being a pair of faulty drives now seems unlikely.

This machine spent the first few weeks of its life with different
SSDs in it (Crucial M5s) without problem so that suggests to me that
the cabling etc is all okay too.

I then disabled NCQ. After first booting with kernel command line
libata.force=noncq yet still observing what looked to be NCQ still
enabled in dmesg and /sys/block/sd?/device/queue_depth, I also wrote
1 to /sys/block/sd?/device/queue_depth.

It's now been 5 days and the problem hasn't manifested itself again.

So, should I be reporting this as a bug in the kernel bugzilla and
if so, would that be against the ahci driver?

Distribution is Debian 8.1 so that comes with kernel package
3.16.7-ckt11-1+deb8u2. I also tried kernel 4.0.8-2 from Debian
testing with no improvement. I did not try an upstream kernel yet.

$ sudo lspci -vvx -s 00:1f.2
00:1f.2 SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 05) (prog-if 01 [AHCI 1.0])
        Subsystem: Super Micro Computer Inc Device 086d
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 164
        Region 0: I/O ports at f070 [size=8]
        Region 1: I/O ports at f060 [size=4]
        Region 2: I/O ports at f050 [size=8]
        Region 3: I/O ports at f040 [size=4]
        Region 4: I/O ports at f020 [size=32]
        Region 5: Memory at fb312000 (32-bit, non-prefetchable) [size=2K]
        Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
                Address: fee002b8  Data: 0000
        Capabilities: [70] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [a8] SATA HBA v1.0 BAR4 Offset=00000004
        Kernel driver in use: ahci
00: 86 80 02 8c 07 04 b0 02 05 01 06 01 00 00 00 00
10: 71 f0 00 00 61 f0 00 00 51 f0 00 00 41 f0 00 00
20: 21 f0 00 00 00 20 31 fb 00 00 00 00 d9 15 6d 08
30: 00 00 00 00 80 00 00 00 00 00 00 00 0b 01 00 00

$ for d in /dev/sd?; do sudo smartctl -i $d; done
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     INTEL SSDSC2BX016T4
Serial Number:    BTHC511604V41P6PGN
LU WWN Device Id: 5 5cd2e4 04b7b1bfa
Firmware Version: G2010110
User Capacity:    1,600,321,314,816 bytes [1.60 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Aug  9 04:24:26 2015 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     INTEL SSDSC2BX016T4
Serial Number:    BTHC511604SD1P6PGN
LU WWN Device Id: 5 5cd2e4 04b7b1ba2
Firmware Version: G2010110
User Capacity:    1,600,321,314,816 bytes [1.60 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Aug  9 04:24:26 2015 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Cheers,
Andy

^ permalink raw reply	[flat|nested] 7+ messages in thread
* Re: NCQ-related READ/WRITE frozen ATA errors with Intel C220 and Intel s3610 SSDs
@ 2015-08-11 11:28 Andrey Korolyov
  0 siblings, 0 replies; 7+ messages in thread
From: Andrey Korolyov @ 2015-08-11 11:28 UTC (permalink / raw)
  To: linux-ide

Hello,

sorry for a rude top-post and please CC me in a main discussion afterward.

Since yesterday since I found the same issue with C602 and S3710 in
mine production and stood in horror for a couple of moments, I made a
small amount of runs with fio to check if the issue is here or not for
certain configuration, here are preliminary results:

- isci in all LTS are affected, if I could use this wording in
appearance of an obviously broken hardware counterparts,
- setting device`s qd=1 is *not* fixing the issue for me,
- setting noncq for a specific link closes the deal.

Therefore, I`d like to propose to either add hook from scsi code to
set horkage_on on setting queue_depth to 1 and removing it back at
runtime, or to reword libata faq [1], as for me there is a clear
difference between actual NCQ disablement for a link/controller and
setting queue depth to 1, the bus resets are still appearing in a
second case.

1. https://ata.wiki.kernel.org/index.php/Libata_FAQ

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-09-08 21:23 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-09  4:27 NCQ-related READ/WRITE frozen ATA errors with Intel C220 and Intel s3610 SSDs Andy Smith
2015-08-10 15:35 ` Tim Small
2015-08-10 16:37   ` Andy Smith
2015-08-10 17:49     ` Andy Smith
2015-08-11 10:15       ` Tim Small
2015-09-08 21:23 ` Andy Smith
  -- strict thread matches above, loose matches on Subject: below --
2015-08-11 11:28 Andrey Korolyov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).