From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andy Smith Subject: NCQ-related READ/WRITE frozen ATA errors with Intel C220 and Intel s3610 SSDs Date: Sun, 9 Aug 2015 04:27:56 +0000 Message-ID: <20150809042756.GX4243@bitfolk.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Received: from bitfolk.com ([85.119.80.223]:48979 "EHLO mail.bitfolk.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751051AbbHIEpq (ORCPT ); Sun, 9 Aug 2015 00:45:46 -0400 Received: from andy by mail.bitfolk.com with local (Exim 4.72) (envelope-from ) id 1ZOIDE-0000eH-V7 for linux-ide@vger.kernel.org; Sun, 09 Aug 2015 04:27:57 +0000 Content-Disposition: inline Sender: linux-ide-owner@vger.kernel.org List-Id: linux-ide@vger.kernel.org To: linux-ide@vger.kernel.org Hi, I've just put together a system based on a Supermicro X10SDV-F motherboard, which comes with an Intel C220 SATA controller. I've got two Intel DC s3610 SSDs plugged into this, and have intermittently been seeing the following ATA errors: Jul 23 17:14:41 snaps kernel: [68044.504092] ata2.00: exception Emask 0x0 SAct 0x3000000 SErr 0x0 action 0x6 frozen Jul 23 17:14:41 snaps kernel: [68044.504215] ata2.00: failed command: WRITE FPDMA QUEUED Jul 23 17:14:41 snaps kernel: [68044.504291] ata2.00: cmd 61/01:c0:00:a8:75/00:00:66:00:00/40 tag 24 ncq 512 out Jul 23 17:14:41 snaps kernel: [68044.504291] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jul 23 17:14:41 snaps kernel: [68044.504357] ata2.00: status: { DRDY } Jul 23 17:14:41 snaps kernel: [68044.504376] ata2.00: failed command: WRITE FPDMA QUEUED Jul 23 17:14:41 snaps kernel: [68044.504402] ata2.00: cmd 61/08:c8:d1:b1:b5/00:00:09:00:00/40 tag 25 ncq 4096 out Jul 23 17:14:41 snaps kernel: [68044.504402] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jul 23 17:14:41 snaps kernel: [68044.504468] ata2.00: status: { DRDY } Jul 23 17:14:41 snaps kernel: [68044.504488] ata2: hard resetting link Jul 23 17:14:42 snaps kernel: [68044.824115] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Jul 23 17:14:42 snaps kernel: [68044.825069] ata2.00: configured for UDMA/133 Jul 23 17:14:42 snaps kernel: [68044.825096] ata2.00: device reported invalid CHS sector 0 Jul 23 17:14:42 snaps kernel: [68044.825123] ata2.00: device reported invalid CHS sector 0 Jul 23 17:14:42 snaps kernel: [68044.825153] ata2: EH complete This would happen once every couple of days, not seemingly related to any particular level of IO load. Initially this was restricted to ata2, and after the SSDs were swapped around the problem then followed the drive to ata1, so some time was wasted asking Intel if this was a faulty drive. However while that was taking place similar errors started to happen with ata2 again, so in fact both drives are affected and the possibility of it being a pair of faulty drives now seems unlikely. This machine spent the first few weeks of its life with different SSDs in it (Crucial M5s) without problem so that suggests to me that the cabling etc is all okay too. I then disabled NCQ. After first booting with kernel command line libata.force=noncq yet still observing what looked to be NCQ still enabled in dmesg and /sys/block/sd?/device/queue_depth, I also wrote 1 to /sys/block/sd?/device/queue_depth. It's now been 5 days and the problem hasn't manifested itself again. So, should I be reporting this as a bug in the kernel bugzilla and if so, would that be against the ahci driver? Distribution is Debian 8.1 so that comes with kernel package 3.16.7-ckt11-1+deb8u2. I also tried kernel 4.0.8-2 from Debian testing with no improvement. I did not try an upstream kernel yet. $ sudo lspci -vvx -s 00:1f.2 00:1f.2 SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 05) (prog-if 01 [AHCI 1.0]) Subsystem: Super Micro Computer Inc Device 086d Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR-