Hi Tim, Thanks for getting back to me. On Mon, Aug 10, 2015 at 04:35:03PM +0100, Tim Small wrote: > Is there any useful info from smartctl -x ? e.g. reset counters or > other logged errors - either drive internal errors or cabling/link > related "Sata Phy Event Counters" etc.? I do not think so, but I have attached the output from this for each drive. sdb is where it has happened most recently (four times since 4th August). > Is attempting to ready SMART info status (e.g. through some daemon e.g. > munin hddtemp etc.) actually triggering these errors? FWIW smartctl -x > seemed to elicit a reset on an Intel S3510 for me about 5 minutes ago. I didn't manage to get this machine into service yet before having this happen so none of this sort of disk monitoring is running. I can't seem to trigger the problem with any of my usual invocations of smartctl -i, -A or -x. When this first occurred I ran a full SMART long self-test of what was at the time sdb (at the moment it's sda), and that passed. > Do the drives work any more reliably on a good PCIe card (such as an > asmedia ASM1061 controller, or Sil3132 = both < $15 assuming you are > able to add one) or do you see the same errors? Unfortunately this is difficult to test as it's in a 1U enclosure with only 1 slot at the back of the chassis, which is currently filled. I'll be looking into what I can do with a riser card as a last resort if I can find out a SATA chipset that is definitely known to work with the s3610 drives. Offlist I had someone report that their entire batch of 3710 drives was suffering this same problem when plugged into Intel C220 chipset; this being on a platform where they previously had other Intel drives including 3500 and 3700 without issue. I have a support case open with Intel but they haven't given me any useful info yet. They actually seem keen to take the drives back under warranty but it looks like this would result in a replacement that has the exact same problems. I don't even know if the problem is in the drive or the SATA chipset or the kernel or what, yet. :( > Have you checked the Intel docs for errata on either part? Yep, can't find anything there. > Disabling NCQ may just be significant, or it may of course just be > slowing down throughput such that the error is less likely to occur - > assuming it's timing-related. Indeed. Disabling NCQ reduces the IOPS performance to around 25% so it's not something I can just put up with! Increasing the queue depth to 8 restores pretty much full performance so I will do some tests to see if this problem re-occurs there and at other values, but this takes a long time as the problem only appears rarely for me. Cheers, Andy