Hi Tim,

Thanks for getting back to me.

On Mon, Aug 10, 2015 at 04:35:03PM +0100, Tim Small wrote:
> Is there any useful info from smartctl -x ?  e.g. reset counters or
> other logged errors - either drive internal errors or cabling/link
> related "Sata Phy Event Counters" etc.?

I do not think so, but I have attached the output from this for each
drive. sdb is where it has happened most recently (four times since
4th August).

> Is attempting to ready SMART info status (e.g. through some daemon e.g.
> munin hddtemp etc.) actually triggering these errors?  FWIW smartctl -x
> seemed to elicit a reset on an Intel S3510 for me about 5 minutes ago.

I didn't manage to get this machine into service yet before having
this happen so none of this sort of disk monitoring is running. I
can't seem to trigger the problem with any of my usual invocations
of smartctl -i, -A or -x.

When this first occurred I ran a full SMART long self-test of what
was at the time sdb (at the moment it's sda), and that passed.

> Do the drives work any more reliably on a good PCIe card (such as an
> asmedia ASM1061 controller, or Sil3132 = both < $15 assuming you are
> able to add one) or do you see the same errors?

Unfortunately this is difficult to test as it's in a 1U enclosure
with only 1 slot at the back of the chassis, which is currently
filled. I'll be looking into what I can do with a riser card as a
last resort if I can find out a SATA chipset that is definitely
known to work with the s3610 drives.

Offlist I had someone report that their entire batch of 3710 drives
was suffering this same problem when plugged into Intel C220
chipset; this being on a platform where they previously had other
Intel drives including 3500 and 3700 without issue.

I have a support case open with Intel but they haven't given me any
useful info yet. They actually seem keen to take the drives back
under warranty but it looks like this would result in a replacement
that has the exact same problems. I don't even know if the problem
is in the drive or the SATA chipset or the kernel or what, yet. :(

> Have you checked the Intel docs for errata on either part?

Yep, can't find anything there.

> Disabling NCQ may just be significant, or it may of course just be
> slowing down throughput such that the error is less likely to occur -
> assuming it's timing-related.

Indeed. Disabling NCQ reduces the IOPS performance to around 25% so
it's not something I can just put up with!

Increasing the queue depth to 8 restores pretty much full
performance so I will do some tests to see if this problem re-occurs
there and at other values, but this takes a long time as the problem
only appears rarely for me.

Cheers,
Andy