possible HighPoint RocketRAID 2720SGL failure

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* possible HighPoint RocketRAID 2720SGL failure
@ 2017-09-21 11:12 Eyal Lebedinsky
  2017-09-21 12:20 ` Roman Mamedov
  0 siblings, 1 reply; 9+ messages in thread
From: Eyal Lebedinsky @ 2017-09-21 11:12 UTC (permalink / raw)
  To: list linux-raid

I wonder if anyone had this happen or is familiar with such a situation.

I have a HighPoint RocketRAID 2720SGL controller managing 7 disks in a software RAID6.
So far it was running smoothly for about 4 years. Recently, at one point all the disks
disappeared.

Looking at the logs I could see that the disks completely stopped responding and
3 minutes later all reported read failures and the raid dropped to 0 out of 7 up.

The disks do not have proper error handling to they are set to 180s timeout (at boot time).
I think this accounts for the 3 minutes delay between no response and disk errors logged.

It looks like the controller failed as all 7 disks disappeared together and did not respond
to any i/o or even smart.

After power off/on things look OK. The raid6 did a very short recovery, then the ext4 fs did
a quick recovery. fsck found no problems.

I later started a raid 'check' but it failed in less that an hour (out of 10) in the same way.
A day later I tried again and it failed within 15 minutes.

So far it looks like nothing was lost but I am uncomfortable with this situation.
No surprise here...

The controller did not log any errors.

Does this look familiar to anyone?

TIA

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: possible HighPoint RocketRAID 2720SGL failure
  2017-09-21 11:12 possible HighPoint RocketRAID 2720SGL failure Eyal Lebedinsky
@ 2017-09-21 12:20 ` Roman Mamedov
  2017-09-22  1:08   ` Roger Heflin
  0 siblings, 1 reply; 9+ messages in thread
From: Roman Mamedov @ 2017-09-21 12:20 UTC (permalink / raw)
  To: Eyal Lebedinsky; +Cc: list linux-raid

On Thu, 21 Sep 2017 21:12:36 +1000
Eyal Lebedinsky <eyal@eyal.emu.id.au> wrote:

> It looks like the controller failed as all 7 disks disappeared together and did not respond
> to any i/o or even smart.
> 
> After power off/on things look OK. The raid6 did a very short recovery, then the ext4 fs did
> a quick recovery. fsck found no problems.
> 
> I later started a raid 'check' but it failed in less that an hour (out of 10) in the same way.
> A day later I tried again and it failed within 15 minutes.
> 
> So far it looks like nothing was lost but I am uncomfortable with this situation.
> No surprise here...
> 
> The controller did not log any errors.
> 
> Does this look familiar to anyone?

The controller is based on the Marvell 9485 chip and Marvell SATA/RAID
controllers seem to have a bad reputation for reliability:

https://www.jethrocarr.com/2013/11/24/adventures-in-io-hell/
https://www.youtube.com/watch?v=010urq9wY3A

I have also faced some CRC errors or disk drop-outs/reconnects on 9123 cards,
and in one case all disks (or possibly the controller itself) disappear from
the system until reboot on a 88SX7042 based controller.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: possible HighPoint RocketRAID 2720SGL failure
  2017-09-21 12:20 ` Roman Mamedov
@ 2017-09-22  1:08   ` Roger Heflin
  2017-09-22  8:27     ` Eyal Lebedinsky
  0 siblings, 1 reply; 9+ messages in thread
From: Roger Heflin @ 2017-09-22  1:08 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Eyal Lebedinsky, list linux-raid

If it is the marvell issue I had before then quit doing smartctl
commands (disable all smart queries of any sort) as that seemed to
massively increase the reliablity.  It did not completely fix the
issues, it just made them happen a lot less often.

good luck, I finally just gave up and quit using marvell controllers.

On Thu, Sep 21, 2017 at 7:20 AM, Roman Mamedov <rm@romanrm.net> wrote:
> On Thu, 21 Sep 2017 21:12:36 +1000
> Eyal Lebedinsky <eyal@eyal.emu.id.au> wrote:
>
>> It looks like the controller failed as all 7 disks disappeared together and did not respond
>> to any i/o or even smart.
>>
>> After power off/on things look OK. The raid6 did a very short recovery, then the ext4 fs did
>> a quick recovery. fsck found no problems.
>>
>> I later started a raid 'check' but it failed in less that an hour (out of 10) in the same way.
>> A day later I tried again and it failed within 15 minutes.
>>
>> So far it looks like nothing was lost but I am uncomfortable with this situation.
>> No surprise here...
>>
>> The controller did not log any errors.
>>
>> Does this look familiar to anyone?
>
> The controller is based on the Marvell 9485 chip and Marvell SATA/RAID
> controllers seem to have a bad reputation for reliability:
>
> https://www.jethrocarr.com/2013/11/24/adventures-in-io-hell/
> https://www.youtube.com/watch?v=010urq9wY3A
>
> I have also faced some CRC errors or disk drop-outs/reconnects on 9123 cards,
> and in one case all disks (or possibly the controller itself) disappear from
> the system until reboot on a 88SX7042 based controller.
>
> --
> With respect,
> Roman
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: possible HighPoint RocketRAID 2720SGL failure
  2017-09-22  1:08   ` Roger Heflin
@ 2017-09-22  8:27     ` Eyal Lebedinsky
  2017-09-22 13:20       ` Phil Turmel
  0 siblings, 1 reply; 9+ messages in thread
From: Eyal Lebedinsky @ 2017-09-22  8:27 UTC (permalink / raw)
  To: list linux-raid

On 22/09/17 11:08, Roger Heflin wrote:
> If it is the marvell issue I had before then quit doing smartctl
> commands (disable all smart queries of any sort) as that seemed to
> massively increase the reliablity.  It did not completely fix the
> issues, it just made them happen a lot less often.

I now tried this and it did not help. I stopped all smart access, but running
a md 'check' fails as before (all the disks disappear). Each test fails at a
different address. This time the machine was mostly idle when the check was
running.

I should note that this HighPoint card was running without any problem for
4 years.

Maybe a driver issue? I was running f19 until a few weeks ago, and the failures
all happened after I upgraded to f22 (and now on f26).

Eyal

> good luck, I finally just gave up and quit using marvell controllers.
> 
> On Thu, Sep 21, 2017 at 7:20 AM, Roman Mamedov <rm@romanrm.net> wrote:
>> On Thu, 21 Sep 2017 21:12:36 +1000
>> Eyal Lebedinsky <eyal@eyal.emu.id.au> wrote:
>>
>>> It looks like the controller failed as all 7 disks disappeared together and did not respond
>>> to any i/o or even smart.
>>>
>>> After power off/on things look OK. The raid6 did a very short recovery, then the ext4 fs did
>>> a quick recovery. fsck found no problems.
>>>
>>> I later started a raid 'check' but it failed in less that an hour (out of 10) in the same way.
>>> A day later I tried again and it failed within 15 minutes.
>>>
>>> So far it looks like nothing was lost but I am uncomfortable with this situation.
>>> No surprise here...
>>>
>>> The controller did not log any errors.
>>>
>>> Does this look familiar to anyone?
>>
>> The controller is based on the Marvell 9485 chip and Marvell SATA/RAID
>> controllers seem to have a bad reputation for reliability:
>>
>> https://www.jethrocarr.com/2013/11/24/adventures-in-io-hell/
>> https://www.youtube.com/watch?v=010urq9wY3A
>>
>> I have also faced some CRC errors or disk drop-outs/reconnects on 9123 cards,
>> and in one case all disks (or possibly the controller itself) disappear from
>> the system until reboot on a 88SX7042 based controller.
>>
>> --
>> With respect,
>> Roman

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: possible HighPoint RocketRAID 2720SGL failure
  2017-09-22  8:27     ` Eyal Lebedinsky
@ 2017-09-22 13:20       ` Phil Turmel
  2017-09-22 22:52         ` Eyal Lebedinsky
  0 siblings, 1 reply; 9+ messages in thread
From: Phil Turmel @ 2017-09-22 13:20 UTC (permalink / raw)
  To: Eyal Lebedinsky, list linux-raid

On 09/22/2017 04:27 AM, Eyal Lebedinsky wrote:
> On 22/09/17 11:08, Roger Heflin wrote:
>> If it is the marvell issue I had before then quit doing smartctl
>> commands (disable all smart queries of any sort) as that seemed to
>> massively increase the reliablity.  It did not completely fix the
>> issues, it just made them happen a lot less often.
> 
> I now tried this and it did not help. I stopped all smart access, but
> running
> a md 'check' fails as before (all the disks disappear). Each test fails
> at a
> different address. This time the machine was mostly idle when the check was
> running.
> 
> I should note that this HighPoint card was running without any problem for
> 4 years.
> 
> Maybe a driver issue? I was running f19 until a few weeks ago, and the
> failures
> all happened after I upgraded to f22 (and now on f26).

Your issue sounds like an overheating controller chip.  Four years of
dust accumulation and/or fan bearing wear.  It's failing when you load
it down with a scrub.  Replace the controller card.

Phil

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: possible HighPoint RocketRAID 2720SGL failure
  2017-09-22 13:20       ` Phil Turmel
@ 2017-09-22 22:52         ` Eyal Lebedinsky
  2017-09-23  9:29           ` Eyal Lebedinsky
  0 siblings, 1 reply; 9+ messages in thread
From: Eyal Lebedinsky @ 2017-09-22 22:52 UTC (permalink / raw)
  To: Phil Turmel, list linux-raid

On 22/09/17 23:20, Phil Turmel wrote:
> On 09/22/2017 04:27 AM, Eyal Lebedinsky wrote:
>> On 22/09/17 11:08, Roger Heflin wrote:
>>> If it is the marvell issue I had before then quit doing smartctl
>>> commands (disable all smart queries of any sort) as that seemed to
>>> massively increase the reliablity.  It did not completely fix the
>>> issues, it just made them happen a lot less often.
>>
>> I now tried this and it did not help. I stopped all smart access, but
>> running
>> a md 'check' fails as before (all the disks disappear). Each test fails
>> at a
>> different address. This time the machine was mostly idle when the check was
>> running.
>>
>> I should note that this HighPoint card was running without any problem for
>> 4 years.
>>
>> Maybe a driver issue? I was running f19 until a few weeks ago, and the
>> failures
>> all happened after I upgraded to f22 (and now on f26).
> 
> Your issue sounds like an overheating controller chip.  Four years of
> dust accumulation and/or fan bearing wear.  It's failing when you load
> it down with a scrub.  Replace the controller card.

Thanks Phil,

Interesting. I checked the card and it still looks "as new", no dust.
It does not have a fan and the heatsink is glued firmly to the processor.
Does not look like there is anything I can do here.

I added a fan firing external air directly at the card and will test again
later. I really need to understand the source of the problem. If this fixes
it then I will get a replacement controller (the fan is just a hack).

> Phil

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: possible HighPoint RocketRAID 2720SGL failure
  2017-09-22 22:52         ` Eyal Lebedinsky
@ 2017-09-23  9:29           ` Eyal Lebedinsky
  2017-09-23 14:49             ` Phil Turmel
  0 siblings, 1 reply; 9+ messages in thread
From: Eyal Lebedinsky @ 2017-09-23  9:29 UTC (permalink / raw)
  To: Phil Turmel, list linux-raid

On 23/09/17 08:52, Eyal Lebedinsky wrote:
> On 22/09/17 23:20, Phil Turmel wrote:
>> On 09/22/2017 04:27 AM, Eyal Lebedinsky wrote:
>>> On 22/09/17 11:08, Roger Heflin wrote:
>>>> If it is the marvell issue I had before then quit doing smartctl
>>>> commands (disable all smart queries of any sort) as that seemed to
>>>> massively increase the reliablity.  It did not completely fix the
>>>> issues, it just made them happen a lot less often.
>>>
>>> I now tried this and it did not help. I stopped all smart access, but
>>> running
>>> a md 'check' fails as before (all the disks disappear). Each test fails
>>> at a
>>> different address. This time the machine was mostly idle when the check was
>>> running.
>>>
>>> I should note that this HighPoint card was running without any problem for
>>> 4 years.
>>>
>>> Maybe a driver issue? I was running f19 until a few weeks ago, and the
>>> failures
>>> all happened after I upgraded to f22 (and now on f26).
>>
>> Your issue sounds like an overheating controller chip.  Four years of
>> dust accumulation and/or fan bearing wear.  It's failing when you load
>> it down with a scrub.  Replace the controller card.
> 
> Thanks Phil,
> 
> Interesting. I checked the card and it still looks "as new", no dust.
> It does not have a fan and the heatsink is glued firmly to the processor.
> Does not look like there is anything I can do here.
> 
> I added a fan firing external air directly at the card and will test again
> later. I really need to understand the source of the problem. If this fixes
> it then I will get a replacement controller (the fan is just a hack).
> 
>> Phil

The 'check' completed, something it was not able to do before, so it seems that
it was an overheating problem. Now that I know, I searched for the relevant terms
and did find some notes suggesting this card is known to have such an issue.

But, why now? I could only guess that the heatsink tape (the only thing
connecting the processor to the heatsink) has aged enough to a low level of
performance. I am not sure that I can fix it, or even remove the heatsink safely.

[OT: the following documents how I handled the 'check' results]

Moving on, I got one report during the 'check':
	kernel: md127: mismatch sector in range 770366344-770366352
The 'check' ended with a 'mismatch_cnt=8' which is not that bad (I had much worse).

I decided to give raid6check a try (I run raid6) and needed to convert the
reported sector range to the required argument.

The array is:

md127 : active raid6 sdi1[8] sdg1[9] sdh1[7] sdf1[10] sde1[14] sdd1[12] sdc1[13]
       19534425600 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/7] [UUUUUUU]
       bitmap: 0/30 pages [0KB], 65536KB chunk

So, converting sectors to 512k chunks (count of 8 sectors rounded up to 1 chunk):
	$ sudo raid6check /dev/md127 $((770366344/2/512)) 1
which reported
	Error detected at stripe 752310, page 113: possible failed disk slot 4: 6 --> /dev/sdi1
This looks reasonable.

I also have a script that finds which files reside in the bad area and it found one
large mythtv recording. So not a big deal.

I copied the file sideways and ran raid6check in automatic repair mode. Now cmp
tells me that the file differs in 4 bytes - this is expected.

Thanks, I feel much better now.

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: possible HighPoint RocketRAID 2720SGL failure
  2017-09-23  9:29           ` Eyal Lebedinsky
@ 2017-09-23 14:49             ` Phil Turmel
  2017-09-23 16:53               ` Wols Lists
  0 siblings, 1 reply; 9+ messages in thread
From: Phil Turmel @ 2017-09-23 14:49 UTC (permalink / raw)
  To: Eyal Lebedinsky, list linux-raid

On 09/23/2017 05:29 AM, Eyal Lebedinsky wrote:
> On 23/09/17 08:52, Eyal Lebedinsky wrote:

>> Thanks Phil,
>>
>> Interesting. I checked the card and it still looks "as new", no dust.
>> It does not have a fan and the heatsink is glued firmly to the processor.
>> Does not look like there is anything I can do here.
>>
>> I added a fan firing external air directly at the card and will test
>> again
>> later. I really need to understand the source of the problem. If this
>> fixes
>> it then I will get a replacement controller (the fan is just a hack).
>>
>>> Phil
> 
> The 'check' completed, something it was not able to do before, so it
> seems that
> it was an overheating problem. Now that I know, I searched for the
> relevant terms
> and did find some notes suggesting this card is known to have such an
> issue.
> 
> But, why now? I could only guess that the heatsink tape (the only thing
> connecting the processor to the heatsink) has aged enough to a low level of
> performance. I am not sure that I can fix it, or even remove the
> heatsink safely.

Chips get old.  Primarily due to thermal cycling producing expansion
cracking.  Silicon and the various deposited metals expand/contract with
different coefficients, so temperature changes stress the mating points
between semiconductors and interconnect.  (Sounds like your problem, as
extra cooling helps.)

If thermal stress doesn't kill them, then dopant diffusion within the
semiconductors eventually will.  This occurs even when turned off, but
proceeds much faster at elevated temperatures.  Dopant concentrations
are engineered to not suffer from diffusion effects well beyond the
expected life of the components (at rated operating temps), but that
doesn't mean there aren't marginal production runs.  Which can be hard
to discover before parts start failing years later.  If that happens
after the warranty period, the manufacturer has dodged a big bullet.
With a bit of tarnish on their name, of course.

> [OT: the following documents how I handled the 'check' results]

> Thanks, I feel much better now.

I like happy endings.  (-:

But it's too bad you have to replace it.

Phil

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: possible HighPoint RocketRAID 2720SGL failure
  2017-09-23 14:49             ` Phil Turmel
@ 2017-09-23 16:53               ` Wols Lists
  0 siblings, 0 replies; 9+ messages in thread
From: Wols Lists @ 2017-09-23 16:53 UTC (permalink / raw)
  To: Phil Turmel, Eyal Lebedinsky, list linux-raid

On 23/09/17 15:49, Phil Turmel wrote:
> If thermal stress doesn't kill them, then dopant diffusion within the
> semiconductors eventually will.  This occurs even when turned off, but
> proceeds much faster at elevated temperatures.  Dopant concentrations
> are engineered to not suffer from diffusion effects well beyond the
> expected life of the components (at rated operating temps), but that
> doesn't mean there aren't marginal production runs.  Which can be hard
> to discover before parts start failing years later.  If that happens
> after the warranty period, the manufacturer has dodged a big bullet.
> With a bit of tarnish on their name, of course.

And as manufacturers shrink their die sizes, dopant diffusion and
quantum leakage become much bigger problems :-( If your transistor is
only four or five atoms deep, it won't have many dopant atoms, and they
only have to diffuse a short distance before your transistor is toast.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-09-23 16:53 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-21 11:12 possible HighPoint RocketRAID 2720SGL failure Eyal Lebedinsky
2017-09-21 12:20 ` Roman Mamedov
2017-09-22  1:08   ` Roger Heflin
2017-09-22  8:27     ` Eyal Lebedinsky
2017-09-22 13:20       ` Phil Turmel
2017-09-22 22:52         ` Eyal Lebedinsky
2017-09-23  9:29           ` Eyal Lebedinsky
2017-09-23 14:49             ` Phil Turmel
2017-09-23 16:53               ` Wols Lists

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).