linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: raid5:md3: read error corrected , followed by , Machine Check
@ 2007-07-17 15:18 Andrew Burgess
  2007-07-21 23:25 ` Mr. James W. Laferriere
  0 siblings, 1 reply; 5+ messages in thread
From: Andrew Burgess @ 2007-07-17 15:18 UTC (permalink / raw)
  To: linux-raid

> 	The 'MCE's have been ongoing for sometime .  I have replaced every item 
>in the system except the chassis & scsi backplane & power supply(750Watts) .
> 	Everything .  MB,cpu,memory,scsi controllers, ...
> 	These MCE's only happen when I am trying to build or bonnie++ test the 
>md3 .  It consists of (now 7+1spare) 146GB drives in the SuperMicro 
>SYS-6035B-8B's backplane attached to a LSI22320 .

Probably every old timer has a story about chasing a hardware problem
where changing the power supply finally fixed it. I keep spares now.

If an MCE (which means bad cpu) doesn't go away after changing the cpu
it would either have to be temperature, power or a bug in the MCE code.
What else could it be?


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: raid5:md3: read error corrected , followed by , Machine Check
  2007-07-17 15:18 raid5:md3: read error corrected , followed by , Machine Check Andrew Burgess
@ 2007-07-21 23:25 ` Mr. James W. Laferriere
  2007-07-23 23:06   ` Bill Davidsen
  0 siblings, 1 reply; 5+ messages in thread
From: Mr. James W. Laferriere @ 2007-07-21 23:25 UTC (permalink / raw)
  To: Andrew Burgess; +Cc: linux-raid

 	Hello Andrew ,

On Tue, 17 Jul 2007, Andrew Burgess wrote:
>> 	The 'MCE's have been ongoing for sometime .  I have replaced every item
>> in the system except the chassis & scsi backplane & power supply(750Watts) .
>> 	Everything .  MB,cpu,memory,scsi controllers, ...
>> 	These MCE's only happen when I am trying to build or bonnie++ test the
>> md3 .  It consists of (now 7+1spare) 146GB drives in the SuperMicro
>> SYS-6035B-8B's backplane attached to a LSI22320 .
>
> Probably every old timer has a story about chasing a hardware problem
> where changing the power supply finally fixed it. I keep spares now.
>
> If an MCE (which means bad cpu) doesn't go away after changing the cpu
> it would either have to be temperature, power or a bug in the MCE code.
> What else could it be?

 	Thank you for the idea of 'changing out the PS' .  So I did it a bit 
differant .  I removed the system PS from the raid backplane & dropped in a 
known good ps of proper wattage & re-tested .  But left the systems ps attached 
to only the MB & fans .
 	It doesn't appear to be power load related .  I tried rebuilding my 7 
disk raid6 array & I got the same thing ,  MCE .
 	Now the raid backplane is still in the air stream in front of the cpu's 
and memory slots .  So it could be a marginal cpu or memory stick .

 	But here's the clincher ,  when I don't use the two drives in from of 
the PS & cpu & memory slots .  The array completes it's resync .  So I'm back to 
testing memory (again) ,  If that passes then I'll try the new cpu(s) route .

 		Tnx All ,  JimL
-- 
+-----------------------------------------------------------------+
| James   W.   Laferriere | System   Techniques | Give me VMS     |
| Network        Engineer | 663  Beaumont  Blvd |  Give me Linux  |
| babydr@baby-dragons.com | Pacifica, CA. 94044 |   only  on  AXP |
+-----------------------------------------------------------------+

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: raid5:md3: read error corrected , followed by , Machine Check
  2007-07-21 23:25 ` Mr. James W. Laferriere
@ 2007-07-23 23:06   ` Bill Davidsen
  2007-07-24  4:44     ` Mr. James W. Laferriere
  0 siblings, 1 reply; 5+ messages in thread
From: Bill Davidsen @ 2007-07-23 23:06 UTC (permalink / raw)
  To: Mr. James W. Laferriere; +Cc: Andrew Burgess, linux-raid

Mr. James W. Laferriere wrote:
>     Hello Andrew ,
>
> On Tue, 17 Jul 2007, Andrew Burgess wrote:
>>>     The 'MCE's have been ongoing for sometime .  I have replaced 
>>> every item
>>> in the system except the chassis & scsi backplane & power 
>>> supply(750Watts) .
>>>     Everything .  MB,cpu,memory,scsi controllers, ...
>>>     These MCE's only happen when I am trying to build or bonnie++ 
>>> test the
>>> md3 .  It consists of (now 7+1spare) 146GB drives in the SuperMicro
>>> SYS-6035B-8B's backplane attached to a LSI22320 .
>>
>> Probably every old timer has a story about chasing a hardware problem
>> where changing the power supply finally fixed it. I keep spares now.
>>
>> If an MCE (which means bad cpu) doesn't go away after changing the cpu
>> it would either have to be temperature, power or a bug in the MCE code.
>> What else could it be?
>
>     Thank you for the idea of 'changing out the PS' .  So I did it a 
> bit differant .  I removed the system PS from the raid backplane & 
> dropped in a known good ps of proper wattage & re-tested .  But left 
> the systems ps attached to only the MB & fans .
>     It doesn't appear to be power load related .  I tried rebuilding 
> my 7 disk raid6 array & I got the same thing ,  MCE .
>     Now the raid backplane is still in the air stream in front of the 
> cpu's and memory slots .  So it could be a marginal cpu or memory stick .
>
>     But here's the clincher ,  when I don't use the two drives in from 
> of the PS & cpu & memory slots .  The array completes it's resync .  
> So I'm back to testing memory (again) ,  If that passes then I'll try 
> the new cpu(s) route .
>
It does sound like a cooling problem, which does not have to imply the 
overheated parts are bad, although that may be true. Could be the total 
number of i/o in flight, etc. Have you tried dropping two other drives? 
Can you put in a bit more fan? Read the system board and CPU temps with 
the "sensors" package?

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: raid5:md3: read error corrected , followed by , Machine Check
  2007-07-23 23:06   ` Bill Davidsen
@ 2007-07-24  4:44     ` Mr. James W. Laferriere
  2007-07-24 18:59       ` Bill Davidsen
  0 siblings, 1 reply; 5+ messages in thread
From: Mr. James W. Laferriere @ 2007-07-24  4:44 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Andrew Burgess, linux-raid

 	Hello Bill ,

On Mon, 23 Jul 2007, Bill Davidsen wrote:
> Mr. James W. Laferriere wrote:
>>     Hello Andrew ,
>> 
>> On Tue, 17 Jul 2007, Andrew Burgess wrote:
>>>>     The 'MCE's have been ongoing for sometime .  I have replaced every 
>>>> item
>>>> in the system except the chassis & scsi backplane & power 
>>>> supply(750Watts) .
>>>>     Everything .  MB,cpu,memory,scsi controllers, ...
>>>>     These MCE's only happen when I am trying to build or bonnie++ test 
>>>> the
>>>> md3 .  It consists of (now 7+1spare) 146GB drives in the SuperMicro
>>>> SYS-6035B-8B's backplane attached to a LSI22320 .
>>> 
>>> Probably every old timer has a story about chasing a hardware problem
>>> where changing the power supply finally fixed it. I keep spares now.
>>> 
>>> If an MCE (which means bad cpu) doesn't go away after changing the cpu
>>> it would either have to be temperature, power or a bug in the MCE code.
>>> What else could it be?
>>
>>     Thank you for the idea of 'changing out the PS' .  So I did it a bit 
>> differant .  I removed the system PS from the raid backplane & dropped in a 
>> known good ps of proper wattage & re-tested .  But left the systems ps 
>> attached to only the MB & fans .
>>     It doesn't appear to be power load related .  I tried rebuilding my 7 
>> disk raid6 array & I got the same thing ,  MCE .
>>     Now the raid backplane is still in the air stream in front of the cpu's 
>> and memory slots .  So it could be a marginal cpu or memory stick .
>>
>>     But here's the clincher ,  when I don't use the two drives in from of 
>> the PS & cpu & memory slots .  The array completes it's resync .  So I'm 
>> back to testing memory (again) ,  If that passes then I'll try the new 
>> cpu(s) route .
>> 
> It does sound like a cooling problem, which does not have to imply the 
> overheated parts are bad, although that may be true.
 	Fyi ,  memtest86+ @ 19 passes (~ 52hours) on 8GB of memory ,  no errors .

> Could be the total number of i/o in flight, etc.
 	Hmmm ,  I didn't think of this one .

> Have you tried dropping two other drives?
 	Well ,  no .  I dropped those two in front of the CPU as a test in 
working my way up the scsi backplane(BP) trying to find a point that worked & 
the last two drives in the BP just happened to be in front of the cpu/memory 
air path .  The minute I put those in the MD build tree within the usual time 
frame I get a MCE .  What I have'nt tried is what you are probably suggesting 
make sure it is the drives in the air path by putting them in the MD build and 
leaving another two out .  I'll try that as well .

> Can you put in a bit more fan?
 	Nope ,  It's maxed out .  sounds like a 747 on take off as it is .
 	It's a supermicro SYS-6035B-8B if you have the time to go look at the 
specs & pics .

> Read the system board and CPU temps with the "sensors" package?
 	Not yet ,  I am building the need items into the kernel now .
 	Will report back (hopefully) sometime this weekend .

 		Tia ,  JimL
-- 
+-----------------------------------------------------------------+
| James   W.   Laferriere | System   Techniques | Give me VMS     |
| Network        Engineer | 663  Beaumont  Blvd |  Give me Linux  |
| babydr@baby-dragons.com | Pacifica, CA. 94044 |   only  on  AXP |
+-----------------------------------------------------------------+

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: raid5:md3: read error corrected , followed by , Machine Check
  2007-07-24  4:44     ` Mr. James W. Laferriere
@ 2007-07-24 18:59       ` Bill Davidsen
  0 siblings, 0 replies; 5+ messages in thread
From: Bill Davidsen @ 2007-07-24 18:59 UTC (permalink / raw)
  To: Mr. James W. Laferriere; +Cc: Andrew Burgess, linux-raid

Mr. James W. Laferriere wrote:

One more thought below...
> On Mon, 23 Jul 2007, Bill Davidsen wrote:
>> Mr. James W. Laferriere wrote:
>>>     Hello Andrew ,
>>>
>>> On Tue, 17 Jul 2007, Andrew Burgess wrote:
>>>>>     The 'MCE's have been ongoing for sometime .  I have replaced 
>>>>> every item
>>>>> in the system except the chassis & scsi backplane & power 
>>>>> supply(750Watts) .
>>>>>     Everything .  MB,cpu,memory,scsi controllers, ...
>>>>>     These MCE's only happen when I am trying to build or bonnie++ 
>>>>> test the
>>>>> md3 .  It consists of (now 7+1spare) 146GB drives in the SuperMicro
>>>>> SYS-6035B-8B's backplane attached to a LSI22320 .
>>>>
>>>> Probably every old timer has a story about chasing a hardware problem
>>>> where changing the power supply finally fixed it. I keep spares now.
>>>>
>>>> If an MCE (which means bad cpu) doesn't go away after changing the cpu
>>>> it would either have to be temperature, power or a bug in the MCE 
>>>> code.
>>>> What else could it be?
>>>
>>>     Thank you for the idea of 'changing out the PS' .  So I did it a 
>>> bit differant .  I removed the system PS from the raid backplane & 
>>> dropped in a known good ps of proper wattage & re-tested .  But left 
>>> the systems ps attached to only the MB & fans .
>>>     It doesn't appear to be power load related .  I tried rebuilding 
>>> my 7 disk raid6 array & I got the same thing ,  MCE .
>>>     Now the raid backplane is still in the air stream in front of 
>>> the cpu's and memory slots .  So it could be a marginal cpu or 
>>> memory stick .
>>>
>>>     But here's the clincher ,  when I don't use the two drives in 
>>> from of the PS & cpu & memory slots .  The array completes it's 
>>> resync .  So I'm back to testing memory (again) ,  If that passes 
>>> then I'll try the new cpu(s) route .
>>>
>> It does sound like a cooling problem, which does not have to imply 
>> the overheated parts are bad, although that may be true.
>     Fyi ,  memtest86+ @ 19 passes (~ 52hours) on 8GB of memory ,  no 
> errors .
>
>> Could be the total number of i/o in flight, etc.
>     Hmmm ,  I didn't think of this one .
>
Those are a PITA to find of that's it, doesn't sound likely to be power 
supply, as an unlikely but cheap test, have you reseated the p/s to 
backplane connectors? Oh and checked that the system board is grounded 
to the case?
>> Have you tried dropping two other drives?
>     Well ,  no .  I dropped those two in front of the CPU as a test in 
> working my way up the scsi backplane(BP) trying to find a point that 
> worked & the last two drives in the BP just happened to be in front of 
> the cpu/memory air path .  The minute I put those in the MD build tree 
> within the usual time frame I get a MCE .  What I have'nt tried is 
> what you are probably suggesting make sure it is the drives in the air 
> path by putting them in the MD build and leaving another two out .  
> I'll try that as well .
>
>> Can you put in a bit more fan?
>     Nope ,  It's maxed out .  sounds like a 747 on take off as it is .
>     It's a supermicro SYS-6035B-8B if you have the time to go look at 
> the specs & pics .
>
What I was thinking is that some of my cases actually have room to 
install fans in front of the drives, allowing push as well as pull. 
Haven't had to do it in several years, but looking at my tall tower 
cases, I believe I could.
>> Read the system board and CPU temps with the "sensors" package?
>     Not yet ,  I am building the need items into the kernel now .
>     Will report back (hopefully) sometime this weekend .
>
Keep us posted, you have picked the low-hanging fruit, when you find out 
what causes this I'm sure it will be something interesting.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-07-24 18:59 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-17 15:18 raid5:md3: read error corrected , followed by , Machine Check Andrew Burgess
2007-07-21 23:25 ` Mr. James W. Laferriere
2007-07-23 23:06   ` Bill Davidsen
2007-07-24  4:44     ` Mr. James W. Laferriere
2007-07-24 18:59       ` Bill Davidsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).