* Re: raid5:md3: read error corrected , followed by , Machine Check
@ 2007-07-17 15:18 Andrew Burgess
2007-07-21 23:25 ` Mr. James W. Laferriere
0 siblings, 1 reply; 5+ messages in thread
From: Andrew Burgess @ 2007-07-17 15:18 UTC (permalink / raw)
To: linux-raid
> The 'MCE's have been ongoing for sometime . I have replaced every item
>in the system except the chassis & scsi backplane & power supply(750Watts) .
> Everything . MB,cpu,memory,scsi controllers, ...
> These MCE's only happen when I am trying to build or bonnie++ test the
>md3 . It consists of (now 7+1spare) 146GB drives in the SuperMicro
>SYS-6035B-8B's backplane attached to a LSI22320 .
Probably every old timer has a story about chasing a hardware problem
where changing the power supply finally fixed it. I keep spares now.
If an MCE (which means bad cpu) doesn't go away after changing the cpu
it would either have to be temperature, power or a bug in the MCE code.
What else could it be?
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: raid5:md3: read error corrected , followed by , Machine Check
2007-07-17 15:18 raid5:md3: read error corrected , followed by , Machine Check Andrew Burgess
@ 2007-07-21 23:25 ` Mr. James W. Laferriere
2007-07-23 23:06 ` Bill Davidsen
0 siblings, 1 reply; 5+ messages in thread
From: Mr. James W. Laferriere @ 2007-07-21 23:25 UTC (permalink / raw)
To: Andrew Burgess; +Cc: linux-raid
Hello Andrew ,
On Tue, 17 Jul 2007, Andrew Burgess wrote:
>> The 'MCE's have been ongoing for sometime . I have replaced every item
>> in the system except the chassis & scsi backplane & power supply(750Watts) .
>> Everything . MB,cpu,memory,scsi controllers, ...
>> These MCE's only happen when I am trying to build or bonnie++ test the
>> md3 . It consists of (now 7+1spare) 146GB drives in the SuperMicro
>> SYS-6035B-8B's backplane attached to a LSI22320 .
>
> Probably every old timer has a story about chasing a hardware problem
> where changing the power supply finally fixed it. I keep spares now.
>
> If an MCE (which means bad cpu) doesn't go away after changing the cpu
> it would either have to be temperature, power or a bug in the MCE code.
> What else could it be?
Thank you for the idea of 'changing out the PS' . So I did it a bit
differant . I removed the system PS from the raid backplane & dropped in a
known good ps of proper wattage & re-tested . But left the systems ps attached
to only the MB & fans .
It doesn't appear to be power load related . I tried rebuilding my 7
disk raid6 array & I got the same thing , MCE .
Now the raid backplane is still in the air stream in front of the cpu's
and memory slots . So it could be a marginal cpu or memory stick .
But here's the clincher , when I don't use the two drives in from of
the PS & cpu & memory slots . The array completes it's resync . So I'm back to
testing memory (again) , If that passes then I'll try the new cpu(s) route .
Tnx All , JimL
--
+-----------------------------------------------------------------+
| James W. Laferriere | System Techniques | Give me VMS |
| Network Engineer | 663 Beaumont Blvd | Give me Linux |
| babydr@baby-dragons.com | Pacifica, CA. 94044 | only on AXP |
+-----------------------------------------------------------------+
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: raid5:md3: read error corrected , followed by , Machine Check
2007-07-21 23:25 ` Mr. James W. Laferriere
@ 2007-07-23 23:06 ` Bill Davidsen
2007-07-24 4:44 ` Mr. James W. Laferriere
0 siblings, 1 reply; 5+ messages in thread
From: Bill Davidsen @ 2007-07-23 23:06 UTC (permalink / raw)
To: Mr. James W. Laferriere; +Cc: Andrew Burgess, linux-raid
Mr. James W. Laferriere wrote:
> Hello Andrew ,
>
> On Tue, 17 Jul 2007, Andrew Burgess wrote:
>>> The 'MCE's have been ongoing for sometime . I have replaced
>>> every item
>>> in the system except the chassis & scsi backplane & power
>>> supply(750Watts) .
>>> Everything . MB,cpu,memory,scsi controllers, ...
>>> These MCE's only happen when I am trying to build or bonnie++
>>> test the
>>> md3 . It consists of (now 7+1spare) 146GB drives in the SuperMicro
>>> SYS-6035B-8B's backplane attached to a LSI22320 .
>>
>> Probably every old timer has a story about chasing a hardware problem
>> where changing the power supply finally fixed it. I keep spares now.
>>
>> If an MCE (which means bad cpu) doesn't go away after changing the cpu
>> it would either have to be temperature, power or a bug in the MCE code.
>> What else could it be?
>
> Thank you for the idea of 'changing out the PS' . So I did it a
> bit differant . I removed the system PS from the raid backplane &
> dropped in a known good ps of proper wattage & re-tested . But left
> the systems ps attached to only the MB & fans .
> It doesn't appear to be power load related . I tried rebuilding
> my 7 disk raid6 array & I got the same thing , MCE .
> Now the raid backplane is still in the air stream in front of the
> cpu's and memory slots . So it could be a marginal cpu or memory stick .
>
> But here's the clincher , when I don't use the two drives in from
> of the PS & cpu & memory slots . The array completes it's resync .
> So I'm back to testing memory (again) , If that passes then I'll try
> the new cpu(s) route .
>
It does sound like a cooling problem, which does not have to imply the
overheated parts are bad, although that may be true. Could be the total
number of i/o in flight, etc. Have you tried dropping two other drives?
Can you put in a bit more fan? Read the system board and CPU temps with
the "sensors" package?
--
bill davidsen <davidsen@tmr.com>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: raid5:md3: read error corrected , followed by , Machine Check
2007-07-23 23:06 ` Bill Davidsen
@ 2007-07-24 4:44 ` Mr. James W. Laferriere
2007-07-24 18:59 ` Bill Davidsen
0 siblings, 1 reply; 5+ messages in thread
From: Mr. James W. Laferriere @ 2007-07-24 4:44 UTC (permalink / raw)
To: Bill Davidsen; +Cc: Andrew Burgess, linux-raid
Hello Bill ,
On Mon, 23 Jul 2007, Bill Davidsen wrote:
> Mr. James W. Laferriere wrote:
>> Hello Andrew ,
>>
>> On Tue, 17 Jul 2007, Andrew Burgess wrote:
>>>> The 'MCE's have been ongoing for sometime . I have replaced every
>>>> item
>>>> in the system except the chassis & scsi backplane & power
>>>> supply(750Watts) .
>>>> Everything . MB,cpu,memory,scsi controllers, ...
>>>> These MCE's only happen when I am trying to build or bonnie++ test
>>>> the
>>>> md3 . It consists of (now 7+1spare) 146GB drives in the SuperMicro
>>>> SYS-6035B-8B's backplane attached to a LSI22320 .
>>>
>>> Probably every old timer has a story about chasing a hardware problem
>>> where changing the power supply finally fixed it. I keep spares now.
>>>
>>> If an MCE (which means bad cpu) doesn't go away after changing the cpu
>>> it would either have to be temperature, power or a bug in the MCE code.
>>> What else could it be?
>>
>> Thank you for the idea of 'changing out the PS' . So I did it a bit
>> differant . I removed the system PS from the raid backplane & dropped in a
>> known good ps of proper wattage & re-tested . But left the systems ps
>> attached to only the MB & fans .
>> It doesn't appear to be power load related . I tried rebuilding my 7
>> disk raid6 array & I got the same thing , MCE .
>> Now the raid backplane is still in the air stream in front of the cpu's
>> and memory slots . So it could be a marginal cpu or memory stick .
>>
>> But here's the clincher , when I don't use the two drives in from of
>> the PS & cpu & memory slots . The array completes it's resync . So I'm
>> back to testing memory (again) , If that passes then I'll try the new
>> cpu(s) route .
>>
> It does sound like a cooling problem, which does not have to imply the
> overheated parts are bad, although that may be true.
Fyi , memtest86+ @ 19 passes (~ 52hours) on 8GB of memory , no errors .
> Could be the total number of i/o in flight, etc.
Hmmm , I didn't think of this one .
> Have you tried dropping two other drives?
Well , no . I dropped those two in front of the CPU as a test in
working my way up the scsi backplane(BP) trying to find a point that worked &
the last two drives in the BP just happened to be in front of the cpu/memory
air path . The minute I put those in the MD build tree within the usual time
frame I get a MCE . What I have'nt tried is what you are probably suggesting
make sure it is the drives in the air path by putting them in the MD build and
leaving another two out . I'll try that as well .
> Can you put in a bit more fan?
Nope , It's maxed out . sounds like a 747 on take off as it is .
It's a supermicro SYS-6035B-8B if you have the time to go look at the
specs & pics .
> Read the system board and CPU temps with the "sensors" package?
Not yet , I am building the need items into the kernel now .
Will report back (hopefully) sometime this weekend .
Tia , JimL
--
+-----------------------------------------------------------------+
| James W. Laferriere | System Techniques | Give me VMS |
| Network Engineer | 663 Beaumont Blvd | Give me Linux |
| babydr@baby-dragons.com | Pacifica, CA. 94044 | only on AXP |
+-----------------------------------------------------------------+
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: raid5:md3: read error corrected , followed by , Machine Check
2007-07-24 4:44 ` Mr. James W. Laferriere
@ 2007-07-24 18:59 ` Bill Davidsen
0 siblings, 0 replies; 5+ messages in thread
From: Bill Davidsen @ 2007-07-24 18:59 UTC (permalink / raw)
To: Mr. James W. Laferriere; +Cc: Andrew Burgess, linux-raid
Mr. James W. Laferriere wrote:
One more thought below...
> On Mon, 23 Jul 2007, Bill Davidsen wrote:
>> Mr. James W. Laferriere wrote:
>>> Hello Andrew ,
>>>
>>> On Tue, 17 Jul 2007, Andrew Burgess wrote:
>>>>> The 'MCE's have been ongoing for sometime . I have replaced
>>>>> every item
>>>>> in the system except the chassis & scsi backplane & power
>>>>> supply(750Watts) .
>>>>> Everything . MB,cpu,memory,scsi controllers, ...
>>>>> These MCE's only happen when I am trying to build or bonnie++
>>>>> test the
>>>>> md3 . It consists of (now 7+1spare) 146GB drives in the SuperMicro
>>>>> SYS-6035B-8B's backplane attached to a LSI22320 .
>>>>
>>>> Probably every old timer has a story about chasing a hardware problem
>>>> where changing the power supply finally fixed it. I keep spares now.
>>>>
>>>> If an MCE (which means bad cpu) doesn't go away after changing the cpu
>>>> it would either have to be temperature, power or a bug in the MCE
>>>> code.
>>>> What else could it be?
>>>
>>> Thank you for the idea of 'changing out the PS' . So I did it a
>>> bit differant . I removed the system PS from the raid backplane &
>>> dropped in a known good ps of proper wattage & re-tested . But left
>>> the systems ps attached to only the MB & fans .
>>> It doesn't appear to be power load related . I tried rebuilding
>>> my 7 disk raid6 array & I got the same thing , MCE .
>>> Now the raid backplane is still in the air stream in front of
>>> the cpu's and memory slots . So it could be a marginal cpu or
>>> memory stick .
>>>
>>> But here's the clincher , when I don't use the two drives in
>>> from of the PS & cpu & memory slots . The array completes it's
>>> resync . So I'm back to testing memory (again) , If that passes
>>> then I'll try the new cpu(s) route .
>>>
>> It does sound like a cooling problem, which does not have to imply
>> the overheated parts are bad, although that may be true.
> Fyi , memtest86+ @ 19 passes (~ 52hours) on 8GB of memory , no
> errors .
>
>> Could be the total number of i/o in flight, etc.
> Hmmm , I didn't think of this one .
>
Those are a PITA to find of that's it, doesn't sound likely to be power
supply, as an unlikely but cheap test, have you reseated the p/s to
backplane connectors? Oh and checked that the system board is grounded
to the case?
>> Have you tried dropping two other drives?
> Well , no . I dropped those two in front of the CPU as a test in
> working my way up the scsi backplane(BP) trying to find a point that
> worked & the last two drives in the BP just happened to be in front of
> the cpu/memory air path . The minute I put those in the MD build tree
> within the usual time frame I get a MCE . What I have'nt tried is
> what you are probably suggesting make sure it is the drives in the air
> path by putting them in the MD build and leaving another two out .
> I'll try that as well .
>
>> Can you put in a bit more fan?
> Nope , It's maxed out . sounds like a 747 on take off as it is .
> It's a supermicro SYS-6035B-8B if you have the time to go look at
> the specs & pics .
>
What I was thinking is that some of my cases actually have room to
install fans in front of the drives, allowing push as well as pull.
Haven't had to do it in several years, but looking at my tall tower
cases, I believe I could.
>> Read the system board and CPU temps with the "sensors" package?
> Not yet , I am building the need items into the kernel now .
> Will report back (hopefully) sometime this weekend .
>
Keep us posted, you have picked the low-hanging fruit, when you find out
what causes this I'm sure it will be something interesting.
--
bill davidsen <davidsen@tmr.com>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2007-07-24 18:59 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-17 15:18 raid5:md3: read error corrected , followed by , Machine Check Andrew Burgess
2007-07-21 23:25 ` Mr. James W. Laferriere
2007-07-23 23:06 ` Bill Davidsen
2007-07-24 4:44 ` Mr. James W. Laferriere
2007-07-24 18:59 ` Bill Davidsen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).