* bonnie++ on md device causes reboot on new motherboard
@ 2009-02-03 4:14 Matt Garman
2009-02-03 8:53 ` David Greaves
2009-02-23 14:45 ` Matt Garman
0 siblings, 2 replies; 6+ messages in thread
From: Matt Garman @ 2009-02-03 4:14 UTC (permalink / raw)
To: linux-raid
I have two four disk raid5 arrays on Ubuntu Linux 8.04 (AMD64).
Both are using XFS for the filesystem.
$ uname -a
Linux septictank 2.6.24-19-generic #1 SMP Wed Aug 20 17:53:40 UTC
2008 x86_64 GNU/Linux
I recently replaced the motherboard and processor: switched from an
Intel Q35 motherboard + E5200 CPU to a Gigabyte GA-MA74GM-S2
motherboard (AMD 740G/SB700) with an 4850e CPU.
I ran a ton of benchmarks under the old configuration, and intend to
do the same with the new hardware. (See my previous posts regarding
SATA/southbridge performance.)
Anyway, four times now, when I run bonnie++ (with my current working
directory on one of the md arrays), the computer immediately
reboots.
I saw this happen twice while I logged in remotely; I did it again
from the console to see if there are any useful error messages.
It flashed too quickly, but the reboot looked like it happened
immediately after bonnie++ started "Writing intelligently...".
There are no useful indications in the system logs.
Every time this reboot happens, it forces a rebuild of the md array.
I have only tried it on the one array (which is empty, so I can
afford to have the rebuild fail). I'm afraid to try it on the other
one until I figure this out.
For what it's worth the md device in question is four Western
Digital 7500AAKS 750 GB 7200 rpm SATA drives. The controller is not
the onboard SATA, but actually two 2-port PCIe SATA cards.
Anyone seen anything like this or have any ideas where I can start
looking for more information?
Thanks!
Matt
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: bonnie++ on md device causes reboot on new motherboard
2009-02-03 4:14 bonnie++ on md device causes reboot on new motherboard Matt Garman
@ 2009-02-03 8:53 ` David Greaves
2009-02-03 15:22 ` John Stoffel
2009-02-23 14:45 ` Matt Garman
1 sibling, 1 reply; 6+ messages in thread
From: David Greaves @ 2009-02-03 8:53 UTC (permalink / raw)
To: Matt Garman; +Cc: linux-raid
Matt Garman wrote:
> Anyone seen anything like this or have any ideas where I can start
> looking for more information?
netconsole?
http://www.mjmwired.net/kernel/Documentation/networking/netconsole.txt
At least then you may see what the error is.
And for a crash like this I'd contact your distro kernel team too (not sure
about lkml with 2.6.24 but probably)
David
--
"Don't worry, you'll be fine; I saw it work in a cartoon once..."
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: bonnie++ on md device causes reboot on new motherboard
2009-02-03 8:53 ` David Greaves
@ 2009-02-03 15:22 ` John Stoffel
2009-02-04 1:02 ` Roger Heflin
0 siblings, 1 reply; 6+ messages in thread
From: John Stoffel @ 2009-02-03 15:22 UTC (permalink / raw)
To: David Greaves; +Cc: Matt Garman, linux-raid
David> Matt Garman wrote:
>> Anyone seen anything like this or have any ideas where I can start
>> looking for more information?
David> netconsole?
David> http://www.mjmwired.net/kernel/Documentation/networking/netconsole.txt
Or a serial console...
David> At least then you may see what the error is. And for a crash
David> like this I'd contact your distro kernel team too (not sure
David> about lkml with 2.6.24 but probably)
From the sounds of it, it's a Hardware problem of some sort. I'd run
a full memtest86 on the box, as well as some sort of CPU torture.
Check all your cables, possibly remove two of the four disks, etc.
Remove as much memory as possible, re-seat memory board, etc. Have
you checked the BIOS version? Have you reset the BIOS defaults to the
'safe' or 'default' settings? Don't bother tweaking stuff to get more
speed, go for stability. The second you have porblems with stability,
you've lost all that time you saved by tweaking things. :]
John
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: bonnie++ on md device causes reboot on new motherboard
2009-02-03 15:22 ` John Stoffel
@ 2009-02-04 1:02 ` Roger Heflin
2009-02-05 15:57 ` Matt Garman
0 siblings, 1 reply; 6+ messages in thread
From: Roger Heflin @ 2009-02-04 1:02 UTC (permalink / raw)
To: John Stoffel; +Cc: David Greaves, Matt Garman, linux-raid
John Stoffel wrote:
> David> Matt Garman wrote:
>>> Anyone seen anything like this or have any ideas where I can start
>>> looking for more information?
>
> David> netconsole?
> David> http://www.mjmwired.net/kernel/Documentation/networking/netconsole.txt
>
> Or a serial console...
>
> David> At least then you may see what the error is. And for a crash
> David> like this I'd contact your distro kernel team too (not sure
> David> about lkml with 2.6.24 but probably)
>
>>From the sounds of it, it's a Hardware problem of some sort. I'd run
> a full memtest86 on the box, as well as some sort of CPU torture.
> Check all your cables, possibly remove two of the four disks, etc.
>
> Remove as much memory as possible, re-seat memory board, etc. Have
> you checked the BIOS version? Have you reset the BIOS defaults to the
> 'safe' or 'default' settings? Don't bother tweaking stuff to get more
> speed, go for stability. The second you have porblems with stability,
> you've lost all that time you saved by tweaking things. :]
>
>
I would second the HW issue, if the machine is doing a full reset with
no printout out of any type I would think PS, or some other serious HW
issue, Linux generally does not crash without some error message.
How big of PS do you have?
I would try just dding the 4 disks at the same time and see if that
also crashes.
And then if you can remove 2 disks from the machine and retest.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: bonnie++ on md device causes reboot on new motherboard
2009-02-04 1:02 ` Roger Heflin
@ 2009-02-05 15:57 ` Matt Garman
0 siblings, 0 replies; 6+ messages in thread
From: Matt Garman @ 2009-02-05 15:57 UTC (permalink / raw)
To: Roger Heflin; +Cc: John Stoffel, David Greaves, linux-raid
On Tue, Feb 03, 2009 at 07:02:58PM -0600, Roger Heflin wrote:
> John Stoffel wrote:
>> David> Matt Garman wrote:
>>>> Anyone seen anything like this or have any ideas where I can start
>>>> looking for more information?
>> David> netconsole?
>> David>
>> http://www.mjmwired.net/kernel/Documentation/networking/netconsole.txt
>> Or a serial console...
>> David> At least then you may see what the error is. And for a crash
>> David> like this I'd contact your distro kernel team too (not sure
>> David> about lkml with 2.6.24 but probably)
>>> From the sounds of it, it's a Hardware problem of some sort. I'd run
>> a full memtest86 on the box, as well as some sort of CPU torture.
>> Check all your cables, possibly remove two of the four disks, etc.
>> Remove as much memory as possible, re-seat memory board, etc. Have
>> you checked the BIOS version? Have you reset the BIOS defaults to the
>> 'safe' or 'default' settings? Don't bother tweaking stuff to get more
>> speed, go for stability. The second you have porblems with stability,
>> you've lost all that time you saved by tweaking things. :]
>>
>
> I would second the HW issue, if the machine is doing a full reset
> with no printout out of any type I would think PS, or some other
> serious HW issue, Linux generally does not crash without some
> error message.
>
> How big of PS do you have?
>
> I would try just dding the 4 disks at the same time and see if
> that also crashes.
>
> And then if you can remove 2 disks from the machine and retest.
Netconsole is a great idea, thanks for that!
I'm going to keep testing, but here are some answers to the above
and general notes. Maybe these will generate ideas...
- Power supply is a Seasonic 450 Watt. I doubt this is the
problem, as I've been using this same power supply---as well
as all other hardware (except mobo and cpu)---without any
stability problems for several months. This MB/CPU actually
uses less power than the previous. Plus I have a Kill-A-Watt
electricity meter hooked up; I have yet to see the machine
pull more than 200 W AC (even at boot, md resync, cpuburn,
memtest, etc).
- I did a dd *read* test from the four drives in parallel
numerous times without causing a crash.
- I ran 24 hours of memtest86 without a single error.
- BIOS settings are all set to stable/conservative values.
(There is a newer BIOS, but no changelog---just says "updated
CPU support". I'll try it anyway.)
- It's not just bonnie++, it appears to be any bulk write to the
filesystem. I tried to do a bulk copy (locally, using rsync)
from the other md array, and that also caused a reset
(unfortunately, I didn't have netconsole running when it
happened).
- One thing that's interesting is that every time this machine
has rebooted itself, it has to resync the md array. The rsync
process itself has never caused a reboot.
- I got brave and both ran bonnie++ and wrote a bunch of data
(via NFS) to the other md array on the integrated (SB700) SATA
controller. No problems.
My hunch is that the board doesn't like one or both of those SiI
2-port PCIe SATA cards. The motherboard has a single PCIe 1x slot
and a 16x; the SATA cards are both PCIe 1x. Maybe the board doesn't
like having a 1x device in the 16x slot? Although, my understanding
is that PCI express is smart enough to handle this kind of thing.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: bonnie++ on md device causes reboot on new motherboard
2009-02-03 4:14 bonnie++ on md device causes reboot on new motherboard Matt Garman
2009-02-03 8:53 ` David Greaves
@ 2009-02-23 14:45 ` Matt Garman
1 sibling, 0 replies; 6+ messages in thread
From: Matt Garman @ 2009-02-23 14:45 UTC (permalink / raw)
To: linux-raid
A few weeks ago I posted a problem I was having with my machine
rebooting whenever I wrote to an md array (details below).
The problem turned out to be using the SiI 3132 2-port SATA card in
the PCIe x16 slot of the ga-ma74gm-s2 motherboard. I contacted
Gigabyte and they said that the x16 slot is designed for *video
only*.
So just a heads-up if you plan to use this board for a low-power
file server.
-Matt
On Mon, Feb 02, 2009 at 10:14:50PM -0600, Matt Garman wrote:
>
> I have two four disk raid5 arrays on Ubuntu Linux 8.04 (AMD64).
> Both are using XFS for the filesystem.
>
> $ uname -a
> Linux septictank 2.6.24-19-generic #1 SMP Wed Aug 20 17:53:40 UTC
> 2008 x86_64 GNU/Linux
>
> I recently replaced the motherboard and processor: switched from an
> Intel Q35 motherboard + E5200 CPU to a Gigabyte GA-MA74GM-S2
> motherboard (AMD 740G/SB700) with an 4850e CPU.
>
> I ran a ton of benchmarks under the old configuration, and intend to
> do the same with the new hardware. (See my previous posts regarding
> SATA/southbridge performance.)
>
> Anyway, four times now, when I run bonnie++ (with my current working
> directory on one of the md arrays), the computer immediately
> reboots.
>
> I saw this happen twice while I logged in remotely; I did it again
> from the console to see if there are any useful error messages.
>
> It flashed too quickly, but the reboot looked like it happened
> immediately after bonnie++ started "Writing intelligently...".
>
> There are no useful indications in the system logs.
>
> Every time this reboot happens, it forces a rebuild of the md array.
> I have only tried it on the one array (which is empty, so I can
> afford to have the rebuild fail). I'm afraid to try it on the other
> one until I figure this out.
>
> For what it's worth the md device in question is four Western
> Digital 7500AAKS 750 GB 7200 rpm SATA drives. The controller is not
> the onboard SATA, but actually two 2-port PCIe SATA cards.
>
> Anyone seen anything like this or have any ideas where I can start
> looking for more information?
>
> Thanks!
> Matt
>
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-02-23 14:45 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-03 4:14 bonnie++ on md device causes reboot on new motherboard Matt Garman
2009-02-03 8:53 ` David Greaves
2009-02-03 15:22 ` John Stoffel
2009-02-04 1:02 ` Roger Heflin
2009-02-05 15:57 ` Matt Garman
2009-02-23 14:45 ` Matt Garman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).