* forcing check of RAID1 arrays causes lockup
@ 2009-05-26 23:45 kyle
2009-05-27 1:33 ` Roger Heflin
0 siblings, 1 reply; 5+ messages in thread
From: kyle @ 2009-05-26 23:45 UTC (permalink / raw)
To: linux-raid
Greetings.
In the process of switching distributions on one of my machines, I've
run into a problem where forcing a check of the RAID arrays causes the
system to lock up. I've got a bug open here
https://bugzilla.redhat.com/show_bug.cgi?id=501126 , but I'm hoping
someone here can help me track it down.
To add a little more to the last post on that bug, I've since installed
gentoo, and am running gentoo's patched 2.6.28-gentoo-r5 kernel (along
with mdadm 2.6.8), and I see the bug here. I don't follow kernel
patches, and haven't looked too deeply into the distribution patches,
but diff isn't showing any changed code in drivers/{md,ata}/* .
And, to further describe "lock up", here's what happens in more detail:
I write "check" to every md array's 'sync_action' file in /sys.
Within about a second, the system stops responding to user input,
doesn't update the mouse pointer, doesn't respond to pings.
I've tried to get in a run of dmesg in before the crash, and I've been
able to see the usual messages about checking arrays, but that's it.
Also, I once tried waiting for each array check to finish before
starting the check to the next array, and that seemed to work for a
while, but some time late in the several hour process the system locked
up again. (It's possible that a cron job tried to start a check--no way
to know.)
Any ideas?
Thanks,
Kyle Liddell
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: forcing check of RAID1 arrays causes lockup 2009-05-26 23:45 forcing check of RAID1 arrays causes lockup kyle @ 2009-05-27 1:33 ` Roger Heflin 2009-05-28 9:17 ` Kyle Liddell 0 siblings, 1 reply; 5+ messages in thread From: Roger Heflin @ 2009-05-27 1:33 UTC (permalink / raw) To: kyle; +Cc: linux-raid kyle@foobox.homelinux.net wrote: > Greetings. > > In the process of switching distributions on one of my machines, I've > run into a problem where forcing a check of the RAID arrays causes the > system to lock up. I've got a bug open here > https://bugzilla.redhat.com/show_bug.cgi?id=501126 , but I'm hoping > someone here can help me track it down. > > To add a little more to the last post on that bug, I've since installed > gentoo, and am running gentoo's patched 2.6.28-gentoo-r5 kernel (along > with mdadm 2.6.8), and I see the bug here. I don't follow kernel > patches, and haven't looked too deeply into the distribution patches, > but diff isn't showing any changed code in drivers/{md,ata}/* . > > And, to further describe "lock up", here's what happens in more detail: > I write "check" to every md array's 'sync_action' file in /sys. > Within about a second, the system stops responding to user input, > doesn't update the mouse pointer, doesn't respond to pings. > I've tried to get in a run of dmesg in before the crash, and I've been > able to see the usual messages about checking arrays, but that's it. > Also, I once tried waiting for each array check to finish before > starting the check to the next array, and that seemed to work for a > while, but some time late in the several hour process the system locked > up again. (It's possible that a cron job tried to start a check--no way > to know.) > > Any ideas? > > Thanks, > Kyle Liddell Not what you are going to want to hear but badly designed hardware. On a machine I had with 4 disks (2 on a build-in via, 2 on other ports--either a built-in promise, or a sil pci card), when the 2 build-in via sata ports got used heavily at the same times as any others--the machine became unstable, it caused issues with other non-disk pci cards doing flakey things, it caused the machine to also lockup (once the machine crashed, getting things to rebuild was almost impossible-MTBF was <5 minutes after a reboot), my end solution was to not use the via sata ports and then it became stable--not a great solution. The via ports in my case were built into the mb and connected at pci66 ports, and the real pci bus was the standard 33mhz. It appeared to me as designed the via chipsets (And think your chipset is pretty close to the one I was using) did not appear to deal with with high levels of traffic to several devices at once, and would become unstable. Once I figured out the issue, I could duplicate it in under 5 minutes, and the only working solution was to not use the via ports. My mb at the time was a Asus k8v se deluxe with a K8T800 chipset, and so long as it was not heavily used it was stable, but under heavy use it was junk. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: forcing check of RAID1 arrays causes lockup 2009-05-27 1:33 ` Roger Heflin @ 2009-05-28 9:17 ` Kyle Liddell 2009-05-28 12:42 ` Redeeman 0 siblings, 1 reply; 5+ messages in thread From: Kyle Liddell @ 2009-05-28 9:17 UTC (permalink / raw) To: Roger Heflin; +Cc: linux-raid On Tue, May 26, 2009 at 08:33:30PM -0500, Roger Heflin wrote: > Not what you are going to want to hear but badly designed hardware. > > On a machine I had with 4 disks (2 on a build-in via, 2 on other > ports--either a built-in promise, or a sil pci card), when the 2 > build-in via sata ports got used heavily at the same times as any ... > It appeared to me as designed the via chipsets (And think your > chipset is pretty close to the one I was using) did not appear to deal > with with high levels of traffic to several devices at once, and would > become unstable. > > Once I figured out the issue, I could duplicate it in under 5 minutes, > and the only working solution was to not use the via ports. > > My mb at the time was a Asus k8v se deluxe with a K8T800 chipset, and > so long as it was not heavily used it was stable, but under heavy use > it was junk. That does sound like my problem, and the hardware is similar. However, I don't think it's the VIA controller that's the problem here: I moved the two drives off the on-board VIA controller and placed them as slaves on the Promise card. I was able to install fedora, which was an improvement, but once installed, I was able to bring the system down again by forcing a check. I've got a spare Promise IDE controller, so I tried swapping it out, with no change. I suppose it's a weird hardware bug, although it still is strange that certain combinations of kernels (which makes a little sense) + distributions (which makes no sense) will work. I just went back to debian on the machine, and it works fine. I'm trying to reproduce the problem on another machine, but I'm not too hopeful. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: forcing check of RAID1 arrays causes lockup 2009-05-28 9:17 ` Kyle Liddell @ 2009-05-28 12:42 ` Redeeman 2009-05-28 23:25 ` Roger Heflin 0 siblings, 1 reply; 5+ messages in thread From: Redeeman @ 2009-05-28 12:42 UTC (permalink / raw) To: Kyle Liddell; +Cc: Roger Heflin, linux-raid On Thu, 2009-05-28 at 05:17 -0400, Kyle Liddell wrote: > On Tue, May 26, 2009 at 08:33:30PM -0500, Roger Heflin wrote: > > Not what you are going to want to hear but badly designed hardware. > > > > On a machine I had with 4 disks (2 on a build-in via, 2 on other > > ports--either a built-in promise, or a sil pci card), when the 2 > > build-in via sata ports got used heavily at the same times as any > ... > > It appeared to me as designed the via chipsets (And think your > > chipset is pretty close to the one I was using) did not appear to deal > > with with high levels of traffic to several devices at once, and would > > become unstable. > > > > Once I figured out the issue, I could duplicate it in under 5 minutes, > > and the only working solution was to not use the via ports. > > > > My mb at the time was a Asus k8v se deluxe with a K8T800 chipset, and > > so long as it was not heavily used it was stable, but under heavy use > > it was junk. > > That does sound like my problem, and the hardware is similar. However, I don't think it's the VIA controller that's the problem here: I moved the two drives off the on-board VIA controller and placed them as slaves on the Promise card. I was able to install fedora, which was an improvement, but once installed, I was able to bring the system down again by forcing a check. I've got a spare Promise IDE controller, so I tried swapping it out, with no change. > > I suppose it's a weird hardware bug, although it still is strange that certain combinations of kernels (which makes a little sense) + distributions (which makes no sense) will work. I just went back to debian on the machine, and it works fine. > > I'm trying to reproduce the problem on another machine, but I'm not too hopeful. I have a system with 6 drives in raid5, on such a k8v se deluxe board with the via controller, and an additional PCI controller. I am experiencing some weird problems on this system too, when doing lots of read/write it will freeze for up to 10 seconds sometimes, what happens is that one of the disks gets the bus soft reset. Now with the old IDE driver, this would f*** up completely, and the box had to be rebooted, however, libata was able to recover it nicely and just continue after the freeze. I was never able to solve the issue, and since it wasnt a major problem for the use, i have just ignored it. do you suppose adding another pci card with IDE ports and discontinuing use of the via controller entirely would fix it? though, i have actually just replaced this box with a newer, so it wont do me much good now, however, the small amount of money a pci ide controller costs, would be worth it just to actually find out what was wrong. > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: forcing check of RAID1 arrays causes lockup 2009-05-28 12:42 ` Redeeman @ 2009-05-28 23:25 ` Roger Heflin 0 siblings, 0 replies; 5+ messages in thread From: Roger Heflin @ 2009-05-28 23:25 UTC (permalink / raw) To: Redeeman; +Cc: Kyle Liddell, linux-raid Redeeman wrote: > On Thu, 2009-05-28 at 05:17 -0400, Kyle Liddell wrote: >> On Tue, May 26, 2009 at 08:33:30PM -0500, Roger Heflin wrote: >>> Not what you are going to want to hear but badly designed hardware. >>> >>> On a machine I had with 4 disks (2 on a build-in via, 2 on other >>> ports--either a built-in promise, or a sil pci card), when the 2 >>> build-in via sata ports got used heavily at the same times as any >> ... >>> It appeared to me as designed the via chipsets (And think your >>> chipset is pretty close to the one I was using) did not appear to deal >>> with with high levels of traffic to several devices at once, and would >>> become unstable. >>> >>> Once I figured out the issue, I could duplicate it in under 5 minutes, >>> and the only working solution was to not use the via ports. >>> >>> My mb at the time was a Asus k8v se deluxe with a K8T800 chipset, and >>> so long as it was not heavily used it was stable, but under heavy use >>> it was junk. >> That does sound like my problem, and the hardware is similar. However, I don't think it's the VIA controller that's the problem here: I moved the two drives off the on-board VIA controller and placed them as slaves on the Promise card. I was able to install fedora, which was an improvement, but once installed, I was able to bring the system down again by forcing a check. I've got a spare Promise IDE controller, so I tried swapping it out, with no change. >> >> I suppose it's a weird hardware bug, although it still is strange that certain combinations of kernels (which makes a little sense) + distributions (which makes no sense) will work. I just went back to debian on the machine, and it works fine. >> >> I'm trying to reproduce the problem on another machine, but I'm not too hopeful. > > I have a system with 6 drives in raid5, on such a k8v se deluxe board > with the via controller, and an additional PCI controller. > > I am experiencing some weird problems on this system too, when doing > lots of read/write it will freeze for up to 10 seconds sometimes, what > happens is that one of the disks gets the bus soft reset. > > Now with the old IDE driver, this would f*** up completely, and the box > had to be rebooted, however, libata was able to recover it nicely and > just continue after the freeze. With the sata via's and sata_via under heavy enough loads it was not recovering. Under lighter loads it did get some odd errors that were not fatal. > > I was never able to solve the issue, and since it wasnt a major problem > for the use, i have just ignored it. On mine until I tried it full rebuild (after a crash) it was not an issue, with 2 on the via, and 2 either on the promise or a sil pci card it crashed quickly under a rebuild, the machine had been mostly stable (a few odd things happened) for a couple of months, after moving away from the via ports the few odd things quit happening, so I believe the odd behavior (video capture cards appearing to lose parts of their video data, video capture cards locking up internally-note 2 completely different pci card designs/chipsets were both doing funny things). > > do you suppose adding another pci card with IDE ports and discontinuing > use of the via controller entirely would fix it? On mine moving everything off of the sata_via ports make things slower but stable (the via pata/sata ports are on pci66x32bit vs. the pci bus sharing a single pci33x32bit bus), so in the first setup (2via/2pcibus) the slowest disk's share of bandwidth is in theory 66mb/second, on the second case (4pcibus) each disk's share of bandwidth is in theory 33mb/second, and I could see a large difference in the raid5 read/writes in one case vs the other, but the faster case was unstable so useless. > > though, i have actually just replaced this box with a newer, so it wont > do me much good now, however, the small amount of money a pci ide > controller costs, would be worth it just to actually find out what was > wrong. > I have since replaced mine (it is no longer a server needing more than 1 drive) so mine is not longer causing issues. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2009-05-28 23:25 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-05-26 23:45 forcing check of RAID1 arrays causes lockup kyle 2009-05-27 1:33 ` Roger Heflin 2009-05-28 9:17 ` Kyle Liddell 2009-05-28 12:42 ` Redeeman 2009-05-28 23:25 ` Roger Heflin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).