* System hangs on raid md recovery/resync - revisit
@ 2009-02-28 1:32 Brad
2009-02-28 9:08 ` Justin Piszcz
0 siblings, 1 reply; 6+ messages in thread
From: Brad @ 2009-02-28 1:32 UTC (permalink / raw)
To: linux-raid
Hi. I'd like to revisit a problem I put to the mailing list on the
27th July 2008.
My linux system hangs if I have a lengthy recovery of a raid-1
device going on at the same time as any significant network
traffic. If I terminate my networking applications the re-sync
succeeds; if I allow them to run then the re-sync will almost always
hang the system.
My PC is about 1.5 years old; it has a Gigabyte GA-P35-DS4 motherboard
with an Intel Core 2 Quad Q6600 CPU. The motherboard
has an Intel ICH9R northbridge with 6 SATA 2 ports and a 'Gigabyte'
(JMicron 20360/20363) southbridge with 2 SATA 2 ports. I have two
500GB Western Digital SATA 2 internal disks, both on the ICH9R northbridge,
as I used to get occasional SATA disconnects/errors if I had a disk under
heavy load on the JMicron controller. The two disks have 400GB
partitions in a MD raid1 mirror. I typically experience this problem when
I plug in a third disk (also on the ICH9R controller) to synchronise as
a backup procedure, but it also happens if I just have the two permanent
disks synchronising between themselves.
I'm running Linux 2.6.28.6. The motherboard has a Realtek RTL8111/8168B
gigabit ethernet controller which I have running in a 100Mbit full duplex
link to my ADSL modem. I'm using the kernel's standard r8169 driver for the
network.
If I have no significant network activity taking place (other than trivial
traffic from named, ntpd and the like) then my md1 recoveries always
succeed. But if I have a program maxing out the connection to my ISP -
about 160KB/sec down, 30KB/sec up - then the re-synchronisation will
always end up hanging:
o disk I/O stops - the disk activity LED will stop flashing, iostat statistics
will drop to zero, 'cat /proc/mdstat' will show dwindling I/O speeds and
ever-increasing finish times (from 200 minutes to 30,000+ minutes!).
o any access to the filesystem I have mounted on top of the md1 device
hangs.
o access to OTHER filesystems is fine, and anything independent of the
hung filesystem works as normal.
There are absolutely no errors reported by the system - nothing logged
to the console and nothing logged via syslog (the /var/log filesystem
is fully operational even while the recovering one is hung).
Looking at /proc/interrupts I can see that the 'eth0' driver has an
interrupt all to itself.
I haven't had a single SATA disconnect error since I moved all my disks
off the JMicron southbridge. I can 'dd' each drive simultaneously with
no errors and better than 70MB/sec throughput from each in parallel.
Does anyone know of any condition which would cause the md1
recovery process to silently hang like this? Can I get some sort of
debug/verbose log out of the raid software to work out why it's hanging?
Has anyone ever experienced this sort of problem - md recovery
'sensitivity' to network traffic? - on this motherboard?
Thanks,
Brad
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: System hangs on raid md recovery/resync - revisit
2009-02-28 1:32 System hangs on raid md recovery/resync - revisit Brad
@ 2009-02-28 9:08 ` Justin Piszcz
2009-02-28 10:41 ` Brad
0 siblings, 1 reply; 6+ messages in thread
From: Justin Piszcz @ 2009-02-28 9:08 UTC (permalink / raw)
To: Brad; +Cc: linux-raid
On Sat, 28 Feb 2009, Brad wrote:
> Hi. I'd like to revisit a problem I put to the mailing list on the
> 27th July 2008.
>
> My linux system hangs if I have a lengthy recovery of a raid-1
> device going on at the same time as any significant network
> traffic. If I terminate my networking applications the re-sync
> succeeds; if I allow them to run then the re-sync will almost always
> hang the system.
>
> My PC is about 1.5 years old; it has a Gigabyte GA-P35-DS4 motherboard
> with an Intel Core 2 Quad Q6600 CPU. The motherboard
> has an Intel ICH9R northbridge with 6 SATA 2 ports and a 'Gigabyte'
> (JMicron 20360/20363) southbridge with 2 SATA 2 ports. I have two
> 500GB Western Digital SATA 2 internal disks, both on the ICH9R northbridge,
> as I used to get occasional SATA disconnects/errors if I had a disk under
> heavy load on the JMicron controller. The two disks have 400GB
> partitions in a MD raid1 mirror. I typically experience this problem when
> I plug in a third disk (also on the ICH9R controller) to synchronise as
> a backup procedure, but it also happens if I just have the two permanent
> disks synchronising between themselves.
>
> I'm running Linux 2.6.28.6. The motherboard has a Realtek RTL8111/8168B
> gigabit ethernet controller which I have running in a 100Mbit full duplex
> link to my ADSL modem. I'm using the kernel's standard r8169 driver for the
> network.
>
> If I have no significant network activity taking place (other than trivial
> traffic from named, ntpd and the like) then my md1 recoveries always
> succeed. But if I have a program maxing out the connection to my ISP -
> about 160KB/sec down, 30KB/sec up - then the re-synchronisation will
> always end up hanging:
>
> o disk I/O stops - the disk activity LED will stop flashing, iostat statistics
> will drop to zero, 'cat /proc/mdstat' will show dwindling I/O speeds and
> ever-increasing finish times (from 200 minutes to 30,000+ minutes!).
>
> o any access to the filesystem I have mounted on top of the md1 device
> hangs.
>
> o access to OTHER filesystems is fine, and anything independent of the
> hung filesystem works as normal.
>
> There are absolutely no errors reported by the system - nothing logged
> to the console and nothing logged via syslog (the /var/log filesystem
> is fully operational even while the recovering one is hung).
>
> Looking at /proc/interrupts I can see that the 'eth0' driver has an
> interrupt all to itself.
>
> I haven't had a single SATA disconnect error since I moved all my disks
> off the JMicron southbridge. I can 'dd' each drive simultaneously with
> no errors and better than 70MB/sec throughput from each in parallel.
>
> Does anyone know of any condition which would cause the md1
> recovery process to silently hang like this? Can I get some sort of
> debug/verbose log out of the raid software to work out why it's hanging?
>
> Has anyone ever experienced this sort of problem - md recovery
> 'sensitivity' to network traffic? - on this motherboard?
I have the same mobo:
Handle 0x0001, DMI type 1, 27 bytes
System Information
Manufacturer: Gigabyte Technology Co., Ltd.
Product Name: P35-DS4
Have a RAID1 and RAID5, I do not use the jmicron SATA ports, only the
intel ones and add-on pci-e cards, never had any problems with the raid
volumes. The NIC is sort of flaky though [in linux], I recommend using an
intel pci-e 1gbps card.
Justin.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: System hangs on raid md recovery/resync - revisit
2009-02-28 9:08 ` Justin Piszcz
@ 2009-02-28 10:41 ` Brad
2009-02-28 11:02 ` Rudy Zijlstra
2009-02-28 12:04 ` Justin Piszcz
0 siblings, 2 replies; 6+ messages in thread
From: Brad @ 2009-02-28 10:41 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-raid
On Sat, Feb 28, 2009 at 7:08 PM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>
> On Sat, 28 Feb 2009, Brad wrote:
>
>> Hi. I'd like to revisit a problem I put to the mailing list on the
>> 27th July 2008.
>>
>> My linux system hangs if I have a lengthy recovery of a raid-1
>> device going on at the same time as any significant network
>> traffic.
>> ...
>
> I have the same mobo:
>
> Handle 0x0001, DMI type 1, 27 bytes
> System Information
> Manufacturer: Gigabyte Technology Co., Ltd.
> Product Name: P35-DS4
How did you get that information, please? Another linux command
for me to learn?! :-)
> Have a RAID1 and RAID5, I do not use the jmicron SATA ports, only the intel
> ones and add-on pci-e cards, never had any problems with the raid volumes.
> The NIC is sort of flaky though [in linux], I recommend using an intel
> pci-e 1gbps card.
I've had another problem with the Realtek network driver ... under network
load it seemed to miss interrupts and/or pass them to the IDE driver, which
would print out errors about unexpected/unknown interrupts. I had to take
IDE out of my kernel.
I *think* my current hanging problem was even worse when the pata_jmicron
driver module - which I need to use the ATA DVD drive connected to one of the
JMicron's IDE ports - shared the same interrupt as the Realtek driver. I
couldn't find a way to change interrupts (can one do that at will with the
Linux kernel?) so my backup script unloads the pata_jmicron module
before it attaches the third backup disk to the md1 array.
But it still hangs if there's any significant network traffic. Maybe,
even though I've gotten rid of anything using the same IRQ as the
Realtek - IDE or pata_jmicron - the NIC driver is still flubbing interrupts
and that's confusing the kernel?
Thanks for the advice Justin. Maybe the solution is to abandon use
of the Realtek NIC (a pity to 'waste' what's freely available on the
motherboard, though, in a way).
Brad
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: System hangs on raid md recovery/resync - revisit
2009-02-28 10:41 ` Brad
@ 2009-02-28 11:02 ` Rudy Zijlstra
2009-02-28 12:04 ` Justin Piszcz
1 sibling, 0 replies; 6+ messages in thread
From: Rudy Zijlstra @ 2009-02-28 11:02 UTC (permalink / raw)
To: Brad; +Cc: Justin Piszcz, linux-raid
On Sat, 2009-02-28 at 20:41 +1000, Brad wrote:
> On Sat, Feb 28, 2009 at 7:08 PM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> >
> > On Sat, 28 Feb 2009, Brad wrote:
> >
>
> But it still hangs if there's any significant network traffic. Maybe,
> even though I've gotten rid of anything using the same IRQ as the
> Realtek - IDE or pata_jmicron - the NIC driver is still flubbing interrupts
> and that's confusing the kernel?
>
I would not call the traffic you mention significant network traffic,
not for a gigabit card...
> Thanks for the advice Justin. Maybe the solution is to abandon use
> of the Realtek NIC (a pity to 'waste' what's freely available on the
> motherboard, though, in a way).
I do agree with the suggestion to use a different NIC. I have been
avoiding realtek based NIC since i learned that the realtek do not
adhere to the electrical parts of the ethernet spec, resulting in
inability to use long cable runs (i have seen that myself).
This was on the 100mbit versions.
The good point of the realtek: they are cheap
bad point: potential troubles.
>
>
> Brad
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Cheers,
Rudy
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: System hangs on raid md recovery/resync - revisit
2009-02-28 10:41 ` Brad
2009-02-28 11:02 ` Rudy Zijlstra
@ 2009-02-28 12:04 ` Justin Piszcz
2009-03-01 14:14 ` Kasper Sandberg
1 sibling, 1 reply; 6+ messages in thread
From: Justin Piszcz @ 2009-02-28 12:04 UTC (permalink / raw)
To: Brad; +Cc: linux-raid
[-- Attachment #1: Type: TEXT/PLAIN, Size: 3733 bytes --]
On Sat, 28 Feb 2009, Brad wrote:
> On Sat, Feb 28, 2009 at 7:08 PM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>>
>> On Sat, 28 Feb 2009, Brad wrote:
>>
>>> Hi. I'd like to revisit a problem I put to the mailing list on the
>>> 27th July 2008.
>>>
>>> My linux system hangs if I have a lengthy recovery of a raid-1
>>> device going on at the same time as any significant network
>>> traffic.
>>> ...
>>
>> I have the same mobo:
>>
>> Handle 0x0001, DMI type 1, 27 bytes
>> System Information
>> Manufacturer: Gigabyte Technology Co., Ltd.
>> Product Name: P35-DS4
>
> How did you get that information, please? Another linux command
> for me to learn?! :-)
dmidecode | more
>
>> Have a RAID1 and RAID5, I do not use the jmicron SATA ports, only the intel
>> ones and add-on pci-e cards, never had any problems with the raid volumes.
>> The NIC is sort of flaky though [in linux], I recommend using an intel
>> pci-e 1gbps card.
>
> I've had another problem with the Realtek network driver ... under network
> load it seemed to miss interrupts and/or pass them to the IDE driver, which
> would print out errors about unexpected/unknown interrupts. I had to take
> IDE out of my kernel.
Correct, buy an Intel 1GBPS PCI-e card, I do for all of my main machines
that do not have Intel NICs, solves the problem. They are $30-40 and then
all of your network issues will be solved.
>
> I *think* my current hanging problem was even worse when the pata_jmicron
> driver module - which I need to use the ATA DVD drive connected to one of the
> JMicron's IDE ports - shared the same interrupt as the Realtek driver.
Hm, no, I also use this jmicron driver and have no problems, but I no longer
use the realtek nic.
I will offer a piece of advice though, the timings on Gigabyte boards in
general for the RAM, etc, have to be set just right otherwise, weird things
happen, I have seen the motherboard freeze/lockup do weird things before,
mainly before I had the memory settings set correctly. Run memtest86 and
let it run for at least 1-2 passes, ENSURE you have no errors, if you have
errors, then the memory timings/parameters are set incorrectly. This can
cause system instability, even though the memory is not bad, you will still
get errors because of the timing/multipliers etc! (I tested the RAM in
another machine, no errors, move to gigabyte board with default settings,
memory errors, and hence, system instability!)
> I couldn't find a way to change interrupts (can one do that at will with the
> Linux kernel?) so my backup script unloads the pata_jmicron module
> before it attaches the third backup disk to the md1 array.
I do not use modules hardly ever, I do not understand why people do, at least
for their main os/system drivers. For cameras, usb devices, etc, I can see
how that would be useful, but for me, I compile everything in when possible,
and only what is necessary.
>
> But it still hangs if there's any significant network traffic. Maybe,
> even though I've gotten rid of anything using the same IRQ as the
> Realtek - IDE or pata_jmicron - the NIC driver is still flubbing interrupts
> and that's confusing the kernel?
How often do you the CD/DVD drive? There are SATA drives for $20-30 at newegg
if you think the IDE/jmicron is the culprit to most of your problems.
>
> Thanks for the advice Justin. Maybe the solution is to abandon use
> of the Realtek NIC (a pity to 'waste' what's freely available on the
> motherboard, though, in a way).
No problem, suggestions:
1. Run memtest86, ensure no errors after 1-2 passes.
2. Buy intel pci-e nic, ~$30
3. Buy sata dvd+rw, ~$20
Justin.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: System hangs on raid md recovery/resync - revisit
2009-02-28 12:04 ` Justin Piszcz
@ 2009-03-01 14:14 ` Kasper Sandberg
0 siblings, 0 replies; 6+ messages in thread
From: Kasper Sandberg @ 2009-03-01 14:14 UTC (permalink / raw)
To: Justin Piszcz; +Cc: Brad, linux-raid
On Sat, 2009-02-28 at 07:04 -0500, Justin Piszcz wrote:
>
> On Sat, 28 Feb 2009, Brad wrote:
>
> > On Sat, Feb 28, 2009 at 7:08 PM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> >>
> >> On Sat, 28 Feb 2009, Brad wrote:
> >>
> >>> Hi. I'd like to revisit a problem I put to the mailing list on the
<snip>
> > I've had another problem with the Realtek network driver ... under network
> > load it seemed to miss interrupts and/or pass them to the IDE driver, which
> > would print out errors about unexpected/unknown interrupts. I had to take
> > IDE out of my kernel.
> Correct, buy an Intel 1GBPS PCI-e card, I do for all of my main machines
> that do not have Intel NICs, solves the problem. They are $30-40 and then
> all of your network issues will be solved.
>
> >
I have a gigabyte X48 board with two of those realtek NICs, and apart
from some driver troubles which the r8169 maintainer fixed for me, i've
had no issues with it.
I suggest contacting the maintainer if you really believe its the NIC
and/or driver
<snip>
> Justin.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-03-01 14:14 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-28 1:32 System hangs on raid md recovery/resync - revisit Brad
2009-02-28 9:08 ` Justin Piszcz
2009-02-28 10:41 ` Brad
2009-02-28 11:02 ` Rudy Zijlstra
2009-02-28 12:04 ` Justin Piszcz
2009-03-01 14:14 ` Kasper Sandberg
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).