* Re: Software RAID1 deadlock in 2.6.25 kernels
[not found] <48650567.3000501@w1nr.net>
@ 2008-06-27 20:47 ` Neil Brown
2008-06-30 9:23 ` Gabor Gombas
0 siblings, 1 reply; 21+ messages in thread
From: Neil Brown @ 2008-06-27 20:47 UTC (permalink / raw)
To: Mike McCarthy; +Cc: linux-raid
On Friday June 27, mike@w1nr.net wrote:
> Hello Neil and Linus,
I've dropped Linus from cc as he won't be interested, and added
linux-raid.
> I am contacting both of you because you have submitted md raid code
> to the 2.6.25 kernels. I have found an apparent deadlock condition
> using software RAID1 with SUSE 11.0 and Fedora 9, both which use the
> 2.6.25 kernels. Configuration is as follows:
>
> Dell Precision 450 with single Xeon 2.4 GHz processor
> (hyperthreaded, looks like a dual core)
> 2 Seagate 120GB IDE drives
> sda1,sdb1 - 1GB swap
> sda2,sdb2 - 100MB, RAID1, md0, /boot
> sda3,sdb3 - balance of disk, RAID1, md1 3 LVM filesystems
>
> System locks up after running a short time. Had it hang once during
> installation. Tried both Reiserfs and EXT3.
We need more details about the lockup. Does the whole system lock up,
or just the array.
Can you get
alt-sysrq-T
output? That might be useful.
>
> Are either of you aware of this issue? If not, I can file a bugzilla.
No I'm not aware of such an issue.
I prefer to avoid bugzilla, lets just discuss it here.
NeilBrown
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Software RAID1 deadlock in 2.6.25 kernels
2008-06-27 20:47 ` Neil Brown
@ 2008-06-30 9:23 ` Gabor Gombas
2008-06-30 11:31 ` Mike McCarthy
0 siblings, 1 reply; 21+ messages in thread
From: Gabor Gombas @ 2008-06-30 9:23 UTC (permalink / raw)
To: Neil Brown; +Cc: Mike McCarthy, linux-raid
On Sat, Jun 28, 2008 at 06:47:29AM +1000, Neil Brown wrote:
> > System locks up after running a short time. Had it hang once during
> > installation. Tried both Reiserfs and EXT3.
>
> We need more details about the lockup. Does the whole system lock up,
> or just the array.
> Can you get
> alt-sysrq-T
> output? That might be useful.
Also, try to set up netconsole and boot with nmi_watchdog=2. I've now 3
completely different HW setups where 2.6.25 hangs due to HPET issues.
Gabor
--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Software RAID1 deadlock in 2.6.25 kernels
2008-06-30 9:23 ` Gabor Gombas
@ 2008-06-30 11:31 ` Mike McCarthy
2008-06-30 11:59 ` Michael Bussmann
0 siblings, 1 reply; 21+ messages in thread
From: Mike McCarthy @ 2008-06-30 11:31 UTC (permalink / raw)
To: Gabor Gombas; +Cc: Neil Brown, linux-raid
Gabor Gombas wrote:
> On Sat, Jun 28, 2008 at 06:47:29AM +1000, Neil Brown wrote:
>
>
>>> System locks up after running a short time. Had it hang once during
>>> installation. Tried both Reiserfs and EXT3.
>>>
>> We need more details about the lockup. Does the whole system lock up,
>> or just the array.
>> Can you get
>> alt-sysrq-T
>> output? That might be useful.
>>
>
> Also, try to set up netconsole and boot with nmi_watchdog=2. I've now 3
> completely different HW setups where 2.6.25 hangs due to HPET issues.
>
> Gabor
>
>
Due to time constraints, I will be unable to do much on this until the
weekend.
In the meantime, a more detailed description of the symptoms:
When the system hangs, the mouse movement is tracked across the screen
and I can ping the node. There is no response to clicking on a mouse
button or trying to type anything into a field. It also does not respond
to ssh or to <CTRL><ALT><F1>.
I may also try and reproduce it on different hardware to rule out a
graphics problem if you wish.
Mike
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Software RAID1 deadlock in 2.6.25 kernels
2008-06-30 11:31 ` Mike McCarthy
@ 2008-06-30 11:59 ` Michael Bussmann
2008-06-30 13:32 ` Bill Davidsen
0 siblings, 1 reply; 21+ messages in thread
From: Michael Bussmann @ 2008-06-30 11:59 UTC (permalink / raw)
To: linux-raid
Hi,
On 2008-06-30 07:31:28 -0400, Mike McCarthy wrote:
>>>> System locks up after running a short time. Had it hang once
>>>> during installation. Tried both Reiserfs and EXT3.
> When the system hangs, the mouse movement is tracked across the screen
> and I can ping the node. There is no response to clicking on a mouse
> button or trying to type anything into a field. It also does not respond
> to ssh or to <CTRL><ALT><F1>.
Maybe it's a totally different issue, but I also noticed system lockups,
that started after I converted the system to Software-RAID1. However, in
my case the lockups only occur after 3-10 days uptime. One day I was able
to capture a couple of syslog entries:
| Jun 12 09:50:47 tardis kernel: hdg: lost interrupt
| Jun 12 09:50:47 tardis kernel: hdg: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
| Jun 12 09:50:47 tardis kernel: hdg: drive_cmd: error=0x04 { DriveStatusError }
| Jun 12 09:50:47 tardis kernel: ide: failed opcode was: 0xb0
| Jun 12 09:51:07 tardis kernel: hdg: dma_timer_expiry: dma status == 0x21
| (2 x WD2500SB-01RFA0 on a PDC20276 (MBFastTrak133))
The HDD LED is permanently on.
Cheers,
MB
--
Michael Bussmann <bus@mb-net.net>
BOFH excuse #136:
Daemons loose in system.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Software RAID1 deadlock in 2.6.25 kernels
2008-06-30 11:59 ` Michael Bussmann
@ 2008-06-30 13:32 ` Bill Davidsen
2008-06-30 13:49 ` Mike McCarthy
0 siblings, 1 reply; 21+ messages in thread
From: Bill Davidsen @ 2008-06-30 13:32 UTC (permalink / raw)
To: Michael Bussmann; +Cc: linux-raid
Michael Bussmann wrote:
> Hi,
>
> On 2008-06-30 07:31:28 -0400, Mike McCarthy wrote:
>
>>>>> System locks up after running a short time. Had it hang once
>>>>> during installation. Tried both Reiserfs and EXT3.
>>>>>
>
>
>> When the system hangs, the mouse movement is tracked across the screen
>> and I can ping the node. There is no response to clicking on a mouse
>> button or trying to type anything into a field. It also does not respond
>> to ssh or to <CTRL><ALT><F1>.
>>
>
> Maybe it's a totally different issue, but I also noticed system lockups,
> that started after I converted the system to Software-RAID1. However, in
> my case the lockups only occur after 3-10 days uptime. One day I was able
> to capture a couple of syslog entries:
>
> | Jun 12 09:50:47 tardis kernel: hdg: lost interrupt
> | Jun 12 09:50:47 tardis kernel: hdg: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> | Jun 12 09:50:47 tardis kernel: hdg: drive_cmd: error=0x04 { DriveStatusError }
> | Jun 12 09:50:47 tardis kernel: ide: failed opcode was: 0xb0
> | Jun 12 09:51:07 tardis kernel: hdg: dma_timer_expiry: dma status == 0x21
> | (2 x WD2500SB-01RFA0 on a PDC20276 (MBFastTrak133))
>
> The HDD LED is permanently on.
>
Wonder if hardware or software is happening, sounds like an mishandled
hardware error, but I'm guessing. I have a server with RAID1 and Fedora
2.6.22.14-72.fc6PAE kernel, up 72 days, no problems.
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Software RAID1 deadlock in 2.6.25 kernels
2008-06-30 13:32 ` Bill Davidsen
@ 2008-06-30 13:49 ` Mike McCarthy
2008-06-30 13:56 ` Justin Piszcz
` (2 more replies)
0 siblings, 3 replies; 21+ messages in thread
From: Mike McCarthy @ 2008-06-30 13:49 UTC (permalink / raw)
To: Bill Davidsen; +Cc: Michael Bussmann, linux-raid
Bill Davidsen wrote:
>
> Wonder if hardware or software is happening, sounds like an mishandled
> hardware error, but I'm guessing. I have a server with RAID1 and
> Fedora 2.6.22.14-72.fc6PAE kernel, up 72 days, no problems.
>
2.6.22 is running fine. The problems are in the 2.6.25 kernel (FC9 and
SUSE 11.0)
Mike
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Software RAID1 deadlock in 2.6.25 kernels
2008-06-30 13:49 ` Mike McCarthy
@ 2008-06-30 13:56 ` Justin Piszcz
2008-06-30 20:21 ` Richard Scobie
2008-07-01 15:34 ` Bill Davidsen
2 siblings, 0 replies; 21+ messages in thread
From: Justin Piszcz @ 2008-06-30 13:56 UTC (permalink / raw)
To: Mike McCarthy; +Cc: Bill Davidsen, Michael Bussmann, linux-raid
On Mon, 30 Jun 2008, Mike McCarthy wrote:
> Bill Davidsen wrote:
>>
>> Wonder if hardware or software is happening, sounds like an mishandled
>> hardware error, but I'm guessing. I have a server with RAID1 and Fedora
>> 2.6.22.14-72.fc6PAE kernel, up 72 days, no problems.
>>
>
> 2.6.22 is running fine. The problems are in the 2.6.25 kernel (FC9 and SUSE
> 11.0)
>
> Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
I am running 2.6.25 and have run every 'fix' iteration to now currently
2.6.25.9, I use RAID1 on two hosts and have not seen any issues on either.
That error looks HW related, you should run some heavy disk I/O processes
for a few days / see if you can get it to re-occur on 2.6.22?
Or.. does it occur immedaitely after booting 2.6.25?
Justin.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Software RAID1 deadlock in 2.6.25 kernels
2008-06-30 20:21 ` Richard Scobie
@ 2008-06-30 20:19 ` michael
2008-07-01 19:00 ` David Rees
0 siblings, 1 reply; 21+ messages in thread
From: michael @ 2008-06-30 20:19 UTC (permalink / raw)
To: Richard Scobie; +Cc: Mike McCarthy, Bill Davidsen, Michael Bussmann, linux-raid
Richard Scobie wrote:
> Mike McCarthy wrote:
>> Bill Davidsen wrote:
>>
>>>
>>> Wonder if hardware or software is happening, sounds like an
>>> mishandled hardware error, but I'm guessing. I have a server with
>>> RAID1 and Fedora 2.6.22.14-72.fc6PAE kernel, up 72 days, no problems.
>>>
>>
>> 2.6.22 is running fine. The problems are in the 2.6.25 kernel (FC9
>> and SUSE 11.0)
>
> FC9 running on RAID 1 (ata_piix) for the last 3 weeks or so, with no
> trouble.
Same here. I haven't had any issues with RAID 1 on Fedora 9.
So the issue isn't biting everyone but there is always the chance
something is wrong with a specific controller or configuration.
Michael
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Software RAID1 deadlock in 2.6.25 kernels
2008-06-30 13:49 ` Mike McCarthy
2008-06-30 13:56 ` Justin Piszcz
@ 2008-06-30 20:21 ` Richard Scobie
2008-06-30 20:19 ` michael
2008-07-01 15:34 ` Bill Davidsen
2 siblings, 1 reply; 21+ messages in thread
From: Richard Scobie @ 2008-06-30 20:21 UTC (permalink / raw)
To: Mike McCarthy; +Cc: Bill Davidsen, Michael Bussmann, linux-raid
Mike McCarthy wrote:
> Bill Davidsen wrote:
>
>>
>> Wonder if hardware or software is happening, sounds like an mishandled
>> hardware error, but I'm guessing. I have a server with RAID1 and
>> Fedora 2.6.22.14-72.fc6PAE kernel, up 72 days, no problems.
>>
>
> 2.6.22 is running fine. The problems are in the 2.6.25 kernel (FC9 and
> SUSE 11.0)
FC9 running on RAID 1 (ata_piix) for the last 3 weeks or so, with no
trouble.
Regards,
Richard
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Software RAID1 deadlock in 2.6.25 kernels
2008-06-30 13:49 ` Mike McCarthy
2008-06-30 13:56 ` Justin Piszcz
2008-06-30 20:21 ` Richard Scobie
@ 2008-07-01 15:34 ` Bill Davidsen
2008-07-01 17:00 ` Mike McCarthy
2 siblings, 1 reply; 21+ messages in thread
From: Bill Davidsen @ 2008-07-01 15:34 UTC (permalink / raw)
To: Mike McCarthy; +Cc: Michael Bussmann, linux-raid
Mike McCarthy wrote:
> Bill Davidsen wrote:
>>
>> Wonder if hardware or software is happening, sounds like an
>> mishandled hardware error, but I'm guessing. I have a server with
>> RAID1 and Fedora 2.6.22.14-72.fc6PAE kernel, up 72 days, no problems.
>>
>
> 2.6.22 is running fine. The problems are in the 2.6.25 kernel (FC9
> and SUSE 11.0)
Given heavy 2.6.25 use, my guess is still that the root cause of this is
hardware, and that the change in disk code either triggers the hardware
problem, or handles it differently. Are you by any chance running NCQ on
your system?
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Software RAID1 deadlock in 2.6.25 kernels
2008-07-01 15:34 ` Bill Davidsen
@ 2008-07-01 17:00 ` Mike McCarthy
2008-07-01 19:45 ` Michael Bussmann
2008-07-07 16:07 ` Bill Davidsen
0 siblings, 2 replies; 21+ messages in thread
From: Mike McCarthy @ 2008-07-01 17:00 UTC (permalink / raw)
To: Bill Davidsen; +Cc: Michael Bussmann, linux-raid
Bill Davidsen wrote:
> Mike McCarthy wrote:
>> Bill Davidsen wrote:
>>>
>>> Wonder if hardware or software is happening, sounds like an
>>> mishandled hardware error, but I'm guessing. I have a server with
>>> RAID1 and Fedora 2.6.22.14-72.fc6PAE kernel, up 72 days, no problems.
>>>
>>
>> 2.6.22 is running fine. The problems are in the 2.6.25 kernel (FC9
>> and SUSE 11.0)
>
> Given heavy 2.6.25 use, my guess is still that the root cause of this
> is hardware, and that the change in disk code either triggers the
> hardware problem, or handles it differently. Are you by any chance
> running NCQ on your system?
>
No. This system and the drives pre-date NCQ. I think NCQ is only
implemented in SATA and these are IDE drives. Sometime over the
weekend, I am going to reload SUSE 11 and try to do some more debugging.
BTW: It's back to 10.3 (kernel 2.6.22) running happily with a VMware
server thrashing away at the disks.
Mike
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Software RAID1 deadlock in 2.6.25 kernels
2008-06-30 20:19 ` michael
@ 2008-07-01 19:00 ` David Rees
0 siblings, 0 replies; 21+ messages in thread
From: David Rees @ 2008-07-01 19:00 UTC (permalink / raw)
To: michael@kmaclub.com
Cc: Richard Scobie, Mike McCarthy, Bill Davidsen, Michael Bussmann,
linux-raid
On Mon, Jun 30, 2008 at 1:19 PM, michael@kmaclub.com
<michael@kmaclub.com> wrote:
> Richard Scobie wrote:
>> Mike McCarthy wrote:
>>> Bill Davidsen wrote:
>>>> Wonder if hardware or software is happening, sounds like an mishandled
>>>> hardware error, but I'm guessing. I have a server with RAID1 and Fedora
>>>> 2.6.22.14-72.fc6PAE kernel, up 72 days, no problems.
>>>
>>> 2.6.22 is running fine. The problems are in the 2.6.25 kernel (FC9 and
>>> SUSE 11.0)
>>
>> FC9 running on RAID 1 (ata_piix) for the last 3 weeks or so, with no
>> trouble.
>
> Same here. I haven't had any issues with RAID 1 on Fedora 9.
>
> So the issue isn't biting everyone but there is always the chance something
> is wrong with a specific controller or configuration.
FWIW, me too. I have three different Fedora 9 systems running the
latest Fedora kernels based on 2.6.25 which are running software RAID
1 without any issues at all.
-Dave
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Software RAID1 deadlock in 2.6.25 kernels
2008-07-01 17:00 ` Mike McCarthy
@ 2008-07-01 19:45 ` Michael Bussmann
2008-07-02 10:37 ` Gabor Gombas
2008-07-07 16:07 ` Bill Davidsen
1 sibling, 1 reply; 21+ messages in thread
From: Michael Bussmann @ 2008-07-01 19:45 UTC (permalink / raw)
To: linux-raid
Hi,
On 2008-07-01 13:00:01 -0400, Mike McCarthy wrote:
> Bill Davidsen wrote:
>> Mike McCarthy wrote:
>>> Bill Davidsen wrote:
>>>>
>>>> Wonder if hardware or software is happening, sounds like an
>>>> mishandled hardware error, but I'm guessing. I have a server with
>>>> RAID1 and Fedora 2.6.22.14-72.fc6PAE kernel, up 72 days, no
>>>> problems.
I have a number of 2.6.25.7-9 machines with SW-RAID1 that are running
flawlessly so far.
>> Given heavy 2.6.25 use, my guess is still that the root cause of this
>> is hardware, and that the change in disk code either triggers the
>> hardware problem, or handles it differently. Are you by any chance
>> running NCQ on your system?
>>
> No. This system and the drives pre-date NCQ. I think NCQ is only
Same here. In my case the lockups are totally random and not related to
heavy disc i/o. Actually most lockups occur when the system was quite idle.
So far I tried
- Replaced IDE cables
- kernel upgrades up to 2.6.25.9
- removed one drive from the RAID, thus running in degraded mode
- disabled CPU frequency scaling
- put one drive on the PDC20276, the other on the ICH4 (82801DB)
Maybe my hardware _is_ broken, but I'll try some other settings anyway
(including using RTC again for ztdummy instead of HPET, disabling NOHZ etc).
Cheers,
MB
--
Michael Bussmann <bus@mb-net.net>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Software RAID1 deadlock in 2.6.25 kernels
@ 2008-07-01 20:21 David Lethe
2008-07-01 21:24 ` michael
0 siblings, 1 reply; 21+ messages in thread
From: David Lethe @ 2008-07-01 20:21 UTC (permalink / raw)
To: David Rees, michael
Cc: Richard Scobie, Mike McCarthy, Bill Davidsen, Michael Bussmann,
linux-raid
Sounds like you worked for microsoft ... "i have 3 systems running XP and I never had a blue screen of death, so there is nothing wrong with the OS" :)
david
-----Original Message-----
From: "David Rees" <drees76@gmail.com>
Subj: Re: Software RAID1 deadlock in 2.6.25 kernels
Date: Tue Jul 1, 2008 2:03 pm
Size: 1K
To: "michael@kmaclub.com" <michael@kmaclub.com>
cc: "Richard Scobie" <richard@sauce.co.nz>; "Mike McCarthy" <mike@w1nr.net>; "Bill Davidsen" <davidsen@tmr.com>; "Michael Bussmann" <bus@mb-net.net>; "linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>
On Mon, Jun 30, 2008 at 1:19 PM, michael@kmaclub.com
<michael@kmaclub.com> wrote:
> Richard Scobie wrote:
>> Mike McCarthy wrote:
>>> Bill Davidsen wrote:
>>>> Wonder if hardware or software is happening, sounds like an mishandled
>>>> hardware error, but I'm guessing. I have a server with RAID1 and Fedora
>>>> 2.6.22.14-72.fc6PAE kernel, up 72 days, no problems.
>>>
>>> 2.6.22 is running fine. The problems are in the 2.6.25 kernel (FC9 and
>>> SUSE 11.0)
>>
>> FC9 running on RAID 1 (ata_piix) for the last 3 weeks or so, with no
>> trouble.
>
> Same here. I haven't had any issues with RAID 1 on Fedora 9.
>
> So the issue isn't biting everyone but there is always the chance something
> is wrong with a specific controller or configuration.
FWIW, me too. I have three different Fedora 9 systems running the
latest Fedora kernels based on 2.6.25 which are running software RAID
1 without any issues at all.
-Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Software RAID1 deadlock in 2.6.25 kernels
2008-07-01 20:21 Software RAID1 deadlock in 2.6.25 kernels David Lethe
@ 2008-07-01 21:24 ` michael
2008-07-01 21:42 ` David Rees
0 siblings, 1 reply; 21+ messages in thread
From: michael @ 2008-07-01 21:24 UTC (permalink / raw)
To: David
Cc: David Rees, Richard Scobie, Mike McCarthy, Bill Davidsen,
Michael Bussmann, linux-raid
David Lethe wrote:
> Sounds like you worked for microsoft ... "i have 3 systems running XP and I never had a blue screen of death, so there is nothing wrong with the OS" :)
>
I don't think anyone implied anything like that.
I think I was clear in my message that not everyone was seeing a problem
so it wasn't a widespread problem with 2.6.25.
That said, more information would be required about the original posters
configuration if there is going to be any determination about a problem.
Michael
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Software RAID1 deadlock in 2.6.25 kernels
2008-07-01 21:24 ` michael
@ 2008-07-01 21:42 ` David Rees
0 siblings, 0 replies; 21+ messages in thread
From: David Rees @ 2008-07-01 21:42 UTC (permalink / raw)
To: michael@kmaclub.com
Cc: David, Richard Scobie, Mike McCarthy, Bill Davidsen,
Michael Bussmann, linux-raid
On Tue, Jul 1, 2008 at 2:24 PM, michael@kmaclub.com <michael@kmaclub.com> wrote:
> David Lethe wrote:
>>
>> Sounds like you worked for microsoft ... "i have 3 systems running XP and
>> I never had a blue screen of death, so there is nothing wrong with the OS"
>> :)
>
> I don't think anyone implied anything like that.
>
> I think I was clear in my message that not everyone was seeing a problem so
> it wasn't a widespread problem with 2.6.25.
> That said, more information would be required about the original posters
> configuration if there is going to be any determination about a problem.
Exactly.
If anything, David Lethe looks like more of a Microsoft employee for
top-posting when it's clear that everyone else is bottom posting and
trimming their messages as appropriate. :-P (Even if his message was
posted tongue in cheek)
Anyway, this thread interests me because I have multiple systems
running RAID1 - if it is not hardware related or if I can help by
posting details of my apparently unaffected hardware, I will.
-Dave
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Software RAID1 deadlock in 2.6.25 kernels
2008-07-01 19:45 ` Michael Bussmann
@ 2008-07-02 10:37 ` Gabor Gombas
2008-07-02 10:50 ` Gabor Gombas
0 siblings, 1 reply; 21+ messages in thread
From: Gabor Gombas @ 2008-07-02 10:37 UTC (permalink / raw)
To: Michael Bussmann; +Cc: linux-raid
On Tue, Jul 01, 2008 at 09:45:08PM +0200, Michael Bussmann wrote:
> So far I tried
> - Replaced IDE cables
> - kernel upgrades up to 2.6.25.9
> - removed one drive from the RAID, thus running in degraded mode
> - disabled CPU frequency scaling
> - put one drive on the PDC20276, the other on the ICH4 (82801DB)
Set up netconsole or serial console, boot with nmi_watchdog=2, and wait
for a lockup.
Gabor
--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Software RAID1 deadlock in 2.6.25 kernels
2008-07-02 10:37 ` Gabor Gombas
@ 2008-07-02 10:50 ` Gabor Gombas
0 siblings, 0 replies; 21+ messages in thread
From: Gabor Gombas @ 2008-07-02 10:50 UTC (permalink / raw)
To: Michael Bussmann; +Cc: linux-raid
On Wed, Jul 02, 2008 at 12:37:54PM +0200, Gabor Gombas wrote:
> On Tue, Jul 01, 2008 at 09:45:08PM +0200, Michael Bussmann wrote:
>
> > So far I tried
> > - Replaced IDE cables
> > - kernel upgrades up to 2.6.25.9
> > - removed one drive from the RAID, thus running in degraded mode
> > - disabled CPU frequency scaling
> > - put one drive on the PDC20276, the other on the ICH4 (82801DB)
>
> Set up netconsole or serial console, boot with nmi_watchdog=2, and wait
> for a lockup.
Some motivation: I have a feeling that the HPET problem in 2.6.25 is
actually biting people other than me. But if you do not enable the NMI
watchdog and do not have a way to save the resulting Oops, then you
will never know what hit you. SysRq+W will show tasks blocked in places
related to disk I/O, but that's not the real reason, so messing with
the disks will not help you in this case.
And this also explains why machines w/o HPET (or HPET being disabled by
the BIOS or otherwise unaffected for yet unknown reasons) can run
2.6.25.x just fine.
Gabor
--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Software RAID1 deadlock in 2.6.25 kernels
2008-07-01 17:00 ` Mike McCarthy
2008-07-01 19:45 ` Michael Bussmann
@ 2008-07-07 16:07 ` Bill Davidsen
2008-07-07 18:11 ` Mike McCarthy
1 sibling, 1 reply; 21+ messages in thread
From: Bill Davidsen @ 2008-07-07 16:07 UTC (permalink / raw)
To: Mike McCarthy; +Cc: Michael Bussmann, linux-raid
Mike McCarthy wrote:
> Bill Davidsen wrote:
>>
>> Given heavy 2.6.25 use, my guess is still that the root cause of this
>> is hardware, and that the change in disk code either triggers the
>> hardware problem, or handles it differently. Are you by any chance
>> running NCQ on your system?
>>
> No. This system and the drives pre-date NCQ. I think NCQ is only
> implemented in SATA and these are IDE drives. Sometime over the
> weekend, I am going to reload SUSE 11 and try to do some more debugging.
>
> BTW: It's back to 10.3 (kernel 2.6.22) running happily with a VMware
> server thrashing away at the disks.
This has recycled back to the top of my todo list, I have a server in
mothballs with IDE drives, I'll pull it out, upgrade to FC9 current
(non-rawhide) and see if I have any problems. It's off due to lack of
need, not really obsolete, so it's a fair test. O'll put a dew hundred
GB of raid-1 and beat on it.
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Software RAID1 deadlock in 2.6.25 kernels
2008-07-07 16:07 ` Bill Davidsen
@ 2008-07-07 18:11 ` Mike McCarthy
2008-07-08 3:24 ` Bill Davidsen
0 siblings, 1 reply; 21+ messages in thread
From: Mike McCarthy @ 2008-07-07 18:11 UTC (permalink / raw)
To: Bill Davidsen; +Cc: Michael Bussmann, linux-raid
Bill Davidsen wrote:
> Mike McCarthy wrote:
>> Bill Davidsen wrote:
>>>
>>> Given heavy 2.6.25 use, my guess is still that the root cause of
>>> this is hardware, and that the change in disk code either triggers
>>> the hardware problem, or handles it differently. Are you by any
>>> chance running NCQ on your system?
>>>
>> No. This system and the drives pre-date NCQ. I think NCQ is only
>> implemented in SATA and these are IDE drives. Sometime over the
>> weekend, I am going to reload SUSE 11 and try to do some more debugging.
>>
>> BTW: It's back to 10.3 (kernel 2.6.22) running happily with a VMware
>> server thrashing away at the disks.
>
> This has recycled back to the top of my todo list, I have a server in
> mothballs with IDE drives, I'll pull it out, upgrade to FC9 current
> (non-rawhide) and see if I have any problems. It's off due to lack of
> need, not really obsolete, so it's a fair test. O'll put a dew hundred
> GB of raid-1 and beat on it.
>
I was going to get back to you all today and let you know what I found.
On Thursday, I rebuilt the system with SUSE 11 but before I did I went
over all of the BIOS settings. The second IDE drive was set to "NONE"
instead of "AUTO". Well, the installation went without the previous
hitch of having to manually install grub on the first boot after the
install. It has also been running since then without issue.
Is it that simple? Could that be all that was wrong? What doesn't make
sense is how 10.3 (kernel 2.6.22) never had an issue. Perhaps without
the BIOS reporting the second drive, the later kernel chose the wrong
parameters setting it up and they didn't match what was set up by the
BIOS for the first drive?
Mike
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Software RAID1 deadlock in 2.6.25 kernels
2008-07-07 18:11 ` Mike McCarthy
@ 2008-07-08 3:24 ` Bill Davidsen
0 siblings, 0 replies; 21+ messages in thread
From: Bill Davidsen @ 2008-07-08 3:24 UTC (permalink / raw)
To: Mike McCarthy; +Cc: Michael Bussmann, linux-raid
Mike McCarthy wrote:
> Bill Davidsen wrote:
>> Mike McCarthy wrote:
>>> Bill Davidsen wrote:
>>>>
>>>> Given heavy 2.6.25 use, my guess is still that the root cause of
>>>> this is hardware, and that the change in disk code either triggers
>>>> the hardware problem, or handles it differently. Are you by any
>>>> chance running NCQ on your system?
>>>>
>>> No. This system and the drives pre-date NCQ. I think NCQ is only
>>> implemented in SATA and these are IDE drives. Sometime over the
>>> weekend, I am going to reload SUSE 11 and try to do some more
>>> debugging.
>>>
>>> BTW: It's back to 10.3 (kernel 2.6.22) running happily with a VMware
>>> server thrashing away at the disks.
>>
>> This has recycled back to the top of my todo list, I have a server in
>> mothballs with IDE drives, I'll pull it out, upgrade to FC9 current
>> (non-rawhide) and see if I have any problems. It's off due to lack of
>> need, not really obsolete, so it's a fair test. O'll put a dew
>> hundred GB of raid-1 and beat on it.
>>
> I was going to get back to you all today and let you know what I
> found. On Thursday, I rebuilt the system with SUSE 11 but before I
> did I went over all of the BIOS settings. The second IDE drive was
> set to "NONE" instead of "AUTO". Well, the installation went without
> the previous hitch of having to manually install grub on the first
> boot after the install. It has also been running since then without
> issue.
>
> Is it that simple? Could that be all that was wrong? What doesn't
> make sense is how 10.3 (kernel 2.6.22) never had an issue. Perhaps
> without the BIOS reporting the second drive, the later kernel chose
> the wrong parameters setting it up and they didn't match what was set
> up by the BIOS for the first drive?
Oh well, I needed to upgrade that system, I'll push testing down the
week a day or two. It could well have been that simple.
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2008-07-08 3:24 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-01 20:21 Software RAID1 deadlock in 2.6.25 kernels David Lethe
2008-07-01 21:24 ` michael
2008-07-01 21:42 ` David Rees
[not found] <48650567.3000501@w1nr.net>
2008-06-27 20:47 ` Neil Brown
2008-06-30 9:23 ` Gabor Gombas
2008-06-30 11:31 ` Mike McCarthy
2008-06-30 11:59 ` Michael Bussmann
2008-06-30 13:32 ` Bill Davidsen
2008-06-30 13:49 ` Mike McCarthy
2008-06-30 13:56 ` Justin Piszcz
2008-06-30 20:21 ` Richard Scobie
2008-06-30 20:19 ` michael
2008-07-01 19:00 ` David Rees
2008-07-01 15:34 ` Bill Davidsen
2008-07-01 17:00 ` Mike McCarthy
2008-07-01 19:45 ` Michael Bussmann
2008-07-02 10:37 ` Gabor Gombas
2008-07-02 10:50 ` Gabor Gombas
2008-07-07 16:07 ` Bill Davidsen
2008-07-07 18:11 ` Mike McCarthy
2008-07-08 3:24 ` Bill Davidsen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).