cerberus on 2.4.17-rc2 UP

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* cerberus on 2.4.17-rc2 UP
@ 2001-12-20 12:59 marc. h.
  2001-12-21 16:56 ` Marcelo Tosatti
  0 siblings, 1 reply; 11+ messages in thread
From: marc. h. @ 2001-12-20 12:59 UTC (permalink / raw)
  To: linux-kernel

[I am not subscribed to the list, please CC me directly]

Hi,

I tried out the latest cerberus from
http://people.redhat.com/bmatthews/cerberus/ on a UP redhat-7.2 box. I ran the
standard non-destructive RedHat tests.

It ran for about 14 hours and then became unresponsive..  machine still ping'ed
, I could switch VC's scroll up on console, but that's it. Could not log in,
etc.. Another point is that the hard drive light remained on but it was not
seeking, it seemed dead silent.

Kernel was compiled using redhat's stock gcc-2.96. Unfortunatly, I did not have
sysrq turned on (I just assumed there would be no problem on UP). The machine
is very well ventilated. 

The symptoms sound very much like what Bob Matthews reported on his 8 way...
just that they take longer to manifest themselves.

I'm willing to test/provide more info. The machine is slated for production
soon so I might only have a week or 2. I'm running the tests again now with
sysrq turned on.

Machine has 128Mb, 270Mb swap, IDE, ext3. no DRI or X. stock 2.4.17-rc2 no
patches applied.

Here is dmesg, lscpci, lsmod, /proc/cpuinfo output:

00:02.0 VGA compatible controller: Intel Corporation 82815 CGC [Chipset Graphics Controller]  (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801BAM PCI (rev 11)
00:1f.0 ISA bridge: Intel Corporation 82801BA ISA Bridge (ICH2) (rev 11)
00:1f.1 IDE interface: Intel Corporation 82801BA IDE U100 (rev 11)
00:1f.2 USB Controller: Intel Corporation 82801BA(M) USB (Hub A) (rev 11)
00:1f.3 SMBus: Intel Corporation 82801BA(M) SMBus (rev 11)
00:1f.4 USB Controller: Intel Corporation 82801BA(M) USB (Hub B) (rev 11)
01:00.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 74)
01:04.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 08)01:08.0 Ethernet controller: Intel Corporation 82801BA(M) Ethernet (rev 03)

Module                  Size  Used by
eepro100               16896   1
3c59x                  25280   1
md                     43840   0  (unused)

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Celeron (Coppermine)
stepping        : 10
cpu MHz         : 801.838
cache size      : 128 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 mmx fxsr sse
bogomips        : 1599.07

-m

-- 
	C3C5 9226 3C03 CDF7 2EF1  029F 4CAD FBA4 F5ED 68EB
	key: http://people.hbesoftware.com/~heckmann/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: cerberus on 2.4.17-rc2 UP
@ 2001-12-20 16:26 Michael Govorun
  0 siblings, 0 replies; 11+ messages in thread
From: Michael Govorun @ 2001-12-20 16:26 UTC (permalink / raw)
  To: linux-kernel


On Thu, Dec 20, 2001 at 04:22:38PM +0300, Michael Govorun wrote:
> 
> I got the same problem on p120 machine with 16Mb RAM, IDE,
> (ext3/ext2) under stock 2.4.16 + trustix 1.5. I got this several
> times. Every time under big load and swap activity. 
> But now I can't supply more info since I'm switched to software RAID1
> on SCSI disks and can't reproduce this.

hmmm, maybe that's a data point.. you should forward it to the list.

-m

-- 
	C3C5 9226 3C03 CDF7 2EF1  029F 4CAD FBA4 F5ED 68EB
	key: http://people.hbesoftware.com/~heckmann/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: cerberus on 2.4.17-rc2 UP
  2001-12-20 12:59 cerberus on 2.4.17-rc2 UP marc. h.
@ 2001-12-21 16:56 ` Marcelo Tosatti
  2001-12-22 20:22   ` Marc Heckmann
  2002-01-07 11:14   ` marc. h.
  0 siblings, 2 replies; 11+ messages in thread
From: Marcelo Tosatti @ 2001-12-21 16:56 UTC (permalink / raw)
  To: marc. h.; +Cc: linux-kernel


Can you please run Cerberus again and give me more information ?

I want Alt+SysRQ+T, Alt+SysRQ+M and Alt+SysRQ+P output.

If those keys simply print the sysrq header, please try Alt+SysRQ+8 then
the above again.

Thanks

On Thu, 20 Dec 2001, marc. h. wrote:

> I tried out the latest cerberus from
> http://people.redhat.com/bmatthews/cerberus/ on a UP redhat-7.2 box. I ran the
> standard non-destructive RedHat tests.
> 
> It ran for about 14 hours and then became unresponsive..  machine still ping'ed
> , I could switch VC's scroll up on console, but that's it. Could not log in,
> etc.. Another point is that the hard drive light remained on but it was not
> seeking, it seemed dead silent.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: cerberus on 2.4.17-rc2 UP
  2001-12-21 16:56 ` Marcelo Tosatti
@ 2001-12-22 20:22   ` Marc Heckmann
  2002-01-07 11:14   ` marc. h.
  1 sibling, 0 replies; 11+ messages in thread
From: Marc Heckmann @ 2001-12-22 20:22 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel

On Fri, Dec 21, 2001 at 02:56:34PM -0200, Marcelo Tosatti wrote:
> 
> Can you please run Cerberus again and give me more information ?

I did and the machine made it through this time ( in 18 hours). The thing 
is that I skipped the LTP and crashme tests because last time, they 
finished in 8 hours succesfully and the machine only locked up after 14 so 
I thought that one of the other tests that were still running were 
responsible. 

In anz case, I don't have access to the box till thursday due to the 
holidays. I will re-run with all tests then.

Cheers, 

> 
> On Thu, 20 Dec 2001, marc. h. wrote:
> 
> > I tried out the latest cerberus from
> > http://people.redhat.com/bmatthews/cerberus/ on a UP redhat-7.2 box. I ran the
> > standard non-destructive RedHat tests.
> > 
> > It ran for about 14 hours and then became unresponsive..  machine still ping'ed
> > , I could switch VC's scroll up on console, but that's it. Could not log in,
> > etc.. Another point is that the hard drive light remained on but it was not
> > seeking, it seemed dead silent.
> 
> 
> 

-m

-- 
	C3C5 9226 3C03 CDF7 2EF1  029F 4CAD FBA4 F5ED 68EB
	key: http://people.hbesoftware.com/~heckmann/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: cerberus on 2.4.17-rc2 UP
  2001-12-21 16:56 ` Marcelo Tosatti
  2001-12-22 20:22   ` Marc Heckmann
@ 2002-01-07 11:14   ` marc. h.
  2002-01-08 15:48     ` [problem captured] " marc. h.
  1 sibling, 1 reply; 11+ messages in thread
From: marc. h. @ 2002-01-07 11:14 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel

On Fri, Dec 21, 2001 at 02:56:34PM -0200, Marcelo Tosatti wrote:
> 
> Can you please run Cerberus again and give me more information ?

ok, I *finally* got it to deadlock again.. trick is to run 2 simultaneous
cerberus runs.. same symptoms, pings, can change VC's, hard drive light
constantly on but silent and no blinks. I had sysrq turned on this time (tested
before the run), but once deadlocked, doing Alt+SysRQ+8, Alt+SysRQ+T, etc would
print nothing at all.. 

-m

> 
> I want Alt+SysRQ+T, Alt+SysRQ+M and Alt+SysRQ+P output.
> 
> If those keys simply print the sysrq header, please try Alt+SysRQ+8 then
> the above again.
> 
> Thanks
> 
> On Thu, 20 Dec 2001, marc. h. wrote:
> 
> > I tried out the latest cerberus from
> > http://people.redhat.com/bmatthews/cerberus/ on a UP redhat-7.2 box. I ran the
> > standard non-destructive RedHat tests.
> > 
> > It ran for about 14 hours and then became unresponsive..  machine still ping'ed
> > , I could switch VC's scroll up on console, but that's it. Could not log in,
> > etc.. Another point is that the hard drive light remained on but it was not
> > seeking, it seemed dead silent.
> 

-- 
	C3C5 9226 3C03 CDF7 2EF1  029F 4CAD FBA4 F5ED 68EB
	key: http://people.hbesoftware.com/~heckmann/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [problem captured] Re: cerberus on 2.4.17-rc2 UP
  2002-01-07 11:14   ` marc. h.
@ 2002-01-08 15:48     ` marc. h.
  2002-01-08 16:13       ` Alan Cox
  0 siblings, 1 reply; 11+ messages in thread
From: marc. h. @ 2002-01-08 15:48 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel

ok,

I managed to get it to do it again. Captured the problem on serial console:

--------------------------------------------

end_request: buffer-list destroyed
hda1: bad access: block=12440, count=-8
end_request: I/O error, dev 03:01 (hda), sector 12440
hda1: bad access: block=12448, count=-16
end_request: I/O error, dev 03:01 (hda), sector 12448
hda: timeout waiting for DMA
ide_dmaproc: chipset supported ide_dma_timeout func only: 14
hda: status error: status=0x58 { DriveReady SeekComplete DataRequest }
hda: drive not ready for command
hda: lost interrupt
hda: lost interrupt
hda: lost interrupt
hda: lost interrupt
hda: lost interrupt
hda: lost interrupt <-- they never stop appearing
[..more of the same..]

------------------------------------------

Is this a bug or could it be the hardware's fault? The hardware is new lspci
and hdparm -iv follows (I also have the output of sysrq T+M+P is that would be
useful to anyone just ask, I'd rather save the bandwidth and not send it over
the list):


/dev/hda:
 multcount    = 16 (on)
 I/O support  =  0 (default 16-bit)
 unmaskirq    =  0 (off)
 using_dma    =  1 (on)
 keepsettings =  0 (off)
 nowerr       =  0 (off)
 readonly     =  0 (off)
 readahead    =  8 (on)
 geometry     = 2501/255/63, sectors = 40188960, start = 0

 Model=IC35L020AVER07-0, FwRev=ER2OA45A, SerialNo=SVPTVFQ8610
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=40
 BuffType=DualPortCache, BuffSize=1916kB, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=-66060037, LBA=yes, LBAsects=40188960
 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes: pio0 pio1 pio2 pio3 pio4
 DMA modes: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5
 AdvancedPM=yes: disabled (255)
 Drive Supports : ATA/ATAPI-5 T13 1321D revision 1 : ATA-2 ATA-3 ATA-4 ATA-5

ontroller Hub (rev 02)
00:02.0 VGA compatible controller: Intel Corporation 82815 CGC [Chipset Graphics Controller]  (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801BAM PCI (rev 11)
00:1f.0 ISA bridge: Intel Corporation 82801BA ISA Bridge (ICH2) (rev 11)
00:1f.1 IDE interface: Intel Corporation 82801BA IDE U100 (rev 11)
00:1f.2 USB Controller: Intel Corporation 82801BA(M) USB (Hub A) (rev 11)
00:1f.3 SMBus: Intel Corporation 82801BA(M) SMBus (rev 11)
00:1f.4 USB Controller: Intel Corporation 82801BA(M) USB (Hub B) (rev 11)
01:00.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 74)
01:04.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 08)01:08.0 Ethernet controller: Intel Corporation 82801BA(M) Ethernet (rev 03)

-m


On Mon, Jan 07, 2002 at 12:14:24PM +0100, marc. h. wrote:
> On Fri, Dec 21, 2001 at 02:56:34PM -0200, Marcelo Tosatti wrote:
> > 
> > Can you please run Cerberus again and give me more information ?
> 
> ok, I *finally* got it to deadlock again.. trick is to run 2 simultaneous
> cerberus runs.. same symptoms, pings, can change VC's, hard drive light
> constantly on but silent and no blinks. I had sysrq turned on this time (tested
> before the run), but once deadlocked, doing Alt+SysRQ+8, Alt+SysRQ+T, etc would
> print nothing at all.. 
> 
> -m
> 
> > 
> > I want Alt+SysRQ+T, Alt+SysRQ+M and Alt+SysRQ+P output.
> > 
> > If those keys simply print the sysrq header, please try Alt+SysRQ+8 then
> > the above again.
> > 
> > Thanks
> > 
> > On Thu, 20 Dec 2001, marc. h. wrote:
> > 
> > > I tried out the latest cerberus from
> > > http://people.redhat.com/bmatthews/cerberus/ on a UP redhat-7.2 box. I ran the
> > > standard non-destructive RedHat tests.
> > > 
> > > It ran for about 14 hours and then became unresponsive..  machine still ping'ed
> > > , I could switch VC's scroll up on console, but that's it. Could not log in,
> > > etc.. Another point is that the hard drive light remained on but it was not
> > > seeking, it seemed dead silent.
> > 
> 
> -- 
> 	C3C5 9226 3C03 CDF7 2EF1  029F 4CAD FBA4 F5ED 68EB
> 	key: http://people.hbesoftware.com/~heckmann/

-- 
	C3C5 9226 3C03 CDF7 2EF1  029F 4CAD FBA4 F5ED 68EB
	key: http://people.hbesoftware.com/~heckmann/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [problem captured] Re: cerberus on 2.4.17-rc2 UP
  2002-01-08 15:48     ` [problem captured] " marc. h.
@ 2002-01-08 16:13       ` Alan Cox
  2002-01-08 20:33         ` Andrew Morton
  0 siblings, 1 reply; 11+ messages in thread
From: Alan Cox @ 2002-01-08 16:13 UTC (permalink / raw)
  To: marc. h.; +Cc: Marcelo Tosatti, linux-kernel

> end_request: buffer-list destroyed
> hda1: bad access: block=12440, count=-8
> end_request: I/O error, dev 03:01 (hda), sector 12440
> hda1: bad access: block=12448, count=-16

That looks like a race in the IDE/block layer (or somewhere above it maybe)
Someone trashed a request in progress.

> Is this a bug or could it be the hardware's fault? The hardware is new lspci

Other people have reported it too. Its clearly a kernel race

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [problem captured] Re: cerberus on 2.4.17-rc2 UP
  2002-01-08 16:13       ` Alan Cox
@ 2002-01-08 20:33         ` Andrew Morton
  2002-01-08 21:05           ` Alex Scheele
                             ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Andrew Morton @ 2002-01-08 20:33 UTC (permalink / raw)
  To: Alan Cox; +Cc: marc. h., Marcelo Tosatti, linux-kernel

Alan Cox wrote:
> 
> > end_request: buffer-list destroyed
> > hda1: bad access: block=12440, count=-8
> > end_request: I/O error, dev 03:01 (hda), sector 12440
> > hda1: bad access: block=12448, count=-16
> 
> That looks like a race in the IDE/block layer (or somewhere above it maybe)
> Someone trashed a request in progress.
> 
> > Is this a bug or could it be the hardware's fault? The hardware is new lspci
> 
> Other people have reported it too. Its clearly a kernel race

Yes, I can generate it at will on two quite different IDE machines
with the run-bash-shared-mapping script from
http://www.zip.com.au/~akpm/ext3-tools.tar.gz

It's on my list of things-to-do, filed under "hard".  It even happens
on uniprocessor, with unmask_irq=0.

Interestingly, I _think_ it only ever occurs against the
swap device.  But I need to confirm this.  Marc, do you
have swap on /dev/hda1?

-

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [problem captured] Re: cerberus on 2.4.17-rc2 UP
  2002-01-08 20:33         ` Andrew Morton
@ 2002-01-08 21:05           ` Alex Scheele
  2002-01-09  9:37           ` marc. h.
  2002-01-16 10:35           ` marc. h.
  2 siblings, 0 replies; 11+ messages in thread
From: Alex Scheele @ 2002-01-08 21:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: heckmann, Lkml

Andrew Morton wrote:
> 
> 
> Alan Cox wrote:
> > 
> > > end_request: buffer-list destroyed
> > > hda1: bad access: block=12440, count=-8
> > > end_request: I/O error, dev 03:01 (hda), sector 12440
> > > hda1: bad access: block=12448, count=-16
> > 
> > That looks like a race in the IDE/block layer (or somewhere 
> above it maybe)
> > Someone trashed a request in progress.
> > 
> > > Is this a bug or could it be the hardware's fault? The 
> hardware is new lspci
> > 
> > Other people have reported it too. Its clearly a kernel race
> 
> Yes, I can generate it at will on two quite different IDE machines
> with the run-bash-shared-mapping script from
> http://www.zip.com.au/~akpm/ext3-tools.tar.gz
> 
> It's on my list of things-to-do, filed under "hard".  It even happens
> on uniprocessor, with unmask_irq=0.
> 
> Interestingly, I _think_ it only ever occurs against the
> swap device.  But I need to confirm this.  Marc, do you
> have swap on /dev/hda1?

I have had this problem on several machines to. But not only 
against the swap device. I have 1 machine with a SCSI disk as 
root disk /dev/sda1, the swap device is /dev/sda2. 
Then there is a 4 disk ide raid0 (software raid) mounted 
on /mnt and if i run it there i have the same problem.
This machine is a SMP machine, tho is has also happend
on UP machines.

Hope it helps.

--
	Alex (alex@packetstorm.nu)



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [problem captured] Re: cerberus on 2.4.17-rc2 UP
  2002-01-08 20:33         ` Andrew Morton
  2002-01-08 21:05           ` Alex Scheele
@ 2002-01-09  9:37           ` marc. h.
  2002-01-16 10:35           ` marc. h.
  2 siblings, 0 replies; 11+ messages in thread
From: marc. h. @ 2002-01-09  9:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alan Cox, Marcelo Tosatti, linux-kernel

On Tue, Jan 08, 2002 at 12:33:33PM -0800, Andrew Morton wrote:
> Alan Cox wrote:
> > 
> > > end_request: buffer-list destroyed
> > > hda1: bad access: block=12440, count=-8
> > > end_request: I/O error, dev 03:01 (hda), sector 12440
> > > hda1: bad access: block=12448, count=-16
> > 
> > That looks like a race in the IDE/block layer (or somewhere above it maybe)
> > Someone trashed a request in progress.
> > 
> > Other people have reported it too. Its clearly a kernel race
> 
> Yes, I can generate it at will on two quite different IDE machines
> with the run-bash-shared-mapping script from
> http://www.zip.com.au/~akpm/ext3-tools.tar.gz

does this mean it's an ext3 bug? (haven't tried to reproduce it using ext2)

> It's on my list of things-to-do, filed under "hard".  It even happens
> on uniprocessor, with unmask_irq=0.

yes. my machine is UP and unmask_irq is 0.

> Interestingly, I _think_ it only ever occurs against the
> swap device.  But I need to confirm this.  Marc, do you
> have swap on /dev/hda1?

nope. hda1 is /.

-m

-- 
	C3C5 9226 3C03 CDF7 2EF1  029F 4CAD FBA4 F5ED 68EB
	key: http://people.hbesoftware.com/~heckmann/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [problem captured] Re: cerberus on 2.4.17-rc2 UP
  2002-01-08 20:33         ` Andrew Morton
  2002-01-08 21:05           ` Alex Scheele
  2002-01-09  9:37           ` marc. h.
@ 2002-01-16 10:35           ` marc. h.
  2 siblings, 0 replies; 11+ messages in thread
From: marc. h. @ 2002-01-16 10:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alan Cox, Marcelo Tosatti, linux-kernel

ok, the ext3-2.4-0.9.17 patch fixes this bug. thank you. that means that
2.4.17-rc2 with the patch applied makes it through a full *double* cerberus run
succesfully. I also tried the same with ext2 and it made it through as well. I
plan to try and 18-preX as well soon. If only such a test wasn't a day and a
half long...

The only other thing that seems to effecting a lot of people (including a
friend of mine) that I can't re-produce here, is the OOPses... A box that was
stable with 2.4.12 oops'es in short time with 2.4.1[67]. The box in question is
a Dell desktop running a netfilter firewall.

-m

On Tue, Jan 08, 2002 at 12:33:33PM -0800, Andrew Morton wrote:
> Alan Cox wrote:
> > 
> > Other people have reported it too. Its clearly a kernel race
> 
> Yes, I can generate it at will on two quite different IDE machines
> with the run-bash-shared-mapping script from
> http://www.zip.com.au/~akpm/ext3-tools.tar.gz
> 
> It's on my list of things-to-do, filed under "hard".  It even happens
> on uniprocessor, with unmask_irq=0.
> 
> Interestingly, I _think_ it only ever occurs against the
> swap device.  But I need to confirm this.  Marc, do you
> have swap on /dev/hda1?
> 

-- 
	C3C5 9226 3C03 CDF7 2EF1  029F 4CAD FBA4 F5ED 68EB
	key: http://people.hbesoftware.com/~heckmann/

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2002-01-16 10:35 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-12-20 12:59 cerberus on 2.4.17-rc2 UP marc. h.
2001-12-21 16:56 ` Marcelo Tosatti
2001-12-22 20:22   ` Marc Heckmann
2002-01-07 11:14   ` marc. h.
2002-01-08 15:48     ` [problem captured] " marc. h.
2002-01-08 16:13       ` Alan Cox
2002-01-08 20:33         ` Andrew Morton
2002-01-08 21:05           ` Alex Scheele
2002-01-09  9:37           ` marc. h.
2002-01-16 10:35           ` marc. h.
  -- strict thread matches above, loose matches on Subject: below --
2001-12-20 16:26 Michael Govorun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox