what causes Machine Check exception? revisited (2.2.18)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* what causes Machine Check exception? revisited (2.2.18)
@ 2001-05-07  9:50 Juhan-Peep Ernits
  2001-05-07 10:57 ` Alan Cox
  0 siblings, 1 reply; 12+ messages in thread
From: Juhan-Peep Ernits @ 2001-05-07  9:50 UTC (permalink / raw)
  To: linux-kernel

Hello!

After searching the archives of the list I found some similar reports
from September and December 2000 but as far as I understood the cause of
the error was blamed on the CPU. Is this the most probable case? 

Best regards,

Juhan Ernits

	-- /var/log/kern.log

May  6 06:47:25 market kernel: CPU 0: Machine Check 
Exception: 0000000000000004
May  6 06:47:25 market kernel: Bank 4: b200000000040151<0>Kernel
panic: CPU context corrupt

	-- /proc/cpuinfo

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 1
cpu MHz         : 551.259
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
sep_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov
pat pse36 mmx fxsr xmm
bogomips        : 1101.00

	-- /var/log/dmesg

Linux version 2.2.18 (root@market.equitygate.com) (gcc version 2.95.2
20000220 (Debian GNU/Linux)) #6 Mon Jan 15 15:52:09 EET 2001
Detected 551259 kHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 1101.00 BogoMIPS
Memory: 517620k/524288k available (776k kernel code, 416k reserved, 5440k
data, 36k init)
Dentry hash table entries: 65536 (order 7, 512k)
Buffer cache hash table entries: 524288 (order 9, 2048k)
Page cache hash table entries: 131072 (order 7, 512k)
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
256K L2 cache (8 way)
CPU: L2 Cache: 256K
CPU: Intel Pentium III (Coppermine) stepping 01
Checking 386/387 coupling... OK, FPU using exception 16 error reporting.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
mtrr: v1.35a (19990819) Richard Gooch (rgooch@atnf.csiro.au)
PCI: PCI BIOS revision 2.10 entry at 0xfb2a0
PCI: Using configuration type 1
PCI: Probing PCI hardware
Linux NET4.0 for Linux 2.2
Based upon Swansea University Computer Society NET3.039
NET4: Unix domain sockets 1.0 for Linux NET4.0.
NET4: Linux TCP/IP 1.0 for NET4.0
IP Protocols: ICMP, UDP, TCP
TCP: Hash tables configured (ehash 524288 bhash 65536)
Initializing RT netlink socket
Starting kswapd v 1.5 
pty: 2048 Unix98 ptys configured
Real Time Clock Driver v1.09
keyboard: Timeout - AT keyboard not present?
keyboard: Timeout - AT keyboard not present?
scsi0 : IBM PCI ServeRAID 4.20.20  <ServeRAID 3L>
scsi : 1 host.
  Vendor:  IBM      Model:  SERVERAID        Rev:  1.0
  Type:   Direct-Access                      ANSI SCSI revision: 01
Detected scsi disk sda at scsi0, channel 0, id 0, lun 0
  Vendor:  IBM      Model:  SERVERAID        Rev:  1.0
  Type:   Processor                          ANSI SCSI revision: 01
scsi : detected 1 SCSI disk total.
SCSI device sda: hdwr sector= 512 bytes. Sectors= 35860480 [17510 MB]
[17.5 GB]
eepro100.c:v1.09j-t 9/29/99 Donald Becker
http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html
eepro100.c: $Revision: 1.20.2.10 $ 2000/05/31 Modified by Andrey
V. Savochkin <saw@saw.sw.com.sg> and others
eepro100.c: VA Linux custom, Dragan Stancevic <visitor@valinux.com>
2000/11/15
eth0: Intel PCI EtherExpress Pro100 82557, 00:D0:B7:16:9E:E2, IRQ 11.
  Receiver lock-up bug exists -- enabling work-around.
  Board assembly 721383-008, Physical connectors present: RJ45
  Primary interface chip i82555 PHY #1.
  General self-test: passed.
  Serial sub-system self-test: passed.
  Internal registers self-test: passed.
  ROM checksum self-test: passed (0x04f4518b).
eepro100.c:v1.09j-t 9/29/99 Donald Becker
http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html
eepro100.c: $Revision: 1.20.2.10 $ 2000/05/31 Modified by Andrey
V. Savochkin <saw@saw.sw.com.sg> and others
eepro100.c: VA Linux custom, Dragan Stancevic <visitor@valinux.com>
2000/11/15
Partition check:
 sda: sda1 sda2 sda3 < sda5 sda6 sda7 sda8 sda9 sda10 sda11 > sda4
apm: BIOS version 1.2 Flags 0x07 (Driver version 1.13)
VFS: Mounted root (ext2 filesystem) readonly.
Freeing unused kernel memory: 36k freed
Adding Swap: 64004k swap-space (priority -1)
Adding Swap: 135976k swap-space (priority -2)
Adding Swap: 135976k swap-space (priority -3)
Adding Swap: 135976k swap-space (priority -4)
Adding Swap: 135976k swap-space (priority -5)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: what causes Machine Check exception? revisited (2.2.18)
@ 2001-05-07 10:50 Bene, Martin
  2001-05-07 13:04 ` Simon Richter
  0 siblings, 1 reply; 12+ messages in thread
From: Bene, Martin @ 2001-05-07 10:50 UTC (permalink / raw)
  To: 'Juhan-Peep Ernits'; +Cc: 'linux-kernel@vger.kernel.org'

Hi Juhan,

> After searching the archives of the list I found some similar reports
> from September and December 2000 but as far as I understood 
> the cause of
> the error was blamed on the CPU. Is this the most probable case? 
> 
> Best regards,
> 
> Juhan Ernits
> 
> 	-- /var/log/kern.log
> 
> May  6 06:47:25 market kernel: CPU 0: Machine Check 
> Exception: 0000000000000004
> May  6 06:47:25 market kernel: Bank 4: b200000000040151<0>Kernel
> panic: CPU context corrupt

Yes. consensus of the messages I received is that it's the cpu flagging an
internal hardware problem. 

Suggested causes include:
	overclocking
	thermal problems
	CPU actually bad

Definitely not caused by:
	Bad Rams, mb-chipset.

In my case the error only occured once and never again - marked it up to bad
karma on that day.

Bye, Martin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: what causes Machine Check exception? revisited (2.2.18)
  2001-05-07  9:50 what causes Machine Check exception? revisited (2.2.18) Juhan-Peep Ernits
@ 2001-05-07 10:57 ` Alan Cox
  2001-05-09  5:41   ` Mike Fedyk
  0 siblings, 1 reply; 12+ messages in thread
From: Alan Cox @ 2001-05-07 10:57 UTC (permalink / raw)
  To: Juhan-Peep Ernits; +Cc: linux-kernel

> After searching the archives of the list I found some similar reports
> from September and December 2000 but as far as I understood the cause of
> the error was blamed on the CPU. Is this the most probable case? 

A machine check (trap 18) is signalled by the processor when it thinks it is
in an invalid state. Many x86 cpus have checking circuitry and the default
behaviour is to either reboot or continue-and-pray. 

Linux enables notification of these events. So yes your processor was unhappy.
But it can be unhappy because of wrong voltages, electrical noise, overheating
and many other things.

Generally it indicates a CPU problem but I've see it caused by overclocking
and poorly fitted heatsinks

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: what causes Machine Check exception? revisited (2.2.18)
  2001-05-07 10:50 Bene, Martin
@ 2001-05-07 13:04 ` Simon Richter
  2001-05-07 18:23   ` Dan Hollis
  0 siblings, 1 reply; 12+ messages in thread
From: Simon Richter @ 2001-05-07 13:04 UTC (permalink / raw)
  To: Bene, Martin; +Cc: 'linux-kernel@vger.kernel.org'

On Mon, 7 May 2001, Bene, Martin wrote:

> Definitely not caused by:
> 	Bad Rams, mb-chipset.

Erm, it was bad RAM everytime it happened to me. On standard PCs, you
don't see those because you don't have ECC and the error is simply not
detected.

   Simon

-- 
GPG public key available from http://phobos.fs.tum.de/pgp/Simon.Richter.asc
 Fingerprint: DC26 EB8D 1F35 4F44 2934  7583 DBB6 F98D 9198 3292
Hi! I'm a .signature virus! Copy me into your ~/.signature to help me spread!


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: what causes Machine Check exception? revisited (2.2.18)
@ 2001-05-07 13:33 Bene, Martin
  2001-05-07 13:51 ` Alan Cox
  2001-05-07 18:12 ` Simon Richter
  0 siblings, 2 replies; 12+ messages in thread
From: Bene, Martin @ 2001-05-07 13:33 UTC (permalink / raw)
  To: 'Simon Richter'; +Cc: 'linux-kernel@vger.kernel.org'

Hi Simon,

> On Mon, 7 May 2001, Bene, Martin wrote:
> 
> > Definitely not caused by:
> > 	Bad Rams, mb-chipset.
> 
> Erm, it was bad RAM everytime it happened to me. On standard PCs, you
> don't see those because you don't have ECC and the error is simply not
> detected.

Strange - definitely, strange. Of course you're correct about memory errors
going undetected on standard PC hardware, and usually these undetected
errors lead to other failures later on:

You get SIG11 errors when running programs(kernel compile seems to be agood
example), you get crashing processes, you get all sorts of weird funnies but
you really shouldn't get machine check exceptions.

I don't think there is a way a machine check exception can be triggered by
software - which it would have to be in order to be caused by bad RAMs.

Bye, Martin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: what causes Machine Check exception? revisited (2.2.18)
  2001-05-07 13:33 Bene, Martin
@ 2001-05-07 13:51 ` Alan Cox
  2001-05-07 18:12 ` Simon Richter
  1 sibling, 0 replies; 12+ messages in thread
From: Alan Cox @ 2001-05-07 13:51 UTC (permalink / raw)
  To: Bene, Martin
  Cc: 'Simon Richter', 'linux-kernel@vger.kernel.org'

> You get SIG11 errors when running programs(kernel compile seems to be agood
> example), you get crashing processes, you get all sorts of weird funnies but
> you really shouldn't get machine check exceptions.
> 
> I don't think there is a way a machine check exception can be triggered by
> software - which it would have to be in order to be caused by bad RAMs.

Bad ECC memory and unrecoverable ECC faults can certainly be reported back to
the processor electrically. Also an L2 cache load failing when the RAM fails
to ack the signals is quite visible to a processor.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: what causes Machine Check exception? revisited (2.2.18)
@ 2001-05-07 14:38 Ricardo Galli
  0 siblings, 0 replies; 12+ messages in thread
From: Ricardo Galli @ 2001-05-07 14:38 UTC (permalink / raw)
  To: linux-kernel

>> Definitely not caused by:
>> Bad Rams, mb-chipset.
>
> Erm, it was bad RAM everytime it happened to me. On standard PCs, you
> don't see those because you don't have ECC and the error is simply not
> detected.

I did have the same problem with an SMP Intel 440LX which run without any
problem since 1998. When I installed 2.2.18 it could run for more than 5
minutes (Alan suggested me it was .

I am not sure it's a RAM poblem, because it never gave/gives a SEGFAULT
compiling the kernel. I brought it back to 2.2.16 and it's running happy.

Could be some SMP/BIOS related problem? If it's the RAM or chipset, I am
scared how we could use it for three years and suddenly it hangs with a new
version of the kernel... Blame to Intel?

--ricardo
http://m3d.uib.es/~gallir/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: what causes Machine Check exception? revisited (2.2.18)
  2001-05-07 13:33 Bene, Martin
  2001-05-07 13:51 ` Alan Cox
@ 2001-05-07 18:12 ` Simon Richter
  1 sibling, 0 replies; 12+ messages in thread
From: Simon Richter @ 2001-05-07 18:12 UTC (permalink / raw)
  To: Bene, Martin; +Cc: 'linux-kernel@vger.kernel.org'

On Mon, 7 May 2001, Bene, Martin wrote:

[MCE caused by bad RAM]

> I don't think there is a way a machine check exception can be triggered by
> software - which it would have to be in order to be caused by bad RAMs.

A MCE is triggered by an ECC error - no software involved. A good trap
handler will then see if the error is recoverable (one-bit errors are),
notify userspace (so the admin gets mailed) and move the data out of this
page.

   Simon

-- 
GPG public key available from http://phobos.fs.tum.de/pgp/Simon.Richter.asc
 Fingerprint: DC26 EB8D 1F35 4F44 2934  7583 DBB6 F98D 9198 3292
Hi! I'm a .signature virus! Copy me into your ~/.signature to help me spread!


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: what causes Machine Check exception? revisited (2.2.18)
  2001-05-07 13:04 ` Simon Richter
@ 2001-05-07 18:23   ` Dan Hollis
  2001-05-07 18:30     ` nick
  2001-05-07 18:49     ` Simon Richter
  0 siblings, 2 replies; 12+ messages in thread
From: Dan Hollis @ 2001-05-07 18:23 UTC (permalink / raw)
  To: Simon Richter; +Cc: Bene, Martin, 'linux-kernel@vger.kernel.org'

On Mon, 7 May 2001, Simon Richter wrote:
> On Mon, 7 May 2001, Bene, Martin wrote:
> > Definitely not caused by:
> > 	Bad Rams, mb-chipset.
> Erm, it was bad RAM everytime it happened to me. On standard PCs, you
> don't see those because you don't have ECC and the error is simply not
> detected.

So a 440bx motherboard with ECC ram is a non-standard PC?

-Dan


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: what causes Machine Check exception? revisited (2.2.18)
  2001-05-07 18:23   ` Dan Hollis
@ 2001-05-07 18:30     ` nick
  2001-05-07 18:49     ` Simon Richter
  1 sibling, 0 replies; 12+ messages in thread
From: nick @ 2001-05-07 18:30 UTC (permalink / raw)
  To: Dan Hollis
  Cc: Simon Richter, Bene, Martin,
	'linux-kernel@vger.kernel.org'

Yep, totally.  I've worked on hundreds of systems and less than 20 of the
workstations or PCs have been useing ECC.  Most servers do, but not even
all of them.
	Nick

On Mon, 7 May 2001, Dan Hollis wrote:

> On Mon, 7 May 2001, Simon Richter wrote:
> > On Mon, 7 May 2001, Bene, Martin wrote:
> > > Definitely not caused by:
> > > 	Bad Rams, mb-chipset.
> > Erm, it was bad RAM everytime it happened to me. On standard PCs, you
> > don't see those because you don't have ECC and the error is simply not
> > detected.
> 
> So a 440bx motherboard with ECC ram is a non-standard PC?
> 
> -Dan
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: what causes Machine Check exception? revisited (2.2.18)
  2001-05-07 18:23   ` Dan Hollis
  2001-05-07 18:30     ` nick
@ 2001-05-07 18:49     ` Simon Richter
  1 sibling, 0 replies; 12+ messages in thread
From: Simon Richter @ 2001-05-07 18:49 UTC (permalink / raw)
  To: Dan Hollis; +Cc: Bene, Martin, 'linux-kernel@vger.kernel.org'

On Mon, 7 May 2001, Dan Hollis wrote:

> > Erm, it was bad RAM everytime it happened to me. On standard PCs, you
> > don't see those because you don't have ECC and the error is simply not
> > detected.

> So a 440bx motherboard with ECC ram is a non-standard PC?

I bet the board doesn't force you to use ECC RAM, so manufacturers will
not use it because it's too expensive and the average customer doesn't
understand what memory is and what it's used for. So yes, it's
non-standard.

   Simon

-- 
GPG public key available from http://phobos.fs.tum.de/pgp/Simon.Richter.asc
 Fingerprint: DC26 EB8D 1F35 4F44 2934  7583 DBB6 F98D 9198 3292
Hi! I'm a .signature virus! Copy me into your ~/.signature to help me spread!


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: what causes Machine Check exception? revisited (2.2.18)
  2001-05-07 10:57 ` Alan Cox
@ 2001-05-09  5:41   ` Mike Fedyk
  0 siblings, 0 replies; 12+ messages in thread
From: Mike Fedyk @ 2001-05-09  5:41 UTC (permalink / raw)
  To: linux-kernel

On Mon, May 07, 2001 at 11:57:17AM +0100, Alan Cox wrote:
> Generally it indicates a CPU problem but I've see it caused by overclocking
> and poorly fitted heatsinks
I've been able to trigger a Machine check error on PPC when trying to boot
directly from OF with a COFF kernel.  The system has worked perfectly with
BootX.

I wonder why this is the first non-x86 report...

Mike

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2001-05-09  5:41 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-05-07  9:50 what causes Machine Check exception? revisited (2.2.18) Juhan-Peep Ernits
2001-05-07 10:57 ` Alan Cox
2001-05-09  5:41   ` Mike Fedyk
  -- strict thread matches above, loose matches on Subject: below --
2001-05-07 10:50 Bene, Martin
2001-05-07 13:04 ` Simon Richter
2001-05-07 18:23   ` Dan Hollis
2001-05-07 18:30     ` nick
2001-05-07 18:49     ` Simon Richter
2001-05-07 13:33 Bene, Martin
2001-05-07 13:51 ` Alan Cox
2001-05-07 18:12 ` Simon Richter
2001-05-07 14:38 Ricardo Galli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox