All of lore.kernel.org
 help / color / mirror / Atom feed
* Bad Illegal instruction traps on dual-Xeon (p4) Linux Dell box
@ 2002-03-20 21:35 Tom Epperly
  2002-03-20 23:26 ` Kurt Garloff
  2002-03-20 23:31 ` Alan Cox
  0 siblings, 2 replies; 11+ messages in thread
From: Tom Epperly @ 2002-03-20 21:35 UTC (permalink / raw)
  To: linux-kernel

This is a followup to an earlier thread whose subject was "Re: RH7.2
running 2.4.9-21-SMP (dual Xeon's) yields "Illegal instructions". Now
I am running a self-compiled 2.4.18 kernel with small changes shown
below to log illegal instruction traps in the kernel log.

The kernel log showed me that various standard programs such as
/bin/sh are generating bogus illegal instruction traps on a legal
opcode (0x55) as part of a standard function preamble. After receiving
an illegal instruction trap on opcode (0x55), the modified kernel does
a wbinvd() to flush the cache and a __flush_tlb() to flush the TLB
and then retries the "illegal" opcode. The retry produces a second
illegal instruction trap on the same legal opcode (0x55). Information
from /var/log/messages is shown below.

The problem disappears if I disable the second CPU (via a BIOS
switch). I've tried physically switching processors on the
motherboard, and both chips behave correctly in single-CPU mode. The
system passes Dell's hardware diagnostics (twice) and memtest-86 2.9,
and I seen identical problems on two other Dell Precision 530
Workstations purchased at different times with different clock speeds.

I initiated a support call with Dell at around 3:30pm PST on Friday
15-Mar-2002, and all the feedback I've received from this so far shows
that they are clueless. They are trying to portray this as a Linux
problem.

The machine doesn't run X11, so the nVidia drivers are never loaded. I
pulled the sound card out too. It has 512MB of ECC RAM.

Does anyone else have any suggestions about what could be causing this
problem or how one might further diagnose the issue.  Is there anyway
that this might not be a hardware problem?  Please Cc me in
replies.

Tom Epperly

*SAMPLE MESSAGES* from /var/log/messages:

Mar 18 20:56:30 tux06 kernel: Restarting 13766  0x805aa80 sh
Mar 18 20:56:30 tux06 kernel: 55 89 e5 83 ec 08 8b 45 08 85 c0 74 0a 8b 15 00 24 0c 08 85 
Mar 18 20:56:30 tux06 kernel: invalid operand: 0000
Mar 18 20:56:30 tux06 kernel: CPU:    1
Mar 18 20:56:30 tux06 kernel: EIP:    0023:[usb_stor_exit+134588960/-1072693344]    Not tainted
Mar 18 20:56:30 tux06 kernel: EIP:    0023:[<0805aa80>]    Not tainted
Mar 18 20:56:30 tux06 kernel: EFLAGS: 00010292
Mar 18 20:56:30 tux06 kernel: eax: 000035c6   ebx: 000035c6   ecx: bfffe730   edx: 00000001
Mar 18 20:56:30 tux06 kernel: esi: 00000000   edi: 00000000   ebp: bfffe7c8   esp: bfffe69c
Mar 18 20:56:30 tux06 kernel: ds: 002b   es: 002b   ss: 002b
Mar 18 20:56:30 tux06 kernel: Process sh (pid: 13766, stackpage=caff1000)
Mar 18 20:56:30 tux06 kernel: Stack: 0806f58c 00000000 bfffe730 bfffe6b0 0806f580 00010000 00000000 00000000 
Mar 18 20:56:30 tux06 kernel:        00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
Mar 18 20:56:30 tux06 kernel:        00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
Mar 18 20:56:30 tux06 kernel: Call Trace: 
Mar 18 20:56:30 tux06 kernel: 
Mar 18 20:56:30 tux06 kernel: Code: 55 89 e5 83 ec 08 8b 45 08 85 c0 74 0a 8b 15 00 24 0c 08 85 
Mar 19 05:13:01 tux06 kernel:  Restarting 11895  0x4011f8a0 sh
Mar 19 05:13:01 tux06 kernel: 55 53 56 57 8b 5c 24 14 8b 4c 24 18 8b 54 24 1c 8b 74 24 20 
Mar 19 05:13:01 tux06 kernel: invalid operand: 0000
Mar 19 05:13:01 tux06 kernel: CPU:    0
Mar 19 05:13:01 tux06 kernel: EIP:    0023:[usb_stor_exit+1074919488/-1072693344]    Not tainted
Mar 19 05:13:01 tux06 kernel: EIP:    0023:[<4011f8a0>]    Not tainted
Mar 19 05:13:01 tux06 kernel: EFLAGS: 00010206
Mar 19 05:13:01 tux06 kernel: eax: 00001000   ebx: 4017c690   ecx: bfffb200   edx: 00001000
Mar 19 05:13:01 tux06 kernel: esi: 080cd00c   edi: 00001000   ebp: bfffb278   esp: bfffb1dc
Mar 19 05:13:01 tux06 kernel: ds: 002b   es: 002b   ss: 002b
Mar 19 05:13:01 tux06 kernel: Process sh (pid: 11895, stackpage=c6f97000)
Mar 19 05:13:01 tux06 kernel: Stack: 4009d145 00000000 00001000 00000003 00000022 ffffffff 00000000 00000001 
Mar 19 05:13:01 tux06 kernel:        00000001 00000805 00000000 00000000 000517b5 000081a4 00000001 00000000 
Mar 19 05:13:01 tux06 kernel:        00000000 00000000 00000000 00000000 00000a29 00000000 00001000 00000008 
Mar 19 05:13:01 tux06 kernel: Call Trace: 
Mar 19 05:13:01 tux06 kernel: 
Mar 19 05:13:01 tux06 kernel: Code: 55 53 56 57 8b 5c 24 14 8b 4c 24 18 8b 54 24 1c 8b 74 24 20 
Mar 19 05:13:01 tux06 kernel:  Restarting 11898  0x4011f8a0 sh
Mar 19 05:13:01 tux06 kernel: 55 53 56 57 8b 5c 24 14 8b 4c 24 18 8b 54 24 1c 8b 74 24 20 
Mar 19 05:13:01 tux06 kernel: invalid operand: 0000
Mar 19 05:13:01 tux06 kernel: CPU:    0
Mar 19 05:13:01 tux06 kernel: EIP:    0023:[usb_stor_exit+1074919488/-1072693344]    Not tainted
Mar 19 05:13:01 tux06 kernel: EIP:    0023:[<4011f8a0>]    Not tainted
Mar 19 05:13:01 tux06 kernel: EFLAGS: 00010206
Mar 19 05:13:01 tux06 kernel: eax: 00001000   ebx: 4017c690   ecx: bfffb200   edx: 00001000
Mar 19 05:13:01 tux06 kernel: esi: 080cd00c   edi: 00001000   ebp: bfffb278   esp: bfffb1dc
Mar 19 05:13:01 tux06 kernel: ds: 002b   es: 002b   ss: 002b
Mar 19 05:13:01 tux06 kernel: Process sh (pid: 11898, stackpage=c6f97000)
Mar 19 05:13:01 tux06 kernel: Stack: 4009d145 00000000 00001000 00000003 00000022 ffffffff 00000000 00000001 
Mar 19 05:13:01 tux06 kernel:        00000001 00000805 00000000 00000000 000517b5 000081a4 00000001 00000000 
Mar 19 05:13:01 tux06 kernel:        00000000 00000000 00000000 00000000 00000a29 00000000 00001000 00000008 
Mar 19 05:13:01 tux06 kernel: Call Trace: 
Mar 19 05:13:01 tux06 kernel: 
Mar 19 05:13:01 tux06 kernel: Code: 55 53 56 57 8b 5c 24 14 8b 4c 24 18 8b 54 24 1c 8b 74 24 20 
Mar 19 05:13:01 tux06 kernel:  Restarting 11902  0x4011f8a0 runAll
Mar 19 05:13:01 tux06 kernel: 55 53 56 57 8b 5c 24 14 8b 4c 24 18 8b 54 24 1c 8b 74 24 20 
Mar 19 05:13:01 tux06 kernel: invalid operand: 0000
Mar 19 05:13:01 tux06 kernel: CPU:    0
Mar 19 05:13:01 tux06 kernel: EIP:    0023:[usb_stor_exit+1074919488/-1072693344]    Not tainted
Mar 19 05:13:01 tux06 kernel: EIP:    0023:[<4011f8a0>]    Not tainted
Mar 19 05:13:01 tux06 kernel: EFLAGS: 00010206
Mar 19 05:13:01 tux06 kernel: eax: 00001000   ebx: 4017c690   ecx: bfffb260   edx: 00001000
Mar 19 05:13:01 tux06 kernel: esi: 080cd00c   edi: 00001000   ebp: bfffb2d8   esp: bfffb23c
Mar 19 05:13:01 tux06 kernel: ds: 002b   es: 002b   ss: 002b
Mar 19 05:13:01 tux06 kernel: Process runAll (pid: 11902, stackpage=cbe0b000)
Mar 19 05:13:01 tux06 kernel: Stack: 4009d145 00000000 00001000 00000003 00000022 ffffffff 00000000 00000001 
Mar 19 05:13:01 tux06 kernel:        00000001 00000805 00000000 00000000 000517b5 000081a4 00000001 00000000 
Mar 19 05:13:01 tux06 kernel:        00000000 00000000 00000000 00000000 00000a29 00000000 00001000 00000008 
Mar 19 05:13:01 tux06 kernel: Call Trace: 
Mar 19 05:13:01 tux06 kernel: 
Mar 19 05:13:01 tux06 kernel: Code: 55 53 56 57 8b 5c 24 14 8b 4c 24 18 8b 54 24 1c 8b 74 24 20 
Mar 19 05:13:01 tux06 kernel:  Restarting 11919  0x4011f8a0 runAll
Mar 19 05:13:01 tux06 kernel: 55 53 56 57 8b 5c 24 14 8b 4c 24 18 8b 54 24 1c 8b 74 24 20 
Mar 19 05:13:01 tux06 kernel: invalid operand: 0000
Mar 19 05:13:01 tux06 kernel: CPU:    0
Mar 19 05:13:01 tux06 kernel: EIP:    0023:[usb_stor_exit+1074919488/-1072693344]    Not tainted
Mar 19 05:13:01 tux06 kernel: EIP:    0023:[<4011f8a0>]    Not tainted
Mar 19 05:13:01 tux06 kernel: EFLAGS: 00010206
Mar 19 05:13:01 tux06 kernel: eax: 00001000   ebx: 4017c690   ecx: bfffb250   edx: 00001000
Mar 19 05:13:01 tux06 kernel: esi: 080cd00c   edi: 00001000   ebp: bfffb2c8   esp: bfffb22c
Mar 19 05:13:01 tux06 kernel: ds: 002b   es: 002b   ss: 002b
Mar 19 05:13:01 tux06 kernel: Process runAll (pid: 11919, stackpage=cbe0b000)
Mar 19 05:13:01 tux06 kernel: Stack: 4009d145 00000000 00001000 00000003 00000022 ffffffff 00000000 00000001 
Mar 19 05:13:01 tux06 kernel:        00000001 00000805 00000000 00000000 000517b5 000081a4 00000001 00000000 
Mar 19 05:13:01 tux06 kernel:        00000000 00000000 00000000 00000000 00000a29 00000000 00001000 00000008 
Mar 19 05:13:01 tux06 kernel: Call Trace: 
Mar 19 05:13:01 tux06 kernel: 
Mar 19 05:13:01 tux06 kernel: Code: 55 53 56 57 8b 5c 24 14 8b 4c 24 18 8b 54 24 1c 8b 74 24 20 

*PATCH* to add the logging (note this patch is not intended for anything other than experimenting & debugging):


$ diff -c ~epperly/linux/arch/i386/kernel/traps.c /usr/src/linux/arch/i386/kernel/traps.c
*** /home/epperly/linux/arch/i386/kernel/traps.c	Sun Sep 30 12:26:08 2001
--- /usr/src/linux/arch/i386/kernel/traps.c	Fri Mar 15 16:06:06 2002
***************
*** 214,227 ****
  	 * When in-kernel, we also print out the stack and code at the
  	 * time of the fault..
  	 */
! 	if (in_kernel) {
  
  		printk("\nStack: ");
  		show_stack((unsigned long*)esp);
  
  		printk("\nCode: ");
  		if(regs->eip < PAGE_OFFSET)
  			goto bad;
  
  		for(i=0;i<20;i++)
  		{
--- 214,229 ----
  	 * When in-kernel, we also print out the stack and code at the
  	 * time of the fault..
  	 */
! 	if (1|in_kernel) {
  
  		printk("\nStack: ");
  		show_stack((unsigned long*)esp);
  
  		printk("\nCode: ");
+                 /*
  		if(regs->eip < PAGE_OFFSET)
  			goto bad;
+                 */
  
  		for(i=0;i<20;i++)
  		{
***************
*** 267,304 ****
  }
  
  static void inline do_trap(int trapnr, int signr, char *str, int vm86,
! 			   struct pt_regs * regs, long error_code, siginfo_t *info)  {
! 	if (vm86 && regs->eflags & VM_MASK)
! 		goto vm86_trap;
! 	if (!(regs->xcs & 3))
! 		goto kernel_trap;
! 
! 	trap_signal: {
! 		struct task_struct *tsk = current;
! 		tsk->thread.error_code = error_code;
! 		tsk->thread.trap_no = trapnr;
! 		if (info)
! 			force_sig_info(signr, info, tsk);
! 		else
! 			force_sig(signr, tsk);
! 		return;
! 	}
! 
! 	kernel_trap: {
! 		unsigned long fixup = search_exception_table(regs->eip);
! 		if (fixup)
! 			regs->eip = fixup;
! 		else	
! 			die(str, regs, error_code);
! 		return;
! 	}
! 
! 	vm86_trap: {
! 		int ret = handle_vm86_trap((struct kernel_vm86_regs *) regs, error_code, trapnr);
! 		if (ret) goto trap_signal;
! 		return;
! 	}
  }
  
  #define DO_ERROR(trapnr, signr, str, name) \
--- 269,340 ----
  }
  
  static void inline do_trap(int trapnr, int signr, char *str, int vm86,
!                            struct pt_regs * regs, long error_code,
! siginfo_t *info)
  {
!   int i;
!   if (vm86 && regs->eflags & VM_MASK)
!     goto vm86_trap;
!   if (!(regs->xcs & 3))
!     goto kernel_trap;
!   
!         trap_signal: {
!                 struct task_struct *tsk = current;
!                 tsk->thread.error_code = error_code;
!                 tsk->thread.trap_no = trapnr;
! 
!                 /*debug for processes getting illegal operation faults*/
!                 if(trapnr==6){
!                         unsigned char c;
! 
!                         __get_user(c, &((unsigned char*)regs->eip)[0]);
! 
!                         if( c==0x55 ){ /*push ebp*/
!                                 if(tsk->per_cpu_utime[31]==regs->eip) {
!                                         /*This guy's been through the mill
! once already*/
!                                         die(str, regs, error_code);
!                                 }else{
!                                         /*first timer, so flag him*/
!                                         tsk->per_cpu_utime[31]=regs->eip;
!                                         printk("Restarting %d  0x%lx %s\n",tsk->pid,regs->eip,tsk->comm);
!                                         for(i=0;i<20;i++) {
!                                                 unsigned char c;
!                                                 if(__get_user(c,
! &((unsigned char*)regs->eip)[i])) {
!                                                         printk(" Bad EIP value.");
!                                                         break;
!                                                 }
!                                                 printk("%02x ", c);
!                                         }
!                                         printk("\n");
!                                         wbinvd();
!                                         __flush_tlb();
!                                         return;
!                                 }
!                         }
!                 }
!                 if (info)
!                         force_sig_info(signr, info, tsk);
!                 else
!                         force_sig(signr, tsk);
!                 return;
!         }
! 
!  kernel_trap: {
!    unsigned long fixup = search_exception_table(regs->eip);
!    if (fixup)
!      regs->eip = fixup;
!    else	
!      die(str, regs, error_code);
!    return;
!  }
!   
!  vm86_trap: {
!    int ret = handle_vm86_trap((struct kernel_vm86_regs *) regs, error_code, trapnr);
!    if (ret) goto trap_signal;
!    return;
!  }
  }
  
  #define DO_ERROR(trapnr, signr, str, name) \

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bad Illegal instruction traps on dual-Xeon (p4) Linux Dell box
  2002-03-20 21:35 Bad Illegal instruction traps on dual-Xeon (p4) Linux Dell box Tom Epperly
@ 2002-03-20 23:26 ` Kurt Garloff
  2002-03-21  0:04   ` Alan Cox
  2002-03-20 23:31 ` Alan Cox
  1 sibling, 1 reply; 11+ messages in thread
From: Kurt Garloff @ 2002-03-20 23:26 UTC (permalink / raw)
  To: Tom Epperly; +Cc: Linux kernel list

[-- Attachment #1: Type: text/plain, Size: 1257 bytes --]

On Wed, Mar 20, 2002 at 01:35:30PM -0800, Tom Epperly wrote:
> The kernel log showed me that various standard programs such as
> /bin/sh are generating bogus illegal instruction traps on a legal
> opcode (0x55) as part of a standard function preamble. After receiving
> an illegal instruction trap on opcode (0x55), the modified kernel does
> a wbinvd() to flush the cache and a __flush_tlb() to flush the TLB
> and then retries the "illegal" opcode. The retry produces a second
> illegal instruction trap on the same legal opcode (0x55). Information
> from /var/log/messages is shown below.

The CPU is what triggers the exception.
So this sounds like a defect (or overheated) CPU to me.

OTOH, the kernel logs "invalid operand". Could you run ksymoops to get a
disassembly?
AFAICS, its a push %ebp instruction, which should not be illegal. So either
your stack is overflowing or my suspicion with the defect CPU is applicable.

Regards,
-- 
Kurt Garloff                   <kurt@garloff.de>         [Eindhoven, NL]
Physics: Plasma simulations  <K.Garloff@Phys.TUE.NL>  [TU Eindhoven, NL]
Linux: SCSI, Security          <garloff@suse.de>    [SuSE Nuernberg, DE]
 (See mail header or public key servers for PGP2 and GPG public keys.)

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bad Illegal instruction traps on dual-Xeon (p4) Linux Dell box
  2002-03-20 21:35 Bad Illegal instruction traps on dual-Xeon (p4) Linux Dell box Tom Epperly
  2002-03-20 23:26 ` Kurt Garloff
@ 2002-03-20 23:31 ` Alan Cox
  2002-03-20 23:31   ` Tom Epperly
  1 sibling, 1 reply; 11+ messages in thread
From: Alan Cox @ 2002-03-20 23:31 UTC (permalink / raw)
  To: Tom Epperly; +Cc: linux-kernel

> I initiated a support call with Dell at around 3:30pm PST on Friday
> 15-Mar-2002, and all the feedback I've received from this so far shows
> that they are clueless. They are trying to portray this as a Linux
> problem.

Well to be honest they aren't the only ones who are totally baffled by it.
Do you have the current microcode updates in your BIOS or via the ucode
driver ?

Do all the problem boxes have the same stepping of CPU ?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bad Illegal instruction traps on dual-Xeon (p4) Linux Dell box
  2002-03-20 23:31 ` Alan Cox
@ 2002-03-20 23:31   ` Tom Epperly
  2002-03-21  0:03     ` Dave Jones
  2002-03-21  0:04     ` Alan Cox
  0 siblings, 2 replies; 11+ messages in thread
From: Tom Epperly @ 2002-03-20 23:31 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

Alan Cox wrote:

>>I initiated a support call with Dell at around 3:30pm PST on Friday
>>15-Mar-2002, and all the feedback I've received from this so far shows
>>that they are clueless. They are trying to portray this as a Linux
>>problem.
>>
>
>Well to be honest they aren't the only ones who are totally baffled by it.
>Do you have the current microcode updates in your BIOS or via the ucode
>driver ?
>
One box, tux06, has the latest Dell BIOS, A05. I don't know how to 
determine if it has the latest microcode updates. Where can one get the 
current microcode updates, and how do I install it?

>
>
>Do all the problem boxes have the same stepping of CPU ?
>
According to cat /proc/cpuinfo, two boxes tux06 & tux34 have stepping 
10, and tux47 has stepping 2. I have seen the unexplained "Illegal 
instruction" messages on tux34 and tux47, but I haven't run the modified 
kernel on them. root access is restricted here.

Tom


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bad Illegal instruction traps on dual-Xeon (p4) Linux Dell box
@ 2002-03-20 23:46 James Washer
  0 siblings, 0 replies; 11+ messages in thread
From: James Washer @ 2002-03-20 23:46 UTC (permalink / raw)
  To: Kurt Garloff; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1965 bytes --]


Just to clarify things.

Lots of processes die from illegal op traps.. gcc, bash, make, etc... but
the instruction is ALWAYS  opcode 0x55 and is part of a subroutine preamble
in every case..  You are correct... 0x55 should not generate a trap.

Bad cpu? Hmmm, Tom has 6 different CPU's ( all p4 xeons ), on three
systems, that have this EXACT same problem.

and why does this require the system to be running smp?


 - jim

Kurt Garloff <kurt@garloff.de>@vger.kernel.org on 03/20/2002 03:26:10 PM

Sent by:    linux-kernel-owner@vger.kernel.org


To:    Tom Epperly <tepperly@llnl.gov>
cc:    Linux kernel list <linux-kernel@vger.kernel.org>
Subject:    Re: Bad Illegal instruction traps on dual-Xeon (p4) Linux Dell
       box



On Wed, Mar 20, 2002 at 01:35:30PM -0800, Tom Epperly wrote:
> The kernel log showed me that various standard programs such as
> /bin/sh are generating bogus illegal instruction traps on a legal
> opcode (0x55) as part of a standard function preamble. After receiving
> an illegal instruction trap on opcode (0x55), the modified kernel does
> a wbinvd() to flush the cache and a __flush_tlb() to flush the TLB
> and then retries the "illegal" opcode. The retry produces a second
> illegal instruction trap on the same legal opcode (0x55). Information
> from /var/log/messages is shown below.

The CPU is what triggers the exception.
So this sounds like a defect (or overheated) CPU to me.

OTOH, the kernel logs "invalid operand". Could you run ksymoops to get a
disassembly?
AFAICS, its a push %ebp instruction, which should not be illegal. So either
your stack is overflowing or my suspicion with the defect CPU is
applicable.

Regards,
--
Kurt Garloff                   <kurt@garloff.de>         [Eindhoven, NL]
Physics: Plasma simulations  <K.Garloff@Phys.TUE.NL>  [TU Eindhoven, NL]
Linux: SCSI, Security          <garloff@suse.de>    [SuSE Nuernberg, DE]
 (See mail header or public key servers for PGP2 and GPG public keys.)


[-- Attachment #2: C.DTF --]
[-- Type: application/octet-stream, Size: 242 bytes --]

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE8mRqSxmLh6hyYd04RAjseAJ9D9WFchN4IdWbh/rUcJ9C55RT6ngCgl4p9
HA1QFzDVq2UdL939jr2bu7U=
=jGsY
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bad Illegal instruction traps on dual-Xeon (p4) Linux Dell box
  2002-03-20 23:31   ` Tom Epperly
@ 2002-03-21  0:03     ` Dave Jones
  2002-03-21  0:04     ` Alan Cox
  1 sibling, 0 replies; 11+ messages in thread
From: Dave Jones @ 2002-03-21  0:03 UTC (permalink / raw)
  To: Tom Epperly; +Cc: Alan Cox, linux-kernel

On Wed, Mar 20, 2002 at 03:31:50PM -0800, Tom Epperly wrote:
 
 > One box, tux06, has the latest Dell BIOS, A05. I don't know how to 
 > determine if it has the latest microcode updates. Where can one get the 
 > current microcode updates, and how do I install it?

http://www.urbanmyth.org/microcode/

-- 
| Dave Jones.        http://www.codemonkey.org.uk
| SuSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bad Illegal instruction traps on dual-Xeon (p4) Linux Dell box
  2002-03-20 23:31   ` Tom Epperly
  2002-03-21  0:03     ` Dave Jones
@ 2002-03-21  0:04     ` Alan Cox
  1 sibling, 0 replies; 11+ messages in thread
From: Alan Cox @ 2002-03-21  0:04 UTC (permalink / raw)
  To: Tom Epperly; +Cc: Alan Cox, linux-kernel

> One box, tux06, has the latest Dell BIOS, A05. I don't know how to 
> determine if it has the latest microcode updates. Where can one get the 
> current microcode updates, and how do I install it?

The microcode updates change the stepping value for the CPU afaik.

> According to cat /proc/cpuinfo, two boxes tux06 & tux34 have stepping 
> 10, and tux47 has stepping 2. I have seen the unexplained "Illegal 
> instruction" messages on tux34 and tux47, but I haven't run the modified 
> kernel on them. root access is restricted here.

Humm. I'm still as baffled as Dell I'm afraid

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bad Illegal instruction traps on dual-Xeon (p4) Linux Dell box
  2002-03-20 23:26 ` Kurt Garloff
@ 2002-03-21  0:04   ` Alan Cox
  0 siblings, 0 replies; 11+ messages in thread
From: Alan Cox @ 2002-03-21  0:04 UTC (permalink / raw)
  To: Kurt Garloff; +Cc: Tom Epperly, Linux kernel list

> disassembly?
> AFAICS, its a push %ebp instruction, which should not be illegal. So either
> your stack is overflowing or my suspicion with the defect CPU is applicable.

Or somehow the I/D TLB's got messed up and the ITLB for that entry is now
wrong.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bad Illegal instruction traps on dual-Xeon (p4) Linux Dell box
@ 2002-03-21  0:30 James Washer
  2002-03-21  7:19 ` Zwane Mwaikambo
  0 siblings, 1 reply; 11+ messages in thread
From: James Washer @ 2002-03-21  0:30 UTC (permalink / raw)
  To: Alan Cox, linux-kernel


The iTLB would be flushed when he did the reload of cr3 ( per your
suggestion ) UNLESS the G bit was set.
I suppose theres some small chance, that at the time this instruction was
first cached and its corresponding iTLB entry was loaded, the G bit may
have been set.. Seems unlikely. but I'll hack up something to
unconditionally flush the iTLB.

 - jim

Alan Cox <alan@lxorguk.ukuu.org.uk>@vger.kernel.org on 03/20/2002 04:04:51
PM

Sent by:    linux-kernel-owner@vger.kernel.org


To:    kurt@garloff.de (Kurt Garloff)
cc:    tepperly@llnl.gov (Tom Epperly), linux-kernel@vger.kernel.org (Linux
       kernel list)
Subject:    Re: Bad Illegal instruction traps on dual-Xeon (p4) Linux Dell
       box



> disassembly?
> AFAICS, its a push %ebp instruction, which should not be illegal. So
either
> your stack is overflowing or my suspicion with the defect CPU is
applicable.

Or somehow the I/D TLB's got messed up and the ITLB for that entry is now
wrong.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bad Illegal instruction traps on dual-Xeon (p4) Linux Dell box
  2002-03-21  0:30 James Washer
@ 2002-03-21  7:19 ` Zwane Mwaikambo
  0 siblings, 0 replies; 11+ messages in thread
From: Zwane Mwaikambo @ 2002-03-21  7:19 UTC (permalink / raw)
  To: James Washer; +Cc: Alan Cox, linux-kernel

On Wed, 20 Mar 2002, James Washer wrote:

> 
> The iTLB would be flushed when he did the reload of cr3 ( per your
> suggestion ) UNLESS the G bit was set.
> I suppose theres some small chance, that at the time this instruction was
> first cached and its corresponding iTLB entry was loaded, the G bit may
> have been set.. Seems unlikely. but I'll hack up something to
> unconditionally flush the iTLB.

I find vol3 somewhat confusing in this regard...

P104 - The only ways to deterministically invalidate global page entries 
are as follows:
o Clear the PGE flag and then invalidate the TLBs.
o Execute the INVLPG instruction to invalidate individual page-directory 
  or page-table entries in the TLBs.
o Write to control register CR3 to invalidate all TLB entries.

Then on page 381.

The following operations invalidate all TLB entries except global entries. 
(A global entry is one for which the G (global) flag is set in its 
corresponding page-directory or page-table entry. The global flag was 
introduced into the IA-32 architecture in the P6 family processors, see 
Section 10.5.,  Cache Control .)

o Writing to control register CR3.
o A task switch that changes control register CR3.

I would reckon reference 1 (p104) is incorrect, can someone shed some 
light?

Thanks,
	Zwane



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bad Illegal instruction traps on dual-Xeon (p4) Linux Dell box
@ 2002-03-21 14:52 James Washer
  0 siblings, 0 replies; 11+ messages in thread
From: James Washer @ 2002-03-21 14:52 UTC (permalink / raw)
  To: Zwane Mwaikambo; +Cc: linux-kernel


Yes, I agree that page 104 ( Section  3.11 ) is inconsistent with itself
wrt
      "Write to control register CR3 to invalidate all TLB entries."

For the particular problem Tom is seeing however. I've recoded do_trap() to
do an invlpg to the particular page that is causing the problem.. Just in
case the G bit was set and the pte was stale.  I suspect he'll be able to
test this code this morning.

 - jim

Zwane Mwaikambo <zwane@linux.realnet.co.sz>@vger.kernel.org on 03/20/2002
11:19:41 PM

Sent by:    linux-kernel-owner@vger.kernel.org


To:    James Washer/Beaverton/IBM@IBMUS
cc:    Alan Cox <alan@lxorguk.ukuu.org.uk>, <linux-kernel@vger.kernel.org>
Subject:    Re: Bad Illegal instruction traps on dual-Xeon (p4) Linux Dell
       box



On Wed, 20 Mar 2002, James Washer wrote:

>
> The iTLB would be flushed when he did the reload of cr3 ( per your
> suggestion ) UNLESS the G bit was set.
> I suppose theres some small chance, that at the time this instruction was
> first cached and its corresponding iTLB entry was loaded, the G bit may
> have been set.. Seems unlikely. but I'll hack up something to
> unconditionally flush the iTLB.

I find vol3 somewhat confusing in this regard...

P104 - The only ways to deterministically invalidate global page entries
are as follows:
o Clear the PGE flag and then invalidate the TLBs.
o Execute the INVLPG instruction to invalidate individual page-directory
  or page-table entries in the TLBs.
o Write to control register CR3 to invalidate all TLB entries.

Then on page 381.

The following operations invalidate all TLB entries except global entries.
(A global entry is one for which the G (global) flag is set in its
corresponding page-directory or page-table entry. The global flag was
introduced into the IA-32 architecture in the P6 family processors, see
Section 10.5.,  Cache Control .)

o Writing to control register CR3.
o A task switch that changes control register CR3.

I would reckon reference 1 (p104) is incorrect, can someone shed some
light?

Thanks,
 Zwane


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2002-03-21 14:52 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-03-20 21:35 Bad Illegal instruction traps on dual-Xeon (p4) Linux Dell box Tom Epperly
2002-03-20 23:26 ` Kurt Garloff
2002-03-21  0:04   ` Alan Cox
2002-03-20 23:31 ` Alan Cox
2002-03-20 23:31   ` Tom Epperly
2002-03-21  0:03     ` Dave Jones
2002-03-21  0:04     ` Alan Cox
  -- strict thread matches above, loose matches on Subject: below --
2002-03-20 23:46 James Washer
2002-03-21  0:30 James Washer
2002-03-21  7:19 ` Zwane Mwaikambo
2002-03-21 14:52 James Washer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.