public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* 2.4.25 SMP - BUG at page_alloc.c:105
@ 2004-03-22 14:49 Matthias Andree
  0 siblings, 0 replies; 7+ messages in thread
From: Matthias Andree @ 2004-03-22 14:49 UTC (permalink / raw)
  To: Linux-Kernel mailing list

Hi,

I found this in the logs of a Dual Athlon MP machine (Tyan board)
running 2.4.25-SMP:

kernel BUG at page_alloc.c:105! 
invalid operand: 0000 
CPU:    0 
EIP: 0010:[__free_pages_ok+80/704]    Not tainted 
EFLAGS: 00010286 
eax: c0333674   ebx: c1b2d720   ecx: 00000000   edx: f22f7a84 
esi: 00000001   edi: 00000000   ebp: 00000001   esp: f6901e3c 
ds: 0018   es: 0018   ss: 0018 
Process svscan (pid: 1348, stackpage=f6901000) 
Stack: c033364c f741cbc0 f22f7a84 00000001 0804c000 c0133ea6 f22f79c0 00000004  
       00000001 00000001 0804c000 00000001 c01308fa c1b2d720 f68e3080 0804b000  
       00001000 0844b000 c03ac4e0 00000001 0804c000 f68e3084 f42baa40 f7212440  
Call Trace: [set_page_dirty+166/176] [zap_page_range+330/400] [exit_mmap+221/352] [mmput+88/176] [do_exit+259/800] 
  [sig_exit+195/208] [dequeue_signal+95/192] [do_signal+448/694] [schedule_timeout+94/176] [process_timeout+0/96] [sys_nanosleep+232/448] 
  [do_page_fault+0/1347] [signal_return+20/24] 

Other than this BUG (that took down the machine hard, I was lucky to log
across the network), there appear to be no relevant logs shortly before
this crash.

What's causing this?

-- 
Matthias Andree

Encrypt your mail: my GnuPG key ID is 0x052E7D95

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.4.25 SMP - BUG at page_alloc.c:105
  2004-03-24 20:58 2.4.25 SMP - BUG at page_alloc.c:105 Marcelo Tosatti
@ 2004-03-24 20:28 ` Andrew Morton
  2004-03-24 21:12   ` Matthias Andree
  2004-03-24 21:51   ` Marcelo Tosatti
  0 siblings, 2 replies; 7+ messages in thread
From: Andrew Morton @ 2004-03-24 20:28 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: matthias.andree, andrea, linux-kernel

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
> 
> The backtrace is odd to me. 
> 
> set_page_dirty() does not call __free_pages_ok() directly or indirectly.
> 

I'd suspect that's just gunk on the stack and that zap_pte_range() freed an
anonymous page which had a non-null ->mapping.  It could be a hardware bug.
Without seeing the actual value of page->mapping it's hard to know.

It would be good to backport the bad_page() debug code so we get a bit more
info when this sort of thing happens.



> ---
> 
> Hi,
> 
> I found this in the logs of a Dual Athlon MP machine (Tyan board)
> running 2.4.25-SMP:
> 
> kernel BUG at page_alloc.c:105! 
> invalid operand: 0000 
> CPU:    0 
> EIP: 0010:[__free_pages_ok+80/704]    Not tainted 
> EFLAGS: 00010286 
> eax: c0333674   ebx: c1b2d720   ecx: 00000000   edx: f22f7a84 
> esi: 00000001   edi: 00000000   ebp: 00000001   esp: f6901e3c 
> ds: 0018   es: 0018   ss: 0018 
> Process svscan (pid: 1348, stackpage=f6901000) 
> Stack: c033364c f741cbc0 f22f7a84 00000001 0804c000 c0133ea6 f22f79c0 00000004  
>        00000001 00000001 0804c000 00000001 c01308fa c1b2d720 f68e3080 0804b000  
>        00001000 0844b000 c03ac4e0 00000001 0804c000 f68e3084 f42baa40 f7212440  
> Call Trace: [set_page_dirty+166/176] [zap_page_range+330/400] [exit_mmap+221/352] \
> [mmput+88/176] [do_exit+259/800]   [sig_exit+195/208] [dequeue_signal+95/192] \
> [do_signal+448/694] [schedule_timeout+94/176] [process_timeout+0/96] \
> [sys_nanosleep+232/448]   [do_page_fault+0/1347] [signal_return+20/24] 
> 
> Other than this BUG (that took down the machine hard, I was lucky to log
> across the network), there appear to be no relevant logs shortly before
> this crash.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.4.25 SMP - BUG at page_alloc.c:105
@ 2004-03-24 20:58 Marcelo Tosatti
  2004-03-24 20:28 ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Marcelo Tosatti @ 2004-03-24 20:58 UTC (permalink / raw)
  To: Matthias Andree; +Cc: akpm, andrea, linux-kernel


The backtrace is odd to me. 

set_page_dirty() does not call __free_pages_ok() directly or indirectly.

How can it be?

---

Hi,

I found this in the logs of a Dual Athlon MP machine (Tyan board)
running 2.4.25-SMP:

kernel BUG at page_alloc.c:105! 
invalid operand: 0000 
CPU:    0 
EIP: 0010:[__free_pages_ok+80/704]    Not tainted 
EFLAGS: 00010286 
eax: c0333674   ebx: c1b2d720   ecx: 00000000   edx: f22f7a84 
esi: 00000001   edi: 00000000   ebp: 00000001   esp: f6901e3c 
ds: 0018   es: 0018   ss: 0018 
Process svscan (pid: 1348, stackpage=f6901000) 
Stack: c033364c f741cbc0 f22f7a84 00000001 0804c000 c0133ea6 f22f79c0 00000004  
       00000001 00000001 0804c000 00000001 c01308fa c1b2d720 f68e3080 0804b000  
       00001000 0844b000 c03ac4e0 00000001 0804c000 f68e3084 f42baa40 f7212440  
Call Trace: [set_page_dirty+166/176] [zap_page_range+330/400] [exit_mmap+221/352] \
[mmput+88/176] [do_exit+259/800]   [sig_exit+195/208] [dequeue_signal+95/192] \
[do_signal+448/694] [schedule_timeout+94/176] [process_timeout+0/96] \
[sys_nanosleep+232/448]   [do_page_fault+0/1347] [signal_return+20/24] 

Other than this BUG (that took down the machine hard, I was lucky to log
across the network), there appear to be no relevant logs shortly before
this crash.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.4.25 SMP - BUG at page_alloc.c:105
  2004-03-24 20:28 ` Andrew Morton
@ 2004-03-24 21:12   ` Matthias Andree
  2004-03-24 21:51   ` Marcelo Tosatti
  1 sibling, 0 replies; 7+ messages in thread
From: Matthias Andree @ 2004-03-24 21:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Marcelo Tosatti, matthias.andree, andrea, linux-kernel

On Wed, 24 Mar 2004, Andrew Morton wrote:

> I'd suspect that's just gunk on the stack and that zap_pte_range() freed an
> anonymous page which had a non-null ->mapping.  It could be a hardware bug.
> Without seeing the actual value of page->mapping it's hard to know.

Any chance to retrieve that when the machine has been rebooted since? I
fear there is none.

I have these log entries from boot-up (after the crash), seems the BIOS
isn't perfect (Tyan S2460 "Tiger MP" w/ BIOS 1.05):

...
128MB HIGHMEM available.
896MB LOWMEM available.
ACPI: have wakeup address 0xc0002000
found SMP MP-table at 000f7510
hm, page 000f7000 reserved twice.
hm, page 000f8000 reserved twice.
hm, page 0009f000 reserved twice.
hm, page 000a0000 reserved twice.
On node 0 totalpages: 262144
zone(0): 4096 pages.
zone(1): 225280 pages.
zone(2): 32768 pages.
ACPI: Unable to locate RSDP
Intel MultiProcessor Specification v1.4
    Virtual Wire compatibility mode.
OEM ID: TYAN     Product ID: GUINNESS     APIC at: 0xFEE00000
Processor #1 Pentium(tm) Pro APIC version 16
Processor #0 Pentium(tm) Pro APIC version 16
I/O APIC #2 Version 17 at 0xFEC00000.
Enabling APIC mode: Flat.       Using 1 I/O APICs
Processors: 2
Kernel command line: root=/dev/hda5 vga=791 splash=silent showopts noapic
Initializing CPU#0
Detected 1533.378 MHz processor.
Console: colour dummy device 80x25
Calibrating delay loop... 3060.53 BogoMIPS
Memory: 1032772k/1048576k available (1902k kernel code, 15416k reserved, 636k data, 152k init, 131072k highmem)
...
Intel machine check reporting enabled on CPU#0.
CPU:     After generic, caps: 0383fbff c1cbfbff 00000000 00000000
CPU:             Common caps: 0383fbff c1cbfbff 00000000 00000000
CPU0: AMD Athlon(tm) MP 1800+ stepping 02
Intel machine check reporting enabled on CPU#1.
CPU:     After generic, caps: 0383fbff c1cbfbff 00000000 00000000
CPU:             Common caps: 0383fbff c1cbfbff 00000000 00000000
CPU1: AMD Athlon(tm) Processor stepping 02
Total of 2 processors activated (6121.06 BogoMIPS).
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 1533.3658 MHz.
..... host bus clock speed is 266.6723 MHz.
cpu: 0, clocks: 2666723, slice: 888907
CPU0<T0:2666720,T1:1777808,D:5,S:888907,C:2666723>
cpu: 1, clocks: 2666723, slice: 888907
CPU1<T0:2666720,T1:888896,D:10,S:888907,C:2666723>
checking TSC synchronization across CPUs: passed.
Waiting on wait_init_idle (map = 0x2)
All processors have done init_idle
mtrr: your CPUs had inconsistent fixed MTRR settings
mtrr: probably your BIOS does not setup all CPUs
ACPI: Subsystem revision 20040116
ACPI: Interpreter disabled.
PCI: PCI BIOS revision 2.10 entry at 0xfd7e0, last bus=1
PCI: Using configuration type 1
PCI: Probing PCI hardware
PCI: ACPI tables contain no PCI IRQ routing entries
PCI: Probing PCI hardware (bus 00)
BIOS failed to enable PCI standards compliance, fixing this error.
I/O APIC: AMD Errata #22 may be present. In the event of instability try
        : booting with the "noapic" option.
...


Don't waste countless efforts debugging this 

-- 
Matthias Andree

Encrypt your mail: my GnuPG key ID is 0x052E7D95

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.4.25 SMP - BUG at page_alloc.c:105
  2004-03-24 21:51   ` Marcelo Tosatti
@ 2004-03-24 21:36     ` Matthias Andree
  2004-03-25  0:22       ` Marcelo Tosatti
  0 siblings, 1 reply; 7+ messages in thread
From: Matthias Andree @ 2004-03-24 21:36 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrew Morton, matthias.andree, andrea, linux-kernel

On Wed, 24 Mar 2004, Marcelo Tosatti wrote:

> This should work. Matthias, please apply and try to reproduce.

Didn't compile. I have changed that line 119 to bad_page(__FUNCTION__,
page); instead. If the first argument must be something else, let me
know. It doesn't immedately make sense with just one caller, but I know
nothing better right now.

As I don't know a specific scenario to reproduce the crash, it may take
longer (possibly weeks) until I can come up with results.

Here's the error:

gcc -D__KERNEL__ -I/usr/src/linux-2.4.25/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -fomit-frame-pointer -pipe -mpreferred-stack-boundary=2 -march=athlon -nostdinc -iwithprefix include -DKBUILD_BASENAME=page_alloc -DEXPORT_SYMTAB -c page_alloc.c
page_alloc.c: In function `__free_pages_ok':
page_alloc.c:119: warning: passing arg 1 of `bad_page' from incompatible pointer type
page_alloc.c:119: error: too few arguments to function `bad_page'
make[2]: *** [page_alloc.o] Error 1
make[2]: Leaving directory `/usr/src/linux-2.4.25/mm'

The relevant parts of the patch were:

> --- mm/page_alloc.c.orig	2004-03-24 18:42:53.693251224 -0300
> +++ mm/page_alloc.c	2004-03-24 18:47:52.484828000 -0300
> @@ -81,6 +81,20 @@
>   * -- wli
>   */
>  
> +static void bad_page(const char *function, struct page *page)
> +{
> +        printk("Bad page state at %s\n", function);
...
> @@ -101,8 +115,8 @@
>  
>  	if (page->buffers)
>  		BUG();
> -	if (page->mapping)
> -		BUG();
> +	if (page->mapping) 
> +		bad_page(page);
>  	if (!VALID_PAGE(page))
>  		BUG();
>  	if (PageLocked(page))

-- 
Matthias Andree

Encrypt your mail: my GnuPG key ID is 0x052E7D95

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.4.25 SMP - BUG at page_alloc.c:105
  2004-03-24 20:28 ` Andrew Morton
  2004-03-24 21:12   ` Matthias Andree
@ 2004-03-24 21:51   ` Marcelo Tosatti
  2004-03-24 21:36     ` Matthias Andree
  1 sibling, 1 reply; 7+ messages in thread
From: Marcelo Tosatti @ 2004-03-24 21:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: matthias.andree, andrea, linux-kernel


On Wed, Mar 24, 2004 at 12:28:06PM -0800, Andrew Morton wrote:
> Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
> >
> > 
> > The backtrace is odd to me. 
> > 
> > set_page_dirty() does not call __free_pages_ok() directly or indirectly.
> > 
> 
> I'd suspect that's just gunk on the stack and that zap_pte_range() freed an
> anonymous page which had a non-null ->mapping.  It could be a hardware bug.
> Without seeing the actual value of page->mapping it's hard to know.
> 
> It would be good to backport the bad_page() debug code so we get a bit more
> info when this sort of thing happens.

This should work. Matthias, please apply and try to reproduce.

--- mm/page_alloc.c.orig	2004-03-24 18:42:53.693251224 -0300
+++ mm/page_alloc.c	2004-03-24 18:47:52.484828000 -0300
@@ -81,6 +81,20 @@
  * -- wli
  */
 
+static void bad_page(const char *function, struct page *page)
+{
+        printk("Bad page state at %s\n", function);
+        printk("flags:0x%08lx mapping:%p buffers:%p count:%d\n",
+                page->flags, page->mapping,
+		page->buffers, page_count(page));
+        printk("Backtrace:\n");
+        dump_stack();
+	printk("bad_page: Trying to fix it up.\n");
+        set_page_count(page, 0);
+        page->mapping = NULL;
+}
+
+
 static void FASTCALL(__free_pages_ok (struct page *page, unsigned int order));
 static void __free_pages_ok (struct page *page, unsigned int order)
 {
@@ -101,8 +115,8 @@
 
 	if (page->buffers)
 		BUG();
-	if (page->mapping)
-		BUG();
+	if (page->mapping) 
+		bad_page(page);
 	if (!VALID_PAGE(page))
 		BUG();
 	if (PageLocked(page))

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.4.25 SMP - BUG at page_alloc.c:105
  2004-03-24 21:36     ` Matthias Andree
@ 2004-03-25  0:22       ` Marcelo Tosatti
  0 siblings, 0 replies; 7+ messages in thread
From: Marcelo Tosatti @ 2004-03-25  0:22 UTC (permalink / raw)
  To: Andrew Morton, andrea, linux-kernel

On Wed, Mar 24, 2004 at 10:36:48PM +0100, Matthias Andree wrote:
> On Wed, 24 Mar 2004, Marcelo Tosatti wrote:
> 
> > This should work. Matthias, please apply and try to reproduce.
> 
> Didn't compile. I have changed that line 119 to bad_page(__FUNCTION__,
> page); instead. If the first argument must be something else, let me
> know. It doesn't immedately make sense with just one caller, but I know
> nothing better right now.

Right. My mistake.

> As I don't know a specific scenario to reproduce the crash, it may take
> longer (possibly weeks) until I can come up with results.

Lets wait and see.

Did you try older 2.4's or 2.6 ? 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-03-24 23:22 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-24 20:58 2.4.25 SMP - BUG at page_alloc.c:105 Marcelo Tosatti
2004-03-24 20:28 ` Andrew Morton
2004-03-24 21:12   ` Matthias Andree
2004-03-24 21:51   ` Marcelo Tosatti
2004-03-24 21:36     ` Matthias Andree
2004-03-25  0:22       ` Marcelo Tosatti
  -- strict thread matches above, loose matches on Subject: below --
2004-03-22 14:49 Matthias Andree

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox