* Re: 2.4.25 SMP - BUG at page_alloc.c:105
@ 2004-03-24 20:58 Marcelo Tosatti
2004-03-24 20:28 ` Andrew Morton
0 siblings, 1 reply; 7+ messages in thread
From: Marcelo Tosatti @ 2004-03-24 20:58 UTC (permalink / raw)
To: Matthias Andree; +Cc: akpm, andrea, linux-kernel
The backtrace is odd to me.
set_page_dirty() does not call __free_pages_ok() directly or indirectly.
How can it be?
---
Hi,
I found this in the logs of a Dual Athlon MP machine (Tyan board)
running 2.4.25-SMP:
kernel BUG at page_alloc.c:105!
invalid operand: 0000
CPU: 0
EIP: 0010:[__free_pages_ok+80/704] Not tainted
EFLAGS: 00010286
eax: c0333674 ebx: c1b2d720 ecx: 00000000 edx: f22f7a84
esi: 00000001 edi: 00000000 ebp: 00000001 esp: f6901e3c
ds: 0018 es: 0018 ss: 0018
Process svscan (pid: 1348, stackpage=f6901000)
Stack: c033364c f741cbc0 f22f7a84 00000001 0804c000 c0133ea6 f22f79c0 00000004
00000001 00000001 0804c000 00000001 c01308fa c1b2d720 f68e3080 0804b000
00001000 0844b000 c03ac4e0 00000001 0804c000 f68e3084 f42baa40 f7212440
Call Trace: [set_page_dirty+166/176] [zap_page_range+330/400] [exit_mmap+221/352] \
[mmput+88/176] [do_exit+259/800] [sig_exit+195/208] [dequeue_signal+95/192] \
[do_signal+448/694] [schedule_timeout+94/176] [process_timeout+0/96] \
[sys_nanosleep+232/448] [do_page_fault+0/1347] [signal_return+20/24]
Other than this BUG (that took down the machine hard, I was lucky to log
across the network), there appear to be no relevant logs shortly before
this crash.
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: 2.4.25 SMP - BUG at page_alloc.c:105 2004-03-24 20:58 2.4.25 SMP - BUG at page_alloc.c:105 Marcelo Tosatti @ 2004-03-24 20:28 ` Andrew Morton 2004-03-24 21:12 ` Matthias Andree 2004-03-24 21:51 ` Marcelo Tosatti 0 siblings, 2 replies; 7+ messages in thread From: Andrew Morton @ 2004-03-24 20:28 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: matthias.andree, andrea, linux-kernel Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote: > > > The backtrace is odd to me. > > set_page_dirty() does not call __free_pages_ok() directly or indirectly. > I'd suspect that's just gunk on the stack and that zap_pte_range() freed an anonymous page which had a non-null ->mapping. It could be a hardware bug. Without seeing the actual value of page->mapping it's hard to know. It would be good to backport the bad_page() debug code so we get a bit more info when this sort of thing happens. > --- > > Hi, > > I found this in the logs of a Dual Athlon MP machine (Tyan board) > running 2.4.25-SMP: > > kernel BUG at page_alloc.c:105! > invalid operand: 0000 > CPU: 0 > EIP: 0010:[__free_pages_ok+80/704] Not tainted > EFLAGS: 00010286 > eax: c0333674 ebx: c1b2d720 ecx: 00000000 edx: f22f7a84 > esi: 00000001 edi: 00000000 ebp: 00000001 esp: f6901e3c > ds: 0018 es: 0018 ss: 0018 > Process svscan (pid: 1348, stackpage=f6901000) > Stack: c033364c f741cbc0 f22f7a84 00000001 0804c000 c0133ea6 f22f79c0 00000004 > 00000001 00000001 0804c000 00000001 c01308fa c1b2d720 f68e3080 0804b000 > 00001000 0844b000 c03ac4e0 00000001 0804c000 f68e3084 f42baa40 f7212440 > Call Trace: [set_page_dirty+166/176] [zap_page_range+330/400] [exit_mmap+221/352] \ > [mmput+88/176] [do_exit+259/800] [sig_exit+195/208] [dequeue_signal+95/192] \ > [do_signal+448/694] [schedule_timeout+94/176] [process_timeout+0/96] \ > [sys_nanosleep+232/448] [do_page_fault+0/1347] [signal_return+20/24] > > Other than this BUG (that took down the machine hard, I was lucky to log > across the network), there appear to be no relevant logs shortly before > this crash. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: 2.4.25 SMP - BUG at page_alloc.c:105 2004-03-24 20:28 ` Andrew Morton @ 2004-03-24 21:12 ` Matthias Andree 2004-03-24 21:51 ` Marcelo Tosatti 1 sibling, 0 replies; 7+ messages in thread From: Matthias Andree @ 2004-03-24 21:12 UTC (permalink / raw) To: Andrew Morton; +Cc: Marcelo Tosatti, matthias.andree, andrea, linux-kernel On Wed, 24 Mar 2004, Andrew Morton wrote: > I'd suspect that's just gunk on the stack and that zap_pte_range() freed an > anonymous page which had a non-null ->mapping. It could be a hardware bug. > Without seeing the actual value of page->mapping it's hard to know. Any chance to retrieve that when the machine has been rebooted since? I fear there is none. I have these log entries from boot-up (after the crash), seems the BIOS isn't perfect (Tyan S2460 "Tiger MP" w/ BIOS 1.05): ... 128MB HIGHMEM available. 896MB LOWMEM available. ACPI: have wakeup address 0xc0002000 found SMP MP-table at 000f7510 hm, page 000f7000 reserved twice. hm, page 000f8000 reserved twice. hm, page 0009f000 reserved twice. hm, page 000a0000 reserved twice. On node 0 totalpages: 262144 zone(0): 4096 pages. zone(1): 225280 pages. zone(2): 32768 pages. ACPI: Unable to locate RSDP Intel MultiProcessor Specification v1.4 Virtual Wire compatibility mode. OEM ID: TYAN Product ID: GUINNESS APIC at: 0xFEE00000 Processor #1 Pentium(tm) Pro APIC version 16 Processor #0 Pentium(tm) Pro APIC version 16 I/O APIC #2 Version 17 at 0xFEC00000. Enabling APIC mode: Flat. Using 1 I/O APICs Processors: 2 Kernel command line: root=/dev/hda5 vga=791 splash=silent showopts noapic Initializing CPU#0 Detected 1533.378 MHz processor. Console: colour dummy device 80x25 Calibrating delay loop... 3060.53 BogoMIPS Memory: 1032772k/1048576k available (1902k kernel code, 15416k reserved, 636k data, 152k init, 131072k highmem) ... Intel machine check reporting enabled on CPU#0. CPU: After generic, caps: 0383fbff c1cbfbff 00000000 00000000 CPU: Common caps: 0383fbff c1cbfbff 00000000 00000000 CPU0: AMD Athlon(tm) MP 1800+ stepping 02 Intel machine check reporting enabled on CPU#1. CPU: After generic, caps: 0383fbff c1cbfbff 00000000 00000000 CPU: Common caps: 0383fbff c1cbfbff 00000000 00000000 CPU1: AMD Athlon(tm) Processor stepping 02 Total of 2 processors activated (6121.06 BogoMIPS). Using local APIC timer interrupts. calibrating APIC timer ... ..... CPU clock speed is 1533.3658 MHz. ..... host bus clock speed is 266.6723 MHz. cpu: 0, clocks: 2666723, slice: 888907 CPU0<T0:2666720,T1:1777808,D:5,S:888907,C:2666723> cpu: 1, clocks: 2666723, slice: 888907 CPU1<T0:2666720,T1:888896,D:10,S:888907,C:2666723> checking TSC synchronization across CPUs: passed. Waiting on wait_init_idle (map = 0x2) All processors have done init_idle mtrr: your CPUs had inconsistent fixed MTRR settings mtrr: probably your BIOS does not setup all CPUs ACPI: Subsystem revision 20040116 ACPI: Interpreter disabled. PCI: PCI BIOS revision 2.10 entry at 0xfd7e0, last bus=1 PCI: Using configuration type 1 PCI: Probing PCI hardware PCI: ACPI tables contain no PCI IRQ routing entries PCI: Probing PCI hardware (bus 00) BIOS failed to enable PCI standards compliance, fixing this error. I/O APIC: AMD Errata #22 may be present. In the event of instability try : booting with the "noapic" option. ... Don't waste countless efforts debugging this -- Matthias Andree Encrypt your mail: my GnuPG key ID is 0x052E7D95 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: 2.4.25 SMP - BUG at page_alloc.c:105 2004-03-24 20:28 ` Andrew Morton 2004-03-24 21:12 ` Matthias Andree @ 2004-03-24 21:51 ` Marcelo Tosatti 2004-03-24 21:36 ` Matthias Andree 1 sibling, 1 reply; 7+ messages in thread From: Marcelo Tosatti @ 2004-03-24 21:51 UTC (permalink / raw) To: Andrew Morton; +Cc: matthias.andree, andrea, linux-kernel On Wed, Mar 24, 2004 at 12:28:06PM -0800, Andrew Morton wrote: > Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote: > > > > > > The backtrace is odd to me. > > > > set_page_dirty() does not call __free_pages_ok() directly or indirectly. > > > > I'd suspect that's just gunk on the stack and that zap_pte_range() freed an > anonymous page which had a non-null ->mapping. It could be a hardware bug. > Without seeing the actual value of page->mapping it's hard to know. > > It would be good to backport the bad_page() debug code so we get a bit more > info when this sort of thing happens. This should work. Matthias, please apply and try to reproduce. --- mm/page_alloc.c.orig 2004-03-24 18:42:53.693251224 -0300 +++ mm/page_alloc.c 2004-03-24 18:47:52.484828000 -0300 @@ -81,6 +81,20 @@ * -- wli */ +static void bad_page(const char *function, struct page *page) +{ + printk("Bad page state at %s\n", function); + printk("flags:0x%08lx mapping:%p buffers:%p count:%d\n", + page->flags, page->mapping, + page->buffers, page_count(page)); + printk("Backtrace:\n"); + dump_stack(); + printk("bad_page: Trying to fix it up.\n"); + set_page_count(page, 0); + page->mapping = NULL; +} + + static void FASTCALL(__free_pages_ok (struct page *page, unsigned int order)); static void __free_pages_ok (struct page *page, unsigned int order) { @@ -101,8 +115,8 @@ if (page->buffers) BUG(); - if (page->mapping) - BUG(); + if (page->mapping) + bad_page(page); if (!VALID_PAGE(page)) BUG(); if (PageLocked(page)) ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: 2.4.25 SMP - BUG at page_alloc.c:105 2004-03-24 21:51 ` Marcelo Tosatti @ 2004-03-24 21:36 ` Matthias Andree 2004-03-25 0:22 ` Marcelo Tosatti 0 siblings, 1 reply; 7+ messages in thread From: Matthias Andree @ 2004-03-24 21:36 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Andrew Morton, matthias.andree, andrea, linux-kernel On Wed, 24 Mar 2004, Marcelo Tosatti wrote: > This should work. Matthias, please apply and try to reproduce. Didn't compile. I have changed that line 119 to bad_page(__FUNCTION__, page); instead. If the first argument must be something else, let me know. It doesn't immedately make sense with just one caller, but I know nothing better right now. As I don't know a specific scenario to reproduce the crash, it may take longer (possibly weeks) until I can come up with results. Here's the error: gcc -D__KERNEL__ -I/usr/src/linux-2.4.25/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -fomit-frame-pointer -pipe -mpreferred-stack-boundary=2 -march=athlon -nostdinc -iwithprefix include -DKBUILD_BASENAME=page_alloc -DEXPORT_SYMTAB -c page_alloc.c page_alloc.c: In function `__free_pages_ok': page_alloc.c:119: warning: passing arg 1 of `bad_page' from incompatible pointer type page_alloc.c:119: error: too few arguments to function `bad_page' make[2]: *** [page_alloc.o] Error 1 make[2]: Leaving directory `/usr/src/linux-2.4.25/mm' The relevant parts of the patch were: > --- mm/page_alloc.c.orig 2004-03-24 18:42:53.693251224 -0300 > +++ mm/page_alloc.c 2004-03-24 18:47:52.484828000 -0300 > @@ -81,6 +81,20 @@ > * -- wli > */ > > +static void bad_page(const char *function, struct page *page) > +{ > + printk("Bad page state at %s\n", function); ... > @@ -101,8 +115,8 @@ > > if (page->buffers) > BUG(); > - if (page->mapping) > - BUG(); > + if (page->mapping) > + bad_page(page); > if (!VALID_PAGE(page)) > BUG(); > if (PageLocked(page)) -- Matthias Andree Encrypt your mail: my GnuPG key ID is 0x052E7D95 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: 2.4.25 SMP - BUG at page_alloc.c:105 2004-03-24 21:36 ` Matthias Andree @ 2004-03-25 0:22 ` Marcelo Tosatti 0 siblings, 0 replies; 7+ messages in thread From: Marcelo Tosatti @ 2004-03-25 0:22 UTC (permalink / raw) To: Andrew Morton, andrea, linux-kernel On Wed, Mar 24, 2004 at 10:36:48PM +0100, Matthias Andree wrote: > On Wed, 24 Mar 2004, Marcelo Tosatti wrote: > > > This should work. Matthias, please apply and try to reproduce. > > Didn't compile. I have changed that line 119 to bad_page(__FUNCTION__, > page); instead. If the first argument must be something else, let me > know. It doesn't immedately make sense with just one caller, but I know > nothing better right now. Right. My mistake. > As I don't know a specific scenario to reproduce the crash, it may take > longer (possibly weeks) until I can come up with results. Lets wait and see. Did you try older 2.4's or 2.6 ? ^ permalink raw reply [flat|nested] 7+ messages in thread
* 2.4.25 SMP - BUG at page_alloc.c:105
@ 2004-03-22 14:49 Matthias Andree
0 siblings, 0 replies; 7+ messages in thread
From: Matthias Andree @ 2004-03-22 14:49 UTC (permalink / raw)
To: Linux-Kernel mailing list
Hi,
I found this in the logs of a Dual Athlon MP machine (Tyan board)
running 2.4.25-SMP:
kernel BUG at page_alloc.c:105!
invalid operand: 0000
CPU: 0
EIP: 0010:[__free_pages_ok+80/704] Not tainted
EFLAGS: 00010286
eax: c0333674 ebx: c1b2d720 ecx: 00000000 edx: f22f7a84
esi: 00000001 edi: 00000000 ebp: 00000001 esp: f6901e3c
ds: 0018 es: 0018 ss: 0018
Process svscan (pid: 1348, stackpage=f6901000)
Stack: c033364c f741cbc0 f22f7a84 00000001 0804c000 c0133ea6 f22f79c0 00000004
00000001 00000001 0804c000 00000001 c01308fa c1b2d720 f68e3080 0804b000
00001000 0844b000 c03ac4e0 00000001 0804c000 f68e3084 f42baa40 f7212440
Call Trace: [set_page_dirty+166/176] [zap_page_range+330/400] [exit_mmap+221/352] [mmput+88/176] [do_exit+259/800]
[sig_exit+195/208] [dequeue_signal+95/192] [do_signal+448/694] [schedule_timeout+94/176] [process_timeout+0/96] [sys_nanosleep+232/448]
[do_page_fault+0/1347] [signal_return+20/24]
Other than this BUG (that took down the machine hard, I was lucky to log
across the network), there appear to be no relevant logs shortly before
this crash.
What's causing this?
--
Matthias Andree
Encrypt your mail: my GnuPG key ID is 0x052E7D95
^ permalink raw reply [flat|nested] 7+ messages in threadend of thread, other threads:[~2004-03-24 23:22 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-03-24 20:58 2.4.25 SMP - BUG at page_alloc.c:105 Marcelo Tosatti 2004-03-24 20:28 ` Andrew Morton 2004-03-24 21:12 ` Matthias Andree 2004-03-24 21:51 ` Marcelo Tosatti 2004-03-24 21:36 ` Matthias Andree 2004-03-25 0:22 ` Marcelo Tosatti -- strict thread matches above, loose matches on Subject: below -- 2004-03-22 14:49 Matthias Andree
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox