Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
@ 2004-06-15 17:44 Nick Warne
  2004-06-15 19:15 ` Stian Jordet
  0 siblings, 1 reply; 13+ messages in thread
From: Nick Warne @ 2004-06-15 17:44 UTC (permalink / raw)
  To: linux-kernel

FYI.

I have a box here that was originally running 2.4.x.  I updated to 
2.6.x a few months ago, and all was well.  Then I started to get 
curious oops, none of them the same.

I started to suspect NFS, as I use an old 486 to hold the web pages 
to serve to the box via NFS... the oops occurred every Saturday 
morning @ 4:02.  Lead to me think it was some sort of cron.weekly 
issue with the disc activity and file access or the like, or 
whatever... I didn't know - I was on a fishing exercise (and a lot of 
searching on the LKML)

But, after talking to a member of the HantsLUG, and showing logs and 
stuff, he brought up at the swap size.  This box was once 64Mb, but 
is now 128Mb - with 128Mb swap.  I created an additional swap file 
(256Mb), and (touch wood), no oops since, all heathly :)  I never 
looked at this before, as swap was never used _during_ normal running 
of the box, but as he said maybe the cron.weekly ran a lot of stuff 
that did use it up...

Nick

-- 
"When you're chewing on life's gristle,
Don't grumble, Give a whistle..."

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
  2004-06-15 17:44 Oopses with both recent 2.4.x kernels and 2.6.x kernels Nick Warne
@ 2004-06-15 19:15 ` Stian Jordet
  0 siblings, 0 replies; 13+ messages in thread
From: Stian Jordet @ 2004-06-15 19:15 UTC (permalink / raw)
  To: Nick Warne; +Cc: linux-kernel

tir, 15.06.2004 kl. 18.44 +0100, skrev Nick Warne:
> But, after talking to a member of the HantsLUG, and showing logs and 
> stuff, he brought up at the swap size.  This box was once 64Mb, but 
> is now 128Mb - with 128Mb swap.  I created an additional swap file 
> (256Mb), and (touch wood), no oops since, all heathly :)  I never 
> looked at this before, as swap was never used _during_ normal running 
> of the box, but as he said maybe the cron.weekly ran a lot of stuff 
> that did use it up...

Doubt that has been my problem... The box in question had 180 MB ram,
and 512 MB swap. The script can't have used that much.. And even if it
did, it is a bug that the box oopses and dies, I guess.

Best regards,
Stian


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Oopses with both recent 2.4.x kernels and 2.6.x kernels
@ 2004-02-03 18:26 Stian Jordet
  2004-02-05 23:51 ` Marcelo Tosatti
  0 siblings, 1 reply; 13+ messages in thread
From: Stian Jordet @ 2004-02-03 18:26 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Hello,

I have a server which was running 2.4.18 and 2.4.19 for almost 200 days
each, without problems. After an upgrade to 2.4.22, the box haven't been
up for 30 days in a row. This happened early november. I have caputered
oopses with both 2.4.23 and 2.6.1 which I have sent decoded to the list,
but have never got any reply.

I have ran memtest86 on the box, no errors. What else can be the
problem? I could of course go back to 2.4.19, which I know worked fine,
but I there have been some fixed security holes since then...

Any thoughts?

Best regards,
Stian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
  2004-02-03 18:26 Stian Jordet
@ 2004-02-05 23:51 ` Marcelo Tosatti
  2004-03-02 11:03   ` Stian Jordet
  0 siblings, 1 reply; 13+ messages in thread
From: Marcelo Tosatti @ 2004-02-05 23:51 UTC (permalink / raw)
  To: Stian Jordet; +Cc: Linux Kernel Mailing List



On Tue, 3 Feb 2004, Stian Jordet wrote:

> Hello,
>
> I have a server which was running 2.4.18 and 2.4.19 for almost 200 days
> each, without problems. After an upgrade to 2.4.22, the box haven't been
> up for 30 days in a row. This happened early november. I have caputered
> oopses with both 2.4.23 and 2.6.1 which I have sent decoded to the list,
> but have never got any reply.
>
> I have ran memtest86 on the box, no errors. What else can be the
> problem? I could of course go back to 2.4.19, which I know worked fine,
> but I there have been some fixed security holes since then...
>
> Any thoughts?

Stian,

I have seen your 2.4.x oopses and they seemed odd. The faults were
happening in different functions (mostly inside VM "freeing" , due to
what seems to be random crap in memory:

 <1>Unable to handle kernel NULL pointer dereference at virtual address 00000021
c0132e86
*pde = 00000000

eax: 00000000   ebx: 00000009   ecx: 000001d2   edx: 00000012
esi: 00000000   edi: c17e38c0   ebp: c1047a00   esp: c86cbdb4

>>EIP; c0132e86 <sync_page_buffers+e/a4>   <=====

>>edi; c17e38c0 <_end+14b5844/bd23f84>
>>ebp; c1047a00 <_end+d19984/bd23f84>
>>esp; c86cbdb4 <_end+839dd38/bd23f84>

Trace; c0132fdc <try_to_free_buffers+c0/ec>

Code;  c0132e86 <sync_page_buffers+e/a4>
00000000 <_EIP>:
Code;  c0132e86 <sync_page_buffers+e/a4>   <=====
   0:   f6 43 18 06               testb  $0x6,0x18(%ebx)   <=====
Code;  c0132e8a <sync_page_buffers+12/a4>
   4:   74 7c                     je     82 <_EIP+0x82> c0132f08
<sync_page_buffers+90/a4>
Code;  c0132e8c <sync_page_buffers+14/a4>
   6:   b8 07 00 00 00            mov    $0x7,%eax
Code;  c0132e91 <sync_page_buffers+19/a4>




 <1>Unable to handle kernel NULL pointer dereference at virtual address
00000028
c015e3a2
*pde = 00000000
Oops: 0000
CPU:    0
EIP:    0010:[<c015e3a2>]    Not tainted
EFLAGS: 00010203

eax: 0100004d   ebx: 00000000   ecx: 000001d2   edx: 00000000

Code;  c015e3a2 <journal_try_to_free_buffers+5a/98>
00000000 <_EIP>:
Code;  c015e3a2 <journal_try_to_free_buffers+5a/98>   <=====
   0:   8b 5b 28                  mov    0x28(%ebx),%ebx   <=====
Code;  c015e3a5 <journal_try_to_free_buffers+5d/98>
   3:   f6 42 19 04               testb  $0x4,0x19(%edx)
Code;  c015e3a9 <journal_try_to_free_buffers+61/98>
   7:   74 17                     je     20 <_EIP+0x20> c015e3c2
<journal_try_to_free_buffers+7a/98>

And other similar oopses.

Are you sure there is nothing messing up the hardware ?

How long have you ran memtest86? It can, sometimes, take a long to showup
errors.

The 2.6.x oopses on the same hardware is also a useful source of
information.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
  2004-02-05 23:51 ` Marcelo Tosatti
@ 2004-03-02 11:03   ` Stian Jordet
  2004-03-02 12:31     ` Stian Jordet
  2004-06-14 17:07     ` Steven Dake
  0 siblings, 2 replies; 13+ messages in thread
From: Stian Jordet @ 2004-03-02 11:03 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linux Kernel Mailing List

fre, 06.02.2004 kl. 00.51 skrev Marcelo Tosatti:
> On Tue, 3 Feb 2004, Stian Jordet wrote:
> 
> > Hello,
> >
> > I have a server which was running 2.4.18 and 2.4.19 for almost 200 days
> > each, without problems. After an upgrade to 2.4.22, the box haven't been
> > up for 30 days in a row. This happened early november. I have caputered
> > oopses with both 2.4.23 and 2.6.1 which I have sent decoded to the list,
> > but have never got any reply.
> >
> > I have ran memtest86 on the box, no errors. What else can be the
> > problem? I could of course go back to 2.4.19, which I know worked fine,
> > but I there have been some fixed security holes since then...
> >
> > Any thoughts?
> 
> Stian,
> 
> I have seen your 2.4.x oopses and they seemed odd. The faults were
> happening in different functions (mostly inside VM "freeing" , due to
> what seems to be random crap in memory:
>
>  <1>Unable to handle kernel NULL pointer dereference at virtual address 00000021
> c0132e86
> *pde = 00000000
> 
> eax: 00000000   ebx: 00000009   ecx: 000001d2   edx: 00000012
> esi: 00000000   edi: c17e38c0   ebp: c1047a00   esp: c86cbdb4
> 
> >>EIP; c0132e86 <sync_page_buffers+e/a4>   <=====
> 
> >>edi; c17e38c0 <_end+14b5844/bd23f84>
> >>ebp; c1047a00 <_end+d19984/bd23f84>
> >>esp; c86cbdb4 <_end+839dd38/bd23f84>
> 
> Trace; c0132fdc <try_to_free_buffers+c0/ec>
> 
> Code;  c0132e86 <sync_page_buffers+e/a4>
> 00000000 <_EIP>:
> Code;  c0132e86 <sync_page_buffers+e/a4>   <=====
>    0:   f6 43 18 06               testb  $0x6,0x18(%ebx)   <=====
> Code;  c0132e8a <sync_page_buffers+12/a4>
>    4:   74 7c                     je     82 <_EIP+0x82> c0132f08
> <sync_page_buffers+90/a4>
> Code;  c0132e8c <sync_page_buffers+14/a4>
>    6:   b8 07 00 00 00            mov    $0x7,%eax
> Code;  c0132e91 <sync_page_buffers+19/a4>
> 
> 
> 
> 
>  <1>Unable to handle kernel NULL pointer dereference at virtual address
> 00000028
> c015e3a2
> *pde = 00000000
> Oops: 0000
> CPU:    0
> EIP:    0010:[<c015e3a2>]    Not tainted
> EFLAGS: 00010203
> 
> eax: 0100004d   ebx: 00000000   ecx: 000001d2   edx: 00000000
> 
> Code;  c015e3a2 <journal_try_to_free_buffers+5a/98>
> 00000000 <_EIP>:
> Code;  c015e3a2 <journal_try_to_free_buffers+5a/98>   <=====
>    0:   8b 5b 28                  mov    0x28(%ebx),%ebx   <=====
> Code;  c015e3a5 <journal_try_to_free_buffers+5d/98>
>    3:   f6 42 19 04               testb  $0x4,0x19(%edx)
> Code;  c015e3a9 <journal_try_to_free_buffers+61/98>
>    7:   74 17                     je     20 <_EIP+0x20> c015e3c2
> <journal_try_to_free_buffers+7a/98>
> 
> And other similar oopses.
> 
> Are you sure there is nothing messing up the hardware ?
> 
> How long have you ran memtest86? It can, sometimes, take a long to showup
> errors.
> 
> The 2.6.x oopses on the same hardware is also a useful source of
> information.

Marcelo,

sorry for getting back to you so insanely late. This was a production
server, and I have now moved the services it were running to another
box, so I could run a more exhaustive memtest86. It has now ran for two
days, without any errors. Of course there could be other flaky hardware,
but since I don't know any way to test it, and the oops occurs with
two-four weeks interval, it's quite time consuming to find out. I'm not
even sure if it will oops without the typical load it used to have.

Anyway, thank you very much for at least answering me. Much appreciated
:)

Best regards,
Stian


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
  2004-03-02 11:03   ` Stian Jordet
@ 2004-03-02 12:31     ` Stian Jordet
  2004-03-09 19:22       ` Marcelo Tosatti
  2004-06-14 17:07     ` Steven Dake
  1 sibling, 1 reply; 13+ messages in thread
From: Stian Jordet @ 2004-03-02 12:31 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 86 bytes --]

Btw, here is one of the 2.6.x oopses as well (as you requested).

Best regards,
Stian

[-- Attachment #2: syslog.txt --]
[-- Type: text/plain, Size: 5902 bytes --]

Jan 20 00:20:04 dodge kernel: ------------[ cut here ]------------
Jan 20 00:20:04 dodge kernel: kernel BUG at mm/page_alloc.c:201!
Jan 20 00:20:04 dodge kernel: invalid operand: 0000 [#1]
Jan 20 00:20:04 dodge kernel: CPU:    0
Jan 20 00:20:04 dodge kernel: EIP:    0060:[free_pages_bulk+482/512]    Not tainted
Jan 20 00:20:04 dodge kernel: EFLAGS: 00010002
Jan 20 00:20:04 dodge kernel: EIP is at free_pages_bulk+0x1e2/0x200
Jan 20 00:20:04 dodge kernel: eax: 00000001   ebx: c00609c8   ecx: 00000000   edx: 666026a5
Jan 20 00:20:04 dodge kernel: esi: 666026a4   edi: ffffffff   ebp: 33301352   esp: c86d5d90
Jan 20 00:20:04 dodge kernel: ds: 007b   es: 007b   ss: 0068
Jan 20 00:20:04 dodge kernel: Process mrtg (pid: 26804, threadinfo=c86d4000 task=c9b860c0)
Jan 20 00:20:04 dodge kernel: Stack: c038ad80 c00609c8 00000000 c038ae40 00000001 c00609a0 c038adbc 00000000 
Jan 20 00:20:04 dodge kernel:        c1000000 c038adbc 00000086 ffffffff c038ad80 c00609a0 c038ae5c 00000246 
Jan 20 00:20:04 dodge kernel:        c01341b9 c038ad80 00000000 c038ae5c 00000000 c038ad80 c00609a0 00000005 
Jan 20 00:20:04 dodge kernel: Call Trace:
Jan 20 00:20:04 dodge kernel:  [free_hot_cold_page+217/240] free_hot_cold_page+0xd9/0xf0
Jan 20 00:20:04 dodge kernel:  [do_generic_mapping_read+714/1008] do_generic_mapping_read+0x2ca/0x3f0
Jan 20 00:20:04 dodge kernel:  [file_read_actor+0/256] file_read_actor+0x0/0x100
Jan 20 00:20:04 dodge kernel:  [__generic_file_aio_read+454/512] __generic_file_aio_read+0x1c6/0x200
Jan 20 00:20:04 dodge kernel:  [file_read_actor+0/256] file_read_actor+0x0/0x100
Jan 20 00:20:04 dodge kernel:  [generic_file_aio_read+91/128] generic_file_aio_read+0x5b/0x80
Jan 20 00:20:04 dodge kernel:  [do_sync_read+137/192] do_sync_read+0x89/0xc0
Jan 20 00:20:04 dodge kernel:  [do_page_fault+300/1328] do_page_fault+0x12c/0x530
Jan 20 00:20:04 dodge kernel:  [do_brk+324/560] do_brk+0x144/0x230
Jan 20 00:20:04 dodge kernel:  [vfs_read+184/304] vfs_read+0xb8/0x130
Jan 20 00:20:04 dodge kernel:  [sys_read+66/112] sys_read+0x42/0x70
Jan 20 00:20:04 dodge kernel:  [syscall_call+7/11] syscall_call+0x7/0xb
Jan 20 00:20:04 dodge kernel: 
Jan 20 00:20:04 dodge kernel: Code: 0f 0b c9 00 9b d1 34 c0 e9 51 ff ff ff 0f 0b bc 00 9b d1 34 
Jan 20 00:20:04 dodge kernel:  <1>Unable to handle kernel paging request at virtual address 00100104
Jan 20 00:20:04 dodge kernel:  printing eip:
Jan 20 00:20:04 dodge kernel: c0133cbf
Jan 20 00:20:04 dodge kernel: *pde = 00000000
Jan 20 00:20:04 dodge kernel: Oops: 0002 [#2]
Jan 20 00:20:04 dodge kernel: CPU:    0
Jan 20 00:20:04 dodge kernel: EIP:    0060:[free_pages_bulk+143/512]    Not tainted
Jan 20 00:20:04 dodge kernel: EFLAGS: 00010003
Jan 20 00:20:04 dodge kernel: EIP is at free_pages_bulk+0x8f/0x200
Jan 20 00:20:04 dodge kernel: eax: c00609a8   ebx: c038ae5c   ecx: 00200200   edx: 00100100
Jan 20 00:20:04 dodge kernel: esi: c00609a0   edi: c038ae5c   ebp: 00000203   esp: c86d5b34
Jan 20 00:20:04 dodge kernel: ds: 007b   es: 007b   ss: 0068
Jan 20 00:20:04 dodge kernel: Process mrtg (pid: 26804, threadinfo=c86d4000 task=c9b860c0)
Jan 20 00:20:04 dodge kernel: Stack: c11a48c0 00268000 00000000 c038af54 00000001 c1044d20 c038aed0 00000000 
Jan 20 00:20:04 dodge kernel:        c1000000 c038adbc 00000082 ffffffff c038ad80 c100a230 c038ae5c 00000203 
Jan 20 00:20:04 dodge kernel:        c01341b9 c038ad80 00000000 c038ae5c 00000000 c038ad80 c5c7ac0c 0020d000 
Jan 20 00:20:04 dodge kernel: Call Trace:
Jan 20 00:20:04 dodge kernel:  [free_hot_cold_page+217/240] free_hot_cold_page+0xd9/0xf0
Jan 20 00:20:04 dodge kernel:  [zap_pte_range+334/400] zap_pte_range+0x14e/0x190
Jan 20 00:20:04 dodge kernel:  [zap_pmd_range+75/112] zap_pmd_range+0x4b/0x70
Jan 20 00:20:04 dodge kernel:  [unmap_page_range+75/128] unmap_page_range+0x4b/0x80
Jan 20 00:20:04 dodge kernel:  [unmap_vmas+254/544] unmap_vmas+0xfe/0x220
Jan 20 00:20:04 dodge kernel:  [exit_mmap+109/384] exit_mmap+0x6d/0x180
Jan 20 00:20:04 dodge kernel:  [mmput+81/160] mmput+0x51/0xa0
Jan 20 00:20:04 dodge kernel:  [do_exit+290/800] do_exit+0x122/0x320
Jan 20 00:20:04 dodge kernel:  [do_invalid_op+0/208] do_invalid_op+0x0/0xd0
Jan 20 00:20:04 dodge kernel:  [die+203/208] die+0xcb/0xd0
Jan 20 00:20:04 dodge kernel:  [do_invalid_op+202/208] do_invalid_op+0xca/0xd0
Jan 20 00:20:04 dodge kernel:  [free_pages_bulk+482/512] free_pages_bulk+0x1e2/0x200
Jan 20 00:20:04 dodge kernel:  [update_wall_time+22/64] update_wall_time+0x16/0x40
Jan 20 00:20:04 dodge kernel:  [do_timer+224/240] do_timer+0xe0/0xf0
Jan 20 00:20:04 dodge kernel:  [timer_interrupt+56/240] timer_interrupt+0x38/0xf0
Jan 20 00:20:04 dodge kernel:  [handle_IRQ_event+73/128] handle_IRQ_event+0x49/0x80
Jan 20 00:20:04 dodge kernel:  [do_IRQ+140/240] do_IRQ+0x8c/0xf0
Jan 20 00:20:04 dodge kernel:  [error_code+45/56] error_code+0x2d/0x38
Jan 20 00:20:04 dodge kernel:  [free_pages_bulk+482/512] free_pages_bulk+0x1e2/0x200
Jan 20 00:20:04 dodge kernel:  [free_hot_cold_page+217/240] free_hot_cold_page+0xd9/0xf0
Jan 20 00:20:04 dodge kernel:  [do_generic_mapping_read+714/1008] do_generic_mapping_read+0x2ca/0x3f0
Jan 20 00:20:04 dodge kernel:  [file_read_actor+0/256] file_read_actor+0x0/0x100
Jan 20 00:20:04 dodge kernel:  [__generic_file_aio_read+454/512] __generic_file_aio_read+0x1c6/0x200
Jan 20 00:20:04 dodge kernel:  [file_read_actor+0/256] file_read_actor+0x0/0x100
Jan 20 00:20:04 dodge kernel:  [generic_file_aio_read+91/128] generic_file_aio_read+0x5b/0x80
Jan 20 00:20:04 dodge kernel:  [do_sync_read+137/192] do_sync_read+0x89/0xc0
Jan 20 00:20:04 dodge kernel:  [do_page_fault+300/1328] do_page_fault+0x12c/0x530
Jan 20 00:20:04 dodge kernel:  [do_brk+324/560] do_brk+0x144/0x230
Jan 20 00:20:04 dodge kernel:  [vfs_read+184/304] vfs_read+0xb8/0x130
Jan 20 00:20:04 dodge kernel:  [sys_read+66/112] sys_read+0x42/0x70
Jan 20 00:20:04 dodge kernel:  [syscall_call+7/11] syscall_call+0x7/0xb

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
  2004-03-02 12:31     ` Stian Jordet
@ 2004-03-09 19:22       ` Marcelo Tosatti
  2004-03-09 22:28         ` Stian Jordet
  0 siblings, 1 reply; 13+ messages in thread
From: Marcelo Tosatti @ 2004-03-09 19:22 UTC (permalink / raw)
  To: Stian Jordet; +Cc: Marcelo Tosatti, Linux Kernel Mailing List



On Tue, 2 Mar 2004, Stian Jordet wrote:

> Btw, here is one of the 2.6.x oopses as well (as you requested).

Stian, 

This sounds like bad hardware. Did I already ask you to try memtest86 ? 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
  2004-03-09 19:22       ` Marcelo Tosatti
@ 2004-03-09 22:28         ` Stian Jordet
  0 siblings, 0 replies; 13+ messages in thread
From: Stian Jordet @ 2004-03-09 22:28 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linux Kernel Mailing List

Quoting Marcelo Tosatti <marcelo.tosatti@cyclades.com>:

> 
> 
> On Tue, 2 Mar 2004, Stian Jordet wrote:
> 
> > Btw, here is one of the 2.6.x oopses as well (as you requested).
> 
> Stian, 
> 
> This sounds like bad hardware. Did I already ask you to try memtest86 ? 
> 

Yup, and I think I wrote that I had it running for almost two days with no
errors.  Oh well. Thanks for looking into this :) I guess I'll try go afford a
new server (very) soon.

Best regards,
Stian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
  2004-03-02 11:03   ` Stian Jordet
  2004-03-02 12:31     ` Stian Jordet
@ 2004-06-14 17:07     ` Steven Dake
  2004-06-14 18:26       ` Chris Shoemaker
  2004-06-15 13:16       ` Marcelo Tosatti
  1 sibling, 2 replies; 13+ messages in thread
From: Steven Dake @ 2004-06-14 17:07 UTC (permalink / raw)
  To: Stian Jordet; +Cc: Marcelo Tosatti, Linux Kernel Mailing List

Marcelo and Stian,

I have also seen this oops relating to low memory situations.  I think
ext3 allocates some data, has a null return, sets something to null, and
then later it is dereferenced in kwapd.

Anyone have a patch for this problem?

Thanks
-steve

On Tue, 2004-03-02 at 04:03, Stian Jordet wrote:
> fre, 06.02.2004 kl. 00.51 skrev Marcelo Tosatti:
> > On Tue, 3 Feb 2004, Stian Jordet wrote:
> > 
> > > Hello,
> > >
> > > I have a server which was running 2.4.18 and 2.4.19 for almost 200 days
> > > each, without problems. After an upgrade to 2.4.22, the box haven't been
> > > up for 30 days in a row. This happened early november. I have caputered
> > > oopses with both 2.4.23 and 2.6.1 which I have sent decoded to the list,
> > > but have never got any reply.
> > >
> > > I have ran memtest86 on the box, no errors. What else can be the
> > > problem? I could of course go back to 2.4.19, which I know worked fine,
> > > but I there have been some fixed security holes since then...
> > >
> > > Any thoughts?
> > 
> > Stian,
> > 
> > I have seen your 2.4.x oopses and they seemed odd. The faults were
> > happening in different functions (mostly inside VM "freeing" , due to
> > what seems to be random crap in memory:
> >
> >  <1>Unable to handle kernel NULL pointer dereference at virtual address 00000021
> > c0132e86
> > *pde = 00000000
> > 
> > eax: 00000000   ebx: 00000009   ecx: 000001d2   edx: 00000012
> > esi: 00000000   edi: c17e38c0   ebp: c1047a00   esp: c86cbdb4
> > 
> > >>EIP; c0132e86 <sync_page_buffers+e/a4>   <=====
> > 
> > >>edi; c17e38c0 <_end+14b5844/bd23f84>
> > >>ebp; c1047a00 <_end+d19984/bd23f84>
> > >>esp; c86cbdb4 <_end+839dd38/bd23f84>
> > 
> > Trace; c0132fdc <try_to_free_buffers+c0/ec>
> > 
> > Code;  c0132e86 <sync_page_buffers+e/a4>
> > 00000000 <_EIP>:
> > Code;  c0132e86 <sync_page_buffers+e/a4>   <=====
> >    0:   f6 43 18 06               testb  $0x6,0x18(%ebx)   <=====
> > Code;  c0132e8a <sync_page_buffers+12/a4>
> >    4:   74 7c                     je     82 <_EIP+0x82> c0132f08
> > <sync_page_buffers+90/a4>
> > Code;  c0132e8c <sync_page_buffers+14/a4>
> >    6:   b8 07 00 00 00            mov    $0x7,%eax
> > Code;  c0132e91 <sync_page_buffers+19/a4>
> > 
> > 
> > 
> > 
> >  <1>Unable to handle kernel NULL pointer dereference at virtual address
> > 00000028
> > c015e3a2
> > *pde = 00000000
> > Oops: 0000
> > CPU:    0
> > EIP:    0010:[<c015e3a2>]    Not tainted
> > EFLAGS: 00010203
> > 
> > eax: 0100004d   ebx: 00000000   ecx: 000001d2   edx: 00000000
> > 
> > Code;  c015e3a2 <journal_try_to_free_buffers+5a/98>
> > 00000000 <_EIP>:
> > Code;  c015e3a2 <journal_try_to_free_buffers+5a/98>   <=====
> >    0:   8b 5b 28                  mov    0x28(%ebx),%ebx   <=====
> > Code;  c015e3a5 <journal_try_to_free_buffers+5d/98>
> >    3:   f6 42 19 04               testb  $0x4,0x19(%edx)
> > Code;  c015e3a9 <journal_try_to_free_buffers+61/98>
> >    7:   74 17                     je     20 <_EIP+0x20> c015e3c2
> > <journal_try_to_free_buffers+7a/98>
> > 
> > And other similar oopses.
> > 
> > Are you sure there is nothing messing up the hardware ?
> > 
> > How long have you ran memtest86? It can, sometimes, take a long to showup
> > errors.
> > 
> > The 2.6.x oopses on the same hardware is also a useful source of
> > information.
> 
> Marcelo,
> 
> sorry for getting back to you so insanely late. This was a production
> server, and I have now moved the services it were running to another
> box, so I could run a more exhaustive memtest86. It has now ran for two
> days, without any errors. Of course there could be other flaky hardware,
> but since I don't know any way to test it, and the oops occurs with
> two-four weeks interval, it's quite time consuming to find out. I'm not
> even sure if it will oops without the typical load it used to have.
> 
> Anyway, thank you very much for at least answering me. Much appreciated
> :)
> 
> Best regards,
> Stian
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
  2004-06-14 17:07     ` Steven Dake
@ 2004-06-14 18:26       ` Chris Shoemaker
  2004-06-15 13:16       ` Marcelo Tosatti
  1 sibling, 0 replies; 13+ messages in thread
From: Chris Shoemaker @ 2004-06-14 18:26 UTC (permalink / raw)
  To: Steven Dake; +Cc: Stian Jordet, Marcelo Tosatti, Linux Kernel Mailing List

Marcelo, Stian, Steve,
	I also have seen this, more than a dozen times -- always related
	to writes to my ext3 partition concurrent with heavy swapping
	due to memory pressure.  Although my oopses are usually not NULL
	dereferences, but simply bad addresses.  In the two or three cases
	where I've actually traced the oops back to the code, (search
	the April archives for my name) it's been a corrupted pointer,
	always in the high part of the word, but not always the same
	part.  For weeks, I thought that it was flakey RAM, but since
	days of memtest86 didn't fail I went back to trying to strip
	more stuff out of the kernel.  Since the last cut, I haven't had
	a single oops, for approx. 3 weeks of uptime.  Previous pattern
	was 1 to 3 days between oopses.
	
	I didn't really learn much that would help you.  However, if you
	are trying to reproduce the failure more rapidly, my experience
	would suggest that it is necessary to run a memory hog,
	concurrent with an io process (e.g kernel compile.)  Either one
	alone may run for days with no failure.

	Let me know if I can help.

	-Chris

On Mon, Jun 14, 2004 at 10:07:05AM -0700, Steven Dake wrote:
> Marcelo and Stian,
> 
> I have also seen this oops relating to low memory situations.  I think
> ext3 allocates some data, has a null return, sets something to null, and
> then later it is dereferenced in kwapd.
> 
> Anyone have a patch for this problem?
> 
> Thanks
> -steve
> 
> On Tue, 2004-03-02 at 04:03, Stian Jordet wrote:
> > fre, 06.02.2004 kl. 00.51 skrev Marcelo Tosatti:
> > > On Tue, 3 Feb 2004, Stian Jordet wrote:
> > > 
> > > > Hello,
> > > >
> > > > I have a server which was running 2.4.18 and 2.4.19 for almost 200 days
> > > > each, without problems. After an upgrade to 2.4.22, the box haven't been
> > > > up for 30 days in a row. This happened early november. I have caputered
> > > > oopses with both 2.4.23 and 2.6.1 which I have sent decoded to the list,
> > > > but have never got any reply.
> > > >
> > > > I have ran memtest86 on the box, no errors. What else can be the
> > > > problem? I could of course go back to 2.4.19, which I know worked fine,
> > > > but I there have been some fixed security holes since then...
> > > >
> > > > Any thoughts?
> > > 
> > > Stian,
> > > 
> > > I have seen your 2.4.x oopses and they seemed odd. The faults were
> > > happening in different functions (mostly inside VM "freeing" , due to
> > > what seems to be random crap in memory:
> > >
> > >  <1>Unable to handle kernel NULL pointer dereference at virtual address 00000021
> > > c0132e86
> > > *pde = 00000000
> > > 
> > > eax: 00000000   ebx: 00000009   ecx: 000001d2   edx: 00000012
> > > esi: 00000000   edi: c17e38c0   ebp: c1047a00   esp: c86cbdb4
> > > 
> > > >>EIP; c0132e86 <sync_page_buffers+e/a4>   <=====
> > > 
> > > >>edi; c17e38c0 <_end+14b5844/bd23f84>
> > > >>ebp; c1047a00 <_end+d19984/bd23f84>
> > > >>esp; c86cbdb4 <_end+839dd38/bd23f84>
> > > 
> > > Trace; c0132fdc <try_to_free_buffers+c0/ec>
> > > 
> > > Code;  c0132e86 <sync_page_buffers+e/a4>
> > > 00000000 <_EIP>:
> > > Code;  c0132e86 <sync_page_buffers+e/a4>   <=====
> > >    0:   f6 43 18 06               testb  $0x6,0x18(%ebx)   <=====
> > > Code;  c0132e8a <sync_page_buffers+12/a4>
> > >    4:   74 7c                     je     82 <_EIP+0x82> c0132f08
> > > <sync_page_buffers+90/a4>
> > > Code;  c0132e8c <sync_page_buffers+14/a4>
> > >    6:   b8 07 00 00 00            mov    $0x7,%eax
> > > Code;  c0132e91 <sync_page_buffers+19/a4>
> > > 
> > > 
> > > 
> > > 
> > >  <1>Unable to handle kernel NULL pointer dereference at virtual address
> > > 00000028
> > > c015e3a2
> > > *pde = 00000000
> > > Oops: 0000
> > > CPU:    0
> > > EIP:    0010:[<c015e3a2>]    Not tainted
> > > EFLAGS: 00010203
> > > 
> > > eax: 0100004d   ebx: 00000000   ecx: 000001d2   edx: 00000000
> > > 
> > > Code;  c015e3a2 <journal_try_to_free_buffers+5a/98>
> > > 00000000 <_EIP>:
> > > Code;  c015e3a2 <journal_try_to_free_buffers+5a/98>   <=====
> > >    0:   8b 5b 28                  mov    0x28(%ebx),%ebx   <=====
> > > Code;  c015e3a5 <journal_try_to_free_buffers+5d/98>
> > >    3:   f6 42 19 04               testb  $0x4,0x19(%edx)
> > > Code;  c015e3a9 <journal_try_to_free_buffers+61/98>
> > >    7:   74 17                     je     20 <_EIP+0x20> c015e3c2
> > > <journal_try_to_free_buffers+7a/98>
> > > 
> > > And other similar oopses.
> > > 
> > > Are you sure there is nothing messing up the hardware ?
> > > 
> > > How long have you ran memtest86? It can, sometimes, take a long to showup
> > > errors.
> > > 
> > > The 2.6.x oopses on the same hardware is also a useful source of
> > > information.
> > 
> > Marcelo,
> > 
> > sorry for getting back to you so insanely late. This was a production
> > server, and I have now moved the services it were running to another
> > box, so I could run a more exhaustive memtest86. It has now ran for two
> > days, without any errors. Of course there could be other flaky hardware,
> > but since I don't know any way to test it, and the oops occurs with
> > two-four weeks interval, it's quite time consuming to find out. I'm not
> > even sure if it will oops without the typical load it used to have.
> > 
> > Anyway, thank you very much for at least answering me. Much appreciated
> > :)
> > 
> > Best regards,
> > Stian
> > 
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
  2004-06-14 17:07     ` Steven Dake
  2004-06-14 18:26       ` Chris Shoemaker
@ 2004-06-15 13:16       ` Marcelo Tosatti
  2004-06-15 14:35         ` Stian Jordet
  2004-06-15 17:56         ` Steven Dake
  1 sibling, 2 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2004-06-15 13:16 UTC (permalink / raw)
  To: Steven Dake; +Cc: Stian Jordet, Linux Kernel Mailing List

On Mon, Jun 14, 2004 at 10:07:05AM -0700, Steven Dake wrote:
> Marcelo and Stian,
> 
> I have also seen this oops relating to low memory situations.  I think
> ext3 allocates some data, has a null return, sets something to null, and
> then later it is dereferenced in kwapd.
> 
> Anyone have a patch for this problem?

Steven, 

For what I remember Stian oopses were happening in random places in the VM freeing 
routines. That makes me belive what he was seeing was some kind of hardware issue, 
because otherwise the oopses would be happening in the same place (in case it was 
a software bug). The codepaths which he saw trying to access invalid addresses are 
executed flawlessly by all 2.4.x mainline users. He was also seeing oopses with v2.6.

Assuming his HW is not faulty, I can think of some driver corrupting his memory. 

Do you have any traces of the oopses you are seeing?  

Stian, you told us switched servers now, I assume the problem is gone? 
Are you still running v2.4 on that server?

Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
  2004-06-15 13:16       ` Marcelo Tosatti
@ 2004-06-15 14:35         ` Stian Jordet
  2004-06-15 17:56         ` Steven Dake
  1 sibling, 0 replies; 13+ messages in thread
From: Stian Jordet @ 2004-06-15 14:35 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Steven Dake, Linux Kernel Mailing List

tir, 15.06.2004 kl. 10.16 -0300, skrev Marcelo Tosatti:
> Stian, you told us switched servers now, I assume the problem is gone? 
> Are you still running v2.4 on that server?

I switched servers, and the oops is gone. The old server is still
running, but without the nightly memory intensive perl-script, and have
never seen an oops since I stopped running that. Both servers are
running 2.6 now.

Best regards,
Stian


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
  2004-06-15 13:16       ` Marcelo Tosatti
  2004-06-15 14:35         ` Stian Jordet
@ 2004-06-15 17:56         ` Steven Dake
  1 sibling, 0 replies; 13+ messages in thread
From: Steven Dake @ 2004-06-15 17:56 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Stian Jordet, Linux Kernel Mailing List

On Tue, 2004-06-15 at 06:16, Marcelo Tosatti wrote:
> On Mon, Jun 14, 2004 at 10:07:05AM -0700, Steven Dake wrote:
> > Marcelo and Stian,
> > 
> > I have also seen this oops relating to low memory situations.  I think
> > ext3 allocates some data, has a null return, sets something to null, and
> > then later it is dereferenced in kwapd.
> > 
> > Anyone have a patch for this problem?

> Steven, 
> 
> For what I remember Stian oopses were happening in random places in the VM freeing 
> routines. That makes me belive what he was seeing was some kind of hardware issue, 
> because otherwise the oopses would be happening in the same place (in case it was 
> a software bug). The codepaths which he saw trying to access invalid addresses are 
> executed flawlessly by all 2.4.x mainline users. He was also seeing oopses with v2.6.
> 
> Assuming his HW is not faulty, I can think of some driver corrupting his memory. 
> 
> Do you have any traces of the oopses you are seeing?  
> 
> Stian, you told us switched servers now, I assume the problem is gone? 
> Are you still running v2.4 on that server?
> 

Marcelo,

Stian responded saying he upgraded to 2.6 and also removed the memory
intensive script and his problems went away.  I suspect removing the
memory intesive script did the trick.  2.6 could also be fixed, who
knows?

After reading lkml, there are about 4-5 Oops in the same function at the
same location.  The people report they were heavily using memory (low
memory situation) in some of the bug reports.

I have tracked down the problem to a null dereference during a buffer
cache rebalance (which occurs during low memory situations).  Here is
the info I have.  Unfortunately I don't know much about how vm handles
low memory situations with the vfs, so if you have any ideas, it would
be helpful. :)

nable to handle kernel NULL pointer dereference at virtual address
00000028
<4> printing eip:
<4>c018aa67
<1>*pde = 00000000
<4>Oops: 0000
<4>CPU:	 2
<4>EIP:	 0010:[<c018aa67>]    Not tainted
<4>EFLAGS: 00010203
<4>eax: 00000000	 ebx: 00000000	 ecx: c0380490	 edx: c217a540
<4>esi: f2757a80	 edi: 00000001	 ebp: 00000000	 esp: f7bd1ee8
<4>ds: 0018   es: 0018   ss: 0018
<4>Process kswapd (pid: 11, stackpage=f7bd1000)
<4>Stack: 00000000 00000000 c217a600 00000000 00000202 00000000 c217a540 000001d0
<4>	c217a540 000102ec c018144f f7abe400 c217a540 000001d0 c0155006 c217a540
<4>	000001d0 f7bd0000 00000000 c0147f0a c217a540 000001d0 f7bd0000 f7bd0000
<4>Call Trace: [<c018144f>] [<c0155006>] [<c0147f0a>]
[<c014832b>] [<c0148398>]
<4>   [<c0148431>] [<c014847f>] [<c0148593>] [<c0105000>]
[<c010578a>] [<c01484f8>]
<4>
<4>Code: 8b 5b 28 f6 40 19 02 75 47 39 f3 75 f1 c6 05 80 24 38 c0 01


>>EIP; c018aa67 <journal_try_to_free_buffers+45/f4>   <=====
Trace; c018144f <ext3_releasepage+2d/32>
Trace; c0155006 <try_to_release_page+4e/78>
Trace; c0147f0a <shrink_cache+270/4f6>
Trace; c014832b <shrink_caches+61/98>
Trace; c0148398 <try_to_free_pages+36/54>
Trace; c0148431 <kswapd_balance_pgdat+51/8a>
Trace; c014847f <kswapd_balance+15/2c>
Trace; c0148593 <kswapd+9b/b6>
Trace; c0105000 <_stext+0/0>
Trace; c010578a <kernel_thread+2e/40>
Trace; c01484f8 <kswapd+0/b6>

The problem is that in this function:

int journal_try_to_free_buffers(journal_t *journal,
		struct page *page, int gfp_mask)
{
    struct buffer_head *bh;
    struct buffer_head *tmp;
    int locked_or_dirty = 0;
    int call_ttfb = 1;

    J_ASSERT(PageLocked(page));

    bh = page->buffers;
    tmp = bh;
    spin_lock(&journal_datalist_lock);
    do {
	struct buffer_head *p = tmp;

	tmp = tmp->b_this_page;
    ...

tmp (page->buffers) above is null.  b_this_page is at offset 0x28 (the accessed address in the oops).  This means that
page->buffers is set to null by some other routine which results in the oops.

I read the page allocate code
(ext3_read_page->block_read_full_page->create_emty_buffers->create_buffers), and it appears that it is not possible to allocate a page->buffers value of zero in the allocate function.  I am having difficulty reproducing and cannot debug further, however.  Can page->buffers be set to zero somewhere else?  Perhaps kswapd and some other thread are racing on the free?

Thansk
-steve
> Thanks!


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2004-06-15 19:15 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-06-15 17:44 Oopses with both recent 2.4.x kernels and 2.6.x kernels Nick Warne
2004-06-15 19:15 ` Stian Jordet
  -- strict thread matches above, loose matches on Subject: below --
2004-02-03 18:26 Stian Jordet
2004-02-05 23:51 ` Marcelo Tosatti
2004-03-02 11:03   ` Stian Jordet
2004-03-02 12:31     ` Stian Jordet
2004-03-09 19:22       ` Marcelo Tosatti
2004-03-09 22:28         ` Stian Jordet
2004-06-14 17:07     ` Steven Dake
2004-06-14 18:26       ` Chris Shoemaker
2004-06-15 13:16       ` Marcelo Tosatti
2004-06-15 14:35         ` Stian Jordet
2004-06-15 17:56         ` Steven Dake

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox