public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Oopses with both recent 2.4.x kernels and 2.6.x kernels
@ 2004-02-03 18:26 Stian Jordet
  2004-02-05 23:51 ` Marcelo Tosatti
  0 siblings, 1 reply; 22+ messages in thread
From: Stian Jordet @ 2004-02-03 18:26 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Hello,

I have a server which was running 2.4.18 and 2.4.19 for almost 200 days
each, without problems. After an upgrade to 2.4.22, the box haven't been
up for 30 days in a row. This happened early november. I have caputered
oopses with both 2.4.23 and 2.6.1 which I have sent decoded to the list,
but have never got any reply.

I have ran memtest86 on the box, no errors. What else can be the
problem? I could of course go back to 2.4.19, which I know worked fine,
but I there have been some fixed security holes since then...

Any thoughts?

Best regards,
Stian


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
  2004-02-03 18:26 Oopses with both recent 2.4.x kernels and 2.6.x kernels Stian Jordet
@ 2004-02-05 23:51 ` Marcelo Tosatti
  2004-03-02 11:03   ` Stian Jordet
  0 siblings, 1 reply; 22+ messages in thread
From: Marcelo Tosatti @ 2004-02-05 23:51 UTC (permalink / raw)
  To: Stian Jordet; +Cc: Linux Kernel Mailing List



On Tue, 3 Feb 2004, Stian Jordet wrote:

> Hello,
>
> I have a server which was running 2.4.18 and 2.4.19 for almost 200 days
> each, without problems. After an upgrade to 2.4.22, the box haven't been
> up for 30 days in a row. This happened early november. I have caputered
> oopses with both 2.4.23 and 2.6.1 which I have sent decoded to the list,
> but have never got any reply.
>
> I have ran memtest86 on the box, no errors. What else can be the
> problem? I could of course go back to 2.4.19, which I know worked fine,
> but I there have been some fixed security holes since then...
>
> Any thoughts?

Stian,

I have seen your 2.4.x oopses and they seemed odd. The faults were
happening in different functions (mostly inside VM "freeing" , due to
what seems to be random crap in memory:

 <1>Unable to handle kernel NULL pointer dereference at virtual address 00000021
c0132e86
*pde = 00000000

eax: 00000000   ebx: 00000009   ecx: 000001d2   edx: 00000012
esi: 00000000   edi: c17e38c0   ebp: c1047a00   esp: c86cbdb4

>>EIP; c0132e86 <sync_page_buffers+e/a4>   <=====

>>edi; c17e38c0 <_end+14b5844/bd23f84>
>>ebp; c1047a00 <_end+d19984/bd23f84>
>>esp; c86cbdb4 <_end+839dd38/bd23f84>

Trace; c0132fdc <try_to_free_buffers+c0/ec>

Code;  c0132e86 <sync_page_buffers+e/a4>
00000000 <_EIP>:
Code;  c0132e86 <sync_page_buffers+e/a4>   <=====
   0:   f6 43 18 06               testb  $0x6,0x18(%ebx)   <=====
Code;  c0132e8a <sync_page_buffers+12/a4>
   4:   74 7c                     je     82 <_EIP+0x82> c0132f08
<sync_page_buffers+90/a4>
Code;  c0132e8c <sync_page_buffers+14/a4>
   6:   b8 07 00 00 00            mov    $0x7,%eax
Code;  c0132e91 <sync_page_buffers+19/a4>




 <1>Unable to handle kernel NULL pointer dereference at virtual address
00000028
c015e3a2
*pde = 00000000
Oops: 0000
CPU:    0
EIP:    0010:[<c015e3a2>]    Not tainted
EFLAGS: 00010203

eax: 0100004d   ebx: 00000000   ecx: 000001d2   edx: 00000000

Code;  c015e3a2 <journal_try_to_free_buffers+5a/98>
00000000 <_EIP>:
Code;  c015e3a2 <journal_try_to_free_buffers+5a/98>   <=====
   0:   8b 5b 28                  mov    0x28(%ebx),%ebx   <=====
Code;  c015e3a5 <journal_try_to_free_buffers+5d/98>
   3:   f6 42 19 04               testb  $0x4,0x19(%edx)
Code;  c015e3a9 <journal_try_to_free_buffers+61/98>
   7:   74 17                     je     20 <_EIP+0x20> c015e3c2
<journal_try_to_free_buffers+7a/98>

And other similar oopses.

Are you sure there is nothing messing up the hardware ?

How long have you ran memtest86? It can, sometimes, take a long to showup
errors.

The 2.6.x oopses on the same hardware is also a useful source of
information.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
  2004-02-05 23:51 ` Marcelo Tosatti
@ 2004-03-02 11:03   ` Stian Jordet
  2004-03-02 12:31     ` Stian Jordet
  2004-06-14 17:07     ` Steven Dake
  0 siblings, 2 replies; 22+ messages in thread
From: Stian Jordet @ 2004-03-02 11:03 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linux Kernel Mailing List

fre, 06.02.2004 kl. 00.51 skrev Marcelo Tosatti:
> On Tue, 3 Feb 2004, Stian Jordet wrote:
> 
> > Hello,
> >
> > I have a server which was running 2.4.18 and 2.4.19 for almost 200 days
> > each, without problems. After an upgrade to 2.4.22, the box haven't been
> > up for 30 days in a row. This happened early november. I have caputered
> > oopses with both 2.4.23 and 2.6.1 which I have sent decoded to the list,
> > but have never got any reply.
> >
> > I have ran memtest86 on the box, no errors. What else can be the
> > problem? I could of course go back to 2.4.19, which I know worked fine,
> > but I there have been some fixed security holes since then...
> >
> > Any thoughts?
> 
> Stian,
> 
> I have seen your 2.4.x oopses and they seemed odd. The faults were
> happening in different functions (mostly inside VM "freeing" , due to
> what seems to be random crap in memory:
>
>  <1>Unable to handle kernel NULL pointer dereference at virtual address 00000021
> c0132e86
> *pde = 00000000
> 
> eax: 00000000   ebx: 00000009   ecx: 000001d2   edx: 00000012
> esi: 00000000   edi: c17e38c0   ebp: c1047a00   esp: c86cbdb4
> 
> >>EIP; c0132e86 <sync_page_buffers+e/a4>   <=====
> 
> >>edi; c17e38c0 <_end+14b5844/bd23f84>
> >>ebp; c1047a00 <_end+d19984/bd23f84>
> >>esp; c86cbdb4 <_end+839dd38/bd23f84>
> 
> Trace; c0132fdc <try_to_free_buffers+c0/ec>
> 
> Code;  c0132e86 <sync_page_buffers+e/a4>
> 00000000 <_EIP>:
> Code;  c0132e86 <sync_page_buffers+e/a4>   <=====
>    0:   f6 43 18 06               testb  $0x6,0x18(%ebx)   <=====
> Code;  c0132e8a <sync_page_buffers+12/a4>
>    4:   74 7c                     je     82 <_EIP+0x82> c0132f08
> <sync_page_buffers+90/a4>
> Code;  c0132e8c <sync_page_buffers+14/a4>
>    6:   b8 07 00 00 00            mov    $0x7,%eax
> Code;  c0132e91 <sync_page_buffers+19/a4>
> 
> 
> 
> 
>  <1>Unable to handle kernel NULL pointer dereference at virtual address
> 00000028
> c015e3a2
> *pde = 00000000
> Oops: 0000
> CPU:    0
> EIP:    0010:[<c015e3a2>]    Not tainted
> EFLAGS: 00010203
> 
> eax: 0100004d   ebx: 00000000   ecx: 000001d2   edx: 00000000
> 
> Code;  c015e3a2 <journal_try_to_free_buffers+5a/98>
> 00000000 <_EIP>:
> Code;  c015e3a2 <journal_try_to_free_buffers+5a/98>   <=====
>    0:   8b 5b 28                  mov    0x28(%ebx),%ebx   <=====
> Code;  c015e3a5 <journal_try_to_free_buffers+5d/98>
>    3:   f6 42 19 04               testb  $0x4,0x19(%edx)
> Code;  c015e3a9 <journal_try_to_free_buffers+61/98>
>    7:   74 17                     je     20 <_EIP+0x20> c015e3c2
> <journal_try_to_free_buffers+7a/98>
> 
> And other similar oopses.
> 
> Are you sure there is nothing messing up the hardware ?
> 
> How long have you ran memtest86? It can, sometimes, take a long to showup
> errors.
> 
> The 2.6.x oopses on the same hardware is also a useful source of
> information.

Marcelo,

sorry for getting back to you so insanely late. This was a production
server, and I have now moved the services it were running to another
box, so I could run a more exhaustive memtest86. It has now ran for two
days, without any errors. Of course there could be other flaky hardware,
but since I don't know any way to test it, and the oops occurs with
two-four weeks interval, it's quite time consuming to find out. I'm not
even sure if it will oops without the typical load it used to have.

Anyway, thank you very much for at least answering me. Much appreciated
:)

Best regards,
Stian


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
  2004-03-02 11:03   ` Stian Jordet
@ 2004-03-02 12:31     ` Stian Jordet
  2004-03-09 19:22       ` Marcelo Tosatti
  2004-06-14 17:07     ` Steven Dake
  1 sibling, 1 reply; 22+ messages in thread
From: Stian Jordet @ 2004-03-02 12:31 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 86 bytes --]

Btw, here is one of the 2.6.x oopses as well (as you requested).

Best regards,
Stian

[-- Attachment #2: syslog.txt --]
[-- Type: text/plain, Size: 5902 bytes --]

Jan 20 00:20:04 dodge kernel: ------------[ cut here ]------------
Jan 20 00:20:04 dodge kernel: kernel BUG at mm/page_alloc.c:201!
Jan 20 00:20:04 dodge kernel: invalid operand: 0000 [#1]
Jan 20 00:20:04 dodge kernel: CPU:    0
Jan 20 00:20:04 dodge kernel: EIP:    0060:[free_pages_bulk+482/512]    Not tainted
Jan 20 00:20:04 dodge kernel: EFLAGS: 00010002
Jan 20 00:20:04 dodge kernel: EIP is at free_pages_bulk+0x1e2/0x200
Jan 20 00:20:04 dodge kernel: eax: 00000001   ebx: c00609c8   ecx: 00000000   edx: 666026a5
Jan 20 00:20:04 dodge kernel: esi: 666026a4   edi: ffffffff   ebp: 33301352   esp: c86d5d90
Jan 20 00:20:04 dodge kernel: ds: 007b   es: 007b   ss: 0068
Jan 20 00:20:04 dodge kernel: Process mrtg (pid: 26804, threadinfo=c86d4000 task=c9b860c0)
Jan 20 00:20:04 dodge kernel: Stack: c038ad80 c00609c8 00000000 c038ae40 00000001 c00609a0 c038adbc 00000000 
Jan 20 00:20:04 dodge kernel:        c1000000 c038adbc 00000086 ffffffff c038ad80 c00609a0 c038ae5c 00000246 
Jan 20 00:20:04 dodge kernel:        c01341b9 c038ad80 00000000 c038ae5c 00000000 c038ad80 c00609a0 00000005 
Jan 20 00:20:04 dodge kernel: Call Trace:
Jan 20 00:20:04 dodge kernel:  [free_hot_cold_page+217/240] free_hot_cold_page+0xd9/0xf0
Jan 20 00:20:04 dodge kernel:  [do_generic_mapping_read+714/1008] do_generic_mapping_read+0x2ca/0x3f0
Jan 20 00:20:04 dodge kernel:  [file_read_actor+0/256] file_read_actor+0x0/0x100
Jan 20 00:20:04 dodge kernel:  [__generic_file_aio_read+454/512] __generic_file_aio_read+0x1c6/0x200
Jan 20 00:20:04 dodge kernel:  [file_read_actor+0/256] file_read_actor+0x0/0x100
Jan 20 00:20:04 dodge kernel:  [generic_file_aio_read+91/128] generic_file_aio_read+0x5b/0x80
Jan 20 00:20:04 dodge kernel:  [do_sync_read+137/192] do_sync_read+0x89/0xc0
Jan 20 00:20:04 dodge kernel:  [do_page_fault+300/1328] do_page_fault+0x12c/0x530
Jan 20 00:20:04 dodge kernel:  [do_brk+324/560] do_brk+0x144/0x230
Jan 20 00:20:04 dodge kernel:  [vfs_read+184/304] vfs_read+0xb8/0x130
Jan 20 00:20:04 dodge kernel:  [sys_read+66/112] sys_read+0x42/0x70
Jan 20 00:20:04 dodge kernel:  [syscall_call+7/11] syscall_call+0x7/0xb
Jan 20 00:20:04 dodge kernel: 
Jan 20 00:20:04 dodge kernel: Code: 0f 0b c9 00 9b d1 34 c0 e9 51 ff ff ff 0f 0b bc 00 9b d1 34 
Jan 20 00:20:04 dodge kernel:  <1>Unable to handle kernel paging request at virtual address 00100104
Jan 20 00:20:04 dodge kernel:  printing eip:
Jan 20 00:20:04 dodge kernel: c0133cbf
Jan 20 00:20:04 dodge kernel: *pde = 00000000
Jan 20 00:20:04 dodge kernel: Oops: 0002 [#2]
Jan 20 00:20:04 dodge kernel: CPU:    0
Jan 20 00:20:04 dodge kernel: EIP:    0060:[free_pages_bulk+143/512]    Not tainted
Jan 20 00:20:04 dodge kernel: EFLAGS: 00010003
Jan 20 00:20:04 dodge kernel: EIP is at free_pages_bulk+0x8f/0x200
Jan 20 00:20:04 dodge kernel: eax: c00609a8   ebx: c038ae5c   ecx: 00200200   edx: 00100100
Jan 20 00:20:04 dodge kernel: esi: c00609a0   edi: c038ae5c   ebp: 00000203   esp: c86d5b34
Jan 20 00:20:04 dodge kernel: ds: 007b   es: 007b   ss: 0068
Jan 20 00:20:04 dodge kernel: Process mrtg (pid: 26804, threadinfo=c86d4000 task=c9b860c0)
Jan 20 00:20:04 dodge kernel: Stack: c11a48c0 00268000 00000000 c038af54 00000001 c1044d20 c038aed0 00000000 
Jan 20 00:20:04 dodge kernel:        c1000000 c038adbc 00000082 ffffffff c038ad80 c100a230 c038ae5c 00000203 
Jan 20 00:20:04 dodge kernel:        c01341b9 c038ad80 00000000 c038ae5c 00000000 c038ad80 c5c7ac0c 0020d000 
Jan 20 00:20:04 dodge kernel: Call Trace:
Jan 20 00:20:04 dodge kernel:  [free_hot_cold_page+217/240] free_hot_cold_page+0xd9/0xf0
Jan 20 00:20:04 dodge kernel:  [zap_pte_range+334/400] zap_pte_range+0x14e/0x190
Jan 20 00:20:04 dodge kernel:  [zap_pmd_range+75/112] zap_pmd_range+0x4b/0x70
Jan 20 00:20:04 dodge kernel:  [unmap_page_range+75/128] unmap_page_range+0x4b/0x80
Jan 20 00:20:04 dodge kernel:  [unmap_vmas+254/544] unmap_vmas+0xfe/0x220
Jan 20 00:20:04 dodge kernel:  [exit_mmap+109/384] exit_mmap+0x6d/0x180
Jan 20 00:20:04 dodge kernel:  [mmput+81/160] mmput+0x51/0xa0
Jan 20 00:20:04 dodge kernel:  [do_exit+290/800] do_exit+0x122/0x320
Jan 20 00:20:04 dodge kernel:  [do_invalid_op+0/208] do_invalid_op+0x0/0xd0
Jan 20 00:20:04 dodge kernel:  [die+203/208] die+0xcb/0xd0
Jan 20 00:20:04 dodge kernel:  [do_invalid_op+202/208] do_invalid_op+0xca/0xd0
Jan 20 00:20:04 dodge kernel:  [free_pages_bulk+482/512] free_pages_bulk+0x1e2/0x200
Jan 20 00:20:04 dodge kernel:  [update_wall_time+22/64] update_wall_time+0x16/0x40
Jan 20 00:20:04 dodge kernel:  [do_timer+224/240] do_timer+0xe0/0xf0
Jan 20 00:20:04 dodge kernel:  [timer_interrupt+56/240] timer_interrupt+0x38/0xf0
Jan 20 00:20:04 dodge kernel:  [handle_IRQ_event+73/128] handle_IRQ_event+0x49/0x80
Jan 20 00:20:04 dodge kernel:  [do_IRQ+140/240] do_IRQ+0x8c/0xf0
Jan 20 00:20:04 dodge kernel:  [error_code+45/56] error_code+0x2d/0x38
Jan 20 00:20:04 dodge kernel:  [free_pages_bulk+482/512] free_pages_bulk+0x1e2/0x200
Jan 20 00:20:04 dodge kernel:  [free_hot_cold_page+217/240] free_hot_cold_page+0xd9/0xf0
Jan 20 00:20:04 dodge kernel:  [do_generic_mapping_read+714/1008] do_generic_mapping_read+0x2ca/0x3f0
Jan 20 00:20:04 dodge kernel:  [file_read_actor+0/256] file_read_actor+0x0/0x100
Jan 20 00:20:04 dodge kernel:  [__generic_file_aio_read+454/512] __generic_file_aio_read+0x1c6/0x200
Jan 20 00:20:04 dodge kernel:  [file_read_actor+0/256] file_read_actor+0x0/0x100
Jan 20 00:20:04 dodge kernel:  [generic_file_aio_read+91/128] generic_file_aio_read+0x5b/0x80
Jan 20 00:20:04 dodge kernel:  [do_sync_read+137/192] do_sync_read+0x89/0xc0
Jan 20 00:20:04 dodge kernel:  [do_page_fault+300/1328] do_page_fault+0x12c/0x530
Jan 20 00:20:04 dodge kernel:  [do_brk+324/560] do_brk+0x144/0x230
Jan 20 00:20:04 dodge kernel:  [vfs_read+184/304] vfs_read+0xb8/0x130
Jan 20 00:20:04 dodge kernel:  [sys_read+66/112] sys_read+0x42/0x70
Jan 20 00:20:04 dodge kernel:  [syscall_call+7/11] syscall_call+0x7/0xb

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
  2004-03-02 12:31     ` Stian Jordet
@ 2004-03-09 19:22       ` Marcelo Tosatti
  2004-03-09 22:28         ` Stian Jordet
  0 siblings, 1 reply; 22+ messages in thread
From: Marcelo Tosatti @ 2004-03-09 19:22 UTC (permalink / raw)
  To: Stian Jordet; +Cc: Marcelo Tosatti, Linux Kernel Mailing List



On Tue, 2 Mar 2004, Stian Jordet wrote:

> Btw, here is one of the 2.6.x oopses as well (as you requested).

Stian, 

This sounds like bad hardware. Did I already ask you to try memtest86 ? 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
  2004-03-09 19:22       ` Marcelo Tosatti
@ 2004-03-09 22:28         ` Stian Jordet
  0 siblings, 0 replies; 22+ messages in thread
From: Stian Jordet @ 2004-03-09 22:28 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linux Kernel Mailing List

Quoting Marcelo Tosatti <marcelo.tosatti@cyclades.com>:

> 
> 
> On Tue, 2 Mar 2004, Stian Jordet wrote:
> 
> > Btw, here is one of the 2.6.x oopses as well (as you requested).
> 
> Stian, 
> 
> This sounds like bad hardware. Did I already ask you to try memtest86 ? 
> 

Yup, and I think I wrote that I had it running for almost two days with no
errors.  Oh well. Thanks for looking into this :) I guess I'll try go afford a
new server (very) soon.

Best regards,
Stian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
  2004-03-02 11:03   ` Stian Jordet
  2004-03-02 12:31     ` Stian Jordet
@ 2004-06-14 17:07     ` Steven Dake
  2004-06-14 18:26       ` Chris Shoemaker
  2004-06-15 13:16       ` Marcelo Tosatti
  1 sibling, 2 replies; 22+ messages in thread
From: Steven Dake @ 2004-06-14 17:07 UTC (permalink / raw)
  To: Stian Jordet; +Cc: Marcelo Tosatti, Linux Kernel Mailing List

Marcelo and Stian,

I have also seen this oops relating to low memory situations.  I think
ext3 allocates some data, has a null return, sets something to null, and
then later it is dereferenced in kwapd.

Anyone have a patch for this problem?

Thanks
-steve

On Tue, 2004-03-02 at 04:03, Stian Jordet wrote:
> fre, 06.02.2004 kl. 00.51 skrev Marcelo Tosatti:
> > On Tue, 3 Feb 2004, Stian Jordet wrote:
> > 
> > > Hello,
> > >
> > > I have a server which was running 2.4.18 and 2.4.19 for almost 200 days
> > > each, without problems. After an upgrade to 2.4.22, the box haven't been
> > > up for 30 days in a row. This happened early november. I have caputered
> > > oopses with both 2.4.23 and 2.6.1 which I have sent decoded to the list,
> > > but have never got any reply.
> > >
> > > I have ran memtest86 on the box, no errors. What else can be the
> > > problem? I could of course go back to 2.4.19, which I know worked fine,
> > > but I there have been some fixed security holes since then...
> > >
> > > Any thoughts?
> > 
> > Stian,
> > 
> > I have seen your 2.4.x oopses and they seemed odd. The faults were
> > happening in different functions (mostly inside VM "freeing" , due to
> > what seems to be random crap in memory:
> >
> >  <1>Unable to handle kernel NULL pointer dereference at virtual address 00000021
> > c0132e86
> > *pde = 00000000
> > 
> > eax: 00000000   ebx: 00000009   ecx: 000001d2   edx: 00000012
> > esi: 00000000   edi: c17e38c0   ebp: c1047a00   esp: c86cbdb4
> > 
> > >>EIP; c0132e86 <sync_page_buffers+e/a4>   <=====
> > 
> > >>edi; c17e38c0 <_end+14b5844/bd23f84>
> > >>ebp; c1047a00 <_end+d19984/bd23f84>
> > >>esp; c86cbdb4 <_end+839dd38/bd23f84>
> > 
> > Trace; c0132fdc <try_to_free_buffers+c0/ec>
> > 
> > Code;  c0132e86 <sync_page_buffers+e/a4>
> > 00000000 <_EIP>:
> > Code;  c0132e86 <sync_page_buffers+e/a4>   <=====
> >    0:   f6 43 18 06               testb  $0x6,0x18(%ebx)   <=====
> > Code;  c0132e8a <sync_page_buffers+12/a4>
> >    4:   74 7c                     je     82 <_EIP+0x82> c0132f08
> > <sync_page_buffers+90/a4>
> > Code;  c0132e8c <sync_page_buffers+14/a4>
> >    6:   b8 07 00 00 00            mov    $0x7,%eax
> > Code;  c0132e91 <sync_page_buffers+19/a4>
> > 
> > 
> > 
> > 
> >  <1>Unable to handle kernel NULL pointer dereference at virtual address
> > 00000028
> > c015e3a2
> > *pde = 00000000
> > Oops: 0000
> > CPU:    0
> > EIP:    0010:[<c015e3a2>]    Not tainted
> > EFLAGS: 00010203
> > 
> > eax: 0100004d   ebx: 00000000   ecx: 000001d2   edx: 00000000
> > 
> > Code;  c015e3a2 <journal_try_to_free_buffers+5a/98>
> > 00000000 <_EIP>:
> > Code;  c015e3a2 <journal_try_to_free_buffers+5a/98>   <=====
> >    0:   8b 5b 28                  mov    0x28(%ebx),%ebx   <=====
> > Code;  c015e3a5 <journal_try_to_free_buffers+5d/98>
> >    3:   f6 42 19 04               testb  $0x4,0x19(%edx)
> > Code;  c015e3a9 <journal_try_to_free_buffers+61/98>
> >    7:   74 17                     je     20 <_EIP+0x20> c015e3c2
> > <journal_try_to_free_buffers+7a/98>
> > 
> > And other similar oopses.
> > 
> > Are you sure there is nothing messing up the hardware ?
> > 
> > How long have you ran memtest86? It can, sometimes, take a long to showup
> > errors.
> > 
> > The 2.6.x oopses on the same hardware is also a useful source of
> > information.
> 
> Marcelo,
> 
> sorry for getting back to you so insanely late. This was a production
> server, and I have now moved the services it were running to another
> box, so I could run a more exhaustive memtest86. It has now ran for two
> days, without any errors. Of course there could be other flaky hardware,
> but since I don't know any way to test it, and the oops occurs with
> two-four weeks interval, it's quite time consuming to find out. I'm not
> even sure if it will oops without the typical load it used to have.
> 
> Anyway, thank you very much for at least answering me. Much appreciated
> :)
> 
> Best regards,
> Stian
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
  2004-06-14 17:07     ` Steven Dake
@ 2004-06-14 18:26       ` Chris Shoemaker
  2004-06-15 13:16       ` Marcelo Tosatti
  1 sibling, 0 replies; 22+ messages in thread
From: Chris Shoemaker @ 2004-06-14 18:26 UTC (permalink / raw)
  To: Steven Dake; +Cc: Stian Jordet, Marcelo Tosatti, Linux Kernel Mailing List

Marcelo, Stian, Steve,
	I also have seen this, more than a dozen times -- always related
	to writes to my ext3 partition concurrent with heavy swapping
	due to memory pressure.  Although my oopses are usually not NULL
	dereferences, but simply bad addresses.  In the two or three cases
	where I've actually traced the oops back to the code, (search
	the April archives for my name) it's been a corrupted pointer,
	always in the high part of the word, but not always the same
	part.  For weeks, I thought that it was flakey RAM, but since
	days of memtest86 didn't fail I went back to trying to strip
	more stuff out of the kernel.  Since the last cut, I haven't had
	a single oops, for approx. 3 weeks of uptime.  Previous pattern
	was 1 to 3 days between oopses.
	
	I didn't really learn much that would help you.  However, if you
	are trying to reproduce the failure more rapidly, my experience
	would suggest that it is necessary to run a memory hog,
	concurrent with an io process (e.g kernel compile.)  Either one
	alone may run for days with no failure.

	Let me know if I can help.

	-Chris

On Mon, Jun 14, 2004 at 10:07:05AM -0700, Steven Dake wrote:
> Marcelo and Stian,
> 
> I have also seen this oops relating to low memory situations.  I think
> ext3 allocates some data, has a null return, sets something to null, and
> then later it is dereferenced in kwapd.
> 
> Anyone have a patch for this problem?
> 
> Thanks
> -steve
> 
> On Tue, 2004-03-02 at 04:03, Stian Jordet wrote:
> > fre, 06.02.2004 kl. 00.51 skrev Marcelo Tosatti:
> > > On Tue, 3 Feb 2004, Stian Jordet wrote:
> > > 
> > > > Hello,
> > > >
> > > > I have a server which was running 2.4.18 and 2.4.19 for almost 200 days
> > > > each, without problems. After an upgrade to 2.4.22, the box haven't been
> > > > up for 30 days in a row. This happened early november. I have caputered
> > > > oopses with both 2.4.23 and 2.6.1 which I have sent decoded to the list,
> > > > but have never got any reply.
> > > >
> > > > I have ran memtest86 on the box, no errors. What else can be the
> > > > problem? I could of course go back to 2.4.19, which I know worked fine,
> > > > but I there have been some fixed security holes since then...
> > > >
> > > > Any thoughts?
> > > 
> > > Stian,
> > > 
> > > I have seen your 2.4.x oopses and they seemed odd. The faults were
> > > happening in different functions (mostly inside VM "freeing" , due to
> > > what seems to be random crap in memory:
> > >
> > >  <1>Unable to handle kernel NULL pointer dereference at virtual address 00000021
> > > c0132e86
> > > *pde = 00000000
> > > 
> > > eax: 00000000   ebx: 00000009   ecx: 000001d2   edx: 00000012
> > > esi: 00000000   edi: c17e38c0   ebp: c1047a00   esp: c86cbdb4
> > > 
> > > >>EIP; c0132e86 <sync_page_buffers+e/a4>   <=====
> > > 
> > > >>edi; c17e38c0 <_end+14b5844/bd23f84>
> > > >>ebp; c1047a00 <_end+d19984/bd23f84>
> > > >>esp; c86cbdb4 <_end+839dd38/bd23f84>
> > > 
> > > Trace; c0132fdc <try_to_free_buffers+c0/ec>
> > > 
> > > Code;  c0132e86 <sync_page_buffers+e/a4>
> > > 00000000 <_EIP>:
> > > Code;  c0132e86 <sync_page_buffers+e/a4>   <=====
> > >    0:   f6 43 18 06               testb  $0x6,0x18(%ebx)   <=====
> > > Code;  c0132e8a <sync_page_buffers+12/a4>
> > >    4:   74 7c                     je     82 <_EIP+0x82> c0132f08
> > > <sync_page_buffers+90/a4>
> > > Code;  c0132e8c <sync_page_buffers+14/a4>
> > >    6:   b8 07 00 00 00            mov    $0x7,%eax
> > > Code;  c0132e91 <sync_page_buffers+19/a4>
> > > 
> > > 
> > > 
> > > 
> > >  <1>Unable to handle kernel NULL pointer dereference at virtual address
> > > 00000028
> > > c015e3a2
> > > *pde = 00000000
> > > Oops: 0000
> > > CPU:    0
> > > EIP:    0010:[<c015e3a2>]    Not tainted
> > > EFLAGS: 00010203
> > > 
> > > eax: 0100004d   ebx: 00000000   ecx: 000001d2   edx: 00000000
> > > 
> > > Code;  c015e3a2 <journal_try_to_free_buffers+5a/98>
> > > 00000000 <_EIP>:
> > > Code;  c015e3a2 <journal_try_to_free_buffers+5a/98>   <=====
> > >    0:   8b 5b 28                  mov    0x28(%ebx),%ebx   <=====
> > > Code;  c015e3a5 <journal_try_to_free_buffers+5d/98>
> > >    3:   f6 42 19 04               testb  $0x4,0x19(%edx)
> > > Code;  c015e3a9 <journal_try_to_free_buffers+61/98>
> > >    7:   74 17                     je     20 <_EIP+0x20> c015e3c2
> > > <journal_try_to_free_buffers+7a/98>
> > > 
> > > And other similar oopses.
> > > 
> > > Are you sure there is nothing messing up the hardware ?
> > > 
> > > How long have you ran memtest86? It can, sometimes, take a long to showup
> > > errors.
> > > 
> > > The 2.6.x oopses on the same hardware is also a useful source of
> > > information.
> > 
> > Marcelo,
> > 
> > sorry for getting back to you so insanely late. This was a production
> > server, and I have now moved the services it were running to another
> > box, so I could run a more exhaustive memtest86. It has now ran for two
> > days, without any errors. Of course there could be other flaky hardware,
> > but since I don't know any way to test it, and the oops occurs with
> > two-four weeks interval, it's quite time consuming to find out. I'm not
> > even sure if it will oops without the typical load it used to have.
> > 
> > Anyway, thank you very much for at least answering me. Much appreciated
> > :)
> > 
> > Best regards,
> > Stian
> > 
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
  2004-06-14 17:07     ` Steven Dake
  2004-06-14 18:26       ` Chris Shoemaker
@ 2004-06-15 13:16       ` Marcelo Tosatti
  2004-06-15 14:35         ` Stian Jordet
  2004-06-15 17:56         ` Steven Dake
  1 sibling, 2 replies; 22+ messages in thread
From: Marcelo Tosatti @ 2004-06-15 13:16 UTC (permalink / raw)
  To: Steven Dake; +Cc: Stian Jordet, Linux Kernel Mailing List

On Mon, Jun 14, 2004 at 10:07:05AM -0700, Steven Dake wrote:
> Marcelo and Stian,
> 
> I have also seen this oops relating to low memory situations.  I think
> ext3 allocates some data, has a null return, sets something to null, and
> then later it is dereferenced in kwapd.
> 
> Anyone have a patch for this problem?

Steven, 

For what I remember Stian oopses were happening in random places in the VM freeing 
routines. That makes me belive what he was seeing was some kind of hardware issue, 
because otherwise the oopses would be happening in the same place (in case it was 
a software bug). The codepaths which he saw trying to access invalid addresses are 
executed flawlessly by all 2.4.x mainline users. He was also seeing oopses with v2.6.

Assuming his HW is not faulty, I can think of some driver corrupting his memory. 

Do you have any traces of the oopses you are seeing?  

Stian, you told us switched servers now, I assume the problem is gone? 
Are you still running v2.4 on that server?

Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
  2004-06-15 13:16       ` Marcelo Tosatti
@ 2004-06-15 14:35         ` Stian Jordet
  2004-06-15 17:56         ` Steven Dake
  1 sibling, 0 replies; 22+ messages in thread
From: Stian Jordet @ 2004-06-15 14:35 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Steven Dake, Linux Kernel Mailing List

tir, 15.06.2004 kl. 10.16 -0300, skrev Marcelo Tosatti:
> Stian, you told us switched servers now, I assume the problem is gone? 
> Are you still running v2.4 on that server?

I switched servers, and the oops is gone. The old server is still
running, but without the nightly memory intensive perl-script, and have
never seen an oops since I stopped running that. Both servers are
running 2.6 now.

Best regards,
Stian


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels
  2004-06-15 13:16       ` Marcelo Tosatti
  2004-06-15 14:35         ` Stian Jordet
@ 2004-06-15 17:56         ` Steven Dake
  2004-06-17 13:16           ` [2.4] page->buffers vanished in journal_try_to_free_buffers() Marcelo Tosatti
  1 sibling, 1 reply; 22+ messages in thread
From: Steven Dake @ 2004-06-15 17:56 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Stian Jordet, Linux Kernel Mailing List

On Tue, 2004-06-15 at 06:16, Marcelo Tosatti wrote:
> On Mon, Jun 14, 2004 at 10:07:05AM -0700, Steven Dake wrote:
> > Marcelo and Stian,
> > 
> > I have also seen this oops relating to low memory situations.  I think
> > ext3 allocates some data, has a null return, sets something to null, and
> > then later it is dereferenced in kwapd.
> > 
> > Anyone have a patch for this problem?

> Steven, 
> 
> For what I remember Stian oopses were happening in random places in the VM freeing 
> routines. That makes me belive what he was seeing was some kind of hardware issue, 
> because otherwise the oopses would be happening in the same place (in case it was 
> a software bug). The codepaths which he saw trying to access invalid addresses are 
> executed flawlessly by all 2.4.x mainline users. He was also seeing oopses with v2.6.
> 
> Assuming his HW is not faulty, I can think of some driver corrupting his memory. 
> 
> Do you have any traces of the oopses you are seeing?  
> 
> Stian, you told us switched servers now, I assume the problem is gone? 
> Are you still running v2.4 on that server?
> 

Marcelo,

Stian responded saying he upgraded to 2.6 and also removed the memory
intensive script and his problems went away.  I suspect removing the
memory intesive script did the trick.  2.6 could also be fixed, who
knows?

After reading lkml, there are about 4-5 Oops in the same function at the
same location.  The people report they were heavily using memory (low
memory situation) in some of the bug reports.

I have tracked down the problem to a null dereference during a buffer
cache rebalance (which occurs during low memory situations).  Here is
the info I have.  Unfortunately I don't know much about how vm handles
low memory situations with the vfs, so if you have any ideas, it would
be helpful. :)

nable to handle kernel NULL pointer dereference at virtual address
00000028
<4> printing eip:
<4>c018aa67
<1>*pde = 00000000
<4>Oops: 0000
<4>CPU:	 2
<4>EIP:	 0010:[<c018aa67>]    Not tainted
<4>EFLAGS: 00010203
<4>eax: 00000000	 ebx: 00000000	 ecx: c0380490	 edx: c217a540
<4>esi: f2757a80	 edi: 00000001	 ebp: 00000000	 esp: f7bd1ee8
<4>ds: 0018   es: 0018   ss: 0018
<4>Process kswapd (pid: 11, stackpage=f7bd1000)
<4>Stack: 00000000 00000000 c217a600 00000000 00000202 00000000 c217a540 000001d0
<4>	c217a540 000102ec c018144f f7abe400 c217a540 000001d0 c0155006 c217a540
<4>	000001d0 f7bd0000 00000000 c0147f0a c217a540 000001d0 f7bd0000 f7bd0000
<4>Call Trace: [<c018144f>] [<c0155006>] [<c0147f0a>]
[<c014832b>] [<c0148398>]
<4>   [<c0148431>] [<c014847f>] [<c0148593>] [<c0105000>]
[<c010578a>] [<c01484f8>]
<4>
<4>Code: 8b 5b 28 f6 40 19 02 75 47 39 f3 75 f1 c6 05 80 24 38 c0 01


>>EIP; c018aa67 <journal_try_to_free_buffers+45/f4>   <=====
Trace; c018144f <ext3_releasepage+2d/32>
Trace; c0155006 <try_to_release_page+4e/78>
Trace; c0147f0a <shrink_cache+270/4f6>
Trace; c014832b <shrink_caches+61/98>
Trace; c0148398 <try_to_free_pages+36/54>
Trace; c0148431 <kswapd_balance_pgdat+51/8a>
Trace; c014847f <kswapd_balance+15/2c>
Trace; c0148593 <kswapd+9b/b6>
Trace; c0105000 <_stext+0/0>
Trace; c010578a <kernel_thread+2e/40>
Trace; c01484f8 <kswapd+0/b6>

The problem is that in this function:

int journal_try_to_free_buffers(journal_t *journal,
		struct page *page, int gfp_mask)
{
    struct buffer_head *bh;
    struct buffer_head *tmp;
    int locked_or_dirty = 0;
    int call_ttfb = 1;

    J_ASSERT(PageLocked(page));

    bh = page->buffers;
    tmp = bh;
    spin_lock(&journal_datalist_lock);
    do {
	struct buffer_head *p = tmp;

	tmp = tmp->b_this_page;
    ...

tmp (page->buffers) above is null.  b_this_page is at offset 0x28 (the accessed address in the oops).  This means that
page->buffers is set to null by some other routine which results in the oops.

I read the page allocate code
(ext3_read_page->block_read_full_page->create_emty_buffers->create_buffers), and it appears that it is not possible to allocate a page->buffers value of zero in the allocate function.  I am having difficulty reproducing and cannot debug further, however.  Can page->buffers be set to zero somewhere else?  Perhaps kswapd and some other thread are racing on the free?

Thansk
-steve
> Thanks!


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [2.4] page->buffers vanished in journal_try_to_free_buffers()
  2004-06-15 17:56         ` Steven Dake
@ 2004-06-17 13:16           ` Marcelo Tosatti
  2004-06-18  3:08             ` Andrew Morton
  2004-06-21 15:06             ` Stephen C. Tweedie
  0 siblings, 2 replies; 22+ messages in thread
From: Marcelo Tosatti @ 2004-06-17 13:16 UTC (permalink / raw)
  To: Steven Dake; +Cc: Stian Jordet, Linux Kernel Mailing List, sct, akpm


On Tue, Jun 15, 2004 at 10:56:38AM -0700, Steven Dake wrote:
> On Tue, 2004-06-15 at 06:16, Marcelo Tosatti wrote:
> > On Mon, Jun 14, 2004 at 10:07:05AM -0700, Steven Dake wrote:
> > > Marcelo and Stian,
> > > 
> > > I have also seen this oops relating to low memory situations.  I think
> > > ext3 allocates some data, has a null return, sets something to null, and
> > > then later it is dereferenced in kwapd.
> > > 
> > > Anyone have a patch for this problem?
> 
> > Steven, 
> > 
> > For what I remember Stian oopses were happening in random places in the VM freeing 
> > routines. That makes me belive what he was seeing was some kind of hardware issue, 
> > because otherwise the oopses would be happening in the same place (in case it was 
> > a software bug). The codepaths which he saw trying to access invalid addresses are 
> > executed flawlessly by all 2.4.x mainline users. He was also seeing oopses with v2.6.
> > 
> > Assuming his HW is not faulty, I can think of some driver corrupting his memory. 
> > 
> > Do you have any traces of the oopses you are seeing?  
> > 
> > Stian, you told us switched servers now, I assume the problem is gone? 
> > Are you still running v2.4 on that server?
> > 
> 
> Marcelo,
> 
> Stian responded saying he upgraded to 2.6 and also removed the memory
> intensive script and his problems went away.  I suspect removing the
> memory intesive script did the trick.  2.6 could also be fixed, who
> knows?
> 
> After reading lkml, there are about 4-5 Oops in the same function at the
> same location.  The people report they were heavily using memory (low
> memory situation) in some of the bug reports.
> 
> I have tracked down the problem to a null dereference during a buffer
> cache rebalance (which occurs during low memory situations).  Here is
> the info I have.  Unfortunately I don't know much about how vm handles
> low memory situations with the vfs, so if you have any ideas, it would
> be helpful. :)
> 
> nable to handle kernel NULL pointer dereference at virtual address
> 00000028
> <4> printing eip:
> <4>c018aa67
> <1>*pde = 00000000
> <4>Oops: 0000
> <4>CPU:	 2
> <4>EIP:	 0010:[<c018aa67>]    Not tainted
> <4>EFLAGS: 00010203
> <4>eax: 00000000	 ebx: 00000000	 ecx: c0380490	 edx: c217a540
> <4>esi: f2757a80	 edi: 00000001	 ebp: 00000000	 esp: f7bd1ee8
> <4>ds: 0018   es: 0018   ss: 0018
> <4>Process kswapd (pid: 11, stackpage=f7bd1000)
> <4>Stack: 00000000 00000000 c217a600 00000000 00000202 00000000 c217a540 000001d0
> <4>	c217a540 000102ec c018144f f7abe400 c217a540 000001d0 c0155006 c217a540
> <4>	000001d0 f7bd0000 00000000 c0147f0a c217a540 000001d0 f7bd0000 f7bd0000
> <4>Call Trace: [<c018144f>] [<c0155006>] [<c0147f0a>]
> [<c014832b>] [<c0148398>]
> <4>   [<c0148431>] [<c014847f>] [<c0148593>] [<c0105000>]
> [<c010578a>] [<c01484f8>]
> <4>
> <4>Code: 8b 5b 28 f6 40 19 02 75 47 39 f3 75 f1 c6 05 80 24 38 c0 01
> 
> 
> >>EIP; c018aa67 <journal_try_to_free_buffers+45/f4>   <=====
> Trace; c018144f <ext3_releasepage+2d/32>
> Trace; c0155006 <try_to_release_page+4e/78>
> Trace; c0147f0a <shrink_cache+270/4f6>
> Trace; c014832b <shrink_caches+61/98>
> Trace; c0148398 <try_to_free_pages+36/54>
> Trace; c0148431 <kswapd_balance_pgdat+51/8a>
> Trace; c014847f <kswapd_balance+15/2c>
> Trace; c0148593 <kswapd+9b/b6>
> Trace; c0105000 <_stext+0/0>
> Trace; c010578a <kernel_thread+2e/40>
> Trace; c01484f8 <kswapd+0/b6>
> 
> The problem is that in this function:
> 
> int journal_try_to_free_buffers(journal_t *journal,
> 		struct page *page, int gfp_mask)
> {
>     struct buffer_head *bh;
>     struct buffer_head *tmp;
>     int locked_or_dirty = 0;
>     int call_ttfb = 1;
> 
>     J_ASSERT(PageLocked(page));
> 
>     bh = page->buffers;
>     tmp = bh;
>     spin_lock(&journal_datalist_lock);
>     do {
> 	struct buffer_head *p = tmp;
> 
> 	tmp = tmp->b_this_page;
>     ...
> 
> tmp (page->buffers) above is null.  b_this_page is at offset 0x28 (the accessed address in the oops).  This means that
> page->buffers is set to null by some other routine which results in the oops.
> 
> I read the page allocate code
> (ext3_read_page->block_read_full_page->create_emty_buffers->create_buffers), and it appears that it is not possible to allocate a page->buffers value of zero in the allocate function.  I am having difficulty reproducing and cannot debug further, however.  Can page->buffers be set to zero somewhere else?  
>Perhaps kswapd and some other thread are racing on the free?

Steve, 

Hum, I'm starting to believe we might have an issue here.

Searching lkml archives I find other similar oopses at the same place 
(trying to access 00000028, tmp->b_this_page), as you said.

However I wonder what other kernel codepath could remove the page buffers
under us, the page MUST be locked here. In the backtrace above the page 
is locked by shrink_cache(). And with the page locked, we guarantee the VM
freeing routines (shrink_cache) wont try to mess with the page.

Can you reproduce the oopsen?

Stephen, Andrew, do you have any idea how the buffers could have vanished
under us with the page locked? That should not be possible. 

I dont see how this "page->buffers = NULL" could be caused by hardware problem, 
which is usually one or two bit flip.

Thanks everyone

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [2.4] page->buffers vanished in journal_try_to_free_buffers()
  2004-06-17 13:16           ` [2.4] page->buffers vanished in journal_try_to_free_buffers() Marcelo Tosatti
@ 2004-06-18  3:08             ` Andrew Morton
  2004-06-19 19:48               ` Marcelo Tosatti
  2004-06-21 15:06             ` Stephen C. Tweedie
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2004-06-18  3:08 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: sdake, liste, linux-kernel, sct

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
>  > tmp (page->buffers) above is null.  b_this_page is at offset 0x28 (the accessed address in the oops).  This means that
>  > page->buffers is set to null by some other routine which results in the oops.
>  > 
>  > I read the page allocate code
>  > (ext3_read_page->block_read_full_page->create_emty_buffers->create_buffers), and it appears that it is not possible to allocate a page->buffers value of zero in the allocate function.  I am having difficulty reproducing and cannot debug further, however.  Can page->buffers be set to zero somewhere else?  
>  >Perhaps kswapd and some other thread are racing on the free?
> 
>  Steve, 
> 
>  Hum, I'm starting to believe we might have an issue here.
> 
>  Searching lkml archives I find other similar oopses at the same place 
>  (trying to access 00000028, tmp->b_this_page), as you said.
> 
>  However I wonder what other kernel codepath could remove the page buffers
>  under us, the page MUST be locked here. In the backtrace above the page 
>  is locked by shrink_cache(). And with the page locked, we guarantee the VM
>  freeing routines (shrink_cache) wont try to mess with the page.
> 
>  Can you reproduce the oopsen?
> 
>  Stephen, Andrew, do you have any idea how the buffers could have vanished
>  under us with the page locked? That should not be possible. 
> 
>  I dont see how this "page->buffers = NULL" could be caused by hardware problem, 
>  which is usually one or two bit flip.

It's a bit odd.  The page is definitely locked, and definitely had non-null
->buffers a few tens of instructions beforehand.

Is this an SMP machine?

One possibility is that we died on the second pass around the loop:
page->buffers points at a buffer_head which has a NULL ->b_this_page.  But
I cannot suggest how ->b_this_page could have been zapped.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [2.4] page->buffers vanished in journal_try_to_free_buffers()
  2004-06-18  3:08             ` Andrew Morton
@ 2004-06-19 19:48               ` Marcelo Tosatti
  2004-06-19 19:50                 ` Frank van Maarseveen
                                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Marcelo Tosatti @ 2004-06-19 19:48 UTC (permalink / raw)
  To: Andrew Morton, frankvm; +Cc: sdake, liste, linux-kernel, sct

On Thu, Jun 17, 2004 at 08:08:59PM -0700, Andrew Morton wrote:
> Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
> >
> >  > tmp (page->buffers) above is null.  b_this_page is at offset 0x28 (the accessed address in the oops).  This means that
> >  > page->buffers is set to null by some other routine which results in the oops.
> >  > 
> >  > I read the page allocate code
> >  > (ext3_read_page->block_read_full_page->create_emty_buffers->create_buffers), and it appears that it is not possible to allocate a page->buffers value of zero in the allocate function.  I am having difficulty reproducing and cannot debug further, however.  Can page->buffers be set to zero somewhere else?  
> >  >Perhaps kswapd and some other thread are racing on the free?
> > 
> >  Steve, 
> > 
> >  Hum, I'm starting to believe we might have an issue here.
> > 
> >  Searching lkml archives I find other similar oopses at the same place 
> >  (trying to access 00000028, tmp->b_this_page), as you said.
> > 
 >  However I wonder what other kernel codepath could remove the page buffers
> >  under us, the page MUST be locked here. In the backtrace above the page 
> >  is locked by shrink_cache(). And with the page locked, we guarantee the VM
> >  freeing routines (shrink_cache) wont try to mess with the page.
> > 
> >  Can you reproduce the oopsen?
> > 
> >  Stephen, Andrew, do you have any idea how the buffers could have vanished
> >  under us with the page locked? That should not be possible. 
> > 
> >  I dont see how this "page->buffers = NULL" could be caused by hardware problem, 
> >  which is usually one or two bit flip.
> 
> It's a bit odd.  The page is definitely locked, and definitely had non-null
> ->buffers a few tens of instructions beforehand.
> 
> Is this an SMP machine?

Steven, did you see the NULL ->b_this_page on SMP or UP?

Stian Jordet had an SMP server, but he also was seeing oopses with v2.6:

 kernel BUG at mm/page_alloc.c:201!
 invalid operand: 0000 [#1]
 CPU:    0
 EIP:    0060:[free_pages_bulk+482/512]    Not tainted
 EIP is at free_pages_bulk+0x1e2/0x200
 eax: 00000001   ebx: c00609c8   ecx: 00000000   edx: \
                666026a5
 esi: 666026a4   edi: ffffffff   ebp: 33301352   esp: \
                c86d5d90
 Process mrtg (pid: 26804, threadinfo=c86d4000 \
                task=c9b860c0)
 Call Trace:
  [free_hot_cold_page+217/240] \
                free_hot_cold_page+0xd9/0xf0
  [do_generic_mapping_read+714/1008] \
                do_generic_mapping_read+0x2ca/0x3f0
  [file_read_actor+0/256] file_read_actor+0x0/0x100
  [__generic_file_aio_read+454/512] \
                __generic_file_aio_read+0x1c6/0x200
  [file_read_actor+0/256] file_read_actor+0x0/0x100
  [generic_file_aio_read+91/128] \
                generic_file_aio_read+0x5b/0x80
  [do_sync_read+137/192] do_sync_read+0x89/0xc0
  [do_page_fault+300/1328] do_page_fault+0x12c/0x530
  [do_brk+324/560] do_brk+0x144/0x230
  [vfs_read+184/304] vfs_read+0xb8/0x130
  [sys_read+66/112] sys_read+0x42/0x70
  [syscall_call+7/11] syscall_call+0x7/0xb
 
and different oopses on v2.4, including sync_page_buffers (also NULL+offset access): 

<1>Unable to handle kernel NULL pointer dereference at virtual address 00000021
c0132e86

eax: 00000000   ebx: 00000009   ecx: 000001d2   edx: 00000012
esi: 00000000   edi: c17e38c0   ebp: c1047a00   esp: c86cbdb4

>>EIP; c0132e86 <sync_page_buffers+e/a4>   <=====

Trace; c0132fdc <try_to_free_buffers+c0/ec>

Code;  c0132e86 <sync_page_buffers+e/a4>
00000000 <_EIP>:
Code;  c0132e86 <sync_page_buffers+e/a4>   <=====
   0:   f6 43 18 06               testb  $0x6,0x18(%ebx)   <=====
Code;  c0132e8a <sync_page_buffers+12/a4>

and the journal_try_to_free_buffers() one:

Unable to handle kernel NULL pointer dereference at virtual address 00000028
c015e3a2
*pde = 00000000
Oops: 0000
CPU:    0
EIP:    0010:[<c015e3a2>]    Not tainted
EFLAGS: 00010203

eax: 0100004d   ebx: 00000000   ecx: 000001d2   edx: 00000000

Code;  c015e3a2 <journal_try_to_free_buffers+5a/98>
00000000 <_EIP>:
Code;  c015e3a2 <journal_try_to_free_buffers+5a/98>   <=====
   0:   8b 5b 28                  mov    0x28(%ebx),%ebx   <=====
Code;  c015e3a5 <journal_try_to_free_buffers+5d/98>


He upgraded the box and stopped seeing the crashes, running 
recent v2.6. 

However, he also mentioned that his crashes started after upgrading 
from v2.4.19->2.4.22. Should search the diff between them looking for 
anything suspicious.

I can't figure out from the archived reports if this is UP or SMP only. 

Frank van Maarseveen has also seen the journal_try_to_free_buffers() NULL 
b_this_page. Frank, were you running SMP or UP when you reported the oops 
with 2.4.23? 

> One possibility is that we died on the second pass around the loop:
> page->buffers points at a buffer_head which has a NULL ->b_this_page.  But
> I cannot suggest how ->b_this_page could have been zapped.

Oh, yes, indeed. 

Maybe adding this (untested) to v2.4 mainline helps? Comments?

--- transaction.c.orig	2004-06-19 15:21:32.861148560 -0300
+++ transaction.c	2004-06-19 15:23:18.214132472 -0300
@@ -1694,6 +1694,24 @@
 	return 0;
 }
 
+void debug_page(struct page *p)
+{
+	struct buffer_head *bh;
+
+	bh = p->buffers;
+
+	printk(KERN_ERR "%s: page index:%u count:%d flags:%x\n", __FUNCTION__,
+		,p->index , atomic_read(&p->count), p->flags);
+
+	do {
+		printk(KERN_ERR "%s: bh b_next:%p blocknr:%u b_list:%u state:%x\n",
+			__FUNCTION__, bh->b_next, bh->b_blocknr, bh->b_list,
+				bh->b_state);
+		bh = bh->b_this_page;
+	} while (bh);
+}
+
+
 
 /** 
  * int journal_try_to_free_buffers() - try to free page buffers.
@@ -1752,6 +1770,11 @@
 	do {
 		struct buffer_head *p = tmp;
 
+		if (!unlikely(tmp)) {
+			debug_page(page);
+			BUG();
+		}
+			
 		tmp = tmp->b_this_page;
 		if (buffer_jbd(p))
 			if (!__journal_try_to_free_buffer(p, &locked_or_dirty))

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [2.4] page->buffers vanished in journal_try_to_free_buffers()
  2004-06-19 19:48               ` Marcelo Tosatti
@ 2004-06-19 19:50                 ` Frank van Maarseveen
  2004-06-19 22:17                   ` Marcelo Tosatti
  2004-06-19 20:04                 ` Andrew Morton
  2004-06-20  7:56                 ` Willy Tarreau
  2 siblings, 1 reply; 22+ messages in thread
From: Frank van Maarseveen @ 2004-06-19 19:50 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrew Morton, sdake, liste, linux-kernel, sct

On Sat, Jun 19, 2004 at 04:48:49PM -0300, Marcelo Tosatti wrote:
> 
> However, he also mentioned that his crashes started after upgrading 
> from v2.4.19->2.4.22. Should search the diff between them looking for 
> anything suspicious.
> 
> I can't figure out from the archived reports if this is UP or SMP only. 
> 
> Frank van Maarseveen has also seen the journal_try_to_free_buffers() NULL 
> b_this_page. Frank, were you running SMP or UP when you reported the oops 
> with 2.4.23? 

UP

-- 
Frank

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [2.4] page->buffers vanished in journal_try_to_free_buffers()
  2004-06-19 19:48               ` Marcelo Tosatti
  2004-06-19 19:50                 ` Frank van Maarseveen
@ 2004-06-19 20:04                 ` Andrew Morton
  2004-06-20  7:56                 ` Willy Tarreau
  2 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2004-06-19 20:04 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: frankvm, sdake, liste, linux-kernel, sct

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
> Maybe adding this (untested) to v2.4 mainline helps? Comments?

It would be helpful.

>  --- transaction.c.orig	2004-06-19 15:21:32.861148560 -0300
>  +++ transaction.c	2004-06-19 15:23:18.214132472 -0300
>  @@ -1694,6 +1694,24 @@
>   	return 0;
>   }
>   
>  +void debug_page(struct page *p)
>  +{
>  +	struct buffer_head *bh;
>  +
>  +	bh = p->buffers;
>  +
>  +	printk(KERN_ERR "%s: page index:%u count:%d flags:%x\n", __FUNCTION__,
>  +		,p->index , atomic_read(&p->count), p->flags);

                ^

>  +
>  +	do {
>  +		printk(KERN_ERR "%s: bh b_next:%p blocknr:%u b_list:%u state:%x\n",
>  +			__FUNCTION__, bh->b_next, bh->b_blocknr, bh->b_list,
>  +				bh->b_state);
>  +		bh = bh->b_this_page;
>  +	} while (bh);
>  +}
>  +

you'll want to make this a while (!bh) {} loop, to handle the
page->buffers==NULL case.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [2.4] page->buffers vanished in journal_try_to_free_buffers()
  2004-06-19 19:50                 ` Frank van Maarseveen
@ 2004-06-19 22:17                   ` Marcelo Tosatti
  2004-06-19 22:44                     ` Frank van Maarseveen
  0 siblings, 1 reply; 22+ messages in thread
From: Marcelo Tosatti @ 2004-06-19 22:17 UTC (permalink / raw)
  To: Frank van Maarseveen, Andrew Morton, sdake, liste, linux-kernel,
	sct

On Sat, Jun 19, 2004 at 09:50:13PM +0200, Frank van Maarseveen wrote:
> On Sat, Jun 19, 2004 at 04:48:49PM -0300, Marcelo Tosatti wrote:
> > 
> > However, he also mentioned that his crashes started after upgrading 
> > from v2.4.19->2.4.22. Should search the diff between them looking for 
> > anything suspicious.
> > 
> > I can't figure out from the archived reports if this is UP or SMP only. 
> > 
> > Frank van Maarseveen has also seen the journal_try_to_free_buffers() NULL 
> > b_this_page. Frank, were you running SMP or UP when you reported the oops 
> > with 2.4.23? 
> 
> UP

Hi Frank,

Has the oops happened again? What kernel are you running now?

Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [2.4] page->buffers vanished in journal_try_to_free_buffers()
  2004-06-19 22:17                   ` Marcelo Tosatti
@ 2004-06-19 22:44                     ` Frank van Maarseveen
  0 siblings, 0 replies; 22+ messages in thread
From: Frank van Maarseveen @ 2004-06-19 22:44 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrew Morton, sdake, liste, linux-kernel, sct

On Sat, Jun 19, 2004 at 07:17:11PM -0300, Marcelo Tosatti wrote:
> 
> Has the oops happened again? What kernel are you running now?

no new oopses.

The machine has used 2.4.21 for a while, then 2.4.25 and now 2.4.26. The
problem _seems_ to show up here only in 2.4.22, 2.4.23 and 2.4.24.



-- 
Frank

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [2.4] page->buffers vanished in journal_try_to_free_buffers()
  2004-06-19 19:48               ` Marcelo Tosatti
  2004-06-19 19:50                 ` Frank van Maarseveen
  2004-06-19 20:04                 ` Andrew Morton
@ 2004-06-20  7:56                 ` Willy Tarreau
  2 siblings, 0 replies; 22+ messages in thread
From: Willy Tarreau @ 2004-06-20  7:56 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrew Morton, frankvm, sdake, liste, linux-kernel, sct

Hi Marcelo,

On Sat, Jun 19, 2004 at 04:48:49PM -0300, Marcelo Tosatti wrote:
 
> +		if (!unlikely(tmp)) {

I think you meant "if (unlikely(!tmp))" here. Howver this does not make
a big difference.

Regards,
willy


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [2.4] page->buffers vanished in journal_try_to_free_buffers()
  2004-06-17 13:16           ` [2.4] page->buffers vanished in journal_try_to_free_buffers() Marcelo Tosatti
  2004-06-18  3:08             ` Andrew Morton
@ 2004-06-21 15:06             ` Stephen C. Tweedie
  2004-06-21 15:53               ` Marcelo Tosatti
  1 sibling, 1 reply; 22+ messages in thread
From: Stephen C. Tweedie @ 2004-06-21 15:06 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Stephen Tweedie, Steven Dake, Stian Jordet,
	Linux Kernel Mailing List, sct, Andrew Morton, Stephen Tweedie

Hi,

On Thu, 2004-06-17 at 14:16, Marcelo Tosatti wrote:

> Stephen, Andrew, do you have any idea how the buffers could have vanished
> under us with the page locked? That should not be possible. 

No, especially not on UP as Frank reported.  

> I dont see how this "page->buffers = NULL" could be caused by hardware problem, 
> which is usually one or two bit flip.

We don't know for sure that it's page->buffers.  If we have gone round
the bh->b_this_page loop already, we could have ended up following the
pointers either to an invalid bh, or to one that's not on the current
page.  So it could also be the previous buffer's b_this_page that got
clobbered, rather than page->buffers.

That's possible in this case, but it's still a bit surprising that we'd
*always* get a NULL pointer rather than some other random pointer as a
result. 

The buffer-ring debug patch that you posted looks like the obvious way
to dig further into this.  If that doesn't get anyway, we can also trap
the case where following bh->b_this_page gives us a buffer whose b_page
is on a different page.

--Stephen



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [2.4] page->buffers vanished in journal_try_to_free_buffers()
  2004-06-21 15:06             ` Stephen C. Tweedie
@ 2004-06-21 15:53               ` Marcelo Tosatti
  2004-06-22 22:13                 ` Stephen C. Tweedie
  0 siblings, 1 reply; 22+ messages in thread
From: Marcelo Tosatti @ 2004-06-21 15:53 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Steven Dake, Stian Jordet, Linux Kernel Mailing List, sct,
	Andrew Morton

Hi Stephen,

On Mon, Jun 21, 2004 at 04:06:50PM +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Thu, 2004-06-17 at 14:16, Marcelo Tosatti wrote:
> 
> > Stephen, Andrew, do you have any idea how the buffers could have vanished
> > under us with the page locked? That should not be possible. 
> 
> No, especially not on UP as Frank reported.  
> 
> > I dont see how this "page->buffers = NULL" could be caused by hardware problem, 
> > which is usually one or two bit flip.
> 
> We don't know for sure that it's page->buffers.  If we have gone round
> the bh->b_this_page loop already, we could have ended up following the
> pointers either to an invalid bh, or to one that's not on the current
> page.  So it could also be the previous buffer's b_this_page that got
> clobbered, rather than page->buffers.
> 
> That's possible in this case, but it's still a bit surprising that we'd
> *always* get a NULL pointer rather than some other random pointer as a
> result. 

I dont remember seeing any case which was not a NULL pointer dereference.

> The buffer-ring debug patch that you posted looks like the obvious way
> to dig further into this.  If that doesn't get anyway, we can also trap
> the case where following bh->b_this_page gives us a buffer whose b_page
> is on a different page.

Fine.  Just printing out bh->b_page at debug_page() will allow us to verify that, yes?

--- transaction.c.orig  2004-06-21 12:50:01.090082264 -0300
+++ transaction.c       2004-06-21 12:50:45.574319632 -0300
@@ -1704,9 +1704,9 @@ void debug_page(struct page *p)
                 p->index, atomic_read(&p->count), p->flags);
  
        while (bh) {
-               printk(KERN_ERR "%s: bh b_next:%p blocknr:%lu b_list:%u state:%lx\n",
+               printk(KERN_ERR "%s: bh b_next:%p blocknr:%lu b_list:%u state:%lx b_page:%p\n",
                        __FUNCTION__, bh->b_next, bh->b_blocknr, bh->b_list,
-                               bh->b_state);
+                               bh->b_state, bh->b_page);
                bh = bh->b_this_page;
        }
 }




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [2.4] page->buffers vanished in journal_try_to_free_buffers()
  2004-06-21 15:53               ` Marcelo Tosatti
@ 2004-06-22 22:13                 ` Stephen C. Tweedie
  0 siblings, 0 replies; 22+ messages in thread
From: Stephen C. Tweedie @ 2004-06-22 22:13 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Steven Dake, Stian Jordet, Linux Kernel Mailing List, sct,
	Andrew Morton, Stephen Tweedie

Hi,

On Mon, 2004-06-21 at 16:53, Marcelo Tosatti wrote:

> > The buffer-ring debug patch that you posted looks like the obvious way
> > to dig further into this.  If that doesn't get anyway, we can also trap
> > the case where following bh->b_this_page gives us a buffer whose b_page
> > is on a different page.
> 
> Fine.  Just printing out bh->b_page at debug_page() will allow us to verify that, yes?

For most cases, yes.  There are basically three corruption cases ---
b_this_page leads us to an oops, an infinite loop, or a loop including a
bogus page.  Trapping the b_this_page ring walks to trap on any bad
b_page would help in the latter two cases, but if we're always getting
the first case, just extending the existing debug patch would be fine.

--Stephen


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2004-06-22 22:17 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-03 18:26 Oopses with both recent 2.4.x kernels and 2.6.x kernels Stian Jordet
2004-02-05 23:51 ` Marcelo Tosatti
2004-03-02 11:03   ` Stian Jordet
2004-03-02 12:31     ` Stian Jordet
2004-03-09 19:22       ` Marcelo Tosatti
2004-03-09 22:28         ` Stian Jordet
2004-06-14 17:07     ` Steven Dake
2004-06-14 18:26       ` Chris Shoemaker
2004-06-15 13:16       ` Marcelo Tosatti
2004-06-15 14:35         ` Stian Jordet
2004-06-15 17:56         ` Steven Dake
2004-06-17 13:16           ` [2.4] page->buffers vanished in journal_try_to_free_buffers() Marcelo Tosatti
2004-06-18  3:08             ` Andrew Morton
2004-06-19 19:48               ` Marcelo Tosatti
2004-06-19 19:50                 ` Frank van Maarseveen
2004-06-19 22:17                   ` Marcelo Tosatti
2004-06-19 22:44                     ` Frank van Maarseveen
2004-06-19 20:04                 ` Andrew Morton
2004-06-20  7:56                 ` Willy Tarreau
2004-06-21 15:06             ` Stephen C. Tweedie
2004-06-21 15:53               ` Marcelo Tosatti
2004-06-22 22:13                 ` Stephen C. Tweedie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox