* Oopses with both recent 2.4.x kernels and 2.6.x kernels @ 2004-02-03 18:26 Stian Jordet 2004-02-05 23:51 ` Marcelo Tosatti 0 siblings, 1 reply; 22+ messages in thread From: Stian Jordet @ 2004-02-03 18:26 UTC (permalink / raw) To: Linux Kernel Mailing List Hello, I have a server which was running 2.4.18 and 2.4.19 for almost 200 days each, without problems. After an upgrade to 2.4.22, the box haven't been up for 30 days in a row. This happened early november. I have caputered oopses with both 2.4.23 and 2.6.1 which I have sent decoded to the list, but have never got any reply. I have ran memtest86 on the box, no errors. What else can be the problem? I could of course go back to 2.4.19, which I know worked fine, but I there have been some fixed security holes since then... Any thoughts? Best regards, Stian ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels 2004-02-03 18:26 Oopses with both recent 2.4.x kernels and 2.6.x kernels Stian Jordet @ 2004-02-05 23:51 ` Marcelo Tosatti 2004-03-02 11:03 ` Stian Jordet 0 siblings, 1 reply; 22+ messages in thread From: Marcelo Tosatti @ 2004-02-05 23:51 UTC (permalink / raw) To: Stian Jordet; +Cc: Linux Kernel Mailing List On Tue, 3 Feb 2004, Stian Jordet wrote: > Hello, > > I have a server which was running 2.4.18 and 2.4.19 for almost 200 days > each, without problems. After an upgrade to 2.4.22, the box haven't been > up for 30 days in a row. This happened early november. I have caputered > oopses with both 2.4.23 and 2.6.1 which I have sent decoded to the list, > but have never got any reply. > > I have ran memtest86 on the box, no errors. What else can be the > problem? I could of course go back to 2.4.19, which I know worked fine, > but I there have been some fixed security holes since then... > > Any thoughts? Stian, I have seen your 2.4.x oopses and they seemed odd. The faults were happening in different functions (mostly inside VM "freeing" , due to what seems to be random crap in memory: <1>Unable to handle kernel NULL pointer dereference at virtual address 00000021 c0132e86 *pde = 00000000 eax: 00000000 ebx: 00000009 ecx: 000001d2 edx: 00000012 esi: 00000000 edi: c17e38c0 ebp: c1047a00 esp: c86cbdb4 >>EIP; c0132e86 <sync_page_buffers+e/a4> <===== >>edi; c17e38c0 <_end+14b5844/bd23f84> >>ebp; c1047a00 <_end+d19984/bd23f84> >>esp; c86cbdb4 <_end+839dd38/bd23f84> Trace; c0132fdc <try_to_free_buffers+c0/ec> Code; c0132e86 <sync_page_buffers+e/a4> 00000000 <_EIP>: Code; c0132e86 <sync_page_buffers+e/a4> <===== 0: f6 43 18 06 testb $0x6,0x18(%ebx) <===== Code; c0132e8a <sync_page_buffers+12/a4> 4: 74 7c je 82 <_EIP+0x82> c0132f08 <sync_page_buffers+90/a4> Code; c0132e8c <sync_page_buffers+14/a4> 6: b8 07 00 00 00 mov $0x7,%eax Code; c0132e91 <sync_page_buffers+19/a4> <1>Unable to handle kernel NULL pointer dereference at virtual address 00000028 c015e3a2 *pde = 00000000 Oops: 0000 CPU: 0 EIP: 0010:[<c015e3a2>] Not tainted EFLAGS: 00010203 eax: 0100004d ebx: 00000000 ecx: 000001d2 edx: 00000000 Code; c015e3a2 <journal_try_to_free_buffers+5a/98> 00000000 <_EIP>: Code; c015e3a2 <journal_try_to_free_buffers+5a/98> <===== 0: 8b 5b 28 mov 0x28(%ebx),%ebx <===== Code; c015e3a5 <journal_try_to_free_buffers+5d/98> 3: f6 42 19 04 testb $0x4,0x19(%edx) Code; c015e3a9 <journal_try_to_free_buffers+61/98> 7: 74 17 je 20 <_EIP+0x20> c015e3c2 <journal_try_to_free_buffers+7a/98> And other similar oopses. Are you sure there is nothing messing up the hardware ? How long have you ran memtest86? It can, sometimes, take a long to showup errors. The 2.6.x oopses on the same hardware is also a useful source of information. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels 2004-02-05 23:51 ` Marcelo Tosatti @ 2004-03-02 11:03 ` Stian Jordet 2004-03-02 12:31 ` Stian Jordet 2004-06-14 17:07 ` Steven Dake 0 siblings, 2 replies; 22+ messages in thread From: Stian Jordet @ 2004-03-02 11:03 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Linux Kernel Mailing List fre, 06.02.2004 kl. 00.51 skrev Marcelo Tosatti: > On Tue, 3 Feb 2004, Stian Jordet wrote: > > > Hello, > > > > I have a server which was running 2.4.18 and 2.4.19 for almost 200 days > > each, without problems. After an upgrade to 2.4.22, the box haven't been > > up for 30 days in a row. This happened early november. I have caputered > > oopses with both 2.4.23 and 2.6.1 which I have sent decoded to the list, > > but have never got any reply. > > > > I have ran memtest86 on the box, no errors. What else can be the > > problem? I could of course go back to 2.4.19, which I know worked fine, > > but I there have been some fixed security holes since then... > > > > Any thoughts? > > Stian, > > I have seen your 2.4.x oopses and they seemed odd. The faults were > happening in different functions (mostly inside VM "freeing" , due to > what seems to be random crap in memory: > > <1>Unable to handle kernel NULL pointer dereference at virtual address 00000021 > c0132e86 > *pde = 00000000 > > eax: 00000000 ebx: 00000009 ecx: 000001d2 edx: 00000012 > esi: 00000000 edi: c17e38c0 ebp: c1047a00 esp: c86cbdb4 > > >>EIP; c0132e86 <sync_page_buffers+e/a4> <===== > > >>edi; c17e38c0 <_end+14b5844/bd23f84> > >>ebp; c1047a00 <_end+d19984/bd23f84> > >>esp; c86cbdb4 <_end+839dd38/bd23f84> > > Trace; c0132fdc <try_to_free_buffers+c0/ec> > > Code; c0132e86 <sync_page_buffers+e/a4> > 00000000 <_EIP>: > Code; c0132e86 <sync_page_buffers+e/a4> <===== > 0: f6 43 18 06 testb $0x6,0x18(%ebx) <===== > Code; c0132e8a <sync_page_buffers+12/a4> > 4: 74 7c je 82 <_EIP+0x82> c0132f08 > <sync_page_buffers+90/a4> > Code; c0132e8c <sync_page_buffers+14/a4> > 6: b8 07 00 00 00 mov $0x7,%eax > Code; c0132e91 <sync_page_buffers+19/a4> > > > > > <1>Unable to handle kernel NULL pointer dereference at virtual address > 00000028 > c015e3a2 > *pde = 00000000 > Oops: 0000 > CPU: 0 > EIP: 0010:[<c015e3a2>] Not tainted > EFLAGS: 00010203 > > eax: 0100004d ebx: 00000000 ecx: 000001d2 edx: 00000000 > > Code; c015e3a2 <journal_try_to_free_buffers+5a/98> > 00000000 <_EIP>: > Code; c015e3a2 <journal_try_to_free_buffers+5a/98> <===== > 0: 8b 5b 28 mov 0x28(%ebx),%ebx <===== > Code; c015e3a5 <journal_try_to_free_buffers+5d/98> > 3: f6 42 19 04 testb $0x4,0x19(%edx) > Code; c015e3a9 <journal_try_to_free_buffers+61/98> > 7: 74 17 je 20 <_EIP+0x20> c015e3c2 > <journal_try_to_free_buffers+7a/98> > > And other similar oopses. > > Are you sure there is nothing messing up the hardware ? > > How long have you ran memtest86? It can, sometimes, take a long to showup > errors. > > The 2.6.x oopses on the same hardware is also a useful source of > information. Marcelo, sorry for getting back to you so insanely late. This was a production server, and I have now moved the services it were running to another box, so I could run a more exhaustive memtest86. It has now ran for two days, without any errors. Of course there could be other flaky hardware, but since I don't know any way to test it, and the oops occurs with two-four weeks interval, it's quite time consuming to find out. I'm not even sure if it will oops without the typical load it used to have. Anyway, thank you very much for at least answering me. Much appreciated :) Best regards, Stian ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels 2004-03-02 11:03 ` Stian Jordet @ 2004-03-02 12:31 ` Stian Jordet 2004-03-09 19:22 ` Marcelo Tosatti 2004-06-14 17:07 ` Steven Dake 1 sibling, 1 reply; 22+ messages in thread From: Stian Jordet @ 2004-03-02 12:31 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 86 bytes --] Btw, here is one of the 2.6.x oopses as well (as you requested). Best regards, Stian [-- Attachment #2: syslog.txt --] [-- Type: text/plain, Size: 5902 bytes --] Jan 20 00:20:04 dodge kernel: ------------[ cut here ]------------ Jan 20 00:20:04 dodge kernel: kernel BUG at mm/page_alloc.c:201! Jan 20 00:20:04 dodge kernel: invalid operand: 0000 [#1] Jan 20 00:20:04 dodge kernel: CPU: 0 Jan 20 00:20:04 dodge kernel: EIP: 0060:[free_pages_bulk+482/512] Not tainted Jan 20 00:20:04 dodge kernel: EFLAGS: 00010002 Jan 20 00:20:04 dodge kernel: EIP is at free_pages_bulk+0x1e2/0x200 Jan 20 00:20:04 dodge kernel: eax: 00000001 ebx: c00609c8 ecx: 00000000 edx: 666026a5 Jan 20 00:20:04 dodge kernel: esi: 666026a4 edi: ffffffff ebp: 33301352 esp: c86d5d90 Jan 20 00:20:04 dodge kernel: ds: 007b es: 007b ss: 0068 Jan 20 00:20:04 dodge kernel: Process mrtg (pid: 26804, threadinfo=c86d4000 task=c9b860c0) Jan 20 00:20:04 dodge kernel: Stack: c038ad80 c00609c8 00000000 c038ae40 00000001 c00609a0 c038adbc 00000000 Jan 20 00:20:04 dodge kernel: c1000000 c038adbc 00000086 ffffffff c038ad80 c00609a0 c038ae5c 00000246 Jan 20 00:20:04 dodge kernel: c01341b9 c038ad80 00000000 c038ae5c 00000000 c038ad80 c00609a0 00000005 Jan 20 00:20:04 dodge kernel: Call Trace: Jan 20 00:20:04 dodge kernel: [free_hot_cold_page+217/240] free_hot_cold_page+0xd9/0xf0 Jan 20 00:20:04 dodge kernel: [do_generic_mapping_read+714/1008] do_generic_mapping_read+0x2ca/0x3f0 Jan 20 00:20:04 dodge kernel: [file_read_actor+0/256] file_read_actor+0x0/0x100 Jan 20 00:20:04 dodge kernel: [__generic_file_aio_read+454/512] __generic_file_aio_read+0x1c6/0x200 Jan 20 00:20:04 dodge kernel: [file_read_actor+0/256] file_read_actor+0x0/0x100 Jan 20 00:20:04 dodge kernel: [generic_file_aio_read+91/128] generic_file_aio_read+0x5b/0x80 Jan 20 00:20:04 dodge kernel: [do_sync_read+137/192] do_sync_read+0x89/0xc0 Jan 20 00:20:04 dodge kernel: [do_page_fault+300/1328] do_page_fault+0x12c/0x530 Jan 20 00:20:04 dodge kernel: [do_brk+324/560] do_brk+0x144/0x230 Jan 20 00:20:04 dodge kernel: [vfs_read+184/304] vfs_read+0xb8/0x130 Jan 20 00:20:04 dodge kernel: [sys_read+66/112] sys_read+0x42/0x70 Jan 20 00:20:04 dodge kernel: [syscall_call+7/11] syscall_call+0x7/0xb Jan 20 00:20:04 dodge kernel: Jan 20 00:20:04 dodge kernel: Code: 0f 0b c9 00 9b d1 34 c0 e9 51 ff ff ff 0f 0b bc 00 9b d1 34 Jan 20 00:20:04 dodge kernel: <1>Unable to handle kernel paging request at virtual address 00100104 Jan 20 00:20:04 dodge kernel: printing eip: Jan 20 00:20:04 dodge kernel: c0133cbf Jan 20 00:20:04 dodge kernel: *pde = 00000000 Jan 20 00:20:04 dodge kernel: Oops: 0002 [#2] Jan 20 00:20:04 dodge kernel: CPU: 0 Jan 20 00:20:04 dodge kernel: EIP: 0060:[free_pages_bulk+143/512] Not tainted Jan 20 00:20:04 dodge kernel: EFLAGS: 00010003 Jan 20 00:20:04 dodge kernel: EIP is at free_pages_bulk+0x8f/0x200 Jan 20 00:20:04 dodge kernel: eax: c00609a8 ebx: c038ae5c ecx: 00200200 edx: 00100100 Jan 20 00:20:04 dodge kernel: esi: c00609a0 edi: c038ae5c ebp: 00000203 esp: c86d5b34 Jan 20 00:20:04 dodge kernel: ds: 007b es: 007b ss: 0068 Jan 20 00:20:04 dodge kernel: Process mrtg (pid: 26804, threadinfo=c86d4000 task=c9b860c0) Jan 20 00:20:04 dodge kernel: Stack: c11a48c0 00268000 00000000 c038af54 00000001 c1044d20 c038aed0 00000000 Jan 20 00:20:04 dodge kernel: c1000000 c038adbc 00000082 ffffffff c038ad80 c100a230 c038ae5c 00000203 Jan 20 00:20:04 dodge kernel: c01341b9 c038ad80 00000000 c038ae5c 00000000 c038ad80 c5c7ac0c 0020d000 Jan 20 00:20:04 dodge kernel: Call Trace: Jan 20 00:20:04 dodge kernel: [free_hot_cold_page+217/240] free_hot_cold_page+0xd9/0xf0 Jan 20 00:20:04 dodge kernel: [zap_pte_range+334/400] zap_pte_range+0x14e/0x190 Jan 20 00:20:04 dodge kernel: [zap_pmd_range+75/112] zap_pmd_range+0x4b/0x70 Jan 20 00:20:04 dodge kernel: [unmap_page_range+75/128] unmap_page_range+0x4b/0x80 Jan 20 00:20:04 dodge kernel: [unmap_vmas+254/544] unmap_vmas+0xfe/0x220 Jan 20 00:20:04 dodge kernel: [exit_mmap+109/384] exit_mmap+0x6d/0x180 Jan 20 00:20:04 dodge kernel: [mmput+81/160] mmput+0x51/0xa0 Jan 20 00:20:04 dodge kernel: [do_exit+290/800] do_exit+0x122/0x320 Jan 20 00:20:04 dodge kernel: [do_invalid_op+0/208] do_invalid_op+0x0/0xd0 Jan 20 00:20:04 dodge kernel: [die+203/208] die+0xcb/0xd0 Jan 20 00:20:04 dodge kernel: [do_invalid_op+202/208] do_invalid_op+0xca/0xd0 Jan 20 00:20:04 dodge kernel: [free_pages_bulk+482/512] free_pages_bulk+0x1e2/0x200 Jan 20 00:20:04 dodge kernel: [update_wall_time+22/64] update_wall_time+0x16/0x40 Jan 20 00:20:04 dodge kernel: [do_timer+224/240] do_timer+0xe0/0xf0 Jan 20 00:20:04 dodge kernel: [timer_interrupt+56/240] timer_interrupt+0x38/0xf0 Jan 20 00:20:04 dodge kernel: [handle_IRQ_event+73/128] handle_IRQ_event+0x49/0x80 Jan 20 00:20:04 dodge kernel: [do_IRQ+140/240] do_IRQ+0x8c/0xf0 Jan 20 00:20:04 dodge kernel: [error_code+45/56] error_code+0x2d/0x38 Jan 20 00:20:04 dodge kernel: [free_pages_bulk+482/512] free_pages_bulk+0x1e2/0x200 Jan 20 00:20:04 dodge kernel: [free_hot_cold_page+217/240] free_hot_cold_page+0xd9/0xf0 Jan 20 00:20:04 dodge kernel: [do_generic_mapping_read+714/1008] do_generic_mapping_read+0x2ca/0x3f0 Jan 20 00:20:04 dodge kernel: [file_read_actor+0/256] file_read_actor+0x0/0x100 Jan 20 00:20:04 dodge kernel: [__generic_file_aio_read+454/512] __generic_file_aio_read+0x1c6/0x200 Jan 20 00:20:04 dodge kernel: [file_read_actor+0/256] file_read_actor+0x0/0x100 Jan 20 00:20:04 dodge kernel: [generic_file_aio_read+91/128] generic_file_aio_read+0x5b/0x80 Jan 20 00:20:04 dodge kernel: [do_sync_read+137/192] do_sync_read+0x89/0xc0 Jan 20 00:20:04 dodge kernel: [do_page_fault+300/1328] do_page_fault+0x12c/0x530 Jan 20 00:20:04 dodge kernel: [do_brk+324/560] do_brk+0x144/0x230 Jan 20 00:20:04 dodge kernel: [vfs_read+184/304] vfs_read+0xb8/0x130 Jan 20 00:20:04 dodge kernel: [sys_read+66/112] sys_read+0x42/0x70 Jan 20 00:20:04 dodge kernel: [syscall_call+7/11] syscall_call+0x7/0xb ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels 2004-03-02 12:31 ` Stian Jordet @ 2004-03-09 19:22 ` Marcelo Tosatti 2004-03-09 22:28 ` Stian Jordet 0 siblings, 1 reply; 22+ messages in thread From: Marcelo Tosatti @ 2004-03-09 19:22 UTC (permalink / raw) To: Stian Jordet; +Cc: Marcelo Tosatti, Linux Kernel Mailing List On Tue, 2 Mar 2004, Stian Jordet wrote: > Btw, here is one of the 2.6.x oopses as well (as you requested). Stian, This sounds like bad hardware. Did I already ask you to try memtest86 ? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels 2004-03-09 19:22 ` Marcelo Tosatti @ 2004-03-09 22:28 ` Stian Jordet 0 siblings, 0 replies; 22+ messages in thread From: Stian Jordet @ 2004-03-09 22:28 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Linux Kernel Mailing List Quoting Marcelo Tosatti <marcelo.tosatti@cyclades.com>: > > > On Tue, 2 Mar 2004, Stian Jordet wrote: > > > Btw, here is one of the 2.6.x oopses as well (as you requested). > > Stian, > > This sounds like bad hardware. Did I already ask you to try memtest86 ? > Yup, and I think I wrote that I had it running for almost two days with no errors. Oh well. Thanks for looking into this :) I guess I'll try go afford a new server (very) soon. Best regards, Stian ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels 2004-03-02 11:03 ` Stian Jordet 2004-03-02 12:31 ` Stian Jordet @ 2004-06-14 17:07 ` Steven Dake 2004-06-14 18:26 ` Chris Shoemaker 2004-06-15 13:16 ` Marcelo Tosatti 1 sibling, 2 replies; 22+ messages in thread From: Steven Dake @ 2004-06-14 17:07 UTC (permalink / raw) To: Stian Jordet; +Cc: Marcelo Tosatti, Linux Kernel Mailing List Marcelo and Stian, I have also seen this oops relating to low memory situations. I think ext3 allocates some data, has a null return, sets something to null, and then later it is dereferenced in kwapd. Anyone have a patch for this problem? Thanks -steve On Tue, 2004-03-02 at 04:03, Stian Jordet wrote: > fre, 06.02.2004 kl. 00.51 skrev Marcelo Tosatti: > > On Tue, 3 Feb 2004, Stian Jordet wrote: > > > > > Hello, > > > > > > I have a server which was running 2.4.18 and 2.4.19 for almost 200 days > > > each, without problems. After an upgrade to 2.4.22, the box haven't been > > > up for 30 days in a row. This happened early november. I have caputered > > > oopses with both 2.4.23 and 2.6.1 which I have sent decoded to the list, > > > but have never got any reply. > > > > > > I have ran memtest86 on the box, no errors. What else can be the > > > problem? I could of course go back to 2.4.19, which I know worked fine, > > > but I there have been some fixed security holes since then... > > > > > > Any thoughts? > > > > Stian, > > > > I have seen your 2.4.x oopses and they seemed odd. The faults were > > happening in different functions (mostly inside VM "freeing" , due to > > what seems to be random crap in memory: > > > > <1>Unable to handle kernel NULL pointer dereference at virtual address 00000021 > > c0132e86 > > *pde = 00000000 > > > > eax: 00000000 ebx: 00000009 ecx: 000001d2 edx: 00000012 > > esi: 00000000 edi: c17e38c0 ebp: c1047a00 esp: c86cbdb4 > > > > >>EIP; c0132e86 <sync_page_buffers+e/a4> <===== > > > > >>edi; c17e38c0 <_end+14b5844/bd23f84> > > >>ebp; c1047a00 <_end+d19984/bd23f84> > > >>esp; c86cbdb4 <_end+839dd38/bd23f84> > > > > Trace; c0132fdc <try_to_free_buffers+c0/ec> > > > > Code; c0132e86 <sync_page_buffers+e/a4> > > 00000000 <_EIP>: > > Code; c0132e86 <sync_page_buffers+e/a4> <===== > > 0: f6 43 18 06 testb $0x6,0x18(%ebx) <===== > > Code; c0132e8a <sync_page_buffers+12/a4> > > 4: 74 7c je 82 <_EIP+0x82> c0132f08 > > <sync_page_buffers+90/a4> > > Code; c0132e8c <sync_page_buffers+14/a4> > > 6: b8 07 00 00 00 mov $0x7,%eax > > Code; c0132e91 <sync_page_buffers+19/a4> > > > > > > > > > > <1>Unable to handle kernel NULL pointer dereference at virtual address > > 00000028 > > c015e3a2 > > *pde = 00000000 > > Oops: 0000 > > CPU: 0 > > EIP: 0010:[<c015e3a2>] Not tainted > > EFLAGS: 00010203 > > > > eax: 0100004d ebx: 00000000 ecx: 000001d2 edx: 00000000 > > > > Code; c015e3a2 <journal_try_to_free_buffers+5a/98> > > 00000000 <_EIP>: > > Code; c015e3a2 <journal_try_to_free_buffers+5a/98> <===== > > 0: 8b 5b 28 mov 0x28(%ebx),%ebx <===== > > Code; c015e3a5 <journal_try_to_free_buffers+5d/98> > > 3: f6 42 19 04 testb $0x4,0x19(%edx) > > Code; c015e3a9 <journal_try_to_free_buffers+61/98> > > 7: 74 17 je 20 <_EIP+0x20> c015e3c2 > > <journal_try_to_free_buffers+7a/98> > > > > And other similar oopses. > > > > Are you sure there is nothing messing up the hardware ? > > > > How long have you ran memtest86? It can, sometimes, take a long to showup > > errors. > > > > The 2.6.x oopses on the same hardware is also a useful source of > > information. > > Marcelo, > > sorry for getting back to you so insanely late. This was a production > server, and I have now moved the services it were running to another > box, so I could run a more exhaustive memtest86. It has now ran for two > days, without any errors. Of course there could be other flaky hardware, > but since I don't know any way to test it, and the oops occurs with > two-four weeks interval, it's quite time consuming to find out. I'm not > even sure if it will oops without the typical load it used to have. > > Anyway, thank you very much for at least answering me. Much appreciated > :) > > Best regards, > Stian > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels 2004-06-14 17:07 ` Steven Dake @ 2004-06-14 18:26 ` Chris Shoemaker 2004-06-15 13:16 ` Marcelo Tosatti 1 sibling, 0 replies; 22+ messages in thread From: Chris Shoemaker @ 2004-06-14 18:26 UTC (permalink / raw) To: Steven Dake; +Cc: Stian Jordet, Marcelo Tosatti, Linux Kernel Mailing List Marcelo, Stian, Steve, I also have seen this, more than a dozen times -- always related to writes to my ext3 partition concurrent with heavy swapping due to memory pressure. Although my oopses are usually not NULL dereferences, but simply bad addresses. In the two or three cases where I've actually traced the oops back to the code, (search the April archives for my name) it's been a corrupted pointer, always in the high part of the word, but not always the same part. For weeks, I thought that it was flakey RAM, but since days of memtest86 didn't fail I went back to trying to strip more stuff out of the kernel. Since the last cut, I haven't had a single oops, for approx. 3 weeks of uptime. Previous pattern was 1 to 3 days between oopses. I didn't really learn much that would help you. However, if you are trying to reproduce the failure more rapidly, my experience would suggest that it is necessary to run a memory hog, concurrent with an io process (e.g kernel compile.) Either one alone may run for days with no failure. Let me know if I can help. -Chris On Mon, Jun 14, 2004 at 10:07:05AM -0700, Steven Dake wrote: > Marcelo and Stian, > > I have also seen this oops relating to low memory situations. I think > ext3 allocates some data, has a null return, sets something to null, and > then later it is dereferenced in kwapd. > > Anyone have a patch for this problem? > > Thanks > -steve > > On Tue, 2004-03-02 at 04:03, Stian Jordet wrote: > > fre, 06.02.2004 kl. 00.51 skrev Marcelo Tosatti: > > > On Tue, 3 Feb 2004, Stian Jordet wrote: > > > > > > > Hello, > > > > > > > > I have a server which was running 2.4.18 and 2.4.19 for almost 200 days > > > > each, without problems. After an upgrade to 2.4.22, the box haven't been > > > > up for 30 days in a row. This happened early november. I have caputered > > > > oopses with both 2.4.23 and 2.6.1 which I have sent decoded to the list, > > > > but have never got any reply. > > > > > > > > I have ran memtest86 on the box, no errors. What else can be the > > > > problem? I could of course go back to 2.4.19, which I know worked fine, > > > > but I there have been some fixed security holes since then... > > > > > > > > Any thoughts? > > > > > > Stian, > > > > > > I have seen your 2.4.x oopses and they seemed odd. The faults were > > > happening in different functions (mostly inside VM "freeing" , due to > > > what seems to be random crap in memory: > > > > > > <1>Unable to handle kernel NULL pointer dereference at virtual address 00000021 > > > c0132e86 > > > *pde = 00000000 > > > > > > eax: 00000000 ebx: 00000009 ecx: 000001d2 edx: 00000012 > > > esi: 00000000 edi: c17e38c0 ebp: c1047a00 esp: c86cbdb4 > > > > > > >>EIP; c0132e86 <sync_page_buffers+e/a4> <===== > > > > > > >>edi; c17e38c0 <_end+14b5844/bd23f84> > > > >>ebp; c1047a00 <_end+d19984/bd23f84> > > > >>esp; c86cbdb4 <_end+839dd38/bd23f84> > > > > > > Trace; c0132fdc <try_to_free_buffers+c0/ec> > > > > > > Code; c0132e86 <sync_page_buffers+e/a4> > > > 00000000 <_EIP>: > > > Code; c0132e86 <sync_page_buffers+e/a4> <===== > > > 0: f6 43 18 06 testb $0x6,0x18(%ebx) <===== > > > Code; c0132e8a <sync_page_buffers+12/a4> > > > 4: 74 7c je 82 <_EIP+0x82> c0132f08 > > > <sync_page_buffers+90/a4> > > > Code; c0132e8c <sync_page_buffers+14/a4> > > > 6: b8 07 00 00 00 mov $0x7,%eax > > > Code; c0132e91 <sync_page_buffers+19/a4> > > > > > > > > > > > > > > > <1>Unable to handle kernel NULL pointer dereference at virtual address > > > 00000028 > > > c015e3a2 > > > *pde = 00000000 > > > Oops: 0000 > > > CPU: 0 > > > EIP: 0010:[<c015e3a2>] Not tainted > > > EFLAGS: 00010203 > > > > > > eax: 0100004d ebx: 00000000 ecx: 000001d2 edx: 00000000 > > > > > > Code; c015e3a2 <journal_try_to_free_buffers+5a/98> > > > 00000000 <_EIP>: > > > Code; c015e3a2 <journal_try_to_free_buffers+5a/98> <===== > > > 0: 8b 5b 28 mov 0x28(%ebx),%ebx <===== > > > Code; c015e3a5 <journal_try_to_free_buffers+5d/98> > > > 3: f6 42 19 04 testb $0x4,0x19(%edx) > > > Code; c015e3a9 <journal_try_to_free_buffers+61/98> > > > 7: 74 17 je 20 <_EIP+0x20> c015e3c2 > > > <journal_try_to_free_buffers+7a/98> > > > > > > And other similar oopses. > > > > > > Are you sure there is nothing messing up the hardware ? > > > > > > How long have you ran memtest86? It can, sometimes, take a long to showup > > > errors. > > > > > > The 2.6.x oopses on the same hardware is also a useful source of > > > information. > > > > Marcelo, > > > > sorry for getting back to you so insanely late. This was a production > > server, and I have now moved the services it were running to another > > box, so I could run a more exhaustive memtest86. It has now ran for two > > days, without any errors. Of course there could be other flaky hardware, > > but since I don't know any way to test it, and the oops occurs with > > two-four weeks interval, it's quite time consuming to find out. I'm not > > even sure if it will oops without the typical load it used to have. > > > > Anyway, thank you very much for at least answering me. Much appreciated > > :) > > > > Best regards, > > Stian > > > > - > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels 2004-06-14 17:07 ` Steven Dake 2004-06-14 18:26 ` Chris Shoemaker @ 2004-06-15 13:16 ` Marcelo Tosatti 2004-06-15 14:35 ` Stian Jordet 2004-06-15 17:56 ` Steven Dake 1 sibling, 2 replies; 22+ messages in thread From: Marcelo Tosatti @ 2004-06-15 13:16 UTC (permalink / raw) To: Steven Dake; +Cc: Stian Jordet, Linux Kernel Mailing List On Mon, Jun 14, 2004 at 10:07:05AM -0700, Steven Dake wrote: > Marcelo and Stian, > > I have also seen this oops relating to low memory situations. I think > ext3 allocates some data, has a null return, sets something to null, and > then later it is dereferenced in kwapd. > > Anyone have a patch for this problem? Steven, For what I remember Stian oopses were happening in random places in the VM freeing routines. That makes me belive what he was seeing was some kind of hardware issue, because otherwise the oopses would be happening in the same place (in case it was a software bug). The codepaths which he saw trying to access invalid addresses are executed flawlessly by all 2.4.x mainline users. He was also seeing oopses with v2.6. Assuming his HW is not faulty, I can think of some driver corrupting his memory. Do you have any traces of the oopses you are seeing? Stian, you told us switched servers now, I assume the problem is gone? Are you still running v2.4 on that server? Thanks! ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels 2004-06-15 13:16 ` Marcelo Tosatti @ 2004-06-15 14:35 ` Stian Jordet 2004-06-15 17:56 ` Steven Dake 1 sibling, 0 replies; 22+ messages in thread From: Stian Jordet @ 2004-06-15 14:35 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Steven Dake, Linux Kernel Mailing List tir, 15.06.2004 kl. 10.16 -0300, skrev Marcelo Tosatti: > Stian, you told us switched servers now, I assume the problem is gone? > Are you still running v2.4 on that server? I switched servers, and the oops is gone. The old server is still running, but without the nightly memory intensive perl-script, and have never seen an oops since I stopped running that. Both servers are running 2.6 now. Best regards, Stian ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels 2004-06-15 13:16 ` Marcelo Tosatti 2004-06-15 14:35 ` Stian Jordet @ 2004-06-15 17:56 ` Steven Dake 2004-06-17 13:16 ` [2.4] page->buffers vanished in journal_try_to_free_buffers() Marcelo Tosatti 1 sibling, 1 reply; 22+ messages in thread From: Steven Dake @ 2004-06-15 17:56 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Stian Jordet, Linux Kernel Mailing List On Tue, 2004-06-15 at 06:16, Marcelo Tosatti wrote: > On Mon, Jun 14, 2004 at 10:07:05AM -0700, Steven Dake wrote: > > Marcelo and Stian, > > > > I have also seen this oops relating to low memory situations. I think > > ext3 allocates some data, has a null return, sets something to null, and > > then later it is dereferenced in kwapd. > > > > Anyone have a patch for this problem? > Steven, > > For what I remember Stian oopses were happening in random places in the VM freeing > routines. That makes me belive what he was seeing was some kind of hardware issue, > because otherwise the oopses would be happening in the same place (in case it was > a software bug). The codepaths which he saw trying to access invalid addresses are > executed flawlessly by all 2.4.x mainline users. He was also seeing oopses with v2.6. > > Assuming his HW is not faulty, I can think of some driver corrupting his memory. > > Do you have any traces of the oopses you are seeing? > > Stian, you told us switched servers now, I assume the problem is gone? > Are you still running v2.4 on that server? > Marcelo, Stian responded saying he upgraded to 2.6 and also removed the memory intensive script and his problems went away. I suspect removing the memory intesive script did the trick. 2.6 could also be fixed, who knows? After reading lkml, there are about 4-5 Oops in the same function at the same location. The people report they were heavily using memory (low memory situation) in some of the bug reports. I have tracked down the problem to a null dereference during a buffer cache rebalance (which occurs during low memory situations). Here is the info I have. Unfortunately I don't know much about how vm handles low memory situations with the vfs, so if you have any ideas, it would be helpful. :) nable to handle kernel NULL pointer dereference at virtual address 00000028 <4> printing eip: <4>c018aa67 <1>*pde = 00000000 <4>Oops: 0000 <4>CPU: 2 <4>EIP: 0010:[<c018aa67>] Not tainted <4>EFLAGS: 00010203 <4>eax: 00000000 ebx: 00000000 ecx: c0380490 edx: c217a540 <4>esi: f2757a80 edi: 00000001 ebp: 00000000 esp: f7bd1ee8 <4>ds: 0018 es: 0018 ss: 0018 <4>Process kswapd (pid: 11, stackpage=f7bd1000) <4>Stack: 00000000 00000000 c217a600 00000000 00000202 00000000 c217a540 000001d0 <4> c217a540 000102ec c018144f f7abe400 c217a540 000001d0 c0155006 c217a540 <4> 000001d0 f7bd0000 00000000 c0147f0a c217a540 000001d0 f7bd0000 f7bd0000 <4>Call Trace: [<c018144f>] [<c0155006>] [<c0147f0a>] [<c014832b>] [<c0148398>] <4> [<c0148431>] [<c014847f>] [<c0148593>] [<c0105000>] [<c010578a>] [<c01484f8>] <4> <4>Code: 8b 5b 28 f6 40 19 02 75 47 39 f3 75 f1 c6 05 80 24 38 c0 01 >>EIP; c018aa67 <journal_try_to_free_buffers+45/f4> <===== Trace; c018144f <ext3_releasepage+2d/32> Trace; c0155006 <try_to_release_page+4e/78> Trace; c0147f0a <shrink_cache+270/4f6> Trace; c014832b <shrink_caches+61/98> Trace; c0148398 <try_to_free_pages+36/54> Trace; c0148431 <kswapd_balance_pgdat+51/8a> Trace; c014847f <kswapd_balance+15/2c> Trace; c0148593 <kswapd+9b/b6> Trace; c0105000 <_stext+0/0> Trace; c010578a <kernel_thread+2e/40> Trace; c01484f8 <kswapd+0/b6> The problem is that in this function: int journal_try_to_free_buffers(journal_t *journal, struct page *page, int gfp_mask) { struct buffer_head *bh; struct buffer_head *tmp; int locked_or_dirty = 0; int call_ttfb = 1; J_ASSERT(PageLocked(page)); bh = page->buffers; tmp = bh; spin_lock(&journal_datalist_lock); do { struct buffer_head *p = tmp; tmp = tmp->b_this_page; ... tmp (page->buffers) above is null. b_this_page is at offset 0x28 (the accessed address in the oops). This means that page->buffers is set to null by some other routine which results in the oops. I read the page allocate code (ext3_read_page->block_read_full_page->create_emty_buffers->create_buffers), and it appears that it is not possible to allocate a page->buffers value of zero in the allocate function. I am having difficulty reproducing and cannot debug further, however. Can page->buffers be set to zero somewhere else? Perhaps kswapd and some other thread are racing on the free? Thansk -steve > Thanks! ^ permalink raw reply [flat|nested] 22+ messages in thread
* [2.4] page->buffers vanished in journal_try_to_free_buffers() 2004-06-15 17:56 ` Steven Dake @ 2004-06-17 13:16 ` Marcelo Tosatti 2004-06-18 3:08 ` Andrew Morton 2004-06-21 15:06 ` Stephen C. Tweedie 0 siblings, 2 replies; 22+ messages in thread From: Marcelo Tosatti @ 2004-06-17 13:16 UTC (permalink / raw) To: Steven Dake; +Cc: Stian Jordet, Linux Kernel Mailing List, sct, akpm On Tue, Jun 15, 2004 at 10:56:38AM -0700, Steven Dake wrote: > On Tue, 2004-06-15 at 06:16, Marcelo Tosatti wrote: > > On Mon, Jun 14, 2004 at 10:07:05AM -0700, Steven Dake wrote: > > > Marcelo and Stian, > > > > > > I have also seen this oops relating to low memory situations. I think > > > ext3 allocates some data, has a null return, sets something to null, and > > > then later it is dereferenced in kwapd. > > > > > > Anyone have a patch for this problem? > > > Steven, > > > > For what I remember Stian oopses were happening in random places in the VM freeing > > routines. That makes me belive what he was seeing was some kind of hardware issue, > > because otherwise the oopses would be happening in the same place (in case it was > > a software bug). The codepaths which he saw trying to access invalid addresses are > > executed flawlessly by all 2.4.x mainline users. He was also seeing oopses with v2.6. > > > > Assuming his HW is not faulty, I can think of some driver corrupting his memory. > > > > Do you have any traces of the oopses you are seeing? > > > > Stian, you told us switched servers now, I assume the problem is gone? > > Are you still running v2.4 on that server? > > > > Marcelo, > > Stian responded saying he upgraded to 2.6 and also removed the memory > intensive script and his problems went away. I suspect removing the > memory intesive script did the trick. 2.6 could also be fixed, who > knows? > > After reading lkml, there are about 4-5 Oops in the same function at the > same location. The people report they were heavily using memory (low > memory situation) in some of the bug reports. > > I have tracked down the problem to a null dereference during a buffer > cache rebalance (which occurs during low memory situations). Here is > the info I have. Unfortunately I don't know much about how vm handles > low memory situations with the vfs, so if you have any ideas, it would > be helpful. :) > > nable to handle kernel NULL pointer dereference at virtual address > 00000028 > <4> printing eip: > <4>c018aa67 > <1>*pde = 00000000 > <4>Oops: 0000 > <4>CPU: 2 > <4>EIP: 0010:[<c018aa67>] Not tainted > <4>EFLAGS: 00010203 > <4>eax: 00000000 ebx: 00000000 ecx: c0380490 edx: c217a540 > <4>esi: f2757a80 edi: 00000001 ebp: 00000000 esp: f7bd1ee8 > <4>ds: 0018 es: 0018 ss: 0018 > <4>Process kswapd (pid: 11, stackpage=f7bd1000) > <4>Stack: 00000000 00000000 c217a600 00000000 00000202 00000000 c217a540 000001d0 > <4> c217a540 000102ec c018144f f7abe400 c217a540 000001d0 c0155006 c217a540 > <4> 000001d0 f7bd0000 00000000 c0147f0a c217a540 000001d0 f7bd0000 f7bd0000 > <4>Call Trace: [<c018144f>] [<c0155006>] [<c0147f0a>] > [<c014832b>] [<c0148398>] > <4> [<c0148431>] [<c014847f>] [<c0148593>] [<c0105000>] > [<c010578a>] [<c01484f8>] > <4> > <4>Code: 8b 5b 28 f6 40 19 02 75 47 39 f3 75 f1 c6 05 80 24 38 c0 01 > > > >>EIP; c018aa67 <journal_try_to_free_buffers+45/f4> <===== > Trace; c018144f <ext3_releasepage+2d/32> > Trace; c0155006 <try_to_release_page+4e/78> > Trace; c0147f0a <shrink_cache+270/4f6> > Trace; c014832b <shrink_caches+61/98> > Trace; c0148398 <try_to_free_pages+36/54> > Trace; c0148431 <kswapd_balance_pgdat+51/8a> > Trace; c014847f <kswapd_balance+15/2c> > Trace; c0148593 <kswapd+9b/b6> > Trace; c0105000 <_stext+0/0> > Trace; c010578a <kernel_thread+2e/40> > Trace; c01484f8 <kswapd+0/b6> > > The problem is that in this function: > > int journal_try_to_free_buffers(journal_t *journal, > struct page *page, int gfp_mask) > { > struct buffer_head *bh; > struct buffer_head *tmp; > int locked_or_dirty = 0; > int call_ttfb = 1; > > J_ASSERT(PageLocked(page)); > > bh = page->buffers; > tmp = bh; > spin_lock(&journal_datalist_lock); > do { > struct buffer_head *p = tmp; > > tmp = tmp->b_this_page; > ... > > tmp (page->buffers) above is null. b_this_page is at offset 0x28 (the accessed address in the oops). This means that > page->buffers is set to null by some other routine which results in the oops. > > I read the page allocate code > (ext3_read_page->block_read_full_page->create_emty_buffers->create_buffers), and it appears that it is not possible to allocate a page->buffers value of zero in the allocate function. I am having difficulty reproducing and cannot debug further, however. Can page->buffers be set to zero somewhere else? >Perhaps kswapd and some other thread are racing on the free? Steve, Hum, I'm starting to believe we might have an issue here. Searching lkml archives I find other similar oopses at the same place (trying to access 00000028, tmp->b_this_page), as you said. However I wonder what other kernel codepath could remove the page buffers under us, the page MUST be locked here. In the backtrace above the page is locked by shrink_cache(). And with the page locked, we guarantee the VM freeing routines (shrink_cache) wont try to mess with the page. Can you reproduce the oopsen? Stephen, Andrew, do you have any idea how the buffers could have vanished under us with the page locked? That should not be possible. I dont see how this "page->buffers = NULL" could be caused by hardware problem, which is usually one or two bit flip. Thanks everyone ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [2.4] page->buffers vanished in journal_try_to_free_buffers() 2004-06-17 13:16 ` [2.4] page->buffers vanished in journal_try_to_free_buffers() Marcelo Tosatti @ 2004-06-18 3:08 ` Andrew Morton 2004-06-19 19:48 ` Marcelo Tosatti 2004-06-21 15:06 ` Stephen C. Tweedie 1 sibling, 1 reply; 22+ messages in thread From: Andrew Morton @ 2004-06-18 3:08 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: sdake, liste, linux-kernel, sct Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote: > > > tmp (page->buffers) above is null. b_this_page is at offset 0x28 (the accessed address in the oops). This means that > > page->buffers is set to null by some other routine which results in the oops. > > > > I read the page allocate code > > (ext3_read_page->block_read_full_page->create_emty_buffers->create_buffers), and it appears that it is not possible to allocate a page->buffers value of zero in the allocate function. I am having difficulty reproducing and cannot debug further, however. Can page->buffers be set to zero somewhere else? > >Perhaps kswapd and some other thread are racing on the free? > > Steve, > > Hum, I'm starting to believe we might have an issue here. > > Searching lkml archives I find other similar oopses at the same place > (trying to access 00000028, tmp->b_this_page), as you said. > > However I wonder what other kernel codepath could remove the page buffers > under us, the page MUST be locked here. In the backtrace above the page > is locked by shrink_cache(). And with the page locked, we guarantee the VM > freeing routines (shrink_cache) wont try to mess with the page. > > Can you reproduce the oopsen? > > Stephen, Andrew, do you have any idea how the buffers could have vanished > under us with the page locked? That should not be possible. > > I dont see how this "page->buffers = NULL" could be caused by hardware problem, > which is usually one or two bit flip. It's a bit odd. The page is definitely locked, and definitely had non-null ->buffers a few tens of instructions beforehand. Is this an SMP machine? One possibility is that we died on the second pass around the loop: page->buffers points at a buffer_head which has a NULL ->b_this_page. But I cannot suggest how ->b_this_page could have been zapped. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [2.4] page->buffers vanished in journal_try_to_free_buffers() 2004-06-18 3:08 ` Andrew Morton @ 2004-06-19 19:48 ` Marcelo Tosatti 2004-06-19 19:50 ` Frank van Maarseveen ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Marcelo Tosatti @ 2004-06-19 19:48 UTC (permalink / raw) To: Andrew Morton, frankvm; +Cc: sdake, liste, linux-kernel, sct On Thu, Jun 17, 2004 at 08:08:59PM -0700, Andrew Morton wrote: > Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote: > > > > > tmp (page->buffers) above is null. b_this_page is at offset 0x28 (the accessed address in the oops). This means that > > > page->buffers is set to null by some other routine which results in the oops. > > > > > > I read the page allocate code > > > (ext3_read_page->block_read_full_page->create_emty_buffers->create_buffers), and it appears that it is not possible to allocate a page->buffers value of zero in the allocate function. I am having difficulty reproducing and cannot debug further, however. Can page->buffers be set to zero somewhere else? > > >Perhaps kswapd and some other thread are racing on the free? > > > > Steve, > > > > Hum, I'm starting to believe we might have an issue here. > > > > Searching lkml archives I find other similar oopses at the same place > > (trying to access 00000028, tmp->b_this_page), as you said. > > > However I wonder what other kernel codepath could remove the page buffers > > under us, the page MUST be locked here. In the backtrace above the page > > is locked by shrink_cache(). And with the page locked, we guarantee the VM > > freeing routines (shrink_cache) wont try to mess with the page. > > > > Can you reproduce the oopsen? > > > > Stephen, Andrew, do you have any idea how the buffers could have vanished > > under us with the page locked? That should not be possible. > > > > I dont see how this "page->buffers = NULL" could be caused by hardware problem, > > which is usually one or two bit flip. > > It's a bit odd. The page is definitely locked, and definitely had non-null > ->buffers a few tens of instructions beforehand. > > Is this an SMP machine? Steven, did you see the NULL ->b_this_page on SMP or UP? Stian Jordet had an SMP server, but he also was seeing oopses with v2.6: kernel BUG at mm/page_alloc.c:201! invalid operand: 0000 [#1] CPU: 0 EIP: 0060:[free_pages_bulk+482/512] Not tainted EIP is at free_pages_bulk+0x1e2/0x200 eax: 00000001 ebx: c00609c8 ecx: 00000000 edx: \ 666026a5 esi: 666026a4 edi: ffffffff ebp: 33301352 esp: \ c86d5d90 Process mrtg (pid: 26804, threadinfo=c86d4000 \ task=c9b860c0) Call Trace: [free_hot_cold_page+217/240] \ free_hot_cold_page+0xd9/0xf0 [do_generic_mapping_read+714/1008] \ do_generic_mapping_read+0x2ca/0x3f0 [file_read_actor+0/256] file_read_actor+0x0/0x100 [__generic_file_aio_read+454/512] \ __generic_file_aio_read+0x1c6/0x200 [file_read_actor+0/256] file_read_actor+0x0/0x100 [generic_file_aio_read+91/128] \ generic_file_aio_read+0x5b/0x80 [do_sync_read+137/192] do_sync_read+0x89/0xc0 [do_page_fault+300/1328] do_page_fault+0x12c/0x530 [do_brk+324/560] do_brk+0x144/0x230 [vfs_read+184/304] vfs_read+0xb8/0x130 [sys_read+66/112] sys_read+0x42/0x70 [syscall_call+7/11] syscall_call+0x7/0xb and different oopses on v2.4, including sync_page_buffers (also NULL+offset access): <1>Unable to handle kernel NULL pointer dereference at virtual address 00000021 c0132e86 eax: 00000000 ebx: 00000009 ecx: 000001d2 edx: 00000012 esi: 00000000 edi: c17e38c0 ebp: c1047a00 esp: c86cbdb4 >>EIP; c0132e86 <sync_page_buffers+e/a4> <===== Trace; c0132fdc <try_to_free_buffers+c0/ec> Code; c0132e86 <sync_page_buffers+e/a4> 00000000 <_EIP>: Code; c0132e86 <sync_page_buffers+e/a4> <===== 0: f6 43 18 06 testb $0x6,0x18(%ebx) <===== Code; c0132e8a <sync_page_buffers+12/a4> and the journal_try_to_free_buffers() one: Unable to handle kernel NULL pointer dereference at virtual address 00000028 c015e3a2 *pde = 00000000 Oops: 0000 CPU: 0 EIP: 0010:[<c015e3a2>] Not tainted EFLAGS: 00010203 eax: 0100004d ebx: 00000000 ecx: 000001d2 edx: 00000000 Code; c015e3a2 <journal_try_to_free_buffers+5a/98> 00000000 <_EIP>: Code; c015e3a2 <journal_try_to_free_buffers+5a/98> <===== 0: 8b 5b 28 mov 0x28(%ebx),%ebx <===== Code; c015e3a5 <journal_try_to_free_buffers+5d/98> He upgraded the box and stopped seeing the crashes, running recent v2.6. However, he also mentioned that his crashes started after upgrading from v2.4.19->2.4.22. Should search the diff between them looking for anything suspicious. I can't figure out from the archived reports if this is UP or SMP only. Frank van Maarseveen has also seen the journal_try_to_free_buffers() NULL b_this_page. Frank, were you running SMP or UP when you reported the oops with 2.4.23? > One possibility is that we died on the second pass around the loop: > page->buffers points at a buffer_head which has a NULL ->b_this_page. But > I cannot suggest how ->b_this_page could have been zapped. Oh, yes, indeed. Maybe adding this (untested) to v2.4 mainline helps? Comments? --- transaction.c.orig 2004-06-19 15:21:32.861148560 -0300 +++ transaction.c 2004-06-19 15:23:18.214132472 -0300 @@ -1694,6 +1694,24 @@ return 0; } +void debug_page(struct page *p) +{ + struct buffer_head *bh; + + bh = p->buffers; + + printk(KERN_ERR "%s: page index:%u count:%d flags:%x\n", __FUNCTION__, + ,p->index , atomic_read(&p->count), p->flags); + + do { + printk(KERN_ERR "%s: bh b_next:%p blocknr:%u b_list:%u state:%x\n", + __FUNCTION__, bh->b_next, bh->b_blocknr, bh->b_list, + bh->b_state); + bh = bh->b_this_page; + } while (bh); +} + + /** * int journal_try_to_free_buffers() - try to free page buffers. @@ -1752,6 +1770,11 @@ do { struct buffer_head *p = tmp; + if (!unlikely(tmp)) { + debug_page(page); + BUG(); + } + tmp = tmp->b_this_page; if (buffer_jbd(p)) if (!__journal_try_to_free_buffer(p, &locked_or_dirty)) ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [2.4] page->buffers vanished in journal_try_to_free_buffers() 2004-06-19 19:48 ` Marcelo Tosatti @ 2004-06-19 19:50 ` Frank van Maarseveen 2004-06-19 22:17 ` Marcelo Tosatti 2004-06-19 20:04 ` Andrew Morton 2004-06-20 7:56 ` Willy Tarreau 2 siblings, 1 reply; 22+ messages in thread From: Frank van Maarseveen @ 2004-06-19 19:50 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Andrew Morton, sdake, liste, linux-kernel, sct On Sat, Jun 19, 2004 at 04:48:49PM -0300, Marcelo Tosatti wrote: > > However, he also mentioned that his crashes started after upgrading > from v2.4.19->2.4.22. Should search the diff between them looking for > anything suspicious. > > I can't figure out from the archived reports if this is UP or SMP only. > > Frank van Maarseveen has also seen the journal_try_to_free_buffers() NULL > b_this_page. Frank, were you running SMP or UP when you reported the oops > with 2.4.23? UP -- Frank ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [2.4] page->buffers vanished in journal_try_to_free_buffers() 2004-06-19 19:50 ` Frank van Maarseveen @ 2004-06-19 22:17 ` Marcelo Tosatti 2004-06-19 22:44 ` Frank van Maarseveen 0 siblings, 1 reply; 22+ messages in thread From: Marcelo Tosatti @ 2004-06-19 22:17 UTC (permalink / raw) To: Frank van Maarseveen, Andrew Morton, sdake, liste, linux-kernel, sct On Sat, Jun 19, 2004 at 09:50:13PM +0200, Frank van Maarseveen wrote: > On Sat, Jun 19, 2004 at 04:48:49PM -0300, Marcelo Tosatti wrote: > > > > However, he also mentioned that his crashes started after upgrading > > from v2.4.19->2.4.22. Should search the diff between them looking for > > anything suspicious. > > > > I can't figure out from the archived reports if this is UP or SMP only. > > > > Frank van Maarseveen has also seen the journal_try_to_free_buffers() NULL > > b_this_page. Frank, were you running SMP or UP when you reported the oops > > with 2.4.23? > > UP Hi Frank, Has the oops happened again? What kernel are you running now? Thanks! ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [2.4] page->buffers vanished in journal_try_to_free_buffers() 2004-06-19 22:17 ` Marcelo Tosatti @ 2004-06-19 22:44 ` Frank van Maarseveen 0 siblings, 0 replies; 22+ messages in thread From: Frank van Maarseveen @ 2004-06-19 22:44 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Andrew Morton, sdake, liste, linux-kernel, sct On Sat, Jun 19, 2004 at 07:17:11PM -0300, Marcelo Tosatti wrote: > > Has the oops happened again? What kernel are you running now? no new oopses. The machine has used 2.4.21 for a while, then 2.4.25 and now 2.4.26. The problem _seems_ to show up here only in 2.4.22, 2.4.23 and 2.4.24. -- Frank ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [2.4] page->buffers vanished in journal_try_to_free_buffers() 2004-06-19 19:48 ` Marcelo Tosatti 2004-06-19 19:50 ` Frank van Maarseveen @ 2004-06-19 20:04 ` Andrew Morton 2004-06-20 7:56 ` Willy Tarreau 2 siblings, 0 replies; 22+ messages in thread From: Andrew Morton @ 2004-06-19 20:04 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: frankvm, sdake, liste, linux-kernel, sct Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote: > > Maybe adding this (untested) to v2.4 mainline helps? Comments? It would be helpful. > --- transaction.c.orig 2004-06-19 15:21:32.861148560 -0300 > +++ transaction.c 2004-06-19 15:23:18.214132472 -0300 > @@ -1694,6 +1694,24 @@ > return 0; > } > > +void debug_page(struct page *p) > +{ > + struct buffer_head *bh; > + > + bh = p->buffers; > + > + printk(KERN_ERR "%s: page index:%u count:%d flags:%x\n", __FUNCTION__, > + ,p->index , atomic_read(&p->count), p->flags); ^ > + > + do { > + printk(KERN_ERR "%s: bh b_next:%p blocknr:%u b_list:%u state:%x\n", > + __FUNCTION__, bh->b_next, bh->b_blocknr, bh->b_list, > + bh->b_state); > + bh = bh->b_this_page; > + } while (bh); > +} > + you'll want to make this a while (!bh) {} loop, to handle the page->buffers==NULL case. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [2.4] page->buffers vanished in journal_try_to_free_buffers() 2004-06-19 19:48 ` Marcelo Tosatti 2004-06-19 19:50 ` Frank van Maarseveen 2004-06-19 20:04 ` Andrew Morton @ 2004-06-20 7:56 ` Willy Tarreau 2 siblings, 0 replies; 22+ messages in thread From: Willy Tarreau @ 2004-06-20 7:56 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Andrew Morton, frankvm, sdake, liste, linux-kernel, sct Hi Marcelo, On Sat, Jun 19, 2004 at 04:48:49PM -0300, Marcelo Tosatti wrote: > + if (!unlikely(tmp)) { I think you meant "if (unlikely(!tmp))" here. Howver this does not make a big difference. Regards, willy ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [2.4] page->buffers vanished in journal_try_to_free_buffers() 2004-06-17 13:16 ` [2.4] page->buffers vanished in journal_try_to_free_buffers() Marcelo Tosatti 2004-06-18 3:08 ` Andrew Morton @ 2004-06-21 15:06 ` Stephen C. Tweedie 2004-06-21 15:53 ` Marcelo Tosatti 1 sibling, 1 reply; 22+ messages in thread From: Stephen C. Tweedie @ 2004-06-21 15:06 UTC (permalink / raw) To: Marcelo Tosatti Cc: Stephen Tweedie, Steven Dake, Stian Jordet, Linux Kernel Mailing List, sct, Andrew Morton, Stephen Tweedie Hi, On Thu, 2004-06-17 at 14:16, Marcelo Tosatti wrote: > Stephen, Andrew, do you have any idea how the buffers could have vanished > under us with the page locked? That should not be possible. No, especially not on UP as Frank reported. > I dont see how this "page->buffers = NULL" could be caused by hardware problem, > which is usually one or two bit flip. We don't know for sure that it's page->buffers. If we have gone round the bh->b_this_page loop already, we could have ended up following the pointers either to an invalid bh, or to one that's not on the current page. So it could also be the previous buffer's b_this_page that got clobbered, rather than page->buffers. That's possible in this case, but it's still a bit surprising that we'd *always* get a NULL pointer rather than some other random pointer as a result. The buffer-ring debug patch that you posted looks like the obvious way to dig further into this. If that doesn't get anyway, we can also trap the case where following bh->b_this_page gives us a buffer whose b_page is on a different page. --Stephen ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [2.4] page->buffers vanished in journal_try_to_free_buffers() 2004-06-21 15:06 ` Stephen C. Tweedie @ 2004-06-21 15:53 ` Marcelo Tosatti 2004-06-22 22:13 ` Stephen C. Tweedie 0 siblings, 1 reply; 22+ messages in thread From: Marcelo Tosatti @ 2004-06-21 15:53 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Steven Dake, Stian Jordet, Linux Kernel Mailing List, sct, Andrew Morton Hi Stephen, On Mon, Jun 21, 2004 at 04:06:50PM +0100, Stephen C. Tweedie wrote: > Hi, > > On Thu, 2004-06-17 at 14:16, Marcelo Tosatti wrote: > > > Stephen, Andrew, do you have any idea how the buffers could have vanished > > under us with the page locked? That should not be possible. > > No, especially not on UP as Frank reported. > > > I dont see how this "page->buffers = NULL" could be caused by hardware problem, > > which is usually one or two bit flip. > > We don't know for sure that it's page->buffers. If we have gone round > the bh->b_this_page loop already, we could have ended up following the > pointers either to an invalid bh, or to one that's not on the current > page. So it could also be the previous buffer's b_this_page that got > clobbered, rather than page->buffers. > > That's possible in this case, but it's still a bit surprising that we'd > *always* get a NULL pointer rather than some other random pointer as a > result. I dont remember seeing any case which was not a NULL pointer dereference. > The buffer-ring debug patch that you posted looks like the obvious way > to dig further into this. If that doesn't get anyway, we can also trap > the case where following bh->b_this_page gives us a buffer whose b_page > is on a different page. Fine. Just printing out bh->b_page at debug_page() will allow us to verify that, yes? --- transaction.c.orig 2004-06-21 12:50:01.090082264 -0300 +++ transaction.c 2004-06-21 12:50:45.574319632 -0300 @@ -1704,9 +1704,9 @@ void debug_page(struct page *p) p->index, atomic_read(&p->count), p->flags); while (bh) { - printk(KERN_ERR "%s: bh b_next:%p blocknr:%lu b_list:%u state:%lx\n", + printk(KERN_ERR "%s: bh b_next:%p blocknr:%lu b_list:%u state:%lx b_page:%p\n", __FUNCTION__, bh->b_next, bh->b_blocknr, bh->b_list, - bh->b_state); + bh->b_state, bh->b_page); bh = bh->b_this_page; } } ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [2.4] page->buffers vanished in journal_try_to_free_buffers() 2004-06-21 15:53 ` Marcelo Tosatti @ 2004-06-22 22:13 ` Stephen C. Tweedie 0 siblings, 0 replies; 22+ messages in thread From: Stephen C. Tweedie @ 2004-06-22 22:13 UTC (permalink / raw) To: Marcelo Tosatti Cc: Steven Dake, Stian Jordet, Linux Kernel Mailing List, sct, Andrew Morton, Stephen Tweedie Hi, On Mon, 2004-06-21 at 16:53, Marcelo Tosatti wrote: > > The buffer-ring debug patch that you posted looks like the obvious way > > to dig further into this. If that doesn't get anyway, we can also trap > > the case where following bh->b_this_page gives us a buffer whose b_page > > is on a different page. > > Fine. Just printing out bh->b_page at debug_page() will allow us to verify that, yes? For most cases, yes. There are basically three corruption cases --- b_this_page leads us to an oops, an infinite loop, or a loop including a bogus page. Trapping the b_this_page ring walks to trap on any bad b_page would help in the latter two cases, but if we're always getting the first case, just extending the existing debug patch would be fine. --Stephen ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2004-06-22 22:17 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-02-03 18:26 Oopses with both recent 2.4.x kernels and 2.6.x kernels Stian Jordet 2004-02-05 23:51 ` Marcelo Tosatti 2004-03-02 11:03 ` Stian Jordet 2004-03-02 12:31 ` Stian Jordet 2004-03-09 19:22 ` Marcelo Tosatti 2004-03-09 22:28 ` Stian Jordet 2004-06-14 17:07 ` Steven Dake 2004-06-14 18:26 ` Chris Shoemaker 2004-06-15 13:16 ` Marcelo Tosatti 2004-06-15 14:35 ` Stian Jordet 2004-06-15 17:56 ` Steven Dake 2004-06-17 13:16 ` [2.4] page->buffers vanished in journal_try_to_free_buffers() Marcelo Tosatti 2004-06-18 3:08 ` Andrew Morton 2004-06-19 19:48 ` Marcelo Tosatti 2004-06-19 19:50 ` Frank van Maarseveen 2004-06-19 22:17 ` Marcelo Tosatti 2004-06-19 22:44 ` Frank van Maarseveen 2004-06-19 20:04 ` Andrew Morton 2004-06-20 7:56 ` Willy Tarreau 2004-06-21 15:06 ` Stephen C. Tweedie 2004-06-21 15:53 ` Marcelo Tosatti 2004-06-22 22:13 ` Stephen C. Tweedie
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox