* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels @ 2004-06-15 17:44 Nick Warne 2004-06-15 19:15 ` Stian Jordet 0 siblings, 1 reply; 13+ messages in thread From: Nick Warne @ 2004-06-15 17:44 UTC (permalink / raw) To: linux-kernel FYI. I have a box here that was originally running 2.4.x. I updated to 2.6.x a few months ago, and all was well. Then I started to get curious oops, none of them the same. I started to suspect NFS, as I use an old 486 to hold the web pages to serve to the box via NFS... the oops occurred every Saturday morning @ 4:02. Lead to me think it was some sort of cron.weekly issue with the disc activity and file access or the like, or whatever... I didn't know - I was on a fishing exercise (and a lot of searching on the LKML) But, after talking to a member of the HantsLUG, and showing logs and stuff, he brought up at the swap size. This box was once 64Mb, but is now 128Mb - with 128Mb swap. I created an additional swap file (256Mb), and (touch wood), no oops since, all heathly :) I never looked at this before, as swap was never used _during_ normal running of the box, but as he said maybe the cron.weekly ran a lot of stuff that did use it up... Nick -- "When you're chewing on life's gristle, Don't grumble, Give a whistle..." ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels 2004-06-15 17:44 Oopses with both recent 2.4.x kernels and 2.6.x kernels Nick Warne @ 2004-06-15 19:15 ` Stian Jordet 0 siblings, 0 replies; 13+ messages in thread From: Stian Jordet @ 2004-06-15 19:15 UTC (permalink / raw) To: Nick Warne; +Cc: linux-kernel tir, 15.06.2004 kl. 18.44 +0100, skrev Nick Warne: > But, after talking to a member of the HantsLUG, and showing logs and > stuff, he brought up at the swap size. This box was once 64Mb, but > is now 128Mb - with 128Mb swap. I created an additional swap file > (256Mb), and (touch wood), no oops since, all heathly :) I never > looked at this before, as swap was never used _during_ normal running > of the box, but as he said maybe the cron.weekly ran a lot of stuff > that did use it up... Doubt that has been my problem... The box in question had 180 MB ram, and 512 MB swap. The script can't have used that much.. And even if it did, it is a bug that the box oopses and dies, I guess. Best regards, Stian ^ permalink raw reply [flat|nested] 13+ messages in thread
* Oopses with both recent 2.4.x kernels and 2.6.x kernels @ 2004-02-03 18:26 Stian Jordet 2004-02-05 23:51 ` Marcelo Tosatti 0 siblings, 1 reply; 13+ messages in thread From: Stian Jordet @ 2004-02-03 18:26 UTC (permalink / raw) To: Linux Kernel Mailing List Hello, I have a server which was running 2.4.18 and 2.4.19 for almost 200 days each, without problems. After an upgrade to 2.4.22, the box haven't been up for 30 days in a row. This happened early november. I have caputered oopses with both 2.4.23 and 2.6.1 which I have sent decoded to the list, but have never got any reply. I have ran memtest86 on the box, no errors. What else can be the problem? I could of course go back to 2.4.19, which I know worked fine, but I there have been some fixed security holes since then... Any thoughts? Best regards, Stian ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels 2004-02-03 18:26 Stian Jordet @ 2004-02-05 23:51 ` Marcelo Tosatti 2004-03-02 11:03 ` Stian Jordet 0 siblings, 1 reply; 13+ messages in thread From: Marcelo Tosatti @ 2004-02-05 23:51 UTC (permalink / raw) To: Stian Jordet; +Cc: Linux Kernel Mailing List On Tue, 3 Feb 2004, Stian Jordet wrote: > Hello, > > I have a server which was running 2.4.18 and 2.4.19 for almost 200 days > each, without problems. After an upgrade to 2.4.22, the box haven't been > up for 30 days in a row. This happened early november. I have caputered > oopses with both 2.4.23 and 2.6.1 which I have sent decoded to the list, > but have never got any reply. > > I have ran memtest86 on the box, no errors. What else can be the > problem? I could of course go back to 2.4.19, which I know worked fine, > but I there have been some fixed security holes since then... > > Any thoughts? Stian, I have seen your 2.4.x oopses and they seemed odd. The faults were happening in different functions (mostly inside VM "freeing" , due to what seems to be random crap in memory: <1>Unable to handle kernel NULL pointer dereference at virtual address 00000021 c0132e86 *pde = 00000000 eax: 00000000 ebx: 00000009 ecx: 000001d2 edx: 00000012 esi: 00000000 edi: c17e38c0 ebp: c1047a00 esp: c86cbdb4 >>EIP; c0132e86 <sync_page_buffers+e/a4> <===== >>edi; c17e38c0 <_end+14b5844/bd23f84> >>ebp; c1047a00 <_end+d19984/bd23f84> >>esp; c86cbdb4 <_end+839dd38/bd23f84> Trace; c0132fdc <try_to_free_buffers+c0/ec> Code; c0132e86 <sync_page_buffers+e/a4> 00000000 <_EIP>: Code; c0132e86 <sync_page_buffers+e/a4> <===== 0: f6 43 18 06 testb $0x6,0x18(%ebx) <===== Code; c0132e8a <sync_page_buffers+12/a4> 4: 74 7c je 82 <_EIP+0x82> c0132f08 <sync_page_buffers+90/a4> Code; c0132e8c <sync_page_buffers+14/a4> 6: b8 07 00 00 00 mov $0x7,%eax Code; c0132e91 <sync_page_buffers+19/a4> <1>Unable to handle kernel NULL pointer dereference at virtual address 00000028 c015e3a2 *pde = 00000000 Oops: 0000 CPU: 0 EIP: 0010:[<c015e3a2>] Not tainted EFLAGS: 00010203 eax: 0100004d ebx: 00000000 ecx: 000001d2 edx: 00000000 Code; c015e3a2 <journal_try_to_free_buffers+5a/98> 00000000 <_EIP>: Code; c015e3a2 <journal_try_to_free_buffers+5a/98> <===== 0: 8b 5b 28 mov 0x28(%ebx),%ebx <===== Code; c015e3a5 <journal_try_to_free_buffers+5d/98> 3: f6 42 19 04 testb $0x4,0x19(%edx) Code; c015e3a9 <journal_try_to_free_buffers+61/98> 7: 74 17 je 20 <_EIP+0x20> c015e3c2 <journal_try_to_free_buffers+7a/98> And other similar oopses. Are you sure there is nothing messing up the hardware ? How long have you ran memtest86? It can, sometimes, take a long to showup errors. The 2.6.x oopses on the same hardware is also a useful source of information. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels 2004-02-05 23:51 ` Marcelo Tosatti @ 2004-03-02 11:03 ` Stian Jordet 2004-03-02 12:31 ` Stian Jordet 2004-06-14 17:07 ` Steven Dake 0 siblings, 2 replies; 13+ messages in thread From: Stian Jordet @ 2004-03-02 11:03 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Linux Kernel Mailing List fre, 06.02.2004 kl. 00.51 skrev Marcelo Tosatti: > On Tue, 3 Feb 2004, Stian Jordet wrote: > > > Hello, > > > > I have a server which was running 2.4.18 and 2.4.19 for almost 200 days > > each, without problems. After an upgrade to 2.4.22, the box haven't been > > up for 30 days in a row. This happened early november. I have caputered > > oopses with both 2.4.23 and 2.6.1 which I have sent decoded to the list, > > but have never got any reply. > > > > I have ran memtest86 on the box, no errors. What else can be the > > problem? I could of course go back to 2.4.19, which I know worked fine, > > but I there have been some fixed security holes since then... > > > > Any thoughts? > > Stian, > > I have seen your 2.4.x oopses and they seemed odd. The faults were > happening in different functions (mostly inside VM "freeing" , due to > what seems to be random crap in memory: > > <1>Unable to handle kernel NULL pointer dereference at virtual address 00000021 > c0132e86 > *pde = 00000000 > > eax: 00000000 ebx: 00000009 ecx: 000001d2 edx: 00000012 > esi: 00000000 edi: c17e38c0 ebp: c1047a00 esp: c86cbdb4 > > >>EIP; c0132e86 <sync_page_buffers+e/a4> <===== > > >>edi; c17e38c0 <_end+14b5844/bd23f84> > >>ebp; c1047a00 <_end+d19984/bd23f84> > >>esp; c86cbdb4 <_end+839dd38/bd23f84> > > Trace; c0132fdc <try_to_free_buffers+c0/ec> > > Code; c0132e86 <sync_page_buffers+e/a4> > 00000000 <_EIP>: > Code; c0132e86 <sync_page_buffers+e/a4> <===== > 0: f6 43 18 06 testb $0x6,0x18(%ebx) <===== > Code; c0132e8a <sync_page_buffers+12/a4> > 4: 74 7c je 82 <_EIP+0x82> c0132f08 > <sync_page_buffers+90/a4> > Code; c0132e8c <sync_page_buffers+14/a4> > 6: b8 07 00 00 00 mov $0x7,%eax > Code; c0132e91 <sync_page_buffers+19/a4> > > > > > <1>Unable to handle kernel NULL pointer dereference at virtual address > 00000028 > c015e3a2 > *pde = 00000000 > Oops: 0000 > CPU: 0 > EIP: 0010:[<c015e3a2>] Not tainted > EFLAGS: 00010203 > > eax: 0100004d ebx: 00000000 ecx: 000001d2 edx: 00000000 > > Code; c015e3a2 <journal_try_to_free_buffers+5a/98> > 00000000 <_EIP>: > Code; c015e3a2 <journal_try_to_free_buffers+5a/98> <===== > 0: 8b 5b 28 mov 0x28(%ebx),%ebx <===== > Code; c015e3a5 <journal_try_to_free_buffers+5d/98> > 3: f6 42 19 04 testb $0x4,0x19(%edx) > Code; c015e3a9 <journal_try_to_free_buffers+61/98> > 7: 74 17 je 20 <_EIP+0x20> c015e3c2 > <journal_try_to_free_buffers+7a/98> > > And other similar oopses. > > Are you sure there is nothing messing up the hardware ? > > How long have you ran memtest86? It can, sometimes, take a long to showup > errors. > > The 2.6.x oopses on the same hardware is also a useful source of > information. Marcelo, sorry for getting back to you so insanely late. This was a production server, and I have now moved the services it were running to another box, so I could run a more exhaustive memtest86. It has now ran for two days, without any errors. Of course there could be other flaky hardware, but since I don't know any way to test it, and the oops occurs with two-four weeks interval, it's quite time consuming to find out. I'm not even sure if it will oops without the typical load it used to have. Anyway, thank you very much for at least answering me. Much appreciated :) Best regards, Stian ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels 2004-03-02 11:03 ` Stian Jordet @ 2004-03-02 12:31 ` Stian Jordet 2004-03-09 19:22 ` Marcelo Tosatti 2004-06-14 17:07 ` Steven Dake 1 sibling, 1 reply; 13+ messages in thread From: Stian Jordet @ 2004-03-02 12:31 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 86 bytes --] Btw, here is one of the 2.6.x oopses as well (as you requested). Best regards, Stian [-- Attachment #2: syslog.txt --] [-- Type: text/plain, Size: 5902 bytes --] Jan 20 00:20:04 dodge kernel: ------------[ cut here ]------------ Jan 20 00:20:04 dodge kernel: kernel BUG at mm/page_alloc.c:201! Jan 20 00:20:04 dodge kernel: invalid operand: 0000 [#1] Jan 20 00:20:04 dodge kernel: CPU: 0 Jan 20 00:20:04 dodge kernel: EIP: 0060:[free_pages_bulk+482/512] Not tainted Jan 20 00:20:04 dodge kernel: EFLAGS: 00010002 Jan 20 00:20:04 dodge kernel: EIP is at free_pages_bulk+0x1e2/0x200 Jan 20 00:20:04 dodge kernel: eax: 00000001 ebx: c00609c8 ecx: 00000000 edx: 666026a5 Jan 20 00:20:04 dodge kernel: esi: 666026a4 edi: ffffffff ebp: 33301352 esp: c86d5d90 Jan 20 00:20:04 dodge kernel: ds: 007b es: 007b ss: 0068 Jan 20 00:20:04 dodge kernel: Process mrtg (pid: 26804, threadinfo=c86d4000 task=c9b860c0) Jan 20 00:20:04 dodge kernel: Stack: c038ad80 c00609c8 00000000 c038ae40 00000001 c00609a0 c038adbc 00000000 Jan 20 00:20:04 dodge kernel: c1000000 c038adbc 00000086 ffffffff c038ad80 c00609a0 c038ae5c 00000246 Jan 20 00:20:04 dodge kernel: c01341b9 c038ad80 00000000 c038ae5c 00000000 c038ad80 c00609a0 00000005 Jan 20 00:20:04 dodge kernel: Call Trace: Jan 20 00:20:04 dodge kernel: [free_hot_cold_page+217/240] free_hot_cold_page+0xd9/0xf0 Jan 20 00:20:04 dodge kernel: [do_generic_mapping_read+714/1008] do_generic_mapping_read+0x2ca/0x3f0 Jan 20 00:20:04 dodge kernel: [file_read_actor+0/256] file_read_actor+0x0/0x100 Jan 20 00:20:04 dodge kernel: [__generic_file_aio_read+454/512] __generic_file_aio_read+0x1c6/0x200 Jan 20 00:20:04 dodge kernel: [file_read_actor+0/256] file_read_actor+0x0/0x100 Jan 20 00:20:04 dodge kernel: [generic_file_aio_read+91/128] generic_file_aio_read+0x5b/0x80 Jan 20 00:20:04 dodge kernel: [do_sync_read+137/192] do_sync_read+0x89/0xc0 Jan 20 00:20:04 dodge kernel: [do_page_fault+300/1328] do_page_fault+0x12c/0x530 Jan 20 00:20:04 dodge kernel: [do_brk+324/560] do_brk+0x144/0x230 Jan 20 00:20:04 dodge kernel: [vfs_read+184/304] vfs_read+0xb8/0x130 Jan 20 00:20:04 dodge kernel: [sys_read+66/112] sys_read+0x42/0x70 Jan 20 00:20:04 dodge kernel: [syscall_call+7/11] syscall_call+0x7/0xb Jan 20 00:20:04 dodge kernel: Jan 20 00:20:04 dodge kernel: Code: 0f 0b c9 00 9b d1 34 c0 e9 51 ff ff ff 0f 0b bc 00 9b d1 34 Jan 20 00:20:04 dodge kernel: <1>Unable to handle kernel paging request at virtual address 00100104 Jan 20 00:20:04 dodge kernel: printing eip: Jan 20 00:20:04 dodge kernel: c0133cbf Jan 20 00:20:04 dodge kernel: *pde = 00000000 Jan 20 00:20:04 dodge kernel: Oops: 0002 [#2] Jan 20 00:20:04 dodge kernel: CPU: 0 Jan 20 00:20:04 dodge kernel: EIP: 0060:[free_pages_bulk+143/512] Not tainted Jan 20 00:20:04 dodge kernel: EFLAGS: 00010003 Jan 20 00:20:04 dodge kernel: EIP is at free_pages_bulk+0x8f/0x200 Jan 20 00:20:04 dodge kernel: eax: c00609a8 ebx: c038ae5c ecx: 00200200 edx: 00100100 Jan 20 00:20:04 dodge kernel: esi: c00609a0 edi: c038ae5c ebp: 00000203 esp: c86d5b34 Jan 20 00:20:04 dodge kernel: ds: 007b es: 007b ss: 0068 Jan 20 00:20:04 dodge kernel: Process mrtg (pid: 26804, threadinfo=c86d4000 task=c9b860c0) Jan 20 00:20:04 dodge kernel: Stack: c11a48c0 00268000 00000000 c038af54 00000001 c1044d20 c038aed0 00000000 Jan 20 00:20:04 dodge kernel: c1000000 c038adbc 00000082 ffffffff c038ad80 c100a230 c038ae5c 00000203 Jan 20 00:20:04 dodge kernel: c01341b9 c038ad80 00000000 c038ae5c 00000000 c038ad80 c5c7ac0c 0020d000 Jan 20 00:20:04 dodge kernel: Call Trace: Jan 20 00:20:04 dodge kernel: [free_hot_cold_page+217/240] free_hot_cold_page+0xd9/0xf0 Jan 20 00:20:04 dodge kernel: [zap_pte_range+334/400] zap_pte_range+0x14e/0x190 Jan 20 00:20:04 dodge kernel: [zap_pmd_range+75/112] zap_pmd_range+0x4b/0x70 Jan 20 00:20:04 dodge kernel: [unmap_page_range+75/128] unmap_page_range+0x4b/0x80 Jan 20 00:20:04 dodge kernel: [unmap_vmas+254/544] unmap_vmas+0xfe/0x220 Jan 20 00:20:04 dodge kernel: [exit_mmap+109/384] exit_mmap+0x6d/0x180 Jan 20 00:20:04 dodge kernel: [mmput+81/160] mmput+0x51/0xa0 Jan 20 00:20:04 dodge kernel: [do_exit+290/800] do_exit+0x122/0x320 Jan 20 00:20:04 dodge kernel: [do_invalid_op+0/208] do_invalid_op+0x0/0xd0 Jan 20 00:20:04 dodge kernel: [die+203/208] die+0xcb/0xd0 Jan 20 00:20:04 dodge kernel: [do_invalid_op+202/208] do_invalid_op+0xca/0xd0 Jan 20 00:20:04 dodge kernel: [free_pages_bulk+482/512] free_pages_bulk+0x1e2/0x200 Jan 20 00:20:04 dodge kernel: [update_wall_time+22/64] update_wall_time+0x16/0x40 Jan 20 00:20:04 dodge kernel: [do_timer+224/240] do_timer+0xe0/0xf0 Jan 20 00:20:04 dodge kernel: [timer_interrupt+56/240] timer_interrupt+0x38/0xf0 Jan 20 00:20:04 dodge kernel: [handle_IRQ_event+73/128] handle_IRQ_event+0x49/0x80 Jan 20 00:20:04 dodge kernel: [do_IRQ+140/240] do_IRQ+0x8c/0xf0 Jan 20 00:20:04 dodge kernel: [error_code+45/56] error_code+0x2d/0x38 Jan 20 00:20:04 dodge kernel: [free_pages_bulk+482/512] free_pages_bulk+0x1e2/0x200 Jan 20 00:20:04 dodge kernel: [free_hot_cold_page+217/240] free_hot_cold_page+0xd9/0xf0 Jan 20 00:20:04 dodge kernel: [do_generic_mapping_read+714/1008] do_generic_mapping_read+0x2ca/0x3f0 Jan 20 00:20:04 dodge kernel: [file_read_actor+0/256] file_read_actor+0x0/0x100 Jan 20 00:20:04 dodge kernel: [__generic_file_aio_read+454/512] __generic_file_aio_read+0x1c6/0x200 Jan 20 00:20:04 dodge kernel: [file_read_actor+0/256] file_read_actor+0x0/0x100 Jan 20 00:20:04 dodge kernel: [generic_file_aio_read+91/128] generic_file_aio_read+0x5b/0x80 Jan 20 00:20:04 dodge kernel: [do_sync_read+137/192] do_sync_read+0x89/0xc0 Jan 20 00:20:04 dodge kernel: [do_page_fault+300/1328] do_page_fault+0x12c/0x530 Jan 20 00:20:04 dodge kernel: [do_brk+324/560] do_brk+0x144/0x230 Jan 20 00:20:04 dodge kernel: [vfs_read+184/304] vfs_read+0xb8/0x130 Jan 20 00:20:04 dodge kernel: [sys_read+66/112] sys_read+0x42/0x70 Jan 20 00:20:04 dodge kernel: [syscall_call+7/11] syscall_call+0x7/0xb ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels 2004-03-02 12:31 ` Stian Jordet @ 2004-03-09 19:22 ` Marcelo Tosatti 2004-03-09 22:28 ` Stian Jordet 0 siblings, 1 reply; 13+ messages in thread From: Marcelo Tosatti @ 2004-03-09 19:22 UTC (permalink / raw) To: Stian Jordet; +Cc: Marcelo Tosatti, Linux Kernel Mailing List On Tue, 2 Mar 2004, Stian Jordet wrote: > Btw, here is one of the 2.6.x oopses as well (as you requested). Stian, This sounds like bad hardware. Did I already ask you to try memtest86 ? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels 2004-03-09 19:22 ` Marcelo Tosatti @ 2004-03-09 22:28 ` Stian Jordet 0 siblings, 0 replies; 13+ messages in thread From: Stian Jordet @ 2004-03-09 22:28 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Linux Kernel Mailing List Quoting Marcelo Tosatti <marcelo.tosatti@cyclades.com>: > > > On Tue, 2 Mar 2004, Stian Jordet wrote: > > > Btw, here is one of the 2.6.x oopses as well (as you requested). > > Stian, > > This sounds like bad hardware. Did I already ask you to try memtest86 ? > Yup, and I think I wrote that I had it running for almost two days with no errors. Oh well. Thanks for looking into this :) I guess I'll try go afford a new server (very) soon. Best regards, Stian ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels 2004-03-02 11:03 ` Stian Jordet 2004-03-02 12:31 ` Stian Jordet @ 2004-06-14 17:07 ` Steven Dake 2004-06-14 18:26 ` Chris Shoemaker 2004-06-15 13:16 ` Marcelo Tosatti 1 sibling, 2 replies; 13+ messages in thread From: Steven Dake @ 2004-06-14 17:07 UTC (permalink / raw) To: Stian Jordet; +Cc: Marcelo Tosatti, Linux Kernel Mailing List Marcelo and Stian, I have also seen this oops relating to low memory situations. I think ext3 allocates some data, has a null return, sets something to null, and then later it is dereferenced in kwapd. Anyone have a patch for this problem? Thanks -steve On Tue, 2004-03-02 at 04:03, Stian Jordet wrote: > fre, 06.02.2004 kl. 00.51 skrev Marcelo Tosatti: > > On Tue, 3 Feb 2004, Stian Jordet wrote: > > > > > Hello, > > > > > > I have a server which was running 2.4.18 and 2.4.19 for almost 200 days > > > each, without problems. After an upgrade to 2.4.22, the box haven't been > > > up for 30 days in a row. This happened early november. I have caputered > > > oopses with both 2.4.23 and 2.6.1 which I have sent decoded to the list, > > > but have never got any reply. > > > > > > I have ran memtest86 on the box, no errors. What else can be the > > > problem? I could of course go back to 2.4.19, which I know worked fine, > > > but I there have been some fixed security holes since then... > > > > > > Any thoughts? > > > > Stian, > > > > I have seen your 2.4.x oopses and they seemed odd. The faults were > > happening in different functions (mostly inside VM "freeing" , due to > > what seems to be random crap in memory: > > > > <1>Unable to handle kernel NULL pointer dereference at virtual address 00000021 > > c0132e86 > > *pde = 00000000 > > > > eax: 00000000 ebx: 00000009 ecx: 000001d2 edx: 00000012 > > esi: 00000000 edi: c17e38c0 ebp: c1047a00 esp: c86cbdb4 > > > > >>EIP; c0132e86 <sync_page_buffers+e/a4> <===== > > > > >>edi; c17e38c0 <_end+14b5844/bd23f84> > > >>ebp; c1047a00 <_end+d19984/bd23f84> > > >>esp; c86cbdb4 <_end+839dd38/bd23f84> > > > > Trace; c0132fdc <try_to_free_buffers+c0/ec> > > > > Code; c0132e86 <sync_page_buffers+e/a4> > > 00000000 <_EIP>: > > Code; c0132e86 <sync_page_buffers+e/a4> <===== > > 0: f6 43 18 06 testb $0x6,0x18(%ebx) <===== > > Code; c0132e8a <sync_page_buffers+12/a4> > > 4: 74 7c je 82 <_EIP+0x82> c0132f08 > > <sync_page_buffers+90/a4> > > Code; c0132e8c <sync_page_buffers+14/a4> > > 6: b8 07 00 00 00 mov $0x7,%eax > > Code; c0132e91 <sync_page_buffers+19/a4> > > > > > > > > > > <1>Unable to handle kernel NULL pointer dereference at virtual address > > 00000028 > > c015e3a2 > > *pde = 00000000 > > Oops: 0000 > > CPU: 0 > > EIP: 0010:[<c015e3a2>] Not tainted > > EFLAGS: 00010203 > > > > eax: 0100004d ebx: 00000000 ecx: 000001d2 edx: 00000000 > > > > Code; c015e3a2 <journal_try_to_free_buffers+5a/98> > > 00000000 <_EIP>: > > Code; c015e3a2 <journal_try_to_free_buffers+5a/98> <===== > > 0: 8b 5b 28 mov 0x28(%ebx),%ebx <===== > > Code; c015e3a5 <journal_try_to_free_buffers+5d/98> > > 3: f6 42 19 04 testb $0x4,0x19(%edx) > > Code; c015e3a9 <journal_try_to_free_buffers+61/98> > > 7: 74 17 je 20 <_EIP+0x20> c015e3c2 > > <journal_try_to_free_buffers+7a/98> > > > > And other similar oopses. > > > > Are you sure there is nothing messing up the hardware ? > > > > How long have you ran memtest86? It can, sometimes, take a long to showup > > errors. > > > > The 2.6.x oopses on the same hardware is also a useful source of > > information. > > Marcelo, > > sorry for getting back to you so insanely late. This was a production > server, and I have now moved the services it were running to another > box, so I could run a more exhaustive memtest86. It has now ran for two > days, without any errors. Of course there could be other flaky hardware, > but since I don't know any way to test it, and the oops occurs with > two-four weeks interval, it's quite time consuming to find out. I'm not > even sure if it will oops without the typical load it used to have. > > Anyway, thank you very much for at least answering me. Much appreciated > :) > > Best regards, > Stian > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels 2004-06-14 17:07 ` Steven Dake @ 2004-06-14 18:26 ` Chris Shoemaker 2004-06-15 13:16 ` Marcelo Tosatti 1 sibling, 0 replies; 13+ messages in thread From: Chris Shoemaker @ 2004-06-14 18:26 UTC (permalink / raw) To: Steven Dake; +Cc: Stian Jordet, Marcelo Tosatti, Linux Kernel Mailing List Marcelo, Stian, Steve, I also have seen this, more than a dozen times -- always related to writes to my ext3 partition concurrent with heavy swapping due to memory pressure. Although my oopses are usually not NULL dereferences, but simply bad addresses. In the two or three cases where I've actually traced the oops back to the code, (search the April archives for my name) it's been a corrupted pointer, always in the high part of the word, but not always the same part. For weeks, I thought that it was flakey RAM, but since days of memtest86 didn't fail I went back to trying to strip more stuff out of the kernel. Since the last cut, I haven't had a single oops, for approx. 3 weeks of uptime. Previous pattern was 1 to 3 days between oopses. I didn't really learn much that would help you. However, if you are trying to reproduce the failure more rapidly, my experience would suggest that it is necessary to run a memory hog, concurrent with an io process (e.g kernel compile.) Either one alone may run for days with no failure. Let me know if I can help. -Chris On Mon, Jun 14, 2004 at 10:07:05AM -0700, Steven Dake wrote: > Marcelo and Stian, > > I have also seen this oops relating to low memory situations. I think > ext3 allocates some data, has a null return, sets something to null, and > then later it is dereferenced in kwapd. > > Anyone have a patch for this problem? > > Thanks > -steve > > On Tue, 2004-03-02 at 04:03, Stian Jordet wrote: > > fre, 06.02.2004 kl. 00.51 skrev Marcelo Tosatti: > > > On Tue, 3 Feb 2004, Stian Jordet wrote: > > > > > > > Hello, > > > > > > > > I have a server which was running 2.4.18 and 2.4.19 for almost 200 days > > > > each, without problems. After an upgrade to 2.4.22, the box haven't been > > > > up for 30 days in a row. This happened early november. I have caputered > > > > oopses with both 2.4.23 and 2.6.1 which I have sent decoded to the list, > > > > but have never got any reply. > > > > > > > > I have ran memtest86 on the box, no errors. What else can be the > > > > problem? I could of course go back to 2.4.19, which I know worked fine, > > > > but I there have been some fixed security holes since then... > > > > > > > > Any thoughts? > > > > > > Stian, > > > > > > I have seen your 2.4.x oopses and they seemed odd. The faults were > > > happening in different functions (mostly inside VM "freeing" , due to > > > what seems to be random crap in memory: > > > > > > <1>Unable to handle kernel NULL pointer dereference at virtual address 00000021 > > > c0132e86 > > > *pde = 00000000 > > > > > > eax: 00000000 ebx: 00000009 ecx: 000001d2 edx: 00000012 > > > esi: 00000000 edi: c17e38c0 ebp: c1047a00 esp: c86cbdb4 > > > > > > >>EIP; c0132e86 <sync_page_buffers+e/a4> <===== > > > > > > >>edi; c17e38c0 <_end+14b5844/bd23f84> > > > >>ebp; c1047a00 <_end+d19984/bd23f84> > > > >>esp; c86cbdb4 <_end+839dd38/bd23f84> > > > > > > Trace; c0132fdc <try_to_free_buffers+c0/ec> > > > > > > Code; c0132e86 <sync_page_buffers+e/a4> > > > 00000000 <_EIP>: > > > Code; c0132e86 <sync_page_buffers+e/a4> <===== > > > 0: f6 43 18 06 testb $0x6,0x18(%ebx) <===== > > > Code; c0132e8a <sync_page_buffers+12/a4> > > > 4: 74 7c je 82 <_EIP+0x82> c0132f08 > > > <sync_page_buffers+90/a4> > > > Code; c0132e8c <sync_page_buffers+14/a4> > > > 6: b8 07 00 00 00 mov $0x7,%eax > > > Code; c0132e91 <sync_page_buffers+19/a4> > > > > > > > > > > > > > > > <1>Unable to handle kernel NULL pointer dereference at virtual address > > > 00000028 > > > c015e3a2 > > > *pde = 00000000 > > > Oops: 0000 > > > CPU: 0 > > > EIP: 0010:[<c015e3a2>] Not tainted > > > EFLAGS: 00010203 > > > > > > eax: 0100004d ebx: 00000000 ecx: 000001d2 edx: 00000000 > > > > > > Code; c015e3a2 <journal_try_to_free_buffers+5a/98> > > > 00000000 <_EIP>: > > > Code; c015e3a2 <journal_try_to_free_buffers+5a/98> <===== > > > 0: 8b 5b 28 mov 0x28(%ebx),%ebx <===== > > > Code; c015e3a5 <journal_try_to_free_buffers+5d/98> > > > 3: f6 42 19 04 testb $0x4,0x19(%edx) > > > Code; c015e3a9 <journal_try_to_free_buffers+61/98> > > > 7: 74 17 je 20 <_EIP+0x20> c015e3c2 > > > <journal_try_to_free_buffers+7a/98> > > > > > > And other similar oopses. > > > > > > Are you sure there is nothing messing up the hardware ? > > > > > > How long have you ran memtest86? It can, sometimes, take a long to showup > > > errors. > > > > > > The 2.6.x oopses on the same hardware is also a useful source of > > > information. > > > > Marcelo, > > > > sorry for getting back to you so insanely late. This was a production > > server, and I have now moved the services it were running to another > > box, so I could run a more exhaustive memtest86. It has now ran for two > > days, without any errors. Of course there could be other flaky hardware, > > but since I don't know any way to test it, and the oops occurs with > > two-four weeks interval, it's quite time consuming to find out. I'm not > > even sure if it will oops without the typical load it used to have. > > > > Anyway, thank you very much for at least answering me. Much appreciated > > :) > > > > Best regards, > > Stian > > > > - > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels 2004-06-14 17:07 ` Steven Dake 2004-06-14 18:26 ` Chris Shoemaker @ 2004-06-15 13:16 ` Marcelo Tosatti 2004-06-15 14:35 ` Stian Jordet 2004-06-15 17:56 ` Steven Dake 1 sibling, 2 replies; 13+ messages in thread From: Marcelo Tosatti @ 2004-06-15 13:16 UTC (permalink / raw) To: Steven Dake; +Cc: Stian Jordet, Linux Kernel Mailing List On Mon, Jun 14, 2004 at 10:07:05AM -0700, Steven Dake wrote: > Marcelo and Stian, > > I have also seen this oops relating to low memory situations. I think > ext3 allocates some data, has a null return, sets something to null, and > then later it is dereferenced in kwapd. > > Anyone have a patch for this problem? Steven, For what I remember Stian oopses were happening in random places in the VM freeing routines. That makes me belive what he was seeing was some kind of hardware issue, because otherwise the oopses would be happening in the same place (in case it was a software bug). The codepaths which he saw trying to access invalid addresses are executed flawlessly by all 2.4.x mainline users. He was also seeing oopses with v2.6. Assuming his HW is not faulty, I can think of some driver corrupting his memory. Do you have any traces of the oopses you are seeing? Stian, you told us switched servers now, I assume the problem is gone? Are you still running v2.4 on that server? Thanks! ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels 2004-06-15 13:16 ` Marcelo Tosatti @ 2004-06-15 14:35 ` Stian Jordet 2004-06-15 17:56 ` Steven Dake 1 sibling, 0 replies; 13+ messages in thread From: Stian Jordet @ 2004-06-15 14:35 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Steven Dake, Linux Kernel Mailing List tir, 15.06.2004 kl. 10.16 -0300, skrev Marcelo Tosatti: > Stian, you told us switched servers now, I assume the problem is gone? > Are you still running v2.4 on that server? I switched servers, and the oops is gone. The old server is still running, but without the nightly memory intensive perl-script, and have never seen an oops since I stopped running that. Both servers are running 2.6 now. Best regards, Stian ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Oopses with both recent 2.4.x kernels and 2.6.x kernels 2004-06-15 13:16 ` Marcelo Tosatti 2004-06-15 14:35 ` Stian Jordet @ 2004-06-15 17:56 ` Steven Dake 1 sibling, 0 replies; 13+ messages in thread From: Steven Dake @ 2004-06-15 17:56 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Stian Jordet, Linux Kernel Mailing List On Tue, 2004-06-15 at 06:16, Marcelo Tosatti wrote: > On Mon, Jun 14, 2004 at 10:07:05AM -0700, Steven Dake wrote: > > Marcelo and Stian, > > > > I have also seen this oops relating to low memory situations. I think > > ext3 allocates some data, has a null return, sets something to null, and > > then later it is dereferenced in kwapd. > > > > Anyone have a patch for this problem? > Steven, > > For what I remember Stian oopses were happening in random places in the VM freeing > routines. That makes me belive what he was seeing was some kind of hardware issue, > because otherwise the oopses would be happening in the same place (in case it was > a software bug). The codepaths which he saw trying to access invalid addresses are > executed flawlessly by all 2.4.x mainline users. He was also seeing oopses with v2.6. > > Assuming his HW is not faulty, I can think of some driver corrupting his memory. > > Do you have any traces of the oopses you are seeing? > > Stian, you told us switched servers now, I assume the problem is gone? > Are you still running v2.4 on that server? > Marcelo, Stian responded saying he upgraded to 2.6 and also removed the memory intensive script and his problems went away. I suspect removing the memory intesive script did the trick. 2.6 could also be fixed, who knows? After reading lkml, there are about 4-5 Oops in the same function at the same location. The people report they were heavily using memory (low memory situation) in some of the bug reports. I have tracked down the problem to a null dereference during a buffer cache rebalance (which occurs during low memory situations). Here is the info I have. Unfortunately I don't know much about how vm handles low memory situations with the vfs, so if you have any ideas, it would be helpful. :) nable to handle kernel NULL pointer dereference at virtual address 00000028 <4> printing eip: <4>c018aa67 <1>*pde = 00000000 <4>Oops: 0000 <4>CPU: 2 <4>EIP: 0010:[<c018aa67>] Not tainted <4>EFLAGS: 00010203 <4>eax: 00000000 ebx: 00000000 ecx: c0380490 edx: c217a540 <4>esi: f2757a80 edi: 00000001 ebp: 00000000 esp: f7bd1ee8 <4>ds: 0018 es: 0018 ss: 0018 <4>Process kswapd (pid: 11, stackpage=f7bd1000) <4>Stack: 00000000 00000000 c217a600 00000000 00000202 00000000 c217a540 000001d0 <4> c217a540 000102ec c018144f f7abe400 c217a540 000001d0 c0155006 c217a540 <4> 000001d0 f7bd0000 00000000 c0147f0a c217a540 000001d0 f7bd0000 f7bd0000 <4>Call Trace: [<c018144f>] [<c0155006>] [<c0147f0a>] [<c014832b>] [<c0148398>] <4> [<c0148431>] [<c014847f>] [<c0148593>] [<c0105000>] [<c010578a>] [<c01484f8>] <4> <4>Code: 8b 5b 28 f6 40 19 02 75 47 39 f3 75 f1 c6 05 80 24 38 c0 01 >>EIP; c018aa67 <journal_try_to_free_buffers+45/f4> <===== Trace; c018144f <ext3_releasepage+2d/32> Trace; c0155006 <try_to_release_page+4e/78> Trace; c0147f0a <shrink_cache+270/4f6> Trace; c014832b <shrink_caches+61/98> Trace; c0148398 <try_to_free_pages+36/54> Trace; c0148431 <kswapd_balance_pgdat+51/8a> Trace; c014847f <kswapd_balance+15/2c> Trace; c0148593 <kswapd+9b/b6> Trace; c0105000 <_stext+0/0> Trace; c010578a <kernel_thread+2e/40> Trace; c01484f8 <kswapd+0/b6> The problem is that in this function: int journal_try_to_free_buffers(journal_t *journal, struct page *page, int gfp_mask) { struct buffer_head *bh; struct buffer_head *tmp; int locked_or_dirty = 0; int call_ttfb = 1; J_ASSERT(PageLocked(page)); bh = page->buffers; tmp = bh; spin_lock(&journal_datalist_lock); do { struct buffer_head *p = tmp; tmp = tmp->b_this_page; ... tmp (page->buffers) above is null. b_this_page is at offset 0x28 (the accessed address in the oops). This means that page->buffers is set to null by some other routine which results in the oops. I read the page allocate code (ext3_read_page->block_read_full_page->create_emty_buffers->create_buffers), and it appears that it is not possible to allocate a page->buffers value of zero in the allocate function. I am having difficulty reproducing and cannot debug further, however. Can page->buffers be set to zero somewhere else? Perhaps kswapd and some other thread are racing on the free? Thansk -steve > Thanks! ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2004-06-15 19:15 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-06-15 17:44 Oopses with both recent 2.4.x kernels and 2.6.x kernels Nick Warne 2004-06-15 19:15 ` Stian Jordet -- strict thread matches above, loose matches on Subject: below -- 2004-02-03 18:26 Stian Jordet 2004-02-05 23:51 ` Marcelo Tosatti 2004-03-02 11:03 ` Stian Jordet 2004-03-02 12:31 ` Stian Jordet 2004-03-09 19:22 ` Marcelo Tosatti 2004-03-09 22:28 ` Stian Jordet 2004-06-14 17:07 ` Steven Dake 2004-06-14 18:26 ` Chris Shoemaker 2004-06-15 13:16 ` Marcelo Tosatti 2004-06-15 14:35 ` Stian Jordet 2004-06-15 17:56 ` Steven Dake
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox