public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* msync() oops in 2.6.7-rc2-bk1
@ 2004-06-04 14:09 Miquel van Smoorenburg
  2004-06-05  4:33 ` Andrew Morton
  0 siblings, 1 reply; 4+ messages in thread
From: Miquel van Smoorenburg @ 2004-06-04 14:09 UTC (permalink / raw)
  To: linux-kernel

I'm running a news server. The innd process uses mmap()s for several
files and uses msync() to force synchronization to disk every so
often. Suddenly, an msync() causes an oops (and innd SEGVs). This
is after the box has been up and running for 3 days:

# uname -a
Linux enterprise 2.6.7-rc2-bk1 #1 Mon May 31 15:03:52 CEST 2004 i686 GNU/Linux

 <1>Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip:
c0149120
*pde = 00000000
Oops: 0002 [#5]
Modules linked in: e100 mii
CPU:    0
EIP:    0060:[<c0149120>]    Not tainted
EFLAGS: 00010213   (2.6.7-rc2-bk1)
EIP is at __set_page_dirty_buffers+0x20/0xb0
eax: 00000000   ebx: f77a1e7c   ecx: c15706e0   edx: eba5a83c
esi: 5ccfb000   edi: 00000000   ebp: 5d000000   esp: f44b5efc
ds: 007b   es: 007b   ss: 0068
Process innd (pid: 10936, threadinfo=f44b5000 task=d0eb2c70)
Stack: f77a1de4 00000004 00000000 c4af23ec c013143e c15706e0 c013da1c 5ccfb000
       c4af23f0 c013db0f c4af23ec f7101900 5ccfb000 00000001 5cc00000 ce8b25d0
       00000000 5d000000 c013dbc3 ce8b25cc 5cc00000 5d000000 f7101900 00000001
Call Trace:
 [<c013143e>] set_page_dirty+0x3e/0x50
 [<c013da1c>] filemap_sync_pte+0x5c/0x80
 [<c013db0f>] filemap_sync_pte_range+0xcf/0xf0
 [<c013dbc3>] filemap_sync+0x93/0x100
 [<c013dc96>] msync_interval+0x66/0xf0
 [<c013de37>] sys_msync+0x117/0x123
 [<c0103c7b>] syscall_call+0x7/0xb

Code: 0f ba 28 01 8b 40 08 39 d0 75 f5 0f ba 29 04 19 c0 85 c0 75

Mike.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: msync() oops in 2.6.7-rc2-bk1
  2004-06-04 14:09 msync() oops in 2.6.7-rc2-bk1 Miquel van Smoorenburg
@ 2004-06-05  4:33 ` Andrew Morton
  2004-06-05 10:35   ` Miquel van Smoorenburg
  2004-06-08 17:34   ` Bill Davidsen
  0 siblings, 2 replies; 4+ messages in thread
From: Andrew Morton @ 2004-06-05  4:33 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: linux-kernel

"Miquel van Smoorenburg" <miquels@cistron.nl> wrote:
>
> I'm running a news server. The innd process uses mmap()s for several
> files and uses msync() to force synchronization to disk every so
> often. Suddenly, an msync() causes an oops (and innd SEGVs). This
> is after the box has been up and running for 3 days:
> 
> # uname -a
> Linux enterprise 2.6.7-rc2-bk1 #1 Mon May 31 15:03:52 CEST 2004 i686 GNU/Linux
> 
>  <1>Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip:
> c0149120
> *pde = 00000000
> Oops: 0002 [#5]
> Modules linked in: e100 mii
> CPU:    0
> EIP:    0060:[<c0149120>]    Not tainted
> EFLAGS: 00010213   (2.6.7-rc2-bk1)
> EIP is at __set_page_dirty_buffers+0x20/0xb0
> eax: 00000000   ebx: f77a1e7c   ecx: c15706e0   edx: eba5a83c
> esi: 5ccfb000   edi: 00000000   ebp: 5d000000   esp: f44b5efc
> ds: 007b   es: 007b   ss: 0068
> Process innd (pid: 10936, threadinfo=f44b5000 task=d0eb2c70)
> Stack: f77a1de4 00000004 00000000 c4af23ec c013143e c15706e0 c013da1c 5ccfb000
>        c4af23f0 c013db0f c4af23ec f7101900 5ccfb000 00000001 5cc00000 ce8b25d0
>        00000000 5d000000 c013dbc3 ce8b25cc 5cc00000 5d000000 f7101900 00000001
> Call Trace:
>  [<c013143e>] set_page_dirty+0x3e/0x50
>  [<c013da1c>] filemap_sync_pte+0x5c/0x80
>  [<c013db0f>] filemap_sync_pte_range+0xcf/0xf0
>  [<c013dbc3>] filemap_sync+0x93/0x100
>  [<c013dc96>] msync_interval+0x66/0xf0
>  [<c013de37>] sys_msync+0x117/0x123
>  [<c0103c7b>] syscall_call+0x7/0xb
> 
> Code: 0f ba 28 01 8b 40 08 39 d0 75 f5 0f ba 29 04 19 c0 85 c0 75

You have a page which has PG_private set, but page->private is NULL.  And
the machine is non-SMP, non-preempt, yes?

I'd be wondering whether that machine has flipped a bit in page->flags,
frankly.  How old is it?

Was the mmap of a regular file or of a block device?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: msync() oops in 2.6.7-rc2-bk1
  2004-06-05  4:33 ` Andrew Morton
@ 2004-06-05 10:35   ` Miquel van Smoorenburg
  2004-06-08 17:34   ` Bill Davidsen
  1 sibling, 0 replies; 4+ messages in thread
From: Miquel van Smoorenburg @ 2004-06-05 10:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Miquel van Smoorenburg, linux-kernel

On Sat, 05 Jun 2004 06:33:15, Andrew Morton wrote:
> "Miquel van Smoorenburg" <miquels@cistron.nl> wrote:
> >
> > I'm running a news server. The innd process uses mmap()s for several
> > files and uses msync() to force synchronization to disk every so
> > often. Suddenly, an msync() causes an oops (and innd SEGVs). This
> > is after the box has been up and running for 3 days:
> > 
> > # uname -a
> > Linux enterprise 2.6.7-rc2-bk1 #1 Mon May 31 15:03:52 CEST 2004 i686 GNU/Linux
> > 
> >  <1>Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip:
> > c0149120
> > *pde = 00000000
> > Oops: 0002 [#5]
> > Modules linked in: e100 mii
> > CPU:    0
> > EIP:    0060:[<c0149120>]    Not tainted
> > EFLAGS: 00010213   (2.6.7-rc2-bk1)
> > EIP is at __set_page_dirty_buffers+0x20/0xb0
> > eax: 00000000   ebx: f77a1e7c   ecx: c15706e0   edx: eba5a83c
> > esi: 5ccfb000   edi: 00000000   ebp: 5d000000   esp: f44b5efc
> > ds: 007b   es: 007b   ss: 0068
> > Process innd (pid: 10936, threadinfo=f44b5000 task=d0eb2c70)
> > Stack: f77a1de4 00000004 00000000 c4af23ec c013143e c15706e0 c013da1c 5ccfb000
> >        c4af23f0 c013db0f c4af23ec f7101900 5ccfb000 00000001 5cc00000 ce8b25d0
> >        00000000 5d000000 c013dbc3 ce8b25cc 5cc00000 5d000000 f7101900 00000001
> > Call Trace:
> >  [<c013143e>] set_page_dirty+0x3e/0x50
> >  [<c013da1c>] filemap_sync_pte+0x5c/0x80
> >  [<c013db0f>] filemap_sync_pte_range+0xcf/0xf0
> >  [<c013dbc3>] filemap_sync+0x93/0x100
> >  [<c013dc96>] msync_interval+0x66/0xf0
> >  [<c013de37>] sys_msync+0x117/0x123
> >  [<c0103c7b>] syscall_call+0x7/0xb
> > 
> > Code: 0f ba 28 01 8b 40 08 39 d0 75 f5 0f ba 29 04 19 c0 85 c0 75
> 
> You have a page which has PG_private set, but page->private is NULL.  And
> the machine is non-SMP, non-preempt, yes?

Indeed.

> I'd be wondering whether that machine has flipped a bit in page->flags,
> frankly.  How old is it?

It's an Athlon XP 2000+ from september 2002, 1 GB RAM.

> Was the mmap of a regular file or of a block device?

I'm not sure. My collegue was stracing the process when it happended
for the 3rd time but didn't save the output to a file.

I now see that just after 2.6.7-rc2-bk was booted it oopsed too-
a different oops. Perhaps that corrupted something.

Oops: 0002 [#1]
Modules linked in: e100 mii
CPU:    0
EIP:    0060:[drop_buffers+84/144]    Not tainted
EFLAGS: 00010207   (2.6.7-rc2-bk1)
EIP is at drop_buffers+0x54/0x90
eax: b552e000   ebx: eba5a834   ecx: eba5a80c   edx: 01000406
esi: eba5addc   edi: eba5a26c   ebp: c15af680   esp: c198cd18
ds: 007b   es: 007b   ss: 0068
Process kswapd0 (pid: 9, threadinfo=c198c000 task=c19a65d0)
Stack: 00000001 c15af680 c15af680 00000000 c198ce1c c014b917 c15af680 c198cd38
       00000000 f7bff63c c15af680 c01361e3 c15af680 000000d0 c198c000 00000001
       0000001b 00000000 c198cd60 c198cd60 00000000 00000000 00000000 00000020
Call Trace:
 [try_to_free_buffers+55/144] try_to_free_buffers+0x37/0x90
 [shrink_list+835/1072] shrink_list+0x343/0x430
 [shrink_cache+331/768] shrink_cache+0x14b/0x300
 [balance_pgdat+462/608] balance_pgdat+0x1ce/0x260
 [kswapd+279/304] kswapd+0x117/0x130
 [autoremove_wake_function+0/96] autoremove_wake_function+0x0/0x60
 [autoremove_wake_function+0/96] autoremove_wake_function+0x0/0x60
 [kswapd+0/304] kswapd+0x0/0x130
 [kernel_thread_helper+5/20] kernel_thread_helper+0x5/0x14
Code: 89 42 04 89 10 89 5b 04 89 59 28 39 fe 89 f1 75 df 8b 44 24

But it might indeed be a case of flakey hardware. Re-examining the
history of this machine I see that it has spontaneously rebooted
before. What I'll do is this - it's now running 2.6.6 vanilla. I'll
see what it does over the weekend. After the weekend I'll boot
2.6.7-rcX-bkY and see what it does with that. If I can reproduce
this problem reliably or can find a pattern in it myself, I'll
let you know.

I think I'll boot 2.6.7-rcX-bkY at the same time on a known-good
machine that has a comparable load.

Mike.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: msync() oops in 2.6.7-rc2-bk1
  2004-06-05  4:33 ` Andrew Morton
  2004-06-05 10:35   ` Miquel van Smoorenburg
@ 2004-06-08 17:34   ` Bill Davidsen
  1 sibling, 0 replies; 4+ messages in thread
From: Bill Davidsen @ 2004-06-08 17:34 UTC (permalink / raw)
  To: linux-kernel

Andrew Morton wrote:
> "Miquel van Smoorenburg" <miquels@cistron.nl> wrote:
> 
>>I'm running a news server. The innd process uses mmap()s for several
>>files and uses msync() to force synchronization to disk every so
>>often. Suddenly, an msync() causes an oops (and innd SEGVs). This
>>is after the box has been up and running for 3 days:
>>
>># uname -a
>>Linux enterprise 2.6.7-rc2-bk1 #1 Mon May 31 15:03:52 CEST 2004 i686 GNU/Linux
>>
>> <1>Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip:
>>c0149120
>>*pde = 00000000
>>Oops: 0002 [#5]
>>Modules linked in: e100 mii
>>CPU:    0
>>EIP:    0060:[<c0149120>]    Not tainted
>>EFLAGS: 00010213   (2.6.7-rc2-bk1)
>>EIP is at __set_page_dirty_buffers+0x20/0xb0
>>eax: 00000000   ebx: f77a1e7c   ecx: c15706e0   edx: eba5a83c
>>esi: 5ccfb000   edi: 00000000   ebp: 5d000000   esp: f44b5efc
>>ds: 007b   es: 007b   ss: 0068
>>Process innd (pid: 10936, threadinfo=f44b5000 task=d0eb2c70)
>>Stack: f77a1de4 00000004 00000000 c4af23ec c013143e c15706e0 c013da1c 5ccfb000
>>       c4af23f0 c013db0f c4af23ec f7101900 5ccfb000 00000001 5cc00000 ce8b25d0
>>       00000000 5d000000 c013dbc3 ce8b25cc 5cc00000 5d000000 f7101900 00000001
>>Call Trace:
>> [<c013143e>] set_page_dirty+0x3e/0x50
>> [<c013da1c>] filemap_sync_pte+0x5c/0x80
>> [<c013db0f>] filemap_sync_pte_range+0xcf/0xf0
>> [<c013dbc3>] filemap_sync+0x93/0x100
>> [<c013dc96>] msync_interval+0x66/0xf0
>> [<c013de37>] sys_msync+0x117/0x123
>> [<c0103c7b>] syscall_call+0x7/0xb
>>
>>Code: 0f ba 28 01 8b 40 08 39 d0 75 f5 0f ba 29 04 19 c0 85 c0 75
> 
> 
> You have a page which has PG_private set, but page->private is NULL.  And
> the machine is non-SMP, non-preempt, yes?
> 
> I'd be wondering whether that machine has flipped a bit in page->flags,
> frankly.  How old is it?
> 
> Was the mmap of a regular file or of a block device?

I don't think so... I have a number of machines, also news servers, 
which are producing various errors in filemap. In the RH2.4.21-15 kernel 
it shows up in filemap_sync_pte_range, which is now in msync. It appears 
to be a bad pte coming in, I see the error at pmd_bad sometimes.

I see it with several applications, all news, all mmap() heavily.

Sorry I can't tell you more, but it happens on multiple brand-new Xeon 
systems. If he is seeing similar on an Athlon I would assume that it's 
some "less traveled way" in the kernel, because I have a limited ability 
to believe in coincidence.


-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2004-06-08 17:34 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-06-04 14:09 msync() oops in 2.6.7-rc2-bk1 Miquel van Smoorenburg
2004-06-05  4:33 ` Andrew Morton
2004-06-05 10:35   ` Miquel van Smoorenburg
2004-06-08 17:34   ` Bill Davidsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox