public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!
@ 2004-04-14  8:57 Alexander Y. Fomichev
  2004-04-14  9:54 ` Andrea Arcangeli
  0 siblings, 1 reply; 7+ messages in thread
From: Alexander Y. Fomichev @ 2004-04-14  8:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrea Arcangeli, admin

G'Day,

I've got a bug on 2.6.5-aa3 today

System: Dual P4 Xeon 2.4GHz, 4G ECC RAM - (production web-server) pretty
	heavy loaded at most time, but i've got it approximately 
	at 02:00 (MSD) when no serious load should be.
	System remained accessible all the time but operations
	with proclist from userspace (i.e. ps, w) appears to be locked. 

Here is a trace:

kernel BUG at mm/objrmap.c:137!
invalid operand: 0000 [#1]
PREEMPT SMP
CPU:    1
EIP:    0060:[<c0148ebb>]    Not tainted
EFLAGS: 00010282   (2.6.5-aa3)
EIP is at page_referenced_one+0x8c/0x96
eax: dafaaae0   ebx: dafaaae0   ecx: 00000000   edx: c19ef4a0
esi: 00000000   edi: dade48c4   ebp: c19ef4a0   esp: e9049c0c
ds: 007b   es: 007b   ss: 0068
Process apache2 (pid: 29070, threadinfo=e9048000 task=cf0e8630)
Stack: dafaaae0 f4531e30 00000000 f4531e2c c0149073 c19ef4a0 00000001 c02d2d00
       e9048000 00000166 c0149103 c19ef4a0 c013ff58 00000000 00000000 c02d2e04
       00000001 00000000 00000000 e9049d30 00000166 c1a59890 c1841cd0 e9049c68
Call Trace:
 [<c0149073>] page_referenced_anon+0x45/0x94
 [<c0149103>] page_referenced+0x41/0x4d
 [<c013ff58>] refill_inactive_zone+0x56c/0x678
 [<c0140123>] shrink_zone+0xbf/0xc1
 [<c014019d>] shrink_caches+0x78/0x7c
 [<c014023b>] try_to_free_pages+0x9a/0x141
 [<c0138f68>] __alloc_pages+0x1cb/0x30a
 [<c0144594>] do_anonymous_page+0x82/0x246
 [<c027fbe2>] inet_recvmsg+0x4a/0x69
 [<c01447b5>] do_no_page+0x5d/0x58c
 [<c0144ed8>] handle_mm_fault+0xe2/0x17b
 [<c0115752>] do_page_fault+0x2eb/0x4d2
 [<c015384d>] vfs_read+0xbc/0xf5
 [<c0115467>] do_page_fault+0x0/0x4d2
 [<c0107789>] error_code+0x2d/0x38

Code: 0f 0b 89 00 a9 20 2a c0 eb e1 55 31 ed 57 89 c7 56 53 83 ec
 <6>note: apache2[29070] exited with preempt_count 2
bad: scheduling while atomic!
Call Trace:
 [<c01182a4>] schedule+0x6a1/0x6a6
 [<c012c2be>] rcu_process_callbacks+0x13a/0x169
 [<c012c2be>] rcu_process_callbacks+0x13a/0x169
 [<c01208f8>] tasklet_action+0x5f/0xa5
 [<c01c9261>] rwsem_down_read_failed+0xa5/0x155
 [<c012f5d4>] .text.lock.futex+0x7/0xb3
 [<c0111b86>] delay_tsc+0xb/0x15
 [<c012f4a5>] do_futex+0x6b/0x6d
 [<c012f5a1>] sys_futex+0xfa/0x112
 [<c011cc90>] call_console_drivers+0x77/0xfc
 [<c011a872>] mm_release+0x96/0xa6
 [<c011ea17>] do_exit+0x84/0x4d9
 [<c029007b>] packet_ioctl+0x1f/0x192
 [<c0107de2>] do_divide_error+0x0/0xd3
 [<c010805c>] do_invalid_op+0x0/0xaa
 [<c0108104>] do_invalid_op+0xa8/0xaa
 [<c0148ebb>] page_referenced_one+0x8c/0x96
 [<c0117642>] scheduler_tick+0x3b/0x5f7
 [<c0116451>] recalc_task_prio+0x8c/0x1a6
 [<c0107789>] error_code+0x2d/0x38
 [<c014007b>] shrink_zone+0x17/0xc1
 [<c0148ebb>] page_referenced_one+0x8c/0x96
 [<c0149073>] page_referenced_anon+0x45/0x94
 [<c0149103>] page_referenced+0x41/0x4d
 [<c013ff58>] refill_inactive_zone+0x56c/0x678
 [<c0140123>] shrink_zone+0xbf/0xc1
 [<c014019d>] shrink_caches+0x78/0x7c
 [<c014023b>] try_to_free_pages+0x9a/0x141
 [<c0138f68>] __alloc_pages+0x1cb/0x30a
 [<c0144594>] do_anonymous_page+0x82/0x246
 [<c027fbe2>] inet_recvmsg+0x4a/0x69
 [<c01447b5>] do_no_page+0x5d/0x58c
 [<c0144ed8>] handle_mm_fault+0xe2/0x17b
 [<c0115752>] do_page_fault+0x2eb/0x4d2
 [<c015384d>] vfs_read+0xbc/0xf5
 [<c0115467>] do_page_fault+0x0/0x4d2
 [<c0107789>] error_code+0x2d/0x38
	
& some (probably) useful info:

cpuinfo: 	http://sysadminday.org.ru/objrmap_bug/cpuinfo
lspci:		http://sysadminday.org.ru/objrmap_bug/lspci 
lspci -vvn	http://sysadminday.org.ru/objrmap_bug/lspci-vvn
full_trace (including initial boot,SysRq-[M,P,T])
		http://sysadminday.org.ru/objrmap_bug/full_trace

-- 
Best regards.
        Alexander Y. Fomichev <gluk@php4.ru>
        Public PGP key: http://sysadminday.org.ru/gluk.asc

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!
  2004-04-14  8:57 2.6.5-aa3: kernel BUG at mm/objrmap.c:137! Alexander Y. Fomichev
@ 2004-04-14  9:54 ` Andrea Arcangeli
  2004-04-14 11:39   ` Alexander Y. Fomichev
  0 siblings, 1 reply; 7+ messages in thread
From: Andrea Arcangeli @ 2004-04-14  9:54 UTC (permalink / raw)
  To: Alexander Y. Fomichev; +Cc: linux-kernel, admin

On Wed, Apr 14, 2004 at 12:57:16PM +0400, Alexander Y. Fomichev wrote:
> G'Day,
> 
> I've got a bug on 2.6.5-aa3 today
> 
> System: Dual P4 Xeon 2.4GHz, 4G ECC RAM - (production web-server) pretty
> 	heavy loaded at most time, but i've got it approximately 
> 	at 02:00 (MSD) when no serious load should be.
> 	System remained accessible all the time but operations
> 	with proclist from userspace (i.e. ps, w) appears to be locked. 

I don't think apache2 uses nonlinear. there was an smp race fix in
2.6.5-aa4, so you may want to try again with 2.6.5-aa5 (latest) just in
case this was mm corruption triggered by the race.

Are you using threading with apache2? Such a race could trigger only
with threads.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!
  2004-04-14  9:54 ` Andrea Arcangeli
@ 2004-04-14 11:39   ` Alexander Y. Fomichev
  2004-04-14 11:47     ` Andrea Arcangeli
  0 siblings, 1 reply; 7+ messages in thread
From: Alexander Y. Fomichev @ 2004-04-14 11:39 UTC (permalink / raw)
  To: admin; +Cc: Andrea Arcangeli, linux-kernel

On Wednesday 14 April 2004 13:54, Andrea Arcangeli wrote:
> On Wed, Apr 14, 2004 at 12:57:16PM +0400, Alexander Y. Fomichev wrote:
> > G'Day,
> >
> > I've got a bug on 2.6.5-aa3 today
> >
> > System: Dual P4 Xeon 2.4GHz, 4G ECC RAM - (production web-server) pretty   
    btw, i was wrong, this box has only 2G ECC RAM not 4. 
> > 	heavy loaded at most time, but i've got it approximately
> > 	at 02:00 (MSD) when no serious load should be.
> > 	System remained accessible all the time but operations
> > 	with proclist from userspace (i.e. ps, w) appears to be locked.
>
> I don't think apache2 uses nonlinear. there was an smp race fix in
> 2.6.5-aa4, so you may want to try again with 2.6.5-aa5 (latest) just in
> case this was mm corruption triggered by the race.
tnx, i'll try it.
btw, system running on 2.6.5-aa3 almost for a month (at 2004/04/18) and
this is first time i catch this, so 2.6.5-aa3 pretty stable for me. (except 
for trouble mentioned by Sergey Kostyliov as "2.6.X kernel memory 
leak?"/"2.6.1 IO lockup on SMP systems")

> Are you using threading with apache2? Such a race could trigger only
> with threads.

exactly right, apache2 compiled --with-mpm=worker.

-- 
Best regards.
        Alexander Y. Fomichev <gluk@php4.ru>
        Public PGP key: http://sysadminday.org.ru/gluk.asc

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!
  2004-04-14 11:39   ` Alexander Y. Fomichev
@ 2004-04-14 11:47     ` Andrea Arcangeli
  2004-04-19 12:32       ` Similar behaviour without BUG() message(was: Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!) Alexander Y. Fomichev
  0 siblings, 1 reply; 7+ messages in thread
From: Andrea Arcangeli @ 2004-04-14 11:47 UTC (permalink / raw)
  To: Alexander Y. Fomichev; +Cc: admin, linux-kernel

On Wed, Apr 14, 2004 at 03:39:49PM +0400, Alexander Y. Fomichev wrote:
> exactly right, apache2 compiled --with-mpm=worker.

ok so there are good chances that 2.6.5-aa5 will fix it, if not then
please notify me again, thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Similar behaviour without BUG() message(was: Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!)
  2004-04-14 11:47     ` Andrea Arcangeli
@ 2004-04-19 12:32       ` Alexander Y. Fomichev
  2004-04-23  2:27         ` Andrea Arcangeli
  0 siblings, 1 reply; 7+ messages in thread
From: Alexander Y. Fomichev @ 2004-04-19 12:32 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: admin, linux-kernel

On Wednesday 14 April 2004 15:47, Andrea Arcangeli wrote:
> ok so there are good chances that 2.6.5-aa5 will fix it, if not then
> please notify me again, thanks.

  I've noticed a behaviour today very similar to mentioned in previous report
  (on 14   Apr 2004) except for message from BUG() (kernel 2.6.5-aa5) i.e.
  system remained accessible but procps(ps,pkill) appears to be locked and
  Sysrq-T is similar to previous one - some of apache2 & all of procps 
  have been blocked.

  Here is a traces:
  
  http://sysadminday.org.ru/2.6.5-ps-lockup/sysrq-M
  http://sysadminday.org.ru/2.6.5-ps-lockup/full_trace
 
-- 
Best regards.
        Alexander Y. Fomichev <gluk@php4.ru>
        Public PGP key: http://sysadminday.org.ru/gluk.asc

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Similar behaviour without BUG() message(was: Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!)
  2004-04-19 12:32       ` Similar behaviour without BUG() message(was: Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!) Alexander Y. Fomichev
@ 2004-04-23  2:27         ` Andrea Arcangeli
  2004-04-23  9:24           ` Alexander Y. Fomichev
  0 siblings, 1 reply; 7+ messages in thread
From: Andrea Arcangeli @ 2004-04-23  2:27 UTC (permalink / raw)
  To: Alexander Y. Fomichev; +Cc: admin, linux-kernel

On Mon, Apr 19, 2004 at 04:32:53PM +0400, Alexander Y. Fomichev wrote:
> On Wednesday 14 April 2004 15:47, Andrea Arcangeli wrote:
> > ok so there are good chances that 2.6.5-aa5 will fix it, if not then
> > please notify me again, thanks.
> 
>   I've noticed a behaviour today very similar to mentioned in previous report
>   (on 14   Apr 2004) except for message from BUG() (kernel 2.6.5-aa5) i.e.
>   system remained accessible but procps(ps,pkill) appears to be locked and
>   Sysrq-T is similar to previous one - some of apache2 & all of procps 
>   have been blocked.
> 
>   Here is a traces:
>   
>   http://sysadminday.org.ru/2.6.5-ps-lockup/sysrq-M
>   http://sysadminday.org.ru/2.6.5-ps-lockup/full_trace

following your trace I got to some futex issue (likely invoked by NPTL),
and following the futex code I noticed a race with threads in
expand_stack and likely the futex is effectively calling expand_stack
too. Then I found a race in expand_stack with threads.

here the details:

in 2.6-aa expand_stack is moving down vm_start and vm_end with only the
read semaphore held. this means this thing can even run in parallel in
both cpus, the latter will corrupt vm_pgoff which generate malfunction
with anon-vma:

        vma->vm_start = address;
        vma->vm_pgoff -= grow;

this isn't an issue for the file mappings because only anon vmas can be
growsdown (the filemappings could never work right in filemap_nopage if
expand_stack would mess with pgoff and vm_start without holding the
mmap_sem in write mode).

serializing this with anon-vma is easy and scalable with the
anon_vma_lock (not an mm-wide lock like the page_table_lock).
This will also serialize perfectly with the objrmap. objrmap should be
the only thing that cares about vm_pgoff and vm_start being coherent at
the same time for anon-mappings in anon-vma, and the anon_vma_lock
provides perfect locking for that.

One of the scalability and simplicity locking beauty of anon-vma is the
total avoidance of page_table_lock for vma merging and all other vma
operations in general, and true usage of page_table_lock only for the
pagetables (and future usage of vma->page_table_lock only for pagetables
too). I wouldn't really like to giveup on this and to reintroduce the
whole mess of page_table_lock in the vma merging and in expand stack
that all other implementations like 2.6 mainline and anonmm are
suffering from, and I hope my fix will be enough. Could you test if this
patch helps?

(there was also an ancient page_table_lock in split_vma, I forgot to
delete that previously)

--- 2.6.5-aa3/mm/mmap.c.~1~	2004-04-14 11:51:33.000000000 +0200
+++ 2.6.5-aa3/mm/mmap.c	2004-04-23 04:04:31.258066752 +0200
@@ -977,6 +977,15 @@ int expand_stack(struct vm_area_struct *
 		return -EFAULT;
 
 	/*
+	 * We must make sure the anon-vma is allocated
+	 * to make sure the anon-vma locking is not a noop.
+	 */
+	if (unlikely(anon_vma_prepare(vma)))
+		return -ENOMEM;
+
+	anon_vma_lock(vma);
+
+	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
 	 * is required to hold the mmap_sem in read mode. We need to get
 	 * the spinlock only before relocating the vma range ourself.
@@ -986,13 +995,15 @@ int expand_stack(struct vm_area_struct *
 	grow = (address - vma->vm_end) >> PAGE_SHIFT;
 
 	/* Overcommit.. */
-	if (security_vm_enough_memory(grow)) {
+	if (unlikely(security_vm_enough_memory(grow))) {
+		anon_vma_unlock(vma);
 		return -ENOMEM;
 	}
 	
-	if (address - vma->vm_start > current->rlim[RLIMIT_STACK].rlim_cur ||
-			((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
-			current->rlim[RLIMIT_AS].rlim_cur) {
+	if (unlikely(address - vma->vm_start > current->rlim[RLIMIT_STACK].rlim_cur ||
+		     ((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
+		     current->rlim[RLIMIT_AS].rlim_cur)) {
+		anon_vma_unlock(vma);
 		vm_unacct_memory(grow);
 		return -ENOMEM;
 	}
@@ -1000,6 +1011,9 @@ int expand_stack(struct vm_area_struct *
 	vma->vm_mm->total_vm += grow;
 	if (vma->vm_flags & VM_LOCKED)
 		vma->vm_mm->locked_vm += grow;
+
+	anon_vma_unlock(vma);
+
 	return 0;
 }
 
@@ -1028,6 +1042,24 @@ int expand_stack(struct vm_area_struct *
 	unsigned long grow;
 
 	/*
+	 * We must make sure the anon-vma is allocated
+	 * to make sure the anon-vma locking is not a noop.
+	 */
+	if (unlikely(anon_vma_prepare(vma)))
+		return -ENOMEM;
+
+	/*
+	 * We must serialize against other thread and against
+	 * objrmap while moving the vm_start/vm_pgoff of anon-vmas.
+	 * The total_vm/locked_vm as well needs serialization
+	 * against other threads, the serialization of
+	 * locked_vm/total_vm against syscalls is provided by
+	 * the mmap_sem that we hold in read mode here (all
+	 * syscalls holds it in write mode).
+	 */
+	anon_vma_lock(vma);
+
+	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
 	 * is required to hold the mmap_sem in read mode. We need to get
 	 * the spinlock only before relocating the vma range ourself.
@@ -1036,13 +1068,15 @@ int expand_stack(struct vm_area_struct *
 	grow = (vma->vm_start - address) >> PAGE_SHIFT;
 
 	/* Overcommit.. */
-	if (security_vm_enough_memory(grow)) {
+	if (unlikely(security_vm_enough_memory(grow))) {
+		anon_vma_unlock(vma);
 		return -ENOMEM;
 	}
 	
-	if (vma->vm_end - address > current->rlim[RLIMIT_STACK].rlim_cur ||
-			((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
-			current->rlim[RLIMIT_AS].rlim_cur) {
+	if (unlikely(vma->vm_end - address > current->rlim[RLIMIT_STACK].rlim_cur ||
+		     ((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
+		     current->rlim[RLIMIT_AS].rlim_cur)) {
+		anon_vma_unlock(vma);
 		vm_unacct_memory(grow);
 		return -ENOMEM;
 	}
@@ -1051,6 +1085,9 @@ int expand_stack(struct vm_area_struct *
 	vma->vm_mm->total_vm += grow;
 	if (vma->vm_flags & VM_LOCKED)
 		vma->vm_mm->locked_vm += grow;
+
+	anon_vma_unlock(vma);
+
 	return 0;
 }
 
@@ -1289,7 +1326,6 @@ int split_vma(struct mm_struct * mm, str
 
 	if (mapping)
 		down(&mapping->i_shared_sem);
-	spin_lock(&mm->page_table_lock);
 	anon_vma_lock(vma);
 
 	if (new_below)
@@ -1301,7 +1337,6 @@ int split_vma(struct mm_struct * mm, str
 	__insert_vm_struct(mm, new);
 
 	anon_vma_unlock(vma);
-	spin_unlock(&mm->page_table_lock);
 	if (mapping)
 		up(&mapping->i_shared_sem);
 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Similar behaviour without BUG() message(was: Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!)
  2004-04-23  2:27         ` Andrea Arcangeli
@ 2004-04-23  9:24           ` Alexander Y. Fomichev
  0 siblings, 0 replies; 7+ messages in thread
From: Alexander Y. Fomichev @ 2004-04-23  9:24 UTC (permalink / raw)
  To: admin; +Cc: Andrea Arcangeli, linux-kernel

On Friday 23 April 2004 06:27, Andrea Arcangeli wrote:
> following your trace I got to some futex issue (likely invoked by NPTL),
> and following the futex code I noticed a race with threads in
> expand_stack and likely the futex is effectively calling expand_stack
> too. Then I found a race in expand_stack with threads.
>
> here the details:
>
> in 2.6-aa expand_stack is moving down vm_start and vm_end with only the
> read semaphore held. this means this thing can even run in parallel in
> both cpus, the latter will corrupt vm_pgoff which generate malfunction
> with anon-vma:
>
>         vma->vm_start = address;
>         vma->vm_pgoff -= grow;
>
> this isn't an issue for the file mappings because only anon vmas can be
> growsdown (the filemappings could never work right in filemap_nopage if
> expand_stack would mess with pgoff and vm_start without holding the
> mmap_sem in write mode).
>
> serializing this with anon-vma is easy and scalable with the
> anon_vma_lock (not an mm-wide lock like the page_table_lock).
> This will also serialize perfectly with the objrmap. objrmap should be
> the only thing that cares about vm_pgoff and vm_start being coherent at
> the same time for anon-mappings in anon-vma, and the anon_vma_lock
> provides perfect locking for that.
>
> One of the scalability and simplicity locking beauty of anon-vma is the
> total avoidance of page_table_lock for vma merging and all other vma
> operations in general, and true usage of page_table_lock only for the
> pagetables (and future usage of vma->page_table_lock only for pagetables
> too). I wouldn't really like to giveup on this and to reintroduce the
> whole mess of page_table_lock in the vma merging and in expand stack
> that all other implementations like 2.6 mainline and anonmm are
> suffering from, and I hope my fix will be enough. Could you test if this
> patch helps?

       Of course, but it will take some time,i've got two followed oops 
  on previous kernel after almost two weeks of stable work. I think week 
  or two would be enough  "timeslice" to say something. Unfortunately i 
  can't try artificial tests on a production system but i guess ordinary 
  load would be enough.

-- 
Best regards.
        Alexander Y. Fomichev <gluk@php4.ru>
        Public PGP key: http://sysadminday.org.ru/gluk.asc

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-04-23  9:24 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-04-14  8:57 2.6.5-aa3: kernel BUG at mm/objrmap.c:137! Alexander Y. Fomichev
2004-04-14  9:54 ` Andrea Arcangeli
2004-04-14 11:39   ` Alexander Y. Fomichev
2004-04-14 11:47     ` Andrea Arcangeli
2004-04-19 12:32       ` Similar behaviour without BUG() message(was: Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!) Alexander Y. Fomichev
2004-04-23  2:27         ` Andrea Arcangeli
2004-04-23  9:24           ` Alexander Y. Fomichev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox