* 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!
@ 2004-04-14 8:57 Alexander Y. Fomichev
2004-04-14 9:54 ` Andrea Arcangeli
0 siblings, 1 reply; 7+ messages in thread
From: Alexander Y. Fomichev @ 2004-04-14 8:57 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrea Arcangeli, admin
G'Day,
I've got a bug on 2.6.5-aa3 today
System: Dual P4 Xeon 2.4GHz, 4G ECC RAM - (production web-server) pretty
heavy loaded at most time, but i've got it approximately
at 02:00 (MSD) when no serious load should be.
System remained accessible all the time but operations
with proclist from userspace (i.e. ps, w) appears to be locked.
Here is a trace:
kernel BUG at mm/objrmap.c:137!
invalid operand: 0000 [#1]
PREEMPT SMP
CPU: 1
EIP: 0060:[<c0148ebb>] Not tainted
EFLAGS: 00010282 (2.6.5-aa3)
EIP is at page_referenced_one+0x8c/0x96
eax: dafaaae0 ebx: dafaaae0 ecx: 00000000 edx: c19ef4a0
esi: 00000000 edi: dade48c4 ebp: c19ef4a0 esp: e9049c0c
ds: 007b es: 007b ss: 0068
Process apache2 (pid: 29070, threadinfo=e9048000 task=cf0e8630)
Stack: dafaaae0 f4531e30 00000000 f4531e2c c0149073 c19ef4a0 00000001 c02d2d00
e9048000 00000166 c0149103 c19ef4a0 c013ff58 00000000 00000000 c02d2e04
00000001 00000000 00000000 e9049d30 00000166 c1a59890 c1841cd0 e9049c68
Call Trace:
[<c0149073>] page_referenced_anon+0x45/0x94
[<c0149103>] page_referenced+0x41/0x4d
[<c013ff58>] refill_inactive_zone+0x56c/0x678
[<c0140123>] shrink_zone+0xbf/0xc1
[<c014019d>] shrink_caches+0x78/0x7c
[<c014023b>] try_to_free_pages+0x9a/0x141
[<c0138f68>] __alloc_pages+0x1cb/0x30a
[<c0144594>] do_anonymous_page+0x82/0x246
[<c027fbe2>] inet_recvmsg+0x4a/0x69
[<c01447b5>] do_no_page+0x5d/0x58c
[<c0144ed8>] handle_mm_fault+0xe2/0x17b
[<c0115752>] do_page_fault+0x2eb/0x4d2
[<c015384d>] vfs_read+0xbc/0xf5
[<c0115467>] do_page_fault+0x0/0x4d2
[<c0107789>] error_code+0x2d/0x38
Code: 0f 0b 89 00 a9 20 2a c0 eb e1 55 31 ed 57 89 c7 56 53 83 ec
<6>note: apache2[29070] exited with preempt_count 2
bad: scheduling while atomic!
Call Trace:
[<c01182a4>] schedule+0x6a1/0x6a6
[<c012c2be>] rcu_process_callbacks+0x13a/0x169
[<c012c2be>] rcu_process_callbacks+0x13a/0x169
[<c01208f8>] tasklet_action+0x5f/0xa5
[<c01c9261>] rwsem_down_read_failed+0xa5/0x155
[<c012f5d4>] .text.lock.futex+0x7/0xb3
[<c0111b86>] delay_tsc+0xb/0x15
[<c012f4a5>] do_futex+0x6b/0x6d
[<c012f5a1>] sys_futex+0xfa/0x112
[<c011cc90>] call_console_drivers+0x77/0xfc
[<c011a872>] mm_release+0x96/0xa6
[<c011ea17>] do_exit+0x84/0x4d9
[<c029007b>] packet_ioctl+0x1f/0x192
[<c0107de2>] do_divide_error+0x0/0xd3
[<c010805c>] do_invalid_op+0x0/0xaa
[<c0108104>] do_invalid_op+0xa8/0xaa
[<c0148ebb>] page_referenced_one+0x8c/0x96
[<c0117642>] scheduler_tick+0x3b/0x5f7
[<c0116451>] recalc_task_prio+0x8c/0x1a6
[<c0107789>] error_code+0x2d/0x38
[<c014007b>] shrink_zone+0x17/0xc1
[<c0148ebb>] page_referenced_one+0x8c/0x96
[<c0149073>] page_referenced_anon+0x45/0x94
[<c0149103>] page_referenced+0x41/0x4d
[<c013ff58>] refill_inactive_zone+0x56c/0x678
[<c0140123>] shrink_zone+0xbf/0xc1
[<c014019d>] shrink_caches+0x78/0x7c
[<c014023b>] try_to_free_pages+0x9a/0x141
[<c0138f68>] __alloc_pages+0x1cb/0x30a
[<c0144594>] do_anonymous_page+0x82/0x246
[<c027fbe2>] inet_recvmsg+0x4a/0x69
[<c01447b5>] do_no_page+0x5d/0x58c
[<c0144ed8>] handle_mm_fault+0xe2/0x17b
[<c0115752>] do_page_fault+0x2eb/0x4d2
[<c015384d>] vfs_read+0xbc/0xf5
[<c0115467>] do_page_fault+0x0/0x4d2
[<c0107789>] error_code+0x2d/0x38
& some (probably) useful info:
cpuinfo: http://sysadminday.org.ru/objrmap_bug/cpuinfo
lspci: http://sysadminday.org.ru/objrmap_bug/lspci
lspci -vvn http://sysadminday.org.ru/objrmap_bug/lspci-vvn
full_trace (including initial boot,SysRq-[M,P,T])
http://sysadminday.org.ru/objrmap_bug/full_trace
--
Best regards.
Alexander Y. Fomichev <gluk@php4.ru>
Public PGP key: http://sysadminday.org.ru/gluk.asc
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137! 2004-04-14 8:57 2.6.5-aa3: kernel BUG at mm/objrmap.c:137! Alexander Y. Fomichev @ 2004-04-14 9:54 ` Andrea Arcangeli 2004-04-14 11:39 ` Alexander Y. Fomichev 0 siblings, 1 reply; 7+ messages in thread From: Andrea Arcangeli @ 2004-04-14 9:54 UTC (permalink / raw) To: Alexander Y. Fomichev; +Cc: linux-kernel, admin On Wed, Apr 14, 2004 at 12:57:16PM +0400, Alexander Y. Fomichev wrote: > G'Day, > > I've got a bug on 2.6.5-aa3 today > > System: Dual P4 Xeon 2.4GHz, 4G ECC RAM - (production web-server) pretty > heavy loaded at most time, but i've got it approximately > at 02:00 (MSD) when no serious load should be. > System remained accessible all the time but operations > with proclist from userspace (i.e. ps, w) appears to be locked. I don't think apache2 uses nonlinear. there was an smp race fix in 2.6.5-aa4, so you may want to try again with 2.6.5-aa5 (latest) just in case this was mm corruption triggered by the race. Are you using threading with apache2? Such a race could trigger only with threads. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137! 2004-04-14 9:54 ` Andrea Arcangeli @ 2004-04-14 11:39 ` Alexander Y. Fomichev 2004-04-14 11:47 ` Andrea Arcangeli 0 siblings, 1 reply; 7+ messages in thread From: Alexander Y. Fomichev @ 2004-04-14 11:39 UTC (permalink / raw) To: admin; +Cc: Andrea Arcangeli, linux-kernel On Wednesday 14 April 2004 13:54, Andrea Arcangeli wrote: > On Wed, Apr 14, 2004 at 12:57:16PM +0400, Alexander Y. Fomichev wrote: > > G'Day, > > > > I've got a bug on 2.6.5-aa3 today > > > > System: Dual P4 Xeon 2.4GHz, 4G ECC RAM - (production web-server) pretty btw, i was wrong, this box has only 2G ECC RAM not 4. > > heavy loaded at most time, but i've got it approximately > > at 02:00 (MSD) when no serious load should be. > > System remained accessible all the time but operations > > with proclist from userspace (i.e. ps, w) appears to be locked. > > I don't think apache2 uses nonlinear. there was an smp race fix in > 2.6.5-aa4, so you may want to try again with 2.6.5-aa5 (latest) just in > case this was mm corruption triggered by the race. tnx, i'll try it. btw, system running on 2.6.5-aa3 almost for a month (at 2004/04/18) and this is first time i catch this, so 2.6.5-aa3 pretty stable for me. (except for trouble mentioned by Sergey Kostyliov as "2.6.X kernel memory leak?"/"2.6.1 IO lockup on SMP systems") > Are you using threading with apache2? Such a race could trigger only > with threads. exactly right, apache2 compiled --with-mpm=worker. -- Best regards. Alexander Y. Fomichev <gluk@php4.ru> Public PGP key: http://sysadminday.org.ru/gluk.asc ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137! 2004-04-14 11:39 ` Alexander Y. Fomichev @ 2004-04-14 11:47 ` Andrea Arcangeli 2004-04-19 12:32 ` Similar behaviour without BUG() message(was: Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!) Alexander Y. Fomichev 0 siblings, 1 reply; 7+ messages in thread From: Andrea Arcangeli @ 2004-04-14 11:47 UTC (permalink / raw) To: Alexander Y. Fomichev; +Cc: admin, linux-kernel On Wed, Apr 14, 2004 at 03:39:49PM +0400, Alexander Y. Fomichev wrote: > exactly right, apache2 compiled --with-mpm=worker. ok so there are good chances that 2.6.5-aa5 will fix it, if not then please notify me again, thanks. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Similar behaviour without BUG() message(was: Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!) 2004-04-14 11:47 ` Andrea Arcangeli @ 2004-04-19 12:32 ` Alexander Y. Fomichev 2004-04-23 2:27 ` Andrea Arcangeli 0 siblings, 1 reply; 7+ messages in thread From: Alexander Y. Fomichev @ 2004-04-19 12:32 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: admin, linux-kernel On Wednesday 14 April 2004 15:47, Andrea Arcangeli wrote: > ok so there are good chances that 2.6.5-aa5 will fix it, if not then > please notify me again, thanks. I've noticed a behaviour today very similar to mentioned in previous report (on 14 Apr 2004) except for message from BUG() (kernel 2.6.5-aa5) i.e. system remained accessible but procps(ps,pkill) appears to be locked and Sysrq-T is similar to previous one - some of apache2 & all of procps have been blocked. Here is a traces: http://sysadminday.org.ru/2.6.5-ps-lockup/sysrq-M http://sysadminday.org.ru/2.6.5-ps-lockup/full_trace -- Best regards. Alexander Y. Fomichev <gluk@php4.ru> Public PGP key: http://sysadminday.org.ru/gluk.asc ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Similar behaviour without BUG() message(was: Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!) 2004-04-19 12:32 ` Similar behaviour without BUG() message(was: Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!) Alexander Y. Fomichev @ 2004-04-23 2:27 ` Andrea Arcangeli 2004-04-23 9:24 ` Alexander Y. Fomichev 0 siblings, 1 reply; 7+ messages in thread From: Andrea Arcangeli @ 2004-04-23 2:27 UTC (permalink / raw) To: Alexander Y. Fomichev; +Cc: admin, linux-kernel On Mon, Apr 19, 2004 at 04:32:53PM +0400, Alexander Y. Fomichev wrote: > On Wednesday 14 April 2004 15:47, Andrea Arcangeli wrote: > > ok so there are good chances that 2.6.5-aa5 will fix it, if not then > > please notify me again, thanks. > > I've noticed a behaviour today very similar to mentioned in previous report > (on 14 Apr 2004) except for message from BUG() (kernel 2.6.5-aa5) i.e. > system remained accessible but procps(ps,pkill) appears to be locked and > Sysrq-T is similar to previous one - some of apache2 & all of procps > have been blocked. > > Here is a traces: > > http://sysadminday.org.ru/2.6.5-ps-lockup/sysrq-M > http://sysadminday.org.ru/2.6.5-ps-lockup/full_trace following your trace I got to some futex issue (likely invoked by NPTL), and following the futex code I noticed a race with threads in expand_stack and likely the futex is effectively calling expand_stack too. Then I found a race in expand_stack with threads. here the details: in 2.6-aa expand_stack is moving down vm_start and vm_end with only the read semaphore held. this means this thing can even run in parallel in both cpus, the latter will corrupt vm_pgoff which generate malfunction with anon-vma: vma->vm_start = address; vma->vm_pgoff -= grow; this isn't an issue for the file mappings because only anon vmas can be growsdown (the filemappings could never work right in filemap_nopage if expand_stack would mess with pgoff and vm_start without holding the mmap_sem in write mode). serializing this with anon-vma is easy and scalable with the anon_vma_lock (not an mm-wide lock like the page_table_lock). This will also serialize perfectly with the objrmap. objrmap should be the only thing that cares about vm_pgoff and vm_start being coherent at the same time for anon-mappings in anon-vma, and the anon_vma_lock provides perfect locking for that. One of the scalability and simplicity locking beauty of anon-vma is the total avoidance of page_table_lock for vma merging and all other vma operations in general, and true usage of page_table_lock only for the pagetables (and future usage of vma->page_table_lock only for pagetables too). I wouldn't really like to giveup on this and to reintroduce the whole mess of page_table_lock in the vma merging and in expand stack that all other implementations like 2.6 mainline and anonmm are suffering from, and I hope my fix will be enough. Could you test if this patch helps? (there was also an ancient page_table_lock in split_vma, I forgot to delete that previously) --- 2.6.5-aa3/mm/mmap.c.~1~ 2004-04-14 11:51:33.000000000 +0200 +++ 2.6.5-aa3/mm/mmap.c 2004-04-23 04:04:31.258066752 +0200 @@ -977,6 +977,15 @@ int expand_stack(struct vm_area_struct * return -EFAULT; /* + * We must make sure the anon-vma is allocated + * to make sure the anon-vma locking is not a noop. + */ + if (unlikely(anon_vma_prepare(vma))) + return -ENOMEM; + + anon_vma_lock(vma); + + /* * vma->vm_start/vm_end cannot change under us because the caller * is required to hold the mmap_sem in read mode. We need to get * the spinlock only before relocating the vma range ourself. @@ -986,13 +995,15 @@ int expand_stack(struct vm_area_struct * grow = (address - vma->vm_end) >> PAGE_SHIFT; /* Overcommit.. */ - if (security_vm_enough_memory(grow)) { + if (unlikely(security_vm_enough_memory(grow))) { + anon_vma_unlock(vma); return -ENOMEM; } - if (address - vma->vm_start > current->rlim[RLIMIT_STACK].rlim_cur || - ((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) > - current->rlim[RLIMIT_AS].rlim_cur) { + if (unlikely(address - vma->vm_start > current->rlim[RLIMIT_STACK].rlim_cur || + ((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) > + current->rlim[RLIMIT_AS].rlim_cur)) { + anon_vma_unlock(vma); vm_unacct_memory(grow); return -ENOMEM; } @@ -1000,6 +1011,9 @@ int expand_stack(struct vm_area_struct * vma->vm_mm->total_vm += grow; if (vma->vm_flags & VM_LOCKED) vma->vm_mm->locked_vm += grow; + + anon_vma_unlock(vma); + return 0; } @@ -1028,6 +1042,24 @@ int expand_stack(struct vm_area_struct * unsigned long grow; /* + * We must make sure the anon-vma is allocated + * to make sure the anon-vma locking is not a noop. + */ + if (unlikely(anon_vma_prepare(vma))) + return -ENOMEM; + + /* + * We must serialize against other thread and against + * objrmap while moving the vm_start/vm_pgoff of anon-vmas. + * The total_vm/locked_vm as well needs serialization + * against other threads, the serialization of + * locked_vm/total_vm against syscalls is provided by + * the mmap_sem that we hold in read mode here (all + * syscalls holds it in write mode). + */ + anon_vma_lock(vma); + + /* * vma->vm_start/vm_end cannot change under us because the caller * is required to hold the mmap_sem in read mode. We need to get * the spinlock only before relocating the vma range ourself. @@ -1036,13 +1068,15 @@ int expand_stack(struct vm_area_struct * grow = (vma->vm_start - address) >> PAGE_SHIFT; /* Overcommit.. */ - if (security_vm_enough_memory(grow)) { + if (unlikely(security_vm_enough_memory(grow))) { + anon_vma_unlock(vma); return -ENOMEM; } - if (vma->vm_end - address > current->rlim[RLIMIT_STACK].rlim_cur || - ((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) > - current->rlim[RLIMIT_AS].rlim_cur) { + if (unlikely(vma->vm_end - address > current->rlim[RLIMIT_STACK].rlim_cur || + ((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) > + current->rlim[RLIMIT_AS].rlim_cur)) { + anon_vma_unlock(vma); vm_unacct_memory(grow); return -ENOMEM; } @@ -1051,6 +1085,9 @@ int expand_stack(struct vm_area_struct * vma->vm_mm->total_vm += grow; if (vma->vm_flags & VM_LOCKED) vma->vm_mm->locked_vm += grow; + + anon_vma_unlock(vma); + return 0; } @@ -1289,7 +1326,6 @@ int split_vma(struct mm_struct * mm, str if (mapping) down(&mapping->i_shared_sem); - spin_lock(&mm->page_table_lock); anon_vma_lock(vma); if (new_below) @@ -1301,7 +1337,6 @@ int split_vma(struct mm_struct * mm, str __insert_vm_struct(mm, new); anon_vma_unlock(vma); - spin_unlock(&mm->page_table_lock); if (mapping) up(&mapping->i_shared_sem); ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Similar behaviour without BUG() message(was: Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!) 2004-04-23 2:27 ` Andrea Arcangeli @ 2004-04-23 9:24 ` Alexander Y. Fomichev 0 siblings, 0 replies; 7+ messages in thread From: Alexander Y. Fomichev @ 2004-04-23 9:24 UTC (permalink / raw) To: admin; +Cc: Andrea Arcangeli, linux-kernel On Friday 23 April 2004 06:27, Andrea Arcangeli wrote: > following your trace I got to some futex issue (likely invoked by NPTL), > and following the futex code I noticed a race with threads in > expand_stack and likely the futex is effectively calling expand_stack > too. Then I found a race in expand_stack with threads. > > here the details: > > in 2.6-aa expand_stack is moving down vm_start and vm_end with only the > read semaphore held. this means this thing can even run in parallel in > both cpus, the latter will corrupt vm_pgoff which generate malfunction > with anon-vma: > > vma->vm_start = address; > vma->vm_pgoff -= grow; > > this isn't an issue for the file mappings because only anon vmas can be > growsdown (the filemappings could never work right in filemap_nopage if > expand_stack would mess with pgoff and vm_start without holding the > mmap_sem in write mode). > > serializing this with anon-vma is easy and scalable with the > anon_vma_lock (not an mm-wide lock like the page_table_lock). > This will also serialize perfectly with the objrmap. objrmap should be > the only thing that cares about vm_pgoff and vm_start being coherent at > the same time for anon-mappings in anon-vma, and the anon_vma_lock > provides perfect locking for that. > > One of the scalability and simplicity locking beauty of anon-vma is the > total avoidance of page_table_lock for vma merging and all other vma > operations in general, and true usage of page_table_lock only for the > pagetables (and future usage of vma->page_table_lock only for pagetables > too). I wouldn't really like to giveup on this and to reintroduce the > whole mess of page_table_lock in the vma merging and in expand stack > that all other implementations like 2.6 mainline and anonmm are > suffering from, and I hope my fix will be enough. Could you test if this > patch helps? Of course, but it will take some time,i've got two followed oops on previous kernel after almost two weeks of stable work. I think week or two would be enough "timeslice" to say something. Unfortunately i can't try artificial tests on a production system but i guess ordinary load would be enough. -- Best regards. Alexander Y. Fomichev <gluk@php4.ru> Public PGP key: http://sysadminday.org.ru/gluk.asc ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2004-04-23 9:24 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-04-14 8:57 2.6.5-aa3: kernel BUG at mm/objrmap.c:137! Alexander Y. Fomichev 2004-04-14 9:54 ` Andrea Arcangeli 2004-04-14 11:39 ` Alexander Y. Fomichev 2004-04-14 11:47 ` Andrea Arcangeli 2004-04-19 12:32 ` Similar behaviour without BUG() message(was: Re: 2.6.5-aa3: kernel BUG at mm/objrmap.c:137!) Alexander Y. Fomichev 2004-04-23 2:27 ` Andrea Arcangeli 2004-04-23 9:24 ` Alexander Y. Fomichev
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox