From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759453AbZBMQ4P (ORCPT ); Fri, 13 Feb 2009 11:56:15 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1750734AbZBMQz7 (ORCPT ); Fri, 13 Feb 2009 11:55:59 -0500 Received: from tomts16-srv.bellnexxia.net ([209.226.175.4]:33363 "EHLO tomts16-srv.bellnexxia.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750713AbZBMQz6 (ORCPT ); Fri, 13 Feb 2009 11:55:58 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ah0FAKcxlUlMQWt2/2dsb2JhbACBbtEPhBgG Date: Fri, 13 Feb 2009 11:55:55 -0500 From: Mathieu Desnoyers To: Masami Hiramatsu Cc: Peter Zijlstra , Nick Piggin , akpm , linux-kernel , Ingo Molnar Subject: Re: irq-disabled vs vmap vs text_poke Message-ID: <20090213165555.GA4684@Krystal> References: <1234529407.6519.28.camel@twins> <20090213141839.GA31922@Krystal> <4995A0B7.1030705@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Disposition: inline In-Reply-To: <4995A0B7.1030705@redhat.com> X-Editor: vi X-Info: http://krystal.dyndns.org:8080 X-Operating-System: Linux/2.6.21.3-grsec (i686) X-Uptime: 11:51:37 up 43 days, 16:49, 4 users, load average: 0.41, 0.37, 0.41 User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Masami Hiramatsu (mhiramat@redhat.com) wrote: > Mathieu Desnoyers wrote: > > * Peter Zijlstra (peterz@infradead.org) wrote: > >> Hi, > >> > >> Ingo got the following splat: > >> > >> [ 5.101748] ------------[ cut here ]------------ > >> [ 5.104305] WARNING: at kernel/smp.c:329 smp_call_function_many+0x34/0x1ea() > >> [ 5.104305] Hardware name: P4DC6 > >> [ 5.104305] Modules linked in: > >> [ 5.104305] Pid: 1, comm: swapper Not tainted 2.6.29-rc4-tip-01766-g1757c19-dirty #2 > >> [ 5.104305] Call Trace: > >> [ 5.104305] [] warn_slowpath+0x79/0x8f > >> [ 5.104305] [] ? dump_trace+0x7d/0xac > >> [ 5.104305] [] ? __lock_acquire+0x319/0x382 > >> [ 5.104305] [] smp_call_function_many+0x34/0x1ea > >> [ 5.104305] [] ? do_flush_tlb_all+0x0/0x48 > >> [ 5.104305] [] ? do_flush_tlb_all+0x0/0x48 > >> [ 5.104305] [] smp_call_function+0x21/0x28 > >> [ 5.104305] [] on_each_cpu+0x14/0x23 > >> [ 5.104305] [] flush_tlb_all+0x19/0x1b > >> [ 5.104305] [] flush_tlb_kernel_range+0xd/0xf > >> [ 5.104305] [] vmap_debug_free_range+0x1c/0x20 > >> [ 5.104305] [] remove_vm_area+0x28/0x67 > >> [ 5.104305] [] __vunmap+0x30/0xab > >> [ 5.104305] [] vunmap+0x27/0x29 > >> [ 5.104305] [] text_poke+0xd6/0x104 > >> [ 5.104305] [] ? kprobe_target+0x0/0x15 > >> [ 5.104305] [] arch_disarm_kprobe+0x13/0x15 > >> [ 5.104305] [] __unregister_kprobe_top+0x68/0xe8 > >> [ 5.104305] [] unregister_kretprobes+0x2c/0xb9 > >> [ 5.104305] [] unregister_kretprobe+0x16/0x18 > >> [ 5.104305] [] init_test_probes+0x2ed/0x40c > >> [ 5.104305] [] init_kprobes+0x127/0x131 > >> [ 5.104305] [] ? lock_release_holdtime+0x43/0x48 > >> [ 5.104305] [] ? __slab_alloc+0x5d/0x27a > >> [ 5.104305] [] ? lockdep_init_map+0x80/0xe7 > >> [ 5.104305] [] ? clocksource_read+0xd/0xf > >> [ 5.104305] [] ? getnstimeofday+0x5e/0xe9 > >> [ 5.104305] [] ? timespec_to_ktime+0xe/0x11 > >> [ 5.104305] [] ? init_kprobes+0x0/0x131 > >> [ 5.104305] [] do_one_initcall+0x6a/0x169 > >> [ 5.104305] [] ? number+0x10d/0x1cf > >> [ 5.104305] [] ? register_lock_class+0x17/0x228 > >> [ 5.104305] [] ? __lock_acquire+0x319/0x382 > >> [ 5.104305] [] ? debug_check_no_obj_freed+0xda/0x126 > >> [ 5.104305] [] ? __lock_acquire+0x319/0x382 > >> [ 5.104305] [] ? lock_release_holdtime+0x43/0x48 > >> [ 5.104305] [] ? _spin_unlock+0x22/0x25 > >> [ 5.104305] [] ? proc_register+0x14b/0x15c > >> [ 5.104305] [] ? create_proc_entry+0x76/0x8c > >> [ 5.104305] [] ? default_affinity_write+0x3f/0x8a > >> [ 5.104305] [] ? init_irq_proc+0x58/0x65 > >> [ 5.104305] [] kernel_init+0x118/0x169 > >> [ 5.104305] [] ? kernel_init+0x0/0x169 > >> [ 5.104305] [] kernel_thread_helper+0x7/0x10 > >> [ 5.104305] ---[ end trace e93713a9d40cd06c ]--- > >> > >> Which points to vunmap() being called with interrupts disabled. > >> > >> Which made me look at the vmap/vunmap calls, and they appear to not be > >> irq-safe, therefore this would be a bug in text_poke(). > >> > >> [ that is, vmap() can end up calling get_vm_area_caller() which in turn > >> calls __get_vm_area_node() with GFP_KERNEL, ergo, don't do this from > >> an atomic context. ] > >> > >> Now text_poke() uses local_irq_save/restore(), which conveys that it can > >> be called with IRQs disabled, which is exactly what happens in the trace > >> above, however we just established that vmap/vunmap() are not irq-safe. > >> > > > > text_poke should actually use local_irq_disable/enable rather than > > local_irq_save/restore because it _must_ be called with interrupts on. > > Could you tell me why it must be called with irq on? > Because it uses vmap, but see below.. > > > >> Anybody got an idea on how to fix this? > > > > There is probably something wrong with the caller, kprobes, which calls > > text_poke with interrupts off. > > Hmm, kprobe's smoke test caused this problem. Since (un)register_kprobe() > may sleep for waiting a mutex, it should not be called with interrupts off. > So, it's not text_poke()'s issue nor vmap(). > > BTW, what about using map_vm_area() in text_poke() instead of vmap()? > Since text_poke() just maps text pages to alias pages temporarily, > I think we don't need to use delayed vunmap(). > As with this patch from you ? Sorry, when it has been initially submitted, the discussion turned in a different direction. Please see comments inline. > From: Masami Hiramatsu > > Use vm_map_ram() instead of vmap() in text_poke(), because text_poke() > just want to map pages temporarily. > > --- > As far as I tested, this patch works fine for me. > However, there might be another hidden bug in the kernel... > We need to fix that too. > > arch/x86/kernel/alternative.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > Index: 2.6-rc/arch/x86/kernel/alternative.c > =================================================================== > --- 2.6-rc.orig/arch/x86/kernel/alternative.c > +++ 2.6-rc/arch/x86/kernel/alternative.c > @@ -515,12 +515,12 @@ void *__kprobes text_poke(void *addr, co > BUG_ON(!pages[0]); > if (!pages[1]) > nr_pages = 1; > - vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL); > + vaddr = vm_map_ram(pages, nr_pages, -1, PAGE_KERNEL); Hrm, maybe we should do : vaddr = vm_map_ram(pages, nr_pages, numa_node_id(), PAGE_KERNEL); instead ? Related question : can vm_map_ram sleep ? Mathieu > BUG_ON(!vaddr); > local_irq_save(flags); > memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len); > local_irq_restore(flags); > - vunmap(vaddr); > + vm_unmap_ram(vaddr, nr_pages); > sync_core(); > /* Could also do a CLFLUSH here to speed up CPU recovery; but > that causes hangs on some VIA CPUs. */ -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68