From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757473AbZEKRU5 (ORCPT ); Mon, 11 May 2009 13:20:57 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753688AbZEKRUt (ORCPT ); Mon, 11 May 2009 13:20:49 -0400 Received: from e23smtp03.au.ibm.com ([202.81.31.145]:48590 "EHLO e23smtp03.au.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753248AbZEKRUr (ORCPT ); Mon, 11 May 2009 13:20:47 -0400 Date: Mon, 11 May 2009 22:50:34 +0530 From: "K.Prasad" To: Alan Stern Cc: Steven Rostedt , Frederic Weisbecker , Ingo Molnar , Linux Kernel Mailing List , Andrew Morton , Benjamin Herrenschmidt , maneesh@linux.vnet.ibm.com, Roland McGrath Subject: Re: [Patch 00/12] Hardware Breakpoint Interfaces Message-ID: <20090511172034.GA8774@in.ibm.com> Reply-To: prasad@linux.vnet.ibm.com References: <20090511113644.GB26721@in.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, May 11, 2009 at 10:50:39AM -0400, Alan Stern wrote: > On Mon, 11 May 2009, K.Prasad wrote: > > > > You shouldn't call on_each_cpu() while holding a spinlock. The same > > > thing happens in unregister_kernel_hw_breakpoint(). > > > > > > > First, on_each_cpu() will now be changed to return only after all > > functions invoked through IPIs have returned (by changing @wait > > parameter to 1). This is required to prevent side effects of > > incrementing hbp_kernel_pos after on_each_cpu() in > > unregister_kernel_hw_breakpoint() [hbp_kernel_pos is still incremented > > after IPI and I will explain it below]. > > Good. I thought this was already done; it's lucky you noticed it > wasn't. > > > on_each_cpu() isn't a blocking call (despite @wait being set to 1, which > > does a busy wait through cpu_relax()) and should be safe to invoke > > inside a spin_lock() context. I would like to know if you think > > otherwise. > > I guess you're right. Why did you change the spin_lock to > spin_lock_bh? > I realised that flush_thread_hw_breakpoint()-->flush_thread() is invoked through a softIRQ. Considering that it can cause potential deadlock due to circular lock dependancy, spin_lock() calls were changed to spin_lock_bh(). Here's a traceback and message from lockdep that helped me: ================================= [ INFO: inconsistent lock state ] 2.6.30-rc4-tip.hbkpt #31 --------------------------------- inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. swapper/0 [HC0[0]:SC1[1]:HE1:SE0] takes: (hw_breakpoint_lock){+.?...}, at: [] flush_thread_hw_breakpoint+0x16/0x75 {SOFTIRQ-ON-W} state was registered at: [] __lock_acquire+0x2e8/0xb55 [] lock_acquire+0x9b/0xbe [] _spin_lock+0x23/0x53 [] load_debug_registers+0x1b/0x77 [] start_secondary+0x1b9/0x1c6 [] 0xffffffff irq event stamp: 1447500 hardirqs last enabled at (1447500): [] kmem_cache_free+0xb2/0xe7 hardirqs last disabled at (1447499): [] kmem_cache_free+0x49/0xe7 softirqs last enabled at (1447468): [] __do_softirq+0x14c/0x154 softirqs last disabled at (1447473): [] do_softirq+0x6d/0xcd other info that might help us debug this: no locks held by swapper/0. stack backtrace: Pid: 0, comm: swapper Not tainted 2.6.30-rc4-tip.hbkpt #31 Call Trace: [] ? printk+0x14/0x1a [] valid_state+0x12a/0x13d [] mark_lock+0xe6/0x1e9 [] ? check_usage_forwards+0x0/0x3f [] __lock_acquire+0x266/0xb55 [] ? page_address+0xe/0xb1 [] lock_acquire+0x9b/0xbe [] ? flush_thread_hw_breakpoint+0x16/0x75 [] _spin_lock+0x23/0x53 [] ? flush_thread_hw_breakpoint+0x16/0x75 [] flush_thread_hw_breakpoint+0x16/0x75 [] free_thread_xstate+0x40/0x62 [] free_thread_info+0x12/0x1e [] free_task+0x1e/0x34 [] __put_task_struct+0xa7/0xac [] delayed_put_task_struct+0x75/0x7c [] __rcu_process_callbacks+0x1a5/0x229 [] rcu_process_callbacks+0x24/0x42 [] __do_softirq+0x9d/0x154 [] ? __do_softirq+0x0/0x154 [] ? irq_exit+0x3a/0x68 [] ? smp_apic_timer_interrupt+0x79/0x87 [] ? apic_timer_interrupt+0x2f/0x34 [] ? native_safe_halt+0xa/0xc [] ? default_idle+0x4f/0x81 [] ? stop_critical_timings+0x25/0x2a [] ? cpu_idle+0x58/0x79 [] ? rest_init+0x58/0x5a [] ? start_kernel+0x2fc/0x301 [] ? __init_begin+0x6a/0x6f > > > > What happens if a kernel breakpoint is triggered on another CPU while > > > this loop is running? Or what happens if the breakpoint being removed > > > is triggered on another CPU before on_each_cpu() is called below? > > > > > > Basically, it's impossible to change the kernel breakpoints > > > simultaneously on all CPUs. That means you somehow have to keep both > > > the old set and the new set around until all the CPUs are updated. > > > > > > > I must admit that the code did not handle the above scenario. I am > > adding a per-cpu instance of 'hbp_kernel[]' called 'this_hbp_kernel[]'. > > The breakpoint handler would use the per-cpu instance which will remain > > valid throughout the execution of the handler. The per-cpu instance will > > be updated with hbp_kernel[] values in arch_update_kernel_hw_breakpoint(). > > [This necessitates hbp_kernel_pos increment to happen after the IPI call > > in unregister_kernel code]. > > Yes, okay. I'm a little nervous about making this_hbp_kernel[] be > per-cpu while hbp_kernel_pos isn't. It means the two values will > sometimes be out of sync with each other. But it ought to be safe > since during the out-of-sync periods, hbp_kernel_pos will always point > to an empty breakpoint slot. > > Still, in hw_breakpoint_handler(), perhaps the "if (i >= > hbp_kernel_pos)" path should also check for !bp. Just to be safe. In > other words, move the > > + if (!bp) > + continue; > + rc = NOTIFY_DONE; > > lines outside the "if" statement. > > I don't see where the out-of-sync situation between hbp_kernel_pos and this_hbp_kernel[] can cause problem. Our concern is when an exception arises in the small time-window starting at the time when hbp_kernel[] is updated in common code and ends when exceptions are disabled through the IPI routine arch_update_kernel_hw_breakpoint. Consider these two cases now: i)register_kernel_hw_breakpoint() - Although hbp_kernel_pos is decremented before the IPI, no new exceptions will arise because of the same on a CPU where this_hbp_kernel[] is not synced because the physical debug registers are not yet set. So, the 'i' in "if (i >= hbp_kernel_pos)" cannot belong to the newly registered breakpoint. If it is for the new breakpoint, then it is on a CPU where this_hbp_kernel[] is synced with hbp_kernel[]. ii)unregister_kernel_hw_breakpoint() - hbp_kernel_pos is incremented after all the IPIs have finished, so hbp_kernel_pos still points to the old value whose hbp_kernel[] is NULL. However in "if (i >= hbp_kernel_pos)" if 'i' points to the 'pending-delete' breakpoint, then 'bp' is derived from this_hbp_kernel[] which contains the old value. The callback bp->triggered() is invoked and the exception returns fine. Let me know if I am missing any possibility leading to incorrect behaviour. > > > Another problem: kdr7 is a global variable, and here you've got every > > > CPU recomputing it whenever a kernel breakpoint is added or removed. > > > It should be computed just once, before the on_each_cpu() call. > > > > > > > If kdr7 needs to be updated only once, it has to be kept outside the IPI > > through the use of a wrapper routine (in arch/x86/kernel/hw_breakpoint.c > > as it is arch-specific). This would mean one more function call in > > (un)register_kernel_<> routines taking the code back to one of its previous > > designs. In a trade-off between code-brevity and efficiency, the present one > > chose the former keeping in mind some of the comments received during the > > early stages of this patch. > > You don't appreciate the seriousness of this problem. Here's the new code: > > + /* Clear all kernel-space breakpoints in kdr7 */ > + kdr7 = 0; > (A) > + for (i = hbp_kernel_pos; i < HB_NUM; i++) { > + per_cpu(this_hbp_kernel[i], cpu) = bp = hbp_kernel[i]; > + if (bp) { > + kdr7 |= encode_dr7(i, bp->info.len, bp->info.type); > + set_debugreg(hbp_kernel[i]->info.address, i); > + } > + } > + > + /* No need to set DR6. Update the debug registers with kernel-space > + * breakpoint values from kdr7 and user-space requests from the > + * current process > + */ > (B) > + set_debugreg(kdr7 | current->thread.debugreg7, 7); > > Suppose you send out the IPIs, and one of the CPUs happens to reach (A) > at the same time another CPU reaches (B). The second CPU will load an > incorrect value into DR7. > > If you prefer you can replace kdr7 above with a local variable, and > set kdr7 to the local variable's value at the end of the function. > Also, in the first set_debugreg() line above, you can replace > hbp_kernel[i] with bp. > > I failed to realise the potential inconsistency in kdr7 value. The code will be modified like this: void arch_update_kernel_hw_breakpoints(void *unused) { struct hw_breakpoint *bp; int i, cpu = get_cpu(); unsigned long temp_kdr7 = 0; /* Don't allow debug exceptions while we update the registers */ set_debugreg(0UL, 7); for (i = hbp_kernel_pos; i < HB_NUM; i++) { per_cpu(this_hbp_kernel[i], cpu) = bp = hbp_kernel[i]; if (bp) { temp_kdr7 |= encode_dr7(i, bp->info.len, bp->info.type); set_debugreg(bp->info.address, i); } } /* No need to set DR6. Update the debug registers with kernel-space * breakpoint values from temp_kdr7 and user-space requests from * the current process */ kdr7 = temp_kdr7; set_debugreg(kdr7 | current->thread.debugreg7, 7); put_cpu_no_resched(); } > > > > + void *temp_mem = kzalloc(sizeof(struct hw_breakpoint), GFP_KERNEL); > > > > + if (!temp_mem) > > > > + temp_mem_used = -ENOMEM; > > > > > > I don't think this is a good idea... > > > > > > > I agree that it turned out to be wrong. ptrace is now modified to use > > the (un)register_user_hw_breakpoint() interfaces directly and not the > > worker routines, thereby avoiding all this complexity. Please find the > > changes in the new patch. > > > > And now if something went wrong, you have already freed the memory > > > holding the original breakpoint structures. It would be better to > > > keep them around until you know they aren't going to be needed. > > I still think it would be a good idea to avoid freeing any breakpoint > structures until you know for certain they aren't needed. Otherwise > you might find that you're unable to allocate memory to re-create a > structure that was already present. > > Alan Stern > I agree that we shouldn't free memory temporarily that would be needed later, as memory may not be available. However such a situation does not arise in ptrace_write_dr7() after the re-write using (un)register_user_hw_breakpoint(). kfree(bp) is done when i) an existing breakpoint is disabled ii) a breakpoint is requested but register_user_hw_breakpoint() routine fails for some reasons. Hence the breakpoint structure kzalloc()'ed afresh is kfree()'ed. iii) In case of modifications to the type of breakpoint (such as read<-->write), the breakpoint structure is not de-allocated and this is where __modify_user_hw_breakpoint() helps us. There is no opportunity window where the breakpoint can be grabbed by other requests when a modification is done. Let me know if you think there's still some discrepancy here. Thanks for your detailed review and pointing me to the errors. I am sending the patchset again with the changes discussed above. Thanks, K.Prasad