All of lore.kernel.org
 help / color / mirror / Atom feed
From: "K.Prasad" <prasad@linux.vnet.ibm.com>
To: Alan Stern <stern@rowland.harvard.edu>
Cc: Steven Rostedt <rostedt@goodmis.org>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Ingo Molnar <mingo@elte.hu>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Benjamin Herrenschmidt <benh@au1.ibm.com>,
	maneesh@linux.vnet.ibm.com, Roland McGrath <roland@redhat.com>
Subject: Re: [Patch 00/12] Hardware Breakpoint Interfaces
Date: Mon, 11 May 2009 22:50:34 +0530	[thread overview]
Message-ID: <20090511172034.GA8774@in.ibm.com> (raw)
In-Reply-To: <Pine.LNX.4.44L0.0905111025500.2975-100000@iolanthe.rowland.org>

On Mon, May 11, 2009 at 10:50:39AM -0400, Alan Stern wrote:
> On Mon, 11 May 2009, K.Prasad wrote:
> 
> > > You shouldn't call on_each_cpu() while holding a spinlock.  The same
> > > thing happens in unregister_kernel_hw_breakpoint().
> > > 
> > 
> > First, on_each_cpu() will now be changed to return only after all
> > functions invoked through IPIs have returned (by changing @wait
> > parameter to 1). This is required to prevent side effects of
> > incrementing hbp_kernel_pos after on_each_cpu() in
> > unregister_kernel_hw_breakpoint() [hbp_kernel_pos is still incremented
> > after IPI and I will explain it below].
> 
> Good.  I thought this was already done; it's lucky you noticed it
> wasn't.
> 
> > on_each_cpu() isn't a blocking call (despite @wait being set to 1, which
> > does a busy wait through cpu_relax()) and should be safe to invoke
> > inside a spin_lock() context. I would like to know if you think
> > otherwise.
> 
> I guess you're right.  Why did you change the spin_lock to 
> spin_lock_bh?
> 

I realised that flush_thread_hw_breakpoint()-->flush_thread() is invoked
through a softIRQ. Considering that it can cause potential deadlock due
to circular lock dependancy, spin_lock() calls were changed to
spin_lock_bh(). Here's a traceback and message from lockdep that helped
me:

=================================
[ INFO: inconsistent lock state ]
2.6.30-rc4-tip.hbkpt #31
---------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
 (hw_breakpoint_lock){+.?...}, at: [<c0498f73>]
flush_thread_hw_breakpoint+0x16/0x75
{SOFTIRQ-ON-W} state was registered at:
  [<c0459309>] __lock_acquire+0x2e8/0xb55
  [<c0459c11>] lock_acquire+0x9b/0xbe
  [<c0701dbd>] _spin_lock+0x23/0x53
  [<c0499190>] load_debug_registers+0x1b/0x77
  [<c06fcf20>] start_secondary+0x1b9/0x1c6
  [<ffffffff>] 0xffffffff
irq event stamp: 1447500
hardirqs last  enabled at (1447500): [<c04be9ff>]
kmem_cache_free+0xb2/0xe7
hardirqs last disabled at (1447499): [<c04be996>]
kmem_cache_free+0x49/0xe7
softirqs last  enabled at (1447468): [<c043cdda>]
__do_softirq+0x14c/0x154
softirqs last disabled at (1447473): [<c04051b8>] do_softirq+0x6d/0xcd

other info that might help us debug this:
no locks held by swapper/0.

stack backtrace:
Pid: 0, comm: swapper Not tainted 2.6.30-rc4-tip.hbkpt #31
Call Trace:
 [<c06ff8d2>] ? printk+0x14/0x1a
 [<c04582e7>] valid_state+0x12a/0x13d
 [<c04583e0>] mark_lock+0xe6/0x1e9
 [<c0458ae8>] ? check_usage_forwards+0x0/0x3f
 [<c0459287>] __lock_acquire+0x266/0xb55
 [<c04aa403>] ? page_address+0xe/0xb1
 [<c0459c11>] lock_acquire+0x9b/0xbe
 [<c0498f73>] ? flush_thread_hw_breakpoint+0x16/0x75
 [<c0701dbd>] _spin_lock+0x23/0x53
 [<c0498f73>] ? flush_thread_hw_breakpoint+0x16/0x75
 [<c0498f73>] flush_thread_hw_breakpoint+0x16/0x75
 [<c0409751>] free_thread_xstate+0x40/0x62
 [<c0409785>] free_thread_info+0x12/0x1e
 [<c043611b>] free_task+0x1e/0x34
 [<c0437522>] __put_task_struct+0xa7/0xac
 [<c0439084>] delayed_put_task_struct+0x75/0x7c
 [<c047bd9c>] __rcu_process_callbacks+0x1a5/0x229
 [<c047be44>] rcu_process_callbacks+0x24/0x42
 [<c043cd2b>] __do_softirq+0x9d/0x154
 [<c043cc8e>] ? __do_softirq+0x0/0x154
 <IRQ>  [<c043c816>] ? irq_exit+0x3a/0x68
 [<c07055d4>] ? smp_apic_timer_interrupt+0x79/0x87
 [<c0403d47>] ? apic_timer_interrupt+0x2f/0x34
 [<c041eeb4>] ? native_safe_halt+0xa/0xc
 [<c0409296>] ? default_idle+0x4f/0x81
 [<c048c417>] ? stop_critical_timings+0x25/0x2a
 [<c040263b>] ? cpu_idle+0x58/0x79
 [<c06f0004>] ? rest_init+0x58/0x5a
 [<c08cc805>] ? start_kernel+0x2fc/0x301
 [<c08cc06a>] ? __init_begin+0x6a/0x6f

> 
> > > What happens if a kernel breakpoint is triggered on another CPU while
> > > this loop is running?  Or what happens if the breakpoint being removed
> > > is triggered on another CPU before on_each_cpu() is called below?
> > > 
> > > Basically, it's impossible to change the kernel breakpoints 
> > > simultaneously on all CPUs.  That means you somehow have to keep both 
> > > the old set and the new set around until all the CPUs are updated.
> > > 
> > 
> > I must admit that the code did not handle the above scenario. I am
> > adding a per-cpu instance of 'hbp_kernel[]' called 'this_hbp_kernel[]'.
> > The breakpoint handler would use the per-cpu instance which will remain
> > valid throughout the execution of the handler. The per-cpu instance will
> > be updated with hbp_kernel[] values in arch_update_kernel_hw_breakpoint().
> > [This necessitates hbp_kernel_pos increment to happen after the IPI call
> > in unregister_kernel code].
> 
> Yes, okay.  I'm a little nervous about making this_hbp_kernel[] be
> per-cpu while hbp_kernel_pos isn't.  It means the two values will
> sometimes be out of sync with each other.  But it ought to be safe
> since during the out-of-sync periods, hbp_kernel_pos will always point
> to an empty breakpoint slot.
> 
> Still, in hw_breakpoint_handler(), perhaps the "if (i >= 
> hbp_kernel_pos)" path should also check for !bp.  Just to be safe.  In 
> other words, move the 
> 
> +			if (!bp)
> +				continue;
> +			rc = NOTIFY_DONE;
> 
> lines outside the "if" statement.
> 
> 

I don't see where the out-of-sync situation between hbp_kernel_pos and
this_hbp_kernel[] can cause problem. Our concern is when an exception
arises in the small time-window starting at the time when hbp_kernel[]
is updated in common code and ends when exceptions are disabled through
the IPI routine arch_update_kernel_hw_breakpoint. Consider these two
cases now:

i)register_kernel_hw_breakpoint() - Although hbp_kernel_pos is
decremented before the IPI, no new exceptions will arise because of the
same on a CPU where this_hbp_kernel[] is not synced because the physical
debug registers are not yet set. So, the 'i' in "if (i >= hbp_kernel_pos)"
cannot belong to the newly registered breakpoint. If it is for the new
breakpoint, then it is on a CPU where this_hbp_kernel[] is synced with
hbp_kernel[].

ii)unregister_kernel_hw_breakpoint() - hbp_kernel_pos is incremented
after all the IPIs have finished, so hbp_kernel_pos still points to the
old value whose hbp_kernel[] is NULL. However in "if (i >= hbp_kernel_pos)"
if 'i' points to the 'pending-delete' breakpoint, then 'bp' is derived
from this_hbp_kernel[] which contains the old value. The callback
bp->triggered() is invoked and the exception returns fine.

Let me know if I am missing any possibility leading to incorrect
behaviour. 

> > > Another problem: kdr7 is a global variable, and here you've got every
> > > CPU recomputing it whenever a kernel breakpoint is added or removed.
> > > It should be computed just once, before the on_each_cpu() call.
> > > 
> > 
> > If kdr7 needs to be updated only once, it has to be kept outside the IPI
> > through the use of a wrapper routine (in arch/x86/kernel/hw_breakpoint.c
> > as it is arch-specific). This would mean one more function call in
> > (un)register_kernel_<> routines taking the code back to one of its previous
> > designs. In a trade-off between code-brevity and efficiency, the present one
> > chose the former keeping in mind some of the comments received during the
> > early stages of this patch.
> 
> You don't appreciate the seriousness of this problem.  Here's the new code:
> 
> +	/* Clear all kernel-space breakpoints in kdr7 */
> +	kdr7 = 0;
> (A)
> +	for (i = hbp_kernel_pos; i < HB_NUM; i++) {
> +		per_cpu(this_hbp_kernel[i], cpu) = bp = hbp_kernel[i];
> +		if (bp) {
> +			kdr7 |= encode_dr7(i, bp->info.len, bp->info.type);
> +			set_debugreg(hbp_kernel[i]->info.address, i);
> +		}
> +	}
> +
> +	/* No need to set DR6. Update the debug registers with kernel-space
> +	 * breakpoint values from kdr7 and user-space requests from the
> +	 * current process
> +	 */
> (B)
> +	set_debugreg(kdr7 | current->thread.debugreg7, 7);
> 
> Suppose you send out the IPIs, and one of the CPUs happens to reach (A)
> at the same time another CPU reaches (B).  The second CPU will load an
> incorrect value into DR7.
> 
> If you prefer you can replace kdr7 above with a local variable, and 
> set kdr7 to the local variable's value at the end of the function.  
> Also, in the first set_debugreg() line above, you can replace 
> hbp_kernel[i] with bp.
> 
>

I failed to realise the potential inconsistency in kdr7 value. The code
will be modified like this:

void arch_update_kernel_hw_breakpoints(void *unused)
{
        struct hw_breakpoint *bp;
        int i, cpu = get_cpu();
        unsigned long temp_kdr7 = 0;

        /* Don't allow debug exceptions while we update the registers */
        set_debugreg(0UL, 7);

        for (i = hbp_kernel_pos; i < HB_NUM; i++) {
                per_cpu(this_hbp_kernel[i], cpu) = bp = hbp_kernel[i];
                if (bp) {
                        temp_kdr7 |= encode_dr7(i, bp->info.len, bp->info.type);
                        set_debugreg(bp->info.address, i);
                }
        }

        /* No need to set DR6. Update the debug registers with kernel-space
         * breakpoint values from temp_kdr7 and user-space requests from
         * the current process
         */
        kdr7 = temp_kdr7;
        set_debugreg(kdr7 | current->thread.debugreg7, 7);
        put_cpu_no_resched();
}

 
> > > > +	void *temp_mem = kzalloc(sizeof(struct hw_breakpoint), GFP_KERNEL);
> > > > +	if (!temp_mem)
> > > > +		temp_mem_used = -ENOMEM;
> > > 
> > > I don't think this is a good idea...
> > > 
> > 
> > I agree that it turned out to be wrong. ptrace is now modified to use
> > the (un)register_user_hw_breakpoint() interfaces directly and not the
> > worker routines, thereby avoiding all this complexity. Please find the
> > changes in the new patch.
> 
> > > And now if something went wrong, you have already freed the memory
> > > holding the original breakpoint structures.  It would be better to
> > > keep them around until you know they aren't going to be needed.
> 
> I still think it would be a good idea to avoid freeing any breakpoint
> structures until you know for certain they aren't needed.  Otherwise
> you might find that you're unable to allocate memory to re-create a
> structure that was already present.
> 
> Alan Stern
>

I agree that we shouldn't free memory temporarily that would be needed
later, as memory may not be available. However such a situation does not
arise in ptrace_write_dr7() after the re-write using
(un)register_user_hw_breakpoint(). kfree(bp) is done when
i) an existing breakpoint is disabled
ii) a breakpoint is requested but register_user_hw_breakpoint() routine
fails for some reasons. Hence the breakpoint structure kzalloc()'ed
afresh is kfree()'ed.
iii) In case of modifications to the type of breakpoint (such as
read<-->write), the breakpoint structure is not de-allocated and this is
where __modify_user_hw_breakpoint() helps us. There is no opportunity
window where the breakpoint can be grabbed by other requests when a
modification is done.

Let me know if you think there's still some discrepancy here.

Thanks for your detailed review and pointing me to the errors. I am
sending the patchset again with the changes discussed above.

Thanks,
K.Prasad

 

  reply	other threads:[~2009-05-11 17:20 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-24  6:13 [Patch 00/12] Hardware Breakpoint Interfaces K.Prasad
2009-05-04 20:55 ` Alan Stern
2009-05-11 11:36   ` K.Prasad
2009-05-11 14:50     ` Alan Stern
2009-05-11 17:20       ` K.Prasad [this message]
2009-05-11 18:09         ` Alan Stern
2009-05-12 14:05           ` K.Prasad
2009-05-12 14:55             ` Alan Stern
2009-05-12 17:12               ` K.Prasad
2009-05-12 20:39                 ` Alan Stern
  -- strict thread matches above, loose matches on Subject: below --
2009-05-11 11:51 [Patch 00/12] Hardware Breakpoint interfaces K.Prasad
2009-05-13 16:12 K.Prasad
2009-05-14 20:02 ` Alan Stern
2009-05-14 20:08   ` K.Prasad
2009-05-14 20:45     ` K.Prasad
2009-05-15 10:53 K.Prasad
2009-05-16  0:23 K.Prasad
2009-05-21 14:00 K.Prasad
2009-05-30 10:47 K.Prasad
2009-06-01 18:12 K.Prasad

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090511172034.GA8774@in.ibm.com \
    --to=prasad@linux.vnet.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=benh@au1.ibm.com \
    --cc=fweisbec@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maneesh@linux.vnet.ibm.com \
    --cc=mingo@elte.hu \
    --cc=roland@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=stern@rowland.harvard.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.