LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [v1][PATCH 7/7] book3e/kexec/kdump: recover "r4 = 0" to create the initial TLB
From: Tiejun Chen @ 2013-02-26  9:19 UTC (permalink / raw)
  To: benh, galak; +Cc: linuxppc-dev, linux-kernel
In-Reply-To: <1361870382-743-1-git-send-email-tiejun.chen@windriver.com>

In commit 96f013f, "powerpc/kexec: Add kexec "hold" support for Book3e
processors", requires that GPR4 survive the "hold" process, for IBM Blue
Gene/Q with with some very strange firmware. But for FSL Book3E, r4 = 1
to indicate that the initial TLB entry for this core already exists so
we still should set r4 with 0 to create that initial TLB.

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
---
 arch/powerpc/kernel/head_64.S |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
index 038e81d..e60f078 100644
--- a/arch/powerpc/kernel/head_64.S
+++ b/arch/powerpc/kernel/head_64.S
@@ -129,6 +129,10 @@ __secondary_hold:
 	/* Grab our physical cpu number */
 	mr	r24,r3
 	/* stash r4 for book3e */
+#ifdef CONFIG_PPC_FSL_BOOK3E
+	/* we need to setup initial TLB entry. */
+	li	r4,0
+#endif
 	mr	r25,r4
 
 	/* Tell the master cpu we're here */
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH 1/1] book3e/corenet64: increase CPU numbers by default
From: Tiejun Chen @ 2013-02-26  9:33 UTC (permalink / raw)
  To: benh, galak; +Cc: linuxppc-dev

Currently we already support p5040ds which has 4 e5500 cores, but
twelve dual-threaded e6500 cores are also built on T4240, we can
change CONFIG_NR_CPUS with this value now.

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
---
 arch/powerpc/configs/corenet64_smp_defconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/configs/corenet64_smp_defconfig b/arch/powerpc/configs/corenet64_smp_defconfig
index 7375961..7b0dfe3 100644
--- a/arch/powerpc/configs/corenet64_smp_defconfig
+++ b/arch/powerpc/configs/corenet64_smp_defconfig
@@ -2,7 +2,7 @@ CONFIG_PPC64=y
 CONFIG_PPC_BOOK3E_64=y
 # CONFIG_VIRT_CPU_ACCOUNTING is not set
 CONFIG_SMP=y
-CONFIG_NR_CPUS=2
+CONFIG_NR_CPUS=24
 CONFIG_EXPERIMENTAL=y
 CONFIG_SYSVIPC=y
 CONFIG_BSD_PROCESS_ACCT=y
-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
From: Lai Jiangshan @ 2013-02-26 12:59 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-doc, peterz, fweisbec, linux-kernel, Michel Lespinasse,
	mingo, linux-arch, linux, xiaoguangrong, wangyun, paulmck, nikunj,
	linux-pm, rusty, rostedt, rjw, namhyung, tglx, linux-arm-kernel,
	netdev, oleg, vincent.guittot, sbw, tj, akpm, linuxppc-dev
In-Reply-To: <512C7A38.8060906@linux.vnet.ibm.com>

On Tue, Feb 26, 2013 at 5:02 PM, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
> On 02/26/2013 05:47 AM, Lai Jiangshan wrote:
>> On Tue, Feb 26, 2013 at 3:26 AM, Srivatsa S. Bhat
>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>> Hi Lai,
>>>
>>> On 02/25/2013 09:23 PM, Lai Jiangshan wrote:
>>>> Hi, Srivatsa,
>>>>
>>>> The target of the whole patchset is nice for me.
>>>
>>> Cool! Thanks :-)
>>>
> [...]
>>>> I wrote an untested draft here.
>>>>
>>>> Thanks,
>>>> Lai
>>>>
>>>> PS: Some HA tools(I'm writing one) which takes checkpoints of
>>>> virtual-machines frequently, I guess this patchset can speedup the
>>>> tools.
>>>>
>>>> From 01db542693a1b7fc6f9ece45d57cb529d9be5b66 Mon Sep 17 00:00:00 2001
>>>> From: Lai Jiangshan <laijs@cn.fujitsu.com>
>>>> Date: Mon, 25 Feb 2013 23:14:27 +0800
>>>> Subject: [PATCH] lglock: add read-preference local-global rwlock
>>>>
>>>> locality via lglock(trylock)
>>>> read-preference read-write-lock via fallback rwlock_t
>>>>
>>>> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
>>>> ---
>>>>  include/linux/lglock.h |   31 +++++++++++++++++++++++++++++++
>>>>  kernel/lglock.c        |   45 +++++++++++++++++++++++++++++++++++++++++++++
>>>>  2 files changed, 76 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/include/linux/lglock.h b/include/linux/lglock.h
>>>> index 0d24e93..30fe887 100644
>>>> --- a/include/linux/lglock.h
>>>> +++ b/include/linux/lglock.h
>>>> @@ -67,4 +67,35 @@ void lg_local_unlock_cpu(struct lglock *lg, int cpu);
>>>>  void lg_global_lock(struct lglock *lg);
>>>>  void lg_global_unlock(struct lglock *lg);
>>>>
>>>> +struct lgrwlock {
>>>> +     unsigned long __percpu *fallback_reader_refcnt;
>>>> +     struct lglock lglock;
>>>> +     rwlock_t fallback_rwlock;
>>>> +};
>>>> +
>>>> +#define DEFINE_LGRWLOCK(name)                                                \
>>>> +     static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)           \
>>>> +     = __ARCH_SPIN_LOCK_UNLOCKED;                                    \
>>>> +     static DEFINE_PER_CPU(unsigned long, name ## _refcnt);          \
>>>> +     struct lgrwlock name = {                                        \
>>>> +             .fallback_reader_refcnt = &name ## _refcnt,             \
>>>> +             .lglock = { .lock = &name ## _lock } }
>>>> +
>>>> +#define DEFINE_STATIC_LGRWLOCK(name)                                 \
>>>> +     static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)           \
>>>> +     = __ARCH_SPIN_LOCK_UNLOCKED;                                    \
>>>> +     static DEFINE_PER_CPU(unsigned long, name ## _refcnt);          \
>>>> +     static struct lgrwlock name = {                                 \
>>>> +             .fallback_reader_refcnt = &name ## _refcnt,             \
>>>> +             .lglock = { .lock = &name ## _lock } }
>>>> +
>>>> +static inline void lg_rwlock_init(struct lgrwlock *lgrw, char *name)
>>>> +{
>>>> +     lg_lock_init(&lgrw->lglock, name);
>>>> +}
>>>> +
>>>> +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw);
>>>> +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw);
>>>> +void lg_rwlock_global_write_lock(struct lgrwlock *lgrw);
>>>> +void lg_rwlock_global_write_unlock(struct lgrwlock *lgrw);
>>>>  #endif
>>>> diff --git a/kernel/lglock.c b/kernel/lglock.c
>>>> index 6535a66..463543a 100644
>>>> --- a/kernel/lglock.c
>>>> +++ b/kernel/lglock.c
>>>> @@ -87,3 +87,48 @@ void lg_global_unlock(struct lglock *lg)
>>>>       preempt_enable();
>>>>  }
>>>>  EXPORT_SYMBOL(lg_global_unlock);
>>>> +
>>>> +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
>>>> +{
>>>> +     struct lglock *lg = &lgrw->lglock;
>>>> +
>>>> +     preempt_disable();
>>>> +     if (likely(!__this_cpu_read(*lgrw->fallback_reader_refcnt))) {
>>>> +             if (likely(arch_spin_trylock(this_cpu_ptr(lg->lock)))) {
>>>> +                     rwlock_acquire_read(&lg->lock_dep_map, 0, 0, _RET_IP_);
>>>> +                     return;
>>>> +             }
>>>> +             read_lock(&lgrw->fallback_rwlock);
>>>> +     }
>>>> +
>>>> +     __this_cpu_inc(*lgrw->fallback_reader_refcnt);
>>>> +}
>>>> +EXPORT_SYMBOL(lg_rwlock_local_read_lock);
>>>> +
>>>> +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
>>>> +{
>>>> +     if (likely(!__this_cpu_read(*lgrw->fallback_reader_refcnt))) {
>>>> +             lg_local_unlock(&lgrw->lglock);
>>>> +             return;
>>>> +     }
>>>> +
>>>> +     if (!__this_cpu_dec_return(*lgrw->fallback_reader_refcnt))
>>>> +             read_unlock(&lgrw->fallback_rwlock);
>>>> +
>>>> +     preempt_enable();
>>>> +}
>>>> +EXPORT_SYMBOL(lg_rwlock_local_read_unlock);
>>>> +
>>>
>>> If I read the code above correctly, all you are doing is implementing a
>>> recursive reader-side primitive (ie., allowing the reader to call these
>>> functions recursively, without resulting in a self-deadlock).
>>>
>>> But the thing is, making the reader-side recursive is the least of our
>>> problems! Our main challenge is to make the locking extremely flexible
>>> and also safe-guard it against circular-locking-dependencies and deadlocks.
>>> Please take a look at the changelog of patch 1 - it explains the situation
>>> with an example.
>>
>>
>> My lock fixes your requirements(I read patch 1-6 before I sent). In
>> readsite, lglock 's lock is token via trylock, the lglock doesn't
>> contribute to deadlocks, we can consider it doesn't exist when we find
>> deadlock from it. And global fallback rwlock doesn't result to
>> deadlocks because it is read-preference(you need to inc the
>> fallback_reader_refcnt inside the cpu-hotplug write-side, I don't do
>> it in generic lgrwlock)
>>
>
> Ah, since you hadn't mentioned the increment at the writer-side in your
> previous email, I had missed the bigger picture of what you were trying
> to achieve.
>
>>
>> If lg_rwlock_local_read_lock() spins, which means
>> lg_rwlock_local_read_lock() spins on fallback_rwlock, and which means
>> lg_rwlock_global_write_lock() took the lgrwlock successfully and
>> return, and which means lg_rwlock_local_read_lock() will stop spinning
>> when the write side finished.
>>
>
> Unfortunately, I see quite a few issues with the code above. IIUC, the
> writer and the reader both increment the same counters. So how will the
> unlock() code in the reader path know when to unlock which of the locks?

The same as your code, the reader(which nested in write C.S.) just dec
the counters.

> (The counter-dropping-to-zero logic is not safe, since it can be updated
> due to different reasons). And now that I look at it again, in the absence
> of the writer, the reader is allowed to be recursive at the heavy cost of
> taking the global rwlock for read, every 2nd time you nest (because the
> spinlock is non-recursive).

(I did not understand your comments of this part)
nested reader is considered seldom. But if N(>=2) nested readers happen,
the overhead is:
    1 spin_try_lock() + 1 read_lock() + (N-1) __this_cpu_inc()

> Also, this lg_rwlock implementation uses 3
> different data-structures - a per-cpu spinlock, a global rwlock and
> a per-cpu refcnt, and its not immediately apparent why you need those many
> or even those many varieties.

data-structures is the same as yours.
fallback_reader_refcnt <--> reader_refcnt
per-cpu spinlock <--> write_signal
fallback_rwlock  <---> global_rwlock

> Also I see that this doesn't handle the
> case of interrupt-handlers also being readers.

handled. nested reader will see the ref or take the fallback_rwlock

>
> IMHO, the per-cpu rwlock scheme that I have implemented in this patchset
> has a clean, understandable design and just enough data-structures/locks
> to achieve its goal and has several optimizations (like reducing the
> interrupts-disabled time etc) included - all in a very straight-forward
> manner. Since this is non-trivial, IMHO, starting from a clean slate is
> actually better than trying to retrofit the logic into some locking scheme
> which we actively want to avoid (and hence effectively we aren't even
> borrowing anything from!).
>
> To summarize, if you are just pointing out that we can implement the same
> logic by altering lglocks, then sure, I acknowledge the possibility.
> However, I don't think doing that actually makes it better; it either
> convolutes the logic unnecessarily, or ends up looking _very_ similar to
> the implementation in this patchset, from what I can see.
>
> Regards,
> Srivatsa S. Bhat
>

^ permalink raw reply

* Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
From: Lai Jiangshan @ 2013-02-26 13:34 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-doc, peterz, fweisbec, linux-kernel, Michel Lespinasse,
	mingo, linux-arch, linux, xiaoguangrong, wangyun, paulmck, nikunj,
	linux-pm, rusty, rostedt, rjw, namhyung, tglx, linux-arm-kernel,
	Lai Jiangshan, netdev, oleg, vincent.guittot, sbw, tj, akpm,
	linuxppc-dev
In-Reply-To: <512BBAD8.8010006@linux.vnet.ibm.com>

On Tue, Feb 26, 2013 at 3:26 AM, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
> Hi Lai,
>
> On 02/25/2013 09:23 PM, Lai Jiangshan wrote:
>> Hi, Srivatsa,
>>
>> The target of the whole patchset is nice for me.
>
> Cool! Thanks :-)
>
>> A question: How did you find out the such usages of
>> "preempt_disable()" and convert them? did all are converted?
>>
>
> Well, I scanned through the source tree for usages which implicitly
> disabled CPU offline and converted them over.

How do you scan? could you show the way you scan the source tree.
I can follow your instructions for double checking.

> Its not limited to uses
> of preempt_disable() alone - even spin_locks, rwlocks, local_irq_disable()
> etc also help disable CPU offline. So I tried to dig out all such uses
> and converted them. However, since the merge window is open, a lot of
> new code is flowing into the tree. So I'll have to rescan the tree to
> see if there are any more places to convert.

I remember some code has such assumption:
    preempt_disable() (or something else)
    //the code assume that the cpu_online_map can't be changed.
    preempt_enable()

It is very hard to find out all such kinds of assumptions and fixes them.
(I notice your code mainly fixes code around send_xxxx())


>
>> And I think the lock is too complex and reinvent the wheel, why don't
>> you reuse the lglock?
>
> lglocks? No way! ;-) See below...
>
>> I wrote an untested draft here.
>>
>> Thanks,
>> Lai
>>
>> PS: Some HA tools(I'm writing one) which takes checkpoints of
>> virtual-machines frequently, I guess this patchset can speedup the
>> tools.
>>
>> From 01db542693a1b7fc6f9ece45d57cb529d9be5b66 Mon Sep 17 00:00:00 2001
>> From: Lai Jiangshan <laijs@cn.fujitsu.com>
>> Date: Mon, 25 Feb 2013 23:14:27 +0800
>> Subject: [PATCH] lglock: add read-preference local-global rwlock
>>
>> locality via lglock(trylock)
>> read-preference read-write-lock via fallback rwlock_t
>>
>> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
>> ---
>>  include/linux/lglock.h |   31 +++++++++++++++++++++++++++++++
>>  kernel/lglock.c        |   45 +++++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 76 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/lglock.h b/include/linux/lglock.h
>> index 0d24e93..30fe887 100644
>> --- a/include/linux/lglock.h
>> +++ b/include/linux/lglock.h
>> @@ -67,4 +67,35 @@ void lg_local_unlock_cpu(struct lglock *lg, int cpu);
>>  void lg_global_lock(struct lglock *lg);
>>  void lg_global_unlock(struct lglock *lg);
>>
>> +struct lgrwlock {
>> +     unsigned long __percpu *fallback_reader_refcnt;
>> +     struct lglock lglock;
>> +     rwlock_t fallback_rwlock;
>> +};
>> +
>> +#define DEFINE_LGRWLOCK(name)                                                \
>> +     static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)           \
>> +     = __ARCH_SPIN_LOCK_UNLOCKED;                                    \
>> +     static DEFINE_PER_CPU(unsigned long, name ## _refcnt);          \
>> +     struct lgrwlock name = {                                        \
>> +             .fallback_reader_refcnt = &name ## _refcnt,             \
>> +             .lglock = { .lock = &name ## _lock } }
>> +
>> +#define DEFINE_STATIC_LGRWLOCK(name)                                 \
>> +     static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)           \
>> +     = __ARCH_SPIN_LOCK_UNLOCKED;                                    \
>> +     static DEFINE_PER_CPU(unsigned long, name ## _refcnt);          \
>> +     static struct lgrwlock name = {                                 \
>> +             .fallback_reader_refcnt = &name ## _refcnt,             \
>> +             .lglock = { .lock = &name ## _lock } }
>> +
>> +static inline void lg_rwlock_init(struct lgrwlock *lgrw, char *name)
>> +{
>> +     lg_lock_init(&lgrw->lglock, name);
>> +}
>> +
>> +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw);
>> +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw);
>> +void lg_rwlock_global_write_lock(struct lgrwlock *lgrw);
>> +void lg_rwlock_global_write_unlock(struct lgrwlock *lgrw);
>>  #endif
>> diff --git a/kernel/lglock.c b/kernel/lglock.c
>> index 6535a66..463543a 100644
>> --- a/kernel/lglock.c
>> +++ b/kernel/lglock.c
>> @@ -87,3 +87,48 @@ void lg_global_unlock(struct lglock *lg)
>>       preempt_enable();
>>  }
>>  EXPORT_SYMBOL(lg_global_unlock);
>> +
>> +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
>> +{
>> +     struct lglock *lg = &lgrw->lglock;
>> +
>> +     preempt_disable();
>> +     if (likely(!__this_cpu_read(*lgrw->fallback_reader_refcnt))) {
>> +             if (likely(arch_spin_trylock(this_cpu_ptr(lg->lock)))) {
>> +                     rwlock_acquire_read(&lg->lock_dep_map, 0, 0, _RET_IP_);
>> +                     return;
>> +             }
>> +             read_lock(&lgrw->fallback_rwlock);
>> +     }
>> +
>> +     __this_cpu_inc(*lgrw->fallback_reader_refcnt);
>> +}
>> +EXPORT_SYMBOL(lg_rwlock_local_read_lock);
>> +
>> +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
>> +{
>> +     if (likely(!__this_cpu_read(*lgrw->fallback_reader_refcnt))) {
>> +             lg_local_unlock(&lgrw->lglock);
>> +             return;
>> +     }
>> +
>> +     if (!__this_cpu_dec_return(*lgrw->fallback_reader_refcnt))
>> +             read_unlock(&lgrw->fallback_rwlock);
>> +
>> +     preempt_enable();
>> +}
>> +EXPORT_SYMBOL(lg_rwlock_local_read_unlock);
>> +
>
> If I read the code above correctly, all you are doing is implementing a
> recursive reader-side primitive (ie., allowing the reader to call these
> functions recursively, without resulting in a self-deadlock).
>
> But the thing is, making the reader-side recursive is the least of our
> problems! Our main challenge is to make the locking extremely flexible
> and also safe-guard it against circular-locking-dependencies and deadlocks.
> Please take a look at the changelog of patch 1 - it explains the situation
> with an example.
>
>> +void lg_rwlock_global_write_lock(struct lgrwlock *lgrw)
>> +{
>> +     lg_global_lock(&lgrw->lglock);
>
> This does a for-loop on all CPUs and takes their locks one-by-one. That's
> exactly what we want to prevent, because that is the _source_ of all our
> deadlock woes in this case. In the presence of perfect lock ordering
> guarantees, this wouldn't have been a problem (that's why lglocks are
> being used successfully elsewhere in the kernel). In the stop-machine()
> removal case, the over-flexibility of preempt_disable() forces us to provide
> an equally flexible locking alternative. Hence we can't use such per-cpu
> locking schemes.
>
> You might note that, for exactly this reason, I haven't actually used any
> per-cpu _locks_ in this synchronization scheme, though it is named as
> "per-cpu rwlocks". The only per-cpu component here are the refcounts, and
> we consciously avoid waiting/spinning on them (because then that would be
> equivalent to having per-cpu locks, which are deadlock-prone). We use
> global rwlocks to get the deadlock-safety that we need.
>
>> +     write_lock(&lgrw->fallback_rwlock);
>> +}
>> +EXPORT_SYMBOL(lg_rwlock_global_write_lock);
>> +
>> +void lg_rwlock_global_write_unlock(struct lgrwlock *lgrw)
>> +{
>> +     write_unlock(&lgrw->fallback_rwlock);
>> +     lg_global_unlock(&lgrw->lglock);
>> +}
>> +EXPORT_SYMBOL(lg_rwlock_global_write_unlock);
>>
>
> Regards,
> Srivatsa S. Bhat
>

^ permalink raw reply

* Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
From: Lai Jiangshan @ 2013-02-26 14:17 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-doc, peterz, fweisbec, linux-kernel, walken, mingo,
	linux-arch, linux, xiaoguangrong, wangyun, paulmck, nikunj,
	linux-pm, rusty, rostedt, rjw, namhyung, tglx, linux-arm-kernel,
	netdev, oleg, vincent.guittot, sbw, tj, akpm, linuxppc-dev
In-Reply-To: <20130218123856.26245.46705.stgit@srivatsabhat.in.ibm.com>

On Mon, Feb 18, 2013 at 8:38 PM, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
> Using global rwlocks as the backend for per-CPU rwlocks helps us avoid many
> lock-ordering related problems (unlike per-cpu locks). However, global
> rwlocks lead to unnecessary cache-line bouncing even when there are no
> writers present, which can slow down the system needlessly.

per-CPU rwlocks(yours and mine) are the exactly same as rwlock_t in
the view of lock dependency(except reader-C.S. can be nested in
writer-C.S.)
so they can deadlock in this order:

spin_lock(some_lock);				percpu_write_lock_irqsave()
						case CPU_DYING
percpu_read_lock_irqsafe();	<---deadlock--->	spin_lock(some_lock);

The lockdep can find out such dependency, but we must try our best to
find out them before merge the patchset to mainline. We can  review
all the code of cpu_disable() and CPU_DYING and fix this kinds of lock
dependency, but it is not easy thing, it may be a long term project.

======
And if there is any CPU_DYING code takes no locks and do some
works(because they know they are called via stop_machine()) we need to
add that locking code back if there is such code.(I don't know whether
such code exist or not)

>
> Per-cpu counters can help solve the cache-line bouncing problem. So we
> actually use the best of both: per-cpu counters (no-waiting) at the reader
> side in the fast-path, and global rwlocks in the slowpath.
>
> [ Fastpath = no writer is active; Slowpath = a writer is active ]
>
> IOW, the readers just increment/decrement their per-cpu refcounts (disabling
> interrupts during the updates, if necessary) when no writer is active.
> When a writer becomes active, he signals all readers to switch to global
> rwlocks for the duration of his activity. The readers switch over when it
> is safe for them (ie., when they are about to start a fresh, non-nested
> read-side critical section) and start using (holding) the global rwlock for
> read in their subsequent critical sections.
>
> The writer waits for every existing reader to switch, and then acquires the
> global rwlock for write and enters his critical section. Later, the writer
> signals all readers that he is done, and that they can go back to using their
> per-cpu refcounts again.
>
> Note that the lock-safety (despite the per-cpu scheme) comes from the fact
> that the readers can *choose* _when_ to switch to rwlocks upon the writer's
> signal. And the readers don't wait on anybody based on the per-cpu counters.
> The only true synchronization that involves waiting at the reader-side in this
> scheme, is the one arising from the global rwlock, which is safe from circular
> locking dependency issues.
>
> Reader-writer locks and per-cpu counters are recursive, so they can be
> used in a nested fashion in the reader-path, which makes per-CPU rwlocks also
> recursive. Also, this design of switching the synchronization scheme ensures
> that you can safely nest and use these locks in a very flexible manner.
>
> I'm indebted to Michael Wang and Xiao Guangrong for their numerous thoughtful
> suggestions and ideas, which inspired and influenced many of the decisions in
> this as well as previous designs. Thanks a lot Michael and Xiao!
>
> Cc: David Howells <dhowells@redhat.com>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
>
>  lib/percpu-rwlock.c |  139 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 137 insertions(+), 2 deletions(-)
>
> diff --git a/lib/percpu-rwlock.c b/lib/percpu-rwlock.c
> index f938096..edefdea 100644
> --- a/lib/percpu-rwlock.c
> +++ b/lib/percpu-rwlock.c
> @@ -27,6 +27,24 @@
>  #include <linux/percpu-rwlock.h>
>  #include <linux/errno.h>
>
> +#include <asm/processor.h>
> +
> +
> +#define reader_yet_to_switch(pcpu_rwlock, cpu)                             \
> +       (ACCESS_ONCE(per_cpu_ptr((pcpu_rwlock)->rw_state, cpu)->reader_refcnt))
> +
> +#define reader_percpu_nesting_depth(pcpu_rwlock)                 \
> +       (__this_cpu_read((pcpu_rwlock)->rw_state->reader_refcnt))
> +
> +#define reader_uses_percpu_refcnt(pcpu_rwlock)                         \
> +                               reader_percpu_nesting_depth(pcpu_rwlock)
> +
> +#define reader_nested_percpu(pcpu_rwlock)                              \
> +                       (reader_percpu_nesting_depth(pcpu_rwlock) > 1)
> +
> +#define writer_active(pcpu_rwlock)                                     \
> +       (__this_cpu_read((pcpu_rwlock)->rw_state->writer_signal))
> +
>
>  int __percpu_init_rwlock(struct percpu_rwlock *pcpu_rwlock,
>                          const char *name, struct lock_class_key *rwlock_key)
> @@ -55,21 +73,138 @@ void percpu_free_rwlock(struct percpu_rwlock *pcpu_rwlock)
>
>  void percpu_read_lock(struct percpu_rwlock *pcpu_rwlock)
>  {
> -       read_lock(&pcpu_rwlock->global_rwlock);
> +       preempt_disable();
> +
> +       /*
> +        * Let the writer know that a reader is active, even before we choose
> +        * our reader-side synchronization scheme.
> +        */
> +       this_cpu_inc(pcpu_rwlock->rw_state->reader_refcnt);
> +
> +       /*
> +        * If we are already using per-cpu refcounts, it is not safe to switch
> +        * the synchronization scheme. So continue using the refcounts.
> +        */
> +       if (reader_nested_percpu(pcpu_rwlock))
> +               return;
> +
> +       /*
> +        * The write to 'reader_refcnt' must be visible before we read
> +        * 'writer_signal'.
> +        */
> +       smp_mb();
> +
> +       if (likely(!writer_active(pcpu_rwlock))) {
> +               goto out;
> +       } else {
> +               /* Writer is active, so switch to global rwlock. */
> +               read_lock(&pcpu_rwlock->global_rwlock);
> +
> +               /*
> +                * We might have raced with a writer going inactive before we
> +                * took the read-lock. So re-evaluate whether we still need to
> +                * hold the rwlock or if we can switch back to per-cpu
> +                * refcounts. (This also helps avoid heterogeneous nesting of
> +                * readers).
> +                */
> +               if (writer_active(pcpu_rwlock)) {
> +                       /*
> +                        * The above writer_active() check can get reordered
> +                        * with this_cpu_dec() below, but this is OK, because
> +                        * holding the rwlock is conservative.
> +                        */
> +                       this_cpu_dec(pcpu_rwlock->rw_state->reader_refcnt);
> +               } else {
> +                       read_unlock(&pcpu_rwlock->global_rwlock);
> +               }
> +       }
> +
> +out:
> +       /* Prevent reordering of any subsequent reads/writes */
> +       smp_mb();
>  }
>
>  void percpu_read_unlock(struct percpu_rwlock *pcpu_rwlock)
>  {
> -       read_unlock(&pcpu_rwlock->global_rwlock);
> +       /*
> +        * We never allow heterogeneous nesting of readers. So it is trivial
> +        * to find out the kind of reader we are, and undo the operation
> +        * done by our corresponding percpu_read_lock().
> +        */
> +
> +       /* Try to fast-path: a nested percpu reader is the simplest case */
> +       if (reader_nested_percpu(pcpu_rwlock)) {
> +               this_cpu_dec(pcpu_rwlock->rw_state->reader_refcnt);
> +               preempt_enable();
> +               return;
> +       }
> +
> +       /*
> +        * Now we are left with only 2 options: a non-nested percpu reader,
> +        * or a reader holding rwlock
> +        */
> +       if (reader_uses_percpu_refcnt(pcpu_rwlock)) {
> +               /*
> +                * Complete the critical section before decrementing the
> +                * refcnt. We can optimize this away if we are a nested
> +                * reader (the case above).
> +                */
> +               smp_mb();
> +               this_cpu_dec(pcpu_rwlock->rw_state->reader_refcnt);
> +       } else {
> +               read_unlock(&pcpu_rwlock->global_rwlock);
> +       }
> +
> +       preempt_enable();
>  }
>
>  void percpu_write_lock(struct percpu_rwlock *pcpu_rwlock)
>  {
> +       unsigned int cpu;
> +
> +       /*
> +        * Tell all readers that a writer is becoming active, so that they
> +        * start switching over to the global rwlock.
> +        */
> +       for_each_possible_cpu(cpu)
> +               per_cpu_ptr(pcpu_rwlock->rw_state, cpu)->writer_signal = true;
> +
> +       smp_mb();
> +
> +       /*
> +        * Wait for every reader to see the writer's signal and switch from
> +        * percpu refcounts to global rwlock.
> +        *
> +        * If a reader is still using percpu refcounts, wait for him to switch.
> +        * Else, we can safely go ahead, because either the reader has already
> +        * switched over, or the next reader that comes along on that CPU will
> +        * notice the writer's signal and will switch over to the rwlock.
> +        */
> +
> +       for_each_possible_cpu(cpu) {
> +               while (reader_yet_to_switch(pcpu_rwlock, cpu))
> +                       cpu_relax();
> +       }
> +
> +       smp_mb(); /* Complete the wait-for-readers, before taking the lock */
>         write_lock(&pcpu_rwlock->global_rwlock);
>  }
>
>  void percpu_write_unlock(struct percpu_rwlock *pcpu_rwlock)
>  {
> +       unsigned int cpu;
> +
> +       /* Complete the critical section before clearing ->writer_signal */
> +       smp_mb();
> +
> +       /*
> +        * Inform all readers that we are done, so that they can switch back
> +        * to their per-cpu refcounts. (We don't need to wait for them to
> +        * see it).
> +        */
> +       for_each_possible_cpu(cpu)
> +               per_cpu_ptr(pcpu_rwlock->rw_state, cpu)->writer_signal = false;
> +
>         write_unlock(&pcpu_rwlock->global_rwlock);
>  }
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
From: Srivatsa S. Bhat @ 2013-02-26 14:22 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-doc, peterz, fweisbec, linux-kernel, Michel Lespinasse,
	mingo, linux-arch, linux, xiaoguangrong, wangyun, paulmck, nikunj,
	linux-pm, rusty, rostedt, rjw, namhyung, tglx, linux-arm-kernel,
	netdev, oleg, vincent.guittot, sbw, tj, akpm, linuxppc-dev
In-Reply-To: <CACvQF51RvhM8XNqu3-1OTQ=cZUZ=pAX+_zx0a29+W67N1X+BgQ@mail.gmail.com>

Hi Lai,

I'm really not convinced that piggy-backing on lglocks would help
us in any way. But still, let me try to address some of the points
you raised...

On 02/26/2013 06:29 PM, Lai Jiangshan wrote:
> On Tue, Feb 26, 2013 at 5:02 PM, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>> On 02/26/2013 05:47 AM, Lai Jiangshan wrote:
>>> On Tue, Feb 26, 2013 at 3:26 AM, Srivatsa S. Bhat
>>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>>> Hi Lai,
>>>>
>>>> On 02/25/2013 09:23 PM, Lai Jiangshan wrote:
>>>>> Hi, Srivatsa,
>>>>>
>>>>> The target of the whole patchset is nice for me.
>>>>
>>>> Cool! Thanks :-)
>>>>
>> [...]
>>
>> Unfortunately, I see quite a few issues with the code above. IIUC, the
>> writer and the reader both increment the same counters. So how will the
>> unlock() code in the reader path know when to unlock which of the locks?
> 
> The same as your code, the reader(which nested in write C.S.) just dec
> the counters.

And that works fine in my case because the writer and the reader update
_two_ _different_ counters. If both of them update the same counter, there
will be a semantic clash - an increment of the counter can either mean that
a new writer became active, or it can also indicate a nested reader. A decrement
can also similarly have 2 meanings. And thus it will be difficult to decide
the right action to take, based on the value of the counter.

> 
>> (The counter-dropping-to-zero logic is not safe, since it can be updated
>> due to different reasons). And now that I look at it again, in the absence
>> of the writer, the reader is allowed to be recursive at the heavy cost of
>> taking the global rwlock for read, every 2nd time you nest (because the
>> spinlock is non-recursive).
> 
> (I did not understand your comments of this part)
> nested reader is considered seldom.

No, nested readers can be _quite_ frequent. Because, potentially all users
of preempt_disable() are readers - and its well-known how frequently we
nest preempt_disable(). As a simple example, any atomic reader who calls
smp_call_function() will become a nested reader, because smp_call_function()
itself is a reader. So reader nesting is expected to be quite frequent.

> But if N(>=2) nested readers happen,
> the overhead is:
>     1 spin_try_lock() + 1 read_lock() + (N-1) __this_cpu_inc()
> 

In my patch, its just this_cpu_inc(). Note that these are _very_ hot paths.
So every bit of optimization that you can add is worthwhile.

And your read_lock() is a _global_ lock - thus, it can lead to a lot of
cache-line bouncing. That's *exactly* why I have used per-cpu refcounts in
my synchronization scheme, to avoid taking the global rwlock as much as possible.

Another important point to note is that, the overhead we are talking about
here, exists even when _not_ performing hotplug. And its the replacement to
the super-fast preempt_disable(). So its extremely important to consciously
minimize this overhead - else we'll end up slowing down the system significantly.

>> Also, this lg_rwlock implementation uses 3
>> different data-structures - a per-cpu spinlock, a global rwlock and
>> a per-cpu refcnt, and its not immediately apparent why you need those many
>> or even those many varieties.
> 
> data-structures is the same as yours.
> fallback_reader_refcnt <--> reader_refcnt

This has semantic problems, as noted above.

> per-cpu spinlock <--> write_signal

Acquire/release of (spin) lock is costlier than inc/dec of a counter, IIUC.

> fallback_rwlock  <---> global_rwlock
> 
>> Also I see that this doesn't handle the
>> case of interrupt-handlers also being readers.
> 
> handled. nested reader will see the ref or take the fallback_rwlock
>

I'm not referring to simple nested readers here, but interrupt handlers who
can act as readers. For starters, the arch_spin_trylock() is not safe when
interrupt handlers can also run the same code, right? You'll need to save
and restore interrupts at critical points in the code. Also, the __foo()
variants used to read/update the counters are not interrupt-safe. And,
the unlock() code in the reader path is again going to be confused about
what to do when interrupt handlers interrupt regular readers, due to the
messed up refcount.

>>
>> IMHO, the per-cpu rwlock scheme that I have implemented in this patchset
>> has a clean, understandable design and just enough data-structures/locks
>> to achieve its goal and has several optimizations (like reducing the
>> interrupts-disabled time etc) included - all in a very straight-forward
>> manner. Since this is non-trivial, IMHO, starting from a clean slate is
>> actually better than trying to retrofit the logic into some locking scheme
>> which we actively want to avoid (and hence effectively we aren't even
>> borrowing anything from!).
>>
>> To summarize, if you are just pointing out that we can implement the same
>> logic by altering lglocks, then sure, I acknowledge the possibility.
>> However, I don't think doing that actually makes it better; it either
>> convolutes the logic unnecessarily, or ends up looking _very_ similar to
>> the implementation in this patchset, from what I can see.
>>

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
From: Srivatsa S. Bhat @ 2013-02-26 14:37 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-doc, peterz, fweisbec, linux-kernel, namhyung, walken,
	mingo, linux-arch, linux, xiaoguangrong, wangyun, paulmck, nikunj,
	linux-pm, rusty, rostedt, rjw, vincent.guittot, tglx,
	linux-arm-kernel, netdev, oleg, sbw, tj, akpm, linuxppc-dev
In-Reply-To: <CACvQF529c72xLuosD44CPd_y_0zJBc-tjgNQpBiD0UhbzS8XDg@mail.gmail.com>

On 02/26/2013 07:47 PM, Lai Jiangshan wrote:
> On Mon, Feb 18, 2013 at 8:38 PM, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>> Using global rwlocks as the backend for per-CPU rwlocks helps us avoid many
>> lock-ordering related problems (unlike per-cpu locks). However, global
>> rwlocks lead to unnecessary cache-line bouncing even when there are no
>> writers present, which can slow down the system needlessly.
> 
> per-CPU rwlocks(yours and mine) are the exactly same as rwlock_t in
> the view of lock dependency(except reader-C.S. can be nested in
> writer-C.S.)
> so they can deadlock in this order:
> 
> spin_lock(some_lock);				percpu_write_lock_irqsave()
> 						case CPU_DYING
> percpu_read_lock_irqsafe();	<---deadlock--->	spin_lock(some_lock);
> 

Yes, of course! But this is the most straight-forward of cases to fix,
because there are a well-defined number of CPU_DYING notifiers, and the fix
is to make the lock ordering same at both sides.

The real challenge is with cases like below:

spin_lock(some_lock)			percpu_read_lock_irqsafe()

percpu_read_lock_irqsafe()		spin_lock(some_lock)

The locking primitives (percpu_read_lock/unlock_irqsafe) have been explicitly
designed to keep cases like the above deadlock-free. Because, they ease
conversions from preempt_disable() to the new locking primitive without
lock-ordering headache.

> The lockdep can find out such dependency, but we must try our best to
> find out them before merge the patchset to mainline. We can  review
> all the code of cpu_disable() and CPU_DYING and fix this kinds of lock
> dependency, but it is not easy thing, it may be a long term project.
> 

:-)

That's exactly what I have done in this patchset, and that's why there
are 46 patches in this series! :-) I have run this patchset with lockdep
turned on, and verified that there are no locking problems due to the
conversion. I even posted out performance numbers from this patchset
(ie., in comparison to stop_machine), if you are curious...
http://article.gmane.org/gmane.linux.kernel/1435249

But yes, I'll have to re-verify this because of the new code that went in
during this merge window.

> ======
> And if there is any CPU_DYING code takes no locks and do some
> works(because they know they are called via stop_machine()) we need to
> add that locking code back if there is such code.(I don't know whether
> such code exist or not)
>

Yes, I explicitly verified this too. (I had mentioned this in the cover letter).

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
From: Srivatsa S. Bhat @ 2013-02-26 15:17 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-doc, peterz, fweisbec, linux-kernel, Michel Lespinasse,
	mingo, linux-arch, linux, xiaoguangrong, wangyun, paulmck, nikunj,
	linux-pm, rusty, rostedt, rjw, namhyung, tglx, linux-arm-kernel,
	Lai Jiangshan, netdev, oleg, vincent.guittot, sbw, tj, akpm,
	linuxppc-dev
In-Reply-To: <CACvQF53gcuEfgM=L8-mPTt4W3GFMte-w6eyO0iv2CobAN68SNw@mail.gmail.com>

On 02/26/2013 07:04 PM, Lai Jiangshan wrote:
> On Tue, Feb 26, 2013 at 3:26 AM, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>> Hi Lai,
>>
>> On 02/25/2013 09:23 PM, Lai Jiangshan wrote:
>>> Hi, Srivatsa,
>>>
>>> The target of the whole patchset is nice for me.
>>
>> Cool! Thanks :-)
>>
>>> A question: How did you find out the such usages of
>>> "preempt_disable()" and convert them? did all are converted?
>>>
>>
>> Well, I scanned through the source tree for usages which implicitly
>> disabled CPU offline and converted them over.
> 
> How do you scan? could you show the way you scan the source tree.
> I can follow your instructions for double checking.
> 

Its nothing special. I grepped the source tree for anything dealing with
cpu_online_mask or its derivatives and also for functions/constructs that
rely on the cpumasks internally (eg: smp_call_function). Then I audited all
such call-sites and converted them (if needed) accordingly.

>> Its not limited to uses
>> of preempt_disable() alone - even spin_locks, rwlocks, local_irq_disable()
>> etc also help disable CPU offline. So I tried to dig out all such uses
>> and converted them. However, since the merge window is open, a lot of
>> new code is flowing into the tree. So I'll have to rescan the tree to
>> see if there are any more places to convert.
> 
> I remember some code has such assumption:
>     preempt_disable() (or something else)
>     //the code assume that the cpu_online_map can't be changed.
>     preempt_enable()
> 
> It is very hard to find out all such kinds of assumptions and fixes them.
> (I notice your code mainly fixes code around send_xxxx())
> 

The conversion can be carried out using the method I mentioned above.

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* [ 3.5.y.z extended stable ] Patch "uprobes/powerpc: Add dependency on single step emulation" has been added to staging queue
From: Luis Henriques @ 2013-02-26 16:12 UTC (permalink / raw)
  To: Suzuki K. Poulose; +Cc: Luis Henriques, linuxppc-dev, kernel-team

This is a note to let you know that I have just added a patch titled

    uprobes/powerpc: Add dependency on single step emulation

to the linux-3.5.y-queue branch of the 3.5.y.z extended stable tree 
which can be found at:

 http://kernel.ubuntu.com/git?p=ubuntu/linux.git;a=shortlog;h=refs/heads/linux-3.5.y-queue

If you, or anyone else, feels it should not be added to this tree, please 
reply to this email.

For more information about the 3.5.y.z tree, see
https://wiki.ubuntu.com/Kernel/Dev/ExtendedStable

Thanks.
-Luis

------

>From 8f290ba295f6a2ecf5d1e2d9c80080b1407b8634 Mon Sep 17 00:00:00 2001
From: "Suzuki K. Poulose" <suzuki@in.ibm.com>
Date: Mon, 7 Jan 2013 00:26:57 +0000
Subject: [PATCH] uprobes/powerpc: Add dependency on single step emulation

commit 5e249d4528528c9a77da051a89ec7f99d31b83eb upstream.

Uprobes uses emulate_step in sstep.c, but we haven't explicitly specified
the dependency. On pseries HAVE_HW_BREAKPOINT protects us, but 44x has no
such luxury.

Consolidate other users that depend on sstep and create a new config option.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Suzuki K. Poulose <suzuki@in.ibm.com>
Cc: linuxppc-dev@ozlabs.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[ luis: adjust context ]
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
---
 arch/powerpc/Kconfig      | 4 ++++
 arch/powerpc/lib/Makefile | 4 +---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 050cb37..4d8336c 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -264,6 +264,10 @@ config PPC_ADV_DEBUG_DAC_RANGE
 	depends on PPC_ADV_DEBUG_REGS && 44x
 	default y

+config PPC_EMULATE_SSTEP
+	bool
+	default y if KPROBES || UPROBES || XMON || HAVE_HW_BREAKPOINT
+
 source "init/Kconfig"

 source "kernel/Kconfig.freezer"
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index 7735a2c..3230bc1 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -18,9 +18,7 @@ obj-$(CONFIG_PPC64)	+= copypage_64.o copyuser_64.o \
 			   memcpy_64.o usercopy_64.o mem_64.o string.o \
 			   checksum_wrappers_64.o hweight_64.o \
 			   copyuser_power7.o
-obj-$(CONFIG_XMON)	+= sstep.o ldstfp.o
-obj-$(CONFIG_KPROBES)	+= sstep.o ldstfp.o
-obj-$(CONFIG_HAVE_HW_BREAKPOINT)	+= sstep.o ldstfp.o
+obj-$(CONFIG_PPC_EMULATE_SSTEP)	+= sstep.o ldstfp.o

 ifeq ($(CONFIG_PPC64),y)
 obj-$(CONFIG_SMP)	+= locks.o
--
1.8.1.2

^ permalink raw reply related

* Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
From: Lai Jiangshan @ 2013-02-26 16:25 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-doc, peterz, fweisbec, linux-kernel, Michel Lespinasse,
	mingo, linux-arch, linux, xiaoguangrong, wangyun, paulmck, nikunj,
	linux-pm, rusty, rostedt, rjw, namhyung, tglx, linux-arm-kernel,
	netdev, oleg, vincent.guittot, sbw, tj, akpm, linuxppc-dev
In-Reply-To: <512CC509.1050000@linux.vnet.ibm.com>

On Tue, Feb 26, 2013 at 10:22 PM, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
>
> Hi Lai,
>
> I'm really not convinced that piggy-backing on lglocks would help
> us in any way. But still, let me try to address some of the points
> you raised...
>
> On 02/26/2013 06:29 PM, Lai Jiangshan wrote:
>> On Tue, Feb 26, 2013 at 5:02 PM, Srivatsa S. Bhat
>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>> On 02/26/2013 05:47 AM, Lai Jiangshan wrote:
>>>> On Tue, Feb 26, 2013 at 3:26 AM, Srivatsa S. Bhat
>>>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>>>> Hi Lai,
>>>>>
>>>>> On 02/25/2013 09:23 PM, Lai Jiangshan wrote:
>>>>>> Hi, Srivatsa,
>>>>>>
>>>>>> The target of the whole patchset is nice for me.
>>>>>
>>>>> Cool! Thanks :-)
>>>>>
>>> [...]
>>>
>>> Unfortunately, I see quite a few issues with the code above. IIUC, the
>>> writer and the reader both increment the same counters. So how will the
>>> unlock() code in the reader path know when to unlock which of the locks?
>>
>> The same as your code, the reader(which nested in write C.S.) just dec
>> the counters.
>
> And that works fine in my case because the writer and the reader update
> _two_ _different_ counters.

I can't find any magic in your code, they are the same counter.

        /*
         * It is desirable to allow the writer to acquire the percpu-rwlock
         * for read (if necessary), without deadlocking or getting complaints
         * from lockdep. To achieve that, just increment the reader_refcnt of
         * this CPU - that way, any attempt by the writer to acquire the
         * percpu-rwlock for read, will get treated as a case of nested percpu
         * reader, which is safe, from a locking perspective.
         */
        this_cpu_inc(pcpu_rwlock->rw_state->reader_refcnt);


> If both of them update the same counter, there
> will be a semantic clash - an increment of the counter can either mean that
> a new writer became active, or it can also indicate a nested reader. A decrement
> can also similarly have 2 meanings. And thus it will be difficult to decide
> the right action to take, based on the value of the counter.
>
>>
>>> (The counter-dropping-to-zero logic is not safe, since it can be updated
>>> due to different reasons). And now that I look at it again, in the absence
>>> of the writer, the reader is allowed to be recursive at the heavy cost of
>>> taking the global rwlock for read, every 2nd time you nest (because the
>>> spinlock is non-recursive).
>>
>> (I did not understand your comments of this part)
>> nested reader is considered seldom.
>
> No, nested readers can be _quite_ frequent. Because, potentially all users
> of preempt_disable() are readers - and its well-known how frequently we
> nest preempt_disable(). As a simple example, any atomic reader who calls
> smp_call_function() will become a nested reader, because smp_call_function()
> itself is a reader. So reader nesting is expected to be quite frequent.
>
>> But if N(>=2) nested readers happen,
>> the overhead is:
>>     1 spin_try_lock() + 1 read_lock() + (N-1) __this_cpu_inc()
>>
>
> In my patch, its just this_cpu_inc(). Note that these are _very_ hot paths.
> So every bit of optimization that you can add is worthwhile.
>
> And your read_lock() is a _global_ lock - thus, it can lead to a lot of
> cache-line bouncing. That's *exactly* why I have used per-cpu refcounts in
> my synchronization scheme, to avoid taking the global rwlock as much as possible.
>
> Another important point to note is that, the overhead we are talking about
> here, exists even when _not_ performing hotplug. And its the replacement to
> the super-fast preempt_disable(). So its extremely important to consciously
> minimize this overhead - else we'll end up slowing down the system significantly.
>

All I was considered is "nested reader is seldom", so I always
fallback to rwlock when nested.
If you like, I can add 6 lines of code, the overhead is
1 spin_try_lock()(fast path)  + N  __this_cpu_inc()

The overhead of your code is
2 smp_mb() + N __this_cpu_inc()

I don't see how much different.

>>> Also, this lg_rwlock implementation uses 3
>>> different data-structures - a per-cpu spinlock, a global rwlock and
>>> a per-cpu refcnt, and its not immediately apparent why you need those many
>>> or even those many varieties.
>>
>> data-structures is the same as yours.
>> fallback_reader_refcnt <--> reader_refcnt
>
> This has semantic problems, as noted above.
>
>> per-cpu spinlock <--> write_signal
>
> Acquire/release of (spin) lock is costlier than inc/dec of a counter, IIUC.
>
>> fallback_rwlock  <---> global_rwlock
>>
>>> Also I see that this doesn't handle the
>>> case of interrupt-handlers also being readers.
>>
>> handled. nested reader will see the ref or take the fallback_rwlock
>>

Sorry, _reentrance_ read_lock() will see the ref or take the fallback_rwlock

>
> I'm not referring to simple nested readers here, but interrupt handlers who
> can act as readers. For starters, the arch_spin_trylock() is not safe when
> interrupt handlers can also run the same code, right? You'll need to save
> and restore interrupts at critical points in the code. Also, the __foo()
> variants used to read/update the counters are not interrupt-safe.

I must missed something.

Could you elaborate more why arch_spin_trylock() is not safe when
interrupt handlers can also run the same code?

Could you elaborate more why __this_cpu_op variants is not
interrupt-safe since they are always called paired.


> And,
> the unlock() code in the reader path is again going to be confused about
> what to do when interrupt handlers interrupt regular readers, due to the
> messed up refcount.

I still can't understand.

>
>>>
>>> IMHO, the per-cpu rwlock scheme that I have implemented in this patchset
>>> has a clean, understandable design and just enough data-structures/locks
>>> to achieve its goal and has several optimizations (like reducing the
>>> interrupts-disabled time etc) included - all in a very straight-forward
>>> manner. Since this is non-trivial, IMHO, starting from a clean slate is
>>> actually better than trying to retrofit the logic into some locking scheme
>>> which we actively want to avoid (and hence effectively we aren't even
>>> borrowing anything from!).
>>>
>>> To summarize, if you are just pointing out that we can implement the same
>>> logic by altering lglocks, then sure, I acknowledge the possibility.
>>> However, I don't think doing that actually makes it better; it either
>>> convolutes the logic unnecessarily, or ends up looking _very_ similar to
>>> the implementation in this patchset, from what I can see.
>>>
>

^ permalink raw reply

* Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
From: Srivatsa S. Bhat @ 2013-02-26 19:30 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-doc, peterz, fweisbec, linux-kernel, Michel Lespinasse,
	mingo, linux-arch, linux, xiaoguangrong, wangyun, paulmck, nikunj,
	linux-pm, rusty, rostedt, rjw, namhyung, tglx, linux-arm-kernel,
	netdev, oleg, vincent.guittot, sbw, tj, akpm, linuxppc-dev
In-Reply-To: <CACvQF50UgiqPpDk3HCmV86zKSf6tPjqr0bZowfPEOJs0g0r0MQ@mail.gmail.com>

On 02/26/2013 09:55 PM, Lai Jiangshan wrote:
> On Tue, Feb 26, 2013 at 10:22 PM, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>
>> Hi Lai,
>>
>> I'm really not convinced that piggy-backing on lglocks would help
>> us in any way. But still, let me try to address some of the points
>> you raised...
>>
>> On 02/26/2013 06:29 PM, Lai Jiangshan wrote:
>>> On Tue, Feb 26, 2013 at 5:02 PM, Srivatsa S. Bhat
>>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>>> On 02/26/2013 05:47 AM, Lai Jiangshan wrote:
>>>>> On Tue, Feb 26, 2013 at 3:26 AM, Srivatsa S. Bhat
>>>>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>>>>> Hi Lai,
>>>>>>
>>>>>> On 02/25/2013 09:23 PM, Lai Jiangshan wrote:
>>>>>>> Hi, Srivatsa,
>>>>>>>
>>>>>>> The target of the whole patchset is nice for me.
>>>>>>
>>>>>> Cool! Thanks :-)
>>>>>>
>>>> [...]
>>>>
>>>> Unfortunately, I see quite a few issues with the code above. IIUC, the
>>>> writer and the reader both increment the same counters. So how will the
>>>> unlock() code in the reader path know when to unlock which of the locks?
>>>
>>> The same as your code, the reader(which nested in write C.S.) just dec
>>> the counters.
>>
>> And that works fine in my case because the writer and the reader update
>> _two_ _different_ counters.
> 
> I can't find any magic in your code, they are the same counter.
> 
>         /*
>          * It is desirable to allow the writer to acquire the percpu-rwlock
>          * for read (if necessary), without deadlocking or getting complaints
>          * from lockdep. To achieve that, just increment the reader_refcnt of
>          * this CPU - that way, any attempt by the writer to acquire the
>          * percpu-rwlock for read, will get treated as a case of nested percpu
>          * reader, which is safe, from a locking perspective.
>          */
>         this_cpu_inc(pcpu_rwlock->rw_state->reader_refcnt);
>

Whoa! Hold on, were you really referring to _this_ increment when you said
that, in your patch you would increment the refcnt at the writer? Then I guess
there is a major disconnect in our conversations. (I had assumed that you were
referring to the update of writer_signal, and were just trying to have a single
refcnt instead of reader_refcnt and writer_signal).

So, please let me clarify things a bit here. Forget about the above increment
of reader_refcnt at the writer side. Its almost utterly insignificant for our
current discussion. We can simply replace it with a check as shown below, at
the reader side:

void percpu_read_lock_irqsafe()
{
	if (current == active_writer)
		return;

	/* Rest of the code */
}

Now, assuming that, in your patch, you were trying to use the per-cpu refcnt
to allow the writer to safely take the reader path, you can simply get rid of
that percpu-refcnt, as demonstrated above.

So that would reduce your code to the following (after simplification):

lg_rwlock_local_read_lock()
{
	if (current == active_writer)
		return;
	if (arch_spin_trylock(per-cpu-spinlock))
		return;
	read_lock(global-rwlock);
}

Now, let us assume that hotplug is not happening, meaning, nobody is running
the writer side code. Now let's see what happens at the reader side in your
patch. As I mentioned earlier, the readers are _very_ frequent and can be in
very hot paths. And they also happen to be nested quite often. 

So, a non-nested reader acquires the per-cpu spinlock. Every subsequent nested
reader on that CPU has to acquire the global rwlock for read. Right there you
have 2 significant performance issues -
1. Acquiring the (spin) lock is costly
2. Acquiring the global rwlock causes cacheline bouncing, which hurts
   performance.

And why do we care so much about performance here? Because, the existing
kernel does an efficient preempt_disable() here - which is an optimized
per-cpu counter increment. Replacing that with such heavy primitives on the
reader side can be very bad.

Now, how does my patchset tackle this? My scheme just requires an increment
of a per-cpu refcnt (reader_refcnt) and memory barrier. Which is acceptable
from a performance-perspective, because IMHO its not horrendously worse than
a preempt_disable().

> 
>> If both of them update the same counter, there
>> will be a semantic clash - an increment of the counter can either mean that
>> a new writer became active, or it can also indicate a nested reader. A decrement
>> can also similarly have 2 meanings. And thus it will be difficult to decide
>> the right action to take, based on the value of the counter.
>>
>>>
>>>> (The counter-dropping-to-zero logic is not safe, since it can be updated
>>>> due to different reasons). And now that I look at it again, in the absence
>>>> of the writer, the reader is allowed to be recursive at the heavy cost of
>>>> taking the global rwlock for read, every 2nd time you nest (because the
>>>> spinlock is non-recursive).
>>>
>>> (I did not understand your comments of this part)
>>> nested reader is considered seldom.
>>
>> No, nested readers can be _quite_ frequent. Because, potentially all users
>> of preempt_disable() are readers - and its well-known how frequently we
>> nest preempt_disable(). As a simple example, any atomic reader who calls
>> smp_call_function() will become a nested reader, because smp_call_function()
>> itself is a reader. So reader nesting is expected to be quite frequent.
>>
>>> But if N(>=2) nested readers happen,
>>> the overhead is:
>>>     1 spin_try_lock() + 1 read_lock() + (N-1) __this_cpu_inc()
>>>
>>
>> In my patch, its just this_cpu_inc(). Note that these are _very_ hot paths.
>> So every bit of optimization that you can add is worthwhile.
>>
>> And your read_lock() is a _global_ lock - thus, it can lead to a lot of
>> cache-line bouncing. That's *exactly* why I have used per-cpu refcounts in
>> my synchronization scheme, to avoid taking the global rwlock as much as possible.
>>
>> Another important point to note is that, the overhead we are talking about
>> here, exists even when _not_ performing hotplug. And its the replacement to
>> the super-fast preempt_disable(). So its extremely important to consciously
>> minimize this overhead - else we'll end up slowing down the system significantly.
>>
> 
> All I was considered is "nested reader is seldom", so I always
> fallback to rwlock when nested.
> If you like, I can add 6 lines of code, the overhead is
> 1 spin_try_lock()(fast path)  + N  __this_cpu_inc()
> 

I'm assuming that calculation is no longer valid, considering that
we just discussed how the per-cpu refcnt that you were using is quite
unnecessary and can be removed.

IIUC, the overhead with your code, as per above discussion would be:
1 spin_try_lock() [non-nested] + N read_lock(global_rwlock).

Note that I'm referring to the scenario when hotplug is _not_ happening
(ie., nobody is running writer side code).

> The overhead of your code is
> 2 smp_mb() + N __this_cpu_inc()
> 

Right. And as you can see, this is much much better than the overhead
shown above.

> I don't see how much different.
> 
>>>> Also, this lg_rwlock implementation uses 3
>>>> different data-structures - a per-cpu spinlock, a global rwlock and
>>>> a per-cpu refcnt, and its not immediately apparent why you need those many
>>>> or even those many varieties.
>>>
>>> data-structures is the same as yours.
>>> fallback_reader_refcnt <--> reader_refcnt
>>
>> This has semantic problems, as noted above.
>>
>>> per-cpu spinlock <--> write_signal
>>
>> Acquire/release of (spin) lock is costlier than inc/dec of a counter, IIUC.
>>
>>> fallback_rwlock  <---> global_rwlock
>>>
>>>> Also I see that this doesn't handle the
>>>> case of interrupt-handlers also being readers.
>>>
>>> handled. nested reader will see the ref or take the fallback_rwlock
>>>
> 
> Sorry, _reentrance_ read_lock() will see the ref or take the fallback_rwlock
> 
>>
>> I'm not referring to simple nested readers here, but interrupt handlers who
>> can act as readers. For starters, the arch_spin_trylock() is not safe when
>> interrupt handlers can also run the same code, right? You'll need to save
>> and restore interrupts at critical points in the code. Also, the __foo()
>> variants used to read/update the counters are not interrupt-safe.
> 
> I must missed something.
> 
> Could you elaborate more why arch_spin_trylock() is not safe when
> interrupt handlers can also run the same code?
> 
> Could you elaborate more why __this_cpu_op variants is not
> interrupt-safe since they are always called paired.
>

Take a look at include/linux/percpu.h. You'll note that __this_cpu_*
operations map to __this_cpu_generic_to_op(), which doesn't disable interrupts
while doing the update. Hence you can get inconsistent results if an interrupt
hits the CPU at that time and the interrupt handler tries to do the same thing.
In contrast, if you use this_cpu_inc() for example, interrupts are explicitly
disabled during the update and hence you won't get inconsistent results.

> 
>> And,
>> the unlock() code in the reader path is again going to be confused about
>> what to do when interrupt handlers interrupt regular readers, due to the
>> messed up refcount.
> 
> I still can't understand.
>

The primary reason _I_ was using the refcnt vs the reason _you_ were using the
refcnt, appears to be very different. Maybe that's why the above statement
didn't make sense. In your case, IIUC, you can simply get rid of the refcnt
and replace it with the simple check I mentioned above. Whereas, I use
refcnts to keep the reader-side synchronization fast (and for reader-writer
communication).

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: [PATCH 5/6][v4]: perf: Create a sysfs entry for Power event format
From: Sukadev Bhattiprolu @ 2013-02-26 20:03 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Andi Kleen, Peter Zijlstra, robert.richter, Anton Blanchard,
	linux-kernel, Stephane Eranian, linuxppc-dev, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, Jiri Olsa
In-Reply-To: <20130226052646.GB21553@concordia>

Michael Ellerman [michael@ellerman.id.au] wrote:
| On Tue, Jan 22, 2013 at 10:26:13PM -0800, Sukadev Bhattiprolu wrote:
| > 
| > [PATCH 5/6][v4]: perf: Create a sysfs entry for Power event format
| > 
| > Create a sysfs entry, '/sys/bus/event_source/devices/cpu/format/event'
| > which describes the format of a POWER cpu.
| 
| Did this patch go upstream? I don't see it.

Hmm, patches 1..4,6 are in linux-tip and Arnaldo's trees but patch 5 is
in neither.

| 
| If not, please don't merge it.
| 
| > The format of the event is the same for all POWER cpus at least in
| > (Power6, Power7), so bulk of this change is common in the code common
| > to POWER cpus.
| 
| No. The event format is different on most POWER cpus, in particular it
| is different on Power6 and Power7, and will be different again on
| Power8.

Sigh.  The port of this patchset to Power6 has not started yet.

But this patchset does work on Power7 correct ?

If so, and we figure out what happened to patch 5, can we add a patch to
to move the format code to power7-pmu.c ?

Sukadev

^ permalink raw reply

* Re: [PATCH 6/6 v8] iommu/fsl: Freescale PAMU driver and IOMMU API implementation.
From: Stuart Yoder @ 2013-02-26 22:33 UTC (permalink / raw)
  To: Varun Sethi
  Cc: Joerg Roedel, Stuart Yoder, linux-kernel, iommu, Scott Wood,
	linuxppc-dev
In-Reply-To: <1361191939-21260-7-git-send-email-Varun.Sethi@freescale.com>

Have not got through the entire file, but have a few comments...

+/*
+ * Set the PAACE type as primary and set the coherency required domain
+ * attribute
+ */
+static void pamu_setup_default_xfer_to_host_ppaace(struct paace *ppaace)
+{
+       set_bf(ppaace->addr_bitfields, PAACE_AF_PT, PAACE_PT_PRIMARY);
+
+       set_bf(ppaace->domain_attr.to_host.coherency_required, PAACE_DA_HOST_CR,
+              PAACE_M_COHERENCE_REQ);
+}
+
+/*
+ * Set the PAACE type as secondary and set the coherency required domain
+ * attribute.
+ */
+static void pamu_setup_default_xfer_to_host_spaace(struct paace *spaace)
+{
+       set_bf(spaace->addr_bitfields, PAACE_AF_PT, PAACE_PT_SECONDARY);
+       set_bf(spaace->domain_attr.to_host.coherency_required, PAACE_DA_HOST_CR,
+              PAACE_M_COHERENCE_REQ);
+}

Can we change the names of the above functions...   I know there is some history
with the name, but "xfer_to_host" is confusing.

Maybe just call them:

pamu_init_paace()
pamu_init_spaace()

> +/**
> + * pamu_config_spaace() - Sets up SPAACE entry for specified subwindow
> + *
> + * @liodn:  Logical IO device number
> + * @subwin_cnt:  number of sub-windows associated with dma-window
> + * @subwin_addr: starting address of subwindow
> + * @subwin_size: size of subwindow
> + * @omi: Operation mapping index
> + * @rpn: real (true physical) page number
> + * @snoopid: snoop id for hardware coherency -- if ~snoopid == 0 then
> + *                       snoopid not defined
> + * @stashid: cache stash id for associated cpu
> + * @enable: enable/disable subwindow after reconfiguration
> + * @prot: sub window permissions
> + *
> + * Returns 0 upon success else error code < 0 returned
> + */
> +int pamu_config_spaace(int liodn, u32 subwin_cnt, u32 subwin_addr,
> +                      phys_addr_t subwin_size, u32 omi, unsigned long rpn,
> +                      u32 snoopid, u32 stashid, int enable, int prot)
> +{
> +       struct paace *paace;
> +
> +       /* setup sub-windows */
> +       if (!subwin_cnt) {
> +               pr_err("Invalid subwindow count\n");
> +               return -EINVAL;
> +       }
> +
> +       paace = pamu_get_ppaace(liodn);
> +       if (subwin_addr > 0 && subwin_addr < subwin_cnt && paace) {

Why is the comparison subwin_addr < subwin_cnt?   Seems wrong...

> +               paace = pamu_get_spaace(paace, subwin_addr - 1);
> +
> +               if (paace && !(paace->addr_bitfields & PAACE_V_VALID)) {
> +                       pamu_setup_default_xfer_to_host_spaace(paace);
> +                       set_bf(paace->addr_bitfields, SPAACE_AF_LIODN, liodn);
> +               }
> +       }
> +
> +       if (!paace) {
> +               pr_err("Invalid liodn entry\n");
> +               return -ENOENT;
> +       }
> +
> +       if (!enable && prot == PAACE_AP_PERMS_DENIED) {
> +               if (subwin_addr > 0)
> +                       set_bf(paace->addr_bitfields, PAACE_AF_V,
> +                                PAACE_V_INVALID);
> +               else
> +                       set_bf(paace->addr_bitfields, PAACE_AF_AP,
> +                                prot);
> +               mb();
> +               return 0;
> +       }

Can you add a comment to the above if statement...when is this function called
with PAACE_AP_PERMS_DENIED?


> +       if (subwin_size & (subwin_size - 1) || subwin_size < PAMU_PAGE_SIZE) {
> +               pr_err("subwindow size out of range, or not a power of 2\n");
> +               return -EINVAL;
> +       }
> +
> +       if (rpn == ULONG_MAX) {
> +               pr_err("real page number out of range\n");
> +               return -EINVAL;
> +       }
> +
> +       /* window size is 2^(WSE+1) bytes */
> +       set_bf(paace->win_bitfields, PAACE_WIN_SWSE,
> +              map_addrspace_size_to_wse(subwin_size));
> +
> +       set_bf(paace->impl_attr, PAACE_IA_ATM, PAACE_ATM_WINDOW_XLATE);
> +       paace->twbah = rpn >> 20;
> +       set_bf(paace->win_bitfields, PAACE_WIN_TWBAL, rpn);
> +       set_bf(paace->addr_bitfields, PAACE_AF_AP, prot);
> +
> +       /* configure snoop id */
> +       if (~snoopid != 0)
> +               paace->domain_attr.to_host.snpid = snoopid;
> +
> +       /* set up operation mapping if it's configured */
> +       if (omi < OME_NUMBER_ENTRIES) {
> +               set_bf(paace->impl_attr, PAACE_IA_OTM, PAACE_OTM_INDEXED);
> +               paace->op_encode.index_ot.omi = omi;
> +       } else if (~omi != 0) {
> +               pr_err("bad operation mapping index: %d\n", omi);
> +               return -EINVAL;
> +       }
> +
> +       if (~stashid != 0)
> +               set_bf(paace->impl_attr, PAACE_IA_CID, stashid);
> +
> +       smp_wmb();
> +
> +       if (enable)
> +               paace->addr_bitfields |= PAACE_V_VALID;
> +
> +       mb();
> +
> +       return 0;
> +}

^ permalink raw reply

* [ 110/150] uprobes/powerpc: Add dependency on single step emulation
From: Greg Kroah-Hartman @ 2013-02-26 23:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: Greg Kroah-Hartman, stable, linuxppc-dev, Suzuki K. Poulose
In-Reply-To: <20130226235523.930663721@linuxfoundation.org>

3.8-stable review patch.  If anyone has any objections, please let me know.

------------------

From: "Suzuki K. Poulose" <suzuki@in.ibm.com>

commit 5e249d4528528c9a77da051a89ec7f99d31b83eb upstream.

Uprobes uses emulate_step in sstep.c, but we haven't explicitly specified
the dependency. On pseries HAVE_HW_BREAKPOINT protects us, but 44x has no
such luxury.

Consolidate other users that depend on sstep and create a new config option.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Suzuki K. Poulose <suzuki@in.ibm.com>
Cc: linuxppc-dev@ozlabs.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 arch/powerpc/Kconfig      |    4 ++++
 arch/powerpc/lib/Makefile |    4 +---
 2 files changed, 5 insertions(+), 3 deletions(-)

--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -275,6 +275,10 @@ config PPC_ADV_DEBUG_DAC_RANGE
 	depends on PPC_ADV_DEBUG_REGS && 44x
 	default y

+config PPC_EMULATE_SSTEP
+	bool
+	default y if KPROBES || UPROBES || XMON || HAVE_HW_BREAKPOINT
+
 source "init/Kconfig"

 source "kernel/Kconfig.freezer"
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -19,9 +19,7 @@ obj-$(CONFIG_PPC64)	+= copypage_64.o cop
 			   checksum_wrappers_64.o hweight_64.o \
 			   copyuser_power7.o string_64.o copypage_power7.o \
 			   memcpy_power7.o
-obj-$(CONFIG_XMON)	+= sstep.o ldstfp.o
-obj-$(CONFIG_KPROBES)	+= sstep.o ldstfp.o
-obj-$(CONFIG_HAVE_HW_BREAKPOINT)	+= sstep.o ldstfp.o
+obj-$(CONFIG_PPC_EMULATE_SSTEP)	+= sstep.o ldstfp.o

 ifeq ($(CONFIG_PPC64),y)
 obj-$(CONFIG_SMP)	+= locks.o

^ permalink raw reply

* Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
From: Lai Jiangshan @ 2013-02-27  0:33 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-doc, peterz, fweisbec, linux-kernel, Michel Lespinasse,
	mingo, linux-arch, linux, xiaoguangrong, wangyun, paulmck, nikunj,
	linux-pm, rusty, rostedt, rjw, namhyung, tglx, linux-arm-kernel,
	netdev, oleg, vincent.guittot, sbw, tj, akpm, linuxppc-dev
In-Reply-To: <512D0D67.9010609@linux.vnet.ibm.com>

On Wed, Feb 27, 2013 at 3:30 AM, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
> On 02/26/2013 09:55 PM, Lai Jiangshan wrote:
>> On Tue, Feb 26, 2013 at 10:22 PM, Srivatsa S. Bhat
>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>>
>>> Hi Lai,
>>>
>>> I'm really not convinced that piggy-backing on lglocks would help
>>> us in any way. But still, let me try to address some of the points
>>> you raised...
>>>
>>> On 02/26/2013 06:29 PM, Lai Jiangshan wrote:
>>>> On Tue, Feb 26, 2013 at 5:02 PM, Srivatsa S. Bhat
>>>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>>>> On 02/26/2013 05:47 AM, Lai Jiangshan wrote:
>>>>>> On Tue, Feb 26, 2013 at 3:26 AM, Srivatsa S. Bhat
>>>>>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>>>>>> Hi Lai,
>>>>>>>
>>>>>>> On 02/25/2013 09:23 PM, Lai Jiangshan wrote:
>>>>>>>> Hi, Srivatsa,
>>>>>>>>
>>>>>>>> The target of the whole patchset is nice for me.
>>>>>>>
>>>>>>> Cool! Thanks :-)
>>>>>>>
>>>>> [...]
>>>>>
>>>>> Unfortunately, I see quite a few issues with the code above. IIUC, the
>>>>> writer and the reader both increment the same counters. So how will the
>>>>> unlock() code in the reader path know when to unlock which of the locks?
>>>>
>>>> The same as your code, the reader(which nested in write C.S.) just dec
>>>> the counters.
>>>
>>> And that works fine in my case because the writer and the reader update
>>> _two_ _different_ counters.
>>
>> I can't find any magic in your code, they are the same counter.
>>
>>         /*
>>          * It is desirable to allow the writer to acquire the percpu-rwlock
>>          * for read (if necessary), without deadlocking or getting complaints
>>          * from lockdep. To achieve that, just increment the reader_refcnt of
>>          * this CPU - that way, any attempt by the writer to acquire the
>>          * percpu-rwlock for read, will get treated as a case of nested percpu
>>          * reader, which is safe, from a locking perspective.
>>          */
>>         this_cpu_inc(pcpu_rwlock->rw_state->reader_refcnt);
>>
>
> Whoa! Hold on, were you really referring to _this_ increment when you said
> that, in your patch you would increment the refcnt at the writer? Then I guess
> there is a major disconnect in our conversations. (I had assumed that you were
> referring to the update of writer_signal, and were just trying to have a single
> refcnt instead of reader_refcnt and writer_signal).

https://github.com/laijs/linux/commit/53e5053d5b724bea7c538b11743d0f420d98f38d

Sorry the name "fallback_reader_refcnt" misled you.

>
> So, please let me clarify things a bit here. Forget about the above increment
> of reader_refcnt at the writer side. Its almost utterly insignificant for our
> current discussion. We can simply replace it with a check as shown below, at
> the reader side:
>
> void percpu_read_lock_irqsafe()
> {
>         if (current == active_writer)
>                 return;
>
>         /* Rest of the code */
> }
>
> Now, assuming that, in your patch, you were trying to use the per-cpu refcnt
> to allow the writer to safely take the reader path, you can simply get rid of
> that percpu-refcnt, as demonstrated above.
>
> So that would reduce your code to the following (after simplification):
>
> lg_rwlock_local_read_lock()
> {
>         if (current == active_writer)
>                 return;
>         if (arch_spin_trylock(per-cpu-spinlock))
>                 return;
>         read_lock(global-rwlock);
> }
>
> Now, let us assume that hotplug is not happening, meaning, nobody is running
> the writer side code. Now let's see what happens at the reader side in your
> patch. As I mentioned earlier, the readers are _very_ frequent and can be in
> very hot paths. And they also happen to be nested quite often.
>
> So, a non-nested reader acquires the per-cpu spinlock. Every subsequent nested
> reader on that CPU has to acquire the global rwlock for read. Right there you
> have 2 significant performance issues -
> 1. Acquiring the (spin) lock is costly
> 2. Acquiring the global rwlock causes cacheline bouncing, which hurts
>    performance.
>
> And why do we care so much about performance here? Because, the existing
> kernel does an efficient preempt_disable() here - which is an optimized
> per-cpu counter increment. Replacing that with such heavy primitives on the
> reader side can be very bad.
>
> Now, how does my patchset tackle this? My scheme just requires an increment
> of a per-cpu refcnt (reader_refcnt) and memory barrier. Which is acceptable
> from a performance-perspective, because IMHO its not horrendously worse than
> a preempt_disable().
>
>>
>>> If both of them update the same counter, there
>>> will be a semantic clash - an increment of the counter can either mean that
>>> a new writer became active, or it can also indicate a nested reader. A decrement
>>> can also similarly have 2 meanings. And thus it will be difficult to decide
>>> the right action to take, based on the value of the counter.
>>>
>>>>
>>>>> (The counter-dropping-to-zero logic is not safe, since it can be updated
>>>>> due to different reasons). And now that I look at it again, in the absence
>>>>> of the writer, the reader is allowed to be recursive at the heavy cost of
>>>>> taking the global rwlock for read, every 2nd time you nest (because the
>>>>> spinlock is non-recursive).
>>>>
>>>> (I did not understand your comments of this part)
>>>> nested reader is considered seldom.
>>>
>>> No, nested readers can be _quite_ frequent. Because, potentially all users
>>> of preempt_disable() are readers - and its well-known how frequently we
>>> nest preempt_disable(). As a simple example, any atomic reader who calls
>>> smp_call_function() will become a nested reader, because smp_call_function()
>>> itself is a reader. So reader nesting is expected to be quite frequent.
>>>
>>>> But if N(>=2) nested readers happen,
>>>> the overhead is:
>>>>     1 spin_try_lock() + 1 read_lock() + (N-1) __this_cpu_inc()
>>>>
>>>
>>> In my patch, its just this_cpu_inc(). Note that these are _very_ hot paths.
>>> So every bit of optimization that you can add is worthwhile.
>>>
>>> And your read_lock() is a _global_ lock - thus, it can lead to a lot of
>>> cache-line bouncing. That's *exactly* why I have used per-cpu refcounts in
>>> my synchronization scheme, to avoid taking the global rwlock as much as possible.
>>>
>>> Another important point to note is that, the overhead we are talking about
>>> here, exists even when _not_ performing hotplug. And its the replacement to
>>> the super-fast preempt_disable(). So its extremely important to consciously
>>> minimize this overhead - else we'll end up slowing down the system significantly.
>>>
>>
>> All I was considered is "nested reader is seldom", so I always
>> fallback to rwlock when nested.
>> If you like, I can add 6 lines of code, the overhead is
>> 1 spin_try_lock()(fast path)  + N  __this_cpu_inc()
>>
>
> I'm assuming that calculation is no longer valid, considering that
> we just discussed how the per-cpu refcnt that you were using is quite
> unnecessary and can be removed.
>
> IIUC, the overhead with your code, as per above discussion would be:
> 1 spin_try_lock() [non-nested] + N read_lock(global_rwlock).

https://github.com/laijs/linux/commit/46334544bb7961550b7065e015da76f6dab21f16

Again, I'm so sorry the name "fallback_reader_refcnt" misled you.

>
> Note that I'm referring to the scenario when hotplug is _not_ happening
> (ie., nobody is running writer side code).
>
>> The overhead of your code is
>> 2 smp_mb() + N __this_cpu_inc()
>>
>
> Right. And as you can see, this is much much better than the overhead
> shown above.

I will write a test to compare it to "1 spin_try_lock()(fast path)  +
N  __this_cpu_inc()"

>
>> I don't see how much different.
>>
>>>>> Also, this lg_rwlock implementation uses 3
>>>>> different data-structures - a per-cpu spinlock, a global rwlock and
>>>>> a per-cpu refcnt, and its not immediately apparent why you need those many
>>>>> or even those many varieties.
>>>>
>>>> data-structures is the same as yours.
>>>> fallback_reader_refcnt <--> reader_refcnt
>>>
>>> This has semantic problems, as noted above.
>>>
>>>> per-cpu spinlock <--> write_signal
>>>
>>> Acquire/release of (spin) lock is costlier than inc/dec of a counter, IIUC.
>>>
>>>> fallback_rwlock  <---> global_rwlock
>>>>
>>>>> Also I see that this doesn't handle the
>>>>> case of interrupt-handlers also being readers.
>>>>
>>>> handled. nested reader will see the ref or take the fallback_rwlock
>>>>
>>
>> Sorry, _reentrance_ read_lock() will see the ref or take the fallback_rwlock
>>
>>>
>>> I'm not referring to simple nested readers here, but interrupt handlers who
>>> can act as readers. For starters, the arch_spin_trylock() is not safe when
>>> interrupt handlers can also run the same code, right? You'll need to save
>>> and restore interrupts at critical points in the code. Also, the __foo()
>>> variants used to read/update the counters are not interrupt-safe.
>>
>> I must missed something.
>>
>> Could you elaborate more why arch_spin_trylock() is not safe when
>> interrupt handlers can also run the same code?
>>
>> Could you elaborate more why __this_cpu_op variants is not
>> interrupt-safe since they are always called paired.
>>
>
> Take a look at include/linux/percpu.h. You'll note that __this_cpu_*
> operations map to __this_cpu_generic_to_op(), which doesn't disable interrupts
> while doing the update. Hence you can get inconsistent results if an interrupt
> hits the CPU at that time and the interrupt handler tries to do the same thing.
> In contrast, if you use this_cpu_inc() for example, interrupts are explicitly
> disabled during the update and hence you won't get inconsistent results.


xx_lock()/xx_unlock() are must called paired, if interrupts happens
the value of the data is recovered after the interrupts return.

the same reason, preempt_disable() itself it is not irqsafe,
but preempt_disable()/preempt_enable() are called paired, so they are
all irqsafe.

>
>>
>>> And,
>>> the unlock() code in the reader path is again going to be confused about
>>> what to do when interrupt handlers interrupt regular readers, due to the
>>> messed up refcount.
>>
>> I still can't understand.
>>
>
> The primary reason _I_ was using the refcnt vs the reason _you_ were using the
> refcnt, appears to be very different.  Maybe that's why the above statement
> didn't make sense. In your case, IIUC, you can simply get rid of the refcnt
> and replace it with the simple check I mentioned above. Whereas, I use
> refcnts to keep the reader-side synchronization fast (and for reader-writer
> communication).
>
>
> Regards,
> Srivatsa S. Bhat
>

^ permalink raw reply

* Re: [PATCH 5/6][v4]: perf: Create a sysfs entry for Power event format
From: Michael Ellerman @ 2013-02-27  1:17 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Andi Kleen, Peter Zijlstra, robert.richter, Anton Blanchard,
	linux-kernel, Stephane Eranian, linuxppc-dev, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, Jiri Olsa
In-Reply-To: <20130226200343.GA21543@us.ibm.com>

On Tue, Feb 26, 2013 at 12:03:43PM -0800, Sukadev Bhattiprolu wrote:
> Michael Ellerman [michael@ellerman.id.au] wrote:
> | On Tue, Jan 22, 2013 at 10:26:13PM -0800, Sukadev Bhattiprolu wrote:
> | > 
> | > [PATCH 5/6][v4]: perf: Create a sysfs entry for Power event format
> | > 
> | > Create a sysfs entry, '/sys/bus/event_source/devices/cpu/format/event'
> | > which describes the format of a POWER cpu.
> | 
> | Did this patch go upstream? I don't see it.
> 
> Hmm, patches 1..4,6 are in linux-tip and Arnaldo's trees but patch 5 is
> in neither.

I suspect Arnaldo was either waiting for an ACK from Ben, or was
expecting Ben to take it?

> | > The format of the event is the same for all POWER cpus at least in
> | > (Power6, Power7), so bulk of this change is common in the code common
> | > to POWER cpus.
> | 
> | No. The event format is different on most POWER cpus, in particular it
> | is different on Power6 and Power7, and will be different again on
> | Power8.
> 
> Sigh.  The port of this patchset to Power6 has not started yet.

It should hardly require a port, it's about five lines. But it doesn't
matter for now.
 
> But this patchset does work on Power7 correct ?

Yes, and without it the rest of the series is essentially useless. So I
guess it's better than nothing and we should get it in, we can fix it up
later to work across different chips.

Ben can you grab it, it's all arch/powerpc.

cheers

^ permalink raw reply

* Re: [PATCH 5/6][v4]: perf: Create a sysfs entry for Power event format
From: Michael Ellerman @ 2013-02-27  1:27 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Andi Kleen, Peter Zijlstra, robert.richter, Anton Blanchard,
	linux-kernel, Stephane Eranian, linuxppc-dev, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, Jiri Olsa
In-Reply-To: <20130123062613.GF13720@us.ibm.com>

On Tue, Jan 22, 2013 at 10:26:13PM -0800, Sukadev Bhattiprolu wrote:
> 
> [PATCH 5/6][v4]: perf: Create a sysfs entry for Power event format
> 
> Create a sysfs entry, '/sys/bus/event_source/devices/cpu/format/event'
> which describes the format of a POWER cpu.

> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index fa476d5..4ae044b 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -1315,6 +1315,18 @@ ssize_t power_events_sysfs_show(struct device *dev,
>  	return sprintf(page, "event=0x%02llx\n", pmu_attr->id);
>  }
>  
> +PMU_FORMAT_ATTR(event, "config:0-20");

Just noticed, it's 20 bits, which is 0-19.

cheers

^ permalink raw reply

* Re: [PATCH] powerpc: kernel/kgdb.c: fix memory leakage
From: tiejun.chen @ 2013-02-27  2:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Stephen Rothwell, Michael Neuling, Cong Ding, linux-kernel,
	Paul Mackerras, Jason Wessel, linuxppc-dev
In-Reply-To: <1360291273.2650.52.camel@pasglop>

On 02/08/2013 10:41 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2013-01-31 at 20:04 -0600, Jason Wessel wrote:
>
>>> diff --git a/arch/powerpc/kernel/kgdb.c b/arch/powerpc/kernel/kgdb.c
>>> index 8747447..5ca82cd 100644
>>> --- a/arch/powerpc/kernel/kgdb.c
>>> +++ b/arch/powerpc/kernel/kgdb.c
>>> @@ -154,12 +154,12 @@ static int kgdb_handle_breakpoint(struct pt_regs *regs)
>>>   static int kgdb_singlestep(struct pt_regs *regs)
>>>   {
>>>   	struct thread_info *thread_info, *exception_thread_info;
>>> -	struct thread_info *backup_current_thread_info = \
>>> -		(struct thread_info *)kmalloc(sizeof(struct thread_info), GFP_KERNEL);
>>> +	struct thread_info *backup_current_thread_info;
>>
>>
>>
>> Woh...  This is definitely wrong.  You have found a problem for sure,
>> but this is not the right way to fix it.
>>
>> It is not a good idea to kmalloc while single stepping because you can
>> hang the kernel if you single step any operation in kmalloc().
>
> Ok. I applied the fix because it was obviously correct vs. the previous
> version of the code (ie. the kmalloc was already there)
>
>> I am in the process of going through all the kgdb mails from the last
>> few months while I had been away from the project, so I didn't catch
>> this one and I see it has upstream commit (fefd9e6f8).  I'll submit
>> another patch to fix this the right way and use a static variable.
>> This is ok to use a static variable here because this is not something
>> we can recursively call at a single CPU level.
>
> But a static will be shared by all CPUs or do you mean a per-cpu one ?
>
>> If Ben prefers we not burn the memory unless kgdb is active we can
>> kmalloc / kfree the space we need at the time that kgdb is
>> initialized.  Else we can go with this patch you see below.  We'll see
>> what Ben desires.
>
> What about the stack ? thread info isn't very big...


Actually booke already copy the thread_info when we enter the debug exception, 
and I will send a patch to do the same thing for book3e, so I also remove these 
copying action here then we're happy without these nasty stuff :)

Tiejun

>
> Cheers,
> Ben.
>
>> -----
>> diff --git a/arch/powerpc/kernel/kgdb.c b/arch/powerpc/kernel/kgdb.c
>> index a7bc752..bb12c8b 100644
>> --- a/arch/powerpc/kernel/kgdb.c
>> +++ b/arch/powerpc/kernel/kgdb.c
>> @@ -151,15 +151,16 @@ static int kgdb_handle_breakpoint(struct pt_regs *regs)
>>   	return 1;
>>   }
>>
>> +static struct thread_info kgdb_backup_thread_info[NR_CPUS];
>> +
>>   static int kgdb_singlestep(struct pt_regs *regs)
>>   {
>>   	struct thread_info *thread_info, *exception_thread_info;
>> -	struct thread_info *backup_current_thread_info;
>> +	int cpu = raw_smp_processor_id();
>>
>>   	if (user_mode(regs))
>>   		return 0;
>>
>> -	backup_current_thread_info = (struct thread_info *)kmalloc(sizeof(struct thread_info), GFP_KERNEL);
>>   	/*
>>   	 * On Book E and perhaps other processors, singlestep is handled on
>>   	 * the critical exception stack.  This causes current_thread_info()
>> @@ -175,7 +176,7 @@ static int kgdb_singlestep(struct pt_regs *regs)
>>
>>   	if (thread_info != exception_thread_info) {
>>   		/* Save the original current_thread_info. */
>> -		memcpy(backup_current_thread_info, exception_thread_info, sizeof *thread_info);
>> +		memcpy(&kgdb_backup_thread_info[cpu], exception_thread_info, sizeof *thread_info);
>>   		memcpy(exception_thread_info, thread_info, sizeof *thread_info);
>>   	}
>>
>> @@ -183,9 +184,8 @@ static int kgdb_singlestep(struct pt_regs *regs)
>>
>>   	if (thread_info != exception_thread_info)
>>   		/* Restore current_thread_info lastly. */
>> -		memcpy(exception_thread_info, backup_current_thread_info, sizeof *thread_info);
>> +		memcpy(exception_thread_info, &kgdb_backup_thread_info[cpu], sizeof *thread_info);
>>
>> -	kfree(backup_current_thread_info);
>>   	return 1;
>>   }
>>
>>
>> -----
>>
>>
>> Thanks,
>> Jason.
>>
>>
>>>
>>>   	if (user_mode(regs))
>>>   		return 0;
>>>
>>> +	backup_current_thread_info = (struct thread_info *)kmalloc(sizeof(struct thread_info), GFP_KERNEL);
>>>   	/*
>>>   	 * On Book E and perhaps other processors, singlestep is handled on
>>>   	 * the critical exception stack.  This causes current_thread_info()
>>> @@ -185,6 +185,7 @@ static int kgdb_singlestep(struct pt_regs *regs)
>>>   		/* Restore current_thread_info lastly. */
>>>   		memcpy(exception_thread_info, backup_current_thread_info, sizeof *thread_info);
>>>
>>> +	kfree(backup_current_thread_info);
>>>   	return 1;
>>>   }
>>>
>>>
>
>
>
>

^ permalink raw reply

* [v3][PATCH 0/6] powerpc/book3e: make kgdb to work well
From: Tiejun Chen @ 2013-02-27  3:04 UTC (permalink / raw)
  To: benh, galak; +Cc: linuxppc-dev

This patchset is used to support kgdb/gdb on book3e.

Validated on p4080ds and p5040ds with test single step and breakpoint

v3:

* make work when enable CONFIG_RELOCATABLE
* fix one typo in patch,
  "powerpc/book3e: store critical/machine/debug exception thread info": 
	ld	r1,PACAKSAVE(r13);
    ->  ld	r14,PACAKSAVE(r13);
* remove copying the thread_info since booke and book3e always copy
  the thead_info now when we enter the debug exception, and so drop
  the v2 patch, "book3e/kgdb: Fix a single stgep case of lazy IRQ"

v2:

* Make sure we cover CONFIG_PPC_BOOK3E_64 safely
* Use LOAD_REG_IMMEDIATE() to load properly
	the value of the constant expression in load debug exception stack 
* Copy thread infor form the kernel stack coming from usr
* Rebase latest powerpc git tree

v1:
* Copy thread info only when we are from !user mode since we'll get kernel stack
  coming from usr directly.
* remove save/restore EX_R14/EX_R15 since DBG_EXCEPTION_PROLOG already covered
  this.
* use CURRENT_THREAD_INFO() conveniently to get thread.
* fix some typos
* add a patch to make sure gdb can generate a single step properly to invoke a
  kgdb state.
* add a patch to if we need to replay an interrupt, we shouldn't restore that
  previous backup thread info to make sure we can replay an interrupt lately
  with a proper thread info.
* rebase latest powerpc git tree

v0:
This patchset is used to support kgdb for book3e.

------
Tiejun Chen (6):
      powerpc/book3e: load critical/machine/debug exception stack
      powerpc/book3e: store critical/machine/debug exception thread info
      book3e/kgdb: update thread's dbcr0
      powerpc/book3e: support kgdb for kernel space
      kgdb/kgdbts: support ppc64
      powerpc/kgdb: remove copying the thread_info

 arch/powerpc/kernel/exceptions-64e.S |   69 ++++++++++++++++++++++++++++++++--
 arch/powerpc/kernel/kgdb.c           |   41 +++++---------------
 drivers/misc/kgdbts.c                |    2 +
 3 files changed, 77 insertions(+), 35 deletions(-)

Tiejun

^ permalink raw reply

* [v3][PATCH 1/6] powerpc/book3e: load critical/machine/debug exception stack
From: Tiejun Chen @ 2013-02-27  3:04 UTC (permalink / raw)
  To: benh, galak; +Cc: linuxppc-dev
In-Reply-To: <1361934261-31840-1-git-send-email-tiejun.chen@windriver.com>

We always alloc critical/machine/debug check exceptions. This is
different from the normal exception. So we should load these exception
stack properly like we did for booke.

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
---
 arch/powerpc/kernel/exceptions-64e.S |   49 +++++++++++++++++++++++++++++++---
 1 file changed, 46 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S
index 1e7782b..7fd6af0 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -36,6 +36,37 @@
  */
 #define	SPECIAL_EXC_FRAME_SIZE	INT_FRAME_SIZE
 
+/* only on book3e */
+#define DBG_STACK_BASE		dbgirq_ctx
+#define MC_STACK_BASE		mcheckirq_ctx
+#define CRIT_STACK_BASE		critirq_ctx
+
+#ifdef CONFIG_RELOCATABLE
+#define LOAD_STACK_BASE(reg, level)			\
+	tovirt(r2,r2);					\
+	LOAD_REG_ADDR(reg, level##_STACK_BASE);
+#else
+#define LOAD_STACK_BASE(reg, level)			\
+	LOAD_REG_IMMEDIATE(reg, level##_STACK_BASE);
+#endif
+
+#ifdef CONFIG_SMP
+#define BOOK3E_LOAD_EXC_LEVEL_STACK(level)		\
+	mfspr	r14,SPRN_PIR;				\
+	slwi	r14,r14,3;				\
+	LOAD_STACK_BASE(r10, level);			\
+	add	r10,r10,r14;				\
+	ld	r10,0(r10);				\
+	addi	r10,r10,THREAD_SIZE;			\
+	std	r10,PACA_##level##_STACK(r13);
+#else
+#define BOOK3E_LOAD_EXC_LEVEL_STACK(level)		\
+	LOAD_STACK_BASE(r10, level);			\
+	ld	r10,0(r10);				\
+	addi	r10,r10,THREAD_SIZE;			\
+	std	r10,PACA_##level##_STACK(r13);
+#endif
+
 /* Exception prolog code for all exceptions */
 #define EXCEPTION_PROLOG(n, intnum, type, addition)	    		    \
 	mtspr	SPRN_SPRG_##type##_SCRATCH,r13;	/* get spare registers */   \
@@ -68,20 +99,32 @@
 #define SPRN_GDBELL_SRR1	SPRN_GSRR1
 
 #define CRIT_SET_KSTACK						            \
+	andi.	r10,r11,MSR_PR;							\
+	bne	1f;								\
+	BOOK3E_LOAD_EXC_LEVEL_STACK(CRIT);					\
 	ld	r1,PACA_CRIT_STACK(r13);				    \
-	subi	r1,r1,SPECIAL_EXC_FRAME_SIZE;
+	subi	r1,r1,SPECIAL_EXC_FRAME_SIZE;					\
+1:
 #define SPRN_CRIT_SRR0	SPRN_CSRR0
 #define SPRN_CRIT_SRR1	SPRN_CSRR1
 
 #define DBG_SET_KSTACK						            \
+	andi.	r10,r11,MSR_PR;							\
+	bne	1f;								\
+	BOOK3E_LOAD_EXC_LEVEL_STACK(DBG);					\
 	ld	r1,PACA_DBG_STACK(r13);					    \
-	subi	r1,r1,SPECIAL_EXC_FRAME_SIZE;
+	subi	r1,r1,SPECIAL_EXC_FRAME_SIZE;					\
+1:
 #define SPRN_DBG_SRR0	SPRN_DSRR0
 #define SPRN_DBG_SRR1	SPRN_DSRR1
 
 #define MC_SET_KSTACK						            \
+	andi.	r10,r11,MSR_PR;							\
+	bne	1f;								\
+	BOOK3E_LOAD_EXC_LEVEL_STACK(MC);					\
 	ld	r1,PACA_MC_STACK(r13);					    \
-	subi	r1,r1,SPECIAL_EXC_FRAME_SIZE;
+	subi	r1,r1,SPECIAL_EXC_FRAME_SIZE;					\
+1:
 #define SPRN_MC_SRR0	SPRN_MCSRR0
 #define SPRN_MC_SRR1	SPRN_MCSRR1
 
-- 
1.7.9.5

^ permalink raw reply related

* [v3][PATCH 2/6] powerpc/book3e: store critical/machine/debug exception thread info
From: Tiejun Chen @ 2013-02-27  3:04 UTC (permalink / raw)
  To: benh, galak; +Cc: linuxppc-dev
In-Reply-To: <1361934261-31840-1-git-send-email-tiejun.chen@windriver.com>

We need to store thread info to these exception thread info like something
we already did for PPC32.

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
---
 arch/powerpc/kernel/exceptions-64e.S |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S
index 7fd6af0..7df9a1f 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -67,6 +67,18 @@
 	std	r10,PACA_##level##_STACK(r13);
 #endif
 
+/* Store something to exception thread info */
+#define	BOOK3E_STORE_EXC_LEVEL_THEAD_INFO(type)					\
+	ld	r14,PACAKSAVE(r13);						\
+	CURRENT_THREAD_INFO(r14, r14);						\
+	CURRENT_THREAD_INFO(r15, r1);						\
+	ld	r10,TI_FLAGS(r14);		     				\
+	std	r10,TI_FLAGS(r15);			     			\
+	ld	r10,TI_PREEMPT(r14);		     				\
+	std	r10,TI_PREEMPT(r15);		     				\
+	ld	r10,TI_TASK(r14);			     			\
+	std	r10,TI_TASK(r15);
+
 /* Exception prolog code for all exceptions */
 #define EXCEPTION_PROLOG(n, intnum, type, addition)	    		    \
 	mtspr	SPRN_SPRG_##type##_SCRATCH,r13;	/* get spare registers */   \
@@ -104,6 +116,7 @@
 	BOOK3E_LOAD_EXC_LEVEL_STACK(CRIT);					\
 	ld	r1,PACA_CRIT_STACK(r13);				    \
 	subi	r1,r1,SPECIAL_EXC_FRAME_SIZE;					\
+	BOOK3E_STORE_EXC_LEVEL_THEAD_INFO(CRIT);				\
 1:
 #define SPRN_CRIT_SRR0	SPRN_CSRR0
 #define SPRN_CRIT_SRR1	SPRN_CSRR1
@@ -114,6 +127,7 @@
 	BOOK3E_LOAD_EXC_LEVEL_STACK(DBG);					\
 	ld	r1,PACA_DBG_STACK(r13);					    \
 	subi	r1,r1,SPECIAL_EXC_FRAME_SIZE;					\
+	BOOK3E_STORE_EXC_LEVEL_THEAD_INFO(DBG);					\
 1:
 #define SPRN_DBG_SRR0	SPRN_DSRR0
 #define SPRN_DBG_SRR1	SPRN_DSRR1
@@ -124,6 +138,7 @@
 	BOOK3E_LOAD_EXC_LEVEL_STACK(MC);					\
 	ld	r1,PACA_MC_STACK(r13);					    \
 	subi	r1,r1,SPECIAL_EXC_FRAME_SIZE;					\
+	BOOK3E_STORE_EXC_LEVEL_THEAD_INFO(MC);					\
 1:
 #define SPRN_MC_SRR0	SPRN_MCSRR0
 #define SPRN_MC_SRR1	SPRN_MCSRR1
-- 
1.7.9.5

^ permalink raw reply related

* [v3][PATCH 3/6] book3e/kgdb: update thread's dbcr0
From: Tiejun Chen @ 2013-02-27  3:04 UTC (permalink / raw)
  To: benh, galak; +Cc: linuxppc-dev
In-Reply-To: <1361934261-31840-1-git-send-email-tiejun.chen@windriver.com>

gdb always need to generate a single step properly to invoke
a kgdb state. But with lazy interrupt, book3e can't always
trigger a debug exception with a single step since the current
is blocked for handling those pending exception, then we miss
that expected dbcr configuration at last to generate a debug
exception.

So here we also update thread's dbcr0 to make sure the current
can go back with that missed dbcr0 configuration.

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
---
 arch/powerpc/kernel/kgdb.c |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/kgdb.c b/arch/powerpc/kernel/kgdb.c
index 5ca82cd..1a57307 100644
--- a/arch/powerpc/kernel/kgdb.c
+++ b/arch/powerpc/kernel/kgdb.c
@@ -410,7 +410,7 @@ int kgdb_arch_handle_exception(int vector, int signo, int err_code,
 			       struct pt_regs *linux_regs)
 {
 	char *ptr = &remcom_in_buffer[1];
-	unsigned long addr;
+	unsigned long addr, dbcr0;
 
 	switch (remcom_in_buffer[0]) {
 		/*
@@ -427,8 +427,15 @@ int kgdb_arch_handle_exception(int vector, int signo, int err_code,
 		/* set the trace bit if we're stepping */
 		if (remcom_in_buffer[0] == 's') {
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
-			mtspr(SPRN_DBCR0,
-			      mfspr(SPRN_DBCR0) | DBCR0_IC | DBCR0_IDM);
+			dbcr0 = mfspr(SPRN_DBCR0) | DBCR0_IC | DBCR0_IDM;
+			mtspr(SPRN_DBCR0, dbcr0);
+#ifdef CONFIG_PPC_BOOK3E_64
+			/* With lazy interrut we have to update thread dbcr0 here
+			 * to make sure we can set debug properly at last to invoke
+			 * kgdb again to work well.
+			 */
+			current->thread.dbcr0 = dbcr0;
+#endif
 			linux_regs->msr |= MSR_DE;
 #else
 			linux_regs->msr |= MSR_SE;
-- 
1.7.9.5

^ permalink raw reply related

* [v3][PATCH 4/6] powerpc/book3e: support kgdb for kernel space
From: Tiejun Chen @ 2013-02-27  3:04 UTC (permalink / raw)
  To: benh, galak; +Cc: linuxppc-dev
In-Reply-To: <1361934261-31840-1-git-send-email-tiejun.chen@windriver.com>

Currently we need to skip this for supporting KGDB.

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
---
 arch/powerpc/kernel/exceptions-64e.S |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S
index 7df9a1f..800e2a3 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -598,11 +598,14 @@ kernel_dbg_exc:
 	rfdi
 
 	/* Normal debug exception */
+1:
+#ifndef CONFIG_KGDB
 	/* XXX We only handle coming from userspace for now since we can't
 	 *     quite save properly an interrupted kernel state yet
 	 */
-1:	andi.	r14,r11,MSR_PR;		/* check for userspace again */
+	andi.	r14,r11,MSR_PR;		/* check for userspace again */
 	beq	kernel_dbg_exc;		/* if from kernel mode */
+#endif
 
 	/* Now we mash up things to make it look like we are coming on a
 	 * normal exception
-- 
1.7.9.5

^ permalink raw reply related

* [v3][PATCH 5/6] kgdb/kgdbts: support ppc64
From: Tiejun Chen @ 2013-02-27  3:04 UTC (permalink / raw)
  To: benh, galak; +Cc: linuxppc-dev
In-Reply-To: <1361934261-31840-1-git-send-email-tiejun.chen@windriver.com>

We can't look up the address of the entry point of the function simply
via that function symbol for all architectures.

For PPC64 ABI, actually there is a function descriptors structure.

A function descriptor is a three doubleword data structure that contains
the following values:
	* The first doubleword contains the address of the entry point of
		the function.
	* The second doubleword contains the TOC base address for
		the function.
	* The third doubleword contains the environment pointer for
		languages such as Pascal and PL/1.

So we should call a wapperred dereference_function_descriptor() to get
the address of the entry point of the function.

Note this is also safe for other architecture after refer to
"include/asm-generic/sections.h" since:

dereference_function_descriptor(p) always is (p) if without arched definition.

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
---
 drivers/misc/kgdbts.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/misc/kgdbts.c b/drivers/misc/kgdbts.c
index 3aa9a96..4799e1f 100644
--- a/drivers/misc/kgdbts.c
+++ b/drivers/misc/kgdbts.c
@@ -103,6 +103,7 @@
 #include <linux/delay.h>
 #include <linux/kthread.h>
 #include <linux/module.h>
+#include <asm/sections.h>
 
 #define v1printk(a...) do { \
 	if (verbose) \
@@ -222,6 +223,7 @@ static unsigned long lookup_addr(char *arg)
 		addr = (unsigned long)do_fork;
 	else if (!strcmp(arg, "hw_break_val"))
 		addr = (unsigned long)&hw_break_val;
+	addr = (unsigned long )dereference_function_descriptor((void *)addr);
 	return addr;
 }
 
-- 
1.7.9.5

^ permalink raw reply related

* [v3][PATCH 6/6] powerpc/kgdb: remove copying the thread_info
From: Tiejun Chen @ 2013-02-27  3:04 UTC (permalink / raw)
  To: benh, galak; +Cc: linuxppc-dev
In-Reply-To: <1361934261-31840-1-git-send-email-tiejun.chen@windriver.com>

Currently BookE and Book3E always copy the thread_info from
the kernel stack when we enter the debug exception, so we can
remove these action here to avoid copying again.

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
---
 arch/powerpc/kernel/kgdb.c |   28 ----------------------------
 1 file changed, 28 deletions(-)

diff --git a/arch/powerpc/kernel/kgdb.c b/arch/powerpc/kernel/kgdb.c
index 1a57307..e954888 100644
--- a/arch/powerpc/kernel/kgdb.c
+++ b/arch/powerpc/kernel/kgdb.c
@@ -153,39 +153,11 @@ static int kgdb_handle_breakpoint(struct pt_regs *regs)
 
 static int kgdb_singlestep(struct pt_regs *regs)
 {
-	struct thread_info *thread_info, *exception_thread_info;
-	struct thread_info *backup_current_thread_info;
-
 	if (user_mode(regs))
 		return 0;
 
-	backup_current_thread_info = (struct thread_info *)kmalloc(sizeof(struct thread_info), GFP_KERNEL);
-	/*
-	 * On Book E and perhaps other processors, singlestep is handled on
-	 * the critical exception stack.  This causes current_thread_info()
-	 * to fail, since it it locates the thread_info by masking off
-	 * the low bits of the current stack pointer.  We work around
-	 * this issue by copying the thread_info from the kernel stack
-	 * before calling kgdb_handle_exception, and copying it back
-	 * afterwards.  On most processors the copy is avoided since
-	 * exception_thread_info == thread_info.
-	 */
-	thread_info = (struct thread_info *)(regs->gpr[1] & ~(THREAD_SIZE-1));
-	exception_thread_info = current_thread_info();
-
-	if (thread_info != exception_thread_info) {
-		/* Save the original current_thread_info. */
-		memcpy(backup_current_thread_info, exception_thread_info, sizeof *thread_info);
-		memcpy(exception_thread_info, thread_info, sizeof *thread_info);
-	}
-
 	kgdb_handle_exception(0, SIGTRAP, 0, regs);
 
-	if (thread_info != exception_thread_info)
-		/* Restore current_thread_info lastly. */
-		memcpy(exception_thread_info, backup_current_thread_info, sizeof *thread_info);
-
-	kfree(backup_current_thread_info);
 	return 1;
 }
 
-- 
1.7.9.5

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox