LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH V2] lglock: add read-preference local-global rwlock
From: Srivatsa S. Bhat @ 2013-03-02 17:11 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Lai Jiangshan, linux-doc, peterz, fweisbec, linux-kernel,
	Michel Lespinasse, mingo, linux-arch, linux, xiaoguangrong,
	wangyun, paulmck, nikunj, linux-pm, rusty, rostedt, rjw, namhyung,
	tglx, linux-arm-kernel, netdev, Oleg Nesterov, vincent.guittot,
	sbw, tj, akpm, linuxppc-dev
In-Reply-To: <5131FB4C.7070408@cn.fujitsu.com>

On 03/02/2013 06:44 PM, Lai Jiangshan wrote:
> From 345a7a75c314ff567be48983e0892bc69c4452e7 Mon Sep 17 00:00:00 2001
> From: Lai Jiangshan <laijs@cn.fujitsu.com>
> Date: Sat, 2 Mar 2013 20:33:14 +0800
> Subject: [PATCH] lglock: add read-preference local-global rwlock
> 
> Current lglock is not read-preference, so it can't be used on some cases
> which read-preference rwlock can do. Example, get_cpu_online_atomic().
> 
[...]
> diff --git a/kernel/lglock.c b/kernel/lglock.c
> index 6535a66..52e9b2c 100644
> --- a/kernel/lglock.c
> +++ b/kernel/lglock.c
> @@ -87,3 +87,71 @@ void lg_global_unlock(struct lglock *lg)
>  	preempt_enable();
>  }
>  EXPORT_SYMBOL(lg_global_unlock);
> +
> +#define FALLBACK_BASE	(1UL << 30)
> +
> +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
> +{
> +	struct lglock *lg = &lgrw->lglock;
> +
> +	preempt_disable();
> +	if (likely(!__this_cpu_read(*lgrw->reader_refcnt))) {
> +		rwlock_acquire_read(&lg->lock_dep_map, 0, 0, _RET_IP_);
> +		if (unlikely(!arch_spin_trylock(this_cpu_ptr(lg->lock)))) {
> +			read_lock(&lgrw->fallback_rwlock);
> +			__this_cpu_write(*lgrw->reader_refcnt, FALLBACK_BASE);
> +			return;
> +		}
> +	}
> +
> +	__this_cpu_inc(*lgrw->reader_refcnt);
> +}
> +EXPORT_SYMBOL(lg_rwlock_local_read_lock);
> +
> +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
> +{
> +	switch (__this_cpu_read(*lgrw->reader_refcnt)) {
> +	case 1:
> +		__this_cpu_write(*lgrw->reader_refcnt, 0);
> +		lg_local_unlock(&lgrw->lglock);
> +		return;

This should be a break, instead of a return, right?
Otherwise, there will be a preempt imbalance...

> +	case FALLBACK_BASE:
> +		__this_cpu_write(*lgrw->reader_refcnt, 0);
> +		read_unlock(&lgrw->fallback_rwlock);
> +		rwlock_release(&lg->lock_dep_map, 1, _RET_IP_);
> +		break;
> +	default:
> +		__this_cpu_dec(*lgrw->reader_refcnt);
> +		break;
> +	}
> +
> +	preempt_enable();
> +}


Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: [PATCH] lglock: add read-preference local-global rwlock
From: Oleg Nesterov @ 2013-03-02 17:06 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Lai Jiangshan, linux-doc, peterz, fweisbec, mingo, linux-arch,
	linux, xiaoguangrong, wangyun, paulmck, nikunj, linux-pm, rusty,
	rostedt, rjw, namhyung, tglx, linux-arm-kernel, Lai Jiangshan,
	netdev, linux-kernel, vincent.guittot, sbw, Srivatsa S. Bhat, tj,
	akpm, linuxppc-dev
In-Reply-To: <CANN689HEW=b5S93J27p+rBX-5m7jkW4FER=eHDCp7OgMaTWLPw@mail.gmail.com>

On 03/02, Michel Lespinasse wrote:
>
> My version would be slower if it needs to take the
> slow path in a reentrant way, but I'm not sure it matters either :)

I'd say, this doesn't matter at all, simply because this can only happen
if we race with the active writer.

Oleg.

^ permalink raw reply

* Re: [PATCH] lglock: add read-preference local-global rwlock
From: Oleg Nesterov @ 2013-03-02 17:01 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Lai Jiangshan, linux-doc, peterz, fweisbec, Michel Lespinasse,
	mingo, linux-arch, linux, xiaoguangrong, wangyun, paulmck, nikunj,
	linux-pm, rusty, rostedt, rjw, namhyung, tglx, linux-arm-kernel,
	netdev, linux-kernel, vincent.guittot, sbw, Srivatsa S. Bhat, tj,
	akpm, linuxppc-dev
In-Reply-To: <513201B7.5070004@cn.fujitsu.com>

On 03/02, Lai Jiangshan wrote:
>
> On 02/03/13 02:28, Oleg Nesterov wrote:
> > Lai, I didn't read this discussion except the code posted by Michel.
> > I'll try to read this patch carefully later, but I'd like to ask
> > a couple of questions.
> >
> > This version looks more complex than Michel's, why? Just curious, I
> > am trying to understand what I missed. See
> > http://marc.info/?l=linux-kernel&m=136196350213593
>
> Michel changed my old draft version a little, his version is good enough for me.

Yes, I see. But imho Michel suggested the valuable cleanup, the code
becomes even more simple with the same perfomance.

Your v2 looks almost correct to me, but I still think it makes sense
to incorporate the simplification from Michel.

> My new version tries to add a little better nestable support with only
> adding single __this_cpu_op() in _read_[un]lock().

How? Afaics with or without FALLBACK_BASE you need _reed + _inc/dec in
_read_lock/unlock.

Oleg.

^ permalink raw reply

* Re: Linux kernel 3.x problems on PowerMac G5
From: Andreas Schwab @ 2013-03-02 16:22 UTC (permalink / raw)
  To: Phileas Fogg; +Cc: linuxppc-dev, Aaro Koskinen
In-Reply-To: <5132198C.8050300__15885.293405927$1362234296$gmane$org@mail.ru>

Phileas Fogg <phileas-fogg@mail.ru> writes:

> Aaro Koskinen wrote:
>> Hi,
>>
>> On Sat, Mar 02, 2013 at 03:40:19PM +0100, Phileas Fogg wrote:
>>> Have anyone tested Linux 3 kernels on PowerMac G5 recently ?
>>
>> 3.8 boots normally to shell and is stable on G5 iMac (PowerMac8,1).
>>
>> A.
>>
>
> Thanks. Then i configured something wrongly probably.
> I tried Linux 2.6.39.4 and it boots too.
> I looked around on the Internet and it seems i'm not the only one
> who has problems with Linux 3.x on my PowerMac G5.

I have no problem with 3.8 either, but mine is a PowerMac7,3.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply

* Re: Linux kernel 3.x problems on PowerMac G5
From: Phileas Fogg @ 2013-03-02 15:23 UTC (permalink / raw)
  To: Aaro Koskinen; +Cc: linuxppc-dev
In-Reply-To: <20130302141352.GG22941@blackmetal.musicnaut.iki.fi>

Aaro Koskinen wrote:
> Hi,
>
> On Sat, Mar 02, 2013 at 03:40:19PM +0100, Phileas Fogg wrote:
>> Have anyone tested Linux 3 kernels on PowerMac G5 recently ?
>
> 3.8 boots normally to shell and is stable on G5 iMac (PowerMac8,1).
>
> A.
>

Thanks. Then i configured something wrongly probably.
I tried Linux 2.6.39.4 and it boots too.
I looked around on the Internet and it seems i'm not the only one
who has problems with Linux 3.x on my PowerMac G5.

The guy here has the same problem:
http://forums.gentoo.org/viewtopic-p-7222918.html


My PowerMac CPU
--------------------

cat /proc/cpuinfo
processor	: 0
cpu		: PPC970MP, altivec supported
clock		: 2000.000000MHz
revision	: 1.0 (pvr 0044 0100)

processor	: 1
cpu		: PPC970MP, altivec supported
clock		: 2000.000000MHz
revision	: 1.0 (pvr 0044 0100)

timebase	: 33333333
platform	: PowerMac
model		: PowerMac11,2
machine		: PowerMac11,2
motherboard	: PowerMac11,2 MacRISC4 Power Macintosh
detected as	: 337 (PowerMac G5 Dual Core)
pmac flags	: 00000000
L2 cache	: 1024K unified
pmac-generation	: NewWorld

^ permalink raw reply

* Re: Linux kernel 3.x problems on PowerMac G5
From: Aaro Koskinen @ 2013-03-02 14:13 UTC (permalink / raw)
  To: Phileas Fogg; +Cc: linuxppc-dev
In-Reply-To: <51320F53.9040000@mail.ru>

Hi,

On Sat, Mar 02, 2013 at 03:40:19PM +0100, Phileas Fogg wrote:
> Have anyone tested Linux 3 kernels on PowerMac G5 recently ?

3.8 boots normally to shell and is stable on G5 iMac (PowerMac8,1).

A.

^ permalink raw reply

* [PATCH] powerpc: remove unused BITOP_LE_SWIZZLE macro
From: Akinobu Mita @ 2013-03-02 14:06 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Paul Mackerras, Akinobu Mita

The BITOP_LE_SWIZZLE macro was used in the little-endian bitops functions
for powerpc.  But these functions were converted to generic bitops and
the BITOP_LE_SWIZZLE is not used anymore.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/bitops.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/include/asm/bitops.h b/arch/powerpc/include/asm/bitops.h
index ef918a2..08bd299 100644
--- a/arch/powerpc/include/asm/bitops.h
+++ b/arch/powerpc/include/asm/bitops.h
@@ -52,8 +52,6 @@
 #define smp_mb__before_clear_bit()	smp_mb()
 #define smp_mb__after_clear_bit()	smp_mb()
 
-#define BITOP_LE_SWIZZLE	((BITS_PER_LONG-1) & ~0x7)
-
 /* Macro for generating the ***_bits() functions */
 #define DEFINE_BITOP(fn, op, prefix, postfix)	\
 static __inline__ void fn(unsigned long mask,	\
-- 
1.8.1.2

^ permalink raw reply related

* Linux kernel 3.x problems on PowerMac G5
From: Phileas Fogg @ 2013-03-02 14:40 UTC (permalink / raw)
  To: linuxppc-dev

Hi,

recently i got a PowerMac G5 and installed Debian Linux 2.6.32 on it.
Everything works so far and Debian boots properly.

Today i tried to boot Linux 3 on the machine and it doesn't boot.
The Linux 3 kernel was cross-compiled by me.

On Linux 3.8.1 it hangs after this line:
---
windfarm: Drive bay control loop started.

And then i'm getting RCU stall call traces.

On Linux 3.2 it hangs too but not at the same place.
It hangs after some SCSI message.

Have anyone tested Linux 3 kernels on PowerMac G5 recently ?

Regards

^ permalink raw reply

* Re: [PATCH] lglock: add read-preference local-global rwlock
From: Lai Jiangshan @ 2013-03-02 13:42 UTC (permalink / raw)
  To: Oleg Nesterov, paulmck
  Cc: Lai Jiangshan, linux-doc, peterz, fweisbec, Michel Lespinasse,
	mingo, linux-arch, linux, xiaoguangrong, wangyun, nikunj,
	linux-pm, rusty, rostedt, rjw, namhyung, tglx, linux-arm-kernel,
	netdev, linux-kernel, vincent.guittot, sbw, Srivatsa S. Bhat, tj,
	akpm, linuxppc-dev
In-Reply-To: <20130301182854.GA3631@redhat.com>

On 02/03/13 02:28, Oleg Nesterov wrote:
> Lai, I didn't read this discussion except the code posted by Michel.
> I'll try to read this patch carefully later, but I'd like to ask
> a couple of questions.
> 
> This version looks more complex than Michel's, why? Just curious, I
> am trying to understand what I missed. See
> http://marc.info/?l=linux-kernel&m=136196350213593

Michel changed my old draft version a little, his version is good enough for me.
My new version tries to add a little better nestable support with only
adding single __this_cpu_op() in _read_[un]lock().

> 
> And I can't understand FALLBACK_BASE...
> 
> OK, suppose that CPU_0 does _write_unlock() and releases ->fallback_rwlock.
> 
> CPU_1 does _read_lock(), and ...
> 
>> +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
>> +{
>> +	struct lglock *lg = &lgrw->lglock;
>> +
>> +	preempt_disable();
>> +	rwlock_acquire_read(&lg->lock_dep_map, 0, 0, _RET_IP_);
>> +	if (likely(!__this_cpu_read(*lgrw->reader_refcnt))) {
>> +		if (!arch_spin_trylock(this_cpu_ptr(lg->lock))) {
> 
> _trylock() fails,
> 
>> +			read_lock(&lgrw->fallback_rwlock);
>> +			__this_cpu_add(*lgrw->reader_refcnt, FALLBACK_BASE);
> 
> so we take ->fallback_rwlock and ->reader_refcnt == FALLBACK_BASE.
> 
> CPU_0 does lg_global_unlock(lgrw->lglock) and finishes _write_unlock().
> 
> Interrupt handler on CPU_1 does _read_lock() notices ->reader_refcnt != 0
> and simply does this_cpu_inc(), so reader_refcnt == FALLBACK_BASE + 1.
> 
> Then irq does _read_unlock(), and
> 
>> +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
>> +{
>> +	switch (__this_cpu_dec_return(*lgrw->reader_refcnt)) {
>> +	case 0:
>> +		lg_local_unlock(&lgrw->lglock);
>> +		return;
>> +	case FALLBACK_BASE:
>> +		__this_cpu_sub(*lgrw->reader_refcnt, FALLBACK_BASE);
>> +		read_unlock(&lgrw->fallback_rwlock);
> 
> hits this case?
> 
> Doesn't look right, but most probably I missed something.

Your are right, I just realized that I had spit a code which should be atomic.

I hope this patch(V2) can get more reviews.

My first and many locking knowledge is learned from Paul.
Paul, would you also review it?

Thanks,
Lai

> 
> Oleg.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply

* [PATCH V2] lglock: add read-preference local-global rwlock
From: Lai Jiangshan @ 2013-03-02 13:14 UTC (permalink / raw)
  To: Michel Lespinasse, Srivatsa S. Bhat
  Cc: Lai Jiangshan, linux-doc, peterz, fweisbec, linux-kernel, mingo,
	linux-arch, linux, xiaoguangrong, wangyun, paulmck, nikunj,
	linux-pm, rusty, rostedt, rjw, namhyung, tglx, linux-arm-kernel,
	netdev, Oleg Nesterov, vincent.guittot, sbw, tj, akpm,
	linuxppc-dev
In-Reply-To: <CANN689HEW=b5S93J27p+rBX-5m7jkW4FER=eHDCp7OgMaTWLPw@mail.gmail.com>

>From 345a7a75c314ff567be48983e0892bc69c4452e7 Mon Sep 17 00:00:00 2001
From: Lai Jiangshan <laijs@cn.fujitsu.com>
Date: Sat, 2 Mar 2013 20:33:14 +0800
Subject: [PATCH] lglock: add read-preference local-global rwlock

Current lglock is not read-preference, so it can't be used on some cases
which read-preference rwlock can do. Example, get_cpu_online_atomic().

Although we can use rwlock for these cases which needs read-preference.
but it leads to unnecessary cache-line bouncing even when there are no
writers present, which can slow down the system needlessly. It will be
worse when we have a lot of CPUs, it is not scale.

So we look forward to lglock. lglock is read-write-lock based on percpu locks,
but it is not read-preference due to its underlining percpu locks.

But what if we convert the percpu locks of lglock to use percpu rwlocks:

         CPU 0                                CPU 1
         ------                               ------
1.    spin_lock(&random_lock);             read_lock(my_rwlock of CPU 1);
2.    read_lock(my_rwlock of CPU 0);       spin_lock(&random_lock);

Writer:
         CPU 2:
         ------
      for_each_online_cpu(cpu)
        write_lock(my_rwlock of 'cpu');


Consider what happens if the writer begins his operation in between steps 1
and 2 at the reader side. It becomes evident that we end up in a (previously
non-existent) deadlock due to a circular locking dependency between the 3
entities, like this:


(holds              Waiting for
 random_lock) CPU 0 -------------> CPU 2  (holds my_rwlock of CPU 0
                                               for write)
               ^                   |
               |                   |
        Waiting|                   | Waiting
          for  |                   |  for
               |                   V
                ------ CPU 1 <------

                (holds my_rwlock of
                 CPU 1 for read)


So obviously this "straight-forward" way of implementing percpu rwlocks is
deadlock-prone. So we can't implement read-preference local-global rwlock
like this.


The implement of this patch reuse current lglock as frontend to achieve
local-read-lock, and reuse global fallback rwlock as backend to achieve
read-preference, and use percpu reader counter to indicate 1) the depth
of the nested reader lockes and 2) whether the outmost lock is percpu lock
or fallback rwlock.

The algorithm is simple, in the read site:
If it is nested reader, just increase the counter
If it is the outmost reader,
	1) try to lock its cpu's lock of the frontend lglock.
	   (reader count +=1 if success)
	2) if the above step fails, read_lock(&fallback_rwlock).
	   (reader count += FALLBACK_BASE + 1)

Write site:
Do the lg_global_lock() of the frontend lglock.
And then write_lock(&fallback_rwlock).


Prof:
1) reader-writer exclusive:
write-site must requires all percpu locks and fallback_rwlock.
outmost read-site must requires one of these locks.

2) read-preference:
before write site lock finished acquired, read site at least
wins at read_lock(&fallback_rwlock) due to rwlock is read-preference.

3) read site functions are irqsafe(reentrance-safe)
   (read site functions is not protected by disabled irq, but they are irqsafe)
	If read site function is interrupted at any point and reenters read site
again, reentranced read site will not be mislead by the first read site if the
reader counter > 0, in this case, it means currently frontend(this cpu lock of
lglock) or backend(fallback rwlock) lock is held, it is safe to act as
nested reader.
	if the reader counter=0, eentranced reader considers it is the
outmost read site, and it always successes after the write side release the lock.
(even the interrupted read-site has already hold the cpu lock of lglock
or the fallback_rwlock).
	And reentranced read site only calls arch_spin_trylock(), read_lock()
and __this_cpu_op(), arch_spin_trylock(), read_lock() is already reentrance-safe.
Although __this_cpu_op() is not reentrance-safe, but the value of the counter
will be restored after the interrupted finished, so read site functions
are still reentrance-safe.


Performance:
We only focus on the performance of the read site. this read site's fast path
is just preempt_disable() + __this_cpu_read/inc() + arch_spin_trylock(),
It has only one heavy memory operation. it will be expected fast.

We test three locks.
1) traditional rwlock WITHOUT remote competition nor cache-bouncing.(opt-rwlock)
2) this lock(lgrwlock)
3) V6 percpu-rwlock by "Srivatsa S. Bhat". (percpu-rwlock)
   (https://lkml.org/lkml/2013/2/18/186)

		nested=1(no nested)	nested=2	nested=4
opt-rwlock	 517181			1009200		2010027
lgrwlock	 452897			 700026		1201415
percpu-rwlock	1192955			1451343		1951757

The value is the time(nano-second) of 10000 times of the operations
{
	read-lock
	[nested read-lock]...
	[nested read-unlock]...
	read-unlock
}
If the way of this test is wrong nor bad, please correct me,
the code of test is here: https://gist.github.com/laijs/5066159
(i5 760, 64bit)

Changed from V1
 fix a reentrance-bug which is cought by Oleg
 don't touch lockdep for nested reader(needed by below change)
 add two special APIs for cpuhotplug.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 include/linux/lglock.h |   38 ++++++++++++++++++++++++++
 kernel/lglock.c        |   68 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 106 insertions(+), 0 deletions(-)

diff --git a/include/linux/lglock.h b/include/linux/lglock.h
index 0d24e93..90bfe79 100644
--- a/include/linux/lglock.h
+++ b/include/linux/lglock.h
@@ -67,4 +67,42 @@ void lg_local_unlock_cpu(struct lglock *lg, int cpu);
 void lg_global_lock(struct lglock *lg);
 void lg_global_unlock(struct lglock *lg);
 
+/* read-preference read-write-lock like rwlock but provides local-read-lock */
+struct lgrwlock {
+	unsigned long __percpu *reader_refcnt;
+	struct lglock lglock;
+	rwlock_t fallback_rwlock;
+};
+
+#define __DEFINE_LGRWLOCK_PERCPU_DATA(name)				\
+	static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)		\
+	= __ARCH_SPIN_LOCK_UNLOCKED;					\
+	static DEFINE_PER_CPU(unsigned long, name ## _refcnt);
+
+#define __LGRWLOCK_INIT(name)						\
+	{								\
+		.reader_refcnt = &name ## _refcnt,			\
+		.lglock = { .lock = &name ## _lock },			\
+		.fallback_rwlock = __RW_LOCK_UNLOCKED(name.fallback_rwlock)\
+	}
+
+#define DEFINE_LGRWLOCK(name)						\
+	__DEFINE_LGRWLOCK_PERCPU_DATA(name)				\
+	struct lgrwlock name = __LGRWLOCK_INIT(name)
+
+#define DEFINE_STATIC_LGRWLOCK(name)					\
+	__DEFINE_LGRWLOCK_PERCPU_DATA(name)				\
+	static struct lgrwlock name = __LGRWLOCK_INIT(name)
+
+static inline void lg_rwlock_init(struct lgrwlock *lgrw, char *name)
+{
+	lg_lock_init(&lgrw->lglock, name);
+}
+
+void lg_rwlock_local_read_lock(struct lgrwlock *lgrw);
+void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw);
+void lg_rwlock_global_write_lock(struct lgrwlock *lgrw);
+void lg_rwlock_global_write_unlock(struct lgrwlock *lgrw);
+void __lg_rwlock_global_read_write_lock(struct lgrwlock *lgrw);
+void __lg_rwlock_global_read_write_unlock(struct lgrwlock *lgrw);
 #endif
diff --git a/kernel/lglock.c b/kernel/lglock.c
index 6535a66..52e9b2c 100644
--- a/kernel/lglock.c
+++ b/kernel/lglock.c
@@ -87,3 +87,71 @@ void lg_global_unlock(struct lglock *lg)
 	preempt_enable();
 }
 EXPORT_SYMBOL(lg_global_unlock);
+
+#define FALLBACK_BASE	(1UL << 30)
+
+void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
+{
+	struct lglock *lg = &lgrw->lglock;
+
+	preempt_disable();
+	if (likely(!__this_cpu_read(*lgrw->reader_refcnt))) {
+		rwlock_acquire_read(&lg->lock_dep_map, 0, 0, _RET_IP_);
+		if (unlikely(!arch_spin_trylock(this_cpu_ptr(lg->lock)))) {
+			read_lock(&lgrw->fallback_rwlock);
+			__this_cpu_write(*lgrw->reader_refcnt, FALLBACK_BASE);
+			return;
+		}
+	}
+
+	__this_cpu_inc(*lgrw->reader_refcnt);
+}
+EXPORT_SYMBOL(lg_rwlock_local_read_lock);
+
+void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
+{
+	switch (__this_cpu_read(*lgrw->reader_refcnt)) {
+	case 1:
+		__this_cpu_write(*lgrw->reader_refcnt, 0);
+		lg_local_unlock(&lgrw->lglock);
+		return;
+	case FALLBACK_BASE:
+		__this_cpu_write(*lgrw->reader_refcnt, 0);
+		read_unlock(&lgrw->fallback_rwlock);
+		rwlock_release(&lg->lock_dep_map, 1, _RET_IP_);
+		break;
+	default:
+		__this_cpu_dec(*lgrw->reader_refcnt);
+		break;
+	}
+
+	preempt_enable();
+}
+EXPORT_SYMBOL(lg_rwlock_local_read_unlock);
+
+void lg_rwlock_global_write_lock(struct lgrwlock *lgrw)
+{
+	lg_global_lock(&lgrw->lglock);
+	write_lock(&lgrw->fallback_rwlock);
+}
+EXPORT_SYMBOL(lg_rwlock_global_write_lock);
+
+void lg_rwlock_global_write_unlock(struct lgrwlock *lgrw)
+{
+	write_unlock(&lgrw->fallback_rwlock);
+	lg_global_unlock(&lgrw->lglock);
+}
+EXPORT_SYMBOL(lg_rwlock_global_write_unlock);
+
+/* special write-site APIs allolw nested reader in such write-site */
+void __lg_rwlock_global_read_write_lock(struct lgrwlock *lgrw)
+{
+	lg_rwlock_global_write_lock(lgrw);
+	__this_cpu_write(*lgrw->reader_refcnt, 1);
+}
+
+void __lg_rwlock_global_read_write_unlock(struct lgrwlock *lgrw)
+{
+	__this_cpu_write(*lgrw->reader_refcnt, 0);
+	lg_rwlock_global_write_unlock(lgrw);
+}
-- 
1.7.7.6

^ permalink raw reply related

* Re: [PATCH] lglock: add read-preference local-global rwlock
From: Michel Lespinasse @ 2013-03-02 12:13 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Lai Jiangshan, linux-doc, peterz, fweisbec, mingo, linux-arch,
	linux, xiaoguangrong, wangyun, paulmck, nikunj, linux-pm, rusty,
	rostedt, rjw, namhyung, tglx, linux-arm-kernel, Lai Jiangshan,
	netdev, linux-kernel, vincent.guittot, sbw, Srivatsa S. Bhat, tj,
	akpm, linuxppc-dev
In-Reply-To: <20130301182854.GA3631@redhat.com>

On Sat, Mar 2, 2013 at 2:28 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> Lai, I didn't read this discussion except the code posted by Michel.
> I'll try to read this patch carefully later, but I'd like to ask
> a couple of questions.
>
> This version looks more complex than Michel's, why? Just curious, I
> am trying to understand what I missed. See
> http://marc.info/?l=linux-kernel&m=136196350213593

>From what I can see, my version used local_refcnt to count how many
reentrant locks are represented by the fastpath lglock spinlock; Lai's
version uses it to count how many reentrant locks are represented by
either the fastpath lglock spinlock or the global rwlock, with
FALLBACK_BASE being a bit thrown in so we can remember which of these
locks was acquired. My version would be slower if it needs to take the
slow path in a reentrant way, but I'm not sure it matters either :)

> Interrupt handler on CPU_1 does _read_lock() notices ->reader_refcnt != 0
> and simply does this_cpu_inc(), so reader_refcnt == FALLBACK_BASE + 1.
>
> Then irq does _read_unlock(), and
>
>> +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
>> +{
>> +     switch (__this_cpu_dec_return(*lgrw->reader_refcnt)) {
>> +     case 0:
>> +             lg_local_unlock(&lgrw->lglock);
>> +             return;
>> +     case FALLBACK_BASE:
>> +             __this_cpu_sub(*lgrw->reader_refcnt, FALLBACK_BASE);
>> +             read_unlock(&lgrw->fallback_rwlock);
>
> hits this case?
>
> Doesn't look right, but most probably I missed something.

Good catch. I think this is easily fixed by setting reader_refcn
directly to FALLBACK_BASE+1, instead of setting it to FALLBACK_BASE
and then incrementing it to FALLBACK_BASE+1.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply

* Re: [PATCH 6/6 v8] iommu/fsl: Freescale PAMU driver and IOMMU API implementation.
From: Stuart Yoder @ 2013-03-01 23:27 UTC (permalink / raw)
  To: Varun Sethi
  Cc: Joerg Roedel, Stuart Yoder, linux-kernel, iommu, Scott Wood,
	linuxppc-dev
In-Reply-To: <1361191939-21260-7-git-send-email-Varun.Sethi@freescale.com>

On Mon, Feb 18, 2013 at 6:52 AM, Varun Sethi <Varun.Sethi@freescale.com> wrote:
[cut]
> +static phys_addr_t get_phys_addr(struct fsl_dma_domain *dma_domain, unsigned long iova)
> +{
> +       u32 win_cnt = dma_domain->win_cnt;
> +       struct dma_window *win_ptr =
> +                               &dma_domain->win_arr[0];
> +       struct iommu_domain_geometry *geom;
> +
> +       geom = &dma_domain->iommu_domain->geometry;
> +
> +       if (!win_cnt || !dma_domain->geom_size) {
> +               pr_err("Number of windows/geometry not configured for the domain\n");
> +               return 0;
> +       }
> +
> +       if (win_cnt > 1) {
> +               u64 subwin_size;
> +               unsigned long subwin_iova;
> +               u32 wnd;
> +
> +               subwin_size = dma_domain->geom_size >> ilog2(win_cnt);

Could it be just geom_size / win_cnt ??

> +               subwin_iova = iova & ~(subwin_size - 1);
> +               wnd = (subwin_iova - geom->aperture_start) >> ilog2(subwin_size);
> +               win_ptr = &dma_domain->win_arr[wnd];
> +       }
> +
> +       if (win_ptr->valid)
> +               return (win_ptr->paddr + (iova & (win_ptr->size - 1)));
> +
> +       return 0;
> +}
> +
> +static int map_liodn_subwins(int liodn, struct fsl_dma_domain *dma_domain)

Just call it map_subwins().  They are just sub windows, not "liodn sub windows".

[cut]

> +static int map_liodn_win(int liodn, struct fsl_dma_domain *dma_domain)

Call it map_win().

[cut]
> +static struct fsl_dma_domain *iommu_alloc_dma_domain(void)
> +{
> +       struct fsl_dma_domain *domain;
> +
> +       domain = kmem_cache_zalloc(fsl_pamu_domain_cache, GFP_KERNEL);
> +       if (!domain)
> +               return NULL;
> +
> +       domain->stash_id = ~(u32)0;
> +       domain->snoop_id = ~(u32)0;
> +       domain->win_cnt = max_subwindow_count;

To align with my previous comments on fsl_pamu.c, I think instead of referencing
a global variable (in fsl_pamu.c) you should be making an accessor API
call here to get
the max subwindow _count.

> +       domain->geom_size = 0;
> +
> +       INIT_LIST_HEAD(&domain->devices);
> +
> +       spin_lock_init(&domain->domain_lock);
> +
> +       return domain;
> +}
> +
> +static inline struct device_domain_info *find_domain(struct device *dev)
> +{
> +       return dev->archdata.iommu_domain;
> +}
> +
> +static void remove_domain_ref(struct device_domain_info *info, u32 win_cnt)
> +{
> +               list_del(&info->link);
> +               spin_lock(&iommu_lock);
> +               if (win_cnt)
> +                       pamu_free_subwins(info->liodn);
> +               pamu_disable_liodn(info->liodn);
> +               spin_unlock(&iommu_lock);
> +               spin_lock(&device_domain_lock);
> +               info->dev->archdata.iommu_domain = NULL;
> +               kmem_cache_free(iommu_devinfo_cache, info);
> +               spin_unlock(&device_domain_lock);
> +}

The above function is literally removing the _device_ reference from the domain.
The name implies that it is removing a "domain reference".   Suggestion is
to call it "remove_device_ref".

Also, the whitespace is messed up there.  You have 2 tabs instead of 1.

> +static void destroy_domain(struct fsl_dma_domain *dma_domain)
> +{
> +       struct device_domain_info *info;
> +
> +       /* Dissociate all the devices from this domain */
> +       while (!list_empty(&dma_domain->devices)) {
> +               info = list_entry(dma_domain->devices.next,
> +                       struct device_domain_info, link);
> +               remove_domain_ref(info, dma_domain->win_cnt);
> +       }
> +}

This function is removing all devices from a domain...maybe to be
consistent with my
suggestion below on detach_domain(), call this detach_all_devices().
 We have 2 functions
doing almost the same thing....one detaches a single device, one
detaches all devices.
The current names "destroy_domain" and "detach_domain" are not as clear to me.

> +static void detach_domain(struct device *dev, struct fsl_dma_domain *dma_domain)
> +{
> +       struct device_domain_info *info;
> +       struct list_head *entry, *tmp;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&dma_domain->domain_lock, flags);
> +       /* Remove the device from the domain device list */
> +       if (!list_empty(&dma_domain->devices)) {
> +               list_for_each_safe(entry, tmp, &dma_domain->devices) {
> +                       info = list_entry(entry, struct device_domain_info, link);
> +                       if (info->dev == dev)
> +                               remove_domain_ref(info, dma_domain->win_cnt);
> +               }
> +       }
> +       spin_unlock_irqrestore(&dma_domain->domain_lock, flags);
> +}

This function is not "detaching a domain", but is detaching a device.
 Call it detach_device().

> +static void attach_domain(struct fsl_dma_domain *dma_domain, int liodn, struct device *dev)
> +{

Same thing here.   This is not attaching a domain, but attaching a
device.  Call it attach_device.

> +       struct device_domain_info *info, *old_domain_info;
> +
> +       spin_lock(&device_domain_lock);
> +       /*
> +        * Check here if the device is already attached to domain or not.
> +        * If the device is already attached to a domain detach it.
> +        */
> +       old_domain_info = find_domain(dev);
> +       if (old_domain_info && old_domain_info->domain != dma_domain) {
> +               spin_unlock(&device_domain_lock);
> +               detach_domain(dev, old_domain_info->domain);
> +               spin_lock(&device_domain_lock);
> +       }
> +
> +       info = kmem_cache_zalloc(iommu_devinfo_cache, GFP_KERNEL);
> +
> +       info->dev = dev;
> +       info->liodn = liodn;
> +       info->domain = dma_domain;
> +
> +       list_add(&info->link, &dma_domain->devices);
> +       /*
> +        * In case of devices with multiple LIODNs just store
> +        * the info for the first LIODN as all
> +        * LIODNs share the same domain
> +        */
> +       if (!old_domain_info)
> +               dev->archdata.iommu_domain = info;
> +       spin_unlock(&device_domain_lock);
> +
> +}
> +

[cut]
> +/* Configure geometry settings for all LIODNs associated with domain */
> +static int configure_domain(struct fsl_dma_domain *dma_domain,
> +                           struct iommu_domain_geometry *geom_attr,
> +                           u32 win_cnt)

This function is not configuring the iommu domain...which is a concept
in the Linux driver, it is taking the domain geometry and setting
up the PAMU tables for all LIODNs currently in the domain.

Maybe it would help if you used a prefix like "pamu" or "paact" to
identify functions that operate on the actual PAMU tables.

maybe:  pamu_set_domain_geometry()

> +{
> +       struct device_domain_info *info;
> +       int ret = 0;
> +
> +       if (!list_empty(&dma_domain->devices)) {
> +               list_for_each_entry(info, &dma_domain->devices, link) {
> +                       ret = configure_liodn(info->liodn, info->dev, dma_domain,
> +                                             geom_attr, win_cnt);

...and following the above naming convention, call this (?): pamu_set_liodn

> +                       if (ret)
> +                               break;
> +               }
> +       }
> +
> +       return ret;
> +}
> +

[cut]
> +static int fsl_pamu_window_enable(struct iommu_domain *domain, u32 wnd_nr,
> +                                 phys_addr_t paddr, u64 size, int iommu_prot)
> +{
> +       struct fsl_dma_domain *dma_domain = domain->priv;
> +       struct dma_window *wnd;
> +       int prot = 0;
> +       int ret;
> +       unsigned long flags;
> +       u64 win_size;
> +
> +       if (iommu_prot & IOMMU_READ)
> +               prot |= PAACE_AP_PERMS_QUERY;
> +       if (iommu_prot & IOMMU_WRITE)
> +               prot |= PAACE_AP_PERMS_UPDATE;
> +
> +       spin_lock_irqsave(&dma_domain->domain_lock, flags);
> +       if (!dma_domain->win_arr) {
> +               pr_err("Number of windows not configured\n");
> +               spin_unlock_irqrestore(&dma_domain->domain_lock, flags);
> +               return -ENODEV;
> +       }
> +
> +       if (wnd_nr >= dma_domain->win_cnt) {
> +               pr_err("Invalid window index\n");
> +               spin_unlock_irqrestore(&dma_domain->domain_lock, flags);
> +               return -EINVAL;
> +       }
> +
> +       win_size = dma_domain->geom_size >> ilog2(dma_domain->win_cnt);

Isn't size just geom_size / win_cnt.   Not sure why you do the ilog2()
and right shift...
We already validated that  win_cnt is power of 2 when it was set.

> +       if (size > win_size) {
> +               pr_err("Invalid window size \n");
> +               spin_unlock_irqrestore(&dma_domain->domain_lock, flags);
> +               return -EINVAL;
> +       }
> +
> +       if (dma_domain->win_cnt == 1) {
> +               if (dma_domain->enabled) {
> +                       pr_err("Disable the window before updating the mapping\n");
> +                       spin_unlock_irqrestore(&dma_domain->domain_lock, flags);
> +                       return -EBUSY;
> +               }
> +
> +               ret = check_size(size, domain->geometry.aperture_start);
> +               if (ret) {
> +                       pr_err("Aperture start not aligned to the size\n");
> +                       spin_unlock_irqrestore(&dma_domain->domain_lock, flags);
> +                       return -EINVAL;
> +               }
> +       }

Why is win_cnt==1 a special case?    Would the geometry not have been verified
when it was originally set up?

Also, do you need a check here to verify if the geometry has been set.  Is it a
requirement to set the geometry prior to window enable?

> +       wnd = &dma_domain->win_arr[wnd_nr];
> +       if (!wnd->valid) {
> +               wnd->paddr = paddr;
> +               wnd->size = size;
> +               wnd->prot = prot;
> +
> +               ret = update_domain_mapping(dma_domain, wnd_nr);
> +               if (!ret) {
> +                       wnd->valid = 1;
> +                       dma_domain->mapped++;
> +               }
> +       } else {
> +               pr_err("Disable the window before updating the mapping\n");
> +               ret = -EBUSY;
> +       }
> +
> +       spin_unlock_irqrestore(&dma_domain->domain_lock, flags);
> +
> +       return ret;
> +}
> +
> +/*

[cut]
> +static int fsl_pamu_attach_device(struct iommu_domain *domain,
> +                                 struct device *dev)
> +{
> +       struct fsl_dma_domain *dma_domain = domain->priv;
> +       const u32 *prop;
> +       u32 prop_cnt;
> +       int len, ret = 0;
> +       struct pci_dev *pdev = NULL;
> +       struct pci_controller *pci_ctl;
> +
> +       /*
> +        * Hack to make attach device work for the PCI devices. Simply assign the
> +        * the LIODN for the PCI controller to the PCI device.
> +        */

Instead of "Simply assign...", perhaps say instead:  Use the LIODN for the
PCI controller when attaching a PCI device.

> +       if (dev->bus == &pci_bus_type) {
> +               pdev = to_pci_dev(dev);
> +               pci_ctl = pci_bus_to_host(pdev->bus);
> +               /*
> +                * make dev point to pci controller device
> +                * so we can get the LIODN programmed by
> +                * u-boot;
> +                */
> +               dev = pci_ctl->parent;
> +       }
> +
> +       prop = of_get_property(dev->of_node, "fsl,liodn", &len);

suggestion: name "prop" to be "liodn" and "prop_cnt" to be "liodn_cnt".  That
will be more clear.

> +       if (prop) {
> +               prop_cnt = len / sizeof(u32);
> +               ret = handle_attach_device(dma_domain, dev,
> +                                        prop, prop_cnt);
> +       } else {
> +               pr_err("missing fsl,liodn property at %s\n",
> +                         dev->of_node->full_name);
> +                       ret = -EINVAL;
> +       }
> +
> +       return ret;
> +}
> +
> +static void fsl_pamu_detach_device(struct iommu_domain *domain,
> +                                     struct device *dev)
> +{
> +       struct fsl_dma_domain *dma_domain = domain->priv;
> +       const u32 *prop;
> +       int len;
> +       struct pci_dev *pdev = NULL;
> +       struct pci_controller *pci_ctl;
> +
> +       /*
> +        * Hack to make detach device work for the PCI devices. Simply assign the
> +        * the LIODN for the PCI controller to the PCI device.
> +        */
> +       if (dev->bus == &pci_bus_type) {
> +               pdev = to_pci_dev(dev);
> +               pci_ctl = pci_bus_to_host(pdev->bus);
> +               /*
> +                * make dev point to pci controller device
> +                * so we can get the LIODN programmed by
> +                * u-boot;
> +                */
> +               dev = pci_ctl->parent;
> +       }
> +
> +       prop = of_get_property(dev->of_node, "fsl,liodn", &len);
> +       if (prop)
> +               detach_domain(dev, dma_domain);
> +       else
> +               pr_err("missing fsl,liodn property at %s\n",
> +                         dev->of_node->full_name);
> +}

I understand why you need the LIODN when attaching, but why do you get it
in the detatch function.   You are not using "prop" here.

[cut]
> +static  bool check_pci_ctl_endpt_part(struct pci_controller *pci_ctl)
> +{
> +       u32 version;
> +
> +       /* Check the PCI controller version number by readding BRR1 register */
> +       version = in_be32(pci_ctl->cfg_addr + (PCI_FSL_BRR1 >> 2));
> +       version &= PCI_FSL_BRR1_VER;
> +       /* If PCI controller version is >= 0x204 we can partition endpoints*/
> +       if (version >= 0x204)
> +               return 1;
> +
> +       return 0;
> +}

Can we just use the compatible string here to identify the different
kinds of PCI
controllers?   Reading the actual device registers is ugly right now because
you are #defining the BRR1 stuff in a generic powerpc header.

> +static int fsl_pamu_add_device(struct device *dev)
> +{
> +       struct iommu_group *group = NULL;
> +       struct pci_dev *pdev;
> +       struct pci_dev *bridge, *dma_pdev = NULL;
> +       struct pci_controller *pci_ctl;
> +       int ret;
> +
> +       /*
> +        * For platform devices we allocate a separate group for
> +        * each of the devices.
> +        */
> +       if (dev->bus == &pci_bus_type) {
> +               bool pci_endpt_part;

That variable name is not clear, the abbreviations are not natural.  I
would just
call it pci_endpoint_partitioning.

> +               pdev = to_pci_dev(dev);
> +               /* Don't create device groups for virtual PCI bridges */
> +               if (pdev->subordinate)
> +                       return 0;
> +
> +               pci_ctl = pci_bus_to_host(pdev->bus);
> +               pci_endpt_part = check_pci_ctl_endpt_part(pci_ctl);
> +               /* We can partition PCIe devices so assign device group to the device */
> +               if (pci_endpt_part) {
> +                       bridge = pci_find_upstream_pcie_bridge(pdev);
> +                       if (bridge) {
> +                               if (pci_is_pcie(bridge))
> +                                       dma_pdev = pci_get_domain_bus_and_slot(
> +                                                               pci_domain_nr(pdev->bus),
> +                                                               bridge->subordinate->number, 0);
> +                               if (!dma_pdev)
> +                                       dma_pdev = pci_dev_get(bridge);
> +                       } else
> +                               dma_pdev = pci_dev_get(pdev);
> +                       /* Account for quirked devices */
> +                       swap_pci_ref(&dma_pdev, pci_get_dma_source(dma_pdev));
> +                       group = get_device_iommu_group(&pdev->dev);
> +                       pci_dev_put(pdev);
> +                       /*
> +                        * PCIe controller is not a paritionable entity
> +                        * free the controller device iommu_group.
> +                        */
> +                       if (pci_ctl->parent->iommu_group)
> +                               iommu_group_remove_device(pci_ctl->parent);
> +               } else {
> +                       /*
> +                        * All devices connected to the controller will share the
> +                        * PCI controllers device group. If this is the first
> +                        * device to be probed for the pci controller, copy the
> +                        * device group information from the PCI controller device
> +                        * node and remove the PCI controller iommu group.
> +                        * For subsequent devices, the iommu group information can
> +                        * be obtained from sibling devices (i.e. from the bus_devices
> +                        * link list).
> +                        */
> +                       if (pci_ctl->parent->iommu_group) {
> +                               group = get_device_iommu_group(pci_ctl->parent);
> +                               iommu_group_remove_device(pci_ctl->parent);
> +                       } else {
> +                               /* check if this is the first device on the bus*/
> +                               if (pdev->bus_list.next == pdev->bus_list.prev) {
> +                                       struct pci_bus *bus = pdev->bus->parent;
> +                                       int found = 0;
> +                                       /* Traverese the parent bus list to get
> +                                        * pdev & dev for the sibling device.
> +                                        */
> +                                       while (bus) {
> +                                               if (!list_empty(&bus->devices)) {
> +                                                       pdev = container_of(bus->devices.next,
> +                                                                struct pci_dev, bus_list);
> +                                                       group = iommu_group_get(&pdev->dev);
> +                                                       /*
> +                                                        * All devices should be associated with
> +                                                        * a group.
> +                                                        */
> +                                                       if (group)
> +                                                               found = 1;
> +                                                       break;
> +                                               } else
> +                                                       bus = bus->parent;
> +                                       }
> +                                       if (!found)
> +                                               dev_err(&pdev->dev, "Failed to allocate group for the device\n");
> +                               } else {
> +                                       /*
> +                                        * Get the pdev & dev for the sibling device
> +                                        */
> +                                       pdev = container_of(pdev->bus_list.prev, struct pci_dev, bus_list);
> +                                       group = iommu_group_get(&pdev->dev);
> +                               }
> +                       }
> +               }
> +       } else
> +               group = get_device_iommu_group(dev);
> +
> +       if (IS_ERR(group))
> +               return PTR_ERR(group);
> +
> +       ret = iommu_group_add_device(group, dev);
> +
> +       iommu_group_put(group);
> +       return ret;
> +}
> +

Regards,
Stuart

^ permalink raw reply

* Re: [PATCH] lglock: add read-preference local-global rwlock
From: Srivatsa S. Bhat @ 2013-03-01 20:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lai Jiangshan, linux-doc, peterz, fweisbec, oleg,
	Michel Lespinasse, mingo, linux-arch, linux, xiaoguangrong,
	wangyun, paulmck, nikunj, linux-pm, rusty, rostedt, rjw, namhyung,
	tglx, linux-arm-kernel, Lai Jiangshan, netdev, linux-kernel,
	vincent.guittot, sbw, akpm, linuxppc-dev
In-Reply-To: <20130301175349.GC2481@mtj.dyndns.org>

On 03/01/2013 11:23 PM, Tejun Heo wrote:
> Hey, guys and Oleg (yes, I'm singling you out ;p because you're that
> awesome.)
> 
> On Sat, Mar 02, 2013 at 01:44:02AM +0800, Lai Jiangshan wrote:
>> Performance:
>> We only focus on the performance of the read site. this read site's fast path
>> is just preempt_disable() + __this_cpu_read/inc() + arch_spin_trylock(),
>> It has only one heavy memory operation. it will be expected fast.
>>
>> We test three locks.
>> 1) traditional rwlock WITHOUT remote competition nor cache-bouncing.(opt-rwlock)
>> 2) this lock(lgrwlock)
>> 3) V6 percpu-rwlock by "Srivatsa S. Bhat". (percpu-rwlock)
>>    (https://lkml.org/lkml/2013/2/18/186)
>>
>> 		nested=1(no nested)	nested=2	nested=4
>> opt-rwlock	 517181			1009200		2010027
>> lgrwlock	 452897			 700026		1201415
>> percpu-rwlock	1192955			1451343		1951757
> 
> On the first glance, the numbers look pretty good and I kinda really
> like the fact that if this works out we don't have to introduce yet
> another percpu synchronization construct and get to reuse lglock.
> 
> So, Oleg, can you please see whether you can find holes in this one?
> 
> Srivatsa, I know you spent a lot of time on percpu_rwlock but as you
> wrote before Lai's work can be seen as continuation of yours, and if
> we get to extend what's already there instead of introducing something
> completely new, there's no reason not to

Yep, I agree!

> (and my apologies for not
> noticing the possibility of extending lglock before).

No problem at all! You gave so many invaluable suggestions to make this
whole thing work in the first place! Now, if we can reuse the existing
stuff and extend it to what we want, then its just even better! :-)

>  So, if this can
> work, it would be awesome if you guys can work together.

Absolutely!

>  Lai might
> not be very good at communicating in english yet but he's really good
> at spotting patterns in complex code and playing with them.
>

That sounds great! :-)

I'll soon take a closer look at his code and the comparisons he posted,
and work towards taking this effort forward.

Thank you very much!

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
From: Srivatsa S. Bhat @ 2013-03-01 19:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lai Jiangshan, linux-doc, peterz, fweisbec, oleg,
	Michel Lespinasse, mingo, linux-arch, linux, xiaoguangrong,
	wangyun, paulmck, nikunj, linux-pm, rusty, rostedt, rjw, namhyung,
	tglx, linux-arm-kernel, netdev, linux-kernel, vincent.guittot,
	sbw, akpm, linuxppc-dev
In-Reply-To: <20130301181054.GD2481@mtj.dyndns.org>

Hi Tejun,

On 03/01/2013 11:40 PM, Tejun Heo wrote:
> Hello, Srivatsa.
> 
> On Thu, Feb 28, 2013 at 02:49:53AM +0530, Srivatsa S. Bhat wrote:
>> Don't get me wrong - I'll whole-heartedly acknowledge and appreciate if
>> _your_ code is better than mine. I just don't like the idea of somebody
>> plagiarizing my ideas/code (or even others' ideas for that matter).
>> However, I sincerely apologize in advance if I misunderstood/misjudged your
>> intentions; I just wanted to voice my concerns out loud at this point,
>> considering the bad feeling I got by looking at your responses collectively.
> 
> Although I don't know Lai personally, from my experience working with
> him past several months on workqueue, I strongly doubt he has dark
> urterior intenions.  The biggest problem probably is that his
> communication in English isn't very fluent yet and thus doesn't carry
> the underlying intentions or tones very well and he tends to bombard
> patches in quick succession without much explanation inbetween (during
> 3.8 cycle, I was receiving several workqueue patchsets days apart
> trying to solve the same problem with completely different approaches
> way before the discussion on the previous postings reach any kind of
> consensus).  A lot of that also probably comes from communication
> problems.
> 
> Anyways, I really don't think he's trying to take claim of your work.
> At least for now, you kinda need to speculate what he's trying to do
> and then confirm that back to him, but once you get used to that part,
> he's pretty nice to work with and I'm sure his communication will
> improve in time (I think it has gotten a lot better than only some
> months ago).
>

Thank you so much for helping clear the air, Tejun! I'm truly sorry again
for having drawn the wrong conclusions based on gaps in our communication.
It was a mistake on my part and I would like to request Lai to kindly
forgive me. I'm looking forward to working with Lai effectively, going
forward.

Thank you!

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
From: Srivatsa S. Bhat @ 2013-03-01 19:47 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Lai Jiangshan, linux-doc, peterz, fweisbec, oleg,
	Michel Lespinasse, mingo, linux-arch, linux, xiaoguangrong,
	wangyun, paulmck, nikunj, linux-pm, rusty, rostedt, rjw, namhyung,
	tglx, linux-arm-kernel, netdev, linux-kernel, vincent.guittot,
	sbw, tj, akpm, linuxppc-dev
In-Reply-To: <5130EA6B.6030901@cn.fujitsu.com>

On 03/01/2013 11:20 PM, Lai Jiangshan wrote:
> On 28/02/13 05:19, Srivatsa S. Bhat wrote:
>> On 02/27/2013 06:03 AM, Lai Jiangshan wrote:
>>> On Wed, Feb 27, 2013 at 3:30 AM, Srivatsa S. Bhat
>>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>>> On 02/26/2013 09:55 PM, Lai Jiangshan wrote:
>>>>> On Tue, Feb 26, 2013 at 10:22 PM, Srivatsa S. Bhat
>>>>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>>>>>
>>>>>> Hi Lai,
>>>>>>
>>>>>> I'm really not convinced that piggy-backing on lglocks would help
>>>>>> us in any way. But still, let me try to address some of the points
>>>>>> you raised...
>>>>>>
>>>>>> On 02/26/2013 06:29 PM, Lai Jiangshan wrote:
>>>>>>> On Tue, Feb 26, 2013 at 5:02 PM, Srivatsa S. Bhat
>>>>>>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>>>>>>> On 02/26/2013 05:47 AM, Lai Jiangshan wrote:
>>>>>>>>> On Tue, Feb 26, 2013 at 3:26 AM, Srivatsa S. Bhat
>>>>>>>>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>>>>>>>>> Hi Lai,
>>>>>>>>>>
>>>>>>>>>> On 02/25/2013 09:23 PM, Lai Jiangshan wrote:
>>>>>>>>>>> Hi, Srivatsa,
>>>>>>>>>>>
>>>>>>>>>>> The target of the whole patchset is nice for me.
>>>>>>>>>>
>>>>>>>>>> Cool! Thanks :-)
>>>>>>>>>>
>>>>>>>> [...]
>>>>>>>>
>>>>>>>> Unfortunately, I see quite a few issues with the code above. IIUC, the
>>>>>>>> writer and the reader both increment the same counters. So how will the
>>>>>>>> unlock() code in the reader path know when to unlock which of the locks?
>>>>>>>
>>>>>>> The same as your code, the reader(which nested in write C.S.) just dec
>>>>>>> the counters.
>>>>>>
>>>>>> And that works fine in my case because the writer and the reader update
>>>>>> _two_ _different_ counters.
>>>>>
>>>>> I can't find any magic in your code, they are the same counter.
>>>>>
>>>>>         /*
>>>>>          * It is desirable to allow the writer to acquire the percpu-rwlock
>>>>>          * for read (if necessary), without deadlocking or getting complaints
>>>>>          * from lockdep. To achieve that, just increment the reader_refcnt of
>>>>>          * this CPU - that way, any attempt by the writer to acquire the
>>>>>          * percpu-rwlock for read, will get treated as a case of nested percpu
>>>>>          * reader, which is safe, from a locking perspective.
>>>>>          */
>>>>>         this_cpu_inc(pcpu_rwlock->rw_state->reader_refcnt);
>>>>>
>>>>
>>>> Whoa! Hold on, were you really referring to _this_ increment when you said
>>>> that, in your patch you would increment the refcnt at the writer? Then I guess
>>>> there is a major disconnect in our conversations. (I had assumed that you were
>>>> referring to the update of writer_signal, and were just trying to have a single
>>>> refcnt instead of reader_refcnt and writer_signal).
>>>
>>> https://github.com/laijs/linux/commit/53e5053d5b724bea7c538b11743d0f420d98f38d
>>>
>>> Sorry the name "fallback_reader_refcnt" misled you.
>>>
>> [...]
>>
>>>>> All I was considered is "nested reader is seldom", so I always
>>>>> fallback to rwlock when nested.
>>>>> If you like, I can add 6 lines of code, the overhead is
>>>>> 1 spin_try_lock()(fast path)  + N  __this_cpu_inc()
>>>>>
>>>>
>>>> I'm assuming that calculation is no longer valid, considering that
>>>> we just discussed how the per-cpu refcnt that you were using is quite
>>>> unnecessary and can be removed.
>>>>
>>>> IIUC, the overhead with your code, as per above discussion would be:
>>>> 1 spin_try_lock() [non-nested] + N read_lock(global_rwlock).
>>>
>>> https://github.com/laijs/linux/commit/46334544bb7961550b7065e015da76f6dab21f16
>>>
>>> Again, I'm so sorry the name "fallback_reader_refcnt" misled you.
>>>
>>
>> At this juncture I really have to admit that I don't understand your
>> intentions at all. What are you really trying to prove? Without giving
>> a single good reason why my code is inferior, why are you even bringing
>> up the discussion about a complete rewrite of the synchronization code?
>> http://article.gmane.org/gmane.linux.kernel.cross-arch/17103
>> http://article.gmane.org/gmane.linux.power-management.general/31345
>>
>> I'm beginning to add 2 + 2 together based on the kinds of questions you
>> have been asking...
>>
>> You posted a patch in this thread and started a discussion around it without
>> even establishing a strong reason to do so. Now you point me to your git
>> tree where your patches have even more traces of ideas being borrowed from
>> my patchset (apart from my own ideas/code, there are traces of others' ideas
>> being borrowed too - for example, it was Oleg who originally proposed the
>> idea of splitting up the counter into 2 parts and I'm seeing that it is
>> slowly crawling into your code with no sign of appropriate credits).
>> http://article.gmane.org/gmane.linux.network/260288
>>
>> And in reply to my mail pointing out the performance implications of the
>> global read_lock at the reader side in your code, you said you'll come up
>> with a comparison between that and my patchset.
>> http://article.gmane.org/gmane.linux.network/260288
>> The issue has been well-documented in my patch description of patch 4.
>> http://article.gmane.org/gmane.linux.kernel/1443258
>>
>> Are you really trying to pit bits and pieces of my own ideas/versions
>> against one another and claiming them as your own?
>>
>> You projected the work involved in handling the locking issues pertaining
>> to CPU_DYING notifiers etc as a TODO, despite the fact that I had explicitly
>> noted in my cover letter that I had audited and taken care of all of them.
>> http://article.gmane.org/gmane.linux.documentation/9727
>> http://article.gmane.org/gmane.linux.documentation/9520
>>
>> You failed to acknowledge (on purpose?) that I had done a tree-wide
>> conversion despite the fact that you were replying to the very thread which
>> had the 46 patches which did exactly that (and I had also mentioned it
>> explicitly in my cover letter).
>> http://article.gmane.org/gmane.linux.documentation/9727
>> http://article.gmane.org/gmane.linux.documentation/9520
>>
>> You then started probing more and more about the technique I used to do
>> the tree-wide conversion.
>> http://article.gmane.org/gmane.linux.kernel.cross-arch/17111
>>
>> You also retorted saying you did go through my patch descriptions, so
>> its not like you have missed reading them.
>> http://article.gmane.org/gmane.linux.power-management.general/31345
>>
>> Each of these when considered individually, might appear like innocuous and
>> honest attempts at evaluating my code. But when put together, I'm beginning
>> to sense a whole different angle to it altogether, as if you are trying
>> to spin your own patch series, complete with the locking framework _and_
>> the tree-wide conversion, heavily borrowed from mine. At the beginning of
>> this discussion, I predicted that the lglock version that you are proposing
>> would end up being either less efficient than my version or look very similar
>> to my version. http://article.gmane.org/gmane.linux.kernel/1447139
>>
>> I thought it was just the former till now, but its not hard to see how it
>> is getting closer to becoming the latter too. So yeah, I'm not amused.
>>
>> Maybe (and hopefully) you are just trying out different ideas on your own,
>> and I'm just being paranoid. I really hope that is the case. If you are just
>> trying to review my code, then please stop sending patches with borrowed ideas
>> with your sole Signed-off-by, and purposefully ignoring the work already done
>> in my patchset, because it is really starting to look suspicious, at least
>> to me.
>>
>> Don't get me wrong - I'll whole-heartedly acknowledge and appreciate if
>> _your_ code is better than mine. I just don't like the idea of somebody
>> plagiarizing my ideas/code (or even others' ideas for that matter).
>> However, I sincerely apologize in advance if I misunderstood/misjudged your
>> intentions; I just wanted to voice my concerns out loud at this point,
>> considering the bad feeling I got by looking at your responses collectively.
>>
> 
> Hi, Srivatsa
> 
> I'm sorry, big apology to you.
> I'm bad in communication and I did be wrong.
> I tended to improve the codes but in false direction.
> 

OK, in that case, I'm extremely sorry too, for jumping on you like that.
I hope you'll forgive me for the uneasiness it caused.

Now that I understand that you were simply trying to help, I would like to
express my gratitude for your time, effort and inputs in improving the design
of the stop-machine replacement.

I'm looking forward to working with you on this as well as future endeavours,
so I sincerely hope that we can put this unfortunate incident behind us and
collaborate effectively with renewed mutual trust and good-will.

Thank you very much!

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: [PATCH] lglock: add read-preference local-global rwlock
From: Oleg Nesterov @ 2013-03-01 18:28 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Lai Jiangshan, linux-doc, peterz, fweisbec, Michel Lespinasse,
	mingo, linux-arch, linux, xiaoguangrong, wangyun, paulmck, nikunj,
	linux-pm, rusty, rostedt, rjw, namhyung, tglx, linux-arm-kernel,
	netdev, linux-kernel, vincent.guittot, sbw, Srivatsa S. Bhat, tj,
	akpm, linuxppc-dev
In-Reply-To: <5130E8E2.50206@cn.fujitsu.com>

Lai, I didn't read this discussion except the code posted by Michel.
I'll try to read this patch carefully later, but I'd like to ask
a couple of questions.

This version looks more complex than Michel's, why? Just curious, I
am trying to understand what I missed. See
http://marc.info/?l=linux-kernel&m=136196350213593

And I can't understand FALLBACK_BASE...

OK, suppose that CPU_0 does _write_unlock() and releases ->fallback_rwlock.

CPU_1 does _read_lock(), and ...

> +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
> +{
> +	struct lglock *lg = &lgrw->lglock;
> +
> +	preempt_disable();
> +	rwlock_acquire_read(&lg->lock_dep_map, 0, 0, _RET_IP_);
> +	if (likely(!__this_cpu_read(*lgrw->reader_refcnt))) {
> +		if (!arch_spin_trylock(this_cpu_ptr(lg->lock))) {

_trylock() fails,

> +			read_lock(&lgrw->fallback_rwlock);
> +			__this_cpu_add(*lgrw->reader_refcnt, FALLBACK_BASE);

so we take ->fallback_rwlock and ->reader_refcnt == FALLBACK_BASE.

CPU_0 does lg_global_unlock(lgrw->lglock) and finishes _write_unlock().

Interrupt handler on CPU_1 does _read_lock() notices ->reader_refcnt != 0
and simply does this_cpu_inc(), so reader_refcnt == FALLBACK_BASE + 1.

Then irq does _read_unlock(), and

> +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
> +{
> +	switch (__this_cpu_dec_return(*lgrw->reader_refcnt)) {
> +	case 0:
> +		lg_local_unlock(&lgrw->lglock);
> +		return;
> +	case FALLBACK_BASE:
> +		__this_cpu_sub(*lgrw->reader_refcnt, FALLBACK_BASE);
> +		read_unlock(&lgrw->fallback_rwlock);

hits this case?

Doesn't look right, but most probably I missed something.

Oleg.

^ permalink raw reply

* Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
From: Tejun Heo @ 2013-03-01 18:10 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Lai Jiangshan, linux-doc, peterz, fweisbec, oleg,
	Michel Lespinasse, mingo, linux-arch, linux, xiaoguangrong,
	wangyun, paulmck, nikunj, linux-pm, rusty, rostedt, rjw, namhyung,
	tglx, linux-arm-kernel, netdev, linux-kernel, vincent.guittot,
	sbw, akpm, linuxppc-dev
In-Reply-To: <512E7879.20109@linux.vnet.ibm.com>

Hello, Srivatsa.

On Thu, Feb 28, 2013 at 02:49:53AM +0530, Srivatsa S. Bhat wrote:
> Don't get me wrong - I'll whole-heartedly acknowledge and appreciate if
> _your_ code is better than mine. I just don't like the idea of somebody
> plagiarizing my ideas/code (or even others' ideas for that matter).
> However, I sincerely apologize in advance if I misunderstood/misjudged your
> intentions; I just wanted to voice my concerns out loud at this point,
> considering the bad feeling I got by looking at your responses collectively.

Although I don't know Lai personally, from my experience working with
him past several months on workqueue, I strongly doubt he has dark
urterior intenions.  The biggest problem probably is that his
communication in English isn't very fluent yet and thus doesn't carry
the underlying intentions or tones very well and he tends to bombard
patches in quick succession without much explanation inbetween (during
3.8 cycle, I was receiving several workqueue patchsets days apart
trying to solve the same problem with completely different approaches
way before the discussion on the previous postings reach any kind of
consensus).  A lot of that also probably comes from communication
problems.

Anyways, I really don't think he's trying to take claim of your work.
At least for now, you kinda need to speculate what he's trying to do
and then confirm that back to him, but once you get used to that part,
he's pretty nice to work with and I'm sure his communication will
improve in time (I think it has gotten a lot better than only some
months ago).

Thanks!

-- 
tejun

^ permalink raw reply

* Re: [PATCH] lglock: add read-preference local-global rwlock
From: Tejun Heo @ 2013-03-01 17:53 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Lai Jiangshan, linux-doc, peterz, fweisbec, oleg,
	Michel Lespinasse, mingo, linux-arch, linux, xiaoguangrong,
	wangyun, paulmck, nikunj, linux-pm, rusty, rostedt, rjw, namhyung,
	tglx, linux-arm-kernel, netdev, linux-kernel, vincent.guittot,
	sbw, Srivatsa S. Bhat, akpm, linuxppc-dev
In-Reply-To: <5130E8E2.50206@cn.fujitsu.com>

Hey, guys and Oleg (yes, I'm singling you out ;p because you're that
awesome.)

On Sat, Mar 02, 2013 at 01:44:02AM +0800, Lai Jiangshan wrote:
> Performance:
> We only focus on the performance of the read site. this read site's fast path
> is just preempt_disable() + __this_cpu_read/inc() + arch_spin_trylock(),
> It has only one heavy memory operation. it will be expected fast.
> 
> We test three locks.
> 1) traditional rwlock WITHOUT remote competition nor cache-bouncing.(opt-rwlock)
> 2) this lock(lgrwlock)
> 3) V6 percpu-rwlock by "Srivatsa S. Bhat". (percpu-rwlock)
>    (https://lkml.org/lkml/2013/2/18/186)
> 
> 		nested=1(no nested)	nested=2	nested=4
> opt-rwlock	 517181			1009200		2010027
> lgrwlock	 452897			 700026		1201415
> percpu-rwlock	1192955			1451343		1951757

On the first glance, the numbers look pretty good and I kinda really
like the fact that if this works out we don't have to introduce yet
another percpu synchronization construct and get to reuse lglock.

So, Oleg, can you please see whether you can find holes in this one?

Srivatsa, I know you spent a lot of time on percpu_rwlock but as you
wrote before Lai's work can be seen as continuation of yours, and if
we get to extend what's already there instead of introducing something
completely new, there's no reason not to (and my apologies for not
noticing the possibility of extending lglock before).  So, if this can
work, it would be awesome if you guys can work together.  Lai might
not be very good at communicating in english yet but he's really good
at spotting patterns in complex code and playing with them.

Thanks!

-- 
tejun

^ permalink raw reply

* Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
From: Lai Jiangshan @ 2013-03-01 17:50 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Lai Jiangshan, linux-doc, peterz, fweisbec, oleg,
	Michel Lespinasse, mingo, linux-arch, linux, xiaoguangrong,
	wangyun, paulmck, nikunj, linux-pm, rusty, rostedt, rjw, namhyung,
	tglx, linux-arm-kernel, netdev, linux-kernel, vincent.guittot,
	sbw, tj, akpm, linuxppc-dev
In-Reply-To: <512E7879.20109@linux.vnet.ibm.com>

On 28/02/13 05:19, Srivatsa S. Bhat wrote:
> On 02/27/2013 06:03 AM, Lai Jiangshan wrote:
>> On Wed, Feb 27, 2013 at 3:30 AM, Srivatsa S. Bhat
>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>> On 02/26/2013 09:55 PM, Lai Jiangshan wrote:
>>>> On Tue, Feb 26, 2013 at 10:22 PM, Srivatsa S. Bhat
>>>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>>>>
>>>>> Hi Lai,
>>>>>
>>>>> I'm really not convinced that piggy-backing on lglocks would help
>>>>> us in any way. But still, let me try to address some of the points
>>>>> you raised...
>>>>>
>>>>> On 02/26/2013 06:29 PM, Lai Jiangshan wrote:
>>>>>> On Tue, Feb 26, 2013 at 5:02 PM, Srivatsa S. Bhat
>>>>>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>>>>>> On 02/26/2013 05:47 AM, Lai Jiangshan wrote:
>>>>>>>> On Tue, Feb 26, 2013 at 3:26 AM, Srivatsa S. Bhat
>>>>>>>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>>>>>>>> Hi Lai,
>>>>>>>>>
>>>>>>>>> On 02/25/2013 09:23 PM, Lai Jiangshan wrote:
>>>>>>>>>> Hi, Srivatsa,
>>>>>>>>>>
>>>>>>>>>> The target of the whole patchset is nice for me.
>>>>>>>>>
>>>>>>>>> Cool! Thanks :-)
>>>>>>>>>
>>>>>>> [...]
>>>>>>>
>>>>>>> Unfortunately, I see quite a few issues with the code above. IIUC, the
>>>>>>> writer and the reader both increment the same counters. So how will the
>>>>>>> unlock() code in the reader path know when to unlock which of the locks?
>>>>>>
>>>>>> The same as your code, the reader(which nested in write C.S.) just dec
>>>>>> the counters.
>>>>>
>>>>> And that works fine in my case because the writer and the reader update
>>>>> _two_ _different_ counters.
>>>>
>>>> I can't find any magic in your code, they are the same counter.
>>>>
>>>>         /*
>>>>          * It is desirable to allow the writer to acquire the percpu-rwlock
>>>>          * for read (if necessary), without deadlocking or getting complaints
>>>>          * from lockdep. To achieve that, just increment the reader_refcnt of
>>>>          * this CPU - that way, any attempt by the writer to acquire the
>>>>          * percpu-rwlock for read, will get treated as a case of nested percpu
>>>>          * reader, which is safe, from a locking perspective.
>>>>          */
>>>>         this_cpu_inc(pcpu_rwlock->rw_state->reader_refcnt);
>>>>
>>>
>>> Whoa! Hold on, were you really referring to _this_ increment when you said
>>> that, in your patch you would increment the refcnt at the writer? Then I guess
>>> there is a major disconnect in our conversations. (I had assumed that you were
>>> referring to the update of writer_signal, and were just trying to have a single
>>> refcnt instead of reader_refcnt and writer_signal).
>>
>> https://github.com/laijs/linux/commit/53e5053d5b724bea7c538b11743d0f420d98f38d
>>
>> Sorry the name "fallback_reader_refcnt" misled you.
>>
> [...]
> 
>>>> All I was considered is "nested reader is seldom", so I always
>>>> fallback to rwlock when nested.
>>>> If you like, I can add 6 lines of code, the overhead is
>>>> 1 spin_try_lock()(fast path)  + N  __this_cpu_inc()
>>>>
>>>
>>> I'm assuming that calculation is no longer valid, considering that
>>> we just discussed how the per-cpu refcnt that you were using is quite
>>> unnecessary and can be removed.
>>>
>>> IIUC, the overhead with your code, as per above discussion would be:
>>> 1 spin_try_lock() [non-nested] + N read_lock(global_rwlock).
>>
>> https://github.com/laijs/linux/commit/46334544bb7961550b7065e015da76f6dab21f16
>>
>> Again, I'm so sorry the name "fallback_reader_refcnt" misled you.
>>
> 
> At this juncture I really have to admit that I don't understand your
> intentions at all. What are you really trying to prove? Without giving
> a single good reason why my code is inferior, why are you even bringing
> up the discussion about a complete rewrite of the synchronization code?
> http://article.gmane.org/gmane.linux.kernel.cross-arch/17103
> http://article.gmane.org/gmane.linux.power-management.general/31345
> 
> I'm beginning to add 2 + 2 together based on the kinds of questions you
> have been asking...
> 
> You posted a patch in this thread and started a discussion around it without
> even establishing a strong reason to do so. Now you point me to your git
> tree where your patches have even more traces of ideas being borrowed from
> my patchset (apart from my own ideas/code, there are traces of others' ideas
> being borrowed too - for example, it was Oleg who originally proposed the
> idea of splitting up the counter into 2 parts and I'm seeing that it is
> slowly crawling into your code with no sign of appropriate credits).
> http://article.gmane.org/gmane.linux.network/260288
> 
> And in reply to my mail pointing out the performance implications of the
> global read_lock at the reader side in your code, you said you'll come up
> with a comparison between that and my patchset.
> http://article.gmane.org/gmane.linux.network/260288
> The issue has been well-documented in my patch description of patch 4.
> http://article.gmane.org/gmane.linux.kernel/1443258
> 
> Are you really trying to pit bits and pieces of my own ideas/versions
> against one another and claiming them as your own?
> 
> You projected the work involved in handling the locking issues pertaining
> to CPU_DYING notifiers etc as a TODO, despite the fact that I had explicitly
> noted in my cover letter that I had audited and taken care of all of them.
> http://article.gmane.org/gmane.linux.documentation/9727
> http://article.gmane.org/gmane.linux.documentation/9520
> 
> You failed to acknowledge (on purpose?) that I had done a tree-wide
> conversion despite the fact that you were replying to the very thread which
> had the 46 patches which did exactly that (and I had also mentioned it
> explicitly in my cover letter).
> http://article.gmane.org/gmane.linux.documentation/9727
> http://article.gmane.org/gmane.linux.documentation/9520
> 
> You then started probing more and more about the technique I used to do
> the tree-wide conversion.
> http://article.gmane.org/gmane.linux.kernel.cross-arch/17111
> 
> You also retorted saying you did go through my patch descriptions, so
> its not like you have missed reading them.
> http://article.gmane.org/gmane.linux.power-management.general/31345
> 
> Each of these when considered individually, might appear like innocuous and
> honest attempts at evaluating my code. But when put together, I'm beginning
> to sense a whole different angle to it altogether, as if you are trying
> to spin your own patch series, complete with the locking framework _and_
> the tree-wide conversion, heavily borrowed from mine. At the beginning of
> this discussion, I predicted that the lglock version that you are proposing
> would end up being either less efficient than my version or look very similar
> to my version. http://article.gmane.org/gmane.linux.kernel/1447139
> 
> I thought it was just the former till now, but its not hard to see how it
> is getting closer to becoming the latter too. So yeah, I'm not amused.
> 
> Maybe (and hopefully) you are just trying out different ideas on your own,
> and I'm just being paranoid. I really hope that is the case. If you are just
> trying to review my code, then please stop sending patches with borrowed ideas
> with your sole Signed-off-by, and purposefully ignoring the work already done
> in my patchset, because it is really starting to look suspicious, at least
> to me.
> 
> Don't get me wrong - I'll whole-heartedly acknowledge and appreciate if
> _your_ code is better than mine. I just don't like the idea of somebody
> plagiarizing my ideas/code (or even others' ideas for that matter).
> However, I sincerely apologize in advance if I misunderstood/misjudged your
> intentions; I just wanted to voice my concerns out loud at this point,
> considering the bad feeling I got by looking at your responses collectively.
> 

Hi, Srivatsa

I'm sorry, big apology to you.
I'm bad in communication and I did be wrong.
I tended to improve the codes but in false direction.

Thanks,
Lai

^ permalink raw reply

* [PATCH] lglock: add read-preference local-global rwlock
From: Lai Jiangshan @ 2013-03-01 17:44 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Lai Jiangshan, linux-doc, peterz, fweisbec, oleg,
	Michel Lespinasse, mingo, linux-arch, linux, xiaoguangrong,
	wangyun, paulmck, nikunj, linux-pm, rusty, rostedt, rjw, namhyung,
	tglx, linux-arm-kernel, netdev, linux-kernel, vincent.guittot,
	sbw, tj, akpm, linuxppc-dev
In-Reply-To: <512E7879.20109@linux.vnet.ibm.com>

>From c63f2be9a4cf7106a521dda169a0e14f8e4f7e3b Mon Sep 17 00:00:00 2001
From: Lai Jiangshan <laijs@cn.fujitsu.com>
Date: Mon, 25 Feb 2013 23:14:27 +0800
Subject: [PATCH] lglock: add read-preference local-global rwlock

Current lglock is not read-preference, so it can't be used on some cases
which read-preference rwlock can do. Example, get_cpu_online_atomic().

Although we can use rwlock for these cases which needs read-preference.
but it leads to unnecessary cache-line bouncing even when there are no
writers present, which can slow down the system needlessly. It will be
worse when we have a lot of CPUs, it is not scale.

So we look forward to lglock. lglock is read-write-lock based on percpu locks,
but it is not read-preference due to its underlining percpu locks.

But what if we convert the percpu locks of lglock to use percpu rwlocks:

         CPU 0                                CPU 1
         ------                               ------
1.    spin_lock(&random_lock);             read_lock(my_rwlock of CPU 1);
2.    read_lock(my_rwlock of CPU 0);       spin_lock(&random_lock);

Writer:
         CPU 2:
         ------
      for_each_online_cpu(cpu)
        write_lock(my_rwlock of 'cpu');


Consider what happens if the writer begins his operation in between steps 1
and 2 at the reader side. It becomes evident that we end up in a (previously
non-existent) deadlock due to a circular locking dependency between the 3
entities, like this:


(holds              Waiting for
 random_lock) CPU 0 -------------> CPU 2  (holds my_rwlock of CPU 0
                                               for write)
               ^                   |
               |                   |
        Waiting|                   | Waiting
          for  |                   |  for
               |                   V
                ------ CPU 1 <------

                (holds my_rwlock of
                 CPU 1 for read)


So obviously this "straight-forward" way of implementing percpu rwlocks is
deadlock-prone. So we can't implement read-preference local-global rwlock
like this.


The implement of this patch reuse current lglock as frontend to achieve
local-read-lock, and reuse global fallback rwlock as backend to achieve
read-preference, and use percpu reader counter to indicate 1) the depth
of the nested reader lockes and 2) whether the outmost lock is percpu lock
or fallback rwlock.

The algorithm is simple, in the read site:
If it is nested reader, just increase the counter
If it is the outmost reader,
	1) try to lock its cpu's lock of the frontend lglock.
	   (reader count +=1 if success)
	2) if the above step fails, read_lock(&fallback_rwlock).
	   (reader count += FALLBACK_BASE + 1)

Write site:
Do the lg_global_lock() of the frontend lglock.
And then write_lock(&fallback_rwlock).


Prof:
1) reader-writer exclusive:
The two steps of write site finished, no reader. Vice-verse.

2) read-preference:
before write site lock finished acquired, read site at least
wins at read_lock(&fallback_rwlock) for rwlock is read-preference.
read-preference also implies nestable.

3) read site functions are irqsafe(reentrance-safe)
	If read site function is interrupted at any point and reenters read site
again, reentranced read site will not be mislead by the first read site if the
reader counter > 0, in this case, it means currently frontend(this cpu lock of
lglock) or backend(fallback rwlock) lock is held, it is safe to act as
nested reader.
	if the reader counter=0, eentranced reader considers it is the
outmost read site, and it always successes after the write side release the lock.
(even the interrupted read-site has already hold the cpu lock of lglock
or the fallback_rwlock).
	And reentranced read site only calls arch_spin_trylock(), read_lock()
and __this_cpu_op(), arch_spin_trylock(), read_lock() is already reentrance-safe.
Although __this_cpu_op() is not reentrance-safe, but the value of the counter
will be restored after the interrupted finished, so read site functions
are still reentrance-safe.


Performance:
We only focus on the performance of the read site. this read site's fast path
is just preempt_disable() + __this_cpu_read/inc() + arch_spin_trylock(),
It has only one heavy memory operation. it will be expected fast.

We test three locks.
1) traditional rwlock WITHOUT remote competition nor cache-bouncing.(opt-rwlock)
2) this lock(lgrwlock)
3) V6 percpu-rwlock by "Srivatsa S. Bhat". (percpu-rwlock)
   (https://lkml.org/lkml/2013/2/18/186)

		nested=1(no nested)	nested=2	nested=4
opt-rwlock	 517181			1009200		2010027
lgrwlock	 452897			 700026		1201415
percpu-rwlock	1192955			1451343		1951757

The value is the time(nano-second) of 10000 times of the operations:
{
	read-lock
	[nested read-lock]...
	[nested read-unlock]...
	read-unlock
}
If the way of this test is wrong nor bad, please correct me,
the code of test is here: https://gist.github.com/laijs/5066159
(i5 760, 64bit)

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 include/linux/lglock.h |   35 ++++++++++++++++++++++++++++++++
 kernel/lglock.c        |   52 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 87 insertions(+), 0 deletions(-)

diff --git a/include/linux/lglock.h b/include/linux/lglock.h
index 0d24e93..afb22bd 100644
--- a/include/linux/lglock.h
+++ b/include/linux/lglock.h
@@ -67,4 +67,39 @@ void lg_local_unlock_cpu(struct lglock *lg, int cpu);
 void lg_global_lock(struct lglock *lg);
 void lg_global_unlock(struct lglock *lg);
 
+struct lgrwlock {
+	unsigned long __percpu *reader_refcnt;
+	struct lglock lglock;
+	rwlock_t fallback_rwlock;
+};
+
+#define __DEFINE_LGRWLOCK_PERCPU_DATA(name)				\
+	static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)		\
+	= __ARCH_SPIN_LOCK_UNLOCKED;					\
+	static DEFINE_PER_CPU(unsigned long, name ## _refcnt);
+
+#define __LGRWLOCK_INIT(name)						\
+	{								\
+		.reader_refcnt = &name ## _refcnt,			\
+		.lglock = { .lock = &name ## _lock },			\
+		.fallback_rwlock = __RW_LOCK_UNLOCKED(name.fallback_rwlock)\
+	}
+
+#define DEFINE_LGRWLOCK(name)						\
+	__DEFINE_LGRWLOCK_PERCPU_DATA(name)				\
+	struct lgrwlock name = __LGRWLOCK_INIT(name)
+
+#define DEFINE_STATIC_LGRWLOCK(name)					\
+	__DEFINE_LGRWLOCK_PERCPU_DATA(name)				\
+	static struct lgrwlock name = __LGRWLOCK_INIT(name)
+
+static inline void lg_rwlock_init(struct lgrwlock *lgrw, char *name)
+{
+	lg_lock_init(&lgrw->lglock, name);
+}
+
+void lg_rwlock_local_read_lock(struct lgrwlock *lgrw);
+void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw);
+void lg_rwlock_global_write_lock(struct lgrwlock *lgrw);
+void lg_rwlock_global_write_unlock(struct lgrwlock *lgrw);
 #endif
diff --git a/kernel/lglock.c b/kernel/lglock.c
index 6535a66..6c60cf7 100644
--- a/kernel/lglock.c
+++ b/kernel/lglock.c
@@ -87,3 +87,55 @@ void lg_global_unlock(struct lglock *lg)
 	preempt_enable();
 }
 EXPORT_SYMBOL(lg_global_unlock);
+
+#define FALLBACK_BASE	(1UL << 30)
+
+void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
+{
+	struct lglock *lg = &lgrw->lglock;
+
+	preempt_disable();
+	rwlock_acquire_read(&lg->lock_dep_map, 0, 0, _RET_IP_);
+	if (likely(!__this_cpu_read(*lgrw->reader_refcnt))) {
+		if (!arch_spin_trylock(this_cpu_ptr(lg->lock))) {
+			read_lock(&lgrw->fallback_rwlock);
+			__this_cpu_add(*lgrw->reader_refcnt, FALLBACK_BASE);
+		}
+	}
+
+	__this_cpu_inc(*lgrw->reader_refcnt);
+}
+EXPORT_SYMBOL(lg_rwlock_local_read_lock);
+
+void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
+{
+	switch (__this_cpu_dec_return(*lgrw->reader_refcnt)) {
+	case 0:
+		lg_local_unlock(&lgrw->lglock);
+		return;
+	case FALLBACK_BASE:
+		__this_cpu_sub(*lgrw->reader_refcnt, FALLBACK_BASE);
+		read_unlock(&lgrw->fallback_rwlock);
+		break;
+	default:
+		break;
+	}
+
+	rwlock_release(&lg->lock_dep_map, 1, _RET_IP_);
+	preempt_enable();
+}
+EXPORT_SYMBOL(lg_rwlock_local_read_unlock);
+
+void lg_rwlock_global_write_lock(struct lgrwlock *lgrw)
+{
+	lg_global_lock(&lgrw->lglock);
+	write_lock(&lgrw->fallback_rwlock);
+}
+EXPORT_SYMBOL(lg_rwlock_global_write_lock);
+
+void lg_rwlock_global_write_unlock(struct lgrwlock *lgrw)
+{
+	write_unlock(&lgrw->fallback_rwlock);
+	lg_global_unlock(&lgrw->lglock);
+}
+EXPORT_SYMBOL(lg_rwlock_global_write_unlock);
-- 
1.7.7.6

^ permalink raw reply related

* RE: [PATCH 1/6 v8] iommu/fsl: Store iommu domain information pointer in archdata.
From: Yoder Stuart-B08248 @ 2013-03-01 16:21 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Sethi Varun-B16395
  Cc: Wood Scott-B07421, Alex Williamson, Joerg Roedel,
	linux-kernel@vger.kernel.org list,
	iommu@lists.linux-foundation.org, Paul Mackerras,
	linuxppc-dev@lists.ozlabs.org list, David Gibson
In-Reply-To: <51307DCE.1050506@ozlabs.ru>

> -----Original Message-----
> From: Alexey Kardashevskiy [mailto:aik@ozlabs.ru]
> Sent: Friday, March 01, 2013 4:07 AM
> To: Sethi Varun-B16395
> Cc: Kumar Gala; Benjamin Herrenschmidt; iommu@lists.linux-foundation.org;=
 linuxppc-dev@lists.ozlabs.org
> list; linux-kernel@vger.kernel.org list; Wood Scott-B07421; Yoder Stuart-=
B08248; Joerg Roedel; Paul
> Mackerras; David Gibson; Alex Williamson
> Subject: Re: [PATCH 1/6 v8] iommu/fsl: Store iommu domain information poi=
nter in archdata.
>=20
> btw the device struct already has a pointer to its iommu_group, and the
> iommu_group struct itself has a pointer void *iommu_data which you could
> use for anything you want (iommu_group_get_iommudata(),
> iommu_group_set_iommudata()).
>
> By design you are expected to add iommu groups to a domain but not device=
s
> so I am not so sure that you really need a pointer to domain in the devic=
e
> struct.

Well, at the lowest level the IOMMU API does attach devices to domains-- i.=
e.
API attach_dev().  So, it seems to conceptually make sense to have a ptr
from the device to the associated domain.  When you implement attach_dev()
you need to be able to see whether the device is already attached to
a domain.   Adding a couple of levels of indirection...from device to
group to domain...doesn't seems to make things simpler or better IMHO.

x86 keeps a pointer to the domain in device archdata and since there is
a direct correlation between a device and domain I'd rather see it
where this patch has it.

Stuart

^ permalink raw reply

* [PATCH 14/22] powerpc/eeh: I/O chip EEH enable option
From: Gavin Shan @ 2013-03-01 14:17 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Gavin Shan
In-Reply-To: <1362147440-14096-1-git-send-email-shangw@linux.vnet.ibm.com>

The patch adds the backend to enable or disable EEH functionality
for the specified PE. The backend is also used to enable MMIO or
DMA path for the problematic PE. It's notable that all PEs on
PowerNV platform support EEH functionality by default, and we
disallow to disable EEH for the specific PE.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/eeh-ioda.c |   65 ++++++++++++++++++++++++++++-
 1 files changed, 64 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c b/arch/powerpc/platforms/powernv/eeh-ioda.c
index 38c1efe..1b5ddf4 100644
--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -59,9 +59,72 @@ static int ioda_eeh_post_init(struct pci_controller *hose)
 	return 0;
 }
 
+/**
+ * ioda_eeh_set_option - Set EEH operation or I/O setting
+ * @pe: EEH PE
+ * @option: options
+ *
+ * Enable or disable EEH option for the indicated PE. The
+ * function also can be used to enable I/O or DMA for the
+ * PE.
+ */
+static int ioda_eeh_set_option(struct eeh_pe *pe, int option)
+{
+	s64 ret;
+	u32 pe_no;
+	struct pci_controller *hose = pe->phb;
+	struct pnv_phb *phb = hose->private_data;
+
+	/* Check on PE number */
+	if (pe->addr < 0 || pe->addr >= phb->ioda.total_pe) {
+		pr_err("%s: PE address %x out of range [0, %x] "
+		       "on PHB#%x\n",
+			__func__, pe->addr, phb->ioda.total_pe,
+			hose->global_number);
+		return -EINVAL;
+	}
+
+	pe_no = pe->addr;
+	switch (option) {
+	case EEH_OPT_DISABLE:
+		ret = -EEXIST;
+		break;
+	case EEH_OPT_ENABLE:
+		ret = 0;
+		break;
+	case EEH_OPT_THAW_MMIO:
+		ret = opal_pci_eeh_freeze_clear(phb->opal_id, pe_no,
+				OPAL_EEH_ACTION_CLEAR_FREEZE_MMIO);
+		if (ret) {
+			pr_warning("%s: Failed to enable MMIO for "
+				   "PHB#%x-PE#%x, err=%lld\n",
+				__func__, hose->global_number, pe_no, ret);
+			return -EIO;
+		}
+
+		break;
+	case EEH_OPT_THAW_DMA:
+		ret = opal_pci_eeh_freeze_clear(phb->opal_id, pe_no,
+				OPAL_EEH_ACTION_CLEAR_FREEZE_DMA);
+		if (ret) {
+			pr_warning("%s: Failed to enable DMA for "
+				   "PHB#%x-PE#%x, err=%lld\n",
+				__func__, hose->global_number, pe_no, ret);
+			return -EIO;
+		}
+
+		break;
+	default:
+		pr_warning("%s: Invalid option %d\n", __func__, option);
+		return -EINVAL;
+	}
+
+	return ret;
+}
+
 struct pnv_eeh_ops ioda_eeh_ops = {
 	.post_init		= ioda_eeh_post_init,
-	.set_option		= NULL,
+	.set_option		= ioda_eeh_set_option,
 	.get_state		= NULL,
 	.reset			= NULL,
 	.get_log		= NULL,
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH 16/22] powerpc/eeh: I/O chip PE reset
From: Gavin Shan @ 2013-03-01 14:17 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Gavin Shan
In-Reply-To: <1362147440-14096-1-git-send-email-shangw@linux.vnet.ibm.com>

The patch adds the I/O chip backend to do PE reset. For now, we
focus on PCI bus dependent PE. If PHB PE has been put into error
state, the PHB will take complete reset. Besides, the root bridge
will take fundamental or hot reset accordingly if the indicated
PE locates at the toppest of PCI hierarchy tree. Otherwise, the
upstream p2p bridge will take hot reset.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/eeh-ioda.c |  231 ++++++++++++++++++++++++++++-
 1 files changed, 230 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c b/arch/powerpc/platforms/powernv/eeh-ioda.c
index cf7bb3d..8e29113 100644
--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -224,11 +224,240 @@ static int ioda_eeh_get_state(struct eeh_pe *pe, int *state)
 	return result;
 }
 
+static int ioda_eeh_pe_clear(struct eeh_pe *pe)
+{
+	struct pci_controller *hose;
+	struct pnv_phb *phb;
+	u32 pe_no;
+	u8 fstate;
+	u16 pcierr;
+	s64 ret;
+
+	pe_no = pe->addr;
+	hose = pe->phb;
+	phb = pe->phb->private_data;
+
+	/* Clear the EEH error on the PE */
+	ret = opal_pci_eeh_freeze_clear(phb->opal_id,
+			pe_no, OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
+	if (ret) {
+		pr_err("%s: Failed to clear EEH error for "
+		       "PHB#%x-PE#%x, err=%lld\n",
+			__func__, hose->global_number, pe_no, ret);
+		return -EIO;
+	}
+
+	/*
+	 * Read the PE state back and verify that the frozen
+	 * state has been removed.
+	 */
+	ret = opal_pci_eeh_freeze_status(phb->opal_id, pe_no,
+			&fstate, &pcierr, NULL);
+	if (ret) {
+		pr_err("%s: Failed to get EEH status on "
+		       "PHB#%x-PE#%x\n, err=%lld\n",
+			__func__, hose->global_number, pe_no, ret);
+		return -EIO;
+	}
+	if (fstate != OPAL_EEH_STOPPED_NOT_FROZEN) {
+		pr_err("%s: Frozen state not cleared on "
+		       "PHB#%x-PE#%x, sts=%x\n",
+			__func__, hose->global_number, pe_no, fstate);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static s64 ioda_eeh_phb_poll(struct pnv_phb *phb)
+{
+	s64 rc = OPAL_HARDWARE;
+
+	while (1) {
+		rc = opal_pci_poll(phb->opal_id);
+		if (rc <= 0)
+			break;
+	}
+
+	return rc;
+}
+
+static int ioda_eeh_phb_reset(struct pci_controller *hose, int option)
+{
+	struct pnv_phb *phb = hose->private_data;
+	s64 rc = OPAL_HARDWARE;
+
+	pr_debug("%s: Reset PHB#%x, option=%d\n",
+		__func__, hose->global_number, option);
+
+	/* Issue PHB complete reset request */
+	if (option == EEH_RESET_FUNDAMENTAL ||
+	    option == EEH_RESET_HOT)
+		rc = opal_pci_reset(phb->opal_id,
+				OPAL_PHB_COMPLETE,
+				OPAL_ASSERT_RESET);
+	else if (option == EEH_RESET_DEACTIVATE)
+		rc = opal_pci_reset(phb->opal_id,
+				OPAL_PHB_COMPLETE,
+				OPAL_DEASSERT_RESET);
+	if (rc < 0)
+		goto out;
+
+	/*
+	 * Poll state of the PHB until the request is done
+	 * successfully.
+	 */
+	rc = ioda_eeh_phb_poll(phb);
+out:
+	if (rc != OPAL_SUCCESS)
+		return -EIO;
+
+	return 0;
+}
+
+static int ioda_eeh_root_reset(struct pci_controller *hose, int option)
+{
+	struct pnv_phb *phb = hose->private_data;
+	s64 rc = OPAL_SUCCESS;
+
+	pr_debug("%s: Reset PHB#%x, option=%d\n",
+		__func__, hose->global_number, option);
+
+	/*
+	 * During the reset deassert time, we needn't care
+	 * the reset scope because the firmware does nothing
+	 * for fundamental or hot reset during deassert phase.
+	 */
+	if (option == EEH_RESET_FUNDAMENTAL)
+		rc = opal_pci_reset(phb->opal_id,
+				OPAL_PCI_FUNDAMENTAL_RESET,
+				OPAL_ASSERT_RESET);
+	else if (option == EEH_RESET_HOT)
+		rc = opal_pci_reset(phb->opal_id,
+				OPAL_PCI_HOT_RESET,
+				OPAL_ASSERT_RESET);
+	else if (option == EEH_RESET_DEACTIVATE)
+		rc = opal_pci_reset(phb->opal_id,
+				OPAL_PCI_HOT_RESET,
+				OPAL_DEASSERT_RESET);
+	if (rc < 0)
+		goto out;
+
+	/* Poll state of the PHB until the request is done */
+	rc = ioda_eeh_phb_poll(phb);
+out:
+	if (rc != OPAL_SUCCESS)
+		return -EIO;
+
+	return 0;
+}
+
+static int ioda_eeh_bridge_reset(struct pci_controller *hose,
+		struct pci_dev *dev, int option)
+{
+	u16 ctrl;
+
+	pr_debug("%s: Reset device %04x:%02x:%02x.%01x with option %d\n",
+		__func__, hose->global_number, dev->bus->number,
+		PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn), option);
+
+	switch (option) {
+	case EEH_RESET_FUNDAMENTAL:
+	case EEH_RESET_HOT:
+		pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
+		ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
+		pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
+		break;
+	case EEH_RESET_DEACTIVATE:
+		pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
+		ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
+		pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
+		break;
+	}
+
+	return 0;
+}
+
+/**
+ * ioda_eeh_reset - Reset the indicated PE
+ * @pe: EEH PE
+ * @option: reset option
+ *
+ * Do reset on the indicated PE. For PCI bus sensitive PE,
+ * we need to reset the parent p2p bridge. The PHB has to
+ * be reinitialized if the p2p bridge is root bridge. For
+ * PCI device sensitive PE, we will try to reset the device
+ * through FLR. For now, we don't have OPAL APIs to do HARD
+ * reset yet, so all reset would be SOFT (HOT) reset.
+ */
+static int ioda_eeh_reset(struct eeh_pe *pe, int option)
+{
+	struct pci_controller *hose = pe->phb;
+	struct eeh_dev *edev;
+	struct pci_dev *dev;
+	int ret;
+
+	/*
+	 * Anyway, we have to clear the problematic state for the
+	 * corresponding PE. However, we needn't do it if the PE
+	 * is PHB associated. That means the PHB is having fatal
+	 * errors and it needs reset. Further more, the AIB interface
+	 * isn't reliable any more.
+	 */
+	if (!(pe->type & EEH_PE_PHB) &&
+	    (option == EEH_RESET_HOT ||
+	    option == EEH_RESET_FUNDAMENTAL)) {
+		ret = ioda_eeh_pe_clear(pe);
+		if (ret)
+			return -EIO;
+	}
+
+	/*
+	 * The rules applied to reset, either fundamental or hot reset:
+	 *
+	 * We always reset the direct upstream bridge of the PE. If the
+	 * direct upstream bridge isn't root bridge, we always take hot
+	 * reset no matter what option (fundamental or hot) is. Otherwise,
+	 * we should do the reset according to the required option.
+	 */
+	if (pe->type & EEH_PE_PHB) {
+		ret = ioda_eeh_phb_reset(hose, option);
+	} else {
+		if (pe->type & EEH_PE_DEVICE) {
+			/*
+			 * If it's device PE, we didn't refer to the parent
+			 * PCI bus yet. So we have to figure it out indirectly.
+			 */
+			edev = list_first_entry(&pe->edevs,
+					struct eeh_dev, list);
+			dev = eeh_dev_to_pci_dev(edev);
+			dev = dev->bus->self;
+		} else {
+			/*
+			 * If it's bus PE, the parent PCI bus is already there
+			 * and just pick it up.
+			 */
+			dev = pe->bus->self;
+		}
+
+		/*
+		 * Do reset based on the fact that the direct upstream bridge
+		 * is root bridge (port) or not.
+		 */
+		if (dev->bus->number == 0)
+			ret = ioda_eeh_root_reset(hose, option);
+		else
+			ret = ioda_eeh_bridge_reset(hose, dev, option);
+	}
+
+	return ret;
+}
+
 struct pnv_eeh_ops ioda_eeh_ops = {
 	.post_init		= ioda_eeh_post_init,
 	.set_option		= ioda_eeh_set_option,
 	.get_state		= ioda_eeh_get_state,
-	.reset			= NULL,
+	.reset			= ioda_eeh_reset,
 	.get_log		= NULL,
 	.configure_bridge	= NULL
 };
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH 22/22] powerpc/eeh: Connect EEH error interrupt handle
From: Gavin Shan @ 2013-03-01 14:17 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Gavin Shan
In-Reply-To: <1362147440-14096-1-git-send-email-shangw@linux.vnet.ibm.com>

The EEH error interrupts should have been exported by firmware
through device tree. The OS already installed the interrupt
handler (opal_interrupt()) for them. The patch checks if we have
any existing EEH errors and wakes the kernel thread up to process
that if possible.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/opal.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
index aaa0dba..8aa99de 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -17,6 +17,8 @@
 #include <linux/interrupt.h>
 #include <asm/opal.h>
 #include <asm/firmware.h>
+#include <asm/io.h>
+#include <asm/eeh.h>
 
 #include "powernv.h"
 
@@ -271,6 +273,10 @@ static irqreturn_t opal_interrupt(int irq, void *data)
 	uint64_t events;
 
 	opal_handle_interrupt(virq_to_hw(irq), &events);
+#ifdef CONFIG_EEH
+	if (events & OPAL_EVENT_PCI_ERROR)
+		pci_err_event();
+#endif
 
 	/* XXX TODO: Do something with the events */
 
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH 12/22] powerpc/eeh: EEH backend for P7IOC
From: Gavin Shan @ 2013-03-01 14:17 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Gavin Shan
In-Reply-To: <1362147440-14096-1-git-send-email-shangw@linux.vnet.ibm.com>

For EEH on PowerNV platform, the overall architecture is a bit
different from that on pSeries platform. In order to support multiple
I/O chips in future, we split EEH to 3 layers for PowerNV platform:
EEH core, platform layer, I/O layer. It would give EEH implementation
on PowerNV much more flexibility in future.

The patch adds the EEH backend for P7IOC.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/Makefile   |    1 +
 arch/powerpc/platforms/powernv/eeh-ioda.c |   52 +++++++++++++++++++++++++++++
 arch/powerpc/platforms/powernv/pci.h      |   22 +++++++++++-
 3 files changed, 74 insertions(+), 1 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/eeh-ioda.c

diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile
index bcc3cb4..09bd0cb 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -3,3 +3,4 @@ obj-y			+= opal-rtc.o opal-nvram.o
 
 obj-$(CONFIG_SMP)	+= smp.o
 obj-$(CONFIG_PCI)	+= pci.o pci-p5ioc2.o pci-ioda.o
+obj-$(CONFIG_EEH)	+= eeh-ioda.o
diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c b/arch/powerpc/platforms/powernv/eeh-ioda.c
new file mode 100644
index 0000000..ee1b538
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -0,0 +1,52 @@
+/*
+ * The file intends to implement the functions needed by EEH, which is
+ * built on IODA compliant chip. Actually, lots of functions related
+ * to EEH would be built based on the OPAL APIs.
+ *
+ * Copyright Benjamin Herrenschmidt & Gavin Shan, IBM Corporation 2013.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307 USA
+ */
+
+#include <linux/bootmem.h>
+#include <linux/delay.h>
+#include <linux/init.h>
+#include <linux/io.h>
+#include <linux/irq.h>
+#include <linux/kernel.h>
+#include <linux/msi.h>
+#include <linux/pci.h>
+#include <linux/string.h>
+
+#include <asm/eeh.h>
+#include <asm/eeh_event.h>
+#include <asm/io.h>
+#include <asm/iommu.h>
+#include <asm/opal.h>
+#include <asm/pci-bridge.h>
+#include <asm/ppc-pci.h>
+#include <asm/tce.h>
+
+#include "powernv.h"
+#include "pci.h"
+
+struct pnv_eeh_ops ioda_eeh_ops = {
+	.post_init		= NULL,
+	.set_option		= NULL,
+	.get_state		= NULL,
+	.reset			= NULL,
+	.get_log		= NULL,
+	.configure_bridge	= NULL
+};
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 7cfb7c8..f1af391 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -63,15 +63,32 @@ struct pnv_ioda_pe {
 	struct list_head	list;
 };
 
+/* IOC dependent EEH operations */
+#ifdef CONFIG_EEH
+struct pnv_eeh_ops {
+	int (*post_init)(struct pci_controller *hose);
+	int (*set_option)(struct eeh_pe *pe, int option);
+	int (*get_state)(struct eeh_pe *pe, int *state);
+	int (*reset)(struct eeh_pe *pe, int option);
+	int (*get_log)(struct eeh_pe *pe, int severity,
+		       char *drv_log, unsigned long len);
+	int (*configure_bridge)(struct eeh_pe *pe);
+};
+#endif /* CONFIG_EEH */
+
 struct pnv_phb {
 	struct pci_controller	*hose;
 	enum pnv_phb_type	type;
 	enum pnv_phb_model	model;
+	u64			hub_id;
 	u64			opal_id;
 	void __iomem		*regs;
 	int			initialized;
 	spinlock_t		lock;
-
+#ifdef CONFIG_EEH
+	struct pnv_eeh_ops	*eeh_ops;
+	int			eeh_enabled;
+#endif
 #ifdef CONFIG_PCI_MSI
 	unsigned long		*msi_map;
 	unsigned int		msi_base;
@@ -144,6 +161,9 @@ struct pnv_phb {
 };
 
 extern struct pci_ops pnv_pci_ops;
+#ifdef CONFIG_EEH
+extern struct pnv_eeh_ops ioda_eeh_ops;
+#endif
 
 extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
 				      void *tce_mem, u64 tce_size,
-- 
1.7.5.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox