LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: smp: Start up non-boot CPUs asynchronously
From: Srivatsa S. Bhat @ 2012-02-14 19:32 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Stephen Rothwell, mikey, Peter Zijlstra, gregkh, ppc-dev,
	linux-kernel, Milton Miller, Srivatsa Vaddagiri, Andrew Morton,
	H. Peter Anvin, arjanvandeven, Ingo Molnar, Paul E. McKenney,
	Linus Torvalds, Thomas Gleixner
In-Reply-To: <4F3A2DFB.5000209@linux.vnet.ibm.com>

On 02/14/2012 03:18 PM, Srivatsa S. Bhat wrote:

> On 02/14/2012 01:47 PM, Srivatsa S. Bhat wrote:
> 
>> On 01/31/2012 09:54 PM, Arjan van de Ven wrote:
>>
>>> From ee65be59057c920042747d46dc174c5a5a56c744 Mon Sep 17 00:00:00 2001
>>> From: Arjan van de Ven <arjan@linux.intel.com>
>>> Date: Mon, 30 Jan 2012 20:44:51 -0800
>>> Subject: [PATCH] smp: Start up non-boot CPUs asynchronously
>>>
>>> The starting of the "not first" CPUs actually takes a lot of boot time
>>> of the kernel... upto "minutes" on some of the bigger SGI boxes.
>>> Right now, this is a fully sequential operation with the rest of the kernel
>>> boot.
>>>
>>> This patch turns this bringup of the other cpus into an asynchronous operation.
>>> With some other changes (not in this patch) this can save significant kernel
>>> boot time (upto 40% on my laptop!!).
>>> Basically now CPUs could get brought up in parallel to disk enumeration, graphic
>>> mode bringup etc etc etc.
>>>
>>> Note that the implementation in this patch still waits for all CPUs to
>>> be brought up before starting userspace; I would love to remove that
>>> restriction over time (technically that is simple), but that becomes
>>> then a change in behavior... I'd like to see more discussion on that
>>> being a good idea before I write that patch.
>>>
>>> Second note on version 2 of the patch:
>>> This patch does currently not save any boot time, due to a situation
>>> where the cpu hotplug lock gets taken for write by the cpu bringup code,
>>> which starves out readers of this lock throughout the kernel.
>>> Ingo specifically requested this behavior to expose this lock problem.
>>>
>>> CC: Milton Miller <miltonm@bga.com>
>>> CC: Andrew Morton <akpm@linux-foundation.org>
>>> CC: Ingo Molnar <mingo@elte.hu>
>>>
>>> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
>>> ---
>>>  kernel/smp.c |   21 ++++++++++++++++++++-
>>>  1 files changed, 20 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/kernel/smp.c b/kernel/smp.c
>>> index db197d6..ea48418 100644
>>> --- a/kernel/smp.c
>>> +++ b/kernel/smp.c
>>> @@ -12,6 +12,8 @@
>>>  #include <linux/gfp.h>
>>>  #include <linux/smp.h>
>>>  #include <linux/cpu.h>
>>> +#include <linux/async.h>
>>> +#include <linux/delay.h>
>>>
>>>  #ifdef CONFIG_USE_GENERIC_SMP_HELPERS
>>>  static struct {
>>> @@ -664,17 +666,34 @@ void __init setup_nr_cpu_ids(void)
>>>  	nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1;
>>>  }
>>>
>>> +void __init async_cpu_up(void *data, async_cookie_t cookie)
>>> +{
>>> +	unsigned long nr = (unsigned long) data;
>>> +	/*
>>> +	 * we can only up one cpu at a time, as enforced by the hotplug
>>> +	 * lock; it's better to wait for all earlier CPUs to be done before
>>> +	 * we bring up ours, so that the bring up order is predictable.
>>> +	 */
>>> +	async_synchronize_cookie(cookie);
>>> +	cpu_up(nr);
>>> +}
>>> +
>>>  /* Called by boot processor to activate the rest. */
>>>  void __init smp_init(void)
>>>  {
>>>  	unsigned int cpu;
>>>
>>>  	/* FIXME: This should be done in userspace --RR */
>>> +
>>> +	/*
>>> +	 * But until we do this in userspace, we're going to do this
>>> +	 * in parallel to the rest of the kernel boot up.-- Arjan
>>> +	 */
>>>  	for_each_present_cpu(cpu) {
>>>  		if (num_online_cpus() >= setup_max_cpus)
>>>  			break;
>>>  		if (!cpu_online(cpu))
>>> -			cpu_up(cpu);
>>> +			async_schedule(async_cpu_up, (void *) cpu);
>>>  	}
>>>
>>>  	/* Any cleanup work */
>>
>>
>> If I understand correctly, with this patch, the booting of non-boot CPUs
>> will happen in parallel with the rest of the kernel boot, but bringing up
>> of individual CPU is still serialized (due to hotplug lock).
>>
>> If that is correct, I see several issues with this patch:
>>
>> 1. In smp_init(), after the comment "Any cleanup work" (see above), we print:
>> printk(KERN_INFO "Brought up %ld CPUs\n", (long)num_online_cpus());
>> So this can potentially print less than expected number of CPUs and might
>> surprise users.
>>
>> 2. Just below that we have smp_cpus_done(setup_max_cpus); and this translates
>> to native_smp_cpus_done() under x86, which calls impress_friends().
>> And that means, the bogosum calculation and the total activated processor
>> count which is printed, may get messed up.
>>
>> 3. sched_init_smp() is called immediately after smp_init(). And that calls
>> init_sched_domains(cpu_active_mask). Of course, it registers a hotplug
>> notifier callback to handle hot-added cpus.. but with this patch, boot up can
>> actually become unnecessarily slow at this point - what could have been done
>> in one go with an appropriately filled up cpu_active_mask, needs to be done
>> again and again using notifier callbacks. IOW, building sched domains can
>> potentially become a bottleneck, especially if there are lots and lots of
>> cpus in the machine.
>>
>> 4. There is an unhandled race condition (tiny window) in sched_init_smp():
>>
>> get_online_cpus();
>> ...
>> init_sched_domains(cpu_active_mask);
>> ...
>> put_online_cpus();
>>                                            <<<<<<<<<<<<<<<<<<<<<<<< There!
>>
>> hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
>> hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
>>
>> At the point shown above, some non-boot cpus can get booted up, without
>> being noticed by the scheduler.
>>
>> 5. And in powerpc, it creates a new race condition, as explained in
>> https://lkml.org/lkml/2012/2/13/383
>> (Of course, we can fix it trivially by using get/put_online_cpus().)
>>
> 
> 
> Actually, this one is trickier than that, to get it perfectly right.
> [see point 8 below].
> 
> 6. I also observed that in powerpc, a distinction is made implicitly between
> a cpu booting for the first time vs a soft CPU online event. That is, for
> freshly booted cpus, the following 3 functions are called:
> (Refer arch/powerpc/kernel/sysfs.c: topology_init)
> 
> 	register_cpu(c, cpu);
> 	device_create_file(&c->dev, &dev_attr_physical_id);
> 	register_cpu_online(cpu);
> 
> However, for a soft CPU Online event, only the last function is called.
> (And that looks correct because it matches properly with what is done
> upon CPU offline - only unregister_cpu_online() is called).
> 
> IOW, with this patch it becomes necessary to carefully examine all code
> with such implicit assumptions and modify them to handle the async boot up
> properly.
> 
> 7. And whichever code between smp_init() and async_synchronize_full() didn't
> care about CPU hotplug till today but depended on all cpus being online must
> suddenly start worrying about CPU Hotplug. They must register a cpu notifier
> and handle callbacks etc etc.. Or if they are not worth that complexity, they
> should atleast be redesigned or moved around - like the print statements that
> tell how many cpus came up, for example.
> 
> 8. And we should provide a way in which a piece of code can easily "catch" all
> CPU_ONLINE/UP_PREPARE events without missing any of them due to race
> conditions. Of course register_cpu_notifier() and friends are provided for
> that purpose, but they can't be used as it is in this boot up code..
> And calling register_cpu_notifier() within get/put_online_cpus() would be a
> disaster since that could lead to ABBA deadlock between cpu_add_remove_lock
> and cpu_hotplug.lock
> 


9. With this patch, the second statement below in Documentation/cpu-hotplug.txt
won't be true anymore:

Init functions could be of two types:
1. early init (init function called when only the boot processor is online).
2. late init (init function called _after_ all the CPUs are online).

And hence, those parts of the code which depend on this will have to be revisited.

10. Further down, in Documentation/cpu-hotplug.txt, we see:
(referring to early init as first case and late init as second case)

"For the first case, you should add the following to your init function

        register_cpu_notifier(&foobar_cpu_notifier);

For the second case, you should add the following to your init function

        register_hotcpu_notifier(&foobar_cpu_notifier); "

And as of now, hotcpu notifiers are nothing but regular cpu notifiers.
I wonder why do we even have something called hotcpu notifiers, when they do
nothing different.. rather, the distinction between "hotcpu add" vs "just
normal booting" was implicitly handled by choosing when we register our
callbacks:
register at early init => "normal booting" + "hotcpu, including soft online"
register at late init => "hotcpu, including soft online"

So, earlier we had some control over which CPU hotplug events we wanted to be
notified of, by choosing when we register the notifiers. But with this patch,
"careful placement" of our callback registration doesn't make any difference
anymore because late initcalls could run in parallel with smp boot up...

The point I am making is, what was already bad with respect to callback
registration, is made even worse by this patch.
(Btw, this issue is in the light of point 6 above).

>> There could be many more things that this patch breaks.. I haven't checked
>> thoroughly.
>>


 
Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: smp: Start up non-boot CPUs asynchronously
From: Srivatsa S. Bhat @ 2012-02-14 19:57 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Stephen Rothwell, mikey, Peter Zijlstra, gregkh, Ingo Molnar,
	linux-kernel, Milton Miller, Srivatsa Vaddagiri, Linus Torvalds,
	Arjan van de Ven, H. Peter Anvin, Thomas Gleixner,
	Paul E. McKenney, ppc-dev, Andrew Morton, Arjan van de Ven
In-Reply-To: <CADyApD0o4UYsTkqf2H2yJZ-d05NAyRAEc6z+m1gJEogc=cZLqQ@mail.gmail.com>

[Small note, it appears as if the last 2 of your replies to this thread
 didn't reach LKML.]

On 02/14/2012 08:01 PM, Arjan van de Ven wrote:

> one coments; will comment more when I get to work
> 
> On Tue, Feb 14, 2012 at 1:48 AM, Srivatsa S. Bhat 
> 
> 7. And whichever code between smp_init() and async_synchronize_full() didn't
> 
>     care about CPU hotplug till today but depended on all cpus being
>     online must
>     suddenly start worrying about CPU Hotplug. They must register a cpu
>     notifier
>     and handle callbacks etc etc.. Or if they are not worth that
>     complexity, they
>     should atleast be redesigned or moved around - like the print
>     statements that
>     tell how many cpus came up, for example.
> 
> 
> frankly, such code HAS to worry about cpus going online and offline even
> today; the firmware, at least on X86, can start taking cores
> offline/online once ACPI is initialized....
> (as controlled by a data center manager from outside the box, usually
> done based on thermal or power conditions on a datacenter level).
> Now, no doubt that we have bugs in this space, since this only happened
> very rarely before.
> 
> Question is what to do from a longer term strategy:
> Either we declare the number of online CPUs invariant during a certain
> phase of the boot (and make ACPI and co honor this as well somehow)
> or
> We decide to go about fixing these (maybe with the help of lockdep?)
> 
> In addition to this, the reality is that the whole "bring cpus up"
> sequence needs to be changed; the current one is very messy and requires
> the hotplug lock for the whole bring up of each individual cpu... which
> is a very unfortunate design; a much better design would be to only take
> the lock for the actual registration of the newly brought up CPU to the
> kernel, while running the physical bringup without the global lock.
> If/when that change gets made, we can do the physical bring up in
> parallel (with each other, but also with the rest of the kernel boot),
> and do the registration en-mass at some convenient time in the boot,
> potentially late.
> 


Sounds like a good idea, but how will we take care of CPU_UP_PREPARE and
CPU_STARTING callbacks then? Because, CPU_UP_PREPARE callbacks are run
before bringing up the cpu and CPU_STARTING is called from the cpu that is
coming up. Also, CPU_UP_PREPARE callbacks can be failed, which can lead
to that particular cpu boot getting aborted. With the "late commissioning
of CPUs" idea you proposed above, retaining such semantics could become
very challenging.

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: smp: Start up non-boot CPUs asynchronously
From: Peter Zijlstra @ 2012-02-14 20:00 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Stephen Rothwell, mikey, gregkh, Ingo Molnar, linux-kernel,
	Milton Miller, Srivatsa Vaddagiri, Linus Torvalds,
	Arjan van de Ven, H. Peter Anvin, Arjan van de Ven,
	Thomas Gleixner, Paul E. McKenney, ppc-dev, Andrew Morton,
	Arjan van de Ven
In-Reply-To: <4F3ABCC1.5020000@linux.vnet.ibm.com>

On Wed, 2012-02-15 at 01:27 +0530, Srivatsa S. Bhat wrote:
> [Small note, it appears as if the last 2 of your replies to this
> thread
>  didn't reach LKML.]=20

because he used html mail, LKML drops those.. IIRC you can tell K-9 not
to use html cruft, but then I stopped trying to pretend you can email
using phones, its all too painful.

^ permalink raw reply

* Re: smp: Start up non-boot CPUs asynchronously
From: Arjan van de Ven @ 2012-02-14 21:02 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Stephen Rothwell, mikey, Peter Zijlstra, gregkh, Ingo Molnar,
	linux-kernel, Milton Miller, Srivatsa Vaddagiri, Linus Torvalds,
	H. Peter Anvin, Arjan van de Ven, Thomas Gleixner,
	Paul E. McKenney, ppc-dev, Andrew Morton, Arjan van de Ven
In-Reply-To: <4F3ABCC1.5020000@linux.vnet.ibm.com>

On 2/14/2012 11:57 AM, Srivatsa S. Bhat wrote:

>> In addition to this, the reality is that the whole "bring cpus up"
>> sequence needs to be changed; the current one is very messy and requires
>> the hotplug lock for the whole bring up of each individual cpu... which
>> is a very unfortunate design; a much better design would be to only take
>> the lock for the actual registration of the newly brought up CPU to the
>> kernel, while running the physical bringup without the global lock.
>> If/when that change gets made, we can do the physical bring up in
>> parallel (with each other, but also with the rest of the kernel boot),
>> and do the registration en-mass at some convenient time in the boot,
>> potentially late.
>>
> 
> 
> Sounds like a good idea, but how will we take care of CPU_UP_PREPARE and
> CPU_STARTING callbacks then? Because, CPU_UP_PREPARE callbacks are run
> before bringing up the cpu and CPU_STARTING is called from the cpu that is
> coming up. Also, CPU_UP_PREPARE callbacks can be failed, which can lead
> to that particular cpu boot getting aborted. With the "late commissioning
> of CPUs" idea you proposed above, retaining such semantics could become
> very challenging.

some of these callbacks may need to be redesigned as well; or at least,
we may need to decouple the "physical" state of the CPU that's getting
brought up from the "logical" OS visible one.

^ permalink raw reply

* Re: smp: Start up non-boot CPUs asynchronously
From: Benjamin Herrenschmidt @ 2012-02-14 21:28 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Stephen Rothwell, mikey, Peter Zijlstra, gregkh, ppc-dev,
	linux-kernel, Milton Miller, Srivatsa Vaddagiri, Andrew Morton,
	Srivatsa S. Bhat, H. Peter Anvin, arjanvandeven, Ingo Molnar,
	Paul E. McKenney, Linus Torvalds, Thomas Gleixner
In-Reply-To: <4F3A2DFB.5000209@linux.vnet.ibm.com>

On Tue, 2012-02-14 at 15:18 +0530, Srivatsa S. Bhat wrote:

> > 2. Just below that we have smp_cpus_done(setup_max_cpus); and this translates
> > to native_smp_cpus_done() under x86, which calls impress_friends().
> > And that means, the bogosum calculation and the total activated processor
> > count which is printed, may get messed up.

We also have code on powerpc that relies on the bringup having been
completed in smp_cpus_done(), especially on platforms that don't support
CPU hotplug (or fake it using sleep loops).

In some case we unmap MMIO space or close access to components (i2c for
example) that we use during the bringup for things like hard synchro of
CPU timebases, etc... on some G5s we disable the elastic interface on
the northbridge for CPUs that weren't brought up, that sort of thing...

So this patch will break a LOT of stuff for us, it must at least be a
config option for now, until we find another way to fix these things.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH v3] powerpc: Rework lazy-interrupt handling
From: Benjamin Herrenschmidt @ 2012-02-15  0:01 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Scott Wood, Stuart Yoder, Anton Blanchard, Laurentiu Tudor,
	Paul Mackerras
In-Reply-To: <1329201969.3772.7.camel@pasglop>

On Tue, 2012-02-14 at 17:46 +1100, Benjamin Herrenschmidt wrote:
> The current implementation of lazy interrupts handling has some
> issues that this tries to address.
> 
> Except on iSeries, we don't do the various workarounds we need to
> do on re-enable when returning from an interrupt, which can do an
> implicit re-enable, and thus we may still lose or get delayed
> decrementer or doorbell interrupts.

 .../...

So we were still losing doorbells on BookE due to a st00pid thinko on my
part :-) New patch following, hopefully that one's good !

Cheers,
Ben.

^ permalink raw reply

* [PATCH v4] powerpc: Rework lazy-interrupt handling
From: Benjamin Herrenschmidt @ 2012-02-15  0:02 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Scott Wood, Stuart Yoder, Anton Blanchard, Laurentiu Tudor,
	Paul Mackerras

The current implementation of lazy interrupts handling has some
issues that this tries to address.

Except on iSeries, we don't do the various workarounds we need to
do on re-enable when returning from an interrupt, which can do an
implicit re-enable, and thus we may still lose or get delayed
decrementer or doorbell interrupts.

The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.

We also hard mask on decrementer interrupts which is sub-optimal.

This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling.

The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.

When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field. We then implement re-emitting of the lost interrupts via either
a re-use of the existing exception frame (exception exit case) or via
the creation of a new one from assembly code (arch_local_irq_enable),
without the need to trigger a fake one using set_dec() or similar.

In addition, this adds a few refinements:

 - We no longer  hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.

 - Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.

 - On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter).

There are additional refinements that we can do on top of this patch:

 - We could remove the ps3 workaround from arch_local_irq_enable(),
I believe that it should no longer be necessary

 - We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.

 - There are additional simplifications of the exception entry/exit path
that I've spotted along the way, such as merging fast_exception_return
with the normal code path.

This patch needs a LOT more testing & review than it had so far !!!

Not-signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

v2:

- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
  to retrigger an interrupt without preventing hard-enable

v3:

 - Fix or vs. ori bug on Book3E
 - Fix enabling of interrupts for some exceptions on Book3E

v4:

 - Fix resend of doorbells on return from interrupt on Book3E
---
 arch/powerpc/include/asm/exception-64s.h        |   21 ++-
 arch/powerpc/include/asm/hw_irq.h               |   51 +++++-
 arch/powerpc/include/asm/irqflags.h             |   13 +-
 arch/powerpc/include/asm/paca.h                 |    2 +-
 arch/powerpc/kernel/asm-offsets.c               |    2 +-
 arch/powerpc/kernel/dbell.c                     |   12 ++
 arch/powerpc/kernel/entry_64.S                  |  106 +++++++-----
 arch/powerpc/kernel/exceptions-64e.S            |  209 ++++++++++++++++-------
 arch/powerpc/kernel/exceptions-64s.S            |   90 ++++++----
 arch/powerpc/kernel/head_64.S                   |    9 -
 arch/powerpc/kernel/idle_book3e.S               |    8 +-
 arch/powerpc/kernel/idle_power4.S               |   17 ++-
 arch/powerpc/kernel/idle_power7.S               |   20 ++-
 arch/powerpc/kernel/irq.c                       |  187 ++++++++++++++------
 arch/powerpc/kernel/time.c                      |   15 ++-
 arch/powerpc/platforms/iseries/Makefile         |    2 +-
 arch/powerpc/platforms/iseries/exception.S      |   11 +-
 arch/powerpc/platforms/iseries/misc.S           |   26 ---
 arch/powerpc/platforms/pseries/processor_idle.c |   24 +++-
 19 files changed, 549 insertions(+), 276 deletions(-)
 delete mode 100644 arch/powerpc/platforms/iseries/misc.S

diff --git a/arch/powerpc/include/asm/exception-64s.h b/arch/powerpc/include/asm/exception-64s.h
index 8057f4f..b3f42e9 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -232,23 +232,24 @@ label##_hv:						\
 	EXCEPTION_PROLOG_PSERIES(PACA_EXGEN, label##_common,	\
 				 EXC_HV, KVMTEST, vec)
 
-#define __SOFTEN_TEST(h)						\
+#define __SOFTEN_TEST(h, vec)						\
 	lbz	r10,PACASOFTIRQEN(r13);					\
 	cmpwi	r10,0;							\
+	li	r10,(vec)>>8;						\
 	beq	masked_##h##interrupt
-#define _SOFTEN_TEST(h)	__SOFTEN_TEST(h)
+#define _SOFTEN_TEST(h, vec)	__SOFTEN_TEST(h, vec)
 
 #define SOFTEN_TEST_PR(vec)						\
 	KVMTEST_PR(vec);						\
-	_SOFTEN_TEST(EXC_STD)
+	_SOFTEN_TEST(EXC_STD, vec)
 
 #define SOFTEN_TEST_HV(vec)						\
 	KVMTEST(vec);							\
-	_SOFTEN_TEST(EXC_HV)
+	_SOFTEN_TEST(EXC_HV, vec)
 
 #define SOFTEN_TEST_HV_201(vec)						\
 	KVMTEST(vec);							\
-	_SOFTEN_TEST(EXC_STD)
+	_SOFTEN_TEST(EXC_STD, vec)
 
 #define __MASKABLE_EXCEPTION_PSERIES(vec, label, h, extra)		\
 	HMT_MEDIUM;							\
@@ -276,9 +277,9 @@ label##_hv:								\
 #define DISABLE_INTS				\
 	li	r11,0;				\
 	stb	r11,PACASOFTIRQEN(r13);		\
-BEGIN_FW_FTR_SECTION;				\
-	stb	r11,PACAHARDIRQEN(r13);		\
-END_FW_FTR_SECTION_IFCLR(FW_FEATURE_ISERIES);	\
+	lbz	r11,PACAIRQHAPPENED(r13);	\
+	ori	r11,r11,PACA_HAPPENED;		\
+	stb	r11,PACAIRQHAPPENED(r13);	\
 	TRACE_DISABLE_INTS;			\
 BEGIN_FW_FTR_SECTION;				\
 	mfmsr	r10;				\
@@ -289,7 +290,9 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_ISERIES)
 #define DISABLE_INTS				\
 	li	r11,0;				\
 	stb	r11,PACASOFTIRQEN(r13);		\
-	stb	r11,PACAHARDIRQEN(r13);		\
+	lbz	r11,PACAIRQHAPPENED(r13);	\
+	ori	r11,r11,PACA_HAPPENED;		\
+	stb	r11,PACAIRQHAPPENED(r13);	\
 	TRACE_DISABLE_INTS
 #endif /* CONFIG_PPC_ISERIES */
 
diff --git a/arch/powerpc/include/asm/hw_irq.h b/arch/powerpc/include/asm/hw_irq.h
index bb712c9..bd33843 100644
--- a/arch/powerpc/include/asm/hw_irq.h
+++ b/arch/powerpc/include/asm/hw_irq.h
@@ -11,11 +11,50 @@
 #include <asm/ptrace.h>
 #include <asm/processor.h>
 
+#ifdef CONFIG_PPC64
+
+/*
+ * PACA flags in paca->irq_happened. On BookS these flags are set
+ * by oring in the interrupt vector shifted right by 8, so what
+ * we actually have in there is:
+ *
+ * EE  :  0x50x >> 8 = 0x05
+ * DEC :  0x90x >> 8 = 0x09
+ *
+ * The bits we test are thus 0x4 and 0x8 respectively, with bit
+ * 0x1 always set when "something happened".
+ *
+ * On BookE, we just arbitrarily use the values defined below.
+ *
+ * Note: That "something happened" bit is important as we set it
+ * when manually hard-disabling, for example in the exception
+ * entry path.
+ *
+ * This allows a subsequent arch_local_irq_restore() to "know"
+ * that it can't just return and has to actually hard enable.
+ *
+ * The PACA_HAPPENED_LEVEL mask is a bit mask of what values
+ * can correspond to a "level" sensitive interrupt, ie, for
+ * such values, we must not hard-enable in timer_interrupt
+ * do_IRQ or doorbell interrupts if one of these bits is set
+ */
+#define PACA_HAPPENED		0x01
+#define PACA_HAPPENED_DBELL	0x02
+#define PACA_HAPPENED_EE	0x04
+#define PACA_HAPPENED_DEC	0x08 /* Or FIT */
+#define PACA_HAPPENED_EE_EDGE	0x10 /* BookE only */
+
+#endif /* CONFIG_PPC64 */
+
+#ifndef __ASSEMBLY__
+
 extern void timer_interrupt(struct pt_regs *);
 
 #ifdef CONFIG_PPC64
 #include <asm/paca.h>
 
+extern void __reemit_interrupt(unsigned int vector);
+
 static inline unsigned long arch_local_save_flags(void)
 {
 	unsigned long flags;
@@ -42,7 +81,6 @@ static inline unsigned long arch_local_irq_disable(void)
 }
 
 extern void arch_local_irq_restore(unsigned long);
-extern void iseries_handle_interrupts(void);
 
 static inline void arch_local_irq_enable(void)
 {
@@ -72,11 +110,11 @@ static inline bool arch_irqs_disabled(void)
 #define __hard_irq_disable()	__mtmsrd(mfmsr() & ~MSR_EE, 1)
 #endif
 
-#define  hard_irq_disable()			\
-	do {					\
-		__hard_irq_disable();		\
-		get_paca()->soft_enabled = 0;	\
-		get_paca()->hard_enabled = 0;	\
+#define  hard_irq_disable()					\
+	do {							\
+		__hard_irq_disable();				\
+		get_paca()->soft_enabled = 0;			\
+		get_paca()->irq_happened |= PACA_HAPPENED;	\
 	} while(0)
 
 #else /* CONFIG_PPC64 */
@@ -149,5 +187,6 @@ static inline bool arch_irqs_disabled(void)
  */
 struct irq_chip;
 
+#endif  /* __ASSEMBLY__ */
 #endif	/* __KERNEL__ */
 #endif	/* _ASM_POWERPC_HW_IRQ_H */
diff --git a/arch/powerpc/include/asm/irqflags.h b/arch/powerpc/include/asm/irqflags.h
index b0b06d8..4bfbf0a 100644
--- a/arch/powerpc/include/asm/irqflags.h
+++ b/arch/powerpc/include/asm/irqflags.h
@@ -47,16 +47,15 @@
 	b	skip;					\
 95:	TRACE_WITH_FRAME_BUFFER(.trace_hardirqs_on)	\
 	li	en,1;
-#define TRACE_AND_RESTORE_IRQ(en)		\
-	TRACE_AND_RESTORE_IRQ_PARTIAL(en,96f);	\
-	stb	en,PACASOFTIRQEN(r13);		\
-96:
 #else
 #define TRACE_ENABLE_INTS
 #define TRACE_DISABLE_INTS
-#define TRACE_AND_RESTORE_IRQ_PARTIAL(en,skip)
-#define TRACE_AND_RESTORE_IRQ(en)		\
-	stb	en,PACASOFTIRQEN(r13)
+#define TRACE_AND_RESTORE_IRQ_PARTIAL(en,skip)		\
+	cmpdi	en,0;					\
+	bne	95f;					\
+	stb	en,PACASOFTIRQEN(r13);			\
+	b	skip;					\
+95:
 #endif
 #endif
 
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 269c05a..daf813f 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -132,7 +132,7 @@ struct paca_struct {
 	u64 saved_msr;			/* MSR saved here by enter_rtas */
 	u16 trap_save;			/* Used when bad stack is encountered */
 	u8 soft_enabled;		/* irq soft-enable flag */
-	u8 hard_enabled;		/* set if irqs are enabled in MSR */
+	u8 irq_happened;		/* irq happened while soft-disabled */
 	u8 io_sync;			/* writel() needs spin_unlock sync */
 	u8 irq_work_pending;		/* IRQ_WORK interrupt while soft-disable */
 	u8 nap_state_lost;		/* NV GPR values lost in power7_idle */
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 04caee7..cdd0d26 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -147,7 +147,7 @@ int main(void)
 	DEFINE(PACAKBASE, offsetof(struct paca_struct, kernelbase));
 	DEFINE(PACAKMSR, offsetof(struct paca_struct, kernel_msr));
 	DEFINE(PACASOFTIRQEN, offsetof(struct paca_struct, soft_enabled));
-	DEFINE(PACAHARDIRQEN, offsetof(struct paca_struct, hard_enabled));
+	DEFINE(PACAIRQHAPPENED, offsetof(struct paca_struct, irq_happened));
 	DEFINE(PACACONTEXTID, offsetof(struct paca_struct, context.id));
 #ifdef CONFIG_PPC_MM_SLICES
 	DEFINE(PACALOWSLICESPSIZE, offsetof(struct paca_struct,
diff --git a/arch/powerpc/kernel/dbell.c b/arch/powerpc/kernel/dbell.c
index 2cc451a..16f0e5e 100644
--- a/arch/powerpc/kernel/dbell.c
+++ b/arch/powerpc/kernel/dbell.c
@@ -37,6 +37,18 @@ void doorbell_exception(struct pt_regs *regs)
 
 	irq_enter();
 
+#ifdef CONFIG_PPC64
+	/* Let's hard enable interrupts now that we have reset
+	 * the DEC (or acked it on BookE)
+	 *
+	 * We skip that if there's a pending EE "level" interrupt
+	 * as an optimization
+	 */
+	get_paca()->irq_happened &= ~PACA_HAPPENED;
+	if (!(get_paca()->irq_happened & PACA_HAPPENED_EE))
+		__hard_irq_enable();
+#endif /* CONFIG_PPC64 */
+
 	smp_ipi_demux();
 
 	irq_exit();
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index d834425..3cc258b 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -31,6 +31,7 @@
 #include <asm/bug.h>
 #include <asm/ptrace.h>
 #include <asm/irqflags.h>
+#include <asm/hw_irq.h>
 #include <asm/ftrace.h>
 
 /*
@@ -125,19 +126,7 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
 #endif /* CONFIG_TRACE_IRQFLAGS */
 	li	r10,1
 	stb	r10,PACASOFTIRQEN(r13)
-	stb	r10,PACAHARDIRQEN(r13)
 	std	r10,SOFTE(r1)
-#ifdef CONFIG_PPC_ISERIES
-BEGIN_FW_FTR_SECTION
-	/* Hack for handling interrupts when soft-enabling on iSeries */
-	cmpdi	cr1,r0,0x5555		/* syscall 0x5555 */
-	andi.	r10,r12,MSR_PR		/* from kernel */
-	crand	4*cr0+eq,4*cr1+eq,4*cr0+eq
-	bne	2f
-	b	hardware_interrupt_entry
-2:
-END_FW_FTR_SECTION_IFSET(FW_FEATURE_ISERIES)
-#endif /* CONFIG_PPC_ISERIES */
 
 	/* Hard enable interrupts */
 #ifdef CONFIG_PPC_BOOK3E
@@ -593,23 +582,33 @@ _GLOBAL(ret_from_except_lite)
 	bne	do_work
 #endif
 
+_GLOBAL(fast_exception_return_irq)
 restore:
-BEGIN_FW_FTR_SECTION
 	ld	r5,SOFTE(r1)
-FW_FTR_SECTION_ELSE
-	b	.Liseries_check_pending_irqs
-ALT_FW_FTR_SECTION_END_IFCLR(FW_FEATURE_ISERIES)
-2:
-	TRACE_AND_RESTORE_IRQ(r5);
+	TRACE_AND_RESTORE_IRQ_PARTIAL(r5, 3f);
 
-	/* extract EE bit and use it to restore paca->hard_enabled */
-	ld	r3,_MSR(r1)
-	rldicl	r4,r3,49,63		/* r0 = (r3 >> 15) & 1 */
-	stb	r4,PACAHARDIRQEN(r13)
+	/*
+	 * We are about to soft-enable interrupts (we are hard disabled
+	 * at this point). We check if there's anything that needs to
+	 * be replayed first
+	 */
+	lbz	r0,PACAIRQHAPPENED(r13)
+	cmpwi	cr0,r0,0
+	bne-	4f
+
+	/*
+	 * Get here when nothing happened while soft-disabled, just
+	 * soft-enable and move-on. We will hard-enable as a side
+	 * effect of rfi
+	 */
+2:	li	r0,1
+	stb	r0,PACASOFTIRQEN(r13);
+3:
 
 #ifdef CONFIG_PPC_BOOK3E
 	b	.exception_return_book3e
 #else
+	ld	r3,_MSR(r1)
 	ld	r4,_CTR(r1)
 	ld	r0,_LINK(r1)
 	mtctr	r4
@@ -644,7 +643,8 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_STCX_CHECKS_ADDRESS)
 
 	/*
 	 * r13 is our per cpu area, only restore it if we are returning to
-	 * userspace
+	 * userspace the value stored in the stack frame may belong to
+	 * another CPU.
 	 */
 	andi.	r0,r3,MSR_PR
 	beq	1f
@@ -669,29 +669,46 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_STCX_CHECKS_ADDRESS)
 
 #endif /* CONFIG_PPC_BOOK3E */
 
-.Liseries_check_pending_irqs:
-#ifdef CONFIG_PPC_ISERIES
-	ld	r5,SOFTE(r1)
-	cmpdi	0,r5,0
+	/*
+	 * Something did happen, check if a re-emit is needed
+	 * (this also clears paca->irq_happened)
+	 */
+4:	bl	.__check_irq_reemit
+	cmpwi	cr0,r3,0
 	beq	2b
-	/* Check for pending interrupts (iSeries) */
-	ld	r3,PACALPPACAPTR(r13)
-	ld	r3,LPPACAANYINT(r3)
-	cmpdi	r3,0
-	beq+	2b			/* skip do_IRQ if no interrupts */
 
-	li	r3,0
-	stb	r3,PACASOFTIRQEN(r13)	/* ensure we are soft-disabled */
-#ifdef CONFIG_TRACE_IRQFLAGS
-	bl	.trace_hardirqs_off
-	mfmsr	r10
-#endif
-	ori	r10,r10,MSR_EE
-	mtmsrd	r10			/* hard-enable again */
-	addi	r3,r1,STACK_FRAME_OVERHEAD
+	/*
+	 * We need to re-emit an interrupt. We do so by re-using our
+	 * existing exception frame. We first change the trap value,
+	 * but we need to ensure we preserve the low nibble of it
+	 */
+	ld	r4,_TRAP(r1)
+	clrldi	r4,r4,60
+	or	r4,r4,r3
+	std	r4,_TRAP(r1)
+
+	/*
+	 * Then find the right handler and call it. Interrupts are
+	 * still soft-disabled and we keep them that way.
+	 */
+	cmpwi	cr0,r3,0x500
+	bne	1f
+	addi	r3,r1,STACK_FRAME_OVERHEAD;
 	bl	.do_IRQ
-	b	.ret_from_except_lite		/* loop back and handle more */
-#endif
+	b	.ret_from_except
+1:	cmpwi	cr0,r3,0x900
+	bne	1f
+	addi	r3,r1,STACK_FRAME_OVERHEAD;
+	bl	.timer_interrupt
+	b	.ret_from_except
+#ifdef CONFIG_PPC_BOOK3E
+1:	cmpwi	cr0,r3,0x280
+	bne	1f
+	addi	r3,r1,STACK_FRAME_OVERHEAD;
+	bl	.doorbell_exception
+	b	.ret_from_except
+#endif /* CONFIG_PPC_BOOK3E */
+1:	b	.ret_from_except /* What else to do here ? */
 
 do_work:
 #ifdef CONFIG_PREEMPT
@@ -713,7 +730,6 @@ do_work:
 	 */
 	li	r0,0
 	stb	r0,PACASOFTIRQEN(r13)
-	stb	r0,PACAHARDIRQEN(r13)
 	TRACE_DISABLE_INTS
 
 	/* Call the scheduler with soft IRQs off */
@@ -728,8 +744,6 @@ do_work:
 	rotldi	r10,r10,16
 	mtmsrd	r10,1
 #endif /* CONFIG_PPC_BOOK3E */
-	li	r0,0
-	stb	r0,PACAHARDIRQEN(r13)
 
 	/* Re-test flags and eventually loop */
 	clrrdi	r9,r1,THREAD_SHIFT
diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S
index 429983c..6a19be0 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -21,6 +21,7 @@
 #include <asm/exception-64e.h>
 #include <asm/bug.h>
 #include <asm/irqflags.h>
+#include <asm/hw_irq.h>
 #include <asm/ptrace.h>
 #include <asm/ppc-opcode.h>
 #include <asm/mmu.h>
@@ -77,59 +78,55 @@
 #define SPRN_MC_SRR1	SPRN_MCSRR1
 
 #define NORMAL_EXCEPTION_PROLOG(n, addition)				    \
-	EXCEPTION_PROLOG(n, GEN, addition##_GEN)
+	EXCEPTION_PROLOG(n, GEN, addition##_GEN(n))
 
 #define CRIT_EXCEPTION_PROLOG(n, addition)				    \
-	EXCEPTION_PROLOG(n, CRIT, addition##_CRIT)
+	EXCEPTION_PROLOG(n, CRIT, addition##_CRIT(n))
 
 #define DBG_EXCEPTION_PROLOG(n, addition)				    \
-	EXCEPTION_PROLOG(n, DBG, addition##_DBG)
+	EXCEPTION_PROLOG(n, DBG, addition##_DBG(n))
 
 #define MC_EXCEPTION_PROLOG(n, addition)				    \
-	EXCEPTION_PROLOG(n, MC, addition##_MC)
+	EXCEPTION_PROLOG(n, MC, addition##_MC(n))
 
 
 /* Variants of the "addition" argument for the prolog
  */
-#define PROLOG_ADDITION_NONE_GEN
-#define PROLOG_ADDITION_NONE_CRIT
-#define PROLOG_ADDITION_NONE_DBG
-#define PROLOG_ADDITION_NONE_MC
+#define PROLOG_ADDITION_NONE_GEN(n)
+#define PROLOG_ADDITION_NONE_CRIT(n)
+#define PROLOG_ADDITION_NONE_DBG(n)
+#define PROLOG_ADDITION_NONE_MC(n)
 
-#define PROLOG_ADDITION_MASKABLE_GEN					    \
+#define PROLOG_ADDITION_MASKABLE_GEN(n)					    \
 	lbz	r11,PACASOFTIRQEN(r13); /* are irqs soft-disabled ? */	    \
 	cmpwi	cr0,r11,0;		/* yes -> go out of line */	    \
-	beq	masked_interrupt_book3e;
+	beq	masked_interrupt_book3e_##n;
 
-#define PROLOG_ADDITION_2REGS_GEN					    \
+#define PROLOG_ADDITION_2REGS_GEN(n)					    \
 	std	r14,PACA_EXGEN+EX_R14(r13);				    \
 	std	r15,PACA_EXGEN+EX_R15(r13)
 
-#define PROLOG_ADDITION_1REG_GEN					    \
+#define PROLOG_ADDITION_1REG_GEN(n)					    \
 	std	r14,PACA_EXGEN+EX_R14(r13);
 
-#define PROLOG_ADDITION_2REGS_CRIT					    \
+#define PROLOG_ADDITION_2REGS_CRIT(n)					    \
 	std	r14,PACA_EXCRIT+EX_R14(r13);				    \
 	std	r15,PACA_EXCRIT+EX_R15(r13)
 
-#define PROLOG_ADDITION_2REGS_DBG					    \
+#define PROLOG_ADDITION_2REGS_DBG(n)					    \
 	std	r14,PACA_EXDBG+EX_R14(r13);				    \
 	std	r15,PACA_EXDBG+EX_R15(r13)
 
-#define PROLOG_ADDITION_2REGS_MC					    \
+#define PROLOG_ADDITION_2REGS_MC(n)					    \
 	std	r14,PACA_EXMC+EX_R14(r13);				    \
 	std	r15,PACA_EXMC+EX_R15(r13)
 
-#define PROLOG_ADDITION_DOORBELL_GEN					    \
-	lbz	r11,PACASOFTIRQEN(r13); /* are irqs soft-disabled ? */	    \
-	cmpwi	cr0,r11,0;		/* yes -> go out of line */	    \
-	beq	masked_doorbell_book3e
-
 
 /* Core exception code for all exceptions except TLB misses.
  * XXX: Needs to make SPRN_SPRG_GEN depend on exception type
  */
 #define EXCEPTION_COMMON(n, excf, ints)					    \
+exc_##n##_common:							    \
 	std	r0,GPR0(r1);		/* save r0 in stackframe */	    \
 	std	r2,GPR2(r1);		/* save r2 in stackframe */	    \
 	SAVE_4GPRS(3, r1);		/* save r3 - r6 in stackframe */    \
@@ -167,19 +164,25 @@
 	std	r0,RESULT(r1);		/* clear regs->result */	    \
 	ints;
 
-/* Variants for the "ints" argument */
+/* Variants for the "ints" argument. This one does nothing when we want
+ * to keep interrupts in their original state
+ */
 #define INTS_KEEP
-#define INTS_DISABLE_SOFT						    \
+
+/* This second version is meant for exceptions that don't immediately
+ * hard-enable. We set a bit in paca->irq_happened to ensure that
+ * a subsequent call to arch_local_irq_restore() will properly
+ * hard-enable and avoid the fast-path
+ */
+#define INTS_DISABLE							    \
 	stb	r0,PACASOFTIRQEN(r13);	/* mark interrupts soft-disabled */ \
+	lbz	r0,PACAIRQHAPPENED(r13);				    \
+	ori	r0,r0,PACA_HAPPENED;					    \
+	stb	r0,PACAIRQHAPPENED(r13);				    \
 	TRACE_DISABLE_INTS;
-#define INTS_DISABLE_HARD						    \
-	stb	r0,PACAHARDIRQEN(r13); /* and hard disabled */
-#define INTS_DISABLE_ALL						    \
-	INTS_DISABLE_SOFT						    \
-	INTS_DISABLE_HARD
-
-/* This is called by exceptions that used INTS_KEEP (that is did not clear
- * neither soft nor hard IRQ indicators in the PACA. This will restore MSR:EE
+
+/* This is called by sp,e exceptions that used INTS_KEEP (that is did
+ * not set hard IRQ indicators in the PACA). This will restore MSR:EE
  * to it's previous value
  *
  * XXX In the long run, we may want to open-code it in order to separate the
@@ -238,7 +241,7 @@ exc_##n##_bad_stack:							    \
 #define MASKABLE_EXCEPTION(trapnum, label, hdlr, ack)			\
 	START_EXCEPTION(label);						\
 	NORMAL_EXCEPTION_PROLOG(trapnum, PROLOG_ADDITION_MASKABLE)	\
-	EXCEPTION_COMMON(trapnum, PACA_EXGEN, INTS_DISABLE_ALL)		\
+	EXCEPTION_COMMON(trapnum, PACA_EXGEN, INTS_DISABLE)		\
 	ack(r8);							\
 	CHECK_NAPPING();						\
 	addi	r3,r1,STACK_FRAME_OVERHEAD;				\
@@ -289,7 +292,7 @@ interrupt_end_book3e:
 /* Critical Input Interrupt */
 	START_EXCEPTION(critical_input);
 	CRIT_EXCEPTION_PROLOG(0x100, PROLOG_ADDITION_NONE)
-//	EXCEPTION_COMMON(0x100, PACA_EXCRIT, INTS_DISABLE_ALL)
+//	EXCEPTION_COMMON(0x100, PACA_EXCRIT, INTS_DISABLE)
 //	bl	special_reg_save_crit
 //	CHECK_NAPPING();
 //	addi	r3,r1,STACK_FRAME_OVERHEAD
@@ -300,7 +303,7 @@ interrupt_end_book3e:
 /* Machine Check Interrupt */
 	START_EXCEPTION(machine_check);
 	CRIT_EXCEPTION_PROLOG(0x200, PROLOG_ADDITION_NONE)
-//	EXCEPTION_COMMON(0x200, PACA_EXMC, INTS_DISABLE_ALL)
+//	EXCEPTION_COMMON(0x200, PACA_EXMC, INTS_DISABLE)
 //	bl	special_reg_save_mc
 //	addi	r3,r1,STACK_FRAME_OVERHEAD
 //	CHECK_NAPPING();
@@ -339,12 +342,11 @@ interrupt_end_book3e:
 	START_EXCEPTION(program);
 	NORMAL_EXCEPTION_PROLOG(0x700, PROLOG_ADDITION_1REG)
 	mfspr	r14,SPRN_ESR
-	EXCEPTION_COMMON(0x700, PACA_EXGEN, INTS_DISABLE_SOFT)
+	EXCEPTION_COMMON(0x700, PACA_EXGEN, INTS_DISABLE)
 	std	r14,_DSISR(r1)
 	addi	r3,r1,STACK_FRAME_OVERHEAD
 	ld	r14,PACA_EXGEN+EX_R14(r13)
 	bl	.save_nvgprs
-	INTS_RESTORE_HARD
 	bl	.program_check_exception
 	b	.ret_from_except
 
@@ -372,7 +374,7 @@ interrupt_end_book3e:
 /* Watchdog Timer Interrupt */
 	START_EXCEPTION(watchdog);
 	CRIT_EXCEPTION_PROLOG(0x9f0, PROLOG_ADDITION_NONE)
-//	EXCEPTION_COMMON(0x9f0, PACA_EXCRIT, INTS_DISABLE_ALL)
+//	EXCEPTION_COMMON(0x9f0, PACA_EXCRIT, INTS_DISABLE)
 //	bl	special_reg_save_crit
 //	CHECK_NAPPING();
 //	addi	r3,r1,STACK_FRAME_OVERHEAD
@@ -450,7 +452,7 @@ interrupt_end_book3e:
 	mfspr	r15,SPRN_SPRG_CRIT_SCRATCH
 	mtspr	SPRN_SPRG_GEN_SCRATCH,r15
 	mfspr	r14,SPRN_DBSR
-	EXCEPTION_COMMON(0xd00, PACA_EXCRIT, INTS_DISABLE_ALL)
+	EXCEPTION_COMMON(0xd00, PACA_EXCRIT, INTS_DISABLE)
 	std	r14,_DSISR(r1)
 	addi	r3,r1,STACK_FRAME_OVERHEAD
 	mr	r4,r14
@@ -515,7 +517,7 @@ kernel_dbg_exc:
 	mfspr	r15,SPRN_SPRG_DBG_SCRATCH
 	mtspr	SPRN_SPRG_GEN_SCRATCH,r15
 	mfspr	r14,SPRN_DBSR
-	EXCEPTION_COMMON(0xd00, PACA_EXDBG, INTS_DISABLE_ALL)
+	EXCEPTION_COMMON(0xd08, PACA_EXDBG, INTS_DISABLE)
 	std	r14,_DSISR(r1)
 	addi	r3,r1,STACK_FRAME_OVERHEAD
 	mr	r4,r14
@@ -525,21 +527,22 @@ kernel_dbg_exc:
 	bl	.DebugException
 	b	.ret_from_except
 
-	MASKABLE_EXCEPTION(0x260, perfmon, .performance_monitor_exception, ACK_NONE)
+	START_EXCEPTION(perfmon);
+	NORMAL_EXCEPTION_PROLOG(0x260, PROLOG_ADDITION_NONE)
+	EXCEPTION_COMMON(0x260, PACA_EXGEN, INTS_DISABLE)
+	addi	r3,r1,STACK_FRAME_OVERHEAD
+	ld	r14,PACA_EXGEN+EX_R14(r13)
+	bl	.save_nvgprs
+	bl	.performance_monitor_exception
+	b	.ret_from_except
 
 /* Doorbell interrupt */
-	START_EXCEPTION(doorbell)
-	NORMAL_EXCEPTION_PROLOG(0x2070, PROLOG_ADDITION_DOORBELL)
-	EXCEPTION_COMMON(0x2070, PACA_EXGEN, INTS_DISABLE_ALL)
-	CHECK_NAPPING()
-	addi	r3,r1,STACK_FRAME_OVERHEAD
-	bl	.doorbell_exception
-	b	.ret_from_except_lite
+	MASKABLE_EXCEPTION(0x280, doorbell, .doorbell_exception, ACK_NONE)
 
 /* Doorbell critical Interrupt */
 	START_EXCEPTION(doorbell_crit);
-	CRIT_EXCEPTION_PROLOG(0x2080, PROLOG_ADDITION_NONE)
-//	EXCEPTION_COMMON(0x2080, PACA_EXCRIT, INTS_DISABLE_ALL)
+	CRIT_EXCEPTION_PROLOG(0x2a0, PROLOG_ADDITION_NONE)
+//	EXCEPTION_COMMON(0x280, PACA_EXCRIT, INTS_DISABLE)
 //	bl	special_reg_save_crit
 //	CHECK_NAPPING();
 //	addi	r3,r1,STACK_FRAME_OVERHEAD
@@ -547,38 +550,116 @@ kernel_dbg_exc:
 //	b	ret_from_crit_except
 	b	.
 
+/* Guest Doorbell */
 	MASKABLE_EXCEPTION(0x2c0, guest_doorbell, .unknown_exception, ACK_NONE)
-	MASKABLE_EXCEPTION(0x2e0, guest_doorbell_crit, .unknown_exception, ACK_NONE)
-	MASKABLE_EXCEPTION(0x310, hypercall, .unknown_exception, ACK_NONE)
-	MASKABLE_EXCEPTION(0x320, ehpriv, .unknown_exception, ACK_NONE)
+
+/* Guest Doorbell critical Interrupt */
+	START_EXCEPTION(guest_doorbell_crit);
+	CRIT_EXCEPTION_PROLOG(0x2e0, PROLOG_ADDITION_NONE)
+//	EXCEPTION_COMMON(0x2e0, PACA_EXCRIT, INTS_DISABLE)
+//	bl	special_reg_save_crit
+//	CHECK_NAPPING();
+//	addi	r3,r1,STACK_FRAME_OVERHEAD
+//	bl	.guest_doorbell_critical_exception
+//	b	ret_from_crit_except
+	b	.
+
+/* Hypervisor call */
+	START_EXCEPTION(hypercall);
+	NORMAL_EXCEPTION_PROLOG(0x310, PROLOG_ADDITION_NONE)
+	EXCEPTION_COMMON(0x310, PACA_EXGEN, INTS_KEEP)
+	addi	r3,r1,STACK_FRAME_OVERHEAD
+	bl	.save_nvgprs
+	INTS_RESTORE_HARD
+	bl	.unknown_exception
+	b	.ret_from_except
+
+/* Embedded Hypervisor priviledged  */
+	START_EXCEPTION(ehpriv);
+	NORMAL_EXCEPTION_PROLOG(0x320, PROLOG_ADDITION_NONE)
+	EXCEPTION_COMMON(0x320, PACA_EXGEN, INTS_KEEP)
+	addi	r3,r1,STACK_FRAME_OVERHEAD
+	bl	.save_nvgprs
+	INTS_RESTORE_HARD
+	bl	.unknown_exception
+	b	.ret_from_except
 
 
 /*
- * An interrupt came in while soft-disabled; clear EE in SRR1,
- * clear paca->hard_enabled and return.
+ * An interrupt came in while soft-disabled; We mark paca->irq_happened
+ * accordingly and if the interrupt is level sensitive, we hard disable
  */
-masked_doorbell_book3e:
-	mtcr	r10
-	/* Resend the doorbell to fire again when ints enabled */
-	mfspr	r10,SPRN_PIR
-	PPC_MSGSND(r10)
-	b	masked_interrupt_book3e_common
 
-masked_interrupt_book3e:
+masked_interrupt_book3e_0x500:
+	li	r11,PACA_HAPPENED_EE
+	b	masked_interrupt_book3e_full_mask
+
+masked_interrupt_book3e_0x900:
+	ACK_DEC(r11);
+	li	r11,PACA_HAPPENED_DEC
+	b	masked_interrupt_book3e_no_mask
+masked_interrupt_book3e_0x980:
+	ACK_FIT(r11);
+	li	r11,PACA_HAPPENED_DEC
+	b	masked_interrupt_book3e_no_mask
+masked_interrupt_book3e_0x280:
+masked_interrupt_book3e_0x2c0:
+	li	r11,PACA_HAPPENED_DBELL
+	b	masked_interrupt_book3e_no_mask
+
+masked_interrupt_book3e_no_mask:
+	mtcr	r10
+	lbz	r10,PACAIRQHAPPENED(r13)
+	or	r10,r10,r11
+	stb	r10,PACAIRQHAPPENED(r13)
+	b	1f
+masked_interrupt_book3e_full_mask:
 	mtcr	r10
-masked_interrupt_book3e_common:
-	stb	r11,PACAHARDIRQEN(r13)
+	lbz	r10,PACAIRQHAPPENED(r13)
+	or	r10,r10,r11
+	stb	r10,PACAIRQHAPPENED(r13)
 	mfspr	r10,SPRN_SRR1
 	rldicl	r11,r10,48,1		/* clear MSR_EE */
 	rotldi	r10,r11,16
 	mtspr	SPRN_SRR1,r10
-	ld	r10,PACA_EXGEN+EX_R10(r13);	/* restore registers */
+1:	ld	r10,PACA_EXGEN+EX_R10(r13);
 	ld	r11,PACA_EXGEN+EX_R11(r13);
 	mfspr	r13,SPRN_SPRG_GEN_SCRATCH;
 	rfi
 	b	.
 
 /*
+ * Called from arch_local_irq_enable when an interrupt needs
+ * to be resent. r3 contains either 0x500,0x900,0x260 or 0x280
+ * to indicate the kind of interrupt. MSR:EE is already off.
+ * We generate a stackframe like if a real interrupt had happened.
+ *
+ * Note: While MSR:EE is off, we need to make sure that _MSR
+ * in the generated frame has EE set to 1 or the exception
+ * handler will not properly re-enable them.
+ */
+_GLOBAL(__reemit_interrupt)
+	/* We are going to jump to the exception common code which
+	 * will retrieve various register values from the PACA which
+	 * we don't give a damn about.
+	 */
+	mflr	r10
+	mfmsr	r11
+	mfcr	r4
+	mtspr	SPRN_SPRG_GEN_SCRATCH,r13;
+	std	r1,PACA_EXGEN+EX_R1(r13);
+	stw	r4,PACA_EXGEN+EX_CR(r13);
+	ori	r11,r11,MSR_EE
+	subi	r1,r1,INT_FRAME_SIZE;
+	cmpwi	cr0,r3,0x500
+	beq	exc_0x500_common
+	cmpwi	cr0,r3,0x900
+	beq	exc_0x900_common
+	cmpwi	cr0,r3,0x280
+	beq	exc_0x280_common
+	blr
+
+/*
  * This is called from 0x300 and 0x400 handlers after the prologs with
  * r14 and r15 containing the fault address and error code, with the
  * original values stashed away in the PACA
@@ -680,6 +761,8 @@ BAD_STACK_TRAMPOLINE(0x000)
 BAD_STACK_TRAMPOLINE(0x100)
 BAD_STACK_TRAMPOLINE(0x200)
 BAD_STACK_TRAMPOLINE(0x260)
+BAD_STACK_TRAMPOLINE(0x280)
+BAD_STACK_TRAMPOLINE(0x2a0)
 BAD_STACK_TRAMPOLINE(0x2c0)
 BAD_STACK_TRAMPOLINE(0x2e0)
 BAD_STACK_TRAMPOLINE(0x300)
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 3844ca7..58cc1ee 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -12,6 +12,7 @@
  *
  */
 
+#include <asm/hw_irq.h>
 #include <asm/exception-64s.h>
 #include <asm/ptrace.h>
 
@@ -356,34 +357,60 @@ do_stab_bolted_pSeries:
 	KVM_HANDLER_PR(PACA_EXGEN, EXC_STD, 0xf40)
 
 /*
- * An interrupt came in while soft-disabled; clear EE in SRR1,
- * clear paca->hard_enabled and return.
+ * An interrupt came in while soft-disabled. We set paca->irq_happened,
+ * then, if it was a decrementer interrupt, we bump the dec to max and
+ * and return, else we hard disable and return.
  */
-masked_interrupt:
-	stb	r10,PACAHARDIRQEN(r13)
-	mtcrf	0x80,r9
-	ld	r9,PACA_EXGEN+EX_R9(r13)
-	mfspr	r10,SPRN_SRR1
-	rldicl	r10,r10,48,1		/* clear MSR_EE */
-	rotldi	r10,r10,16
-	mtspr	SPRN_SRR1,r10
-	ld	r10,PACA_EXGEN+EX_R10(r13)
-	GET_SCRATCH0(r13)
-	rfid
-	b	.
 
-masked_Hinterrupt:
-	stb	r10,PACAHARDIRQEN(r13)
-	mtcrf	0x80,r9
-	ld	r9,PACA_EXGEN+EX_R9(r13)
-	mfspr	r10,SPRN_HSRR1
-	rldicl	r10,r10,48,1		/* clear MSR_EE */
-	rotldi	r10,r10,16
-	mtspr	SPRN_HSRR1,r10
-	ld	r10,PACA_EXGEN+EX_R10(r13)
-	GET_SCRATCH0(r13)
-	hrfid
+#define MASKED_INTERRUPT(_H)				\
+masked_##_H##interrupt:					\
+	std	r11,PACA_EXGEN+EX_R11(r13);		\
+	lbz	r11,PACAIRQHAPPENED(r13);		\
+	or	r11,r11,r10;				\
+	stb	r11,PACAIRQHAPPENED(r13);		\
+	andi.	r10,r10,PACA_HAPPENED_DEC;		\
+	beq	1f;					\
+	lis	r10,0x7fff;				\
+	ori	r10,r10,0xffff;				\
+	mtspr	SPRN_DEC,r10;				\
+	b	2f;					\
+1:	mfspr	r10,SPRN_##_H##SRR1;			\
+	rldicl	r10,r10,48,1; /* clear MSR_EE */	\
+	rotldi	r10,r10,16;				\
+	mtspr	SPRN_##_H##SRR1,r10;			\
+2:	mtcrf	0x80,r9;				\
+	ld	r9,PACA_EXGEN+EX_R9(r13);		\
+	ld	r10,PACA_EXGEN+EX_R10(r13);		\
+	ld	r11,PACA_EXGEN+EX_R11(r13);		\
+	GET_SCRATCH0(r13);				\
+	##_H##rfid;					\
 	b	.
+	
+	MASKED_INTERRUPT()
+	MASKED_INTERRUPT(H)
+
+/*
+ * Called from arch_local_irq_enable when an interrupt needs
+ * to be resent. r3 contains 0x500 or 0x900 to indicate which
+ * kind of interrupt. MSR:EE is already off. We generate a
+ * stackframe like if a real interrupt had happened.
+ *
+ * Note: While MSR:EE is off, we need to make sure that _MSR
+ * in the generated frame has EE set to 1 or the exception
+ * handler will not properly re-enable them.
+ */
+_GLOBAL(__reemit_interrupt)
+	/* We are going to jump to the exception common code which
+	 * will retrieve various register values from the PACA which
+	 * we don't give a damn about, so we don't bother storing them.
+	 */
+	mfmsr	r12
+	mflr	r11
+	mfcr	r9
+	ori	r12,r12,MSR_EE
+	andi.	r3,r3,0x0800
+	bne	decrementer_common
+	b	hardware_interrupt_common
 
 #ifdef CONFIG_PPC_PSERIES
 /*
@@ -838,18 +865,10 @@ __end_handlers:
  * any task or sent any task a signal, you should use
  * ret_from_except or ret_from_except_lite instead of this.
  */
-fast_exc_return_irq:			/* restores irq state too */
-	ld	r3,SOFTE(r1)
-	TRACE_AND_RESTORE_IRQ(r3);
-	ld	r12,_MSR(r1)
-	rldicl	r4,r12,49,63		/* get MSR_EE to LSB */
-	stb	r4,PACAHARDIRQEN(r13)	/* restore paca->hard_enabled */
-	b	1f
-
 	.globl	fast_exception_return
 fast_exception_return:
 	ld	r12,_MSR(r1)
-1:	ld	r11,_NIP(r1)
+	ld	r11,_NIP(r1)
 	andi.	r3,r12,MSR_RI		/* check if RI is set */
 	beq-	unrecov_fer
 
@@ -973,7 +992,7 @@ BEGIN_FW_FTR_SECTION
 	 * Here we have interrupts hard-disabled, so it is sufficient
 	 * to restore paca->{soft,hard}_enable and get out.
 	 */
-	beq	fast_exc_return_irq	/* Return from exception on success */
+	beq	14f
 END_FW_FTR_SECTION_IFCLR(FW_FEATURE_ISERIES)
 
 	/* For a hash failure, we don't bother re-enabling interrupts */
@@ -1015,6 +1034,7 @@ handle_page_fault:
 	b	.ret_from_except
 
 13:	b	.ret_from_except_lite
+14:	b	.fast_exception_return_irq
 
 /* We have a page fault that hash_page could handle but HV refused
  * the PTE insertion
diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
index 06c7251..ffe08a6 100644
--- a/arch/powerpc/kernel/head_64.S
+++ b/arch/powerpc/kernel/head_64.S
@@ -564,7 +564,6 @@ _GLOBAL(pmac_secondary_start)
 	 */
 	li	r0,0
 	stb	r0,PACASOFTIRQEN(r13)
-	stb	r0,PACAHARDIRQEN(r13)
 
 	/* Create a temp kernel stack for use before relocation is on.	*/
 	ld	r1,PACAEMERGSP(r13)
@@ -621,13 +620,8 @@ __secondary_start:
 #ifdef CONFIG_PPC_ISERIES
 BEGIN_FW_FTR_SECTION
 	ori	r4,r4,MSR_EE
-	li	r8,1
-	stb	r8,PACAHARDIRQEN(r13)
 END_FW_FTR_SECTION_IFSET(FW_FEATURE_ISERIES)
 #endif
-BEGIN_FW_FTR_SECTION
-	stb	r7,PACAHARDIRQEN(r13)
-END_FW_FTR_SECTION_IFCLR(FW_FEATURE_ISERIES)
 	stb	r7,PACASOFTIRQEN(r13)
 
 	mtspr	SPRN_SRR0,r3
@@ -782,11 +776,8 @@ BEGIN_FW_FTR_SECTION
 	mfmsr	r5
 	ori	r5,r5,MSR_EE		/* Hard Enabled on iSeries*/
 	mtmsrd	r5
-	li	r5,1
 END_FW_FTR_SECTION_IFSET(FW_FEATURE_ISERIES)
 #endif
-	stb	r5,PACAHARDIRQEN(r13)	/* Hard Disabled on others */
-
 	bl	.start_kernel
 
 	/* Not reached */
diff --git a/arch/powerpc/kernel/idle_book3e.S b/arch/powerpc/kernel/idle_book3e.S
index 16c002d..b1199f8 100644
--- a/arch/powerpc/kernel/idle_book3e.S
+++ b/arch/powerpc/kernel/idle_book3e.S
@@ -32,11 +32,11 @@ _GLOBAL(book3e_idle)
 	 * since we may otherwise lose it (doorbells etc...). We know
 	 * that since PACAHARDIRQEN will have been cleared in that case.
 	 */
-	lbz	r3,PACAHARDIRQEN(r13)
+	lbz	r3,PACAIRQHAPPENED(r13)
 	cmpwi	cr0,r3,0
-	beqlr
+	bnelr
 
-	/* Now we are going to mark ourselves as soft and hard enables in
+	/* Now we are going to mark ourselves as soft and hard enabled in
 	 * order to be able to take interrupts while asleep. We inform lockdep
 	 * of that. We don't actually turn interrupts on just yet tho.
 	 */
@@ -46,7 +46,6 @@ _GLOBAL(book3e_idle)
 #endif
 	li	r0,1
 	stb	r0,PACASOFTIRQEN(r13)
-	stb	r0,PACAHARDIRQEN(r13)
 	
 	/* Interrupts will make use return to LR, so get something we want
 	 * in there
@@ -59,7 +58,6 @@ _GLOBAL(book3e_idle)
 	/* Mark them off again in the PACA as well */
 	li	r0,0
 	stb	r0,PACASOFTIRQEN(r13)
-	stb	r0,PACAHARDIRQEN(r13)
 
 	/* Tell lockdep about it */
 #ifdef CONFIG_TRACE_IRQFLAGS
diff --git a/arch/powerpc/kernel/idle_power4.S b/arch/powerpc/kernel/idle_power4.S
index ba31954..c30af92 100644
--- a/arch/powerpc/kernel/idle_power4.S
+++ b/arch/powerpc/kernel/idle_power4.S
@@ -29,14 +29,27 @@ END_FTR_SECTION_IFCLR(CPU_FTR_CAN_NAP)
 	cmpwi	0,r4,0
 	beqlr
 
-	/* Go to NAP now */
+	/* Hard disable */
 	mfmsr	r7
 	rldicl	r0,r7,48,1
 	rotldi	r0,r0,16
 	mtmsrd	r0,1			/* hard-disable interrupts */
+
+	/* Check if something happened while soft-disabled */
+	lbz	r0,PACAIRQHAPPENED(r13)
+	cmpwi	cr0,r0,0
+	bnelr
+
+	/*
+	 * Here we mark ourselves soft-enabled. We should probably
+	 * tell lockdep about it, but the interrupt will re-disable
+	 * immediately so it shouldn't be a big issue. If it becomes
+	 * one, then we should implement things the way we do on
+	 * book3e.
+	 */
 	li	r0,1
 	stb	r0,PACASOFTIRQEN(r13)	/* we'll hard-enable shortly */
-	stb	r0,PACAHARDIRQEN(r13)
+
 BEGIN_FTR_SECTION
 	DSSALL
 	sync
diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S
index fcdff19..61f8cac 100644
--- a/arch/powerpc/kernel/idle_power7.S
+++ b/arch/powerpc/kernel/idle_power7.S
@@ -1,5 +1,5 @@
 /*
- *  This file contains the power_save function for 970-family CPUs.
+ *  This file contains the power_save function for Power7 CPUs
  *
  *  This program is free software; you can redistribute it and/or
  *  modify it under the terms of the GNU General Public License
@@ -51,9 +51,23 @@ _GLOBAL(power7_idle)
 	rldicl	r9,r9,48,1
 	rotldi	r9,r9,16
 	mtmsrd	r9,1			/* hard-disable interrupts */
-	li	r0,0
+
+	/* Check if something happened while soft-disabled */
+	lbz	r0,PACAIRQHAPPENED(r13)
+	cmpwi	cr0,r0,0
+	beq	1f
+	addi	r1,r1,INT_FRAME_SIZE
+	ld	r0,16(r1)
+	mtlr	r0
+	blr
+
+	/*
+	 * Here we mark ourselves soft-diasbled (we should already be
+	 * actually...). The interrupt is only going to happen after
+	 * we return to the caller and it does a local_irq_enable()
+	 */
+1:	li	r0,0
 	stb	r0,PACASOFTIRQEN(r13)	/* we'll hard-enable shortly */
-	stb	r0,PACAHARDIRQEN(r13)
 	stb	r0,PACA_NAPSTATELOST(r13)
 
 	/* Continue saving state */
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 701d4ac..7e5b94b 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -72,6 +72,7 @@
 #include <asm/paca.h>
 #include <asm/firmware.h>
 #include <asm/lv1call.h>
+#include <asm/irqflags.h>
 #endif
 #define CREATE_TRACE_POINTS
 #include <asm/trace.h>
@@ -99,14 +100,14 @@ EXPORT_SYMBOL(irq_desc);
 
 int distribute_irqs = 1;
 
-static inline notrace unsigned long get_hard_enabled(void)
+static inline notrace unsigned long get_irq_happened(void)
 {
-	unsigned long enabled;
+	unsigned long happened;
 
 	__asm__ __volatile__("lbz %0,%1(13)"
-	: "=r" (enabled) : "i" (offsetof(struct paca_struct, hard_enabled)));
+	: "=r" (happened) : "i" (offsetof(struct paca_struct, irq_happened)));
 
-	return enabled;
+	return happened;
 }
 
 static inline notrace void set_soft_enabled(unsigned long enable)
@@ -115,81 +116,140 @@ static inline notrace void set_soft_enabled(unsigned long enable)
 	: : "r" (enable), "i" (offsetof(struct paca_struct, soft_enabled)));
 }
 
-static inline notrace void decrementer_check_overflow(void)
+static inline int decrementer_check_overflow(void)
 {
 	u64 now = get_tb_or_rtc();
 	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
 
-	if (now >= *next_tb)
-		set_dec(1);
+	return now >= *next_tb;
 }
 
-notrace void arch_local_irq_restore(unsigned long en)
+/* This is called whenever we are re-enabling interrupts
+ * and returns either 0 (nothing to do) or 500/900 if there's
+ * either an EE or a DEC to generate.
+ *
+ * This is called in two contexts: From arch_local_irq_restore()
+ * before soft-enabling interrupts, and from the exception exit
+ * path when returning from an interrupt from a soft-disabled to
+ * a soft enabled context. In both case we have interrupts hard
+ * disabled.
+ *
+ * We take care of only clearing the bits we handled in the
+ * PACA irq_happened field since we can only re-emit one at a
+ * time and we don't want to "lose" one.
+ */
+notrace unsigned int __check_irq_reemit(void)
 {
 	/*
-	 * get_paca()->soft_enabled = en;
-	 * Is it ever valid to use local_irq_restore(0) when soft_enabled is 1?
-	 * That was allowed before, and in such a case we do need to take care
-	 * that gcc will set soft_enabled directly via r13, not choose to use
-	 * an intermediate register, lest we're preempted to a different cpu.
+	 * We use local_paca rather than get_paca() to avoid all
+	 * the debug_smp_processor_id() business in this low level
+	 * function
 	 */
-	set_soft_enabled(en);
-	if (!en)
-		return;
+	unsigned char happened = local_paca->irq_happened;
+
+	/* Clear bit 0 which we wouldn't clear otherwise */
+	local_paca->irq_happened &= ~1;
 
-#ifdef CONFIG_PPC_STD_MMU_64
-	if (firmware_has_feature(FW_FEATURE_ISERIES)) {
-		/*
-		 * Do we need to disable preemption here?  Not really: in the
-		 * unlikely event that we're preempted to a different cpu in
-		 * between getting r13, loading its lppaca_ptr, and loading
-		 * its any_int, we might call iseries_handle_interrupts without
-		 * an interrupt pending on the new cpu, but that's no disaster,
-		 * is it?  And the business of preempting us off the old cpu
-		 * would itself involve a local_irq_restore which handles the
-		 * interrupt to that cpu.
-		 *
-		 * But use "local_paca->lppaca_ptr" instead of "get_lppaca()"
-		 * to avoid any preemption checking added into get_paca().
-		 */
-		if (local_paca->lppaca_ptr->int_dword.any_int)
-			iseries_handle_interrupts();
+	/*
+	 * Force the delivery of pending soft-disabled interrupts on PS3.
+	 * Any HV call will have this side effect.
+	 */
+	if (firmware_has_feature(FW_FEATURE_PS3_LV1)) {
+		u64 tmp, tmp2;
+		lv1_get_version_info(&tmp, &tmp2);
 	}
-#endif /* CONFIG_PPC_STD_MMU_64 */
 
 	/*
-	 * if (get_paca()->hard_enabled) return;
-	 * But again we need to take care that gcc gets hard_enabled directly
-	 * via r13, not choose to use an intermediate register, lest we're
-	 * preempted to a different cpu in between the two instructions.
+	 * We may have missed a decrementer interrupt. We check the
+	 * decrementer itself rather than the paca irq_happened field
+	 * in case we also had a rollover while hard disabled
 	 */
-	if (get_hard_enabled())
-		return;
+	local_paca->irq_happened &= ~PACA_HAPPENED_DEC;
+	if (decrementer_check_overflow())
+		return 0x900;
+
+	/* Finally check if an external interrupt happened */
+	local_paca->irq_happened &= ~PACA_HAPPENED_EE;
+	if (happened & PACA_HAPPENED_EE)
+		return 0x500;
+
+#ifdef CONFIG_PPC_BOOK3E
+	/* Finally check if an EPR external interrupt happened
+	 * this bit is typically set if we need to handle another
+	 * "edge" interrupt from within the MPIC "EPR" handler
+	 */
+	local_paca->irq_happened &= ~PACA_HAPPENED_EE_EDGE;
+	if (happened & PACA_HAPPENED_EE_EDGE)
+		return 0x500;
+
+	local_paca->irq_happened &= ~PACA_HAPPENED_DBELL;
+	if (happened & PACA_HAPPENED_DBELL)
+		return 0x280;
+#endif /* CONFIG_PPC_BOOK3E */
+
+	/* There should be nothing left ! */
+	BUG_ON(local_paca->irq_happened != 0);
+
+	return 0;
+}
 
+notrace void arch_local_irq_restore(unsigned long en)
+{
+	unsigned int reemit;
+
+	/* Write the new soft-enabled value */
+	set_soft_enabled(en);
+	if (!en)
+		return;
 	/*
-	 * Need to hard-enable interrupts here.  Since currently disabled,
-	 * no need to take further asm precautions against preemption; but
-	 * use local_paca instead of get_paca() to avoid preemption checking.
+	 * From this point onward, we can take interrupts, preempt,
+	 * etc... unless we got hard-disabled. We check if an event
+	 * happened. If none happened, we know we can just return.
+	 *
+	 * We may have preempted before the check below, in which case
+	 * we are checking the "new" CPU instead of the old one. This
+	 * is only a problem if an event happened on the "old" CPU.
+	 *
+	 * External interrupt events on non-iseries will have caused
+	 * interrupts to be hard-disabled, so there is no problem, we
+	 * cannot have preempted.
+	 *
+	 * That leaves us with EEs on iSeries or decrementer interrupts,
+	 * which I decided to safely ignore. The preemption would have
+	 * itself been the result of an interrupt, upon which return we
+	 * will have checked for pending events on the old CPU.
 	 */
-	local_paca->hard_enabled = en;
+	if (!get_irq_happened())
+		return;
+	/*
+	 * We need to hard disable to get a trusted value from
+	 * __check_irq_reemit(). We also need to soft-disable
+	 * again to avoid warnings in there due to the use of
+	 * per-cpu variables.
+	 */
+	__hard_irq_disable();
+	set_soft_enabled(0);
 
 	/*
-	 * Trigger the decrementer if we have a pending event. Some processors
-	 * only trigger on edge transitions of the sign bit. We might also
-	 * have disabled interrupts long enough that the decrementer wrapped
-	 * to positive.
+	 * Check if anything needs to be re-emitted. We haven't
+	 * soft-enabled yet to avoid warnings in decrementer_check_overflow
+	 * accessing per-cpu variables
 	 */
-	decrementer_check_overflow();
+	reemit = __check_irq_reemit();
+
+	/* We can soft-enable now */
+	set_soft_enabled(1);
 
 	/*
-	 * Force the delivery of pending soft-disabled interrupts on PS3.
-	 * Any HV call will have this side effect.
+	 * And re-emit if we have to. This will return with interrupts
+	 * hard-enabled.
 	 */
-	if (firmware_has_feature(FW_FEATURE_PS3_LV1)) {
-		u64 tmp, tmp2;
-		lv1_get_version_info(&tmp, &tmp2);
+	if (reemit) {
+		__reemit_interrupt(reemit);
+		return;
 	}
 
+	/* Finally, let's ensure we are hard enabled */
 	__hard_irq_enable();
 }
 EXPORT_SYMBOL(arch_local_irq_restore);
@@ -360,8 +420,27 @@ void do_IRQ(struct pt_regs *regs)
 
 	check_stack_overflow();
 
+	/* Query the platform PIC for the interrupt & ack it */
 	irq = ppc_md.get_irq();
 
+#ifdef CONFIG_PPC64
+	/*
+	 * At this point, we are soft-disabled and hard-disabled.
+	 *
+	 * get_irq() will have caused the PIC to lower the EE line
+	 * so we can improve the quality of perf samples by hard
+	 * enabling in order to let performance interrupts through.
+	 *
+	 * In the event where we might have another interrupt pending
+	 * the worst case is that we take it and hard-disable again
+	 * after setting irq_happened, which will cause us to come,
+	 * back when the interrupt exit tests paca->irq_happened again
+	 */
+	get_paca()->irq_happened &= ~PACA_HAPPENED;
+	__hard_irq_enable();
+#endif /* CONFIG_PPC64 */
+
+	/* Process the interrupt */
 	if (irq != NO_IRQ && irq != NO_IRQ_IGNORE)
 		handle_one_irq(irq);
 	else if (irq != NO_IRQ_IGNORE)
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 567dd7c..39f201f 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -259,7 +259,6 @@ void accumulate_stolen_time(void)
 	u64 sst, ust;
 
 	u8 save_soft_enabled = local_paca->soft_enabled;
-	u8 save_hard_enabled = local_paca->hard_enabled;
 
 	/* We are called early in the exception entry, before
 	 * soft/hard_enabled are sync'ed to the expected state
@@ -268,7 +267,6 @@ void accumulate_stolen_time(void)
 	 * complain
 	 */
 	local_paca->soft_enabled = 0;
-	local_paca->hard_enabled = 0;
 
 	sst = scan_dispatch_log(local_paca->starttime_user);
 	ust = scan_dispatch_log(local_paca->starttime);
@@ -277,7 +275,6 @@ void accumulate_stolen_time(void)
 	local_paca->stolen_time += ust + sst;
 
 	local_paca->soft_enabled = save_soft_enabled;
-	local_paca->hard_enabled = save_hard_enabled;
 }
 
 static inline u64 calculate_stolen_time(u64 stop_tb)
@@ -589,6 +586,18 @@ void timer_interrupt(struct pt_regs * regs)
 		do_IRQ(regs);
 #endif
 
+#ifdef CONFIG_PPC64
+	/* Let's hard enable interrupts now that we have reset
+	 * the DEC (or acked it on BookE)
+	 *
+	 * We skip that if there's a pending EE "level" interrupt
+	 * as an optimization
+	 */
+	get_paca()->irq_happened &= ~PACA_HAPPENED;
+	if (!(get_paca()->irq_happened & PACA_HAPPENED_EE))
+		__hard_irq_enable();	
+#endif /* CONFIG_PPC64 */
+
 	old_regs = set_irq_regs(regs);
 	irq_enter();
 
diff --git a/arch/powerpc/platforms/iseries/Makefile b/arch/powerpc/platforms/iseries/Makefile
index a7602b1..7208589 100644
--- a/arch/powerpc/platforms/iseries/Makefile
+++ b/arch/powerpc/platforms/iseries/Makefile
@@ -2,7 +2,7 @@ ccflags-y	:= -mno-minimal-toc
 
 obj-y += exception.o
 obj-y += hvlog.o hvlpconfig.o lpardata.o setup.o dt.o mf.o lpevents.o \
-	hvcall.o proc.o htab.o iommu.o misc.o irq.o
+	hvcall.o proc.o htab.o iommu.o irq.o
 obj-$(CONFIG_PCI) += pci.o
 obj-$(CONFIG_SMP) += smp.o
 obj-$(CONFIG_VIOPATH) += viopath.o vio.o
diff --git a/arch/powerpc/platforms/iseries/exception.S b/arch/powerpc/platforms/iseries/exception.S
index f519ee1..508f863 100644
--- a/arch/powerpc/platforms/iseries/exception.S
+++ b/arch/powerpc/platforms/iseries/exception.S
@@ -32,6 +32,7 @@
 #include <asm/ptrace.h>
 #include <asm/cputable.h>
 #include <asm/mmu.h>
+#include <asm/hw_irq.h>
 
 #include "exception.h"
 
@@ -261,16 +262,20 @@ system_call_iSeries:
 
 decrementer_iSeries_masked:
 	/* We may not have a valid TOC pointer in here. */
-	li	r11,1
+	li	r11,PACA_HAPPENED_DEC
 	ld	r12,PACALPPACAPTR(r13)
 	stb	r11,LPPACADECRINT(r12)
 	li	r12,-1
 	clrldi	r12,r12,33	/* set DEC to 0x7fffffff */
 	mtspr	SPRN_DEC,r12
-	/* fall through */
+	b	1f
 
 hardware_interrupt_iSeries_masked:
-	mtcrf	0x80,r9		/* Restore regs */
+	li	r11,PACA_HAPPENED_EE
+1:	mtcrf	0x80,r9		/* Restore regs */
+	lbz	r10,PACAIRQHAPPENED(r13)
+	or	r11,r10,r11
+	stb	r11,PACAIRQHAPPENED(r13)
 	ld	r12,PACALPPACAPTR(r13)
 	ld	r11,LPPACASRR0(r12)
 	ld	r12,LPPACASRR1(r12)
diff --git a/arch/powerpc/platforms/iseries/misc.S b/arch/powerpc/platforms/iseries/misc.S
deleted file mode 100644
index 2c6ff0f..0000000
--- a/arch/powerpc/platforms/iseries/misc.S
+++ /dev/null
@@ -1,26 +0,0 @@
-/*
- * This file contains miscellaneous low-level functions.
- *    Copyright (C) 1995-2005 IBM Corp
- *
- * Largely rewritten by Cort Dougan (cort@cs.nmt.edu)
- * and Paul Mackerras.
- * Adapted for iSeries by Mike Corrigan (mikejc@us.ibm.com)
- * PPC64 updates by Dave Engebretsen (engebret@us.ibm.com)
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License
- * as published by the Free Software Foundation; either version
- * 2 of the License, or (at your option) any later version.
- */
-
-#include <asm/processor.h>
-#include <asm/asm-offsets.h>
-#include <asm/ppc_asm.h>
-
-	.text
-
-/* Handle pending interrupts in interrupt context */
-_GLOBAL(iseries_handle_interrupts)
-	li	r0,0x5555
-	sc
-	blr
diff --git a/arch/powerpc/platforms/pseries/processor_idle.c b/arch/powerpc/platforms/pseries/processor_idle.c
index 085fd3f..019a529 100644
--- a/arch/powerpc/platforms/pseries/processor_idle.c
+++ b/arch/powerpc/platforms/pseries/processor_idle.c
@@ -96,6 +96,25 @@ out:
 	return index;
 }
 
+static int check_and_cede_processor(void)
+{
+	/*
+	 * Interrupts are soft-disabled at this point,
+	 * but not hard disabled. So an interrupt might have
+	 * occurred before entering NAP, and would be potentially
+	 * lost (edge events, decrementer events, etc...) unless
+	 * we first hard disable then check.
+	 *
+	 * We must use the low level __hard_irq_disable() and not
+	 * hard_irq_disable() as the later will set a bit in
+	 * paca->irq_happened (to force re-enable later) which we
+	 * don't need nor want here.
+	 */
+	__hard_irq_disable();
+	if (get_paca()->irq_happened == 0)
+		cede_processor();
+}
+
 static int dedicated_cede_loop(struct cpuidle_device *dev,
 				struct cpuidle_driver *drv,
 				int index)
@@ -108,7 +127,8 @@ static int dedicated_cede_loop(struct cpuidle_device *dev,
 
 	ppc64_runlatch_off();
 	HMT_medium();
-	cede_processor();
+
+	check_and_cede_processor();
 
 	get_lppaca()->donate_dedicated_cpu = 0;
 	dev->last_residency =
@@ -132,7 +152,7 @@ static int shared_cede_loop(struct cpuidle_device *dev,
 	 * processor. When returning here, external interrupts
 	 * are enabled.
 	 */
-	cede_processor();
+	check_and_cede_processor();
 
 	dev->last_residency =
 		(int)idle_loop_epilog(in_purr, kt_before);

^ permalink raw reply related

* [PATCH] powerpc: remove legacy iSeries from ppc64_defconfig
From: Stephen Rothwell @ 2012-02-15  2:31 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: ppc-dev

[-- Attachment #1: Type: text/plain, Size: 1007 bytes --]

Since we are heading towards removing the Legacy iSeries platform, start
by no longer building it for ppc64_defconfig.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 arch/powerpc/configs/ppc64_defconfig |    5 -----
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/configs/ppc64_defconfig b/arch/powerpc/configs/ppc64_defconfig
index 2156e07..1acf650 100644
--- a/arch/powerpc/configs/ppc64_defconfig
+++ b/arch/powerpc/configs/ppc64_defconfig
@@ -24,10 +24,6 @@ CONFIG_PPC_SPLPAR=y
 CONFIG_SCANLOG=m
 CONFIG_PPC_SMLPAR=y
 CONFIG_DTL=y
-CONFIG_PPC_ISERIES=y
-CONFIG_VIODASD=y
-CONFIG_VIOCD=m
-CONFIG_VIOTAPE=m
 CONFIG_PPC_MAPLE=y
 CONFIG_PPC_PASEMI=y
 CONFIG_PPC_PASEMI_IOMMU=y
@@ -259,7 +255,6 @@ CONFIG_PASEMI_MAC=y
 CONFIG_MLX4_EN=m
 CONFIG_QLGE=m
 CONFIG_BE2NET=m
-CONFIG_ISERIES_VETH=m
 CONFIG_PPP=m
 CONFIG_PPP_ASYNC=m
 CONFIG_PPP_SYNC_TTY=m
-- 
1.7.9

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related

* RE: [PATCH] powerpc/usb: fix issue of CPU halt when missing USB PHY clock
From: Benjamin Herrenschmidt @ 2012-02-15  2:31 UTC (permalink / raw)
  To: Liu Shengzhou-B36685
  Cc: linuxppc-dev@lists.ozlabs.org, linux-usb@vger.kernel.org,
	Pavan Kondeti
In-Reply-To: <3F453DDFF675A64A89321A1F352810216EDA8B@039-SN1MPN1-005.039d.mgd.msft.net>


> > > +	if (!(in_be32(non_ehci + FSL_SOC_USB_CTRL) & CTRL_PHY_CLK_VALID))
> > {
> > > +		printk(KERN_WARNING "fsl-ehci: USB PHY clock invalid\n");
> > > +		return -1;
> > 
> > Please return a proper error code. -ENODEV ?
> 
> [Shengzhou] Ok, updated in v2, thanks.
> >

Note that I just got a p5020ds from FSL, and with it's default
configuration, when I build & boot current upstream with FSL USB
support (64-bit kernel) it hangs when initializing USB.

With or without this patch.

It complains about invalid dr-mode (there's two USB nodes in the .dts
coming from uboot, an "mph" and "dr", the former has no dr-mode property
in the device-tree.

Is the current kernel incompatible with old device-tree's ? (that would
be a shame...)

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC PATCH v6 00/10] fadump: Firmware-assisted dump support for Powerpc.
From: Paul Mackerras @ 2012-02-15  4:16 UTC (permalink / raw)
  To: Mahesh J Salgaonkar
  Cc: Amerigo Wang, Kexec-ml, Linux Kernel, Milton Miller, linuxppc-dev,
	Randy Dunlap, Anton Blanchard, Vivek Goyal, Eric W. Biederman
In-Reply-To: <20111210064301.10195.3344.stgit@mars.in.ibm.com>

On Sat, Dec 10, 2011 at 12:19:59PM +0530, Mahesh J Salgaonkar wrote:

> The most of the code implementation has been adapted from phyp assisted dump
> implementation written by Linas Vepstas and Manish Ahuja.

When you repost the series, please be explicit about what the
relationship between the new fadump facility and the old phyp-dump
is, both in the documentation you're adding and in the patch
descriptions.

I gather that fadump uses the same firmware interfaces as phyp-dump,
and can be characterised as a rewrite of phyp-dump.  It would be good
if you would explicitly mention:

- What advantages fadump has over phyp-dump
- Whether there are any capabilities that phyp-dump does that fadump
  doesn't
- What is different between fadump and phyp-dump in the interface to
  usermode code
- Any user-visible differences between how fadump operates compared to
  phyp-dump.  For example, will users see a difference in how much
  memory is available to the kernel?

Paul.

^ permalink raw reply

* [PATCH 1/3] powerpc/fsl/pci: Fix PCIe fixup regression
From: Benjamin Herrenschmidt @ 2012-02-15  4:22 UTC (permalink / raw)
  To: linuxppc-dev

Upstream changes to the way PHB resources are registered
broke the resource fixup for FSL boards.

We can no longer rely on the resource pointer array for the PHB's
pci_bus structure, so let's leave it alone and go straight for
the PHB resources instead. This also makes the code generally
more readable.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/sysdev/fsl_pci.c |   47 ++++++++++++++++++++++++----------------
 1 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c
index 30eb17e..b392144 100644
--- a/arch/powerpc/sysdev/fsl_pci.c
+++ b/arch/powerpc/sysdev/fsl_pci.c
@@ -385,26 +385,35 @@ static void __init setup_pci_cmd(struct pci_controller *hose)
 void fsl_pcibios_fixup_bus(struct pci_bus *bus)
 {
 	struct pci_controller *hose = pci_bus_to_host(bus);
-	int i;
-
-	if ((bus->parent == hose->bus) &&
-	    ((fsl_pcie_bus_fixup &&
-	      early_find_capability(hose, 0, 0, PCI_CAP_ID_EXP)) ||
-	     (hose->indirect_type & PPC_INDIRECT_TYPE_NO_PCIE_LINK)))
-	{
-		for (i = 0; i < 4; ++i) {
+	int i, is_pcie = 0, no_link;
+
+	/* The root complex bridge comes up with bogus resources,
+	 * we copy the PHB ones in.
+	 *
+	 * With the current generic PCI code, the PHB bus no longer
+	 * has bus->resource[0..4] set, so things are a bit more
+	 * tricky.
+	 */
+
+	if (fsl_pcie_bus_fixup)
+		is_pcie = early_find_capability(hose, 0, 0, PCI_CAP_ID_EXP);
+	no_link = !!(hose->indirect_type & PPC_INDIRECT_TYPE_NO_PCIE_LINK);
+
+	if (bus->parent == hose->bus && (is_pcie || no_link)) {
+		for (i = 0; i < PCI_BRIDGE_RESOURCE_NUM; ++i) {
 			struct resource *res = bus->resource[i];
-			struct resource *par = bus->parent->resource[i];
-			if (res) {
-				res->start = 0;
-				res->end   = 0;
-				res->flags = 0;
-			}
-			if (res && par) {
-				res->start = par->start;
-				res->end   = par->end;
-				res->flags = par->flags;
-			}
+			struct resource *par;
+
+			if (!res)
+				continue;
+			if (!i)
+				par = &hose->io_resource;
+			else
+				par = &hose->mem_resources[i-1];
+
+			res->start = par->start;
+			res->end   = par->end;
+			res->flags = par->flags;
 		}
 	}
 }
-- 
1.7.7.3

^ permalink raw reply related

* [PATCH 2/3] powerpc/fsl/pci: Improve PCIe host bridge quirks
From: Benjamin Herrenschmidt @ 2012-02-15  4:22 UTC (permalink / raw)
  To: linuxppc-dev

The current quirk is a header quirk which runs after
pci_setup_device(). This means that we can warnings
from the later due to an unrecognized class code,
and some parts of pci_dev aren't setup properly.

On the other hand, we somewhat rely on that later on
as this causes "pci_read_bases" to not be called, thus
not bringing in the bogus BAR 0 of the root complex.

This breaks up the quirk in two. One is an early quirk
and fixes up the class code. The other one remains a
header quirk and removes the bogus resource.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/sysdev/fsl_pci.c |   34 +++++++++++++++++++++++++++++-----
 1 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c
index b392144..a1da340 100644
--- a/arch/powerpc/sysdev/fsl_pci.c
+++ b/arch/powerpc/sysdev/fsl_pci.c
@@ -36,23 +36,48 @@
 
 static int fsl_pcie_bus_fixup, is_mpc83xx_pci;
 
-static void __init quirk_fsl_pcie_header(struct pci_dev *dev)
+static int __devinit __quirk_is_fsl_pcie_host(struct pci_dev *dev)
 {
 	u8 progif;
 
+	/* We aren't a root complex, don't bother */
+	if (dev->bus->self)
+		return 0;
+
 	/* if we aren't a PCIe don't bother */
 	if (!pci_find_capability(dev, PCI_CAP_ID_EXP))
-		return;
+		return 0;
 
 	/* if we aren't in host mode don't bother */
 	pci_read_config_byte(dev, PCI_CLASS_PROG, &progif);
 	if (progif & 0x1)
+		return 0;
+
+	return 1;
+}
+
+static void __devinit quirk_fsl_pcie_early(struct pci_dev *dev)
+{
+
+	if (!__quirk_is_fsl_pcie_host(dev))
 		return;
 
+	/* Fix incorrect class code */
 	dev->class = PCI_CLASS_BRIDGE_PCI << 8;
 	fsl_pcie_bus_fixup = 1;
-	return;
 }
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_FREESCALE, PCI_ANY_ID, quirk_fsl_pcie_early);
+
+static void __devinit quirk_fsl_pcie_header(struct pci_dev *dev)
+{
+	if (!__quirk_is_fsl_pcie_host(dev))
+		return;
+
+	/* Hide resource 0 */
+	dev->resource[0].start = dev->resource[0].end = 0;
+	dev->resource[0].flags = 0;
+}
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_FREESCALE, PCI_ANY_ID, quirk_fsl_pcie_header);
 
 static int __init fsl_pcie_check_link(struct pci_controller *hose)
 {
@@ -81,6 +106,7 @@ static int fsl_pci_dma_set_mask(struct device *dev, u64 dma_mask)
 	 */
 	if ((dev->bus == &pci_bus_type) &&
 	    dma_mask >= DMA_BIT_MASK(MAX_PHYS_ADDR_BITS)) {
+		dev_info(dev, "Switching PCI device to direct DMA ops\n");
 		set_dma_ops(dev, &dma_direct_ops);
 		set_dma_offset(dev, pci64_dma_offset);
 	}
@@ -496,8 +522,6 @@ int __init fsl_add_bridge(struct device_node *dev, int is_primary)
 }
 #endif /* CONFIG_FSL_SOC_BOOKE || CONFIG_PPC_86xx */
 
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_FREESCALE, PCI_ANY_ID, quirk_fsl_pcie_header);
-
 #if defined(CONFIG_PPC_83xx) || defined(CONFIG_PPC_MPC512x)
 struct mpc83xx_pcie_priv {
 	void __iomem *cfg_type0;
-- 
1.7.7.3

^ permalink raw reply related

* [PATCH 3/3] powerpc/pci: Make pci_probe_only default to 0
From: Benjamin Herrenschmidt @ 2012-02-15  4:23 UTC (permalink / raw)
  To: linuxppc-dev

pci_probe_only is set on ppc64 to prevent resource re-allocation
by the core. It's meant to be used in very specific circumstances
such as when operating under a hypervisor that may prevent such
re-allocation.

Instead of default to 1, we make it default to 0 and explcitiely
set it in the few cases where we need it.

This fixes FSL PCI which wants it clear among others.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/kernel/pci_64.c           |    2 +-
 arch/powerpc/platforms/powermac/pci.c  |    3 ---
 arch/powerpc/platforms/powernv/pci.c   |    3 ---
 arch/powerpc/platforms/pseries/setup.c |    3 +++
 arch/powerpc/platforms/wsp/wsp_pci.c   |    1 -
 5 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/pci_64.c b/arch/powerpc/kernel/pci_64.c
index 3318d39..f627eb7 100644
--- a/arch/powerpc/kernel/pci_64.c
+++ b/arch/powerpc/kernel/pci_64.c
@@ -33,7 +33,7 @@
 #include <asm/machdep.h>
 #include <asm/ppc-pci.h>
 
-unsigned long pci_probe_only = 1;
+unsigned long pci_probe_only = 0;
 
 /* pci_io_base -- the base address from which io bars are offsets.
  * This is the lowest I/O base address (so bar values are always positive),
diff --git a/arch/powerpc/platforms/powermac/pci.c b/arch/powerpc/platforms/powermac/pci.c
index 31a7d3a..43bbe1b 100644
--- a/arch/powerpc/platforms/powermac/pci.c
+++ b/arch/powerpc/platforms/powermac/pci.c
@@ -1059,9 +1059,6 @@ void __init pmac_pci_init(void)
 	}
 	/* pmac_check_ht_link(); */
 
-	/* We can allocate missing resources if any */
-	pci_probe_only = 0;
-
 #else /* CONFIG_PPC64 */
 	init_p2pbridge();
 	init_second_ohare();
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index a70bc1e..a053f4f 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -557,9 +557,6 @@ void __init pnv_pci_init(void)
 
 	pci_set_flags(PCI_CAN_SKIP_ISA_ALIGN);
 
-	/* We do not want to just probe */
-	pci_probe_only = 0;
-
 	/* OPAL absent, try POPAL first then RTAS detection of PHBs */
 	if (!firmware_has_feature(FW_FEATURE_OPAL)) {
 #ifdef CONFIG_PPC_POWERNV_RTAS
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index f79f127..386e265 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -380,6 +380,9 @@ static void __init pSeries_setup_arch(void)
 
 	fwnmi_init();
 
+	/* By default, only probe PCI (can be overriden by rtas_pci */
+	pci_probe_only = 1;
+
 	/* Find and initialize PCI host bridges */
 	init_pci_config_tokens();
 	find_and_init_phbs();
diff --git a/arch/powerpc/platforms/wsp/wsp_pci.c b/arch/powerpc/platforms/wsp/wsp_pci.c
index d24b3ac..763014c 100644
--- a/arch/powerpc/platforms/wsp/wsp_pci.c
+++ b/arch/powerpc/platforms/wsp/wsp_pci.c
@@ -682,7 +682,6 @@ static int __init wsp_setup_one_phb(struct device_node *np)
 	/* XXX Force re-assigning of everything for now */
 	pci_add_flags(PCI_REASSIGN_ALL_BUS | PCI_REASSIGN_ALL_RSRC |
 		      PCI_ENABLE_PROC_DOMAINS);
-	pci_probe_only = 0;
 
 	/* Calculate how the TCE space is divided */
 	phb->dma32_base		= 0;
-- 
1.7.7.3

^ permalink raw reply related

* [PATCH 6/12] arch/powerpc: remove references to cpu_*_map.
From: Rusty Russell @ 2012-02-15  4:58 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev; +Cc: linux-kernel

From: Rusty Russell <rusty@rustcorp.com.au>

This has been obsolescent for a while; time for the final push.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/platforms/wsp/smp.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/wsp/smp.c b/arch/powerpc/platforms/wsp/smp.c
--- a/arch/powerpc/platforms/wsp/smp.c
+++ b/arch/powerpc/platforms/wsp/smp.c
@@ -71,7 +71,7 @@ int __devinit smp_a2_kick_cpu(int nr)
 
 static int __init smp_a2_probe(void)
 {
-	return cpus_weight(cpu_possible_map);
+	return num_possible_cpus();
 }
 
 static struct smp_ops_t a2_smp_ops = {

^ permalink raw reply

* [PATCH V2 RESEND] fsl-sata: I/O load balancing
From: Qiang Liu @ 2012-02-15  5:49 UTC (permalink / raw)
  To: jgarzik, linux-ide; +Cc: Qiang Liu, linuxppc-dev, linux-kernel

Reduce interrupt signals through reset Interrupt Coalescing Control Reg.
Provide dynamic method to adjust interrupt signals and timer ticks by sysfs.
It is a tradeoff for different applications.

Signed-off-by: Qiang Liu <qiang.liu@freescale.com>
---
change for V2
	support dynamic config interrupt coalescing register by /sysfs
	test random small file with iometer
	adjust kernel source baseline for apply upstream
Description:
  1. fsl-sata interrupt will be raised 130 thousand times when write 8G file
    (dd if=/dev/zero of=/dev/sda2 bs=128K count=65536);
  2. most of interrupts raised because of only 1-4 commands completed;
  3. only 30 thousand times will be raised after set max interrupt threshold,
    more interrupts are coalesced as the description of ICC;
Test methods and results:
  1. test sequential large file performance,
  [root@p2020ds root]# echo 31 524287 > \
	/sys/devices/soc.0/ffe18000.sata/intr_coalescing
  [root@p2020ds root]# dd if=/dev/zero of=/dev/sda2 bs=128K count=65536 &
  [root@p2020ds root]# top

   CPU %  |  dd   |  flush-8:0 | softirq
   ---------------------------------------
   before | 20-22 |    17-19   |    7
   ---------------------------------------
   after  | 18-21 |    15-16   |    5
   ---------------------------------------
   2. test random small file with iometer,
   iometer paramters:
   4 I/Os burst length, 1MB transfer request size, 100% write, 2MB file size
     as default configuration of interrupt coalescing register, 1 interrupts and
   no timeout config, total write performance is 119MB per second,
     after config with the maximum value, write performance is 110MB per second.

   After compare the test results, a configuable interrupt coalescing should be
   better when cope with flexible context.


 drivers/ata/sata_fsl.c |  111 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 107 insertions(+), 4 deletions(-)

diff --git a/drivers/ata/sata_fsl.c b/drivers/ata/sata_fsl.c
index 0120b0d..d6577b9 100644
--- a/drivers/ata/sata_fsl.c
+++ b/drivers/ata/sata_fsl.c
@@ -6,7 +6,7 @@
  * Author: Ashish Kalra <ashish.kalra@freescale.com>
  * Li Yang <leoli@freescale.com>
  *
- * Copyright (c) 2006-2007, 2011 Freescale Semiconductor, Inc.
+ * Copyright (c) 2006-2007, 2011-2012 Freescale Semiconductor, Inc.
  *
  * This program is free software; you can redistribute  it and/or modify it
  * under  the terms of  the GNU General  Public License as published by the
@@ -26,6 +26,15 @@
 #include <asm/io.h>
 #include <linux/of_platform.h>

+static unsigned int intr_coalescing_count;
+module_param(intr_coalescing_count, int, S_IRUGO);
+MODULE_PARM_DESC(intr_coalescing_count,
+				 "INT coalescing count threshold (1..31)");
+
+static unsigned int intr_coalescing_ticks;
+module_param(intr_coalescing_ticks, int, S_IRUGO);
+MODULE_PARM_DESC(intr_coalescing_ticks,
+				 "INT coalescing timer threshold in AHB ticks");
 /* Controller information */
 enum {
 	SATA_FSL_QUEUE_DEPTH	= 16,
@@ -83,6 +92,16 @@ enum {
 };

 /*
+ * Interrupt Coalescing Control Register bitdefs  */
+enum {
+	ICC_MIN_INT_COUNT_THRESHOLD	= 1,
+	ICC_MAX_INT_COUNT_THRESHOLD	= ((1 << 5) - 1),
+	ICC_MIN_INT_TICKS_THRESHOLD	= 0,
+	ICC_MAX_INT_TICKS_THRESHOLD	= ((1 << 19) - 1),
+	ICC_SAFE_INT_TICKS		= 1,
+};
+
+/*
 * Host Controller command register set - per port
 */
 enum {
@@ -263,8 +282,65 @@ struct sata_fsl_host_priv {
 	void __iomem *csr_base;
 	int irq;
 	int data_snoop;
+	struct device_attribute intr_coalescing;
 };

+static void fsl_sata_set_irq_coalescing(struct ata_host *host,
+		unsigned int count, unsigned int ticks)
+{
+	struct sata_fsl_host_priv *host_priv = host->private_data;
+	void __iomem *hcr_base = host_priv->hcr_base;
+
+	if (count > ICC_MAX_INT_COUNT_THRESHOLD)
+		count = ICC_MAX_INT_COUNT_THRESHOLD;
+	else if (count < ICC_MIN_INT_COUNT_THRESHOLD)
+		count = ICC_MIN_INT_COUNT_THRESHOLD;
+
+	if (ticks > ICC_MAX_INT_TICKS_THRESHOLD)
+		ticks = ICC_MAX_INT_TICKS_THRESHOLD;
+	else if ((ICC_MIN_INT_TICKS_THRESHOLD == ticks) &&
+			(count > ICC_MIN_INT_COUNT_THRESHOLD))
+		ticks = ICC_SAFE_INT_TICKS;
+
+	spin_lock(&host->lock);
+	iowrite32((count << 24 | ticks), hcr_base + ICC);
+
+	intr_coalescing_count = count;
+	intr_coalescing_ticks = ticks;
+	spin_unlock(&host->lock);
+
+	DPRINTK("intrrupt coalescing, count = 0x%x, ticks = %x\n",
+			intr_coalescing_count, intr_coalescing_ticks);
+	DPRINTK("ICC register status: (hcr base: 0x%x) = 0x%x\n",
+			hcr_base, ioread32(hcr_base + ICC));
+}
+
+static ssize_t fsl_sata_intr_coalescing_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%d	%d\n",
+			intr_coalescing_count, intr_coalescing_ticks);
+}
+
+static ssize_t fsl_sata_intr_coalescing_store(struct device *dev,
+		struct device_attribute *attr,
+		const char *buf, size_t count)
+{
+	unsigned int coalescing_count,	coalescing_ticks;
+
+	if (sscanf(buf, "%d%d",
+				&coalescing_count,
+				&coalescing_ticks) != 2) {
+		printk(KERN_ERR "fsl-sata: wrong parameter format.\n");
+		return -EINVAL;
+	}
+
+	fsl_sata_set_irq_coalescing(dev_get_drvdata(dev),
+			coalescing_count, coalescing_ticks);
+
+	return strlen(buf);
+}
+
 static inline unsigned int sata_fsl_tag(unsigned int tag,
 					void __iomem *hcr_base)
 {
@@ -346,10 +422,10 @@ static unsigned int sata_fsl_fill_sg(struct ata_queued_cmd *qc, void *cmd_desc,
 			(unsigned long long)sg_addr, sg_len);

 		/* warn if each s/g element is not dword aligned */
-		if (sg_addr & 0x03)
+		if (unlikely(sg_addr & 0x03))
 			ata_port_err(qc->ap, "s/g addr unaligned : 0x%llx\n",
 				     (unsigned long long)sg_addr);
-		if (sg_len & 0x03)
+		if (unlikely(sg_len & 0x03))
 			ata_port_err(qc->ap, "s/g len unaligned : 0x%x\n",
 				     sg_len);

@@ -1245,6 +1321,13 @@ static int sata_fsl_init_controller(struct ata_host *host)
 	iowrite32(0x00000FFFF, hcr_base + CE);
 	iowrite32(0x00000FFFF, hcr_base + DE);

+ 	/*
+	 * reset the number of command complete bits which will cause the
+	 * interrupt to be signaled
+	 */
+	fsl_sata_set_irq_coalescing(host, intr_coalescing_count,
+			intr_coalescing_ticks);
+
 	/*
 	 * host controller will be brought on-line, during xx_port_start()
 	 * callback, that should also initiate the OOB, COMINIT sequence
@@ -1309,7 +1392,7 @@ static int sata_fsl_probe(struct platform_device *ofdev)
 	void __iomem *csr_base = NULL;
 	struct sata_fsl_host_priv *host_priv = NULL;
 	int irq;
-	struct ata_host *host;
+	struct ata_host *host = NULL;
 	u32 temp;

 	struct ata_port_info pi = sata_fsl_port_info[0];
@@ -1356,6 +1439,10 @@ static int sata_fsl_probe(struct platform_device *ofdev)

 	/* allocate host structure */
 	host = ata_host_alloc_pinfo(&ofdev->dev, ppi, SATA_FSL_MAX_PORTS);
+	if (!host) {
+		retval = -ENOMEM;
+		goto error_exit_with_cleanup;
+	}

 	/* host->iomap is not used currently */
 	host->private_data = host_priv;
@@ -1373,10 +1460,24 @@ static int sata_fsl_probe(struct platform_device *ofdev)

 	dev_set_drvdata(&ofdev->dev, host);

+	host_priv->intr_coalescing.show = fsl_sata_intr_coalescing_show;
+	host_priv->intr_coalescing.store = fsl_sata_intr_coalescing_store;
+	sysfs_attr_init(&host_priv->intr_coalescing.attr);
+	host_priv->intr_coalescing.attr.name = "intr_coalescing";
+	host_priv->intr_coalescing.attr.mode = S_IRUGO | S_IWUSR;
+	retval = device_create_file(host->dev, &host_priv->intr_coalescing);
+	if (retval)
+		goto error_exit_with_cleanup;
+
 	return 0;

 error_exit_with_cleanup:

+	if (host) {
+		dev_set_drvdata(&ofdev->dev, NULL);
+		ata_host_detach(host);
+	}
+
 	if (hcr_base)
 		iounmap(hcr_base);
 	if (host_priv)
@@ -1390,6 +1491,8 @@ static int sata_fsl_remove(struct platform_device *ofdev)
 	struct ata_host *host = dev_get_drvdata(&ofdev->dev);
 	struct sata_fsl_host_priv *host_priv = host->private_data;

+	device_remove_file(&ofdev->dev, &host_priv->intr_coalescing);
+
 	ata_host_detach(host);

 	dev_set_drvdata(&ofdev->dev, NULL);
--
1.6.4

^ permalink raw reply related

* [PATCH 2/2 v5] powerpc/85xx: Abstract common define of signal multiplex control for qe
From: Zhicheng Fan @ 2012-02-15  6:58 UTC (permalink / raw)
  To: linuxppc-dev, galak; +Cc: Zhicheng Fan
In-Reply-To: <1329289091-26231-1-git-send-email-B32736@freescale.com>

From: Zhicheng Fan <b32736@freescale.com>

The mpc85xx_rdb and mpc85xx_mds have commom define of signal multiplex for qe, so
they need to go in common header, the patch abstract them to fsl_guts.h

Signed-off-by: Zhicheng Fan <b32736@freescale.com>
---
 arch/powerpc/include/asm/fsl_guts.h       |   20 +++++++++++++++++++-
 arch/powerpc/platforms/85xx/mpc85xx_mds.c |   27 ++++++++++++---------------
 2 files changed, 31 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/fsl_guts.h b/arch/powerpc/include/asm/fsl_guts.h
index bebd124..dcd5b70 100644
--- a/arch/powerpc/include/asm/fsl_guts.h
+++ b/arch/powerpc/include/asm/fsl_guts.h
@@ -4,7 +4,7 @@
  * Authors: Jeff Brown
  *          Timur Tabi <timur@freescale.com>
  *
- * Copyright 2004,2007 Freescale Semiconductor, Inc
+ * Copyright 2004,2007,2012 Freescale Semiconductor, Inc
  *
  * This program is free software; you can redistribute  it and/or modify it
  * under  the terms of  the GNU General  Public License as published by the
@@ -114,6 +114,24 @@ struct ccsr_guts_86xx {
 	__be32	srds2cr1;	/* 0x.0f44 - SerDes2 Control Register 0 */
 } __attribute__ ((packed));
 
+#ifdef CONFIG_PPC_85xx
+
+/* Alternate function signal multiplex control */
+#define MPC85xx_PMUXCR_QE0              0x00008000
+#define MPC85xx_PMUXCR_QE2              0x00002000
+#define MPC85xx_PMUXCR_QE3              0x00001000
+#define MPC85xx_PMUXCR_QE4              0x00000800
+#define MPC85xx_PMUXCR_QE5              0x00000400
+#define MPC85xx_PMUXCR_QE6              0x00000200
+#define MPC85xx_PMUXCR_QE7              0x00000100
+#define MPC85xx_PMUXCR_QE8              0x00000080
+#define MPC85xx_PMUXCR_QE9              0x00000040
+#define MPC85xx_PMUXCR_QE10             0x00000020
+#define MPC85xx_PMUXCR_QE11             0x00000010
+#define MPC85xx_PMUXCR_QE12             0x00000008
+
+#endif
+
 #ifdef CONFIG_PPC_86xx
 
 #define CCSR_GUTS_DMACR_DEV_SSI	0	/* DMA controller/channel set to SSI */
diff --git a/arch/powerpc/platforms/85xx/mpc85xx_mds.c b/arch/powerpc/platforms/85xx/mpc85xx_mds.c
index 1d15a0c..d55f869 100644
--- a/arch/powerpc/platforms/85xx/mpc85xx_mds.c
+++ b/arch/powerpc/platforms/85xx/mpc85xx_mds.c
@@ -1,5 +1,6 @@
 /*
- * Copyright (C) Freescale Semicondutor, Inc. 2006-2010. All rights reserved.
+ * Copyright (C) 2006-2010, 2012 Freescale Semicondutor, Inc.
+ * All rights reserved.
  *
  * Author: Andy Fleming <afleming@freescale.com>
  *
@@ -51,6 +52,7 @@
 #include <asm/qe_ic.h>
 #include <asm/mpic.h>
 #include <asm/swiotlb.h>
+#include <asm/fsl_guts.h>
 #include "smp.h"
 
 #include "mpc85xx.h"
@@ -268,34 +270,29 @@ static void __init mpc85xx_mds_qe_init(void)
 	mpc85xx_mds_reset_ucc_phys();
 
 	if (machine_is(p1021_mds)) {
-#define MPC85xx_PMUXCR_OFFSET           0x60
-#define MPC85xx_PMUXCR_QE0              0x00008000
-#define MPC85xx_PMUXCR_QE3              0x00001000
-#define MPC85xx_PMUXCR_QE9              0x00000040
-#define MPC85xx_PMUXCR_QE12             0x00000008
-		static __be32 __iomem *pmuxcr;
+
+		struct ccsr_guts_85xx __iomem *guts;
 
 		np = of_find_node_by_name(NULL, "global-utilities");
 
 		if (np) {
-			pmuxcr = of_iomap(np, 0) + MPC85xx_PMUXCR_OFFSET;
+			guts = of_iomap(np, 0);
 
-			if (!pmuxcr)
-				printk(KERN_EMERG "Error: Alternate function"
-					" signal multiplex control register not"
-					" mapped!\n");
-			else
+			if (!guts)
+				pr_err("mpc85xx-rdb: could not map global utilties register!\n");
+			else{
 			/* P1021 has pins muxed for QE and other functions. To
 			 * enable QE UEC mode, we need to set bit QE0 for UCC1
 			 * in Eth mode, QE0 and QE3 for UCC5 in Eth mode, QE9
 			 * and QE12 for QE MII management signals in PMUXCR
 			 * register.
 			 */
-				setbits32(pmuxcr, MPC85xx_PMUXCR_QE0 |
+				setbits32(&guts->pmuxcr, MPC85xx_PMUXCR_QE0 |
 						  MPC85xx_PMUXCR_QE3 |
 						  MPC85xx_PMUXCR_QE9 |
 						  MPC85xx_PMUXCR_QE12);
-
+				iounmap(guts);
+			}
 			of_node_put(np);
 		}
 
-- 
1.7.0.4

^ permalink raw reply related

* [PATCH 1/2 v5] powerpc/85xx: Add Quicc Engine support for p1025rdb
From: Zhicheng Fan @ 2012-02-15  6:58 UTC (permalink / raw)
  To: linuxppc-dev, galak; +Cc: Zhicheng Fan

From: Zhicheng Fan <b32736@freescale.com>

Signed-off-by: Zhicheng Fan <b32736@freescale.com>
---
 arch/powerpc/platforms/85xx/mpc85xx_rdb.c |   78 ++++++++++++++++++++++++++++-
 1 files changed, 77 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/platforms/85xx/mpc85xx_rdb.c b/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
index e95aef7..b85180e 100644
--- a/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
+++ b/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
@@ -26,6 +26,9 @@
 #include <asm/prom.h>
 #include <asm/udbg.h>
 #include <asm/mpic.h>
+#include <asm/qe.h>
+#include <asm/qe_ic.h>
+#include <asm/fsl_guts.h>
 
 #include <sysdev/fsl_soc.h>
 #include <sysdev/fsl_pci.h>
@@ -47,6 +50,10 @@ void __init mpc85xx_rdb_pic_init(void)
 	struct mpic *mpic;
 	unsigned long root = of_get_flat_dt_root();
 
+#ifdef CONFIG_QUICC_ENGINE
+	struct device_node *np;
+#endif
+
 	if (of_flat_dt_is_compatible(root, "fsl,MPC85XXRDB-CAMP")) {
 		mpic = mpic_alloc(NULL, 0,
 			MPIC_BIG_ENDIAN | MPIC_BROKEN_FRR_NIRQS |
@@ -62,6 +69,18 @@ void __init mpc85xx_rdb_pic_init(void)
 
 	BUG_ON(mpic == NULL);
 	mpic_init(mpic);
+
+#ifdef CONFIG_QUICC_ENGINE
+	np = of_find_compatible_node(NULL, NULL, "fsl,qe-ic");
+	if (np) {
+		qe_ic_init(np, 0, qe_ic_cascade_low_mpic,
+				qe_ic_cascade_high_mpic);
+		of_node_put(np);
+
+	} else
+		pr_err("%s: Could not find qe-ic node\n", __func__);
+#endif
+
 }
 
 /*
@@ -69,7 +88,7 @@ void __init mpc85xx_rdb_pic_init(void)
  */
 static void __init mpc85xx_rdb_setup_arch(void)
 {
-#ifdef CONFIG_PCI
+#if defined(CONFIG_PCI) || defined(CONFIG_QUICC_ENGINE)
 	struct device_node *np;
 #endif
 
@@ -85,6 +104,63 @@ static void __init mpc85xx_rdb_setup_arch(void)
 #endif
 
 	mpc85xx_smp_init();
+
+#ifdef CONFIG_QUICC_ENGINE
+	np = of_find_compatible_node(NULL, NULL, "fsl,qe");
+	if (!np) {
+		pr_err("%s: Could not find Quicc Engine node\n", __func__);
+		goto qe_fail;
+	}
+
+	qe_reset();
+	of_node_put(np);
+
+	np = of_find_node_by_name(NULL, "par_io");
+	if (np) {
+		struct device_node *ucc;
+
+		par_io_init(np);
+		of_node_put(np);
+
+		for_each_node_by_name(ucc, "ucc")
+			par_io_of_config(ucc);
+
+	}
+#if defined(CONFIG_UCC_GETH) || defined(CONFIG_SERIAL_QE)
+	if (machine_is(p1025_rdb)) {
+
+		struct ccsr_guts_85xx __iomem *guts;
+
+		np = of_find_node_by_name(NULL, "global-utilities");
+		if (np) {
+
+			guts = of_iomap(np, 0);
+			if (!guts) {
+
+				pr_err("mpc85xx-rdb: could not map global utilties register!\n");
+
+			} else {
+			/* P1025 has pins muxed for QE and other functions. To
+			* enable QE UEC mode, we need to set bit QE0 for UCC1
+			* in Eth mode, QE0 and QE3 for UCC5 in Eth mode, QE9
+			* and QE12 for QE MII management singals in PMUXCR
+			* register.
+			*/
+				setbits32(&guts->pmuxcr, MPC85xx_PMUXCR_QE0 |
+						MPC85xx_PMUXCR_QE3 |
+						MPC85xx_PMUXCR_QE9 |
+						MPC85xx_PMUXCR_QE12);
+				iounmap(guts);
+			}
+			of_node_put(np);
+		}
+
+	}
+#endif
+
+qe_fail:
+#endif	/* CONFIG_QUICC_ENGINE */
+
 	printk(KERN_INFO "MPC85xx RDB board from Freescale Semiconductor\n");
 }
 
-- 
1.7.0.4

^ permalink raw reply related

* Re: [PATCH V2 RESEND] fsl-sata: I/O load balancing
From: Li Yang @ 2012-02-15  7:22 UTC (permalink / raw)
  To: Qiang Liu; +Cc: linux-ide, jgarzik, linuxppc-dev, linux-kernel
In-Reply-To: <1329284944-17943-1-git-send-email-qiang.liu@freescale.com>

On Wed, Feb 15, 2012 at 1:49 PM, Qiang Liu <qiang.liu@freescale.com> wrote:

Hi Liu Qiang,

The patch is fine except for the title and comment.  It's too vague to
say I/O load balancing.  You need explicit description like "add
interrupt coalescing support" as title.

> Reduce interrupt signals through reset Interrupt Coalescing Control Reg.
> Provide dynamic method to adjust interrupt signals and timer ticks by sysfs.
> It is a tradeoff for different applications.

How about:

Adds support for interrupt coalescing feature to reduce interrupt
events.  Provides mechanism of adjusting coalescing count and timeout
tick by sysfs on runtime, so that tradeoff of latency and CPU load can
be made depending on applications.

- Leo

^ permalink raw reply

* RE: [PATCH V2 RESEND] fsl-sata: I/O load balancing
From: Liu Qiang-B32616 @ 2012-02-15  7:29 UTC (permalink / raw)
  To: Li Yang-R58472
  Cc: linux-ide@vger.kernel.org, jgarzik@pobox.com,
	linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org
In-Reply-To: <CADRPPNQ8LUr13J4k8fsBYSSxXVuxFdEvKyL4abxZpY32Meh=uQ@mail.gmail.com>

PiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiBGcm9tOiBsaW51eC1pZGUtb3duZXJAdmdl
ci5rZXJuZWwub3JnIFttYWlsdG86bGludXgtaWRlLQ0KPiBvd25lckB2Z2VyLmtlcm5lbC5vcmdd
IE9uIEJlaGFsZiBPZiBMaSBZYW5nDQo+IFNlbnQ6IFdlZG5lc2RheSwgRmVicnVhcnkgMTUsIDIw
MTIgMzoyMiBQTQ0KPiBUbzogTGl1IFFpYW5nLUIzMjYxNg0KPiBDYzogamdhcnppa0Bwb2JveC5j
b207IGxpbnV4LWlkZUB2Z2VyLmtlcm5lbC5vcmc7IGxpbnV4cHBjLQ0KPiBkZXZAbGlzdHMub3ps
YWJzLm9yZzsgbGludXgta2VybmVsQHZnZXIua2VybmVsLm9yZw0KPiBTdWJqZWN0OiBSZTogW1BB
VENIIFYyIFJFU0VORF0gZnNsLXNhdGE6IEkvTyBsb2FkIGJhbGFuY2luZw0KPiANCj4gT24gV2Vk
LCBGZWIgMTUsIDIwMTIgYXQgMTo0OSBQTSwgUWlhbmcgTGl1IDxxaWFuZy5saXVAZnJlZXNjYWxl
LmNvbT4NCj4gd3JvdGU6DQo+IA0KPiBIaSBMaXUgUWlhbmcsDQo+IA0KPiBUaGUgcGF0Y2ggaXMg
ZmluZSBleGNlcHQgZm9yIHRoZSB0aXRsZSBhbmQgY29tbWVudC4gIEl0J3MgdG9vIHZhZ3VlIHRv
DQo+IHNheSBJL08gbG9hZCBiYWxhbmNpbmcuICBZb3UgbmVlZCBleHBsaWNpdCBkZXNjcmlwdGlv
biBsaWtlICJhZGQNCj4gaW50ZXJydXB0IGNvYWxlc2Npbmcgc3VwcG9ydCIgYXMgdGl0bGUuDQo+
IA0KPiA+IFJlZHVjZSBpbnRlcnJ1cHQgc2lnbmFscyB0aHJvdWdoIHJlc2V0IEludGVycnVwdCBD
b2FsZXNjaW5nIENvbnRyb2wgUmVnLg0KPiA+IFByb3ZpZGUgZHluYW1pYyBtZXRob2QgdG8gYWRq
dXN0IGludGVycnVwdCBzaWduYWxzIGFuZCB0aW1lciB0aWNrcyBieQ0KPiBzeXNmcy4NCj4gPiBJ
dCBpcyBhIHRyYWRlb2ZmIGZvciBkaWZmZXJlbnQgYXBwbGljYXRpb25zLg0KPiANCj4gSG93IGFi
b3V0Og0KPiANCj4gQWRkcyBzdXBwb3J0IGZvciBpbnRlcnJ1cHQgY29hbGVzY2luZyBmZWF0dXJl
IHRvIHJlZHVjZSBpbnRlcnJ1cHQgZXZlbnRzLg0KPiBQcm92aWRlcyBtZWNoYW5pc20gb2YgYWRq
dXN0aW5nIGNvYWxlc2NpbmcgY291bnQgYW5kIHRpbWVvdXQgdGljayBieQ0KPiBzeXNmcyBvbiBy
dW50aW1lLCBzbyB0aGF0IHRyYWRlb2ZmIG9mIGxhdGVuY3kgYW5kIENQVSBsb2FkIGNhbiBiZSBt
YWRlDQo+IGRlcGVuZGluZyBvbiBhcHBsaWNhdGlvbnMuDQo+IA0KT2ssIEkgd2lsbCBjaGFuZ2Ug
dGhlIHRpdGxlIGFuZCBjb21tZW50cyBtb3JlIHNwZWNpZmljYWxseS4NCg0KPiAtIExlbw0KPiAt
LQ0KPiBUbyB1bnN1YnNjcmliZSBmcm9tIHRoaXMgbGlzdDogc2VuZCB0aGUgbGluZSAidW5zdWJz
Y3JpYmUgbGludXgtaWRlIiBpbg0KPiB0aGUgYm9keSBvZiBhIG1lc3NhZ2UgdG8gbWFqb3Jkb21v
QHZnZXIua2VybmVsLm9yZyBNb3JlIG1ham9yZG9tbyBpbmZvIGF0DQo+IGh0dHA6Ly92Z2VyLmtl
cm5lbC5vcmcvbWFqb3Jkb21vLWluZm8uaHRtbA0KDQo=

^ permalink raw reply

* [PATCH V3] fsl-sata: add support for interrupt coalsecing feature
From: Qiang Liu @ 2012-02-15  7:40 UTC (permalink / raw)
  To: jgarzik, linux-ide; +Cc: Qiang Liu, linuxppc-dev, linux-kernel

Adds support for interrupt coalescing feature to reduce interrupt events.
Provides a mechanism of adjusting coalescing count and timeout tick by sysfs
at runtime, so that tradeoff of latency and CPU load can be made depending
on different applications.

Signed-off-by: Qiang Liu <qiang.liu@freescale.com>
---
change for V3
	change the title and comments according the feedback
	support dynamic config interrupt coalescing register by /sysfs
	test random small file with iometer
	adjust kernel source baseline for apply upstream
Description:
  1. fsl-sata interrupt will be raised 130 thousand times when write 8G file
    (dd if=/dev/zero of=/dev/sda2 bs=128K count=65536);
  2. most of interrupts raised because of only 1-4 commands completed;
  3. only 30 thousand times will be raised after set max interrupt threshold,
    more interrupts are coalesced as the description of ICC;
Test methods and results:
  1. test sequential large file performance,
  [root@p2020ds root]# echo 31 524287 > \
	/sys/devices/soc.0/ffe18000.sata/intr_coalescing
  [root@p2020ds root]# dd if=/dev/zero of=/dev/sda2 bs=128K count=65536 &
  [root@p2020ds root]# top

   CPU %  |  dd   |  flush-8:0 | softirq
   ---------------------------------------
   before | 20-22 |    17-19   |    7
   ---------------------------------------
   after  | 18-21 |    15-16   |    5
   ---------------------------------------
   2. test random small file with iometer,
   iometer paramters:
   4 I/Os burst length, 1MB transfer request size, 100% write, 2MB file size
     as default configuration of interrupt coalescing register, 1 interrupts and
   no timeout config, total write performance is 119MB per second,
     after config with the maximum value, write performance is 110MB per second.

   After compare the test results, a configuable interrupt coalescing should be
   better when cope with flexible context.


 drivers/ata/sata_fsl.c |  111 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 107 insertions(+), 4 deletions(-)

diff --git a/drivers/ata/sata_fsl.c b/drivers/ata/sata_fsl.c
index 0120b0d..d6577b9 100644
--- a/drivers/ata/sata_fsl.c
+++ b/drivers/ata/sata_fsl.c
@@ -6,7 +6,7 @@
  * Author: Ashish Kalra <ashish.kalra@freescale.com>
  * Li Yang <leoli@freescale.com>
  *
- * Copyright (c) 2006-2007, 2011 Freescale Semiconductor, Inc.
+ * Copyright (c) 2006-2007, 2011-2012 Freescale Semiconductor, Inc.
  *
  * This program is free software; you can redistribute  it and/or modify it
  * under  the terms of  the GNU General  Public License as published by the
@@ -26,6 +26,15 @@
 #include <asm/io.h>
 #include <linux/of_platform.h>

+static unsigned int intr_coalescing_count;
+module_param(intr_coalescing_count, int, S_IRUGO);
+MODULE_PARM_DESC(intr_coalescing_count,
+				 "INT coalescing count threshold (1..31)");
+
+static unsigned int intr_coalescing_ticks;
+module_param(intr_coalescing_ticks, int, S_IRUGO);
+MODULE_PARM_DESC(intr_coalescing_ticks,
+				 "INT coalescing timer threshold in AHB ticks");
 /* Controller information */
 enum {
 	SATA_FSL_QUEUE_DEPTH	= 16,
@@ -83,6 +92,16 @@ enum {
 };

 /*
+ * Interrupt Coalescing Control Register bitdefs  */
+enum {
+	ICC_MIN_INT_COUNT_THRESHOLD	= 1,
+	ICC_MAX_INT_COUNT_THRESHOLD	= ((1 << 5) - 1),
+	ICC_MIN_INT_TICKS_THRESHOLD	= 0,
+	ICC_MAX_INT_TICKS_THRESHOLD	= ((1 << 19) - 1),
+	ICC_SAFE_INT_TICKS		= 1,
+};
+
+/*
 * Host Controller command register set - per port
 */
 enum {
@@ -263,8 +282,65 @@ struct sata_fsl_host_priv {
 	void __iomem *csr_base;
 	int irq;
 	int data_snoop;
+	struct device_attribute intr_coalescing;
 };

+static void fsl_sata_set_irq_coalescing(struct ata_host *host,
+		unsigned int count, unsigned int ticks)
+{
+	struct sata_fsl_host_priv *host_priv = host->private_data;
+	void __iomem *hcr_base = host_priv->hcr_base;
+
+	if (count > ICC_MAX_INT_COUNT_THRESHOLD)
+		count = ICC_MAX_INT_COUNT_THRESHOLD;
+	else if (count < ICC_MIN_INT_COUNT_THRESHOLD)
+		count = ICC_MIN_INT_COUNT_THRESHOLD;
+
+	if (ticks > ICC_MAX_INT_TICKS_THRESHOLD)
+		ticks = ICC_MAX_INT_TICKS_THRESHOLD;
+	else if ((ICC_MIN_INT_TICKS_THRESHOLD == ticks) &&
+			(count > ICC_MIN_INT_COUNT_THRESHOLD))
+		ticks = ICC_SAFE_INT_TICKS;
+
+	spin_lock(&host->lock);
+	iowrite32((count << 24 | ticks), hcr_base + ICC);
+
+	intr_coalescing_count = count;
+	intr_coalescing_ticks = ticks;
+	spin_unlock(&host->lock);
+
+	DPRINTK("intrrupt coalescing, count = 0x%x, ticks = %x\n",
+			intr_coalescing_count, intr_coalescing_ticks);
+	DPRINTK("ICC register status: (hcr base: 0x%x) = 0x%x\n",
+			hcr_base, ioread32(hcr_base + ICC));
+}
+
+static ssize_t fsl_sata_intr_coalescing_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%d	%d\n",
+			intr_coalescing_count, intr_coalescing_ticks);
+}
+
+static ssize_t fsl_sata_intr_coalescing_store(struct device *dev,
+		struct device_attribute *attr,
+		const char *buf, size_t count)
+{
+	unsigned int coalescing_count,	coalescing_ticks;
+
+	if (sscanf(buf, "%d%d",
+				&coalescing_count,
+				&coalescing_ticks) != 2) {
+		printk(KERN_ERR "fsl-sata: wrong parameter format.\n");
+		return -EINVAL;
+	}
+
+	fsl_sata_set_irq_coalescing(dev_get_drvdata(dev),
+			coalescing_count, coalescing_ticks);
+
+	return strlen(buf);
+}
+
 static inline unsigned int sata_fsl_tag(unsigned int tag,
 					void __iomem *hcr_base)
 {
@@ -346,10 +422,10 @@ static unsigned int sata_fsl_fill_sg(struct ata_queued_cmd *qc, void *cmd_desc,
 			(unsigned long long)sg_addr, sg_len);

 		/* warn if each s/g element is not dword aligned */
-		if (sg_addr & 0x03)
+		if (unlikely(sg_addr & 0x03))
 			ata_port_err(qc->ap, "s/g addr unaligned : 0x%llx\n",
 				     (unsigned long long)sg_addr);
-		if (sg_len & 0x03)
+		if (unlikely(sg_len & 0x03))
 			ata_port_err(qc->ap, "s/g len unaligned : 0x%x\n",
 				     sg_len);

@@ -1245,6 +1321,13 @@ static int sata_fsl_init_controller(struct ata_host *host)
 	iowrite32(0x00000FFFF, hcr_base + CE);
 	iowrite32(0x00000FFFF, hcr_base + DE);

+ 	/*
+	 * reset the number of command complete bits which will cause the
+	 * interrupt to be signaled
+	 */
+	fsl_sata_set_irq_coalescing(host, intr_coalescing_count,
+			intr_coalescing_ticks);
+
 	/*
 	 * host controller will be brought on-line, during xx_port_start()
 	 * callback, that should also initiate the OOB, COMINIT sequence
@@ -1309,7 +1392,7 @@ static int sata_fsl_probe(struct platform_device *ofdev)
 	void __iomem *csr_base = NULL;
 	struct sata_fsl_host_priv *host_priv = NULL;
 	int irq;
-	struct ata_host *host;
+	struct ata_host *host = NULL;
 	u32 temp;

 	struct ata_port_info pi = sata_fsl_port_info[0];
@@ -1356,6 +1439,10 @@ static int sata_fsl_probe(struct platform_device *ofdev)

 	/* allocate host structure */
 	host = ata_host_alloc_pinfo(&ofdev->dev, ppi, SATA_FSL_MAX_PORTS);
+	if (!host) {
+		retval = -ENOMEM;
+		goto error_exit_with_cleanup;
+	}

 	/* host->iomap is not used currently */
 	host->private_data = host_priv;
@@ -1373,10 +1460,24 @@ static int sata_fsl_probe(struct platform_device *ofdev)

 	dev_set_drvdata(&ofdev->dev, host);

+	host_priv->intr_coalescing.show = fsl_sata_intr_coalescing_show;
+	host_priv->intr_coalescing.store = fsl_sata_intr_coalescing_store;
+	sysfs_attr_init(&host_priv->intr_coalescing.attr);
+	host_priv->intr_coalescing.attr.name = "intr_coalescing";
+	host_priv->intr_coalescing.attr.mode = S_IRUGO | S_IWUSR;
+	retval = device_create_file(host->dev, &host_priv->intr_coalescing);
+	if (retval)
+		goto error_exit_with_cleanup;
+
 	return 0;

 error_exit_with_cleanup:

+	if (host) {
+		dev_set_drvdata(&ofdev->dev, NULL);
+		ata_host_detach(host);
+	}
+
 	if (hcr_base)
 		iounmap(hcr_base);
 	if (host_priv)
@@ -1390,6 +1491,8 @@ static int sata_fsl_remove(struct platform_device *ofdev)
 	struct ata_host *host = dev_get_drvdata(&ofdev->dev);
 	struct sata_fsl_host_priv *host_priv = host->private_data;

+	device_remove_file(&ofdev->dev, &host_priv->intr_coalescing);
+
 	ata_host_detach(host);

 	dev_set_drvdata(&ofdev->dev, NULL);
--
1.6.4

^ permalink raw reply related

* Re: [PATCH V3] fsl-sata: add support for interrupt coalsecing feature
From: Li Yang @ 2012-02-15  7:51 UTC (permalink / raw)
  To: Qiang Liu; +Cc: linux-ide, linuxppc-dev, jgarzik, linux-kernel
In-Reply-To: <1329291634-883-1-git-send-email-qiang.liu@freescale.com>

On Wed, Feb 15, 2012 at 3:40 PM, Qiang Liu <qiang.liu@freescale.com> wrote:
> Adds support for interrupt coalescing feature to reduce interrupt events.
> Provides a mechanism of adjusting coalescing count and timeout tick by sysfs
> at runtime, so that tradeoff of latency and CPU load can be made depending
> on different applications.
>
> Signed-off-by: Qiang Liu <qiang.liu@freescale.com>

Acked-by: Li Yang <leoli@freescale.com>

- Leo

^ permalink raw reply

* RE: [RFC] usb: Fix build error due to dma_mask is not at pdev_archdata at ARM
From: Mehresh Ramneek-B31383 @ 2012-02-15  8:47 UTC (permalink / raw)
  To: Chen Peter-B29397, stern@rowland.harvard.edu, agust@denx.de
  Cc: Estevam Fabio-R49496, linux-usb@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, kernel@pengutronix.de
In-Reply-To: <1329210695-14492-1-git-send-email-peter.chen@freescale.com>



-----Original Message-----
From: Chen Peter-B29397=20
Sent: Tuesday, February 14, 2012 2:42 PM
To: stern@rowland.harvard.edu; agust@denx.de
Cc: kernel@pengutronix.de; linuxppc-dev@lists.ozlabs.org; Mehresh Ramneek-B=
31383; Estevam Fabio-R49496; linux-usb@vger.kernel.org
Subject: [RFC] usb: Fix build error due to dma_mask is not at pdev_archdata=
 at ARM

When build i.mx platform with imx_v6_v7_defconfig, and after adding USB Gad=
get support, it has below build error:

CC      drivers/usb/host/fsl-mph-dr-of.o
drivers/usb/host/fsl-mph-dr-of.c: In function 'fsl_usb2_device_register':
drivers/usb/host/fsl-mph-dr-of.c:97: error: 'struct pdev_archdata'
has no member named 'dma_mask'

It has discussed at: http://www.spinics.net/lists/linux-usb/msg57302.html

For PowerPC, there is dma_mask at struct pdev_archdata, but there is no dma=
_mask at struct pdev_archdata for ARM. The pdev_archdata is related to spec=
ific platform, it should NOT be accessed by cross platform drivers, like US=
B.

The code for pdev_archdata should be useless, as for PowerPC, it has alread=
y gotten the value for pdev->dev.dma_mask at function arch_setup_pdev_archd=
ata of arch/powerpc/kernel/setup-common.c.

Anyone who has PowerPC hardware with USB host enabled, and uses this code c=
an help me a test? Thank you

[Ramneek]: Hi Peter, the code is working for Host stack on PowerPC.
=20
Signed-off-by: Peter Chen <peter.chen@freescale.com>
---
 drivers/usb/host/fsl-mph-dr-of.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/usb/host/fsl-mph-dr-of.c b/drivers/usb/host/fsl-mph-dr=
-of.c
index 7916e56..ab333ac 100644
--- a/drivers/usb/host/fsl-mph-dr-of.c
+++ b/drivers/usb/host/fsl-mph-dr-of.c
@@ -94,7 +94,6 @@ struct platform_device * __devinit fsl_usb2_device_regist=
er(
 	pdev->dev.parent =3D &ofdev->dev;
=20
 	pdev->dev.coherent_dma_mask =3D ofdev->dev.coherent_dma_mask;
-	pdev->dev.dma_mask =3D &pdev->archdata.dma_mask;
 	*pdev->dev.dma_mask =3D *ofdev->dev.dma_mask;
=20
 	retval =3D platform_device_add_data(pdev, pdata, sizeof(*pdata));
--
1.7.0.4

^ permalink raw reply related

* Re: [PATCH 6/12] arch/powerpc: remove references to cpu_*_map.
From: Srivatsa S. Bhat @ 2012-02-15  9:21 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Venkatesh Pallipadi, linux-kernel, Paul Mackerras,
	akpm@linux-foundation.org, linuxppc-dev
In-Reply-To: <1329281884.20466.rusty@rustcorp.com.au>

On 02/15/2012 10:28 AM, Rusty Russell wrote:

> From: Rusty Russell <rusty@rustcorp.com.au>
> 
> This has been obsolescent for a while; time for the final push.
> 
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: linuxppc-dev@lists.ozlabs.org
> ---


Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: [PATCH v3 24/25] irq_domain: remove "hint" when allocating irq numbers
From: Nicolas Ferre @ 2012-02-15 15:04 UTC (permalink / raw)
  To: Grant Likely, devicetree-discuss, Rob Herring
  Cc: Stephen Rothwell, linux-kernel, Milton Miller, Thomas Gleixner,
	linuxppc-dev, linux-arm-kernel
In-Reply-To: <4F316848.4060100@atmel.com>

On 02/07/2012 07:07 PM, Nicolas Ferre :
> On 01/27/2012 10:36 PM, Grant Likely :
>> The 'hint' used to try and line up irq numbers with hw irq numbers is
>> rather a hack and not very useful.  Now that /proc/interrupts also outputs
>> the hwirq number, it is even less useful to keep around the 'hint' heuristic.
>>
>> This patch removes it.
> 
> Grant,
> 
> While trying your patch series in conjunction with Rob one, I do not
> find this patch in your irqdomain/next branch (and a couple of others).
> Can you tell me if this v3 series is available as a git tree?

I am still interested by patch 24-25 of this series but still cannot
find them in your irqdomain/next branch:
Are they also expected to join the 3.4 merge window material?

Bye,
-- 
Nicolas Ferre

^ permalink raw reply

* Re: [PATCH v3 25/25] irq_domain: mostly eliminate slow-path revmap lookups
From: Nicolas Ferre @ 2012-02-15 16:36 UTC (permalink / raw)
  To: Grant Likely, devicetree-discuss
  Cc: Stephen Rothwell, linux-kernel, Rob Herring, Milton Miller,
	Thomas Gleixner, linuxppc-dev, linux-arm-kernel
In-Reply-To: <1327700179-17454-26-git-send-email-grant.likely@secretlab.ca>

Grant,

I do not know if it is the latest revision but I have identified some
issues on error/slow paths...


On 01/27/2012 10:36 PM, Grant Likely :
> With the current state of irq_domain, the reverse map is always used when
> new IRQs get mapped.  This means that the irq_find_mapping() function
> can be simplified to always call out to the revmap-specific lookup function.
> 
> This patch adds lookup functions for the revmaps that don't yet have one
> and removes the slow path lookup from most of the code paths.  The only
> place where the slow path legitimately remains is when the linear map
> is used with a hwirq number larger than the revmap size.
> 
> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Milton Miller <miltonm@bga.com>
> ---
>  arch/powerpc/sysdev/xics/xics-common.c |    3 -
>  include/linux/irqdomain.h              |    3 +-
>  kernel/irq/irqdomain.c                 |   94 +++++++++++++++++---------------
>  3 files changed, 51 insertions(+), 49 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xics/xics-common.c b/arch/powerpc/sysdev/xics/xics-common.c
> index ea5e204..1d7067d 100644
> --- a/arch/powerpc/sysdev/xics/xics-common.c
> +++ b/arch/powerpc/sysdev/xics/xics-common.c
> @@ -330,9 +330,6 @@ static int xics_host_map(struct irq_domain *h, unsigned int virq,
>  
>  	pr_devel("xics: map virq %d, hwirq 0x%lx\n", virq, hw);
>  
> -	/* Insert the interrupt mapping into the radix tree for fast lookup */
> -	irq_radix_revmap_insert(xics_host, virq, hw);
> -
>  	/* They aren't all level sensitive but we just don't really know */
>  	irq_set_status_flags(virq, IRQ_LEVEL);
>  
> diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h
> index 0b00f83..38314f2 100644
> --- a/include/linux/irqdomain.h
> +++ b/include/linux/irqdomain.h
> @@ -93,6 +93,7 @@ struct irq_domain {
>  	struct list_head link;
>  
>  	/* type of reverse mapping_technique */
> +	unsigned int (*revmap)(struct irq_domain *host, irq_hw_number_t hwirq);
>  	unsigned int revmap_type;
>  	union {
>  		struct {
> @@ -155,8 +156,6 @@ extern void irq_dispose_mapping(unsigned int virq);
>  extern unsigned int irq_find_mapping(struct irq_domain *host,
>  				     irq_hw_number_t hwirq);
>  extern unsigned int irq_create_direct_mapping(struct irq_domain *host);
> -extern void irq_radix_revmap_insert(struct irq_domain *host, unsigned int virq,
> -				    irq_hw_number_t hwirq);
>  extern unsigned int irq_radix_revmap_lookup(struct irq_domain *host,
>  					    irq_hw_number_t hwirq);
>  extern unsigned int irq_linear_revmap(struct irq_domain *host,
> diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
> index 5b4fc4d..91c1cb7 100644
> --- a/kernel/irq/irqdomain.c
> +++ b/kernel/irq/irqdomain.c
> @@ -104,6 +104,7 @@ struct irq_domain *irq_domain_add_legacy(struct device_node *of_node,
>  	domain->revmap_data.legacy.first_irq = first_irq;
>  	domain->revmap_data.legacy.first_hwirq = first_hwirq;
>  	domain->revmap_data.legacy.size = size;
> +	domain->revmap = irq_domain_legacy_revmap;
>  
>  	mutex_lock(&irq_domain_mutex);
>  	/* Verify that all the irqs are available */
> @@ -174,18 +175,35 @@ struct irq_domain *irq_domain_add_linear(struct device_node *of_node,
>  	}
>  	domain->revmap_data.linear.size = size;
>  	domain->revmap_data.linear.revmap = revmap;
> +	domain->revmap = irq_linear_revmap;
>  	irq_domain_add(domain);
>  	return domain;
>  }
>  
> +static unsigned int irq_domain_nomap_revmap(struct irq_domain *domain,
> +					    irq_hw_number_t hwirq)
> +{
> +	struct irq_data *data = irq_get_irq_data(hwirq);
> +
> +	if (WARN_ON_ONCE(domain->revmap_type != IRQ_DOMAIN_MAP_NOMAP))
> +		return irq_find_mapping(domain, hwirq);

Should be:
		return irq_find_mapping_slow(domain, hwirq);

Recursion otherwise...


> +
> +	/* Verify that the map has actually been established */
> +	if (data && (data->domain == domain) && (data->hwirq == hwirq))
> +		return hwirq;
> +	return 0;
> +}
> +
>  struct irq_domain *irq_domain_add_nomap(struct device_node *of_node,
>  					 const struct irq_domain_ops *ops,
>  					 void *host_data)
>  {
>  	struct irq_domain *domain = irq_domain_alloc(of_node,
>  					IRQ_DOMAIN_MAP_NOMAP, ops, host_data);
> -	if (domain)
> +	if (domain) {
> +		domain->revmap = irq_domain_nomap_revmap;
>  		irq_domain_add(domain);
> +	}
>  	return domain;
>  }
>  
> @@ -205,6 +223,7 @@ struct irq_domain *irq_domain_add_tree(struct device_node *of_node,
>  					IRQ_DOMAIN_MAP_TREE, ops, host_data);
>  	if (domain) {
>  		INIT_RADIX_TREE(&domain->revmap_data.tree, GFP_KERNEL);
> +		domain->revmap = irq_radix_revmap_lookup;
>  		irq_domain_add(domain);
>  	}
>  	return domain;
> @@ -378,6 +397,19 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
>  		return 0;
>  	}
>  
> +	switch(domain->revmap_type) {
> +	case IRQ_DOMAIN_MAP_LINEAR:
> +		if (hwirq < domain->revmap_data.linear.size)
> +			domain->revmap_data.linear.revmap[hwirq] = irq;
> +		break;
> +	case IRQ_DOMAIN_MAP_TREE:
> +		mutex_lock(&revmap_trees_mutex);
> +		radix_tree_insert(&domain->revmap_data.tree, hwirq,
> +				  irq_get_irq_data(irq));
> +		mutex_unlock(&revmap_trees_mutex);
> +
> +		break;
> +	}
>  	pr_debug("irq: irq %lu on domain %s mapped to virtual irq %u\n",
>  		hwirq, domain->of_node ? domain->of_node->full_name : "null", virq);
>  
> @@ -478,25 +510,27 @@ EXPORT_SYMBOL_GPL(irq_dispose_mapping);
>   * irq_find_mapping() - Find a linux irq from an hw irq number.
>   * @domain: domain owning this hardware interrupt
>   * @hwirq: hardware irq number in that domain space
> - *
> - * This is a slow path, for use by generic code. It's expected that an
> - * irq controller implementation directly calls the appropriate low level
> - * mapping function.
>   */
>  unsigned int irq_find_mapping(struct irq_domain *domain,
>  			      irq_hw_number_t hwirq)
>  {
> -	unsigned int i;
> -
> -	/* Look for default domain if nececssary */
> -	if (domain == NULL)
> +	if (!domain)
>  		domain = irq_default_domain;
> -	if (domain == NULL)
> -		return 0;
> +	return domain ? domain->revmap(domain, hwirq) : 0;
> +}
> +EXPORT_SYMBOL_GPL(irq_find_mapping);
>  
> -	/* legacy -> bail early */
> -	if (domain->revmap_type == IRQ_DOMAIN_MAP_LEGACY)
> -		return irq_domain_legacy_revmap(domain, hwirq);
> +/**
> + * irq_find_mapping_slow() - slow path for finding the irq mapped to a hwirq
> + *
> + * This is the failsafe slow path for finding an irq mapping.  The only time
> + * this will reasonably get called is when the linear map is used with a
> + * hwirq number larger than the size of the reverse map.
> + */
> +static unsigned int irq_find_mapping_slow(struct irq_domain *domain,
> +					  irq_hw_number_t hwirq)
> +{
> +	int i;
>  
>  	/* Slow path does a linear search of the map */
>  	for (i = 0; i < irq_virq_count; i++)  {
> @@ -506,7 +540,6 @@ unsigned int irq_find_mapping(struct irq_domain *domain,
>  	}
>  	return 0;
>  }
> -EXPORT_SYMBOL_GPL(irq_find_mapping);
>  
>  /**
>   * irq_radix_revmap_lookup() - Find a linux irq from a hw irq number.
> @@ -537,31 +570,7 @@ unsigned int irq_radix_revmap_lookup(struct irq_domain *domain,
>  	 * Else fallback to linear lookup - this should not happen in practice
>  	 * as it means that we failed to insert the node in the radix tree.
>  	 */
> -	return irq_data ? irq_data->irq : irq_find_mapping(domain, hwirq);
> -}
> -
> -/**
> - * irq_radix_revmap_insert() - Insert a hw irq to linux irq number mapping.
> - * @domain: domain owning this hardware interrupt
> - * @virq: linux irq number
> - * @hwirq: hardware irq number in that domain space
> - *
> - * This is for use by irq controllers that use a radix tree reverse
> - * mapping for fast lookup.
> - */
> -void irq_radix_revmap_insert(struct irq_domain *domain, unsigned int virq,
> -			     irq_hw_number_t hwirq)
> -{
> -	struct irq_data *irq_data = irq_get_irq_data(virq);
> -
> -	if (WARN_ON(domain->revmap_type != IRQ_DOMAIN_MAP_TREE))
> -		return;
> -
> -	if (virq) {
> -		mutex_lock(&revmap_trees_mutex);
> -		radix_tree_insert(&domain->revmap_data.tree, hwirq, irq_data);
> -		mutex_unlock(&revmap_trees_mutex);
> -	}
> +	return irq_data ? irq_data->irq : irq_find_mapping_slow(domain, hwirq);
>  }
>  
>  /**
> @@ -585,14 +594,11 @@ unsigned int irq_linear_revmap(struct irq_domain *domain,
>  	if (unlikely(hwirq >= domain->revmap_data.linear.size))
>  		return irq_find_mapping(domain, hwirq);

Ditto here. And same whith previous one in same function. Check in
irq_radix_revmap_lookup() there is the same issue on the "WARN_ON_ONCE"
path...

>  
> -	/* Check if revmap was allocated */
>  	revmap = domain->revmap_data.linear.revmap;
> -	if (unlikely(revmap == NULL))
> -		return irq_find_mapping(domain, hwirq);
>  
>  	/* Fill up revmap with slow path if no mapping found */
>  	if (unlikely(!revmap[hwirq]))
> -		revmap[hwirq] = irq_find_mapping(domain, hwirq);
> +		revmap[hwirq] = irq_find_mapping_slow(domain, hwirq);
>  
>  	return revmap[hwirq];
>  }

Bye,
-- 
Nicolas Ferre

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox