LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
From: Darren Hart @ 2010-09-01 15:10 UTC (permalink / raw)
  To: michael
  Cc: Stephen Rothwell, Gautham R Shenoy, Josh Triplett, Steven Rostedt,
	linuxppc-dev, Will Schmidt, Paul Mackerras, Ankita Garg,
	Thomas Gleixner
In-Reply-To: <1283320481.32679.32.camel@concordia>

On 08/31/2010 10:54 PM, Michael Ellerman wrote:
> On Tue, 2010-08-31 at 00:12 -0700, Darren Hart wrote:
> ..
>>
>> When running with the function plugin I had to stop the trace
>> immediately before entering start_secondary after an online or my traces
>> would not include the pseries_mach_cpu_die function, nor the tracing I
>> added there (possibly buffer size, I am using 2048). The following trace
>> was collected using "trace-cmd record -p function -e irq -e sched" and
>> has been filtered to only show CPU [001] (the CPU undergoing the
>> offline/online test, and the one seeing preempt_count (pcnt) go to
>> ffffffff after cede. The function tracer does not indicate anything
>> running on the CPU other than the HCALL - unless the __trace_hcall*
>> commands might be to blame. 
> 
> It's not impossible. Though normally they're disabled right, so the only
> reason they're running is because you're tracing. So if they are causing
> the bug then that doesn't explain why you see it normally.
> 
> Still, might be worth disabling just the hcall tracepoints just to be
> 100% sure.

A couple of updates. I was working from tip/rt/head, which has been
stale for some months now. I switched to tip/rt/2.6.33 and the
preempt_count() change over cede went away. I now see the live hang that
Will has reported.

Before I dive into the live hang, I want to understand what fixed the
preempt_count() change. That might start pointing us in the right
direction for the live hang.

I did an inverted git bisect between tip/rt/head and tip/rt/2.6.33 to
try and locate the fix. I've narrowed it down to the 2.6.33.6 merge:

# git show 7e1af1172bbd4109d09ac515c5d376f633da7cff
commit 7e1af1172bbd4109d09ac515c5d376f633da7cff
Merge: d8e94db 9666790
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Tue Jul 13 16:01:16 2010 +0200

    Merge
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.33.y

    Conflicts:
        Makefile

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


Visual inspection yields two patches of interest:

f8b67691828321f5c85bb853283aa101ae673130
powerpc/pseries: Make query-cpu-stopped callable outside hotplug cpu

aef40e87d866355ffd279ab21021de733242d0d5
powerpc/pseries: Only call start-cpu when a CPU is stopped

I'm going to try reverting these today and see if they addressed the
issue indirectly.


-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply

* Re: [PATCH] gpiolib: Add 'struct gpio_chip' forward declaration for !GPIOLIB case
From: Grant Likely @ 2010-09-01 15:26 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: David Brownell, Andrew Morton, Benjamin Herrenschmidt,
	linuxppc-dev, linux-kernel
In-Reply-To: <20100831170756.GA24887@oksana.dev.rtsoft.ru>

On Tue, Aug 31, 2010 at 09:07:56PM +0400, Anton Vorontsov wrote:
> On Tue, Aug 31, 2010 at 10:44:14AM -0600, Grant Likely wrote:
> > On Tue, Aug 31, 2010 at 2:37 AM, Anton Vorontsov <cbouatmailru@gmail.com> wrote:
> > > On Tue, Aug 31, 2010 at 02:03:44AM -0600, Grant Likely wrote:
> > >> On Tue, Aug 24, 2010 at 01:26:23PM +0400, Anton Vorontsov wrote:
> > >> > With CONFIG_GPIOLIB=n, the 'struct gpio_chip' is not declared,
> > >> > so the following pops up on PowerPC:
> > >> >
> > >> >   cc1: warnings being treated as errors
> > >> >   In file included from arch/powerpc/platforms/52xx/mpc52xx_common.c:19:
> > >> >   include/linux/of_gpio.h:74: warning: 'struct gpio_chip' declared
> > >> >                               inside parameter list
> > >> >   include/linux/of_gpio.h:74: warning: its scope is only this definition
> > >> >                               or declaration, which is probably not what
> > >> >                           you want
> > >> >   include/linux/of_gpio.h:75: warning: 'struct gpio_chip' declared
> > >> >                               inside parameter list
> > >> >   make[2]: *** [arch/powerpc/platforms/52xx/mpc52xx_common.o] Error 1
> > >> >
> > >> > This patch fixes the issue by providing the proper forward declaration.
> > >> >
> > >> > Signed-off-by: Anton Vorontsov <cbouatmailru@gmail.com>
> > >>
> > >> This doesn't actually solve the problem, and gpiochip should
> > >> remain undefined when CONFIG_GPIOLIB=n to catch exactly these
> > >> build failures.  The real problem is that I merged a change
> > >> into the mpc5200 code that required CONFIG_GPIOLIB be enabled
> > >> without reflecting that requirement in Kconfig.
> > >
> > > No, look closer. The error is in of_gpio.h, and it's perfectly
> > > fine to include it w/o GPIOLIB=y.
> > 
> > Looking even closer, we're both wrong.  You're right I didn't look
> > carefully enough, and the error is in of_gpio.h, not the .c file.
> > 
> > However, it is not okay to get the definitions from of_gpio.h when
> > CONFIG_GPIOLIB=n.  If GPIOLIB, or more specifically OF_GPIO isn't set,
> > then the of_gpio.h definitions should either not be defined, or should
> > be defined as empty stubs (where appropriate).
> 
> Grrr. Grant, look again, even closer than you did.

Gah.  Very bad day yesterday.  Applied and sorry for the noise.

g.

^ permalink raw reply

* Re: ERR_PTR pattern in phylib
From: David Miller @ 2010-09-01 15:27 UTC (permalink / raw)
  To: grant.likely; +Cc: netdev, afleming, linuxppc-dev
In-Reply-To: <AANLkTimjVZTNeLdm4qptU1MqREODLASdGGanUyxqxfub@mail.gmail.com>

From: Grant Likely <grant.likely@secretlab.ca>
Date: Wed, 1 Sep 2010 08:42:49 -0600

> It seems to me that phylib is one of the cases where the users (the
> network drivers) don't actually care about the specific error code
> when calling phylib functions.  The drivers only seem to care whether
> or not the function failed, and if it did then bail out.  I've also
> noticed that using the "if (!ptr)" test on phylib return values is a
> common error for driver writers.
> 
> In the interest of making driver code easier to write and review,
> would you be opposed to a set of patches to remove the ERR_PTR()
> pattern from phylib and its users?

I'm opposed to it because it means that if code actually does
care about the error code it will no longer be able to obtain
it.

Please don't make this change, thanks.

^ permalink raw reply

* Re: [PATCH] powerpc/512x: fix clk_get() return value
From: Grant Likely @ 2010-09-01 15:32 UTC (permalink / raw)
  To: Akinobu Mita; +Cc: linuxppc-dev
In-Reply-To: <1283244231-23659-1-git-send-email-akinobu.mita@gmail.com>

On Tue, Aug 31, 2010 at 05:43:51PM +0900, Akinobu Mita wrote:
> clk_get() should return an ERR_PTR value on error, not NULL.
> 
> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
> Cc: Grant Likely <grant.likely@secretlab.ca>
> Cc: linuxppc-dev@lists.ozlabs.org

applied, thanks.

g.

> ---
>  arch/powerpc/platforms/512x/clock.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/512x/clock.c b/arch/powerpc/platforms/512x/clock.c
> index 5b243bd..3dc2a8d 100644
> --- a/arch/powerpc/platforms/512x/clock.c
> +++ b/arch/powerpc/platforms/512x/clock.c
> @@ -57,7 +57,7 @@ static struct clk *mpc5121_clk_get(struct device *dev, const char *id)
>  	int id_match = 0;
>  
>  	if (dev == NULL || id == NULL)
> -		return NULL;
> +		return clk;
>  
>  	mutex_lock(&clocks_mutex);
>  	list_for_each_entry(p, &clocks, node) {
> -- 
> 1.7.1.231.gd0b16
> 

^ permalink raw reply

* Re: ERR_PTR pattern in phylib
From: Fleming Andy-AFLEMING @ 2010-09-01 15:50 UTC (permalink / raw)
  To: Grant Likely; +Cc: netdev, linuxppc-dev
In-Reply-To: <AANLkTimjVZTNeLdm4qptU1MqREODLASdGGanUyxqxfub@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 681 bytes --]

On Sep 1, 2010, at 9:43, Grant Likely <grant.likely@secretlab.ca> wrote:

> 
> In the interest of making driver code easier to write and review,
> would you be opposed to a set of patches to remove the ERR_PTR()
> pattern from phylib and its users?

Not at all.  I was trying to make sure all the information was available, and I'm not really a big fan of passing in a **ptr to get set.  But I don't really have a strong opinion.  This is just the design i settled on.  Heck, 99.9% of the errors are unlikely to ever happen, a they're caused by hardware malfunction!

Andy

> 
> g.
> 
> -- 
> Grant Likely, B.Sc., P.Eng.
> Secret Lab Technologies Ltd.
> 

[-- Attachment #2: Type: text/html, Size: 1236 bytes --]

^ permalink raw reply

* question about arch/powerpc/platforms/cell/celleb_scc_uhc.c
From: Julia Lawall @ 2010-09-01 15:51 UTC (permalink / raw)
  To: Arnd Bergmann, Benjamin Herrenschmidt
  Cc: cbe-oss-dev, linuxppc-dev, Paul Mackerras, joe

The file arch/powerpc/platforms/cell/celleb_scc_uhc.c contains the 
following function:

static inline int uhc_clkctrl_ready(u32 val)
{
        const u32 mask = SCC_UHC_USBCEN | SCC_UHC_USBCEN;
        return((val & mask) == mask);
}

The variable mask is a bit or of two identical constants.  Later in the 
same file in the function enable_scc_uhc, I see the code:

val |= (SCC_UHC_USBCEN | SCC_UHC_USBEN);

Should the code in uhc_clkctrl_ready also be SCC_UHC_USBCEN | SCC_UHC_USBEN?
Or something else?

thanks,
julia

^ permalink raw reply

* Re: question about arch/powerpc/platforms/cell/celleb_scc_uhc.c
From: Joe Perches @ 2010-09-01 16:02 UTC (permalink / raw)
  To: Julia Lawall; +Cc: cbe-oss-dev, Paul Mackerras, linuxppc-dev, Arnd Bergmann
In-Reply-To: <Pine.LNX.4.64.1009011748050.32566@ask.diku.dk>

On Wed, 2010-09-01 at 17:51 +0200, Julia Lawall wrote:
> The file arch/powerpc/platforms/cell/celleb_scc_uhc.c contains the 
> following function:
> static inline int uhc_clkctrl_ready(u32 val)
> {
>         const u32 mask = SCC_UHC_USBCEN | SCC_UHC_USBCEN;
>         return((val & mask) == mask);
> }
> The variable mask is a bit or of two identical constants.  Later in the 
> same file in the function enable_scc_uhc, I see the code:
> val |= (SCC_UHC_USBCEN | SCC_UHC_USBEN);
> Should the code in uhc_clkctrl_ready also be SCC_UHC_USBCEN | SCC_UHC_USBEN?
> Or something else?

Thanks Julia.

This was also noticed a couple of years ago and not applied
http://lkml.indiana.edu/hypermail/linux/kernel/0808.2/1428.html

One day...

^ permalink raw reply

* Re: question about arch/powerpc/platforms/cell/celleb_scc_uhc.c
From: Julia Lawall @ 2010-09-01 16:05 UTC (permalink / raw)
  To: Joe Perches; +Cc: cbe-oss-dev, Paul Mackerras, linuxppc-dev, Arnd Bergmann
In-Reply-To: <1283356950.1797.159.camel@Joe-Laptop>

On Wed, 1 Sep 2010, Joe Perches wrote:

> On Wed, 2010-09-01 at 17:51 +0200, Julia Lawall wrote:
> > The file arch/powerpc/platforms/cell/celleb_scc_uhc.c contains the 
> > following function:
> > static inline int uhc_clkctrl_ready(u32 val)
> > {
> >         const u32 mask = SCC_UHC_USBCEN | SCC_UHC_USBCEN;
> >         return((val & mask) == mask);
> > }
> > The variable mask is a bit or of two identical constants.  Later in the 
> > same file in the function enable_scc_uhc, I see the code:
> > val |= (SCC_UHC_USBCEN | SCC_UHC_USBEN);
> > Should the code in uhc_clkctrl_ready also be SCC_UHC_USBCEN | SCC_UHC_USBEN?
> > Or something else?
> 
> Thanks Julia.
> 
> This was also noticed a couple of years ago and not applied
> http://lkml.indiana.edu/hypermail/linux/kernel/0808.2/1428.html
> 
> One day...

OK, thanks, I will send a patch shortly.

julia

^ permalink raw reply

* Re: Request review of device tree documentation
From: Grant Likely @ 2010-09-01 16:19 UTC (permalink / raw)
  To: devicetree-discuss, Jeremy Kerr, John Rigby, linuxppc-dev,
	microblaze-uclinux, Olof Johansson
In-Reply-To: <20100805044325.GC25458@yookeroo>

On Thu, Aug 05, 2010 at 02:43:25PM +1000, David Gibson wrote:
> On Fri, Jun 11, 2010 at 04:59:46PM -0600, Grant Likely wrote:
> > I've been doing a bit of work on some introductory level documentation
> > of the flattened device tree.  I've got a rough copy up on the
> > devicetree.org wiki, and I could use some feedback.  If anyone has
> > some time to look at it, you can find it here:
> > 
> > http://devicetree.org/Device_Tree_Usage
> 
> Sorry I haven't replied sooner, I've been away, then sick and
> generally preoccupied.  Still here are some comments now.

Thanks David.  Reworked as per comments.  You can see the diff here:

http://www.devicetree.org/mediawiki/index.php?title=Device_Tree_Usage&diff=228&oldid=227

g.

> 
> How Addressing Works:
> 
>  * Small inconsistency you use "address1", "address2" then "unit-address3".
> 
>  * Perhaps re-emphasise that a parent's #*-cells properties govern the
>    children's reg properties, not its own, since this is a common
>    misunderstanding..
> 
> Non Memory Mapped Devices:
> 
>  * Your phrasing here suggests that non-memory-maped == zero
>    size-cells, which is not always true.
> 
> Ranges (Address Translation):
> 
>  * Third paragraph, first sentence is a grammatical dogs' breakfast,
> 
> How Interrupts Work:
> 
>  * Bogus paragraph break partway through first sentence.
> 
>  * At the end you say the second cell indicates the interrupt's
>    polarity, but you don't specify how this is encoded.  It might be
>    worth emphasising that while most interrupt specifiers do include
>    trigger and polarity type information, the encoding of it can and
>    does vary between interrupt controllers.
> 
> Advanced Sample Machine:
> 
>  * The unit address in the name shouldn't have a "0x" prefix
> 
> Advanced Interrupt Mapping:
> 
>  * Perhaps worth noting that while a PCI *card* will use INTA..INTD,
>    on-board PCI devices can, and frequently do, have interrupts wired
>    side-band to the PCI bus, directly to the main interrupt
>    controller.
> 
>  * In your example, you're muddying the waters of your previous usage
>    of interrupt-parent.  The PCI child nodes have the PCI top-level
>    node as their implicit interrupt parent, because its their first
>    ancestor with an interrupt-map, and we hit that before the
>    interrupt-parent property specified at the very top level.  This
>    means amongst other things that if there are PCI devices with
>    seperately wired interrupts, they must explicitly set
>    interrupt-parent to bypass the normal PCI interrupt mapping.
> 
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson

^ permalink raw reply

* Re: [PATCH] powerpc: mtmsrd not defined
From: Sean MacLennan @ 2010-09-01 16:40 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev
In-Reply-To: <20100901080214.GA4722@brick.ozlabs.ibm.com>

On Wed, 1 Sep 2010 18:02:14 +1000
Paul Mackerras <paulus@samba.org> wrote:

> Ah, those references would be from arch/powerpc/lib/sstep.c.
> Evidently we need #ifdef CONFIG_PPC_FPU around the emulation of the
> floating-point loads and stores.

I tried that yesterday and it didn't help, although maybe I forgot to
mask out a few. I will check again.

Cheers,
   Sean

^ permalink raw reply

* Re: [PATCH] powerpc: mtmsrd not defined
From: Sean MacLennan @ 2010-09-01 17:21 UTC (permalink / raw)
  To: Kumar Gala; +Cc: linuxppc-dev, Paul Mackerras
In-Reply-To: <04187CA6-D0DE-45B3-819F-DE26BFD9FF16@kernel.crashing.org>

Ok, I didn't catch all the cases the first time :( Since the errors
seemed to come from copy_32.S, I didn't spend much time on sstep.c.

Here is an updated patch the fixes the mtmsrd problem and maps out the
floating point.

Cheers,
   Sean

Replace the BOOK3S_64 specific mtmsrd with the generic MTMSRD macro.
Only enable ldstfp when CONFIG_PPC_FPU is set.

Signed-off-by: Sean MacLennan <smaclennan@pikatech.com>
---
diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
index 74a7f41..55f19f9 100644
--- a/arch/powerpc/lib/copy_32.S
+++ b/arch/powerpc/lib/copy_32.S
@@ -62,7 +62,7 @@
 
 	.text
 	.stabs	"arch/powerpc/lib/",N_SO,0,0,0f
-	.stabs	"copy32.S",N_SO,0,0,0f
+	.stabs	"copy_32.S",N_SO,0,0,0f
 0:
 
 CACHELINE_BYTES = L1_CACHE_BYTES
diff --git a/arch/powerpc/lib/ldstfp.S b/arch/powerpc/lib/ldstfp.S
index f644863..6a85380 100644
--- a/arch/powerpc/lib/ldstfp.S
+++ b/arch/powerpc/lib/ldstfp.S
@@ -17,6 +17,8 @@
 #include <asm/asm-offsets.h>
 #include <linux/errno.h>
 
+#ifdef CONFIG_PPC_FPU
+
 #define STKFRM	(PPC_MIN_STKFRM + 16)
 
 	.macro	extab	instr,handler
@@ -81,7 +83,7 @@ _GLOBAL(do_lfs)
 	mfmsr	r6
 	ori	r7,r6,MSR_FP
 	cmpwi	cr7,r3,0
-	mtmsrd	r7
+	MTMSRD(r7)
 	isync
 	beq	cr7,1f
 	stfd	fr0,STKFRM-16(r1)
@@ -93,7 +95,7 @@ _GLOBAL(do_lfs)
 	lfd	fr0,STKFRM-16(r1)
 4:	PPC_LL	r0,STKFRM+PPC_LR_STKOFF(r1)
 	mtlr	r0
-	mtmsrd	r6
+	MTMSRD(r6)
 	isync
 	mr	r3,r9
 	addi	r1,r1,STKFRM
@@ -108,7 +110,7 @@ _GLOBAL(do_lfd)
 	mfmsr	r6
 	ori	r7,r6,MSR_FP
 	cmpwi	cr7,r3,0
-	mtmsrd	r7
+	MTMSRD(r7)
 	isync
 	beq	cr7,1f
 	stfd	fr0,STKFRM-16(r1)
@@ -120,7 +122,7 @@ _GLOBAL(do_lfd)
 	lfd	fr0,STKFRM-16(r1)
 4:	PPC_LL	r0,STKFRM+PPC_LR_STKOFF(r1)
 	mtlr	r0
-	mtmsrd	r6
+	MTMSRD(r6)
 	isync
 	mr	r3,r9
 	addi	r1,r1,STKFRM
@@ -135,7 +137,7 @@ _GLOBAL(do_stfs)
 	mfmsr	r6
 	ori	r7,r6,MSR_FP
 	cmpwi	cr7,r3,0
-	mtmsrd	r7
+	MTMSRD(r7)
 	isync
 	beq	cr7,1f
 	stfd	fr0,STKFRM-16(r1)
@@ -147,7 +149,7 @@ _GLOBAL(do_stfs)
 	lfd	fr0,STKFRM-16(r1)
 4:	PPC_LL	r0,STKFRM+PPC_LR_STKOFF(r1)
 	mtlr	r0
-	mtmsrd	r6
+	MTMSRD(r6)
 	isync
 	mr	r3,r9
 	addi	r1,r1,STKFRM
@@ -162,7 +164,7 @@ _GLOBAL(do_stfd)
 	mfmsr	r6
 	ori	r7,r6,MSR_FP
 	cmpwi	cr7,r3,0
-	mtmsrd	r7
+	MTMSRD(r7)
 	isync
 	beq	cr7,1f
 	stfd	fr0,STKFRM-16(r1)
@@ -174,7 +176,7 @@ _GLOBAL(do_stfd)
 	lfd	fr0,STKFRM-16(r1)
 4:	PPC_LL	r0,STKFRM+PPC_LR_STKOFF(r1)
 	mtlr	r0
-	mtmsrd	r6
+	MTMSRD(r6)
 	isync
 	mr	r3,r9
 	addi	r1,r1,STKFRM
@@ -229,7 +231,7 @@ _GLOBAL(do_lvx)
 	oris	r7,r6,MSR_VEC@h
 	cmpwi	cr7,r3,0
 	li	r8,STKFRM-16
-	mtmsrd	r7
+	MTMSRD(r7)
 	isync
 	beq	cr7,1f
 	stvx	vr0,r1,r8
@@ -241,7 +243,7 @@ _GLOBAL(do_lvx)
 	lvx	vr0,r1,r8
 4:	PPC_LL	r0,STKFRM+PPC_LR_STKOFF(r1)
 	mtlr	r0
-	mtmsrd	r6
+	MTMSRD(r6)
 	isync
 	mr	r3,r9
 	addi	r1,r1,STKFRM
@@ -257,7 +259,7 @@ _GLOBAL(do_stvx)
 	oris	r7,r6,MSR_VEC@h
 	cmpwi	cr7,r3,0
 	li	r8,STKFRM-16
-	mtmsrd	r7
+	MTMSRD(r7)
 	isync
 	beq	cr7,1f
 	stvx	vr0,r1,r8
@@ -269,7 +271,7 @@ _GLOBAL(do_stvx)
 	lvx	vr0,r1,r8
 4:	PPC_LL	r0,STKFRM+PPC_LR_STKOFF(r1)
 	mtlr	r0
-	mtmsrd	r6
+	MTMSRD(r6)
 	isync
 	mr	r3,r9
 	addi	r1,r1,STKFRM
@@ -325,7 +327,7 @@ _GLOBAL(do_lxvd2x)
 	oris	r7,r6,MSR_VSX@h
 	cmpwi	cr7,r3,0
 	li	r8,STKFRM-16
-	mtmsrd	r7
+	MTMSRD(r7)
 	isync
 	beq	cr7,1f
 	STXVD2X(0,r1,r8)
@@ -337,7 +339,7 @@ _GLOBAL(do_lxvd2x)
 	LXVD2X(0,r1,r8)
 4:	PPC_LL	r0,STKFRM+PPC_LR_STKOFF(r1)
 	mtlr	r0
-	mtmsrd	r6
+	MTMSRD(r6)
 	isync
 	mr	r3,r9
 	addi	r1,r1,STKFRM
@@ -353,7 +355,7 @@ _GLOBAL(do_stxvd2x)
 	oris	r7,r6,MSR_VSX@h
 	cmpwi	cr7,r3,0
 	li	r8,STKFRM-16
-	mtmsrd	r7
+	MTMSRD(r7)
 	isync
 	beq	cr7,1f
 	STXVD2X(0,r1,r8)
@@ -365,7 +367,7 @@ _GLOBAL(do_stxvd2x)
 	LXVD2X(0,r1,r8)
 4:	PPC_LL	r0,STKFRM+PPC_LR_STKOFF(r1)
 	mtlr	r0
-	mtmsrd	r6
+	MTMSRD(r6)
 	isync
 	mr	r3,r9
 	addi	r1,r1,STKFRM
@@ -373,3 +375,5 @@ _GLOBAL(do_stxvd2x)
 	extab	2b,3b
 
 #endif /* CONFIG_VSX */
+
+#endif	/* CONFIG_PPC_FPU */
diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index e0a9858..ae5189a 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -30,6 +30,7 @@ extern char system_call_common[];
 #define XER_OV		0x40000000U
 #define XER_CA		0x20000000U
 
+#ifdef CONFIG_PPC_FPU
 /*
  * Functions in ldstfp.S
  */
@@ -41,6 +42,7 @@ extern int do_lvx(int rn, unsigned long ea);
 extern int do_stvx(int rn, unsigned long ea);
 extern int do_lxvd2x(int rn, unsigned long ea);
 extern int do_stxvd2x(int rn, unsigned long ea);
+#endif
 
 /*
  * Determine whether a conditional branch instruction would branch.
@@ -290,6 +292,7 @@ static int __kprobes write_mem(unsigned long val, unsigned long ea, int nb,
 	return write_mem_unaligned(val, ea, nb, regs);
 }
 
+#ifdef CONFIG_PPC_FPU
 /*
  * Check the address and alignment, and call func to do the actual
  * load or store.
@@ -351,6 +354,7 @@ static int __kprobes do_fp_store(int rn, int (*func)(int, unsigned long),
 	}
 	return err;
 }
+#endif
 
 #ifdef CONFIG_ALTIVEC
 /* For Altivec/VMX, no need to worry about alignment */
@@ -1393,6 +1397,7 @@ int __kprobes emulate_step(struct pt_regs *regs, unsigned int instr)
 				regs->gpr[rd] = byterev_4(val);
 			goto ldst_done;
 
+#ifdef CONFIG_PPC_CPU
 		case 535:	/* lfsx */
 		case 567:	/* lfsux */
 			if (!(regs->msr & MSR_FP))
@@ -1424,6 +1429,7 @@ int __kprobes emulate_step(struct pt_regs *regs, unsigned int instr)
 			ea = xform_ea(instr, regs, u);
 			err = do_fp_store(rd, do_stfd, ea, 8, regs);
 			goto ldst_done;
+#endif
 
 #ifdef __powerpc64__
 		case 660:	/* stdbrx */
@@ -1534,6 +1540,7 @@ int __kprobes emulate_step(struct pt_regs *regs, unsigned int instr)
 		} while (++rd < 32);
 		goto instr_done;
 
+#ifdef CONFIG_PPC_FPU
 	case 48:	/* lfs */
 	case 49:	/* lfsu */
 		if (!(regs->msr & MSR_FP))
@@ -1565,6 +1572,7 @@ int __kprobes emulate_step(struct pt_regs *regs, unsigned int instr)
 		ea = dform_ea(instr, regs);
 		err = do_fp_store(rd, do_stfd, ea, 8, regs);
 		goto ldst_done;
+#endif
 
 #ifdef __powerpc64__
 	case 58:	/* ld[u], lwa */

^ permalink raw reply related

* Re: [PATCH v2] powerpc/fsl_soc: Search all global-utilities nodes for rstccr
From: Timur Tabi @ 2010-09-01 18:15 UTC (permalink / raw)
  To: Matthew McClintock; +Cc: kumar.gala, linuxppc-dev
In-Reply-To: <1283294691-18765-1-git-send-email-msm@freescale.com>

Matthew McClintock wrote:
> The first global-utilities node might not contain the rstcr
> property, so we should search all the nodes
> 
> Signed-off-by: Matthew McClintock <msm@freescale.com>

Acked-by: Timur Tabi <timur@freescale.com>

^ permalink raw reply

* Re: [PATCH 1/4] drivers/serial/ucc_uart.c: Add of_node_put to avoid memory leak
From: Timur Tabi @ 2010-09-01 18:40 UTC (permalink / raw)
  To: Julia Lawall
  Cc: kernel-janitors, devicetree-discuss, linuxppc-dev, linux-kernel
In-Reply-To: <1283269738-14612-2-git-send-email-julia@diku.dk>

On Tue, Aug 31, 2010 at 10:48 AM, Julia Lawall <julia@diku.dk> wrote:
> Add a call to of_node_put in the error handling code following a call to
> of_find_compatible_node or of_find_node_by_type.
>
> This patch also substantially reorganizes the error handling code in the
> function, to that it is possible first to jump to code that frees qe_port
> and then to jump to code that also puts np.

Acked-by: Timur Tabi <timur@freescale.com>

Thanks, Julia.

-- 
Timur Tabi
Linux kernel developer at Freescale

^ permalink raw reply

* Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
From: Darren Hart @ 2010-09-01 18:47 UTC (permalink / raw)
  To: michael
  Cc: Stephen Rothwell, Michael Neuling, Gautham R Shenoy,
	Josh Triplett, Steven Rostedt, linuxppc-dev, Will Schmidt,
	Paul Mackerras, Ankita Garg, Thomas Gleixner
In-Reply-To: <4C7E6CCC.8090700@us.ibm.com>

On 09/01/2010 08:10 AM, Darren Hart wrote:
> On 08/31/2010 10:54 PM, Michael Ellerman wrote:
>> On Tue, 2010-08-31 at 00:12 -0700, Darren Hart wrote:
>> ..
>>>
>>> When running with the function plugin I had to stop the trace
>>> immediately before entering start_secondary after an online or my traces
>>> would not include the pseries_mach_cpu_die function, nor the tracing I
>>> added there (possibly buffer size, I am using 2048). The following trace
>>> was collected using "trace-cmd record -p function -e irq -e sched" and
>>> has been filtered to only show CPU [001] (the CPU undergoing the
>>> offline/online test, and the one seeing preempt_count (pcnt) go to
>>> ffffffff after cede. The function tracer does not indicate anything
>>> running on the CPU other than the HCALL - unless the __trace_hcall*
>>> commands might be to blame. 
>>
>> It's not impossible. Though normally they're disabled right, so the only
>> reason they're running is because you're tracing. So if they are causing
>> the bug then that doesn't explain why you see it normally.
>>
>> Still, might be worth disabling just the hcall tracepoints just to be
>> 100% sure.
> 
> A couple of updates. I was working from tip/rt/head, which has been
> stale for some months now. I switched to tip/rt/2.6.33 and the
> preempt_count() change over cede went away. I now see the live hang that
> Will has reported.
> 
> Before I dive into the live hang, I want to understand what fixed the
> preempt_count() change. That might start pointing us in the right
> direction for the live hang.
> 
> I did an inverted git bisect between tip/rt/head and tip/rt/2.6.33 to
> try and locate the fix. I've narrowed it down to the 2.6.33.6 merge:
> 
> # git show 7e1af1172bbd4109d09ac515c5d376f633da7cff
> commit 7e1af1172bbd4109d09ac515c5d376f633da7cff
> Merge: d8e94db 9666790
> Author: Thomas Gleixner <tglx@linutronix.de>
> Date:   Tue Jul 13 16:01:16 2010 +0200
> 
>     Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.33.y
> 
>     Conflicts:
>         Makefile
> 
>     Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> 
> Visual inspection yields two patches of interest:
> 
> f8b67691828321f5c85bb853283aa101ae673130
> powerpc/pseries: Make query-cpu-stopped callable outside hotplug cpu
> 
> aef40e87d866355ffd279ab21021de733242d0d5
> powerpc/pseries: Only call start-cpu when a CPU is stopped
> 
> I'm going to try reverting these today and see if they addressed the
> issue indirectly.

Removing:

aef40e87d866355ffd279ab21021de733242d0d5
powerpc/pseries: Only call start-cpu when a CPU is stopped

--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -82,6 +82,12 @@ static inline int __devinit smp_startup_cpu(unsigned int lcpu)
 
        pcpu = get_hard_smp_processor_id(lcpu);
 
+       /* Check to see if the CPU out of FW already for kexec */

This comment is really confusing to me. I _think_ it is saying that this test
determines if the CPU is done executing firmware code and has begun executing
OS code.... Is that right?

+       if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
+               cpu_set(lcpu, of_spin_map);
+               return 1;
+       }
+
        /* Fixup atomic count: it exited inside IRQ handler. */
        task_thread_info(paca[lcpu].__current)->preempt_count   = 0;

from tip/rt/2.6.33 causes the preempt_count() to change across the cede
call.  This patch appears to prevents the proxy preempt_count assignment
from happening. This non-local-cpu assignment to 0 would cause an
underrun of preempt_count() if the local-cpu had disabled preemption
prior to the assignment and then later tried to enable it. This appears
to be the case with the stack of __trace_hcall* calls preceeding the
return from extended_cede_processor() in the latency format trace-cmd
report:

  <idle>-0       1d....   201.252737: function:             .cpu_die
  <idle>-0       1d....   201.252738: function:                .pseries_mach_cpu_die
  <idle>-0       1d....   201.252740: function:                   .idle_task_exit
  <idle>-0       1d....   201.252741: function:                      .switch_slb
  <idle>-0       1d....   201.252742: function:                   .xics_teardown_cpu
  <idle>-0       1d....   201.252743: function:                      .xics_set_cpu_priority
  <idle>-0       1d....   201.252744: function:             .__trace_hcall_entry
  <idle>-0       1d..1.   201.252745: function:                .probe_hcall_entry
  <idle>-0       1d..1.   201.252746: function:             .__trace_hcall_exit
  <idle>-0       1d..2.   201.252747: function:                .probe_hcall_exit
  <idle>-0       1d....   201.252748: function:             .__trace_hcall_entry
  <idle>-0       1d..1.   201.252748: function:                .probe_hcall_entry
  <idle>-0       1d..1.   201.252750: function:             .__trace_hcall_exit
  <idle>-0       1d..2.   201.252751: function:                .probe_hcall_exit
  <idle>-0       1d....   201.252752: function:             .__trace_hcall_entry
  <idle>-0       1d..1.   201.252753: function:                .probe_hcall_entry
  offon.sh-3684  6.....   201.466488: bprint:               .smp_pSeries_kick_cpu : resetting pcnt to 0 for cpu 1

preempt_count() is reset from 1 to 0 by smp_startup_cpu() without the
QCSS_NOT_STOPPED check from the patch above.

  <idle>-0       1d....   201.466503: function:             .__trace_hcall_exit
  <idle>-0       1d..1.   201.466505: function:                .probe_hcall_exit
  <idle>-0       1d.Hff.   201.466507: bprint:               .pseries_mach_cpu_die : after cede: ffffffff

With the preempt_count() being one less than it should be, the final
preempt_enable() in the trace_hcall path drops preempt_count to
0xffffffff, which of course is an illegal value and leads to a crash.

I don't know if the patch above is the "right fix" for this or not as
don't yet understand why this occurs only with PREEMPT_RT and not in
mainline. Once we sort that out, we may need a more specific fix.

Many thanks to Steven Rostedt for pouring over the trace and the trace
code with me.

Thanks,

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply

* Re: ERR_PTR pattern in phylib
From: Grant Likely @ 2010-09-01 18:56 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, afleming, linuxppc-dev
In-Reply-To: <20100901.082729.135524471.davem@davemloft.net>

On Wed, Sep 1, 2010 at 9:27 AM, David Miller <davem@davemloft.net> wrote:
> From: Grant Likely <grant.likely@secretlab.ca>
> Date: Wed, 1 Sep 2010 08:42:49 -0600
>
>> It seems to me that phylib is one of the cases where the users (the
>> network drivers) don't actually care about the specific error code
>> when calling phylib functions. =A0The drivers only seem to care whether
>> or not the function failed, and if it did then bail out. =A0I've also
>> noticed that using the "if (!ptr)" test on phylib return values is a
>> common error for driver writers.
>>
>> In the interest of making driver code easier to write and review,
>> would you be opposed to a set of patches to remove the ERR_PTR()
>> pattern from phylib and its users?
>
> I'm opposed to it because it means that if code actually does
> care about the error code it will no longer be able to obtain
> it.

The error codes in phylib are almost arbitrary and don't really give
enough information about where the a failure lies.  dev_err() is more
useful for debugging.

My experience has been that the encoding of error numbers into a
pointer return is a source of bugs for driver writers and should be
strongly avoided unless the return codes are actually important (ie.
userspace depends on them).  Especially so when the use-case for
keeping them is merely theoretical.  I've looked through the phylib
usage for ERR_PTR(), and it all is related to whether or not a
phy_device pointer can be located.  The error code is pretty much
irrelevant if a phy cannot be obtained.

How about this as a compromise:  I'll investigate all the users of
phylib and if I find even one situation where the specific return code
is actually important to a driver, then I'll back off.  phylib has
been around for 5 years now which should be enough time for that use
case to bubble to the surface.

g.

^ permalink raw reply

* Re: ERR_PTR pattern in phylib
From: David Miller @ 2010-09-01 19:03 UTC (permalink / raw)
  To: grant.likely; +Cc: netdev, afleming, linuxppc-dev
In-Reply-To: <AANLkTimkh6ytDmiCQY-gZLHpLzeux1RwMFyr=HyfcDaJ@mail.gmail.com>

From: Grant Likely <grant.likely@secretlab.ca>
Date: Wed, 1 Sep 2010 12:56:29 -0600

> The error codes in phylib are almost arbitrary and don't really give
> enough information about where the a failure lies.  dev_err() is more
> useful for debugging.

If it's using bad error codes, that's only indicative of a bug in
phylib.

> How about this as a compromise:  I'll investigate all the users of
> phylib and if I find even one situation where the specific return code
> is actually important to a driver, then I'll back off.  phylib has
> been around for 5 years now which should be enough time for that use
> case to bubble to the surface.

Sorry, I'm not agreeable to this.

If there are bugs where drivers are not checking the return
pointers correctly, please just fix those bugs.  Otherwise
I see things fine as they are.

^ permalink raw reply

* Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
From: Steven Rostedt @ 2010-09-01 19:59 UTC (permalink / raw)
  To: Darren Hart
  Cc: Stephen Rothwell, Michael Neuling, Gautham R Shenoy,
	Josh Triplett, linuxppc-dev, Will Schmidt, Paul Mackerras,
	Ankita Garg, Thomas Gleixner
In-Reply-To: <4C7E9FC1.60004@us.ibm.com>

On Wed, 2010-09-01 at 11:47 -0700, Darren Hart wrote:

> from tip/rt/2.6.33 causes the preempt_count() to change across the cede
> call.  This patch appears to prevents the proxy preempt_count assignment
> from happening. This non-local-cpu assignment to 0 would cause an
> underrun of preempt_count() if the local-cpu had disabled preemption
> prior to the assignment and then later tried to enable it. This appears
> to be the case with the stack of __trace_hcall* calls preceeding the
> return from extended_cede_processor() in the latency format trace-cmd
> report:
> 
>   <idle>-0       1d....   201.252737: function:             .cpu_die

Note, the above 1d.... is a series of values. The first being the CPU,
the next if interrupts are disabled, the next if the NEED_RESCHED flag
is set, the next is softirqs enabled or disabled, next the
preempt_count, and finally the lockdepth count.

Here we only care about the preempt_count, which is zero when '.' and a
number if it is something else. It is the second to last field in that
list.


>   <idle>-0       1d....   201.252738: function:                .pseries_mach_cpu_die
>   <idle>-0       1d....   201.252740: function:                   .idle_task_exit
>   <idle>-0       1d....   201.252741: function:                      .switch_slb
>   <idle>-0       1d....   201.252742: function:                   .xics_teardown_cpu
>   <idle>-0       1d....   201.252743: function:                      .xics_set_cpu_priority
>   <idle>-0       1d....   201.252744: function:             .__trace_hcall_entry
>   <idle>-0       1d..1.   201.252745: function:                .probe_hcall_entry

                       ^
                preempt_count set to 1

>   <idle>-0       1d..1.   201.252746: function:             .__trace_hcall_exit
>   <idle>-0       1d..2.   201.252747: function:                .probe_hcall_exit
>   <idle>-0       1d....   201.252748: function:             .__trace_hcall_entry
>   <idle>-0       1d..1.   201.252748: function:                .probe_hcall_entry
>   <idle>-0       1d..1.   201.252750: function:             .__trace_hcall_exit
>   <idle>-0       1d..2.   201.252751: function:                .probe_hcall_exit
>   <idle>-0       1d....   201.252752: function:             .__trace_hcall_entry
>   <idle>-0       1d..1.   201.252753: function:                .probe_hcall_entry
                   ^   ^
                  CPU  preempt_count

Entering the function probe_hcall_entry() the preempt_count is 1 (see
below). But probe_hcall_entry does:

	h = &get_cpu_var(hcall_stats)[opcode / 4];

Without doing the put (which it does in probe_hcall_exit())

So exiting the probe_hcall_entry() the prempt_count is 2.
The trace_hcall_entry() will do a preempt_enable() making it leave as 1.


>   offon.sh-3684  6.....   201.466488: bprint:               .smp_pSeries_kick_cpu : resetting pcnt to 0 for cpu 1

This is CPU 6, changing the preempt count from 1 to 0.

> 
> preempt_count() is reset from 1 to 0 by smp_startup_cpu() without the
> QCSS_NOT_STOPPED check from the patch above.
> 
>   <idle>-0       1d....   201.466503: function:             .__trace_hcall_exit

Note: __trace_hcall_exit() and __trace_hcall_entry() basically do:

 preempt_disable();
 call probe();
 preempt_enable();


>   <idle>-0       1d..1.   201.466505: function:                .probe_hcall_exit

The preempt_count of 1 entering the probe_hcall_exit() is because of the
preempt_disable() shown above. It should have been entered as a 2.

But then it does:


	put_cpu_var(hcall_stats);

making preempt_count 0.

But the preempt_enable() in the trace_hcall_exit() causes this to be -1.


>   <idle>-0       1d.Hff.   201.466507: bprint:               .pseries_mach_cpu_die : after cede: ffffffff
> 
> With the preempt_count() being one less than it should be, the final
> preempt_enable() in the trace_hcall path drops preempt_count to
> 0xffffffff, which of course is an illegal value and leads to a crash.

I'm confused to how this works in mainline?

-- Steve

> 
> I don't know if the patch above is the "right fix" for this or not as
> don't yet understand why this occurs only with PREEMPT_RT and not in
> mainline. Once we sort that out, we may need a more specific fix.

^ permalink raw reply

* Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
From: Darren Hart @ 2010-09-01 20:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Stephen Rothwell, Michael Neuling, Gautham R Shenoy,
	Josh Triplett, linuxppc-dev, Will Schmidt, Paul Mackerras,
	Ankita Garg, Thomas Gleixner
In-Reply-To: <1283371161.2356.53.camel@gandalf.stny.rr.com>

On 09/01/2010 12:59 PM, Steven Rostedt wrote:
> On Wed, 2010-09-01 at 11:47 -0700, Darren Hart wrote:
> 
>> from tip/rt/2.6.33 causes the preempt_count() to change across the cede
>> call.  This patch appears to prevents the proxy preempt_count assignment
>> from happening. This non-local-cpu assignment to 0 would cause an
>> underrun of preempt_count() if the local-cpu had disabled preemption
>> prior to the assignment and then later tried to enable it. This appears
>> to be the case with the stack of __trace_hcall* calls preceeding the
>> return from extended_cede_processor() in the latency format trace-cmd
>> report:
>>
>>   <idle>-0       1d....   201.252737: function:             .cpu_die
> 
> Note, the above 1d.... is a series of values. The first being the CPU,
> the next if interrupts are disabled, the next if the NEED_RESCHED flag
> is set, the next is softirqs enabled or disabled, next the
> preempt_count, and finally the lockdepth count.
> 
> Here we only care about the preempt_count, which is zero when '.' and a
> number if it is something else. It is the second to last field in that
> list.
> 
> 
>>   <idle>-0       1d....   201.252738: function:                .pseries_mach_cpu_die
>>   <idle>-0       1d....   201.252740: function:                   .idle_task_exit
>>   <idle>-0       1d....   201.252741: function:                      .switch_slb
>>   <idle>-0       1d....   201.252742: function:                   .xics_teardown_cpu
>>   <idle>-0       1d....   201.252743: function:                      .xics_set_cpu_priority
>>   <idle>-0       1d....   201.252744: function:             .__trace_hcall_entry
>>   <idle>-0       1d..1.   201.252745: function:                .probe_hcall_entry
> 
>                        ^
>                 preempt_count set to 1
> 
>>   <idle>-0       1d..1.   201.252746: function:             .__trace_hcall_exit
>>   <idle>-0       1d..2.   201.252747: function:                .probe_hcall_exit
>>   <idle>-0       1d....   201.252748: function:             .__trace_hcall_entry
>>   <idle>-0       1d..1.   201.252748: function:                .probe_hcall_entry
>>   <idle>-0       1d..1.   201.252750: function:             .__trace_hcall_exit
>>   <idle>-0       1d..2.   201.252751: function:                .probe_hcall_exit
>>   <idle>-0       1d....   201.252752: function:             .__trace_hcall_entry
>>   <idle>-0       1d..1.   201.252753: function:                .probe_hcall_entry
>                    ^   ^
>                   CPU  preempt_count
> 
> Entering the function probe_hcall_entry() the preempt_count is 1 (see
> below). But probe_hcall_entry does:
> 
> 	h = &get_cpu_var(hcall_stats)[opcode / 4];
> 
> Without doing the put (which it does in probe_hcall_exit())
> 
> So exiting the probe_hcall_entry() the prempt_count is 2.
> The trace_hcall_entry() will do a preempt_enable() making it leave as 1.
> 
> 
>>   offon.sh-3684  6.....   201.466488: bprint:               .smp_pSeries_kick_cpu : resetting pcnt to 0 for cpu 1
> 
> This is CPU 6, changing the preempt count from 1 to 0.
> 
>>
>> preempt_count() is reset from 1 to 0 by smp_startup_cpu() without the
>> QCSS_NOT_STOPPED check from the patch above.
>>
>>   <idle>-0       1d....   201.466503: function:             .__trace_hcall_exit
> 
> Note: __trace_hcall_exit() and __trace_hcall_entry() basically do:
> 
>  preempt_disable();
>  call probe();
>  preempt_enable();
> 
> 
>>   <idle>-0       1d..1.   201.466505: function:                .probe_hcall_exit
> 
> The preempt_count of 1 entering the probe_hcall_exit() is because of the
> preempt_disable() shown above. It should have been entered as a 2.
> 
> But then it does:
> 
> 
> 	put_cpu_var(hcall_stats);
> 
> making preempt_count 0.
> 
> But the preempt_enable() in the trace_hcall_exit() causes this to be -1.
> 
> 
>>   <idle>-0       1d.Hff.   201.466507: bprint:               .pseries_mach_cpu_die : after cede: ffffffff
>>
>> With the preempt_count() being one less than it should be, the final
>> preempt_enable() in the trace_hcall path drops preempt_count to
>> 0xffffffff, which of course is an illegal value and leads to a crash.
> 
> I'm confused to how this works in mainline?

Turns out it didn't. 2.6.33.5 with CONFIG_PREEMPT=y sees this exact same
behavior. The following, part of the 2.6.33.6 stable release, prevents
this from happening:

aef40e87d866355ffd279ab21021de733242d0d5
powerpc/pseries: Only call start-cpu when a CPU is stopped

--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -82,6 +82,12 @@ static inline int __devinit smp_startup_cpu(unsigned
int lcpu)

        pcpu = get_hard_smp_processor_id(lcpu);

+       /* Check to see if the CPU out of FW already for kexec */
+       if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
+               cpu_set(lcpu, of_spin_map);
+               return 1;
+       }
+
        /* Fixup atomic count: it exited inside IRQ handler. */
        task_thread_info(paca[lcpu].__current)->preempt_count   = 0;

The question is now, Is this the right fix? If so, perhaps we can update
the comment to be a bit more clear and not refer solely to kexec.

Michael Neuling, can you offer any thoughts here? We hit this EVERY
TIME, which makes me wonder if the offline/online path could do this
without calling smp_startup_cpu at all.

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply

* Memory allocation modifications in ibm_newemac driver
From: Jonathan Haws @ 2010-09-01 20:41 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, linux-net@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org

[-- Attachment #1: Type: text/plain, Size: 5555 bytes --]

I am not sure which list this is best addressed to, so I hope I find someone who can help me out.

I am new to kernel development and am working on a network driver.  The AppliedMicro 405EX chip provides dual EMAC ports and the driver for those is the ibm_newemac driver.  I discovered an issue last year with said driver that I am trying to fix now.

The problem is this - when I enable a large MTU (larger than a single page) and run data through the EMAC and also am reading and writing data to a disk, memory becomes so fragmented that allocating a new SKB fails.

I have modified the driver to only ever deal with single pages, since the problem was not that there was a memory leak or that I was just plain out of memory - just that I had a whole ton of free single pages, but no buffers of order 2.  However, now I am getting the following BUG in the kernel:


PING 172.31.22.1 (172.31.22.1): 56 data bytes
64 bytes from 172.31.22.1: seq=0 ttl=128 time=1.147 ms
64 bytes from 172.31.22.1: seq=1 ttl=128 time=0.466 ms
64 bytes from 172.31.22.1: seq=2 ttl=128 time=0.448 ms
64 bytes from 172.31.22.1: seq=3 ttl=128 time=0.444 ms
64 bytes from 172.31.22.1: seq=4 ttl=128 time=0.443 ms
64 bytes from 172.31.22.1: seq=5 ttl=128 time=0.452 ms
64 bytes from 172.31.22.1: seq=6 ttl=128 time=0.445 ms
64 bytes from 172.31.22.1: seq=7 ttl=128 time=0.447 ms
64 bytes from 172.31.22.1: seq=8 ttl=128 time=0.443 ms
64 bytes from 172.31.22.1: seq=9 ttl=128 time=0.452 ms
64 bytes from 172.31.22.1: seq=10 ttl=128 time=0.444 ms
64 bytes from 172.31.22.1: seq=11 ttl=128 time=0.444 ms
64 bytes from 172.31.22.1: seq=12 ttl=128 time=0.448 ms
64 bytes from 172.31.22.1: seq=13 ttl=128 time=0.454 ms
64 bytes from 172.31.22.1: seq=14 ttl=128 time=0.445 ms
------------[ cut here ]------------
kernel BUG at mm/slub.c:2925!
Oops: Exception in kernel mode, sig: 5 [#1]
PREEMPT PowerPC 40x Platform
Modules linked in:
NIP: c0094024 LR: c01c86e4 CTR: 00000000
REGS: cc9679c0 TRAP: 0700   Not tainted  (2.6.31-rc5-walle-01329-geea77be-dirty)
MSR: 00029030 <EE,ME,CE,IR,DR>  CR: 22004024  XER: 2000005f
TASK = ce442940[291] 'ping' THREAD: cc966000
GPR00: 00080000 cc967a70 ce442940 cf0b0000 0000000e 00000000 cf011000 cf0b0024
GPR08: 003a97b1 00000001 c0348000 c0330000 22004022 100dc4d4 c0330000 c02c49fc
GPR16: c02c4a1c c02c4a3c c02c4a54 c02c38cc c02c49b4 cc966000 cf011310 00000001
GPR24: 0000000f 00000000 cca1c220 c01c86e4 cf0b0000 c0335ba4 cca1c180 c0529600
NIP [c0094024] kfree+0xdc/0xec
LR [c01c86e4] skb_release_data+0x78/0xd4
Call Trace:
[cc967a70] [cc966000] 0xcc966000 (unreliable)
[cc967a90] [c01c86e4] skb_release_data+0x78/0xd4
[cc967aa0] [c01c8334] __kfree_skb+0x18/0xe8
[cc967ab0] [c01d1968] netif_receive_skb+0x368/0x378
[cc967ae0] [c01a7394] emac_poll_rx+0x150/0x7b0
[cc967b40] [c01a2abc] mal_poll+0xe4/0x29c
[cc967b80] [c01d4a50] net_rx_action+0x9c/0x1b4
[cc967bb0] [c003b3c0] __do_softirq+0xc4/0x148
[cc967bf0] [c0004d18] do_softirq+0x78/0x80
[cc967c00] [c003b67c] local_bh_enable+0xc0/0xd8
[cc967c10] [c01d5ed8] dev_queue_xmit+0xfc/0x3e4
[cc967c30] [c01f2cd8] ip_finish_output+0xfc/0x31c
[cc967c50] [c01f2f7c] ip_local_out+0x34/0x48
[cc967c60] [c01f3228] ip_push_pending_frames+0x298/0x3d8
[cc967c80] [c0210980] raw_sendmsg+0x6e8/0x74c
[cc967d20] [c0219f44] inet_sendmsg+0x4c/0x78
[cc967d40] [c01c1684] sock_sendmsg+0xac/0xe4
[cc967e30] [c01c19fc] sys_sendto+0xbc/0xf0
[cc967f00] [c01c2450] sys_socketcall+0x140/0x1f8
[cc967f40] [c000f434] ret_from_syscall+0x0/0x3c
Instruction dump:
8009000c 813e0080 5400103a 7d3c012e 939e0080 4bffffc8 83ff000c 4bffff78
801f0000 7009c000 7d200026 55291ffe <0f090000> 7fe3fb78 4bfdf905 4bffffa4
Kernel panic - not syncing: Fatal exception in interrupt
Call Trace:
[cc967810] [c0006ef0] show_stack+0x44/0x16c (unreliable)
[cc967850] [c0034d78] panic+0x94/0x170
[cc9678a0] [c000cdc0] die+0x17c/0x190
[cc9678c0] [c000d08c] _exception+0x174/0x1c4
[cc9679b0] [c000fa30] ret_from_except_full+0x0/0x4c
[cc967a70] [cc966000] 0xcc966000
[cc967a90] [c01c86e4] skb_release_data+0x78/0xd4
[cc967aa0] [c01c8334] __kfree_skb+0x18/0xe8
[cc967ab0] [c01d1968] netif_receive_skb+0x368/0x378
[cc967ae0] [c01a7394] emac_poll_rx+0x150/0x7b0
[cc967b40] [c01a2abc] mal_poll+0xe4/0x29c
[cc967b80] [c01d4a50] net_rx_action+0x9c/0x1b4
[cc967bb0] [c003b3c0] __do_softirq+0xc4/0x148
[cc967bf0] [c0004d18] do_softirq+0x78/0x80
[cc967c00] [c003b67c] local_bh_enable+0xc0/0xd8
[cc967c10] [c01d5ed8] dev_queue_xmit+0xfc/0x3e4
[cc967c30] [c01f2cd8] ip_finish_output+0xfc/0x31c
[cc967c50] [c01f2f7c] ip_local_out+0x34/0x48
[cc967c60] [c01f3228] ip_push_pending_frames+0x298/0x3d8
[cc967c80] [c0210980] raw_sendmsg+0x6e8/0x74c
[cc967d20] [c0219f44] inet_sendmsg+0x4c/0x78
[cc967d40] [c01c1684] sock_sendmsg+0xac/0xe4
[cc967e30] [c01c19fc] sys_sendto+0xbc/0xf0
[cc967f00] [c01c2450] sys_socketcall+0x140/0x1f8
[cc967f40] [c000f434] ret_from_syscall+0x0/0x3c
Rebooting in 180 seconds..


I cannot see why this is occurring.  I have made sure that I have the pages allocated when the driver starts up.  I outlined the driver according to how the e1000e driver accomplished the same task.

Has anyone seen behavior such as this before?  Can anyone point me in the direction I need to go to get this to work?  Any help is appreciated.  I have attached the modified source if anyone wants to take a look.  All the modifications are preceded by a comment such as /* JRH - some comment */.  Just grep JRH and you can see all my changes.

Thanks,

Jonathan


[-- Attachment #2: core.c --]
[-- Type: text/plain, Size: 85590 bytes --]

/*
 * drivers/net/ibm_newemac/core.c
 *
 * Driver for PowerPC 4xx on-chip ethernet controller.
 *
 * Copyright 2007 Benjamin Herrenschmidt, IBM Corp.
 *                <benh@kernel.crashing.org>
 *
 * Based on the arch/ppc version of the driver:
 *
 * Copyright (c) 2004, 2005 Zultys Technologies.
 * Eugene Surovegin <eugene.surovegin@zultys.com> or <ebs@ebshome.net>
 *
 * Based on original work by
 * 	Matt Porter <mporter@kernel.crashing.org>
 *	(c) 2003 Benjamin Herrenschmidt <benh@kernel.crashing.org>
 *      Armin Kuster <akuster@mvista.com>
 * 	Johnnie Peters <jpeters@mvista.com>
 *
 * This program is free software; you can redistribute  it and/or modify it
 * under  the terms of  the GNU General  Public License as published by the
 * Free Software Foundation;  either version 2 of the  License, or (at your
 * option) any later version.
 *
 */

#include <linux/sched.h>
#include <linux/string.h>
#include <linux/errno.h>
#include <linux/delay.h>
#include <linux/types.h>
#include <linux/pci.h>
#include <linux/etherdevice.h>
#include <linux/skbuff.h>
#include <linux/crc32.h>
#include <linux/ethtool.h>
#include <linux/mii.h>
#include <linux/bitops.h>
#include <linux/workqueue.h>
#include <linux/of.h>
#include <linux/highmem.h>

#include <asm/processor.h>
#include <asm/io.h>
#include <asm/dma.h>
#include <asm/uaccess.h>
#include <asm/dcr.h>
#include <asm/dcr-regs.h>

#include "core.h"

/*
 * Lack of dma_unmap_???? calls is intentional.
 *
 * API-correct usage requires additional support state information to be
 * maintained for every RX and TX buffer descriptor (BD). Unfortunately, due to
 * EMAC design (e.g. TX buffer passed from network stack can be split into
 * several BDs, dma_map_single/dma_map_page can be used to map particular BD),
 * maintaining such information will add additional overhead.
 * Current DMA API implementation for 4xx processors only ensures cache coherency
 * and dma_unmap_???? routines are empty and are likely to stay this way.
 * I decided to omit dma_unmap_??? calls because I don't want to add additional
 * complexity just for the sake of following some abstract API, when it doesn't
 * add any real benefit to the driver. I understand that this decision maybe
 * controversial, but I really tried to make code API-correct and efficient
 * at the same time and didn't come up with code I liked :(.                --ebs
 */

#define DRV_NAME        "emac"
#define DRV_VERSION     "3.54"
#define DRV_DESC        "PPC 4xx OCP EMAC driver"

MODULE_DESCRIPTION(DRV_DESC);
MODULE_AUTHOR
("Eugene Surovegin <eugene.surovegin@zultys.com> or <ebs@ebshome.net>");
MODULE_LICENSE("GPL");

/*
 * PPC64 doesn't (yet) have a cacheable_memcpy
 */
#ifdef CONFIG_PPC64
#define cacheable_memcpy(d,s,n) memcpy((d),(s),(n))
#endif

/* minimum number of free TX descriptors required to wake up TX process */
#define EMAC_TX_WAKEUP_THRESH		(NUM_TX_BUFF / 4)

/* If packet size is less than this number, we allocate small skb and copy packet
 * contents into it instead of just sending original big skb up
 */
#define EMAC_RX_COPY_THRESH		CONFIG_IBM_NEW_EMAC_RX_COPY_THRESHOLD

/* Since multiple EMACs share MDIO lines in various ways, we need
 * to avoid re-using the same PHY ID in cases where the arch didn't
 * setup precise phy_map entries
 *
 * XXX This is something that needs to be reworked as we can have multiple
 * EMAC "sets" (multiple ASICs containing several EMACs) though we can
 * probably require in that case to have explicit PHY IDs in the device-tree
 */
static u32 busy_phy_map;
static DEFINE_MUTEX(emac_phy_map_lock);

/* This is the wait queue used to wait on any event related to probe, that
 * is discovery of MALs, other EMACs, ZMII/RGMIIs, etc...
 */
static DECLARE_WAIT_QUEUE_HEAD( emac_probe_wait);

/* Having stable interface names is a doomed idea. However, it would be nice
 * if we didn't have completely random interface names at boot too :-) It's
 * just a matter of making everybody's life easier. Since we are doing
 * threaded probing, it's a bit harder though. The base idea here is that
 * we make up a list of all emacs in the device-tree before we register the
 * driver. Every emac will then wait for the previous one in the list to
 * initialize before itself. We should also keep that list ordered by
 * cell_index.
 * That list is only 4 entries long, meaning that additional EMACs don't
 * get ordering guarantees unless EMAC_BOOT_LIST_SIZE is increased.
 */

#define EMAC_BOOT_LIST_SIZE	4
static struct device_node *emac_boot_list[EMAC_BOOT_LIST_SIZE];

/* How long should I wait for dependent devices ? */
#define EMAC_PROBE_DEP_TIMEOUT	(HZ * 5)

/* I don't want to litter system log with timeout errors
 * when we have brain-damaged PHY.
 */
static inline void emac_report_timeout_error(struct emac_instance *dev,
		const char *error)
{
if (emac_has_feature(dev, EMAC_FTR_440GX_PHY_CLK_FIX |
				EMAC_FTR_460EX_PHY_CLK_FIX |
				EMAC_FTR_440EP_PHY_CLK_FIX))
DBG(dev, "%s" NL, error);
else if (net_ratelimit())
printk(KERN_ERR "%s: %s\n", dev->ofdev->node->full_name, error);
}

/* EMAC PHY clock workaround:
 * 440EP/440GR has more sane SDR0_MFR register implementation than 440GX,
 * which allows controlling each EMAC clock
 */
static inline void emac_rx_clk_tx(struct emac_instance *dev)
{
#ifdef CONFIG_PPC_DCR_NATIVE
	if (emac_has_feature(dev, EMAC_FTR_440EP_PHY_CLK_FIX))
	dcri_clrset(SDR0, SDR0_MFR,
			0, SDR0_MFR_ECS >> dev->cell_index);
#endif
}

static inline void emac_rx_clk_default(struct emac_instance *dev)
{
#ifdef CONFIG_PPC_DCR_NATIVE
	if (emac_has_feature(dev, EMAC_FTR_440EP_PHY_CLK_FIX))
	dcri_clrset(SDR0, SDR0_MFR,
			SDR0_MFR_ECS >> dev->cell_index, 0);
#endif
}

/* PHY polling intervals */
#define PHY_POLL_LINK_ON	HZ
#define PHY_POLL_LINK_OFF	(HZ / 5)

/* Graceful stop timeouts in us.
 * We should allow up to 1 frame time (full-duplex, ignoring collisions)
 */
#define STOP_TIMEOUT_10		1230
#define STOP_TIMEOUT_100	124
#define STOP_TIMEOUT_1000	13
#define STOP_TIMEOUT_1000_JUMBO	73

static unsigned char default_mcast_addr[] =
{ 0x01, 0x80, 0xC2, 0x00, 0x00, 0x01 };

/* Please, keep in sync with struct ibm_emac_stats/ibm_emac_error_stats */
static const char emac_stats_keys[EMAC_ETHTOOL_STATS_COUNT][ETH_GSTRING_LEN] =
{ "rx_packets", "rx_bytes", "tx_packets", "tx_bytes", "rx_packets_csum",
		"tx_packets_csum", "tx_undo", "rx_dropped_stack", "rx_dropped_oom",
		"rx_dropped_error", "rx_dropped_resize", "rx_dropped_mtu",
		"rx_stopped", "rx_bd_errors", "rx_bd_overrun", "rx_bd_bad_packet",
		"rx_bd_runt_packet", "rx_bd_short_event", "rx_bd_alignment_error",
		"rx_bd_bad_fcs", "rx_bd_packet_too_long", "rx_bd_out_of_range",
		"rx_bd_in_range", "rx_parity", "rx_fifo_overrun", "rx_overrun",
		"rx_bad_packet", "rx_runt_packet", "rx_short_event",
		"rx_alignment_error", "rx_bad_fcs", "rx_packet_too_long",
		"rx_out_of_range", "rx_in_range", "tx_dropped", "tx_bd_errors",
		"tx_bd_bad_fcs", "tx_bd_carrier_loss", "tx_bd_excessive_deferral",
		"tx_bd_excessive_collisions", "tx_bd_late_collision",
		"tx_bd_multple_collisions", "tx_bd_single_collision", "tx_bd_underrun",
		"tx_bd_sqe", "tx_parity", "tx_underrun", "tx_sqe", "tx_errors" };

static irqreturn_t emac_irq(int irq, void *dev_instance);
static void emac_clean_tx_ring(struct emac_instance *dev);
static void __emac_set_multicast_list(struct emac_instance *dev);

static inline int emac_phy_supports_gige(int phy_mode)
{
	return phy_mode == PHY_MODE_GMII || phy_mode == PHY_MODE_RGMII || phy_mode
			== PHY_MODE_SGMII || phy_mode == PHY_MODE_TBI || phy_mode
			== PHY_MODE_RTBI;
}

static inline int emac_phy_gpcs(int phy_mode)
{
	return phy_mode == PHY_MODE_SGMII || phy_mode == PHY_MODE_TBI || phy_mode
			== PHY_MODE_RTBI;
}

static inline void emac_tx_enable(struct emac_instance *dev)
{
	struct emac_regs __iomem *p = dev->emacp;
	u32 r;

	DBG(dev, "tx_enable" NL);

	r = in_be32(&p->mr0);
	if (!(r & EMAC_MR0_TXE))
		out_be32(&p->mr0, r | EMAC_MR0_TXE);
}

static void emac_tx_disable(struct emac_instance *dev)
{
	struct emac_regs __iomem *p = dev->emacp;
	u32 r;

	DBG(dev, "tx_disable" NL);

	r = in_be32(&p->mr0);
	if (r & EMAC_MR0_TXE)
	{
		int n = dev->stop_timeout;
		out_be32(&p->mr0, r & ~EMAC_MR0_TXE);
		while (!(in_be32(&p->mr0) & EMAC_MR0_TXI) && n)
		{
			udelay(1);
			--n;
		}
		if (unlikely(!n))
			emac_report_timeout_error(dev, "TX disable timeout");
	}
}

static void emac_rx_enable(struct emac_instance *dev)
{
	struct emac_regs __iomem *p = dev->emacp;
	u32 r;

	if (unlikely(test_bit(MAL_COMMAC_RX_STOPPED, &dev->commac.flags)))
		goto out;

	DBG(dev, "rx_enable" NL);

	r = in_be32(&p->mr0);
	if (!(r & EMAC_MR0_RXE))
	{
		if (unlikely(!(r & EMAC_MR0_RXI)))
		{
			/* Wait if previous async disable is still in progress */
			int n = dev->stop_timeout;
			while (!(r = in_be32(&p->mr0) & EMAC_MR0_RXI) && n)
			{
				udelay(1);
				--n;
			}
			if (unlikely(!n))
				emac_report_timeout_error(dev, "RX disable timeout");
		}
		out_be32(&p->mr0, r | EMAC_MR0_RXE);
	}
	out: ;
}

static void emac_rx_disable(struct emac_instance *dev)
{
	struct emac_regs __iomem *p = dev->emacp;
	u32 r;

	DBG(dev, "rx_disable" NL);

	r = in_be32(&p->mr0);
	if (r & EMAC_MR0_RXE)
	{
		int n = dev->stop_timeout;
		out_be32(&p->mr0, r & ~EMAC_MR0_RXE);
		while (!(in_be32(&p->mr0) & EMAC_MR0_RXI) && n)
		{
			udelay(1);
			--n;
		}
		if (unlikely(!n))
			emac_report_timeout_error(dev, "RX disable timeout");
	}
}

static inline void emac_netif_stop(struct emac_instance *dev)
{
	netif_tx_lock_bh(dev->ndev);
	netif_addr_lock(dev->ndev);
	dev->no_mcast = 1;
	netif_addr_unlock(dev->ndev);
	netif_tx_unlock_bh(dev->ndev);
	dev->ndev->trans_start = jiffies; /* prevent tx timeout */
	mal_poll_disable(dev->mal, &dev->commac);
	netif_tx_disable(dev->ndev);
}

static inline void emac_netif_start(struct emac_instance *dev)
{
	netif_tx_lock_bh(dev->ndev);
	netif_addr_lock(dev->ndev);
	dev->no_mcast = 0;
	if (dev->mcast_pending && netif_running(dev->ndev))
		__emac_set_multicast_list(dev);
	netif_addr_unlock(dev->ndev);
	netif_tx_unlock_bh(dev->ndev);

	netif_wake_queue(dev->ndev);

	/* NOTE: unconditional netif_wake_queue is only appropriate
	 * so long as all callers are assured to have free tx slots
	 * (taken from tg3... though the case where that is wrong is
	 *  not terribly harmful)
	 */
	mal_poll_enable(dev->mal, &dev->commac);
}

static inline void emac_rx_disable_async(struct emac_instance *dev)
{
	struct emac_regs __iomem *p = dev->emacp;
	u32 r;

	DBG(dev, "rx_disable_async" NL);

	r = in_be32(&p->mr0);
	if (r & EMAC_MR0_RXE)
		out_be32(&p->mr0, r & ~EMAC_MR0_RXE);
}

static int emac_reset(struct emac_instance *dev)
{
	struct emac_regs __iomem *p = dev->emacp;
	int n = 20;

	DBG(dev, "reset" NL);

	if (!dev->reset_failed)
	{
		/* 40x erratum suggests stopping RX channel before reset,
		 * we stop TX as well
		 */
		emac_rx_disable(dev);
		emac_tx_disable(dev);
	}

#ifdef CONFIG_PPC_DCR_NATIVE
	/* Enable internal clock source */
	if (emac_has_feature(dev, EMAC_FTR_460EX_PHY_CLK_FIX))
	dcri_clrset(SDR0, SDR0_ETH_CFG,
			0, SDR0_ETH_CFG_ECS << dev->cell_index);
#endif

	out_be32(&p->mr0, EMAC_MR0_SRST);
	while ((in_be32(&p->mr0) & EMAC_MR0_SRST) && n)
		--n;

#ifdef CONFIG_PPC_DCR_NATIVE
	/* Enable external clock source */
	if (emac_has_feature(dev, EMAC_FTR_460EX_PHY_CLK_FIX))
	dcri_clrset(SDR0, SDR0_ETH_CFG,
			SDR0_ETH_CFG_ECS << dev->cell_index, 0);
#endif

	if (n)
	{
		dev->reset_failed = 0;
		return 0;
	}
	else
	{
		emac_report_timeout_error(dev, "reset timeout");
		dev->reset_failed = 1;
		return -ETIMEDOUT;
	}
}

static void emac_hash_mc(struct emac_instance *dev)
{
	const int regs = EMAC_XAHT_REGS(dev);
	u32 *gaht_base = emac_gaht_base(dev);
	u32 gaht_temp[regs];
	struct dev_mc_list *dmi;
	int i;

	DBG(dev, "hash_mc %d" NL, dev->ndev->mc_count);

	memset(gaht_temp, 0, sizeof(gaht_temp));

	for (dmi = dev->ndev->mc_list; dmi; dmi = dmi->next)
	{
		int slot, reg, mask;
		DBG2(dev, "mc %pM" NL, dmi->dmi_addr);

		slot = EMAC_XAHT_CRC_TO_SLOT(dev, ether_crc(ETH_ALEN, dmi->dmi_addr));
		reg = EMAC_XAHT_SLOT_TO_REG(dev, slot);
		mask = EMAC_XAHT_SLOT_TO_MASK(dev, slot);

		gaht_temp[reg] |= mask;
	}

	for (i = 0; i < regs; i++)
		out_be32(gaht_base + i, gaht_temp[i]);
}

static inline u32 emac_iff2rmr(struct net_device *ndev)
{
	struct emac_instance *dev = netdev_priv(ndev);
	u32 r;

	r = EMAC_RMR_SP | EMAC_RMR_SFCS | EMAC_RMR_IAE | EMAC_RMR_BAE;

	if (emac_has_feature(dev, EMAC_FTR_EMAC4))
		r |= EMAC4_RMR_BASE;
	else
		r |= EMAC_RMR_BASE;

	if (ndev->flags & IFF_PROMISC)
		r |= EMAC_RMR_PME;
	else if (ndev->flags & IFF_ALLMULTI || (ndev->mc_count
			> EMAC_XAHT_SLOTS(dev)))
		r |= EMAC_RMR_PMME;
	else if (ndev->mc_count > 0)
		r |= EMAC_RMR_MAE;

	return r;
}

static u32 __emac_calc_base_mr1(struct emac_instance *dev, int tx_size,
		int rx_size)
{
	u32 ret = EMAC_MR1_VLE | EMAC_MR1_IST | EMAC_MR1_TR0_MULT;

	DBG2(dev, "__emac_calc_base_mr1" NL);

	switch (tx_size)
	{
	case 2048:
		ret |= EMAC_MR1_TFS_2K;
		break;
	default:
printk	(KERN_WARNING "%s: Unknown Rx FIFO size %d\n",
			dev->ndev->name, tx_size);
}

switch(rx_size)
{
	case 16384:
	ret |= EMAC_MR1_RFS_16K;
	break;
	case 4096:
	ret |= EMAC_MR1_RFS_4K;
	break;
	default:
	printk(KERN_WARNING "%s: Unknown Rx FIFO size %d\n",
			dev->ndev->name, rx_size);
}

return ret;
}

static u32 __emac4_calc_base_mr1(struct emac_instance *dev, int tx_size,
		int rx_size)
{
	u32 ret = EMAC_MR1_VLE | EMAC_MR1_IST | EMAC4_MR1_TR
			| EMAC4_MR1_OBCI(dev->opb_bus_freq / 1000000);

	DBG2(dev, "__emac4_calc_base_mr1" NL);

	switch (tx_size)
	{
	case 4096:
		ret |= EMAC4_MR1_TFS_4K;
		break;
	case 2048:
		ret |= EMAC4_MR1_TFS_2K;
		break;
	default:
printk	(KERN_WARNING "%s: Unknown Rx FIFO size %d\n",
			dev->ndev->name, tx_size);
}

switch(rx_size)
{
	case 16384:
	ret |= EMAC4_MR1_RFS_16K;
	break;
	case 4096:
	ret |= EMAC4_MR1_RFS_4K;
	break;
	case 2048:
	ret |= EMAC4_MR1_RFS_2K;
	break;
	default:
	printk(KERN_WARNING "%s: Unknown Rx FIFO size %d\n",
			dev->ndev->name, rx_size);
}

return ret;
}

static u32 emac_calc_base_mr1(struct emac_instance *dev, int tx_size,
		int rx_size)
{
	return emac_has_feature(dev, EMAC_FTR_EMAC4) ? __emac4_calc_base_mr1(dev,
			tx_size, rx_size) : __emac_calc_base_mr1(dev, tx_size, rx_size);
}

static inline u32 emac_calc_trtr(struct emac_instance *dev, unsigned int size)
{
	if (emac_has_feature(dev, EMAC_FTR_EMAC4))
		return ((size >> 6) - 1) << EMAC_TRTR_SHIFT_EMAC4;
	else
		return ((size >> 6) - 1) << EMAC_TRTR_SHIFT;
}

static inline u32 emac_calc_rwmr(struct emac_instance *dev, unsigned int low,
		unsigned int high)
{
	if (emac_has_feature(dev, EMAC_FTR_EMAC4))
		return (low << 22) | ((high & 0x3ff) << 6);
	else
		return (low << 23) | ((high & 0x1ff) << 7);
}

static int emac_configure(struct emac_instance *dev)
{
	struct emac_regs __iomem *p = dev->emacp;
	struct net_device *ndev = dev->ndev;
	int tx_size, rx_size, link = netif_carrier_ok(dev->ndev);
	u32 r, mr1 = 0;

	DBG(dev, "configure" NL);

	if (!link)
	{
		out_be32(&p->mr1, in_be32(&p->mr1) | EMAC_MR1_FDE | EMAC_MR1_ILE);
		udelay(100);
	}
	else if (emac_reset(dev) < 0)
		return -ETIMEDOUT;

	if (emac_has_feature(dev, EMAC_FTR_HAS_TAH))
		tah_reset(dev->tah_dev);

	DBG(dev, " link = %d duplex = %d, pause = %d, asym_pause = %d\n",
			link, dev->phy.duplex, dev->phy.pause, dev->phy.asym_pause);

	/* Default fifo sizes */
	tx_size = dev->tx_fifo_size;
	rx_size = dev->rx_fifo_size;

	/* No link, force loopback */
	if (!link)
		mr1 = EMAC_MR1_FDE | EMAC_MR1_ILE;

	/* Check for full duplex */
	else if (dev->phy.duplex == DUPLEX_FULL)
		mr1 |= EMAC_MR1_FDE | EMAC_MR1_MWSW_001;

	/* Adjust fifo sizes, mr1 and timeouts based on link speed */
	dev->stop_timeout = STOP_TIMEOUT_10;
	switch (dev->phy.speed)
	{
	case SPEED_1000:
		if (emac_phy_gpcs(dev->phy.mode))
		{
			mr1 |= EMAC_MR1_MF_1000GPCS | EMAC_MR1_MF_IPPA(
					(dev->phy.gpcs_address != 0xffffffff) ?
					dev->phy.gpcs_address : dev->phy.address);

			/* Put some arbitrary OUI, Manuf & Rev IDs so we can
			 * identify this GPCS PHY later.
			 */
			out_be32(&p->u1.emac4.ipcr, 0xdeadbeef);
		}
		else
			mr1 |= EMAC_MR1_MF_1000;

		/* Extended fifo sizes */
		tx_size = dev->tx_fifo_size_gige;
		rx_size = dev->rx_fifo_size_gige;

		if (dev->ndev->mtu > ETH_DATA_LEN)
		{
			if (emac_has_feature(dev, EMAC_FTR_EMAC4))
				mr1 |= EMAC4_MR1_JPSM;
			else
				mr1 |= EMAC_MR1_JPSM;
			dev->stop_timeout = STOP_TIMEOUT_1000_JUMBO;
		}
		else
			dev->stop_timeout = STOP_TIMEOUT_1000;
		break;
	case SPEED_100:
		mr1 |= EMAC_MR1_MF_100;
		dev->stop_timeout = STOP_TIMEOUT_100;
		break;
	default: /* make gcc happy */
		break;
	}

	if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII))
		rgmii_set_speed(dev->rgmii_dev, dev->rgmii_port,
				dev->phy.speed);
	if (emac_has_feature(dev, EMAC_FTR_HAS_ZMII))
		zmii_set_speed(dev->zmii_dev, dev->zmii_port, dev->phy.speed);

	/* on 40x erratum forces us to NOT use integrated flow control,
	 * let's hope it works on 44x ;)
	 */
	if (!emac_has_feature(dev, EMAC_FTR_NO_FLOW_CONTROL_40x) && dev->phy.duplex
			== DUPLEX_FULL)
	{
		if (dev->phy.pause)
			mr1 |= EMAC_MR1_EIFC | EMAC_MR1_APP;
		else if (dev->phy.asym_pause)
			mr1 |= EMAC_MR1_APP;
	}

	/* Add base settings & fifo sizes & program MR1 */
	mr1 |= emac_calc_base_mr1(dev, tx_size, rx_size);
	out_be32(&p->mr1, mr1);

	/* Set individual MAC address */
	out_be32(&p->iahr, (ndev->dev_addr[0] << 8) | ndev->dev_addr[1]);
	out_be32(&p->ialr, (ndev->dev_addr[2] << 24) | (ndev->dev_addr[3] << 16)
			| (ndev->dev_addr[4] << 8) | ndev->dev_addr[5]);

	/* VLAN Tag Protocol ID */
	out_be32(&p->vtpid, 0x8100);

	/* Receive mode register */
	r = emac_iff2rmr(ndev);
	if (r & EMAC_RMR_MAE)
		emac_hash_mc(dev);
	out_be32(&p->rmr, r);

	/* FIFOs thresholds */
	if (emac_has_feature(dev, EMAC_FTR_EMAC4))
		r = EMAC4_TMR1((dev->mal_burst_size / dev->fifo_entry_size) + 1,
				tx_size / 2 / dev->fifo_entry_size);
	else
		r = EMAC_TMR1((dev->mal_burst_size / dev->fifo_entry_size) + 1,
				tx_size / 2 / dev->fifo_entry_size);
	out_be32(&p->tmr1, r);
	out_be32(&p->trtr, emac_calc_trtr(dev, tx_size / 2));

	/* PAUSE frame is sent when RX FIFO reaches its high-water mark,
	 there should be still enough space in FIFO to allow the our link
	 partner time to process this frame and also time to send PAUSE
	 frame itself.

	 Here is the worst case scenario for the RX FIFO "headroom"
	 (from "The Switch Book") (100Mbps, without preamble, inter-frame gap):

	 1) One maximum-length frame on TX                    1522 bytes
	 2) One PAUSE frame time                                64 bytes
	 3) PAUSE frame decode time allowance                   64 bytes
	 4) One maximum-length frame on RX                    1522 bytes
	 5) Round-trip propagation delay of the link (100Mb)    15 bytes
	 ----------
	 3187 bytes

	 I chose to set high-water mark to RX_FIFO_SIZE / 4 (1024 bytes)
	 low-water mark  to RX_FIFO_SIZE / 8 (512 bytes)
	 */
	r = emac_calc_rwmr(dev, rx_size / 8 / dev->fifo_entry_size, rx_size / 4
			/ dev->fifo_entry_size);
	out_be32(&p->rwmr, r);

	/* Set PAUSE timer to the maximum */
	out_be32(&p->ptr, 0xffff);

	/* IRQ sources */
	r = EMAC_ISR_OVR | EMAC_ISR_BP | EMAC_ISR_SE | EMAC_ISR_ALE | EMAC_ISR_BFCS
			| EMAC_ISR_PTLE | EMAC_ISR_ORE | EMAC_ISR_IRE | EMAC_ISR_TE;
	if (emac_has_feature(dev, EMAC_FTR_EMAC4))
		r |= EMAC4_ISR_TXPE | EMAC4_ISR_RXPE /* | EMAC4_ISR_TXUE |
		 EMAC4_ISR_RXOE | */;
	out_be32(&p->iser, r);

	/* We need to take GPCS PHY out of isolate mode after EMAC reset */
	if (emac_phy_gpcs(dev->phy.mode))
	{
		if (dev->phy.gpcs_address != 0xffffffff)
			emac_mii_reset_gpcs(&dev->phy);
		else
			emac_mii_reset_phy(&dev->phy);
	}

	return 0;
}

static void emac_reinitialize(struct emac_instance *dev)
{
	DBG(dev, "reinitialize" NL);

	emac_netif_stop(dev);
	if (!emac_configure(dev))
	{
		emac_tx_enable(dev);
		emac_rx_enable(dev);
	}
	emac_netif_start(dev);
}

static void emac_full_tx_reset(struct emac_instance *dev)
{
	DBG(dev, "full_tx_reset" NL);

	emac_tx_disable(dev);
	mal_disable_tx_channel(dev->mal, dev->mal_tx_chan);
	emac_clean_tx_ring(dev);
	dev->tx_cnt = dev->tx_slot = dev->ack_slot = 0;

	emac_configure(dev);

	mal_enable_tx_channel(dev->mal, dev->mal_tx_chan);
	emac_tx_enable(dev);
	emac_rx_enable(dev);
}

static void emac_reset_work(struct work_struct *work)
{
struct	emac_instance *dev = container_of(work, struct emac_instance, reset_work);

	DBG(dev, "reset_work" NL);

	mutex_lock(&dev->link_lock);
	if (dev->opened)
	{
		emac_netif_stop(dev);
		emac_full_tx_reset(dev);
		emac_netif_start(dev);
	}
	mutex_unlock(&dev->link_lock);
}

static void emac_tx_timeout(struct net_device *ndev)
{
	struct emac_instance *dev = netdev_priv(ndev);

	DBG(dev, "tx_timeout" NL);

	schedule_work(&dev->reset_work);
}

static inline int emac_phy_done(struct emac_instance *dev, u32 stacr)
{
	int done = !!(stacr & EMAC_STACR_OC);

	if (emac_has_feature(dev, EMAC_FTR_STACR_OC_INVERT))
		done = !done;

	return done;
}
;

static int __emac_mdio_read(struct emac_instance *dev, u8 id, u8 reg)
{
	struct emac_regs __iomem *p = dev->emacp;
	u32 r = 0;
	int n, err = -ETIMEDOUT;

	mutex_lock(&dev->mdio_lock);

	DBG2(dev, "mdio_read(%02x,%02x)" NL, id, reg);

	/* Enable proper MDIO port */
	if (emac_has_feature(dev, EMAC_FTR_HAS_ZMII))
		zmii_get_mdio(dev->zmii_dev, dev->zmii_port);
	if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII))
		rgmii_get_mdio(dev->rgmii_dev, dev->rgmii_port);

	/* Wait for management interface to become idle */
	n = 20;
	while (!emac_phy_done(dev, in_be32(&p->stacr)))
	{
		udelay(1);
		if (!--n)
		{
			DBG2(dev, " -> timeout wait idle\n");
			goto bail;
		}
	}

	/* Issue read command */
	if (emac_has_feature(dev, EMAC_FTR_EMAC4))
		r = EMAC4_STACR_BASE(dev->opb_bus_freq);
	else
		r = EMAC_STACR_BASE(dev->opb_bus_freq);
	if (emac_has_feature(dev, EMAC_FTR_STACR_OC_INVERT))
		r |= EMAC_STACR_OC;
	if (emac_has_feature(dev, EMAC_FTR_HAS_NEW_STACR))
		r |= EMACX_STACR_STAC_READ;
	else
		r |= EMAC_STACR_STAC_READ;
	r |= (reg & EMAC_STACR_PRA_MASK) | ((id & EMAC_STACR_PCDA_MASK)
			<< EMAC_STACR_PCDA_SHIFT);
	out_be32(&p->stacr, r);

	/* Wait for read to complete */
	n = 200;
	while (!emac_phy_done(dev, (r = in_be32(&p->stacr))))
	{
		udelay(1);
		if (!--n)
		{
			DBG2(dev, " -> timeout wait complete\n");
			goto bail;
		}
	}

	if (unlikely(r & EMAC_STACR_PHYE))
	{
		DBG(dev, "mdio_read(%02x, %02x) failed" NL, id, reg);
		err = -EREMOTEIO;
		goto bail;
	}

	r = ((r >> EMAC_STACR_PHYD_SHIFT) & EMAC_STACR_PHYD_MASK);

	DBG2(dev, "mdio_read -> %04x" NL, r);
	err = 0;
	bail: if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII))
		rgmii_put_mdio(dev->rgmii_dev, dev->rgmii_port);
	if (emac_has_feature(dev, EMAC_FTR_HAS_ZMII))
		zmii_put_mdio(dev->zmii_dev, dev->zmii_port);
	mutex_unlock(&dev->mdio_lock);

	return err == 0 ? r : err;
}

static void __emac_mdio_write(struct emac_instance *dev, u8 id, u8 reg, u16 val)
{
	struct emac_regs __iomem *p = dev->emacp;
	u32 r = 0;
	int n, err = -ETIMEDOUT;

	mutex_lock(&dev->mdio_lock);

	DBG2(dev, "mdio_write(%02x,%02x,%04x)" NL, id, reg, val);

	/* Enable proper MDIO port */
	if (emac_has_feature(dev, EMAC_FTR_HAS_ZMII))
		zmii_get_mdio(dev->zmii_dev, dev->zmii_port);
	if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII))
		rgmii_get_mdio(dev->rgmii_dev, dev->rgmii_port);

	/* Wait for management interface to be idle */
	n = 20;
	while (!emac_phy_done(dev, in_be32(&p->stacr)))
	{
		udelay(1);
		if (!--n)
		{
			DBG2(dev, " -> timeout wait idle\n");
			goto bail;
		}
	}

	/* Issue write command */
	if (emac_has_feature(dev, EMAC_FTR_EMAC4))
		r = EMAC4_STACR_BASE(dev->opb_bus_freq);
	else
		r = EMAC_STACR_BASE(dev->opb_bus_freq);
	if (emac_has_feature(dev, EMAC_FTR_STACR_OC_INVERT))
		r |= EMAC_STACR_OC;
	if (emac_has_feature(dev, EMAC_FTR_HAS_NEW_STACR))
		r |= EMACX_STACR_STAC_WRITE;
	else
		r |= EMAC_STACR_STAC_WRITE;
	r |= (reg & EMAC_STACR_PRA_MASK) | ((id & EMAC_STACR_PCDA_MASK)
			<< EMAC_STACR_PCDA_SHIFT) | (val << EMAC_STACR_PHYD_SHIFT);
	out_be32(&p->stacr, r);

	/* Wait for write to complete */
	n = 200;
	while (!emac_phy_done(dev, in_be32(&p->stacr)))
	{
		udelay(1);
		if (!--n)
		{
			DBG2(dev, " -> timeout wait complete\n");
			goto bail;
		}
	}
	err = 0;
	bail: if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII))
		rgmii_put_mdio(dev->rgmii_dev, dev->rgmii_port);
	if (emac_has_feature(dev, EMAC_FTR_HAS_ZMII))
		zmii_put_mdio(dev->zmii_dev, dev->zmii_port);
	mutex_unlock(&dev->mdio_lock);
}

static int emac_mdio_read(struct net_device *ndev, int id, int reg)
{
	struct emac_instance *dev = netdev_priv(ndev);
	int res;

	res
			= __emac_mdio_read((dev->mdio_instance && dev->phy.gpcs_address
					!= id) ? dev->mdio_instance : dev, (u8) id, (u8) reg);
	return res;
}

static void emac_mdio_write(struct net_device *ndev, int id, int reg, int val)
{
	struct emac_instance *dev = netdev_priv(ndev);

	__emac_mdio_write(
			(dev->mdio_instance && dev->phy.gpcs_address != id) ? dev->mdio_instance
					: dev, (u8) id, (u8) reg, (u16) val);
}

/* Tx lock BH */
static void __emac_set_multicast_list(struct emac_instance *dev)
{
	struct emac_regs __iomem *p = dev->emacp;
	u32 rmr = emac_iff2rmr(dev->ndev);

	DBG(dev, "__multicast %08x" NL, rmr);

	/* I decided to relax register access rules here to avoid
	 * full EMAC reset.
	 *
	 * There is a real problem with EMAC4 core if we use MWSW_001 bit
	 * in MR1 register and do a full EMAC reset.
	 * One TX BD status update is delayed and, after EMAC reset, it
	 * never happens, resulting in TX hung (it'll be recovered by TX
	 * timeout handler eventually, but this is just gross).
	 * So we either have to do full TX reset or try to cheat here :)
	 *
	 * The only required change is to RX mode register, so I *think* all
	 * we need is just to stop RX channel. This seems to work on all
	 * tested SoCs.                                                --ebs
	 *
	 * If we need the full reset, we might just trigger the workqueue
	 * and do it async... a bit nasty but should work --BenH
	 */
	dev->mcast_pending = 0;
	emac_rx_disable(dev);
	if (rmr & EMAC_RMR_MAE)
		emac_hash_mc(dev);
	out_be32(&p->rmr, rmr);
	emac_rx_enable(dev);
}

/* Tx lock BH */
static void emac_set_multicast_list(struct net_device *ndev)
{
	struct emac_instance *dev = netdev_priv(ndev);

	DBG(dev, "multicast" NL);

	BUG_ON(!netif_running(dev->ndev));

	if (dev->no_mcast)
	{
		dev->mcast_pending = 1;
		return;
	}
	__emac_set_multicast_list(dev);
}

static int emac_resize_rx_ring(struct emac_instance *dev, int new_mtu)
{
	int rx_sync_size = emac_rx_sync_size(new_mtu);
	int rx_skb_size = emac_rx_skb_size(new_mtu);
	int i, ret = 0;

	mutex_lock(&dev->link_lock);
	emac_netif_stop(dev);
	emac_rx_disable(dev);
	mal_disable_rx_channel(dev->mal, dev->mal_rx_chan);

	if (dev->rx_sg_skb)
	{
		++dev->estats.rx_dropped_resize;
		dev_kfree_skb(dev->rx_sg_skb);
		dev->rx_sg_skb = NULL;
	}

	/* Make a first pass over RX ring and mark BDs ready, dropping
	 * non-processed packets on the way. We need this as a separate pass
	 * to simplify error recovery in the case of allocation failure later.
	 */
	for (i = 0; i < NUM_RX_BUFF; ++i)
	{
		if (dev->rx_desc[i].ctrl & MAL_RX_CTRL_FIRST)
			++dev->estats.rx_dropped_resize;

		dev->rx_desc[i].data_len = 0;
		dev->rx_desc[i].ctrl = MAL_RX_CTRL_EMPTY
				| (i == (NUM_RX_BUFF - 1) ? MAL_RX_CTRL_WRAP : 0);
	}

	/* Reallocate RX ring only if bigger skb buffers are required */
	if (rx_skb_size <= dev->rx_skb_size)
		goto skip;

	/* Second pass, allocate new skbs */
	for (i = 0; i < NUM_RX_BUFF; ++i)
	{
		struct sk_buff *skb = alloc_skb(rx_skb_size, GFP_ATOMIC);
		if (!skb)
		{
			ret = -ENOMEM;
			goto oom;
		}

		BUG_ON(!dev->rx_skb[i]);
		dev_kfree_skb(dev->rx_skb[i]);

		skb_reserve(skb, EMAC_RX_SKB_HEADROOM + 2);
		/*JRH - map a page */
		/*dev->rx_desc[i].data_ptr = dma_map_single(&dev->ofdev->dev,
				skb->data - 2, rx_sync_size, DMA_FROM_DEVICE) + 2;*/
		dev->rx_desc[i].data_ptr = dma_map_page(&dev->ofdev->dev, dev->page[i],
				0, rx_sync_size, DMA_FROM_DEVICE);
		dev->rx_skb[i] = skb;
	}
	skip:
	/* Check if we need to change "Jumbo" bit in MR1 */
	if ((new_mtu > ETH_DATA_LEN) ^ (dev->ndev->mtu > ETH_DATA_LEN))
	{
		/* This is to prevent starting RX channel in emac_rx_enable() */
		set_bit(MAL_COMMAC_RX_STOPPED, &dev->commac.flags);

		dev->ndev->mtu = new_mtu;
		emac_full_tx_reset(dev);
	}

	mal_set_rcbs(dev->mal, dev->mal_rx_chan, emac_rx_size(new_mtu));
	oom:
	/* Restart RX */
	clear_bit(MAL_COMMAC_RX_STOPPED, &dev->commac.flags);
	dev->rx_slot = 0;
	mal_enable_rx_channel(dev->mal, dev->mal_rx_chan);
	emac_rx_enable(dev);
	emac_netif_start(dev);
	mutex_unlock(&dev->link_lock);

	return ret;
}

/* Process ctx, rtnl_lock semaphore */
static int emac_change_mtu(struct net_device *ndev, int new_mtu)
{
	struct emac_instance *dev = netdev_priv(ndev);
	int ret = 0;

	if (new_mtu < EMAC_MIN_MTU || new_mtu > dev->max_mtu)
		return -EINVAL;

	DBG(dev, "change_mtu(%d)" NL, new_mtu);

	if (netif_running(ndev))
	{
		/* Check if we really need to reinitalize RX ring */
		if (emac_rx_skb_size(ndev->mtu) != emac_rx_skb_size(new_mtu))
			ret = emac_resize_rx_ring(dev, new_mtu);
	}

	if (!ret)
	{
		ndev->mtu = new_mtu;
		dev->rx_skb_size = emac_rx_skb_size(new_mtu);
		dev->rx_sync_size = emac_rx_sync_size(new_mtu);
	}

	return ret;
}

static void emac_clean_tx_ring(struct emac_instance *dev)
{
	int i;

	for (i = 0; i < NUM_TX_BUFF; ++i)
	{
		if (dev->tx_skb[i])
		{
			dev_kfree_skb(dev->tx_skb[i]);
			dev->tx_skb[i] = NULL;
			if (dev->tx_desc[i].ctrl & MAL_TX_CTRL_READY)
				++dev->estats.tx_dropped;
		}
		dev->tx_desc[i].ctrl = 0;
		dev->tx_desc[i].data_ptr = 0;
	}
}

static void emac_clean_rx_ring(struct emac_instance *dev)
{
	int i;

	for (i = 0; i < NUM_RX_BUFF; ++i)
		if (dev->rx_skb[i])
		{
			dev->rx_desc[i].ctrl = 0;
			dev_kfree_skb(dev->rx_skb[i]);
			dev->rx_skb[i] = NULL;
			dev->rx_desc[i].data_ptr = 0;
		}

	if (dev->rx_sg_skb)
	{
		dev_kfree_skb(dev->rx_sg_skb);
		dev->rx_sg_skb = NULL;
	}
}

static inline int emac_alloc_rx_skb(struct emac_instance *dev, int slot,
		gfp_t flags)
{
	struct sk_buff *skb = alloc_skb(dev->rx_skb_size, flags);
	if (unlikely(!skb))
		return -ENOMEM;

	dev->rx_skb[slot] = skb;
	dev->rx_desc[slot].data_len = 0;

	skb_reserve(skb, EMAC_RX_SKB_HEADROOM + 2);
	/*JRH - map a page */
	/*dev->rx_desc[slot].data_ptr = dma_map_single(&dev->ofdev->dev,
			skb->data - 2, dev->rx_sync_size, DMA_FROM_DEVICE) + 2;*/
	dev->rx_desc[slot].data_ptr = dma_map_page(&dev->ofdev->dev,
			dev->page[slot], 0, dev->rx_sync_size, DMA_FROM_DEVICE) + 2;
	wmb();
	dev->rx_desc[slot].ctrl = MAL_RX_CTRL_EMPTY
			| (slot == (NUM_RX_BUFF - 1) ? MAL_RX_CTRL_WRAP : 0);

	return 0;
}

static void emac_print_link_status(struct emac_instance *dev)
{
if (netif_carrier_ok(dev->ndev))
printk(KERN_INFO "%s: link is up, %d %s%s\n",
		dev->ndev->name, dev->phy.speed,
		dev->phy.duplex == DUPLEX_FULL ? "FDX" : "HDX",
		dev->phy.pause ? ", pause enabled" :
		dev->phy.asym_pause ? ", asymmetric pause enabled" : "");
else
printk(KERN_INFO "%s: link is down\n", dev->ndev->name);
}

/* Process ctx, rtnl_lock semaphore */
static int emac_open(struct net_device *ndev)
{
	struct emac_instance *dev = netdev_priv(ndev);
	int err, i;

	DBG(dev, "open" NL);

	/* Setup error IRQ handler */
	err = request_irq(dev->emac_irq, emac_irq, 0, "EMAC", dev);
	if (err)
	{
		printk(KERN_ERR "%s: failed to request IRQ %d\n",
				ndev->name, dev->emac_irq);
		return err;
	}

	/* Allocate RX ring */
	for (i = 0; i < NUM_RX_BUFF; ++i)
	{
		if(!(dev->page[i] = alloc_page(GFP_ATOMIC)))
		{
			printk(KERN_ERR "%s: failed to allocate pages in RX ring\n",
					ndev->name);
		}
		if (emac_alloc_rx_skb(dev, i, GFP_KERNEL))
		{
			printk(KERN_ERR "%s: failed to allocate RX ring\n",
					ndev->name);
			goto oom;
		}
	}

	dev->tx_cnt = dev->tx_slot = dev->ack_slot = dev->rx_slot = 0;
	clear_bit(MAL_COMMAC_RX_STOPPED, &dev->commac.flags);
	dev->rx_sg_skb = NULL;

	mutex_lock(&dev->link_lock);
	dev->opened = 1;

	/* Start PHY polling now.
	 */
	if (dev->phy.address >= 0)
	{
		int link_poll_interval;
		if (dev->phy.def->ops->poll_link(&dev->phy))
		{
			dev->phy.def->ops->read_link(&dev->phy);
			emac_rx_clk_default(dev);
			netif_carrier_on(dev->ndev);
			link_poll_interval = PHY_POLL_LINK_ON;
		}
		else
		{
			emac_rx_clk_tx(dev);
			netif_carrier_off(dev->ndev);
			link_poll_interval = PHY_POLL_LINK_OFF;
		}
		dev->link_polling = 1;
		wmb();
		schedule_delayed_work(&dev->link_work, link_poll_interval);
		emac_print_link_status(dev);
	}
	else
		netif_carrier_on(dev->ndev);

	/* Required for Pause packet support in EMAC */
	dev_mc_add(ndev, default_mcast_addr, sizeof(default_mcast_addr), 1);

	emac_configure(dev);
	mal_poll_add(dev->mal, &dev->commac);
	mal_enable_tx_channel(dev->mal, dev->mal_tx_chan);
	mal_set_rcbs(dev->mal, dev->mal_rx_chan, emac_rx_size(ndev->mtu));
	mal_enable_rx_channel(dev->mal, dev->mal_rx_chan);
	emac_tx_enable(dev);
	emac_rx_enable(dev);
	emac_netif_start(dev);

	mutex_unlock(&dev->link_lock);

	return 0;
	oom: emac_clean_rx_ring(dev);
	free_irq(dev->emac_irq, dev);

	return -ENOMEM;
}

/* BHs disabled */
#if 0
static int emac_link_differs(struct emac_instance *dev)
{
	u32 r = in_be32(&dev->emacp->mr1);

	int duplex = r & EMAC_MR1_FDE ? DUPLEX_FULL : DUPLEX_HALF;
	int speed, pause, asym_pause;

	if (r & EMAC_MR1_MF_1000)
	speed = SPEED_1000;
	else if (r & EMAC_MR1_MF_100)
	speed = SPEED_100;
	else
	speed = SPEED_10;

	switch (r & (EMAC_MR1_EIFC | EMAC_MR1_APP))
	{
		case (EMAC_MR1_EIFC | EMAC_MR1_APP):
		pause = 1;
		asym_pause = 0;
		break;
		case EMAC_MR1_APP:
		pause = 0;
		asym_pause = 1;
		break;
		default:
		pause = asym_pause = 0;
	}
	return speed != dev->phy.speed || duplex != dev->phy.duplex ||
	pause != dev->phy.pause || asym_pause != dev->phy.asym_pause;
}
#endif

static void emac_link_timer(struct work_struct *work)
{
	struct emac_instance *dev =
	container_of(to_delayed_work(work),
			struct emac_instance, link_work);
	int link_poll_interval;

	mutex_lock(&dev->link_lock);
	DBG2(dev, "link timer" NL);

	if (!dev->opened)
		goto bail;

	if (dev->phy.def->ops->poll_link(&dev->phy))
	{
		if (!netif_carrier_ok(dev->ndev))
		{
			emac_rx_clk_default(dev);
			/* Get new link parameters */
			dev->phy.def->ops->read_link(&dev->phy);

			netif_carrier_on(dev->ndev);
			emac_netif_stop(dev);
			emac_full_tx_reset(dev);
			emac_netif_start(dev);
			emac_print_link_status(dev);
		}
		link_poll_interval = PHY_POLL_LINK_ON;
	}
	else
	{
		if (netif_carrier_ok(dev->ndev))
		{
			emac_rx_clk_tx(dev);
			netif_carrier_off(dev->ndev);
			netif_tx_disable(dev->ndev);
			emac_reinitialize(dev);
			emac_print_link_status(dev);
		}
		link_poll_interval = PHY_POLL_LINK_OFF;
	}
	schedule_delayed_work(&dev->link_work, link_poll_interval);
	bail: mutex_unlock(&dev->link_lock);
}

static void emac_force_link_update(struct emac_instance *dev)
{
	netif_carrier_off(dev->ndev);
	smp_rmb();
	if (dev->link_polling)
	{
		cancel_rearming_delayed_work(&dev->link_work);
		if (dev->link_polling)
			schedule_delayed_work(&dev->link_work, PHY_POLL_LINK_OFF);
	}
}

/* Process ctx, rtnl_lock semaphore */
static int emac_close(struct net_device *ndev)
{
	struct emac_instance *dev = netdev_priv(ndev);

	DBG(dev, "close" NL);

	if (dev->phy.address >= 0)
	{
		dev->link_polling = 0;
		cancel_rearming_delayed_work(&dev->link_work);
	}
	mutex_lock(&dev->link_lock);
	emac_netif_stop(dev);
	dev->opened = 0;
	mutex_unlock(&dev->link_lock);

	emac_rx_disable(dev);
	emac_tx_disable(dev);
	mal_disable_rx_channel(dev->mal, dev->mal_rx_chan);
	mal_disable_tx_channel(dev->mal, dev->mal_tx_chan);
	mal_poll_del(dev->mal, &dev->commac);

	emac_clean_tx_ring(dev);
	emac_clean_rx_ring(dev);

	free_irq(dev->emac_irq, dev);

	return 0;
}

static inline u16 emac_tx_csum(struct emac_instance *dev, struct sk_buff *skb)
{
	if (emac_has_feature(dev, EMAC_FTR_HAS_TAH) && (skb->ip_summed
			== CHECKSUM_PARTIAL))
	{
		++dev->stats.tx_packets_csum;
		return EMAC_TX_CTRL_TAH_CSUM;
	}
	return 0;
}

static inline int emac_xmit_finish(struct emac_instance *dev, int len)
{
	struct emac_regs __iomem *p = dev->emacp;
	struct net_device *ndev = dev->ndev;

	/* Send the packet out. If the if makes a significant perf
	 * difference, then we can store the TMR0 value in "dev"
	 * instead
	 */
	if (emac_has_feature(dev, EMAC_FTR_EMAC4))
		out_be32(&p->tmr0, EMAC4_TMR0_XMIT);
	else
		out_be32(&p->tmr0, EMAC_TMR0_XMIT);

	if (unlikely(++dev->tx_cnt == NUM_TX_BUFF))
	{
		netif_stop_queue(ndev);
		DBG2(dev, "stopped TX queue" NL);
	}

	ndev->trans_start = jiffies;
	++dev->stats.tx_packets;
	dev->stats.tx_bytes += len;

	return 0;
}

/* Tx lock BH */
static int emac_start_xmit(struct sk_buff *skb, struct net_device *ndev)
{
	struct emac_instance *dev = netdev_priv(ndev);
	unsigned int len = skb->len;
	int slot;

	u16 ctrl = EMAC_TX_CTRL_GFCS | EMAC_TX_CTRL_GP | MAL_TX_CTRL_READY
			| MAL_TX_CTRL_LAST | emac_tx_csum(dev, skb);

	slot = dev->tx_slot++;
	if (dev->tx_slot == NUM_TX_BUFF)
	{
		dev->tx_slot = 0;
		ctrl |= MAL_TX_CTRL_WRAP;
	}

	DBG2(dev, "xmit(%u) %d" NL, len, slot);

	dev->tx_skb[slot] = skb;
	dev->tx_desc[slot].data_ptr = dma_map_single(&dev->ofdev->dev, skb->data,
			len, DMA_TO_DEVICE);
	dev->tx_desc[slot].data_len = (u16) len;
	wmb();
	dev->tx_desc[slot].ctrl = ctrl;

	return emac_xmit_finish(dev, len);
}

static inline int emac_xmit_split(struct emac_instance *dev, int slot, u32 pd,
		int len, int last, u16 base_ctrl)
{
	while (1)
	{
		u16 ctrl = base_ctrl;
		int chunk = min(len, MAL_MAX_TX_SIZE);
		len -= chunk;

		slot = (slot + 1) % NUM_TX_BUFF;

		if (last && !len)
			ctrl |= MAL_TX_CTRL_LAST;
		if (slot == NUM_TX_BUFF - 1)
			ctrl |= MAL_TX_CTRL_WRAP;

		dev->tx_skb[slot] = NULL;
		dev->tx_desc[slot].data_ptr = pd;
		dev->tx_desc[slot].data_len = (u16) chunk;
		dev->tx_desc[slot].ctrl = ctrl;
		++dev->tx_cnt;

		if (!len)
			break;

		pd += chunk;
	}
	return slot;
}

/* Tx lock BH disabled (SG version for TAH equipped EMACs) */
static int emac_start_xmit_sg(struct sk_buff *skb, struct net_device *ndev)
{
	struct emac_instance *dev = netdev_priv(ndev);
	int nr_frags = skb_shinfo(skb)->nr_frags;
	int len = skb->len, chunk;
	int slot, i;
	u16 ctrl;
	u32 pd;

	/* This is common "fast" path */
	if (likely(!nr_frags && len <= MAL_MAX_TX_SIZE))
		return emac_start_xmit(skb, ndev);

	len -= skb->data_len;

	/* Note, this is only an *estimation*, we can still run out of empty
	 * slots because of the additional fragmentation into
	 * MAL_MAX_TX_SIZE-sized chunks
	 */
	if (unlikely(dev->tx_cnt + nr_frags + mal_tx_chunks(len) > NUM_TX_BUFF))
		goto stop_queue;

	ctrl = EMAC_TX_CTRL_GFCS | EMAC_TX_CTRL_GP | MAL_TX_CTRL_READY
			| emac_tx_csum(dev, skb);
	slot = dev->tx_slot;

	/* skb data */
	dev->tx_skb[slot] = NULL;
	chunk = min(len, MAL_MAX_TX_SIZE);
	dev->tx_desc[slot].data_ptr = pd = dma_map_single(&dev->ofdev->dev,
			skb->data, len, DMA_TO_DEVICE);
	dev->tx_desc[slot].data_len = (u16) chunk;
	len -= chunk;
	if (unlikely(len))
		slot = emac_xmit_split(dev, slot, pd + chunk, len, !nr_frags, ctrl);
	/* skb fragments */
	for (i = 0; i < nr_frags; ++i)
	{
		struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i];
		len = frag->size;

		if (unlikely(dev->tx_cnt + mal_tx_chunks(len) >= NUM_TX_BUFF))
			goto undo_frame;

		pd = dma_map_page(&dev->ofdev->dev, frag->page, frag->page_offset, len,
				DMA_TO_DEVICE);

		slot = emac_xmit_split(dev, slot, pd, len, i == nr_frags - 1, ctrl);
	}

	DBG2(dev, "xmit_sg(%u) %d - %d" NL, skb->len, dev->tx_slot, slot);

	/* Attach skb to the last slot so we don't release it too early */
	dev->tx_skb[slot] = skb;

	/* Send the packet out */
	if (dev->tx_slot == NUM_TX_BUFF - 1)
		ctrl |= MAL_TX_CTRL_WRAP;
	wmb();
	dev->tx_desc[dev->tx_slot].ctrl = ctrl;
	dev->tx_slot = (slot + 1) % NUM_TX_BUFF;

	return emac_xmit_finish(dev, skb->len);

	undo_frame:
	/* Well, too bad. Our previous estimation was overly optimistic.
	 * Undo everything.
	 */
	while (slot != dev->tx_slot)
	{
		dev->tx_desc[slot].ctrl = 0;
		--dev->tx_cnt;
		if (--slot < 0)
			slot = NUM_TX_BUFF - 1;
	}
	++dev->estats.tx_undo;

	stop_queue: netif_stop_queue(ndev);
	DBG2(dev, "stopped TX queue" NL);
	return NETDEV_TX_BUSY;
}

/* Tx lock BHs */
static void emac_parse_tx_error(struct emac_instance *dev, u16 ctrl)
{
	struct emac_error_stats *st = &dev->estats;

	DBG(dev, "BD TX error %04x" NL, ctrl);

	++st->tx_bd_errors;
	if (ctrl & EMAC_TX_ST_BFCS)
		++st->tx_bd_bad_fcs;
	if (ctrl & EMAC_TX_ST_LCS)
		++st->tx_bd_carrier_loss;
	if (ctrl & EMAC_TX_ST_ED)
		++st->tx_bd_excessive_deferral;
	if (ctrl & EMAC_TX_ST_EC)
		++st->tx_bd_excessive_collisions;
	if (ctrl & EMAC_TX_ST_LC)
		++st->tx_bd_late_collision;
	if (ctrl & EMAC_TX_ST_MC)
		++st->tx_bd_multple_collisions;
	if (ctrl & EMAC_TX_ST_SC)
		++st->tx_bd_single_collision;
	if (ctrl & EMAC_TX_ST_UR)
		++st->tx_bd_underrun;
	if (ctrl & EMAC_TX_ST_SQE)
		++st->tx_bd_sqe;
}

static void emac_poll_tx(void *param)
{
	struct emac_instance *dev = param;
	u32 bad_mask;

	DBG2(dev, "poll_tx, %d %d" NL, dev->tx_cnt, dev->ack_slot);

	if (emac_has_feature(dev, EMAC_FTR_HAS_TAH))
		bad_mask = EMAC_IS_BAD_TX_TAH;
	else
		bad_mask = EMAC_IS_BAD_TX;

	netif_tx_lock_bh(dev->ndev);
	if (dev->tx_cnt)
	{
		u16 ctrl;
		int slot = dev->ack_slot, n = 0;
		again: ctrl = dev->tx_desc[slot].ctrl;
		if (!(ctrl & MAL_TX_CTRL_READY))
		{
			struct sk_buff *skb = dev->tx_skb[slot];
			++n;

			if (skb)
			{
				dev_kfree_skb(skb);
				dev->tx_skb[slot] = NULL;
			}
			slot = (slot + 1) % NUM_TX_BUFF;

			if (unlikely(ctrl & bad_mask))
				emac_parse_tx_error(dev, ctrl);

			if (--dev->tx_cnt)
				goto again;
		}
		if (n)
		{
			dev->ack_slot = slot;
			if (netif_queue_stopped(dev->ndev) && dev->tx_cnt
					< EMAC_TX_WAKEUP_THRESH)
				netif_wake_queue(dev->ndev);

			DBG2(dev, "tx %d pkts" NL, n);
		}
	}
	netif_tx_unlock_bh(dev->ndev);
}

static inline void emac_recycle_rx_skb(struct emac_instance *dev, int slot,
		int len)
{
	/*struct sk_buff *skb = dev->rx_skb[slot];*/

	DBG2(dev, "recycle %d %d" NL, slot, len);

	/*JRH - map a page */
	if (len)
		dma_map_page(&dev->ofdev->dev, dev->page[slot], 0,
				EMAC_DMA_ALIGN(len + 2), DMA_FROM_DEVICE);
		/*dma_map_single(&dev->ofdev->dev, skb->data - 2,
				EMAC_DMA_ALIGN(len + 2), DMA_FROM_DEVICE);*/

	dev->rx_desc[slot].data_len = 0;
	wmb();
	dev->rx_desc[slot].ctrl = MAL_RX_CTRL_EMPTY
			| (slot == (NUM_RX_BUFF - 1) ? MAL_RX_CTRL_WRAP : 0);
}

static void emac_parse_rx_error(struct emac_instance *dev, u16 ctrl)
{
	struct emac_error_stats *st = &dev->estats;

	DBG(dev, "BD RX error %04x" NL, ctrl);

	++st->rx_bd_errors;
	if (ctrl & EMAC_RX_ST_OE)
		++st->rx_bd_overrun;
	if (ctrl & EMAC_RX_ST_BP)
		++st->rx_bd_bad_packet;
	if (ctrl & EMAC_RX_ST_RP)
		++st->rx_bd_runt_packet;
	if (ctrl & EMAC_RX_ST_SE)
		++st->rx_bd_short_event;
	if (ctrl & EMAC_RX_ST_AE)
		++st->rx_bd_alignment_error;
	if (ctrl & EMAC_RX_ST_BFCS)
		++st->rx_bd_bad_fcs;
	if (ctrl & EMAC_RX_ST_PTL)
		++st->rx_bd_packet_too_long;
	if (ctrl & EMAC_RX_ST_ORE)
		++st->rx_bd_out_of_range;
	if (ctrl & EMAC_RX_ST_IRE)
		++st->rx_bd_in_range;
}

static inline void emac_rx_csum(struct emac_instance *dev, struct sk_buff *skb,
		u16 ctrl)
{
#ifdef CONFIG_IBM_NEW_EMAC_TAH
	if (!ctrl && dev->tah_dev)
	{
		skb->ip_summed = CHECKSUM_UNNECESSARY;
		++dev->stats.rx_packets_csum;
	}
#endif
}

static inline int emac_rx_sg_append(struct emac_instance *dev, int slot)
{
	if (likely(dev->rx_sg_skb != NULL))
	{
		int len = dev->rx_desc[slot].data_len;
		int tot_len = dev->rx_sg_skb->len + len;

		if (unlikely(tot_len + 2 > dev->rx_skb_size))
		{
			++dev->estats.rx_dropped_mtu;
			dev_kfree_skb(dev->rx_sg_skb);
			dev->rx_sg_skb = NULL;
		}
		else
		{
			cacheable_memcpy(skb_tail_pointer(dev->rx_sg_skb),
					dev->rx_skb[slot]->data, len);
			skb_put(dev->rx_sg_skb, len);
			emac_recycle_rx_skb(dev, slot, len);
			return 0;
		}
	}
	emac_recycle_rx_skb(dev, slot, 0);
	return -1;
}

/* JRH - Consume an allocated page */
static void emac_consume_page(struct emac_instance *dev, int slot,
		struct sk_buff *skb, int len)
{
	dev->page[slot] = NULL;
	skb->len += len;
	skb->data_len += len;
	skb->truesize += len;

	/* Allocate a new page to replace the one consumed */
	if(!(dev->page[slot] = alloc_page(GFP_ATOMIC)))
	{
		printk(KERN_ERR "%s: allocating new page after consumption\n", __FUNCTION__);
		return;
	}

	/* Map the page */
	dev->rx_desc[slot].data_ptr = dma_map_page(&dev->ofdev->dev,
			dev->page[slot], 0, dev->rx_sync_size, DMA_FROM_DEVICE);

	return;
}

/* JRH - rewrote append routine to deal with single pages only
 *     - moved away from linear SKBs and using paged SKBs */
static inline int emac_rx_sg_append2(struct emac_instance *dev, int slot)
{
	if (likely(dev->rx_sg_skb != NULL))
	{
		int len = dev->rx_desc[slot].data_len;
		int idx = skb_shinfo(dev->rx_sg_skb)->nr_frags;
		struct page *page = dev->page[slot];
		struct sk_buff *skb = dev->rx_sg_skb;

		/* Append the page to the end of the SG chain */
		skb_fill_page_desc(skb, idx, page, 0, len);
		emac_consume_page(dev, slot, skb, len);

		return 0;
	}
	emac_recycle_rx_skb(dev, slot, 0);
	return -1;
}

/* NAPI poll context */
static int emac_poll_rx(void *param, int budget)
{
	struct emac_instance *dev = param;
	int slot = dev->rx_slot, received = 0;
	int len;
	struct sk_buff *skb;

	DBG2(dev, "poll_rx(%d)" NL, budget);

again:
	while (budget > 0)
	{
		u16 ctrl = dev->rx_desc[slot].ctrl;

		if (ctrl & MAL_RX_CTRL_EMPTY)
			break;

		skb = dev->rx_skb[slot];
		mb();
		len = dev->rx_desc[slot].data_len;

		if (unlikely(!MAL_IS_SINGLE_RX(ctrl)))
			goto sg;

		ctrl &= EMAC_BAD_RX_MASK;
		if (unlikely(ctrl && ctrl != EMAC_RX_TAH_BAD_CSUM))
		{
			DBG(dev, "bad checksum\n");
			emac_parse_rx_error(dev, ctrl);
			++dev->estats.rx_dropped_error;
			emac_recycle_rx_skb(dev, slot, 0);
			len = 0;
			goto next;
		}

		if (len < ETH_HLEN)
		{
			DBG(dev, "length less than header length\n");
			++dev->estats.rx_dropped_stack;
			emac_recycle_rx_skb(dev, slot, len);
			goto next;
		}

		if (len && len < EMAC_RX_COPY_THRESH)
		{
			/* JRH - move to pages */
			/*struct sk_buff *copy_skb = alloc_skb(
					len + EMAC_RX_SKB_HEADROOM + 2, GFP_ATOMIC);
			if (unlikely(!copy_skb))
				goto oom;

			skb_reserve(copy_skb, EMAC_RX_SKB_HEADROOM + 2);
			cacheable_memcpy(copy_skb->data - 2, skb->data - 2, len + 2);
			emac_recycle_rx_skb(dev, slot, len);
			skb = copy_skb;*/
			u8 *vaddr;
			DBG(dev, "Copying page to skb\n");
			vaddr = (u8*)kmap_atomic(dev->page[slot], KM_SKB_DATA_SOFTIRQ);
			cacheable_memcpy(skb_tail_pointer(skb), vaddr, len);
			kunmap_atomic(vaddr, KM_SKB_DATA_SOFTIRQ);
			skb_put(skb,len);
		}
		else /*if (unlikely(emac_alloc_rx_skb(dev, slot, GFP_ATOMIC)))
			goto oom;*/
		{
			DBG(dev, "placing page in skb\n");
			skb_fill_page_desc(skb, 0, dev->page[slot], 0, len);
			emac_consume_page(dev, slot, skb, len);
		}

		/* JRH - do not call skb_put() on a paged SKB! */
		/*skb_put(skb, len);*/

push_packet:
		skb->dev = dev->ndev;
		skb->protocol = eth_type_trans(skb, dev->ndev);
		emac_rx_csum(dev, skb, ctrl);

		if (unlikely(netif_receive_skb(skb) == NET_RX_DROP))
			++dev->estats.rx_dropped_stack;
next:
		++dev->stats.rx_packets;
skip:
		dev->stats.rx_bytes += len;
		slot = (slot + 1) % NUM_RX_BUFF;
		--budget;
		++received;
		continue;
sg:
		if (ctrl & MAL_RX_CTRL_FIRST)
		{
			BUG_ON(dev->rx_sg_skb);
			if (unlikely(emac_alloc_rx_skb(dev, slot, GFP_ATOMIC)))
			{
				DBG(dev, "rx OOM %d" NL, slot);
				++dev->estats.rx_dropped_oom;
				emac_recycle_rx_skb(dev, slot, 0);
			}
			else
			{
				/* JRH - changing to SG chain, instead of linear SKB */
				dev->rx_sg_skb = skb;
				/*skb_put(skb, len);*/
				skb_fill_page_desc(dev->rx_sg_skb, 0, dev->page[slot], 0, len);
				emac_consume_page(dev, slot, dev->rx_sg_skb, len);
			}
		}
		else if (!emac_rx_sg_append2(dev, slot) && (ctrl & MAL_RX_CTRL_LAST))
		{

			skb = dev->rx_sg_skb;
			dev->rx_sg_skb = NULL;

			ctrl &= EMAC_BAD_RX_MASK;
			if (unlikely(ctrl && ctrl != EMAC_RX_TAH_BAD_CSUM))
			{
				emac_parse_rx_error(dev, ctrl);
				++dev->estats.rx_dropped_error;
				dev_kfree_skb(skb);
				len = 0;
			}
			else
			{
				/* JRH - added pskb_may_pull for eth_type_trans() call above */
				/* eth_type_trans needs skb->data to point at something */
				if (!pskb_may_pull(skb, ETH_HLEN))
				{
					DBG(dev, "pskb_may_pull failed.\n");
					dev_kfree_skb(skb);
					goto skip;
				}
				goto push_packet;
			}
		}
		goto skip;
/*oom:*/
		DBG(dev, "rx OOM %d" NL, slot);
		/* Drop the packet and recycle skb */
		++dev->estats.rx_dropped_oom;
		emac_recycle_rx_skb(dev, slot, 0);
		goto next;
	}

	if (received)
	{
		DBG2(dev, "rx %d BDs" NL, received);
		dev->rx_slot = slot;
	}

	if (unlikely(budget && test_bit(MAL_COMMAC_RX_STOPPED, &dev->commac.flags)))
	{
		mb();
		if (!(dev->rx_desc[slot].ctrl & MAL_RX_CTRL_EMPTY))
		{
			DBG2(dev, "rx restart" NL);
			received = 0;
			goto again;
		}

		if (dev->rx_sg_skb)
		{
			DBG2(dev, "dropping partial rx packet" NL);
			++dev->estats.rx_dropped_error;
			dev_kfree_skb(dev->rx_sg_skb);
			dev->rx_sg_skb = NULL;
		}

		clear_bit(MAL_COMMAC_RX_STOPPED, &dev->commac.flags);
		mal_enable_rx_channel(dev->mal, dev->mal_rx_chan);
		emac_rx_enable(dev);
		dev->rx_slot = 0;
	}
	return received;
}

/* NAPI poll context */
static int emac_peek_rx(void *param)
{
	struct emac_instance *dev = param;

	return !(dev->rx_desc[dev->rx_slot].ctrl & MAL_RX_CTRL_EMPTY);
}

/* NAPI poll context */
static int emac_peek_rx_sg(void *param)
{
	struct emac_instance *dev = param;

	int slot = dev->rx_slot;
	while (1)
	{
		u16 ctrl = dev->rx_desc[slot].ctrl;
		if (ctrl & MAL_RX_CTRL_EMPTY)
			return 0;
		else if (ctrl & MAL_RX_CTRL_LAST)
			return 1;

		slot = (slot + 1) % NUM_RX_BUFF;

		/* I'm just being paranoid here :) */
		if (unlikely(slot == dev->rx_slot))
			return 0;
	}
}

/* Hard IRQ */
static void emac_rxde(void *param)
{
	struct emac_instance *dev = param;

	++dev->estats.rx_stopped;
	emac_rx_disable_async(dev);
}

/* Hard IRQ */
static irqreturn_t emac_irq(int irq, void *dev_instance)
{
	struct emac_instance *dev = dev_instance;
	struct emac_regs __iomem *p = dev->emacp;
	struct emac_error_stats *st = &dev->estats;
	u32 isr;

	spin_lock(&dev->lock);

	isr = in_be32(&p->isr);
	out_be32(&p->isr, isr);

	DBG(dev, "isr = %08x" NL, isr);

	if (isr & EMAC4_ISR_TXPE)
		++st->tx_parity;
	if (isr & EMAC4_ISR_RXPE)
		++st->rx_parity;
	if (isr & EMAC4_ISR_TXUE)
		++st->tx_underrun;
	if (isr & EMAC4_ISR_RXOE)
		++st->rx_fifo_overrun;
	if (isr & EMAC_ISR_OVR)
		++st->rx_overrun;
	if (isr & EMAC_ISR_BP)
		++st->rx_bad_packet;
	if (isr & EMAC_ISR_RP)
		++st->rx_runt_packet;
	if (isr & EMAC_ISR_SE)
		++st->rx_short_event;
	if (isr & EMAC_ISR_ALE)
		++st->rx_alignment_error;
	if (isr & EMAC_ISR_BFCS)
		++st->rx_bad_fcs;
	if (isr & EMAC_ISR_PTLE)
		++st->rx_packet_too_long;
	if (isr & EMAC_ISR_ORE)
		++st->rx_out_of_range;
	if (isr & EMAC_ISR_IRE)
		++st->rx_in_range;
	if (isr & EMAC_ISR_SQE)
		++st->tx_sqe;
	if (isr & EMAC_ISR_TE)
		++st->tx_errors;

	spin_unlock(&dev->lock);

	return IRQ_HANDLED;
}

static struct net_device_stats *emac_stats(struct net_device *ndev)
{
	struct emac_instance *dev = netdev_priv(ndev);
	struct emac_stats *st = &dev->stats;
	struct emac_error_stats *est = &dev->estats;
	struct net_device_stats *nst = &dev->nstats;
	unsigned long flags;

	DBG2(dev, "stats" NL);

	/* Compute "legacy" statistics */
	spin_lock_irqsave(&dev->lock, flags);
	nst->rx_packets = (unsigned long) st->rx_packets;
	nst->rx_bytes = (unsigned long) st->rx_bytes;
	nst->tx_packets = (unsigned long) st->tx_packets;
	nst->tx_bytes = (unsigned long) st->tx_bytes;
	nst->rx_dropped = (unsigned long) (est->rx_dropped_oom
			+ est->rx_dropped_error + est->rx_dropped_resize
			+ est->rx_dropped_mtu);
	nst->tx_dropped = (unsigned long) est->tx_dropped;

	nst->rx_errors = (unsigned long) est->rx_bd_errors;
	nst->rx_fifo_errors = (unsigned long) (est->rx_bd_overrun
			+ est->rx_fifo_overrun + est->rx_overrun);
	nst->rx_frame_errors = (unsigned long) (est->rx_bd_alignment_error
			+ est->rx_alignment_error);
	nst->rx_crc_errors = (unsigned long) (est->rx_bd_bad_fcs + est->rx_bad_fcs);
	nst->rx_length_errors
			= (unsigned long) (est->rx_bd_runt_packet + est->rx_bd_short_event
					+ est->rx_bd_packet_too_long + est->rx_bd_out_of_range
					+ est->rx_bd_in_range + est->rx_runt_packet
					+ est->rx_short_event + est->rx_packet_too_long
					+ est->rx_out_of_range + est->rx_in_range);

	nst->tx_errors = (unsigned long) (est->tx_bd_errors + est->tx_errors);
	nst->tx_fifo_errors = (unsigned long) (est->tx_bd_underrun
			+ est->tx_underrun);
	nst->tx_carrier_errors = (unsigned long) est->tx_bd_carrier_loss;
	nst->collisions = (unsigned long) (est->tx_bd_excessive_deferral
			+ est->tx_bd_excessive_collisions + est->tx_bd_late_collision
			+ est->tx_bd_multple_collisions);
	spin_unlock_irqrestore(&dev->lock, flags);
	return nst;
}

static struct mal_commac_ops emac_commac_ops =
{ .poll_tx = &emac_poll_tx, .poll_rx = &emac_poll_rx, .peek_rx = &emac_peek_rx,
		.rxde = &emac_rxde, };

static struct mal_commac_ops emac_commac_sg_ops =
{ .poll_tx = &emac_poll_tx, .poll_rx = &emac_poll_rx,
		.peek_rx = &emac_peek_rx_sg, .rxde = &emac_rxde, };

/* Ethtool support */
static int emac_ethtool_get_settings(struct net_device *ndev,
		struct ethtool_cmd *cmd)
{
	struct emac_instance *dev = netdev_priv(ndev);

	cmd->supported = dev->phy.features;
	cmd->port = PORT_MII;
	cmd->phy_address = dev->phy.address;
	cmd->transceiver = dev->phy.address >= 0 ? XCVR_EXTERNAL : XCVR_INTERNAL;

	mutex_lock(&dev->link_lock);
	cmd->advertising = dev->phy.advertising;
	cmd->autoneg = dev->phy.autoneg;
	cmd->speed = dev->phy.speed;
	cmd->duplex = dev->phy.duplex;
	mutex_unlock(&dev->link_lock);

	return 0;
}

static int emac_ethtool_set_settings(struct net_device *ndev,
		struct ethtool_cmd *cmd)
{
	struct emac_instance *dev = netdev_priv(ndev);
	u32 f = dev->phy.features;

	DBG(dev, "set_settings(%d, %d, %d, 0x%08x)" NL,
			cmd->autoneg, cmd->speed, cmd->duplex, cmd->advertising);

	/* Basic sanity checks */
	if (dev->phy.address < 0)
		return -EOPNOTSUPP;
	if (cmd->autoneg != AUTONEG_ENABLE && cmd->autoneg != AUTONEG_DISABLE)
		return -EINVAL;
	if (cmd->autoneg == AUTONEG_ENABLE && cmd->advertising == 0)
		return -EINVAL;
	if (cmd->duplex != DUPLEX_HALF && cmd->duplex != DUPLEX_FULL)
		return -EINVAL;

	if (cmd->autoneg == AUTONEG_DISABLE)
	{
		switch (cmd->speed)
		{
		case SPEED_10:
			if (cmd->duplex == DUPLEX_HALF && !(f & SUPPORTED_10baseT_Half))
				return -EINVAL;
			if (cmd->duplex == DUPLEX_FULL && !(f & SUPPORTED_10baseT_Full))
				return -EINVAL;
			break;
		case SPEED_100:
			if (cmd->duplex == DUPLEX_HALF && !(f & SUPPORTED_100baseT_Half))
				return -EINVAL;
			if (cmd->duplex == DUPLEX_FULL && !(f & SUPPORTED_100baseT_Full))
				return -EINVAL;
			break;
		case SPEED_1000:
			if (cmd->duplex == DUPLEX_HALF && !(f & SUPPORTED_1000baseT_Half))
				return -EINVAL;
			if (cmd->duplex == DUPLEX_FULL && !(f & SUPPORTED_1000baseT_Full))
				return -EINVAL;
			break;
		default:
			return -EINVAL;
		}

		mutex_lock(&dev->link_lock);
		dev->phy.def->ops->setup_forced(&dev->phy, cmd->speed, cmd->duplex);
		mutex_unlock(&dev->link_lock);

	}
	else
	{
		if (!(f & SUPPORTED_Autoneg))
			return -EINVAL;

		mutex_lock(&dev->link_lock);
		dev->phy.def->ops->setup_aneg(&dev->phy, (cmd->advertising & f)
				| (dev->phy.advertising & (ADVERTISED_Pause
						| ADVERTISED_Asym_Pause)));
		mutex_unlock(&dev->link_lock);
	}
	emac_force_link_update(dev);

	return 0;
}

static void emac_ethtool_get_ringparam(struct net_device *ndev,
		struct ethtool_ringparam *rp)
{
	rp->rx_max_pending = rp->rx_pending = NUM_RX_BUFF;
	rp->tx_max_pending = rp->tx_pending = NUM_TX_BUFF;
}

static void emac_ethtool_get_pauseparam(struct net_device *ndev,
		struct ethtool_pauseparam *pp)
{
	struct emac_instance *dev = netdev_priv(ndev);

	mutex_lock(&dev->link_lock);
	if ((dev->phy.features & SUPPORTED_Autoneg) && (dev->phy.advertising
			& (ADVERTISED_Pause | ADVERTISED_Asym_Pause)))
		pp->autoneg = 1;

	if (dev->phy.duplex == DUPLEX_FULL)
	{
		if (dev->phy.pause)
			pp->rx_pause = pp->tx_pause = 1;
		else if (dev->phy.asym_pause)
			pp->tx_pause = 1;
	}
	mutex_unlock(&dev->link_lock);
}

static u32 emac_ethtool_get_rx_csum(struct net_device *ndev)
{
	struct emac_instance *dev = netdev_priv(ndev);

	return dev->tah_dev != NULL;
}

static int emac_get_regs_len(struct emac_instance *dev)
{
	if (emac_has_feature(dev, EMAC_FTR_EMAC4))
		return sizeof(struct emac_ethtool_regs_subhdr)
				+ EMAC4_ETHTOOL_REGS_SIZE(dev);
	else
		return sizeof(struct emac_ethtool_regs_subhdr)
				+ EMAC_ETHTOOL_REGS_SIZE(dev);
}

static int emac_ethtool_get_regs_len(struct net_device *ndev)
{
	struct emac_instance *dev = netdev_priv(ndev);
	int size;

	size = sizeof(struct emac_ethtool_regs_hdr) + emac_get_regs_len(dev)
			+ mal_get_regs_len(dev->mal);
	if (emac_has_feature(dev, EMAC_FTR_HAS_ZMII))
		size += zmii_get_regs_len(dev->zmii_dev);
	if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII))
		size += rgmii_get_regs_len(dev->rgmii_dev);
	if (emac_has_feature(dev, EMAC_FTR_HAS_TAH))
		size += tah_get_regs_len(dev->tah_dev);

	return size;
}

static void *emac_dump_regs(struct emac_instance *dev, void *buf)
{
	struct emac_ethtool_regs_subhdr *hdr = buf;

	hdr->index = dev->cell_index;
	if (emac_has_feature(dev, EMAC_FTR_EMAC4))
	{
		hdr->version = EMAC4_ETHTOOL_REGS_VER;
		memcpy_fromio(hdr + 1, dev->emacp, EMAC4_ETHTOOL_REGS_SIZE(dev));
		return ((void *) (hdr + 1) + EMAC4_ETHTOOL_REGS_SIZE(dev));
	}
	else
	{
		hdr->version = EMAC_ETHTOOL_REGS_VER;
		memcpy_fromio(hdr + 1, dev->emacp, EMAC_ETHTOOL_REGS_SIZE(dev));
		return ((void *) (hdr + 1) + EMAC_ETHTOOL_REGS_SIZE(dev));
	}
}

static void emac_ethtool_get_regs(struct net_device *ndev,
		struct ethtool_regs *regs, void *buf)
{
	struct emac_instance *dev = netdev_priv(ndev);
	struct emac_ethtool_regs_hdr *hdr = buf;

	hdr->components = 0;
	buf = hdr + 1;

	buf = mal_dump_regs(dev->mal, buf);
	buf = emac_dump_regs(dev, buf);
	if (emac_has_feature(dev, EMAC_FTR_HAS_ZMII))
	{
		hdr->components |= EMAC_ETHTOOL_REGS_ZMII;
		buf = zmii_dump_regs(dev->zmii_dev, buf);
	}
	if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII))
	{
		hdr->components |= EMAC_ETHTOOL_REGS_RGMII;
		buf = rgmii_dump_regs(dev->rgmii_dev, buf);
	}
	if (emac_has_feature(dev, EMAC_FTR_HAS_TAH))
	{
		hdr->components |= EMAC_ETHTOOL_REGS_TAH;
		buf = tah_dump_regs(dev->tah_dev, buf);
	}
}

static int emac_ethtool_nway_reset(struct net_device *ndev)
{
	struct emac_instance *dev = netdev_priv(ndev);
	int res = 0;

	DBG(dev, "nway_reset" NL);

	if (dev->phy.address < 0)
		return -EOPNOTSUPP;

	mutex_lock(&dev->link_lock);
	if (!dev->phy.autoneg)
	{
		res = -EINVAL;
		goto out;
	}

	dev->phy.def->ops->setup_aneg(&dev->phy, dev->phy.advertising);
	out: mutex_unlock(&dev->link_lock);
	emac_force_link_update(dev);
	return res;
}

static int emac_ethtool_get_stats_count(struct net_device *ndev)
{
	return EMAC_ETHTOOL_STATS_COUNT;
}

static void emac_ethtool_get_strings(struct net_device *ndev, u32 stringset,
		u8 * buf)
{
	if (stringset == ETH_SS_STATS)
		memcpy(buf, &emac_stats_keys, sizeof(emac_stats_keys));
}

static void emac_ethtool_get_ethtool_stats(struct net_device *ndev,
		struct ethtool_stats *estats, u64 * tmp_stats)
{
	struct emac_instance *dev = netdev_priv(ndev);

	memcpy(tmp_stats, &dev->stats, sizeof(dev->stats));
	tmp_stats += sizeof(dev->stats) / sizeof(u64);
	memcpy(tmp_stats, &dev->estats, sizeof(dev->estats));
}

static void emac_ethtool_get_drvinfo(struct net_device *ndev,
		struct ethtool_drvinfo *info)
{
	struct emac_instance *dev = netdev_priv(ndev);

	strcpy(info->driver, "ibm_emac");
	strcpy(info->version, DRV_VERSION);
	info->fw_version[0] = '\0';
	sprintf(info->bus_info, "PPC 4xx EMAC-%d %s", dev->cell_index,
			dev->ofdev->node->full_name);
	info->n_stats = emac_ethtool_get_stats_count(ndev);
	info->regdump_len = emac_ethtool_get_regs_len(ndev);
}

static const struct ethtool_ops emac_ethtool_ops =
{ .get_settings = emac_ethtool_get_settings,
		.set_settings = emac_ethtool_set_settings,
		.get_drvinfo = emac_ethtool_get_drvinfo,

		.get_regs_len = emac_ethtool_get_regs_len,
		.get_regs = emac_ethtool_get_regs,

		.nway_reset = emac_ethtool_nway_reset,

		.get_ringparam = emac_ethtool_get_ringparam,
		.get_pauseparam = emac_ethtool_get_pauseparam,

		.get_rx_csum = emac_ethtool_get_rx_csum,

		.get_strings = emac_ethtool_get_strings,
		.get_stats_count = emac_ethtool_get_stats_count,
		.get_ethtool_stats = emac_ethtool_get_ethtool_stats,

		.get_link = ethtool_op_get_link, .get_tx_csum = ethtool_op_get_tx_csum,
		.get_sg = ethtool_op_get_sg, };

static int emac_ioctl(struct net_device *ndev, struct ifreq *rq, int cmd)
{
	struct emac_instance *dev = netdev_priv(ndev);
	uint16_t *data = (uint16_t *) &rq->ifr_ifru;

	DBG(dev, "ioctl %08x" NL, cmd);

	if (dev->phy.address < 0)
		return -EOPNOTSUPP;

	switch (cmd)
	{
	case SIOCGMIIPHY:
	case SIOCDEVPRIVATE:
		data[0] = dev->phy.address;
		/* Fall through */
	case SIOCGMIIREG:
	case SIOCDEVPRIVATE + 1:
		data[3] = emac_mdio_read(ndev, dev->phy.address, data[1]);
		return 0;

	case SIOCSMIIREG:
	case SIOCDEVPRIVATE + 2:
		if (!capable(CAP_NET_ADMIN))
			return -EPERM;
		emac_mdio_write(ndev, dev->phy.address, data[1], data[2]);
		return 0;
	default:
		return -EOPNOTSUPP;
	}
}

struct emac_depentry
{
	u32 phandle;
	struct device_node *node;
	struct of_device *ofdev;
	void *drvdata;
};

#define	EMAC_DEP_MAL_IDX	0
#define	EMAC_DEP_ZMII_IDX	1
#define	EMAC_DEP_RGMII_IDX	2
#define	EMAC_DEP_TAH_IDX	3
#define	EMAC_DEP_MDIO_IDX	4
#define	EMAC_DEP_PREV_IDX	5
#define	EMAC_DEP_COUNT		6

static int __devinit emac_check_deps(struct emac_instance *dev,
		struct emac_depentry *deps)
{
	int i, there = 0;
	struct device_node *np;

	for (i = 0; i < EMAC_DEP_COUNT; i++)
	{
		/* no dependency on that item, allright */
		if (deps[i].phandle == 0)
		{
			there++;
			continue;
		}
		/* special case for blist as the dependency might go away */
		if (i == EMAC_DEP_PREV_IDX)
		{
			np = *(dev->blist - 1);
			if (np == NULL)
			{
				deps[i].phandle = 0;
				there++;
				continue;
			}
			if (deps[i].node == NULL)
				deps[i].node = of_node_get(np);
		}
		if (deps[i].node == NULL)
			deps[i].node = of_find_node_by_phandle(deps[i].phandle);
		if (deps[i].node == NULL)
			continue;
		if (deps[i].ofdev == NULL)
			deps[i].ofdev = of_find_device_by_node(deps[i].node);
		if (deps[i].ofdev == NULL)
			continue;
		if (deps[i].drvdata == NULL)
			deps[i].drvdata = dev_get_drvdata(&deps[i].ofdev->dev);
		if (deps[i].drvdata != NULL)
			there++;
	}
	return (there == EMAC_DEP_COUNT);
}

static void emac_put_deps(struct emac_instance *dev)
{
	if (dev->mal_dev)
		of_dev_put(dev->mal_dev);
	if (dev->zmii_dev)
		of_dev_put(dev->zmii_dev);
	if (dev->rgmii_dev)
		of_dev_put(dev->rgmii_dev);
	if (dev->mdio_dev)
		of_dev_put(dev->mdio_dev);
	if (dev->tah_dev)
		of_dev_put(dev->tah_dev);
}

static int __devinit emac_of_bus_notify(struct notifier_block *nb,
		unsigned long action, void *data)
{
	/* We are only intereted in device addition */
	if (action == BUS_NOTIFY_BOUND_DRIVER)
		wake_up_all(&emac_probe_wait);
	return 0;
}

static struct notifier_block emac_of_bus_notifier __devinitdata =
{ .notifier_call = emac_of_bus_notify };

static int __devinit emac_wait_deps(struct emac_instance *dev)
{
	struct emac_depentry deps[EMAC_DEP_COUNT];
	int i, err;

	memset(&deps, 0, sizeof(deps));

	deps[EMAC_DEP_MAL_IDX].phandle = dev->mal_ph;
	deps[EMAC_DEP_ZMII_IDX].phandle = dev->zmii_ph;
	deps[EMAC_DEP_RGMII_IDX].phandle = dev->rgmii_ph;
	if (dev->tah_ph)
		deps[EMAC_DEP_TAH_IDX].phandle = dev->tah_ph;
	if (dev->mdio_ph)
		deps[EMAC_DEP_MDIO_IDX].phandle = dev->mdio_ph;
	if (dev->blist && dev->blist > emac_boot_list)
		deps[EMAC_DEP_PREV_IDX].phandle = 0xffffffffu;
	bus_register_notifier(&of_platform_bus_type, &emac_of_bus_notifier);
	wait_event_timeout(emac_probe_wait, emac_check_deps(dev, deps),
			EMAC_PROBE_DEP_TIMEOUT);
	bus_unregister_notifier(&of_platform_bus_type, &emac_of_bus_notifier);
	err = emac_check_deps(dev, deps) ? 0 : -ENODEV;
	for (i = 0; i < EMAC_DEP_COUNT; i++)
	{
		if (deps[i].node)
			of_node_put(deps[i].node);
		if (err && deps[i].ofdev)
			of_dev_put(deps[i].ofdev);
	}
	if (err == 0)
	{
		dev->mal_dev = deps[EMAC_DEP_MAL_IDX].ofdev;
		dev->zmii_dev = deps[EMAC_DEP_ZMII_IDX].ofdev;
		dev->rgmii_dev = deps[EMAC_DEP_RGMII_IDX].ofdev;
		dev->tah_dev = deps[EMAC_DEP_TAH_IDX].ofdev;
		dev->mdio_dev = deps[EMAC_DEP_MDIO_IDX].ofdev;
	}
	if (deps[EMAC_DEP_PREV_IDX].ofdev)
		of_dev_put(deps[EMAC_DEP_PREV_IDX].ofdev);
	return err;
}

static int __devinit emac_read_uint_prop(struct device_node *np,
		const char *name, u32 *val, int fatal)
{
	int len;
	const u32 *prop = of_get_property(np, name, &len);
	if (prop == NULL || len < sizeof(u32))
	{
		if (fatal)
		printk(KERN_ERR "%s: missing %s property\n",
				np->full_name, name);
		return -ENODEV;
	}
	*val = *prop;
	return 0;
}

static int __devinit emac_init_phy(struct emac_instance *dev)
{
	struct device_node *np = dev->ofdev->node;
	struct net_device *ndev = dev->ndev;
	u32 phy_map, adv;
	int i;

	dev->phy.dev = ndev;
	dev->phy.mode = dev->phy_mode;

	/* PHY-less configuration.
	 * XXX I probably should move these settings to the dev tree
	 */
	if (dev->phy_address == 0xffffffff && dev->phy_map == 0xffffffff)
	{
		emac_reset(dev);

		/* PHY-less configuration.
		 * XXX I probably should move these settings to the dev tree
		 */
		dev->phy.address = -1;
		dev->phy.features = SUPPORTED_MII;
		if (emac_phy_supports_gige(dev->phy_mode))
			dev->phy.features |= SUPPORTED_1000baseT_Full;
		else
			dev->phy.features |= SUPPORTED_100baseT_Full;
		dev->phy.pause = 1;

		return 0;
	}

	mutex_lock(&emac_phy_map_lock);
	phy_map = dev->phy_map | busy_phy_map;

	DBG(dev, "PHY maps %08x %08x" NL, dev->phy_map, busy_phy_map);

	dev->phy.mdio_read = emac_mdio_read;
	dev->phy.mdio_write = emac_mdio_write;

	/* Enable internal clock source */
#ifdef CONFIG_PPC_DCR_NATIVE
	if (emac_has_feature(dev, EMAC_FTR_440GX_PHY_CLK_FIX))
	dcri_clrset(SDR0, SDR0_MFR, 0, SDR0_MFR_ECS);
#endif
	/* PHY clock workaround */
	emac_rx_clk_tx(dev);

	/* Enable internal clock source on 440GX*/
#ifdef CONFIG_PPC_DCR_NATIVE
	if (emac_has_feature(dev, EMAC_FTR_440GX_PHY_CLK_FIX))
	dcri_clrset(SDR0, SDR0_MFR, 0, SDR0_MFR_ECS);
#endif
	/* Configure EMAC with defaults so we can at least use MDIO
	 * This is needed mostly for 440GX
	 */
	if (emac_phy_gpcs(dev->phy.mode))
	{
		/* XXX
		 * Make GPCS PHY address equal to EMAC index.
		 * We probably should take into account busy_phy_map
		 * and/or phy_map here.
		 *
		 * Note that the busy_phy_map is currently global
		 * while it should probably be per-ASIC...
		 */
		dev->phy.gpcs_address = dev->gpcs_address;
		if (dev->phy.gpcs_address == 0xffffffff)
			dev->phy.address = dev->cell_index;
	}

	emac_configure(dev);

	if (dev->phy_address != 0xffffffff)
		phy_map = ~(1 << dev->phy_address);

	for (i = 0; i < 0x20; phy_map >>= 1, ++i)
		if (!(phy_map & 1))
		{
			int r;
			busy_phy_map |= 1 << i;

			/* Quick check if there is a PHY at the address */
			r = emac_mdio_read(dev->ndev, i, MII_BMCR);
			if (r == 0xffff || r < 0)
				continue;
			if (!emac_mii_phy_probe(&dev->phy, i))
				break;
		}

	/* Enable external clock source */
#ifdef CONFIG_PPC_DCR_NATIVE
	if (emac_has_feature(dev, EMAC_FTR_440GX_PHY_CLK_FIX))
	dcri_clrset(SDR0, SDR0_MFR, SDR0_MFR_ECS, 0);
#endif
	mutex_unlock(&emac_phy_map_lock);
	if (i == 0x20)
	{
		printk(KERN_WARNING "%s: can't find PHY!\n", np->full_name);
		return -ENXIO;
	}

	/* Init PHY */
	if (dev->phy.def->ops->init)
		dev->phy.def->ops->init(&dev->phy);

	/* Disable any PHY features not supported by the platform */
	dev->phy.def->features &= ~dev->phy_feat_exc;

	/* Setup initial link parameters */
	if (dev->phy.features & SUPPORTED_Autoneg)
	{
		adv = dev->phy.features;
		if (!emac_has_feature(dev, EMAC_FTR_NO_FLOW_CONTROL_40x))
			adv |= ADVERTISED_Pause | ADVERTISED_Asym_Pause;
		/* Restart autonegotiation */
		dev->phy.def->ops->setup_aneg(&dev->phy, adv);
	}
	else
	{
		u32 f = dev->phy.def->features;
		int speed = SPEED_10, fd = DUPLEX_HALF;

		/* Select highest supported speed/duplex */
		if (f & SUPPORTED_1000baseT_Full)
		{
			speed = SPEED_1000;
			fd = DUPLEX_FULL;
		}
		else if (f & SUPPORTED_1000baseT_Half)
			speed = SPEED_1000;
		else if (f & SUPPORTED_100baseT_Full)
		{
			speed = SPEED_100;
			fd = DUPLEX_FULL;
		}
		else if (f & SUPPORTED_100baseT_Half)
			speed = SPEED_100;
		else if (f & SUPPORTED_10baseT_Full)
			fd = DUPLEX_FULL;

		/* Force link parameters */
		dev->phy.def->ops->setup_forced(&dev->phy, speed, fd);
	}
	return 0;
}

static int __devinit emac_init_config(struct emac_instance *dev)
{
	struct device_node *np = dev->ofdev->node;
	const void *p;
	unsigned int plen;
	const char *pm, *phy_modes[] =
	{ [PHY_MODE_NA] = "", [PHY_MODE_MII] = "mii", [PHY_MODE_RMII] = "rmii",
			[PHY_MODE_SMII] = "smii", [PHY_MODE_RGMII] = "rgmii",
			[PHY_MODE_TBI] = "tbi", [PHY_MODE_GMII] = "gmii",
			[PHY_MODE_RTBI] = "rtbi", [PHY_MODE_SGMII] = "sgmii", };

	/* Read config from device-tree */
	if (emac_read_uint_prop(np, "mal-device", &dev->mal_ph, 1))
		return -ENXIO;
	if (emac_read_uint_prop(np, "mal-tx-channel", &dev->mal_tx_chan, 1))
		return -ENXIO;
	if (emac_read_uint_prop(np, "mal-rx-channel", &dev->mal_rx_chan, 1))
		return -ENXIO;
	if (emac_read_uint_prop(np, "cell-index", &dev->cell_index, 1))
		return -ENXIO;
	if (emac_read_uint_prop(np, "max-frame-size", &dev->max_mtu, 0))
		dev->max_mtu = 1500;
	if (emac_read_uint_prop(np, "rx-fifo-size", &dev->rx_fifo_size, 0))
		dev->rx_fifo_size = 2048;
	if (emac_read_uint_prop(np, "tx-fifo-size", &dev->tx_fifo_size, 0))
		dev->tx_fifo_size = 2048;
	if (emac_read_uint_prop(np, "rx-fifo-size-gige", &dev->rx_fifo_size_gige, 0))
		dev->rx_fifo_size_gige = dev->rx_fifo_size;
	if (emac_read_uint_prop(np, "tx-fifo-size-gige", &dev->tx_fifo_size_gige, 0))
		dev->tx_fifo_size_gige = dev->tx_fifo_size;
	if (emac_read_uint_prop(np, "phy-address", &dev->phy_address, 0))
		dev->phy_address = 0xffffffff;
	if (emac_read_uint_prop(np, "phy-map", &dev->phy_map, 0))
		dev->phy_map = 0xffffffff;
	if (emac_read_uint_prop(np, "gpcs-address", &dev->gpcs_address, 0))
		dev->gpcs_address = 0xffffffff;
	if (emac_read_uint_prop(np->parent, "clock-frequency", &dev->opb_bus_freq,
			1))
		return -ENXIO;
	if (emac_read_uint_prop(np, "tah-device", &dev->tah_ph, 0))
		dev->tah_ph = 0;
	if (emac_read_uint_prop(np, "tah-channel", &dev->tah_port, 0))
		dev->tah_port = 0;
	if (emac_read_uint_prop(np, "mdio-device", &dev->mdio_ph, 0))
		dev->mdio_ph = 0;
	if (emac_read_uint_prop(np, "zmii-device", &dev->zmii_ph, 0))
		dev->zmii_ph = 0;
	;
	if (emac_read_uint_prop(np, "zmii-channel", &dev->zmii_port, 0))
		dev->zmii_port = 0xffffffff;
	;
	if (emac_read_uint_prop(np, "rgmii-device", &dev->rgmii_ph, 0))
		dev->rgmii_ph = 0;
	;
	if (emac_read_uint_prop(np, "rgmii-channel", &dev->rgmii_port, 0))
		dev->rgmii_port = 0xffffffff;
	;
	if (emac_read_uint_prop(np, "fifo-entry-size", &dev->fifo_entry_size, 0))
		dev->fifo_entry_size = 16;
	if (emac_read_uint_prop(np, "mal-burst-size", &dev->mal_burst_size, 0))
		dev->mal_burst_size = 256;

	/* PHY mode needs some decoding */
	dev->phy_mode = PHY_MODE_NA;
	pm = of_get_property(np, "phy-mode", &plen);
	if (pm != NULL)
	{
		int i;
		for (i = 0; i < ARRAY_SIZE(phy_modes); i++)
			if (!strcasecmp(pm, phy_modes[i]))
			{
				dev->phy_mode = i;
				break;
			}
	}

	/* Backward compat with non-final DT */
	if (dev->phy_mode == PHY_MODE_NA && pm != NULL && plen == 4)
	{
		u32 nmode = *(const u32 *) pm;
		if (nmode > PHY_MODE_NA && nmode <= PHY_MODE_SGMII)
			dev->phy_mode = nmode;
	}

	/* Check EMAC version */
	if (of_device_is_compatible(np, "ibm,emac4sync"))
	{
		dev->features |= (EMAC_FTR_EMAC4 | EMAC_FTR_EMAC4SYNC);
		if (of_device_is_compatible(np, "ibm,emac-460ex")
				|| of_device_is_compatible(np, "ibm,emac-460gt"))
			dev->features |= EMAC_FTR_460EX_PHY_CLK_FIX;
		if (of_device_is_compatible(np, "ibm,emac-405ex")
				|| of_device_is_compatible(np, "ibm,emac-405exr"))
			dev->features |= EMAC_FTR_440EP_PHY_CLK_FIX;
	}
	else if (of_device_is_compatible(np, "ibm,emac4"))
	{
		dev->features |= EMAC_FTR_EMAC4;
		if (of_device_is_compatible(np, "ibm,emac-440gx"))
			dev->features |= EMAC_FTR_440GX_PHY_CLK_FIX;
	}
	else
	{
		if (of_device_is_compatible(np, "ibm,emac-440ep")
				|| of_device_is_compatible(np, "ibm,emac-440gr"))
			dev->features |= EMAC_FTR_440EP_PHY_CLK_FIX;
		if (of_device_is_compatible(np, "ibm,emac-405ez"))
		{
#ifdef CONFIG_IBM_NEW_EMAC_NO_FLOW_CTRL
			dev->features |= EMAC_FTR_NO_FLOW_CONTROL_40x;
#else
			printk(KERN_ERR "%s: Flow control not disabled!\n",
					np->full_name);
			return -ENXIO;
#endif
		}

	}

	/* Fixup some feature bits based on the device tree */
	if (of_get_property(np, "has-inverted-stacr-oc", NULL))
		dev->features |= EMAC_FTR_STACR_OC_INVERT;
	if (of_get_property(np, "has-new-stacr-staopc", NULL))
		dev->features |= EMAC_FTR_HAS_NEW_STACR;

	/* CAB lacks the appropriate properties */
	if (of_device_is_compatible(np, "ibm,emac-axon"))
		dev->features |= EMAC_FTR_HAS_NEW_STACR | EMAC_FTR_STACR_OC_INVERT;

	/* Enable TAH/ZMII/RGMII features as found */
	if (dev->tah_ph != 0)
	{
#ifdef CONFIG_IBM_NEW_EMAC_TAH
		dev->features |= EMAC_FTR_HAS_TAH;
#else
		printk(KERN_ERR "%s: TAH support not enabled !\n",
				np->full_name);
		return -ENXIO;
#endif
	}

	if (dev->zmii_ph != 0)
	{
#ifdef CONFIG_IBM_NEW_EMAC_ZMII
		dev->features |= EMAC_FTR_HAS_ZMII;
#else
		printk(KERN_ERR "%s: ZMII support not enabled !\n",
				np->full_name);
		return -ENXIO;
#endif
	}

	if (dev->rgmii_ph != 0)
	{
#ifdef CONFIG_IBM_NEW_EMAC_RGMII
		dev->features |= EMAC_FTR_HAS_RGMII;
#else
		printk(KERN_ERR "%s: RGMII support not enabled !\n",
				np->full_name);
		return -ENXIO;
#endif
	}

	/* Read MAC-address */
	p = of_get_property(np, "local-mac-address", NULL);
	if (p == NULL)
	{
		printk(KERN_ERR "%s: Can't find local-mac-address property\n",
				np->full_name);
		return -ENXIO;
	}
	memcpy(dev->ndev->dev_addr, p, 6);

	/* IAHT and GAHT filter parameterization */
	if (emac_has_feature(dev, EMAC_FTR_EMAC4SYNC))
	{
		dev->xaht_slots_shift = EMAC4SYNC_XAHT_SLOTS_SHIFT;
		dev->xaht_width_shift = EMAC4SYNC_XAHT_WIDTH_SHIFT;
	}
	else
	{
		dev->xaht_slots_shift = EMAC4_XAHT_SLOTS_SHIFT;
		dev->xaht_width_shift = EMAC4_XAHT_WIDTH_SHIFT;
	}

	DBG(dev, "features     : 0x%08x / 0x%08x\n", dev->features, EMAC_FTRS_POSSIBLE);
	DBG(dev, "tx_fifo_size : %d (%d gige)\n", dev->tx_fifo_size, dev->tx_fifo_size_gige);
	DBG(dev, "rx_fifo_size : %d (%d gige)\n", dev->rx_fifo_size, dev->rx_fifo_size_gige);
	DBG(dev, "max_mtu      : %d\n", dev->max_mtu);
	DBG(dev, "OPB freq     : %d\n", dev->opb_bus_freq);

	return 0;
}

static const struct net_device_ops emac_netdev_ops =
{ .ndo_open = emac_open, .ndo_stop = emac_close, .ndo_get_stats = emac_stats,
		.ndo_set_multicast_list = emac_set_multicast_list,
		.ndo_do_ioctl = emac_ioctl, .ndo_tx_timeout = emac_tx_timeout,
		.ndo_validate_addr = eth_validate_addr,
		.ndo_set_mac_address = eth_mac_addr, .ndo_start_xmit = emac_start_xmit,
		.ndo_change_mtu = eth_change_mtu, };

static const struct net_device_ops
		emac_gige_netdev_ops =
		{ .ndo_open = emac_open, .ndo_stop = emac_close,
				.ndo_get_stats = emac_stats,
				.ndo_set_multicast_list = emac_set_multicast_list,
				.ndo_do_ioctl = emac_ioctl, .ndo_tx_timeout = emac_tx_timeout,
				.ndo_validate_addr = eth_validate_addr,
				.ndo_set_mac_address = eth_mac_addr,
				.ndo_start_xmit = emac_start_xmit_sg,
				.ndo_change_mtu = emac_change_mtu, };

static int __devinit emac_probe(struct of_device *ofdev,
		const struct of_device_id *match)
{
	struct net_device *ndev;
	struct emac_instance *dev;
	struct device_node *np = ofdev->node;
	struct device_node **blist = NULL;
	int err, i;

	/* Skip unused/unwired EMACS.  We leave the check for an unused
	 * property here for now, but new flat device trees should set a
	 * status property to "disabled" instead.
	 */
	if (of_get_property(np, "unused", NULL) || !of_device_is_available(np))
		return -ENODEV;

	/* Find ourselves in the bootlist if we are there */
	for (i = 0; i < EMAC_BOOT_LIST_SIZE; i++)
		if (emac_boot_list[i] == np)
			blist = &emac_boot_list[i];

	/* Allocate our net_device structure */
	err = -ENOMEM;
	ndev = alloc_etherdev(sizeof(struct emac_instance));
	if (!ndev)
	{
		printk(KERN_ERR "%s: could not allocate ethernet device!\n",
				np->full_name);
		goto err_gone;
	}
	dev = netdev_priv(ndev);
	dev->ndev = ndev;
	dev->ofdev = ofdev;
	dev->blist = blist;
	SET_NETDEV_DEV(ndev, &ofdev->dev);

	/* Initialize some embedded data structures */
	mutex_init(&dev->mdio_lock);
	mutex_init(&dev->link_lock);
	spin_lock_init(&dev->lock);
	INIT_WORK(&dev->reset_work, emac_reset_work);

	/* Init various config data based on device-tree */
	err = emac_init_config(dev);
	if (err != 0)
		goto err_free;

	/* Get interrupts. EMAC irq is mandatory, WOL irq is optional */
	dev->emac_irq = irq_of_parse_and_map(np, 0);
	dev->wol_irq = irq_of_parse_and_map(np, 1);
	if (dev->emac_irq == NO_IRQ)
	{
		printk(KERN_ERR "%s: Can't map main interrupt\n", np->full_name);
		goto err_free;
	}
	ndev->irq = dev->emac_irq;

	/* Map EMAC regs */
	if (of_address_to_resource(np, 0, &dev->rsrc_regs))
	{
		printk(KERN_ERR "%s: Can't get registers address\n",
				np->full_name);
		goto err_irq_unmap;
	}
	// TODO : request_mem_region
	dev->emacp = ioremap(dev->rsrc_regs.start, dev->rsrc_regs.end
			- dev->rsrc_regs.start + 1);
	if (dev->emacp == NULL)
	{
		printk(KERN_ERR "%s: Can't map device registers!\n",
				np->full_name);
		err = -ENOMEM;
		goto err_irq_unmap;
	}

	/* Wait for dependent devices */
	err = emac_wait_deps(dev);
	if (err)
	{
		printk(KERN_ERR
				"%s: Timeout waiting for dependent devices\n",
				np->full_name);
		/*  display more info about what's missing ? */
		goto err_reg_unmap;
	}
	dev->mal = dev_get_drvdata(&dev->mal_dev->dev);
	if (dev->mdio_dev != NULL)
		dev->mdio_instance = dev_get_drvdata(&dev->mdio_dev->dev);

	/* Register with MAL */
	dev->commac.ops = &emac_commac_ops;
	dev->commac.dev = dev;
	dev->commac.tx_chan_mask = MAL_CHAN_MASK(dev->mal_tx_chan);
	dev->commac.rx_chan_mask = MAL_CHAN_MASK(dev->mal_rx_chan);
	err = mal_register_commac(dev->mal, &dev->commac);
	if (err)
	{
		printk(KERN_ERR "%s: failed to register with mal %s!\n",
				np->full_name, dev->mal_dev->node->full_name);
		goto err_rel_deps;
	}
	dev->rx_skb_size = emac_rx_skb_size(ndev->mtu);
	dev->rx_sync_size = emac_rx_sync_size(ndev->mtu);

	if (dev->rx_skb_size > PAGE_SIZE)
	{
		dev->rx_skb_size = PAGE_SIZE;
	}

	/* Get pointers to BD rings */
	dev->tx_desc = dev->mal->bd_virt + mal_tx_bd_offset(dev->mal,
			dev->mal_tx_chan);
	dev->rx_desc = dev->mal->bd_virt + mal_rx_bd_offset(dev->mal,
			dev->mal_rx_chan);

	DBG(dev, "tx_desc %p" NL, dev->tx_desc);
	DBG(dev, "rx_desc %p" NL, dev->rx_desc);

	/* Clean rings */
	memset(dev->tx_desc, 0, NUM_TX_BUFF * sizeof(struct mal_descriptor));
	memset(dev->rx_desc, 0, NUM_RX_BUFF * sizeof(struct mal_descriptor));
	memset(dev->tx_skb, 0, NUM_TX_BUFF * sizeof(struct sk_buff *));
	memset(dev->rx_skb, 0, NUM_RX_BUFF * sizeof(struct sk_buff *));

	/* Attach to ZMII, if needed */
	if (emac_has_feature(dev, EMAC_FTR_HAS_ZMII) && (err
			= zmii_attach(dev->zmii_dev, dev->zmii_port, &dev->phy_mode)) != 0)
		goto err_unreg_commac;

	/* Attach to RGMII, if needed */
	if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII) && (err
			= rgmii_attach(dev->rgmii_dev, dev->rgmii_port, dev->phy_mode))
			!= 0)
		goto err_detach_zmii;

	/* Attach to TAH, if needed */
	if (emac_has_feature(dev, EMAC_FTR_HAS_TAH) && (err
			= tah_attach(dev->tah_dev, dev->tah_port)) != 0)
		goto err_detach_rgmii;

	/* Set some link defaults before we can find out real parameters */
	dev->phy.speed = SPEED_100;
	dev->phy.duplex = DUPLEX_FULL;
	dev->phy.autoneg = AUTONEG_DISABLE;
	dev->phy.pause = dev->phy.asym_pause = 0;
	dev->stop_timeout = STOP_TIMEOUT_100;
	INIT_DELAYED_WORK(&dev->link_work, emac_link_timer);

	/* Find PHY if any */
	err = emac_init_phy(dev);
	if (err != 0)
		goto err_detach_tah;

	printk(KERN_INFO "ENABLING HIGHDMA!\n");
	ndev->features |= NETIF_F_HIGHDMA;

	if (dev->tah_dev)
		ndev->features |= NETIF_F_IP_CSUM | NETIF_F_SG;
	ndev->watchdog_timeo = 5 * HZ;
	if (emac_phy_supports_gige(dev->phy_mode))
	{
		ndev->netdev_ops = &emac_gige_netdev_ops;
		dev->commac.ops = &emac_commac_sg_ops;
	}
	else
		ndev->netdev_ops = &emac_netdev_ops;
	SET_ETHTOOL_OPS(ndev, &emac_ethtool_ops);

	netif_carrier_off(ndev);
	netif_stop_queue(ndev);

	err = register_netdev(ndev);
	if (err)
	{
		printk(KERN_ERR "%s: failed to register net device (%d)!\n",
				np->full_name, err);
		goto err_detach_tah;
	}

	/* Set our drvdata last as we don't want them visible until we are
	 * fully initialized
	 */
	wmb();
	dev_set_drvdata(&ofdev->dev, dev);

	/* There's a new kid in town ! Let's tell everybody */
	wake_up_all(&emac_probe_wait);

	printk(KERN_INFO "%s: EMAC-%d %s, MAC %pM\n",
			ndev->name, dev->cell_index, np->full_name, ndev->dev_addr);

	if (dev->phy_mode == PHY_MODE_SGMII)
	printk(KERN_NOTICE "%s: in SGMII mode\n", ndev->name);

	if (dev->phy.address >= 0)
		printk("%s: found %s PHY (0x%02x)\n", ndev->name, dev->phy.def->name,
				dev->phy.address);

	emac_dbg_register(dev);

	/* Life is good */
	return 0;

	/* I have a bad feeling about this ... */

	err_detach_tah: if (emac_has_feature(dev, EMAC_FTR_HAS_TAH))
		tah_detach(dev->tah_dev, dev->tah_port);
	err_detach_rgmii: if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII))
		rgmii_detach(dev->rgmii_dev, dev->rgmii_port);
	err_detach_zmii: if (emac_has_feature(dev, EMAC_FTR_HAS_ZMII))
		zmii_detach(dev->zmii_dev, dev->zmii_port);
	err_unreg_commac: mal_unregister_commac(dev->mal, &dev->commac);
	err_rel_deps: emac_put_deps(dev);
	err_reg_unmap: iounmap(dev->emacp);
	err_irq_unmap: if (dev->wol_irq != NO_IRQ)
		irq_dispose_mapping(dev->wol_irq);
	if (dev->emac_irq != NO_IRQ)
		irq_dispose_mapping(dev->emac_irq);
	err_free: kfree(ndev);
	err_gone:
	/* if we were on the bootlist, remove us as we won't show up and
	 * wake up all waiters to notify them in case they were waiting
	 * on us
	 */
	if (blist)
	{
		*blist = NULL;
		wake_up_all(&emac_probe_wait);
	}
	return err;
}

static int __devexit emac_remove(struct of_device *ofdev)
{
	struct emac_instance *dev = dev_get_drvdata(&ofdev->dev);

	DBG(dev, "remove" NL);

	dev_set_drvdata(&ofdev->dev, NULL);

	unregister_netdev(dev->ndev);

	flush_scheduled_work();

	if (emac_has_feature(dev, EMAC_FTR_HAS_TAH))
	tah_detach(dev->tah_dev, dev->tah_port);
	if (emac_has_feature(dev, EMAC_FTR_HAS_RGMII))
	rgmii_detach(dev->rgmii_dev, dev->rgmii_port);
	if (emac_has_feature(dev, EMAC_FTR_HAS_ZMII))
	zmii_detach(dev->zmii_dev, dev->zmii_port);

	mal_unregister_commac(dev->mal, &dev->commac);
	emac_put_deps(dev);

	emac_dbg_unregister(dev);
	iounmap(dev->emacp);

	if (dev->wol_irq != NO_IRQ)
	irq_dispose_mapping(dev->wol_irq);
	if (dev->emac_irq != NO_IRQ)
	irq_dispose_mapping(dev->emac_irq);

	kfree(dev->ndev);

	return 0;
}

/* XXX Features in here should be replaced by properties... */
static struct of_device_id emac_match[] =
{
{ .type = "network", .compatible = "ibm,emac", },
{ .type = "network", .compatible = "ibm,emac4", },
{ .type = "network", .compatible = "ibm,emac4sync", },
{ }, };

static struct of_platform_driver emac_driver =
{ .name = "emac", .match_table = emac_match,

.probe = emac_probe, .remove = emac_remove, };

static void __init emac_make_bootlist(void)
{
	struct device_node *np = NULL;
	int j, max, i = 0, k;
	int cell_indices[EMAC_BOOT_LIST_SIZE];

	/* Collect EMACs */
	while((np = of_find_all_nodes(np)) != NULL)
	{
		const u32 *idx;

		if (of_match_node(emac_match, np) == NULL)
		continue;
		if (of_get_property(np, "unused", NULL))
		continue;
		idx = of_get_property(np, "cell-index", NULL);
		if (idx == NULL)
		continue;
		cell_indices[i] = *idx;
		emac_boot_list[i++] = of_node_get(np);
		if (i >= EMAC_BOOT_LIST_SIZE)
		{
			of_node_put(np);
			break;
		}
	}
	max = i;

	/* Bubble sort them (doh, what a creative algorithm :-) */
	for (i = 0; max > 1 && (i < (max - 1)); i++)
	for (j = i; j < max; j++)
	{
		if (cell_indices[i] > cell_indices[j])
		{
			np = emac_boot_list[i];
			emac_boot_list[i] = emac_boot_list[j];
			emac_boot_list[j] = np;
			k = cell_indices[i];
			cell_indices[i] = cell_indices[j];
			cell_indices[j] = k;
		}
	}
}

static int __init emac_init(void)
{
	int rc;

	printk(KERN_INFO DRV_DESC ", version " DRV_VERSION "\n");

	/* Init debug stuff */
	emac_init_debug();

	/* Build EMAC boot list */
	emac_make_bootlist();

	/* Init submodules */
	rc = mal_init();
	if (rc)
	goto err;
	rc = zmii_init();
	if (rc)
	goto err_mal;
	rc = rgmii_init();
	if (rc)
	goto err_zmii;
	rc = tah_init();
	if (rc)
	goto err_rgmii;
	rc = of_register_platform_driver(&emac_driver);
	if (rc)
	goto err_tah;

	return 0;

	err_tah:
	tah_exit();
	err_rgmii:
	rgmii_exit();
	err_zmii:
	zmii_exit();
	err_mal:
	mal_exit();
	err:
	return rc;
}

static void __exit emac_exit(void)
{
	int i;

	of_unregister_platform_driver(&emac_driver);

	tah_exit();
	rgmii_exit();
	zmii_exit();
	mal_exit();
	emac_fini_debug();

	/* Destroy EMAC boot list */
	for (i = 0; i < EMAC_BOOT_LIST_SIZE; i++)
	if (emac_boot_list[i])
	of_node_put(emac_boot_list[i]);
}

module_init(emac_init);
module_exit(emac_exit);

^ permalink raw reply

* RE: Memory allocation modifications in ibm_newemac driver
From: Jonathan Haws @ 2010-09-01 21:18 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, linux-net@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <5ABE8A5E96C0364CAF2B4DC1DEB7CEAC127A2F09@Mercury.usurf.usu.edu>

I found out what was causing the crash, but still am not there and could us=
e some direction:=0A=
=0A=
What was happening was that I was not allocating a new SKB to replace the o=
ne in the ring that was being passed up the stack.  I have remedied that an=
d am now having another issue:=0A=
=0A=
Once the ring index rolls over (it does so at 64) I start to lose packets b=
ecause they are not being handled correctly (or do not contain the correct =
headers or something of that sort).  Here is a simple ping test showing wha=
t is happening:=0A=
=0A=
64 bytes from 172.31.22.21: seq=3D29 ttl=3D128 time=3D10.826 ms=0A=
emac/plb/opb/ethernet@ef600900: PACKET: 54 0X1800 110 0XC04E6B80=0A=
emac/plb/opb/ethernet@ef600900: PACKET: 55 0X1800 110 0XC04E6EC0=0A=
emac/plb/opb/ethernet@ef600900: PACKET: 56 0X1800 98 0XC04E70C0=0A=
64 bytes from 172.31.22.21: seq=3D30 ttl=3D128 time=3D10.839 ms=0A=
emac/plb/opb/ethernet@ef600900: PACKET: 57 0X1800 110 0XC04E6580=0A=
emac/plb/opb/ethernet@ef600900: PACKET: 58 0X1800 98 0XC04E5B80=0A=
64 bytes from 172.31.22.21: seq=3D31 ttl=3D128 time=3D10.832 ms=0A=
emac/plb/opb/ethernet@ef600900: PACKET: 59 0X1800 219 0XC04E5740=0A=
emac/plb/opb/ethernet@ef600900: PACKET: 60 0X1800 249 0XC04E5000=0A=
emac/plb/opb/ethernet@ef600900: PACKET: 61 0X1800 92 0XC04E5340=0A=
emac/plb/opb/ethernet@ef600900: PACKET: 62 0X1800 98 0XC04E4A00=0A=
64 bytes from 172.31.22.21: seq=3D32 ttl=3D128 time=3D10.825 ms=0A=
emac/plb/opb/ethernet@ef600900: PACKET: 63 0X5800 92 0XC04E4E00=0A=
emac/plb/opb/ethernet@ef600900: PACKET: 0 0X1800 98 0XC04EF520=0A=
emac/plb/opb/ethernet@ef600900: PACKET: 1 0X1800 92 0XC04D8340=0A=
emac/plb/opb/ethernet@ef600900: PACKET: 2 0X1800 98 0XC04D8260=0A=
emac/plb/opb/ethernet@ef600900: PACKET: 3 0X1800 60 0XC04E7AA0=0A=
emac/plb/opb/ethernet@ef600900: PACKET: 4 0X1800 92 0XC04E74A0=0A=
emac/plb/opb/ethernet@ef600900: PACKET: 5 0X1800 92 0XC04E6BC0=0A=
emac/plb/opb/ethernet@ef600900: PACKET: 6 0X1800 98 0XC04E6F80=0A=
emac/plb/opb/ethernet@ef600900: PACKET: 7 0X1800 92 0XC04E64E0=0A=
emac/plb/opb/ethernet@ef600900: PACKET: 8 0X1800 98 0XC04E5BC0=0A=
=0A=
The first number in my debug print statement is what the driver calls the s=
lot number (the ring index).  When it rolls over I start losing the ping re=
plies.  What have I done wrong to cause that?  The data is coming in and is=
 there.  Have any other network device developers seen similar behavior?=0A=
=0A=
Thanks,=0A=
=0A=
Jonathan=0A=
=0A=

^ permalink raw reply

* RE: Memory allocation modifications in ibm_newemac driver
From: Jonathan Haws @ 2010-09-01 22:04 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, linux-net@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <5ABE8A5E96C0364CAF2B4DC1DEB7CEAC127A2FBE@Mercury.usurf.usu.edu>

Okay, I think I have all the issues worked out and can now send and receive=
 any size packet without a hiccup.  I have tested this in our system setup =
as well with data being sent out to disk and did not see any problems there=
 either (since it only ever allocates a single page, never more).=0A=
=0A=
Is this something that may be wanted in the mainline?  I have not run full =
benchmarks, but I anticipate that my modified driver is slightly slower tha=
n the mainline driver because we keep track of an SKB ring, as well as a ri=
ng of pages and allocate both on each packet received.=0A=
=0A=
Thanks,=0A=
=0A=
Jonathan=0A=
=0A=

^ permalink raw reply

* RE: Memory allocation modifications in ibm_newemac driver
From: Jonathan Haws @ 2010-09-01 22:46 UTC (permalink / raw)
  To: Jonathan Haws, linux-kernel@vger.kernel.org,
	linux-net@vger.kernel.org, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <5ABE8A5E96C0364CAF2B4DC1DEB7CEAC127A301B@Mercury.usurf.usu.edu>

Apparently I spoke too soon - sorry about that.  I am still getting the err=
or when I try to write to disk and receive on the network at the same time.=
  Here is the output:=0A=
=0A=
blastee: page allocation failure. order:1, mode:0x4020=0A=
Call Trace:=0A=
[ccea9a40] [c0006ef0] show_stack+0x44/0x16c (unreliable)=0A=
[ccea9a80] [c006f9f0] __alloc_pages_nodemask+0x38c/0x4f8=0A=
[ccea9b00] [c0095008] __slab_alloc+0x594/0x5e0=0A=
[ccea9b40] [c0095a08] __kmalloc_track_caller+0xe8/0xf0=0A=
[ccea9b60] [c01c848c] __alloc_skb+0x60/0x140=0A=
[ccea9b80] [c01a7df8] emac_poll_rx+0x568/0x768=0A=
[ccea9bc0] [c01a28e4] mal_poll+0xa8/0x1ec=0A=
[ccea9bf0] [c01d3eec] net_rx_action+0x9c/0x1b4=0A=
[ccea9c20] [c003b3c0] __do_softirq+0xc4/0x148=0A=
[ccea9c60] [c0004d18] do_softirq+0x78/0x80=0A=
[ccea9c70] [c003b67c] local_bh_enable+0xc0/0xd8=0A=
[ccea9c80] [c01c29bc] lock_sock_nested+0xc0/0xdc=0A=
[ccea9cc0] [c0212cb4] udp_recvmsg+0x318/0x3a4=0A=
[ccea9d10] [c01c2334] sock_common_recvmsg+0x3c/0x60=0A=
[ccea9d30] [c01c06c4] sock_recvmsg+0xb8/0xf0=0A=
[ccea9e20] [c01c09b0] sys_recvfrom+0x8c/0xfc=0A=
[ccea9f00] [c01c18d4] sys_socketcall+0x128/0x1f8=0A=
[ccea9f40] [c000f434] ret_from_syscall+0x0/0x3c=0A=
Mem-Info:=0A=
DMA per-cpu:=0A=
CPU    0: hi:   90, btch:  15 usd:  48=0A=
Active_anon:28 active_file:807 inactive_anon:85=0A=
 inactive_file:171 unevictable:0 dirty:0 writeback:0 unstable:0=0A=
 free:506 slab:53530 mapped:362 pagetables:19 bounce:0=0A=
DMA free:2024kB min:2036kB low:2544kB high:3052kB active_anon:112kB inactiv=
e_anon:340kB active_file:3228kB inactive_file:684kB unevictable:0kB present=
:260096kB pages_scanned:0 all_unreclaimable? no=0A=
lowmem_reserve[]: 0 0 0=0A=
DMA: 410*4kB 28*8kB 4*16kB 1*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0=
*2048kB 0*4096kB =3D 2024kB=0A=
978 total pagecache pages=0A=
0 pages in swap cache=0A=
Swap cache stats: add 0, delete 0, find 0/0=0A=
Free swap  =3D 0kB=0A=
Total swap =3D 0kB=0A=
65536 pages RAM=0A=
1400 pages reserved=0A=
1001 pages shared=0A=
62828 pages non-shared=0A=
SLUB: Unable to allocate memory on node -1 (gfp=3D0x20)=0A=
  cache: kmalloc-8192, object size: 8192, buffer size: 8192, default order:=
 3, min order: 1=0A=
  node 0: slabs: 7140, objs: 25809, free: 0=0A=
=0A=
Can anyone explain to me why I would be getting this error in the first pla=
ce?  Why is it failing to allocate a page when there are pages available?  =
That does not make any sense to me.=0A=
=0A=
Thanks,=0A=
=0A=
Jonathan=

^ permalink raw reply

* Re: [v1 PATCH] ucc_geth: fix ethtool set ring param bug
From: Liang Li @ 2010-09-02  0:50 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev, avorontsov, davem, linuxppc-dev
In-Reply-To: <1283348550.2268.7.camel@achroite.uk.solarflarecom.com>

On Wed, Sep 01, 2010 at 02:42:30PM +0100, Ben Hutchings wrote:
> On Wed, 2010-09-01 at 09:43 +0800, Liang Li wrote:
> > It's common sense that when we should do change to driver ring
> > desc/buffer etc only after 'stop/shutdown' the device. When we
> > do change while devices/driver is running, kernel oops occur:
> [...]
> > -	ug_info->bdRingLenRx[queue] = ring->rx_pending;
> > -	ug_info->bdRingLenTx[queue] = ring->tx_pending;
> > -
> >  	if (netif_running(netdev)) {
> > -		/* FIXME: restart automatically */
> > -		printk(KERN_INFO
> > -			"Please re-open the interface.\n");
> > +		u16 rx_t;
> > +		u16 tx_t;
> > +		printk(KERN_INFO "Stopping interface %s.\n", netdev->name);
> > +		ucc_geth_close(netdev);
> > +
> > +		rx_t = ug_info->bdRingLenRx[queue];
> > +		tx_t = ug_info->bdRingLenTx[queue];
> > +
> > +		ug_info->bdRingLenRx[queue] = ring->rx_pending;
> > +		ug_info->bdRingLenTx[queue] = ring->tx_pending;
> > +
> > +		printk(KERN_INFO "Reactivating interface %s.\n", netdev->name);
> > +		ret = ucc_geth_open(netdev);
> > +		if (ret) {
> > +			printk(KERN_WARNING "uec_set_ringparam: set ring param for running"
> > +					" interface %s failed. Please try to make the interface "
> > +					" down, then try again.\n", netdev->name);
> > +			ug_info->bdRingLenRx[queue] = rx_t;
> > +			ug_info->bdRingLenTx[queue] = tx_t;
> > +		}
> [...]
> 
> Bringing the interface down will call ucc_geth_close(), which will try
> to free resources that have not been allocated!

Sorry, I did not understand you on this point. There is no
ucc_geth_close when 'open fail'. What you mean here exactly?

> 
> If you cannot roll back a failed change then at least use dev_close()
> and dev_open() rather than calling ucc_geth_{close,open}() directly, so
> that the interface state is updated correctly.  Or just refuse to make
> the change if the interface is up.

That make things more simply but I do not think that is necessary.
Since there is no such restriction exist in most NIC drivers. :)
Actually I did not see the 'fail of reopen' case. So I assume you
may witnessed similar 'open fail' case in some rare cases. May you
please give more input on this then I can 'make re-open safer' here.

Thanks,
				-Liang Li

> 
> Ben.
> 
> -- 
> Ben Hutchings, Senior Software Engineer, Solarflare Communications
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
From: Michael Neuling @ 2010-09-02  1:02 UTC (permalink / raw)
  To: Darren Hart
  Cc: Stephen Rothwell, Gautham R Shenoy, Josh Triplett, Steven Rostedt,
	linuxppc-dev, Will Schmidt, Paul Mackerras, Ankita Garg,
	Thomas Gleixner
In-Reply-To: <4C7EBAA8.7030601@us.ibm.com>



In message <4C7EBAA8.7030601@us.ibm.com> you wrote:
> On 09/01/2010 12:59 PM, Steven Rostedt wrote:
> > On Wed, 2010-09-01 at 11:47 -0700, Darren Hart wrote:
> > 
> >> from tip/rt/2.6.33 causes the preempt_count() to change across the cede
> >> call.  This patch appears to prevents the proxy preempt_count assignment
> >> from happening. This non-local-cpu assignment to 0 would cause an
> >> underrun of preempt_count() if the local-cpu had disabled preemption
> >> prior to the assignment and then later tried to enable it. This appears
> >> to be the case with the stack of __trace_hcall* calls preceeding the
> >> return from extended_cede_processor() in the latency format trace-cmd
> >> report:
> >>
> >>   <idle>-0       1d....   201.252737: function:             .cpu_die
> > 
> > Note, the above 1d.... is a series of values. The first being the CPU,
> > the next if interrupts are disabled, the next if the NEED_RESCHED flag
> > is set, the next is softirqs enabled or disabled, next the
> > preempt_count, and finally the lockdepth count.
> > 
> > Here we only care about the preempt_count, which is zero when '.' and a
> > number if it is something else. It is the second to last field in that
> > list.
> > 
> > 
> >>   <idle>-0       1d....   201.252738: function:                .pseries_ma
ch_cpu_die
> >>   <idle>-0       1d....   201.252740: function:                   .idle_ta
sk_exit
> >>   <idle>-0       1d....   201.252741: function:                      .swit
ch_slb
> >>   <idle>-0       1d....   201.252742: function:                   .xics_te
ardown_cpu
> >>   <idle>-0       1d....   201.252743: function:                      .xics
_set_cpu_priority
> >>   <idle>-0       1d....   201.252744: function:             .__trace_hcall
_entry
> >>   <idle>-0       1d..1.   201.252745: function:                .probe_hcal
l_entry
> > 
> >                        ^
> >                 preempt_count set to 1
> > 
> >>   <idle>-0       1d..1.   201.252746: function:             .__trace_hcall
_exit
> >>   <idle>-0       1d..2.   201.252747: function:                .probe_hcal
l_exit
> >>   <idle>-0       1d....   201.252748: function:             .__trace_hcall
_entry
> >>   <idle>-0       1d..1.   201.252748: function:                .probe_hcal
l_entry
> >>   <idle>-0       1d..1.   201.252750: function:             .__trace_hcall
_exit
> >>   <idle>-0       1d..2.   201.252751: function:                .probe_hcal
l_exit
> >>   <idle>-0       1d....   201.252752: function:             .__trace_hcall
_entry
> >>   <idle>-0       1d..1.   201.252753: function:                .probe_hcal
l_entry
> >                    ^   ^
> >                   CPU  preempt_count
> > 
> > Entering the function probe_hcall_entry() the preempt_count is 1 (see
> > below). But probe_hcall_entry does:
> > 
> > 	h = &get_cpu_var(hcall_stats)[opcode / 4];
> > 
> > Without doing the put (which it does in probe_hcall_exit())
> > 
> > So exiting the probe_hcall_entry() the prempt_count is 2.
> > The trace_hcall_entry() will do a preempt_enable() making it leave as 1.
> > 
> > 
> >>   offon.sh-3684  6.....   201.466488: bprint:               .smp_pSeries_k
ick_cpu : resetting pcnt to 0 for cpu 1
> > 
> > This is CPU 6, changing the preempt count from 1 to 0.
> > 
> >>
> >> preempt_count() is reset from 1 to 0 by smp_startup_cpu() without the
> >> QCSS_NOT_STOPPED check from the patch above.
> >>
> >>   <idle>-0       1d....   201.466503: function:             .__trace_hcall
_exit
> > 
> > Note: __trace_hcall_exit() and __trace_hcall_entry() basically do:
> > 
> >  preempt_disable();
> >  call probe();
> >  preempt_enable();
> > 
> > 
> >>   <idle>-0       1d..1.   201.466505: function:                .probe_hcal
l_exit
> > 
> > The preempt_count of 1 entering the probe_hcall_exit() is because of the
> > preempt_disable() shown above. It should have been entered as a 2.
> > 
> > But then it does:
> > 
> > 
> > 	put_cpu_var(hcall_stats);
> > 
> > making preempt_count 0.
> > 
> > But the preempt_enable() in the trace_hcall_exit() causes this to be -1.
> > 
> > 
> >>   <idle>-0       1d.Hff.   201.466507: bprint:               .pseries_mach
_cpu_die : after cede: ffffffff
> >>
> >> With the preempt_count() being one less than it should be, the final
> >> preempt_enable() in the trace_hcall path drops preempt_count to
> >> 0xffffffff, which of course is an illegal value and leads to a crash.
> > 
> > I'm confused to how this works in mainline?
> 
> Turns out it didn't. 2.6.33.5 with CONFIG_PREEMPT=y sees this exact same
> behavior. The following, part of the 2.6.33.6 stable release, prevents
> this from happening:
> 
> aef40e87d866355ffd279ab21021de733242d0d5
> powerpc/pseries: Only call start-cpu when a CPU is stopped
> 
> --- a/arch/powerpc/platforms/pseries/smp.c
> +++ b/arch/powerpc/platforms/pseries/smp.c
> @@ -82,6 +82,12 @@ static inline int __devinit smp_startup_cpu(unsigned
> int lcpu)
> 
>         pcpu = get_hard_smp_processor_id(lcpu);
> 
> +       /* Check to see if the CPU out of FW already for kexec */
> +       if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
> +               cpu_set(lcpu, of_spin_map);
> +               return 1;
> +       }
> +
>         /* Fixup atomic count: it exited inside IRQ handler. */
>         task_thread_info(paca[lcpu].__current)->preempt_count   = 0;
> 
> The question is now, Is this the right fix? If so, perhaps we can update
> the comment to be a bit more clear and not refer solely to kexec.
> 
> Michael Neuling, can you offer any thoughts here? We hit this EVERY
> TIME, which makes me wonder if the offline/online path could do this
> without calling smp_startup_cpu at all.

We need to call smp_startup_cpu on boot when we the cpus are still in
FW.  smp_startup_cpu does this for us on boot.

I'm wondering if we just need to move the test down a bit to make sure
the preempt_count is set.  I've not been following this thread, but
maybe this might work?

Untested patch below...

Mikey

diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c
index 0317cce..3afaba4 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -104,18 +104,18 @@ static inline int __devinit smp_startup_cpu(unsigned int lcpu)
 
 	pcpu = get_hard_smp_processor_id(lcpu);
 
-	/* Check to see if the CPU out of FW already for kexec */
-	if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
-		cpumask_set_cpu(lcpu, of_spin_mask);
-		return 1;
-	}
-
 	/* Fixup atomic count: it exited inside IRQ handler. */
 	task_thread_info(paca[lcpu].__current)->preempt_count	= 0;
 
 	if (get_cpu_current_state(lcpu) == CPU_STATE_INACTIVE)
 		goto out;
 
+	/* Check to see if the CPU out of FW already for kexec */
+	if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
+		cpumask_set_cpu(lcpu, of_spin_mask);
+		return 1;
+	}
+
 	/* 
 	 * If the RTAS start-cpu token does not exist then presume the
 	 * cpu is already spinning.

^ permalink raw reply related

* Re: [git pull] Please pull powerpc.git merge branch
From: Benjamin Herrenschmidt @ 2010-09-02  1:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linuxppc-dev list, Andrew Morton, Linux Kernel list
In-Reply-To: <1283234198.2151.15.camel@pasglop>

On Tue, 2010-08-31 at 15:56 +1000, Benjamin Herrenschmidt wrote:
> Hi Linus !
> 
> Here's a few small fixes, one is an important fix for a nasty regression
> breaking pseries machine running under hypervisor... oops !

I updated this with some embedded fixed from Kumar and a fix for a new
deadlock in the pseries dlpar code.

New log below.

Cheers,
Ben.

The following changes since commit 2bfc96a127bc1cc94d26bfaa40159966064f9c8c:
  Linus Torvalds (1):
        Linux 2.6.36-rc3

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc.git merge

Alexander Graf (1):
      powerpc/85xx: Fix compilation of mpc85xx_mds.c

Anton Vorontsov (1):
      powerpc/85xx: Add P1021 PCI IDs and quirks

Julia Lawall (2):
      arch/powerpc/platforms/83xx/mpc837x_mds.c: Add missing iounmap
      arch/powerpc/sysdev/qe_lib/qe.c: Add of_node_put to avoid memory leak

Kumar Gala (1):
      powerpc/85xx: Fix compile issue with p1022_ds due to lmb rename to memblock

Li Yang (1):
      fsl_rio: fix compile errors

Matthew McClintock (1):
      powerpc/kexec: Adds correct calling convention for kexec purgatory

Michael Neuling (1):
      powerpc: Don't use kernel stack with translation off

Nathan Fontenot (1):
      powerpc/pseries: Correct rtas_data_buf locking in dlpar code

Paul Mackerras (1):
      powerpc/perf_event: Reduce latency of calling perf_event_do_pending

 arch/powerpc/kernel/head_64.S             |   12 ++++++--
 arch/powerpc/kernel/misc_32.S             |    3 ++
 arch/powerpc/kernel/time.c                |   23 +++++++--------
 arch/powerpc/platforms/83xx/mpc837x_mds.c |    9 ++++--
 arch/powerpc/platforms/85xx/mpc85xx_mds.c |    1 +
 arch/powerpc/platforms/85xx/p1022_ds.c    |    4 +-
 arch/powerpc/platforms/pseries/dlpar.c    |   42 ++++++++++++++++++++---------
 arch/powerpc/sysdev/fsl_pci.c             |    2 +
 arch/powerpc/sysdev/fsl_rio.c             |    6 +++-
 arch/powerpc/sysdev/qe_lib/qe.c           |    1 +
 include/linux/pci_ids.h                   |    2 +
 11 files changed, 71 insertions(+), 34 deletions(-)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox