LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 00/14] Consolidation of 83xx/85xx board files
From: Kumar Gala @ 2011-07-19 14:29 UTC (permalink / raw)
  To: Dmitry Eremin-Solenikov; +Cc: Linux PPC Development, Paul Mackerras
In-Reply-To: <1311065631-3429-1-git-send-email-dbaryshkov@gmail.com>


On Jul 19, 2011, at 3:53 AM, Dmitry Eremin-Solenikov wrote:

> I think it's already too late for this merge window, so this should =
stay
> for 3.2 merge window. Board files for mpc83xx platforms show lots of =
common
> code. Same goes for mpc85xx boards. This patchset is an initial =
attempt
> to merge some (most) of the common code. Based on the tree by Kumar =
Gala.
>=20
> The following changes since commit =
6471fc6630a507fd54fdaceceee1ddaf3c917cde:
>=20
>  powerpc: Dont require a dma_ops struct to set dma mask (2011-07-08 =
00:21:36 -0500)
>=20
> Dmitry Eremin-Solenikov (14):
>      83xx: consolidate init_IRQ functions
>      83xx: consolidate of_platform_bus_probe calls
>      mpc8349emitx: mark localbus as compatible with simple-bus
>      83xx/mpc834x_itx: drop pq2pro-localbus-specific code
>      83xx: headers cleanup
>      85xx/sbc8560: correct compilation if CONFIG_PHYS_ADDR_T_64BIT is =
set
>      85xx/ksi8560: declare that localbus is compatbile with simple-bus
>      85xx/sbc8560: declare that localbus is compatbile with simple-bus
>      85xx/sbc8548: read hardware revision when it's required for first =
time
>      85xx/mpc85xx_rdb: merge p1020_rdb and p2020_rdb machine entries
>      85xx: merge 32-bit QorIQ with DPA boards support
>      85xx/mpc85xx_ds,ads,cds: move .pci_exclude_device setting to =
machine definitions
>      85xx: consolidate of_platform_bus_probe calls
>      85xx: separate cpm2 pic init

So for next round can you split this into two groups.  One for 83xx and =
one for 85xx.

- k=

^ permalink raw reply

* [PATCH] powerpc: Fix build dependencies for epapr.c which needs libfdt.h
From: Matthew McClintock @ 2011-07-19 16:22 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Matthew McClintock

Currently, the build can (very rarely) fail to build because libfdt.h has
not been created or is in the process of being copied.

Signed-off-by: Matthew McClintock <msm@freescale.com>
---
I think this fixes this build error. Please comment as it's really hard to
reproduce this build error. I've seen this happen a few times now on our
automated build server.

 BOOTCC  arch/powerpc/boot/ep8248e.o
 BOOTCC  arch/powerpc/boot/cuboot-warp.o
 BOOTCC  arch/powerpc/boot/cuboot-85xx-cpm2.o
In file included from arch/powerpc/boot/epapr.c:20:0:
arch/powerpc/boot/libfdt.h:382:1: error: unterminated comment
arch/powerpc/boot/libfdt.h:1:0: error: unterminated #ifndef
 BOOTCC  arch/powerpc/boot/cuboot-yosemite.o
make[1]: *** [arch/powerpc/boot/epapr.o] Error 1
make[1]: *** Waiting for unfinished jobs....
 BOOTCC  arch/powerpc/boot/simpleboot.o
make: *** [uImage] Error 2
Build step 'Execute shell' marked build as failure

 arch/powerpc/boot/Makefile |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
index c26200b..ac6705e 100644
--- a/arch/powerpc/boot/Makefile
+++ b/arch/powerpc/boot/Makefile
@@ -58,7 +58,7 @@ $(addprefix $(obj)/,$(zlib) cuboot-c2k.o gunzip_util.o main.o prpmc2800.o): \
 libfdt       := fdt.c fdt_ro.c fdt_wip.c fdt_sw.c fdt_rw.c fdt_strerror.c
 libfdtheader := fdt.h libfdt.h libfdt_internal.h

-$(addprefix $(obj)/,$(libfdt) libfdt-wrapper.o simpleboot.o): \
+$(addprefix $(obj)/,$(libfdt) libfdt-wrapper.o simpleboot.o epapr.o): \
 	$(addprefix $(obj)/,$(libfdtheader))

 src-wlib := string.S crt0.S crtsavres.S stdio.c main.c \
-- 
1.7.5

^ permalink raw reply related

* Re: [PATCH 04/14] 83xx/mpc834x_itx: drop pq2pro-localbus-specific code
From: Scott Wood @ 2011-07-19 16:55 UTC (permalink / raw)
  To: Dmitry Eremin-Solenikov; +Cc: Paul Mackerras, Linux PPC Development
In-Reply-To: <1311065631-3429-5-git-send-email-dbaryshkov@gmail.com>

On Tue, 19 Jul 2011 12:53:41 +0400
Dmitry Eremin-Solenikov <dbaryshkov@gmail.com> wrote:

> As localbus on mpc8349e-mitx now provides simple-bus compatibility, we
> can drop code asking for pq2pro-localbus devices on mpc834x_itx boards.

Do we have a good reason for breaking compatibility with older device trees?

-Scott

^ permalink raw reply

* Re: [PATCH 13/14] 85xx: consolidate of_platform_bus_probe calls
From: Scott Wood @ 2011-07-19 17:54 UTC (permalink / raw)
  To: Dmitry Eremin-Solenikov; +Cc: Paul Mackerras, Linux PPC Development
In-Reply-To: <1311065631-3429-14-git-send-email-dbaryshkov@gmail.com>

On Tue, 19 Jul 2011 12:53:50 +0400
Dmitry Eremin-Solenikov <dbaryshkov@gmail.com> wrote:

> +static struct of_device_id __initdata mpc85xx_common_ids[] = {
> +	{ .type = "soc", },
> +	{ .compatible = "soc", },
> +	{ .compatible = "simple-bus", },
> +	{ .compatible = "gianfar", },
> +	{ .compatible = "fsl,qe", },
> +	{ .compatible = "fsl,cpm2", },
> +	{},
> +};

Same comment as for 83xx regarding localbus and compatibility with old
device trees.

-Scott

^ permalink raw reply

* Re: setbat() in udbg_init_cpm() required to avoid driver lockup
From: Scott Wood @ 2011-07-19 18:24 UTC (permalink / raw)
  To: Daniel Ng2; +Cc: linuxppc-dev
In-Reply-To: <32088424.post@talk.nabble.com>

On Mon, 18 Jul 2011 22:39:01 -0700
Daniel Ng2 <daniel.ng1234@gmail.com> wrote:

> 
> Our USB Device Controller (UDC) driver seems to get stuck in a loop waiting
> for the CPM Command Register to indicate that the CPM has finished executing
> a command. (It should do this by setting the cpmcr 'Command Done' bit).
>
> This only happens if I disable the 'Early Debug' Kernel Hacking .config
> parameter. If Early Debug is enabled, then the problem goes away.
> 
> I've narrowed it down to this line in udbg_init_cpm(void):
> 
> setbat(1, 0xf0000000, 0xf0000000, 0x40000, PAGE_KERNEL_NCG);
> 
> -without this line, the driver gets stuck in the loop.
> 
> Can anyone suggest why?

Is your USB driver accessing effective addresses from 0xf0000000 to
0xf0040000?  It should be using ioremap().

> Also, what undesireable effects might there be of keeping the above call to
> setbat()?

It's squatting on a chunk of virtual address space without properly
reserving it.  This is bad enough for a debug hack (and should be fixed).
Don't extend it to normal operation -- especially not as a substitute for
understanding the root cause of your problem.

-Scott

^ permalink raw reply

* Re: [PATCH 4/4] powerpc: Enable lockup and hung task detectors in pseriesand ppc64 defeconfigs
From: Anton Blanchard @ 2011-07-19 23:46 UTC (permalink / raw)
  To: David Laight; +Cc: linuxppc-dev
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6D8ADE3@saturn3.aculab.com>


Hi David,

> > As a result of changes to Kconfig files, we no longer enable
> > the lockup and hung task detectors. Both are very light weight
> > and provide useful information in the event of a hang, so
> > reenable them.
> ...
> > +CONFIG_LOCKUP_DETECTOR=y
> > +CONFIG_DETECT_HUNG_TASK=y
> 
> Is one of thise responsible for generating a kernel stack traceback
> when a process has been sleeping uninterruptably for a 'long' time?
> 
> We have a kernel subsystem that has several 'worker' threads,
> these always sleep uninterruptable (they are shut down by explicit
> request) and, at times, can be idle for long periods.
> 
> Perhaps it should be possible to disable the check either on
> a per-process of per sleep basis?

I don't see any runtime options other than disabling it completely via:

/proc/sys/kernel/hung_task_timeout_secs

Anton

^ permalink raw reply

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
From: Anton Blanchard @ 2011-07-20  2:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds
In-Reply-To: <1311070894.13765.180.camel@twins>


Hi,

> That looks very strange indeed.. up to node 23 there is the normal
> symmetric matrix with all the trace elements on 10 (as we would expect
> for local access), and some 4x4 sub-matrix stacked around the trace
> with 20, suggesting a single hop distance, and the rest on 40 being
> out-there.
> 
> But row 24-27 and column 28-31 are way weird, how can that ever be?
> Aren't the inter-connects symmetric and thus mandating a fully
> symmetric matrix? That is, how can traffic from node 23 (row) to node
> 28 (column) have inf bandwidth (0) yet traffic from node 28 (row) to
> node 23 (column) have a multi-hop distance of 40.

Good point, it definitely makes no sense. It looks like a bug in
numactl, the raw data looks reasonable:

# cat /sys/devices/system/node/node?/distance node??/distance
10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10

Yet another bug to track down :(

> So the idea I had to generate numa sched domains from the node
> distance ( http://marc.info/?l=linux-kernel&m=130218515520540 ),
> would that still work for you? [it does assume a symmetric matrix ]

It should work for us and it makes our NUMA memory and scheduler
domains more consistent. Nice!

Anton

^ permalink raw reply

* Re: [PATCH] net: ibm_newemac: Don't start autonegotiation when disabled in BMCR (genmii)
From: Stefan Roese @ 2011-07-20  7:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, netdev
In-Reply-To: <1311078570.25044.421.camel@pasglop>

On Tuesday 19 July 2011 14:29:30 Benjamin Herrenschmidt wrote:
> > I feel that this BMCR_ANENABLE bit should be evaluated, but I have no
> > strong preference here. If you prefer that this should be handled via a
> > new dt property (phy-aneg = "disabled" ?), I can implement it this way.
> > Just let me know.
> 
> Don't we already have some bindings for PHY with a fixed setting ? I
> don't remember off hand, we need to dbl check.

The only related PHY property I found is "fixed-link" used in fs_enet-main.c. 
None in the emac driver. Here the description for "fixed-link":

Documentation/devicetree/bindings/net/fsl-tsec-phy.txt:

  - fixed-link : <a b c d e> where a is emulated phy id - choose any,
    but unique to the all specified fixed-links, b is duplex - 0 half,
    1 full, c is link speed - d#10/d#100/d#1000, d is pause - 0 no
    pause, 1 pause, e is asym_pause - 0 no asym_pause, 1 asym_pause.

But what I really want to achieve, is to skip auto-negotiation (use the 
strapped configuration). And not to define this fixed configuration (again) in 
the device-tree. So I would prefer something like phy-aneg = "disabled".

What do you think?

Thanks,
Stefan

^ permalink raw reply

* RE: [PATCH 2/3] eSDHC: Fix errors when booting kernel with fsl esdhc
From: Zang Roy-R61911 @ 2011-07-20  9:29 UTC (permalink / raw)
  To: S, Venkatraman; +Cc: Xu Lei-B33228, linux-mmc, akpm, linuxppc-dev
In-Reply-To: <CANfBPZ-V3+NFDVuftXfpTwUGwmXC-sFiUhG8LKCLROse6xT3=Q@mail.gmail.com>



> -----Original Message-----
> From: S, Venkatraman [mailto:svenkatr@ti.com]
> Sent: Tuesday, July 19, 2011 23:58 PM
> To: Zang Roy-R61911
> Cc: linux-mmc; linuxppc-dev; cbouatmailru; akpm; Xu Lei-B33228; Kumar Gal=
a
> Subject: Re: [PATCH 2/3] eSDHC: Fix errors when booting kernel with fsl e=
sdhc
>=20
> On Mon, Jul 18, 2011 at 11:31 AM, Zang Roy-R61911 <r61911@freescale.com> =
wrote:
> >
> >
> >> -----Original Message-----
> >> From: S, Venkatraman [mailto:svenkatr@ti.com]
> >> Sent: Tuesday, July 05, 2011 14:17 PM
> >> To: Zang Roy-R61911
> >> Cc: linux-mmc; linuxppc-dev; cbouatmailru; akpm; Xu Lei-B33228; Kumar =
Gala
> >> Subject: Re: [PATCH 2/3] eSDHC: Fix errors when booting kernel with fs=
l esdhc
> >>
> >> On Tue, Jul 5, 2011 at 9:49 AM, Roy Zang <tie-fei.zang@freescale.com> =
wrote:
> >> > From: Xu lei <B33228@freescale.com>
> >> >
> >> > When esdhc module was enabled in p5020, there were following errors:
> >> >
> >> > mmc0: Timeout waiting for hardware interrupt.
> >> > mmc0: error -110 whilst initialising SD card
> >> > mmc0: Unexpected interrupt 0x02000000.
> >> > mmc0: Timeout waiting for hardware interrupt.
> >> > mmc0: error -110 whilst initialising SD card
> >> > mmc0: Unexpected interrupt 0x02000000.
> >> >
> >> > It is because ESDHC controller has different bit setting for PROCTL
> >> > register, when kernel sets Power Control Register by method for stan=
dard
> >> > SD Host Specification, it would overwritten FSL ESDHC PROCTL[DMAS];
> >> > when it set Host Control Registers[DMAS], it sets PROCTL[EMODE] and
> >> > PROCTL[D3CD]. These operations will set bad bits for PROCTL Register
> >> > on FSL ESDHC Controller and cause errors, so this patch will make es=
dhc
> >> > driver access FSL PROCTL Register according to block guide instead o=
f
> >> > standard SD Host Specification.
> >> >
> >> > For some FSL chips, such as MPC8536/P2020, PROCTL[VOLT_SEL] and PROC=
TL[DMAS]
> >> > bits are reserved and even if they are set to wrong bits there is no=
 error.
> >> > But considering that all FSL ESDHC Controller register map is not fu=
lly
> >> > compliant to standard SD Host Specification, we put the patch to all=
 of
> >> > FSL ESDHC Controllers.
> >> >
> >> > Signed-off-by: Lei Xu <B33228@freescale.com>
> >> > Signed-off-by: Roy Zang <tie-fei.zang@freescale.com>
> >> > Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
> >> > ---
> >> > =A0drivers/mmc/host/sdhci-of-core.c | =A0 =A03 ++
> >> > =A0drivers/mmc/host/sdhci.c =A0 =A0 =A0 =A0 | =A0 62 +++++++++++++++=
+++++++++++++++----
> -
> >> --
> >> > =A0include/linux/mmc/sdhci.h =A0 =A0 =A0 =A0| =A0 =A06 ++-
> >> > =A03 files changed, 57 insertions(+), 14 deletions(-)
> > [snip]
> >
> >> > diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
> >> > index 58d5436..77174e5 100644
> >> > --- a/drivers/mmc/host/sdhci.c
> >> > +++ b/drivers/mmc/host/sdhci.c
> >> > @@ -674,7 +674,7 @@ static void sdhci_set_transfer_irqs(struct sdhci=
_host
> >> *host)
> >> > =A0static void sdhci_prepare_data(struct sdhci_host *host, struct mm=
c_command
> >> *cmd)
> >> > =A0{
> >> > =A0 =A0 =A0 =A0u8 count;
> >> > - =A0 =A0 =A0 u8 ctrl;
> >> > + =A0 =A0 =A0 u32 ctrl;
> >> > =A0 =A0 =A0 =A0struct mmc_data *data =3D cmd->data;
> >> > =A0 =A0 =A0 =A0int ret;
> >> >
> >> > @@ -807,14 +807,28 @@ static void sdhci_prepare_data(struct sdhci_ho=
st
> *host,
> >> struct mmc_command *cmd)
> >> > =A0 =A0 =A0 =A0 * is ADMA.
> >> > =A0 =A0 =A0 =A0 */
> >> > =A0 =A0 =A0 =A0if (host->version >=3D SDHCI_SPEC_200) {
> >> > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 ctrl =3D sdhci_readb(host, SDHCI_HOST_=
CONTROL);
> >> > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 ctrl &=3D ~SDHCI_CTRL_DMA_MASK;
> >> > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 if ((host->flags & SDHCI_REQ_USE_DMA) =
&&
> >> > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (host->flags & SDHCI_U=
SE_ADMA))
> >> > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ctrl |=3D SDHCI_CTRL_A=
DMA32;
> >> > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 else
> >> > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ctrl |=3D SDHCI_CTRL_S=
DMA;
> >> > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 sdhci_writeb(host, ctrl, SDHCI_HOST_CO=
NTROL);
> >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (host->quirks & SDHCI_QUIRK_QORIQ_P=
ROCTL_WEIRD) {
> >> > +#define ESDHCI_PROCTL_DMAS_MASK =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00x00=
000300
> >> > +#define ESDHCI_PROCTL_ADMA32 =A0 =A0 =A0 =A0 =A0 0x00000200
> >> > +#define ESDHCI_PROCTL_SDMA =A0 =A0 =A0 =A0 =A0 =A0 0x00000000
> >>
> >> Breaks the code flow / readability. Can be moved to top of the file ?
> > The defines are only used in the following section. Why it will break
> > the readability?
> > I can also see this kind of define in the file
> > ...
> > #define SAMPLE_COUNT =A0 =A05
> >
> > static int sdhci_get_ro(struct mmc_host *mmc)
> > ...
> >
> > Any rule should follow?
> >
> >
> > [snip]
> >> > @@ -1162,6 +1189,17 @@ static void sdhci_set_power(struct sdhci_host=
 *host,
> >> unsigned short power)
> >> >
> >> > =A0 =A0 =A0 =A0host->pwr =3D pwr;
> >> >
> >> > + =A0 =A0 =A0 /* Now FSL ESDHC Controller has no Bus Power bit,
> >> > + =A0 =A0 =A0 =A0* and PROCTL[21] bit is for voltage selection */
> >>
> >> Multiline comment style needed..
> > Will update.
> > please help to explain your previous comment.
> > Thanks.
> > Roy
>=20
> There aren't very hard rules on this. Simple #defines are good, as a
> one off usage.
> These bit mask fields are very verbose, and they tend to grow more
> than a screenful.
> The remaining bits will never be defined ?
The mask fields and bits are only for some WERID defines. I do not think
they will grow more than a screen.
The remain bits will have little chance to be defined.
Thanks for the suggestion. I will update the comment style and post again.
Roy

^ permalink raw reply

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
From: Anton Blanchard @ 2011-07-20 10:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds
In-Reply-To: <1311070894.13765.180.camel@twins>


Hi Peter,

> That looks very strange indeed.. up to node 23 there is the normal
> symmetric matrix with all the trace elements on 10 (as we would expect
> for local access), and some 4x4 sub-matrix stacked around the trace
> with 20, suggesting a single hop distance, and the rest on 40 being
> out-there.

I retested with the latest version of numactl, and get correct results.

I worked out why the patches don't boot, we weren't allocating any
space for the cpumask and ran off the end of the allocation.

Should we also use cpumask_copy instead of open coding it? I added that
too.

Anton

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2011-07-20 01:54:08.191668781 -0500
+++ linux-2.6/kernel/sched.c	2011-07-20 04:45:36.203750525 -0500
@@ -7020,8 +7020,8 @@
 		if (cpumask_test_cpu(i, covered))
 			continue;
 
-		sg = kzalloc_node(sizeof(struct sched_group), GFP_KERNEL,
-				cpu_to_node(i));
+		sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
+				  GFP_KERNEL, cpu_to_node(i));
 
 		if (!sg)
 			goto fail;
@@ -7031,7 +7031,7 @@
 		child = *per_cpu_ptr(sdd->sd, i);
 		if (child->child) {
 			child = child->child;
-			*sg_span = *sched_domain_span(child);
+			cpumask_copy(sg_span, sched_domain_span(child));
 		} else
 			cpumask_set_cpu(i, sg_span);
 

^ permalink raw reply

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
From: Peter Zijlstra @ 2011-07-20 10:45 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds
In-Reply-To: <20110720201436.19e9689a@kryten>

On Wed, 2011-07-20 at 20:14 +1000, Anton Blanchard wrote:

> > That looks very strange indeed.. up to node 23 there is the normal
> > symmetric matrix with all the trace elements on 10 (as we would expect
> > for local access), and some 4x4 sub-matrix stacked around the trace
> > with 20, suggesting a single hop distance, and the rest on 40 being
> > out-there.
>=20
> I retested with the latest version of numactl, and get correct results.

One less thing to worry about ;-)

> I worked out why the patches don't boot, we weren't allocating any
> space for the cpumask and ran off the end of the allocation.

Gah! that's not the first time I made that particular mistake :/

> Should we also use cpumask_copy instead of open coding it? I added that
> too.

Probably, I looked for cpumask_assign() and on failing to find that used
the direct assignment.

So with that fix the patch makes the machine happy again?

Thanks!

^ permalink raw reply

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
From: Anton Blanchard @ 2011-07-20 12:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds
In-Reply-To: <1311158708.5345.12.camel@twins>


Hi Peter,

> So with that fix the patch makes the machine happy again?

Yes, the machine looks fine with the patches applied. Thanks!

Anton

^ permalink raw reply

* [v3 PATCH 0/1] powerpc/4xx: enable and fix pcie gen1/gen2 on the 460sx
From: Ayman Elkhashab @ 2011-07-20 13:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Tony Breeds, linuxppc-dev,
	linux-kernel
  Cc: Ayman El-Khashab
In-Reply-To: <1310603611-8960-2-git-send-email-ayman@elkhashab.com>

From: Ayman El-Khashab <ayman@elkhashab.com>

Changes from v1->v2
Added definitions for the bits in the OMRxMSKL registers
Refactored the setting of the OMRxMSKL registers for the 460SX
Added bit defines for the PECFG_460SX_DLLSTA register

Changes from v2->v3
Fixed commit message to be more clear as to what was done

Ayman El-Khashab (1):
  powerpc/4xx: enable and fix pcie gen1/gen2 on the 460sx

 arch/powerpc/sysdev/ppc4xx_pci.c |   89 ++++++++++++++++++++++++++++++-------
 arch/powerpc/sysdev/ppc4xx_pci.h |   12 +++++
 2 files changed, 84 insertions(+), 17 deletions(-)

-- 
1.7.4.3

^ permalink raw reply

* [v3 PATCH 1/1] powerpc/4xx: enable and fix pcie gen1/gen2 on the 460sx
From: Ayman Elkhashab @ 2011-07-20 13:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Tony Breeds, linuxppc-dev,
	linux-kernel
  Cc: Ayman El-Khashab
In-Reply-To: <1311166949-2543-1-git-send-email-aymane@elkhashab.com>

From: Ayman El-Khashab <ayman@elkhashab.com>

Adds a register to the config space for the 460sx.  Changes the vc0
detect to a pll detect.  maps configuration space to test the link
status.  changes the setup to enable gen2 devices to operate at gen2
speeds.  fixes mapping that was not correct for the 460sx.  added
bit definitions for the OMRxMSKL registers.  Removed reserved bit
that was set incorrectly in the OMR2MSKL register.

tested on the 460sx eiger and custom board

Signed-off-by: Ayman El-Khashab <ayman@elkhashab.com>
---
 arch/powerpc/sysdev/ppc4xx_pci.c |   89 ++++++++++++++++++++++++++++++-------
 arch/powerpc/sysdev/ppc4xx_pci.h |   12 +++++
 2 files changed, 84 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/sysdev/ppc4xx_pci.c b/arch/powerpc/sysdev/ppc4xx_pci.c
index ad330fe..eeeeb12 100644
--- a/arch/powerpc/sysdev/ppc4xx_pci.c
+++ b/arch/powerpc/sysdev/ppc4xx_pci.c
@@ -1092,6 +1092,10 @@ static int __init ppc460sx_pciex_core_init(struct device_node *np)
 	mtdcri(SDR0, PESDR1_460SX_HSSSLEW, 0xFFFF0000);
 	mtdcri(SDR0, PESDR2_460SX_HSSSLEW, 0xFFFF0000);
 
+	/* Set HSS PRBS enabled */
+	mtdcri(SDR0, PESDR0_460SX_HSSCTLSET, 0x00001130);
+	mtdcri(SDR0, PESDR2_460SX_HSSCTLSET, 0x00001130);
+
 	udelay(100);
 
 	/* De-assert PLLRESET */
@@ -1132,9 +1136,6 @@ static int ppc460sx_pciex_init_port_hw(struct ppc4xx_pciex_port *port)
 		dcri_clrset(SDR0, port->sdr_base + PESDRn_UTLSET2,
 				0, 0x01000000);
 
-	/*Gen-1*/
-	mtdcri(SDR0, port->sdr_base + PESDRn_460SX_RCEI, 0x08000000);
-
 	dcri_clrset(SDR0, port->sdr_base + PESDRn_RCSSET,
 			(PESDRx_RCSSET_RSTGU | PESDRx_RCSSET_RSTDL),
 			PESDRx_RCSSET_RSTPYN);
@@ -1148,14 +1149,42 @@ static int ppc460sx_pciex_init_utl(struct ppc4xx_pciex_port *port)
 {
 	/* Max 128 Bytes */
 	out_be32 (port->utl_base + PEUTL_PBBSZ,   0x00000000);
+	/* Assert VRB and TXE - per datasheet turn off addr validation */
+	out_be32(port->utl_base + PEUTL_PCTL,  0x80800000);
 	return 0;
 }
 
+static void __init ppc460sx_pciex_check_link(struct ppc4xx_pciex_port *port)
+{
+	void __iomem *mbase;
+	int attempt = 50;
+
+	port->link = 0;
+
+	mbase = ioremap(port->cfg_space.start + 0x10000000, 0x1000);
+	if (mbase == NULL) {
+		printk(KERN_ERR "%s: Can't map internal config space !",
+			port->node->full_name);
+		goto done;
+	}
+
+	while (attempt && (0 == (in_le32(mbase + PECFG_460SX_DLLSTA)
+			& PECFG_460SX_DLLSTA_LINKUP))) {
+		attempt--;
+		mdelay(10);
+	}
+	if (attempt)
+		port->link = 1;
+done:
+	iounmap(mbase);
+
+}
+
 static struct ppc4xx_pciex_hwops ppc460sx_pcie_hwops __initdata = {
 	.core_init	= ppc460sx_pciex_core_init,
 	.port_init_hw	= ppc460sx_pciex_init_port_hw,
 	.setup_utl	= ppc460sx_pciex_init_utl,
-	.check_link	= ppc4xx_pciex_check_link_sdr,
+	.check_link	= ppc460sx_pciex_check_link,
 };
 
 #endif /* CONFIG_44x */
@@ -1338,15 +1367,15 @@ static int __init ppc4xx_pciex_port_init(struct ppc4xx_pciex_port *port)
 	if (rc != 0)
 		return rc;
 
-	if (ppc4xx_pciex_hwops->check_link)
-		ppc4xx_pciex_hwops->check_link(port);
-
 	/*
 	 * Initialize mapping: disable all regions and configure
 	 * CFG and REG regions based on resources in the device tree
 	 */
 	ppc4xx_pciex_port_init_mapping(port);
 
+	if (ppc4xx_pciex_hwops->check_link)
+		ppc4xx_pciex_hwops->check_link(port);
+
 	/*
 	 * Map UTL
 	 */
@@ -1360,13 +1389,23 @@ static int __init ppc4xx_pciex_port_init(struct ppc4xx_pciex_port *port)
 		ppc4xx_pciex_hwops->setup_utl(port);
 
 	/*
-	 * Check for VC0 active and assert RDY.
+	 * Check for VC0 active or PLL Locked and assert RDY.
 	 */
 	if (port->sdr_base) {
-		if (port->link &&
-		    ppc4xx_pciex_wait_on_sdr(port, PESDRn_RCSSTS,
-					     1 << 16, 1 << 16, 5000)) {
-			printk(KERN_INFO "PCIE%d: VC0 not active\n", port->index);
+		if (of_device_is_compatible(port->node,
+				"ibm,plb-pciex-460sx")){
+			if (port->link && ppc4xx_pciex_wait_on_sdr(port,
+					PESDRn_RCSSTS,
+					1 << 12, 1 << 12, 5000)) {
+				printk(KERN_INFO "PCIE%d: PLL not locked\n",
+						port->index);
+				port->link = 0;
+			}
+		} else if (port->link &&
+			ppc4xx_pciex_wait_on_sdr(port, PESDRn_RCSSTS,
+				1 << 16, 1 << 16, 5000)) {
+			printk(KERN_INFO "PCIE%d: VC0 not active\n",
+					port->index);
 			port->link = 0;
 		}
 
@@ -1573,8 +1612,15 @@ static int __init ppc4xx_setup_one_pciex_POM(struct ppc4xx_pciex_port	*port,
 		dcr_write(port->dcrs, DCRO_PEGPL_OMR1BAH, lah);
 		dcr_write(port->dcrs, DCRO_PEGPL_OMR1BAL, lal);
 		dcr_write(port->dcrs, DCRO_PEGPL_OMR1MSKH, 0x7fffffff);
-		/* Note that 3 here means enabled | single region */
-		dcr_write(port->dcrs, DCRO_PEGPL_OMR1MSKL, sa | 3);
+		/*Enabled and single region */
+		if (of_device_is_compatible(port->node, "ibm,plb-pciex-460sx"))
+			dcr_write(port->dcrs, DCRO_PEGPL_OMR1MSKL,
+				sa | DCRO_PEGPL_460SX_OMR1MSKL_UOT
+					| DCRO_PEGPL_OMRxMSKL_VAL);
+		else
+			dcr_write(port->dcrs, DCRO_PEGPL_OMR1MSKL,
+				sa | DCRO_PEGPL_OMR1MSKL_UOT
+					| DCRO_PEGPL_OMRxMSKL_VAL);
 		break;
 	case 1:
 		out_le32(mbase + PECFG_POM1LAH, pciah);
@@ -1582,8 +1628,8 @@ static int __init ppc4xx_setup_one_pciex_POM(struct ppc4xx_pciex_port	*port,
 		dcr_write(port->dcrs, DCRO_PEGPL_OMR2BAH, lah);
 		dcr_write(port->dcrs, DCRO_PEGPL_OMR2BAL, lal);
 		dcr_write(port->dcrs, DCRO_PEGPL_OMR2MSKH, 0x7fffffff);
-		/* Note that 3 here means enabled | single region */
-		dcr_write(port->dcrs, DCRO_PEGPL_OMR2MSKL, sa | 3);
+		dcr_write(port->dcrs, DCRO_PEGPL_OMR2MSKL,
+				sa | DCRO_PEGPL_OMRxMSKL_VAL);
 		break;
 	case 2:
 		out_le32(mbase + PECFG_POM2LAH, pciah);
@@ -1592,7 +1638,9 @@ static int __init ppc4xx_setup_one_pciex_POM(struct ppc4xx_pciex_port	*port,
 		dcr_write(port->dcrs, DCRO_PEGPL_OMR3BAL, lal);
 		dcr_write(port->dcrs, DCRO_PEGPL_OMR3MSKH, 0x7fffffff);
 		/* Note that 3 here means enabled | IO space !!! */
-		dcr_write(port->dcrs, DCRO_PEGPL_OMR3MSKL, sa | 3);
+		dcr_write(port->dcrs, DCRO_PEGPL_OMR3MSKL,
+				sa | DCRO_PEGPL_OMR3MSKL_IO
+					| DCRO_PEGPL_OMRxMSKL_VAL);
 		break;
 	}
 
@@ -1693,6 +1741,9 @@ static void __init ppc4xx_configure_pciex_PIMs(struct ppc4xx_pciex_port *port,
 		if (res->flags & IORESOURCE_PREFETCH)
 			sa |= 0x8;
 
+		if (of_device_is_compatible(port->node, "ibm,plb-pciex-460sx"))
+			sa |= PCI_BASE_ADDRESS_MEM_TYPE_64;
+
 		out_le32(mbase + PECFG_BAR0HMPA, RES_TO_U32_HIGH(sa));
 		out_le32(mbase + PECFG_BAR0LMPA, RES_TO_U32_LOW(sa));
 
@@ -1854,6 +1905,10 @@ static void __init ppc4xx_pciex_port_setup_hose(struct ppc4xx_pciex_port *port)
 	}
 	out_le16(mbase + 0x202, val);
 
+	/* Enable Bus master, memory, and io space */
+	if (of_device_is_compatible(port->node, "ibm,plb-pciex-460sx"))
+		out_le16(mbase + 0x204, 0x7);
+
 	if (!port->endpoint) {
 		/* Set Class Code to PCI-PCI bridge and Revision Id to 1 */
 		out_le32(mbase + 0x208, 0x06040001);
diff --git a/arch/powerpc/sysdev/ppc4xx_pci.h b/arch/powerpc/sysdev/ppc4xx_pci.h
index 56d9e5d..61b3659 100644
--- a/arch/powerpc/sysdev/ppc4xx_pci.h
+++ b/arch/powerpc/sysdev/ppc4xx_pci.h
@@ -464,6 +464,18 @@
 #define PECFG_POM2LAL		0x390
 #define PECFG_POM2LAH		0x394
 
+/* 460sx only */
+#define PECFG_460SX_DLLSTA     0x3f8
+
+/* 460sx Bit Mappings */
+#define PECFG_460SX_DLLSTA_LINKUP	 0x00000010
+#define DCRO_PEGPL_460SX_OMR1MSKL_UOT	 0x00000004
+
+/* PEGPL Bit Mappings */
+#define DCRO_PEGPL_OMRxMSKL_VAL	 0x00000001
+#define DCRO_PEGPL_OMR1MSKL_UOT	 0x00000002
+#define DCRO_PEGPL_OMR3MSKL_IO	 0x00000002
+
 /* SDR Bit Mappings */
 #define PESDRx_RCSSET_HLDPLB	0x10000000
 #define PESDRx_RCSSET_RSTGU	0x01000000
-- 
1.7.4.3

^ permalink raw reply related

* Re: Please pull 'next' branch of 4xx tree
From: Josh Boyer @ 2011-07-20 13:18 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev
In-Reply-To: <20110712204156.GD4203@zod.rchland.ibm.com>

On Tue, Jul 12, 2011 at 4:41 PM, Josh Boyer <jwboyer@linux.vnet.ibm.com> wr=
ote:
> Hi Ben,
>
> A few fixes from Tony/Dave, a DTS update from Stefan, and a MAINTAINERS
> update.
>
> josh
>
> The following changes since commit af9719c3062dfe216a0c3de3fa52be6d22b445=
6c:
>
> =A0powerpc: Use -mtraceback=3Dno (2011-07-01 13:49:27 +1000)
>
> are available in the git repository at:
> =A0ssh://master.kernel.org/pub/scm/linux/kernel/git/jwboyer/powerpc-4xx.g=
it next

Ben, ping?

josh

^ permalink raw reply

* Re: [PATCH] powerpc: Fix build dependencies for epapr.c which needs libfdt.h
From: David Gibson @ 2011-07-20 13:49 UTC (permalink / raw)
  To: Matthew McClintock; +Cc: linuxppc-dev
In-Reply-To: <1311092564-10102-1-git-send-email-msm@freescale.com>

On Tue, Jul 19, 2011 at 11:22:44AM -0500, Matthew McClintock wrote:
> Currently, the build can (very rarely) fail to build because libfdt.h has
> not been created or is in the process of being copied.
> 
> Signed-off-by: Matthew McClintock <msm@freescale.com>

Looks sane to me.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
From: Linus Torvalds @ 2011-07-20 14:40 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Peter Zijlstra, mahesh, linux-kernel, mingo, linuxppc-dev
In-Reply-To: <20110720221420.153b0830@kryten>

On Wed, Jul 20, 2011 at 5:14 AM, Anton Blanchard <anton@samba.org> wrote:
>
>> So with that fix the patch makes the machine happy again?
>
> Yes, the machine looks fine with the patches applied. Thanks!

Ok, so what's the situation for 3.0 (I'm waiting for some RCU
resolution now)? Anton's patch may be small, but that's just the tiny
fixup patch to Peter's much scarier one ;)

                            Linus

^ permalink raw reply

* Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty & young
From: Darren Hart @ 2011-07-20 14:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: tony.luck, Peter Zijlstra, Shan Hai, Peter Zijlstra, linux-kernel,
	cmetcalf, dhowells, paulus, tglx, walken, linuxppc-dev, akpm
In-Reply-To: <1311049762.25044.392.camel@pasglop>

Obviously no objection from the futex side of things, looks good. Couple
nits on the function comment:

On 07/18/2011 09:29 PM, Benjamin Herrenschmidt wrote:
...
> -/**
> +/*
> + * fixup_user_fault() - manually resolve a user page  fault

s/  fault/ fault/

> + * @tsk:	the task_struct to use for page fault accounting, or
> + *		NULL if faults are not to be recorded.
> + * @mm:		mm_struct of target mm
> + * @address:	user address
> + * @fault_flags:flags to pass down to handle_mm_fault()
> + *
> + * This is meant to be called in the specific scenario where for
> + * locking reasons we try to access user memory in atomic context
> + * (within a pagefault_disable() section), this returns -EFAULT,
> + * and we want to resolve the user fault before trying again.
> + *
> + * Typically this is meant to be used by the futex code.
> + *
> + * The main difference with get_user_pages() is that this function
> + * will unconditionally call handle_mm_fault() which will in turn
> + * perform all the necessary SW fixup of the dirty and young bits
> + * in the PTE, while handle_mm_fault() only guarantees to update
> + * these in the struct page.
> + *
> + * This is important for some architectures where those bits also
> + * gate the access permission to the page because their are

s/their/they/

Thanks,

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

^ permalink raw reply

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
From: Peter Zijlstra @ 2011-07-20 14:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mahesh, linux-kernel, Anton Blanchard, mingo, linuxppc-dev
In-Reply-To: <CA+55aFxzaaMj8OUaust90c_hYKzg8NpRfmX4SzJ9SMwXzg5ocA@mail.gmail.com>

On Wed, 2011-07-20 at 07:40 -0700, Linus Torvalds wrote:
> On Wed, Jul 20, 2011 at 5:14 AM, Anton Blanchard <anton@samba.org> wrote:
> >
> >> So with that fix the patch makes the machine happy again?
> >
> > Yes, the machine looks fine with the patches applied. Thanks!
>=20
> Ok, so what's the situation for 3.0 (I'm waiting for some RCU
> resolution now)? Anton's patch may be small, but that's just the tiny
> fixup patch to Peter's much scarier one ;)

Right, so we can either merge my scary patches now and have 3.0 boot on
16+ node machines (and risk breaking something), or delay them until
3.0.1 and have 16+ node machines suffer a little.

The alternative quick hack is simply to disable the node domain, but
that'll be detrimental to regular machines in that the top domain used
to have NODE sd_flags will now have ALL_NODE sd_flags which are much
less aggressive.

^ permalink raw reply

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
From: Linus Torvalds @ 2011-07-20 16:04 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mahesh, linux-kernel, Anton Blanchard, mingo, linuxppc-dev
In-Reply-To: <1311173910.5345.94.camel@twins>

On Wed, Jul 20, 2011 at 7:58 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> Right, so we can either merge my scary patches now and have 3.0 boot on
> 16+ node machines (and risk breaking something), or delay them until
> 3.0.1 and have 16+ node machines suffer a little.

So how much impact does your scary patch have on machines that don't
have multiple nodes? If it's a "the code isn't even called by normal
machines" kind of setup, I don't think I care a lot.

                  Linus

^ permalink raw reply

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
From: Ingo Molnar @ 2011-07-20 16:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, mahesh, linux-kernel, Anton Blanchard,
	linuxppc-dev
In-Reply-To: <CA+55aFxpH9-4rkH6CRXbjFCQMGNYGRMKgdgHUEthefQPumFKVg@mail.gmail.com>


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, Jul 20, 2011 at 7:58 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > Right, so we can either merge my scary patches now and have 3.0 
> > boot on 16+ node machines (and risk breaking something), or delay 
> > them until 3.0.1 and have 16+ node machines suffer a little.
> 
> So how much impact does your scary patch have on machines that 
> don't have multiple nodes? If it's a "the code isn't even called by 
> normal machines" kind of setup, I don't think I care a lot.

NUMA systems will trigger the new code - not just 'weird NUMA 
systems' - but i still think we could try the patches, the code looks 
straightforward and i booted them on NUMA systems and it all seems 
fine so far.

Anyway, i'll push the new sched/urgent branch out in a few minutes 
and then you'll see the full patches in the commit notifications.

Thanks,

	Ingo

^ permalink raw reply

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
From: Peter Zijlstra @ 2011-07-20 16:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mahesh, linux-kernel, Anton Blanchard, mingo, linuxppc-dev
In-Reply-To: <CA+55aFxpH9-4rkH6CRXbjFCQMGNYGRMKgdgHUEthefQPumFKVg@mail.gmail.com>

On Wed, 2011-07-20 at 09:04 -0700, Linus Torvalds wrote:
> On Wed, Jul 20, 2011 at 7:58 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> =
wrote:
> >
> > Right, so we can either merge my scary patches now and have 3.0 boot on
> > 16+ node machines (and risk breaking something), or delay them until
> > 3.0.1 and have 16+ node machines suffer a little.
>=20
> So how much impact does your scary patch have on machines that don't
> have multiple nodes? If it's a "the code isn't even called by normal
> machines" kind of setup, I don't think I care a lot.

Hmm, it does get called, but it looks relatively straight forward to
make it so that it doesn't. Let me try that.

Yes, the below works nicely (on top of the previous two).

Built and boot tested on a single-node and multi-node x86_64.

---
Subject: sched: Avoid creating superfluous domains
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Wed Jul 20 18:34:30 CEST 2011

When creating sched_domains, stop when we've covered the entire target
span instead of continuing to create domains, only to later find
they're redundant and throw them away again.

This avoids single node systems from touching funny NUMA sched_domain
creation code.

Requested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/kernel/sched.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -7436,6 +7436,8 @@ static int build_sched_domains(const str
 			sd =3D build_sched_domain(tl, &d, cpu_map, attr, sd, i);
 			if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP))
 				sd->flags |=3D SD_OVERLAP;
+			if (cpumask_equal(cpu_map, sched_domain_span(sd)))
+				break;
 		}
=20
 		while (sd->child)

^ permalink raw reply

* Fwd: perf PPC: kernel panic with callchains and context switch events
From: David Ahern @ 2011-07-20 22:13 UTC (permalink / raw)
  To: linuxppc-dev

[suggestion to try this mailing list as well]

-------- Original Message --------
Subject: perf PPC: kernel panic with callchains and context switch events
Date: Wed, 20 Jul 2011 15:57:51 -0600
From: David Ahern <dsahern@gmail.com>
To: Anton Blanchard <anton@samba.org>, Paul Mackerras
<paulus@samba.org>,  linux-perf-users@vger.kernel.org, LKML
<linux-kernel@vger.kernel.org>

I am hoping someone familiar with PPC can help understand a panic that
is generated when capturing callchains with context switch events.

Call trace is below. The short of it is that walking the callchain
generates a page fault. To handle the page fault the mmap_sem is needed,
but it is currently held by setup_arg_pages. setup_arg_pages calls
shift_arg_pages with the mmap_sem held. shift_arg_pages then calls
move_page_tables which has a cond_resched at the top of its for loop. If
the cond_resched() is removed from move_page_tables everything works
beautifully - no panics.

So, the question: is it normal for walking the stack to trigger a page
fault on PPC? The panic is not seen on x86 based systems.

 [<b0180e00>]rb_erase+0x1b4/0x3e8
 [<b00430f4>]__dequeue_entity+0x50/0xe8
 [<b0043304>]set_next_entity+0x178/0x1bc
 [<b0043440>]pick_next_task_fair+0xb0/0x118
 [<b02ada80>]schedule+0x500/0x614
 [<b02afaa8>]rwsem_down_failed_common+0xf0/0x264
 [<b02afca0>]rwsem_down_read_failed+0x34/0x54
 [<b02aed4c>]down_read+0x3c/0x54
 [<b0023b58>]do_page_fault+0x114/0x5e8
 [<b001e350>]handle_page_fault+0xc/0x80
 [<b0022dec>]perf_callchain+0x224/0x31c
 [<b009ba70>]perf_prepare_sample+0x240/0x2fc
 [<b009d760>]__perf_event_overflow+0x280/0x398
 [<b009d914>]perf_swevent_overflow+0x9c/0x10c
 [<b009db54>]perf_swevent_ctx_event+0x1d0/0x230
 [<b009dc38>]do_perf_sw_event+0x84/0xe4
 [<b009dde8>]perf_sw_event_context_switch+0x150/0x1b4
 [<b009de90>]perf_event_task_sched_out+0x44/0x2d4
 [<b02ad840>]schedule+0x2c0/0x614
 [<b0047dc0>]__cond_resched+0x34/0x90
 [<b02adcc8>]_cond_resched+0x4c/0x68
 [<b00bccf8>]move_page_tables+0xb0/0x418
 [<b00d7ee0>]setup_arg_pages+0x184/0x2a0
 [<b0110914>]load_elf_binary+0x394/0x1208
 [<b00d6e28>]search_binary_handler+0xe0/0x2c4
 [<b00d834c>]do_execve+0x1bc/0x268
 [<b0015394>]sys_execve+0x84/0xc8
 [<b001df10>]ret_from_syscall+0x0/0x3c

Thanks,
David

^ permalink raw reply

* Re: [v3 PATCH 1/1] powerpc/4xx: enable and fix pcie gen1/gen2 on the 460sx
From: Tony Breeds @ 2011-07-21  0:59 UTC (permalink / raw)
  To: Ayman Elkhashab
  Cc: Ayman El-Khashab, Paul Mackerras, linuxppc-dev, linux-kernel
In-Reply-To: <1311166949-2543-2-git-send-email-aymane@elkhashab.com>

On Wed, Jul 20, 2011 at 08:02:29AM -0500, Ayman Elkhashab wrote:
> From: Ayman El-Khashab <ayman@elkhashab.com>
> 
> Adds a register to the config space for the 460sx.  Changes the vc0
> detect to a pll detect.  maps configuration space to test the link
> status.  changes the setup to enable gen2 devices to operate at gen2
> speeds.  fixes mapping that was not correct for the 460sx.  added
> bit definitions for the OMRxMSKL registers.  Removed reserved bit
> that was set incorrectly in the OMR2MSKL register.

FWIW Looks good to me.

Yours Tony

^ permalink raw reply

* [PATCH v3] net: filter: BPF 'JIT' compiler for PPC64
From: Matt Evans @ 2011-07-21  1:51 UTC (permalink / raw)
  To: linuxppc-dev, netdev
In-Reply-To: <4E24E867.9050909@ozlabs.org>

An implementation of a code generator for BPF programs to speed up packet
filtering on PPC64, inspired by Eric Dumazet's x86-64 version.

Filter code is generated as an ABI-compliant function in module_alloc()'d mem
with stackframe & prologue/epilogue generated if required (simple filters don't
need anything more than an li/blr).  The filter's local variables, M[], live in
registers.  Supports all BPF opcodes, although "complicated" loads from negative
packet offsets (e.g. SKF_LL_OFF) are not yet supported.

There are a couple of further optimisations left for future work; many-pass
assembly with branch-reach reduction and a register allocator to push M[]
variables into volatile registers would improve the code quality further.

This currently supports big-endian 64-bit PowerPC only (but is fairly simple
to port to PPC32 or LE!).

Enabled in the same way as x86-64:

	echo 1 > /proc/sys/net/core/bpf_jit_enable

Or, enabled with extra debug output:

	echo 2 > /proc/sys/net/core/bpf_jit_enable

Signed-off-by: Matt Evans <matt@ozlabs.org>
---

V3: Added BUILD_BUG_ON to assert PACA CPU ID is 16bits, made a comment (in
    LD_MSH) a bit clearer, ratelimited "Unknown opcode" error and moved
    bpf_jit.S to bpf_jit_64.S (it doesn't make sense to rename bpf_jit_comp.c as
    small portions will eventually get split out into _32/_64.c files when we do
    32bit support).

 arch/powerpc/Kconfig                  |    1 +
 arch/powerpc/Makefile                 |    3 +-
 arch/powerpc/include/asm/ppc-opcode.h |   40 ++
 arch/powerpc/net/Makefile             |    4 +
 arch/powerpc/net/bpf_jit.h            |  227 +++++++++++
 arch/powerpc/net/bpf_jit_64.S         |  138 +++++++
 arch/powerpc/net/bpf_jit_comp.c       |  694 +++++++++++++++++++++++++++++++++
 7 files changed, 1106 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2729c66..39860fc 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -134,6 +134,7 @@ config PPC
 	select GENERIC_IRQ_SHOW_LEVEL
 	select HAVE_RCU_TABLE_FREE if SMP
 	select HAVE_SYSCALL_TRACEPOINTS
+	select HAVE_BPF_JIT if PPC64
 
 config EARLY_PRINTK
 	bool
diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index b7212b6..b94740f 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -154,7 +154,8 @@ core-y				+= arch/powerpc/kernel/ \
 				   arch/powerpc/lib/ \
 				   arch/powerpc/sysdev/ \
 				   arch/powerpc/platforms/ \
-				   arch/powerpc/math-emu/
+				   arch/powerpc/math-emu/ \
+				   arch/powerpc/net/
 core-$(CONFIG_XMON)		+= arch/powerpc/xmon/
 core-$(CONFIG_KVM) 		+= arch/powerpc/kvm/
 
diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
index e472659..e980faa 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -71,6 +71,42 @@
 #define PPC_INST_ERATSX			0x7c000126
 #define PPC_INST_ERATSX_DOT		0x7c000127
 
+/* Misc instructions for BPF compiler */
+#define PPC_INST_LD			0xe8000000
+#define PPC_INST_LHZ			0xa0000000
+#define PPC_INST_LWZ			0x80000000
+#define PPC_INST_STD			0xf8000000
+#define PPC_INST_STDU			0xf8000001
+#define PPC_INST_MFLR			0x7c0802a6
+#define PPC_INST_MTLR			0x7c0803a6
+#define PPC_INST_CMPWI			0x2c000000
+#define PPC_INST_CMPDI			0x2c200000
+#define PPC_INST_CMPLW			0x7c000040
+#define PPC_INST_CMPLWI			0x28000000
+#define PPC_INST_ADDI			0x38000000
+#define PPC_INST_ADDIS			0x3c000000
+#define PPC_INST_ADD			0x7c000214
+#define PPC_INST_SUB			0x7c000050
+#define PPC_INST_BLR			0x4e800020
+#define PPC_INST_BLRL			0x4e800021
+#define PPC_INST_MULLW			0x7c0001d6
+#define PPC_INST_MULHWU			0x7c000016
+#define PPC_INST_MULLI			0x1c000000
+#define PPC_INST_DIVWU			0x7c0003d6
+#define PPC_INST_RLWINM			0x54000000
+#define PPC_INST_RLDICR			0x78000004
+#define PPC_INST_SLW			0x7c000030
+#define PPC_INST_SRW			0x7c000430
+#define PPC_INST_AND			0x7c000038
+#define PPC_INST_ANDDOT			0x7c000039
+#define PPC_INST_OR			0x7c000378
+#define PPC_INST_ANDI			0x70000000
+#define PPC_INST_ORI			0x60000000
+#define PPC_INST_ORIS			0x64000000
+#define PPC_INST_NEG			0x7c0000d0
+#define PPC_INST_BRANCH			0x48000000
+#define PPC_INST_BRANCH_COND		0x40800000
+
 /* macros to insert fields into opcodes */
 #define __PPC_RA(a)	(((a) & 0x1f) << 16)
 #define __PPC_RB(b)	(((b) & 0x1f) << 11)
@@ -83,6 +119,10 @@
 #define __PPC_T_TLB(t)	(((t) & 0x3) << 21)
 #define __PPC_WC(w)	(((w) & 0x3) << 21)
 #define __PPC_WS(w)	(((w) & 0x1f) << 11)
+#define __PPC_SH(s)	__PPC_WS(s)
+#define __PPC_MB(s)	(((s) & 0x1f) << 6)
+#define __PPC_ME(s)	(((s) & 0x1f) << 1)
+#define __PPC_BI(s)	(((s) & 0x1f) << 16)
 
 /*
  * Only use the larx hint bit on 64bit CPUs. e500v1/v2 based CPUs will treat a
diff --git a/arch/powerpc/net/Makefile b/arch/powerpc/net/Makefile
new file mode 100644
index 0000000..266b395
--- /dev/null
+++ b/arch/powerpc/net/Makefile
@@ -0,0 +1,4 @@
+#
+# Arch-specific network modules
+#
+obj-$(CONFIG_BPF_JIT) += bpf_jit_64.o bpf_jit_comp.o
diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
new file mode 100644
index 0000000..af1ab5e
--- /dev/null
+++ b/arch/powerpc/net/bpf_jit.h
@@ -0,0 +1,227 @@
+/* bpf_jit.h: BPF JIT compiler for PPC64
+ *
+ * Copyright 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+#ifndef _BPF_JIT_H
+#define _BPF_JIT_H
+
+#define BPF_PPC_STACK_LOCALS	32
+#define BPF_PPC_STACK_BASIC	(48+64)
+#define BPF_PPC_STACK_SAVE	(18*8)
+#define BPF_PPC_STACKFRAME	(BPF_PPC_STACK_BASIC+BPF_PPC_STACK_LOCALS+ \
+				 BPF_PPC_STACK_SAVE)
+#define BPF_PPC_SLOWPATH_FRAME	(48+64)
+
+/*
+ * Generated code register usage:
+ *
+ * As normal PPC C ABI (e.g. r1=sp, r2=TOC), with:
+ *
+ * skb		r3	(Entry parameter)
+ * A register	r4
+ * X register	r5
+ * addr param	r6
+ * r7-r10	scratch
+ * skb->data	r14
+ * skb headlen	r15	(skb->len - skb->data_len)
+ * m[0]		r16
+ * m[...]	...
+ * m[15]	r31
+ */
+#define r_skb		3
+#define r_ret		3
+#define r_A		4
+#define r_X		5
+#define r_addr		6
+#define r_scratch1	7
+#define r_D		14
+#define r_HL		15
+#define r_M		16
+
+#ifndef __ASSEMBLY__
+
+/*
+ * Assembly helpers from arch/powerpc/net/bpf_jit.S:
+ */
+extern u8 sk_load_word[], sk_load_half[], sk_load_byte[], sk_load_byte_msh[];
+
+#define FUNCTION_DESCR_SIZE	24
+
+/*
+ * 16-bit immediate helper macros: HA() is for use with sign-extending instrs
+ * (e.g. LD, ADDI).  If the bottom 16 bits is "-ve", add another bit into the
+ * top half to negate the effect (i.e. 0xffff + 1 = 0x(1)0000).
+ */
+#define IMM_H(i)		((uintptr_t)(i)>>16)
+#define IMM_HA(i)		(((uintptr_t)(i)>>16) +			      \
+				 (((uintptr_t)(i) & 0x8000) >> 15))
+#define IMM_L(i)		((uintptr_t)(i) & 0xffff)
+
+#define PLANT_INSTR(d, idx, instr)					      \
+	do { if (d) { (d)[idx] = instr; } idx++; } while (0)
+#define EMIT(instr)		PLANT_INSTR(image, ctx->idx, instr)
+
+#define PPC_NOP()		EMIT(PPC_INST_NOP)
+#define PPC_BLR()		EMIT(PPC_INST_BLR)
+#define PPC_BLRL()		EMIT(PPC_INST_BLRL)
+#define PPC_MTLR(r)		EMIT(PPC_INST_MTLR | __PPC_RT(r))
+#define PPC_ADDI(d, a, i)	EMIT(PPC_INST_ADDI | __PPC_RT(d) |	      \
+				     __PPC_RA(a) | IMM_L(i))
+#define PPC_MR(d, a)		PPC_OR(d, a, a)
+#define PPC_LI(r, i)		PPC_ADDI(r, 0, i)
+#define PPC_ADDIS(d, a, i)	EMIT(PPC_INST_ADDIS |			      \
+				     __PPC_RS(d) | __PPC_RA(a) | IMM_L(i))
+#define PPC_LIS(r, i)		PPC_ADDIS(r, 0, i)
+#define PPC_STD(r, base, i)	EMIT(PPC_INST_STD | __PPC_RS(r) |	      \
+				     __PPC_RA(base) | ((i) & 0xfffc))
+
+#define PPC_LD(r, base, i)	EMIT(PPC_INST_LD | __PPC_RT(r) |	      \
+				     __PPC_RA(base) | IMM_L(i))
+#define PPC_LWZ(r, base, i)	EMIT(PPC_INST_LWZ | __PPC_RT(r) |	      \
+				     __PPC_RA(base) | IMM_L(i))
+#define PPC_LHZ(r, base, i)	EMIT(PPC_INST_LHZ | __PPC_RT(r) |	      \
+				     __PPC_RA(base) | IMM_L(i))
+/* Convenience helpers for the above with 'far' offsets: */
+#define PPC_LD_OFFS(r, base, i) do { if ((i) < 32768) PPC_LD(r, base, i);     \
+		else {	PPC_ADDIS(r, base, IMM_HA(i));			      \
+			PPC_LD(r, r, IMM_L(i)); } } while(0)
+
+#define PPC_LWZ_OFFS(r, base, i) do { if ((i) < 32768) PPC_LWZ(r, base, i);   \
+		else {	PPC_ADDIS(r, base, IMM_HA(i));			      \
+			PPC_LWZ(r, r, IMM_L(i)); } } while(0)
+
+#define PPC_LHZ_OFFS(r, base, i) do { if ((i) < 32768) PPC_LHZ(r, base, i);   \
+		else {	PPC_ADDIS(r, base, IMM_HA(i));			      \
+			PPC_LHZ(r, r, IMM_L(i)); } } while(0)
+
+#define PPC_CMPWI(a, i)		EMIT(PPC_INST_CMPWI | __PPC_RA(a) | IMM_L(i))
+#define PPC_CMPDI(a, i)		EMIT(PPC_INST_CMPDI | __PPC_RA(a) | IMM_L(i))
+#define PPC_CMPLWI(a, i)	EMIT(PPC_INST_CMPLWI | __PPC_RA(a) | IMM_L(i))
+#define PPC_CMPLW(a, b)		EMIT(PPC_INST_CMPLW | __PPC_RA(a) | __PPC_RB(b))
+
+#define PPC_SUB(d, a, b)	EMIT(PPC_INST_SUB | __PPC_RT(d) |	      \
+				     __PPC_RB(a) | __PPC_RA(b))
+#define PPC_ADD(d, a, b)	EMIT(PPC_INST_ADD | __PPC_RT(d) |	      \
+				     __PPC_RA(a) | __PPC_RB(b))
+#define PPC_MUL(d, a, b)	EMIT(PPC_INST_MULLW | __PPC_RT(d) |	      \
+				     __PPC_RA(a) | __PPC_RB(b))
+#define PPC_MULHWU(d, a, b)	EMIT(PPC_INST_MULHWU | __PPC_RT(d) |	      \
+				     __PPC_RA(a) | __PPC_RB(b))
+#define PPC_MULI(d, a, i)	EMIT(PPC_INST_MULLI | __PPC_RT(d) |	      \
+				     __PPC_RA(a) | IMM_L(i))
+#define PPC_DIVWU(d, a, b)	EMIT(PPC_INST_DIVWU | __PPC_RT(d) |	      \
+				     __PPC_RA(a) | __PPC_RB(b))
+#define PPC_AND(d, a, b)	EMIT(PPC_INST_AND | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | __PPC_RB(b))
+#define PPC_ANDI(d, a, i)	EMIT(PPC_INST_ANDI | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | IMM_L(i))
+#define PPC_AND_DOT(d, a, b)	EMIT(PPC_INST_ANDDOT | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | __PPC_RB(b))
+#define PPC_OR(d, a, b)		EMIT(PPC_INST_OR | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | __PPC_RB(b))
+#define PPC_ORI(d, a, i)	EMIT(PPC_INST_ORI | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | IMM_L(i))
+#define PPC_ORIS(d, a, i)	EMIT(PPC_INST_ORIS | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | IMM_L(i))
+#define PPC_SLW(d, a, s)	EMIT(PPC_INST_SLW | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | __PPC_RB(s))
+#define PPC_SRW(d, a, s)	EMIT(PPC_INST_SRW | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | __PPC_RB(s))
+/* slwi = rlwinm Rx, Ry, n, 0, 31-n */
+#define PPC_SLWI(d, a, i)	EMIT(PPC_INST_RLWINM | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | __PPC_SH(i) |	      \
+				     __PPC_MB(0) | __PPC_ME(31-(i)))
+/* srwi = rlwinm Rx, Ry, 32-n, n, 31 */
+#define PPC_SRWI(d, a, i)	EMIT(PPC_INST_RLWINM | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | __PPC_SH(32-(i)) |	      \
+				     __PPC_MB(i) | __PPC_ME(31))
+/* sldi = rldicr Rx, Ry, n, 63-n */
+#define PPC_SLDI(d, a, i)	EMIT(PPC_INST_RLDICR | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | __PPC_SH(i) |	      \
+				     __PPC_MB(63-(i)) | (((i) & 0x20) >> 4))
+#define PPC_NEG(d, a)		EMIT(PPC_INST_NEG | __PPC_RT(d) | __PPC_RA(a))
+
+/* Long jump; (unconditional 'branch') */
+#define PPC_JMP(dest)		EMIT(PPC_INST_BRANCH |			      \
+				     (((dest) - (ctx->idx * 4)) & 0x03fffffc))
+/* "cond" here covers BO:BI fields. */
+#define PPC_BCC_SHORT(cond, dest)	EMIT(PPC_INST_BRANCH_COND |	      \
+					     (((cond) & 0x3ff) << 16) |	      \
+					     (((dest) - (ctx->idx * 4)) &     \
+					      0xfffc))
+#define PPC_LI32(d, i)		do { PPC_LI(d, IMM_L(i));		      \
+		if ((u32)(uintptr_t)(i) >= 32768) {			      \
+			PPC_ADDIS(d, d, IMM_HA(i));			      \
+		} } while(0)
+#define PPC_LI64(d, i)		do {					      \
+		if (!((uintptr_t)(i) & 0xffffffff00000000ULL))		      \
+			PPC_LI32(d, i);					      \
+		else {							      \
+			PPC_LIS(d, ((uintptr_t)(i) >> 48));		      \
+			if ((uintptr_t)(i) & 0x0000ffff00000000ULL)	      \
+				PPC_ORI(d, d,				      \
+					((uintptr_t)(i) >> 32) & 0xffff);     \
+			PPC_SLDI(d, d, 32);				      \
+			if ((uintptr_t)(i) & 0x00000000ffff0000ULL)	      \
+				PPC_ORIS(d, d,				      \
+					 ((uintptr_t)(i) >> 16) & 0xffff);    \
+			if ((uintptr_t)(i) & 0x000000000000ffffULL)	      \
+				PPC_ORI(d, d, (uintptr_t)(i) & 0xffff);	      \
+		} } while (0);
+
+static inline bool is_nearbranch(int offset)
+{
+	return (offset < 32768) && (offset >= -32768);
+}
+
+/*
+ * The fly in the ointment of code size changing from pass to pass is
+ * avoided by padding the short branch case with a NOP.	 If code size differs
+ * with different branch reaches we will have the issue of code moving from
+ * one pass to the next and will need a few passes to converge on a stable
+ * state.
+ */
+#define PPC_BCC(cond, dest)	do {					      \
+		if (is_nearbranch((dest) - (ctx->idx * 4))) {		      \
+			PPC_BCC_SHORT(cond, dest);			      \
+			PPC_NOP();					      \
+		} else {						      \
+			/* Flip the 'T or F' bit to invert comparison */      \
+			PPC_BCC_SHORT(cond ^ COND_CMP_TRUE, (ctx->idx+2)*4);  \
+			PPC_JMP(dest);					      \
+		} } while(0)
+
+/* To create a branch condition, select a bit of cr0... */
+#define CR0_LT		0
+#define CR0_GT		1
+#define CR0_EQ		2
+/* ...and modify BO[3] */
+#define COND_CMP_TRUE	0x100
+#define COND_CMP_FALSE	0x000
+/* Together, they make all required comparisons: */
+#define COND_GT		(CR0_GT | COND_CMP_TRUE)
+#define COND_GE		(CR0_LT | COND_CMP_FALSE)
+#define COND_EQ		(CR0_EQ | COND_CMP_TRUE)
+#define COND_NE		(CR0_EQ | COND_CMP_FALSE)
+#define COND_LT		(CR0_LT | COND_CMP_TRUE)
+
+#define SEEN_DATAREF 0x10000 /* might call external helpers */
+#define SEEN_XREG    0x20000 /* X reg is used */
+#define SEEN_MEM     0x40000 /* SEEN_MEM+(1<<n) = use mem[n] for temporary
+			      * storage */
+#define SEEN_MEM_MSK 0x0ffff
+
+struct codegen_context {
+	unsigned int seen;
+	unsigned int idx;
+	int pc_ret0; /* bpf index of first RET #0 instruction (if any) */
+};
+
+#endif
+
+#endif
diff --git a/arch/powerpc/net/bpf_jit_64.S b/arch/powerpc/net/bpf_jit_64.S
new file mode 100644
index 0000000..ff4506e
--- /dev/null
+++ b/arch/powerpc/net/bpf_jit_64.S
@@ -0,0 +1,138 @@
+/* bpf_jit.S: Packet/header access helper functions
+ * for PPC64 BPF compiler.
+ *
+ * Copyright 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include <asm/ppc_asm.h>
+#include "bpf_jit.h"
+
+/*
+ * All of these routines are called directly from generated code,
+ * whose register usage is:
+ *
+ * r3		skb
+ * r4,r5	A,X
+ * r6		*** address parameter to helper ***
+ * r7-r10	scratch
+ * r14		skb->data
+ * r15		skb headlen
+ * r16-31	M[]
+ */
+
+/*
+ * To consider: These helpers are so small it could be better to just
+ * generate them inline.  Inline code can do the simple headlen check
+ * then branch directly to slow_path_XXX if required.  (In fact, could
+ * load a spare GPR with the address of slow_path_generic and pass size
+ * as an argument, making the call site a mtlr, li and bllr.)
+ *
+ * Technically, the "is addr < 0" check is unnecessary & slowing down
+ * the ABS path, as it's statically checked on generation.
+ */
+	.globl	sk_load_word
+sk_load_word:
+	cmpdi	r_addr, 0
+	blt	bpf_error
+	/* Are we accessing past headlen? */
+	subi	r_scratch1, r_HL, 4
+	cmpd	r_scratch1, r_addr
+	blt	bpf_slow_path_word
+	/* Nope, just hitting the header.  cr0 here is eq or gt! */
+	lwzx	r_A, r_D, r_addr
+	/* When big endian we don't need to byteswap. */
+	blr	/* Return success, cr0 != LT */
+
+	.globl	sk_load_half
+sk_load_half:
+	cmpdi	r_addr, 0
+	blt	bpf_error
+	subi	r_scratch1, r_HL, 2
+	cmpd	r_scratch1, r_addr
+	blt	bpf_slow_path_half
+	lhzx	r_A, r_D, r_addr
+	blr
+
+	.globl	sk_load_byte
+sk_load_byte:
+	cmpdi	r_addr, 0
+	blt	bpf_error
+	cmpd	r_HL, r_addr
+	ble	bpf_slow_path_byte
+	lbzx	r_A, r_D, r_addr
+	blr
+
+/*
+ * BPF_S_LDX_B_MSH: ldxb  4*([offset]&0xf)
+ * r_addr is the offset value, already known positive
+ */
+	.globl sk_load_byte_msh
+sk_load_byte_msh:
+	cmpd	r_HL, r_addr
+	ble	bpf_slow_path_byte_msh
+	lbzx	r_X, r_D, r_addr
+	rlwinm	r_X, r_X, 2, 32-4-2, 31-2
+	blr
+
+bpf_error:
+	/* Entered with cr0 = lt */
+	li	r3, 0
+	/* Generated code will 'blt epilogue', returning 0. */
+	blr
+
+/* Call out to skb_copy_bits:
+ * We'll need to back up our volatile regs first; we have
+ * local variable space at r1+(BPF_PPC_STACK_BASIC).
+ * Allocate a new stack frame here to remain ABI-compliant in
+ * stashing LR.
+ */
+#define bpf_slow_path_common(SIZE)				\
+	mflr	r0;						\
+	std	r0, 16(r1);					\
+	/* R3 goes in parameter space of caller's frame */	\
+	std	r_skb, (BPF_PPC_STACKFRAME+48)(r1);		\
+	std	r_A, (BPF_PPC_STACK_BASIC+(0*8))(r1);		\
+	std	r_X, (BPF_PPC_STACK_BASIC+(1*8))(r1);		\
+	addi	r5, r1, BPF_PPC_STACK_BASIC+(2*8);		\
+	stdu	r1, -BPF_PPC_SLOWPATH_FRAME(r1);		\
+	/* R3 = r_skb, as passed */				\
+	mr	r4, r_addr;					\
+	li	r6, SIZE;					\
+	bl	skb_copy_bits;					\
+	/* R3 = 0 on success */					\
+	addi	r1, r1, BPF_PPC_SLOWPATH_FRAME;			\
+	ld	r0, 16(r1);					\
+	ld	r_A, (BPF_PPC_STACK_BASIC+(0*8))(r1);		\
+	ld	r_X, (BPF_PPC_STACK_BASIC+(1*8))(r1);		\
+	mtlr	r0;						\
+	cmpdi	r3, 0;						\
+	blt	bpf_error;	/* cr0 = LT */			\
+	ld	r_skb, (BPF_PPC_STACKFRAME+48)(r1);		\
+	/* Great success! */
+
+bpf_slow_path_word:
+	bpf_slow_path_common(4)
+	/* Data value is on stack, and cr0 != LT */
+	lwz	r_A, BPF_PPC_STACK_BASIC+(2*8)(r1)
+	blr
+
+bpf_slow_path_half:
+	bpf_slow_path_common(2)
+	lhz	r_A, BPF_PPC_STACK_BASIC+(2*8)(r1)
+	blr
+
+bpf_slow_path_byte:
+	bpf_slow_path_common(1)
+	lbz	r_A, BPF_PPC_STACK_BASIC+(2*8)(r1)
+	blr
+
+bpf_slow_path_byte_msh:
+	bpf_slow_path_common(1)
+	lbz	r_X, BPF_PPC_STACK_BASIC+(2*8)(r1)
+	rlwinm	r_X, r_X, 2, 32-4-2, 31-2
+	blr
diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
new file mode 100644
index 0000000..73619d3
--- /dev/null
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -0,0 +1,694 @@
+/* bpf_jit_comp.c: BPF JIT compiler for PPC64
+ *
+ * Copyright 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation
+ *
+ * Based on the x86 BPF compiler, by Eric Dumazet (eric.dumazet@gmail.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+#include <linux/moduleloader.h>
+#include <asm/cacheflush.h>
+#include <linux/netdevice.h>
+#include <linux/filter.h>
+#include "bpf_jit.h"
+
+#ifndef __BIG_ENDIAN
+/* There are endianness assumptions herein. */
+#error "Little-endian PPC not supported in BPF compiler"
+#endif
+
+int bpf_jit_enable __read_mostly;
+
+
+static inline void bpf_flush_icache(void *start, void *end)
+{
+	smp_wmb();
+	flush_icache_range((unsigned long)start, (unsigned long)end);
+}
+
+static void bpf_jit_build_prologue(struct sk_filter *fp, u32 *image,
+				   struct codegen_context *ctx)
+{
+	int i;
+	const struct sock_filter *filter = fp->insns;
+
+	if (ctx->seen & (SEEN_MEM | SEEN_DATAREF)) {
+		/* Make stackframe */
+		if (ctx->seen & SEEN_DATAREF) {
+			/* If we call any helpers (for loads), save LR */
+			EMIT(PPC_INST_MFLR | __PPC_RT(0));
+			PPC_STD(0, 1, 16);
+
+			/* Back up non-volatile regs. */
+			PPC_STD(r_D, 1, -(8*(32-r_D)));
+			PPC_STD(r_HL, 1, -(8*(32-r_HL)));
+		}
+		if (ctx->seen & SEEN_MEM) {
+			/*
+			 * Conditionally save regs r15-r31 as some will be used
+			 * for M[] data.
+			 */
+			for (i = r_M; i < (r_M+16); i++) {
+				if (ctx->seen & (1 << (i-r_M)))
+					PPC_STD(i, 1, -(8*(32-i)));
+			}
+		}
+		EMIT(PPC_INST_STDU | __PPC_RS(1) | __PPC_RA(1) |
+		     (-BPF_PPC_STACKFRAME & 0xfffc));
+	}
+
+	if (ctx->seen & SEEN_DATAREF) {
+		/*
+		 * If this filter needs to access skb data,
+		 * prepare r_D and r_HL:
+		 *  r_HL = skb->len - skb->data_len
+		 *  r_D	 = skb->data
+		 */
+		PPC_LWZ_OFFS(r_scratch1, r_skb, offsetof(struct sk_buff,
+							 data_len));
+		PPC_LWZ_OFFS(r_HL, r_skb, offsetof(struct sk_buff, len));
+		PPC_SUB(r_HL, r_HL, r_scratch1);
+		PPC_LD_OFFS(r_D, r_skb, offsetof(struct sk_buff, data));
+	}
+
+	if (ctx->seen & SEEN_XREG) {
+		/*
+		 * TODO: Could also detect whether first instr. sets X and
+		 * avoid this (as below, with A).
+		 */
+		PPC_LI(r_X, 0);
+	}
+
+	switch (filter[0].code) {
+	case BPF_S_RET_K:
+	case BPF_S_LD_W_LEN:
+	case BPF_S_ANC_PROTOCOL:
+	case BPF_S_ANC_IFINDEX:
+	case BPF_S_ANC_MARK:
+	case BPF_S_ANC_RXHASH:
+	case BPF_S_ANC_CPU:
+	case BPF_S_ANC_QUEUE:
+	case BPF_S_LD_W_ABS:
+	case BPF_S_LD_H_ABS:
+	case BPF_S_LD_B_ABS:
+		/* first instruction sets A register (or is RET 'constant') */
+		break;
+	default:
+		/* make sure we dont leak kernel information to user */
+		PPC_LI(r_A, 0);
+	}
+}
+
+static void bpf_jit_build_epilogue(u32 *image, struct codegen_context *ctx)
+{
+	int i;
+
+	if (ctx->seen & (SEEN_MEM | SEEN_DATAREF)) {
+		PPC_ADDI(1, 1, BPF_PPC_STACKFRAME);
+		if (ctx->seen & SEEN_DATAREF) {
+			PPC_LD(0, 1, 16);
+			PPC_MTLR(0);
+			PPC_LD(r_D, 1, -(8*(32-r_D)));
+			PPC_LD(r_HL, 1, -(8*(32-r_HL)));
+		}
+		if (ctx->seen & SEEN_MEM) {
+			/* Restore any saved non-vol registers */
+			for (i = r_M; i < (r_M+16); i++) {
+				if (ctx->seen & (1 << (i-r_M)))
+					PPC_LD(i, 1, -(8*(32-i)));
+			}
+		}
+	}
+	/* The RETs have left a return value in R3. */
+
+	PPC_BLR();
+}
+
+/* Assemble the body code between the prologue & epilogue. */
+static int bpf_jit_build_body(struct sk_filter *fp, u32 *image,
+			      struct codegen_context *ctx,
+			      unsigned int *addrs)
+{
+	const struct sock_filter *filter = fp->insns;
+	int flen = fp->len;
+	u8 *func;
+	unsigned int true_cond;
+	int i;
+
+	/* Start of epilogue code */
+	unsigned int exit_addr = addrs[flen];
+
+	for (i = 0; i < flen; i++) {
+		unsigned int K = filter[i].k;
+
+		/*
+		 * addrs[] maps a BPF bytecode address into a real offset from
+		 * the start of the body code.
+		 */
+		addrs[i] = ctx->idx * 4;
+
+		switch (filter[i].code) {
+			/*** ALU ops ***/
+		case BPF_S_ALU_ADD_X: /* A += X; */
+			ctx->seen |= SEEN_XREG;
+			PPC_ADD(r_A, r_A, r_X);
+			break;
+		case BPF_S_ALU_ADD_K: /* A += K; */
+			if (!K)
+				break;
+			PPC_ADDI(r_A, r_A, IMM_L(K));
+			if (K >= 32768)
+				PPC_ADDIS(r_A, r_A, IMM_HA(K));
+			break;
+		case BPF_S_ALU_SUB_X: /* A -= X; */
+			ctx->seen |= SEEN_XREG;
+			PPC_SUB(r_A, r_A, r_X);
+			break;
+		case BPF_S_ALU_SUB_K: /* A -= K */
+			if (!K)
+				break;
+			PPC_ADDI(r_A, r_A, IMM_L(-K));
+			if (K >= 32768)
+				PPC_ADDIS(r_A, r_A, IMM_HA(-K));
+			break;
+		case BPF_S_ALU_MUL_X: /* A *= X; */
+			ctx->seen |= SEEN_XREG;
+			PPC_MUL(r_A, r_A, r_X);
+			break;
+		case BPF_S_ALU_MUL_K: /* A *= K */
+			if (K < 32768)
+				PPC_MULI(r_A, r_A, K);
+			else {
+				PPC_LI32(r_scratch1, K);
+				PPC_MUL(r_A, r_A, r_scratch1);
+			}
+			break;
+		case BPF_S_ALU_DIV_X: /* A /= X; */
+			ctx->seen |= SEEN_XREG;
+			PPC_CMPWI(r_X, 0);
+			if (ctx->pc_ret0 != -1) {
+				PPC_BCC(COND_EQ, addrs[ctx->pc_ret0]);
+			} else {
+				/*
+				 * Exit, returning 0; first pass hits here
+				 * (longer worst-case code size).
+				 */
+				PPC_BCC_SHORT(COND_NE, (ctx->idx*4)+12);
+				PPC_LI(r_ret, 0);
+				PPC_JMP(exit_addr);
+			}
+			PPC_DIVWU(r_A, r_A, r_X);
+			break;
+		case BPF_S_ALU_DIV_K: /* A = reciprocal_divide(A, K); */
+			PPC_LI32(r_scratch1, K);
+			/* Top 32 bits of 64bit result -> A */
+			PPC_MULHWU(r_A, r_A, r_scratch1);
+			break;
+		case BPF_S_ALU_AND_X:
+			ctx->seen |= SEEN_XREG;
+			PPC_AND(r_A, r_A, r_X);
+			break;
+		case BPF_S_ALU_AND_K:
+			if (!IMM_H(K))
+				PPC_ANDI(r_A, r_A, K);
+			else {
+				PPC_LI32(r_scratch1, K);
+				PPC_AND(r_A, r_A, r_scratch1);
+			}
+			break;
+		case BPF_S_ALU_OR_X:
+			ctx->seen |= SEEN_XREG;
+			PPC_OR(r_A, r_A, r_X);
+			break;
+		case BPF_S_ALU_OR_K:
+			if (IMM_L(K))
+				PPC_ORI(r_A, r_A, IMM_L(K));
+			if (K >= 65536)
+				PPC_ORIS(r_A, r_A, IMM_H(K));
+			break;
+		case BPF_S_ALU_LSH_X: /* A <<= X; */
+			ctx->seen |= SEEN_XREG;
+			PPC_SLW(r_A, r_A, r_X);
+			break;
+		case BPF_S_ALU_LSH_K:
+			if (K == 0)
+				break;
+			else
+				PPC_SLWI(r_A, r_A, K);
+			break;
+		case BPF_S_ALU_RSH_X: /* A >>= X; */
+			ctx->seen |= SEEN_XREG;
+			PPC_SRW(r_A, r_A, r_X);
+			break;
+		case BPF_S_ALU_RSH_K: /* A >>= K; */
+			if (K == 0)
+				break;
+			else
+				PPC_SRWI(r_A, r_A, K);
+			break;
+		case BPF_S_ALU_NEG:
+			PPC_NEG(r_A, r_A);
+			break;
+		case BPF_S_RET_K:
+			PPC_LI32(r_ret, K);
+			if (!K) {
+				if (ctx->pc_ret0 == -1)
+					ctx->pc_ret0 = i;
+			}
+			/*
+			 * If this isn't the very last instruction, branch to
+			 * the epilogue if we've stuff to clean up.  Otherwise,
+			 * if there's nothing to tidy, just return.  If we /are/
+			 * the last instruction, we're about to fall through to
+			 * the epilogue to return.
+			 */
+			if (i != flen - 1) {
+				/*
+				 * Note: 'seen' is properly valid only on pass
+				 * #2.	Both parts of this conditional are the
+				 * same instruction size though, meaning the
+				 * first pass will still correctly determine the
+				 * code size/addresses.
+				 */
+				if (ctx->seen)
+					PPC_JMP(exit_addr);
+				else
+					PPC_BLR();
+			}
+			break;
+		case BPF_S_RET_A:
+			PPC_MR(r_ret, r_A);
+			if (i != flen - 1) {
+				if (ctx->seen)
+					PPC_JMP(exit_addr);
+				else
+					PPC_BLR();
+			}
+			break;
+		case BPF_S_MISC_TAX: /* X = A */
+			PPC_MR(r_X, r_A);
+			break;
+		case BPF_S_MISC_TXA: /* A = X */
+			ctx->seen |= SEEN_XREG;
+			PPC_MR(r_A, r_X);
+			break;
+
+			/*** Constant loads/M[] access ***/
+		case BPF_S_LD_IMM: /* A = K */
+			PPC_LI32(r_A, K);
+			break;
+		case BPF_S_LDX_IMM: /* X = K */
+			PPC_LI32(r_X, K);
+			break;
+		case BPF_S_LD_MEM: /* A = mem[K] */
+			PPC_MR(r_A, r_M + (K & 0xf));
+			ctx->seen |= SEEN_MEM | (1<<(K & 0xf));
+			break;
+		case BPF_S_LDX_MEM: /* X = mem[K] */
+			PPC_MR(r_X, r_M + (K & 0xf));
+			ctx->seen |= SEEN_MEM | (1<<(K & 0xf));
+			break;
+		case BPF_S_ST: /* mem[K] = A */
+			PPC_MR(r_M + (K & 0xf), r_A);
+			ctx->seen |= SEEN_MEM | (1<<(K & 0xf));
+			break;
+		case BPF_S_STX: /* mem[K] = X */
+			PPC_MR(r_M + (K & 0xf), r_X);
+			ctx->seen |= SEEN_XREG | SEEN_MEM | (1<<(K & 0xf));
+			break;
+		case BPF_S_LD_W_LEN: /*	A = skb->len; */
+			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4);
+			PPC_LWZ_OFFS(r_A, r_skb, offsetof(struct sk_buff, len));
+			break;
+		case BPF_S_LDX_W_LEN: /* X = skb->len; */
+			PPC_LWZ_OFFS(r_X, r_skb, offsetof(struct sk_buff, len));
+			break;
+
+			/*** Ancillary info loads ***/
+
+			/* None of the BPF_S_ANC* codes appear to be passed by
+			 * sk_chk_filter().  The interpreter and the x86 BPF
+			 * compiler implement them so we do too -- they may be
+			 * planted in future.
+			 */
+		case BPF_S_ANC_PROTOCOL: /* A = ntohs(skb->protocol); */
+			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
+						  protocol) != 2);
+			PPC_LHZ_OFFS(r_A, r_skb, offsetof(struct sk_buff,
+							  protocol));
+			/* ntohs is a NOP with BE loads. */
+			break;
+		case BPF_S_ANC_IFINDEX:
+			PPC_LD_OFFS(r_scratch1, r_skb, offsetof(struct sk_buff,
+								dev));
+			PPC_CMPDI(r_scratch1, 0);
+			if (ctx->pc_ret0 != -1) {
+				PPC_BCC(COND_EQ, addrs[ctx->pc_ret0]);
+			} else {
+				/* Exit, returning 0; first pass hits here. */
+				PPC_BCC_SHORT(COND_NE, (ctx->idx*4)+12);
+				PPC_LI(r_ret, 0);
+				PPC_JMP(exit_addr);
+			}
+			BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
+						  ifindex) != 4);
+			PPC_LWZ_OFFS(r_A, r_scratch1,
+				     offsetof(struct net_device, ifindex));
+			break;
+		case BPF_S_ANC_MARK:
+			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4);
+			PPC_LWZ_OFFS(r_A, r_skb, offsetof(struct sk_buff,
+							  mark));
+			break;
+		case BPF_S_ANC_RXHASH:
+			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, rxhash) != 4);
+			PPC_LWZ_OFFS(r_A, r_skb, offsetof(struct sk_buff,
+							  rxhash));
+			break;
+		case BPF_S_ANC_QUEUE:
+			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
+						  queue_mapping) != 2);
+			PPC_LHZ_OFFS(r_A, r_skb, offsetof(struct sk_buff,
+							  queue_mapping));
+			break;
+		case BPF_S_ANC_CPU:
+#ifdef CONFIG_SMP
+			/*
+			 * PACA ptr is r13:
+			 * raw_smp_processor_id() = local_paca->paca_index
+			 */
+			BUILD_BUG_ON(FIELD_SIZEOF(struct paca_struct,
+						  paca_index) != 2);
+			PPC_LHZ_OFFS(r_A, 13,
+				     offsetof(struct paca_struct, paca_index));
+#else
+			PPC_LI(r_A, 0);
+#endif
+			break;
+
+			/*** Absolute loads from packet header/data ***/
+		case BPF_S_LD_W_ABS:
+			func = sk_load_word;
+			goto common_load;
+		case BPF_S_LD_H_ABS:
+			func = sk_load_half;
+			goto common_load;
+		case BPF_S_LD_B_ABS:
+			func = sk_load_byte;
+		common_load:
+			/*
+			 * Load from [K].  Reference with the (negative)
+			 * SKF_NET_OFF/SKF_LL_OFF offsets is unsupported.
+			 */
+			ctx->seen |= SEEN_DATAREF;
+			if ((int)K < 0)
+				return -ENOTSUPP;
+			PPC_LI64(r_scratch1, func);
+			PPC_MTLR(r_scratch1);
+			PPC_LI32(r_addr, K);
+			PPC_BLRL();
+			/*
+			 * Helper returns 'lt' condition on error, and an
+			 * appropriate return value in r3
+			 */
+			PPC_BCC(COND_LT, exit_addr);
+			break;
+
+			/*** Indirect loads from packet header/data ***/
+		case BPF_S_LD_W_IND:
+			func = sk_load_word;
+			goto common_load_ind;
+		case BPF_S_LD_H_IND:
+			func = sk_load_half;
+			goto common_load_ind;
+		case BPF_S_LD_B_IND:
+			func = sk_load_byte;
+		common_load_ind:
+			/*
+			 * Load from [X + K].  Negative offsets are tested for
+			 * in the helper functions, and result in a 'ret 0'.
+			 */
+			ctx->seen |= SEEN_DATAREF | SEEN_XREG;
+			PPC_LI64(r_scratch1, func);
+			PPC_MTLR(r_scratch1);
+			PPC_ADDI(r_addr, r_X, IMM_L(K));
+			if (K >= 32768)
+				PPC_ADDIS(r_addr, r_addr, IMM_HA(K));
+			PPC_BLRL();
+			/* If error, cr0.LT set */
+			PPC_BCC(COND_LT, exit_addr);
+			break;
+
+		case BPF_S_LDX_B_MSH:
+			/*
+			 * x86 version drops packet (RET 0) when K<0, whereas
+			 * interpreter does allow K<0 (__load_pointer, special
+			 * ancillary data).  common_load returns ENOTSUPP if K<0,
+			 * so we fall back to interpreter & filter works.
+			 */
+			func = sk_load_byte_msh;
+			goto common_load;
+			break;
+
+			/*** Jump and branches ***/
+		case BPF_S_JMP_JA:
+			if (K != 0)
+				PPC_JMP(addrs[i + 1 + K]);
+			break;
+
+		case BPF_S_JMP_JGT_K:
+		case BPF_S_JMP_JGT_X:
+			true_cond = COND_GT;
+			goto cond_branch;
+		case BPF_S_JMP_JGE_K:
+		case BPF_S_JMP_JGE_X:
+			true_cond = COND_GE;
+			goto cond_branch;
+		case BPF_S_JMP_JEQ_K:
+		case BPF_S_JMP_JEQ_X:
+			true_cond = COND_EQ;
+			goto cond_branch;
+		case BPF_S_JMP_JSET_K:
+		case BPF_S_JMP_JSET_X:
+			true_cond = COND_NE;
+			/* Fall through */
+		cond_branch:
+			/* same targets, can avoid doing the test :) */
+			if (filter[i].jt == filter[i].jf) {
+				if (filter[i].jt > 0)
+					PPC_JMP(addrs[i + 1 + filter[i].jt]);
+				break;
+			}
+
+			switch (filter[i].code) {
+			case BPF_S_JMP_JGT_X:
+			case BPF_S_JMP_JGE_X:
+			case BPF_S_JMP_JEQ_X:
+				ctx->seen |= SEEN_XREG;
+				PPC_CMPLW(r_A, r_X);
+				break;
+			case BPF_S_JMP_JSET_X:
+				ctx->seen |= SEEN_XREG;
+				PPC_AND_DOT(r_scratch1, r_A, r_X);
+				break;
+			case BPF_S_JMP_JEQ_K:
+			case BPF_S_JMP_JGT_K:
+			case BPF_S_JMP_JGE_K:
+				if (K < 32768)
+					PPC_CMPLWI(r_A, K);
+				else {
+					PPC_LI32(r_scratch1, K);
+					PPC_CMPLW(r_A, r_scratch1);
+				}
+				break;
+			case BPF_S_JMP_JSET_K:
+				if (K < 32768)
+					/* PPC_ANDI is /only/ dot-form */
+					PPC_ANDI(r_scratch1, r_A, K);
+				else {
+					PPC_LI32(r_scratch1, K);
+					PPC_AND_DOT(r_scratch1, r_A,
+						    r_scratch1);
+				}
+				break;
+			}
+			/* Sometimes branches are constructed "backward", with
+			 * the false path being the branch and true path being
+			 * a fallthrough to the next instruction.
+			 */
+			if (filter[i].jt == 0)
+				/* Swap the sense of the branch */
+				PPC_BCC(true_cond ^ COND_CMP_TRUE,
+					addrs[i + 1 + filter[i].jf]);
+			else {
+				PPC_BCC(true_cond, addrs[i + 1 + filter[i].jt]);
+				if (filter[i].jf != 0)
+					PPC_JMP(addrs[i + 1 + filter[i].jf]);
+			}
+			break;
+		default:
+			/* The filter contains something cruel & unusual.
+			 * We don't handle it, but also there shouldn't be
+			 * anything missing from our list.
+			 */
+			if (printk_ratelimit())
+				pr_err("BPF filter opcode %04x (@%d) unsupported\n",
+				       filter[i].code, i);
+			return -ENOTSUPP;
+		}
+
+	}
+	/* Set end-of-body-code address for exit. */
+	addrs[i] = ctx->idx * 4;
+
+	return 0;
+}
+
+void bpf_jit_compile(struct sk_filter *fp)
+{
+	unsigned int proglen;
+	unsigned int alloclen;
+	u32 *image = NULL;
+	u32 *code_base;
+	unsigned int *addrs;
+	struct codegen_context cgctx;
+	int pass;
+	int flen = fp->len;
+
+	if (!bpf_jit_enable)
+		return;
+
+	addrs = kzalloc((flen+1) * sizeof(*addrs), GFP_KERNEL);
+	if (addrs == NULL)
+		return;
+
+	/*
+	 * There are multiple assembly passes as the generated code will change
+	 * size as it settles down, figuring out the max branch offsets/exit
+	 * paths required.
+	 *
+	 * The range of standard conditional branches is +/- 32Kbytes.	Since
+	 * BPF_MAXINSNS = 4096, we can only jump from (worst case) start to
+	 * finish with 8 bytes/instruction.  Not feasible, so long jumps are
+	 * used, distinct from short branches.
+	 *
+	 * Current:
+	 *
+	 * For now, both branch types assemble to 2 words (short branches padded
+	 * with a NOP); this is less efficient, but assembly will always complete
+	 * after exactly 3 passes:
+	 *
+	 * First pass: No code buffer; Program is "faux-generated" -- no code
+	 * emitted but maximum size of output determined (and addrs[] filled
+	 * in).	 Also, we note whether we use M[], whether we use skb data, etc.
+	 * All generation choices assumed to be 'worst-case', e.g. branches all
+	 * far (2 instructions), return path code reduction not available, etc.
+	 *
+	 * Second pass: Code buffer allocated with size determined previously.
+	 * Prologue generated to support features we have seen used.  Exit paths
+	 * determined and addrs[] is filled in again, as code may be slightly
+	 * smaller as a result.
+	 *
+	 * Third pass: Code generated 'for real', and branch destinations
+	 * determined from now-accurate addrs[] map.
+	 *
+	 * Ideal:
+	 *
+	 * If we optimise this, near branches will be shorter.	On the
+	 * first assembly pass, we should err on the side of caution and
+	 * generate the biggest code.  On subsequent passes, branches will be
+	 * generated short or long and code size will reduce.  With smaller
+	 * code, more branches may fall into the short category, and code will
+	 * reduce more.
+	 *
+	 * Finally, if we see one pass generate code the same size as the
+	 * previous pass we have converged and should now generate code for
+	 * real.  Allocating at the end will also save the memory that would
+	 * otherwise be wasted by the (small) current code shrinkage.
+	 * Preferably, we should do a small number of passes (e.g. 5) and if we
+	 * haven't converged by then, get impatient and force code to generate
+	 * as-is, even if the odd branch would be left long.  The chances of a
+	 * long jump are tiny with all but the most enormous of BPF filter
+	 * inputs, so we should usually converge on the third pass.
+	 */
+
+	cgctx.idx = 0;
+	cgctx.seen = 0;
+	cgctx.pc_ret0 = -1;
+	/* Scouting faux-generate pass 0 */
+	if (bpf_jit_build_body(fp, 0, &cgctx, addrs))
+		/* We hit something illegal or unsupported. */
+		goto out;
+
+	/*
+	 * Pretend to build prologue, given the features we've seen.  This will
+	 * update ctgtx.idx as it pretends to output instructions, then we can
+	 * calculate total size from idx.
+	 */
+	bpf_jit_build_prologue(fp, 0, &cgctx);
+	bpf_jit_build_epilogue(0, &cgctx);
+
+	proglen = cgctx.idx * 4;
+	alloclen = proglen + FUNCTION_DESCR_SIZE;
+	image = module_alloc(max_t(unsigned int, alloclen,
+				   sizeof(struct work_struct)));
+	if (!image)
+		goto out;
+
+	code_base = image + (FUNCTION_DESCR_SIZE/4);
+
+	/* Code generation passes 1-2 */
+	for (pass = 1; pass < 3; pass++) {
+		/* Now build the prologue, body code & epilogue for real. */
+		cgctx.idx = 0;
+		bpf_jit_build_prologue(fp, code_base, &cgctx);
+		bpf_jit_build_body(fp, code_base, &cgctx, addrs);
+		bpf_jit_build_epilogue(code_base, &cgctx);
+
+		if (bpf_jit_enable > 1)
+			pr_info("Pass %d: shrink = %d, seen = 0x%x\n", pass,
+				proglen - (cgctx.idx * 4), cgctx.seen);
+	}
+
+	if (bpf_jit_enable > 1)
+		pr_info("flen=%d proglen=%u pass=%d image=%p\n",
+		       flen, proglen, pass, image);
+
+	if (image) {
+		if (bpf_jit_enable > 1)
+			print_hex_dump(KERN_ERR, "JIT code: ",
+				       DUMP_PREFIX_ADDRESS,
+				       16, 1, code_base,
+				       proglen, false);
+
+		bpf_flush_icache(code_base, code_base + (proglen/4));
+		/* Function descriptor nastiness: Address + TOC */
+		((u64 *)image)[0] = (u64)code_base;
+		((u64 *)image)[1] = local_paca->kernel_toc;
+		fp->bpf_func = (void *)image;
+	}
+out:
+	kfree(addrs);
+	return;
+}
+
+static void jit_free_defer(struct work_struct *arg)
+{
+	module_free(NULL, arg);
+}
+
+/* run from softirq, we must use a work_struct to call
+ * module_free() from process context
+ */
+void bpf_jit_free(struct sk_filter *fp)
+{
+	if (fp->bpf_func != sk_run_filter) {
+		struct work_struct *work = (struct work_struct *)fp->bpf_func;
+
+		INIT_WORK(work, jit_free_defer);
+		schedule_work(work);
+	}
+}

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox