LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] net: filter: BPF 'JIT' compiler for PPC64
From: Eric Dumazet @ 2011-07-18  8:39 UTC (permalink / raw)
  To: Matt Evans; +Cc: netdev, linuxppc-dev
In-Reply-To: <4E23E5C3.1070209@ozlabs.org>

Le lundi 18 juillet 2011 à 17:50 +1000, Matt Evans a écrit :
> An implementation of a code generator for BPF programs to speed up packet
> filtering on PPC64, inspired by Eric Dumazet's x86-64 version.
> 
> Filter code is generated as an ABI-compliant function in module_alloc()'d mem
> with stackframe & prologue/epilogue generated if required (simple filters don't
> need anything more than an li/blr).  The filter's local variables, M[], live in
> registers.  Supports all BPF opcodes, although "complicated" loads from negative
> packet offsets (e.g. SKF_LL_OFF) are not yet supported.
> 
> There are a couple of further optimisations left for future work; many-pass
> assembly with branch-reach reduction and a register allocator to push M[]
> variables into volatile registers would improve the code quality further.
> 
> This currently supports big-endian 64-bit PowerPC only (but is fairly simple
> to port to PPC32 or LE!).
> 
> Enabled in the same way as x86-64:
> 
> 	echo 1 > /proc/sys/net/core/bpf_jit_enable
> 
> Or, enabled with extra debug output:
> 
> 	echo 2 > /proc/sys/net/core/bpf_jit_enable
> 
> Signed-off-by: Matt Evans <matt@ozlabs.org>
> ---
> 
> Since the RFC post, this has incorporated the bugfixes/tidies from review plus a
> couple more found in further testing, plus some general/comment tidies.

Hi Matt

A small note about SEEN_XREG usage in PPC against x86_64 :

In x86_64, XREG is stored in EBX : I had to save/restore it in function
prologue epilogue. And set it to zero in prologue to avoid leak of
kernel information.

In PPC, you chose a scratch register, so you only have to zero it in
function prologue, if X is ever read.

So in PPC SEEN_XREG only is to be set of X is read, not if written.

So you dont have to set SEEN_XREG bit in this part :

> +		case BPF_S_MISC_TAX: /* X = A */
> +			ctx->seen |= SEEN_XREG;
> +			PPC_MR(r_X, r_A);
> +			break;

and in this part :

> +		case BPF_S_LDX_IMM: /* X = K */
> +			ctx->seen |= SEEN_XREG;
> +			PPC_LI32(r_X, K);
> +			break;

and :

> +		case BPF_S_LDX_MEM: /* X = mem[K] */
> +			PPC_MR(r_X, r_M + (K & 0xf));
> +			ctx->seen |= SEEN_XREG | SEEN_MEM | (1<<(K & 0xf));
> +			break;

and :

> +		case BPF_S_LDX_W_LEN: /* X = skb->len; */
> +			ctx->seen |= SEEN_XREG;
> +			PPC_LWZ_OFFS(r_X, r_skb, offsetof(struct sk_buff, len));
> +			break;
> +

^ permalink raw reply

* linux-next: build failure after merge of the final tree
From: Stephen Rothwell @ 2011-07-18  9:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev
  Cc: linux-next, linux-kernel, Avi Kivity, Marcelo Tosatti

[-- Attachment #1: Type: text/plain, Size: 489 bytes --]

Hi all,

After merging the final tree, today's linux-next build (powerpc
allysconfig) failed like this:

arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
arch/powerpc/kernel/exceptions-64s.S:1151: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1160: Error: attempt to move .org backwards

This is probably powerpc or kvm tree related.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply

* [UPDATED PATCH v2] powerpc32: Kexec support for PPC440X chipsets
From: Suzuki K. Poulose @ 2011-07-18 13:29 UTC (permalink / raw)
  To: Benjammin Herrenschmidt, Kumar Gala
  Cc: Suzuki Poulose, Sebastian Andrzej Siewior, kexec, lkml,
	Josh Boyer, Paul Mackerras, linux ppc dev, Vivek Goyal

UPDATE: Minor update in Copyright assignment in misc_32.S
        Added requirement of upstream kexec-tools.

Changes from v1: Uses a tmp mapping in the other address space to setup
                 the 1:1 mapping (suggested by Sebastian Andrzej Siewior).

Note 1: Should we do the same for kernel entry code for PPC44x ?

This patch adds kexec support for PPC440 based chipsets.This work is based
on the KEXEC patches for FSL BookE.

The FSL BookE patch and the code flow could be found at the link below:

	http://patchwork.ozlabs.org/patch/49359/

Steps:

1) Invalidate all the TLB entries except the one this code is run from
2) Create a tmp mapping for our code in the other address space and jump to it
3) Invalidate the entry we used
4) Create a 1:1 mapping for 0-2GiB in blocks of 256M
5) Jump to the new 1:1 mapping and invalidate the tmp mapping

I have tested this patches on Ebony, Sequoia boards and Virtex on QEMU.
It would be great if somebody could test this on the other boards.

You need the latest snapshot of kexec-tools for ppc440x support, available at

 git://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git

Signed-off-by: 	Suzuki Poulose <suzuki@in.ibm.com>
Cc:	Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---

 arch/powerpc/Kconfig             |    2 
 arch/powerpc/include/asm/kexec.h |    2 
 arch/powerpc/kernel/misc_32.S    |  171 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 173 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 423145a6..d04fae0 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -349,7 +349,7 @@ config ARCH_ENABLE_MEMORY_HOTREMOVE
 
 config KEXEC
 	bool "kexec system call (EXPERIMENTAL)"
-	depends on (PPC_BOOK3S || FSL_BOOKE) && EXPERIMENTAL
+	depends on (PPC_BOOK3S || FSL_BOOKE || (44x && !SMP && !47x)) && EXPERIMENTAL
 	help
 	  kexec is a system call that implements the ability to shutdown your
 	  current kernel, and to start another kernel.  It is like a reboot
diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index 8a33698..f921eb1 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -2,7 +2,7 @@
 #define _ASM_POWERPC_KEXEC_H
 #ifdef __KERNEL__
 
-#ifdef CONFIG_FSL_BOOKE
+#if defined(CONFIG_FSL_BOOKE) || defined(CONFIG_44x)
 
 /*
  * On FSL-BookE we setup a 1:1 mapping which covers the first 2GiB of memory
diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
index 998a100..f7d760a 100644
--- a/arch/powerpc/kernel/misc_32.S
+++ b/arch/powerpc/kernel/misc_32.S
@@ -8,6 +8,8 @@
  * kexec bits:
  * Copyright (C) 2002-2003 Eric Biederman  <ebiederm@xmission.com>
  * GameCube/ppc32 port Copyright (C) 2004 Albert Herranz
+ * PPC44x port. Copyright (C) 2011,  IBM Corporation
+ * 		Author: Suzuki Poulose <suzuki@in.ibm.com>
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public License
@@ -736,6 +738,175 @@ relocate_new_kernel:
 	mr      r5, r31
 
 	li	r0, 0
+#elif defined(CONFIG_44x)  && !defined(CONFIG_47x)
+
+/*
+ * Code for setting up 1:1 mapping for PPC440x for KEXEC
+ *
+ * We cannot switch off the MMU on PPC44x.
+ * So we:
+ * 1) Invalidate all the mappings except the one we are running from.
+ * 2) Create a tmp mapping for our code in the other address space(TS) and
+ *    jump to it. Invalidate the entry we started in.
+ * 3) Create a 1:1 mapping for 0-2GiB in chunks of 256M in original TS.
+ * 4) Jump to the 1:1 mapping in original TS.
+ * 5) Invalidate the tmp mapping.
+ *
+ * - Based on the kexec support code for FSL BookE
+ * - Doesn't support 47x yet.
+ *
+ */
+	/* Save our parameters */
+	mr	r29, r3
+	mr	r30, r4
+	mr	r31, r5
+
+	/* Load our MSR_IS and TID to MMUCR for TLB search */
+	mfspr	r3,SPRN_PID
+	mfmsr	r4
+	andi.	r4,r4,MSR_IS@l
+	beq	wmmucr
+	oris	r3,r3,PPC44x_MMUCR_STS@h
+wmmucr:
+	mtspr	SPRN_MMUCR,r3
+	sync
+
+	/*
+	 * Invalidate all the TLB entries except the current entry
+	 * where we are running from
+	 */
+	bl	0f				/* Find our address */
+0:	mflr	r5				/* Make it accessible */
+	tlbsx	r23,0,r5			/* Find entry we are in */
+	li	r4,0				/* Start at TLB entry 0 */
+	li	r3,0				/* Set PAGEID inval value */
+1:	cmpw	r23,r4				/* Is this our entry? */
+	beq	skip				/* If so, skip the inval */
+	tlbwe	r3,r4,PPC44x_TLB_PAGEID		/* If not, inval the entry */
+skip:
+	addi	r4,r4,1				/* Increment */
+	cmpwi	r4,64				/* Are we done?	*/
+	bne	1b				/* If not, repeat */
+	isync
+
+	/* Create a temp mapping and jump to it */
+	andi.	r6, r23, 1		/* Find the index to use */
+	addi	r24, r6, 1		/* r24 will contain 1 or 2 */
+
+	mfmsr	r9			/* get the MSR */
+	rlwinm	r5, r9, 27, 31, 31	/* Extract the MSR[IS] */
+	xori	r7, r5, 1		/* Use the other address space */
+
+	/* Read the current mapping entries */
+	tlbre	r3, r23, PPC44x_TLB_PAGEID
+	tlbre	r4, r23, PPC44x_TLB_XLAT
+	tlbre	r5, r23, PPC44x_TLB_ATTRIB
+
+	/* Save our current XLAT entry */
+	mr	r25, r4
+
+	/* Extract the TLB PageSize */
+	li	r10, 1 			/* r10 will hold PageSize */
+	rlwinm	r11, r3, 0, 24, 27	/* bits 24-27 */
+
+	/* XXX: As of now we use 256M, 4K pages */
+	cmpwi	r11, PPC44x_TLB_256M
+	bne	tlb_4k
+	rotlwi	r10, r10, 28		/* r10 = 256M */
+	b	write_out
+tlb_4k:
+	cmpwi	r11, PPC44x_TLB_4K
+	bne	default
+	rotlwi	r10, r10, 12		/* r10 = 4K */
+	b	write_out
+default:
+	rotlwi	r10, r10, 10		/* r10 = 1K */
+
+write_out:
+	/*
+	 * Write out the tmp 1:1 mapping for this code in other address space
+	 * Fixup  EPN = RPN , TS=other address space
+	 */
+	insrwi	r3, r7, 1, 23		/* Bit 23 is TS for PAGEID field */
+
+	/* Write out the tmp mapping entries */
+	tlbwe	r3, r24, PPC44x_TLB_PAGEID
+	tlbwe	r4, r24, PPC44x_TLB_XLAT
+	tlbwe	r5, r24, PPC44x_TLB_ATTRIB
+
+	subi	r11, r10, 1		/* PageOffset Mask = PageSize - 1 */
+	not	r10, r11		/* Mask for PageNum */
+
+	/* Switch to other address space in MSR */
+	insrwi	r9, r7, 1, 26		/* Set MSR[IS] = r7 */
+
+	bl	1f
+1:	mflr	r8
+	addi	r8, r8, (2f-1b)		/* Find the target offset */
+
+	/* Jump to the tmp mapping */
+	mtspr	SPRN_SRR0, r8
+	mtspr	SPRN_SRR1, r9
+	rfi
+
+2:
+	/* Invalidate the entry we were executing from */
+	li	r3, 0
+	tlbwe	r3, r23, PPC44x_TLB_PAGEID
+
+	/* attribute fields. rwx for SUPERVISOR mode */
+	li	r5, 0
+	ori	r5, r5, (PPC44x_TLB_SW | PPC44x_TLB_SR | PPC44x_TLB_SX | PPC44x_TLB_G)
+
+	/* Create 1:1 mapping in 256M pages */
+	xori	r7, r7, 1			/* Revert back to Original TS */
+
+	li	r8, 0				/* PageNumber */
+	li	r6, 3				/* TLB Index, start at 3  */
+
+next_tlb:
+	rotlwi	r3, r8, 28			/* Create EPN (bits 0-3) */
+	mr	r4, r3				/* RPN = EPN  */
+	ori	r3, r3, (PPC44x_TLB_VALID | PPC44x_TLB_256M) /* SIZE = 256M, Valid */
+	insrwi	r3, r7, 1, 23			/* Set TS from r7 */
+
+	tlbwe	r3, r6, PPC44x_TLB_PAGEID	/* PageID field : EPN, V, SIZE */
+	tlbwe	r4, r6, PPC44x_TLB_XLAT		/* Address translation : RPN   */
+	tlbwe	r5, r6, PPC44x_TLB_ATTRIB	/* Attributes */
+
+	addi	r8, r8, 1			/* Increment PN */
+	addi	r6, r6, 1			/* Increment TLB Index */
+	cmpwi	r8, 8				/* Are we done ? */
+	bne	next_tlb
+	isync
+
+	/* Jump to the new mapping 1:1 */
+	li	r9,0
+	insrwi	r9, r7, 1, 26			/* Set MSR[IS] = r7 */
+
+	bl	1f
+1:	mflr	r8
+	and	r8, r8, r11			/* Get our offset within page */
+	addi	r8, r8, (2f-1b)
+
+	and	r5, r25, r10			/* Get our target PageNum */
+	or	r8, r8, r5			/* Target jump address */
+
+	mtspr	SPRN_SRR0, r8
+	mtspr	SPRN_SRR1, r9
+	rfi
+2:
+	/* Invalidate the tmp entry we used */
+	li	r3, 0
+	tlbwe	r3, r24, PPC44x_TLB_PAGEID
+	sync
+
+	/* Restore the parameters */
+	mr	r3, r29
+	mr	r4, r30
+	mr	r5, r31
+
+	li	r0, 0
 #else
 	li	r0, 0
 

^ permalink raw reply related

* Re: [v2 PATCH 1/1] powerpc/4xx: enable and fix pcie gen1/gen2 on the 460sx
From: Ayman El-Khashab @ 2011-07-18 13:31 UTC (permalink / raw)
  To: Tony Breeds; +Cc: Paul Mackerras, linuxppc-dev, linux-kernel
In-Reply-To: <20110718040115.GK20597@ozlabs.org>

On Mon, Jul 18, 2011 at 02:01:15PM +1000, Tony Breeds wrote:
> On Fri, Jul 15, 2011 at 11:40:27AM -0500, Ayman Elkhashab wrote:
> 
> > @@ -1582,8 +1628,8 @@ static int __init ppc4xx_setup_one_pciex_POM(struct ppc4xx_pciex_port	*port,
> >  		dcr_write(port->dcrs, DCRO_PEGPL_OMR2BAH, lah);
> >  		dcr_write(port->dcrs, DCRO_PEGPL_OMR2BAL, lal);
> >  		dcr_write(port->dcrs, DCRO_PEGPL_OMR2MSKH, 0x7fffffff);
> > -		/* Note that 3 here means enabled | single region */
> > -		dcr_write(port->dcrs, DCRO_PEGPL_OMR2MSKL, sa | 3);
> > +		dcr_write(port->dcrs, DCRO_PEGPL_OMR2MSKL,
> > +				sa | DCRO_PEGPL_OMRxMSKL_VAL);
> 
> Didn't you just change "sa | 3" to "sa | 1" ?
> 

Yes, but I think that is correct for it to be "1".  The data
sheets for these parts that I checked had bit 1 marked as
reserved.  Only OMR1MSKL and OMR3MSKL had extra definitions
such as the _IO and _UOT.  The parts I checked which were
the sheets for the EX and SX (which cover another 6 or 7
parts) all had it with just a single bit defined on that
register.

Ayman

^ permalink raw reply

* [PATCH] powerpc/44x: Add NOR flash device to Yosemite dts
From: Stefan Roese @ 2011-07-18 13:49 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: jwboyer

Signed-off-by: Stefan Roese <sr@denx.de>
---
 arch/powerpc/boot/dts/yosemite.dts |   36 ++++++++++++++++++++++++++++++++++++
 1 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/boot/dts/yosemite.dts b/arch/powerpc/boot/dts/yosemite.dts
index 6492324..30bb475 100644
--- a/arch/powerpc/boot/dts/yosemite.dts
+++ b/arch/powerpc/boot/dts/yosemite.dts
@@ -138,6 +138,42 @@
 				clock-frequency = <0>; /* Filled in by zImage */
 				interrupts = <0x5 0x1>;
 				interrupt-parent = <&UIC1>;
+
+				nor_flash@0,0 {
+					compatible = "amd,s29gl256n", "cfi-flash";
+					bank-width = <2>;
+					reg = <0x00000000 0x00000000 0x04000000>;
+					#address-cells = <1>;
+					#size-cells = <1>;
+					partition@0 {
+						label = "kernel";
+						reg = <0x00000000 0x001e0000>;
+					};
+					partition@1e0000 {
+						label = "dtb";
+						reg = <0x001e0000 0x00020000>;
+					};
+					partition@200000 {
+						label = "ramdisk";
+						reg = <0x00200000 0x01400000>;
+					};
+					partition@1600000 {
+						label = "jffs2";
+						reg = <0x01600000 0x00400000>;
+					};
+					partition@1a00000 {
+						label = "user";
+						reg = <0x01a00000 0x02540000>;
+					};
+					partition@3f40000 {
+						label = "env";
+						reg = <0x03f40000 0x00040000>;
+					};
+					partition@3f80000 {
+						label = "u-boot";
+						reg = <0x03f80000 0x00080000>;
+					};
+				};
 			};
 
 			UART0: serial@ef600300 {
-- 
1.7.6

^ permalink raw reply related

* Busy waits take much longer in driver code vs. application code
From: Matias Garcia @ 2011-07-18 15:35 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <CAKTLLVSZ1ju5e-_Hqi9i4bYMsd1nxS3gTP9Q8ecoe-zxdZDOAA@mail.gmail.com>

Not sure if this is arch-dependant. If not, I'll register on the kernel
mailing list.

I'm working with some legacy driver code (an IOCTL) that busy waits
after requesting an operation from an FPGA. It reads a register in a
loop until the FPGA has finished the operation. The operation is
supposed to take about 1ms, but the driver code is returning after 11ms
or more. If I instead run the same busy loop inside the application that
calls the IOCTL, it takes <1ms. Instrumenting the driver loop shows that
the loop is only executed a couple of times about 5ms apart (though
printk may skew this slightly). The application loop is run MUCH more
often.

Platform is P2020 with kernel 2.6.37 running two applications. The one
in question is run with FIFO scheduler at priority 10, the other one is
run with FIFO scheduler at priority 50. All other processes are
regular/default priority.

The only difference in interface between calling the IOCTL and
reading/writing registers through the device file is that read/writes
use copy to/from user while the IOCTL calls io[read|write]32be. No data
needs to go to user-space for this operation.

A couple of questions:

1. Why is the application code running WAY faster than the IOCTL call? I
thought driver code was executed in process context at the same priority
as the calling process.

2. Can you suggest a better way to implement the busy wait in the driver
code if I can get the priorities right? Should I even leave it in there?
I'm doing a lot of reading, but could really use an expert opinion.

Any light shed will be toasted heartily.

Thanks,
Matias Garcia

^ permalink raw reply

* Re: [v3 PATCH 1/1] booke/kprobe: make program exception to use one dedicated exception stack
From: Scott Wood @ 2011-07-18 15:56 UTC (permalink / raw)
  To: Chen, Tiejun; +Cc: linuxppc-dev@ozlabs.org
In-Reply-To: <82C960D7DF4A1F47B94FC1C67A29BEE384D577@ALA-MBA.corp.ad.wrs.com>

On Sat, 16 Jul 2011 03:25:47 +0000
"Chen, Tiejun" <Tiejun.Chen@windriver.com> wrote:

> > -----Original Message-----
> > From: Scott Wood [mailto:scottwood@freescale.com] 
> > Sent: Saturday, July 16, 2011 2:43 AM
> > To: Chen, Tiejun
> > Cc: Kumar Gala; linuxppc-dev@ozlabs.org
> > Subject: Re: [v3 PATCH 1/1] booke/kprobe: make program 
> > exception to use one dedicated exception stack
> > 
> > On Fri, 15 Jul 2011 13:28:15 +0800
> > tiejun.chen <tiejun.chen@windriver.com> wrote:
> > 
> > > Kumar Gala wrote:
> > > > I'm still very confused why we need a unique stack frame 
> > for kprobe/program exceptions on book-e devices.
> > > 
> > > Its a bug at least for Book-E.
> > 
> > But why only booke?  There's nothing booke-specific about the 
> 
> I don't mean this is reproduced only on booke, so I use 'at least' carefully to notice we really see this problem on booke.
> 
> > stwu instruction.
> 
> Please note this root cause to this bug is not related to how to emulate stwu instruction. That should be issued from the overlap between an exception frame and the kprobed function stack frame on booke. Would you like to see that example I showed?

As I understand it, the problem comes from the fact that stwu combines the
creation of a stack frame with storing into that stack frame.  If they were
separate instructions you'd have a new exception frame at a lower address
by the time you actually store to the non-exception frame.

-Scott

^ permalink raw reply

* Re: [PATCH] net: filter: BPF 'JIT' compiler for PPC64
From: David Miller @ 2011-07-18 19:42 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, linuxppc-dev, matt
In-Reply-To: <1310978375.5756.7.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 18 Jul 2011 10:39:35 +0200

> So in PPC SEEN_XREG only is to be set of X is read, not if written.
> 
> So you dont have to set SEEN_XREG bit in this part :

Matt, do you want to integrate changes based upon Eric's feedback
here or do you want me to apply your patch as-is for now?

Thanks.

^ permalink raw reply

* Re: [PATCH] net: filter: BPF 'JIT' compiler for PPC64
From: Eric Dumazet @ 2011-07-18 20:05 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linuxppc-dev, matt
In-Reply-To: <20110718.124248.1462465498024218250.davem@davemloft.net>

Le lundi 18 juillet 2011 à 12:42 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 18 Jul 2011 10:39:35 +0200
> 
> > So in PPC SEEN_XREG only is to be set of X is read, not if written.
> > 
> > So you dont have to set SEEN_XREG bit in this part :
> 
> Matt, do you want to integrate changes based upon Eric's feedback
> here or do you want me to apply your patch as-is for now?
> 

This was a really minor point, so Matt feel free to ask an immediate
inclusion ;)

^ permalink raw reply

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
From: Peter Zijlstra @ 2011-07-18 21:35 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds
In-Reply-To: <20110715104547.29c3c509@kryten>

[-- Attachment #1: Type: text/plain, Size: 442 bytes --]

Anton, could you test the below two patches on that machine?

It should make things boot again, while I don't have a machine nearly
big enough to trigger any of this, I tested the new code paths by
setting FORCE_SD_OVERLAP in /debug/sched_features. Although any review
of the error paths would be much appreciated.

Also, could you send me the node_distance table for that machine? I'm
curious what the interconnects look like on that thing.

[-- Attachment #2: sched-domain-foo-1.patch --]
[-- Type: text/x-patch, Size: 9787 bytes --]

Subject: sched: Break out cpu_power from the sched_group structure
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Thu Jul 14 13:00:06 CEST 2011

In order to prepare for non-unique sched_groups per domain, we need to
carry the cpu_power elsewhere, so put a level of indirection in.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-qkho2byuhe4482fuknss40ad@git.kernel.org
---
 include/linux/sched.h |   14 +++++++++-----
 kernel/sched.c        |   32 ++++++++++++++++++++++++++------
 kernel/sched_fair.c   |   46 +++++++++++++++++++++++-----------------------
 3 files changed, 58 insertions(+), 34 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -6550,7 +6550,7 @@ static int sched_domain_debug_one(struct
 			break;
 		}
 
-		if (!group->cpu_power) {
+		if (!group->sgp->power) {
 			printk(KERN_CONT "\n");
 			printk(KERN_ERR "ERROR: domain->cpu_power not "
 					"set\n");
@@ -6574,9 +6574,9 @@ static int sched_domain_debug_one(struct
 		cpulist_scnprintf(str, sizeof(str), sched_group_cpus(group));
 
 		printk(KERN_CONT " %s", str);
-		if (group->cpu_power != SCHED_POWER_SCALE) {
+		if (group->sgp->power != SCHED_POWER_SCALE) {
 			printk(KERN_CONT " (cpu_power = %d)",
-				group->cpu_power);
+				group->sgp->power);
 		}
 
 		group = group->next;
@@ -6770,8 +6770,10 @@ static struct root_domain *alloc_rootdom
 static void free_sched_domain(struct rcu_head *rcu)
 {
 	struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu);
-	if (atomic_dec_and_test(&sd->groups->ref))
+	if (atomic_dec_and_test(&sd->groups->ref)) {
+		kfree(sd->groups->sgp);
 		kfree(sd->groups);
+	}
 	kfree(sd);
 }
 
@@ -6938,6 +6940,7 @@ int sched_smt_power_savings = 0, sched_m
 struct sd_data {
 	struct sched_domain **__percpu sd;
 	struct sched_group **__percpu sg;
+	struct sched_group_power **__percpu sgp;
 };
 
 struct s_data {
@@ -6974,8 +6977,10 @@ static int get_group(int cpu, struct sd_
 	if (child)
 		cpu = cpumask_first(sched_domain_span(child));
 
-	if (sg)
+	if (sg) {
 		*sg = *per_cpu_ptr(sdd->sg, cpu);
+		(*sg)->sgp = *per_cpu_ptr(sdd->sgp, cpu);
+	}
 
 	return cpu;
 }
@@ -7013,7 +7018,7 @@ build_sched_groups(struct sched_domain *
 			continue;
 
 		cpumask_clear(sched_group_cpus(sg));
-		sg->cpu_power = 0;
+		sg->sgp->power = 0;
 
 		for_each_cpu(j, span) {
 			if (get_group(j, sdd, NULL) != group)
@@ -7178,6 +7183,7 @@ static void claim_allocations(int cpu, s
 	if (cpu == cpumask_first(sched_group_cpus(sg))) {
 		WARN_ON_ONCE(*per_cpu_ptr(sdd->sg, cpu) != sg);
 		*per_cpu_ptr(sdd->sg, cpu) = NULL;
+		*per_cpu_ptr(sdd->sgp, cpu) = NULL;
 	}
 }
 
@@ -7227,9 +7233,14 @@ static int __sdt_alloc(const struct cpum
 		if (!sdd->sg)
 			return -ENOMEM;
 
+		sdd->sgp = alloc_percpu(struct sched_group_power *);
+		if (!sdd->sgp)
+			return -ENOMEM;
+
 		for_each_cpu(j, cpu_map) {
 			struct sched_domain *sd;
 			struct sched_group *sg;
+			struct sched_group_power *sgp;
 
 		       	sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(),
 					GFP_KERNEL, cpu_to_node(j));
@@ -7244,6 +7255,13 @@ static int __sdt_alloc(const struct cpum
 				return -ENOMEM;
 
 			*per_cpu_ptr(sdd->sg, j) = sg;
+
+			sgp = kzalloc_node(sizeof(struct sched_group_power),
+					GFP_KERNEL, cpu_to_node(j));
+			if (!sgp)
+				return -ENOMEM;
+
+			*per_cpu_ptr(sdd->sgp, j) = sgp;
 		}
 	}
 
@@ -7261,9 +7279,11 @@ static void __sdt_free(const struct cpum
 		for_each_cpu(j, cpu_map) {
 			kfree(*per_cpu_ptr(sdd->sd, j));
 			kfree(*per_cpu_ptr(sdd->sg, j));
+			kfree(*per_cpu_ptr(sdd->sgp, j));
 		}
 		free_percpu(sdd->sd);
 		free_percpu(sdd->sg);
+		free_percpu(sdd->sgp);
 	}
 }
 
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1583,7 +1583,7 @@ find_idlest_group(struct sched_domain *s
 		}
 
 		/* Adjust by relative CPU power of the group */
-		avg_load = (avg_load * SCHED_POWER_SCALE) / group->cpu_power;
+		avg_load = (avg_load * SCHED_POWER_SCALE) / group->sgp->power;
 
 		if (local_group) {
 			this_load = avg_load;
@@ -2629,7 +2629,7 @@ static void update_cpu_power(struct sche
 		power >>= SCHED_POWER_SHIFT;
 	}
 
-	sdg->cpu_power_orig = power;
+	sdg->sgp->power_orig = power;
 
 	if (sched_feat(ARCH_POWER))
 		power *= arch_scale_freq_power(sd, cpu);
@@ -2645,7 +2645,7 @@ static void update_cpu_power(struct sche
 		power = 1;
 
 	cpu_rq(cpu)->cpu_power = power;
-	sdg->cpu_power = power;
+	sdg->sgp->power = power;
 }
 
 static void update_group_power(struct sched_domain *sd, int cpu)
@@ -2663,11 +2663,11 @@ static void update_group_power(struct sc
 
 	group = child->groups;
 	do {
-		power += group->cpu_power;
+		power += group->sgp->power;
 		group = group->next;
 	} while (group != child->groups);
 
-	sdg->cpu_power = power;
+	sdg->sgp->power = power;
 }
 
 /*
@@ -2689,7 +2689,7 @@ fix_small_capacity(struct sched_domain *
 	/*
 	 * If ~90% of the cpu_power is still there, we're good.
 	 */
-	if (group->cpu_power * 32 > group->cpu_power_orig * 29)
+	if (group->sgp->power * 32 > group->sgp->power_orig * 29)
 		return 1;
 
 	return 0;
@@ -2769,7 +2769,7 @@ static inline void update_sg_lb_stats(st
 	}
 
 	/* Adjust by relative CPU power of the group */
-	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->cpu_power;
+	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->sgp->power;
 
 	/*
 	 * Consider the group unbalanced when the imbalance is larger
@@ -2786,7 +2786,7 @@ static inline void update_sg_lb_stats(st
 	if ((max_cpu_load - min_cpu_load) >= avg_load_per_task && max_nr_running > 1)
 		sgs->group_imb = 1;
 
-	sgs->group_capacity = DIV_ROUND_CLOSEST(group->cpu_power,
+	sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
 						SCHED_POWER_SCALE);
 	if (!sgs->group_capacity)
 		sgs->group_capacity = fix_small_capacity(sd, group);
@@ -2875,7 +2875,7 @@ static inline void update_sd_lb_stats(st
 			return;
 
 		sds->total_load += sgs.group_load;
-		sds->total_pwr += sg->cpu_power;
+		sds->total_pwr += sg->sgp->power;
 
 		/*
 		 * In case the child domain prefers tasks go to siblings
@@ -2960,7 +2960,7 @@ static int check_asym_packing(struct sch
 	if (this_cpu > busiest_cpu)
 		return 0;
 
-	*imbalance = DIV_ROUND_CLOSEST(sds->max_load * sds->busiest->cpu_power,
+	*imbalance = DIV_ROUND_CLOSEST(sds->max_load * sds->busiest->sgp->power,
 				       SCHED_POWER_SCALE);
 	return 1;
 }
@@ -2991,7 +2991,7 @@ static inline void fix_small_imbalance(s
 
 	scaled_busy_load_per_task = sds->busiest_load_per_task
 					 * SCHED_POWER_SCALE;
-	scaled_busy_load_per_task /= sds->busiest->cpu_power;
+	scaled_busy_load_per_task /= sds->busiest->sgp->power;
 
 	if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
 			(scaled_busy_load_per_task * imbn)) {
@@ -3005,28 +3005,28 @@ static inline void fix_small_imbalance(s
 	 * moving them.
 	 */
 
-	pwr_now += sds->busiest->cpu_power *
+	pwr_now += sds->busiest->sgp->power *
 			min(sds->busiest_load_per_task, sds->max_load);
-	pwr_now += sds->this->cpu_power *
+	pwr_now += sds->this->sgp->power *
 			min(sds->this_load_per_task, sds->this_load);
 	pwr_now /= SCHED_POWER_SCALE;
 
 	/* Amount of load we'd subtract */
 	tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
-		sds->busiest->cpu_power;
+		sds->busiest->sgp->power;
 	if (sds->max_load > tmp)
-		pwr_move += sds->busiest->cpu_power *
+		pwr_move += sds->busiest->sgp->power *
 			min(sds->busiest_load_per_task, sds->max_load - tmp);
 
 	/* Amount of load we'd add */
-	if (sds->max_load * sds->busiest->cpu_power <
+	if (sds->max_load * sds->busiest->sgp->power <
 		sds->busiest_load_per_task * SCHED_POWER_SCALE)
-		tmp = (sds->max_load * sds->busiest->cpu_power) /
-			sds->this->cpu_power;
+		tmp = (sds->max_load * sds->busiest->sgp->power) /
+			sds->this->sgp->power;
 	else
 		tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
-			sds->this->cpu_power;
-	pwr_move += sds->this->cpu_power *
+			sds->this->sgp->power;
+	pwr_move += sds->this->sgp->power *
 			min(sds->this_load_per_task, sds->this_load + tmp);
 	pwr_move /= SCHED_POWER_SCALE;
 
@@ -3072,7 +3072,7 @@ static inline void calculate_imbalance(s
 
 		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
 
-		load_above_capacity /= sds->busiest->cpu_power;
+		load_above_capacity /= sds->busiest->sgp->power;
 	}
 
 	/*
@@ -3088,8 +3088,8 @@ static inline void calculate_imbalance(s
 	max_pull = min(sds->max_load - sds->avg_load, load_above_capacity);
 
 	/* How much load to actually move to equalise the imbalance */
-	*imbalance = min(max_pull * sds->busiest->cpu_power,
-		(sds->avg_load - sds->this_load) * sds->this->cpu_power)
+	*imbalance = min(max_pull * sds->busiest->sgp->power,
+		(sds->avg_load - sds->this_load) * sds->this->sgp->power)
 			/ SCHED_POWER_SCALE;
 
 	/*
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -893,16 +893,20 @@ static inline int sd_power_saving_flags(
 	return 0;
 }
 
-struct sched_group {
-	struct sched_group *next;	/* Must be a circular list */
-	atomic_t ref;
-
+struct sched_group_power {
 	/*
 	 * CPU power of this group, SCHED_LOAD_SCALE being max power for a
 	 * single CPU.
 	 */
-	unsigned int cpu_power, cpu_power_orig;
+	unsigned int power, power_orig;
+};
+
+struct sched_group {
+	struct sched_group *next;	/* Must be a circular list */
+	atomic_t ref;
+
 	unsigned int group_weight;
+	struct sched_group_power *sgp;
 
 	/*
 	 * The CPUs this group covers.

[-- Attachment #3: sched-domain-foo-2.patch --]
[-- Type: text/x-patch, Size: 8956 bytes --]

Subject: sched: Allow for overlapping sched_domain spans
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Fri Jul 15 10:35:52 CEST 2011

Allow for sched_domain spans that overlap by giving such domains their
own sched_group list instead of sharing the sched_groups amongst
each-other.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-yr71izj2souh2dbifdh6j68y@git.kernel.org
---
 include/linux/sched.h   |    2 
 kernel/sched.c          |  157 +++++++++++++++++++++++++++++++++++++++---------
 kernel/sched_features.h |    2 
 3 files changed, 132 insertions(+), 29 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -844,6 +844,7 @@ enum cpu_idle_type {
 #define SD_SERIALIZE		0x0400	/* Only a single load balancing instance */
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
+#define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
 
 enum powersavings_balance_level {
 	POWERSAVINGS_BALANCE_NONE = 0,  /* No power saving load balance */
@@ -894,6 +895,7 @@ static inline int sd_power_saving_flags(
 }
 
 struct sched_group_power {
+	atomic_t ref;
 	/*
 	 * CPU power of this group, SCHED_LOAD_SCALE being max power for a
 	 * single CPU.
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -6767,10 +6767,36 @@ static struct root_domain *alloc_rootdom
 	return rd;
 }
 
+static void free_sched_groups(struct sched_group *sg, int free_sgp)
+{
+	struct sched_group *tmp, *first;
+
+	if (!sg)
+		return;
+
+	first = sg;
+	do {
+		tmp = sg->next;
+
+		if (free_sgp && atomic_dec_and_test(&sg->sgp->ref))
+			kfree(sg->sgp);
+
+		kfree(sg);
+		sg = tmp;
+	} while (sg != first);
+}
+
 static void free_sched_domain(struct rcu_head *rcu)
 {
 	struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu);
-	if (atomic_dec_and_test(&sd->groups->ref)) {
+
+	/*
+	 * If its an overlapping domain it has private groups, iterate and
+	 * nuke them all.
+	 */
+	if (sd->flags & SD_OVERLAP) {
+		free_sched_groups(sd->groups, 1);
+	} else if (atomic_dec_and_test(&sd->groups->ref)) {
 		kfree(sd->groups->sgp);
 		kfree(sd->groups);
 	}
@@ -6960,15 +6986,73 @@ struct sched_domain_topology_level;
 typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu);
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 
+#define SDTL_OVERLAP	0x01
+
 struct sched_domain_topology_level {
 	sched_domain_init_f init;
 	sched_domain_mask_f mask;
+	int		    flags;
 	struct sd_data      data;
 };
 
-/*
- * Assumes the sched_domain tree is fully constructed
- */
+static int
+build_overlap_sched_groups(struct sched_domain *sd, int cpu)
+{
+	struct sched_group *first = NULL, *last = NULL, *groups = NULL, *sg;
+	const struct cpumask *span = sched_domain_span(sd);
+	struct cpumask *covered = sched_domains_tmpmask;
+	struct sd_data *sdd = sd->private;
+	struct sched_domain *child;
+	int i;
+
+	cpumask_clear(covered);
+
+	for_each_cpu(i, span) {
+		struct cpumask *sg_span;
+
+		if (cpumask_test_cpu(i, covered))
+			continue;
+
+		sg = kzalloc_node(sizeof(struct sched_group), GFP_KERNEL,
+				cpu_to_node(i));
+
+		if (!sg)
+			goto fail;
+
+		sg_span = sched_group_cpus(sg);
+
+		child = *per_cpu_ptr(sdd->sd, i);
+		if (child->child) {
+			child = child->child;
+			*sg_span = *sched_domain_span(child);
+		} else
+			cpumask_set_cpu(i, sg_span);
+
+		cpumask_or(covered, covered, sg_span);
+
+		sg->sgp = *per_cpu_ptr(sdd->sgp, cpumask_first(sg_span));
+		atomic_inc(&sg->sgp->ref);
+
+		if (cpumask_test_cpu(cpu, sg_span))
+			groups = sg;
+
+		if (!first)
+			first = sg;
+		if (last)
+			last->next = sg;
+		last = sg;
+		last->next = first;
+	}
+	sd->groups = groups;
+
+	return 0;
+
+fail:
+	free_sched_groups(first, 0);
+
+	return -ENOMEM;
+}
+
 static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg)
 {
 	struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
@@ -6980,23 +7064,21 @@ static int get_group(int cpu, struct sd_
 	if (sg) {
 		*sg = *per_cpu_ptr(sdd->sg, cpu);
 		(*sg)->sgp = *per_cpu_ptr(sdd->sgp, cpu);
+		atomic_set(&(*sg)->sgp->ref, 1); /* for claim_allocations */
 	}
 
 	return cpu;
 }
 
 /*
- * build_sched_groups takes the cpumask we wish to span, and a pointer
- * to a function which identifies what group(along with sched group) a CPU
- * belongs to. The return value of group_fn must be a >= 0 and < nr_cpu_ids
- * (due to the fact that we keep track of groups covered with a struct cpumask).
- *
  * build_sched_groups will build a circular linked list of the groups
  * covered by the given span, and will set each group's ->cpumask correctly,
  * and ->cpu_power to 0.
+ *
+ * Assumes the sched_domain tree is fully constructed
  */
-static void
-build_sched_groups(struct sched_domain *sd)
+static int
+build_sched_groups(struct sched_domain *sd, int cpu)
 {
 	struct sched_group *first = NULL, *last = NULL;
 	struct sd_data *sdd = sd->private;
@@ -7004,6 +7086,12 @@ build_sched_groups(struct sched_domain *
 	struct cpumask *covered;
 	int i;
 
+	get_group(cpu, sdd, &sd->groups);
+	atomic_inc(&sd->groups->ref);
+
+	if (cpu != cpumask_first(sched_domain_span(sd)))
+		return 0;
+
 	lockdep_assert_held(&sched_domains_mutex);
 	covered = sched_domains_tmpmask;
 
@@ -7035,6 +7123,8 @@ build_sched_groups(struct sched_domain *
 		last = sg;
 	}
 	last->next = first;
+
+	return 0;
 }
 
 /*
@@ -7049,12 +7139,17 @@ build_sched_groups(struct sched_domain *
  */
 static void init_sched_groups_power(int cpu, struct sched_domain *sd)
 {
-	WARN_ON(!sd || !sd->groups);
+	struct sched_group *sg = sd->groups;
 
-	if (cpu != group_first_cpu(sd->groups))
-		return;
+	WARN_ON(!sd || !sg);
 
-	sd->groups->group_weight = cpumask_weight(sched_group_cpus(sd->groups));
+	do {
+		sg->group_weight = cpumask_weight(sched_group_cpus(sg));
+		sg = sg->next;
+	} while (sg != sd->groups);
+
+	if (cpu != group_first_cpu(sg))
+		return;
 
 	update_group_power(sd, cpu);
 }
@@ -7175,16 +7270,15 @@ static enum s_alloc __visit_domain_alloc
 static void claim_allocations(int cpu, struct sched_domain *sd)
 {
 	struct sd_data *sdd = sd->private;
-	struct sched_group *sg = sd->groups;
 
 	WARN_ON_ONCE(*per_cpu_ptr(sdd->sd, cpu) != sd);
 	*per_cpu_ptr(sdd->sd, cpu) = NULL;
 
-	if (cpu == cpumask_first(sched_group_cpus(sg))) {
-		WARN_ON_ONCE(*per_cpu_ptr(sdd->sg, cpu) != sg);
+	if (atomic_read(&(*per_cpu_ptr(sdd->sg, cpu))->ref))
 		*per_cpu_ptr(sdd->sg, cpu) = NULL;
+
+	if (atomic_read(&(*per_cpu_ptr(sdd->sgp, cpu))->ref))
 		*per_cpu_ptr(sdd->sgp, cpu) = NULL;
-	}
 }
 
 #ifdef CONFIG_SCHED_SMT
@@ -7209,7 +7303,7 @@ static struct sched_domain_topology_leve
 #endif
 	{ sd_init_CPU, cpu_cpu_mask, },
 #ifdef CONFIG_NUMA
-	{ sd_init_NODE, cpu_node_mask, },
+	{ sd_init_NODE, cpu_node_mask, SDTL_OVERLAP, },
 	{ sd_init_ALLNODES, cpu_allnodes_mask, },
 #endif
 	{ NULL, },
@@ -7277,7 +7371,9 @@ static void __sdt_free(const struct cpum
 		struct sd_data *sdd = &tl->data;
 
 		for_each_cpu(j, cpu_map) {
-			kfree(*per_cpu_ptr(sdd->sd, j));
+			struct sched_domain *sd = *per_cpu_ptr(sdd->sd, j);
+			if (sd && (sd->flags & SD_OVERLAP))
+				free_sched_groups(sd->groups, 0);
 			kfree(*per_cpu_ptr(sdd->sg, j));
 			kfree(*per_cpu_ptr(sdd->sgp, j));
 		}
@@ -7329,8 +7425,11 @@ static int build_sched_domains(const str
 		struct sched_domain_topology_level *tl;
 
 		sd = NULL;
-		for (tl = sched_domain_topology; tl->init; tl++)
+		for (tl = sched_domain_topology; tl->init; tl++) {
 			sd = build_sched_domain(tl, &d, cpu_map, attr, sd, i);
+			if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP))
+				sd->flags |= SD_OVERLAP;
+		}
 
 		while (sd->child)
 			sd = sd->child;
@@ -7342,13 +7441,13 @@ static int build_sched_domains(const str
 	for_each_cpu(i, cpu_map) {
 		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
 			sd->span_weight = cpumask_weight(sched_domain_span(sd));
-			get_group(i, sd->private, &sd->groups);
-			atomic_inc(&sd->groups->ref);
-
-			if (i != cpumask_first(sched_domain_span(sd)))
-				continue;
-
-			build_sched_groups(sd);
+			if (sd->flags & SD_OVERLAP) {
+				if (build_overlap_sched_groups(sd, i))
+					goto error;
+			} else {
+				if (build_sched_groups(sd, i))
+					goto error;
+			}
 		}
 	}
 
Index: linux-2.6/kernel/sched_features.h
===================================================================
--- linux-2.6.orig/kernel/sched_features.h
+++ linux-2.6/kernel/sched_features.h
@@ -70,3 +70,5 @@ SCHED_FEAT(NONIRQ_POWER, 1)
  * using the scheduler IPI. Reduces rq->lock contention/bounces.
  */
 SCHED_FEAT(TTWU_QUEUE, 1)
+
+SCHED_FEAT(FORCE_SD_OVERLAP, 0)

^ permalink raw reply

* Re: [PATCH] net: filter: BPF 'JIT' compiler for PPC64
From: Matt Evans @ 2011-07-19  1:21 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linuxppc-dev, eric.dumazet
In-Reply-To: <20110718.124248.1462465498024218250.davem@davemloft.net>

On 19/07/11 05:42, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 18 Jul 2011 10:39:35 +0200
> 
>> So in PPC SEEN_XREG only is to be set of X is read, not if written.
>>
>> So you dont have to set SEEN_XREG bit in this part :
> 
> Matt, do you want to integrate changes based upon Eric's feedback
> here or do you want me to apply your patch as-is for now?

Thanks, but no worries; I will send a v2 in a sec.  Eric's comments are spot-on,
and there are a couple of other areas that the brainfart should really be
polished out of, too.  :-)


Cheers,


Matt

^ permalink raw reply

* Re: [v2 PATCH 1/1] powerpc/4xx: enable and fix pcie gen1/gen2 on the 460sx
From: Tony Breeds @ 2011-07-19  1:23 UTC (permalink / raw)
  To: Ayman El-Khashab; +Cc: Paul Mackerras, linuxppc-dev, linux-kernel
In-Reply-To: <20110718133101.GB26701@crust.elkhashab.com>

On Mon, Jul 18, 2011 at 08:31:01AM -0500, Ayman El-Khashab wrote:

> Yes, but I think that is correct for it to be "1".  The data
> sheets for these parts that I checked had bit 1 marked as
> reserved.  Only OMR1MSKL and OMR3MSKL had extra definitions
> such as the _IO and _UOT.  The parts I checked which were
> the sheets for the EX and SX (which cover another 6 or 7
> parts) all had it with just a single bit defined on that
> register.

Ahh okay.  I kind of think that this may need to be a seperate change.  At the
very least it needs to be explicitly mentioned in the change log.

Yours Tony

^ permalink raw reply

* [PATCH v2] net: filter: BPF 'JIT' compiler for PPC64
From: Matt Evans @ 2011-07-19  2:13 UTC (permalink / raw)
  To: linuxppc-dev, netdev
In-Reply-To: <4E23E5C3.1070209@ozlabs.org>

An implementation of a code generator for BPF programs to speed up packet
filtering on PPC64, inspired by Eric Dumazet's x86-64 version.

Filter code is generated as an ABI-compliant function in module_alloc()'d mem
with stackframe & prologue/epilogue generated if required (simple filters don't
need anything more than an li/blr).  The filter's local variables, M[], live in
registers.  Supports all BPF opcodes, although "complicated" loads from negative
packet offsets (e.g. SKF_LL_OFF) are not yet supported.

There are a couple of further optimisations left for future work; many-pass
assembly with branch-reach reduction and a register allocator to push M[]
variables into volatile registers would improve the code quality further.

This currently supports big-endian 64-bit PowerPC only (but is fairly simple
to port to PPC32 or LE!).

Enabled in the same way as x86-64:

	echo 1 > /proc/sys/net/core/bpf_jit_enable

Or, enabled with extra debug output:

	echo 2 > /proc/sys/net/core/bpf_jit_enable

Signed-off-by: Matt Evans <matt@ozlabs.org>
---

V2: Removed some cut/paste woe in setting SEEN_X even on writes.
    Merci for le review, Eric!

 arch/powerpc/Kconfig                  |    1 +
 arch/powerpc/Makefile                 |    3 +-
 arch/powerpc/include/asm/ppc-opcode.h |   40 ++
 arch/powerpc/net/Makefile             |    4 +
 arch/powerpc/net/bpf_jit.S            |  138 +++++++
 arch/powerpc/net/bpf_jit.h            |  227 +++++++++++
 arch/powerpc/net/bpf_jit_comp.c       |  690 +++++++++++++++++++++++++++++++++
 7 files changed, 1102 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2729c66..39860fc 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -134,6 +134,7 @@ config PPC
 	select GENERIC_IRQ_SHOW_LEVEL
 	select HAVE_RCU_TABLE_FREE if SMP
 	select HAVE_SYSCALL_TRACEPOINTS
+	select HAVE_BPF_JIT if PPC64
 
 config EARLY_PRINTK
 	bool
diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index b7212b6..b94740f 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -154,7 +154,8 @@ core-y				+= arch/powerpc/kernel/ \
 				   arch/powerpc/lib/ \
 				   arch/powerpc/sysdev/ \
 				   arch/powerpc/platforms/ \
-				   arch/powerpc/math-emu/
+				   arch/powerpc/math-emu/ \
+				   arch/powerpc/net/
 core-$(CONFIG_XMON)		+= arch/powerpc/xmon/
 core-$(CONFIG_KVM) 		+= arch/powerpc/kvm/
 
diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
index e472659..e980faa 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -71,6 +71,42 @@
 #define PPC_INST_ERATSX			0x7c000126
 #define PPC_INST_ERATSX_DOT		0x7c000127
 
+/* Misc instructions for BPF compiler */
+#define PPC_INST_LD			0xe8000000
+#define PPC_INST_LHZ			0xa0000000
+#define PPC_INST_LWZ			0x80000000
+#define PPC_INST_STD			0xf8000000
+#define PPC_INST_STDU			0xf8000001
+#define PPC_INST_MFLR			0x7c0802a6
+#define PPC_INST_MTLR			0x7c0803a6
+#define PPC_INST_CMPWI			0x2c000000
+#define PPC_INST_CMPDI			0x2c200000
+#define PPC_INST_CMPLW			0x7c000040
+#define PPC_INST_CMPLWI			0x28000000
+#define PPC_INST_ADDI			0x38000000
+#define PPC_INST_ADDIS			0x3c000000
+#define PPC_INST_ADD			0x7c000214
+#define PPC_INST_SUB			0x7c000050
+#define PPC_INST_BLR			0x4e800020
+#define PPC_INST_BLRL			0x4e800021
+#define PPC_INST_MULLW			0x7c0001d6
+#define PPC_INST_MULHWU			0x7c000016
+#define PPC_INST_MULLI			0x1c000000
+#define PPC_INST_DIVWU			0x7c0003d6
+#define PPC_INST_RLWINM			0x54000000
+#define PPC_INST_RLDICR			0x78000004
+#define PPC_INST_SLW			0x7c000030
+#define PPC_INST_SRW			0x7c000430
+#define PPC_INST_AND			0x7c000038
+#define PPC_INST_ANDDOT			0x7c000039
+#define PPC_INST_OR			0x7c000378
+#define PPC_INST_ANDI			0x70000000
+#define PPC_INST_ORI			0x60000000
+#define PPC_INST_ORIS			0x64000000
+#define PPC_INST_NEG			0x7c0000d0
+#define PPC_INST_BRANCH			0x48000000
+#define PPC_INST_BRANCH_COND		0x40800000
+
 /* macros to insert fields into opcodes */
 #define __PPC_RA(a)	(((a) & 0x1f) << 16)
 #define __PPC_RB(b)	(((b) & 0x1f) << 11)
@@ -83,6 +119,10 @@
 #define __PPC_T_TLB(t)	(((t) & 0x3) << 21)
 #define __PPC_WC(w)	(((w) & 0x3) << 21)
 #define __PPC_WS(w)	(((w) & 0x1f) << 11)
+#define __PPC_SH(s)	__PPC_WS(s)
+#define __PPC_MB(s)	(((s) & 0x1f) << 6)
+#define __PPC_ME(s)	(((s) & 0x1f) << 1)
+#define __PPC_BI(s)	(((s) & 0x1f) << 16)
 
 /*
  * Only use the larx hint bit on 64bit CPUs. e500v1/v2 based CPUs will treat a
diff --git a/arch/powerpc/net/Makefile b/arch/powerpc/net/Makefile
new file mode 100644
index 0000000..90568c3
--- /dev/null
+++ b/arch/powerpc/net/Makefile
@@ -0,0 +1,4 @@
+#
+# Arch-specific network modules
+#
+obj-$(CONFIG_BPF_JIT) += bpf_jit.o bpf_jit_comp.o
diff --git a/arch/powerpc/net/bpf_jit.S b/arch/powerpc/net/bpf_jit.S
new file mode 100644
index 0000000..ff4506e
--- /dev/null
+++ b/arch/powerpc/net/bpf_jit.S
@@ -0,0 +1,138 @@
+/* bpf_jit.S: Packet/header access helper functions
+ * for PPC64 BPF compiler.
+ *
+ * Copyright 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include <asm/ppc_asm.h>
+#include "bpf_jit.h"
+
+/*
+ * All of these routines are called directly from generated code,
+ * whose register usage is:
+ *
+ * r3		skb
+ * r4,r5	A,X
+ * r6		*** address parameter to helper ***
+ * r7-r10	scratch
+ * r14		skb->data
+ * r15		skb headlen
+ * r16-31	M[]
+ */
+
+/*
+ * To consider: These helpers are so small it could be better to just
+ * generate them inline.  Inline code can do the simple headlen check
+ * then branch directly to slow_path_XXX if required.  (In fact, could
+ * load a spare GPR with the address of slow_path_generic and pass size
+ * as an argument, making the call site a mtlr, li and bllr.)
+ *
+ * Technically, the "is addr < 0" check is unnecessary & slowing down
+ * the ABS path, as it's statically checked on generation.
+ */
+	.globl	sk_load_word
+sk_load_word:
+	cmpdi	r_addr, 0
+	blt	bpf_error
+	/* Are we accessing past headlen? */
+	subi	r_scratch1, r_HL, 4
+	cmpd	r_scratch1, r_addr
+	blt	bpf_slow_path_word
+	/* Nope, just hitting the header.  cr0 here is eq or gt! */
+	lwzx	r_A, r_D, r_addr
+	/* When big endian we don't need to byteswap. */
+	blr	/* Return success, cr0 != LT */
+
+	.globl	sk_load_half
+sk_load_half:
+	cmpdi	r_addr, 0
+	blt	bpf_error
+	subi	r_scratch1, r_HL, 2
+	cmpd	r_scratch1, r_addr
+	blt	bpf_slow_path_half
+	lhzx	r_A, r_D, r_addr
+	blr
+
+	.globl	sk_load_byte
+sk_load_byte:
+	cmpdi	r_addr, 0
+	blt	bpf_error
+	cmpd	r_HL, r_addr
+	ble	bpf_slow_path_byte
+	lbzx	r_A, r_D, r_addr
+	blr
+
+/*
+ * BPF_S_LDX_B_MSH: ldxb  4*([offset]&0xf)
+ * r_addr is the offset value, already known positive
+ */
+	.globl sk_load_byte_msh
+sk_load_byte_msh:
+	cmpd	r_HL, r_addr
+	ble	bpf_slow_path_byte_msh
+	lbzx	r_X, r_D, r_addr
+	rlwinm	r_X, r_X, 2, 32-4-2, 31-2
+	blr
+
+bpf_error:
+	/* Entered with cr0 = lt */
+	li	r3, 0
+	/* Generated code will 'blt epilogue', returning 0. */
+	blr
+
+/* Call out to skb_copy_bits:
+ * We'll need to back up our volatile regs first; we have
+ * local variable space at r1+(BPF_PPC_STACK_BASIC).
+ * Allocate a new stack frame here to remain ABI-compliant in
+ * stashing LR.
+ */
+#define bpf_slow_path_common(SIZE)				\
+	mflr	r0;						\
+	std	r0, 16(r1);					\
+	/* R3 goes in parameter space of caller's frame */	\
+	std	r_skb, (BPF_PPC_STACKFRAME+48)(r1);		\
+	std	r_A, (BPF_PPC_STACK_BASIC+(0*8))(r1);		\
+	std	r_X, (BPF_PPC_STACK_BASIC+(1*8))(r1);		\
+	addi	r5, r1, BPF_PPC_STACK_BASIC+(2*8);		\
+	stdu	r1, -BPF_PPC_SLOWPATH_FRAME(r1);		\
+	/* R3 = r_skb, as passed */				\
+	mr	r4, r_addr;					\
+	li	r6, SIZE;					\
+	bl	skb_copy_bits;					\
+	/* R3 = 0 on success */					\
+	addi	r1, r1, BPF_PPC_SLOWPATH_FRAME;			\
+	ld	r0, 16(r1);					\
+	ld	r_A, (BPF_PPC_STACK_BASIC+(0*8))(r1);		\
+	ld	r_X, (BPF_PPC_STACK_BASIC+(1*8))(r1);		\
+	mtlr	r0;						\
+	cmpdi	r3, 0;						\
+	blt	bpf_error;	/* cr0 = LT */			\
+	ld	r_skb, (BPF_PPC_STACKFRAME+48)(r1);		\
+	/* Great success! */
+
+bpf_slow_path_word:
+	bpf_slow_path_common(4)
+	/* Data value is on stack, and cr0 != LT */
+	lwz	r_A, BPF_PPC_STACK_BASIC+(2*8)(r1)
+	blr
+
+bpf_slow_path_half:
+	bpf_slow_path_common(2)
+	lhz	r_A, BPF_PPC_STACK_BASIC+(2*8)(r1)
+	blr
+
+bpf_slow_path_byte:
+	bpf_slow_path_common(1)
+	lbz	r_A, BPF_PPC_STACK_BASIC+(2*8)(r1)
+	blr
+
+bpf_slow_path_byte_msh:
+	bpf_slow_path_common(1)
+	lbz	r_X, BPF_PPC_STACK_BASIC+(2*8)(r1)
+	rlwinm	r_X, r_X, 2, 32-4-2, 31-2
+	blr
diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
new file mode 100644
index 0000000..af1ab5e
--- /dev/null
+++ b/arch/powerpc/net/bpf_jit.h
@@ -0,0 +1,227 @@
+/* bpf_jit.h: BPF JIT compiler for PPC64
+ *
+ * Copyright 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+#ifndef _BPF_JIT_H
+#define _BPF_JIT_H
+
+#define BPF_PPC_STACK_LOCALS	32
+#define BPF_PPC_STACK_BASIC	(48+64)
+#define BPF_PPC_STACK_SAVE	(18*8)
+#define BPF_PPC_STACKFRAME	(BPF_PPC_STACK_BASIC+BPF_PPC_STACK_LOCALS+ \
+				 BPF_PPC_STACK_SAVE)
+#define BPF_PPC_SLOWPATH_FRAME	(48+64)
+
+/*
+ * Generated code register usage:
+ *
+ * As normal PPC C ABI (e.g. r1=sp, r2=TOC), with:
+ *
+ * skb		r3	(Entry parameter)
+ * A register	r4
+ * X register	r5
+ * addr param	r6
+ * r7-r10	scratch
+ * skb->data	r14
+ * skb headlen	r15	(skb->len - skb->data_len)
+ * m[0]		r16
+ * m[...]	...
+ * m[15]	r31
+ */
+#define r_skb		3
+#define r_ret		3
+#define r_A		4
+#define r_X		5
+#define r_addr		6
+#define r_scratch1	7
+#define r_D		14
+#define r_HL		15
+#define r_M		16
+
+#ifndef __ASSEMBLY__
+
+/*
+ * Assembly helpers from arch/powerpc/net/bpf_jit.S:
+ */
+extern u8 sk_load_word[], sk_load_half[], sk_load_byte[], sk_load_byte_msh[];
+
+#define FUNCTION_DESCR_SIZE	24
+
+/*
+ * 16-bit immediate helper macros: HA() is for use with sign-extending instrs
+ * (e.g. LD, ADDI).  If the bottom 16 bits is "-ve", add another bit into the
+ * top half to negate the effect (i.e. 0xffff + 1 = 0x(1)0000).
+ */
+#define IMM_H(i)		((uintptr_t)(i)>>16)
+#define IMM_HA(i)		(((uintptr_t)(i)>>16) +			      \
+				 (((uintptr_t)(i) & 0x8000) >> 15))
+#define IMM_L(i)		((uintptr_t)(i) & 0xffff)
+
+#define PLANT_INSTR(d, idx, instr)					      \
+	do { if (d) { (d)[idx] = instr; } idx++; } while (0)
+#define EMIT(instr)		PLANT_INSTR(image, ctx->idx, instr)
+
+#define PPC_NOP()		EMIT(PPC_INST_NOP)
+#define PPC_BLR()		EMIT(PPC_INST_BLR)
+#define PPC_BLRL()		EMIT(PPC_INST_BLRL)
+#define PPC_MTLR(r)		EMIT(PPC_INST_MTLR | __PPC_RT(r))
+#define PPC_ADDI(d, a, i)	EMIT(PPC_INST_ADDI | __PPC_RT(d) |	      \
+				     __PPC_RA(a) | IMM_L(i))
+#define PPC_MR(d, a)		PPC_OR(d, a, a)
+#define PPC_LI(r, i)		PPC_ADDI(r, 0, i)
+#define PPC_ADDIS(d, a, i)	EMIT(PPC_INST_ADDIS |			      \
+				     __PPC_RS(d) | __PPC_RA(a) | IMM_L(i))
+#define PPC_LIS(r, i)		PPC_ADDIS(r, 0, i)
+#define PPC_STD(r, base, i)	EMIT(PPC_INST_STD | __PPC_RS(r) |	      \
+				     __PPC_RA(base) | ((i) & 0xfffc))
+
+#define PPC_LD(r, base, i)	EMIT(PPC_INST_LD | __PPC_RT(r) |	      \
+				     __PPC_RA(base) | IMM_L(i))
+#define PPC_LWZ(r, base, i)	EMIT(PPC_INST_LWZ | __PPC_RT(r) |	      \
+				     __PPC_RA(base) | IMM_L(i))
+#define PPC_LHZ(r, base, i)	EMIT(PPC_INST_LHZ | __PPC_RT(r) |	      \
+				     __PPC_RA(base) | IMM_L(i))
+/* Convenience helpers for the above with 'far' offsets: */
+#define PPC_LD_OFFS(r, base, i) do { if ((i) < 32768) PPC_LD(r, base, i);     \
+		else {	PPC_ADDIS(r, base, IMM_HA(i));			      \
+			PPC_LD(r, r, IMM_L(i)); } } while(0)
+
+#define PPC_LWZ_OFFS(r, base, i) do { if ((i) < 32768) PPC_LWZ(r, base, i);   \
+		else {	PPC_ADDIS(r, base, IMM_HA(i));			      \
+			PPC_LWZ(r, r, IMM_L(i)); } } while(0)
+
+#define PPC_LHZ_OFFS(r, base, i) do { if ((i) < 32768) PPC_LHZ(r, base, i);   \
+		else {	PPC_ADDIS(r, base, IMM_HA(i));			      \
+			PPC_LHZ(r, r, IMM_L(i)); } } while(0)
+
+#define PPC_CMPWI(a, i)		EMIT(PPC_INST_CMPWI | __PPC_RA(a) | IMM_L(i))
+#define PPC_CMPDI(a, i)		EMIT(PPC_INST_CMPDI | __PPC_RA(a) | IMM_L(i))
+#define PPC_CMPLWI(a, i)	EMIT(PPC_INST_CMPLWI | __PPC_RA(a) | IMM_L(i))
+#define PPC_CMPLW(a, b)		EMIT(PPC_INST_CMPLW | __PPC_RA(a) | __PPC_RB(b))
+
+#define PPC_SUB(d, a, b)	EMIT(PPC_INST_SUB | __PPC_RT(d) |	      \
+				     __PPC_RB(a) | __PPC_RA(b))
+#define PPC_ADD(d, a, b)	EMIT(PPC_INST_ADD | __PPC_RT(d) |	      \
+				     __PPC_RA(a) | __PPC_RB(b))
+#define PPC_MUL(d, a, b)	EMIT(PPC_INST_MULLW | __PPC_RT(d) |	      \
+				     __PPC_RA(a) | __PPC_RB(b))
+#define PPC_MULHWU(d, a, b)	EMIT(PPC_INST_MULHWU | __PPC_RT(d) |	      \
+				     __PPC_RA(a) | __PPC_RB(b))
+#define PPC_MULI(d, a, i)	EMIT(PPC_INST_MULLI | __PPC_RT(d) |	      \
+				     __PPC_RA(a) | IMM_L(i))
+#define PPC_DIVWU(d, a, b)	EMIT(PPC_INST_DIVWU | __PPC_RT(d) |	      \
+				     __PPC_RA(a) | __PPC_RB(b))
+#define PPC_AND(d, a, b)	EMIT(PPC_INST_AND | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | __PPC_RB(b))
+#define PPC_ANDI(d, a, i)	EMIT(PPC_INST_ANDI | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | IMM_L(i))
+#define PPC_AND_DOT(d, a, b)	EMIT(PPC_INST_ANDDOT | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | __PPC_RB(b))
+#define PPC_OR(d, a, b)		EMIT(PPC_INST_OR | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | __PPC_RB(b))
+#define PPC_ORI(d, a, i)	EMIT(PPC_INST_ORI | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | IMM_L(i))
+#define PPC_ORIS(d, a, i)	EMIT(PPC_INST_ORIS | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | IMM_L(i))
+#define PPC_SLW(d, a, s)	EMIT(PPC_INST_SLW | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | __PPC_RB(s))
+#define PPC_SRW(d, a, s)	EMIT(PPC_INST_SRW | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | __PPC_RB(s))
+/* slwi = rlwinm Rx, Ry, n, 0, 31-n */
+#define PPC_SLWI(d, a, i)	EMIT(PPC_INST_RLWINM | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | __PPC_SH(i) |	      \
+				     __PPC_MB(0) | __PPC_ME(31-(i)))
+/* srwi = rlwinm Rx, Ry, 32-n, n, 31 */
+#define PPC_SRWI(d, a, i)	EMIT(PPC_INST_RLWINM | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | __PPC_SH(32-(i)) |	      \
+				     __PPC_MB(i) | __PPC_ME(31))
+/* sldi = rldicr Rx, Ry, n, 63-n */
+#define PPC_SLDI(d, a, i)	EMIT(PPC_INST_RLDICR | __PPC_RA(d) |	      \
+				     __PPC_RS(a) | __PPC_SH(i) |	      \
+				     __PPC_MB(63-(i)) | (((i) & 0x20) >> 4))
+#define PPC_NEG(d, a)		EMIT(PPC_INST_NEG | __PPC_RT(d) | __PPC_RA(a))
+
+/* Long jump; (unconditional 'branch') */
+#define PPC_JMP(dest)		EMIT(PPC_INST_BRANCH |			      \
+				     (((dest) - (ctx->idx * 4)) & 0x03fffffc))
+/* "cond" here covers BO:BI fields. */
+#define PPC_BCC_SHORT(cond, dest)	EMIT(PPC_INST_BRANCH_COND |	      \
+					     (((cond) & 0x3ff) << 16) |	      \
+					     (((dest) - (ctx->idx * 4)) &     \
+					      0xfffc))
+#define PPC_LI32(d, i)		do { PPC_LI(d, IMM_L(i));		      \
+		if ((u32)(uintptr_t)(i) >= 32768) {			      \
+			PPC_ADDIS(d, d, IMM_HA(i));			      \
+		} } while(0)
+#define PPC_LI64(d, i)		do {					      \
+		if (!((uintptr_t)(i) & 0xffffffff00000000ULL))		      \
+			PPC_LI32(d, i);					      \
+		else {							      \
+			PPC_LIS(d, ((uintptr_t)(i) >> 48));		      \
+			if ((uintptr_t)(i) & 0x0000ffff00000000ULL)	      \
+				PPC_ORI(d, d,				      \
+					((uintptr_t)(i) >> 32) & 0xffff);     \
+			PPC_SLDI(d, d, 32);				      \
+			if ((uintptr_t)(i) & 0x00000000ffff0000ULL)	      \
+				PPC_ORIS(d, d,				      \
+					 ((uintptr_t)(i) >> 16) & 0xffff);    \
+			if ((uintptr_t)(i) & 0x000000000000ffffULL)	      \
+				PPC_ORI(d, d, (uintptr_t)(i) & 0xffff);	      \
+		} } while (0);
+
+static inline bool is_nearbranch(int offset)
+{
+	return (offset < 32768) && (offset >= -32768);
+}
+
+/*
+ * The fly in the ointment of code size changing from pass to pass is
+ * avoided by padding the short branch case with a NOP.	 If code size differs
+ * with different branch reaches we will have the issue of code moving from
+ * one pass to the next and will need a few passes to converge on a stable
+ * state.
+ */
+#define PPC_BCC(cond, dest)	do {					      \
+		if (is_nearbranch((dest) - (ctx->idx * 4))) {		      \
+			PPC_BCC_SHORT(cond, dest);			      \
+			PPC_NOP();					      \
+		} else {						      \
+			/* Flip the 'T or F' bit to invert comparison */      \
+			PPC_BCC_SHORT(cond ^ COND_CMP_TRUE, (ctx->idx+2)*4);  \
+			PPC_JMP(dest);					      \
+		} } while(0)
+
+/* To create a branch condition, select a bit of cr0... */
+#define CR0_LT		0
+#define CR0_GT		1
+#define CR0_EQ		2
+/* ...and modify BO[3] */
+#define COND_CMP_TRUE	0x100
+#define COND_CMP_FALSE	0x000
+/* Together, they make all required comparisons: */
+#define COND_GT		(CR0_GT | COND_CMP_TRUE)
+#define COND_GE		(CR0_LT | COND_CMP_FALSE)
+#define COND_EQ		(CR0_EQ | COND_CMP_TRUE)
+#define COND_NE		(CR0_EQ | COND_CMP_FALSE)
+#define COND_LT		(CR0_LT | COND_CMP_TRUE)
+
+#define SEEN_DATAREF 0x10000 /* might call external helpers */
+#define SEEN_XREG    0x20000 /* X reg is used */
+#define SEEN_MEM     0x40000 /* SEEN_MEM+(1<<n) = use mem[n] for temporary
+			      * storage */
+#define SEEN_MEM_MSK 0x0ffff
+
+struct codegen_context {
+	unsigned int seen;
+	unsigned int idx;
+	int pc_ret0; /* bpf index of first RET #0 instruction (if any) */
+};
+
+#endif
+
+#endif
diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
new file mode 100644
index 0000000..2cb2566
--- /dev/null
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -0,0 +1,690 @@
+/* bpf_jit_comp.c: BPF JIT compiler for PPC64
+ *
+ * Copyright 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation
+ *
+ * Based on the x86 BPF compiler, by Eric Dumazet (eric.dumazet@gmail.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+#include <linux/moduleloader.h>
+#include <asm/cacheflush.h>
+#include <linux/netdevice.h>
+#include <linux/filter.h>
+#include "bpf_jit.h"
+
+#ifndef __BIG_ENDIAN
+/* There are endianness assumptions herein. */
+#error "Little-endian PPC not supported in BPF compiler"
+#endif
+
+int bpf_jit_enable __read_mostly;
+
+
+static inline void bpf_flush_icache(void *start, void *end)
+{
+	smp_wmb();
+	flush_icache_range((unsigned long)start, (unsigned long)end);
+}
+
+static void bpf_jit_build_prologue(struct sk_filter *fp, u32 *image,
+				   struct codegen_context *ctx)
+{
+	int i;
+	const struct sock_filter *filter = fp->insns;
+
+	if (ctx->seen & (SEEN_MEM | SEEN_DATAREF)) {
+		/* Make stackframe */
+		if (ctx->seen & SEEN_DATAREF) {
+			/* If we call any helpers (for loads), save LR */
+			EMIT(PPC_INST_MFLR | __PPC_RT(0));
+			PPC_STD(0, 1, 16);
+
+			/* Back up non-volatile regs. */
+			PPC_STD(r_D, 1, -(8*(32-r_D)));
+			PPC_STD(r_HL, 1, -(8*(32-r_HL)));
+		}
+		if (ctx->seen & SEEN_MEM) {
+			/*
+			 * Conditionally save regs r15-r31 as some will be used
+			 * for M[] data.
+			 */
+			for (i = r_M; i < (r_M+16); i++) {
+				if (ctx->seen & (1 << (i-r_M)))
+					PPC_STD(i, 1, -(8*(32-i)));
+			}
+		}
+		EMIT(PPC_INST_STDU | __PPC_RS(1) | __PPC_RA(1) |
+		     (-BPF_PPC_STACKFRAME & 0xfffc));
+	}
+
+	if (ctx->seen & SEEN_DATAREF) {
+		/*
+		 * If this filter needs to access skb data,
+		 * prepare r_D and r_HL:
+		 *  r_HL = skb->len - skb->data_len
+		 *  r_D	 = skb->data
+		 */
+		PPC_LWZ_OFFS(r_scratch1, r_skb, offsetof(struct sk_buff,
+							 data_len));
+		PPC_LWZ_OFFS(r_HL, r_skb, offsetof(struct sk_buff, len));
+		PPC_SUB(r_HL, r_HL, r_scratch1);
+		PPC_LD_OFFS(r_D, r_skb, offsetof(struct sk_buff, data));
+	}
+
+	if (ctx->seen & SEEN_XREG) {
+		/*
+		 * TODO: Could also detect whether first instr. sets X and
+		 * avoid this (as below, with A).
+		 */
+		PPC_LI(r_X, 0);
+	}
+
+	switch (filter[0].code) {
+	case BPF_S_RET_K:
+	case BPF_S_LD_W_LEN:
+	case BPF_S_ANC_PROTOCOL:
+	case BPF_S_ANC_IFINDEX:
+	case BPF_S_ANC_MARK:
+	case BPF_S_ANC_RXHASH:
+	case BPF_S_ANC_CPU:
+	case BPF_S_ANC_QUEUE:
+	case BPF_S_LD_W_ABS:
+	case BPF_S_LD_H_ABS:
+	case BPF_S_LD_B_ABS:
+		/* first instruction sets A register (or is RET 'constant') */
+		break;
+	default:
+		/* make sure we dont leak kernel information to user */
+		PPC_LI(r_A, 0);
+	}
+}
+
+static void bpf_jit_build_epilogue(u32 *image, struct codegen_context *ctx)
+{
+	int i;
+
+	if (ctx->seen & (SEEN_MEM | SEEN_DATAREF)) {
+		PPC_ADDI(1, 1, BPF_PPC_STACKFRAME);
+		if (ctx->seen & SEEN_DATAREF) {
+			PPC_LD(0, 1, 16);
+			PPC_MTLR(0);
+			PPC_LD(r_D, 1, -(8*(32-r_D)));
+			PPC_LD(r_HL, 1, -(8*(32-r_HL)));
+		}
+		if (ctx->seen & SEEN_MEM) {
+			/* Restore any saved non-vol registers */
+			for (i = r_M; i < (r_M+16); i++) {
+				if (ctx->seen & (1 << (i-r_M)))
+					PPC_LD(i, 1, -(8*(32-i)));
+			}
+		}
+	}
+	/* The RETs have left a return value in R3. */
+
+	PPC_BLR();
+}
+
+/* Assemble the body code between the prologue & epilogue. */
+static int bpf_jit_build_body(struct sk_filter *fp, u32 *image,
+			      struct codegen_context *ctx,
+			      unsigned int *addrs)
+{
+	const struct sock_filter *filter = fp->insns;
+	int flen = fp->len;
+	u8 *func;
+	unsigned int true_cond;
+	int i;
+
+	/* Start of epilogue code */
+	unsigned int exit_addr = addrs[flen];
+
+	for (i = 0; i < flen; i++) {
+		unsigned int K = filter[i].k;
+
+		/*
+		 * addrs[] maps a BPF bytecode address into a real offset from
+		 * the start of the body code.
+		 */
+		addrs[i] = ctx->idx * 4;
+
+		switch (filter[i].code) {
+			/*** ALU ops ***/
+		case BPF_S_ALU_ADD_X: /* A += X; */
+			ctx->seen |= SEEN_XREG;
+			PPC_ADD(r_A, r_A, r_X);
+			break;
+		case BPF_S_ALU_ADD_K: /* A += K; */
+			if (!K)
+				break;
+			PPC_ADDI(r_A, r_A, IMM_L(K));
+			if (K >= 32768)
+				PPC_ADDIS(r_A, r_A, IMM_HA(K));
+			break;
+		case BPF_S_ALU_SUB_X: /* A -= X; */
+			ctx->seen |= SEEN_XREG;
+			PPC_SUB(r_A, r_A, r_X);
+			break;
+		case BPF_S_ALU_SUB_K: /* A -= K */
+			if (!K)
+				break;
+			PPC_ADDI(r_A, r_A, IMM_L(-K));
+			if (K >= 32768)
+				PPC_ADDIS(r_A, r_A, IMM_HA(-K));
+			break;
+		case BPF_S_ALU_MUL_X: /* A *= X; */
+			ctx->seen |= SEEN_XREG;
+			PPC_MUL(r_A, r_A, r_X);
+			break;
+		case BPF_S_ALU_MUL_K: /* A *= K */
+			if (K < 32768)
+				PPC_MULI(r_A, r_A, K);
+			else {
+				PPC_LI32(r_scratch1, K);
+				PPC_MUL(r_A, r_A, r_scratch1);
+			}
+			break;
+		case BPF_S_ALU_DIV_X: /* A /= X; */
+			ctx->seen |= SEEN_XREG;
+			PPC_CMPWI(r_X, 0);
+			if (ctx->pc_ret0 != -1) {
+				PPC_BCC(COND_EQ, addrs[ctx->pc_ret0]);
+			} else {
+				/*
+				 * Exit, returning 0; first pass hits here
+				 * (longer worst-case code size).
+				 */
+				PPC_BCC_SHORT(COND_NE, (ctx->idx*4)+12);
+				PPC_LI(r_ret, 0);
+				PPC_JMP(exit_addr);
+			}
+			PPC_DIVWU(r_A, r_A, r_X);
+			break;
+		case BPF_S_ALU_DIV_K: /* A = reciprocal_divide(A, K); */
+			PPC_LI32(r_scratch1, K);
+			/* Top 32 bits of 64bit result -> A */
+			PPC_MULHWU(r_A, r_A, r_scratch1);
+			break;
+		case BPF_S_ALU_AND_X:
+			ctx->seen |= SEEN_XREG;
+			PPC_AND(r_A, r_A, r_X);
+			break;
+		case BPF_S_ALU_AND_K:
+			if (!IMM_H(K))
+				PPC_ANDI(r_A, r_A, K);
+			else {
+				PPC_LI32(r_scratch1, K);
+				PPC_AND(r_A, r_A, r_scratch1);
+			}
+			break;
+		case BPF_S_ALU_OR_X:
+			ctx->seen |= SEEN_XREG;
+			PPC_OR(r_A, r_A, r_X);
+			break;
+		case BPF_S_ALU_OR_K:
+			if (IMM_L(K))
+				PPC_ORI(r_A, r_A, IMM_L(K));
+			if (K >= 65536)
+				PPC_ORIS(r_A, r_A, IMM_H(K));
+			break;
+		case BPF_S_ALU_LSH_X: /* A <<= X; */
+			ctx->seen |= SEEN_XREG;
+			PPC_SLW(r_A, r_A, r_X);
+			break;
+		case BPF_S_ALU_LSH_K:
+			if (K == 0)
+				break;
+			else
+				PPC_SLWI(r_A, r_A, K);
+			break;
+		case BPF_S_ALU_RSH_X: /* A >>= X; */
+			ctx->seen |= SEEN_XREG;
+			PPC_SRW(r_A, r_A, r_X);
+			break;
+		case BPF_S_ALU_RSH_K: /* A >>= K; */
+			if (K == 0)
+				break;
+			else
+				PPC_SRWI(r_A, r_A, K);
+			break;
+		case BPF_S_ALU_NEG:
+			PPC_NEG(r_A, r_A);
+			break;
+		case BPF_S_RET_K:
+			PPC_LI32(r_ret, K);
+			if (!K) {
+				if (ctx->pc_ret0 == -1)
+					ctx->pc_ret0 = i;
+			}
+			/*
+			 * If this isn't the very last instruction, branch to
+			 * the epilogue if we've stuff to clean up.  Otherwise,
+			 * if there's nothing to tidy, just return.  If we /are/
+			 * the last instruction, we're about to fall through to
+			 * the epilogue to return.
+			 */
+			if (i != flen - 1) {
+				/*
+				 * Note: 'seen' is properly valid only on pass
+				 * #2.	Both parts of this conditional are the
+				 * same instruction size though, meaning the
+				 * first pass will still correctly determine the
+				 * code size/addresses.
+				 */
+				if (ctx->seen)
+					PPC_JMP(exit_addr);
+				else
+					PPC_BLR();
+			}
+			break;
+		case BPF_S_RET_A:
+			PPC_MR(r_ret, r_A);
+			if (i != flen - 1) {
+				if (ctx->seen)
+					PPC_JMP(exit_addr);
+				else
+					PPC_BLR();
+			}
+			break;
+		case BPF_S_MISC_TAX: /* X = A */
+			PPC_MR(r_X, r_A);
+			break;
+		case BPF_S_MISC_TXA: /* A = X */
+			ctx->seen |= SEEN_XREG;
+			PPC_MR(r_A, r_X);
+			break;
+
+			/*** Constant loads/M[] access ***/
+		case BPF_S_LD_IMM: /* A = K */
+			PPC_LI32(r_A, K);
+			break;
+		case BPF_S_LDX_IMM: /* X = K */
+			PPC_LI32(r_X, K);
+			break;
+		case BPF_S_LD_MEM: /* A = mem[K] */
+			PPC_MR(r_A, r_M + (K & 0xf));
+			ctx->seen |= SEEN_MEM | (1<<(K & 0xf));
+			break;
+		case BPF_S_LDX_MEM: /* X = mem[K] */
+			PPC_MR(r_X, r_M + (K & 0xf));
+			ctx->seen |= SEEN_MEM | (1<<(K & 0xf));
+			break;
+		case BPF_S_ST: /* mem[K] = A */
+			PPC_MR(r_M + (K & 0xf), r_A);
+			ctx->seen |= SEEN_MEM | (1<<(K & 0xf));
+			break;
+		case BPF_S_STX: /* mem[K] = X */
+			PPC_MR(r_M + (K & 0xf), r_X);
+			ctx->seen |= SEEN_XREG | SEEN_MEM | (1<<(K & 0xf));
+			break;
+		case BPF_S_LD_W_LEN: /*	A = skb->len; */
+			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4);
+			PPC_LWZ_OFFS(r_A, r_skb, offsetof(struct sk_buff, len));
+			break;
+		case BPF_S_LDX_W_LEN: /* X = skb->len; */
+			PPC_LWZ_OFFS(r_X, r_skb, offsetof(struct sk_buff, len));
+			break;
+
+			/*** Ancillary info loads ***/
+
+			/* None of the BPF_S_ANC* codes appear to be passed by
+			 * sk_chk_filter().  The interpreter and the x86 BPF
+			 * compiler implement them so we do too -- they may be
+			 * planted in future.
+			 */
+		case BPF_S_ANC_PROTOCOL: /* A = ntohs(skb->protocol); */
+			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
+						  protocol) != 2);
+			PPC_LHZ_OFFS(r_A, r_skb, offsetof(struct sk_buff,
+							  protocol));
+			/* ntohs is a NOP with BE loads. */
+			break;
+		case BPF_S_ANC_IFINDEX:
+			PPC_LD_OFFS(r_scratch1, r_skb, offsetof(struct sk_buff,
+								dev));
+			PPC_CMPDI(r_scratch1, 0);
+			if (ctx->pc_ret0 != -1) {
+				PPC_BCC(COND_EQ, addrs[ctx->pc_ret0]);
+			} else {
+				/* Exit, returning 0; first pass hits here. */
+				PPC_BCC_SHORT(COND_NE, (ctx->idx*4)+12);
+				PPC_LI(r_ret, 0);
+				PPC_JMP(exit_addr);
+			}
+			BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
+						  ifindex) != 4);
+			PPC_LWZ_OFFS(r_A, r_scratch1,
+				     offsetof(struct net_device, ifindex));
+			break;
+		case BPF_S_ANC_MARK:
+			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4);
+			PPC_LWZ_OFFS(r_A, r_skb, offsetof(struct sk_buff,
+							  mark));
+			break;
+		case BPF_S_ANC_RXHASH:
+			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, rxhash) != 4);
+			PPC_LWZ_OFFS(r_A, r_skb, offsetof(struct sk_buff,
+							  rxhash));
+			break;
+		case BPF_S_ANC_QUEUE:
+			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
+						  queue_mapping) != 2);
+			PPC_LHZ_OFFS(r_A, r_skb, offsetof(struct sk_buff,
+							  queue_mapping));
+			break;
+		case BPF_S_ANC_CPU:
+#ifdef CONFIG_SMP
+			/*
+			 * PACA ptr is r13:
+			 * raw_smp_processor_id() = local_paca->paca_index
+			 */
+			PPC_LHZ_OFFS(r_A, 13,
+				     offsetof(struct paca_struct, paca_index));
+#else
+			PPC_LI(r_A, 0);
+#endif
+			break;
+
+			/*** Absolute loads from packet header/data ***/
+		case BPF_S_LD_W_ABS:
+			func = sk_load_word;
+			goto common_load;
+		case BPF_S_LD_H_ABS:
+			func = sk_load_half;
+			goto common_load;
+		case BPF_S_LD_B_ABS:
+			func = sk_load_byte;
+		common_load:
+			/*
+			 * Load from [K].  Reference with the (negative)
+			 * SKF_NET_OFF/SKF_LL_OFF offsets is unsupported.
+			 */
+			ctx->seen |= SEEN_DATAREF;
+			if ((int)K < 0)
+				return -ENOTSUPP;
+			PPC_LI64(r_scratch1, func);
+			PPC_MTLR(r_scratch1);
+			PPC_LI32(r_addr, K);
+			PPC_BLRL();
+			/*
+			 * Helper returns 'lt' condition on error, and an
+			 * appropriate return value in r3
+			 */
+			PPC_BCC(COND_LT, exit_addr);
+			break;
+
+			/*** Indirect loads from packet header/data ***/
+		case BPF_S_LD_W_IND:
+			func = sk_load_word;
+			goto common_load_ind;
+		case BPF_S_LD_H_IND:
+			func = sk_load_half;
+			goto common_load_ind;
+		case BPF_S_LD_B_IND:
+			func = sk_load_byte;
+		common_load_ind:
+			/*
+			 * Load from [X + K].  Negative offsets are tested for
+			 * in the helper functions, and result in a 'ret 0'.
+			 */
+			ctx->seen |= SEEN_DATAREF | SEEN_XREG;
+			PPC_LI64(r_scratch1, func);
+			PPC_MTLR(r_scratch1);
+			PPC_ADDI(r_addr, r_X, IMM_L(K));
+			if (K >= 32768)
+				PPC_ADDIS(r_addr, r_addr, IMM_HA(K));
+			PPC_BLRL();
+			/* If error, cr0.LT set */
+			PPC_BCC(COND_LT, exit_addr);
+			break;
+
+		case BPF_S_LDX_B_MSH:
+			/*
+			 * x86 version drops packet (RET 0) when K<0, whereas
+			 * interpreter does allow K<0 (__load_pointer, special
+			 * ancillary data).
+			 */
+			func = sk_load_byte_msh;
+			goto common_load;
+			break;
+
+			/*** Jump and branches ***/
+		case BPF_S_JMP_JA:
+			if (K != 0)
+				PPC_JMP(addrs[i + 1 + K]);
+			break;
+
+		case BPF_S_JMP_JGT_K:
+		case BPF_S_JMP_JGT_X:
+			true_cond = COND_GT;
+			goto cond_branch;
+		case BPF_S_JMP_JGE_K:
+		case BPF_S_JMP_JGE_X:
+			true_cond = COND_GE;
+			goto cond_branch;
+		case BPF_S_JMP_JEQ_K:
+		case BPF_S_JMP_JEQ_X:
+			true_cond = COND_EQ;
+			goto cond_branch;
+		case BPF_S_JMP_JSET_K:
+		case BPF_S_JMP_JSET_X:
+			true_cond = COND_NE;
+			/* Fall through */
+		cond_branch:
+			/* same targets, can avoid doing the test :) */
+			if (filter[i].jt == filter[i].jf) {
+				if (filter[i].jt > 0)
+					PPC_JMP(addrs[i + 1 + filter[i].jt]);
+				break;
+			}
+
+			switch (filter[i].code) {
+			case BPF_S_JMP_JGT_X:
+			case BPF_S_JMP_JGE_X:
+			case BPF_S_JMP_JEQ_X:
+				ctx->seen |= SEEN_XREG;
+				PPC_CMPLW(r_A, r_X);
+				break;
+			case BPF_S_JMP_JSET_X:
+				ctx->seen |= SEEN_XREG;
+				PPC_AND_DOT(r_scratch1, r_A, r_X);
+				break;
+			case BPF_S_JMP_JEQ_K:
+			case BPF_S_JMP_JGT_K:
+			case BPF_S_JMP_JGE_K:
+				if (K < 32768)
+					PPC_CMPLWI(r_A, K);
+				else {
+					PPC_LI32(r_scratch1, K);
+					PPC_CMPLW(r_A, r_scratch1);
+				}
+				break;
+			case BPF_S_JMP_JSET_K:
+				if (K < 32768)
+					/* PPC_ANDI is /only/ dot-form */
+					PPC_ANDI(r_scratch1, r_A, K);
+				else {
+					PPC_LI32(r_scratch1, K);
+					PPC_AND_DOT(r_scratch1, r_A,
+						    r_scratch1);
+				}
+				break;
+			}
+			/* Sometimes branches are constructed "backward", with
+			 * the false path being the branch and true path being
+			 * a fallthrough to the next instruction.
+			 */
+			if (filter[i].jt == 0)
+				/* Swap the sense of the branch */
+				PPC_BCC(true_cond ^ COND_CMP_TRUE,
+					addrs[i + 1 + filter[i].jf]);
+			else {
+				PPC_BCC(true_cond, addrs[i + 1 + filter[i].jt]);
+				if (filter[i].jf != 0)
+					PPC_JMP(addrs[i + 1 + filter[i].jf]);
+			}
+			break;
+		default:
+			/* The filter contains something cruel & unusual.
+			 * We don't handle it, but also there shouldn't be
+			 * anything missing from our list.
+			 */
+			pr_err("BPF filter opcode %04x (@%d) unsupported\n",
+			       filter[i].code, i);
+			return -ENOTSUPP;
+		}
+
+	}
+	/* Set end-of-body-code address for exit. */
+	addrs[i] = ctx->idx * 4;
+
+	return 0;
+}
+
+void bpf_jit_compile(struct sk_filter *fp)
+{
+	unsigned int proglen;
+	unsigned int alloclen;
+	u32 *image = NULL;
+	u32 *code_base;
+	unsigned int *addrs;
+	struct codegen_context cgctx;
+	int pass;
+	int flen = fp->len;
+
+	if (!bpf_jit_enable)
+		return;
+
+	addrs = kzalloc((flen+1) * sizeof(*addrs), GFP_KERNEL);
+	if (addrs == NULL)
+		return;
+
+	/*
+	 * There are multiple assembly passes as the generated code will change
+	 * size as it settles down, figuring out the max branch offsets/exit
+	 * paths required.
+	 *
+	 * The range of standard conditional branches is +/- 32Kbytes.	Since
+	 * BPF_MAXINSNS = 4096, we can only jump from (worst case) start to
+	 * finish with 8 bytes/instruction.  Not feasible, so long jumps are
+	 * used, distinct from short branches.
+	 *
+	 * Current:
+	 *
+	 * For now, both branch types assemble to 2 words (short branches padded
+	 * with a NOP); this is less efficient, but assembly will always complete
+	 * after exactly 3 passes:
+	 *
+	 * First pass: No code buffer; Program is "faux-generated" -- no code
+	 * emitted but maximum size of output determined (and addrs[] filled
+	 * in).	 Also, we note whether we use M[], whether we use skb data, etc.
+	 * All generation choices assumed to be 'worst-case', e.g. branches all
+	 * far (2 instructions), return path code reduction not available, etc.
+	 *
+	 * Second pass: Code buffer allocated with size determined previously.
+	 * Prologue generated to support features we have seen used.  Exit paths
+	 * determined and addrs[] is filled in again, as code may be slightly
+	 * smaller as a result.
+	 *
+	 * Third pass: Code generated 'for real', and branch destinations
+	 * determined from now-accurate addrs[] map.
+	 *
+	 * Ideal:
+	 *
+	 * If we optimise this, near branches will be shorter.	On the
+	 * first assembly pass, we should err on the side of caution and
+	 * generate the biggest code.  On subsequent passes, branches will be
+	 * generated short or long and code size will reduce.  With smaller
+	 * code, more branches may fall into the short category, and code will
+	 * reduce more.
+	 *
+	 * Finally, if we see one pass generate code the same size as the
+	 * previous pass we have converged and should now generate code for
+	 * real.  Allocating at the end will also save the memory that would
+	 * otherwise be wasted by the (small) current code shrinkage.
+	 * Preferably, we should do a small number of passes (e.g. 5) and if we
+	 * haven't converged by then, get impatient and force code to generate
+	 * as-is, even if the odd branch would be left long.  The chances of a
+	 * long jump are tiny with all but the most enormous of BPF filter
+	 * inputs, so we should usually converge on the third pass.
+	 */
+
+	cgctx.idx = 0;
+	cgctx.seen = 0;
+	cgctx.pc_ret0 = -1;
+	/* Scouting faux-generate pass 0 */
+	if (bpf_jit_build_body(fp, 0, &cgctx, addrs))
+		/* We hit something illegal or unsupported. */
+		goto out;
+
+	/*
+	 * Pretend to build prologue, given the features we've seen.  This will
+	 * update ctgtx.idx as it pretends to output instructions, then we can
+	 * calculate total size from idx.
+	 */
+	bpf_jit_build_prologue(fp, 0, &cgctx);
+	bpf_jit_build_epilogue(0, &cgctx);
+
+	proglen = cgctx.idx * 4;
+	alloclen = proglen + FUNCTION_DESCR_SIZE;
+	image = module_alloc(max_t(unsigned int, alloclen,
+				   sizeof(struct work_struct)));
+	if (!image)
+		goto out;
+
+	code_base = image + (FUNCTION_DESCR_SIZE/4);
+
+	/* Code generation passes 1-2 */
+	for (pass = 1; pass < 3; pass++) {
+		/* Now build the prologue, body code & epilogue for real. */
+		cgctx.idx = 0;
+		bpf_jit_build_prologue(fp, code_base, &cgctx);
+		bpf_jit_build_body(fp, code_base, &cgctx, addrs);
+		bpf_jit_build_epilogue(code_base, &cgctx);
+
+		if (bpf_jit_enable > 1)
+			pr_info("Pass %d: shrink = %d, seen = 0x%x\n", pass,
+				proglen - (cgctx.idx * 4), cgctx.seen);
+	}
+
+	if (bpf_jit_enable > 1)
+		pr_info("flen=%d proglen=%u pass=%d image=%p\n",
+		       flen, proglen, pass, image);
+
+	if (image) {
+		if (bpf_jit_enable > 1)
+			print_hex_dump(KERN_ERR, "JIT code: ",
+				       DUMP_PREFIX_ADDRESS,
+				       16, 1, code_base,
+				       proglen, false);
+
+		bpf_flush_icache(code_base, code_base + (proglen/4));
+		/* Function descriptor nastiness: Address + TOC */
+		((u64 *)image)[0] = (u64)code_base;
+		((u64 *)image)[1] = local_paca->kernel_toc;
+		fp->bpf_func = (void *)image;
+	}
+out:
+	kfree(addrs);
+	return;
+}
+
+static void jit_free_defer(struct work_struct *arg)
+{
+	module_free(NULL, arg);
+}
+
+/* run from softirq, we must use a work_struct to call
+ * module_free() from process context
+ */
+void bpf_jit_free(struct sk_filter *fp)
+{
+	if (fp->bpf_func != sk_run_filter) {
+		struct work_struct *work = (struct work_struct *)fp->bpf_func;
+
+		INIT_WORK(work, jit_free_defer);
+		schedule_work(work);
+	}
+}

^ permalink raw reply related

* [PATCH] powerpc: Copy back TIF flags on return from softirq stack
From: Benjamin Herrenschmidt @ 2011-07-19  3:17 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Peter Zijlstra

We already did it for hard IRQs but it looks like we forgot
to do it for softirqs. Without this, we would lose flags
such as TIF_NEED_RESCHED set using current_thread_info()
by something running of a softirq.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/kernel/irq.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 164fb6c..4e7f1aa 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -451,11 +451,18 @@ static inline void do_softirq_onstack(void)
 	curtp = current_thread_info();
 	irqtp = softirq_ctx[smp_processor_id()];
 	irqtp->task = curtp->task;
+	irqtp->flags = 0;
 	current->thread.ksp_limit = (unsigned long)irqtp +
 				    _ALIGN_UP(sizeof(struct thread_info), 16);
 	call_do_softirq(irqtp);
 	current->thread.ksp_limit = saved_sp_limit;
 	irqtp->task = NULL;
+
+	/* Set any flag that may have been set on the
+	 * alternate stack
+	 */
+	if (irqtp->flags)
+		set_bits(irqtp->flags, &curtp->flags);
 }
 
 void do_softirq(void)

^ permalink raw reply related

* Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core
From: Shan Hai @ 2011-07-19  3:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: tony.luck, Peter Zijlstra, Peter Zijlstra, linux-kernel, cmetcalf,
	dhowells, paulus, tglx, walken, linuxppc-dev, akpm
In-Reply-To: <1310974591.25044.298.camel@pasglop>

On 07/18/2011 03:36 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2011-07-18 at 15:26 +0800, Shan Hai wrote:
>> I am sorry I hadn't tried your newer patch, I tried it but it still
>> could not work in my test environment, I will dig into and tell you
>> why that failed later.
> Ok, please let me know what you find !
>

Have not been finding out the reason why failed,
I tried the following based on your code,
(1)
diff --git a/kernel/futex.c b/kernel/futex.c
index fe28dc2..820556d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -353,10 +353,11 @@ static int fault_in_user_writeable(u32 __user *uaddr)
  {
         struct mm_struct *mm = current->mm;
         int ret;
+       int flags = FOLL_TOUCH | FOLL_GET | FOLL_WRITE | FOLL_FIXFAULT;

         down_read(&mm->mmap_sem);
-       ret = get_user_pages(current, mm, (unsigned long)uaddr,
-                            1, 1, 0, NULL, NULL);
+       ret = __get_user_pages(current, mm, (unsigned long)uaddr, 1,
+                              flags, NULL, NULL, NULL);
         up_read(&mm->mmap_sem);

         return ret < 0 ? ret : 0;

(2)
diff --git a/mm/memory.c b/mm/memory.c
index 9b8a01d..f7ba26e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
...
+
+       if ((flags & (FOLL_WRITE | FOLL_FIXFAULT)) && !pte_dirty(pte))
+               handle_pte_sw_young_dirty(vma, address, ptep,
+                                          FAULT_FLAG_WRITE);
...

And everything lookes good, but still couldn't work, need more 
investigation.

>> Yep, I know holding lots of ifdef's everywhere is not so good,
>> but if we have some other way(I don't know how till now) to
>> figure out the arch has the need to fixup up the write permission
>> we could eradicate the ugly ifdef's here.
>>
>> I think the handle_mm_fault could do all dirty/young tracking,
>> because the purpose of making follow_page return NULL to
>> its caller is that want to the handle_mm_fault to be called
>> on write permission protection fault.
> I see your point. Rather than factoring the fixup code out, we could
> force gup to call handle_mm_fault()... that makes sense.
>
> However, I don't think we should special case archs. There's plenty of
> cases where we don't care about this fixup even on archs that do SW
> tracking of dirty and young. For example when gup is using for
> subsequent DMA.
>
> Only the (rare ?) cases where it's used as a mean to fixup a failing
> "atomic" user access are relevant.
>
> So I believe we should still pass an explicit flag to __get_user_pages()
> as I propose to activate that behaviour.
>

How about the following one?
the write permission fixup behaviour is triggered explicitly by
the trouble making parts like futex as you suggested.

In this way, the follow_page() mimics exactly how the MMU
faults on atomic access to the user pages, and we could handle
the fault by already existing handle_mm_fault which also do
the dirty/young tracking properly.


diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9670f71..8a76694 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1546,6 +1546,7 @@ struct page *follow_page(struct vm_area_struct *, 
unsigned long address,
  #define FOLL_MLOCK    0x40    /* mark page as mlocked */
  #define FOLL_SPLIT    0x80    /* don't return transhuge pages, split 
them */
  #define FOLL_HWPOISON    0x100    /* check page is hwpoisoned */
+#define FOLL_FIXFAULT    0x200    /* fixup after a fault (PTE 
dirty/young upd) */

  typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
              void *data);
diff --git a/kernel/futex.c b/kernel/futex.c
index fe28dc2..820556d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -353,10 +353,11 @@ static int fault_in_user_writeable(u32 __user *uaddr)
  {
      struct mm_struct *mm = current->mm;
      int ret;
+    int flags = FOLL_TOUCH | FOLL_GET | FOLL_WRITE | FOLL_FIXFAULT;

      down_read(&mm->mmap_sem);
-    ret = get_user_pages(current, mm, (unsigned long)uaddr,
-                 1, 1, 0, NULL, NULL);
+    ret = __get_user_pages(current, mm, (unsigned long)uaddr, 1,
+                   flags, NULL, NULL, NULL);
      up_read(&mm->mmap_sem);

      return ret < 0 ? ret : 0;
diff --git a/mm/memory.c b/mm/memory.c
index 9b8a01d..5682501 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1442,6 +1442,7 @@ struct page *follow_page(struct vm_area_struct 
*vma, unsigned long address,
      spinlock_t *ptl;
      struct page *page;
      struct mm_struct *mm = vma->vm_mm;
+    int fix_write_permission = 0;

      page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
      if (!IS_ERR(page)) {
@@ -1519,6 +1520,9 @@ split_fallthrough:
          if ((flags & FOLL_WRITE) &&
              !pte_dirty(pte) && !PageDirty(page))
              set_page_dirty(page);
+
+        if ((flags & (FOLL_WRITE | FOLL_FIXFAULT)) && !pte_dirty(pte))
+            fix_write_permission = 1;
          /*
           * pte_mkyoung() would be more correct here, but atomic care
           * is needed to avoid losing the dirty bit: it is easier to use
@@ -1551,7 +1555,7 @@ split_fallthrough:
  unlock:
      pte_unmap_unlock(ptep, ptl);
  out:
-    return page;
+    return (fix_write_permission) ? NULL : page;

  bad_page:
      pte_unmap_unlock(ptep, ptl);

> At this point, since we have isolated the special case callers, I think
> we are pretty much in a situation where there's no point trying to
> optimize the x86 case more, it's a fairly slow path anyway, and so no
> ifdef should be needed (and x86 already #define out the TLB flush for
> spurious faults in handle_pte_fault today).
>
> We don't even need to change follow_page()... we just don't call it the
> first time around.
>
> I'll cook up another patch later but first we need to find out why the
> one you have doesn't work. There might be another problem lurking (or I
> just made a stupid mistake).
>
> BTW. Can you give me some details about how you reproduce the problem ?
> I should setup something on a booke machine here to verify things.
>
> Cheers,
> Ben.
>

^ permalink raw reply related

* Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core
From: Benjamin Herrenschmidt @ 2011-07-19  4:20 UTC (permalink / raw)
  To: Shan Hai
  Cc: tony.luck, Peter Zijlstra, Peter Zijlstra, linux-kernel, cmetcalf,
	dhowells, paulus, tglx, walken, linuxppc-dev, akpm
In-Reply-To: <4E24FA51.70602@gmail.com>

On Tue, 2011-07-19 at 11:30 +0800, Shan Hai wrote:
> On 07/18/2011 03:36 PM, Benjamin Herrenschmidt wrote:
> > On Mon, 2011-07-18 at 15:26 +0800, Shan Hai wrote:
> >> I am sorry I hadn't tried your newer patch, I tried it but it still
> >> could not work in my test environment, I will dig into and tell you
> >> why that failed later.
> > Ok, please let me know what you find !
> >
> 
> Have not been finding out the reason why failed,
> I tried the following based on your code,

Ok, looks like we'll need to dig more, though the original findings
still stand, which means we might be chasing two different bugs :-)

I haven't had time to try to reproduce today and may not this week,
so I'll have to let you toy around with it until I get a chance to
try to track it down myself unless somebody else gets into it... Kumar ?
Anybody on FSL side feels like having a look ?
 
> How about the following one?
> the write permission fixup behaviour is triggered explicitly by
> the trouble making parts like futex as you suggested.
> 
> In this way, the follow_page() mimics exactly how the MMU
> faults on atomic access to the user pages, and we could handle
> the fault by already existing handle_mm_fault which also do
> the dirty/young tracking properly.

So you say this still doesn't fix your problem right ?

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9670f71..8a76694 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1546,6 +1546,7 @@ struct page *follow_page(struct vm_area_struct *, 
> unsigned long address,
>   #define FOLL_MLOCK    0x40    /* mark page as mlocked */
>   #define FOLL_SPLIT    0x80    /* don't return transhuge pages, split 
> them */
>   #define FOLL_HWPOISON    0x100    /* check page is hwpoisoned */
> +#define FOLL_FIXFAULT    0x200    /* fixup after a fault (PTE 
> dirty/young upd) */

Badly wrapped it seems :-) And totally whitespace damaged...

>   typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
>               void *data);
> diff --git a/kernel/futex.c b/kernel/futex.c
> index fe28dc2..820556d 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -353,10 +353,11 @@ static int fault_in_user_writeable(u32 __user *uaddr)
>   {
>       struct mm_struct *mm = current->mm;
>       int ret;
> +    int flags = FOLL_TOUCH | FOLL_GET | FOLL_WRITE | FOLL_FIXFAULT;

You don't want TOUCH -and- FIXFAULT do you ? Also you don't want GET
since you aren't passing a page array or vma array anyway.

>       down_read(&mm->mmap_sem);
> -    ret = get_user_pages(current, mm, (unsigned long)uaddr,
> -                 1, 1, 0, NULL, NULL);
> +    ret = __get_user_pages(current, mm, (unsigned long)uaddr, 1,
> +                   flags, NULL, NULL, NULL);
>       up_read(&mm->mmap_sem);
> 
>       return ret < 0 ? ret : 0;
> diff --git a/mm/memory.c b/mm/memory.c
> index 9b8a01d..5682501 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1442,6 +1442,7 @@ struct page *follow_page(struct vm_area_struct 
> *vma, unsigned long address,
>       spinlock_t *ptl;
>       struct page *page;
>       struct mm_struct *mm = vma->vm_mm;
> +    int fix_write_permission = 0;

Don't do that.

>       page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
>       if (!IS_ERR(page)) {
> @@ -1519,6 +1520,9 @@ split_fallthrough:
>           if ((flags & FOLL_WRITE) &&
>               !pte_dirty(pte) && !PageDirty(page))
>               set_page_dirty(page);
> +
> +        if ((flags & (FOLL_WRITE | FOLL_FIXFAULT)) && !pte_dirty(pte))
> +            fix_write_permission = 1;

No, you missed my point completely. If FOLL_FIXFAULT is set, you don't
even need to call follow_page() to begin with... you -always- want to
force a call to handle_mm_fault (and only one, no loop), regardless
of whether the PTE is dirty or not, since you need to also address
the lack of a young bit.

(That might explain why your patch doesn't work if your problem is
caused by a missing young bit).

What about the patch in my next email...

Ben.

^ permalink raw reply

* [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty & young
From: Benjamin Herrenschmidt @ 2011-07-19  4:29 UTC (permalink / raw)
  To: Shan Hai
  Cc: tony.luck, Peter Zijlstra, Peter Zijlstra, linux-kernel, cmetcalf,
	dhowells, paulus, tglx, walken, linuxppc-dev, akpm
In-Reply-To: <4E24FA51.70602@gmail.com>

The futex code currently attempts to write to user memory within
a pagefault disabled section, and if that fails, tries to fix it
up using get_user_pages().

This doesn't work on archs where the dirty and young bits are
maintained by software, since they will gate access permission
in the TLB, and will not be updated by gup().

In addition, there's an expectation on some archs that a
spurious write fault triggers a local TLB flush, and that is
missing from the picture as well.

I decided that adding those "features" to gup() would be too much
for this already too complex function, and instead added a new
simpler fixup_user_fault() which is essentially a wrapper around
handle_mm_fault() which the futex code can call.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

Shan, can you test this ? It might not fix the problem since I'm
starting to have the nasty feeling that you are hitting what is
somewhat a subtly different issue or my previous patch should
have worked (but then I might have done a stupid mistake as well)
but let us know anyway.

Cheers,
Ben.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9670f71..1036614 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -985,6 +985,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 struct page *get_dump_page(unsigned long addr);
+extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
+			    unsigned long address, unsigned int fault_flags);
 
 extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
 extern void do_invalidatepage(struct page *page, unsigned long offset);
diff --git a/kernel/futex.c b/kernel/futex.c
index fe28dc2..7a0a4ed 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr)
 	int ret;
 
 	down_read(&mm->mmap_sem);
-	ret = get_user_pages(current, mm, (unsigned long)uaddr,
-			     1, 1, 0, NULL, NULL);
+	ret = fixup_user_fault(current, mm, (unsigned long)uaddr,
+			       FAULT_FLAG_WRITE);
 	up_read(&mm->mmap_sem);
 
 	return ret < 0 ? ret : 0;
diff --git a/mm/memory.c b/mm/memory.c
index 40b7531..b967fb0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1815,7 +1815,64 @@ next_page:
 }
 EXPORT_SYMBOL(__get_user_pages);
 
-/**
+/*
+ * fixup_user_fault() - manually resolve a user page  fault
+ * @tsk:	the task_struct to use for page fault accounting, or
+ *		NULL if faults are not to be recorded.
+ * @mm:		mm_struct of target mm
+ * @address:	user address
+ * @fault_flags:flags to pass down to handle_mm_fault()
+ *
+ * This is meant to be called in the specific scenario where for
+ * locking reasons we try to access user memory in atomic context
+ * (within a pagefault_disable() section), this returns -EFAULT,
+ * and we want to resolve the user fault before trying again.
+ *
+ * Typically this is meant to be used by the futex code.
+ *
+ * The main difference with get_user_pages() is that this function
+ * will unconditionally call handle_mm_fault() which will in turn
+ * perform all the necessary SW fixup of the dirty and young bits
+ * in the PTE, while handle_mm_fault() only guarantees to update
+ * these in the struct page.
+ *
+ * This is important for some architectures where those bits also
+ * gate the access permission to the page because their are
+ * maintained in software. On such architecture, gup() will not
+ * be enough to make a subsequent access succeed.
+ *
+ * This should be called with the mm_sem held for read.
+ */
+int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
+		     unsigned long address, unsigned int fault_flags)
+{
+	struct vm_area_struct *vma;
+	int ret;
+
+	vma = find_extend_vma(mm, address);
+	if (!vma || address < vma->vm_start)
+		return -EFAULT;
+	
+	ret = handle_mm_fault(mm, vma, address, fault_flags);
+	if (ret & VM_FAULT_ERROR) {
+		if (ret & VM_FAULT_OOM)
+			return -ENOMEM;
+		if (ret & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
+			return -EHWPOISON;
+		if (ret & VM_FAULT_SIGBUS)
+			return -EFAULT;
+		BUG();
+	}
+	if (tsk) {
+		if (ret & VM_FAULT_MAJOR)
+			tsk->maj_flt++;
+		else
+			tsk->min_flt++;
+	}
+	return 0;
+}
+
+/*
  * get_user_pages() - pin user pages in memory
  * @tsk:	the task_struct to use for page fault accounting, or
  *		NULL if faults are not to be recorded.

^ permalink raw reply related

* Re: [PATCH 1/5] fs/hugetlbfs/inode.c: Fix pgoff alignment checking on 32-bit
From: Benjamin Herrenschmidt @ 2011-07-19  4:43 UTC (permalink / raw)
  To: linux-mm@kvack.org; +Cc: linux-kernel, Andrew Morton, linuxppc-dev, david
In-Reply-To: <13092909493748-git-send-email-beckyb@kernel.crashing.org>

Andrew, Anybody ? Can I have an -mm ack for this ?

Cheers,
Ben.

On Tue, 2011-06-28 at 14:54 -0500, Becky Bruce wrote:
> From: Becky Bruce <beckyb@kernel.crashing.org>
> 
> This:
> 
> vma->vm_pgoff & ~(huge_page_mask(h) >> PAGE_SHIFT)
> 
> is incorrect on 32-bit.  It causes us to & the pgoff with
> something that looks like this (for a 4m hugepage): 0xfff003ff.
> The mask should be flipped and *then* shifted, to give you
> 0x0000_03fff.
> 
> Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
> ---
>  fs/hugetlbfs/inode.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 7aafeb8..537a209 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -94,7 +94,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
>  	vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
>  	vma->vm_ops = &hugetlb_vm_ops;
>  
> -	if (vma->vm_pgoff & ~(huge_page_mask(h) >> PAGE_SHIFT))
> +	if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
>  		return -EINVAL;
>  
>  	vma_len = (loff_t)(vma->vm_end - vma->vm_start);

^ permalink raw reply

* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
From: Anton Blanchard @ 2011-07-19  4:44 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds
In-Reply-To: <1311024956.2309.22.camel@laptop>

On Mon, 18 Jul 2011 23:35:56 +0200
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Anton, could you test the below two patches on that machine?
> 
> It should make things boot again, while I don't have a machine nearly
> big enough to trigger any of this, I tested the new code paths by
> setting FORCE_SD_OVERLAP in /debug/sched_features. Although any review
> of the error paths would be much appreciated.

I get an oops in slub code:

NIP [c000000000197d30] .deactivate_slab+0x1b0/0x200
LR [c000000000199d94] .__slab_alloc+0xb4/0x5a0
[c000000000199d94] .__slab_alloc+0xb4/0x5a0
[c00000000019ac98] .kmem_cache_alloc_node_trace+0xa8/0x260
[c00000000007eb70] .build_sched_domains+0xa60/0xb90
[c000000000a16a98] .sched_init_smp+0xa8/0x228
[c000000000a00274] .kernel_init+0x10c/0x1fc
[c00000000002324c] .kernel_thread+0x54/0x70

I'm guessing it's a result of some nodes not having any local memory.
but a bit surprised I'm not seeing it elsewhere.

Investigating.

> Also, could you send me the node_distance table for that machine? I'm
> curious what the interconnects look like on that thing.

Our node distances are a bit arbitrary (I make them up based on
information given to us in the device tree). In terms of memory we have
a maximum of three levels. To give some gross estimates, on chip memory
might be 30GB/sec, on node memory 10-15GB/sec and off node memory
5GB/sec.

The only thing we tweak with node distances is to make sure we go into
node reclaim before going off node:

/*
 * Before going off node we want the VM to try and reclaim from the local
 * node. It does this if the remote distance is larger than RECLAIM_DISTANCE.
 * With the default REMOTE_DISTANCE of 20 and the default RECLAIM_DISTANCE of
 * 20, we never reclaim and go off node straight away.
 *
 * To fix this we choose a smaller value of RECLAIM_DISTANCE.
 */
#define RECLAIM_DISTANCE 10

Anton

node distances:
node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31 
  0:  10  20  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  1:  20  10  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  2:  20  20  10  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  3:  20  20  20  10  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  4:  40  40  40  40  10  20  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  5:  40  40  40  40  20  10  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  6:  40  40  40  40  20  20  10  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  7:  40  40  40  40  20  20  20  10  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  8:  40  40  40  40  40  40  40  40  10  20  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
  9:  40  40  40  40  40  40  40  40  20  10  20  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
 10:  40  40  40  40  40  40  40  40  20  20  10  20  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
 11:  40  40  40  40  40  40  40  40  20  20  20  10  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
 12:  40  40  40  40  40  40  40  40  40  40  40  40  10  20  20  20  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
 13:  40  40  40  40  40  40  40  40  40  40  40  40  20  10  20  20  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
 14:  40  40  40  40  40  40  40  40  40  40  40  40  20  20  10  20  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
 15:  40  40  40  40  40  40  40  40  40  40  40  40  20  20  20  10  40  40  40  40  40  40  40  40  40  40  40  40   0   0   0   0 
 16:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  10  20  20  20  40  40  40  40  40  40  40  40   0   0   0   0 
 17:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  10  20  20  40  40  40  40  40  40  40  40   0   0   0   0 
 18:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  10  20  40  40  40  40  40  40  40  40   0   0   0   0 
 19:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  20  10  40  40  40  40  40  40  40  40   0   0   0   0 
 20:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  10  20  20  20  40  40  40  40   0   0   0   0 
 21:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  10  20  20  40  40  40  40   0   0   0   0 
 22:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  10  20  40  40  40  40   0   0   0   0 
 23:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  20  10  40  40  40  40   0   0   0   0 
 24:   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
 25:   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
 26:   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
 27:   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
 28:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  10  20  20  20   0   0   0   0 
 29:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  10  20  20   0   0   0   0 
 30:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  10  20   0   0   0   0 
 31:  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  40  20  20  20  10   0   0   0   0 

^ permalink raw reply

* Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty & young
From: Shan Hai @ 2011-07-19  4:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: tony.luck, Peter Zijlstra, Peter Zijlstra, linux-kernel, cmetcalf,
	dhowells, paulus, tglx, walken, linuxppc-dev, akpm
In-Reply-To: <1311049762.25044.392.camel@pasglop>

On 07/19/2011 12:29 PM, Benjamin Herrenschmidt wrote:
> The futex code currently attempts to write to user memory within
> a pagefault disabled section, and if that fails, tries to fix it
> up using get_user_pages().
>
> This doesn't work on archs where the dirty and young bits are
> maintained by software, since they will gate access permission
> in the TLB, and will not be updated by gup().
>
> In addition, there's an expectation on some archs that a
> spurious write fault triggers a local TLB flush, and that is
> missing from the picture as well.
>
> I decided that adding those "features" to gup() would be too much
> for this already too complex function, and instead added a new
> simpler fixup_user_fault() which is essentially a wrapper around
> handle_mm_fault() which the futex code can call.
>
> Signed-off-by: Benjamin Herrenschmidt<benh@kernel.crashing.org>
> ---
>
> Shan, can you test this ? It might not fix the problem since I'm
> starting to have the nasty feeling that you are hitting what is
> somewhat a subtly different issue or my previous patch should
> have worked (but then I might have done a stupid mistake as well)
> but let us know anyway.
>

Ok, I will test the patch, I think this should work, because
it's similar to my first posted patch, the difference is that
I tried to do it in the futex_atomic_cmpxchg_inatomic() in
the ppc specific path, lower level than yours as in
fault_in_user_writable :-)

Anyway, I will notify you on the test result.

Thanks
Shan Hai

> Cheers,
> Ben.
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9670f71..1036614 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -985,6 +985,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>   int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>   			struct page **pages);
>   struct page *get_dump_page(unsigned long addr);
> +extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> +			    unsigned long address, unsigned int fault_flags);
>
>   extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
>   extern void do_invalidatepage(struct page *page, unsigned long offset);
> diff --git a/kernel/futex.c b/kernel/futex.c
> index fe28dc2..7a0a4ed 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr)
>   	int ret;
>
>   	down_read(&mm->mmap_sem);
> -	ret = get_user_pages(current, mm, (unsigned long)uaddr,
> -			     1, 1, 0, NULL, NULL);
> +	ret = fixup_user_fault(current, mm, (unsigned long)uaddr,
> +			       FAULT_FLAG_WRITE);
>   	up_read(&mm->mmap_sem);
>
>   	return ret<  0 ? ret : 0;
> diff --git a/mm/memory.c b/mm/memory.c
> index 40b7531..b967fb0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1815,7 +1815,64 @@ next_page:
>   }
>   EXPORT_SYMBOL(__get_user_pages);
>
> -/**
> +/*
> + * fixup_user_fault() - manually resolve a user page  fault
> + * @tsk:	the task_struct to use for page fault accounting, or
> + *		NULL if faults are not to be recorded.
> + * @mm:		mm_struct of target mm
> + * @address:	user address
> + * @fault_flags:flags to pass down to handle_mm_fault()
> + *
> + * This is meant to be called in the specific scenario where for
> + * locking reasons we try to access user memory in atomic context
> + * (within a pagefault_disable() section), this returns -EFAULT,
> + * and we want to resolve the user fault before trying again.
> + *
> + * Typically this is meant to be used by the futex code.
> + *
> + * The main difference with get_user_pages() is that this function
> + * will unconditionally call handle_mm_fault() which will in turn
> + * perform all the necessary SW fixup of the dirty and young bits
> + * in the PTE, while handle_mm_fault() only guarantees to update
> + * these in the struct page.
> + *
> + * This is important for some architectures where those bits also
> + * gate the access permission to the page because their are
> + * maintained in software. On such architecture, gup() will not
> + * be enough to make a subsequent access succeed.
> + *
> + * This should be called with the mm_sem held for read.
> + */
> +int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> +		     unsigned long address, unsigned int fault_flags)
> +{
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	vma = find_extend_vma(mm, address);
> +	if (!vma || address<  vma->vm_start)
> +		return -EFAULT;
> +	
> +	ret = handle_mm_fault(mm, vma, address, fault_flags);
> +	if (ret&  VM_FAULT_ERROR) {
> +		if (ret&  VM_FAULT_OOM)
> +			return -ENOMEM;
> +		if (ret&  (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
> +			return -EHWPOISON;
> +		if (ret&  VM_FAULT_SIGBUS)
> +			return -EFAULT;
> +		BUG();
> +	}
> +	if (tsk) {
> +		if (ret&  VM_FAULT_MAJOR)
> +			tsk->maj_flt++;
> +		else
> +			tsk->min_flt++;
> +	}
> +	return 0;
> +}
> +
> +/*
>    * get_user_pages() - pin user pages in memory
>    * @tsk:	the task_struct to use for page fault accounting, or
>    *		NULL if faults are not to be recorded.
>
>

^ permalink raw reply

* Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty & young
From: Shan Hai @ 2011-07-19  5:17 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: tony.luck, Peter Zijlstra, Peter Zijlstra, linux-kernel, cmetcalf,
	dhowells, paulus, tglx, walken, linuxppc-dev, akpm
In-Reply-To: <1311049762.25044.392.camel@pasglop>

On 07/19/2011 12:29 PM, Benjamin Herrenschmidt wrote:
> The futex code currently attempts to write to user memory within
> a pagefault disabled section, and if that fails, tries to fix it
> up using get_user_pages().
>
> This doesn't work on archs where the dirty and young bits are
> maintained by software, since they will gate access permission
> in the TLB, and will not be updated by gup().
>
> In addition, there's an expectation on some archs that a
> spurious write fault triggers a local TLB flush, and that is
> missing from the picture as well.
>
> I decided that adding those "features" to gup() would be too much
> for this already too complex function, and instead added a new
> simpler fixup_user_fault() which is essentially a wrapper around
> handle_mm_fault() which the futex code can call.
>
> Signed-off-by: Benjamin Herrenschmidt<benh@kernel.crashing.org>
> ---
>
> Shan, can you test this ? It might not fix the problem since I'm
> starting to have the nasty feeling that you are hitting what is
> somewhat a subtly different issue or my previous patch should
> have worked (but then I might have done a stupid mistake as well)
> but let us know anyway.
>

The patch works, but I have certain confusions,
- Do we want to handle_mm_fault on each futex_lock_pi
     even though in most cases there is no write permission
     fixup's needed?
- How about let the archs do their own write permission
     fixup as what I did in my original
     "[PATCH 1/1] Fixup write permission of TLB on powerpc e500 core"?
     (I will fix the stupid errors in my original patch if the concept 
is acceptable)
     in this way we could decrease the overhead of handle_mm_fault
     in the path which does not need write permission fixup.

Thanks
Shan Hai
> Cheers,
> Ben.
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9670f71..1036614 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -985,6 +985,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>   int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>   			struct page **pages);
>   struct page *get_dump_page(unsigned long addr);
> +extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> +			    unsigned long address, unsigned int fault_flags);
>
>   extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
>   extern void do_invalidatepage(struct page *page, unsigned long offset);
> diff --git a/kernel/futex.c b/kernel/futex.c
> index fe28dc2..7a0a4ed 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr)
>   	int ret;
>
>   	down_read(&mm->mmap_sem);
> -	ret = get_user_pages(current, mm, (unsigned long)uaddr,
> -			     1, 1, 0, NULL, NULL);
> +	ret = fixup_user_fault(current, mm, (unsigned long)uaddr,
> +			       FAULT_FLAG_WRITE);
>   	up_read(&mm->mmap_sem);
>
>   	return ret<  0 ? ret : 0;
> diff --git a/mm/memory.c b/mm/memory.c
> index 40b7531..b967fb0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1815,7 +1815,64 @@ next_page:
>   }
>   EXPORT_SYMBOL(__get_user_pages);
>
> -/**
> +/*
> + * fixup_user_fault() - manually resolve a user page  fault
> + * @tsk:	the task_struct to use for page fault accounting, or
> + *		NULL if faults are not to be recorded.
> + * @mm:		mm_struct of target mm
> + * @address:	user address
> + * @fault_flags:flags to pass down to handle_mm_fault()
> + *
> + * This is meant to be called in the specific scenario where for
> + * locking reasons we try to access user memory in atomic context
> + * (within a pagefault_disable() section), this returns -EFAULT,
> + * and we want to resolve the user fault before trying again.
> + *
> + * Typically this is meant to be used by the futex code.
> + *
> + * The main difference with get_user_pages() is that this function
> + * will unconditionally call handle_mm_fault() which will in turn
> + * perform all the necessary SW fixup of the dirty and young bits
> + * in the PTE, while handle_mm_fault() only guarantees to update
> + * these in the struct page.
> + *
> + * This is important for some architectures where those bits also
> + * gate the access permission to the page because their are
> + * maintained in software. On such architecture, gup() will not
> + * be enough to make a subsequent access succeed.
> + *
> + * This should be called with the mm_sem held for read.
> + */
> +int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> +		     unsigned long address, unsigned int fault_flags)
> +{
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	vma = find_extend_vma(mm, address);
> +	if (!vma || address<  vma->vm_start)
> +		return -EFAULT;
> +	
> +	ret = handle_mm_fault(mm, vma, address, fault_flags);
> +	if (ret&  VM_FAULT_ERROR) {
> +		if (ret&  VM_FAULT_OOM)
> +			return -ENOMEM;
> +		if (ret&  (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
> +			return -EHWPOISON;
> +		if (ret&  VM_FAULT_SIGBUS)
> +			return -EFAULT;
> +		BUG();
> +	}
> +	if (tsk) {
> +		if (ret&  VM_FAULT_MAJOR)
> +			tsk->maj_flt++;
> +		else
> +			tsk->min_flt++;
> +	}
> +	return 0;
> +}
> +
> +/*
>    * get_user_pages() - pin user pages in memory
>    * @tsk:	the task_struct to use for page fault accounting, or
>    *		NULL if faults are not to be recorded.
>
>

^ permalink raw reply

* Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty & young
From: Benjamin Herrenschmidt @ 2011-07-19  5:24 UTC (permalink / raw)
  To: Shan Hai
  Cc: tony.luck, Peter Zijlstra, Peter Zijlstra, linux-kernel, cmetcalf,
	dhowells, paulus, tglx, walken, linuxppc-dev, akpm
In-Reply-To: <4E251365.9090004@gmail.com>

On Tue, 2011-07-19 at 13:17 +0800, Shan Hai wrote:

> The patch works, but I have certain confusions,
> - Do we want to handle_mm_fault on each futex_lock_pi
>      even though in most cases there is no write permission
>      fixup's needed?

Don't we only ever call this when futex_atomic_op_inuser() failed ?
Which means a fixup -is- needed .... The fast path is still there.

> - How about let the archs do their own write permission
>      fixup as what I did in my original

Why ? This is generic and will fix all archs at once with generic code
which is a significant improvement in my book and a lot more
maintainable :-)

>      "[PATCH 1/1] Fixup write permission of TLB on powerpc e500 core"?
>      (I will fix the stupid errors in my original patch if the concept 
> is acceptable)
>      in this way we could decrease the overhead of handle_mm_fault
>      in the path which does not need write permission fixup.

Which overhead ? gup does handle_mm_fault() as well if needed.

What I do is I replace what is arguably an abuse of gup() in the case
where a fixup -is- needed with a dedicated function designed to perform
the said fixup ... and do it properly which gup() didn't :-)

Cheers,
Ben.

> Thanks
> Shan Hai
> > Cheers,
> > Ben.
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 9670f71..1036614 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -985,6 +985,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> >   int get_user_pages_fast(unsigned long start, int nr_pages, int write,
> >   			struct page **pages);
> >   struct page *get_dump_page(unsigned long addr);
> > +extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> > +			    unsigned long address, unsigned int fault_flags);
> >
> >   extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
> >   extern void do_invalidatepage(struct page *page, unsigned long offset);
> > diff --git a/kernel/futex.c b/kernel/futex.c
> > index fe28dc2..7a0a4ed 100644
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr)
> >   	int ret;
> >
> >   	down_read(&mm->mmap_sem);
> > -	ret = get_user_pages(current, mm, (unsigned long)uaddr,
> > -			     1, 1, 0, NULL, NULL);
> > +	ret = fixup_user_fault(current, mm, (unsigned long)uaddr,
> > +			       FAULT_FLAG_WRITE);
> >   	up_read(&mm->mmap_sem);
> >
> >   	return ret<  0 ? ret : 0;
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 40b7531..b967fb0 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1815,7 +1815,64 @@ next_page:
> >   }
> >   EXPORT_SYMBOL(__get_user_pages);
> >
> > -/**
> > +/*
> > + * fixup_user_fault() - manually resolve a user page  fault
> > + * @tsk:	the task_struct to use for page fault accounting, or
> > + *		NULL if faults are not to be recorded.
> > + * @mm:		mm_struct of target mm
> > + * @address:	user address
> > + * @fault_flags:flags to pass down to handle_mm_fault()
> > + *
> > + * This is meant to be called in the specific scenario where for
> > + * locking reasons we try to access user memory in atomic context
> > + * (within a pagefault_disable() section), this returns -EFAULT,
> > + * and we want to resolve the user fault before trying again.
> > + *
> > + * Typically this is meant to be used by the futex code.
> > + *
> > + * The main difference with get_user_pages() is that this function
> > + * will unconditionally call handle_mm_fault() which will in turn
> > + * perform all the necessary SW fixup of the dirty and young bits
> > + * in the PTE, while handle_mm_fault() only guarantees to update
> > + * these in the struct page.
> > + *
> > + * This is important for some architectures where those bits also
> > + * gate the access permission to the page because their are
> > + * maintained in software. On such architecture, gup() will not
> > + * be enough to make a subsequent access succeed.
> > + *
> > + * This should be called with the mm_sem held for read.
> > + */
> > +int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> > +		     unsigned long address, unsigned int fault_flags)
> > +{
> > +	struct vm_area_struct *vma;
> > +	int ret;
> > +
> > +	vma = find_extend_vma(mm, address);
> > +	if (!vma || address<  vma->vm_start)
> > +		return -EFAULT;
> > +	
> > +	ret = handle_mm_fault(mm, vma, address, fault_flags);
> > +	if (ret&  VM_FAULT_ERROR) {
> > +		if (ret&  VM_FAULT_OOM)
> > +			return -ENOMEM;
> > +		if (ret&  (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
> > +			return -EHWPOISON;
> > +		if (ret&  VM_FAULT_SIGBUS)
> > +			return -EFAULT;
> > +		BUG();
> > +	}
> > +	if (tsk) {
> > +		if (ret&  VM_FAULT_MAJOR)
> > +			tsk->maj_flt++;
> > +		else
> > +			tsk->min_flt++;
> > +	}
> > +	return 0;
> > +}
> > +
> > +/*
> >    * get_user_pages() - pin user pages in memory
> >    * @tsk:	the task_struct to use for page fault accounting, or
> >    *		NULL if faults are not to be recorded.
> >
> >
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty & young
From: Shan Hai @ 2011-07-19  5:38 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: tony.luck, Peter Zijlstra, Peter Zijlstra, linux-kernel, cmetcalf,
	dhowells, paulus, tglx, walken, linuxppc-dev, akpm
In-Reply-To: <1311053063.25044.397.camel@pasglop>

On 07/19/2011 01:24 PM, Benjamin Herrenschmidt wrote:
> On Tue, 2011-07-19 at 13:17 +0800, Shan Hai wrote:
>
>> The patch works, but I have certain confusions,
>> - Do we want to handle_mm_fault on each futex_lock_pi
>>       even though in most cases there is no write permission
>>       fixup's needed?
> Don't we only ever call this when futex_atomic_op_inuser() failed ?
> Which means a fixup -is- needed .... The fast path is still there.
>

What you said is another path, that is futex_wake_op(),
but what about futex_lock_pi in which my test case failed?
your patch will call handle_mm_fault on every futex contention
in the futex_lock_pi path.

futex_lock_pi()
     ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, 0);
         case -EFAULT:
                         goto uaddr_faulted;

     ...
uaddr_faulted:
     ret = fault_in_user_writeable(uaddr);


>> - How about let the archs do their own write permission
>>       fixup as what I did in my original
> Why ? This is generic and will fix all archs at once with generic code
> which is a significant improvement in my book and a lot more
> maintainable :-)
>

If the overhead in the futex_lock_pi  path is not considerable yes fix it up
generally is nice :-)

>>       "[PATCH 1/1] Fixup write permission of TLB on powerpc e500 core"?
>>       (I will fix the stupid errors in my original patch if the concept
>> is acceptable)
>>       in this way we could decrease the overhead of handle_mm_fault
>>       in the path which does not need write permission fixup.
> Which overhead ? gup does handle_mm_fault() as well if needed.

it does it *if needed*, and this requirement is rare in my opinion.


Thanks
Shan Hai

> What I do is I replace what is arguably an abuse of gup() in the case
> where a fixup -is- needed with a dedicated function designed to perform
> the said fixup ... and do it properly which gup() didn't :-)
>
> Cheers,
> Ben.
>
>> Thanks
>> Shan Hai
>>> Cheers,
>>> Ben.
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 9670f71..1036614 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -985,6 +985,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>>>    int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>>>    			struct page **pages);
>>>    struct page *get_dump_page(unsigned long addr);
>>> +extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
>>> +			    unsigned long address, unsigned int fault_flags);
>>>
>>>    extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
>>>    extern void do_invalidatepage(struct page *page, unsigned long offset);
>>> diff --git a/kernel/futex.c b/kernel/futex.c
>>> index fe28dc2..7a0a4ed 100644
>>> --- a/kernel/futex.c
>>> +++ b/kernel/futex.c
>>> @@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr)
>>>    	int ret;
>>>
>>>    	down_read(&mm->mmap_sem);
>>> -	ret = get_user_pages(current, mm, (unsigned long)uaddr,
>>> -			     1, 1, 0, NULL, NULL);
>>> +	ret = fixup_user_fault(current, mm, (unsigned long)uaddr,
>>> +			       FAULT_FLAG_WRITE);
>>>    	up_read(&mm->mmap_sem);
>>>
>>>    	return ret<   0 ? ret : 0;
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 40b7531..b967fb0 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -1815,7 +1815,64 @@ next_page:
>>>    }
>>>    EXPORT_SYMBOL(__get_user_pages);
>>>
>>> -/**
>>> +/*
>>> + * fixup_user_fault() - manually resolve a user page  fault
>>> + * @tsk:	the task_struct to use for page fault accounting, or
>>> + *		NULL if faults are not to be recorded.
>>> + * @mm:		mm_struct of target mm
>>> + * @address:	user address
>>> + * @fault_flags:flags to pass down to handle_mm_fault()
>>> + *
>>> + * This is meant to be called in the specific scenario where for
>>> + * locking reasons we try to access user memory in atomic context
>>> + * (within a pagefault_disable() section), this returns -EFAULT,
>>> + * and we want to resolve the user fault before trying again.
>>> + *
>>> + * Typically this is meant to be used by the futex code.
>>> + *
>>> + * The main difference with get_user_pages() is that this function
>>> + * will unconditionally call handle_mm_fault() which will in turn
>>> + * perform all the necessary SW fixup of the dirty and young bits
>>> + * in the PTE, while handle_mm_fault() only guarantees to update
>>> + * these in the struct page.
>>> + *
>>> + * This is important for some architectures where those bits also
>>> + * gate the access permission to the page because their are
>>> + * maintained in software. On such architecture, gup() will not
>>> + * be enough to make a subsequent access succeed.
>>> + *
>>> + * This should be called with the mm_sem held for read.
>>> + */
>>> +int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
>>> +		     unsigned long address, unsigned int fault_flags)
>>> +{
>>> +	struct vm_area_struct *vma;
>>> +	int ret;
>>> +
>>> +	vma = find_extend_vma(mm, address);
>>> +	if (!vma || address<   vma->vm_start)
>>> +		return -EFAULT;
>>> +	
>>> +	ret = handle_mm_fault(mm, vma, address, fault_flags);
>>> +	if (ret&   VM_FAULT_ERROR) {
>>> +		if (ret&   VM_FAULT_OOM)
>>> +			return -ENOMEM;
>>> +		if (ret&   (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
>>> +			return -EHWPOISON;
>>> +		if (ret&   VM_FAULT_SIGBUS)
>>> +			return -EFAULT;
>>> +		BUG();
>>> +	}
>>> +	if (tsk) {
>>> +		if (ret&   VM_FAULT_MAJOR)
>>> +			tsk->maj_flt++;
>>> +		else
>>> +			tsk->min_flt++;
>>> +	}
>>> +	return 0;
>>> +}
>>> +
>>> +/*
>>>     * get_user_pages() - pin user pages in memory
>>>     * @tsk:	the task_struct to use for page fault accounting, or
>>>     *		NULL if faults are not to be recorded.
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply

* setbat() in udbg_init_cpm() required to avoid driver lockup
From: Daniel Ng2 @ 2011-07-19  5:39 UTC (permalink / raw)
  To: linuxppc-dev


Our USB Device Controller (UDC) driver seems to get stuck in a loop waiting
for the CPM Command Register to indicate that the CPM has finished executing
a command. (It should do this by setting the cpmcr 'Command Done' bit).

This only happens if I disable the 'Early Debug' Kernel Hacking .config
parameter. If Early Debug is enabled, then the problem goes away.

I've narrowed it down to this line in udbg_init_cpm(void):

setbat(1, 0xf0000000, 0xf0000000, 0x40000, PAGE_KERNEL_NCG);

-without this line, the driver gets stuck in the loop.

Can anyone suggest why?

Also, what undesireable effects might there be of keeping the above call to
setbat()?

System:
-MPC8272 (CPM2)
-Kernel 2.6.30.3

Cheers,
Daniel

-- 
View this message in context: http://old.nabble.com/setbat%28%29-in-udbg_init_cpm%28%29-required-to-avoid-driver-lockup-tp32088424p32088424.html
Sent from the linuxppc-dev mailing list archive at Nabble.com.

^ permalink raw reply

* Re: [PATCH v2] net: filter: BPF 'JIT' compiler for PPC64
From: Eric Dumazet @ 2011-07-19  6:51 UTC (permalink / raw)
  To: Matt Evans; +Cc: netdev, linuxppc-dev
In-Reply-To: <4E24E867.9050909@ozlabs.org>

Le mardi 19 juillet 2011 à 12:13 +1000, Matt Evans a écrit :
> An implementation of a code generator for BPF programs to speed up packet
> filtering on PPC64, inspired by Eric Dumazet's x86-64 version.
> 
> Filter code is generated as an ABI-compliant function in module_alloc()'d mem
> with stackframe & prologue/epilogue generated if required (simple filters don't
> need anything more than an li/blr).  The filter's local variables, M[], live in
> registers.  Supports all BPF opcodes, although "complicated" loads from negative
> packet offsets (e.g. SKF_LL_OFF) are not yet supported.
> 
> There are a couple of further optimisations left for future work; many-pass
> assembly with branch-reach reduction and a register allocator to push M[]
> variables into volatile registers would improve the code quality further.
> 
> This currently supports big-endian 64-bit PowerPC only (but is fairly simple
> to port to PPC32 or LE!).
> 
> Enabled in the same way as x86-64:
> 
> 	echo 1 > /proc/sys/net/core/bpf_jit_enable
> 
> Or, enabled with extra debug output:
> 
> 	echo 2 > /proc/sys/net/core/bpf_jit_enable
> 
> Signed-off-by: Matt Evans <matt@ozlabs.org>
> ---
> 
> V2: Removed some cut/paste woe in setting SEEN_X even on writes.
>     Merci for le review, Eric!
> 
>  arch/powerpc/Kconfig                  |    1 +
>  arch/powerpc/Makefile                 |    3 +-
>  arch/powerpc/include/asm/ppc-opcode.h |   40 ++
>  arch/powerpc/net/Makefile             |    4 +
>  arch/powerpc/net/bpf_jit.S            |  138 +++++++
>  arch/powerpc/net/bpf_jit.h            |  227 +++++++++++
>  arch/powerpc/net/bpf_jit_comp.c       |  690 +++++++++++++++++++++++++++++++++
>  7 files changed, 1102 insertions(+), 1 deletions(-)
> 

> +		case BPF_S_ANC_CPU:
> +#ifdef CONFIG_SMP
> +			/*
> +			 * PACA ptr is r13:
> +			 * raw_smp_processor_id() = local_paca->paca_index
> +			 */

This could break if one day linux supports more than 65536 cpus :)

> +			PPC_LHZ_OFFS(r_A, 13,
> +				     offsetof(struct paca_struct, paca_index));
> +#else
> +			PPC_LI(r_A, 0);
> +#endif
> +			break;
> +
> +
> +		case BPF_S_LDX_B_MSH:
> +			/*
> +			 * x86 version drops packet (RET 0) when K<0, whereas
> +			 * interpreter does allow K<0 (__load_pointer, special
> +			 * ancillary data).
> +			 */

Hmm, thanks I'll take a look at this.

> +			func = sk_load_byte_msh;
> +			goto common_load;
> +			break;
> +
> +			/*** Jump and branches ***/

> +		default:
> +			/* The filter contains something cruel & unusual.
> +			 * We don't handle it, but also there shouldn't be
> +			 * anything missing from our list.
> +			 */
> +			pr_err("BPF filter opcode %04x (@%d) unsupported\n",
> +			       filter[i].code, i);

You should at least ratelimit this message ?

On x86_64 I chose to silently fall back to interpretor for a "complex
filter" or "unsupported opcode".

> +			return -ENOTSUPP;
> +		}
> +
> +	}
> +	/* Set end-of-body-code address for exit. */
> +	addrs[i] = ctx->idx * 4;
> +
> +	return 0;
> +}
> +

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox