LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] powerpc: Fix "attempt to move .org backwards" error
From: Benjamin Herrenschmidt @ 2013-12-09 23:26 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Mahesh J Salgaonkar, linux-next, Paul Mackerras, Linux Kernel,
	linuxppc-dev
In-Reply-To: <20131210101031.82b02468d32ddb89481b24b9@canb.auug.org.au>

On Tue, 2013-12-10 at 10:10 +1100, Stephen Rothwell wrote:
> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
> Tested-by: Stephen Rothwell <sfr@canb.auug.org.au>
> 
> Works for me.  Thanks.  I will add this to linux-next today if Ben
> doesn't add it to his tree.

I will but probably not soon enough for your cut today

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH] powerpc: Fix "attempt to move .org backwards" error
From: Stephen Rothwell @ 2013-12-09 23:10 UTC (permalink / raw)
  To: Mahesh J Salgaonkar
  Cc: linuxppc-dev, linux-next, Paul Mackerras, Linux Kernel
In-Reply-To: <20131209191015.22974.31604.stgit@mars>

[-- Attachment #1: Type: text/plain, Size: 1827 bytes --]

Hi,

On Tue, 10 Dec 2013 00:40:15 +0530 Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:
>
> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> 
> With recent machine check patch series changes, The exception vectors
> starting from 0x4300 are now overflowing with allyesconfig. Fix that by
> moving machine_check_common and machine_check_handle_early code out of
> that region to make enough room for exception vector area.
> 
> Fixes this build error reportes by Stephen:
> 
> arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
> arch/powerpc/kernel/exceptions-64s.S:958: Error: attempt to move .org backwards
> arch/powerpc/kernel/exceptions-64s.S:959: Error: attempt to move .org backwards
> arch/powerpc/kernel/exceptions-64s.S:983: Error: attempt to move .org backwards
> arch/powerpc/kernel/exceptions-64s.S:984: Error: attempt to move .org backwards
> arch/powerpc/kernel/exceptions-64s.S:1003: Error: attempt to move .org backwards
> arch/powerpc/kernel/exceptions-64s.S:1013: Error: attempt to move .org backwards
> arch/powerpc/kernel/exceptions-64s.S:1014: Error: attempt to move .org backwards
> arch/powerpc/kernel/exceptions-64s.S:1015: Error: attempt to move .org backwards
> arch/powerpc/kernel/exceptions-64s.S:1016: Error: attempt to move .org backwards
> arch/powerpc/kernel/exceptions-64s.S:1017: Error: attempt to move .org backwards
> arch/powerpc/kernel/exceptions-64s.S:1018: Error: attempt to move .org backwards
> 
> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Tested-by: Stephen Rothwell <sfr@canb.auug.org.au>

Works for me.  Thanks.  I will add this to linux-next today if Ben
doesn't add it to his tree.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: questions: second of the 2 pcie controllers does not scan the bus.
From: Scott Wood @ 2013-12-09 22:50 UTC (permalink / raw)
  To: Ruchika; +Cc: linuxppc-dev
In-Reply-To: <52A2706A.7030205@servergy.com>

On Fri, 2013-12-06 at 18:48 -0600, Ruchika wrote:
> Hi,
> I am working with an p4080 based board. I am trying to get 2 PCIE 
> controllers probed properly.
> 
> In uboot I have no problems scanning and discovering what is connected 
> to both controllers/PCI bridges.
> 
> For both PCIE1/2 uboot sets up the Primary, secondary and Subordinate 
> bus numbers to 0,1,1 respectively.
> 
> When linux boots up and probes the controllers, PCIE1 is probed and the 
> bridge scanned properly but PCIE2 is probed at the bridge but not 
> attempted a scan.
> I see this message
> "pci 0001:02:00.0: bridge configuration invalid ([bus 01-01]), reconfiguring
> "
> 
> I updated uboot to set the secondary and subordinate numbers to 2 (left 
> the primary number to 0) and a subsequent kernel boot scanned the bus 
> for PCIE2 successfully.
> I found these numbers to be very critical since the device tree blob 
> (bus-range) for pci is also based off these.
> 
> I'd like to get a good fix rather than the uboot hack and get better 
> understanding of the problem. If there are any pointers someone could 
> provide it would be awesome.

This is the code that prints that:

        /* Check if setup is sensible at all */
        if (!pass &&
            (primary != bus->number || secondary <= bus->number ||
             secondary > subordinate)) {
                dev_info(&dev->dev, "bridge configuration invalid ([bus %02x-%0
                         secondary, subordinate);
                broken = 1;
        }

Start by printing out more information to determine which of those
checks is failing (e.g. what is primary and bus->number).  If it turns
out that U-Boot is configuring the PCI bus incorrectly, send e-mail to
the U-Boot list.  Be sure to mention what version of Linux and U-Boot
you're using.

-Scott

^ permalink raw reply

* Please pull 'merge' branch of 5xxx tree
From: Anatolij Gustschin @ 2013-12-09 21:55 UTC (permalink / raw)
  To: linuxppc-dev list, Benjamin Herrenschmidt

Hi Ben !

Please pull a device tree fix for v3.13. The booting on mpc512x
is broken since v3.13-rc1, this patch repairs it.

Thanks,
Anatolij

The following changes since commit 721cb59e9d95eb7f47ec73711ed35ef85e1ea1ca:

  powerpc/windfarm: Fix XServe G5 fan control Makefile issue (2013-11-27 11:35:47 +1100)

are available in the git repository at:

  git://git.denx.de/linux-2.6-agust.git merge

for you to fetch changes up to c65ec135960e4555f65d8c9243f65b2fb88ac071:

  powerpc/512x: dts: remove misplaced IRQ spec from 'soc' node (2013-12-07 09:43:28 +0100)

----------------------------------------------------------------
Gerhard Sittig (1):
      powerpc/512x: dts: remove misplaced IRQ spec from 'soc' node

 arch/powerpc/boot/dts/mpc5121.dtsi |    1 -
 1 file changed, 1 deletion(-)

^ permalink raw reply

* [PATCH] powerpc/52xx: re-enable bestcomm driver in defconfigs
From: Anatolij Gustschin @ 2013-12-09 21:15 UTC (permalink / raw)
  To: linuxppc-dev

The bestcomm driver has been moved to drivers/dma, so to select
this driver by default additionally CONFIG_DMADEVICES has to be
enabled. Currently it is not enabled in the config despite existing
CONFIG_PPC_BESTCOMM=y in the config files. Fix it.

Signed-off-by: Anatolij Gustschin <agust@denx.de>
---
 arch/powerpc/configs/52xx/cm5200_defconfig    |    3 ++-
 arch/powerpc/configs/52xx/lite5200b_defconfig |    3 ++-
 arch/powerpc/configs/52xx/motionpro_defconfig |    3 ++-
 arch/powerpc/configs/52xx/pcm030_defconfig    |    3 ++-
 arch/powerpc/configs/52xx/tqm5200_defconfig   |    3 ++-
 arch/powerpc/configs/mpc5200_defconfig        |    3 ++-
 6 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/configs/52xx/cm5200_defconfig b/arch/powerpc/configs/52xx/cm5200_defconfig
index 69b57da..0b88c7b 100644
--- a/arch/powerpc/configs/52xx/cm5200_defconfig
+++ b/arch/powerpc/configs/52xx/cm5200_defconfig
@@ -12,7 +12,6 @@ CONFIG_EXPERT=y
 CONFIG_PPC_MPC52xx=y
 CONFIG_PPC_MPC5200_SIMPLE=y
 # CONFIG_PPC_PMAC is not set
-CONFIG_PPC_BESTCOMM=y
 CONFIG_SPARSE_IRQ=y
 CONFIG_PM=y
 # CONFIG_PCI is not set
@@ -71,6 +70,8 @@ CONFIG_USB_DEVICEFS=y
 CONFIG_USB_OHCI_HCD=y
 CONFIG_USB_OHCI_HCD_PPC_OF_BE=y
 CONFIG_USB_STORAGE=y
+CONFIG_DMADEVICES=y
+CONFIG_PPC_BESTCOMM=y
 CONFIG_EXT2_FS=y
 CONFIG_EXT3_FS=y
 # CONFIG_EXT3_DEFAULTS_TO_ORDERED is not set
diff --git a/arch/powerpc/configs/52xx/lite5200b_defconfig b/arch/powerpc/configs/52xx/lite5200b_defconfig
index f3638ae..104a332 100644
--- a/arch/powerpc/configs/52xx/lite5200b_defconfig
+++ b/arch/powerpc/configs/52xx/lite5200b_defconfig
@@ -15,7 +15,6 @@ CONFIG_PPC_MPC52xx=y
 CONFIG_PPC_MPC5200_SIMPLE=y
 CONFIG_PPC_LITE5200=y
 # CONFIG_PPC_PMAC is not set
-CONFIG_PPC_BESTCOMM=y
 CONFIG_NO_HZ=y
 CONFIG_HIGH_RES_TIMERS=y
 CONFIG_SPARSE_IRQ=y
@@ -59,6 +58,8 @@ CONFIG_I2C_CHARDEV=y
 CONFIG_I2C_MPC=y
 # CONFIG_HWMON is not set
 CONFIG_VIDEO_OUTPUT_CONTROL=m
+CONFIG_DMADEVICES=y
+CONFIG_PPC_BESTCOMM=y
 CONFIG_EXT2_FS=y
 CONFIG_EXT3_FS=y
 # CONFIG_EXT3_DEFAULTS_TO_ORDERED is not set
diff --git a/arch/powerpc/configs/52xx/motionpro_defconfig b/arch/powerpc/configs/52xx/motionpro_defconfig
index 0c7de96..0d13ad7 100644
--- a/arch/powerpc/configs/52xx/motionpro_defconfig
+++ b/arch/powerpc/configs/52xx/motionpro_defconfig
@@ -12,7 +12,6 @@ CONFIG_EXPERT=y
 CONFIG_PPC_MPC52xx=y
 CONFIG_PPC_MPC5200_SIMPLE=y
 # CONFIG_PPC_PMAC is not set
-CONFIG_PPC_BESTCOMM=y
 CONFIG_SPARSE_IRQ=y
 CONFIG_PM=y
 # CONFIG_PCI is not set
@@ -84,6 +83,8 @@ CONFIG_LEDS_TRIGGERS=y
 CONFIG_LEDS_TRIGGER_TIMER=y
 CONFIG_RTC_CLASS=y
 CONFIG_RTC_DRV_DS1307=y
+CONFIG_DMADEVICES=y
+CONFIG_PPC_BESTCOMM=y
 CONFIG_EXT2_FS=y
 CONFIG_EXT3_FS=y
 # CONFIG_EXT3_DEFAULTS_TO_ORDERED is not set
diff --git a/arch/powerpc/configs/52xx/pcm030_defconfig b/arch/powerpc/configs/52xx/pcm030_defconfig
index 22e7195..430aa18 100644
--- a/arch/powerpc/configs/52xx/pcm030_defconfig
+++ b/arch/powerpc/configs/52xx/pcm030_defconfig
@@ -21,7 +21,6 @@ CONFIG_MODULE_UNLOAD=y
 CONFIG_PPC_MPC52xx=y
 CONFIG_PPC_MPC5200_SIMPLE=y
 # CONFIG_PPC_PMAC is not set
-CONFIG_PPC_BESTCOMM=y
 CONFIG_NO_HZ=y
 CONFIG_HIGH_RES_TIMERS=y
 CONFIG_HZ_100=y
@@ -87,6 +86,8 @@ CONFIG_USB_OHCI_HCD_PPC_OF_BE=y
 CONFIG_USB_STORAGE=m
 CONFIG_RTC_CLASS=y
 CONFIG_RTC_DRV_PCF8563=m
+CONFIG_DMADEVICES=y
+CONFIG_PPC_BESTCOMM=y
 CONFIG_EXT2_FS=m
 CONFIG_EXT3_FS=m
 # CONFIG_EXT3_DEFAULTS_TO_ORDERED is not set
diff --git a/arch/powerpc/configs/52xx/tqm5200_defconfig b/arch/powerpc/configs/52xx/tqm5200_defconfig
index 716a37b..7af4c5b 100644
--- a/arch/powerpc/configs/52xx/tqm5200_defconfig
+++ b/arch/powerpc/configs/52xx/tqm5200_defconfig
@@ -17,7 +17,6 @@ CONFIG_PPC_MPC52xx=y
 CONFIG_PPC_MPC5200_SIMPLE=y
 CONFIG_PPC_MPC5200_BUGFIX=y
 # CONFIG_PPC_PMAC is not set
-CONFIG_PPC_BESTCOMM=y
 CONFIG_PM=y
 # CONFIG_PCI is not set
 CONFIG_NET=y
@@ -86,6 +85,8 @@ CONFIG_USB_STORAGE=y
 CONFIG_RTC_CLASS=y
 CONFIG_RTC_DRV_DS1307=y
 CONFIG_RTC_DRV_DS1374=y
+CONFIG_DMADEVICES=y
+CONFIG_PPC_BESTCOMM=y
 CONFIG_EXT2_FS=y
 CONFIG_EXT3_FS=y
 # CONFIG_EXT3_DEFAULTS_TO_ORDERED is not set
diff --git a/arch/powerpc/configs/mpc5200_defconfig b/arch/powerpc/configs/mpc5200_defconfig
index 6640a35..8b682d1c 100644
--- a/arch/powerpc/configs/mpc5200_defconfig
+++ b/arch/powerpc/configs/mpc5200_defconfig
@@ -15,7 +15,6 @@ CONFIG_PPC_MEDIA5200=y
 CONFIG_PPC_MPC5200_BUGFIX=y
 CONFIG_PPC_MPC5200_LPBFIFO=m
 # CONFIG_PPC_PMAC is not set
-CONFIG_PPC_BESTCOMM=y
 CONFIG_SIMPLE_GPIO=y
 CONFIG_NO_HZ=y
 CONFIG_HIGH_RES_TIMERS=y
@@ -125,6 +124,8 @@ CONFIG_RTC_CLASS=y
 CONFIG_RTC_DRV_DS1307=y
 CONFIG_RTC_DRV_DS1374=y
 CONFIG_RTC_DRV_PCF8563=m
+CONFIG_DMADEVICES=y
+CONFIG_PPC_BESTCOMM=y
 CONFIG_EXT2_FS=y
 CONFIG_EXT3_FS=y
 # CONFIG_EXT3_DEFAULTS_TO_ORDERED is not set
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH] powerpc: Fix "attempt to move .org backwards" error
From: Mahesh J Salgaonkar @ 2013-12-09 19:10 UTC (permalink / raw)
  To: linuxppc-dev, Benjamin Herrenschmidt
  Cc: Stephen Rothwell, linux-next, Paul Mackerras, Linux Kernel

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

With recent machine check patch series changes, The exception vectors
starting from 0x4300 are now overflowing with allyesconfig. Fix that by
moving machine_check_common and machine_check_handle_early code out of
that region to make enough room for exception vector area.

Fixes this build error reportes by Stephen:

arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
arch/powerpc/kernel/exceptions-64s.S:958: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:959: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:983: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:984: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1003: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1013: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1014: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1015: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1016: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1017: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1018: Error: attempt to move .org backwards

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/exceptions-64s.S |  280 +++++++++++++++++-----------------
 1 file changed, 140 insertions(+), 140 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 862b9dd..b5c3313 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -768,146 +768,6 @@ kvmppc_skip_Hinterrupt:
 
 	STD_EXCEPTION_COMMON(0x100, system_reset, .system_reset_exception)
 
-	/*
-	 * Machine check is different because we use a different
-	 * save area: PACA_EXMC instead of PACA_EXGEN.
-	 */
-	.align	7
-	.globl machine_check_common
-machine_check_common:
-
-	mfspr	r10,SPRN_DAR
-	std	r10,PACA_EXGEN+EX_DAR(r13)
-	mfspr	r10,SPRN_DSISR
-	stw	r10,PACA_EXGEN+EX_DSISR(r13)
-	EXCEPTION_PROLOG_COMMON(0x200, PACA_EXMC)
-	FINISH_NAP
-	DISABLE_INTS
-	ld	r3,PACA_EXGEN+EX_DAR(r13)
-	lwz	r4,PACA_EXGEN+EX_DSISR(r13)
-	std	r3,_DAR(r1)
-	std	r4,_DSISR(r1)
-	bl	.save_nvgprs
-	addi	r3,r1,STACK_FRAME_OVERHEAD
-	bl	.machine_check_exception
-	b	.ret_from_except
-
-#define MACHINE_CHECK_HANDLER_WINDUP			\
-	/* Clear MSR_RI before setting SRR0 and SRR1. */\
-	li	r0,MSR_RI;				\
-	mfmsr	r9;		/* get MSR value */	\
-	andc	r9,r9,r0;				\
-	mtmsrd	r9,1;		/* Clear MSR_RI */	\
-	/* Move original SRR0 and SRR1 into the respective regs */	\
-	ld	r9,_MSR(r1);				\
-	mtspr	SPRN_SRR1,r9;				\
-	ld	r3,_NIP(r1);				\
-	mtspr	SPRN_SRR0,r3;				\
-	ld	r9,_CTR(r1);				\
-	mtctr	r9;					\
-	ld	r9,_XER(r1);				\
-	mtxer	r9;					\
-	ld	r9,_LINK(r1);				\
-	mtlr	r9;					\
-	REST_GPR(0, r1);				\
-	REST_8GPRS(2, r1);				\
-	REST_GPR(10, r1);				\
-	ld	r11,_CCR(r1);				\
-	mtcr	r11;					\
-	/* Decrement paca->in_mce. */			\
-	lhz	r12,PACA_IN_MCE(r13);			\
-	subi	r12,r12,1;				\
-	sth	r12,PACA_IN_MCE(r13);			\
-	REST_GPR(11, r1);				\
-	REST_2GPRS(12, r1);				\
-	/* restore original r1. */			\
-	ld	r1,GPR1(r1)
-
-	/*
-	 * Handle machine check early in real mode. We come here with
-	 * ME=1, MMU (IR=0 and DR=0) off and using MC emergency stack.
-	 */
-	.align	7
-	.globl machine_check_handle_early
-machine_check_handle_early:
-BEGIN_FTR_SECTION
-	std	r0,GPR0(r1)	/* Save r0 */
-	EXCEPTION_PROLOG_COMMON_3(0x200)
-	bl	.save_nvgprs
-	addi	r3,r1,STACK_FRAME_OVERHEAD
-	bl	.machine_check_early
-	ld	r12,_MSR(r1)
-#ifdef	CONFIG_PPC_P7_NAP
-	/*
-	 * Check if thread was in power saving mode. We come here when any
-	 * of the following is true:
-	 * a. thread wasn't in power saving mode
-	 * b. thread was in power saving mode with no state loss or
-	 *    supervisor state loss
-	 *
-	 * Go back to nap again if (b) is true.
-	 */
-	rlwinm.	r11,r12,47-31,30,31	/* Was it in power saving mode? */
-	beq	4f			/* No, it wasn;t */
-	/* Thread was in power saving mode. Go back to nap again. */
-	cmpwi	r11,2
-	bne	3f
-	/* Supervisor state loss */
-	li	r0,1
-	stb	r0,PACA_NAPSTATELOST(r13)
-3:	bl	.machine_check_queue_event
-	MACHINE_CHECK_HANDLER_WINDUP
-	GET_PACA(r13)
-	ld	r1,PACAR1(r13)
-	b	.power7_enter_nap_mode
-4:
-#endif
-	/*
-	 * Check if we are coming from hypervisor userspace. If yes then we
-	 * continue in host kernel in V mode to deliver the MC event.
-	 */
-	rldicl.	r11,r12,4,63		/* See if MC hit while in HV mode. */
-	beq	5f
-	andi.	r11,r12,MSR_PR		/* See if coming from user. */
-	bne	9f			/* continue in V mode if we are. */
-
-5:
-#ifdef CONFIG_KVM_BOOK3S_64_HV
-	/*
-	 * We are coming from kernel context. Check if we are coming from
-	 * guest. if yes, then we can continue. We will fall through
-	 * do_kvm_200->kvmppc_interrupt to deliver the MC event to guest.
-	 */
-	lbz	r11,HSTATE_IN_GUEST(r13)
-	cmpwi	r11,0			/* Check if coming from guest */
-	bne	9f			/* continue if we are. */
-#endif
-	/*
-	 * At this point we are not sure about what context we come from.
-	 * Queue up the MCE event and return from the interrupt.
-	 * But before that, check if this is an un-recoverable exception.
-	 * If yes, then stay on emergency stack and panic.
-	 */
-	andi.	r11,r12,MSR_RI
-	bne	2f
-1:	addi	r3,r1,STACK_FRAME_OVERHEAD
-	bl	.unrecoverable_exception
-	b	1b
-2:
-	/*
-	 * Return from MC interrupt.
-	 * Queue up the MCE event so that we can log it later, while
-	 * returning from kernel or opal call.
-	 */
-	bl	.machine_check_queue_event
-	MACHINE_CHECK_HANDLER_WINDUP
-	rfid
-9:
-	/* Deliver the machine check to host kernel in V mode. */
-	MACHINE_CHECK_HANDLER_WINDUP
-	b	machine_check_pSeries
-END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
-
 	STD_EXCEPTION_COMMON_ASYNC(0x500, hardware_interrupt, do_IRQ)
 	STD_EXCEPTION_COMMON_ASYNC(0x900, decrementer, .timer_interrupt)
 	STD_EXCEPTION_COMMON(0x980, hdecrementer, .hdec_interrupt)
@@ -1458,6 +1318,146 @@ _GLOBAL(opal_mc_secondary_handler)
 	b	machine_check_pSeries
 #endif /* CONFIG_PPC_POWERNV */
 
+	/*
+	 * Machine check is different because we use a different
+	 * save area: PACA_EXMC instead of PACA_EXGEN.
+	 */
+	.align	7
+	.globl machine_check_common
+machine_check_common:
+
+	mfspr	r10,SPRN_DAR
+	std	r10,PACA_EXGEN+EX_DAR(r13)
+	mfspr	r10,SPRN_DSISR
+	stw	r10,PACA_EXGEN+EX_DSISR(r13)
+	EXCEPTION_PROLOG_COMMON(0x200, PACA_EXMC)
+	FINISH_NAP
+	DISABLE_INTS
+	ld	r3,PACA_EXGEN+EX_DAR(r13)
+	lwz	r4,PACA_EXGEN+EX_DSISR(r13)
+	std	r3,_DAR(r1)
+	std	r4,_DSISR(r1)
+	bl	.save_nvgprs
+	addi	r3,r1,STACK_FRAME_OVERHEAD
+	bl	.machine_check_exception
+	b	.ret_from_except
+
+#define MACHINE_CHECK_HANDLER_WINDUP			\
+	/* Clear MSR_RI before setting SRR0 and SRR1. */\
+	li	r0,MSR_RI;				\
+	mfmsr	r9;		/* get MSR value */	\
+	andc	r9,r9,r0;				\
+	mtmsrd	r9,1;		/* Clear MSR_RI */	\
+	/* Move original SRR0 and SRR1 into the respective regs */	\
+	ld	r9,_MSR(r1);				\
+	mtspr	SPRN_SRR1,r9;				\
+	ld	r3,_NIP(r1);				\
+	mtspr	SPRN_SRR0,r3;				\
+	ld	r9,_CTR(r1);				\
+	mtctr	r9;					\
+	ld	r9,_XER(r1);				\
+	mtxer	r9;					\
+	ld	r9,_LINK(r1);				\
+	mtlr	r9;					\
+	REST_GPR(0, r1);				\
+	REST_8GPRS(2, r1);				\
+	REST_GPR(10, r1);				\
+	ld	r11,_CCR(r1);				\
+	mtcr	r11;					\
+	/* Decrement paca->in_mce. */			\
+	lhz	r12,PACA_IN_MCE(r13);			\
+	subi	r12,r12,1;				\
+	sth	r12,PACA_IN_MCE(r13);			\
+	REST_GPR(11, r1);				\
+	REST_2GPRS(12, r1);				\
+	/* restore original r1. */			\
+	ld	r1,GPR1(r1)
+
+	/*
+	 * Handle machine check early in real mode. We come here with
+	 * ME=1, MMU (IR=0 and DR=0) off and using MC emergency stack.
+	 */
+	.align	7
+	.globl machine_check_handle_early
+machine_check_handle_early:
+BEGIN_FTR_SECTION
+	std	r0,GPR0(r1)	/* Save r0 */
+	EXCEPTION_PROLOG_COMMON_3(0x200)
+	bl	.save_nvgprs
+	addi	r3,r1,STACK_FRAME_OVERHEAD
+	bl	.machine_check_early
+	ld	r12,_MSR(r1)
+#ifdef	CONFIG_PPC_P7_NAP
+	/*
+	 * Check if thread was in power saving mode. We come here when any
+	 * of the following is true:
+	 * a. thread wasn't in power saving mode
+	 * b. thread was in power saving mode with no state loss or
+	 *    supervisor state loss
+	 *
+	 * Go back to nap again if (b) is true.
+	 */
+	rlwinm.	r11,r12,47-31,30,31	/* Was it in power saving mode? */
+	beq	4f			/* No, it wasn;t */
+	/* Thread was in power saving mode. Go back to nap again. */
+	cmpwi	r11,2
+	bne	3f
+	/* Supervisor state loss */
+	li	r0,1
+	stb	r0,PACA_NAPSTATELOST(r13)
+3:	bl	.machine_check_queue_event
+	MACHINE_CHECK_HANDLER_WINDUP
+	GET_PACA(r13)
+	ld	r1,PACAR1(r13)
+	b	.power7_enter_nap_mode
+4:
+#endif
+	/*
+	 * Check if we are coming from hypervisor userspace. If yes then we
+	 * continue in host kernel in V mode to deliver the MC event.
+	 */
+	rldicl.	r11,r12,4,63		/* See if MC hit while in HV mode. */
+	beq	5f
+	andi.	r11,r12,MSR_PR		/* See if coming from user. */
+	bne	9f			/* continue in V mode if we are. */
+
+5:
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+	/*
+	 * We are coming from kernel context. Check if we are coming from
+	 * guest. if yes, then we can continue. We will fall through
+	 * do_kvm_200->kvmppc_interrupt to deliver the MC event to guest.
+	 */
+	lbz	r11,HSTATE_IN_GUEST(r13)
+	cmpwi	r11,0			/* Check if coming from guest */
+	bne	9f			/* continue if we are. */
+#endif
+	/*
+	 * At this point we are not sure about what context we come from.
+	 * Queue up the MCE event and return from the interrupt.
+	 * But before that, check if this is an un-recoverable exception.
+	 * If yes, then stay on emergency stack and panic.
+	 */
+	andi.	r11,r12,MSR_RI
+	bne	2f
+1:	addi	r3,r1,STACK_FRAME_OVERHEAD
+	bl	.unrecoverable_exception
+	b	1b
+2:
+	/*
+	 * Return from MC interrupt.
+	 * Queue up the MCE event so that we can log it later, while
+	 * returning from kernel or opal call.
+	 */
+	bl	.machine_check_queue_event
+	MACHINE_CHECK_HANDLER_WINDUP
+	rfid
+9:
+	/* Deliver the machine check to host kernel in V mode. */
+	MACHINE_CHECK_HANDLER_WINDUP
+	b	machine_check_pSeries
+END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
+
 
 /*
  * r13 points to the PACA, r9 contains the saved CR,

^ permalink raw reply related

* [PATCH] powernv: fix VFIO support with PHB3
From: Thadeu Lima de Souza Cascardo @ 2013-12-09 16:41 UTC (permalink / raw)
  To: benh
  Cc: shangw, aik, linux-kernel, paulus, Thadeu Lima de Souza Cascardo,
	linuxppc-dev

I have recently found out that no iommu_groups could be found under
/sys/ on a P8. That prevents PCI passthrough from working.

During my investigation, I found out there seems to be a missing
iommu_register_group for PHB3. The following patch seems to fix the
problem. After applying it, I see iommu_groups under
/sys/kernel/iommu_groups/, and can also bind vfio-pci to an adapter,
which gives me a device at /dev/vfio/.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
---

This is now applied on top of benh's tree, branch next.

Alexey, is this now OK for you?

Thanks.
Cascardo.

---
 arch/powerpc/platforms/powernv/pci-ioda.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 614356c..f0e6871 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -720,6 +720,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 		tbl->it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE;
 	}
 	iommu_init_table(tbl, phb->hose->node);
+	iommu_register_group(tbl, pci_domain_nr(pe->pbus), pe->pe_number);
 
 	if (pe->pdev)
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH] powerpc/44x: fix ocm_block allocation
From: Ilia Mirkin @ 2013-12-09 15:28 UTC (permalink / raw)
  To: Vinh Huu Tuong Nguyen; +Cc: linuxppc-dev
In-Reply-To: <CAM9eBokeHwgDnddrH5PVNK9bCT7b1_7GA98DsJaUwKPJfLCvgw@mail.gmail.com>

On Mon, Dec 9, 2013 at 3:38 AM, Vinh Huu Tuong Nguyen <vhtnguyen@apm.com> wrote:
>
> Hi Ilia Mirkin,
> Thanks for your info. I did investigated why our test didn't detect it and found out that
> the struct ocm_block is only used on ocm_debugfs_show function when we want to
> know information about ocm and it's available when we enable debugfs. But our test
> only tried to use the OCM block functions and didn't care about the OCM information.
> So I think we should apply your patch to solve this issue instead of removing ocm part.
>

OK, perhaps there's something clever gong on. However on my git tree
(updated as of a few days ago):

$ git grep ppc4xx_ocm_alloc
arch/powerpc/include/asm/ppc4xx_ocm.h:void
*ppc4xx_ocm_alloc(phys_addr_t *phys, int size, int align,
arch/powerpc/include/asm/ppc4xx_ocm.h:#define ppc4xx_ocm_alloc(phys,
size, align, flags, owner) NULL
arch/powerpc/sysdev/ppc4xx_ocm.c:void *ppc4xx_ocm_alloc(phys_addr_t
*phys, int size, int align,

So... no users. Unless there's macro-related cleverness going on (I'll
freely admit to not having read/understood the full code, so could
well be.) Perhaps it was meant to be used but the call got lost?

  -ilia

>
>
>
> On Sat, Dec 7, 2013 at 7:43 AM, Ilia Mirkin <imirkin@alum.mit.edu> wrote:
>>
>> Allocate enough memory for the ocm_block structure, not just a pointer
>> to it.
>>
>> Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
>> ---
>>
>> I have neither the hardware to test nor the toolchain to even build-test
>> this. However this seems like a fairly obvious fix (and I have to wonder how
>> this ever worked at all). Found with spatch.
>>
>> Actually further investigation makes it seem like this function is never
>> called, perhaps it should just be removed? If it is kept around though, would
>> be nice to apply this patch so that tools don't trip over this wrong code.
>>
>>  arch/powerpc/sysdev/ppc4xx_ocm.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/sysdev/ppc4xx_ocm.c b/arch/powerpc/sysdev/ppc4xx_ocm.c
>> index b7c4345..85d9e37 100644
>> --- a/arch/powerpc/sysdev/ppc4xx_ocm.c
>> +++ b/arch/powerpc/sysdev/ppc4xx_ocm.c
>> @@ -339,7 +339,7 @@ void *ppc4xx_ocm_alloc(phys_addr_t *phys, int size, int align,
>>                 if (IS_ERR_VALUE(offset))
>>                         continue;
>>
>> -               ocm_blk = kzalloc(sizeof(struct ocm_block *), GFP_KERNEL);
>> +               ocm_blk = kzalloc(sizeof(struct ocm_block), GFP_KERNEL);
>>                 if (!ocm_blk) {
>>                         printk(KERN_ERR "PPC4XX OCM: could not allocate ocm block");
>>                         rh_free(ocm_reg->rh, offset);
>> --
>> 1.8.3.2
>>
>
>
>
> --
>
> Vinh Nguyen Huu Tuong | Staff SW Engineer
>
> C: 090.335.7841 | O: 083.770.0640 ext: 3719
>
> F: 083.770.0641 | vhtnguyen@apm.com
>
>
>
>
>
>
>
>

^ permalink raw reply

* [PATCH] powerpc: Fix up the kdump base cap to 128M
From: Mahesh J Salgaonkar @ 2013-12-09 10:03 UTC (permalink / raw)
  To: linuxppc-dev, Benjamin Herrenschmidt; +Cc: Anton Blanchard

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

The current logic sets the kdump base to min of 2G or ppc64_rma_size/2.
On PowerNV kernel the first memory block 'memory@0' can be very large,
equal to the DIMM size with ppc64_rma_size value capped to 1G. Hence on
PowerNV, kdump base is set to 512M resulting kdump to fail while allocating
paca array. This is because, paca need its memory from RMA region capped
at 256M (see allocate_pacas()).

This patch lowers the kdump base cap to 128M so that kdump kernel can
successfully get memory below 256M for paca allocation.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/machine_kexec.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/machine_kexec.c b/arch/powerpc/kernel/machine_kexec.c
index 88a7fb4..75d4f73 100644
--- a/arch/powerpc/kernel/machine_kexec.c
+++ b/arch/powerpc/kernel/machine_kexec.c
@@ -148,7 +148,7 @@ void __init reserve_crashkernel(void)
 		 * a small SLB (128MB) since the crash kernel needs to place
 		 * itself and some stacks to be in the first segment.
 		 */
-		crashk_res.start = min(0x80000000ULL, (ppc64_rma_size / 2));
+		crashk_res.start = min(0x8000000ULL, (ppc64_rma_size / 2));
 #else
 		crashk_res.start = KDUMP_KERNELBASE;
 #endif

^ permalink raw reply related

* Re: [PATCH] powerpc/44x: fix ocm_block allocation
From: Vinh Huu Tuong Nguyen @ 2013-12-09  8:38 UTC (permalink / raw)
  To: Ilia Mirkin; +Cc: linuxppc-dev
In-Reply-To: <1386377017-909-1-git-send-email-imirkin@alum.mit.edu>

[-- Attachment #1: Type: text/plain, Size: 2055 bytes --]

Hi Ilia Mirkin,
Thanks for your info. I did investigated why our test didn't detect it and
found out that the struct ocm_block is only used on ocm_debugfs_show
function when we want to know information about ocm and it's available when
we enable debugfs. But our test only tried to use the OCM block functions
and didn't care about the OCM information. So I think we should apply your
patch to solve this issue instead of removing ocm part.




On Sat, Dec 7, 2013 at 7:43 AM, Ilia Mirkin <imirkin@alum.mit.edu> wrote:

> Allocate enough memory for the ocm_block structure, not just a pointer
> to it.
>
> Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
> ---
>
> I have neither the hardware to test nor the toolchain to even build-test
> this. However this seems like a fairly obvious fix (and I have to wonder
> how
> this ever worked at all). Found with spatch.
>
> Actually further investigation makes it seem like this function is never
> called, perhaps it should just be removed? If it is kept around though,
> would
> be nice to apply this patch so that tools don't trip over this wrong code.
>
>  arch/powerpc/sysdev/ppc4xx_ocm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/sysdev/ppc4xx_ocm.c
> b/arch/powerpc/sysdev/ppc4xx_ocm.c
> index b7c4345..85d9e37 100644
> --- a/arch/powerpc/sysdev/ppc4xx_ocm.c
> +++ b/arch/powerpc/sysdev/ppc4xx_ocm.c
> @@ -339,7 +339,7 @@ void *ppc4xx_ocm_alloc(phys_addr_t *phys, int size,
> int align,
>                 if (IS_ERR_VALUE(offset))
>                         continue;
>
> -               ocm_blk = kzalloc(sizeof(struct ocm_block *), GFP_KERNEL);
> +               ocm_blk = kzalloc(sizeof(struct ocm_block), GFP_KERNEL);
>                 if (!ocm_blk) {
>                         printk(KERN_ERR "PPC4XX OCM: could not allocate
> ocm block");
>                         rh_free(ocm_reg->rh, offset);
> --
> 1.8.3.2
>
>


-- 

 *Vinh Nguyen Huu Tuong **|** Staff SW Engineer*

C: 090.335.7841 | O: 083.770.0640 ext: 3719

F: 083.770.0641 | vhtnguyen@apm.com

[-- Attachment #2: Type: text/html, Size: 4295 bytes --]

^ permalink raw reply

* MPC8641 BASED Custom designed Board Linux  stucks after Mounting cache hash table entries
From: Ashish @ 2013-12-09  8:44 UTC (permalink / raw)
  To: linuxppc-dev

Hii All,
   I am trying to port linux 2.6.34 to mpc8641d based custom designed 
board but I am facing kernel oops after mounting hash table entries. can 
anybody was facing this kind of issue while porting or can give me some 
light on this. Any pointer/direction will be very helpfull. Here I am 
showing the snapshot of the problem for more understanding this issue..

bootm 1600000 600000 1400000
## Booting kernel from Legacy Image at 01600000 ...
    Image Name:   Linux-2.6.34
    Image Type:   PowerPC Linux Kernel Image (gzip compressed)
    Data Size:    2615699 Bytes = 2.5 MiB
    Load Address: 00000000
    Entry Point:  00000000
    Verifying Checksum ... OK
## Loading init Ramdisk from Legacy Image at 00600000 ...
    Image Name:   rootfs
    Image Type:   PowerPC Linux RAMDisk Image (gzip compressed)
    Data Size:    8043648 Bytes = 7.7 MiB
    Load Address: 00000000
    Entry Point:  00000000
    Verifying Checksum ... OK
## Flattened Device Tree blob at 01400000
    Booting using the fdt blob at 0x01400000
    Uncompressing Kernel Image ... OK
    Loading Ramdisk to 1f6fb000, end 1fea6c80 ... OK
    Loading Device Tree to 007fb000, end 007ff919 ... OK
Using MPC86xx HPCN machine description
Total memory = 512MB; using 1024kB for hash table (at cff00000)
Linux version 2.6.34 (ashish@ashish-virtual-machine) (gcc version 4.7.2 
(GCC) ) #1 Fri Dec 6 10:38:44 IST 2013
Found initrd at 0xdf6fb000:0xdfea6c80
bootconsole [udbg0] enabled
setup_arch: bootmem
mpc86xx_hpcn_setup_arch()
MPC86xx HPCN board from Freescale Semiconductor
arch: exit
Zone PFN ranges:
   DMA      0x00000000 -> 0x00020000
   Normal   empty
   HighMem  empty

Movable zone start PFN for each node
early_node_map[1] active PFN ranges
     0: 0x00000000 -> 0x00020000
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 130048
Kernel command line: mem=512m root=/dev/ram console=ttyS0,115200
PID hash table entries: 2048 (order: 1, 8192 bytes)
Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
Memory: 505540k/524288k available (4976k kernel code, 18748k reserved, 
196k data, 160k bss, 192k init)
Kernel virtual memory layout:
   * 0xfffd0000..0xfffff000  : fixmap
   * 0xff800000..0xffc00000  : highmem PTEs
   * 0xff7fe000..0xff800000  : early ioremap
   * 0xe1000000..0xff7fe000  : vmalloc & ioremap
SLUB: Genslabs=13, HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
Hierarchical RCU implementation.
NR_IRQS:512 nr_irqs:512
mpic: Setting up MPIC " MPIC     " version 1.2 at f8040000, max 2 CPUs
mpic: ISU size: 256, shift: 8, mask: ff
mpic: Initializing for 256 sources
clocksource: timebase mult[2000000] shift[22] registered
Console: colour dummy device 80x25
Mount-cache hash table entries: 512
Unable to handle kernel paging request for data at address 0x00000000
Faulting instruction address: 0xc00179c8
Oops: Kernel access of bad area, sig: 11 [#1]
MPC86xx HPCN
last sysfs file:
Modules linked in:
NIP: c00179c8 LR: c00d69ec CTR: 00000008
REGS: c0507e40 TRAP: 0300   Not tainted  (2.6.34)
MSR: 00009032 <EE,ME,IR,DR>  CR: 84000028  XER: 00000000
DAR: 00000000, DSISR: 40000000
TASK = c04de410[0] 'swapper' THREAD: c0506000
GPR00: c00d6da0 c0507ef0 c04de410 00000064 ffffffff df01d0e0 00000000 
00000000
GPR08: 00000000 0000000d 00000000 c0503194 44000022 f8afffff ffffffff 
200c9000
GPR16: fbbfdffb 00000000 00000000 00000024 00000000 1fea8b08 1fea8d24 
00000000
GPR24: 00000000 1fffa2e4 40000000 1ffcd66c dfffed30 00000000 00000000 
df01d080
NIP [c00179c8] strcmp+0x10/0x24
LR [c00d69ec] duplicate_name+0x3c/0x74
Call Trace:
[c0507ef0] [c07fc440] 0xc07fc440 (unreliable)
[c0507f00] [c00d6da0] proc_device_tree_add_node+0xfc/0x144
[c0507f20] [c00d6ce4] proc_device_tree_add_node+0x40/0x144
[c0507f40] [c00d6ce4] proc_device_tree_add_node+0x40/0x144
[c0507f60] [c00d6ce4] proc_device_tree_add_node+0x40/0x144
[c0507f80] [c00d6ce4] proc_device_tree_add_node+0x40/0x144
[c0507fa0] [c04bdcd4] proc_device_tree_init+0x68/0x94
[c0507fb0] [c04bd6f8] proc_root_init+0xd0/0x108
[c0507fc0] [c04ac728] start_kernel+0x2b4/0x2cc
[c0507ff0] [00003444] 0x3444
Instruction dump:
2c000000 4082fff8 38a5ffff 8c040001 2c000000 9c050001 4082fff4 4e800020
38a3ffff 3884ffff 8c650001 2c830000 <8c040001> 7c601851 4d860020 4182ffec
---[ end trace 31fd0ba7d8756001 ]---
Kernel panic - not syncing: Attempted to kill the idle task!
Rebooting in 180 seconds..



Regards
Ashish Kumar Khetan

^ permalink raw reply

* Re: [PATCH v3] powerpc: Fix PTE page address mismatch in pgtable ctor/dtor
From: Aneesh Kumar K.V @ 2013-12-09  8:42 UTC (permalink / raw)
  To: Hong H. Pham, linux-rt-users, linuxppc-dev
  Cc: Paul Mackerras, Hong H. Pham, linux-stable
In-Reply-To: <1386425193-24015-1-git-send-email-hong.pham@windriver.com>

"Hong H. Pham" <hong.pham@windriver.com> writes:

> From: "Hong H. Pham" <hong.pham@windriver.com>
>
> In pte_alloc_one(), pgtable_page_ctor() is passed an address that has
> not been converted by page_address() to the newly allocated PTE page.
>
> When the PTE is freed, __pte_free_tlb() calls pgtable_page_dtor()
> with an address to the PTE page that has been converted by page_address().
> The mismatch in the PTE's page address causes pgtable_page_dtor() to access
> invalid memory, so resources for that PTE (such as the page lock) is not
> properly cleaned up.
>
> On PPC32, only SMP kernels are affected.
>
> On PPC64, only SMP kernels with 4K page size are affected.
>
> This bug was introduced by commit d614bb041209fd7cb5e4b35e11a7b2f6ee8f62b8
> "powerpc: Move the pte free routines from common header".
>
> On a preempt-rt kernel, a spinlock is dynamically allocated for each
> PTE in pgtable_page_ctor().  When the PTE is freed, calling
> pgtable_page_dtor() with a mismatched page address causes a memory leak,
> as the pointer to the PTE's spinlock is bogus.
>
> On mainline, there isn't any immediately obvious symptoms, but the
> problem still exists here.
>
> Fixes: d614bb041209fd7c "powerpc: Move the pte free routes from common header"
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: linux-stable <stable@vger.kernel.org> # v3.10+
> Signed-off-by: Hong H. Pham <hong.pham@windriver.com>


Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>


> ---
>  arch/powerpc/include/asm/pgalloc-32.h | 6 ++----
>  arch/powerpc/include/asm/pgalloc-64.h | 6 ++----
>  2 files changed, 4 insertions(+), 8 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/pgalloc-32.h b/arch/powerpc/include/asm/pgalloc-32.h
> index 27b2386..842846c 100644
> --- a/arch/powerpc/include/asm/pgalloc-32.h
> +++ b/arch/powerpc/include/asm/pgalloc-32.h
> @@ -84,10 +84,8 @@ static inline void pgtable_free_tlb(struct mmu_gather *tlb,
>  static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
>  				  unsigned long address)
>  {
> -	struct page *page = page_address(table);
> -
>  	tlb_flush_pgtable(tlb, address);
> -	pgtable_page_dtor(page);
> -	pgtable_free_tlb(tlb, page, 0);
> +	pgtable_page_dtor(table);
> +	pgtable_free_tlb(tlb, page_address(table), 0);
>  }
>  #endif /* _ASM_POWERPC_PGALLOC_32_H */
> diff --git a/arch/powerpc/include/asm/pgalloc-64.h b/arch/powerpc/include/asm/pgalloc-64.h
> index f65e27b..256d6f8 100644
> --- a/arch/powerpc/include/asm/pgalloc-64.h
> +++ b/arch/powerpc/include/asm/pgalloc-64.h
> @@ -144,11 +144,9 @@ static inline void pgtable_free_tlb(struct mmu_gather *tlb,
>  static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
>  				  unsigned long address)
>  {
> -	struct page *page = page_address(table);
> -
>  	tlb_flush_pgtable(tlb, address);
> -	pgtable_page_dtor(page);
> -	pgtable_free_tlb(tlb, page, 0);
> +	pgtable_page_dtor(table);
> +	pgtable_free_tlb(tlb, page_address(table), 0);
>  }
>
>  #else /* if CONFIG_PPC_64K_PAGES */
> -- 
> 1.8.3.2

^ permalink raw reply

* Re: [PATCH v3] powerpc: Fix PTE page address mismatch in pgtable ctor/dtor
From: Aneesh Kumar K.V @ 2013-12-09  8:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Hong H. Pham
  Cc: Paul Mackerras, linuxppc-dev, linux-rt-users, linux-stable
In-Reply-To: <1386448079.21910.105.camel@pasglop>

Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

> On Sat, 2013-12-07 at 09:06 -0500, Hong H. Pham wrote:
>
>> diff --git a/arch/powerpc/include/asm/pgalloc-32.h b/arch/powerpc/include/asm/pgalloc-32.h
>> index 27b2386..842846c 100644
>> --- a/arch/powerpc/include/asm/pgalloc-32.h
>> +++ b/arch/powerpc/include/asm/pgalloc-32.h
>> @@ -84,10 +84,8 @@ static inline void pgtable_free_tlb(struct mmu_gather *tlb,
>>  static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
>>  				  unsigned long address)
>>  {
>> -	struct page *page = page_address(table);
>> -
>>  	tlb_flush_pgtable(tlb, address);
>> -	pgtable_page_dtor(page);
>> -	pgtable_free_tlb(tlb, page, 0);
>> +	pgtable_page_dtor(table);
>> +	pgtable_free_tlb(tlb, page_address(table), 0);
>>  }
>
> Ok so your description of the problem confused me a bit, but I see that
> in the !64K page, pgtable_t is already a struct page so yes, the
> page_address() call here is bogus.
>
> However, I also noticed that in the 64k page case, we don't call the dto
> at all. Is that a problem ?
>
> Also, Aneesh, shouldn't we just fix the disconnect here and have
> pgtable_t always be the same type ? The way this is now is confusing
> and error prone...

With pte page fragments that may not be possible right ?. With PTE fragments,
we share the page allocated with multiple pmd entries 

5c1f6ee9a31cbdac90bbb8ae1ba4475031ac74b4 should have more details

>
>>  #endif /* _ASM_POWERPC_PGALLOC_32_H */
>> diff --git a/arch/powerpc/include/asm/pgalloc-64.h b/arch/powerpc/include/asm/pgalloc-64.h
>> index f65e27b..256d6f8 100644
>> --- a/arch/powerpc/include/asm/pgalloc-64.h
>> +++ b/arch/powerpc/include/asm/pgalloc-64.h
>> @@ -144,11 +144,9 @@ static inline void pgtable_free_tlb(struct mmu_gather *tlb,
>>  static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
>>  				  unsigned long address)
>>  {
>> -	struct page *page = page_address(table);
>> -
>>  	tlb_flush_pgtable(tlb, address);
>> -	pgtable_page_dtor(page);
>> -	pgtable_free_tlb(tlb, page, 0);
>> +	pgtable_page_dtor(table);
>> +	pgtable_free_tlb(tlb, page_address(table), 0);
>>  }
>>  
>>  #else /* if CONFIG_PPC_64K_PAGES */
>
> Ben.

-aneesh

^ permalink raw reply

* [PATCH V2 3/3] powerpc iommu: Update the generic code to use dynamic iommu page sizes
From: Alistair Popple @ 2013-12-09  7:17 UTC (permalink / raw)
  To: benh, linuxppc-dev; +Cc: Alistair Popple
In-Reply-To: <1386573423-7989-1-git-send-email-alistair@popple.id.au>

This patch updates the generic iommu backend code to use the
it_page_shift field to determine the iommu page size instead of
using hardcoded values.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
---
 arch/powerpc/include/asm/iommu.h     |   19 +++++---
 arch/powerpc/kernel/dma-iommu.c      |    4 +-
 arch/powerpc/kernel/iommu.c          |   88 ++++++++++++++++++----------------
 arch/powerpc/kernel/vio.c            |   25 +++++++---
 arch/powerpc/platforms/powernv/pci.c |    2 -
 drivers/net/ethernet/ibm/ibmveth.c   |   15 +++---
 6 files changed, 88 insertions(+), 65 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 7c92834..f7a8036 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -35,17 +35,14 @@
 #define IOMMU_PAGE_MASK_4K       (~((1 << IOMMU_PAGE_SHIFT_4K) - 1))
 #define IOMMU_PAGE_ALIGN_4K(addr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE_4K)
 
+#define IOMMU_PAGE_SIZE(tblptr) (ASM_CONST(1) << (tblptr)->it_page_shift)
+#define IOMMU_PAGE_MASK(tblptr) (~((1 << (tblptr)->it_page_shift) - 1))
+#define IOMMU_PAGE_ALIGN(addr, tblptr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE(tblptr))
+
 /* Boot time flags */
 extern int iommu_is_off;
 extern int iommu_force_on;
 
-/* Pure 2^n version of get_order */
-static __inline__ __attribute_const__ int get_iommu_order(unsigned long size)
-{
-	return __ilog2((size - 1) >> IOMMU_PAGE_SHIFT_4K) + 1;
-}
-
-
 /*
  * IOMAP_MAX_ORDER defines the largest contiguous block
  * of dma space we can get.  IOMAP_MAX_ORDER = 13
@@ -82,6 +79,14 @@ struct iommu_table {
 #endif
 };
 
+/* Pure 2^n version of get_order */
+static inline __attribute_const__
+int get_iommu_order(unsigned long size, struct iommu_table *tbl)
+{
+	return __ilog2((size - 1) >> tbl->it_page_shift) + 1;
+}
+
+
 struct scatterlist;
 
 static inline void set_iommu_table_base(struct device *dev, void *base)
diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index 5cfe3db..54d0116 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -83,10 +83,10 @@ static int dma_iommu_dma_supported(struct device *dev, u64 mask)
 		return 0;
 	}
 
-	if (tbl->it_offset > (mask >> IOMMU_PAGE_SHIFT_4K)) {
+	if (tbl->it_offset > (mask >> tbl->it_page_shift)) {
 		dev_info(dev, "Warning: IOMMU offset too big for device mask\n");
 		dev_info(dev, "mask: 0x%08llx, table offset: 0x%08lx\n",
-				mask, tbl->it_offset << IOMMU_PAGE_SHIFT_4K);
+				mask, tbl->it_offset << tbl->it_page_shift);
 		return 0;
 	} else
 		return 1;
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index df4a7f1..f58d813 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -251,14 +251,13 @@ again:
 
 	if (dev)
 		boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
-				      1 << IOMMU_PAGE_SHIFT_4K);
+				      1 << tbl->it_page_shift);
 	else
-		boundary_size = ALIGN(1UL << 32, 1 << IOMMU_PAGE_SHIFT_4K);
+		boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
 	/* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
 
-	n = iommu_area_alloc(tbl->it_map, limit, start, npages,
-			tbl->it_offset, boundary_size >> IOMMU_PAGE_SHIFT_4K,
-			align_mask);
+	n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset,
+			     boundary_size >> tbl->it_page_shift, align_mask);
 	if (n == -1) {
 		if (likely(pass == 0)) {
 			/* First try the pool from the start */
@@ -320,12 +319,12 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
 		return DMA_ERROR_CODE;
 
 	entry += tbl->it_offset;	/* Offset into real TCE table */
-	ret = entry << IOMMU_PAGE_SHIFT_4K;	/* Set the return dma address */
+	ret = entry << tbl->it_page_shift;	/* Set the return dma address */
 
 	/* Put the TCEs in the HW table */
 	build_fail = ppc_md.tce_build(tbl, entry, npages,
-				(unsigned long)page & IOMMU_PAGE_MASK_4K,
-				direction, attrs);
+				      (unsigned long)page &
+				      IOMMU_PAGE_MASK(tbl), direction, attrs);
 
 	/* ppc_md.tce_build() only returns non-zero for transient errors.
 	 * Clean up the table bitmap in this case and return
@@ -352,7 +351,7 @@ static bool iommu_free_check(struct iommu_table *tbl, dma_addr_t dma_addr,
 {
 	unsigned long entry, free_entry;
 
-	entry = dma_addr >> IOMMU_PAGE_SHIFT_4K;
+	entry = dma_addr >> tbl->it_page_shift;
 	free_entry = entry - tbl->it_offset;
 
 	if (((free_entry + npages) > tbl->it_size) ||
@@ -401,7 +400,7 @@ static void __iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
 	unsigned long flags;
 	struct iommu_pool *pool;
 
-	entry = dma_addr >> IOMMU_PAGE_SHIFT_4K;
+	entry = dma_addr >> tbl->it_page_shift;
 	free_entry = entry - tbl->it_offset;
 
 	pool = get_pool(tbl, free_entry);
@@ -468,13 +467,13 @@ int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 		}
 		/* Allocate iommu entries for that segment */
 		vaddr = (unsigned long) sg_virt(s);
-		npages = iommu_num_pages(vaddr, slen, IOMMU_PAGE_SIZE_4K);
+		npages = iommu_num_pages(vaddr, slen, IOMMU_PAGE_SIZE(tbl));
 		align = 0;
-		if (IOMMU_PAGE_SHIFT_4K < PAGE_SHIFT && slen >= PAGE_SIZE &&
+		if (tbl->it_page_shift < PAGE_SHIFT && slen >= PAGE_SIZE &&
 		    (vaddr & ~PAGE_MASK) == 0)
-			align = PAGE_SHIFT - IOMMU_PAGE_SHIFT_4K;
+			align = PAGE_SHIFT - tbl->it_page_shift;
 		entry = iommu_range_alloc(dev, tbl, npages, &handle,
-					  mask >> IOMMU_PAGE_SHIFT_4K, align);
+					  mask >> tbl->it_page_shift, align);
 
 		DBG("  - vaddr: %lx, size: %lx\n", vaddr, slen);
 
@@ -489,16 +488,16 @@ int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 
 		/* Convert entry to a dma_addr_t */
 		entry += tbl->it_offset;
-		dma_addr = entry << IOMMU_PAGE_SHIFT_4K;
-		dma_addr |= (s->offset & ~IOMMU_PAGE_MASK_4K);
+		dma_addr = entry << tbl->it_page_shift;
+		dma_addr |= (s->offset & ~IOMMU_PAGE_MASK(tbl));
 
 		DBG("  - %lu pages, entry: %lx, dma_addr: %lx\n",
 			    npages, entry, dma_addr);
 
 		/* Insert into HW table */
 		build_fail = ppc_md.tce_build(tbl, entry, npages,
-					vaddr & IOMMU_PAGE_MASK_4K,
-					direction, attrs);
+					      vaddr & IOMMU_PAGE_MASK(tbl),
+					      direction, attrs);
 		if(unlikely(build_fail))
 			goto failure;
 
@@ -559,9 +558,9 @@ int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 		if (s->dma_length != 0) {
 			unsigned long vaddr, npages;
 
-			vaddr = s->dma_address & IOMMU_PAGE_MASK_4K;
+			vaddr = s->dma_address & IOMMU_PAGE_MASK(tbl);
 			npages = iommu_num_pages(s->dma_address, s->dma_length,
-						 IOMMU_PAGE_SIZE_4K);
+						 IOMMU_PAGE_SIZE(tbl));
 			__iommu_free(tbl, vaddr, npages);
 			s->dma_address = DMA_ERROR_CODE;
 			s->dma_length = 0;
@@ -592,7 +591,7 @@ void iommu_unmap_sg(struct iommu_table *tbl, struct scatterlist *sglist,
 		if (sg->dma_length == 0)
 			break;
 		npages = iommu_num_pages(dma_handle, sg->dma_length,
-					 IOMMU_PAGE_SIZE_4K);
+					 IOMMU_PAGE_SIZE(tbl));
 		__iommu_free(tbl, dma_handle, npages);
 		sg = sg_next(sg);
 	}
@@ -676,7 +675,7 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 		set_bit(0, tbl->it_map);
 
 	/* We only split the IOMMU table if we have 1GB or more of space */
-	if ((tbl->it_size << IOMMU_PAGE_SHIFT_4K) >= (1UL * 1024 * 1024 * 1024))
+	if ((tbl->it_size << tbl->it_page_shift) >= (1UL * 1024 * 1024 * 1024))
 		tbl->nr_pools = IOMMU_NR_POOLS;
 	else
 		tbl->nr_pools = 1;
@@ -768,16 +767,16 @@ dma_addr_t iommu_map_page(struct device *dev, struct iommu_table *tbl,
 
 	vaddr = page_address(page) + offset;
 	uaddr = (unsigned long)vaddr;
-	npages = iommu_num_pages(uaddr, size, IOMMU_PAGE_SIZE_4K);
+	npages = iommu_num_pages(uaddr, size, IOMMU_PAGE_SIZE(tbl));
 
 	if (tbl) {
 		align = 0;
-		if (IOMMU_PAGE_SHIFT_4K < PAGE_SHIFT && size >= PAGE_SIZE &&
+		if (tbl->it_page_shift < PAGE_SHIFT && size >= PAGE_SIZE &&
 		    ((unsigned long)vaddr & ~PAGE_MASK) == 0)
-			align = PAGE_SHIFT - IOMMU_PAGE_SHIFT_4K;
+			align = PAGE_SHIFT - tbl->it_page_shift;
 
 		dma_handle = iommu_alloc(dev, tbl, vaddr, npages, direction,
-					 mask >> IOMMU_PAGE_SHIFT_4K, align,
+					 mask >> tbl->it_page_shift, align,
 					 attrs);
 		if (dma_handle == DMA_ERROR_CODE) {
 			if (printk_ratelimit())  {
@@ -786,7 +785,7 @@ dma_addr_t iommu_map_page(struct device *dev, struct iommu_table *tbl,
 					 npages);
 			}
 		} else
-			dma_handle |= (uaddr & ~IOMMU_PAGE_MASK_4K);
+			dma_handle |= (uaddr & ~IOMMU_PAGE_MASK(tbl));
 	}
 
 	return dma_handle;
@@ -801,7 +800,8 @@ void iommu_unmap_page(struct iommu_table *tbl, dma_addr_t dma_handle,
 	BUG_ON(direction == DMA_NONE);
 
 	if (tbl) {
-		npages = iommu_num_pages(dma_handle, size, IOMMU_PAGE_SIZE_4K);
+		npages = iommu_num_pages(dma_handle, size,
+					 IOMMU_PAGE_SIZE(tbl));
 		iommu_free(tbl, dma_handle, npages);
 	}
 }
@@ -845,10 +845,10 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
 	memset(ret, 0, size);
 
 	/* Set up tces to cover the allocated range */
-	nio_pages = size >> IOMMU_PAGE_SHIFT_4K;
-	io_order = get_iommu_order(size);
+	nio_pages = size >> tbl->it_page_shift;
+	io_order = get_iommu_order(size, tbl);
 	mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
-			      mask >> IOMMU_PAGE_SHIFT_4K, io_order, NULL);
+			      mask >> tbl->it_page_shift, io_order, NULL);
 	if (mapping == DMA_ERROR_CODE) {
 		free_pages((unsigned long)ret, order);
 		return NULL;
@@ -864,7 +864,7 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
 		unsigned int nio_pages;
 
 		size = PAGE_ALIGN(size);
-		nio_pages = size >> IOMMU_PAGE_SHIFT_4K;
+		nio_pages = size >> tbl->it_page_shift;
 		iommu_free(tbl, dma_handle, nio_pages);
 		size = PAGE_ALIGN(size);
 		free_pages((unsigned long)vaddr, get_order(size));
@@ -935,10 +935,10 @@ int iommu_tce_clear_param_check(struct iommu_table *tbl,
 	if (tce_value)
 		return -EINVAL;
 
-	if (ioba & ~IOMMU_PAGE_MASK_4K)
+	if (ioba & ~IOMMU_PAGE_MASK(tbl))
 		return -EINVAL;
 
-	ioba >>= IOMMU_PAGE_SHIFT_4K;
+	ioba >>= tbl->it_page_shift;
 	if (ioba < tbl->it_offset)
 		return -EINVAL;
 
@@ -955,13 +955,13 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
 	if (!(tce & (TCE_PCI_WRITE | TCE_PCI_READ)))
 		return -EINVAL;
 
-	if (tce & ~(IOMMU_PAGE_MASK_4K | TCE_PCI_WRITE | TCE_PCI_READ))
+	if (tce & ~(IOMMU_PAGE_MASK(tbl) | TCE_PCI_WRITE | TCE_PCI_READ))
 		return -EINVAL;
 
-	if (ioba & ~IOMMU_PAGE_MASK_4K)
+	if (ioba & ~IOMMU_PAGE_MASK(tbl))
 		return -EINVAL;
 
-	ioba >>= IOMMU_PAGE_SHIFT_4K;
+	ioba >>= tbl->it_page_shift;
 	if (ioba < tbl->it_offset)
 		return -EINVAL;
 
@@ -1037,7 +1037,7 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
 
 	/* if (unlikely(ret))
 		pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
-				__func__, hwaddr, entry << IOMMU_PAGE_SHIFT_4K,
+			__func__, hwaddr, entry << IOMMU_PAGE_SHIFT(tbl),
 				hwaddr, ret); */
 
 	return ret;
@@ -1049,14 +1049,14 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
 {
 	int ret;
 	struct page *page = NULL;
-	unsigned long hwaddr, offset = tce & IOMMU_PAGE_MASK_4K & ~PAGE_MASK;
+	unsigned long hwaddr, offset = tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK;
 	enum dma_data_direction direction = iommu_tce_direction(tce);
 
 	ret = get_user_pages_fast(tce & PAGE_MASK, 1,
 			direction != DMA_TO_DEVICE, &page);
 	if (unlikely(ret != 1)) {
 		/* pr_err("iommu_tce: get_user_pages_fast failed tce=%lx ioba=%lx ret=%d\n",
-				tce, entry << IOMMU_PAGE_SHIFT_4K, ret); */
+				tce, entry << IOMMU_PAGE_SHIFT(tbl), ret); */
 		return -EFAULT;
 	}
 	hwaddr = (unsigned long) page_address(page) + offset;
@@ -1067,7 +1067,7 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
 
 	if (ret < 0)
 		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
-			__func__, entry << IOMMU_PAGE_SHIFT_4K, tce, ret);
+			__func__, entry << tbl->it_page_shift, tce, ret);
 
 	return ret;
 }
@@ -1127,6 +1127,12 @@ int iommu_add_device(struct device *dev)
 	pr_debug("iommu_tce: adding %s to iommu group %d\n",
 			dev_name(dev), iommu_group_id(tbl->it_group));
 
+	if (PAGE_SIZE < IOMMU_PAGE_SIZE(tbl)) {
+		pr_err("iommu_tce: unsupported iommu page size.");
+		pr_err("%s has not been added\n", dev_name(dev));
+		return -EINVAL;
+	}
+
 	ret = iommu_group_add_device(tbl->it_group, dev);
 	if (ret < 0)
 		pr_err("iommu_tce: %s has not been added, ret=%d\n",
diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index 170ac24..826d8bd 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -518,16 +518,18 @@ static dma_addr_t vio_dma_iommu_map_page(struct device *dev, struct page *page,
                                          struct dma_attrs *attrs)
 {
 	struct vio_dev *viodev = to_vio_dev(dev);
+	struct iommu_table *tbl;
 	dma_addr_t ret = DMA_ERROR_CODE;
 
-	if (vio_cmo_alloc(viodev, roundup(size, IOMMU_PAGE_SIZE_4K))) {
+	tbl = get_iommu_table_base(dev);
+	if (vio_cmo_alloc(viodev, roundup(size, IOMMU_PAGE_SIZE(tbl)))) {
 		atomic_inc(&viodev->cmo.allocs_failed);
 		return ret;
 	}
 
 	ret = dma_iommu_ops.map_page(dev, page, offset, size, direction, attrs);
 	if (unlikely(dma_mapping_error(dev, ret))) {
-		vio_cmo_dealloc(viodev, roundup(size, IOMMU_PAGE_SIZE_4K));
+		vio_cmo_dealloc(viodev, roundup(size, IOMMU_PAGE_SIZE(tbl)));
 		atomic_inc(&viodev->cmo.allocs_failed);
 	}
 
@@ -540,10 +542,12 @@ static void vio_dma_iommu_unmap_page(struct device *dev, dma_addr_t dma_handle,
 				     struct dma_attrs *attrs)
 {
 	struct vio_dev *viodev = to_vio_dev(dev);
+	struct iommu_table *tbl;
 
+	tbl = get_iommu_table_base(dev);
 	dma_iommu_ops.unmap_page(dev, dma_handle, size, direction, attrs);
 
-	vio_cmo_dealloc(viodev, roundup(size, IOMMU_PAGE_SIZE_4K));
+	vio_cmo_dealloc(viodev, roundup(size, IOMMU_PAGE_SIZE(tbl)));
 }
 
 static int vio_dma_iommu_map_sg(struct device *dev, struct scatterlist *sglist,
@@ -551,12 +555,14 @@ static int vio_dma_iommu_map_sg(struct device *dev, struct scatterlist *sglist,
                                 struct dma_attrs *attrs)
 {
 	struct vio_dev *viodev = to_vio_dev(dev);
+	struct iommu_table *tbl;
 	struct scatterlist *sgl;
 	int ret, count = 0;
 	size_t alloc_size = 0;
 
+	tbl = get_iommu_table_base(dev);
 	for (sgl = sglist; count < nelems; count++, sgl++)
-		alloc_size += roundup(sgl->length, IOMMU_PAGE_SIZE_4K);
+		alloc_size += roundup(sgl->length, IOMMU_PAGE_SIZE(tbl));
 
 	if (vio_cmo_alloc(viodev, alloc_size)) {
 		atomic_inc(&viodev->cmo.allocs_failed);
@@ -572,7 +578,7 @@ static int vio_dma_iommu_map_sg(struct device *dev, struct scatterlist *sglist,
 	}
 
 	for (sgl = sglist, count = 0; count < ret; count++, sgl++)
-		alloc_size -= roundup(sgl->dma_length, IOMMU_PAGE_SIZE_4K);
+		alloc_size -= roundup(sgl->dma_length, IOMMU_PAGE_SIZE(tbl));
 	if (alloc_size)
 		vio_cmo_dealloc(viodev, alloc_size);
 
@@ -585,12 +591,14 @@ static void vio_dma_iommu_unmap_sg(struct device *dev,
 		struct dma_attrs *attrs)
 {
 	struct vio_dev *viodev = to_vio_dev(dev);
+	struct iommu_table *tbl;
 	struct scatterlist *sgl;
 	size_t alloc_size = 0;
 	int count = 0;
 
+	tbl = get_iommu_table_base(dev);
 	for (sgl = sglist; count < nelems; count++, sgl++)
-		alloc_size += roundup(sgl->dma_length, IOMMU_PAGE_SIZE_4K);
+		alloc_size += roundup(sgl->dma_length, IOMMU_PAGE_SIZE(tbl));
 
 	dma_iommu_ops.unmap_sg(dev, sglist, nelems, direction, attrs);
 
@@ -706,11 +714,14 @@ static int vio_cmo_bus_probe(struct vio_dev *viodev)
 {
 	struct vio_cmo_dev_entry *dev_ent;
 	struct device *dev = &viodev->dev;
+	struct iommu_table *tbl;
 	struct vio_driver *viodrv = to_vio_driver(dev->driver);
 	unsigned long flags;
 	size_t size;
 	bool dma_capable = false;
 
+	tbl = get_iommu_table_base(dev);
+
 	/* A device requires entitlement if it has a DMA window property */
 	switch (viodev->family) {
 	case VDEVICE:
@@ -737,7 +748,7 @@ static int vio_cmo_bus_probe(struct vio_dev *viodev)
 		}
 
 		viodev->cmo.desired =
-			IOMMU_PAGE_ALIGN_4K(viodrv->get_desired_dma(viodev));
+			IOMMU_PAGE_ALIGN(viodrv->get_desired_dma(viodev), tbl);
 		if (viodev->cmo.desired < VIO_CMO_MIN_ENT)
 			viodev->cmo.desired = VIO_CMO_MIN_ENT;
 		size = VIO_CMO_MIN_ENT;
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 569b464..b555ebc 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -762,8 +762,6 @@ static struct notifier_block tce_iommu_bus_nb = {
 
 static int __init tce_iommu_bus_notifier_init(void)
 {
-	BUILD_BUG_ON(PAGE_SIZE < IOMMU_PAGE_SIZE_4K);
-
 	bus_register_notifier(&pci_bus_type, &tce_iommu_bus_nb);
 	return 0;
 }
diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
index f7d7538..d04dbab 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -1276,31 +1276,34 @@ static unsigned long ibmveth_get_desired_dma(struct vio_dev *vdev)
 {
 	struct net_device *netdev = dev_get_drvdata(&vdev->dev);
 	struct ibmveth_adapter *adapter;
+	struct iommu_table *tbl;
 	unsigned long ret;
 	int i;
 	int rxqentries = 1;
 
+	tbl = get_iommu_table_base(&vdev->dev);
+
 	/* netdev inits at probe time along with the structures we need below*/
 	if (netdev == NULL)
-		return IOMMU_PAGE_ALIGN_4K(IBMVETH_IO_ENTITLEMENT_DEFAULT);
+		return IOMMU_PAGE_ALIGN(IBMVETH_IO_ENTITLEMENT_DEFAULT, tbl);
 
 	adapter = netdev_priv(netdev);
 
 	ret = IBMVETH_BUFF_LIST_SIZE + IBMVETH_FILT_LIST_SIZE;
-	ret += IOMMU_PAGE_ALIGN_4K(netdev->mtu);
+	ret += IOMMU_PAGE_ALIGN(netdev->mtu, tbl);
 
 	for (i = 0; i < IBMVETH_NUM_BUFF_POOLS; i++) {
 		/* add the size of the active receive buffers */
 		if (adapter->rx_buff_pool[i].active)
 			ret +=
 			    adapter->rx_buff_pool[i].size *
-			    IOMMU_PAGE_ALIGN_4K(adapter->rx_buff_pool[i].
-			            buff_size);
+			    IOMMU_PAGE_ALIGN(adapter->rx_buff_pool[i].
+					     buff_size, tbl);
 		rxqentries += adapter->rx_buff_pool[i].size;
 	}
 	/* add the size of the receive queue entries */
-	ret += IOMMU_PAGE_ALIGN_4K(
-		rxqentries * sizeof(struct ibmveth_rx_q_entry));
+	ret += IOMMU_PAGE_ALIGN(
+		rxqentries * sizeof(struct ibmveth_rx_q_entry), tbl);
 
 	return ret;
 }
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH V2 2/3] powerpc iommu: Add it_page_shift field to determine iommu page size
From: Alistair Popple @ 2013-12-09  7:17 UTC (permalink / raw)
  To: benh, linuxppc-dev; +Cc: Alistair Popple
In-Reply-To: <1386573423-7989-1-git-send-email-alistair@popple.id.au>

This patch adds a it_page_shift field to struct iommu_table and
initiliases it to 4K for all platforms.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
---
 arch/powerpc/include/asm/iommu.h       |    1 +
 arch/powerpc/kernel/vio.c              |    5 +++--
 arch/powerpc/platforms/cell/iommu.c    |    8 +++++---
 arch/powerpc/platforms/pasemi/iommu.c  |    5 ++++-
 arch/powerpc/platforms/powernv/pci.c   |    3 ++-
 arch/powerpc/platforms/pseries/iommu.c |   10 ++++++----
 arch/powerpc/platforms/wsp/wsp_pci.c   |    5 +++--
 7 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 0869c7e..7c92834 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -76,6 +76,7 @@ struct iommu_table {
 	struct iommu_pool large_pool;
 	struct iommu_pool pools[IOMMU_NR_POOLS];
 	unsigned long *it_map;       /* A simple allocation bitmap for now */
+	unsigned long  it_page_shift;/* table iommu page size */
 #ifdef CONFIG_IOMMU_API
 	struct iommu_group *it_group;
 #endif
diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index 2e89fa3..170ac24 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -1177,9 +1177,10 @@ static struct iommu_table *vio_build_iommu_table(struct vio_dev *dev)
 			    &tbl->it_index, &offset, &size);
 
 	/* TCE table size - measured in tce entries */
-	tbl->it_size = size >> IOMMU_PAGE_SHIFT_4K;
+	tbl->it_page_shift = IOMMU_PAGE_SHIFT_4K;
+	tbl->it_size = size >> tbl->it_page_shift;
 	/* offset for VIO should always be 0 */
-	tbl->it_offset = offset >> IOMMU_PAGE_SHIFT_4K;
+	tbl->it_offset = offset >> tbl->it_page_shift;
 	tbl->it_busno = 0;
 	tbl->it_type = TCE_VB;
 	tbl->it_blocksize = 16;
diff --git a/arch/powerpc/platforms/cell/iommu.c b/arch/powerpc/platforms/cell/iommu.c
index fc61b90..2b90ff8 100644
--- a/arch/powerpc/platforms/cell/iommu.c
+++ b/arch/powerpc/platforms/cell/iommu.c
@@ -197,7 +197,7 @@ static int tce_build_cell(struct iommu_table *tbl, long index, long npages,
 
 	io_pte = (unsigned long *)tbl->it_base + (index - tbl->it_offset);
 
-	for (i = 0; i < npages; i++, uaddr += IOMMU_PAGE_SIZE_4K)
+	for (i = 0; i < npages; i++, uaddr += tbl->it_page_shift)
 		io_pte[i] = base_pte | (__pa(uaddr) & CBE_IOPTE_RPN_Mask);
 
 	mb();
@@ -487,8 +487,10 @@ cell_iommu_setup_window(struct cbe_iommu *iommu, struct device_node *np,
 	window->table.it_blocksize = 16;
 	window->table.it_base = (unsigned long)iommu->ptab;
 	window->table.it_index = iommu->nid;
-	window->table.it_offset = (offset >> IOMMU_PAGE_SHIFT_4K) + pte_offset;
-	window->table.it_size = size >> IOMMU_PAGE_SHIFT_4K;
+	window->table.it_page_shift = IOMMU_PAGE_SHIFT_4K;
+	window->table.it_offset =
+		(offset >> window->table.it_page_shift) + pte_offset;
+	window->table.it_size = size >> window->table.it_page_shift;
 
 	iommu_init_table(&window->table, iommu->nid);
 
diff --git a/arch/powerpc/platforms/pasemi/iommu.c b/arch/powerpc/platforms/pasemi/iommu.c
index 7d2d036..2e576f2 100644
--- a/arch/powerpc/platforms/pasemi/iommu.c
+++ b/arch/powerpc/platforms/pasemi/iommu.c
@@ -138,8 +138,11 @@ static void iommu_table_iobmap_setup(void)
 	pr_debug(" -> %s\n", __func__);
 	iommu_table_iobmap.it_busno = 0;
 	iommu_table_iobmap.it_offset = 0;
+	iommu_table_iobmap.it_page_shift = IOBMAP_PAGE_SHIFT;
+
 	/* it_size is in number of entries */
-	iommu_table_iobmap.it_size = 0x80000000 >> IOBMAP_PAGE_SHIFT;
+	iommu_table_iobmap.it_size =
+		0x80000000 >> iommu_table_iobmap.it_page_shift;
 
 	/* Initialize the common IOMMU code */
 	iommu_table_iobmap.it_base = (unsigned long)iob_l2_base;
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 7f4d857..569b464 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -564,7 +564,8 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
 {
 	tbl->it_blocksize = 16;
 	tbl->it_base = (unsigned long)tce_mem;
-	tbl->it_offset = dma_offset >> IOMMU_PAGE_SHIFT_4K;
+	tbl->it_page_shift = IOMMU_PAGE_SHIFT_4K;
+	tbl->it_offset = dma_offset >> tbl->it_page_shift;
 	tbl->it_index = 0;
 	tbl->it_size = tce_size >> 3;
 	tbl->it_busno = 0;
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 1b7531c..e029918 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -486,9 +486,10 @@ static void iommu_table_setparms(struct pci_controller *phb,
 		memset((void *)tbl->it_base, 0, *sizep);
 
 	tbl->it_busno = phb->bus->number;
+	tbl->it_page_shift = IOMMU_PAGE_SHIFT_4K;
 
 	/* Units of tce entries */
-	tbl->it_offset = phb->dma_window_base_cur >> IOMMU_PAGE_SHIFT_4K;
+	tbl->it_offset = phb->dma_window_base_cur >> tbl->it_page_shift;
 
 	/* Test if we are going over 2GB of DMA space */
 	if (phb->dma_window_base_cur + phb->dma_window_size > 0x80000000ul) {
@@ -499,7 +500,7 @@ static void iommu_table_setparms(struct pci_controller *phb,
 	phb->dma_window_base_cur += phb->dma_window_size;
 
 	/* Set the tce table size - measured in entries */
-	tbl->it_size = phb->dma_window_size >> IOMMU_PAGE_SHIFT_4K;
+	tbl->it_size = phb->dma_window_size >> tbl->it_page_shift;
 
 	tbl->it_index = 0;
 	tbl->it_blocksize = 16;
@@ -537,11 +538,12 @@ static void iommu_table_setparms_lpar(struct pci_controller *phb,
 	of_parse_dma_window(dn, dma_window, &tbl->it_index, &offset, &size);
 
 	tbl->it_busno = phb->bus->number;
+	tbl->it_page_shift = IOMMU_PAGE_SHIFT_4K;
 	tbl->it_base   = 0;
 	tbl->it_blocksize  = 16;
 	tbl->it_type = TCE_PCI;
-	tbl->it_offset = offset >> IOMMU_PAGE_SHIFT_4K;
-	tbl->it_size = size >> IOMMU_PAGE_SHIFT_4K;
+	tbl->it_offset = offset >> tbl->it_page_shift;
+	tbl->it_size = size >> tbl->it_page_shift;
 }
 
 static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
diff --git a/arch/powerpc/platforms/wsp/wsp_pci.c b/arch/powerpc/platforms/wsp/wsp_pci.c
index 8a58961..9a15e5b 100644
--- a/arch/powerpc/platforms/wsp/wsp_pci.c
+++ b/arch/powerpc/platforms/wsp/wsp_pci.c
@@ -381,8 +381,9 @@ static struct wsp_dma_table *wsp_pci_create_dma32_table(struct wsp_phb *phb,
 
 	/* Init bits and pieces */
 	tbl->table.it_blocksize = 16;
-	tbl->table.it_offset = addr >> IOMMU_PAGE_SHIFT_4K;
-	tbl->table.it_size = size >> IOMMU_PAGE_SHIFT_4K;
+	tbl->table.it_page_shift = IOMMU_PAGE_SHIFT_4K;
+	tbl->table.it_offset = addr >> tbl->table.it_page_shift;
+	tbl->table.it_size = size >> tbl->table.it_page_shift;
 
 	/*
 	 * It's already blank but we clear it anyway.
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH V2 1/3] powerpc iommu: Update constant names to reflect their hardcoded page size
From: Alistair Popple @ 2013-12-09  7:17 UTC (permalink / raw)
  To: benh, linuxppc-dev; +Cc: Alexey Kardashevskiy, Alistair Popple
In-Reply-To: <1386573423-7989-1-git-send-email-alistair@popple.id.au>

The powerpc iommu uses a hardcoded page size of 4K. This patch changes
the name of the IOMMU_PAGE_* macros to reflect the hardcoded values. A
future patch will use the existing names to support dynamic page
sizes.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h       |   10 ++--
 arch/powerpc/kernel/dma-iommu.c        |    4 +-
 arch/powerpc/kernel/iommu.c            |   78 ++++++++++++++++----------------
 arch/powerpc/kernel/vio.c              |   19 ++++----
 arch/powerpc/platforms/cell/iommu.c    |   12 ++---
 arch/powerpc/platforms/powernv/pci.c   |    4 +-
 arch/powerpc/platforms/pseries/iommu.c |    8 ++--
 arch/powerpc/platforms/pseries/setup.c |    4 +-
 arch/powerpc/platforms/wsp/wsp_pci.c   |   10 ++--
 drivers/net/ethernet/ibm/ibmveth.c     |    9 ++--
 drivers/vfio/vfio_iommu_spapr_tce.c    |   28 ++++++------
 11 files changed, 94 insertions(+), 92 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 774fa27..0869c7e 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -30,10 +30,10 @@
 #include <asm/machdep.h>
 #include <asm/types.h>
 
-#define IOMMU_PAGE_SHIFT      12
-#define IOMMU_PAGE_SIZE       (ASM_CONST(1) << IOMMU_PAGE_SHIFT)
-#define IOMMU_PAGE_MASK       (~((1 << IOMMU_PAGE_SHIFT) - 1))
-#define IOMMU_PAGE_ALIGN(addr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE)
+#define IOMMU_PAGE_SHIFT_4K      12
+#define IOMMU_PAGE_SIZE_4K       (ASM_CONST(1) << IOMMU_PAGE_SHIFT_4K)
+#define IOMMU_PAGE_MASK_4K       (~((1 << IOMMU_PAGE_SHIFT_4K) - 1))
+#define IOMMU_PAGE_ALIGN_4K(addr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE_4K)
 
 /* Boot time flags */
 extern int iommu_is_off;
@@ -42,7 +42,7 @@ extern int iommu_force_on;
 /* Pure 2^n version of get_order */
 static __inline__ __attribute_const__ int get_iommu_order(unsigned long size)
 {
-	return __ilog2((size - 1) >> IOMMU_PAGE_SHIFT) + 1;
+	return __ilog2((size - 1) >> IOMMU_PAGE_SHIFT_4K) + 1;
 }
 
 
diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index e489752..5cfe3db 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -83,10 +83,10 @@ static int dma_iommu_dma_supported(struct device *dev, u64 mask)
 		return 0;
 	}
 
-	if (tbl->it_offset > (mask >> IOMMU_PAGE_SHIFT)) {
+	if (tbl->it_offset > (mask >> IOMMU_PAGE_SHIFT_4K)) {
 		dev_info(dev, "Warning: IOMMU offset too big for device mask\n");
 		dev_info(dev, "mask: 0x%08llx, table offset: 0x%08lx\n",
-				mask, tbl->it_offset << IOMMU_PAGE_SHIFT);
+				mask, tbl->it_offset << IOMMU_PAGE_SHIFT_4K);
 		return 0;
 	} else
 		return 1;
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d22abe0..df4a7f1 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -251,14 +251,14 @@ again:
 
 	if (dev)
 		boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
-				      1 << IOMMU_PAGE_SHIFT);
+				      1 << IOMMU_PAGE_SHIFT_4K);
 	else
-		boundary_size = ALIGN(1UL << 32, 1 << IOMMU_PAGE_SHIFT);
+		boundary_size = ALIGN(1UL << 32, 1 << IOMMU_PAGE_SHIFT_4K);
 	/* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
 
 	n = iommu_area_alloc(tbl->it_map, limit, start, npages,
-			     tbl->it_offset, boundary_size >> IOMMU_PAGE_SHIFT,
-			     align_mask);
+			tbl->it_offset, boundary_size >> IOMMU_PAGE_SHIFT_4K,
+			align_mask);
 	if (n == -1) {
 		if (likely(pass == 0)) {
 			/* First try the pool from the start */
@@ -320,12 +320,12 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
 		return DMA_ERROR_CODE;
 
 	entry += tbl->it_offset;	/* Offset into real TCE table */
-	ret = entry << IOMMU_PAGE_SHIFT;	/* Set the return dma address */
+	ret = entry << IOMMU_PAGE_SHIFT_4K;	/* Set the return dma address */
 
 	/* Put the TCEs in the HW table */
 	build_fail = ppc_md.tce_build(tbl, entry, npages,
-	                              (unsigned long)page & IOMMU_PAGE_MASK,
-	                              direction, attrs);
+				(unsigned long)page & IOMMU_PAGE_MASK_4K,
+				direction, attrs);
 
 	/* ppc_md.tce_build() only returns non-zero for transient errors.
 	 * Clean up the table bitmap in this case and return
@@ -352,7 +352,7 @@ static bool iommu_free_check(struct iommu_table *tbl, dma_addr_t dma_addr,
 {
 	unsigned long entry, free_entry;
 
-	entry = dma_addr >> IOMMU_PAGE_SHIFT;
+	entry = dma_addr >> IOMMU_PAGE_SHIFT_4K;
 	free_entry = entry - tbl->it_offset;
 
 	if (((free_entry + npages) > tbl->it_size) ||
@@ -401,7 +401,7 @@ static void __iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
 	unsigned long flags;
 	struct iommu_pool *pool;
 
-	entry = dma_addr >> IOMMU_PAGE_SHIFT;
+	entry = dma_addr >> IOMMU_PAGE_SHIFT_4K;
 	free_entry = entry - tbl->it_offset;
 
 	pool = get_pool(tbl, free_entry);
@@ -468,13 +468,13 @@ int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 		}
 		/* Allocate iommu entries for that segment */
 		vaddr = (unsigned long) sg_virt(s);
-		npages = iommu_num_pages(vaddr, slen, IOMMU_PAGE_SIZE);
+		npages = iommu_num_pages(vaddr, slen, IOMMU_PAGE_SIZE_4K);
 		align = 0;
-		if (IOMMU_PAGE_SHIFT < PAGE_SHIFT && slen >= PAGE_SIZE &&
+		if (IOMMU_PAGE_SHIFT_4K < PAGE_SHIFT && slen >= PAGE_SIZE &&
 		    (vaddr & ~PAGE_MASK) == 0)
-			align = PAGE_SHIFT - IOMMU_PAGE_SHIFT;
+			align = PAGE_SHIFT - IOMMU_PAGE_SHIFT_4K;
 		entry = iommu_range_alloc(dev, tbl, npages, &handle,
-					  mask >> IOMMU_PAGE_SHIFT, align);
+					  mask >> IOMMU_PAGE_SHIFT_4K, align);
 
 		DBG("  - vaddr: %lx, size: %lx\n", vaddr, slen);
 
@@ -489,16 +489,16 @@ int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 
 		/* Convert entry to a dma_addr_t */
 		entry += tbl->it_offset;
-		dma_addr = entry << IOMMU_PAGE_SHIFT;
-		dma_addr |= (s->offset & ~IOMMU_PAGE_MASK);
+		dma_addr = entry << IOMMU_PAGE_SHIFT_4K;
+		dma_addr |= (s->offset & ~IOMMU_PAGE_MASK_4K);
 
 		DBG("  - %lu pages, entry: %lx, dma_addr: %lx\n",
 			    npages, entry, dma_addr);
 
 		/* Insert into HW table */
 		build_fail = ppc_md.tce_build(tbl, entry, npages,
-		                              vaddr & IOMMU_PAGE_MASK,
-		                              direction, attrs);
+					vaddr & IOMMU_PAGE_MASK_4K,
+					direction, attrs);
 		if(unlikely(build_fail))
 			goto failure;
 
@@ -559,9 +559,9 @@ int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 		if (s->dma_length != 0) {
 			unsigned long vaddr, npages;
 
-			vaddr = s->dma_address & IOMMU_PAGE_MASK;
+			vaddr = s->dma_address & IOMMU_PAGE_MASK_4K;
 			npages = iommu_num_pages(s->dma_address, s->dma_length,
-						 IOMMU_PAGE_SIZE);
+						 IOMMU_PAGE_SIZE_4K);
 			__iommu_free(tbl, vaddr, npages);
 			s->dma_address = DMA_ERROR_CODE;
 			s->dma_length = 0;
@@ -592,7 +592,7 @@ void iommu_unmap_sg(struct iommu_table *tbl, struct scatterlist *sglist,
 		if (sg->dma_length == 0)
 			break;
 		npages = iommu_num_pages(dma_handle, sg->dma_length,
-					 IOMMU_PAGE_SIZE);
+					 IOMMU_PAGE_SIZE_4K);
 		__iommu_free(tbl, dma_handle, npages);
 		sg = sg_next(sg);
 	}
@@ -676,7 +676,7 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 		set_bit(0, tbl->it_map);
 
 	/* We only split the IOMMU table if we have 1GB or more of space */
-	if ((tbl->it_size << IOMMU_PAGE_SHIFT) >= (1UL * 1024 * 1024 * 1024))
+	if ((tbl->it_size << IOMMU_PAGE_SHIFT_4K) >= (1UL * 1024 * 1024 * 1024))
 		tbl->nr_pools = IOMMU_NR_POOLS;
 	else
 		tbl->nr_pools = 1;
@@ -768,16 +768,16 @@ dma_addr_t iommu_map_page(struct device *dev, struct iommu_table *tbl,
 
 	vaddr = page_address(page) + offset;
 	uaddr = (unsigned long)vaddr;
-	npages = iommu_num_pages(uaddr, size, IOMMU_PAGE_SIZE);
+	npages = iommu_num_pages(uaddr, size, IOMMU_PAGE_SIZE_4K);
 
 	if (tbl) {
 		align = 0;
-		if (IOMMU_PAGE_SHIFT < PAGE_SHIFT && size >= PAGE_SIZE &&
+		if (IOMMU_PAGE_SHIFT_4K < PAGE_SHIFT && size >= PAGE_SIZE &&
 		    ((unsigned long)vaddr & ~PAGE_MASK) == 0)
-			align = PAGE_SHIFT - IOMMU_PAGE_SHIFT;
+			align = PAGE_SHIFT - IOMMU_PAGE_SHIFT_4K;
 
 		dma_handle = iommu_alloc(dev, tbl, vaddr, npages, direction,
-					 mask >> IOMMU_PAGE_SHIFT, align,
+					 mask >> IOMMU_PAGE_SHIFT_4K, align,
 					 attrs);
 		if (dma_handle == DMA_ERROR_CODE) {
 			if (printk_ratelimit())  {
@@ -786,7 +786,7 @@ dma_addr_t iommu_map_page(struct device *dev, struct iommu_table *tbl,
 					 npages);
 			}
 		} else
-			dma_handle |= (uaddr & ~IOMMU_PAGE_MASK);
+			dma_handle |= (uaddr & ~IOMMU_PAGE_MASK_4K);
 	}
 
 	return dma_handle;
@@ -801,7 +801,7 @@ void iommu_unmap_page(struct iommu_table *tbl, dma_addr_t dma_handle,
 	BUG_ON(direction == DMA_NONE);
 
 	if (tbl) {
-		npages = iommu_num_pages(dma_handle, size, IOMMU_PAGE_SIZE);
+		npages = iommu_num_pages(dma_handle, size, IOMMU_PAGE_SIZE_4K);
 		iommu_free(tbl, dma_handle, npages);
 	}
 }
@@ -845,10 +845,10 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
 	memset(ret, 0, size);
 
 	/* Set up tces to cover the allocated range */
-	nio_pages = size >> IOMMU_PAGE_SHIFT;
+	nio_pages = size >> IOMMU_PAGE_SHIFT_4K;
 	io_order = get_iommu_order(size);
 	mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
-			      mask >> IOMMU_PAGE_SHIFT, io_order, NULL);
+			      mask >> IOMMU_PAGE_SHIFT_4K, io_order, NULL);
 	if (mapping == DMA_ERROR_CODE) {
 		free_pages((unsigned long)ret, order);
 		return NULL;
@@ -864,7 +864,7 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
 		unsigned int nio_pages;
 
 		size = PAGE_ALIGN(size);
-		nio_pages = size >> IOMMU_PAGE_SHIFT;
+		nio_pages = size >> IOMMU_PAGE_SHIFT_4K;
 		iommu_free(tbl, dma_handle, nio_pages);
 		size = PAGE_ALIGN(size);
 		free_pages((unsigned long)vaddr, get_order(size));
@@ -935,10 +935,10 @@ int iommu_tce_clear_param_check(struct iommu_table *tbl,
 	if (tce_value)
 		return -EINVAL;
 
-	if (ioba & ~IOMMU_PAGE_MASK)
+	if (ioba & ~IOMMU_PAGE_MASK_4K)
 		return -EINVAL;
 
-	ioba >>= IOMMU_PAGE_SHIFT;
+	ioba >>= IOMMU_PAGE_SHIFT_4K;
 	if (ioba < tbl->it_offset)
 		return -EINVAL;
 
@@ -955,13 +955,13 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
 	if (!(tce & (TCE_PCI_WRITE | TCE_PCI_READ)))
 		return -EINVAL;
 
-	if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
+	if (tce & ~(IOMMU_PAGE_MASK_4K | TCE_PCI_WRITE | TCE_PCI_READ))
 		return -EINVAL;
 
-	if (ioba & ~IOMMU_PAGE_MASK)
+	if (ioba & ~IOMMU_PAGE_MASK_4K)
 		return -EINVAL;
 
-	ioba >>= IOMMU_PAGE_SHIFT;
+	ioba >>= IOMMU_PAGE_SHIFT_4K;
 	if (ioba < tbl->it_offset)
 		return -EINVAL;
 
@@ -1037,7 +1037,7 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
 
 	/* if (unlikely(ret))
 		pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
-				__func__, hwaddr, entry << IOMMU_PAGE_SHIFT,
+				__func__, hwaddr, entry << IOMMU_PAGE_SHIFT_4K,
 				hwaddr, ret); */
 
 	return ret;
@@ -1049,14 +1049,14 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
 {
 	int ret;
 	struct page *page = NULL;
-	unsigned long hwaddr, offset = tce & IOMMU_PAGE_MASK & ~PAGE_MASK;
+	unsigned long hwaddr, offset = tce & IOMMU_PAGE_MASK_4K & ~PAGE_MASK;
 	enum dma_data_direction direction = iommu_tce_direction(tce);
 
 	ret = get_user_pages_fast(tce & PAGE_MASK, 1,
 			direction != DMA_TO_DEVICE, &page);
 	if (unlikely(ret != 1)) {
 		/* pr_err("iommu_tce: get_user_pages_fast failed tce=%lx ioba=%lx ret=%d\n",
-				tce, entry << IOMMU_PAGE_SHIFT, ret); */
+				tce, entry << IOMMU_PAGE_SHIFT_4K, ret); */
 		return -EFAULT;
 	}
 	hwaddr = (unsigned long) page_address(page) + offset;
@@ -1067,7 +1067,7 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
 
 	if (ret < 0)
 		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
-				__func__, entry << IOMMU_PAGE_SHIFT, tce, ret);
+			__func__, entry << IOMMU_PAGE_SHIFT_4K, tce, ret);
 
 	return ret;
 }
diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index 76a6482..2e89fa3 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -520,14 +520,14 @@ static dma_addr_t vio_dma_iommu_map_page(struct device *dev, struct page *page,
 	struct vio_dev *viodev = to_vio_dev(dev);
 	dma_addr_t ret = DMA_ERROR_CODE;
 
-	if (vio_cmo_alloc(viodev, roundup(size, IOMMU_PAGE_SIZE))) {
+	if (vio_cmo_alloc(viodev, roundup(size, IOMMU_PAGE_SIZE_4K))) {
 		atomic_inc(&viodev->cmo.allocs_failed);
 		return ret;
 	}
 
 	ret = dma_iommu_ops.map_page(dev, page, offset, size, direction, attrs);
 	if (unlikely(dma_mapping_error(dev, ret))) {
-		vio_cmo_dealloc(viodev, roundup(size, IOMMU_PAGE_SIZE));
+		vio_cmo_dealloc(viodev, roundup(size, IOMMU_PAGE_SIZE_4K));
 		atomic_inc(&viodev->cmo.allocs_failed);
 	}
 
@@ -543,7 +543,7 @@ static void vio_dma_iommu_unmap_page(struct device *dev, dma_addr_t dma_handle,
 
 	dma_iommu_ops.unmap_page(dev, dma_handle, size, direction, attrs);
 
-	vio_cmo_dealloc(viodev, roundup(size, IOMMU_PAGE_SIZE));
+	vio_cmo_dealloc(viodev, roundup(size, IOMMU_PAGE_SIZE_4K));
 }
 
 static int vio_dma_iommu_map_sg(struct device *dev, struct scatterlist *sglist,
@@ -556,7 +556,7 @@ static int vio_dma_iommu_map_sg(struct device *dev, struct scatterlist *sglist,
 	size_t alloc_size = 0;
 
 	for (sgl = sglist; count < nelems; count++, sgl++)
-		alloc_size += roundup(sgl->length, IOMMU_PAGE_SIZE);
+		alloc_size += roundup(sgl->length, IOMMU_PAGE_SIZE_4K);
 
 	if (vio_cmo_alloc(viodev, alloc_size)) {
 		atomic_inc(&viodev->cmo.allocs_failed);
@@ -572,7 +572,7 @@ static int vio_dma_iommu_map_sg(struct device *dev, struct scatterlist *sglist,
 	}
 
 	for (sgl = sglist, count = 0; count < ret; count++, sgl++)
-		alloc_size -= roundup(sgl->dma_length, IOMMU_PAGE_SIZE);
+		alloc_size -= roundup(sgl->dma_length, IOMMU_PAGE_SIZE_4K);
 	if (alloc_size)
 		vio_cmo_dealloc(viodev, alloc_size);
 
@@ -590,7 +590,7 @@ static void vio_dma_iommu_unmap_sg(struct device *dev,
 	int count = 0;
 
 	for (sgl = sglist; count < nelems; count++, sgl++)
-		alloc_size += roundup(sgl->dma_length, IOMMU_PAGE_SIZE);
+		alloc_size += roundup(sgl->dma_length, IOMMU_PAGE_SIZE_4K);
 
 	dma_iommu_ops.unmap_sg(dev, sglist, nelems, direction, attrs);
 
@@ -736,7 +736,8 @@ static int vio_cmo_bus_probe(struct vio_dev *viodev)
 			return -EINVAL;
 		}
 
-		viodev->cmo.desired = IOMMU_PAGE_ALIGN(viodrv->get_desired_dma(viodev));
+		viodev->cmo.desired =
+			IOMMU_PAGE_ALIGN_4K(viodrv->get_desired_dma(viodev));
 		if (viodev->cmo.desired < VIO_CMO_MIN_ENT)
 			viodev->cmo.desired = VIO_CMO_MIN_ENT;
 		size = VIO_CMO_MIN_ENT;
@@ -1176,9 +1177,9 @@ static struct iommu_table *vio_build_iommu_table(struct vio_dev *dev)
 			    &tbl->it_index, &offset, &size);
 
 	/* TCE table size - measured in tce entries */
-	tbl->it_size = size >> IOMMU_PAGE_SHIFT;
+	tbl->it_size = size >> IOMMU_PAGE_SHIFT_4K;
 	/* offset for VIO should always be 0 */
-	tbl->it_offset = offset >> IOMMU_PAGE_SHIFT;
+	tbl->it_offset = offset >> IOMMU_PAGE_SHIFT_4K;
 	tbl->it_busno = 0;
 	tbl->it_type = TCE_VB;
 	tbl->it_blocksize = 16;
diff --git a/arch/powerpc/platforms/cell/iommu.c b/arch/powerpc/platforms/cell/iommu.c
index b535606..fc61b90 100644
--- a/arch/powerpc/platforms/cell/iommu.c
+++ b/arch/powerpc/platforms/cell/iommu.c
@@ -197,7 +197,7 @@ static int tce_build_cell(struct iommu_table *tbl, long index, long npages,
 
 	io_pte = (unsigned long *)tbl->it_base + (index - tbl->it_offset);
 
-	for (i = 0; i < npages; i++, uaddr += IOMMU_PAGE_SIZE)
+	for (i = 0; i < npages; i++, uaddr += IOMMU_PAGE_SIZE_4K)
 		io_pte[i] = base_pte | (__pa(uaddr) & CBE_IOPTE_RPN_Mask);
 
 	mb();
@@ -430,7 +430,7 @@ static void cell_iommu_setup_hardware(struct cbe_iommu *iommu,
 {
 	cell_iommu_setup_stab(iommu, base, size, 0, 0);
 	iommu->ptab = cell_iommu_alloc_ptab(iommu, base, size, 0, 0,
-					    IOMMU_PAGE_SHIFT);
+					    IOMMU_PAGE_SHIFT_4K);
 	cell_iommu_enable_hardware(iommu);
 }
 
@@ -487,8 +487,8 @@ cell_iommu_setup_window(struct cbe_iommu *iommu, struct device_node *np,
 	window->table.it_blocksize = 16;
 	window->table.it_base = (unsigned long)iommu->ptab;
 	window->table.it_index = iommu->nid;
-	window->table.it_offset = (offset >> IOMMU_PAGE_SHIFT) + pte_offset;
-	window->table.it_size = size >> IOMMU_PAGE_SHIFT;
+	window->table.it_offset = (offset >> IOMMU_PAGE_SHIFT_4K) + pte_offset;
+	window->table.it_size = size >> IOMMU_PAGE_SHIFT_4K;
 
 	iommu_init_table(&window->table, iommu->nid);
 
@@ -773,7 +773,7 @@ static void __init cell_iommu_init_one(struct device_node *np,
 
 	/* Setup the iommu_table */
 	cell_iommu_setup_window(iommu, np, base, size,
-				offset >> IOMMU_PAGE_SHIFT);
+				offset >> IOMMU_PAGE_SHIFT_4K);
 }
 
 static void __init cell_disable_iommus(void)
@@ -1122,7 +1122,7 @@ static int __init cell_iommu_fixed_mapping_init(void)
 
 		cell_iommu_setup_stab(iommu, dbase, dsize, fbase, fsize);
 		iommu->ptab = cell_iommu_alloc_ptab(iommu, dbase, dsize, 0, 0,
-						    IOMMU_PAGE_SHIFT);
+						    IOMMU_PAGE_SHIFT_4K);
 		cell_iommu_setup_fixed_ptab(iommu, np, dbase, dsize,
 					     fbase, fsize);
 		cell_iommu_enable_hardware(iommu);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index bac289a..7f4d857 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -564,7 +564,7 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
 {
 	tbl->it_blocksize = 16;
 	tbl->it_base = (unsigned long)tce_mem;
-	tbl->it_offset = dma_offset >> IOMMU_PAGE_SHIFT;
+	tbl->it_offset = dma_offset >> IOMMU_PAGE_SHIFT_4K;
 	tbl->it_index = 0;
 	tbl->it_size = tce_size >> 3;
 	tbl->it_busno = 0;
@@ -761,7 +761,7 @@ static struct notifier_block tce_iommu_bus_nb = {
 
 static int __init tce_iommu_bus_notifier_init(void)
 {
-	BUILD_BUG_ON(PAGE_SIZE < IOMMU_PAGE_SIZE);
+	BUILD_BUG_ON(PAGE_SIZE < IOMMU_PAGE_SIZE_4K);
 
 	bus_register_notifier(&pci_bus_type, &tce_iommu_bus_nb);
 	return 0;
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index a80af6c..1b7531c 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -488,7 +488,7 @@ static void iommu_table_setparms(struct pci_controller *phb,
 	tbl->it_busno = phb->bus->number;
 
 	/* Units of tce entries */
-	tbl->it_offset = phb->dma_window_base_cur >> IOMMU_PAGE_SHIFT;
+	tbl->it_offset = phb->dma_window_base_cur >> IOMMU_PAGE_SHIFT_4K;
 
 	/* Test if we are going over 2GB of DMA space */
 	if (phb->dma_window_base_cur + phb->dma_window_size > 0x80000000ul) {
@@ -499,7 +499,7 @@ static void iommu_table_setparms(struct pci_controller *phb,
 	phb->dma_window_base_cur += phb->dma_window_size;
 
 	/* Set the tce table size - measured in entries */
-	tbl->it_size = phb->dma_window_size >> IOMMU_PAGE_SHIFT;
+	tbl->it_size = phb->dma_window_size >> IOMMU_PAGE_SHIFT_4K;
 
 	tbl->it_index = 0;
 	tbl->it_blocksize = 16;
@@ -540,8 +540,8 @@ static void iommu_table_setparms_lpar(struct pci_controller *phb,
 	tbl->it_base   = 0;
 	tbl->it_blocksize  = 16;
 	tbl->it_type = TCE_PCI;
-	tbl->it_offset = offset >> IOMMU_PAGE_SHIFT;
-	tbl->it_size = size >> IOMMU_PAGE_SHIFT;
+	tbl->it_offset = offset >> IOMMU_PAGE_SHIFT_4K;
+	tbl->it_size = size >> IOMMU_PAGE_SHIFT_4K;
 }
 
 static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index c1f1908..49cd16e 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -72,7 +72,7 @@
 
 int CMO_PrPSP = -1;
 int CMO_SecPSP = -1;
-unsigned long CMO_PageSize = (ASM_CONST(1) << IOMMU_PAGE_SHIFT);
+unsigned long CMO_PageSize = (ASM_CONST(1) << IOMMU_PAGE_SHIFT_4K);
 EXPORT_SYMBOL(CMO_PageSize);
 
 int fwnmi_active;  /* TRUE if an FWNMI handler is present */
@@ -569,7 +569,7 @@ void pSeries_cmo_feature_init(void)
 {
 	char *ptr, *key, *value, *end;
 	int call_status;
-	int page_order = IOMMU_PAGE_SHIFT;
+	int page_order = IOMMU_PAGE_SHIFT_4K;
 
 	pr_debug(" -> fw_cmo_feature_init()\n");
 	spin_lock(&rtas_data_buf_lock);
diff --git a/arch/powerpc/platforms/wsp/wsp_pci.c b/arch/powerpc/platforms/wsp/wsp_pci.c
index 62cb527..8a58961 100644
--- a/arch/powerpc/platforms/wsp/wsp_pci.c
+++ b/arch/powerpc/platforms/wsp/wsp_pci.c
@@ -260,7 +260,7 @@ static int tce_build_wsp(struct iommu_table *tbl, long index, long npages,
 		*tcep = proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT;
 
 		dma_debug("[DMA] TCE %p set to 0x%016llx (dma addr: 0x%lx)\n",
-			  tcep, *tcep, (tbl->it_offset + index) << IOMMU_PAGE_SHIFT);
+			  tcep, *tcep, (tbl->it_offset + index) << IOMMU_PAGE_SHIFT_4K);
 
 		uaddr += TCE_PAGE_SIZE;
 		index++;
@@ -381,8 +381,8 @@ static struct wsp_dma_table *wsp_pci_create_dma32_table(struct wsp_phb *phb,
 
 	/* Init bits and pieces */
 	tbl->table.it_blocksize = 16;
-	tbl->table.it_offset = addr >> IOMMU_PAGE_SHIFT;
-	tbl->table.it_size = size >> IOMMU_PAGE_SHIFT;
+	tbl->table.it_offset = addr >> IOMMU_PAGE_SHIFT_4K;
+	tbl->table.it_size = size >> IOMMU_PAGE_SHIFT_4K;
 
 	/*
 	 * It's already blank but we clear it anyway.
@@ -449,8 +449,8 @@ static void wsp_pci_dma_dev_setup(struct pci_dev *pdev)
 	if (table) {
 		pr_info("%s: Setup iommu: 32-bit DMA region 0x%08lx..0x%08lx\n",
 			pci_name(pdev),
-			table->table.it_offset << IOMMU_PAGE_SHIFT,
-			(table->table.it_offset << IOMMU_PAGE_SHIFT)
+			table->table.it_offset << IOMMU_PAGE_SHIFT_4K,
+			(table->table.it_offset << IOMMU_PAGE_SHIFT_4K)
 			+ phb->dma32_region_size - 1);
 		archdata->dma_data.iommu_table_base = &table->table;
 		return;
diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
index 952d795..f7d7538 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -1282,24 +1282,25 @@ static unsigned long ibmveth_get_desired_dma(struct vio_dev *vdev)
 
 	/* netdev inits at probe time along with the structures we need below*/
 	if (netdev == NULL)
-		return IOMMU_PAGE_ALIGN(IBMVETH_IO_ENTITLEMENT_DEFAULT);
+		return IOMMU_PAGE_ALIGN_4K(IBMVETH_IO_ENTITLEMENT_DEFAULT);
 
 	adapter = netdev_priv(netdev);
 
 	ret = IBMVETH_BUFF_LIST_SIZE + IBMVETH_FILT_LIST_SIZE;
-	ret += IOMMU_PAGE_ALIGN(netdev->mtu);
+	ret += IOMMU_PAGE_ALIGN_4K(netdev->mtu);
 
 	for (i = 0; i < IBMVETH_NUM_BUFF_POOLS; i++) {
 		/* add the size of the active receive buffers */
 		if (adapter->rx_buff_pool[i].active)
 			ret +=
 			    adapter->rx_buff_pool[i].size *
-			    IOMMU_PAGE_ALIGN(adapter->rx_buff_pool[i].
+			    IOMMU_PAGE_ALIGN_4K(adapter->rx_buff_pool[i].
 			            buff_size);
 		rxqentries += adapter->rx_buff_pool[i].size;
 	}
 	/* add the size of the receive queue entries */
-	ret += IOMMU_PAGE_ALIGN(rxqentries * sizeof(struct ibmveth_rx_q_entry));
+	ret += IOMMU_PAGE_ALIGN_4K(
+		rxqentries * sizeof(struct ibmveth_rx_q_entry));
 
 	return ret;
 }
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index bdae7a0..a84788b 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -81,7 +81,7 @@ static int tce_iommu_enable(struct tce_container *container)
 	 * enforcing the limit based on the max that the guest can map.
 	 */
 	down_write(&current->mm->mmap_sem);
-	npages = (tbl->it_size << IOMMU_PAGE_SHIFT) >> PAGE_SHIFT;
+	npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
 	locked = current->mm->locked_vm + npages;
 	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
@@ -110,7 +110,7 @@ static void tce_iommu_disable(struct tce_container *container)
 
 	down_write(&current->mm->mmap_sem);
 	current->mm->locked_vm -= (container->tbl->it_size <<
-			IOMMU_PAGE_SHIFT) >> PAGE_SHIFT;
+			IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
 	up_write(&current->mm->mmap_sem);
 }
 
@@ -174,8 +174,8 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (info.argsz < minsz)
 			return -EINVAL;
 
-		info.dma32_window_start = tbl->it_offset << IOMMU_PAGE_SHIFT;
-		info.dma32_window_size = tbl->it_size << IOMMU_PAGE_SHIFT;
+		info.dma32_window_start = tbl->it_offset << IOMMU_PAGE_SHIFT_4K;
+		info.dma32_window_size = tbl->it_size << IOMMU_PAGE_SHIFT_4K;
 		info.flags = 0;
 
 		if (copy_to_user((void __user *)arg, &info, minsz))
@@ -205,8 +205,8 @@ static long tce_iommu_ioctl(void *iommu_data,
 				VFIO_DMA_MAP_FLAG_WRITE))
 			return -EINVAL;
 
-		if ((param.size & ~IOMMU_PAGE_MASK) ||
-				(param.vaddr & ~IOMMU_PAGE_MASK))
+		if ((param.size & ~IOMMU_PAGE_MASK_4K) ||
+				(param.vaddr & ~IOMMU_PAGE_MASK_4K))
 			return -EINVAL;
 
 		/* iova is checked by the IOMMU API */
@@ -220,17 +220,17 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (ret)
 			return ret;
 
-		for (i = 0; i < (param.size >> IOMMU_PAGE_SHIFT); ++i) {
+		for (i = 0; i < (param.size >> IOMMU_PAGE_SHIFT_4K); ++i) {
 			ret = iommu_put_tce_user_mode(tbl,
-					(param.iova >> IOMMU_PAGE_SHIFT) + i,
+					(param.iova >> IOMMU_PAGE_SHIFT_4K) + i,
 					tce);
 			if (ret)
 				break;
-			tce += IOMMU_PAGE_SIZE;
+			tce += IOMMU_PAGE_SIZE_4K;
 		}
 		if (ret)
 			iommu_clear_tces_and_put_pages(tbl,
-					param.iova >> IOMMU_PAGE_SHIFT,	i);
+					param.iova >> IOMMU_PAGE_SHIFT_4K, i);
 
 		iommu_flush_tce(tbl);
 
@@ -256,17 +256,17 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (param.flags)
 			return -EINVAL;
 
-		if (param.size & ~IOMMU_PAGE_MASK)
+		if (param.size & ~IOMMU_PAGE_MASK_4K)
 			return -EINVAL;
 
 		ret = iommu_tce_clear_param_check(tbl, param.iova, 0,
-				param.size >> IOMMU_PAGE_SHIFT);
+				param.size >> IOMMU_PAGE_SHIFT_4K);
 		if (ret)
 			return ret;
 
 		ret = iommu_clear_tces_and_put_pages(tbl,
-				param.iova >> IOMMU_PAGE_SHIFT,
-				param.size >> IOMMU_PAGE_SHIFT);
+				param.iova >> IOMMU_PAGE_SHIFT_4K,
+				param.size >> IOMMU_PAGE_SHIFT_4K);
 		iommu_flush_tce(tbl);
 
 		return ret;
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH V2 0/3] powerpc iommu: Remove hardcoded page sizes
From: Alistair Popple @ 2013-12-09  7:17 UTC (permalink / raw)
  To: benh, linuxppc-dev

The series doesn't actually change the iommu page size as each platform continues to
initialise the iommu page size to a hardcoded value of 4K.

At this stage testing has only been carried out on a pSeries machine, other platforms
including cell have yet to be tested.

Changes from V1:
* Rebased on Ben's next tree
* Updated constants in 1/3 that were not present in V1 (thanks Alexy!)
* Added initialisation for pasemi platform that was missed in V1

^ permalink raw reply

* linux-next: build failure after merge of the final tree (powerpc tree related)
From: Stephen Rothwell @ 2013-12-09  6:32 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
  Cc: Mahesh Salgaonkar, linux-next, Paul Mackerras, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2068 bytes --]

Hi all,

After merging the final tree, today's linux-next build (powerpc
allyesconfig) failed like this:

arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
arch/powerpc/kernel/exceptions-64s.S:958: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:959: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:983: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:984: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1003: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1013: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1014: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1015: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1016: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1017: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1018: Error: attempt to move .org backwards

Caused by commit 1e9b4507ed98 ("powerpc/book3s: handle machine check in
Linux host").

I have reverted these commits (possibly some of these reverts are
unnecessary):

b63a0ffe35de "powerpc/powernv: Machine check exception handling"
28446de2ce99 "powerpc/powernv: Remove machine check handling in OPAL"
b5ff4211a829 "powerpc/book3s: Queue up and process delayed MCE events"
36df96f8acaf "powerpc/book3s: Decode and save machine check event"
ae744f3432d3 "powerpc/book3s: Flush SLB/TLBs if we get SLB/TLB machine check errors on power8"
e22a22740c1a "powerpc/book3s: Flush SLB/TLBs if we get SLB/TLB machine check errors on power7"
0440705049b0 "powerpc/book3s: Add flush_tlb operation in cpu_spec"
4c703416efc0 "powerpc/book3s: Introduce a early machine check hook in cpu_spec"
1c51089f777b "powerpc/book3s: Return from interrupt if coming from evil context"
1e9b4507ed98 "powerpc/book3s: handle machine check in Linux host"

-- 
Cheers,
Stephen Rothwell <sfr@canb.auug.org.au>

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCH V4 10/10] powerpc, perf: Cleanup SW branch filter list look up
From: Michael Ellerman @ 2013-12-09  6:21 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, khandual
  Cc: mikey, ak, eranian, acme, sukadev, mingo
In-Reply-To: <1386153162-11225-11-git-send-email-khandual@linux.vnet.ibm.com>

On Wed, 2013-04-12 at 10:32:42 UTC, Anshuman Khandual wrote:
> This patch adds enumeration for all available SW branch filters
> in powerpc book3s code and also streamlines the look for the
> SW branch filter entries while trying to figure out which all
> branch filters can be supported in SW.

This appears to patch code that was only added in 8/10 ?

Was there any reason not to do it the right way from the beginning?

cheers

^ permalink raw reply

* Re: [PATCH V4 09/10] power8, perf: Change BHRB branch filter configuration
From: Michael Ellerman @ 2013-12-09  6:21 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, khandual
  Cc: mikey, ak, eranian, acme, sukadev, mingo
In-Reply-To: <1386153162-11225-10-git-send-email-khandual@linux.vnet.ibm.com>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 4825 bytes --]

On Wed, 2013-04-12 at 10:32:41 UTC, Anshuman Khandual wrote:
> Powerpc kernel now supports SW based branch filters for book3s systems with some
> specifc requirements while dealing with HW supported branch filters in order to
> achieve overall OR semantics prevailing in perf branch stack sampling framework.
> This patch adapts the BHRB branch filter configuration to meet those protocols.
> POWER8 PMU does support 3 branch filters (out of which two are getting used in
> perf branch stack) which are mutually exclussive and cannot be ORed with each
> other. This implies that PMU can only handle one HW based branch filter request
> at any point of time. For all other combinations PMU will pass it on to the SW.
> 
> Also the combination of PERF_SAMPLE_BRANCH_ANY_CALL and PERF_SAMPLE_BRANCH_COND
> can now be handled in SW, hence we dont error them out anymore.
> 
> diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c
> index 03c5b8d..6021349 100644
> --- a/arch/powerpc/perf/power8-pmu.c
> +++ b/arch/powerpc/perf/power8-pmu.c
> @@ -561,7 +561,56 @@ static int power8_generic_events[] = {
>  
>  static u64 power8_bhrb_filter_map(u64 branch_sample_type, u64 *filter_mask)
>  {
> -	u64 pmu_bhrb_filter = 0;
> +	u64 x, tmp, pmu_bhrb_filter = 0;
> +	*filter_mask = 0;
> +
> +	/* No branch filter requested */
> +	if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY) {
> +		*filter_mask = PERF_SAMPLE_BRANCH_ANY;
> +		return pmu_bhrb_filter;
> +	}
> +
> +	/*
> +	 * P8 does not support oring of PMU HW branch filters. Hence
> +	 * if multiple branch filters are requested which includes filters
> +	 * supported in PMU, still go ahead and clear the PMU based HW branch
> +	 * filter component as in this case all the filters will be processed
> + 	 * in SW.

Leading space there.

> +	 */
> +	tmp = branch_sample_type;
> +
> +	/* Remove privilege filters before comparison */
> +	tmp &= ~PERF_SAMPLE_BRANCH_USER;
> +	tmp &= ~PERF_SAMPLE_BRANCH_KERNEL;
> +	tmp &= ~PERF_SAMPLE_BRANCH_HV;
> +
> +	for_each_branch_sample_type(x) {
> +		/* Ignore privilege requests */
> +		if ((x == PERF_SAMPLE_BRANCH_USER) || (x == PERF_SAMPLE_BRANCH_KERNEL) || (x == PERF_SAMPLE_BRANCH_HV))
> +			continue;
> +
> +		if (!(tmp & x))
> +			continue;
> +
> +               /* Supported HW PMU filters */
> +		if (tmp & PERF_SAMPLE_BRANCH_ANY_CALL) {
> +			tmp &= ~PERF_SAMPLE_BRANCH_ANY_CALL;
> +			if (tmp) {
> +				pmu_bhrb_filter = 0;
> +				*filter_mask = 0;
> +				return pmu_bhrb_filter;
> +			}
> +		}
> +
> +		if (tmp & PERF_SAMPLE_BRANCH_COND) {
> +			tmp &= ~PERF_SAMPLE_BRANCH_COND;
> +			if (tmp) {
> +				pmu_bhrb_filter = 0;
> +				*filter_mask = 0;
> +				return pmu_bhrb_filter;
> +			}
> +		}
> +	}

>  
>  	/* BHRB and regular PMU events share the same privilege state
>  	 * filter configuration. BHRB is always recorded along with a
> @@ -570,34 +619,20 @@ static u64 power8_bhrb_filter_map(u64 branch_sample_type, u64 *filter_mask)
>  	 * PMU event, we ignore any separate BHRB specific request.
>  	 */
>  
> -	/* No branch filter requested */
> -	if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY)
> -		return pmu_bhrb_filter;
> -
> -	/* Invalid branch filter options - HW does not support */
> -	if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
> -		return -1;
> -
> -	if (branch_sample_type & PERF_SAMPLE_BRANCH_IND_CALL)
> -		return -1;
> -
> +	/* Supported individual branch filters */
>  	if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
>  		pmu_bhrb_filter |= POWER8_MMCRA_IFM1;
> +		*filter_mask    |= PERF_SAMPLE_BRANCH_ANY_CALL;
>  		return pmu_bhrb_filter;
>  	}
>  
>  	if (branch_sample_type & PERF_SAMPLE_BRANCH_COND) {
>  		pmu_bhrb_filter |= POWER8_MMCRA_IFM3;
> +		*filter_mask    |= PERF_SAMPLE_BRANCH_COND;
>  		return pmu_bhrb_filter;
>  	}
>  
> -	/* PMU does not support ANY combination of HW BHRB filters */
> -	if ((branch_sample_type & PERF_SAMPLE_BRANCH_ANY_CALL) &&
> -			(branch_sample_type & PERF_SAMPLE_BRANCH_COND))
> -		return -1;
> -
> -	/* Every thing else is unsupported */
> -	return -1;
> +	return pmu_bhrb_filter;
>  }


As I said in my comments on version 3 which you ignored:

    I think it would be clearer if we actually checked for the possibilities we
    allow and let everything else fall through, eg:

        /* Ignore user/kernel/hv bits */
        branch_sample_type &= ~PERF_SAMPLE_BRANCH_PLM_ALL;

        if (branch_sample_type == PERF_SAMPLE_BRANCH_ANY)
                return 0;

        if (branch_sample_type == PERF_SAMPLE_BRANCH_ANY_CALL)
                return POWER8_MMCRA_IFM1;
 
        if (branch_sample_type == PERF_SAMPLE_BRANCH_COND)
                return POWER8_MMCRA_IFM3;
        
        return -1;


cheers

^ permalink raw reply

* Re: [PATCH V4 08/10] powerpc, perf: Enable SW filtering in branch stack sampling framework
From: Michael Ellerman @ 2013-12-09  6:21 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, khandual
  Cc: mikey, ak, eranian, acme, sukadev, mingo
In-Reply-To: <1386153162-11225-9-git-send-email-khandual@linux.vnet.ibm.com>

On Wed, 2013-04-12 at 10:32:40 UTC, Anshuman Khandual wrote:
> This patch enables SW based post processing of BHRB captured branches
> to be able to meet more user defined branch filtration criteria in perf
> branch stack sampling framework. These changes increase the number of
> branch filters and their valid combinations on any powerpc64 server
> platform with BHRB support. Find the summary of code changes here.
> 
> (1) struct cpu_hw_events
> 
> 	Introduced two new variables track various filter values and mask
> 
> 	(a) bhrb_sw_filter	Tracks SW implemented branch filter flags
> 	(b) filter_mask		Tracks both (SW and HW) branch filter flags

The name 'filter_mask' doesn't mean much to me. I'd rather it was 'bhrb_filter'.


> (2) Event creation
> 
> 	Kernel will figure out supported BHRB branch filters through a PMU call
> 	back 'bhrb_filter_map'. This function will find out how many of the
> 	requested branch filters can be supported in the PMU HW. It will not
> 	try to invalidate any branch filter combinations. Event creation will not
> 	error out because of lack of HW based branch filters. Meanwhile it will
> 	track the overall supported branch filters in the "filter_mask" variable.
> 
> 	Once the PMU call back returns kernel will process the user branch filter
> 	request against available SW filters while looking at the "filter_mask".
> 	During this phase all the branch filters which are still pending from the
> 	user requested list will have to be supported in SW failing which the
> 	event creation will error out.
> 
> (3) SW branch filter
> 
> 	During the BHRB data capture inside the PMU interrupt context, each
> 	of the captured 'perf_branch_entry.from' will be checked for compliance
> 	with applicable SW branch filters. If the entry does not conform to the
> 	filter requirements, it will be discarded from the final perf branch
> 	stack buffer.
> 
> (4) Supported SW based branch filters
> 
> 	(a) PERF_SAMPLE_BRANCH_ANY_RETURN
> 	(b) PERF_SAMPLE_BRANCH_IND_CALL
> 	(c) PERF_SAMPLE_BRANCH_ANY_CALL
> 	(d) PERF_SAMPLE_BRANCH_COND
> 
> 	Please refer patch to understand the classification of instructions into
> 	these branch filter categories.
> 
> (5) Multiple branch filter semantics
> 
> 	Book3 sever implementation follows the same OR semantics (as implemented in
> 	x86) while dealing with multiple branch filters at any point of time. SW
> 	branch filter analysis is carried on the data set captured in the PMU HW.
> 	So the resulting set of data (after applying the SW filters) will inherently
> 	be an AND with the HW captured set. Hence any combination of HW and SW branch
> 	filters will be invalid. HW based branch filters are more efficient and faster
> 	compared to SW implemented branch filters. So at first the PMU should decide
> 	whether it can support all the requested branch filters itself or not. In case
> 	it can support all the branch filters in an OR manner, we dont apply any SW
> 	branch filter on top of the HW captured set (which is the final set). This
> 	preserves the OR semantic of multiple branch filters as required. But in case
> 	where the PMU cannot support all the requested branch filters in an OR manner,
> 	it should not apply any it's filters and leave it upto the SW to handle them
> 	all. Its the PMU code's responsibility to uphold this protocol to be able to
> 	conform to the overall OR semantic of perf branch stack sampling framework.


I'd prefer this level of commentary was in a block comment in the code. It's
much more likely to be seen by a future hacker than here in the commit log.


> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index 2de7d48..54d39a5 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -48,6 +48,8 @@ struct cpu_hw_events {
>  
>  	/* BHRB bits */
>  	u64				bhrb_hw_filter;	/* BHRB HW branch filter */
> +	u64				bhrb_sw_filter;	/* BHRB SW branch filter */
> +	u64				filter_mask;	/* Branch filter mask */
>  	int				bhrb_users;
>  	void				*bhrb_context;
>  	struct	perf_branch_stack	bhrb_stack;
> @@ -400,6 +402,228 @@ static __u64 power_pmu_bhrb_to(u64 addr)
>  	return target - (unsigned long)&instr + addr;
>  }
>  
> +/*
> + * Instruction opcode analysis
> + *
> + * Analyse instruction opcodes and classify them
> + * into various branch filter options available.
> + * This follows the standard semantics of OR which
> + * means that instructions which conforms to `any`
> + * of the requested branch filters get picked up.
> + */
> +static bool validate_instruction(unsigned int *addr, u64 bhrb_sw_filter)
> +{

"validate" is not a good name here. That implies that this routine identifies
"valid" and "invalid" instructions - but that's not really correct.

Also it's preferable to not use the same variable name for the local as for the
cpuhw->bhrb_sw_filter global. Although technically it doesn't shadow the global
it can still be confusing to a human, ie. me. A good name for the local would
just be "sw_filter" because we know in this code that we're dealing with the
BHRB.


> +	bool result = false;
> +
> +	if (bhrb_sw_filter & PERF_SAMPLE_BRANCH_ANY_RETURN) {
> +
> +		/* XL-form instruction */
> +		if (instr_is_branch_xlform(*addr)) {
> +
> +			/* LR should not be set */
> +				/*
> +			 	 * Conditional and unconditional
> +			 	 * branch to LR register.
> +			 	 */
> +				if (is_xlform_lr(*addr))
> +					result = true;
> +			}
> +		}
> +	}

is_xform_lr() implies instr_is_branch_xlform(), and once you get a hit you can
short-circuit and exit the function, so this should boil down to just:

	if (bhrb_sw_filter & PERF_SAMPLE_BRANCH_ANY_RETURN)
		if (is_xlform_lr(*addr) && !is_branch_link_set(*addr))
			return true;


Having said that I think it should move into a routine in code-patching as I
said in the comments to the previous patch.


> +
> +	if (bhrb_sw_filter & PERF_SAMPLE_BRANCH_IND_CALL) {
> +		/* XL-form instruction */
> +		if (instr_is_branch_xlform(*addr)) {
> +
> +			/* LR should be set */
> +			if (is_branch_link_set(*addr)) {
> +				/*
> +			 	 * Conditional and unconditional
> +			 	 * branch to CTR.
> +			 	 */
> +				if (is_xlform_ctr(*addr))
> +					result = true;
> +
> +				/*
> +			 	 * Conditional and unconditional
> +			 	 * branch to LR.
> +			 	 */
> +				if (is_xlform_lr(*addr))
> +					result = true;
> +
> +				/*
> +			 	 * Conditional and unconditional
> +			 	 * branch to TAR.
> +			 	 */
> +				if (is_xlform_tar(*addr))
> +					result = true;

What other kind of XL-Form branch is there?

> +			}
> +		}
> +	}

The comments above all have a bogus leading space.

> +
> +	/* Any-form branch */
> +	if (bhrb_sw_filter & PERF_SAMPLE_BRANCH_ANY_CALL) {
> +		/* LR should be set */
> +		if (is_branch_link_set(*addr))
> +			result = true;

Short circuit.

> +	}
> +
> +	if (bhrb_sw_filter & PERF_SAMPLE_BRANCH_COND) {
> +
> +		/* I-form instruction - excluded */
> +		if (instr_is_branch_iform(*addr))
> +			goto out;
> +
> +		/* B-form or XL-form instruction */
> +		if (instr_is_branch_bform(*addr) || instr_is_branch_xlform(*addr))  {
> +
> +			/* Not branch always  */
> +			if (!is_bo_always(*addr)) {
> +
> +				/* Conditional branch to CTR register */
> +				if (is_bo_ctr(*addr))
> +					goto out;

We might have discussed this but why not?

> +
> +				/* CR[BI] conditional branch with static hint */

A conditional branch with a static hint is still a conditional branch?

> +				if (is_bo_crbi_off(*addr) || is_bo_crbi_on(*addr)) {
> +					if (is_bo_crbi_hint(*addr))
> +						goto out;
> +				}
> +
> +				result = true;
> +			}
> +		}
> +	}
> +out:
> +	return result;
> +}
> +
> +static bool check_instruction(u64 addr, u64 bhrb_sw_filter)
> +{


"check" is not a very descriptive name here, especially when "check" calls
"validate".

"filter" is also not good because a filter keeps some things and rejects others,
and the directionality is not clear.

I'd suggest "filter_selects_branch()" or just "keep_branch()".


> +	unsigned int instr;
> +	bool ret;
> +
> +	if (bhrb_sw_filter == 0)
> +		return true;
> +
> +	if (is_kernel_addr(addr)) {
> +		ret = validate_instruction((unsigned int *) addr, bhrb_sw_filter);

No reason not to return directly here.

That would then remove the need for an else block.

> +	} else {
> +		/*
> +		 * Userspace address needs to be
> +		 * copied first before analysis.
> +		 */
> +		pagefault_disable();
> +		ret =  __get_user_inatomic(instr, (unsigned int __user *)addr);

I suspect you borrowed this incantation from the callchain code. Unlike that
code you don't fallback to reading the page tables directly.

I'd rather see the accessor in the callchain code made generic and have you
call it here.

> +
> +		/*
> +		 * If the instruction could not be accessible
> +		 * from user space, we still 'okay' the entry.
> +		 */
> +		if (ret) {
> +			pagefault_enable();
> +			return true;
> +		}
> +		pagefault_enable();
> +		ret = validate_instruction(&instr, bhrb_sw_filter);

No reason not to return directly here.

> +	}
> +	return ret;
> +}
> +
> +/*
> + * Validate whether all requested branch filters
> + * are getting processed either in the PMU or in SW.
> + */
> +static int match_filters(u64 branch_sample_type, u64 filter_mask)

I don't really understand why we have this routine?

We should implement the filter in HW if we can, or in SW. Which filters can't we
implement in SW?

> +{
> +	u64 x;
> +
> +	if (filter_mask == PERF_SAMPLE_BRANCH_ANY)
> +		return true;
> +
> +	for_each_branch_sample_type(x) {
> +		if (!(branch_sample_type & x))
> +			continue;
> +		/*
> +		 * Privilege filter requests have been already
> +		 * taken care during the base PMU configuration.
> +		 */
> +		if (x == PERF_SAMPLE_BRANCH_USER)
> +			continue;
> +		if (x == PERF_SAMPLE_BRANCH_KERNEL)
> +			continue;
> +		if (x == PERF_SAMPLE_BRANCH_HV)
> +			continue;
> +
> +		/*
> +		 * Requested filter not available either
> +		 * in PMU or in SW.
> +		 */
> +		if (!(filter_mask & x))
> +			return false;
> +	}
> +	return true;
> +}
> +
> +/*
> + * Required SW based branch filters
> + *
> + * This is called after figuring out what all branch filters the
> + * PMU HW supports for the requested branch filter set. Here we
> + * will go through all the SW implemented branch filters one by
> + * one and pick them up if its not already supported in the PMU.
> + */
> +static u64 branch_filter_map(u64 branch_sample_type, u64 pmu_bhrb_filter,
> +			     					u64 *filter_mask)

Whitespace is foobar here ^

This function deals exclusively with the software filter IIUI, but the name
doesn't indicate that in any way.

As far as the logic goes, you return the software filter value, as well as
mutating the *filter_mask. And in all cases you make the same modification to
both. That seems very dubious.

Shouldn't this routine just setup the software filter, and leave the upper
level code to deal with the HW & SW filter values?

> +{
> +	u64 branch_sw_filter = 0;
> +
> +	/* No branch filter requested */
> +	if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY) {
> +		WARN_ON(pmu_bhrb_filter != 0);
> +		WARN_ON(*filter_mask != PERF_SAMPLE_BRANCH_ANY);
> +		return branch_sw_filter;
> +	}
> +
> +	/*
> +	 * PMU supported branch filters must also be implemented in SW
> +	 * in the event when the PMU is unable to process them for some
> +	 * reason. This all those branch filters can be satisfied with
> +	 * SW implemented filters. But right now, there is now way to
> +	 * initimate the user about this decision.

Please proof read these comments, I don't entirely follow this one.

You say "must also be implemented in SW" - but I think it's actually "must be
implemented in SW", ie. the HW is not "also" implementing the filter.

You say "in the event when" but I think you just mean "when" - the word "event"
has a particular meaning in this code so you should only use it for that if at
all possible.

I don't follow "This all those".

You should just drop the last sentence, there is never going to be any way to
notify the user that their filter is implemented in HW vs SW, that's an
implementation detail.

> +	 */
> +	if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
> +		if (!(pmu_bhrb_filter & PERF_SAMPLE_BRANCH_ANY_CALL)) {
> +			branch_sw_filter |= PERF_SAMPLE_BRANCH_ANY_CALL;
> +			*filter_mask |= PERF_SAMPLE_BRANCH_ANY_CALL;
> +		}
> +	}
> +
> +	if (branch_sample_type & PERF_SAMPLE_BRANCH_COND) {
> +		if (!(pmu_bhrb_filter & PERF_SAMPLE_BRANCH_COND)) {
> +			branch_sw_filter |= PERF_SAMPLE_BRANCH_COND;
> +			*filter_mask |= PERF_SAMPLE_BRANCH_COND;
> +		}
> +	}
> +
> +	if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_RETURN) {
> +		if (!(pmu_bhrb_filter & PERF_SAMPLE_BRANCH_ANY_RETURN)) {
> +			branch_sw_filter |= PERF_SAMPLE_BRANCH_ANY_RETURN;
> +			*filter_mask |= PERF_SAMPLE_BRANCH_ANY_RETURN;
> +		}
> +	}
> +
> +	if (branch_sample_type & PERF_SAMPLE_BRANCH_IND_CALL) {
> +		if (!(pmu_bhrb_filter & PERF_SAMPLE_BRANCH_IND_CALL)) {
> +			branch_sw_filter |= PERF_SAMPLE_BRANCH_IND_CALL;
> +			*filter_mask |= PERF_SAMPLE_BRANCH_IND_CALL;
> +		}
> +	}
> +
> +	return branch_sw_filter;
> +}
> +
>  /* Processing BHRB entries */
>  void power_pmu_bhrb_read(struct cpu_hw_events *cpuhw)
>  {
> @@ -459,17 +683,29 @@ void power_pmu_bhrb_read(struct cpu_hw_events *cpuhw)
>  					addr = 0;
>  				}
>  				cpuhw->bhrb_entries[u_index].from = addr;
> +
> +				if (!check_instruction(cpuhw->
> +						bhrb_entries[u_index].from,
> +							cpuhw->bhrb_sw_filter))
> +					u_index--;
>  			} else {
>  				/* Branches to immediate field 
>  				   (ie I or B form) */
>  				cpuhw->bhrb_entries[u_index].from = addr;
> -				cpuhw->bhrb_entries[u_index].to =
> -					power_pmu_bhrb_to(addr);
> -				cpuhw->bhrb_entries[u_index].mispred = pred;
> -				cpuhw->bhrb_entries[u_index].predicted = ~pred;
> +				if (check_instruction(cpuhw->
> +						bhrb_entries[u_index].from,
> +						cpuhw->bhrb_sw_filter)) {
> +					cpuhw->bhrb_entries[u_index].
> +						to = power_pmu_bhrb_to(addr);
> +					cpuhw->bhrb_entries[u_index].
> +						mispred = pred;
> +					cpuhw->bhrb_entries[u_index].
> +						predicted = ~pred;
> +				} else {
> +					u_index--;
> +				}
>  			}
>  			u_index++;


This code was already in need of some unindentation, and now it's just
ridiculous.

To start with at the beginning of this routine we have:

while (..) {
	if (!val)
		break;
	else {
		// Bulk of the logic
		...
	}
}

That should almost always become:

while (..) {
	if (!val)
		break;

	// Bulk of the logic
	...
}


But in this case that's not enough. Please send a precursor patch which moves
this logic out into a helper function.


> -
>  		}
>  	}
>  	cpuhw->bhrb_stack.nr = u_index;
> @@ -1255,7 +1491,11 @@ nocheck:
>  	if (has_branch_stack(event)) {
>  		power_pmu_bhrb_enable(event);
>  		cpuhw->bhrb_hw_filter = ppmu->bhrb_filter_map(
> -					event->attr.branch_sample_type);
> +					event->attr.branch_sample_type,
> +					&cpuhw->filter_mask);
> +		cpuhw->bhrb_sw_filter = branch_filter_map
> +					(event->attr.branch_sample_type,
> +					cpuhw->bhrb_hw_filter, &cpuhw->filter_mask);
>  	}
>  
>  	perf_pmu_enable(event->pmu);
> @@ -1637,10 +1877,16 @@ static int power_pmu_event_init(struct perf_event *event)
>  	err = power_check_constraints(cpuhw, events, cflags, n + 1);
>  
>  	if (has_branch_stack(event)) {
> -		cpuhw->bhrb_hw_filter = ppmu->bhrb_filter_map(
> -					event->attr.branch_sample_type);
> -
> -		if(cpuhw->bhrb_hw_filter == -1)
> +		cpuhw->bhrb_hw_filter = ppmu->bhrb_filter_map
> +				(event->attr.branch_sample_type,
> +				&cpuhw->filter_mask);
> +		cpuhw->bhrb_sw_filter = branch_filter_map
> +				(event->attr.branch_sample_type,
> +				cpuhw->bhrb_hw_filter,
> +				&cpuhw->filter_mask);
> +
> +		if(!match_filters(event->attr.branch_sample_type,
> +						cpuhw->filter_mask))
>  			return -EOPNOTSUPP;

The above two hunks look too similar for my liking.


cheers

^ permalink raw reply

* Re: [PATCH V4 07/10] powerpc, lib: Add new branch instruction analysis support functions
From: Michael Ellerman @ 2013-12-09  6:21 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, khandual
  Cc: mikey, ak, eranian, acme, sukadev, mingo
In-Reply-To: <1386153162-11225-8-git-send-email-khandual@linux.vnet.ibm.com>

On Wed, 2013-04-12 at 10:32:39 UTC, Anshuman Khandual wrote:
> Generic powerpc branch instruction analysis support added in the code
> patching library which will help the subsequent patch on SW based
> filtering of branch records in perf. This patch also converts and
> exports some of the existing local static functions through the header
> file to be used else where.
> 
> diff --git a/arch/powerpc/include/asm/code-patching.h b/arch/powerpc/include/asm/code-patching.h
> index a6f8c7a..8bab417 100644
> --- a/arch/powerpc/include/asm/code-patching.h
> +++ b/arch/powerpc/include/asm/code-patching.h
> @@ -22,6 +22,36 @@
>  #define BRANCH_SET_LINK	0x1
>  #define BRANCH_ABSOLUTE	0x2
>  
> +#define XL_FORM_LR  0x4C000020
> +#define XL_FORM_CTR 0x4C000420
> +#define XL_FORM_TAR 0x4C000460
> +
> +#define BO_ALWAYS    0x02800000
> +#define BO_CTR       0x02000000
> +#define BO_CRBI_OFF  0x00800000
> +#define BO_CRBI_ON   0x01800000
> +#define BO_CRBI_HINT 0x00400000
> +
> +/* Forms of branch instruction */
> +int instr_is_branch_iform(unsigned int instr);
> +int instr_is_branch_bform(unsigned int instr);
> +int instr_is_branch_xlform(unsigned int instr);
> +
> +/* Classification of XL-form instruction */
> +int is_xlform_lr(unsigned int instr);
> +int is_xlform_ctr(unsigned int instr);
> +int is_xlform_tar(unsigned int instr);
> +
> +/* Branch instruction is a call */
> +int is_branch_link_set(unsigned int instr);
> +
> +/* BO field analysis (B-form or XL-form) */
> +int is_bo_always(unsigned int instr);
> +int is_bo_ctr(unsigned int instr);
> +int is_bo_crbi_off(unsigned int instr);
> +int is_bo_crbi_on(unsigned int instr);
> +int is_bo_crbi_hint(unsigned int instr);


I think this is the wrong API.

We end up with all these micro checks, which don't actually encapsulate much,
and don't implement the logic perf needs. If we had another user for this level
of detail then it might make sense, but for a single user I think we're better
off just implementing the semantics it wants.

So that would be something more like:

bool instr_is_return_branch(unsigned int instr);
bool instr_is_conditional_branch(unsigned int instr);
bool instr_is_func_call(unsigned int instr);
bool instr_is_indirect_func_call(unsigned int instr);


These would then encapsulate something like the logic in your 8/10 patch. You
can hopefully also optimise the checking logic in each routine because you know
the exact semantics you're implementing.

cheers

^ permalink raw reply

* Re: [PATCH 1/3] powerpc: mm: make _PAGE_NUMA take effect
From: Benjamin Herrenschmidt @ 2013-12-09  6:19 UTC (permalink / raw)
  To: Liu ping fan; +Cc: Paul Mackerras, linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <CAFgQCTtJh1-2wuLueREbA6ru5cje4EcxN5yT+cU-YNSrZi9O=Q@mail.gmail.com>

On Mon, 2013-12-09 at 14:17 +0800, Liu ping fan wrote:
> On Mon, Dec 9, 2013 at 8:31 AM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > On Thu, 2013-12-05 at 16:23 +0530, Aneesh Kumar K.V wrote:
> >> Liu Ping Fan <kernelfans@gmail.com> writes:
> >>
> >> > To enable the do_numa_page(), we should not fix _PAGE_NUMA in
> >> > hash_page(), so bail out for the case of pte_numa().
> >
> > For some reason I don't have 2/3 and 3/3 in my mbox (though I do have
> > them on patchwork) so I'll reply to this one.
> >
> > Overall, your statement that this is a faster path needs to be backed up
> > with numbers.
> >
> > The code is complicated enough as it-is, such additional mess in the low
> > level hashing code requires a good justification, and also a
> > demonstration that it doesn't add overhead to the normal hash path.
> >
> For the test, is it ok to have an user application to copy page where
> all page are PG_mlocked?

If that specific scenario is relevant in practice, then yes, though also
demonstrate the lack of regression with some more normal path such as a
kernel compile.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH][RESEND] powerpc: remove unused REDBOOT Kconfig parameter
From: Benjamin Herrenschmidt @ 2013-12-09  6:17 UTC (permalink / raw)
  To: Michael Opdenacker; +Cc: marcelo, linux-kernel, paulus, linuxppc-dev
In-Reply-To: <1386566860-19785-1-git-send-email-michael.opdenacker@free-electrons.com>

On Mon, 2013-12-09 at 06:27 +0100, Michael Opdenacker wrote:
> This removes the REDBOOT Kconfig parameter,
> which was no longer used anywhere in the source code
> and Makefiles.

It hasn't been lost :-) It's still in patchwork and it's even in my
queue.

Cheers,
Ben.

> Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
> ---
>  arch/powerpc/Kconfig                | 3 ---
>  arch/powerpc/platforms/83xx/Kconfig | 1 -
>  arch/powerpc/platforms/8xx/Kconfig  | 1 -
>  3 files changed, 5 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index b44b52c0a8f0..70dc283050b5 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -209,9 +209,6 @@ config DEFAULT_UIMAGE
>  	  Used to allow a board to specify it wants a uImage built by default
>  	default n
>  
> -config REDBOOT
> -	bool
> -
>  config ARCH_HIBERNATION_POSSIBLE
>  	bool
>  	default y
> diff --git a/arch/powerpc/platforms/83xx/Kconfig b/arch/powerpc/platforms/83xx/Kconfig
> index 670a033264c0..2bdc8c862c46 100644
> --- a/arch/powerpc/platforms/83xx/Kconfig
> +++ b/arch/powerpc/platforms/83xx/Kconfig
> @@ -99,7 +99,6 @@ config SBC834x
>  config ASP834x
>  	bool "Analogue & Micro ASP 834x"
>  	select PPC_MPC834x
> -	select REDBOOT
>  	help
>  	  This enables support for the Analogue & Micro ASP 83xx
>  	  board.
> diff --git a/arch/powerpc/platforms/8xx/Kconfig b/arch/powerpc/platforms/8xx/Kconfig
> index 8dec3c0911ad..bd6f1a1cf922 100644
> --- a/arch/powerpc/platforms/8xx/Kconfig
> +++ b/arch/powerpc/platforms/8xx/Kconfig
> @@ -45,7 +45,6 @@ config PPC_EP88XC
>  config PPC_ADDER875
>  	bool "Analogue & Micro Adder 875"
>  	select CPM1
> -	select REDBOOT
>  	help
>  	  This enables support for the Analogue & Micro Adder 875
>  	  board.

^ permalink raw reply

* Re: [PATCH 1/3] powerpc: mm: make _PAGE_NUMA take effect
From: Liu ping fan @ 2013-12-09  6:17 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Paul Mackerras, linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1386549112.5159.4.camel@pasglop>

On Mon, Dec 9, 2013 at 8:31 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Thu, 2013-12-05 at 16:23 +0530, Aneesh Kumar K.V wrote:
>> Liu Ping Fan <kernelfans@gmail.com> writes:
>>
>> > To enable the do_numa_page(), we should not fix _PAGE_NUMA in
>> > hash_page(), so bail out for the case of pte_numa().
>
> For some reason I don't have 2/3 and 3/3 in my mbox (though I do have
> them on patchwork) so I'll reply to this one.
>
> Overall, your statement that this is a faster path needs to be backed up
> with numbers.
>
> The code is complicated enough as it-is, such additional mess in the low
> level hashing code requires a good justification, and also a
> demonstration that it doesn't add overhead to the normal hash path.
>
For the test, is it ok to have an user application to copy page where
all page are PG_mlocked?

Thanks and regards,
Pingfan

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox