LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: AW: PowerPC PCI DMA issues (prefetch/coherency?)
From: Adam Zilkie @ 2009-09-03 16:04 UTC (permalink / raw)
  To: benh; +Cc: Tom Burns, Chris Pringle, Andrea Zypchen, linuxppc-dev
In-Reply-To: <1251971849.15089.28.camel@pasglop>

Ben,

Thanks for your info.

Are you sure there is L2 cache on the 440?

I am seeing this problem with our custom IDE driver which is based on
pretty old code. Our driver uses pci_alloc_consistent() to allocate the
physical DMA memory and alloc_pages() to allocate a virtual page. It
then uses pci_map_sg() to map to a scatter/gather buffer. Perhaps I
should convert these to the DMA API calls as you suggest.

Regards,
Adam

On Thu, 2009-09-03 at 19:57 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2009-09-03 at 09:05 +0100, Chris Pringle wrote:
> > Hi Adam,
> > 
> > If you have a look in include/asm-ppc/pgtable.h for the following section:
> > #ifdef CONFIG_44x
> > #define _PAGE_BASE    (_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_GUARDED)
> > #else
> > #define _PAGE_BASE    (_PAGE_PRESENT | _PAGE_ACCESSED)
> > #endif
> > 
> > Try adding _PAGE_COHERENT to the appropriate line above and see if that 
> > fixes your issue - this causes the 'M' bit to be set on the page which 
> > sure enforce cache coherency. If it doesn't, you'll need to check the 
> > 'M' bit isn't being masked out in head_44x.S (it was originally masked 
> > out on arch/powerpc, but was fixed in later kernels when the cache 
> > coherency issues with non-SMP systems were resolved).
> 
> I have some doubts about the usefulness of doing that for 4xx. AFAIK,
> the 440 core just ignores M.
> 
> The problem lies probably elsewhere. Maybe the L2 cache coherency isn't
> enabled or not working ?
> 
> The L1 cache on 440 is simply not coherent, so drivers have to make sure
> they use the appropriate DMA APIs which will do cache flushing when
> needed.
> 
> Adam, what driver is causing you that sort of problems ?
> 
> Cheers,
> Ben.
> 
> 
-- 
Adam Zilkie
Software Designer,
International Datacasting Corp.

This message and the documents attached hereto are intended only for the addressee and may contain privileged or confidential information. Any unauthorized disclosure is strictly prohibited. If you have received this message in error, please notify us immediately so that we may correct our internal records. Please then delete the original message. Thank you.

^ permalink raw reply

* Re: AW: PowerPC PCI DMA issues (prefetch/coherency?)
From: Josh Boyer @ 2009-09-03 16:21 UTC (permalink / raw)
  To: Adam Zilkie; +Cc: Chris Pringle, Andrea Zypchen, linuxppc-dev, Tom Burns
In-Reply-To: <1251993890.2548.14.camel@Adam>

On Thu, Sep 03, 2009 at 12:04:50PM -0400, Adam Zilkie wrote:
>Ben,
>
>Thanks for your info.
>
>Are you sure there is L2 cache on the 440?

It depends on which 440 SoC you have.  It also depends on that being 
configured in the kernel even if it does exist.

>I am seeing this problem with our custom IDE driver which is based on
>pretty old code. Our driver uses pci_alloc_consistent() to allocate the
>physical DMA memory and alloc_pages() to allocate a virtual page. It
>then uses pci_map_sg() to map to a scatter/gather buffer. Perhaps I
>should convert these to the DMA API calls as you suggest.

I would suggest updating the code.  I have no idea if that is the problem,
but it should probably be done anyway.

josh

^ permalink raw reply

* Re: [RFC] net/fs_enet: send a reset request to the PHY on init
From: Grant Likely @ 2009-09-03 16:48 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linuxppc-dev, netdev, Vitaly Bordug
In-Reply-To: <20090902110410.GC15401@www.tglx.de>

On Wed, Sep 2, 2009 at 5:04 AM, Sebastian Andrzej
Siewior<bigeasy@linutronix.de> wrote:
> Usually u-boot sends a phy request in its network init routine. An uboot
> without network support doesn't do it and I endup without working
> network. I still can switch between 10/100Mbit (according to the LED on
> the hub and phy registers) but I can't send or receive any data.
>
> At this point I'm not sure if the PowerON Reset takes the PHY a few
> nsecs too early out of reset or if this reset is required and everyone
> relies on U-boot performing this reset.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> This is done on a custom mpc512x board. Unfortunately I don't have other
> boards to check. The PHY is a AMD Am79C874, phylib uses the generic one.
>
> =A0drivers/net/fs_enet/fs_enet-main.c | =A0 =A03 ++-
> =A01 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/net/fs_enet/fs_enet-main.c b/drivers/net/fs_enet/fs_=
enet-main.c
> index ee15402..a3c962b 100644
> --- a/drivers/net/fs_enet/fs_enet-main.c
> +++ b/drivers/net/fs_enet/fs_enet-main.c
> @@ -823,7 +823,8 @@ static int fs_init_phy(struct net_device *dev)
> =A0 =A0 =A0 =A0}
>
> =A0 =A0 =A0 =A0fep->phydev =3D phydev;
> -
> + =A0 =A0 =A0 phy_write(phydev, MII_BMCR, BMCR_RESET);
> + =A0 =A0 =A0 udelay(1);

What version of the kernel are you using?  The line numbers don't
match up with kernel mainline, so I wonder if this is before or after
the OF MDIO rework changes.

Regardless, this doesn't look right.  It certainly isn't right for the
driver to do an unconditional PHY reset when it doesn't actually know
what phy is attached.  For most boards I'm sure this is not desirable
because it will cause a delay while the PHY auto negotiates.
Depending on when the first network traffic begins, can cause several
seconds of boot delay.

Best would be to do this in U-Boot.  Otherwise, I think I would rather
see it at phy_device probe time.  At least then it would be on a
per-phy basis, or could be controlled by a property in the device tree
so that all boards don't get the same impact.

g.

--=20
Grant Likely, B.Sc., P.Eng.
Secret Lab Technologies Ltd.

^ permalink raw reply

* [Patch 0/6] PPC64-HWBKPT: Hardware Breakpoint interfaces - ver IX
From: K.Prasad @ 2009-09-03 18:39 UTC (permalink / raw)
  To: David Gibson, linuxppc-dev
  Cc: paulus, Michael Neuling, Benjamin Herrenschmidt, Alan Stern,
	Roland McGrath

Hi All,
	Please find a new set of patches with the changes as listed below.

These patches have to be applied over the set of patches sent to LKML here:
http://lkml.org/lkml/2009/8/28/272 that enable per-cpu breakpoint support and
a few new APIs.

Changelog - ver IX
-------------------
- Invocation of user-defined callback will be 'trigger-after-execute' (except
  for ptrace).
- Creation of a new global per-CPU breakpoint structure to help invocation of
  user-defined callback from single-step handler.
- Validation before registration will fail only if the address does not match
  the kernel symbol's (if specified) resolved address
  (through kallsyms_lookup_name()).
- 'symbolsize' value is expected to within the range contained by the symbol's
  starting address and the end of a double-word boundary (8 Bytes).
- PPC64's arch-dependant code is now aware of 'cpumask' in 'struct hw_breakpoint'
  and can accomodate requests for a subset of CPUs in the system.
- Introduced arch_disable_hw_breakpoint() required for
  <enable><disable>_hw_breakpoint() APIs.

Kindly let me know your comments on the same.

Thanks,
K.Prasad

Changelog - ver VIII
-------------------
- Reverting changes to allow one-shot breakpoints only for ptrace requests.
- Minor changes in sanity checking in arch_validate_hwbkpt_settings().
- put_cpu_no_resched() is no longer available. Converted to put_cpu().

Changelog - ver VII
-------------------
- Allow the one-shot behaviour for exception handlers to be defined by the user.
  A new 'is_one_shot' flag is added to 'struct arch_hw_breakpoint'.

Changelog - ver VI
------------------
The task of identifying 'genuine' breakpoint exceptions from those caused by
'out-of-range' accesses turned out to be more tricky than originally thought.
Some changes to this effect were made in version IV of this patchset, but they
were not sufficient for user-space. Basically the breakpoint address received
through ptrace is always aligned to 8-bytes since ptrace receives an encoded
'data' (consisting of address | translation_enable | bkpt_type), and the size of
the symbol is not known. However for kernel-space addresses, the symbol-size can
be determined using kallsyms_lookup_size_offset() and this is used to check if
DAR (in the exception context) is
'bkpt_address <= DAR <= (bkpt_address + symbol_size)', failing which we conclude
it as a stray exception.

The following changes are made to enable check:
- Addition of a symbolsize field in 'struct arch_hw_breakpoint' field.
- Store the size of the 'watched' kernel symbol into 'symbolsize' field in
  arch_store_info(0 routine.
- Verify if the above described condition is true when is_one_shot is FALSE in
  hw_breakpoint_handler().

Changelog - ver V
------------------
- Breakpoint requests from ptrace (for user-space) are designed to be one-shot
in PPC64. The patch contains changes to retain this behaviour by returning early
in hw_breakpoint_handler() [without re-initialising DABR] and unregistering the
user-space request in ptrace_triggered(). It is safe to make a
unregister_user_hw_breakpoint() call from the breakpoint exception context
[through ptrace_triggered()] without giving rise to circular locking-dependancy.
This is because there can be no kernel code running on the CPU (which received
the exception) with the same spinlock held.

- Minor change in 'type' member of 'struct arch_hw_breakpoint' from u8 to 'int'.

Changelog - ver IV
------------------
- While DABR register requires double-word (8 bytes) aligned addresses, i.e.
the breakpoint is active over a range of 8 bytes, PPC64 allows byte-level
addressability. This may lead to stray exceptions which have to be ignored in
hw_breakpoint_handler(), when DAR != (Breakpoint request address). However DABR
will be populated with the requested breakpoint address aligned to the previous
double-word address. The code is now modified to store user-requested address
in 'bp->info.address' but update the DABR with a double-word aligned address.

- Please note that the Data Breakpoint facility in Xmon is broken as of 2.6.29
and the same has not been integrated into this facility as described in Ver I.

Changelog - ver III
------------------
- Patches are based on commit 08f16e060bf54bdc34f800ed8b5362cdeda75d8b of -tip
  tree.
- The declarations in arch/powerpc/include/asm/hw_breakpoint.h are done only if
  CONFIG_PPC64 is defined. This eliminates the need to conditionally include this
  header file.
- load_debug_registers() is done in start_secondary() i.e. during CPU initialisation.
- arch_check_va_<> routines in hw_breakpoint.c are now replaced with a much
  simpler is_kernel_addr() check in arch_validate_hwbkpt_settings()
- Return code of hw_breakpoint_handler() when triggered due to Lazy debug
  register switching is now changed to NOTIFY_STOP.
- The ptrace code no longer sets the TIF_DEBUG task flag as it is proposed to
  be done in register_user_hw_breakpoint() routine.
- hw_breakpoint_handler() is now modified to use hbp_kernel_pos value to
  determine if the trigger was a user/kernel space address. The DAR register
  value is checked with the address stored in 'struct hw_breakpoint' to avoid
  handling of exceptions that belong to kprobe/Xmon.

Changelog - ver II
------------------
- Split the monolithic patch into six logical patches
- Changed the signature of arch_check_va_in_<user><kernel>space functions. They
  are now marked static.
- HB_NUM is now called as HBP_NUM (to preserve a consistent short-name
  convention)
- Introduced hw_breakpoint_disable() and changes to kexec code to disable
  breakpoints before a reboot.
- Minor changes in ptrace code to use macro-defined constants instead of
  numbers.
- Introduced a new constant definition INSTRUCTION_LEN in reg.h

^ permalink raw reply

* [Patch 1/6] PPC64-HWBKPT: Prepare the PowerPC platform for HW Breakpoint infrastructure
From: K.Prasad @ 2009-09-03 18:40 UTC (permalink / raw)
  To: David Gibson, linuxppc-dev
  Cc: Michael Neuling, Benjamin Herrenschmidt, paulus, Alan Stern,
	K.Prasad, Roland McGrath
In-Reply-To: <20090903183306.875398457@xyz>

Prepare the PowerPC code for HW Breakpoint infrastructure patches by including
relevant constant definitions and function declarations.

Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/hw_breakpoint.h |   61 +++++++++++++++++++++++++++++++
 arch/powerpc/include/asm/processor.h     |    1 
 arch/powerpc/include/asm/reg.h           |    3 +
 arch/powerpc/include/asm/thread_info.h   |    2 +
 4 files changed, 67 insertions(+)

Index: linux-2.6-tip.ppc64_hbkpt/arch/powerpc/include/asm/hw_breakpoint.h
===================================================================
--- /dev/null
+++ linux-2.6-tip.ppc64_hbkpt/arch/powerpc/include/asm/hw_breakpoint.h
@@ -0,0 +1,61 @@
+#ifndef	_PPC64_HW_BREAKPOINT_H
+#define	_PPC64_HW_BREAKPOINT_H
+
+#ifdef	__KERNEL__
+#define	__ARCH_HW_BREAKPOINT_H
+#ifdef CONFIG_PPC64
+
+struct arch_hw_breakpoint {
+	int		type;
+	char		*name; /* Contains name of the symbol to set bkpt */
+	unsigned long	address;
+	unsigned long	symbolsize;
+};
+
+#include <linux/kdebug.h>
+#include <asm/reg.h>
+#include <asm-generic/hw_breakpoint.h>
+
+#define HW_BREAKPOINT_READ DABR_DATA_READ
+#define HW_BREAKPOINT_WRITE DABR_DATA_WRITE
+#define HW_BREAKPOINT_RW (DABR_DATA_READ | DABR_DATA_WRITE)
+
+#define HW_BREAKPOINT_ALIGN 0x7
+/* Maximum permissible length of any HW Breakpoint */
+#define HW_BREAKPOINT_LEN 0x8
+
+extern struct hw_breakpoint *hbp_kernel[HBP_NUM];
+DECLARE_PER_CPU(struct hw_breakpoint*, this_hbp_kernel[HBP_NUM]);
+extern unsigned int hbp_user_refcount[HBP_NUM];
+
+extern void arch_install_thread_hw_breakpoint(struct task_struct *tsk);
+extern void arch_uninstall_thread_hw_breakpoint(void);
+extern int arch_validate_hwbkpt_settings(struct hw_breakpoint *bp,
+						struct task_struct *tsk);
+extern void arch_update_user_hw_breakpoint(int pos, struct task_struct *tsk);
+extern void arch_flush_thread_hw_breakpoint(struct task_struct *tsk);
+extern void arch_update_kernel_hw_breakpoint(void *);
+extern void arch_disable_hw_breakpoint(void);
+extern int hw_breakpoint_exceptions_notify(struct notifier_block *unused,
+				     unsigned long val, void *data);
+
+extern void flush_thread_hw_breakpoint(struct task_struct *tsk);
+extern int copy_thread_hw_breakpoint(struct task_struct *tsk,
+		struct task_struct *child, unsigned long clone_flags);
+extern void load_debug_registers(void);
+extern void ptrace_triggered(struct hw_breakpoint *bp, struct pt_regs *regs);
+
+static inline void hw_breakpoint_disable(void)
+{
+	set_dabr(0);
+}
+
+#else
+static inline void hw_breakpoint_disable(void)
+{
+	/* Function is defined only on PPC64 for now */
+}
+#endif	/* CONFIG_PPC64 */
+#endif	/* __KERNEL__ */
+#endif	/* _PPC64_HW_BREAKPOINT_H */
+
Index: linux-2.6-tip.ppc64_hbkpt/arch/powerpc/include/asm/processor.h
===================================================================
--- linux-2.6-tip.ppc64_hbkpt.orig/arch/powerpc/include/asm/processor.h
+++ linux-2.6-tip.ppc64_hbkpt/arch/powerpc/include/asm/processor.h
@@ -177,6 +177,7 @@ struct thread_struct {
 #ifdef CONFIG_PPC64
 	unsigned long	start_tb;	/* Start purr when proc switched in */
 	unsigned long	accum_tb;	/* Total accumilated purr for process */
+	struct hw_breakpoint *hbp[HBP_NUM];
 #endif
 	unsigned long	dabr;		/* Data address breakpoint register */
 #ifdef CONFIG_ALTIVEC
Index: linux-2.6-tip.ppc64_hbkpt/arch/powerpc/include/asm/reg.h
===================================================================
--- linux-2.6-tip.ppc64_hbkpt.orig/arch/powerpc/include/asm/reg.h
+++ linux-2.6-tip.ppc64_hbkpt/arch/powerpc/include/asm/reg.h
@@ -26,6 +26,8 @@
 #include <asm/reg_8xx.h>
 #endif /* CONFIG_8xx */
 
+#define INSTRUCTION_LEN	4		/* Length of any instruction */
+
 #define MSR_SF_LG	63              /* Enable 64 bit mode */
 #define MSR_ISF_LG	61              /* Interrupt 64b mode valid on 630 */
 #define MSR_HV_LG 	60              /* Hypervisor state */
@@ -184,6 +186,7 @@
 #define   CTRL_TE	0x00c00000	/* thread enable */
 #define   CTRL_RUNLATCH	0x1
 #define SPRN_DABR	0x3F5	/* Data Address Breakpoint Register */
+#define   HBP_NUM	1	/* Number of physical HW breakpoint registers */
 #define   DABR_TRANSLATION	(1UL << 2)
 #define   DABR_DATA_WRITE	(1UL << 1)
 #define   DABR_DATA_READ	(1UL << 0)
Index: linux-2.6-tip.ppc64_hbkpt/arch/powerpc/include/asm/thread_info.h
===================================================================
--- linux-2.6-tip.ppc64_hbkpt.orig/arch/powerpc/include/asm/thread_info.h
+++ linux-2.6-tip.ppc64_hbkpt/arch/powerpc/include/asm/thread_info.h
@@ -112,6 +112,7 @@ static inline struct thread_info *curren
 #define TIF_FREEZE		14	/* Freezing for suspend */
 #define TIF_RUNLATCH		15	/* Is the runlatch enabled? */
 #define TIF_ABI_PENDING		16	/* 32/64 bit switch needed */
+#define TIF_DEBUG		17	/* uses debug registers */
 
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
@@ -130,6 +131,7 @@ static inline struct thread_info *curren
 #define _TIF_FREEZE		(1<<TIF_FREEZE)
 #define _TIF_RUNLATCH		(1<<TIF_RUNLATCH)
 #define _TIF_ABI_PENDING	(1<<TIF_ABI_PENDING)
+#define _TIF_DEBUG		(1<<TIF_DEBUG)
 #define _TIF_SYSCALL_T_OR_A	(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP)
 
 #define _TIF_USER_WORK_MASK	(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \

^ permalink raw reply

* [Patch 2/6] PPC64-HWBKPT: Introduce PPC64 specific Hardware Breakpoint interfaces
From: K.Prasad @ 2009-09-03 18:40 UTC (permalink / raw)
  To: David Gibson, linuxppc-dev
  Cc: Michael Neuling, Benjamin Herrenschmidt, paulus, Alan Stern,
	K.Prasad, Roland McGrath
In-Reply-To: <20090903183306.875398457@xyz>

Introduce PPC64 implementation for the generic hardware breakpoint interfaces
defined in kernel/hw_breakpoint.c. Enable the HAVE_HW_BREAKPOINT flag and the
Makefile.

Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com>
---
 arch/powerpc/Kconfig                |    1 
 arch/powerpc/kernel/Makefile        |    2 
 arch/powerpc/kernel/hw_breakpoint.c |  342 ++++++++++++++++++++++++++++++++++++
 arch/powerpc/kernel/ptrace.c        |    4 
 4 files changed, 348 insertions(+), 1 deletion(-)

Index: linux-2.6-tip.ppc64_hbkpt/arch/powerpc/Kconfig
===================================================================
--- linux-2.6-tip.ppc64_hbkpt.orig/arch/powerpc/Kconfig
+++ linux-2.6-tip.ppc64_hbkpt/arch/powerpc/Kconfig
@@ -126,6 +126,7 @@ config PPC
 	select HAVE_SYSCALL_WRAPPERS if PPC64
 	select GENERIC_ATOMIC64 if PPC32
 	select HAVE_PERF_COUNTERS
+	select HAVE_HW_BREAKPOINT if PPC64
 
 config EARLY_PRINTK
 	bool
Index: linux-2.6-tip.ppc64_hbkpt/arch/powerpc/kernel/Makefile
===================================================================
--- linux-2.6-tip.ppc64_hbkpt.orig/arch/powerpc/kernel/Makefile
+++ linux-2.6-tip.ppc64_hbkpt/arch/powerpc/kernel/Makefile
@@ -35,7 +35,7 @@ obj-$(CONFIG_PPC64)		+= setup_64.o sys_p
 				   signal_64.o ptrace32.o \
 				   paca.o cpu_setup_ppc970.o \
 				   cpu_setup_pa6t.o \
-				   firmware.o nvram_64.o
+				   firmware.o nvram_64.o hw_breakpoint.o
 obj64-$(CONFIG_RELOCATABLE)	+= reloc_64.o
 obj-$(CONFIG_PPC64)		+= vdso64/
 obj-$(CONFIG_ALTIVEC)		+= vecemu.o
Index: linux-2.6-tip.ppc64_hbkpt/arch/powerpc/kernel/hw_breakpoint.c
===================================================================
--- /dev/null
+++ linux-2.6-tip.ppc64_hbkpt/arch/powerpc/kernel/hw_breakpoint.c
@@ -0,0 +1,342 @@
+/*
+ * HW_breakpoint: a unified kernel/user-space hardware breakpoint facility,
+ * using the CPU's debug registers.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright 2009 IBM Corporation
+ */
+
+#include <linux/notifier.h>
+#include <linux/kallsyms.h>
+#include <linux/kprobes.h>
+#include <linux/percpu.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/init.h>
+#include <linux/smp.h>
+
+#include <asm/hw_breakpoint.h>
+#include <asm/processor.h>
+#include <asm/sstep.h>
+
+/* Store the kernel-space breakpoint address value */
+static unsigned long kdabr;
+
+/*
+ * Temporarily stores address for DABR before it is written by the
+ * single-step handler routine
+ */
+static DEFINE_PER_CPU(unsigned long, dabr_data);
+static DEFINE_PER_CPU(struct hw_breakpoint*, last_hit_bp);
+
+/* Disable breakpoints on the physical debug register */
+void arch_disable_hw_breakpoint(void)
+{
+	set_dabr(0);
+}
+
+void arch_update_kernel_hw_breakpoint(void *unused)
+{
+	struct hw_breakpoint *bp;
+
+	/* Check if there is nothing to update */
+	if (hbp_kernel_pos == HBP_NUM)
+		return;
+
+	bp = per_cpu(this_hbp_kernel[hbp_kernel_pos], get_cpu());
+	if (bp == NULL)
+		kdabr = 0;
+	else
+		kdabr = (bp->info.address & ~HW_BREAKPOINT_ALIGN) |
+			bp->info.type | DABR_TRANSLATION;
+	set_dabr(kdabr);
+	put_cpu();
+}
+
+/*
+ * Install the thread breakpoints in their debug registers.
+ */
+void arch_install_thread_hw_breakpoint(struct task_struct *tsk)
+{
+	set_dabr(tsk->thread.dabr);
+}
+
+/*
+ * Clear the DABR which contains the thread-specific breakpoint address
+ */
+void arch_uninstall_thread_hw_breakpoint()
+{
+	set_dabr(0);
+}
+
+/*
+ * Store a breakpoint's encoded address, length, and type.
+ */
+int arch_store_info(struct hw_breakpoint *bp, struct task_struct *tsk)
+{
+	unsigned long sym_addr;
+
+	/* Symbol names from user-space are rejected */
+	if (tsk) {
+		if (bp->info.name)
+			return -EINVAL;
+		return 0;
+	}
+	/*
+	 * User-space requests will always have the address field populated
+	 * For kernel-addresses, either the address or symbol name can be
+	 * specified.
+	 */
+	if (bp->info.name) {
+		sym_addr = (unsigned long)kallsyms_lookup_name(bp->info.name);
+		if (bp->info.address) {
+			if (bp->info.address != sym_addr)
+				return -EINVAL;
+		} else
+			bp->info.address = sym_addr;
+	}
+	if (!bp->info.address)
+		return -EINVAL;
+	/*
+	 * Determine the symbolsize if not already specified.
+	 * Reject the breakpoint request if symbolsize is found
+	 * to be greater than HW_BREAKPOINT_LEN
+	 */
+	if (!bp->info.symbolsize) {
+		if(!kallsyms_lookup_size_offset(bp->info.address,
+					&(bp->info.symbolsize), NULL))
+			return -EINVAL;
+	}
+	if (bp->info.symbolsize <= HW_BREAKPOINT_LEN)
+		return 0;
+	return -EINVAL;
+}
+
+/*
+ * Validate the arch-specific HW Breakpoint register settings
+ */
+int arch_validate_hwbkpt_settings(struct hw_breakpoint *bp,
+						struct task_struct *tsk)
+{
+	int is_kernel, ret = -EINVAL;
+
+	/* User-space breakpoints cannot be restricted to a subset of CPUs */
+	if (tsk && bp->cpumask)
+		return ret;
+
+	if (!bp)
+		return ret;
+
+	switch (bp->info.type) {
+	case HW_BREAKPOINT_READ:
+	case HW_BREAKPOINT_WRITE:
+	case HW_BREAKPOINT_RW:
+		break;
+	default:
+		return ret;
+	}
+
+	if (!bp->triggered)
+		return -EINVAL;
+
+	ret = arch_store_info(bp, tsk);
+	is_kernel = is_kernel_addr(bp->info.address);
+	if ((tsk && is_kernel) || (!tsk && !is_kernel))
+		return -EINVAL;
+
+	/*
+	 * Since breakpoint length can be a maximum of HW_BREAKPOINT_LEN(8)
+	 * and breakpoint addresses are aligned to nearest double-word
+	 * HW_BREAKPOINT_ALIGN by rounding off to the lower address, the
+	 * 'symbolsize' should satisfy the check below.
+	 */
+	if (bp->info.symbolsize >
+	    (HW_BREAKPOINT_LEN - (bp->info.address & HW_BREAKPOINT_ALIGN)))
+		return -EINVAL;
+
+	return ret;
+}
+
+void arch_update_user_hw_breakpoint(int pos, struct task_struct *tsk)
+{
+	struct thread_struct *thread = &(tsk->thread);
+	struct hw_breakpoint *bp = thread->hbp[0];
+
+	if (bp)
+		thread->dabr = (bp->info.address & ~HW_BREAKPOINT_ALIGN) |
+				bp->info.type | DABR_TRANSLATION;
+	else
+		thread->dabr = 0;
+}
+
+void arch_flush_thread_hw_breakpoint(struct task_struct *tsk)
+{
+	struct thread_struct *thread = &(tsk->thread);
+
+	thread->dabr = 0;
+}
+
+/*
+ * Handle debug exception notifications.
+ */
+int __kprobes hw_breakpoint_handler(struct die_args *args)
+{
+	int rc = NOTIFY_STOP;
+	struct hw_breakpoint *bp;
+	struct pt_regs *regs = args->regs;
+	unsigned long dar = regs->dar;
+	int cpu, is_kernel, stepped = 1;
+
+	is_kernel = (hbp_kernel_pos == HBP_NUM) ? 0 : 1;
+
+	/* Disable breakpoints during exception handling */
+	set_dabr(0);
+
+	cpu = get_cpu();
+	/* Determine whether kernel- or user-space address is the trigger */
+	bp = is_kernel ?
+		per_cpu(this_hbp_kernel[0], cpu) : current->thread.hbp[0];
+	/*
+	 * bp can be NULL due to lazy debug register switching
+	 * or due to the delay between updates of hbp_kernel_pos
+	 * and this_hbp_kernel.
+	 */
+	if (!bp)
+		goto out;
+
+	per_cpu(dabr_data, cpu) = is_kernel ? kdabr : current->thread.dabr;
+
+	/* Verify if dar lies within the address range occupied by the symbol
+	 * being watched. Since we cannot get the symbol size for
+	 * user-space requests we skip this check in that case
+	 */
+	if (is_kernel &&
+	    !((bp->info.address <= dar) &&
+	     (dar <= (bp->info.address + bp->info.symbolsize))))
+		/*
+		 * This exception is triggered not because of a memory access on
+		 * the monitored variable but in the double-word address range
+		 * in which it is contained. We will consume this exception,
+		 * considering it as 'noise'.
+		 */
+		goto out;
+
+	/*
+	 * Return early after invoking user-callback function without restoring
+	 * DABR if the breakpoint is from ptrace which always operates in
+	 * one-shot mode
+	 */
+	if (bp->triggered == ptrace_triggered) {
+		(bp->triggered)(bp, regs);
+		rc = NOTIFY_DONE;
+		goto out;
+	}
+
+	stepped = emulate_step(regs, regs->nip);
+	/*
+	 * Single-step the causative instruction manually if
+	 * emulate_step() could not execute it
+	 */
+	if (stepped == 0) {
+		regs->msr |= MSR_SE;
+		per_cpu(last_hit_bp, cpu) = bp;
+		goto out;
+	}
+	(bp->triggered)(bp, regs);
+	set_dabr(per_cpu(dabr_data, cpu));
+
+out:
+	/* Enable pre-emption only if single-stepping is finished */
+	if (stepped) {
+		per_cpu(dabr_data, cpu) = 0;
+		put_cpu();
+	}
+	return rc;
+}
+
+/*
+ * Handle single-step exceptions following a DABR hit.
+ */
+int __kprobes single_step_dabr_instruction(struct die_args *args)
+{
+	struct pt_regs *regs = args->regs;
+	int cpu = get_cpu();
+	int ret = NOTIFY_DONE;
+	siginfo_t info;
+	unsigned long this_dabr_data = per_cpu(dabr_data, cpu);
+	struct hw_breakpoint *bp = per_cpu(last_hit_bp, cpu);
+
+	/*
+	 * Check if we are single-stepping as a result of a
+	 * previous HW Breakpoint exception
+	 */
+	if (this_dabr_data == 0)
+		goto out;
+
+	regs->msr &= ~MSR_SE;
+	/*
+	 * We shall invoke the user-defined callback function in the single
+	 * stepping handler to confirm to 'trigger-after-execute' semantics
+	 */
+	(bp->triggered)(bp, regs);
+
+	/* Deliver signal to user-space */
+	if (this_dabr_data < TASK_SIZE) {
+		info.si_signo = SIGTRAP;
+		info.si_errno = 0;
+		info.si_code = TRAP_HWBKPT;
+		info.si_addr = (void __user *)(per_cpu(dabr_data, cpu));
+		force_sig_info(SIGTRAP, &info, current);
+	}
+
+	set_dabr(this_dabr_data);
+	per_cpu(dabr_data, cpu) = 0;
+	ret = NOTIFY_STOP;
+	/*
+	 * If single-stepped after hw_breakpoint_handler(), pre-emption is
+	 * already disabled.
+	 */
+	put_cpu();
+
+out:
+	/*
+	 * A put_cpu() call is required to complement the get_cpu()
+	 * call used initially
+	 */
+	put_cpu();
+	return ret;
+}
+
+/*
+ * Handle debug exception notifications.
+ */
+int __kprobes hw_breakpoint_exceptions_notify(
+		struct notifier_block *unused, unsigned long val, void *data)
+{
+	int ret = NOTIFY_DONE;
+
+	switch (val) {
+	case DIE_DABR_MATCH:
+		ret = hw_breakpoint_handler(data);
+		break;
+	case DIE_SSTEP:
+		ret = single_step_dabr_instruction(data);
+		break;
+	}
+
+	return ret;
+}
Index: linux-2.6-tip.ppc64_hbkpt/arch/powerpc/kernel/ptrace.c
===================================================================
--- linux-2.6-tip.ppc64_hbkpt.orig/arch/powerpc/kernel/ptrace.c
+++ linux-2.6-tip.ppc64_hbkpt/arch/powerpc/kernel/ptrace.c
@@ -755,6 +755,10 @@ void user_disable_single_step(struct tas
 	clear_tsk_thread_flag(task, TIF_SINGLESTEP);
 }
 
+void ptrace_triggered(struct hw_breakpoint *bp, struct pt_regs *regs)
+{
+}
+
 int ptrace_set_debugreg(struct task_struct *task, unsigned long addr,
 			       unsigned long data)
 {

^ permalink raw reply

* [Patch 3/6] PPC64-HWBKPT: Modify ptrace code to use Hardware Breakpoint interfaces
From: K.Prasad @ 2009-09-03 18:40 UTC (permalink / raw)
  To: David Gibson, linuxppc-dev
  Cc: Michael Neuling, Benjamin Herrenschmidt, paulus, Alan Stern,
	K.Prasad, Roland McGrath
In-Reply-To: <20090903183306.875398457@xyz>

Modify the ptrace code to use the hardware breakpoint interfaces for user-space.

Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/ptrace.c |   43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

Index: linux-2.6-tip.hbkpt/arch/powerpc/kernel/ptrace.c
===================================================================
--- linux-2.6-tip.hbkpt.orig/arch/powerpc/kernel/ptrace.c
+++ linux-2.6-tip.hbkpt/arch/powerpc/kernel/ptrace.c
@@ -37,6 +37,7 @@
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/system.h>
+#include <asm/hw_breakpoint.h>
 
 /*
  * does not yet catch signals sent when the child dies.
@@ -757,11 +758,24 @@ void user_disable_single_step(struct tas
 
 void ptrace_triggered(struct hw_breakpoint *bp, struct pt_regs *regs)
 {
+	/*
+	 * Unregister the breakpoint request here since ptrace has defined a
+	 * one-shot behaviour for breakpoint exceptions in PPC64.
+	 * The SIGTRAP signal is generated automatically for us in do_dabr().
+	 * We don't have to do anything here
+	 */
+	unregister_user_hw_breakpoint(current, bp);
+	kfree(bp);
 }
 
 int ptrace_set_debugreg(struct task_struct *task, unsigned long addr,
 			       unsigned long data)
 {
+#ifdef CONFIG_PPC64
+	struct thread_struct *thread = &(task->thread);
+	struct hw_breakpoint *bp;
+	int ret;
+#endif
 	/* For ppc64 we support one DABR and no IABR's at the moment (ppc64).
 	 *  For embedded processors we support one DAC and no IAC's at the
 	 *  moment.
@@ -791,6 +805,35 @@ int ptrace_set_debugreg(struct task_stru
 	if (data && !(data & DABR_TRANSLATION))
 		return -EIO;
 
+#ifdef CONFIG_PPC64
+	bp = thread->hbp[0];
+	if (data == 0) {
+		if (bp) {
+			unregister_user_hw_breakpoint(task, bp);
+			kfree(bp);
+		}
+		return 0;
+	}
+
+	if (bp) {
+		bp->info.type = data & HW_BREAKPOINT_RW;
+		task->thread.dabr = bp->info.address = data;
+		return modify_user_hw_breakpoint(task, bp);
+	}
+	bp = kzalloc(sizeof(struct hw_breakpoint), GFP_KERNEL);
+	if (!bp)
+		return -ENOMEM;
+
+	/* Store the type of breakpoint */
+	bp->info.type = data & HW_BREAKPOINT_RW;
+	bp->triggered = ptrace_triggered;
+	task->thread.dabr = bp->info.address = data;
+
+	ret = register_user_hw_breakpoint(task, bp);
+	if (ret)
+		return ret;
+#endif /* CONFIG_PPC64 */
+
 	/* Move contents to the DABR register */
 	task->thread.dabr = data;
 

^ permalink raw reply

* [Patch 4/6] PPC64-HWBKPT: Modify process/processor code to recognise hardware debug registers
From: K.Prasad @ 2009-09-03 18:40 UTC (permalink / raw)
  To: David Gibson, linuxppc-dev
  Cc: Michael Neuling, Benjamin Herrenschmidt, paulus, Alan Stern,
	K.Prasad, Roland McGrath
In-Reply-To: <20090903183306.875398457@xyz>

Modify process handling code to recognise hardware debug registers during copy
and flush operations. Introduce a new TIF_DEBUG task flag to indicate a
process's use of debug register. Load the debug register values into a
new CPU during initialisation.

Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/process.c |   15 +++++++++++++++
 arch/powerpc/kernel/smp.c     |    2 ++
 2 files changed, 17 insertions(+)

Index: linux-2.6-tip.hbkpt/arch/powerpc/kernel/process.c
===================================================================
--- linux-2.6-tip.hbkpt.orig/arch/powerpc/kernel/process.c
+++ linux-2.6-tip.hbkpt/arch/powerpc/kernel/process.c
@@ -50,6 +50,7 @@
 #include <asm/syscalls.h>
 #ifdef CONFIG_PPC64
 #include <asm/firmware.h>
+#include <asm/hw_breakpoint.h>
 #endif
 #include <linux/kprobes.h>
 #include <linux/kdebug.h>
@@ -254,8 +255,10 @@ void do_dabr(struct pt_regs *regs, unsig
 			11, SIGSEGV) == NOTIFY_STOP)
 		return;
 
+#ifndef CONFIG_PPC64
 	if (debugger_dabr_match(regs))
 		return;
+#endif
 
 	/* Clear the DAC and struct entries.  One shot trigger */
 #if defined(CONFIG_BOOKE)
@@ -372,8 +375,13 @@ struct task_struct *__switch_to(struct t
 
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_PPC64
+		if (unlikely(test_tsk_thread_flag(new, TIF_DEBUG)))
+			arch_install_thread_hw_breakpoint(new);
+#else
 	if (unlikely(__get_cpu_var(current_dabr) != new->thread.dabr))
 		set_dabr(new->thread.dabr);
+#endif /* CONFIG_PPC64 */
 
 #if defined(CONFIG_BOOKE)
 	/* If new thread DAC (HW breakpoint) is the same then leave it */
@@ -550,6 +558,10 @@ void show_regs(struct pt_regs * regs)
 void exit_thread(void)
 {
 	discard_lazy_cpu_state();
+#ifdef CONFIG_PPC64
+	if (unlikely(test_tsk_thread_flag(current, TIF_DEBUG)))
+		flush_thread_hw_breakpoint(current);
+#endif /* CONFIG_PPC64 */
 }
 
 void flush_thread(void)
@@ -672,6 +684,9 @@ int copy_thread(unsigned long clone_flag
 	 * function.
  	 */
 	kregs->nip = *((unsigned long *)ret_from_fork);
+
+	if (unlikely(test_tsk_thread_flag(current, TIF_DEBUG)))
+		copy_thread_hw_breakpoint(current, p, clone_flags);
 #else
 	kregs->nip = (unsigned long)ret_from_fork;
 #endif
Index: linux-2.6-tip.hbkpt/arch/powerpc/kernel/smp.c
===================================================================
--- linux-2.6-tip.hbkpt.orig/arch/powerpc/kernel/smp.c
+++ linux-2.6-tip.hbkpt/arch/powerpc/kernel/smp.c
@@ -48,6 +48,7 @@
 #include <asm/vdso_datapage.h>
 #ifdef CONFIG_PPC64
 #include <asm/paca.h>
+#include <asm/hw_breakpoint.h>
 #endif
 
 #ifdef DEBUG
@@ -537,6 +538,7 @@ int __devinit start_secondary(void *unus
 
 	local_irq_enable();
 
+	load_debug_registers();
 	cpu_idle();
 	return 0;
 }

^ permalink raw reply

* [Patch 5/6] PPC64-HWBKPT: Modify Data storage exception code to recognise DABR match first
From: K.Prasad @ 2009-09-03 18:40 UTC (permalink / raw)
  To: David Gibson, linuxppc-dev
  Cc: Michael Neuling, Benjamin Herrenschmidt, paulus, Alan Stern,
	K.Prasad, Roland McGrath
In-Reply-To: <20090903183306.875398457@xyz>

Modify Data storage exception code to first lookout for a DABR match before
recognising a kprobe or xmon exception.

Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com>
---
 arch/powerpc/mm/fault.c |   14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

Index: linux-2.6-tip.hbkpt/arch/powerpc/mm/fault.c
===================================================================
--- linux-2.6-tip.hbkpt.orig/arch/powerpc/mm/fault.c
+++ linux-2.6-tip.hbkpt/arch/powerpc/mm/fault.c
@@ -137,6 +137,12 @@ int __kprobes do_page_fault(struct pt_re
 		error_code &= 0x48200000;
 	else
 		is_write = error_code & DSISR_ISSTORE;
+
+	if (error_code & DSISR_DABRMATCH) {
+		/* DABR match */
+		do_dabr(regs, address, error_code);
+		return 0;
+	}
 #else
 	is_write = error_code & ESR_DST;
 #endif /* CONFIG_4xx || CONFIG_BOOKE */
@@ -151,14 +157,6 @@ int __kprobes do_page_fault(struct pt_re
 	if (!user_mode(regs) && (address >= TASK_SIZE))
 		return SIGSEGV;
 
-#if !(defined(CONFIG_4xx) || defined(CONFIG_BOOKE))
-  	if (error_code & DSISR_DABRMATCH) {
-		/* DABR match */
-		do_dabr(regs, address, error_code);
-		return 0;
-	}
-#endif /* !(CONFIG_4xx || CONFIG_BOOKE)*/
-
 	if (in_atomic() || mm == NULL) {
 		if (!user_mode(regs))
 			return SIGSEGV;

^ permalink raw reply

* [Patch 6/6] PPC64-HWBKPT: Adapt kexec and samples code to recognise PPC64 hw-breakpoint
From: K.Prasad @ 2009-09-03 18:41 UTC (permalink / raw)
  To: David Gibson, linuxppc-dev
  Cc: Michael Neuling, Benjamin Herrenschmidt, paulus, Alan Stern,
	K.Prasad, Roland McGrath
In-Reply-To: <20090903183306.875398457@xyz>

Modify kexec code to disable DABR registers before a reboot. Adapt the samples
code to populate PPC64-arch specific fields.

Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/machine_kexec_64.c  |    3 +++
 samples/hw_breakpoint/data_breakpoint.c |    4 ++++
 2 files changed, 7 insertions(+)

Index: linux-2.6-tip.ppc64_hbkpt/arch/powerpc/kernel/machine_kexec_64.c
===================================================================
--- linux-2.6-tip.ppc64_hbkpt.orig/arch/powerpc/kernel/machine_kexec_64.c
+++ linux-2.6-tip.ppc64_hbkpt/arch/powerpc/kernel/machine_kexec_64.c
@@ -24,6 +24,7 @@
 #include <asm/sections.h>	/* _end */
 #include <asm/prom.h>
 #include <asm/smp.h>
+#include <asm/hw_breakpoint.h>
 
 int default_machine_kexec_prepare(struct kimage *image)
 {
@@ -214,6 +215,7 @@ static void kexec_prepare_cpus(void)
 	put_cpu();
 
 	local_irq_disable();
+	hw_breakpoint_disable();
 }
 
 #else /* ! SMP */
@@ -233,6 +235,7 @@ static void kexec_prepare_cpus(void)
 	if (ppc_md.kexec_cpu_down)
 		ppc_md.kexec_cpu_down(0, 0);
 	local_irq_disable();
+	hw_breakpoint_disable();
 }
 
 #endif /* SMP */
Index: linux-2.6-tip.ppc64_hbkpt/samples/hw_breakpoint/data_breakpoint.c
===================================================================
--- linux-2.6-tip.ppc64_hbkpt.orig/samples/hw_breakpoint/data_breakpoint.c
+++ linux-2.6-tip.ppc64_hbkpt/samples/hw_breakpoint/data_breakpoint.c
@@ -54,6 +54,10 @@ static int __init hw_break_module_init(v
 	sample_hbp.info.type = HW_BREAKPOINT_WRITE;
 	sample_hbp.info.len = HW_BREAKPOINT_LEN_4;
 #endif /* CONFIG_X86 */
+#ifdef CONFIG_PPC64
+	sample_hbp.info.name = ksym_name;
+	sample_hbp.info.type = HW_BREAKPOINT_WRITE;
+#endif /* CONFIG_PPC64 */
 
 	sample_hbp.triggered = (void *)sample_hbp_handler;
 

^ permalink raw reply

* RE: AW: PowerPC PCI DMA issues (prefetch/coherency?)
From: Prodyut Hazarika @ 2009-09-03 20:27 UTC (permalink / raw)
  To: azilkie, benh; +Cc: Tom Burns, Chris Pringle, Andrea Zypchen, linuxppc-dev
In-Reply-To: <1251993890.2548.14.camel@Adam>

Hi Adam,

> Are you sure there is L2 cache on the 440?

It depends on the SoC you are using. SoC like 460EX (Canyonlands board)
have L2Cache.
It seems you are using a Sequoia board, which has a 440EPx SoC. 440EPx
has a 440 cpu core, but no L2Cache.
Could you please tell me which SoC you are using?
You can also refer to the appropriate dts file to see if there is L2C.
For example, in canyonlands.dts (460EX based board), we have the L2C
entry.
        L2C0: l2c {
              ...
        }

>I am seeing this problem with our custom IDE driver which is based on=20
>pretty old code. Our driver uses pci_alloc_consistent() to allocate the

>physical DMA memory and alloc_pages() to allocate a virtual page. It=20
>then uses pci_map_sg() to map to a scatter/gather buffer. Perhaps I=20
>should convert these to the DMA API calls as you suggest.

Could you give more details on the consistency problem? It is a good
idea to change to the new DMA APIs, but pci_alloc_consistent() should
work too

Thanks
Prodyut=09

On Thu, 2009-09-03 at 19:57 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2009-09-03 at 09:05 +0100, Chris Pringle wrote:
> > Hi Adam,
> >=20
> > If you have a look in include/asm-ppc/pgtable.h for the following
section:
> > #ifdef CONFIG_44x
> > #define _PAGE_BASE    (_PAGE_PRESENT | _PAGE_ACCESSED |
_PAGE_GUARDED)
> > #else
> > #define _PAGE_BASE    (_PAGE_PRESENT | _PAGE_ACCESSED)
> > #endif
> >=20
> > Try adding _PAGE_COHERENT to the appropriate line above and see if
that=20
> > fixes your issue - this causes the 'M' bit to be set on the page
which=20
> > sure enforce cache coherency. If it doesn't, you'll need to check
the=20
> > 'M' bit isn't being masked out in head_44x.S (it was originally
masked=20
> > out on arch/powerpc, but was fixed in later kernels when the cache=20
> > coherency issues with non-SMP systems were resolved).
>=20
> I have some doubts about the usefulness of doing that for 4xx. AFAIK,
> the 440 core just ignores M.
>=20
> The problem lies probably elsewhere. Maybe the L2 cache coherency
isn't
> enabled or not working ?
>=20
> The L1 cache on 440 is simply not coherent, so drivers have to make
sure
> they use the appropriate DMA APIs which will do cache flushing when
> needed.
>=20
> Adam, what driver is causing you that sort of problems ?
>=20
> Cheers,
> Ben.
>=20
>=20
--=20
Adam Zilkie
Software Designer,
International Datacasting Corp.

This message and the documents attached hereto are intended only for the
addressee and may contain privileged or confidential information. Any
unauthorized disclosure is strictly prohibited. If you have received
this message in error, please notify us immediately so that we may
correct our internal records. Please then delete the original message.
Thank you.


_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev
--------------------------------------------------------

CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, =
is for the sole use of the intended recipient(s) and contains =
information that is confidential and proprietary to AppliedMicro =
Corporation or its subsidiaries. It is to be used solely for the purpose =
of furthering the parties' business relationship. All unauthorized =
review, use, disclosure or distribution is prohibited. If you are not =
the intended recipient, please contact the sender by reply e-mail and =
destroy all copies of the original message.

^ permalink raw reply

* Re: [PATCH] powerpc: Fix i8259 interrupt driver kernel crash on ML510
From: Benjamin Herrenschmidt @ 2009-09-03 22:44 UTC (permalink / raw)
  To: Grant Likely; +Cc: torvalds, Roderick Colenbrander, linuxppc-dev, linux-kernel
In-Reply-To: <20090903155208.18009.73841.stgit@localhost.localdomain>

On Thu, 2009-09-03 at 09:57 -0600, Grant Likely wrote:
> From: Roderick Colenbrander <thunderbird2k@gmail.com>
> 
> This patch fixes a null pointer exception caused by removal of
> 'ack()' for level interrupts in the Xilinx interrupt driver.  A recent
> change to the xilinx interrupt controller removed the ack hook for
> level irqs.
> 
> Signed-off-by: Roderick Colenbrander <thunderbird2k@gmail.com>
> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
> ---

Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

My git trees aren't at hand, so Linus feel free to merge that directly.

Cheers,
Ben.

> 
> Hi Ben & Linus,
> 
> This is a last minute bug fix must go into 2.6.31.  This patch
> is needed to prevent a kernel panic on Xilinx ml510 boards.
> 
> I've also pushed the patch out to my git tree if you'd prefer to pull:
> 
> The following changes since commit 326ba5010a5429a5a528b268b36a5900d4ab0eba:
>   Linus Torvalds (1):
>         Linux 2.6.31-rc8
> 
> are available in the git repository at:
> 
>   git://git.secretlab.ca/git/linux-2.6 merge
> 
> Roderick Colenbrander (1):
>       powerpc: Fix i8259 interrupt driver kernel crash on ML510
> 
>  arch/powerpc/sysdev/xilinx_intc.c |    1 -
>  1 files changed, 0 insertions(+), 1 deletions(-)
> 
> 
> 
> diff --git a/arch/powerpc/sysdev/xilinx_intc.c b/arch/powerpc/sysdev/xilinx_intc.c
> index 3ee1fd3..40edad5 100644
> --- a/arch/powerpc/sysdev/xilinx_intc.c
> +++ b/arch/powerpc/sysdev/xilinx_intc.c
> @@ -234,7 +234,6 @@ static void xilinx_i8259_cascade(unsigned int irq, struct irq_desc *desc)
>  		generic_handle_irq(cascade_irq);
>  
>  	/* Let xilinx_intc end the interrupt */
> -	desc->chip->ack(irq);
>  	desc->chip->unmask(irq);
>  }
>  

^ permalink raw reply

* [0/3] Sanitize pagetable handling for hugepages
From: David Gibson @ 2009-09-04  7:14 UTC (permalink / raw)
  To: linuxppc-dev, Benjamin Herrenschmidt

Currently, ordinary pages use one pagetable layout, and each different
hugepage size uses a slightly different variant layout.  A number of
places which need to walk the pagetable must first check the slice map
to see what the pagetable layout then handle the various different
forms.  New hardware, like Book3E is liable to introduce more possible
variants.

This patch series, therefore, is designed to simplify the matter by
limiting knowledge of the pagetable layout to only the allocation
path.  With this patch, ordinary pages are handled as ever, with a
fixed 4 (or 3) level tree.  All other variants branch off from some
layer of that with a specially marked PGD/PUD/PMD pointer which also
contains enough information to interpret the directories below that
point.  This means that things walking the pagetables (without
allocating) don't need to look up the slice map, they can just step
down the tree in the usual way, branching off to the "non-standard
layout" path for hugepages, which uses the embdded information to
interpret the tree from that point on.

This reduces the source size in a number of places, and means that
newer variants on the pagetable layout to handle new hardware and new
features will need to alter the existing code in less places.


I've used the libhugetlbfs testsuite to test these patches on a
Power5+ machine, but they could do with some banging.  In particular I
don't have any suitable hardware to test 16G pages.  So, think of this
as the first draft of the series.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply

* [1/3] Make hpte_need_flush() correctly mask for multiple page sizes
From: David Gibson @ 2009-09-04  7:15 UTC (permalink / raw)
  To: linuxppc-dev, Benjamin Herrenschmidt
In-Reply-To: <20090904071445.GD20631@yookeroo.seuss>

Currently, hpte_need_flush() only correctly flushes the given address
for normal pages.  Callers for hugepages are required to mask the
address themselves.

But hpte_nned_flush() already looks up the page sizes for its own
reasons, so this is a rather silly imposition on the callers.  This
patch alters it to mask based on the pagesize it has looked up itself,
and removes the awkward masking code in the hugepage caller.

Signed-off-by: David Gibson <dwg@au1.ibm.com>

---

---
 arch/powerpc/mm/hugetlbpage.c |    6 +-----
 arch/powerpc/mm/tlb_hash64.c  |    8 +++-----
 2 files changed, 4 insertions(+), 10 deletions(-)

Index: working-2.6/arch/powerpc/mm/tlb_hash64.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/tlb_hash64.c	2009-09-04 14:35:30.000000000 +1000
+++ working-2.6/arch/powerpc/mm/tlb_hash64.c	2009-09-04 14:36:12.000000000 +1000
@@ -53,11 +53,6 @@ void hpte_need_flush(struct mm_struct *m
 
 	i = batch->index;
 
-	/* We mask the address for the base page size. Huge pages will
-	 * have applied their own masking already
-	 */
-	addr &= PAGE_MASK;
-
 	/* Get page size (maybe move back to caller).
 	 *
 	 * NOTE: when using special 64K mappings in 4K environment like
@@ -75,6 +70,9 @@ void hpte_need_flush(struct mm_struct *m
 	} else
 		psize = pte_pagesize_index(mm, addr, pte);
 
+	/* Mask the address for the correct page size */
+	addr &= ~((1UL << mmu_psize_defs[psize].shift) - 1);
+
 	/* Build full vaddr */
 	if (!is_kernel_addr(addr)) {
 		ssize = user_segment_size(addr);
Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-09-04 14:35:30.000000000 +1000
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-09-04 14:36:12.000000000 +1000
@@ -445,11 +445,7 @@ void set_huge_pte_at(struct mm_struct *m
 		 * necessary anymore if we make hpte_need_flush() get the
 		 * page size from the slices
 		 */
-		unsigned int psize = get_slice_psize(mm, addr);
-		unsigned int shift = mmu_psize_to_shift(psize);
-		unsigned long sz = ((1UL) << shift);
-		struct hstate *hstate = size_to_hstate(sz);
-		pte_update(mm, addr & hstate->mask, ptep, ~0UL, 1);
+		pte_update(mm, addr, ptep, ~0UL, 1);
 	}
 	*ptep = __pte(pte_val(pte) & ~_PAGE_HPTEFLAGS);
 }

^ permalink raw reply

* [2/3] Cleanup management of kmem_caches for pagetables
From: David Gibson @ 2009-09-04  7:15 UTC (permalink / raw)
  To: linuxppc-dev, Benjamin Herrenschmidt
In-Reply-To: <20090904071445.GD20631@yookeroo.seuss>

Currently we have a fair bit of rather fiddly code to manage the
various kmem_caches used to store page tables of various levels.  We
generally have two caches holding some combination of PGD, PUD and PMD
tables, plus several more for the special hugepage pagetables.

This patch cleans this all up by taking a different approach.  Rather
than the caches being designated as for PUDs or for hugeptes for 16M
pages, the caches are simply allocated to be a specific size.  Thus
sharing of caches between different types/levels of pagetables happens
naturally.  The pagetable size, where needed, is passed around encoded
in the same way as {PGD,PUD,PMD}_INDEX_SIZE; that is n where the
pagetable contains 2^n pointers.

Signed-off-by: David Gibson <dwg@au1.ibm.com>

---
 arch/powerpc/include/asm/pgtable-ppc64.h |    1 

---
 arch/powerpc/include/asm/pgalloc-64.h    |   43 ++++++++++++-----------------
 arch/powerpc/include/asm/pgalloc.h       |   25 +++--------------
 arch/powerpc/include/asm/pgtable-ppc64.h |    1 
 arch/powerpc/mm/hugetlbpage.c            |   45 ++++++++-----------------------
 arch/powerpc/mm/init_64.c                |   42 ++++++++++++++--------------
 arch/powerpc/mm/pgtable.c                |   25 +++++++++++------
 6 files changed, 73 insertions(+), 108 deletions(-)

Index: working-2.6/arch/powerpc/mm/init_64.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/init_64.c	2009-09-04 14:35:30.000000000 +1000
+++ working-2.6/arch/powerpc/mm/init_64.c	2009-09-04 14:38:20.000000000 +1000
@@ -148,30 +148,30 @@ static void pmd_ctor(void *addr)
 	memset(addr, 0, PMD_TABLE_SIZE);
 }
 
-static const unsigned int pgtable_cache_size[2] = {
-	PGD_TABLE_SIZE, PMD_TABLE_SIZE
-};
-static const char *pgtable_cache_name[ARRAY_SIZE(pgtable_cache_size)] = {
-#ifdef CONFIG_PPC_64K_PAGES
-	"pgd_cache", "pmd_cache",
-#else
-	"pgd_cache", "pud_pmd_cache",
-#endif /* CONFIG_PPC_64K_PAGES */
-};
-
-#ifdef CONFIG_HUGETLB_PAGE
-/* Hugepages need an extra cache per hugepagesize, initialized in
- * hugetlbpage.c.  We can't put into the tables above, because HPAGE_SHIFT
- * is not compile time constant. */
-struct kmem_cache *pgtable_cache[ARRAY_SIZE(pgtable_cache_size)+MMU_PAGE_COUNT];
-#else
-struct kmem_cache *pgtable_cache[ARRAY_SIZE(pgtable_cache_size)];
-#endif
+struct kmem_cache *pgtable_cache[PGF_SHIFT_MASK];
+
+void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
+{
+	char *name;
+	unsigned long table_size = sizeof(void *) << shift;
+	struct kmem_cache *new;
+
+	BUG_ON((shift < 1) || (shift > PGF_SHIFT_MASK));
+	if (PGT_CACHE(shift))
+		return; /* Already have a cache of this size */
+	name = kasprintf(GFP_KERNEL, "pgtable-2^%d", shift);
+	new = kmem_cache_create(name, table_size, table_size, 0, ctor);
+	PGT_CACHE(shift) = new;
+}
+
 
 void pgtable_cache_init(void)
 {
-	pgtable_cache[0] = kmem_cache_create(pgtable_cache_name[0], PGD_TABLE_SIZE, PGD_TABLE_SIZE, SLAB_PANIC, pgd_ctor);
-	pgtable_cache[1] = kmem_cache_create(pgtable_cache_name[1], PMD_TABLE_SIZE, PMD_TABLE_SIZE, SLAB_PANIC, pmd_ctor);
+	pgtable_cache_add(PGD_INDEX_SIZE, pgd_ctor);
+	pgtable_cache_add(PMD_INDEX_SIZE, pmd_ctor);
+	if (!PGT_CACHE(PGD_INDEX_SIZE) || !PGT_CACHE(PMD_INDEX_SIZE))
+		panic("Couldn't allocate pgtable caches");
+	BUG_ON(!PGT_CACHE(PUD_INDEX_SIZE));
 }
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
Index: working-2.6/arch/powerpc/include/asm/pgalloc-64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgalloc-64.h	2009-09-04 14:35:30.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/pgalloc-64.h	2009-09-04 14:38:20.000000000 +1000
@@ -16,22 +16,17 @@ static inline void subpage_prot_free(pgd
 #endif
 
 extern struct kmem_cache *pgtable_cache[];
-
-#define PGD_CACHE_NUM		0
-#define PUD_CACHE_NUM		1
-#define PMD_CACHE_NUM		1
-#define HUGEPTE_CACHE_NUM	2
-#define PTE_NONCACHE_NUM	7  /* from GFP rather than kmem_cache */
+#define PGT_CACHE(shift) (pgtable_cache[(shift)-1])
 
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	return kmem_cache_alloc(pgtable_cache[PGD_CACHE_NUM], GFP_KERNEL);
+	return kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE), GFP_KERNEL);
 }
 
 static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
 	subpage_prot_free(pgd);
-	kmem_cache_free(pgtable_cache[PGD_CACHE_NUM], pgd);
+	kmem_cache_free(PGT_CACHE(PGD_INDEX_SIZE), pgd);
 }
 
 #ifndef CONFIG_PPC_64K_PAGES
@@ -40,13 +35,13 @@ static inline void pgd_free(struct mm_st
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return kmem_cache_alloc(pgtable_cache[PUD_CACHE_NUM],
+	return kmem_cache_alloc(PGT_CACHE(PUD_INDEX_SIZE),
 				GFP_KERNEL|__GFP_REPEAT);
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
 {
-	kmem_cache_free(pgtable_cache[PUD_CACHE_NUM], pud);
+	kmem_cache_free(PGT_CACHE(PUD_INDEX_SIZE), pud);
 }
 
 static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
@@ -78,13 +73,13 @@ static inline void pmd_populate_kernel(s
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return kmem_cache_alloc(pgtable_cache[PMD_CACHE_NUM],
+	return kmem_cache_alloc(PGT_CACHE(PMD_INDEX_SIZE),
 				GFP_KERNEL|__GFP_REPEAT);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 {
-	kmem_cache_free(pgtable_cache[PMD_CACHE_NUM], pmd);
+	kmem_cache_free(PGT_CACHE(PMD_INDEX_SIZE), pmd);
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
@@ -107,24 +102,22 @@ static inline pgtable_t pte_alloc_one(st
 	return page;
 }
 
-static inline void pgtable_free(pgtable_free_t pgf)
+static inline void pgtable_free(void *table, unsigned index_size)
 {
-	void *p = (void *)(pgf.val & ~PGF_CACHENUM_MASK);
-	int cachenum = pgf.val & PGF_CACHENUM_MASK;
-
-	if (cachenum == PTE_NONCACHE_NUM)
-		free_page((unsigned long)p);
-	else
-		kmem_cache_free(pgtable_cache[cachenum], p);
+	if (!index_size)
+		free_page((unsigned long)table);
+	else {
+		BUG_ON(index_size > PGF_SHIFT_MASK);
+		kmem_cache_free(PGT_CACHE(index_size), table);
+	}
 }
 
-#define __pmd_free_tlb(tlb, pmd,addr)		      \
-	pgtable_free_tlb(tlb, pgtable_free_cache(pmd, \
-		PMD_CACHE_NUM, PMD_TABLE_SIZE-1))
+#define __pmd_free_tlb(tlb, pmd, addr)		      \
+	pgtable_free_tlb(tlb, pmd, PMD_INDEX_SIZE)
 #ifndef CONFIG_PPC_64K_PAGES
 #define __pud_free_tlb(tlb, pud, addr)		      \
-	pgtable_free_tlb(tlb, pgtable_free_cache(pud, \
-		PUD_CACHE_NUM, PUD_TABLE_SIZE-1))
+	pgtable_free_tlb(tlb, pud, PUD_INDEX_SIZE)
+
 #endif /* CONFIG_PPC_64K_PAGES */
 
 #define check_pgt_cache()	do { } while (0)
Index: working-2.6/arch/powerpc/include/asm/pgalloc.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgalloc.h	2009-09-04 14:35:30.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/pgalloc.h	2009-09-04 14:38:20.000000000 +1000
@@ -24,24 +24,12 @@ static inline void pte_free(struct mm_st
 	__free_page(ptepage);
 }
 
-typedef struct pgtable_free {
-	unsigned long val;
-} pgtable_free_t;
-
 /* This needs to be big enough to allow for MMU_PAGE_COUNT + 2 to be stored
  * and small enough to fit in the low bits of any naturally aligned page
  * table cache entry. Arbitrarily set to 0x1f, that should give us some
  * room to grow
  */
-#define PGF_CACHENUM_MASK	0x1f
-
-static inline pgtable_free_t pgtable_free_cache(void *p, int cachenum,
-						unsigned long mask)
-{
-	BUG_ON(cachenum > PGF_CACHENUM_MASK);
-
-	return (pgtable_free_t){.val = ((unsigned long) p & ~mask) | cachenum};
-}
+#define PGF_SHIFT_MASK		0xf
 
 #ifdef CONFIG_PPC64
 #include <asm/pgalloc-64.h>
@@ -50,12 +38,12 @@ static inline pgtable_free_t pgtable_fre
 #endif
 
 #ifdef CONFIG_SMP
-extern void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf);
+extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
 extern void pte_free_finish(void);
 #else /* CONFIG_SMP */
-static inline void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf)
+static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
-	pgtable_free(pgf);
+	pgtable_free(table, shift);
 }
 static inline void pte_free_finish(void) { }
 #endif /* !CONFIG_SMP */
@@ -63,12 +51,9 @@ static inline void pte_free_finish(void)
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
 				  unsigned long address)
 {
-	pgtable_free_t pgf = pgtable_free_cache(page_address(ptepage),
-						PTE_NONCACHE_NUM,
-						PTE_TABLE_SIZE-1);
 	tlb_flush_pgtable(tlb, address);
 	pgtable_page_dtor(ptepage);
-	pgtable_free_tlb(tlb, pgf);
+	pgtable_free_tlb(tlb, page_address(ptepage), 0);
 }
 
 #endif /* __KERNEL__ */
Index: working-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/pgtable.c	2009-09-04 14:35:30.000000000 +1000
+++ working-2.6/arch/powerpc/mm/pgtable.c	2009-09-04 14:38:20.000000000 +1000
@@ -47,12 +47,12 @@ struct pte_freelist_batch
 {
 	struct rcu_head	rcu;
 	unsigned int	index;
-	pgtable_free_t	tables[0];
+	unsigned long	tables[0];
 };
 
 #define PTE_FREELIST_SIZE \
 	((PAGE_SIZE - sizeof(struct pte_freelist_batch)) \
-	  / sizeof(pgtable_free_t))
+	  / sizeof(unsigned long))
 
 static void pte_free_smp_sync(void *arg)
 {
@@ -62,13 +62,13 @@ static void pte_free_smp_sync(void *arg)
 /* This is only called when we are critically out of memory
  * (and fail to get a page in pte_free_tlb).
  */
-static void pgtable_free_now(pgtable_free_t pgf)
+static void pgtable_free_now(void *table, unsigned shift)
 {
 	pte_freelist_forced_free++;
 
 	smp_call_function(pte_free_smp_sync, NULL, 1);
 
-	pgtable_free(pgf);
+	pgtable_free(table, shift);
 }
 
 static void pte_free_rcu_callback(struct rcu_head *head)
@@ -77,8 +77,12 @@ static void pte_free_rcu_callback(struct
 		container_of(head, struct pte_freelist_batch, rcu);
 	unsigned int i;
 
-	for (i = 0; i < batch->index; i++)
-		pgtable_free(batch->tables[i]);
+	for (i = 0; i < batch->index; i++) {
+		void *table = (void *)(batch->tables[i] & ~PGF_SHIFT_MASK);
+		unsigned shift = batch->tables[i] & PGF_SHIFT_MASK;
+
+		pgtable_free(table, shift);
+	}
 
 	free_page((unsigned long)batch);
 }
@@ -89,25 +93,28 @@ static void pte_free_submit(struct pte_f
 	call_rcu(&batch->rcu, pte_free_rcu_callback);
 }
 
-void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf)
+void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
 	/* This is safe since tlb_gather_mmu has disabled preemption */
 	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+	unsigned long pgf;
 
 	if (atomic_read(&tlb->mm->mm_users) < 2 ||
 	    cpumask_equal(mm_cpumask(tlb->mm), cpumask_of(smp_processor_id()))){
-		pgtable_free(pgf);
+		pgtable_free(table, shift);
 		return;
 	}
 
 	if (*batchp == NULL) {
 		*batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC);
 		if (*batchp == NULL) {
-			pgtable_free_now(pgf);
+			pgtable_free_now(table, shift);
 			return;
 		}
 		(*batchp)->index = 0;
 	}
+	BUG_ON(shift > (PGF_SHIFT_MASK + 1));
+	pgf = (unsigned long)table | (shift - 1);
 	(*batchp)->tables[(*batchp)->index++] = pgf;
 	if ((*batchp)->index == PTE_FREELIST_SIZE) {
 		pte_free_submit(*batchp);
Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-09-04 14:36:12.000000000 +1000
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-09-04 14:38:20.000000000 +1000
@@ -43,26 +43,14 @@ static unsigned nr_gpages;
 unsigned int mmu_huge_psizes[MMU_PAGE_COUNT] = { }; /* initialize all to 0 */
 
 #define hugepte_shift			mmu_huge_psizes
-#define PTRS_PER_HUGEPTE(psize)		(1 << hugepte_shift[psize])
-#define HUGEPTE_TABLE_SIZE(psize)	(sizeof(pte_t) << hugepte_shift[psize])
+#define HUGEPTE_INDEX_SIZE(psize)	(mmu_huge_psizes[(psize)])
+#define PTRS_PER_HUGEPTE(psize)		(1 << mmu_huge_psizes[psize])
 
 #define HUGEPD_SHIFT(psize)		(mmu_psize_to_shift(psize) \
-						+ hugepte_shift[psize])
+					 + HUGEPTE_INDEX_SIZE(psize))
 #define HUGEPD_SIZE(psize)		(1UL << HUGEPD_SHIFT(psize))
 #define HUGEPD_MASK(psize)		(~(HUGEPD_SIZE(psize)-1))
 
-/* Subtract one from array size because we don't need a cache for 4K since
- * is not a huge page size */
-#define HUGE_PGTABLE_INDEX(psize)	(HUGEPTE_CACHE_NUM + psize - 1)
-#define HUGEPTE_CACHE_NAME(psize)	(huge_pgtable_cache_name[psize])
-
-static const char *huge_pgtable_cache_name[MMU_PAGE_COUNT] = {
-	[MMU_PAGE_64K]	= "hugepte_cache_64K",
-	[MMU_PAGE_1M]	= "hugepte_cache_1M",
-	[MMU_PAGE_16M]	= "hugepte_cache_16M",
-	[MMU_PAGE_16G]	= "hugepte_cache_16G",
-};
-
 /* Flag to mark huge PD pointers.  This means pmd_bad() and pud_bad()
  * will choke on pointers to hugepte tables, which is handy for
  * catching screwups early. */
@@ -114,15 +102,15 @@ static inline pte_t *hugepte_offset(huge
 static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 			   unsigned long address, unsigned int psize)
 {
-	pte_t *new = kmem_cache_zalloc(pgtable_cache[HUGE_PGTABLE_INDEX(psize)],
-				      GFP_KERNEL|__GFP_REPEAT);
+	pte_t *new = kmem_cache_zalloc(PGT_CACHE(hugepte_shift[psize]),
+				       GFP_KERNEL|__GFP_REPEAT);
 
 	if (! new)
 		return -ENOMEM;
 
 	spin_lock(&mm->page_table_lock);
 	if (!hugepd_none(*hpdp))
-		kmem_cache_free(pgtable_cache[HUGE_PGTABLE_INDEX(psize)], new);
+		kmem_cache_free(PGT_CACHE(hugepte_shift[psize]), new);
 	else
 		hpdp->pd = (unsigned long)new | HUGEPD_OK;
 	spin_unlock(&mm->page_table_lock);
@@ -271,9 +259,7 @@ static void free_hugepte_range(struct mm
 
 	hpdp->pd = 0;
 	tlb->need_flush = 1;
-	pgtable_free_tlb(tlb, pgtable_free_cache(hugepte,
-						 HUGEPTE_CACHE_NUM+psize-1,
-						 PGF_CACHENUM_MASK));
+	pgtable_free_tlb(tlb, hugepte, hugepte_shift[psize]);
 }
 
 static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
@@ -698,8 +684,6 @@ static void __init set_huge_psize(int ps
 		if (mmu_huge_psizes[psize] ||
 		   mmu_psize_defs[psize].shift == PAGE_SHIFT)
 			return;
-		if (WARN_ON(HUGEPTE_CACHE_NAME(psize) == NULL))
-			return;
 		hugetlb_add_hstate(mmu_psize_defs[psize].shift - PAGE_SHIFT);
 
 		switch (mmu_psize_defs[psize].shift) {
@@ -769,16 +753,11 @@ static int __init hugetlbpage_init(void)
 
 	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
 		if (mmu_huge_psizes[psize]) {
-			pgtable_cache[HUGE_PGTABLE_INDEX(psize)] =
-				kmem_cache_create(
-					HUGEPTE_CACHE_NAME(psize),
-					HUGEPTE_TABLE_SIZE(psize),
-					HUGEPTE_TABLE_SIZE(psize),
-					0,
-					NULL);
-			if (!pgtable_cache[HUGE_PGTABLE_INDEX(psize)])
-				panic("hugetlbpage_init(): could not create %s"\
-				      "\n", HUGEPTE_CACHE_NAME(psize));
+			pgtable_cache_add(hugepte_shift[psize], NULL);
+			if (!PGT_CACHE(hugepte_shift[psize]))
+				panic("hugetlbpage_init(): could not create "
+				      "pgtable cache for %d bit pagesize\n",
+				      mmu_psize_to_shift(psize));
 		}
 	}
 
Index: working-2.6/arch/powerpc/include/asm/pgtable-ppc64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgtable-ppc64.h	2009-09-04 14:35:30.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/pgtable-ppc64.h	2009-09-04 14:38:20.000000000 +1000
@@ -354,6 +354,7 @@ static inline void __ptep_set_access_fla
 #define pgoff_to_pte(off)	((pte_t) {((off) << PTE_RPN_SHIFT)|_PAGE_FILE})
 #define PTE_FILE_MAX_BITS	(BITS_PER_LONG - PTE_RPN_SHIFT)
 
+void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
 void pgtable_cache_init(void);
 
 /*

^ permalink raw reply

* [3/3] Allow more flexible layouts for hugepage pagetables
From: David Gibson @ 2009-09-04  7:15 UTC (permalink / raw)
  To: linuxppc-dev, Benjamin Herrenschmidt
In-Reply-To: <20090904071445.GD20631@yookeroo.seuss>

Currently each available hugepage size uses a slightly different
pagetable layout: that is, the bottem level table of pointers to
hugepages is a different size, and may branch off from the normal page
tables at a different level.  Every hugepage aware path that needs to
walk the pagetables must therefore look up the hugepage size from the
slice info first, and work out the correct way to walk the pagetables
accordingly.  Future hardware is likely to add more possible hugepage
sizes, more layout options and more mess.

This patch, therefore reworks the handling of hugepage pagetables to
reduce this complexity.  In the new scheme, instead of having to
consult the slice mask, pagetable walking code can check a flag in the
PGD/PUD/PMD entries to see where to branch off to hugepage pagetables,
and the entry also contains the information (eseentially hugepage
shift) necessary to then interpret that table without recourse to the
slice mask.  This scheme can be extended neatly to handle multiple
levels of self-describing "special" hugepage pagetables, although for
now we assume only one level exists.

This approach means that only the pagetable allocation path needs to
know how the pagetables should be set out.  All other (hugepage)
pagetable walking paths can just interpret the structure as they go.

There already was a flag bit in PGD/PUD/PMD entries for hugepage
directory pointers, but it was only used for debug.  We alter that
flag bit to instead be a 0 in the MSB to indicate a hugepage pagetable
pointer (normally it would be 1 since the pointer lies in the linear
mapping).  This means that asm pagetable walking can test for (and
punt on) hugepage pointers with the same test that checks for
unpopulated page directory entries (beq becomes bge), since hugepage
pointers will always be positive, and normal pointers always negative.

While we're at it, we get rid of the confusing (and grep defeating)
#defining of hugepte_shift to be the same thing as mmu_huge_psizes.

Signed-off-by: David Gibson <dwg@au1.ibm.com>

---
 arch/powerpc/include/asm/hugetlb.h |    9 
 arch/powerpc/mm/hugetlbpage.c      |  396 +++++++++++++++----------------------
 arch/powerpc/mm/init_64.c          |   11 -
 3 files changed, 181 insertions(+), 235 deletions(-)

Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-09-04 14:38:20.000000000 +1000
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-09-04 16:21:49.000000000 +1000
@@ -42,23 +42,9 @@ static unsigned nr_gpages;
  */
 unsigned int mmu_huge_psizes[MMU_PAGE_COUNT] = { }; /* initialize all to 0 */
 
-#define hugepte_shift			mmu_huge_psizes
-#define HUGEPTE_INDEX_SIZE(psize)	(mmu_huge_psizes[(psize)])
-#define PTRS_PER_HUGEPTE(psize)		(1 << mmu_huge_psizes[psize])
-
-#define HUGEPD_SHIFT(psize)		(mmu_psize_to_shift(psize) \
-					 + HUGEPTE_INDEX_SIZE(psize))
-#define HUGEPD_SIZE(psize)		(1UL << HUGEPD_SHIFT(psize))
-#define HUGEPD_MASK(psize)		(~(HUGEPD_SIZE(psize)-1))
-
 /* Flag to mark huge PD pointers.  This means pmd_bad() and pud_bad()
  * will choke on pointers to hugepte tables, which is handy for
  * catching screwups early. */
-#define HUGEPD_OK	0x1
-
-typedef struct { unsigned long pd; } hugepd_t;
-
-#define hugepd_none(hpd)	((hpd).pd == 0)
 
 static inline int shift_to_mmu_psize(unsigned int shift)
 {
@@ -82,71 +68,127 @@ static inline unsigned int mmu_psize_to_
 	BUG();
 }
 
+#define hugepd_none(hpd)	((hpd).pd == 0)
+
 static inline pte_t *hugepd_page(hugepd_t hpd)
 {
-	BUG_ON(!(hpd.pd & HUGEPD_OK));
-	return (pte_t *)(hpd.pd & ~HUGEPD_OK);
+	BUG_ON(!hugepd_ok(hpd));
+	return (pte_t *)((hpd.pd & ~HUGEPD_SHIFT_MASK) | 0xc000000000000000);
+}
+
+static inline unsigned int hugepd_shift(hugepd_t hpd)
+{
+	return hpd.pd & HUGEPD_SHIFT_MASK;
 }
 
-static inline pte_t *hugepte_offset(hugepd_t *hpdp, unsigned long addr,
-				    struct hstate *hstate)
+static inline pte_t *hugepte_offset(hugepd_t *hpdp, unsigned long addr, unsigned pdshift)
 {
-	unsigned int shift = huge_page_shift(hstate);
-	int psize = shift_to_mmu_psize(shift);
-	unsigned long idx = ((addr >> shift) & (PTRS_PER_HUGEPTE(psize)-1));
+	unsigned long idx = (addr & ((1UL << pdshift) - 1)) >> hugepd_shift(*hpdp);
 	pte_t *dir = hugepd_page(*hpdp);
 
 	return dir + idx;
 }
 
+static pte_t *huge_pte_offset_and_shift(struct mm_struct *mm,
+					unsigned long addr, unsigned *shift)
+{
+	pgd_t *pg;
+	pud_t *pu;
+	pmd_t *pm;
+	hugepd_t *hpdp = NULL;
+	unsigned pdshift = PGDIR_SHIFT;
+
+	if (shift)
+		*shift = 0;
+
+	pg = pgd_offset(mm, addr);
+	if (is_hugepd(pg)) {
+		hpdp = (hugepd_t *)pg;
+	} else if (!pgd_none(*pg)) {
+		pdshift = PUD_SHIFT;
+		pu = pud_offset(pg, addr);
+		if (is_hugepd(pu))
+			hpdp = (hugepd_t *)pu;
+		else if (!pud_none(*pu)) {
+			pdshift = PMD_SHIFT;
+			pm = pmd_offset(pu, addr);
+			if (is_hugepd(pm))
+				hpdp = (hugepd_t *)pm;
+			else if (!pmd_none(*pm)) {
+				return pte_offset_map(pm, addr);
+			}
+		}
+	}
+
+	if (!hpdp)
+		return NULL;
+
+	if (shift)
+		*shift = hugepd_shift(*hpdp);
+	return hugepte_offset(hpdp, addr, pdshift);
+}
+
+pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
+{
+	return huge_pte_offset_and_shift(mm, addr, NULL);
+}
+
 static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
-			   unsigned long address, unsigned int psize)
+			   unsigned long address, unsigned pdshift, unsigned pshift)
 {
-	pte_t *new = kmem_cache_zalloc(PGT_CACHE(hugepte_shift[psize]),
+	pte_t *new = kmem_cache_zalloc(PGT_CACHE(pdshift - pshift),
 				       GFP_KERNEL|__GFP_REPEAT);
 
+	BUG_ON(pshift > HUGEPD_SHIFT_MASK);
+	BUG_ON((unsigned long)new & HUGEPD_SHIFT_MASK);
+
 	if (! new)
 		return -ENOMEM;
 
 	spin_lock(&mm->page_table_lock);
 	if (!hugepd_none(*hpdp))
-		kmem_cache_free(PGT_CACHE(hugepte_shift[psize]), new);
+		kmem_cache_free(PGT_CACHE(pdshift - pshift), new);
 	else
-		hpdp->pd = (unsigned long)new | HUGEPD_OK;
+		hpdp->pd = ((unsigned long)new & ~0x8000000000000000) | pshift;
 	spin_unlock(&mm->page_table_lock);
 	return 0;
 }
 
-
-static pud_t *hpud_offset(pgd_t *pgd, unsigned long addr, struct hstate *hstate)
-{
-	if (huge_page_shift(hstate) < PUD_SHIFT)
-		return pud_offset(pgd, addr);
-	else
-		return (pud_t *) pgd;
-}
-static pud_t *hpud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long addr,
-			 struct hstate *hstate)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
 {
-	if (huge_page_shift(hstate) < PUD_SHIFT)
-		return pud_alloc(mm, pgd, addr);
-	else
-		return (pud_t *) pgd;
-}
-static pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate)
-{
-	if (huge_page_shift(hstate) < PMD_SHIFT)
-		return pmd_offset(pud, addr);
-	else
-		return (pmd_t *) pud;
-}
-static pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
-			 struct hstate *hstate)
-{
-	if (huge_page_shift(hstate) < PMD_SHIFT)
-		return pmd_alloc(mm, pud, addr);
-	else
-		return (pmd_t *) pud;
+	pgd_t *pg;
+	pud_t *pu;
+	pmd_t *pm;
+	hugepd_t *hpdp = NULL;
+	unsigned pshift = __ffs(sz);
+	unsigned pdshift = PGDIR_SHIFT;
+
+	addr &= ~(sz-1);
+
+	pg = pgd_offset(mm, addr);
+	if (pshift >= PUD_SHIFT) {
+		hpdp = (hugepd_t *)pg;
+	} else {
+		pdshift = PUD_SHIFT;
+		pu = pud_alloc(mm, pg, addr);
+		if (pshift >= PMD_SHIFT) {
+			hpdp = (hugepd_t *)pu;
+		} else {
+			pdshift = PMD_SHIFT;
+			pm = pmd_alloc(mm, pu, addr);
+			hpdp = (hugepd_t *)pm;
+		}
+	}
+
+	if (!hpdp)
+		return NULL;
+
+	BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
+
+	if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr, pdshift, pshift))
+		return NULL;
+
+	return hugepte_offset(hpdp, addr, pdshift);
 }
 
 /* Build list of addresses of gigantic pages.  This function is used in early
@@ -180,92 +222,38 @@ int alloc_bootmem_huge_page(struct hstat
 	return 1;
 }
 
-
-/* Modelled after find_linux_pte() */
-pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
-{
-	pgd_t *pg;
-	pud_t *pu;
-	pmd_t *pm;
-
-	unsigned int psize;
-	unsigned int shift;
-	unsigned long sz;
-	struct hstate *hstate;
-	psize = get_slice_psize(mm, addr);
-	shift = mmu_psize_to_shift(psize);
-	sz = ((1UL) << shift);
-	hstate = size_to_hstate(sz);
-
-	addr &= hstate->mask;
-
-	pg = pgd_offset(mm, addr);
-	if (!pgd_none(*pg)) {
-		pu = hpud_offset(pg, addr, hstate);
-		if (!pud_none(*pu)) {
-			pm = hpmd_offset(pu, addr, hstate);
-			if (!pmd_none(*pm))
-				return hugepte_offset((hugepd_t *)pm, addr,
-						      hstate);
-		}
-	}
-
-	return NULL;
-}
-
-pte_t *huge_pte_alloc(struct mm_struct *mm,
-			unsigned long addr, unsigned long sz)
-{
-	pgd_t *pg;
-	pud_t *pu;
-	pmd_t *pm;
-	hugepd_t *hpdp = NULL;
-	struct hstate *hstate;
-	unsigned int psize;
-	hstate = size_to_hstate(sz);
-
-	psize = get_slice_psize(mm, addr);
-	BUG_ON(!mmu_huge_psizes[psize]);
-
-	addr &= hstate->mask;
-
-	pg = pgd_offset(mm, addr);
-	pu = hpud_alloc(mm, pg, addr, hstate);
-
-	if (pu) {
-		pm = hpmd_alloc(mm, pu, addr, hstate);
-		if (pm)
-			hpdp = (hugepd_t *)pm;
-	}
-
-	if (! hpdp)
-		return NULL;
-
-	if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr, psize))
-		return NULL;
-
-	return hugepte_offset(hpdp, addr, hstate);
-}
-
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 {
 	return 0;
 }
 
-static void free_hugepte_range(struct mmu_gather *tlb, hugepd_t *hpdp,
-			       unsigned int psize)
+static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshift,
+			      unsigned long start, unsigned long end,
+			      unsigned long floor, unsigned long ceiling)
 {
 	pte_t *hugepte = hugepd_page(*hpdp);
+	unsigned shift = hugepd_shift(*hpdp);
+	unsigned long pdmask = ~((1UL << pdshift) - 1);
+
+	start &= pdmask;
+	if (start < floor)
+		return;
+	if (ceiling) {
+		ceiling &= pdmask;
+		if (! ceiling)
+			return;
+	}
+	if (end - 1 > ceiling - 1)
+		return;
 
 	hpdp->pd = 0;
 	tlb->need_flush = 1;
-	pgtable_free_tlb(tlb, hugepte, hugepte_shift[psize]);
+	pgtable_free_tlb(tlb, hugepte, pdshift - shift);
 }
 
 static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 				   unsigned long addr, unsigned long end,
-				   unsigned long floor, unsigned long ceiling,
-				   unsigned int psize)
+				   unsigned long floor, unsigned long ceiling)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -277,7 +265,8 @@ static void hugetlb_free_pmd_range(struc
 		next = pmd_addr_end(addr, end);
 		if (pmd_none(*pmd))
 			continue;
-		free_hugepte_range(tlb, (hugepd_t *)pmd, psize);
+		free_hugepd_range(tlb, (hugepd_t *)pmd, PMD_SHIFT,
+				  addr, next, floor, ceiling);
 	} while (pmd++, addr = next, addr != end);
 
 	start &= PUD_MASK;
@@ -303,23 +292,19 @@ static void hugetlb_free_pud_range(struc
 	pud_t *pud;
 	unsigned long next;
 	unsigned long start;
-	unsigned int shift;
-	unsigned int psize = get_slice_psize(tlb->mm, addr);
-	shift = mmu_psize_to_shift(psize);
 
 	start = addr;
 	pud = pud_offset(pgd, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		if (shift < PMD_SHIFT) {
+		if (!is_hugepd(pud)) {
 			if (pud_none_or_clear_bad(pud))
 				continue;
 			hugetlb_free_pmd_range(tlb, pud, addr, next, floor,
-					       ceiling, psize);
+					       ceiling);
 		} else {
-			if (pud_none(*pud))
-				continue;
-			free_hugepte_range(tlb, (hugepd_t *)pud, psize);
+			free_hugepd_range(tlb, (hugepd_t *)pud, PUD_SHIFT,
+					  addr, next, floor, ceiling);
 		}
 	} while (pud++, addr = next, addr != end);
 
@@ -350,74 +335,34 @@ void hugetlb_free_pgd_range(struct mmu_g
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long start;
 
 	/*
-	 * Comments below take from the normal free_pgd_range().  They
-	 * apply here too.  The tests against HUGEPD_MASK below are
-	 * essential, because we *don't* test for this at the bottom
-	 * level.  Without them we'll attempt to free a hugepte table
-	 * when we unmap just part of it, even if there are other
-	 * active mappings using it.
-	 *
-	 * The next few lines have given us lots of grief...
+	 * Because there are a number of different possible pagetable
+	 * layouts for hugepage ranges, we limit knowledge of how
+	 * things should be laid out to the allocation path
+	 * (huge_pte_alloc(), above).  Everything else works out the
+	 * structure as it goes from information in the hugepd
+	 * pointers.  That means that we can't here use the
+	 * optimization used in the normal page free_pgd_range(), of
+	 * checking whether we're actually covering a large enough
+	 * range to have to do anything at the top level of the walk
+	 * instead of at the bottom.
 	 *
-	 * Why are we testing HUGEPD* at this top level?  Because
-	 * often there will be no work to do at all, and we'd prefer
-	 * not to go all the way down to the bottom just to discover
-	 * that.
-	 *
-	 * Why all these "- 1"s?  Because 0 represents both the bottom
-	 * of the address space and the top of it (using -1 for the
-	 * top wouldn't help much: the masks would do the wrong thing).
-	 * The rule is that addr 0 and floor 0 refer to the bottom of
-	 * the address space, but end 0 and ceiling 0 refer to the top
-	 * Comparisons need to use "end - 1" and "ceiling - 1" (though
-	 * that end 0 case should be mythical).
-	 *
-	 * Wherever addr is brought up or ceiling brought down, we
-	 * must be careful to reject "the opposite 0" before it
-	 * confuses the subsequent tests.  But what about where end is
-	 * brought down by HUGEPD_SIZE below? no, end can't go down to
-	 * 0 there.
-	 *
-	 * Whereas we round start (addr) and ceiling down, by different
-	 * masks at different levels, in order to test whether a table
-	 * now has no other vmas using it, so can be freed, we don't
-	 * bother to round floor or end up - the tests don't need that.
+	 * To make sense of this, you should probably go read the big
+	 * block comment at the top of the normal free_pgd_range(),
+	 * too.
 	 */
-	unsigned int psize = get_slice_psize(tlb->mm, addr);
-
-	addr &= HUGEPD_MASK(psize);
-	if (addr < floor) {
-		addr += HUGEPD_SIZE(psize);
-		if (!addr)
-			return;
-	}
-	if (ceiling) {
-		ceiling &= HUGEPD_MASK(psize);
-		if (!ceiling)
-			return;
-	}
-	if (end - 1 > ceiling - 1)
-		end -= HUGEPD_SIZE(psize);
-	if (addr > end - 1)
-		return;
 
-	start = addr;
 	pgd = pgd_offset(tlb->mm, addr);
 	do {
-		psize = get_slice_psize(tlb->mm, addr);
-		BUG_ON(!mmu_huge_psizes[psize]);
 		next = pgd_addr_end(addr, end);
-		if (mmu_psize_to_shift(psize) < PUD_SHIFT) {
+		if (!is_hugepd(pgd)) {
 			if (pgd_none_or_clear_bad(pgd))
 				continue;
 			hugetlb_free_pud_range(tlb, pgd, addr, next, floor, ceiling);
 		} else {
-			if (pgd_none(*pgd))
-				continue;
-			free_hugepte_range(tlb, (hugepd_t *)pgd, psize);
+			free_hugepd_range(tlb, (hugepd_t *)pgd, PGDIR_SHIFT,
+					  addr, next, floor, ceiling);
 		}
 	} while (pgd++, addr = next, addr != end);
 }
@@ -448,19 +393,19 @@ follow_huge_addr(struct mm_struct *mm, u
 {
 	pte_t *ptep;
 	struct page *page;
-	unsigned int mmu_psize = get_slice_psize(mm, address);
+	unsigned shift;
+	unsigned long mask;
+
+	ptep = huge_pte_offset_and_shift(mm, address, &shift);
 
 	/* Verify it is a huge page else bail. */
-	if (!mmu_huge_psizes[mmu_psize])
+	if (!ptep || !shift)
 		return ERR_PTR(-EINVAL);
 
-	ptep = huge_pte_offset(mm, address);
+	mask = (1UL << shift) - 1;
 	page = pte_page(*ptep);
-	if (page) {
-		unsigned int shift = mmu_psize_to_shift(mmu_psize);
-		unsigned long sz = ((1UL) << shift);
-		page += (address % sz) / PAGE_SIZE;
-	}
+	if (page)
+		page += (address & mask) / PAGE_SIZE;
 
 	return page;
 }
@@ -541,21 +486,18 @@ int hash_huge_page(struct mm_struct *mm,
 	int err = 1;
 	int ssize = user_segment_size(ea);
 	unsigned int mmu_psize;
-	int shift;
+	unsigned shift;
 	mmu_psize = get_slice_psize(mm, ea);
 
-	if (!mmu_huge_psizes[mmu_psize])
-		goto out;
-	ptep = huge_pte_offset(mm, ea);
-
+	ptep = huge_pte_offset_and_shift(mm, ea, &shift);
 	/* Search the Linux page table for a match with va */
 	va = hpt_va(ea, vsid, ssize);
 
 	/*
-	 * If no pte found or not present, send the problem up to
-	 * do_page_fault
+	 * If no pte found or not present, or it's not a hugepage pte,
+	 * send the problem up to do_page_fault
 	 */
-	if (unlikely(!ptep || pte_none(*ptep)))
+	if (unlikely(!ptep || pte_none(*ptep) || !shift))
 		goto out;
 
 	/* 
@@ -588,7 +530,6 @@ int hash_huge_page(struct mm_struct *mm,
 	rflags = 0x2 | (!(new_pte & _PAGE_RW));
  	/* _PAGE_EXEC -> HW_NO_EXEC since it's inverted */
 	rflags |= ((new_pte & _PAGE_EXEC) ? 0 : HPTE_R_N);
-	shift = mmu_psize_to_shift(mmu_psize);
 	sz = ((1UL) << shift);
 	if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
 		/* No CPU has hugepages but lacks no execute, so we
@@ -672,6 +613,8 @@ repeat:
 
 static void __init set_huge_psize(int psize)
 {
+	unsigned pdshift;
+
 	/* Check that it is a page size supported by the hardware and
 	 * that it fits within pagetable limits. */
 	if (mmu_psize_defs[psize].shift &&
@@ -686,29 +629,14 @@ static void __init set_huge_psize(int ps
 			return;
 		hugetlb_add_hstate(mmu_psize_defs[psize].shift - PAGE_SHIFT);
 
-		switch (mmu_psize_defs[psize].shift) {
-		case PAGE_SHIFT_64K:
-		    /* We only allow 64k hpages with 4k base page,
-		     * which was checked above, and always put them
-		     * at the PMD */
-		    hugepte_shift[psize] = PMD_SHIFT;
-		    break;
-		case PAGE_SHIFT_16M:
-		    /* 16M pages can be at two different levels
-		     * of pagestables based on base page size */
-		    if (PAGE_SHIFT == PAGE_SHIFT_64K)
-			    hugepte_shift[psize] = PMD_SHIFT;
-		    else /* 4k base page */
-			    hugepte_shift[psize] = PUD_SHIFT;
-		    break;
-		case PAGE_SHIFT_16G:
-		    /* 16G pages are always at PGD level */
-		    hugepte_shift[psize] = PGDIR_SHIFT;
-		    break;
-		}
-		hugepte_shift[psize] -= mmu_psize_defs[psize].shift;
-	} else
-		hugepte_shift[psize] = 0;
+		if (mmu_psize_defs[psize].shift < PMD_SHIFT)
+			pdshift = PMD_SHIFT;
+		else if (mmu_psize_defs[psize].shift < PUD_SHIFT)
+			pdshift = PUD_SHIFT;
+		else
+			pdshift = PGDIR_SHIFT;
+		mmu_huge_psizes[psize] = pdshift - mmu_psize_defs[psize].shift;
+	}
 }
 
 static int __init hugepage_setup_sz(char *str)
@@ -732,7 +660,7 @@ __setup("hugepagesz=", hugepage_setup_sz
 
 static int __init hugetlbpage_init(void)
 {
-	unsigned int psize;
+	int psize;
 
 	if (!cpu_has_feature(CPU_FTR_16M_PAGE))
 		return -ENODEV;
@@ -753,8 +681,8 @@ static int __init hugetlbpage_init(void)
 
 	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
 		if (mmu_huge_psizes[psize]) {
-			pgtable_cache_add(hugepte_shift[psize], NULL);
-			if (!PGT_CACHE(hugepte_shift[psize]))
+			pgtable_cache_add(mmu_huge_psizes[psize], NULL);
+			if (!PGT_CACHE(mmu_huge_psizes[psize]))
 				panic("hugetlbpage_init(): could not create "
 				      "pgtable cache for %d bit pagesize\n",
 				      mmu_psize_to_shift(psize));
Index: working-2.6/arch/powerpc/include/asm/hugetlb.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/hugetlb.h	2009-09-04 16:09:06.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/hugetlb.h	2009-09-04 16:20:55.000000000 +1000
@@ -3,6 +3,15 @@
 
 #include <asm/page.h>
 
+typedef struct { signed long pd; } hugepd_t;
+
+static inline int hugepd_ok(hugepd_t hpd)
+{
+	return (hpd.pd > 0);
+}
+
+#define is_hugepd(pdep)               (hugepd_ok(*((hugepd_t *)(pdep))))
+#define HUGEPD_SHIFT_MASK     0x3f
 
 int is_hugepage_only_range(struct mm_struct *mm, unsigned long addr,
 			   unsigned long len);
Index: working-2.6/arch/powerpc/mm/init_64.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/init_64.c	2009-09-04 16:12:43.000000000 +1000
+++ working-2.6/arch/powerpc/mm/init_64.c	2009-09-04 16:23:14.000000000 +1000
@@ -41,6 +41,7 @@
 #include <linux/module.h>
 #include <linux/poison.h>
 #include <linux/lmb.h>
+#include <linux/hugetlb.h>
 
 #include <asm/pgalloc.h>
 #include <asm/page.h>
@@ -154,13 +155,21 @@ void pgtable_cache_add(unsigned shift, v
 {
 	char *name;
 	unsigned long table_size = sizeof(void *) << shift;
+	unsigned long align = table_size;
 	struct kmem_cache *new;
 
 	BUG_ON((shift < 1) || (shift > PGF_SHIFT_MASK));
+#ifdef CONFIG_HUGETLB_PAGE
+	/* We use low bits in hugepage dir pointers to store index
+	 * size information.  Table alignment must be big enough to
+	 * fit it. */
+	align = max_t(unsigned long, align, HUGEPD_SHIFT_MASK + 1);
+#endif
+
 	if (PGT_CACHE(shift))
 		return; /* Already have a cache of this size */
 	name = kasprintf(GFP_KERNEL, "pgtable-2^%d", shift);
-	new = kmem_cache_create(name, table_size, table_size, 0, ctor);
+	new = kmem_cache_create(name, table_size, align, 0, ctor);
 	PGT_CACHE(shift) = new;
 }
 

^ permalink raw reply

* Re: [PATCH] powerpc/mpc52xx/mtd: fix mtd-ram access for 16-bit Local Plus Bus
From: David Woodhouse @ 2009-09-04  8:51 UTC (permalink / raw)
  To: Albrecht Dreß; +Cc: Linux PPC Development
In-Reply-To: <1244911551.3423.0@antares>

On Sat, 2009-06-13 at 18:45 +0200, Albrecht Dreß wrote:
> Am 11.06.09 19:28 schrieb(en) Grant Likely:
> > So; the solution to me seems to be on an MPC5200 platform replace the  
> > offending hooks with MPC5200 specific variants at runtime.
> 
> Will re-work the patch that way!  BTW, a dumb question: what is the  
> proper way to determine which cpu the system is running on?  Check the  
> CPU node of the of tree?

Surely the solution is for it to be a 'complex' mapping, where it can
provide its own I/O functions instead of polluting the inline versions
designed for 'simple' maps with special cases?

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

^ permalink raw reply

* [PATCH] * mpc8313erdb.dts: Fixed eTSEC interrupt assignment.
From: Roland Lezuo @ 2009-09-04 10:31 UTC (permalink / raw)
  To: linuxppc-dev

The following patch is needed to correctly assign the IRQs for the gianfar driver on the MPC8313ERDB-revc boards. ERR and TX are swapped as well as the interrupt lines for the two devices.

Signed-off-by: Roland Lezuo <roland.lezuo@chello.at>

---
 arch/powerpc/boot/dts/mpc8313erdb.dts |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/boot/dts/mpc8313erdb.dts b/arch/powerpc/boot/dts/mpc8313erdb.dts
index 761faa7..907a445 100644
--- a/arch/powerpc/boot/dts/mpc8313erdb.dts
+++ b/arch/powerpc/boot/dts/mpc8313erdb.dts
@@ -188,7 +188,7 @@
 			compatible = "gianfar";
 			reg = <0x24000 0x1000>;
 			local-mac-address = [ 00 00 00 00 00 00 ];
-			interrupts = <37 0x8 36 0x8 35 0x8>;
+			interrupts = <32 0x8 33 0x8 34 0x8>;
 			interrupt-parent = <&ipic>;
 			tbi-handle = < &tbi0 >;
 			/* Vitesse 7385 isn't on the MDIO bus */
@@ -223,7 +223,7 @@
 			reg = <0x25000 0x1000>;
 			ranges = <0x0 0x25000 0x1000>;
 			local-mac-address = [ 00 00 00 00 00 00 ];
-			interrupts = <34 0x8 33 0x8 32 0x8>;
+			interrupts = <35 0x8 36 0x8 37 0x8>;
 			interrupt-parent = <&ipic>;
 			tbi-handle = < &tbi1 >;
 			phy-handle = < &phy4 >;
-- 
1.6.0.4

Regards
Roland Lezuo
please CC personally as I'm not subscribed.

^ permalink raw reply related

* Fwd: MPC85xx External/Internal Interrupts
From: Alemao @ 2009-09-04 13:13 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <d970ff420909040604h1bb3874eg5c79f79f85e713a3@mail.gmail.com>

Hi all,

In all MPC85xx DTS files I have seen this interrupt configuration for
TSEC1:

interrupts = <29 2 30 2 34 2>;

29 - TSEC1 interrupt transmit
30 - TSEC1 interrupt receive
34 - TSEC1 interrupt error


But in MPC8555RM, chapter 10.1.5.2 the numbers are these:

13 - TSEC1 interrupt transmit
14 - TSEC1 interrupt receive
18 - TSEC1 interrupt error


Im a little bit confused about this internal interrupts, how can I get
this numbers? TSEC1 is working normally.


My real problem is that my driver is trying to request the external
interrupt IRQ0, and I dont know what number use in INTR_NUM:

request_irq(INTR_NUM, , , , )


Im using linux-2.6.26

Cheers,

--
Alemao

^ permalink raw reply

* Re: [RFC] net/fs_enet: send a reset request to the PHY on init
From: Sebastian Andrzej Siewior @ 2009-09-04 15:38 UTC (permalink / raw)
  To: Grant Likely; +Cc: linuxppc-dev, netdev, Vitaly Bordug
In-Reply-To: <fa686aa40909030948h4acf6d3x1c318baa2fdefe1f@mail.gmail.com>

Grant Likely wrote:
  > What version of the kernel are you using?  The line numbers don't
> match up with kernel mainline, so I wonder if this is before or after
> the OF MDIO rework changes.
It is the kernel which was shipped in ads5121's bsp which is 2.6.24.

> Regardless, this doesn't look right.  It certainly isn't right for the
> driver to do an unconditional PHY reset when it doesn't actually know
> what phy is attached.  For most boards I'm sure this is not desirable
> because it will cause a delay while the PHY auto negotiates.
> Depending on when the first network traffic begins, can cause several
> seconds of boot delay.
> 
> Best would be to do this in U-Boot.  Otherwise, I think I would rather
> see it at phy_device probe time.  At least then it would be on a
> per-phy basis, or could be controlled by a property in the device tree
> so that all boards don't get the same impact.
I have no network support in boot loader so I can't do it there. Doing it 
at phy-probe time sounds reasonable.
So all other boards are doing this kind of reset in u-boot?

> g.
> 


Sebastian

^ permalink raw reply

* Re: [RFC] net/fs_enet: send a reset request to the PHY on init
From: Grant Likely @ 2009-09-04 15:45 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linuxppc-dev, netdev, Wolfgang Denk, Vitaly Bordug
In-Reply-To: <4AA1347A.5060702@linutronix.de>

On Fri, Sep 4, 2009 at 9:38 AM, Sebastian Andrzej
Siewior<bigeasy@linutronix.de> wrote:
> Grant Likely wrote:
> =A0> What version of the kernel are you using? =A0The line numbers don't
>>
>> match up with kernel mainline, so I wonder if this is before or after
>> the OF MDIO rework changes.
>
> It is the kernel which was shipped in ads5121's bsp which is 2.6.24.

Okay, I can safely ignore this then.  Wolfgang may be interested
though.  He's been doing some work to get 5121 support mainlined.

> I have no network support in boot loader so I can't do it there. Doing it=
 at
> phy-probe time sounds reasonable.
> So all other boards are doing this kind of reset in u-boot?

In general I take the approach that as much as possible firmware
should have devices in a sane state before booting the kernel just to
avoid doing board specific fixup stuff in the kernel tree.  But this
isn't law, just more of a rule of thumb that I go by.  2nd resort is
to create a board specific platform code file and put it there
(arch/powerpc/platforms/*).

g.

--=20
Grant Likely, B.Sc., P.Eng.
Secret Lab Technologies Ltd.

^ permalink raw reply

* Re: [PATCH v2 0/8] spi_mpc8xxx: Add support for DMA transfers
From: Anton Vorontsov @ 2009-09-04 16:37 UTC (permalink / raw)
  To: Kumar Gala
  Cc: David Brownell, Greg Kroah-Hartman, linux-kernel, David Brownell,
	linuxppc-dev, spi-devel-general, Andrew Morton
In-Reply-To: <200908272141.59796.david-b@pacbell.net>

On Thu, Aug 27, 2009 at 09:41:59PM -0700, David Brownell wrote:
> On Tuesday 18 August 2009, Anton Vorontsov wrote:
> 
> > - Fix build issues in fsl_qe_udc;
> > - Some minor cosmetic changes in "Add support for QE DMA mode and
> >   CPM1/CPM2 chips" patch.
> 
> Hmm ... the first four of these are pure PPC stuff and thus
> not appropriate to send as SPI patches; but the second four
> depend on them.
> 
> So I'll just say
> 
>   Acked-by: David Brownell <dbrownell@users.sourceforge.net>
> 
> and ask you to merge via the PPC tree.  (And hope that you
> verified these are bisectable...)

Thanks David.

Kumar, can you please merge the SPI part of this patch set?

Thanks,

-- 
Anton Vorontsov
email: cbouatmailru@gmail.com
irc://irc.freenode.net/bd2

^ permalink raw reply

* Re: [PATCH RFC 1/2] Makefile: Never use -fno-omit-frame-pointer
From: Anton Vorontsov @ 2009-09-04 16:53 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linuxppc-dev, Steven Rostedt, Sam Ravnborg, linux-kernel
In-Reply-To: <20090718120145.GB31007@elte.hu>

On Sat, Jul 18, 2009 at 02:01:45PM +0200, Ingo Molnar wrote:
> 
> * Anton Vorontsov <avorontsov@ru.mvista.com> wrote:
> 
> > On Wed, Jun 17, 2009 at 12:16:30AM +0400, Anton Vorontsov wrote:
> > > According to Segher Boessenkool and GCC manual, -fomit-frame-pointer
> > > is only the default when optimising on archs/ABIs where it doesn't
> > > hinder debugging and -pg. So, we do not get it by default on x86,
> > > not at any optimisation level.
> > > 
> > > On the other hand, *using* -fno-omit-frame-pointer causes gcc to
> > > produce buggy code on PowerPC targets.
> > > 
> > > If Segher and GCC manual are right, this patch should be a no-op
> > > for all arches except PowerPC, where the patch fixes gcc issues.
> > > 
> > > Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
> > > ---
> > > 
> > > See this thread for more discussion:
> > > http://osdir.com/ml/linux-kernel/2009-05/msg01754.html
> > > 
> > > p.s.
> > > Obviously, I didn't test this patch on anything else but PPC32. ;-)
> > > 
> > > Segher, do you know if all GCC versions that we support for
> > > building Linux are behaving the way that GCC manual describe?
> > 
> > No news is good news... Ingo, can we merge this into -tip for 
> > testing?
> 
> Changes to the top level Makefile should really go via Sam's kbuild 
> tree.

Sam, any thoughts about these patches?

Thanks!

-- 
Anton Vorontsov
email: cbouatmailru@gmail.com
irc://irc.freenode.net/bd2

^ permalink raw reply

* MPC85xx External/Internal Interrupts
From: Alemao @ 2009-09-04 19:01 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <d970ff420909040604h1bb3874eg5c79f79f85e713a3@mail.gmail.com>

Hi all,

In all MPC85xx DTS files I have seen this interrupt configuration for
TSEC1:

interrupts = <29 2 30 2 34 2>;

29 - TSEC1 interrupt transmit
30 - TSEC1 interrupt receive
34 - TSEC1 interrupt error


But in MPC8555RM, chapter 10.1.5.2 the numbers are these:

13 - TSEC1 interrupt transmit
14 - TSEC1 interrupt receive
18 - TSEC1 interrupt error


Im a little bit confused about this internal interrupts, how can I get
this numbers? TSEC1 is working normally.


My real problem is that my driver is trying to request the external
interrupt IRQ0, and I dont know what number use in INTR_NUM:

request_irq(INTR_NUM, , , , )


Im using linux-2.6.26

Cheers,

--
Alemao

^ permalink raw reply

* Re: MPC85xx External/Internal Interrupts
From: Alemao @ 2009-09-04 19:14 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <d970ff420909041201m343dafdfta0e55df8b5c01f60@mail.gmail.com>

I've read some posts in the list, and about:

irq_of_parse_and_map()
irq_create_map()

But Im still trying to understand MPC85xx TSEC1 dts.

Gianfar driver is using request_irq(), and request_irq() uses virtual
irq, right?

Thats why in dts all irqs for TSEC1 are "offseted" of 16?

Manual  |  DTS
---------------
13         29
14         30
18         34

And about external IRQs, MPC8555RM doesnt have IDs for them, what
should I use to request IRQ0 using irq_create_map()??

In MPC83xx all interrupts have IDs, including IRQ0, IRQ1...
That make things much more clear.

Thanks in advance,

--
Alemao

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox