LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [2/6] Cleanup management of kmem_caches for pagetables
From: David Gibson @ 2009-10-27  5:24 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
In-Reply-To: <20091027052258.GD20694@yookeroo.seuss>

Currently we have a fair bit of rather fiddly code to manage the
various kmem_caches used to store page tables of various levels.  We
generally have two caches holding some combination of PGD, PUD and PMD
tables, plus several more for the special hugepage pagetables.

This patch cleans this all up by taking a different approach.  Rather
than the caches being designated as for PUDs or for hugeptes for 16M
pages, the caches are simply allocated to be a specific size.  Thus
sharing of caches between different types/levels of pagetables happens
naturally.  The pagetable size, where needed, is passed around encoded
in the same way as {PGD,PUD,PMD}_INDEX_SIZE; that is n where the
pagetable contains 2^n pointers.

Signed-off-by: David Gibson <dwg@au1.ibm.com>

---
 arch/powerpc/include/asm/pgalloc-64.h    |   60 +++++++++++++++-----------
 arch/powerpc/include/asm/pgalloc.h       |   30 +------------
 arch/powerpc/include/asm/pgtable-ppc64.h |    1 
 arch/powerpc/mm/hugetlbpage.c            |   45 +++++--------------
 arch/powerpc/mm/init_64.c                |   70 +++++++++++++++++++++----------
 arch/powerpc/mm/pgtable.c                |   25 +++++++----
 6 files changed, 117 insertions(+), 114 deletions(-)

Index: working-2.6/arch/powerpc/mm/init_64.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/init_64.c	2009-10-27 15:30:17.000000000 +1100
+++ working-2.6/arch/powerpc/mm/init_64.c	2009-10-27 15:37:04.000000000 +1100
@@ -119,30 +119,58 @@ static void pmd_ctor(void *addr)
 	memset(addr, 0, PMD_TABLE_SIZE);
 }
 
-static const unsigned int pgtable_cache_size[2] = {
-	PGD_TABLE_SIZE, PMD_TABLE_SIZE
-};
-static const char *pgtable_cache_name[ARRAY_SIZE(pgtable_cache_size)] = {
-#ifdef CONFIG_PPC_64K_PAGES
-	"pgd_cache", "pmd_cache",
-#else
-	"pgd_cache", "pud_pmd_cache",
-#endif /* CONFIG_PPC_64K_PAGES */
-};
-
-#ifdef CONFIG_HUGETLB_PAGE
-/* Hugepages need an extra cache per hugepagesize, initialized in
- * hugetlbpage.c.  We can't put into the tables above, because HPAGE_SHIFT
- * is not compile time constant. */
-struct kmem_cache *pgtable_cache[ARRAY_SIZE(pgtable_cache_size)+MMU_PAGE_COUNT];
-#else
-struct kmem_cache *pgtable_cache[ARRAY_SIZE(pgtable_cache_size)];
-#endif
+struct kmem_cache *pgtable_cache[MAX_PGTABLE_INDEX_SIZE];
+
+/*
+ * Create a kmem_cache() for pagetables.  This is not used for PTE
+ * pages - they're linked to struct page, come from the normal free
+ * pages pool and have a different entry size (see real_pte_t) to
+ * everything else.  Caches created by this function are used for all
+ * the higher level pagetables, and for hugepage pagetables.
+ */
+void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
+{
+	char *name;
+	unsigned long table_size = sizeof(void *) << shift;
+	unsigned long align = table_size;
+
+	/* When batching pgtable pointers for RCU freeing, we store
+	 * the index size in the low bits.  Table alignment must be
+	 * big enough to fit it */
+	unsigned long minalign = MAX_PGTABLE_INDEX_SIZE + 1;
+	struct kmem_cache *new;
+
+	/* It would be nice if this was a BUILD_BUG_ON(), but at the
+	 * moment, gcc doesn't seem to recognize is_power_of_2 as a
+	 * constant expression, so so much for that. */
+	BUG_ON(!is_power_of_2(minalign));
+	BUG_ON((shift < 1) || (shift > MAX_PGTABLE_INDEX_SIZE));
+
+	if (PGT_CACHE(shift))
+		return; /* Already have a cache of this size */
+
+	align = max_t(unsigned long, align, minalign);
+	name = kasprintf(GFP_KERNEL, "pgtable-2^%d", shift);
+	new = kmem_cache_create(name, table_size, align, 0, ctor);
+	PGT_CACHE(shift) = new;
+
+	pr_debug("Allocated pgtable cache for order %d\n", shift);
+}
+
 
 void pgtable_cache_init(void)
 {
-	pgtable_cache[0] = kmem_cache_create(pgtable_cache_name[0], PGD_TABLE_SIZE, PGD_TABLE_SIZE, SLAB_PANIC, pgd_ctor);
-	pgtable_cache[1] = kmem_cache_create(pgtable_cache_name[1], PMD_TABLE_SIZE, PMD_TABLE_SIZE, SLAB_PANIC, pmd_ctor);
+	pgtable_cache_add(PGD_INDEX_SIZE, pgd_ctor);
+	pgtable_cache_add(PMD_INDEX_SIZE, pmd_ctor);
+	if (!PGT_CACHE(PGD_INDEX_SIZE) || !PGT_CACHE(PMD_INDEX_SIZE))
+		panic("Couldn't allocate pgtable caches");
+
+	/* In all current configs, when the PUD index exists it's the
+	 * same size as either the pgd or pmd index.  Verify that the
+	 * initialization above has also created a PUD cache.  This
+	 * will need re-examiniation if we add new possibilities for
+	 * the pagetable layout. */
+	BUG_ON(PUD_INDEX_SIZE && !PGT_CACHE(PUD_INDEX_SIZE));
 }
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
Index: working-2.6/arch/powerpc/include/asm/pgalloc-64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgalloc-64.h	2009-10-27 15:30:16.000000000 +1100
+++ working-2.6/arch/powerpc/include/asm/pgalloc-64.h	2009-10-27 15:30:18.000000000 +1100
@@ -11,27 +11,39 @@
 #include <linux/cpumask.h>
 #include <linux/percpu.h>
 
+/*
+ * Functions that deal with pagetables that could be at any level of
+ * the table need to be passed an "index_size" so they know how to
+ * handle allocation.  For PTE pages (which are linked to a struct
+ * page for now, and drawn from the main get_free_pages() pool), the
+ * allocation size will be (2^index_size * sizeof(pointer)) and
+ * allocations are drawn from the kmem_cache in PGT_CACHE(index_size).
+ *
+ * The maximum index size needs to be big enough to allow any
+ * pagetable sizes we need, but small enough to fit in the low bits of
+ * any page table pointer.  In other words all pagetables, even tiny
+ * ones, must be aligned to allow at least enough low 0 bits to
+ * contain this value.  This value is also used as a mask, so it must
+ * be one less than a power of two.
+ */
+#define MAX_PGTABLE_INDEX_SIZE	0xf
+
 #ifndef CONFIG_PPC_SUBPAGE_PROT
 static inline void subpage_prot_free(pgd_t *pgd) {}
 #endif
 
 extern struct kmem_cache *pgtable_cache[];
-
-#define PGD_CACHE_NUM		0
-#define PUD_CACHE_NUM		1
-#define PMD_CACHE_NUM		1
-#define HUGEPTE_CACHE_NUM	2
-#define PTE_NONCACHE_NUM	7  /* from GFP rather than kmem_cache */
+#define PGT_CACHE(shift) (pgtable_cache[(shift)-1])
 
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	return kmem_cache_alloc(pgtable_cache[PGD_CACHE_NUM], GFP_KERNEL);
+	return kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE), GFP_KERNEL);
 }
 
 static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
 	subpage_prot_free(pgd);
-	kmem_cache_free(pgtable_cache[PGD_CACHE_NUM], pgd);
+	kmem_cache_free(PGT_CACHE(PGD_INDEX_SIZE), pgd);
 }
 
 #ifndef CONFIG_PPC_64K_PAGES
@@ -40,13 +52,13 @@ static inline void pgd_free(struct mm_st
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return kmem_cache_alloc(pgtable_cache[PUD_CACHE_NUM],
+	return kmem_cache_alloc(PGT_CACHE(PUD_INDEX_SIZE),
 				GFP_KERNEL|__GFP_REPEAT);
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
 {
-	kmem_cache_free(pgtable_cache[PUD_CACHE_NUM], pud);
+	kmem_cache_free(PGT_CACHE(PUD_INDEX_SIZE), pud);
 }
 
 static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
@@ -78,13 +90,13 @@ static inline void pmd_populate_kernel(s
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return kmem_cache_alloc(pgtable_cache[PMD_CACHE_NUM],
+	return kmem_cache_alloc(PGT_CACHE(PMD_INDEX_SIZE),
 				GFP_KERNEL|__GFP_REPEAT);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 {
-	kmem_cache_free(pgtable_cache[PMD_CACHE_NUM], pmd);
+	kmem_cache_free(PGT_CACHE(PMD_INDEX_SIZE), pmd);
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
@@ -107,24 +119,22 @@ static inline pgtable_t pte_alloc_one(st
 	return page;
 }
 
-static inline void pgtable_free(pgtable_free_t pgf)
+static inline void pgtable_free(void *table, unsigned index_size)
 {
-	void *p = (void *)(pgf.val & ~PGF_CACHENUM_MASK);
-	int cachenum = pgf.val & PGF_CACHENUM_MASK;
-
-	if (cachenum == PTE_NONCACHE_NUM)
-		free_page((unsigned long)p);
-	else
-		kmem_cache_free(pgtable_cache[cachenum], p);
+	if (!index_size)
+		free_page((unsigned long)table);
+	else {
+		BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
+		kmem_cache_free(PGT_CACHE(index_size), table);
+	}
 }
 
-#define __pmd_free_tlb(tlb, pmd,addr)		      \
-	pgtable_free_tlb(tlb, pgtable_free_cache(pmd, \
-		PMD_CACHE_NUM, PMD_TABLE_SIZE-1))
+#define __pmd_free_tlb(tlb, pmd, addr)		      \
+	pgtable_free_tlb(tlb, pmd, PMD_INDEX_SIZE)
 #ifndef CONFIG_PPC_64K_PAGES
 #define __pud_free_tlb(tlb, pud, addr)		      \
-	pgtable_free_tlb(tlb, pgtable_free_cache(pud, \
-		PUD_CACHE_NUM, PUD_TABLE_SIZE-1))
+	pgtable_free_tlb(tlb, pud, PUD_INDEX_SIZE)
+
 #endif /* CONFIG_PPC_64K_PAGES */
 
 #define check_pgt_cache()	do { } while (0)
Index: working-2.6/arch/powerpc/include/asm/pgalloc.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgalloc.h	2009-10-27 15:30:17.000000000 +1100
+++ working-2.6/arch/powerpc/include/asm/pgalloc.h	2009-10-27 15:30:18.000000000 +1100
@@ -24,25 +24,6 @@ static inline void pte_free(struct mm_st
 	__free_page(ptepage);
 }
 
-typedef struct pgtable_free {
-	unsigned long val;
-} pgtable_free_t;
-
-/* This needs to be big enough to allow for MMU_PAGE_COUNT + 2 to be stored
- * and small enough to fit in the low bits of any naturally aligned page
- * table cache entry. Arbitrarily set to 0x1f, that should give us some
- * room to grow
- */
-#define PGF_CACHENUM_MASK	0x1f
-
-static inline pgtable_free_t pgtable_free_cache(void *p, int cachenum,
-						unsigned long mask)
-{
-	BUG_ON(cachenum > PGF_CACHENUM_MASK);
-
-	return (pgtable_free_t){.val = ((unsigned long) p & ~mask) | cachenum};
-}
-
 #ifdef CONFIG_PPC64
 #include <asm/pgalloc-64.h>
 #else
@@ -50,12 +31,12 @@ static inline pgtable_free_t pgtable_fre
 #endif
 
 #ifdef CONFIG_SMP
-extern void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf);
+extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
 extern void pte_free_finish(void);
 #else /* CONFIG_SMP */
-static inline void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf)
+static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
-	pgtable_free(pgf);
+	pgtable_free(table, shift);
 }
 static inline void pte_free_finish(void) { }
 #endif /* !CONFIG_SMP */
@@ -63,12 +44,9 @@ static inline void pte_free_finish(void)
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
 				  unsigned long address)
 {
-	pgtable_free_t pgf = pgtable_free_cache(page_address(ptepage),
-						PTE_NONCACHE_NUM,
-						PTE_TABLE_SIZE-1);
 	tlb_flush_pgtable(tlb, address);
 	pgtable_page_dtor(ptepage);
-	pgtable_free_tlb(tlb, pgf);
+	pgtable_free_tlb(tlb, page_address(ptepage), 0);
 }
 
 #endif /* __KERNEL__ */
Index: working-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/pgtable.c	2009-10-27 15:30:17.000000000 +1100
+++ working-2.6/arch/powerpc/mm/pgtable.c	2009-10-27 15:30:18.000000000 +1100
@@ -49,12 +49,12 @@ struct pte_freelist_batch
 {
 	struct rcu_head	rcu;
 	unsigned int	index;
-	pgtable_free_t	tables[0];
+	unsigned long	tables[0];
 };
 
 #define PTE_FREELIST_SIZE \
 	((PAGE_SIZE - sizeof(struct pte_freelist_batch)) \
-	  / sizeof(pgtable_free_t))
+	  / sizeof(unsigned long))
 
 static void pte_free_smp_sync(void *arg)
 {
@@ -64,13 +64,13 @@ static void pte_free_smp_sync(void *arg)
 /* This is only called when we are critically out of memory
  * (and fail to get a page in pte_free_tlb).
  */
-static void pgtable_free_now(pgtable_free_t pgf)
+static void pgtable_free_now(void *table, unsigned shift)
 {
 	pte_freelist_forced_free++;
 
 	smp_call_function(pte_free_smp_sync, NULL, 1);
 
-	pgtable_free(pgf);
+	pgtable_free(table, shift);
 }
 
 static void pte_free_rcu_callback(struct rcu_head *head)
@@ -79,8 +79,12 @@ static void pte_free_rcu_callback(struct
 		container_of(head, struct pte_freelist_batch, rcu);
 	unsigned int i;
 
-	for (i = 0; i < batch->index; i++)
-		pgtable_free(batch->tables[i]);
+	for (i = 0; i < batch->index; i++) {
+		void *table = (void *)(batch->tables[i] & ~MAX_PGTABLE_INDEX_SIZE);
+		unsigned shift = batch->tables[i] & MAX_PGTABLE_INDEX_SIZE;
+
+		pgtable_free(table, shift);
+	}
 
 	free_page((unsigned long)batch);
 }
@@ -91,25 +95,28 @@ static void pte_free_submit(struct pte_f
 	call_rcu(&batch->rcu, pte_free_rcu_callback);
 }
 
-void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf)
+void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
 	/* This is safe since tlb_gather_mmu has disabled preemption */
 	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+	unsigned long pgf;
 
 	if (atomic_read(&tlb->mm->mm_users) < 2 ||
 	    cpumask_equal(mm_cpumask(tlb->mm), cpumask_of(smp_processor_id()))){
-		pgtable_free(pgf);
+		pgtable_free(table, shift);
 		return;
 	}
 
 	if (*batchp == NULL) {
 		*batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC);
 		if (*batchp == NULL) {
-			pgtable_free_now(pgf);
+			pgtable_free_now(table, shift);
 			return;
 		}
 		(*batchp)->index = 0;
 	}
+	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+	pgf = (unsigned long)table | (shift - 1);
 	(*batchp)->tables[(*batchp)->index++] = pgf;
 	if ((*batchp)->index == PTE_FREELIST_SIZE) {
 		pte_free_submit(*batchp);
Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-10-27 15:30:17.000000000 +1100
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-10-27 15:35:27.000000000 +1100
@@ -43,26 +43,14 @@ static unsigned nr_gpages;
 unsigned int mmu_huge_psizes[MMU_PAGE_COUNT] = { }; /* initialize all to 0 */
 
 #define hugepte_shift			mmu_huge_psizes
-#define PTRS_PER_HUGEPTE(psize)		(1 << hugepte_shift[psize])
-#define HUGEPTE_TABLE_SIZE(psize)	(sizeof(pte_t) << hugepte_shift[psize])
+#define HUGEPTE_INDEX_SIZE(psize)	(mmu_huge_psizes[(psize)])
+#define PTRS_PER_HUGEPTE(psize)		(1 << mmu_huge_psizes[psize])
 
 #define HUGEPD_SHIFT(psize)		(mmu_psize_to_shift(psize) \
-						+ hugepte_shift[psize])
+					 + HUGEPTE_INDEX_SIZE(psize))
 #define HUGEPD_SIZE(psize)		(1UL << HUGEPD_SHIFT(psize))
 #define HUGEPD_MASK(psize)		(~(HUGEPD_SIZE(psize)-1))
 
-/* Subtract one from array size because we don't need a cache for 4K since
- * is not a huge page size */
-#define HUGE_PGTABLE_INDEX(psize)	(HUGEPTE_CACHE_NUM + psize - 1)
-#define HUGEPTE_CACHE_NAME(psize)	(huge_pgtable_cache_name[psize])
-
-static const char *huge_pgtable_cache_name[MMU_PAGE_COUNT] = {
-	[MMU_PAGE_64K]	= "hugepte_cache_64K",
-	[MMU_PAGE_1M]	= "hugepte_cache_1M",
-	[MMU_PAGE_16M]	= "hugepte_cache_16M",
-	[MMU_PAGE_16G]	= "hugepte_cache_16G",
-};
-
 /* Flag to mark huge PD pointers.  This means pmd_bad() and pud_bad()
  * will choke on pointers to hugepte tables, which is handy for
  * catching screwups early. */
@@ -114,15 +102,15 @@ static inline pte_t *hugepte_offset(huge
 static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 			   unsigned long address, unsigned int psize)
 {
-	pte_t *new = kmem_cache_zalloc(pgtable_cache[HUGE_PGTABLE_INDEX(psize)],
-				      GFP_KERNEL|__GFP_REPEAT);
+	pte_t *new = kmem_cache_zalloc(PGT_CACHE(hugepte_shift[psize]),
+				       GFP_KERNEL|__GFP_REPEAT);
 
 	if (! new)
 		return -ENOMEM;
 
 	spin_lock(&mm->page_table_lock);
 	if (!hugepd_none(*hpdp))
-		kmem_cache_free(pgtable_cache[HUGE_PGTABLE_INDEX(psize)], new);
+		kmem_cache_free(PGT_CACHE(hugepte_shift[psize]), new);
 	else
 		hpdp->pd = (unsigned long)new | HUGEPD_OK;
 	spin_unlock(&mm->page_table_lock);
@@ -271,9 +259,7 @@ static void free_hugepte_range(struct mm
 
 	hpdp->pd = 0;
 	tlb->need_flush = 1;
-	pgtable_free_tlb(tlb, pgtable_free_cache(hugepte,
-						 HUGEPTE_CACHE_NUM+psize-1,
-						 PGF_CACHENUM_MASK));
+	pgtable_free_tlb(tlb, hugepte, hugepte_shift[psize]);
 }
 
 static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
@@ -698,8 +684,6 @@ static void __init set_huge_psize(int ps
 		if (mmu_huge_psizes[psize] ||
 		   mmu_psize_defs[psize].shift == PAGE_SHIFT)
 			return;
-		if (WARN_ON(HUGEPTE_CACHE_NAME(psize) == NULL))
-			return;
 		hugetlb_add_hstate(mmu_psize_defs[psize].shift - PAGE_SHIFT);
 
 		switch (mmu_psize_defs[psize].shift) {
@@ -769,16 +753,11 @@ static int __init hugetlbpage_init(void)
 
 	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
 		if (mmu_huge_psizes[psize]) {
-			pgtable_cache[HUGE_PGTABLE_INDEX(psize)] =
-				kmem_cache_create(
-					HUGEPTE_CACHE_NAME(psize),
-					HUGEPTE_TABLE_SIZE(psize),
-					HUGEPTE_TABLE_SIZE(psize),
-					0,
-					NULL);
-			if (!pgtable_cache[HUGE_PGTABLE_INDEX(psize)])
-				panic("hugetlbpage_init(): could not create %s"\
-				      "\n", HUGEPTE_CACHE_NAME(psize));
+			pgtable_cache_add(hugepte_shift[psize], NULL);
+			if (!PGT_CACHE(hugepte_shift[psize]))
+				panic("hugetlbpage_init(): could not create "
+				      "pgtable cache for %d bit pagesize\n",
+				      mmu_psize_to_shift(psize));
 		}
 	}
 
Index: working-2.6/arch/powerpc/include/asm/pgtable-ppc64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgtable-ppc64.h	2009-10-27 15:30:17.000000000 +1100
+++ working-2.6/arch/powerpc/include/asm/pgtable-ppc64.h	2009-10-27 15:35:27.000000000 +1100
@@ -354,6 +354,7 @@ static inline void __ptep_set_access_fla
 #define pgoff_to_pte(off)	((pte_t) {((off) << PTE_RPN_SHIFT)|_PAGE_FILE})
 #define PTE_FILE_MAX_BITS	(BITS_PER_LONG - PTE_RPN_SHIFT)
 
+void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
 void pgtable_cache_init(void);
 
 /*

^ permalink raw reply

* [1/6] Make hpte_need_flush() correctly mask for multiple page sizes
From: David Gibson @ 2009-10-27  5:24 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
In-Reply-To: <20091027052258.GD20694@yookeroo.seuss>

Currently, hpte_need_flush() only correctly flushes the given address
for normal pages.  Callers for hugepages are required to mask the
address themselves.

But hpte_need_flush() already looks up the page sizes for its own
reasons, so this is a rather silly imposition on the callers.  This
patch alters it to mask based on the pagesize it has looked up itself,
and removes the awkward masking code in the hugepage caller.

Signed-off-by: David Gibson <dwg@au1.ibm.com>

---
 arch/powerpc/mm/hugetlbpage.c |    6 +-----
 arch/powerpc/mm/tlb_hash64.c  |    8 +++-----
 2 files changed, 4 insertions(+), 10 deletions(-)

Index: working-2.6/arch/powerpc/mm/tlb_hash64.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/tlb_hash64.c	2009-09-04 14:35:30.000000000 +1000
+++ working-2.6/arch/powerpc/mm/tlb_hash64.c	2009-09-04 14:36:12.000000000 +1000
@@ -53,11 +53,6 @@ void hpte_need_flush(struct mm_struct *m
 
 	i = batch->index;
 
-	/* We mask the address for the base page size. Huge pages will
-	 * have applied their own masking already
-	 */
-	addr &= PAGE_MASK;
-
 	/* Get page size (maybe move back to caller).
 	 *
 	 * NOTE: when using special 64K mappings in 4K environment like
@@ -75,6 +70,9 @@ void hpte_need_flush(struct mm_struct *m
 	} else
 		psize = pte_pagesize_index(mm, addr, pte);
 
+	/* Mask the address for the correct page size */
+	addr &= ~((1UL << mmu_psize_defs[psize].shift) - 1);
+
 	/* Build full vaddr */
 	if (!is_kernel_addr(addr)) {
 		ssize = user_segment_size(addr);
Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-09-04 14:35:30.000000000 +1000
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-09-04 14:36:12.000000000 +1000
@@ -445,11 +445,7 @@ void set_huge_pte_at(struct mm_struct *m
 		 * necessary anymore if we make hpte_need_flush() get the
 		 * page size from the slices
 		 */
-		unsigned int psize = get_slice_psize(mm, addr);
-		unsigned int shift = mmu_psize_to_shift(psize);
-		unsigned long sz = ((1UL) << shift);
-		struct hstate *hstate = size_to_hstate(sz);
-		pte_update(mm, addr & hstate->mask, ptep, ~0UL, 1);
+		pte_update(mm, addr, ptep, ~0UL, 1);
 	}
 	*ptep = __pte(pte_val(pte) & ~_PAGE_HPTEFLAGS);
 }

^ permalink raw reply

* [3/6] Allow more flexible layouts for hugepage pagetables
From: David Gibson @ 2009-10-27  5:24 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
In-Reply-To: <20091027052258.GD20694@yookeroo.seuss>

Currently each available hugepage size uses a slightly different
pagetable layout: that is, the bottem level table of pointers to
hugepages is a different size, and may branch off from the normal page
tables at a different level.  Every hugepage aware path that needs to
walk the pagetables must therefore look up the hugepage size from the
slice info first, and work out the correct way to walk the pagetables
accordingly.  Future hardware is likely to add more possible hugepage
sizes, more layout options and more mess.

This patch, therefore reworks the handling of hugepage pagetables to
reduce this complexity.  In the new scheme, instead of having to
consult the slice mask, pagetable walking code can check a flag in the
PGD/PUD/PMD entries to see where to branch off to hugepage pagetables,
and the entry also contains the information (eseentially hugepage
shift) necessary to then interpret that table without recourse to the
slice mask.  This scheme can be extended neatly to handle multiple
levels of self-describing "special" hugepage pagetables, although for
now we assume only one level exists.

This approach means that only the pagetable allocation path needs to
know how the pagetables should be set out.  All other (hugepage)
pagetable walking paths can just interpret the structure as they go.

There already was a flag bit in PGD/PUD/PMD entries for hugepage
directory pointers, but it was only used for debug.  We alter that
flag bit to instead be a 0 in the MSB to indicate a hugepage pagetable
pointer (normally it would be 1 since the pointer lies in the linear
mapping).  This means that asm pagetable walking can test for (and
punt on) hugepage pointers with the same test that checks for
unpopulated page directory entries (beq becomes bge), since hugepage
pointers will always be positive, and normal pointers always negative.

While we're at it, we get rid of the confusing (and grep defeating)
#defining of hugepte_shift to be the same thing as mmu_huge_psizes.

Signed-off-by: David Gibson <dwg@au1.ibm.com>

---
 arch/powerpc/include/asm/hugetlb.h       |    1 
 arch/powerpc/include/asm/mmu-hash64.h    |   14 
 arch/powerpc/include/asm/page.h          |   14 
 arch/powerpc/include/asm/pgtable-ppc64.h |   13 
 arch/powerpc/include/asm/pgtable.h       |    3 
 arch/powerpc/kernel/perf_callchain.c     |   20 -
 arch/powerpc/mm/gup.c                    |  149 +--------
 arch/powerpc/mm/hash_utils_64.c          |   26 -
 arch/powerpc/mm/hugetlbpage.c            |  473 ++++++++++++++-----------------
 arch/powerpc/mm/init_64.c                |   10 
 10 files changed, 313 insertions(+), 410 deletions(-)

Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-10-27 15:35:27.000000000 +1100
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-10-27 15:37:08.000000000 +1100
@@ -40,25 +40,11 @@ static unsigned nr_gpages;
 /* Array of valid huge page sizes - non-zero value(hugepte_shift) is
  * stored for the huge page sizes that are valid.
  */
-unsigned int mmu_huge_psizes[MMU_PAGE_COUNT] = { }; /* initialize all to 0 */
-
-#define hugepte_shift			mmu_huge_psizes
-#define HUGEPTE_INDEX_SIZE(psize)	(mmu_huge_psizes[(psize)])
-#define PTRS_PER_HUGEPTE(psize)		(1 << mmu_huge_psizes[psize])
-
-#define HUGEPD_SHIFT(psize)		(mmu_psize_to_shift(psize) \
-					 + HUGEPTE_INDEX_SIZE(psize))
-#define HUGEPD_SIZE(psize)		(1UL << HUGEPD_SHIFT(psize))
-#define HUGEPD_MASK(psize)		(~(HUGEPD_SIZE(psize)-1))
+static unsigned int mmu_huge_psizes[MMU_PAGE_COUNT] = { }; /* initialize all to 0 */
 
 /* Flag to mark huge PD pointers.  This means pmd_bad() and pud_bad()
  * will choke on pointers to hugepte tables, which is handy for
  * catching screwups early. */
-#define HUGEPD_OK	0x1
-
-typedef struct { unsigned long pd; } hugepd_t;
-
-#define hugepd_none(hpd)	((hpd).pd == 0)
 
 static inline int shift_to_mmu_psize(unsigned int shift)
 {
@@ -82,71 +68,126 @@ static inline unsigned int mmu_psize_to_
 	BUG();
 }
 
+#define hugepd_none(hpd)	((hpd).pd == 0)
+
 static inline pte_t *hugepd_page(hugepd_t hpd)
 {
-	BUG_ON(!(hpd.pd & HUGEPD_OK));
-	return (pte_t *)(hpd.pd & ~HUGEPD_OK);
+	BUG_ON(!hugepd_ok(hpd));
+	return (pte_t *)((hpd.pd & ~HUGEPD_SHIFT_MASK) | 0xc000000000000000);
 }
 
-static inline pte_t *hugepte_offset(hugepd_t *hpdp, unsigned long addr,
-				    struct hstate *hstate)
+static inline unsigned int hugepd_shift(hugepd_t hpd)
 {
-	unsigned int shift = huge_page_shift(hstate);
-	int psize = shift_to_mmu_psize(shift);
-	unsigned long idx = ((addr >> shift) & (PTRS_PER_HUGEPTE(psize)-1));
+	return hpd.pd & HUGEPD_SHIFT_MASK;
+}
+
+static inline pte_t *hugepte_offset(hugepd_t *hpdp, unsigned long addr, unsigned pdshift)
+{
+	unsigned long idx = (addr & ((1UL << pdshift) - 1)) >> hugepd_shift(*hpdp);
 	pte_t *dir = hugepd_page(*hpdp);
 
 	return dir + idx;
 }
 
+pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, unsigned *shift)
+{
+	pgd_t *pg;
+	pud_t *pu;
+	pmd_t *pm;
+	hugepd_t *hpdp = NULL;
+	unsigned pdshift = PGDIR_SHIFT;
+
+	if (shift)
+		*shift = 0;
+
+	pg = pgdir + pgd_index(ea);
+	if (is_hugepd(pg)) {
+		hpdp = (hugepd_t *)pg;
+	} else if (!pgd_none(*pg)) {
+		pdshift = PUD_SHIFT;
+		pu = pud_offset(pg, ea);
+		if (is_hugepd(pu))
+			hpdp = (hugepd_t *)pu;
+		else if (!pud_none(*pu)) {
+			pdshift = PMD_SHIFT;
+			pm = pmd_offset(pu, ea);
+			if (is_hugepd(pm))
+				hpdp = (hugepd_t *)pm;
+			else if (!pmd_none(*pm)) {
+				return pte_offset_map(pm, ea);
+			}
+		}
+	}
+
+	if (!hpdp)
+		return NULL;
+
+	if (shift)
+		*shift = hugepd_shift(*hpdp);
+	return hugepte_offset(hpdp, ea, pdshift);
+}
+
+pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
+{
+	return find_linux_pte_or_hugepte(mm->pgd, addr, NULL);
+}
+
 static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
-			   unsigned long address, unsigned int psize)
+			   unsigned long address, unsigned pdshift, unsigned pshift)
 {
-	pte_t *new = kmem_cache_zalloc(PGT_CACHE(hugepte_shift[psize]),
+	pte_t *new = kmem_cache_zalloc(PGT_CACHE(pdshift - pshift),
 				       GFP_KERNEL|__GFP_REPEAT);
 
+	BUG_ON(pshift > HUGEPD_SHIFT_MASK);
+	BUG_ON((unsigned long)new & HUGEPD_SHIFT_MASK);
+
 	if (! new)
 		return -ENOMEM;
 
 	spin_lock(&mm->page_table_lock);
 	if (!hugepd_none(*hpdp))
-		kmem_cache_free(PGT_CACHE(hugepte_shift[psize]), new);
+		kmem_cache_free(PGT_CACHE(pdshift - pshift), new);
 	else
-		hpdp->pd = (unsigned long)new | HUGEPD_OK;
+		hpdp->pd = ((unsigned long)new & ~0x8000000000000000) | pshift;
 	spin_unlock(&mm->page_table_lock);
 	return 0;
 }
 
-
-static pud_t *hpud_offset(pgd_t *pgd, unsigned long addr, struct hstate *hstate)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
 {
-	if (huge_page_shift(hstate) < PUD_SHIFT)
-		return pud_offset(pgd, addr);
-	else
-		return (pud_t *) pgd;
-}
-static pud_t *hpud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long addr,
-			 struct hstate *hstate)
-{
-	if (huge_page_shift(hstate) < PUD_SHIFT)
-		return pud_alloc(mm, pgd, addr);
-	else
-		return (pud_t *) pgd;
-}
-static pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate)
-{
-	if (huge_page_shift(hstate) < PMD_SHIFT)
-		return pmd_offset(pud, addr);
-	else
-		return (pmd_t *) pud;
-}
-static pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
-			 struct hstate *hstate)
-{
-	if (huge_page_shift(hstate) < PMD_SHIFT)
-		return pmd_alloc(mm, pud, addr);
-	else
-		return (pmd_t *) pud;
+	pgd_t *pg;
+	pud_t *pu;
+	pmd_t *pm;
+	hugepd_t *hpdp = NULL;
+	unsigned pshift = __ffs(sz);
+	unsigned pdshift = PGDIR_SHIFT;
+
+	addr &= ~(sz-1);
+
+	pg = pgd_offset(mm, addr);
+	if (pshift >= PUD_SHIFT) {
+		hpdp = (hugepd_t *)pg;
+	} else {
+		pdshift = PUD_SHIFT;
+		pu = pud_alloc(mm, pg, addr);
+		if (pshift >= PMD_SHIFT) {
+			hpdp = (hugepd_t *)pu;
+		} else {
+			pdshift = PMD_SHIFT;
+			pm = pmd_alloc(mm, pu, addr);
+			hpdp = (hugepd_t *)pm;
+		}
+	}
+
+	if (!hpdp)
+		return NULL;
+
+	BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
+
+	if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr, pdshift, pshift))
+		return NULL;
+
+	return hugepte_offset(hpdp, addr, pdshift);
 }
 
 /* Build list of addresses of gigantic pages.  This function is used in early
@@ -180,92 +221,38 @@ int alloc_bootmem_huge_page(struct hstat
 	return 1;
 }
 
-
-/* Modelled after find_linux_pte() */
-pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
-{
-	pgd_t *pg;
-	pud_t *pu;
-	pmd_t *pm;
-
-	unsigned int psize;
-	unsigned int shift;
-	unsigned long sz;
-	struct hstate *hstate;
-	psize = get_slice_psize(mm, addr);
-	shift = mmu_psize_to_shift(psize);
-	sz = ((1UL) << shift);
-	hstate = size_to_hstate(sz);
-
-	addr &= hstate->mask;
-
-	pg = pgd_offset(mm, addr);
-	if (!pgd_none(*pg)) {
-		pu = hpud_offset(pg, addr, hstate);
-		if (!pud_none(*pu)) {
-			pm = hpmd_offset(pu, addr, hstate);
-			if (!pmd_none(*pm))
-				return hugepte_offset((hugepd_t *)pm, addr,
-						      hstate);
-		}
-	}
-
-	return NULL;
-}
-
-pte_t *huge_pte_alloc(struct mm_struct *mm,
-			unsigned long addr, unsigned long sz)
-{
-	pgd_t *pg;
-	pud_t *pu;
-	pmd_t *pm;
-	hugepd_t *hpdp = NULL;
-	struct hstate *hstate;
-	unsigned int psize;
-	hstate = size_to_hstate(sz);
-
-	psize = get_slice_psize(mm, addr);
-	BUG_ON(!mmu_huge_psizes[psize]);
-
-	addr &= hstate->mask;
-
-	pg = pgd_offset(mm, addr);
-	pu = hpud_alloc(mm, pg, addr, hstate);
-
-	if (pu) {
-		pm = hpmd_alloc(mm, pu, addr, hstate);
-		if (pm)
-			hpdp = (hugepd_t *)pm;
-	}
-
-	if (! hpdp)
-		return NULL;
-
-	if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr, psize))
-		return NULL;
-
-	return hugepte_offset(hpdp, addr, hstate);
-}
-
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 {
 	return 0;
 }
 
-static void free_hugepte_range(struct mmu_gather *tlb, hugepd_t *hpdp,
-			       unsigned int psize)
+static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshift,
+			      unsigned long start, unsigned long end,
+			      unsigned long floor, unsigned long ceiling)
 {
 	pte_t *hugepte = hugepd_page(*hpdp);
+	unsigned shift = hugepd_shift(*hpdp);
+	unsigned long pdmask = ~((1UL << pdshift) - 1);
+
+	start &= pdmask;
+	if (start < floor)
+		return;
+	if (ceiling) {
+		ceiling &= pdmask;
+		if (! ceiling)
+			return;
+	}
+	if (end - 1 > ceiling - 1)
+		return;
 
 	hpdp->pd = 0;
 	tlb->need_flush = 1;
-	pgtable_free_tlb(tlb, hugepte, hugepte_shift[psize]);
+	pgtable_free_tlb(tlb, hugepte, pdshift - shift);
 }
 
 static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 				   unsigned long addr, unsigned long end,
-				   unsigned long floor, unsigned long ceiling,
-				   unsigned int psize)
+				   unsigned long floor, unsigned long ceiling)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -277,7 +264,8 @@ static void hugetlb_free_pmd_range(struc
 		next = pmd_addr_end(addr, end);
 		if (pmd_none(*pmd))
 			continue;
-		free_hugepte_range(tlb, (hugepd_t *)pmd, psize);
+		free_hugepd_range(tlb, (hugepd_t *)pmd, PMD_SHIFT,
+				  addr, next, floor, ceiling);
 	} while (pmd++, addr = next, addr != end);
 
 	start &= PUD_MASK;
@@ -303,23 +291,19 @@ static void hugetlb_free_pud_range(struc
 	pud_t *pud;
 	unsigned long next;
 	unsigned long start;
-	unsigned int shift;
-	unsigned int psize = get_slice_psize(tlb->mm, addr);
-	shift = mmu_psize_to_shift(psize);
 
 	start = addr;
 	pud = pud_offset(pgd, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		if (shift < PMD_SHIFT) {
+		if (!is_hugepd(pud)) {
 			if (pud_none_or_clear_bad(pud))
 				continue;
 			hugetlb_free_pmd_range(tlb, pud, addr, next, floor,
-					       ceiling, psize);
+					       ceiling);
 		} else {
-			if (pud_none(*pud))
-				continue;
-			free_hugepte_range(tlb, (hugepd_t *)pud, psize);
+			free_hugepd_range(tlb, (hugepd_t *)pud, PUD_SHIFT,
+					  addr, next, floor, ceiling);
 		}
 	} while (pud++, addr = next, addr != end);
 
@@ -350,74 +334,34 @@ void hugetlb_free_pgd_range(struct mmu_g
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long start;
 
 	/*
-	 * Comments below take from the normal free_pgd_range().  They
-	 * apply here too.  The tests against HUGEPD_MASK below are
-	 * essential, because we *don't* test for this at the bottom
-	 * level.  Without them we'll attempt to free a hugepte table
-	 * when we unmap just part of it, even if there are other
-	 * active mappings using it.
+	 * Because there are a number of different possible pagetable
+	 * layouts for hugepage ranges, we limit knowledge of how
+	 * things should be laid out to the allocation path
+	 * (huge_pte_alloc(), above).  Everything else works out the
+	 * structure as it goes from information in the hugepd
+	 * pointers.  That means that we can't here use the
+	 * optimization used in the normal page free_pgd_range(), of
+	 * checking whether we're actually covering a large enough
+	 * range to have to do anything at the top level of the walk
+	 * instead of at the bottom.
 	 *
-	 * The next few lines have given us lots of grief...
-	 *
-	 * Why are we testing HUGEPD* at this top level?  Because
-	 * often there will be no work to do at all, and we'd prefer
-	 * not to go all the way down to the bottom just to discover
-	 * that.
-	 *
-	 * Why all these "- 1"s?  Because 0 represents both the bottom
-	 * of the address space and the top of it (using -1 for the
-	 * top wouldn't help much: the masks would do the wrong thing).
-	 * The rule is that addr 0 and floor 0 refer to the bottom of
-	 * the address space, but end 0 and ceiling 0 refer to the top
-	 * Comparisons need to use "end - 1" and "ceiling - 1" (though
-	 * that end 0 case should be mythical).
-	 *
-	 * Wherever addr is brought up or ceiling brought down, we
-	 * must be careful to reject "the opposite 0" before it
-	 * confuses the subsequent tests.  But what about where end is
-	 * brought down by HUGEPD_SIZE below? no, end can't go down to
-	 * 0 there.
-	 *
-	 * Whereas we round start (addr) and ceiling down, by different
-	 * masks at different levels, in order to test whether a table
-	 * now has no other vmas using it, so can be freed, we don't
-	 * bother to round floor or end up - the tests don't need that.
+	 * To make sense of this, you should probably go read the big
+	 * block comment at the top of the normal free_pgd_range(),
+	 * too.
 	 */
-	unsigned int psize = get_slice_psize(tlb->mm, addr);
 
-	addr &= HUGEPD_MASK(psize);
-	if (addr < floor) {
-		addr += HUGEPD_SIZE(psize);
-		if (!addr)
-			return;
-	}
-	if (ceiling) {
-		ceiling &= HUGEPD_MASK(psize);
-		if (!ceiling)
-			return;
-	}
-	if (end - 1 > ceiling - 1)
-		end -= HUGEPD_SIZE(psize);
-	if (addr > end - 1)
-		return;
-
-	start = addr;
 	pgd = pgd_offset(tlb->mm, addr);
 	do {
-		psize = get_slice_psize(tlb->mm, addr);
-		BUG_ON(!mmu_huge_psizes[psize]);
 		next = pgd_addr_end(addr, end);
-		if (mmu_psize_to_shift(psize) < PUD_SHIFT) {
+		if (!is_hugepd(pgd)) {
 			if (pgd_none_or_clear_bad(pgd))
 				continue;
 			hugetlb_free_pud_range(tlb, pgd, addr, next, floor, ceiling);
 		} else {
-			if (pgd_none(*pgd))
-				continue;
-			free_hugepte_range(tlb, (hugepd_t *)pgd, psize);
+			free_hugepd_range(tlb, (hugepd_t *)pgd, PGDIR_SHIFT,
+					  addr, next, floor, ceiling);
 		}
 	} while (pgd++, addr = next, addr != end);
 }
@@ -448,19 +392,19 @@ follow_huge_addr(struct mm_struct *mm, u
 {
 	pte_t *ptep;
 	struct page *page;
-	unsigned int mmu_psize = get_slice_psize(mm, address);
+	unsigned shift;
+	unsigned long mask;
+
+	ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
 
 	/* Verify it is a huge page else bail. */
-	if (!mmu_huge_psizes[mmu_psize])
+	if (!ptep || !shift)
 		return ERR_PTR(-EINVAL);
 
-	ptep = huge_pte_offset(mm, address);
+	mask = (1UL << shift) - 1;
 	page = pte_page(*ptep);
-	if (page) {
-		unsigned int shift = mmu_psize_to_shift(mmu_psize);
-		unsigned long sz = ((1UL) << shift);
-		page += (address % sz) / PAGE_SIZE;
-	}
+	if (page)
+		page += (address & mask) / PAGE_SIZE;
 
 	return page;
 }
@@ -483,6 +427,73 @@ follow_huge_pmd(struct mm_struct *mm, un
 	return NULL;
 }
 
+static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
+		       unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long mask;
+	unsigned long pte_end;
+	struct page *head, *page;
+	pte_t pte;
+	int refs;
+
+	pte_end = (addr + sz) & ~(sz-1);
+	if (pte_end < end)
+		end = pte_end;
+
+	pte = *ptep;
+	mask = _PAGE_PRESENT | _PAGE_USER;
+	if (write)
+		mask |= _PAGE_RW;
+
+	if ((pte_val(pte) & mask) != mask)
+		return 0;
+
+	/* hugepages are never "special" */
+	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+
+	refs = 0;
+	head = pte_page(pte);
+
+	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
+	do {
+		VM_BUG_ON(compound_head(page) != head);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+		/* Could be optimized better */
+		while (*nr) {
+			put_page(page);
+			(*nr)--;
+		}
+	}
+
+	return 1;
+}
+
+int gup_hugepd(hugepd_t *hugepd, unsigned pdshift,
+	       unsigned long addr, unsigned long end,
+	       int write, struct page **pages, int *nr)
+{
+	pte_t *ptep;
+	unsigned long sz = 1UL << hugepd_shift(*hugepd);
+
+	ptep = hugepte_offset(hugepd, addr, pdshift);
+	do {
+		if (!gup_hugepte(ptep, sz, addr, end, write, pages, nr))
+			return 0;
+	} while (ptep++, addr += sz, addr != end);
+
+	return 1;
+}
 
 unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 					unsigned long len, unsigned long pgoff,
@@ -530,34 +541,20 @@ static unsigned int hash_huge_page_do_la
 	return rflags;
 }
 
-int hash_huge_page(struct mm_struct *mm, unsigned long access,
-		   unsigned long ea, unsigned long vsid, int local,
-		   unsigned long trap)
+int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
+		     pte_t *ptep, unsigned long trap, int local, int ssize,
+		     unsigned int shift, unsigned int mmu_psize)
 {
-	pte_t *ptep;
 	unsigned long old_pte, new_pte;
 	unsigned long va, rflags, pa, sz;
 	long slot;
 	int err = 1;
-	int ssize = user_segment_size(ea);
-	unsigned int mmu_psize;
-	int shift;
-	mmu_psize = get_slice_psize(mm, ea);
 
-	if (!mmu_huge_psizes[mmu_psize])
-		goto out;
-	ptep = huge_pte_offset(mm, ea);
+	BUG_ON(shift != mmu_psize_defs[mmu_psize].shift);
 
 	/* Search the Linux page table for a match with va */
 	va = hpt_va(ea, vsid, ssize);
 
-	/*
-	 * If no pte found or not present, send the problem up to
-	 * do_page_fault
-	 */
-	if (unlikely(!ptep || pte_none(*ptep)))
-		goto out;
-
 	/* 
 	 * Check the user's access rights to the page.  If access should be
 	 * prevented then send the problem up to do_page_fault.
@@ -588,7 +585,6 @@ int hash_huge_page(struct mm_struct *mm,
 	rflags = 0x2 | (!(new_pte & _PAGE_RW));
  	/* _PAGE_EXEC -> HW_NO_EXEC since it's inverted */
 	rflags |= ((new_pte & _PAGE_EXEC) ? 0 : HPTE_R_N);
-	shift = mmu_psize_to_shift(mmu_psize);
 	sz = ((1UL) << shift);
 	if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
 		/* No CPU has hugepages but lacks no execute, so we
@@ -672,6 +668,8 @@ repeat:
 
 static void __init set_huge_psize(int psize)
 {
+	unsigned pdshift;
+
 	/* Check that it is a page size supported by the hardware and
 	 * that it fits within pagetable limits. */
 	if (mmu_psize_defs[psize].shift &&
@@ -686,29 +684,14 @@ static void __init set_huge_psize(int ps
 			return;
 		hugetlb_add_hstate(mmu_psize_defs[psize].shift - PAGE_SHIFT);
 
-		switch (mmu_psize_defs[psize].shift) {
-		case PAGE_SHIFT_64K:
-		    /* We only allow 64k hpages with 4k base page,
-		     * which was checked above, and always put them
-		     * at the PMD */
-		    hugepte_shift[psize] = PMD_SHIFT;
-		    break;
-		case PAGE_SHIFT_16M:
-		    /* 16M pages can be at two different levels
-		     * of pagestables based on base page size */
-		    if (PAGE_SHIFT == PAGE_SHIFT_64K)
-			    hugepte_shift[psize] = PMD_SHIFT;
-		    else /* 4k base page */
-			    hugepte_shift[psize] = PUD_SHIFT;
-		    break;
-		case PAGE_SHIFT_16G:
-		    /* 16G pages are always at PGD level */
-		    hugepte_shift[psize] = PGDIR_SHIFT;
-		    break;
-		}
-		hugepte_shift[psize] -= mmu_psize_defs[psize].shift;
-	} else
-		hugepte_shift[psize] = 0;
+		if (mmu_psize_defs[psize].shift < PMD_SHIFT)
+			pdshift = PMD_SHIFT;
+		else if (mmu_psize_defs[psize].shift < PUD_SHIFT)
+			pdshift = PUD_SHIFT;
+		else
+			pdshift = PGDIR_SHIFT;
+		mmu_huge_psizes[psize] = pdshift - mmu_psize_defs[psize].shift;
+	}
 }
 
 static int __init hugepage_setup_sz(char *str)
@@ -732,7 +715,7 @@ __setup("hugepagesz=", hugepage_setup_sz
 
 static int __init hugetlbpage_init(void)
 {
-	unsigned int psize;
+	int psize;
 
 	if (!cpu_has_feature(CPU_FTR_16M_PAGE))
 		return -ENODEV;
@@ -753,8 +736,8 @@ static int __init hugetlbpage_init(void)
 
 	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
 		if (mmu_huge_psizes[psize]) {
-			pgtable_cache_add(hugepte_shift[psize], NULL);
-			if (!PGT_CACHE(hugepte_shift[psize]))
+			pgtable_cache_add(mmu_huge_psizes[psize], NULL);
+			if (!PGT_CACHE(mmu_huge_psizes[psize]))
 				panic("hugetlbpage_init(): could not create "
 				      "pgtable cache for %d bit pagesize\n",
 				      mmu_psize_to_shift(psize));
Index: working-2.6/arch/powerpc/include/asm/hugetlb.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/hugetlb.h	2009-10-27 15:35:27.000000000 +1100
+++ working-2.6/arch/powerpc/include/asm/hugetlb.h	2009-10-27 15:37:08.000000000 +1100
@@ -3,7 +3,6 @@
 
 #include <asm/page.h>
 
-
 int is_hugepage_only_range(struct mm_struct *mm, unsigned long addr,
 			   unsigned long len);
 
Index: working-2.6/arch/powerpc/mm/init_64.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/init_64.c	2009-10-27 15:37:04.000000000 +1100
+++ working-2.6/arch/powerpc/mm/init_64.c	2009-10-27 15:37:08.000000000 +1100
@@ -41,6 +41,7 @@
 #include <linux/module.h>
 #include <linux/poison.h>
 #include <linux/lmb.h>
+#include <linux/hugetlb.h>
 
 #include <asm/pgalloc.h>
 #include <asm/page.h>
@@ -136,8 +137,13 @@ void pgtable_cache_add(unsigned shift, v
 
 	/* When batching pgtable pointers for RCU freeing, we store
 	 * the index size in the low bits.  Table alignment must be
-	 * big enough to fit it */
-	unsigned long minalign = MAX_PGTABLE_INDEX_SIZE + 1;
+	 * big enough to fit it.
+	 *
+	 * Likewise, hugeapge pagetable pointers contain a (different)
+	 * shift value in the low bits.  All tables must be aligned so
+	 * as to leave enough 0 bits in the address to contain it. */
+	unsigned long minalign = max(MAX_PGTABLE_INDEX_SIZE + 1,
+				     HUGEPD_SHIFT_MASK + 1);
 	struct kmem_cache *new;
 
 	/* It would be nice if this was a BUILD_BUG_ON(), but at the
Index: working-2.6/arch/powerpc/include/asm/pgtable-ppc64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgtable-ppc64.h	2009-10-27 15:35:27.000000000 +1100
+++ working-2.6/arch/powerpc/include/asm/pgtable-ppc64.h	2009-10-27 15:37:08.000000000 +1100
@@ -379,7 +379,18 @@ void pgtable_cache_init(void);
 	return pt;
 }
 
-pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long address);
+#ifdef CONFIG_HUGETLB_PAGE
+pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
+				 unsigned *shift);
+#else
+static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
+					       unsigned *shift)
+{
+	if (shift)
+		*shift = 0;
+	return find_linux_pte(pgdir, ea);
+}
+#endif /* !CONFIG_HUGETLB_PAGE */
 
 #endif /* __ASSEMBLY__ */
 
Index: working-2.6/arch/powerpc/mm/gup.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/gup.c	2009-10-27 15:35:27.000000000 +1100
+++ working-2.6/arch/powerpc/mm/gup.c	2009-10-27 15:37:08.000000000 +1100
@@ -55,57 +55,6 @@ static noinline int gup_pte_range(pmd_t 
 	return 1;
 }
 
-#ifdef CONFIG_HUGETLB_PAGE
-static noinline int gup_huge_pte(pte_t *ptep, struct hstate *hstate,
-				 unsigned long *addr, unsigned long end,
-				 int write, struct page **pages, int *nr)
-{
-	unsigned long mask;
-	unsigned long pte_end;
-	struct page *head, *page;
-	pte_t pte;
-	int refs;
-
-	pte_end = (*addr + huge_page_size(hstate)) & huge_page_mask(hstate);
-	if (pte_end < end)
-		end = pte_end;
-
-	pte = *ptep;
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-	if ((pte_val(pte) & mask) != mask)
-		return 0;
-	/* hugepages are never "special" */
-	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
-
-	refs = 0;
-	head = pte_page(pte);
-	page = head + ((*addr & ~huge_page_mask(hstate)) >> PAGE_SHIFT);
-	do {
-		VM_BUG_ON(compound_head(page) != head);
-		pages[*nr] = page;
-		(*nr)++;
-		page++;
-		refs++;
-	} while (*addr += PAGE_SIZE, *addr != end);
-
-	if (!page_cache_add_speculative(head, refs)) {
-		*nr -= refs;
-		return 0;
-	}
-	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-		/* Could be optimized better */
-		while (*nr) {
-			put_page(page);
-			(*nr)--;
-		}
-	}
-
-	return 1;
-}
-#endif /* CONFIG_HUGETLB_PAGE */
-
 static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		int write, struct page **pages, int *nr)
 {
@@ -119,7 +68,11 @@ static int gup_pmd_range(pud_t pud, unsi
 		next = pmd_addr_end(addr, end);
 		if (pmd_none(pmd))
 			return 0;
-		if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+		if (is_hugepd(pmdp)) {
+			if (!gup_hugepd((hugepd_t *)pmdp, PMD_SHIFT,
+					addr, next, write, pages, nr))
+				return 0;
+		} else if (!gup_pte_range(pmd, addr, next, write, pages, nr))
 			return 0;
 	} while (pmdp++, addr = next, addr != end);
 
@@ -139,7 +92,11 @@ static int gup_pud_range(pgd_t pgd, unsi
 		next = pud_addr_end(addr, end);
 		if (pud_none(pud))
 			return 0;
-		if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+		if (is_hugepd(pudp)) {
+			if (!gup_hugepd((hugepd_t *)pudp, PUD_SHIFT,
+					addr, next, write, pages, nr))
+				return 0;
+		} else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
 			return 0;
 	} while (pudp++, addr = next, addr != end);
 
@@ -154,10 +111,6 @@ int get_user_pages_fast(unsigned long st
 	unsigned long next;
 	pgd_t *pgdp;
 	int nr = 0;
-#ifdef CONFIG_PPC64
-	unsigned int shift;
-	int psize;
-#endif
 
 	pr_devel("%s(%lx,%x,%s)\n", __func__, start, nr_pages, write ? "write" : "read");
 
@@ -172,25 +125,6 @@ int get_user_pages_fast(unsigned long st
 
 	pr_devel("  aligned: %lx .. %lx\n", start, end);
 
-#ifdef CONFIG_HUGETLB_PAGE
-	/* We bail out on slice boundary crossing when hugetlb is
-	 * enabled in order to not have to deal with two different
-	 * page table formats
-	 */
-	if (addr < SLICE_LOW_TOP) {
-		if (end > SLICE_LOW_TOP)
-			goto slow_irqon;
-
-		if (unlikely(GET_LOW_SLICE_INDEX(addr) !=
-			     GET_LOW_SLICE_INDEX(end - 1)))
-			goto slow_irqon;
-	} else {
-		if (unlikely(GET_HIGH_SLICE_INDEX(addr) !=
-			     GET_HIGH_SLICE_INDEX(end - 1)))
-			goto slow_irqon;
-	}
-#endif /* CONFIG_HUGETLB_PAGE */
-
 	/*
 	 * XXX: batch / limit 'nr', to avoid large irq off latency
 	 * needs some instrumenting to determine the common sizes used by
@@ -210,54 +144,23 @@ int get_user_pages_fast(unsigned long st
 	 */
 	local_irq_disable();
 
-#ifdef CONFIG_PPC64
-	/* Those bits are related to hugetlbfs implementation and only exist
-	 * on 64-bit for now
-	 */
-	psize = get_slice_psize(mm, addr);
-	shift = mmu_psize_defs[psize].shift;
-#endif /* CONFIG_PPC64 */
-
-#ifdef CONFIG_HUGETLB_PAGE
-	if (unlikely(mmu_huge_psizes[psize])) {
-		pte_t *ptep;
-		unsigned long a = addr;
-		unsigned long sz = ((1UL) << shift);
-		struct hstate *hstate = size_to_hstate(sz);
-
-		BUG_ON(!hstate);
-		/*
-		 * XXX: could be optimized to avoid hstate
-		 * lookup entirely (just use shift)
-		 */
-
-		do {
-			VM_BUG_ON(shift != mmu_psize_defs[get_slice_psize(mm, a)].shift);
-			ptep = huge_pte_offset(mm, a);
-			pr_devel(" %016lx: huge ptep %p\n", a, ptep);
-			if (!ptep || !gup_huge_pte(ptep, hstate, &a, end, write, pages,
-						   &nr))
-				goto slow;
-		} while (a != end);
-	} else
-#endif /* CONFIG_HUGETLB_PAGE */
-	{
-		pgdp = pgd_offset(mm, addr);
-		do {
-			pgd_t pgd = *pgdp;
-
-#ifdef CONFIG_PPC64
-			VM_BUG_ON(shift != mmu_psize_defs[get_slice_psize(mm, addr)].shift);
-#endif
-			pr_devel("  %016lx: normal pgd %p\n", addr,
-				 (void *)pgd_val(pgd));
-			next = pgd_addr_end(addr, end);
-			if (pgd_none(pgd))
-				goto slow;
-			if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+	pgdp = pgd_offset(mm, addr);
+	do {
+		pgd_t pgd = *pgdp;
+
+		pr_devel("  %016lx: normal pgd %p\n", addr,
+			 (void *)pgd_val(pgd));
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(pgd))
+			goto slow;
+		if (is_hugepd(pgdp)) {
+			if (!gup_hugepd((hugepd_t *)pgdp, PGDIR_SHIFT,
+					addr, next, write, pages, &nr))
 				goto slow;
-		} while (pgdp++, addr = next, addr != end);
-	}
+		} else if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+			goto slow;
+	} while (pgdp++, addr = next, addr != end);
+
 	local_irq_enable();
 
 	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
Index: working-2.6/arch/powerpc/kernel/perf_callchain.c
===================================================================
--- working-2.6.orig/arch/powerpc/kernel/perf_callchain.c	2009-10-27 15:35:27.000000000 +1100
+++ working-2.6/arch/powerpc/kernel/perf_callchain.c	2009-10-27 15:37:08.000000000 +1100
@@ -119,13 +119,6 @@ static void perf_callchain_kernel(struct
 }
 
 #ifdef CONFIG_PPC64
-
-#ifdef CONFIG_HUGETLB_PAGE
-#define is_huge_psize(pagesize)	(HPAGE_SHIFT && mmu_huge_psizes[pagesize])
-#else
-#define is_huge_psize(pagesize)	0
-#endif
-
 /*
  * On 64-bit we don't want to invoke hash_page on user addresses from
  * interrupt context, so if the access faults, we read the page tables
@@ -135,7 +128,7 @@ static int read_user_stack_slow(void __u
 {
 	pgd_t *pgdir;
 	pte_t *ptep, pte;
-	int pagesize;
+	unsigned shift;
 	unsigned long addr = (unsigned long) ptr;
 	unsigned long offset;
 	unsigned long pfn;
@@ -145,17 +138,14 @@ static int read_user_stack_slow(void __u
 	if (!pgdir)
 		return -EFAULT;
 
-	pagesize = get_slice_psize(current->mm, addr);
+	ptep = find_linux_pte_or_hugepte(pgdir, addr, &shift);
+	if (!shift)
+		shift = PAGE_SHIFT;
 
 	/* align address to page boundary */
-	offset = addr & ((1ul << mmu_psize_defs[pagesize].shift) - 1);
+	offset = addr & ((1UL << shift) - 1);
 	addr -= offset;
 
-	if (is_huge_psize(pagesize))
-		ptep = huge_pte_offset(current->mm, addr);
-	else
-		ptep = find_linux_pte(pgdir, addr);
-
 	if (ptep == NULL)
 		return -EFAULT;
 	pte = *ptep;
Index: working-2.6/arch/powerpc/mm/hash_utils_64.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hash_utils_64.c	2009-10-27 15:35:27.000000000 +1100
+++ working-2.6/arch/powerpc/mm/hash_utils_64.c	2009-10-27 15:57:41.000000000 +1100
@@ -891,6 +891,7 @@ int hash_page(unsigned long ea, unsigned
 	unsigned long vsid;
 	struct mm_struct *mm;
 	pte_t *ptep;
+	unsigned hugeshift;
 	const struct cpumask *tmp;
 	int rc, user_region = 0, local = 0;
 	int psize, ssize;
@@ -943,30 +944,31 @@ int hash_page(unsigned long ea, unsigned
 	if (user_region && cpumask_equal(mm_cpumask(mm), tmp))
 		local = 1;
 
-#ifdef CONFIG_HUGETLB_PAGE
-	/* Handle hugepage regions */
-	if (HPAGE_SHIFT && mmu_huge_psizes[psize]) {
-		DBG_LOW(" -> huge page !\n");
-		return hash_huge_page(mm, access, ea, vsid, local, trap);
-	}
-#endif /* CONFIG_HUGETLB_PAGE */
-
 #ifndef CONFIG_PPC_64K_PAGES
-	/* If we use 4K pages and our psize is not 4K, then we are hitting
-	 * a special driver mapping, we need to align the address before
-	 * we fetch the PTE
+	/* If we use 4K pages and our psize is not 4K, then we might
+	 * be hitting a special driver mapping, and need to align the
+	 * address before we fetch the PTE.
+	 *
+	 * It could also be a hugepage mapping, in which case this is
+	 * not necessary, but it's not harmful, either.
 	 */
 	if (psize != MMU_PAGE_4K)
 		ea &= ~((1ul << mmu_psize_defs[psize].shift) - 1);
 #endif /* CONFIG_PPC_64K_PAGES */
 
 	/* Get PTE and page size from page tables */
-	ptep = find_linux_pte(pgdir, ea);
+	ptep = find_linux_pte_or_hugepte(pgdir, ea, &hugeshift);
 	if (ptep == NULL || !pte_present(*ptep)) {
 		DBG_LOW(" no PTE !\n");
 		return 1;
 	}
 
+#ifdef CONFIG_HUGETLB_PAGE
+	if (hugeshift)
+		return __hash_page_huge(ea, access, vsid, ptep, trap, local,
+					ssize, hugeshift, psize);
+#endif /* CONFIG_HUGETLB_PAGE */
+
 #ifndef CONFIG_PPC_64K_PAGES
 	DBG_LOW(" i-pte: %016lx\n", pte_val(*ptep));
 #else
Index: working-2.6/arch/powerpc/include/asm/mmu-hash64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/mmu-hash64.h	2009-10-27 15:35:27.000000000 +1100
+++ working-2.6/arch/powerpc/include/asm/mmu-hash64.h	2009-10-27 15:37:08.000000000 +1100
@@ -173,14 +173,6 @@ extern unsigned long tce_alloc_start, tc
  */
 extern int mmu_ci_restrictions;
 
-#ifdef CONFIG_HUGETLB_PAGE
-/*
- * The page size indexes of the huge pages for use by hugetlbfs
- */
-extern unsigned int mmu_huge_psizes[MMU_PAGE_COUNT];
-
-#endif /* CONFIG_HUGETLB_PAGE */
-
 /*
  * This function sets the AVPN and L fields of the HPTE  appropriately
  * for the page size
@@ -254,9 +246,9 @@ extern int __hash_page_64K(unsigned long
 			   unsigned int local, int ssize);
 struct mm_struct;
 extern int hash_page(unsigned long ea, unsigned long access, unsigned long trap);
-extern int hash_huge_page(struct mm_struct *mm, unsigned long access,
-			  unsigned long ea, unsigned long vsid, int local,
-			  unsigned long trap);
+int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
+		     pte_t *ptep, unsigned long trap, int local, int ssize,
+		     unsigned int shift, unsigned int mmu_psize);
 
 extern int htab_bolt_mapping(unsigned long vstart, unsigned long vend,
 			     unsigned long pstart, unsigned long prot,
Index: working-2.6/arch/powerpc/include/asm/page.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/page.h	2009-10-27 15:35:27.000000000 +1100
+++ working-2.6/arch/powerpc/include/asm/page.h	2009-10-27 15:37:08.000000000 +1100
@@ -229,6 +229,20 @@ typedef unsigned long pgprot_t;
 
 #endif
 
+typedef struct { signed long pd; } hugepd_t;
+#define HUGEPD_SHIFT_MASK     0x3f
+
+#ifdef CONFIG_HUGETLB_PAGE
+static inline int hugepd_ok(hugepd_t hpd)
+{
+	return (hpd.pd > 0);
+}
+
+#define is_hugepd(pdep)               (hugepd_ok(*((hugepd_t *)(pdep))))
+#else /* CONFIG_HUGETLB_PAGE */
+#define is_hugepd(pdep)			0
+#endif /* CONFIG_HUGETLB_PAGE */
+
 struct page;
 extern void clear_user_page(void *page, unsigned long vaddr, struct page *pg);
 extern void copy_user_page(void *to, void *from, unsigned long vaddr,
Index: working-2.6/arch/powerpc/include/asm/pgtable.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgtable.h	2009-10-27 15:35:27.000000000 +1100
+++ working-2.6/arch/powerpc/include/asm/pgtable.h	2009-10-27 15:37:08.000000000 +1100
@@ -211,6 +211,9 @@ extern void paging_init(void);
  */
 extern void update_mmu_cache(struct vm_area_struct *, unsigned long, pte_t);
 
+extern int gup_hugepd(hugepd_t *hugepd, unsigned pdshift, unsigned long addr,
+		      unsigned long end, int write, struct page **pages, int *nr);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __KERNEL__ */

^ permalink raw reply

* [4/6] Cleanup initialization of hugepages on powerpc
From: David Gibson @ 2009-10-27  5:24 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
In-Reply-To: <20091027052258.GD20694@yookeroo.seuss>

This patch simplifies the logic used to initialize hugepages on
powerpc.  The somewhat oddly named set_huge_psize() is renamed to
add_huge_page_size() and now does all necessary verification of
whether it's given a valid hugepage sizes (instead of just some) and
instantiates the generic hstate structure (but no more).  

hugetlbpage_init() now steps through the available pagesizes, checks
if they're valid for hugepages by calling add_huge_page_size() and
initializes the kmem_caches for the hugepage pagetables.  This means
we can now eliminate the mmu_huge_psizes array, since we no longer
need to pass the sizing information for the pagetable caches from
set_huge_psize() into hugetlbpage_init()

Determination of the default huge page size is also moved from the
hash code into the general hugepage code.

Signed-off-by: David Gibson <dwg@au1.ibm.com>

---
 arch/powerpc/include/asm/page_64.h |    2 
 arch/powerpc/mm/hash_utils_64.c    |   10 --
 arch/powerpc/mm/hugetlbpage.c      |  130 +++++++++++++++++--------------------
 3 files changed, 64 insertions(+), 78 deletions(-)

Index: linux-a2/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- linux-a2.orig/arch/powerpc/mm/hugetlbpage.c	2009-10-15 16:40:49.000000000 +1100
+++ linux-a2/arch/powerpc/mm/hugetlbpage.c	2009-10-15 16:41:33.000000000 +1100
@@ -37,27 +37,17 @@
 static unsigned long gpage_freearray[MAX_NUMBER_GPAGES];
 static unsigned nr_gpages;
 
-/* Array of valid huge page sizes - non-zero value(hugepte_shift) is
- * stored for the huge page sizes that are valid.
- */
-static unsigned int mmu_huge_psizes[MMU_PAGE_COUNT] = { }; /* initialize all to 0 */
-
 /* Flag to mark huge PD pointers.  This means pmd_bad() and pud_bad()
  * will choke on pointers to hugepte tables, which is handy for
  * catching screwups early. */
 
 static inline int shift_to_mmu_psize(unsigned int shift)
 {
-	switch (shift) {
-#ifndef CONFIG_PPC_64K_PAGES
-	case PAGE_SHIFT_64K:
-	    return MMU_PAGE_64K;
-#endif
-	case PAGE_SHIFT_16M:
-	    return MMU_PAGE_16M;
-	case PAGE_SHIFT_16G:
-	    return MMU_PAGE_16G;
-	}
+	int psize;
+
+	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize)
+		if (mmu_psize_defs[psize].shift == shift)
+			return psize;
 	return -1;
 }
 
@@ -502,8 +492,6 @@ unsigned long hugetlb_get_unmapped_area(
 	struct hstate *hstate = hstate_file(file);
 	int mmu_psize = shift_to_mmu_psize(huge_page_shift(hstate));
 
-	if (!mmu_huge_psizes[mmu_psize])
-		return -EINVAL;
 	return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1, 0);
 }
 
@@ -666,47 +654,46 @@ repeat:
 	return err;
 }
 
-static void __init set_huge_psize(int psize)
+static int __init add_huge_page_size(unsigned long long size)
 {
-	unsigned pdshift;
+	int shift = __ffs(size);
+	int mmu_psize;
 
 	/* Check that it is a page size supported by the hardware and
-	 * that it fits within pagetable limits. */
-	if (mmu_psize_defs[psize].shift &&
-		mmu_psize_defs[psize].shift < SID_SHIFT_1T &&
-		(mmu_psize_defs[psize].shift > MIN_HUGEPTE_SHIFT ||
-		 mmu_psize_defs[psize].shift == PAGE_SHIFT_64K ||
-		 mmu_psize_defs[psize].shift == PAGE_SHIFT_16G)) {
-		/* Return if huge page size has already been setup or is the
-		 * same as the base page size. */
-		if (mmu_huge_psizes[psize] ||
-		   mmu_psize_defs[psize].shift == PAGE_SHIFT)
-			return;
-		hugetlb_add_hstate(mmu_psize_defs[psize].shift - PAGE_SHIFT);
+	 * that it fits within pagetable and slice limits. */
+	if (!is_power_of_2(size)
+	    || (shift > SLICE_HIGH_SHIFT) || (shift <= PAGE_SHIFT))
+		return -EINVAL;
 
-		if (mmu_psize_defs[psize].shift < PMD_SHIFT)
-			pdshift = PMD_SHIFT;
-		else if (mmu_psize_defs[psize].shift < PUD_SHIFT)
-			pdshift = PUD_SHIFT;
-		else
-			pdshift = PGDIR_SHIFT;
-		mmu_huge_psizes[psize] = pdshift - mmu_psize_defs[psize].shift;
-	}
+	if ((mmu_psize = shift_to_mmu_psize(shift)) < 0)
+		return -EINVAL;
+
+#ifdef CONFIG_SPU_FS_64K_LS
+	/* Disable support for 64K huge pages when 64K SPU local store
+	 * support is enabled as the current implementation conflicts.
+	 */
+	if (shift == PAGE_SHIFT_64K)
+		return -EINVAL;
+#endif /* CONFIG_SPU_FS_64K_LS */
+
+	BUG_ON(mmu_psize_defs[mmu_psize].shift != shift);
+
+	/* Return if huge page size has already been setup */
+	if (size_to_hstate(size))
+		return 0;
+
+	hugetlb_add_hstate(shift - PAGE_SHIFT);
+
+	return 0;
 }
 
 static int __init hugepage_setup_sz(char *str)
 {
 	unsigned long long size;
-	int mmu_psize;
-	int shift;
 
 	size = memparse(str, &str);
 
-	shift = __ffs(size);
-	mmu_psize = shift_to_mmu_psize(shift);
-	if (mmu_psize >= 0 && mmu_psize_defs[mmu_psize].shift)
-		set_huge_psize(mmu_psize);
-	else
+	if (add_huge_page_size(size) != 0)
 		printk(KERN_WARNING "Invalid huge page size specified(%llu)\n", size);
 
 	return 1;
@@ -720,31 +707,40 @@ static int __init hugetlbpage_init(void)
 	if (!cpu_has_feature(CPU_FTR_16M_PAGE))
 		return -ENODEV;
 
-	/* Add supported huge page sizes.  Need to change HUGE_MAX_HSTATE
-	 * and adjust PTE_NONCACHE_NUM if the number of supported huge page
-	 * sizes changes.
-	 */
-	set_huge_psize(MMU_PAGE_16M);
-	set_huge_psize(MMU_PAGE_16G);
+	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
+		unsigned shift;
+		unsigned pdshift;
 
-	/* Temporarily disable support for 64K huge pages when 64K SPU local
-	 * store support is enabled as the current implementation conflicts.
-	 */
-#ifndef CONFIG_SPU_FS_64K_LS
-	set_huge_psize(MMU_PAGE_64K);
-#endif
+		if (!mmu_psize_defs[psize].shift)
+			continue;
 
-	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
-		if (mmu_huge_psizes[psize]) {
-			pgtable_cache_add(mmu_huge_psizes[psize], NULL);
-			if (!PGT_CACHE(mmu_huge_psizes[psize]))
-				panic("hugetlbpage_init(): could not create "
-				      "pgtable cache for %d bit pagesize\n",
-				      mmu_psize_to_shift(psize));
-		}
+		shift = mmu_psize_to_shift(psize);
+
+		if (add_huge_page_size(1ULL << shift) < 0)
+			continue;
+
+		if (shift < PMD_SHIFT)
+			pdshift = PMD_SHIFT;
+		else if (shift < PUD_SHIFT)
+			pdshift = PUD_SHIFT;
+		else
+			pdshift = PGDIR_SHIFT;
+
+		pgtable_cache_add(pdshift - shift, NULL);
+		if (!PGT_CACHE(pdshift - shift))
+			panic("hugetlbpage_init(): could not create "
+			      "pgtable cache for %d bit pagesize\n", shift);
 	}
 
+
+	/* Set default large page size. Currently, we pick 16M or 1M
+	 * depending on what is available
+	 */
+	if (mmu_psize_defs[MMU_PAGE_16M].shift)
+		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_16M].shift;
+	else if (mmu_psize_defs[MMU_PAGE_1M].shift)
+		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_1M].shift;
+
 	return 0;
 }
-
 module_init(hugetlbpage_init);
Index: linux-a2/arch/powerpc/include/asm/page_64.h
===================================================================
--- linux-a2.orig/arch/powerpc/include/asm/page_64.h	2009-10-15 16:39:59.000000000 +1100
+++ linux-a2/arch/powerpc/include/asm/page_64.h	2009-10-15 16:41:34.000000000 +1100
@@ -90,7 +90,7 @@ extern unsigned int HPAGE_SHIFT;
 #define HPAGE_SIZE		((1UL) << HPAGE_SHIFT)
 #define HPAGE_MASK		(~(HPAGE_SIZE - 1))
 #define HUGETLB_PAGE_ORDER	(HPAGE_SHIFT - PAGE_SHIFT)
-#define HUGE_MAX_HSTATE		3
+#define HUGE_MAX_HSTATE		(MMU_PAGE_COUNT-1)
 
 #endif /* __ASSEMBLY__ */
 
Index: linux-a2/arch/powerpc/mm/hash_utils_64.c
===================================================================
--- linux-a2.orig/arch/powerpc/mm/hash_utils_64.c	2009-10-15 16:40:47.000000000 +1100
+++ linux-a2/arch/powerpc/mm/hash_utils_64.c	2009-10-15 16:41:33.000000000 +1100
@@ -481,16 +481,6 @@ static void __init htab_init_page_sizes(
 #ifdef CONFIG_HUGETLB_PAGE
 	/* Reserve 16G huge page memory sections for huge pages */
 	of_scan_flat_dt(htab_dt_scan_hugepage_blocks, NULL);
-
-/* Set default large page size. Currently, we pick 16M or 1M depending
-	 * on what is available
-	 */
-	if (mmu_psize_defs[MMU_PAGE_16M].shift)
-		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_16M].shift;
-	/* With 4k/4level pagetables, we can't (for now) cope with a
-	 * huge page size < PMD_SIZE */
-	else if (mmu_psize_defs[MMU_PAGE_1M].shift)
-		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_1M].shift;
 #endif /* CONFIG_HUGETLB_PAGE */
 }
 

^ permalink raw reply

* [5/6] Split hash MMU specific hugepage code into a new file
From: David Gibson @ 2009-10-27  5:24 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
In-Reply-To: <20091027052258.GD20694@yookeroo.seuss>

This patch separates the parts of hugetlbpage.c which are inherently
specific to the hash MMU into a new hugelbpage-hash64.c file.

Signed-off-by: David Gibson <dwg@au1.ibm.com>

---
 arch/powerpc/include/asm/hugetlb.h   |    3 
 arch/powerpc/mm/Makefile             |    5 -
 arch/powerpc/mm/hugetlbpage-hash64.c |  167 ++++++++++++++++++++++++++++++++++
 arch/powerpc/mm/hugetlbpage.c        |  168 -----------------------------------
 4 files changed, 176 insertions(+), 167 deletions(-)

Index: working-2.6/arch/powerpc/mm/Makefile
===================================================================
--- working-2.6.orig/arch/powerpc/mm/Makefile	2009-10-27 15:07:38.000000000 +1100
+++ working-2.6/arch/powerpc/mm/Makefile	2009-10-27 15:08:09.000000000 +1100
@@ -28,7 +28,10 @@ obj-$(CONFIG_44x)		+= 44x_mmu.o
 obj-$(CONFIG_FSL_BOOKE)		+= fsl_booke_mmu.o
 obj-$(CONFIG_NEED_MULTIPLE_NODES) += numa.o
 obj-$(CONFIG_PPC_MM_SLICES)	+= slice.o
-obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
+ifeq ($(CONFIG_HUGETLB_PAGE),y)
+obj-y				+= hugetlbpage.o
+obj-$(CONFIG_PPC_STD_MMU_64)	+= hugetlbpage-hash64.o
+endif
 obj-$(CONFIG_PPC_SUBPAGE_PROT)	+= subpage-prot.o
 obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
 obj-$(CONFIG_HIGHMEM)		+= highmem.o
Index: working-2.6/arch/powerpc/mm/hugetlbpage-hash64.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ working-2.6/arch/powerpc/mm/hugetlbpage-hash64.c	2009-10-27 15:08:09.000000000 +1100
@@ -0,0 +1,167 @@
+/*
+ * PPC64 Huge TLB Page Support for hash based MMUs (POWER4 and later)
+ *
+ * Copyright (C) 2003 David Gibson, IBM Corporation.
+ *
+ * Based on the IA-32 version:
+ * Copyright (C) 2002, Rohit Seth <rohit.seth@intel.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/cacheflush.h>
+#include <asm/machdep.h>
+
+/*
+ * Called by asm hashtable.S for doing lazy icache flush
+ */
+static unsigned int hash_huge_page_do_lazy_icache(unsigned long rflags,
+					pte_t pte, int trap, unsigned long sz)
+{
+	struct page *page;
+	int i;
+
+	if (!pfn_valid(pte_pfn(pte)))
+		return rflags;
+
+	page = pte_page(pte);
+
+	/* page is dirty */
+	if (!test_bit(PG_arch_1, &page->flags) && !PageReserved(page)) {
+		if (trap == 0x400) {
+			for (i = 0; i < (sz / PAGE_SIZE); i++)
+				__flush_dcache_icache(page_address(page+i));
+			set_bit(PG_arch_1, &page->flags);
+		} else {
+			rflags |= HPTE_R_N;
+		}
+	}
+	return rflags;
+}
+
+int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
+		     pte_t *ptep, unsigned long trap, int local, int ssize,
+		     unsigned int shift, unsigned int mmu_psize)
+{
+	unsigned long old_pte, new_pte;
+	unsigned long va, rflags, pa, sz;
+	long slot;
+	int err = 1;
+
+	BUG_ON(shift != mmu_psize_defs[mmu_psize].shift);
+
+	/* Search the Linux page table for a match with va */
+	va = hpt_va(ea, vsid, ssize);
+
+	/*
+	 * Check the user's access rights to the page.  If access should be
+	 * prevented then send the problem up to do_page_fault.
+	 */
+	if (unlikely(access & ~pte_val(*ptep)))
+		goto out;
+	/*
+	 * At this point, we have a pte (old_pte) which can be used to build
+	 * or update an HPTE. There are 2 cases:
+	 *
+	 * 1. There is a valid (present) pte with no associated HPTE (this is
+	 *	the most common case)
+	 * 2. There is a valid (present) pte with an associated HPTE. The
+	 *	current values of the pp bits in the HPTE prevent access
+	 *	because we are doing software DIRTY bit management and the
+	 *	page is currently not DIRTY.
+	 */
+
+
+	do {
+		old_pte = pte_val(*ptep);
+		if (old_pte & _PAGE_BUSY)
+			goto out;
+		new_pte = old_pte | _PAGE_BUSY | _PAGE_ACCESSED;
+	} while(old_pte != __cmpxchg_u64((unsigned long *)ptep,
+					 old_pte, new_pte));
+
+	rflags = 0x2 | (!(new_pte & _PAGE_RW));
+ 	/* _PAGE_EXEC -> HW_NO_EXEC since it's inverted */
+	rflags |= ((new_pte & _PAGE_EXEC) ? 0 : HPTE_R_N);
+	sz = ((1UL) << shift);
+	if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
+		/* No CPU has hugepages but lacks no execute, so we
+		 * don't need to worry about that case */
+		rflags = hash_huge_page_do_lazy_icache(rflags, __pte(old_pte),
+						       trap, sz);
+
+	/* Check if pte already has an hpte (case 2) */
+	if (unlikely(old_pte & _PAGE_HASHPTE)) {
+		/* There MIGHT be an HPTE for this pte */
+		unsigned long hash, slot;
+
+		hash = hpt_hash(va, shift, ssize);
+		if (old_pte & _PAGE_F_SECOND)
+			hash = ~hash;
+		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+		slot += (old_pte & _PAGE_F_GIX) >> 12;
+
+		if (ppc_md.hpte_updatepp(slot, rflags, va, mmu_psize,
+					 ssize, local) == -1)
+			old_pte &= ~_PAGE_HPTEFLAGS;
+	}
+
+	if (likely(!(old_pte & _PAGE_HASHPTE))) {
+		unsigned long hash = hpt_hash(va, shift, ssize);
+		unsigned long hpte_group;
+
+		pa = pte_pfn(__pte(old_pte)) << PAGE_SHIFT;
+
+repeat:
+		hpte_group = ((hash & htab_hash_mask) *
+			      HPTES_PER_GROUP) & ~0x7UL;
+
+		/* clear HPTE slot informations in new PTE */
+#ifdef CONFIG_PPC_64K_PAGES
+		new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | _PAGE_HPTE_SUB0;
+#else
+		new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | _PAGE_HASHPTE;
+#endif
+		/* Add in WIMG bits */
+		rflags |= (new_pte & (_PAGE_WRITETHRU | _PAGE_NO_CACHE |
+				      _PAGE_COHERENT | _PAGE_GUARDED));
+
+		/* Insert into the hash table, primary slot */
+		slot = ppc_md.hpte_insert(hpte_group, va, pa, rflags, 0,
+					  mmu_psize, ssize);
+
+		/* Primary is full, try the secondary */
+		if (unlikely(slot == -1)) {
+			hpte_group = ((~hash & htab_hash_mask) *
+				      HPTES_PER_GROUP) & ~0x7UL;
+			slot = ppc_md.hpte_insert(hpte_group, va, pa, rflags,
+						  HPTE_V_SECONDARY,
+						  mmu_psize, ssize);
+			if (slot == -1) {
+				if (mftb() & 0x1)
+					hpte_group = ((hash & htab_hash_mask) *
+						      HPTES_PER_GROUP)&~0x7UL;
+
+				ppc_md.hpte_remove(hpte_group);
+				goto repeat;
+                        }
+		}
+
+		if (unlikely(slot == -2))
+			panic("hash_huge_page: pte_insert failed\n");
+
+		new_pte |= (slot << 12) & (_PAGE_F_SECOND | _PAGE_F_GIX);
+	}
+
+	/*
+	 * No need to use ldarx/stdcx here
+	 */
+	*ptep = __pte(new_pte & ~_PAGE_BUSY);
+
+	err = 0;
+
+ out:
+	return err;
+}
Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-10-27 15:08:07.000000000 +1100
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-10-27 15:08:09.000000000 +1100
@@ -7,29 +7,17 @@
  * Copyright (C) 2002, Rohit Seth <rohit.seth@intel.com>
  */
 
-#include <linux/init.h>
-#include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/io.h>
 #include <linux/hugetlb.h>
-#include <linux/pagemap.h>
-#include <linux/slab.h>
-#include <linux/err.h>
-#include <linux/sysctl.h>
-#include <asm/mman.h>
+#include <asm/pgtable.h>
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
-#include <asm/tlbflush.h>
-#include <asm/mmu_context.h>
-#include <asm/machdep.h>
-#include <asm/cputable.h>
-#include <asm/spu.h>
 
 #define PAGE_SHIFT_64K	16
 #define PAGE_SHIFT_16M	24
 #define PAGE_SHIFT_16G	34
 
-#define NUM_LOW_AREAS	(0x100000000UL >> SID_SHIFT)
-#define NUM_HIGH_AREAS	(PGTABLE_RANGE >> HTLB_AREA_SHIFT)
 #define MAX_NUMBER_GPAGES	1024
 
 /* Tracks the 16G pages after the device tree is scanned and before the
@@ -502,158 +490,6 @@ unsigned long vma_mmu_pagesize(struct vm
 	return 1UL << mmu_psize_to_shift(psize);
 }
 
-/*
- * Called by asm hashtable.S for doing lazy icache flush
- */
-static unsigned int hash_huge_page_do_lazy_icache(unsigned long rflags,
-					pte_t pte, int trap, unsigned long sz)
-{
-	struct page *page;
-	int i;
-
-	if (!pfn_valid(pte_pfn(pte)))
-		return rflags;
-
-	page = pte_page(pte);
-
-	/* page is dirty */
-	if (!test_bit(PG_arch_1, &page->flags) && !PageReserved(page)) {
-		if (trap == 0x400) {
-			for (i = 0; i < (sz / PAGE_SIZE); i++)
-				__flush_dcache_icache(page_address(page+i));
-			set_bit(PG_arch_1, &page->flags);
-		} else {
-			rflags |= HPTE_R_N;
-		}
-	}
-	return rflags;
-}
-
-int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
-		     pte_t *ptep, unsigned long trap, int local, int ssize,
-		     unsigned int shift, unsigned int mmu_psize)
-{
-	unsigned long old_pte, new_pte;
-	unsigned long va, rflags, pa, sz;
-	long slot;
-	int err = 1;
-
-	BUG_ON(shift != mmu_psize_defs[mmu_psize].shift);
-
-	/* Search the Linux page table for a match with va */
-	va = hpt_va(ea, vsid, ssize);
-
-	/* 
-	 * Check the user's access rights to the page.  If access should be
-	 * prevented then send the problem up to do_page_fault.
-	 */
-	if (unlikely(access & ~pte_val(*ptep)))
-		goto out;
-	/*
-	 * At this point, we have a pte (old_pte) which can be used to build
-	 * or update an HPTE. There are 2 cases:
-	 *
-	 * 1. There is a valid (present) pte with no associated HPTE (this is 
-	 *	the most common case)
-	 * 2. There is a valid (present) pte with an associated HPTE. The
-	 *	current values of the pp bits in the HPTE prevent access
-	 *	because we are doing software DIRTY bit management and the
-	 *	page is currently not DIRTY. 
-	 */
-
-
-	do {
-		old_pte = pte_val(*ptep);
-		if (old_pte & _PAGE_BUSY)
-			goto out;
-		new_pte = old_pte | _PAGE_BUSY | _PAGE_ACCESSED;
-	} while(old_pte != __cmpxchg_u64((unsigned long *)ptep,
-					 old_pte, new_pte));
-
-	rflags = 0x2 | (!(new_pte & _PAGE_RW));
- 	/* _PAGE_EXEC -> HW_NO_EXEC since it's inverted */
-	rflags |= ((new_pte & _PAGE_EXEC) ? 0 : HPTE_R_N);
-	sz = ((1UL) << shift);
-	if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
-		/* No CPU has hugepages but lacks no execute, so we
-		 * don't need to worry about that case */
-		rflags = hash_huge_page_do_lazy_icache(rflags, __pte(old_pte),
-						       trap, sz);
-
-	/* Check if pte already has an hpte (case 2) */
-	if (unlikely(old_pte & _PAGE_HASHPTE)) {
-		/* There MIGHT be an HPTE for this pte */
-		unsigned long hash, slot;
-
-		hash = hpt_hash(va, shift, ssize);
-		if (old_pte & _PAGE_F_SECOND)
-			hash = ~hash;
-		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-		slot += (old_pte & _PAGE_F_GIX) >> 12;
-
-		if (ppc_md.hpte_updatepp(slot, rflags, va, mmu_psize,
-					 ssize, local) == -1)
-			old_pte &= ~_PAGE_HPTEFLAGS;
-	}
-
-	if (likely(!(old_pte & _PAGE_HASHPTE))) {
-		unsigned long hash = hpt_hash(va, shift, ssize);
-		unsigned long hpte_group;
-
-		pa = pte_pfn(__pte(old_pte)) << PAGE_SHIFT;
-
-repeat:
-		hpte_group = ((hash & htab_hash_mask) *
-			      HPTES_PER_GROUP) & ~0x7UL;
-
-		/* clear HPTE slot informations in new PTE */
-#ifdef CONFIG_PPC_64K_PAGES
-		new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | _PAGE_HPTE_SUB0;
-#else
-		new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | _PAGE_HASHPTE;
-#endif
-		/* Add in WIMG bits */
-		rflags |= (new_pte & (_PAGE_WRITETHRU | _PAGE_NO_CACHE |
-				      _PAGE_COHERENT | _PAGE_GUARDED));
-
-		/* Insert into the hash table, primary slot */
-		slot = ppc_md.hpte_insert(hpte_group, va, pa, rflags, 0,
-					  mmu_psize, ssize);
-
-		/* Primary is full, try the secondary */
-		if (unlikely(slot == -1)) {
-			hpte_group = ((~hash & htab_hash_mask) *
-				      HPTES_PER_GROUP) & ~0x7UL; 
-			slot = ppc_md.hpte_insert(hpte_group, va, pa, rflags,
-						  HPTE_V_SECONDARY,
-						  mmu_psize, ssize);
-			if (slot == -1) {
-				if (mftb() & 0x1)
-					hpte_group = ((hash & htab_hash_mask) *
-						      HPTES_PER_GROUP)&~0x7UL;
-
-				ppc_md.hpte_remove(hpte_group);
-				goto repeat;
-                        }
-		}
-
-		if (unlikely(slot == -2))
-			panic("hash_huge_page: pte_insert failed\n");
-
-		new_pte |= (slot << 12) & (_PAGE_F_SECOND | _PAGE_F_GIX);
-	}
-
-	/*
-	 * No need to use ldarx/stdcx here
-	 */
-	*ptep = __pte(new_pte & ~_PAGE_BUSY);
-
-	err = 0;
-
- out:
-	return err;
-}
-
 static int __init add_huge_page_size(unsigned long long size)
 {
 	int shift = __ffs(size);
Index: working-2.6/arch/powerpc/include/asm/hugetlb.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/hugetlb.h	2009-10-27 15:08:04.000000000 +1100
+++ working-2.6/arch/powerpc/include/asm/hugetlb.h	2009-10-27 15:08:09.000000000 +1100
@@ -3,6 +3,9 @@
 
 #include <asm/page.h>
 
+pte_t *huge_pte_offset_and_shift(struct mm_struct *mm,
+				 unsigned long addr, unsigned *shift);
+
 int is_hugepage_only_range(struct mm_struct *mm, unsigned long addr,
 			   unsigned long len);
 

^ permalink raw reply

* [6/6] Bring hugepage PTE accessor functions back into sync with normal accessors
From: David Gibson @ 2009-10-27  5:24 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
In-Reply-To: <20091027052258.GD20694@yookeroo.seuss>

The hugepage arch code provides a number of hook functions/macros
which mirror the functionality of various normal page pte access
functions.  Various changes in the normal page accessors (in
particular BenH's recent changes to the handling of lazy icache
flushing and PAGE_EXEC) have caused the hugepage versions to get out
of sync with the originals.  In some cases, this is a bug, at least on
some MMU types.

One of the reasons that some hooks were not identical to the normal
page versions, is that the fact we're dealing with a hugepage needed
to be passed down do use the correct dcache-icache flush function.
This patch makes the main flush_dcache_icache_page() function hugepage
aware (by checking for the PageCompound flag).  That in turn means we
can make set_huge_pte_at() just a call to set_pte_at() bringing it
back into sync.  As a bonus, this lets us remove the
hash_huge_page_do_lazy_icache() function, replacing it with a call to
the hash_page_do_lazy_icache() function it was based on.

Some other hugepage pte access hooks - huge_ptep_get_and_clear() and
huge_ptep_clear_flush() - are not so easily unified, but this patch at
least brings them back into sync with the current versions of the
corresponding normal page functions.

Signed-off-by: David Gibson <dwg@au1.ibm.com>

---
 arch/powerpc/include/asm/hugetlb.h    |   25 +++++++++++++++++++------
 arch/powerpc/include/asm/mmu-hash64.h |    1 +
 arch/powerpc/mm/hash_utils_64.c       |    2 +-
 arch/powerpc/mm/hugetlbpage-hash64.c  |   30 +-----------------------------
 arch/powerpc/mm/hugetlbpage.c         |   31 ++++++++++---------------------
 arch/powerpc/mm/mem.c                 |   17 +++++++++++++----
 6 files changed, 45 insertions(+), 61 deletions(-)

Index: working-2.6/arch/powerpc/include/asm/hugetlb.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/hugetlb.h	2009-10-27 14:50:34.000000000 +1100
+++ working-2.6/arch/powerpc/include/asm/hugetlb.h	2009-10-27 14:56:31.000000000 +1100
@@ -6,6 +6,8 @@
 pte_t *huge_pte_offset_and_shift(struct mm_struct *mm,
 				 unsigned long addr, unsigned *shift);
 
+void flush_dcache_icache_hugepage(struct page *page);
+
 int is_hugepage_only_range(struct mm_struct *mm, unsigned long addr,
 			   unsigned long len);
 
@@ -13,12 +15,6 @@ void hugetlb_free_pgd_range(struct mmu_g
 			    unsigned long end, unsigned long floor,
 			    unsigned long ceiling);
 
-void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-		     pte_t *ptep, pte_t pte);
-
-pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
-			      pte_t *ptep);
-
 /*
  * The version of vma_mmu_pagesize() in arch/powerpc/mm/hugetlbpage.c needs
  * to override the version in mm/hugetlb.c
@@ -44,9 +40,26 @@ static inline void hugetlb_prefault_arch
 {
 }
 
+
+static inline void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
+				   pte_t *ptep, pte_t pte)
+{
+	set_pte_at(mm, addr, ptep, pte);
+}
+
+static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
+					    unsigned long addr, pte_t *ptep)
+{
+	unsigned long old = pte_update(mm, addr, ptep, ~0UL, 1);
+	return __pte(old);
+}
+
 static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,
 					 unsigned long addr, pte_t *ptep)
 {
+	pte_t pte;
+	pte = huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
+	flush_tlb_page(vma, addr);
 }
 
 static inline int huge_pte_none(pte_t pte)
Index: working-2.6/arch/powerpc/include/asm/mmu-hash64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/mmu-hash64.h	2009-10-27 14:36:36.000000000 +1100
+++ working-2.6/arch/powerpc/include/asm/mmu-hash64.h	2009-10-27 14:55:22.000000000 +1100
@@ -245,6 +245,7 @@ extern int __hash_page_64K(unsigned long
 			   unsigned long vsid, pte_t *ptep, unsigned long trap,
 			   unsigned int local, int ssize);
 struct mm_struct;
+unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap);
 extern int hash_page(unsigned long ea, unsigned long access, unsigned long trap);
 int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
 		     pte_t *ptep, unsigned long trap, int local, int ssize,
Index: working-2.6/arch/powerpc/mm/hash_utils_64.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hash_utils_64.c	2009-10-27 14:42:47.000000000 +1100
+++ working-2.6/arch/powerpc/mm/hash_utils_64.c	2009-10-27 14:55:22.000000000 +1100
@@ -775,7 +775,7 @@ unsigned int hash_page_do_lazy_icache(un
 	/* page is dirty */
 	if (!test_bit(PG_arch_1, &page->flags) && !PageReserved(page)) {
 		if (trap == 0x400) {
-			__flush_dcache_icache(page_address(page));
+			flush_dcache_icache_page(page);
 			set_bit(PG_arch_1, &page->flags);
 		} else
 			pp |= HPTE_R_N;
Index: working-2.6/arch/powerpc/mm/hugetlbpage-hash64.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage-hash64.c	2009-10-27 14:50:34.000000000 +1100
+++ working-2.6/arch/powerpc/mm/hugetlbpage-hash64.c	2009-10-27 14:55:22.000000000 +1100
@@ -14,33 +14,6 @@
 #include <asm/cacheflush.h>
 #include <asm/machdep.h>
 
-/*
- * Called by asm hashtable.S for doing lazy icache flush
- */
-static unsigned int hash_huge_page_do_lazy_icache(unsigned long rflags,
-					pte_t pte, int trap, unsigned long sz)
-{
-	struct page *page;
-	int i;
-
-	if (!pfn_valid(pte_pfn(pte)))
-		return rflags;
-
-	page = pte_page(pte);
-
-	/* page is dirty */
-	if (!test_bit(PG_arch_1, &page->flags) && !PageReserved(page)) {
-		if (trap == 0x400) {
-			for (i = 0; i < (sz / PAGE_SIZE); i++)
-				__flush_dcache_icache(page_address(page+i));
-			set_bit(PG_arch_1, &page->flags);
-		} else {
-			rflags |= HPTE_R_N;
-		}
-	}
-	return rflags;
-}
-
 int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
 		     pte_t *ptep, unsigned long trap, int local, int ssize,
 		     unsigned int shift, unsigned int mmu_psize)
@@ -89,8 +62,7 @@ int __hash_page_huge(unsigned long ea, u
 	if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
 		/* No CPU has hugepages but lacks no execute, so we
 		 * don't need to worry about that case */
-		rflags = hash_huge_page_do_lazy_icache(rflags, __pte(old_pte),
-						       trap, sz);
+		rflags = hash_page_do_lazy_icache(rflags, __pte(old_pte), trap);
 
 	/* Check if pte already has an hpte (case 2) */
 	if (unlikely(old_pte & _PAGE_HASHPTE)) {
Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-10-27 14:50:34.000000000 +1100
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-10-27 14:55:22.000000000 +1100
@@ -344,27 +344,6 @@ void hugetlb_free_pgd_range(struct mmu_g
 	} while (pgd++, addr = next, addr != end);
 }
 
-void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-		     pte_t *ptep, pte_t pte)
-{
-	if (pte_present(*ptep)) {
-		/* We open-code pte_clear because we need to pass the right
-		 * argument to hpte_need_flush (huge / !huge). Might not be
-		 * necessary anymore if we make hpte_need_flush() get the
-		 * page size from the slices
-		 */
-		pte_update(mm, addr, ptep, ~0UL, 1);
-	}
-	*ptep = __pte(pte_val(pte) & ~_PAGE_HPTEFLAGS);
-}
-
-pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
-			      pte_t *ptep)
-{
-	unsigned long old = pte_update(mm, addr, ptep, ~0UL, 1);
-	return __pte(old);
-}
-
 struct page *
 follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
 {
@@ -580,3 +559,13 @@ static int __init hugetlbpage_init(void)
 	return 0;
 }
 module_init(hugetlbpage_init);
+
+void flush_dcache_icache_hugepage(struct page *page)
+{
+	int i;
+
+	BUG_ON(!PageCompound(page));
+
+	for (i = 0; i < (1UL << compound_order(page)); i++)
+		__flush_dcache_icache(page_address(page+i));
+}
Index: working-2.6/arch/powerpc/mm/mem.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/mem.c	2009-10-27 12:52:00.000000000 +1100
+++ working-2.6/arch/powerpc/mm/mem.c	2009-10-27 14:55:22.000000000 +1100
@@ -32,6 +32,7 @@
 #include <linux/pagemap.h>
 #include <linux/suspend.h>
 #include <linux/lmb.h>
+#include <linux/hugetlb.h>
 
 #include <asm/pgalloc.h>
 #include <asm/prom.h>
@@ -417,18 +418,26 @@ EXPORT_SYMBOL(flush_dcache_page);
 
 void flush_dcache_icache_page(struct page *page)
 {
+#ifdef CONFIG_HUGETLB_PAGE
+	if (PageCompound(page)) {
+		flush_dcache_icache_hugepage(page);
+		return;
+	}
+#endif
 #ifdef CONFIG_BOOKE
-	void *start = kmap_atomic(page, KM_PPC_SYNC_ICACHE);
-	__flush_dcache_icache(start);
-	kunmap_atomic(start, KM_PPC_SYNC_ICACHE);
+	{
+		void *start = kmap_atomic(page, KM_PPC_SYNC_ICACHE);
+		__flush_dcache_icache(start);
+		kunmap_atomic(start, KM_PPC_SYNC_ICACHE);
+	}
 #elif defined(CONFIG_8xx) || defined(CONFIG_PPC64)
 	/* On 8xx there is no need to kmap since highmem is not supported */
 	__flush_dcache_icache(page_address(page)); 
 #else
 	__flush_dcache_icache_phys(page_to_pfn(page) << PAGE_SHIFT);
 #endif
-
 }
+
 void clear_user_page(void *page, unsigned long vaddr, struct page *pg)
 {
 	clear_page(page);

^ permalink raw reply

* [PATCH v3] powerpc/ppc64: Use preempt_schedule_irq instead of preempt_schedule
From: Benjamin Herrenschmidt @ 2009-10-27  5:41 UTC (permalink / raw)
  To: Valentine Barshak; +Cc: olof, linuxppc-dev, paulus
In-Reply-To: <1256601324.2076.49.camel@pasglop>


> So I _think_ that the irqs on/off accounting for lockdep isn't quite
> right. What do you think of this slightly modified version ? I've only
> done a quick boot test on a G5 with lockdep enabled and a played a bit,
> nothing shows up so far but it's definitely not conclusive.
> 
> The main difference is that I call trace_hardirqs_off to "advertise"
> the fact that we are soft-disabling (it could be a dup, but at this
> stage this is no big deal, but it's not always, like in syscall return
> the kernel thinks we have interrupts enabled and could thus get out
> of sync without that).
> 
> I also mark the PACA hard disable to reflect the MSR:EE state before
> calling into preempt_schedule_irq().

Allright, second thought :-)

It's probably simpler to just keep hardirqs off. Code is smaller and
simpler and the scheduler will re-enable them soon enough anyways.

This version of the patch also spaces the code a bit and adds comments
which makes them (the code and the patch) more readable.

Cheers,
Ben.
 
From: Benjamin Herrenschmidt <benh@kernel.crashing.org>

[PATCH v3] powerpc/ppc64: Use preempt_schedule_irq instead of preempt_schedule

Based on an original patch by Valentine Barshak <vbarshak@ru.mvista.com>

Use preempt_schedule_irq to prevent infinite irq-entry and
eventual stack overflow problems with fast-paced IRQ sources.

This kind of problems has been observed on the PASemi Electra IDE
controller. We have to make sure we are soft-disabled before calling
preempt_schedule_irq and hard disable interrupts after that
to avoid unrecoverable exceptions.

This patch also moves the "clrrdi r9,r1,THREAD_SHIFT" out of
the #ifdef CONFIG_PPC_BOOK3E scope, since r9 is clobbered
and has to be restored in both cases.
---
 arch/powerpc/kernel/entry_64.S |   41 ++++++++++++++++++++-------------------
 1 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index f9fd54b..9763267 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -658,42 +658,43 @@ do_work:
 	cmpdi	r0,0
 	crandc	eq,cr1*4+eq,eq
 	bne	restore
-	/* here we are preempting the current task */
-1:
-#ifdef CONFIG_TRACE_IRQFLAGS
-	bl	.trace_hardirqs_on
-	/* Note: we just clobbered r10 which used to contain the previous
-	 * MSR before the hard-disabling done by the caller of do_work.
-	 * We don't have that value anymore, but it doesn't matter as
-	 * we will hard-enable unconditionally, we can just reload the
-	 * current MSR into r10
+
+	/* Here we are preempting the current task.
+	 *
+	 * Ensure interrupts are soft-disabled. We also properly mark
+	 * the PACA to reflect the fact that they are hard-disabled
+	 * and trace the change
 	 */
-	mfmsr	r10
-#endif /* CONFIG_TRACE_IRQFLAGS */
-	li	r0,1
+	li	r0,0
 	stb	r0,PACASOFTIRQEN(r13)
 	stb	r0,PACAHARDIRQEN(r13)
+	TRACE_DISABLE_INTS
+
+	/* Call the scheduler with soft IRQs off */
+1:	bl	.preempt_schedule_irq
+
+	/* Hard-disable interrupts again (and update PACA) */
 #ifdef CONFIG_PPC_BOOK3E
-	wrteei	1
-	bl	.preempt_schedule
 	wrteei	0
 #else
-	ori	r10,r10,MSR_EE
-	mtmsrd	r10,1		/* reenable interrupts */
-	bl	.preempt_schedule
 	mfmsr	r10
-	clrrdi	r9,r1,THREAD_SHIFT
-	rldicl	r10,r10,48,1	/* disable interrupts again */
+	rldicl	r10,r10,48,1
 	rotldi	r10,r10,16
 	mtmsrd	r10,1
 #endif /* CONFIG_PPC_BOOK3E */
+	li	r0,0
+	stb	r0,PACAHARDIRQEN(r13)
+
+	/* Re-test flags and eventually loop */
+	clrrdi	r9,r1,THREAD_SHIFT
 	ld	r4,TI_FLAGS(r9)
 	andi.	r0,r4,_TIF_NEED_RESCHED
 	bne	1b
 	b	restore
 
 user_work:
-#endif
+#endif /* CONFIG_PREEMPT */
+
 	/* Enable interrupts */
 #ifdef CONFIG_PPC_BOOK3E
 	wrteei	1
-- 
1.6.1.2.14.gf26b5

^ permalink raw reply related

* [git pull] Please pull powerpc.git merge branch
From: Benjamin Herrenschmidt @ 2009-10-27  7:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linuxppc-dev list, Andrew Morton, Linux Kernel list

Hi Linus !

Some of these might have been better in -rc4 or earlier, all my fault
for having some backlog that I'm still going through. So we have some
bug fixes (not necessarily regressions but also generally simple
enough that I decided to go for 2.6.32 anyways) and a few very trivial
Kconfig cleanups (outside of arch/powerpc but related to our symbols)
from Kumar that could go anytime.

The following changes since commit 964fe080d94db82a3268443e9b9ece4c60246414:
  Linus Torvalds (1):
        Merge git://git.kernel.org/.../rusty/linux-2.6-for-linus

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc.git merge

Andreas Schwab (2):
      powerpc: Fix segment mapping in vdso32
      powerpc: Align vDSO base address

Benjamin Herrenschmidt (1):
      powerpc/ppc64: Use preempt_schedule_irq instead of preempt_schedule

Josh Boyer (1):
      powerpc/booke: Fix xmon single step on PowerPC Book-E

Kumar Gala (7):
      powerpc: Add a Book-3E 64-bit defconfig
      powerpc: Fix compile errors found by new ppc64e_defconfig
      powerpc: Limit hugetlbfs support to PPC64 Book-3S machines
      powerpc: Limit memory hotplug support to PPC64 Book-3S machines
      powerpc: Minor cleanup to init/Kconfig
      powerpc: Minor cleanup to sound/ppc/Kconfig
      powerpc: Minor cleanup to lib/Kconfig.debug

Michael Neuling (1):
      powerpc/perf_events: Fix priority of MSR HV vs PR bits

Stephen Rothwell (1):
      powerpc/iseries: Remove compiler version dependent hack

 arch/powerpc/configs/ppc64e_defconfig   | 2199 +++++++++++++++++++++++++++++++
 arch/powerpc/kernel/entry_64.S          |   41 +-
 arch/powerpc/kernel/pci_64.c            |    2 +
 arch/powerpc/kernel/perf_event.c        |   17 +-
 arch/powerpc/kernel/process.c           |    2 +-
 arch/powerpc/kernel/setup_64.c          |    1 -
 arch/powerpc/kernel/vdso.c              |   11 +-
 arch/powerpc/kernel/vdso32/vdso32.lds.S |    4 +-
 arch/powerpc/platforms/iseries/Makefile |   11 +-
 arch/powerpc/platforms/iseries/dt.c     |   56 +-
 arch/powerpc/xmon/xmon.c                |   20 +-
 fs/Kconfig                              |    2 +-
 init/Kconfig                            |    2 +-
 lib/Kconfig.debug                       |    2 +-
 mm/Kconfig                              |    2 +-
 sound/ppc/Kconfig                       |    2 +-
 16 files changed, 2294 insertions(+), 80 deletions(-)
 create mode 100644 arch/powerpc/configs/ppc64e_defconfig

^ permalink raw reply

* Re: [PATCH 0/8]  Fix 8xx MMU/TLB
From: Joakim Tjernlund @ 2009-10-27  9:16 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Scott Wood, linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <1256601653.2076.51.camel@pasglop>

Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote on 27/10/2009 01:00:53:
>
> On Mon, 2009-10-26 at 16:26 -0700, Dan Malek wrote:
> > Just be careful the get_user() doesn't regenerate the same
> > translation error you are trying to fix by being here......

yes, I had some problems with this initially but managed to work around that.
I noticed another problem though, I got multiple TLB errors for the same
address when I did it in C. Noticed by just printk:ing every hit
for a dcbX insn in do_page_fault. I can't explain it, but it seems
like when moving to C you have to execute a rfi insn and that might somehow
restart the dcbX insn before moving on to the page fault routine(or something
totally different)

>
> It shouldn't since it will always come up with a proper DAR but
> you may want to double check before hand that your instruction
> address you are loading from is -not- your marker value for bad DAR.

hmm, I check that the insn really is a dcbX insn, but not that the address is
!= 0x00f0. Don't see how it could be as if something is wrong with
the insn address you get ITLB error instead of a DTLB error.

Anyhow, things seems stalled as I haven't heard from Scott or Rex for a while.
If this isn't working now, I really don't know what is wrong and need
some debugging help.

>
> > It is nice doing things in C code, but you have to be aware
> > of the environment and the side effects when in this kind
>
> Yup.
>
> Cheers,
> Ben.
>
>
>

^ permalink raw reply

* Re: Acceleration for map_copy_from on powerpc 512x
From: Fortini Matteo @ 2009-10-27 10:02 UTC (permalink / raw)
  To: Kenneth Johansson; +Cc: linux-ppc list
In-Reply-To: <1256135453.22238.27.camel@kenjo-laptop>

The simple_map_init() works at a higher level, what I'm redefining is a 
function called by mtd->read()

The block size for e.g. a dd if=/dev/mtd0 of=/dev/null
with the default block size (I believe it's 512Bytes), fetches from 
/dev/mtd0 4096 Bytes at a time.
I'd prefer the kernel to be scheduling other tasks meanwhile, instead of 
busy-waiting on completion.

Regards

Kenneth Johansson ha scritto:
> On Mon, 2009-10-19 at 09:52 +0200, Fortini Matteo wrote:
>
>   
>> I didn't find a cleaner way than just #ifdef'ing the map_copy_from call 
>> and substitute with my call on relevant cases. I wonder if there is a 
>> cleaner way.
>>     
>
> Remove the call to simple_map_init() and do it manually in your driver
> with your own functions.
>
>   
>> And yes, as soon as I've cleaned up the code a little bit, I will 
>> definitely post a patch about it.
>>
>> Moreover: a huge benefit would come from exploiting DMA on these 
>> transfers, 
>>     
>
> probably depends on the block size if it's a gain or not. What is the
> size you normally see. 
>
>
>   

^ permalink raw reply

* Re: Acceleration for map_copy_from on powerpc 512x
From: Kenneth Johansson @ 2009-10-27 11:41 UTC (permalink / raw)
  To: Fortini Matteo; +Cc: linux-ppc list
In-Reply-To: <4AE6C51F.9050203@mta.it>

On Tue, 2009-10-27 at 11:02 +0100, Fortini Matteo wrote:
> The simple_map_init() works at a higher level, what I'm redefining is a 
> function called by mtd->read()

not sure I follow. What you want to do is change the access to the
flash. You do this by turning on MTD_COMPLEX_MAPPINGS and then setting
up the function pointers like is done in simple_map_init() but point to
your own functions. Now every access to the NOR flash will be done using
your functions and you can do whatever optimization you like.


> The block size for e.g. a dd if=/dev/mtd0 of=/dev/null
> with the default block size (I believe it's 512Bytes), fetches from 
> /dev/mtd0 4096 Bytes at a time.
> I'd prefer the kernel to be scheduling other tasks meanwhile, instead of 
> busy-waiting on completion.
> 
> Regards
> 
> Kenneth Johansson ha scritto:
> > On Mon, 2009-10-19 at 09:52 +0200, Fortini Matteo wrote:
> >
> >   
> >> I didn't find a cleaner way than just #ifdef'ing the map_copy_from call 
> >> and substitute with my call on relevant cases. I wonder if there is a 
> >> cleaner way.
> >>     
> >
> > Remove the call to simple_map_init() and do it manually in your driver
> > with your own functions.
> >
> >   
> >> And yes, as soon as I've cleaned up the code a little bit, I will 
> >> definitely post a patch about it.
> >>
> >> Moreover: a huge benefit would come from exploiting DMA on these 
> >> transfers, 
> >>     
> >
> > probably depends on the block size if it's a gain or not. What is the
> > size you normally see. 
> >
> >
> >   
> 
> 

^ permalink raw reply

* Micrel PHY KSZ8001 on MPC5200B FEC
From: Roman Fietze @ 2009-10-27 12:17 UTC (permalink / raw)
  To: linuxppc-dev

Hello,

We would need some help on how to make a Micrel KSZ8001 work on the
MPC5200B FEC using the kernel DENX-v2.6.3[01].

We can already boot the kernel and device tree using TFTP and this PHY
using a recent U-Boot version, so we would need some pointers how to
acomplish that.

Add a proper PHY driver in the drivers/net/phy/ directory?

Modify the DTS? If yes, how? A link to some documentation that's not
already in the kernel sources would already help.

Is it correct, when looking at the sources, that the MPC's FEC driver
switched to the generic PHY driver interface?


Roman

=2D-=20
Roman Fietze                Telemotive AG B=FCro M=FChlhausen
Breitwiesen                              73347 M=FChlhausen
Tel.: +49(0)7335/18493-45        http://www.telemotive.de

^ permalink raw reply

* RE: Network Stack SKB Reallocation
From: john.p.price @ 2009-10-27 13:43 UTC (permalink / raw)
  To: Jonathan Haws, linuxppc-dev
In-Reply-To: <BB99A6BA28709744BF22A68E6D7EB51F0330BEB38D@midas.usurf.usu.edu>

Hi Jonathan, I've read your post with great interest.

I have a custom board with custom fpga's connected to the PPC405EX EBC
bus on banks 2 and 3.  Running linux 2.6.29.1.  The board collects data
and dma's it to a scatter-gather dma buffer and then uses TCP/writev +
Ethernet 9KB Jumbo packets to transmit data off of the board.

Our systems have 7 of these data collection boards, we are seeing the
following stack trace, the boards do not crash apparently the just
continue to run.

~ # BUG: Bad page state in process dcb  pfn:080db
page:c03d2b60 flags:00044000 count:0 mapcount:0 mapping:(null)
index:3718
Call Trace:
[ce871980] [c0006bc0] show_stack+0x44/0x16c (unreliable)
[ce8719c0] [c005374c] bad_page+0x94/0x12c
[ce8719e0] [c0053c30] __free_pages_ok+0x364/0x3ec
[ce871a20] [c0057c00] put_compound_page+0x48/0x60
[ce871a30] [c0075520] kfree+0xd4/0xd8
[ce871a40] [c0175140] skb_release_data+0x80/0xc8
[ce871a50] [c0174f30] __kfree_skb+0x18/0xe8
[ce871a60] [c01ab9e4] tcp_ack+0x48c/0x1a84
[ce871af0] [c01add8c] tcp_rcv_state_process+0x70/0x9ac
[ce871b10] [c01b47fc] tcp_v4_do_rcv+0x9c/0x1a8
[ce871b40] [c01b6328] tcp_v4_rcv+0x4d4/0x5b8
[ce871b70] [c0198b90] ip_local_deliver+0x90/0x140
[ce871b90] [c0198f24] ip_rcv+0x2e4/0x4bc


The above occurs on at least one of the seven boards over the course of
a multi-day run.

Another trace from an actual crash, occurs not so often;

DCB: tcp connection request accepted - line length: 18168
Unable to handle kernel paging request for data at address 0x0004009c
Faulting instruction address: 0xc017500c
Oops: Kernel access of bad area, sig: 11 [#1]
DCB
Modules linked in: ds3b3 dma ds3b2
NIP: c017500c LR: c01351f8 CTR: c013513c
REGS: cd779aa0 TRAP: 0300   Not tainted  (2.6.29.1)
MSR: 00029030 <EE,ME,CE,IR,DR>  CR: 42424024  XER: 2000005f
DEAR: 0004009c, ESR: 00000000
TASK =3D ce8883f0[770] 'dcb' THREAD: cd778000
GPR00: 00000060 cd779b50 ce8883f0 00040000 00000020 c001220c 00000001
00000014=20
GPR08: 00000002 0004009c 00000003 000000c0 22424022 10183238 000022f4
00000001=20
GPR16: 00000020 000022f4 000237c0 00000000 cd6590e4 13511000 00000008
bfe9d520=20
GPR24: ce8e2c34 ce8e2c2c ce811ce0 00000001 00000018 ce811360 00000300
ce8113c0=20
NIP [c017500c] kfree_skb+0xc/0x38
LR [c01351f8] emac_poll_tx+0xbc/0x310
Call Trace:
[cd779b50] [c001220c] __mtdcr_table+0x0/0x3ff8 (unreliable)
[cd779b70] [c0132248] mal_poll+0x44/0x1c8
[cd779ba0] [c017fb10] net_rx_action+0x94/0x188
[cd779bd0] [c0024740] __do_softirq+0x84/0x124
[cd779c00] [c0004f10] do_softirq+0x58/0x5c
[cd779c10] [c00245b0] irq_exit+0x48/0x58
[cd779c20] [c0004fb4] do_IRQ+0xa0/0xc4
[cd779c40] [c000eba0] ret_from_except+0x0/0x18
[cd779d00] [c01a4ec0] tcp_sendmsg+0x220/0xbf0
[cd779d80] [c016dd18] sock_aio_write+0xf0/0x104
[cd779de0] [c007a5b0] do_sync_readv_writev+0xbc/0x130
[cd779e90] [c007ae54] do_readv_writev+0xb4/0x1c4
[cd779f10] [c007b010] sys_writev+0x4c/0x90
[cd779f40] [c000e558] ret_from_syscall+0x0/0x3c
Instruction dump:
3d20c02b 80695ac4 7fe4fb78 4bf00fb9 80010014 83e1000c 7c0803a6 38210010=20
4e800020 2c030000 4d820020 3923009c <8003009c> 2f800001 409e0008
4bffff00=20
Kernel panic - not syncing: Fatal exception in interrupt
Rebooting in 1 seconds..


So the questions I have for you are as follows;

	1. Do either of these trace appear related to the issue your
driver patch will fix?

	2. If I set path MTU to 1500, will that avoid the issue?=20

	3. Would you have any further suggestions?

thanks



-----Original Message-----
From: linuxppc-dev-bounces+john.p.price=3Dl-3com.com@lists.ozlabs.org
[mailto:linuxppc-dev-bounces+john.p.price=3Dl-3com.com@lists.ozlabs.org]
On Behalf Of Jonathan Haws
Sent: Monday, October 26, 2009 2:43 PM
To: linuxppc-dev@lists.ozlabs.org
Subject: Network Stack SKB Reallocation

Quick question about the network stack in general:

Does the stack itself release an SKB allocated by the device driver back
to the heap upstream, or does it require that the device driver handle
that?

Thanks!

Jonathan


_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

* RE: Network Stack SKB Reallocation
From: Jonathan Haws @ 2009-10-27 14:28 UTC (permalink / raw)
  To: john.p.price@l-3com.com, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <B9639434CFA424438117913649CDB78710729A20@MA_EXCHANGE.corp.sds.l-3com.com>

Hi John,

> I have a custom board with custom fpga's connected to the PPC405EX
> EBC
> bus on banks 2 and 3.  Running linux 2.6.29.1.  The board collects
> data
> and dma's it to a scatter-gather dma buffer and then uses TCP/writev
> +
> Ethernet 9KB Jumbo packets to transmit data off of the board.

We are also doing something similar, however we do not transmit the data of=
f the board - we are storing it to disk.  What we are seeing is that memory=
 gets so fragmented during normal operation that the EMAC driver cannot fin=
d a contiguous block of memory large enough for the MTU (a 9000 byte MTU re=
quires 4 pages of memory, or 16384 bytes).
>=20
> Our systems have 7 of these data collection boards, we are seeing
> the
> following stack trace, the boards do not crash apparently the just
> continue to run.
>=20
> ~ # BUG: Bad page state in process dcb  pfn:080db
> page:c03d2b60 flags:00044000 count:0 mapcount:0 mapping:(null)
> index:3718
> Call Trace:
> [ce871980] [c0006bc0] show_stack+0x44/0x16c (unreliable)
> [ce8719c0] [c005374c] bad_page+0x94/0x12c
> [ce8719e0] [c0053c30] __free_pages_ok+0x364/0x3ec
> [ce871a20] [c0057c00] put_compound_page+0x48/0x60
> [ce871a30] [c0075520] kfree+0xd4/0xd8
> [ce871a40] [c0175140] skb_release_data+0x80/0xc8
> [ce871a50] [c0174f30] __kfree_skb+0x18/0xe8
> [ce871a60] [c01ab9e4] tcp_ack+0x48c/0x1a84
> [ce871af0] [c01add8c] tcp_rcv_state_process+0x70/0x9ac
> [ce871b10] [c01b47fc] tcp_v4_do_rcv+0x9c/0x1a8
> [ce871b40] [c01b6328] tcp_v4_rcv+0x4d4/0x5b8
> [ce871b70] [c0198b90] ip_local_deliver+0x90/0x140
> [ce871b90] [c0198f24] ip_rcv+0x2e4/0x4bc
>=20
>=20
> The above occurs on at least one of the seven boards over the course
> of
> a multi-day run.

This is very similar output that I would get when memory got fragmented, ho=
wever my BUG showed its face when I tried to allocate, not to free, so the =
issue might be somewhere else.

> Another trace from an actual crash, occurs not so often;
>=20
> DCB: tcp connection request accepted - line length: 18168
> Unable to handle kernel paging request for data at address
> 0x0004009c
> Faulting instruction address: 0xc017500c
> Oops: Kernel access of bad area, sig: 11 [#1]
> DCB
> Modules linked in: ds3b3 dma ds3b2
> NIP: c017500c LR: c01351f8 CTR: c013513c
> REGS: cd779aa0 TRAP: 0300   Not tainted  (2.6.29.1)
> MSR: 00029030 <EE,ME,CE,IR,DR>  CR: 42424024  XER: 2000005f
> DEAR: 0004009c, ESR: 00000000
> TASK =3D ce8883f0[770] 'dcb' THREAD: cd778000
> GPR00: 00000060 cd779b50 ce8883f0 00040000 00000020 c001220c
> 00000001
> 00000014
> GPR08: 00000002 0004009c 00000003 000000c0 22424022 10183238
> 000022f4
> 00000001
> GPR16: 00000020 000022f4 000237c0 00000000 cd6590e4 13511000
> 00000008
> bfe9d520
> GPR24: ce8e2c34 ce8e2c2c ce811ce0 00000001 00000018 ce811360
> 00000300
> ce8113c0
> NIP [c017500c] kfree_skb+0xc/0x38
> LR [c01351f8] emac_poll_tx+0xbc/0x310
> Call Trace:
> [cd779b50] [c001220c] __mtdcr_table+0x0/0x3ff8 (unreliable)
> [cd779b70] [c0132248] mal_poll+0x44/0x1c8
> [cd779ba0] [c017fb10] net_rx_action+0x94/0x188
> [cd779bd0] [c0024740] __do_softirq+0x84/0x124
> [cd779c00] [c0004f10] do_softirq+0x58/0x5c
> [cd779c10] [c00245b0] irq_exit+0x48/0x58
> [cd779c20] [c0004fb4] do_IRQ+0xa0/0xc4
> [cd779c40] [c000eba0] ret_from_except+0x0/0x18
> [cd779d00] [c01a4ec0] tcp_sendmsg+0x220/0xbf0
> [cd779d80] [c016dd18] sock_aio_write+0xf0/0x104
> [cd779de0] [c007a5b0] do_sync_readv_writev+0xbc/0x130
> [cd779e90] [c007ae54] do_readv_writev+0xb4/0x1c4
> [cd779f10] [c007b010] sys_writev+0x4c/0x90
> [cd779f40] [c000e558] ret_from_syscall+0x0/0x3c
> Instruction dump:
> 3d20c02b 80695ac4 7fe4fb78 4bf00fb9 80010014 83e1000c 7c0803a6
> 38210010
> 4e800020 2c030000 4d820020 3923009c <8003009c> 2f800001 409e0008
> 4bffff00
> Kernel panic - not syncing: Fatal exception in interrupt
> Rebooting in 1 seconds..
>=20
>=20
> So the questions I have for you are as follows;
>=20
> 	1. Do either of these trace appear related to the issue your
> driver patch will fix?

I don't believe so - especially since I do not have a working patch.  I hav=
e come to the conclusion that the driver works as is and we are just going =
to have to deal with the memory fragmentation.
=20
> 	2. If I set path MTU to 1500, will that avoid the issue?

I believe it would, see answer to question 3.

> 	3. Would you have any further suggestions?

The road I believe that we are going to take is move to a 4000 byte MTU.  T=
he 405EX MAL has a 4080 byte limit anyway, so keeping the MTU to 4000 bytes=
 guarantees that a whole packet will fit into a single page in memory, so i=
f you are still getting memory errors or problems allocating a new SKB, the=
n you have much bigger issues because either your memory is having problems=
 or you are just plain out of memory completely.

The reason we are going that route is because the Linux network stack recyc=
les and frees an SKB that is passed up to it from the driver.  So, when I a=
llocated 256 4-page buffers and used those to replace the rx_skb that conta=
ined the data, the stack would free that buffer for me (it is so helpful :\=
) and when I would try to reuse it later, the kernel would panic because th=
at was not a valid SKB.

So, moral of the story is keep your MTU at 4000 or lower.  This hammers you=
r throughput, but it seems to be the best we can do given the way the stack=
 works.

If anyone has any other solutions, that would be GREAT!  I would love to be=
 able to use a 9000 byte MTU without getting out of memory errors simply du=
e to fragmentation.

HTH,

Jonathan


>=20
> -----Original Message-----
> From: linuxppc-dev-bounces+john.p.price=3Dl-3com.com@lists.ozlabs.org
> [mailto:linuxppc-dev-bounces+john.p.price=3Dl-
> 3com.com@lists.ozlabs.org]
> On Behalf Of Jonathan Haws
> Sent: Monday, October 26, 2009 2:43 PM
> To: linuxppc-dev@lists.ozlabs.org
> Subject: Network Stack SKB Reallocation
>=20
> Quick question about the network stack in general:
>=20
> Does the stack itself release an SKB allocated by the device driver
> back
> to the heap upstream, or does it require that the device driver
> handle
> that?
>=20
> Thanks!
>=20
> Jonathan
>=20
>=20
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

* [PATCH] Xilinx LL-TEMAC: Add Netpoll controller support
From: Santosh Shukla @ 2009-10-27 14:30 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: john.linn

From: Santosh Shukla <sshukla@in.mvista.com>
Date: Tue, 13 Oct 2009 18:55:57 +0530
Subject: [PATCH] Xilinx LL-TEMAC: Add Netpoll controller support

Adding Netpoll controller support to Xilinx LL-TEMAC ethernet
driver.Replaced Rx, Tx tasklet schedule call with their handlers,
Added correct version of call which can execute interrupt on/off
context i.e. dev_kfree_skb_any().
---
 drivers/net/xilinx_lltemac/xlltemac_main.c |   41 +++++++++++++++++++++++++++-
 1 files changed, 40 insertions(+), 1 deletions(-)

diff --git a/drivers/net/xilinx_lltemac/xlltemac_main.c
b/drivers/net/xilinx_lltemac/xlltemac_main.c
index b245422..697f474 100644
--- a/drivers/net/xilinx_lltemac/xlltemac_main.c
+++ b/drivers/net/xilinx_lltemac/xlltemac_main.c
@@ -87,11 +87,18 @@
 #define BUFFER_ALIGNSEND_PERF(adr) ((ALIGNMENT_SEND_PERF - ((u32) adr)) % 32)
 #define BUFFER_ALIGNRECV(adr) ((ALIGNMENT_RECV - ((u32) adr)) % 32)

+#ifndef CONFIG_NET_POLL_CONTROLLER
 /* Default TX/RX Threshold and waitbound values for SGDMA mode */
 #define DFT_TX_THRESHOLD  24
 #define DFT_TX_WAITBOUND  254
 #define DFT_RX_THRESHOLD  4
 #define DFT_RX_WAITBOUND  254
+#else
+#define DFT_TX_THRESHOLD  1
+#define DFT_TX_WAITBOUND  0
+#define DFT_RX_THRESHOLD  1
+#define DFT_RX_WAITBOUND  0
+#endif

 #define XTE_AUTOSTRIPPING 1

@@ -1097,7 +1104,11 @@ static irqreturn_t xenet_dma_rx_interrupt(int
irq, void *dev_id)
 			list_add_tail(&lp->rcv, &receivedQueue);
 			XLlDma_mBdRingIntDisable(&lp->Dma.RxBdRing,
 						 XLLDMA_CR_IRQ_ALL_EN_MASK);
+#ifndef CONFIG_NET_POLL_CONTROLLER
 			tasklet_schedule(&DmaRecvBH);
+#else
+			DmaRecvHandlerBH(0);
+#endif
 		}
 		spin_unlock_irqrestore(&receivedQueueSpin, flags);
 	}
@@ -1134,7 +1145,11 @@ static irqreturn_t xenet_dma_tx_interrupt(int
irq, void *dev_id)
 			list_add_tail(&lp->xmit, &sentQueue);
 			XLlDma_mBdRingIntDisable(&lp->Dma.TxBdRing,
 						 XLLDMA_CR_IRQ_ALL_EN_MASK);
+#ifndef CONFIG_NET_POLL_CONTROLLER
 			tasklet_schedule(&DmaSendBH);
+#else
+			DmaSendHandlerBH(0);
+#endif
 		}
 		spin_unlock_irqrestore(&sentQueueSpin, flags);
 	}
@@ -1711,11 +1726,15 @@ static int xenet_DmaSend(struct sk_buff *skb,
struct net_device *dev)
 	 * SgAlloc, SgCommit sequence, which also exists in DmaSendHandlerBH Bottom
 	 * Half, or triggered by other processor in SMP case.
 	 */
+#ifndef CONFIG_NET_POLL_CONTROLLER
 	spin_lock_bh(&XTE_tx_spinlock);
+#endif

 	xenet_DmaSend_internal(skb, dev);

+#ifndef CONFIG_NET_POLL_CONTROLLER
 	spin_unlock_bh(&XTE_tx_spinlock);
+#endif

 	return 0;
 }
@@ -1764,7 +1783,7 @@ static void DmaSendHandlerBH(unsigned long p)
 				skb = (struct sk_buff *)
 					XLlDma_mBdGetId(BdCurPtr);
 				if (skb)
-					dev_kfree_skb(skb);
+					dev_kfree_skb_any(skb);

 				/* reset BD id */
 				XLlDma_mBdSetId(BdCurPtr, NULL);
@@ -3220,6 +3239,23 @@ static int detect_phy(struct net_local *lp,
char *dev_name)
 	return 0;		/* default to zero */
 }

+#ifdef CONFIG_NET_POLL_CONTROLLER
+static void
+lltemac_poll_controller(struct net_device *ndev)
+{
+       struct net_local *lp = netdev_priv(ndev);
+
+       disable_irq(lp->dma_irq_s);
+       disable_irq(lp->dma_irq_r);
+
+       xenet_dma_tx_interrupt(lp->dma_irq_s, ndev);
+       xenet_dma_rx_interrupt(lp->dma_irq_r, ndev);
+
+       enable_irq(lp->dma_irq_s);
+       enable_irq(lp->dma_irq_r);
+}
+#endif
+
 static struct net_device_ops xilinx_netdev_ops;

 /* From include/linux/ethtool.h */
@@ -3491,6 +3527,9 @@ static struct net_device_ops xilinx_netdev_ops = {
 	.ndo_change_mtu	= xenet_change_mtu,
 	.ndo_tx_timeout	= xenet_tx_timeout,
 	.ndo_get_stats	= xenet_get_stats,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	.ndo_poll_controller = lltemac_poll_controller,
+#endif
 };

 static struct of_device_id xtenet_fifo_of_match[] = {
-- 
1.6.3.3.220.g609a0

^ permalink raw reply related

* Low BogoMips
From: Luigi 'Comio' Mantellini @ 2009-10-27 15:27 UTC (permalink / raw)
  To: linuxppc-dev

Hi All,

my name is Luigi. I'm working on a stripped-down mpc8541 board (that has just 
a serian and twe enets).
I have an issue on delay calibration. In fact my at boot time I see the 
following on the serial console:

....
Calibrating delay loop... 83.20 BogoMIPS (lpj=166400)
....

This value seems to be too low and I suspect an error propagated from u-boot.

Can anyone help me to understand a good way to investigate this problem? which 
registers or configuration values I need to check?

I will post the necessary snips of code or configuration if required.

Thanks a lot for your help,

ciao

luigi

-- 
Luigi 'Comio' Mantellini

R&D - Software
Industrie Dial Face S.p.A.
Via Canzo, 4
20068 Peschiera Borromeo (MI), Italy
Tel.:  +39 02 5167 2813
Fax:   +39 02 5167 2459
Email: luigi.mantellini@idf-hit.com

^ permalink raw reply

* RE: Network Stack SKB Reallocation
From: john.p.price @ 2009-10-27 15:33 UTC (permalink / raw)
  To: Jonathan Haws, linuxppc-dev
In-Reply-To: <BB99A6BA28709744BF22A68E6D7EB51F0330D36118@midas.usurf.usu.edu>

Hmmm, so if the issue I see is related to what you see then setting mtu
to 4KB may clear it otherwise I have a either a potential race condition
freeing skb's or ultimately the protocol stack is not freeing the
correct buffer...

Thanks Jonathan.

-----Original Message-----
From: Jonathan Haws [mailto:Jonathan.Haws@sdl.usu.edu]=20
Sent: Tuesday, October 27, 2009 10:28 AM
To: Price, John @ SDS; linuxppc-dev@lists.ozlabs.org
Subject: RE: Network Stack SKB Reallocation

Hi John,

> I have a custom board with custom fpga's connected to the PPC405EX
> EBC
> bus on banks 2 and 3.  Running linux 2.6.29.1.  The board collects
> data
> and dma's it to a scatter-gather dma buffer and then uses TCP/writev
> +
> Ethernet 9KB Jumbo packets to transmit data off of the board.

We are also doing something similar, however we do not transmit the data
off the board - we are storing it to disk.  What we are seeing is that
memory gets so fragmented during normal operation that the EMAC driver
cannot find a contiguous block of memory large enough for the MTU (a
9000 byte MTU requires 4 pages of memory, or 16384 bytes).
>=20
> Our systems have 7 of these data collection boards, we are seeing
> the
> following stack trace, the boards do not crash apparently the just
> continue to run.
>=20
> ~ # BUG: Bad page state in process dcb  pfn:080db
> page:c03d2b60 flags:00044000 count:0 mapcount:0 mapping:(null)
> index:3718
> Call Trace:
> [ce871980] [c0006bc0] show_stack+0x44/0x16c (unreliable)
> [ce8719c0] [c005374c] bad_page+0x94/0x12c
> [ce8719e0] [c0053c30] __free_pages_ok+0x364/0x3ec
> [ce871a20] [c0057c00] put_compound_page+0x48/0x60
> [ce871a30] [c0075520] kfree+0xd4/0xd8
> [ce871a40] [c0175140] skb_release_data+0x80/0xc8
> [ce871a50] [c0174f30] __kfree_skb+0x18/0xe8
> [ce871a60] [c01ab9e4] tcp_ack+0x48c/0x1a84
> [ce871af0] [c01add8c] tcp_rcv_state_process+0x70/0x9ac
> [ce871b10] [c01b47fc] tcp_v4_do_rcv+0x9c/0x1a8
> [ce871b40] [c01b6328] tcp_v4_rcv+0x4d4/0x5b8
> [ce871b70] [c0198b90] ip_local_deliver+0x90/0x140
> [ce871b90] [c0198f24] ip_rcv+0x2e4/0x4bc
>=20
>=20
> The above occurs on at least one of the seven boards over the course
> of
> a multi-day run.

This is very similar output that I would get when memory got fragmented,
however my BUG showed its face when I tried to allocate, not to free, so
the issue might be somewhere else.

> Another trace from an actual crash, occurs not so often;
>=20
> DCB: tcp connection request accepted - line length: 18168
> Unable to handle kernel paging request for data at address
> 0x0004009c
> Faulting instruction address: 0xc017500c
> Oops: Kernel access of bad area, sig: 11 [#1]
> DCB
> Modules linked in: ds3b3 dma ds3b2
> NIP: c017500c LR: c01351f8 CTR: c013513c
> REGS: cd779aa0 TRAP: 0300   Not tainted  (2.6.29.1)
> MSR: 00029030 <EE,ME,CE,IR,DR>  CR: 42424024  XER: 2000005f
> DEAR: 0004009c, ESR: 00000000
> TASK =3D ce8883f0[770] 'dcb' THREAD: cd778000
> GPR00: 00000060 cd779b50 ce8883f0 00040000 00000020 c001220c
> 00000001
> 00000014
> GPR08: 00000002 0004009c 00000003 000000c0 22424022 10183238
> 000022f4
> 00000001
> GPR16: 00000020 000022f4 000237c0 00000000 cd6590e4 13511000
> 00000008
> bfe9d520
> GPR24: ce8e2c34 ce8e2c2c ce811ce0 00000001 00000018 ce811360
> 00000300
> ce8113c0
> NIP [c017500c] kfree_skb+0xc/0x38
> LR [c01351f8] emac_poll_tx+0xbc/0x310
> Call Trace:
> [cd779b50] [c001220c] __mtdcr_table+0x0/0x3ff8 (unreliable)
> [cd779b70] [c0132248] mal_poll+0x44/0x1c8
> [cd779ba0] [c017fb10] net_rx_action+0x94/0x188
> [cd779bd0] [c0024740] __do_softirq+0x84/0x124
> [cd779c00] [c0004f10] do_softirq+0x58/0x5c
> [cd779c10] [c00245b0] irq_exit+0x48/0x58
> [cd779c20] [c0004fb4] do_IRQ+0xa0/0xc4
> [cd779c40] [c000eba0] ret_from_except+0x0/0x18
> [cd779d00] [c01a4ec0] tcp_sendmsg+0x220/0xbf0
> [cd779d80] [c016dd18] sock_aio_write+0xf0/0x104
> [cd779de0] [c007a5b0] do_sync_readv_writev+0xbc/0x130
> [cd779e90] [c007ae54] do_readv_writev+0xb4/0x1c4
> [cd779f10] [c007b010] sys_writev+0x4c/0x90
> [cd779f40] [c000e558] ret_from_syscall+0x0/0x3c
> Instruction dump:
> 3d20c02b 80695ac4 7fe4fb78 4bf00fb9 80010014 83e1000c 7c0803a6
> 38210010
> 4e800020 2c030000 4d820020 3923009c <8003009c> 2f800001 409e0008
> 4bffff00
> Kernel panic - not syncing: Fatal exception in interrupt
> Rebooting in 1 seconds..
>=20
>=20
> So the questions I have for you are as follows;
>=20
> 	1. Do either of these trace appear related to the issue your
> driver patch will fix?

I don't believe so - especially since I do not have a working patch.  I
have come to the conclusion that the driver works as is and we are just
going to have to deal with the memory fragmentation.
=20
> 	2. If I set path MTU to 1500, will that avoid the issue?

I believe it would, see answer to question 3.

> 	3. Would you have any further suggestions?

The road I believe that we are going to take is move to a 4000 byte MTU.
The 405EX MAL has a 4080 byte limit anyway, so keeping the MTU to 4000
bytes guarantees that a whole packet will fit into a single page in
memory, so if you are still getting memory errors or problems allocating
a new SKB, then you have much bigger issues because either your memory
is having problems or you are just plain out of memory completely.

The reason we are going that route is because the Linux network stack
recycles and frees an SKB that is passed up to it from the driver.  So,
when I allocated 256 4-page buffers and used those to replace the rx_skb
that contained the data, the stack would free that buffer for me (it is
so helpful :\) and when I would try to reuse it later, the kernel would
panic because that was not a valid SKB.

So, moral of the story is keep your MTU at 4000 or lower.  This hammers
your throughput, but it seems to be the best we can do given the way the
stack works.

If anyone has any other solutions, that would be GREAT!  I would love to
be able to use a 9000 byte MTU without getting out of memory errors
simply due to fragmentation.

HTH,

Jonathan


>=20
> -----Original Message-----
> From: linuxppc-dev-bounces+john.p.price=3Dl-3com.com@lists.ozlabs.org
> [mailto:linuxppc-dev-bounces+john.p.price=3Dl-
> 3com.com@lists.ozlabs.org]
> On Behalf Of Jonathan Haws
> Sent: Monday, October 26, 2009 2:43 PM
> To: linuxppc-dev@lists.ozlabs.org
> Subject: Network Stack SKB Reallocation
>=20
> Quick question about the network stack in general:
>=20
> Does the stack itself release an SKB allocated by the device driver
> back
> to the heap upstream, or does it require that the device driver
> handle
> that?
>=20
> Thanks!
>=20
> Jonathan
>=20
>=20
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

* Re:Micrel PHY KSZ8001 on MPC5200B FEC
From: suvidh kankariya @ 2009-10-27 16:47 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <mailman.1186.1256651657.12812.linuxppc-dev@lists.ozlabs.or g>


Romen,
A driver for micrel phy exists in /drivers/net/phy/micrel.c. in 2.6.30.
You just need to add the structure with your phy information .

If you are using older kernel you may want to copy it.



Regards

Suvidh







>------------------------------
>
>Message: 6
>Date: Tue, 27 Oct 2009 13:17:27 +0100
>From: Roman Fietze <roman.fietze@telemotive.de>
>To: linuxppc-dev@lists.ozlabs.org
>Subject: Micrel PHY KSZ8001 on MPC5200B FEC
>Message-ID: <200910271317.28007.roman.fietze@telemotive.de>
>Content-Type: text/plain;  charset="iso-8859-1"
>
>Hello,
>
>We would need some help on how to make a Micrel KSZ8001 work on the
>MPC5200B FEC using the kernel DENX-v2.6.3[01].
>
>We can already boot the kernel and device tree using TFTP and this PHY
>using a recent U-Boot version, so we would need some pointers how to
>acomplish that.
>
>Add a proper PHY driver in the Theirdirectory?
>
>Modify the DTS? If yes, how? A link to some documentation that's not
>already in the kernel sources would already help.
>
>Is it correct, when looking at the sources, that the MPC's FEC driver
>switched to the generic PHY driver interface?
>
>
>Roman
>
>--
>Roman Fietze                Telemotive AG B?ro M?hlhausen
>Breitwiesen                              73347 M?hlhausen
>Tel.: +49(0)7335/18493-45        http://www.telemotive.de
>
>
>------------------------------

^ permalink raw reply

* Re: [PATCH 0/8]  Fix 8xx MMU/TLB
From: Scott Wood @ 2009-10-27 15:58 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: Rex Feany, linuxppc-dev@ozlabs.org
In-Reply-To: <OF2108C9F5.3115433C-ONC125765C.003129B4-C125765C.0032EE38@transmode.se>

On Tue, Oct 27, 2009 at 10:16:17AM +0100, Joakim Tjernlund wrote:
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote on 27/10/2009 01:00:53:
> >
> > On Mon, 2009-10-26 at 16:26 -0700, Dan Malek wrote:
> > > Just be careful the get_user() doesn't regenerate the same
> > > translation error you are trying to fix by being here......
> 
> yes, I had some problems with this initially but managed to work around
> that. I noticed another problem though, I got multiple TLB errors for the
> same address when I did it in C. Noticed by just printk:ing every hit for
> a dcbX insn in do_page_fault. I can't explain it, but it seems like when
> moving to C you have to execute a rfi insn and that might somehow restart
> the dcbX insn before moving on to the page fault routine(or something
> totally different)

The rfi should be to other kernel code -- there is no way that it should be
restarting the dcbX (other than when trying to turn a TLB miss into a TLB
error).  Can you post the C version, maybe we can see what's going wrong? 
Is the empty TLB entry from the miss getting invalidated in the dcbX fixup
case?

> > It shouldn't since it will always come up with a proper DAR but
> > you may want to double check before hand that your instruction
> > address you are loading from is -not- your marker value for bad DAR.
> 
> hmm, I check that the insn really is a dcbX insn, but not that the address is
> != 0x00f0. Don't see how it could be as if something is wrong with
> the insn address you get ITLB error instead of a DTLB error.

I'm guessing he meant the data address you're loading.

> Anyhow, things seems stalled as I haven't heard from Scott or Rex for a
> while. If this isn't working now, I really don't know what is wrong and
> need some debugging help.

I'll test the latest version, but I have some scheduling latency. :-)

-Scott

^ permalink raw reply

* Re: [PATCH 0/8]  Fix 8xx MMU/TLB
From: Joakim Tjernlund @ 2009-10-27 16:38 UTC (permalink / raw)
  To: Scott Wood; +Cc: Rex Feany, linuxppc-dev@ozlabs.org
In-Reply-To: <20091027155841.GA25916@loki.buserror.net>

Scott Wood <scottwood@freescale.com> wrote on 27/10/2009 16:58:41:
>
> On Tue, Oct 27, 2009 at 10:16:17AM +0100, Joakim Tjernlund wrote:
> > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote on 27/10/2009 01:00:53:
> > >
> > > On Mon, 2009-10-26 at 16:26 -0700, Dan Malek wrote:
> > > > Just be careful the get_user() doesn't regenerate the same
> > > > translation error you are trying to fix by being here......
> >
> > yes, I had some problems with this initially but managed to work around
> > that. I noticed another problem though, I got multiple TLB errors for the
> > same address when I did it in C. Noticed by just printk:ing every hit for
> > a dcbX insn in do_page_fault. I can't explain it, but it seems like when
> > moving to C you have to execute a rfi insn and that might somehow restart
> > the dcbX insn before moving on to the page fault routine(or something
> > totally different)
>
> The rfi should be to other kernel code -- there is no way that it should be
> restarting the dcbX (other than when trying to turn a TLB miss into a TLB
> error).  Can you post the C version, maybe we can see what's going wrong?

I don't have it for 2.6 and I never did cleanup up my 2.4 version.
Your best bet is to look at one of the earlier patches such
as:
  Add some debug code to do_page_fault
and fix the remaining bits.

> Is the empty TLB entry from the miss getting invalidated in the dcbX fixup
> case?

Yes, in all cases it was invalidated.

>
> > > It shouldn't since it will always come up with a proper DAR but
> > > you may want to double check before hand that your instruction
> > > address you are loading from is -not- your marker value for bad DAR.
> >
> > hmm, I check that the insn really is a dcbX insn, but not that the address is
> > != 0x00f0. Don't see how it could be as if something is wrong with
> > the insn address you get ITLB error instead of a DTLB error.
>
> I'm guessing he meant the data address you're loading.

Hopefully and I am already looking at the OP code to make sure it is
a dcbX insn.

>
> > Anyhow, things seems stalled as I haven't heard from Scott or Rex for a
> > while. If this isn't working now, I really don't know what is wrong and
> > need some debugging help.
>
> I'll test the latest version, but I have some scheduling latency. :-)

Get yourself a new scheduler :)

   Jocke

^ permalink raw reply

* Re: Low BogoMips
From: Jeff Angielski @ 2009-10-27 17:03 UTC (permalink / raw)
  To: luigi.mantellini.ml; +Cc: linuxppc-dev
In-Reply-To: <200910271627.06549.luigi.mantellini.ml@gmail.com>

Luigi 'Comio' Mantellini wrote:
> Hi All,
> 
> my name is Luigi. I'm working on a stripped-down mpc8541 board (that has just 
> a serian and twe enets).
> I have an issue on delay calibration. In fact my at boot time I see the 
> following on the serial console:
> 
> ....
> Calibrating delay loop... 83.20 BogoMIPS (lpj=166400)
> ....
> 
> This value seems to be too low and I suspect an error propagated from u-boot.
> 
> Can anyone help me to understand a good way to investigate this problem? which 
> registers or configuration values I need to check?
> 
> I will post the necessary snips of code or configuration if required.
> 
> Thanks a lot for your help,

On the PowerPC, the BogoMIPS is not an estimate of the performance of 
the CPU, but rather related to the internal timebase register frequency.

"With recent kernels, when build with ARCH=powerpc, we now use the
hardware timebase instead of bogus processor loops for short timings.
Thus our bogomips value is no longer the speed at which the processor
runs empty loops, but the actual processor timebase value as obtained
after calibration at boot. " - Benjamin Herrenschmidt

Google(powerpc, bogomips, timebase)=happiness


-- 
Jeff Angielski
The PTR Group
www.theptrgroup.com

^ permalink raw reply

* help with unhandled IRQ error with mpt2sas driver and powerpc 460EX
From: Ayman El-Khashab @ 2009-10-27 17:27 UTC (permalink / raw)
  To: linuxppc-dev

Hello, I am using the mpt2sas driver on the canyonlands / 460EX board.  
I've already found and fixed one problem, but can't get past the 
unhandled IRQ.

The first problem I noticed is that the physical address is read into a 
32 bit variable, but the 460ex has a 36 bit bus so the ioremap would 
always fail.  I've change the defn of chip_phys in mpt2sas_base.h to u64 
and that cleared up that issue.   As soon as the unmask_interrupts 
method is called (or not long after), I get an interrupt -- presumably 
from the sas controller.  If I comment out the unmask, the interrupt 
never occurs.  If I unmask them, I get the interrupt.  I've traced the 
code through the interrupt handler all the way to ~ line 757.

 rpf = &ioc->reply_post_free[ioc->reply_post_host_index];

I've verified that at the end of this, IRQ_NONE is returned.  At this 
point the kernel prints the following -- the last statements lead me to 
think that the sas controller expected something but never got it.  I am 
unsure how to proceed at this point.  I am using a denx kernel head 
pulled from git today since there were some changes to thsi driver for 
endian issues.

irq 18: nobody cared (try booting with the "irqpoll" option)
Call Trace:
[c0367df0] [c0005eac] show_stack+0x44/0x16c (unreliable)
[c0367e30] [c004eedc] __report_bad_irq+0x34/0xb8
[c0367e50] [c004f118] note_interrupt+0x1b8/0x224
[c0367e80] [c004ff50] handle_level_irq+0xa0/0x11c
[c0367e90] [c0018ba4] uic_irq_cascade+0xf8/0x12c
[c0367eb0] [c00041d0] do_IRQ+0x98/0xb4
[c0367ed0] [c000df40] ret_from_except+0x0/0x18
[c0367f90] [c0006ed8] cpu_idle+0x50/0xd8
[c0367fb0] [c000197c] rest_init+0x5c/0x70
[c0367fc0] [c0320848] start_kernel+0x224/0x2a0
[c0367ff0] [c0000200] skpinv+0x190/0x1cc
handlers:
[<c01aba98>] (_base_interrupt+0x0/0x8f8)
Disabling IRQ #18
mpt2sas0: _base_event_notification: timeout
mf:
        07000000 00000000 00000000 00000000 00000000 0f2f3fff fffffffc 
ffffffff
        ffffffff 00000000 00000000
mpt2sas0: sending diag reset !!
mpt2sas0: diag reset: SUCCESS
mpt2sas0: failure at 
drivers/scsi/mpt2sas/mpt2sas_scsih.c:5989/_scsih_probe()!

Thanks
Ayman

^ permalink raw reply

* Re: Please pull mpc5200 and OF changes
From: Grant Likely @ 2009-10-27 17:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
In-Reply-To: <fa686aa40910150946s70ad4013y15cbf9e29c5e50f2@mail.gmail.com>

Hey Ben,

I don't see these pulled into your tree yet.  Can you please this tree
into your merge branch?

Thanks,
g.

On Thu, Oct 15, 2009 at 10:46 AM, Grant Likely
<grant.likely@secretlab.ca> wrote:
> Hi Ben.
>
> Here are some OF and MPC5200 changes needed for 2.6.32. =A0Mostly
> defconfig updates and a couple of new board dts files.
>
> Cheers,
> g.
>
> The following changes since commit 161291396e76e0832c08f617eb9bd364d16481=
48:
> =A0Linus Torvalds (1):
> =A0 =A0 =A0 =A0Linux 2.6.32-rc4
>
> are available in the git repository at:
>
> =A0git://git.secretlab.ca/git/linux-2.6 merge
>
> Grant Likely (1):
> =A0 =A0 =A0powerpc/5200: Update defconfigs
>
> Heiko Schocher (2):
> =A0 =A0 =A0mpc5200: support for the MAN mpc5200 based board uc101
> =A0 =A0 =A0mpc5200: support for the MAN mpc5200 based board mucmc52
>
> Julia Lawall (1):
> =A0 =A0 =A0drivers/serial/mpc52xx_uart.c: Use UPIO_MEM rather than SERIAL=
_IO_MEM
>
> J=E9r=F4me Pouiller (1):
> =A0 =A0 =A0of: Remove nested function
>
> Wolfram Sang (1):
> =A0 =A0 =A0powerpc/boot/dts: drop obsolete 'fsl5200-clocking'
>
> =A0arch/powerpc/boot/dts/cm5200.dts =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0 =A01=
 -
> =A0arch/powerpc/boot/dts/digsy_mtc.dts =A0 =A0 =A0 =A0 =A0 | =A0 =A01 -
> =A0arch/powerpc/boot/dts/lite5200.dts =A0 =A0 =A0 =A0 =A0 =A0| =A0 =A02 -
> =A0arch/powerpc/boot/dts/lite5200b.dts =A0 =A0 =A0 =A0 =A0 | =A0 =A02 -
> =A0arch/powerpc/boot/dts/media5200.dts =A0 =A0 =A0 =A0 =A0 | =A0 =A02 -
> =A0arch/powerpc/boot/dts/motionpro.dts =A0 =A0 =A0 =A0 =A0 | =A0 =A01 -
> =A0arch/powerpc/boot/dts/mpc5121ads.dts =A0 =A0 =A0 =A0 =A0| =A0 =A03 -
> =A0arch/powerpc/boot/dts/mucmc52.dts =A0 =A0 =A0 =A0 =A0 =A0 | =A0332 +++=
++++++++++++++++++++++
> =A0arch/powerpc/boot/dts/pcm030.dts =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0 =A02=
 -
> =A0arch/powerpc/boot/dts/pcm032.dts =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0 =A02=
 -
> =A0arch/powerpc/boot/dts/tqm5200.dts =A0 =A0 =A0 =A0 =A0 =A0 | =A0 =A01 -
> =A0arch/powerpc/boot/dts/uc101.dts =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0284 +=
++++++++++++++++++++
> =A0arch/powerpc/configs/52xx/cm5200_defconfig =A0 =A0| =A0136 ++++++----
> =A0arch/powerpc/configs/52xx/lite5200b_defconfig | =A0153 ++++++++----
> =A0arch/powerpc/configs/52xx/motionpro_defconfig | =A0146 +++++++----
> =A0arch/powerpc/configs/52xx/pcm030_defconfig =A0 =A0| =A0142 ++++++-----
> =A0arch/powerpc/configs/52xx/tqm5200_defconfig =A0 | =A0148 +++++++----
> =A0arch/powerpc/configs/mpc5200_defconfig =A0 =A0 =A0 =A0| =A0192 +++++++=
+++-----
> =A0arch/powerpc/platforms/52xx/mpc5200_simple.c =A0| =A0 =A02 +
> =A0drivers/of/of_mdio.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0| =A0 13 +-
> =A0drivers/serial/mpc52xx_uart.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 =
=A02 +-
> =A021 files changed, 1196 insertions(+), 371 deletions(-)
> =A0create mode 100644 arch/powerpc/boot/dts/mucmc52.dts
> =A0create mode 100644 arch/powerpc/boot/dts/uc101.dts
>
>
> --
> Grant Likely, B.Sc., P.Eng.
> Secret Lab Technologies Ltd.
>



--=20
Grant Likely, B.Sc., P.Eng.
Secret Lab Technologies Ltd.

^ permalink raw reply

* Re: Micrel PHY KSZ8001 on MPC5200B FEC
From: Roman Fietze @ 2009-10-27 19:54 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <200910271549.n9RFn3ep028623@nti1.com>

Hello Suvidh,

On Tuesday 27 October 2009 17:47:51 suvidh kankariya wrote:

> A driver for micrel phy exists in /drivers/net/phy/micrel.c. in
> 2.6.30.

Am I somewhat blind, or do you have access to other 2.6.30's than I
have?

I searched git.kernel.org, git.denx.de and git.secretlab.ca, but could
not find that file, neither in the current master, nor in older tags
somewhat related to "2.6.30", nor in any local clone in any version of
the 2.6 since "He" created the repos.


> If you are using older kernel you may want to copy it.

2.6.30 and 2.6.31 from DENX or kernel.org.


Roman

=2D-=20
Roman Fietze                Telemotive AG B=FCro M=FChlhausen
Breitwiesen                              73347 M=FChlhausen
Tel.: +49(0)7335/18493-45        http://www.telemotive.de

^ permalink raw reply

* Accessing flash directly from User Space
From: Jonathan Haws @ 2009-10-27 19:59 UTC (permalink / raw)
  To: linuxppc-dev@lists.ozlabs.org

I know this is probably a really dumb question, but a wise man once said th=
at the only stupid question is the one that is not asked.

So, I have written a flash driver in VxWorks that simply addresses the flas=
h directly and handles all the hardware accesses just fine.  I am porting t=
hat to Linux and need it to run in user space (mainly to simplify the inter=
face with the user - I want to keep it the same as in VxWorks).  Here is a =
snippet of what my question is:

static uint8_t bflashEraseSector(int sa, int verbose)
{
	uint16_t * flash =3D (uint16_t *) NOR_FLASH_BASE_ADRS;
	uint32_t offset;

	...

	/* We divide by 2 here to adjust for the 16-bit offset into the address */
	offset =3D sa * NOR_FLASH_SECTOR_SIZE / 2;
	flash[BFLASH_SECTOR_ERASE_ADDR1] =3D BFLASH_SECTOR_ERASE_BYTE1;

	...

}

I am trying to get a pointer to NOR_FLASH_BASE_ADRS which is defined to be =
0xFC000000.  I then dereference that directly to write to the flash.

How can I get that pointer?  Unfortunately I cannot simply use the address =
of the flash.  Is there some magical function call that gives me access to =
that portion of the memory space?

Thanks for the help!

Jonathan

PS - I know that I could simply use the MTD driver provided by the kernel, =
but I need to be able to keep the interface the same so we can use previous=
ly written code.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox