LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH -V7 06/10] powerpc: Update gup_pmd_range to handle transparent hugepages
From: David Gibson @ 2013-05-03  4:57 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, linux-mm
In-Reply-To: <1367178711-8232-7-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 457 bytes --]

On Mon, Apr 29, 2013 at 01:21:47AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH -V7 02/10] powerpc/THP: Implement transparent hugepages for ppc64
From: David Gibson @ 2013-05-03  4:52 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, linux-mm
In-Reply-To: <1367178711-8232-3-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 25821 bytes --]

On Mon, Apr 29, 2013 at 01:21:43AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> We now have pmd entries covering 16MB range and the PMD table double its original size.
> We use the second half of the PMD table to deposit the pgtable (PTE page).
> The depoisted PTE page is further used to track the HPTE information. The information
> include [ secondary group | 3 bit hidx | valid ]. We use one byte per each HPTE entry.
> With 16MB hugepage and 64K HPTE we need 256 entries and with 4K HPTE we need
> 4096 entries. Both will fit in a 4K PTE page. On hugepage invalidate we need to walk
> the PTE page and invalidate all valid HPTEs.
> 
> This patch implements necessary arch specific functions for THP support and also
> hugepage invalidate logic. These PMD related functions are intentionally kept
> similar to their PTE counter-part.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/page.h              |  11 +-
>  arch/powerpc/include/asm/pgtable-ppc64-64k.h |   3 +-
>  arch/powerpc/include/asm/pgtable-ppc64.h     | 259 +++++++++++++++++++++-
>  arch/powerpc/include/asm/pgtable.h           |   5 +
>  arch/powerpc/include/asm/pte-hash64-64k.h    |  17 ++
>  arch/powerpc/mm/pgtable_64.c                 | 318 +++++++++++++++++++++++++++
>  arch/powerpc/platforms/Kconfig.cputype       |   1 +
>  7 files changed, 611 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
> index 988c812..cbf4be7 100644
> --- a/arch/powerpc/include/asm/page.h
> +++ b/arch/powerpc/include/asm/page.h
> @@ -37,8 +37,17 @@
>  #define PAGE_SIZE		(ASM_CONST(1) << PAGE_SHIFT)
>  
>  #ifndef __ASSEMBLY__
> -#ifdef CONFIG_HUGETLB_PAGE
> +/*
> + * With hugetlbfs enabled we allow the HPAGE_SHIFT to run time
> + * configurable. But we enable THP only with 16MB hugepage.
> + * With only THP configured, we force hugepage size to 16MB.
> + * This should ensure that all subarchs that doesn't support
> + * THP continue to work fine with HPAGE_SHIFT usage.
> + */
> +#if defined(CONFIG_HUGETLB_PAGE)
>  extern unsigned int HPAGE_SHIFT;
> +#elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
> +#define HPAGE_SHIFT PMD_SHIFT

As I said in comments on the first patch series, this messing around
with HPAGE_SHIFT for THP is missing the point.  On ppc HPAGE_SHIFT is
nothing more than the _default_ hugepage size for explicit hugepages.
THP should not be dependent on it in any way.

>  #else
>  #define HPAGE_SHIFT PAGE_SHIFT
>  #endif
> diff --git a/arch/powerpc/include/asm/pgtable-ppc64-64k.h b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
> index 45142d6..a56b82f 100644
> --- a/arch/powerpc/include/asm/pgtable-ppc64-64k.h
> +++ b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
> @@ -33,7 +33,8 @@
>  #define PGDIR_MASK	(~(PGDIR_SIZE-1))
>  
>  /* Bits to mask out from a PMD to get to the PTE page */
> -#define PMD_MASKED_BITS		0x1ff
> +/* PMDs point to PTE table fragments which are 4K aligned.  */
> +#define PMD_MASKED_BITS		0xfff

Hrm.  AFAICT this is related to the change in size of PTE tables, and
hence the page sharing stuff, so this belongs in the patch which
implements that, rather than the THP support itself.

>  /* Bits to mask out from a PGD/PUD to get to the PMD page */
>  #define PUD_MASKED_BITS		0x1ff
>  
> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
> index ab84332..20133c1 100644
> --- a/arch/powerpc/include/asm/pgtable-ppc64.h
> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
> @@ -154,7 +154,7 @@
>  #define	pmd_present(pmd)	(pmd_val(pmd) != 0)
>  #define	pmd_clear(pmdp)		(pmd_val(*(pmdp)) = 0)
>  #define pmd_page_vaddr(pmd)	(pmd_val(pmd) & ~PMD_MASKED_BITS)
> -#define pmd_page(pmd)		virt_to_page(pmd_page_vaddr(pmd))
> +extern struct page *pmd_page(pmd_t pmd);
>  
>  #define pud_set(pudp, pudval)	(pud_val(*(pudp)) = (pudval))
>  #define pud_none(pud)		(!pud_val(pud))
> @@ -382,4 +382,261 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
>  
>  #endif /* __ASSEMBLY__ */
>  
> +#ifndef _PAGE_SPLITTING
> +/*
> + * THP pages can't be special. So use the _PAGE_SPECIAL
> + */
> +#define _PAGE_SPLITTING _PAGE_SPECIAL
> +#endif
> +
> +#ifndef _PAGE_THP_HUGE
> +/*
> + * We need to differentiate between explicit huge page and THP huge
> + * page, since THP huge page also need to track real subpage details
> + * We use the _PAGE_COMBO bits here as dummy for platform that doesn't
> + * support THP.
> + */
> +#define _PAGE_THP_HUGE  0x10000000

So if it's _PAGE_COMBO, use _PAGE_COMBO, instead of the actual number.

> +#endif
> +
> +/*
> + * PTE flags to conserve for HPTE identification for THP page.
> + */
> +#ifndef _PAGE_THP_HPTEFLAGS
> +#define _PAGE_THP_HPTEFLAGS	(_PAGE_BUSY | _PAGE_HASHPTE)

You have this definition both here and in pte-hash64-64k.h.  More
importantly including _PAGE_BUSY seems like an extremely bad idea -
did you mean _PAGE_THP_HUGE == _PAGE_COMBO?

> +#endif
> +
> +#define HUGE_PAGE_SIZE		(ASM_CONST(1) << 24)
> +#define HUGE_PAGE_MASK		(~(HUGE_PAGE_SIZE - 1))

These constants should be named so its clear they're THP specific.
They should also be defined in terms of PMD_SHIFT, instead of
directly.

> +/*
> + * set of bits not changed in pmd_modify.
> + */
> +#define _HPAGE_CHG_MASK (PTE_RPN_MASK | _PAGE_THP_HPTEFLAGS | \
> +			 _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_THP_HUGE)
> +
> +#ifndef __ASSEMBLY__
> +extern void hpte_need_hugepage_flush(struct mm_struct *mm, unsigned long addr,
> +				     pmd_t *pmdp);

This should maybe be called "hpge_do_hugepage_flush()".  The current
name suggests it returns a boolean, rather than performing the actual
flush.

> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
> +extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
> +extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot);
> +extern void set_pmd_at(struct mm_struct *mm, unsigned long addr,
> +		       pmd_t *pmdp, pmd_t pmd);
> +extern void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
> +				 pmd_t *pmd);
> +
> +static inline int pmd_trans_huge(pmd_t pmd)
> +{
> +	/*
> +	 * leaf pte for huge page, bottom two bits != 00
> +	 */
> +	return (pmd_val(pmd) & 0x3) && (pmd_val(pmd) & _PAGE_THP_HUGE);
> +}
> +
> +static inline int pmd_large(pmd_t pmd)
> +{
> +	/*
> +	 * leaf pte for huge page, bottom two bits != 00
> +	 */
> +	if (pmd_trans_huge(pmd))
> +		return pmd_val(pmd) & _PAGE_PRESENT;
> +	return 0;
> +}
> +
> +static inline int pmd_trans_splitting(pmd_t pmd)
> +{
> +	if (pmd_trans_huge(pmd))
> +		return pmd_val(pmd) & _PAGE_SPLITTING;
> +	return 0;
> +}
> +
> +
> +static inline unsigned long pmd_pfn(pmd_t pmd)
> +{
> +	/*
> +	 * Only called for hugepage pmd
> +	 */
> +	return pmd_val(pmd) >> PTE_RPN_SHIFT;
> +}
> +
> +/* We will enable it in the last patch */
> +#define has_transparent_hugepage() 0
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +static inline int pmd_young(pmd_t pmd)
> +{
> +	return pmd_val(pmd) & _PAGE_ACCESSED;
> +}

It would be clearer to define this function as well as various others
that operate on PMDs as PTEs to just cast the parameter and call the
corresponding pte_XXX(),

> +
> +static inline pmd_t pmd_mkhuge(pmd_t pmd)
> +{
> +	/* Do nothing, mk_pmd() does this part.  */
> +	return pmd;
> +}
> +
> +#define __HAVE_ARCH_PMD_WRITE
> +static inline int pmd_write(pmd_t pmd)
> +{
> +	return pmd_val(pmd) & _PAGE_RW;
> +}
> +
> +static inline pmd_t pmd_mkold(pmd_t pmd)
> +{
> +	pmd_val(pmd) &= ~_PAGE_ACCESSED;
> +	return pmd;
> +}
> +
> +static inline pmd_t pmd_wrprotect(pmd_t pmd)
> +{
> +	pmd_val(pmd) &= ~_PAGE_RW;
> +	return pmd;
> +}
> +
> +static inline pmd_t pmd_mkdirty(pmd_t pmd)
> +{
> +	pmd_val(pmd) |= _PAGE_DIRTY;
> +	return pmd;
> +}
> +
> +static inline pmd_t pmd_mkyoung(pmd_t pmd)
> +{
> +	pmd_val(pmd) |= _PAGE_ACCESSED;
> +	return pmd;
> +}
> +
> +static inline pmd_t pmd_mkwrite(pmd_t pmd)
> +{
> +	pmd_val(pmd) |= _PAGE_RW;
> +	return pmd;
> +}
> +
> +static inline pmd_t pmd_mknotpresent(pmd_t pmd)
> +{
> +	pmd_val(pmd) &= ~_PAGE_PRESENT;
> +	return pmd;
> +}
> +
> +static inline pmd_t pmd_mksplitting(pmd_t pmd)
> +{
> +	pmd_val(pmd) |= _PAGE_SPLITTING;
> +	return pmd;
> +}
> +
> +/*
> + * Set the dirty and/or accessed bits atomically in a linux hugepage PMD, this
> + * function doesn't need to flush the hash entry
> + */
> +static inline void __pmdp_set_access_flags(pmd_t *pmdp, pmd_t entry)
> +{
> +	unsigned long bits = pmd_val(entry) & (_PAGE_DIRTY |
> +					       _PAGE_ACCESSED |
> +					       _PAGE_RW | _PAGE_EXEC);
> +#ifdef PTE_ATOMIC_UPDATES
> +	unsigned long old, tmp;
> +
> +	__asm__ __volatile__(
> +	"1:	ldarx	%0,0,%4\n\
> +		andi.	%1,%0,%6\n\
> +		bne-	1b \n\
> +		or	%0,%3,%0\n\
> +		stdcx.	%0,0,%4\n\
> +		bne-	1b"
> +	:"=&r" (old), "=&r" (tmp), "=m" (*pmdp)
> +	:"r" (bits), "r" (pmdp), "m" (*pmdp), "i" (_PAGE_BUSY)
> +	:"cc");
> +#else
> +	unsigned long old = pmd_val(*pmdp);
> +	*pmdp = __pmd(old | bits);
> +#endif

Using parameter casts on the corresponding pte_update() function would
be even more valuable for these more complex functions with asm.

> +}
> +
> +#define __HAVE_ARCH_PMD_SAME
> +static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
> +{
> +	return (((pmd_val(pmd_a) ^ pmd_val(pmd_b)) & ~_PAGE_THP_HPTEFLAGS) == 0);

Here, specifically, the fact that PAGE_BUSY is in PAGE_THP_HPTEFLAGS
is likely to be bad.  If the page is busy, it's in the middle of
update so can't stably be considered the same as anything.

> +}
> +
> +#define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
> +extern int pmdp_set_access_flags(struct vm_area_struct *vma,
> +				 unsigned long address, pmd_t *pmdp,
> +				 pmd_t entry, int dirty);
> +
> +static inline unsigned long pmd_hugepage_update(struct mm_struct *mm,
> +						unsigned long addr,
> +						pmd_t *pmdp, unsigned long clr)
> +{
> +#ifdef PTE_ATOMIC_UPDATES
> +	unsigned long old, tmp;
> +
> +	__asm__ __volatile__(
> +	"1:	ldarx	%0,0,%3\n\
> +		andi.	%1,%0,%6\n\
> +		bne-	1b \n\
> +		andc	%1,%0,%4 \n\
> +		stdcx.	%1,0,%3 \n\
> +		bne-	1b"
> +	: "=&r" (old), "=&r" (tmp), "=m" (*pmdp)
> +	: "r" (pmdp), "r" (clr), "m" (*pmdp), "i" (_PAGE_BUSY)
> +	: "cc" );
> +#else
> +	unsigned long old = pmd_val(*pmdp);
> +	*pmdp = __pmd(old & ~clr);
> +#endif
> +
> +#ifdef CONFIG_PPC_STD_MMU_64

THP only works with the standard hash MMU, so this #if seems a bit
pointless.

> +	if (old & _PAGE_HASHPTE)
> +		hpte_need_hugepage_flush(mm, addr, pmdp);
> +#endif
> +	return old;
> +}
> +
> +static inline int __pmdp_test_and_clear_young(struct mm_struct *mm,
> +					      unsigned long addr, pmd_t *pmdp)
> +{
> +	unsigned long old;
> +
> +	if ((pmd_val(*pmdp) & (_PAGE_ACCESSED | _PAGE_HASHPTE)) == 0)
> +		return 0;
> +	old = pmd_hugepage_update(mm, addr, pmdp, _PAGE_ACCESSED);
> +	return ((old & _PAGE_ACCESSED) != 0);
> +}
> +
> +#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
> +extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
> +				     unsigned long address, pmd_t *pmdp);
> +#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
> +extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
> +				  unsigned long address, pmd_t *pmdp);
> +
> +#define __HAVE_ARCH_PMDP_GET_AND_CLEAR
> +extern pmd_t pmdp_get_and_clear(struct mm_struct *mm,
> +				unsigned long addr, pmd_t *pmdp);
> +
> +#define __HAVE_ARCH_PMDP_SET_WRPROTECT

Now that the PTE format is the same at bottom or PMD level, do you
still need this?

> +static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr,
> +				      pmd_t *pmdp)
> +{
> +
> +	if ((pmd_val(*pmdp) & _PAGE_RW) == 0)
> +		return;
> +
> +	pmd_hugepage_update(mm, addr, pmdp, _PAGE_RW);
> +}
> +
> +#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
> +extern void pmdp_splitting_flush(struct vm_area_struct *vma,
> +				 unsigned long address, pmd_t *pmdp);
> +
> +#define __HAVE_ARCH_PGTABLE_DEPOSIT
> +extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
> +				       pgtable_t pgtable);
> +#define __HAVE_ARCH_PGTABLE_WITHDRAW
> +extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
> +
> +#define __HAVE_ARCH_PMDP_INVALIDATE
> +extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> +			    pmd_t *pmdp);
> +#endif /* __ASSEMBLY__ */
>  #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
> diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
> index 7aeb955..283198e 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -222,5 +222,10 @@ extern int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
>  		       unsigned long end, int write, struct page **pages, int *nr);
>  #endif /* __ASSEMBLY__ */
>  
> +#ifndef CONFIG_TRANSPARENT_HUGEPAGE
> +#define pmd_large(pmd)		0
> +#define has_transparent_hugepage() 0
> +#endif
> +
>  #endif /* __KERNEL__ */
>  #endif /* _ASM_POWERPC_PGTABLE_H */
> diff --git a/arch/powerpc/include/asm/pte-hash64-64k.h b/arch/powerpc/include/asm/pte-hash64-64k.h
> index 3e13e23..6be70be 100644
> --- a/arch/powerpc/include/asm/pte-hash64-64k.h
> +++ b/arch/powerpc/include/asm/pte-hash64-64k.h
> @@ -38,6 +38,23 @@
>   */
>  #define PTE_RPN_SHIFT	(30)
>  
> +/*
> + * THP pages can't be special. So use the _PAGE_SPECIAL
> + */
> +#define _PAGE_SPLITTING _PAGE_SPECIAL
> +
> +/*
> + * PTE flags to conserve for HPTE identification for THP page.
> + * We drop _PAGE_COMBO here, because we overload that with _PAGE_TH_HUGE.
> + */
> +#define _PAGE_THP_HPTEFLAGS	(_PAGE_BUSY | _PAGE_HASHPTE)
> +
> +/*
> + * We need to differentiate between explicit huge page and THP huge
> + * page, since THP huge page also need to track real subpage details
> + */
> +#define _PAGE_THP_HUGE  _PAGE_COMBO

All 3 of these definitions also appeared elsewhere.

> +
>  #ifndef __ASSEMBLY__
>  
>  /*
> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
> index a854096..54216c1 100644
> --- a/arch/powerpc/mm/pgtable_64.c
> +++ b/arch/powerpc/mm/pgtable_64.c
> @@ -338,6 +338,19 @@ EXPORT_SYMBOL(iounmap);
>  EXPORT_SYMBOL(__iounmap);
>  EXPORT_SYMBOL(__iounmap_at);
>  
> +/*
> + * For hugepage we have pfn in the pmd, we use PTE_RPN_SHIFT bits for flags
> + * For PTE page, we have a PTE_FRAG_SIZE (4K) aligned virtual address.
> + */
> +struct page *pmd_page(pmd_t pmd)
> +{
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	if (pmd_trans_huge(pmd))
> +		return pfn_to_page(pmd_pfn(pmd));

In this case you should be able to define this in terms of pte_pfn().

> +#endif
> +	return virt_to_page(pmd_page_vaddr(pmd));
> +}
> +
>  #ifdef CONFIG_PPC_64K_PAGES
>  static pte_t *get_from_cache(struct mm_struct *mm)
>  {
> @@ -455,3 +468,308 @@ void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
>  }
>  #endif
>  #endif /* CONFIG_PPC_64K_PAGES */
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static pmd_t set_hugepage_access_flags_filter(pmd_t pmd,
> +					      struct vm_area_struct *vma,
> +					      int dirty)
> +{
> +	return pmd;
> +}

This identity function is only used immediately before.  Why does it
exist?

> +/*
> + * This is called when relaxing access to a hugepage. It's also called in the page
> + * fault path when we don't hit any of the major fault cases, ie, a minor
> + * update of _PAGE_ACCESSED, _PAGE_DIRTY, etc... The generic code will have
> + * handled those two for us, we additionally deal with missing execute
> + * permission here on some processors
> + */
> +int pmdp_set_access_flags(struct vm_area_struct *vma, unsigned long address,
> +			  pmd_t *pmdp, pmd_t entry, int dirty)
> +{
> +	int changed;
> +	entry = set_hugepage_access_flags_filter(entry, vma, dirty);
> +	changed = !pmd_same(*(pmdp), entry);
> +	if (changed) {
> +		__pmdp_set_access_flags(pmdp, entry);
> +		/*
> +		 * Since we are not supporting SW TLB systems, we don't
> +		 * have any thing similar to flush_tlb_page_nohash()
> +		 */
> +	}
> +	return changed;
> +}
> +
> +int pmdp_test_and_clear_young(struct vm_area_struct *vma,
> +			      unsigned long address, pmd_t *pmdp)
> +{
> +	return __pmdp_test_and_clear_young(vma->vm_mm, address, pmdp);
> +}
> +
> +/*
> + * We currently remove entries from the hashtable regardless of whether
> + * the entry was young or dirty. The generic routines only flush if the
> + * entry was young or dirty which is not good enough.
> + *
> + * We should be more intelligent about this but for the moment we override
> + * these functions and force a tlb flush unconditionally
> + */
> +int pmdp_clear_flush_young(struct vm_area_struct *vma,
> +				  unsigned long address, pmd_t *pmdp)
> +{
> +	return __pmdp_test_and_clear_young(vma->vm_mm, address, pmdp);
> +}
> +
> +/*
> + * We mark the pmd splitting and invalidate all the hpte
> + * entries for this hugepage.
> + */
> +void pmdp_splitting_flush(struct vm_area_struct *vma,
> +			  unsigned long address, pmd_t *pmdp)
> +{
> +	unsigned long old, tmp;
> +
> +	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +#ifdef PTE_ATOMIC_UPDATES
> +
> +	__asm__ __volatile__(
> +	"1:	ldarx	%0,0,%3\n\
> +		andi.	%1,%0,%6\n\
> +		bne-	1b \n\
> +		ori	%1,%0,%4 \n\
> +		stdcx.	%1,0,%3 \n\
> +		bne-	1b"
> +	: "=&r" (old), "=&r" (tmp), "=m" (*pmdp)
> +	: "r" (pmdp), "i" (_PAGE_SPLITTING), "m" (*pmdp), "i" (_PAGE_BUSY)
> +	: "cc" );
> +#else
> +	old = pmd_val(*pmdp);
> +	*pmdp = __pmd(old | _PAGE_SPLITTING);
> +#endif
> +	/*
> +	 * If we didn't had the splitting flag set, go and flush the
> +	 * HPTE entries and serialize against gup fast.
> +	 */
> +	if (!(old & _PAGE_SPLITTING)) {
> +#ifdef CONFIG_PPC_STD_MMU_64
> +		/* We need to flush the hpte */
> +		if (old & _PAGE_HASHPTE)
> +			hpte_need_hugepage_flush(vma->vm_mm, address, pmdp);
> +#endif
> +		/* need tlb flush only to serialize against gup-fast */
> +		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +	}
> +}
> +
> +/*
> + * We want to put the pgtable in pmd and use pgtable for tracking
> + * the base page size hptes
> + */
> +void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
> +				pgtable_t pgtable)
> +{
> +	unsigned long *pgtable_slot;
> +	assert_spin_locked(&mm->page_table_lock);
> +	/*
> +	 * we store the pgtable in the second half of PMD
> +	 */
> +	pgtable_slot = pmdp + PTRS_PER_PMD;
> +	*pgtable_slot = (unsigned long)pgtable;

Why not just make pgtable_slot have type (pgtable_t *) and avoid the
case.

> +}
> +
> +pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
> +{
> +	pgtable_t pgtable;
> +	unsigned long *pgtable_slot;
> +
> +	assert_spin_locked(&mm->page_table_lock);
> +	pgtable_slot = pmdp + PTRS_PER_PMD;
> +	pgtable = (pgtable_t) *pgtable_slot;
> +	/*
> +	 * We store HPTE information in the deposited PTE fragment.
> +	 * zero out the content on withdraw.
> +	 */
> +	memset(pgtable, 0, PTE_FRAG_SIZE);
> +	return pgtable;
> +}
> +
> +/*
> + * Since we are looking at latest ppc64, we don't need to worry about
> + * i/d cache coherency on exec fault
> + */
> +static pmd_t set_pmd_filter(pmd_t pmd, unsigned long addr)
> +{
> +	pmd = __pmd(pmd_val(pmd) & ~_PAGE_THP_HPTEFLAGS);
> +	return pmd;
> +}
> +
> +/*
> + * We can make it less convoluted than __set_pte_at, because
> + * we can ignore lot of hardware here, because this is only for
> + * MPSS
> + */
> +static inline void __set_pmd_at(struct mm_struct *mm, unsigned long addr,
> +				pmd_t *pmdp, pmd_t pmd, int percpu)
> +{
> +	/*
> +	 * There is nothing in hash page table now, so nothing to
> +	 * invalidate, set_pte_at is used for adding new entry.
> +	 * For updating we should use update_hugepage_pmd()
> +	 */
> +	*pmdp = pmd;
> +}

Again you should be able to define this in terms of the set_pte_at()
functions.

> +/*
> + * set a new huge pmd. We should not be called for updating
> + * an existing pmd entry. That should go via pmd_hugepage_update.
> + */
> +void set_pmd_at(struct mm_struct *mm, unsigned long addr,
> +		pmd_t *pmdp, pmd_t pmd)
> +{
> +	/*
> +	 * Note: mm->context.id might not yet have been assigned as
> +	 * this context might not have been activated yet when this
> +	 * is called.

And the relevance of this comment here is...?

> +	 */
> +	pmd = set_pmd_filter(pmd, addr);
> +
> +	__set_pmd_at(mm, addr, pmdp, pmd, 0);
> +
> +}
> +
> +void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> +		     pmd_t *pmdp)
> +{
> +	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT);
> +	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +}
> +
> +/*
> + * A linux hugepage PMD was changed and the corresponding hash table entries
> + * neesd to be flushed.
> + *
> + * The linux hugepage PMD now include the pmd entries followed by the address
> + * to the stashed pgtable_t. The stashed pgtable_t contains the hpte bits.
> + * [ secondary group | 3 bit hidx | valid ]. We use one byte per each HPTE entry.
> + * With 16MB hugepage and 64K HPTE we need 256 entries and with 4K HPTE we need
> + * 4096 entries. Both will fit in a 4K pgtable_t.
> + */
> +void hpte_need_hugepage_flush(struct mm_struct *mm, unsigned long addr,
> +			      pmd_t *pmdp)
> +{
> +	int ssize, i;
> +	unsigned long s_addr;
> +	unsigned int psize, valid;
> +	unsigned char *hpte_slot_array;
> +	unsigned long hidx, vpn, vsid, hash, shift, slot;
> +
> +	/*
> +	 * Flush all the hptes mapping this hugepage
> +	 */
> +	s_addr = addr & HUGE_PAGE_MASK;
> +	/*
> +	 * The hpte hindex are stored in the pgtable whose address is in the
> +	 * second half of the PMD
> +	 */
> +	hpte_slot_array = *(char **)(pmdp + PTRS_PER_PMD);
> +
> +	/* get the base page size */
> +	psize = get_slice_psize(mm, s_addr);
> +	shift = mmu_psize_defs[psize].shift;
> +
> +	for (i = 0; i < (HUGE_PAGE_SIZE >> shift); i++) {
> +		/*
> +		 * 8 bits per each hpte entries
> +		 * 000| [ secondary group (one bit) | hidx (3 bits) | valid bit]
> +		 */
> +		valid = hpte_slot_array[i] & 0x1;
> +		if (!valid)
> +			continue;
> +		hidx =  hpte_slot_array[i]  >> 1;
> +
> +		/* get the vpn */
> +		addr = s_addr + (i * (1ul << shift));
> +		if (!is_kernel_addr(addr)) {
> +			ssize = user_segment_size(addr);
> +			vsid = get_vsid(mm->context.id, addr, ssize);
> +			WARN_ON(vsid == 0);
> +		} else {
> +			vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
> +			ssize = mmu_kernel_ssize;
> +		}
> +
> +		vpn = hpt_vpn(addr, vsid, ssize);
> +		hash = hpt_hash(vpn, shift, ssize);
> +		if (hidx & _PTEIDX_SECONDARY)
> +			hash = ~hash;
> +
> +		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> +		slot += hidx & _PTEIDX_GROUP_IX;
> +		ppc_md.hpte_invalidate(slot, vpn, psize, ssize, 0);
> +	}
> +}
> +
> +static pmd_t pmd_set_protbits(pmd_t pmd, pgprot_t pgprot)
> +{
> +	pmd_val(pmd) |= pgprot_val(pgprot);
> +	return pmd;
> +}
> +
> +pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot)
> +{
> +	pmd_t pmd;
> +	/*
> +	 * For a valid pte, we would have _PAGE_PRESENT or _PAGE_FILE always
> +	 * set. We use this to check THP page at pmd level.
> +	 * leaf pte for huge page, bottom two bits != 00
> +	 */
> +	pmd_val(pmd) = pfn << PTE_RPN_SHIFT;
> +	pmd_val(pmd) |= _PAGE_THP_HUGE;
> +	pmd = pmd_set_protbits(pmd, pgprot);
> +	return pmd;
> +}
> +
> +pmd_t mk_pmd(struct page *page, pgprot_t pgprot)
> +{
> +	return pfn_pmd(page_to_pfn(page), pgprot);
> +}
> +
> +pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
> +{
> +
> +	pmd_val(pmd) &= _HPAGE_CHG_MASK;
> +	pmd = pmd_set_protbits(pmd, newprot);
> +	return pmd;
> +}
> +
> +/*
> + * This is called at the end of handling a user page fault, when the
> + * fault has been handled by updating a HUGE PMD entry in the linux page tables.
> + * We use it to preload an HPTE into the hash table corresponding to
> + * the updated linux HUGE PMD entry.
> + */
> +void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
> +			  pmd_t *pmd)
> +{
> +	return;
> +}
> +
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +pmd_t pmdp_get_and_clear(struct mm_struct *mm,
> +			 unsigned long addr, pmd_t *pmdp)
> +{
> +	pmd_t old_pmd;
> +	unsigned long old;
> +	/*
> +	 * khugepaged calls this for normal pmd also
> +	 */
> +	if (pmd_trans_huge(*pmdp)) {
> +		old = pmd_hugepage_update(mm, addr, pmdp, ~0UL);
> +		old_pmd = __pmd(old);
> +	} else {
> +		old_pmd = *pmdp;
> +		pmd_clear(pmdp);
> +	}
> +	return old_pmd;
> +}
> diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
> index 18e3b76..a526144 100644
> --- a/arch/powerpc/platforms/Kconfig.cputype
> +++ b/arch/powerpc/platforms/Kconfig.cputype
> @@ -71,6 +71,7 @@ config PPC_BOOK3S_64
>  	select PPC_FPU
>  	select PPC_HAVE_PMU_SUPPORT
>  	select SYS_SUPPORTS_HUGETLBFS
> +	select HAVE_ARCH_TRANSPARENT_HUGEPAGE if PPC_64K_PAGES
>  
>  config PPC_BOOK3E_64
>  	bool "Embedded processors"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH -V7 09/10] powerpc: Optimize hugepage invalidate
From: David Gibson @ 2013-05-03  5:28 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, linux-mm
In-Reply-To: <1367178711-8232-10-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 11696 bytes --]

On Mon, Apr 29, 2013 at 01:21:50AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> Hugepage invalidate involves invalidating multiple hpte entries.
> Optimize the operation using H_BULK_REMOVE on lpar platforms.
> On native, reduce the number of tlb flush.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Since this is purely an optimization, have you tried reproducing the
bugs you're chasing with this patch not included?

> ---
>  arch/powerpc/include/asm/machdep.h    |   3 +
>  arch/powerpc/mm/hash_native_64.c      |  78 +++++++++++++++++++++
>  arch/powerpc/mm/pgtable_64.c          |  13 +++-
>  arch/powerpc/platforms/pseries/lpar.c | 126 ++++++++++++++++++++++++++++++++--
>  4 files changed, 210 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
> index 3f3f691..5d1e7d2 100644
> --- a/arch/powerpc/include/asm/machdep.h
> +++ b/arch/powerpc/include/asm/machdep.h
> @@ -56,6 +56,9 @@ struct machdep_calls {
>  	void            (*hpte_removebolted)(unsigned long ea,
>  					     int psize, int ssize);
>  	void		(*flush_hash_range)(unsigned long number, int local);
> +	void		(*hugepage_invalidate)(struct mm_struct *mm,
> +					       unsigned char *hpte_slot_array,
> +					       unsigned long addr, int psize);
>  
>  	/* special for kexec, to be called in real mode, linear mapping is
>  	 * destroyed as well */
> diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
> index 6a2aead..8ca178d 100644
> --- a/arch/powerpc/mm/hash_native_64.c
> +++ b/arch/powerpc/mm/hash_native_64.c
> @@ -455,6 +455,83 @@ static void native_hpte_invalidate(unsigned long slot, unsigned long vpn,
>  	local_irq_restore(flags);
>  }
>  
> +static void native_hugepage_invalidate(struct mm_struct *mm,
> +				       unsigned char *hpte_slot_array,
> +				       unsigned long addr, int psize)
> +{
> +	int ssize = 0, i;
> +	int lock_tlbie;
> +	struct hash_pte *hptep;
> +	int actual_psize = MMU_PAGE_16M;
> +	unsigned int max_hpte_count, valid;
> +	unsigned long flags, s_addr = addr;
> +	unsigned long hpte_v, want_v, shift;
> +	unsigned long hidx, vpn = 0, vsid, hash, slot;
> +
> +	shift = mmu_psize_defs[psize].shift;
> +	max_hpte_count = HUGE_PAGE_SIZE >> shift;
> +
> +	local_irq_save(flags);
> +	for (i = 0; i < max_hpte_count; i++) {
> +		/*
> +		 * 8 bits per each hpte entries
> +		 * 000| [ secondary group (one bit) | hidx (3 bits) | valid bit]
> +		 */
> +		valid = hpte_slot_array[i] & 0x1;
> +		if (!valid)
> +			continue;
> +		hidx =  hpte_slot_array[i]  >> 1;
> +
> +		/* get the vpn */
> +		addr = s_addr + (i * (1ul << shift));
> +		if (!is_kernel_addr(addr)) {
> +			ssize = user_segment_size(addr);
> +			vsid = get_vsid(mm->context.id, addr, ssize);
> +			WARN_ON(vsid == 0);
> +		} else {
> +			vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
> +			ssize = mmu_kernel_ssize;
> +		}
> +
> +		vpn = hpt_vpn(addr, vsid, ssize);
> +		hash = hpt_hash(vpn, shift, ssize);
> +		if (hidx & _PTEIDX_SECONDARY)
> +			hash = ~hash;
> +
> +		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> +		slot += hidx & _PTEIDX_GROUP_IX;
> +
> +		hptep = htab_address + slot;
> +		want_v = hpte_encode_avpn(vpn, psize, ssize);
> +		native_lock_hpte(hptep);
> +		hpte_v = hptep->v;
> +
> +		/* Even if we miss, we need to invalidate the TLB */
> +		if (!HPTE_V_COMPARE(hpte_v, want_v) || !(hpte_v & HPTE_V_VALID))
> +			native_unlock_hpte(hptep);
> +		else
> +			/* Invalidate the hpte. NOTE: this also unlocks it */
> +			hptep->v = 0;
> +	}
> +	/*
> +	 * Since this is a hugepage, we just need a single tlbie.
> +	 * use the last vpn.
> +	 */
> +	lock_tlbie = !mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE);
> +	if (lock_tlbie)
> +		raw_spin_lock(&native_tlbie_lock);
> +
> +	asm volatile("ptesync":::"memory");
> +	__tlbie(vpn, psize, actual_psize, ssize);
> +	asm volatile("eieio; tlbsync; ptesync":::"memory");
> +
> +	if (lock_tlbie)
> +		raw_spin_unlock(&native_tlbie_lock);
> +
> +	local_irq_restore(flags);
> +}
> +
> +
>  static void hpte_decode(struct hash_pte *hpte, unsigned long slot,
>  			int *psize, int *apsize, int *ssize, unsigned long *vpn)
>  {
> @@ -658,4 +735,5 @@ void __init hpte_init_native(void)
>  	ppc_md.hpte_remove	= native_hpte_remove;
>  	ppc_md.hpte_clear_all	= native_hpte_clear;
>  	ppc_md.flush_hash_range = native_flush_hash_range;
> +	ppc_md.hugepage_invalidate   = native_hugepage_invalidate;
>  }
> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
> index b742d6f..504952f 100644
> --- a/arch/powerpc/mm/pgtable_64.c
> +++ b/arch/powerpc/mm/pgtable_64.c
> @@ -659,6 +659,7 @@ void hpte_need_hugepage_flush(struct mm_struct *mm, unsigned long addr,
>  {
>  	int ssize, i;
>  	unsigned long s_addr;
> +	int max_hpte_count;
>  	unsigned int psize, valid;
>  	unsigned char *hpte_slot_array;
>  	unsigned long hidx, vpn, vsid, hash, shift, slot;
> @@ -672,12 +673,18 @@ void hpte_need_hugepage_flush(struct mm_struct *mm, unsigned long addr,
>  	 * second half of the PMD
>  	 */
>  	hpte_slot_array = *(char **)(pmdp + PTRS_PER_PMD);
> -
>  	/* get the base page size */
>  	psize = get_slice_psize(mm, s_addr);
> -	shift = mmu_psize_defs[psize].shift;
>  
> -	for (i = 0; i < (HUGE_PAGE_SIZE >> shift); i++) {
> +	if (ppc_md.hugepage_invalidate)
> +		return ppc_md.hugepage_invalidate(mm, hpte_slot_array,
> +						  s_addr, psize);
> +	/*
> +	 * No bluk hpte removal support, invalidate each entry
> +	 */
> +	shift = mmu_psize_defs[psize].shift;
> +	max_hpte_count = HUGE_PAGE_SIZE >> shift;
> +	for (i = 0; i < max_hpte_count; i++) {
>  		/*
>  		 * 8 bits per each hpte entries
>  		 * 000| [ secondary group (one bit) | hidx (3 bits) | valid bit]
> diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
> index 6d62072..58a31db 100644
> --- a/arch/powerpc/platforms/pseries/lpar.c
> +++ b/arch/powerpc/platforms/pseries/lpar.c
> @@ -45,6 +45,13 @@
>  #include "plpar_wrappers.h"
>  #include "pseries.h"
>  
> +/* Flag bits for H_BULK_REMOVE */
> +#define HBR_REQUEST	0x4000000000000000UL
> +#define HBR_RESPONSE	0x8000000000000000UL
> +#define HBR_END		0xc000000000000000UL
> +#define HBR_AVPN	0x0200000000000000UL
> +#define HBR_ANDCOND	0x0100000000000000UL
> +
>  
>  /* in hvCall.S */
>  EXPORT_SYMBOL(plpar_hcall);
> @@ -345,6 +352,117 @@ static void pSeries_lpar_hpte_invalidate(unsigned long slot, unsigned long vpn,
>  	BUG_ON(lpar_rc != H_SUCCESS);
>  }
>  
> +/*
> + * Limit iterations holding pSeries_lpar_tlbie_lock to 3. We also need
> + * to make sure that we avoid bouncing the hypervisor tlbie lock.
> + */
> +#define PPC64_HUGE_HPTE_BATCH 12
> +
> +static void __pSeries_lpar_hugepage_invalidate(unsigned long *slot,
> +					     unsigned long *vpn, int count,
> +					     int psize, int ssize)
> +{
> +	unsigned long param[9];

[9]?  I only see 8 elements being used.

> +	int i = 0, pix = 0, rc;
> +	unsigned long flags = 0;
> +	int lock_tlbie = !mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE);
> +
> +	if (lock_tlbie)
> +		spin_lock_irqsave(&pSeries_lpar_tlbie_lock, flags);

Why are these hash operations being called with the tlbie lock held?

> +
> +	for (i = 0; i < count; i++) {
> +
> +		if (!firmware_has_feature(FW_FEATURE_BULK_REMOVE)) {
> +			pSeries_lpar_hpte_invalidate(slot[i], vpn[i], psize,
> +						     ssize, 0);

Couldn't you set the ppc_md hook based on the firmware request to
avoid this test in the inner loop?  I don't see any tlbie operations
at all.

> +		} else {
> +			param[pix] = HBR_REQUEST | HBR_AVPN | slot[i];
> +			param[pix+1] = hpte_encode_avpn(vpn[i], psize, ssize);
> +			pix += 2;
> +			if (pix == 8) {
> +				rc = plpar_hcall9(H_BULK_REMOVE, param,
> +						  param[0], param[1], param[2],
> +						  param[3], param[4], param[5],
> +						  param[6], param[7]);
> +				BUG_ON(rc != H_SUCCESS);
> +				pix = 0;
> +			}
> +		}
> +	}
> +	if (pix) {
> +		param[pix] = HBR_END;
> +		rc = plpar_hcall9(H_BULK_REMOVE, param, param[0], param[1],
> +				  param[2], param[3], param[4], param[5],
> +				  param[6], param[7]);
> +		BUG_ON(rc != H_SUCCESS);
> +	}
> +
> +	if (lock_tlbie)
> +		spin_unlock_irqrestore(&pSeries_lpar_tlbie_lock, flags);
> +}
> +
> +static void pSeries_lpar_hugepage_invalidate(struct mm_struct *mm,
> +				       unsigned char *hpte_slot_array,
> +				       unsigned long addr, int psize)
> +{
> +	int ssize = 0, i, index = 0;
> +	unsigned long s_addr = addr;
> +	unsigned int max_hpte_count, valid;
> +	unsigned long vpn_array[PPC64_HUGE_HPTE_BATCH];
> +	unsigned long slot_array[PPC64_HUGE_HPTE_BATCH];
> +	unsigned long shift, hidx, vpn = 0, vsid, hash, slot;
> +
> +	shift = mmu_psize_defs[psize].shift;
> +	max_hpte_count = HUGE_PAGE_SIZE >> shift;
> +
> +	for (i = 0; i < max_hpte_count; i++) {
> +		/*
> +		 * 8 bits per each hpte entries
> +		 * 000| [ secondary group (one bit) | hidx (3 bits) | valid bit]
> +		 */
> +		valid = hpte_slot_array[i] & 0x1;
> +		if (!valid)
> +			continue;
> +		hidx =  hpte_slot_array[i]  >> 1;
> +
> +		/* get the vpn */
> +		addr = s_addr + (i * (1ul << shift));
> +		if (!is_kernel_addr(addr)) {
> +			ssize = user_segment_size(addr);
> +			vsid = get_vsid(mm->context.id, addr, ssize);
> +			WARN_ON(vsid == 0);
> +		} else {
> +			vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
> +			ssize = mmu_kernel_ssize;
> +		}
> +
> +		vpn = hpt_vpn(addr, vsid, ssize);
> +		hash = hpt_hash(vpn, shift, ssize);
> +		if (hidx & _PTEIDX_SECONDARY)
> +			hash = ~hash;
> +
> +		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> +		slot += hidx & _PTEIDX_GROUP_IX;
> +
> +		slot_array[index] = slot;
> +		vpn_array[index] = vpn;
> +		if (index == PPC64_HUGE_HPTE_BATCH - 1) {
> +			/*
> +			 * Now do a bluk invalidate
> +			 */
> +			__pSeries_lpar_hugepage_invalidate(slot_array,
> +							   vpn_array,
> +							   PPC64_HUGE_HPTE_BATCH,
> +							   psize, ssize);

I don't really understand why you have one loop in this function, then
another in the __ function.

> +			index = 0;
> +		} else
> +			index++;
> +	}
> +	if (index)
> +		__pSeries_lpar_hugepage_invalidate(slot_array, vpn_array,
> +						   index, psize, ssize);
> +}
> +
>  static void pSeries_lpar_hpte_removebolted(unsigned long ea,
>  					   int psize, int ssize)
>  {
> @@ -360,13 +478,6 @@ static void pSeries_lpar_hpte_removebolted(unsigned long ea,
>  	pSeries_lpar_hpte_invalidate(slot, vpn, psize, ssize, 0);
>  }
>  
> -/* Flag bits for H_BULK_REMOVE */
> -#define HBR_REQUEST	0x4000000000000000UL
> -#define HBR_RESPONSE	0x8000000000000000UL
> -#define HBR_END		0xc000000000000000UL
> -#define HBR_AVPN	0x0200000000000000UL
> -#define HBR_ANDCOND	0x0100000000000000UL
> -
>  /*
>   * Take a spinlock around flushes to avoid bouncing the hypervisor tlbie
>   * lock.
> @@ -452,6 +563,7 @@ void __init hpte_init_lpar(void)
>  	ppc_md.hpte_removebolted = pSeries_lpar_hpte_removebolted;
>  	ppc_md.flush_hash_range	= pSeries_lpar_flush_hash_range;
>  	ppc_md.hpte_clear_all   = pSeries_lpar_hptab_clear;
> +	ppc_md.hugepage_invalidate = pSeries_lpar_hugepage_invalidate;
>  }
>  
>  #ifdef CONFIG_PPC_SMLPAR

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH -V7 10/10] powerpc: disable assert_pte_locked
From: David Gibson @ 2013-05-03  5:30 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, linux-mm
In-Reply-To: <1367178711-8232-11-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 1601 bytes --]

On Mon, Apr 29, 2013 at 01:21:51AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> With THP we set pmd to none, before we do pte_clear. Hence we can't
> walk page table to get the pte lock ptr and verify whether it is locked.
> THP do take pte lock before calling pte_clear. So we don't change the locking
> rules here. It is that we can't use page table walking to check whether
> pte locks are help with THP.
> 
> NOTE: This needs to be re-written. Not to be merged upstream.

So, rewrite it..

> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/mm/pgtable.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> index 214130a..d77f94f 100644
> --- a/arch/powerpc/mm/pgtable.c
> +++ b/arch/powerpc/mm/pgtable.c
> @@ -224,6 +224,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma, unsigned long address,
>  #ifdef CONFIG_DEBUG_VM
>  void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
>  {
> +#if 0
>  	pgd_t *pgd;
>  	pud_t *pud;
>  	pmd_t *pmd;
> @@ -237,6 +238,7 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
>  	pmd = pmd_offset(pud, addr);
>  	BUG_ON(!pmd_present(*pmd));
>  	assert_spin_locked(pte_lockptr(mm, pmd));
> +#endif
>  }
>  #endif /* CONFIG_DEBUG_VM */
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH 1/1] powerpc: mpc512x_dma: add support for data transfers between memory and i/o memory
From: Alexander Popov @ 2013-05-03  6:28 UTC (permalink / raw)
  To: Vinod Koul; +Cc: linuxppc-dev, linux-kernel, Dan Williams
In-Reply-To: <20130502171452.GP1960@intel.com>

Hello Vinod,

Thanks for the review.
I will return with improved and tested version 2.

Best regards,
Alexander

^ permalink raw reply

* Re: [PATCHv4 1/2] ppc64: perform proper max_bus_speed detection
From: Tony Breeds @ 2013-05-03  6:40 UTC (permalink / raw)
  To: Kleber Sacilotto de Souza
  Cc: David Airlie, Brian King, dri-devel, Alex Deucher, Jerome Glisse,
	Thadeu Lima de Souza Cascardo, Bjorn Helgaas, linuxppc-dev
In-Reply-To: <51828481.2090406@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 875 bytes --]

On Thu, May 02, 2013 at 12:21:37PM -0300, Kleber Sacilotto de Souza wrote:

> Hi Tony,
> 
> It seems Lucas' change is a bit incomplete and is not handling the reference counter to
> the device_node correctly. Is the following change what you had in mind?

Ahh Sorry I expected there would be a for_each_parent_of_node macro.
I did a quick grep and it seems that's not very common, so open coding
it should be fine.
 
> 
> 	dn = pcibios_get_phb_of_node(bus);
> 	if (!dn)
> 		return 0;
> 
> 	for (pdn = dn; pdn != NULL; pdn = of_get_next_parent(pdn)) {
> 		pcie_link_speed_stats = (const uint32_t *) of_get_property(pdn,
> 			"ibm,pcie-link-speed-stats", NULL);
> 		if (pcie_link_speed_stats)
> 			break;
> 	}
> 
> 	of_node_put(pdn);

I think you need the of_node_put() in the body of the loop, otherwise
aren't you leaking refcounts?
 
Yours Tony

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCH 1/1] powerpc: mpc512x_dma: add support for data transfers between memory and i/o memory
From: Anatolij Gustschin @ 2013-05-03  6:59 UTC (permalink / raw)
  To: Alexander Popov; +Cc: Vinod Koul, linuxppc-dev, linux-kernel, Dan Williams
In-Reply-To: <CAF0T0X7XBuL7a=8zwqMnLDB04-Tf=qhCLPZ4T29yNdN6mP5S2A@mail.gmail.com>

On Fri, 3 May 2013 10:28:02 +0400
Alexander Popov <a13xp0p0v88@gmail.com> wrote:

> Hello Vinod,
> 
> Thanks for the review.
> I will return with improved and tested version 2.

Note that there is a patch for .device_prep_slave_sg() operation
for this driver as part of this series:
https://patchwork.kernel.org/patch/2368581/
https://patchwork.kernel.org/patch/2368591/

maybe you can reuse and improve it.

Thanks,

Anatolij

^ permalink raw reply

* Re: [PATCH -V7 02/10] powerpc/THP: Implement transparent hugepages for ppc64
From: Benjamin Herrenschmidt @ 2013-05-03  8:19 UTC (permalink / raw)
  To: David Gibson; +Cc: linux-mm, linuxppc-dev, paulus, Aneesh Kumar K.V
In-Reply-To: <20130503045201.GO13041@truffula.fritz.box>

On Fri, 2013-05-03 at 14:52 +1000, David Gibson wrote:
> Here, specifically, the fact that PAGE_BUSY is in PAGE_THP_HPTEFLAGS
> is likely to be bad.  If the page is busy, it's in the middle of
> update so can't stably be considered the same as anything.

_PAGE_BUSY is more like a read lock. It means it's being hashed, so what
is not stable is _PAGE_HASHPTE, slot index, _ACCESSED and _DIRTY. The
rest is stable and usually is what pmd_same looks at (though I have a
small doubt vs. _ACCESSED and _DIRTY but at least x86 doesn't care since
they are updated by HW).

Cheers,
Ben.

^ permalink raw reply

* RE: [PATCH net-next] af_unix: fix a fatal race with bit fields
From: David Laight @ 2013-05-03  8:20 UTC (permalink / raw)
  To: Alan Modra, Eric Dumazet
  Cc: netdev, Ambrose Feinstein, Paul Mackerras, Anton Blanchard,
	linuxppc-dev, David Miller
In-Reply-To: <20130503013136.GN5221@bubble.grove.modra.org>

> Did someone fix btrfs, but not check other kernel locks?  Having now
> hit the same problem again, have you checked that other kernel locks
> don't have adjacent bit fields in the same 64-bit word?  And comment
> the struct to ensure someone doesn't optimize those unsigned chars
> back to bit fields.

Seems a good reason to have a general policy of not using
bit fields!

Separate char fields normally generate faster code - possibly
at the expense of an increase in the allocated size of a
structure.

	David

^ permalink raw reply

* Re: [PATCH 1/1] powerpc: mpc512x_dma: add support for data transfers between memory and i/o memory
From: Alexander Popov @ 2013-05-03 10:43 UTC (permalink / raw)
  To: Anatolij Gustschin; +Cc: Vinod Koul, linuxppc-dev, linux-kernel, Dan Williams
In-Reply-To: <20130503085944.69ef751e@crub>

Hello Anatolij,

> Note that there is a patch for .device_prep_slave_sg() operation
> for this driver as part of this series:
> https://patchwork.kernel.org/patch/2368581/
> https://patchwork.kernel.org/patch/2368591/
Thanks, I haven't seen that patch.
It's certainly what my SCLPC device driver needs
(http://patchwork.ozlabs.org/patch/241010/).
I will send the second version of it which uses .device_prep_slave_sg().

> maybe you can reuse and improve it.
> Anatolij
Should I propose my additions at https://patchwork.kernel.org/patch/2368591/ ?

Best regards,
Alexander

^ permalink raw reply

* Re: [PATCHv4 1/2] ppc64: perform proper max_bus_speed detection
From: Kleber Sacilotto de Souza @ 2013-05-03 11:55 UTC (permalink / raw)
  To: tony
  Cc: David Airlie, Brian King, dri-devel, Alex Deucher, Jerome Glisse,
	Thadeu Lima de Souza Cascardo, Bjorn Helgaas, linuxppc-dev
In-Reply-To: <20130503064008.GB4260@thor.bakeyournoodle.com>

On 05/03/2013 03:40 AM, Tony Breeds wrote:
> On Thu, May 02, 2013 at 12:21:37PM -0300, Kleber Sacilotto de Souza wrote:
> 
>> Hi Tony,
>>
>> It seems Lucas' change is a bit incomplete and is not handling the reference counter to
>> the device_node correctly. Is the following change what you had in mind?
> 
> Ahh Sorry I expected there would be a for_each_parent_of_node macro.
> I did a quick grep and it seems that's not very common, so open coding
> it should be fine.
>  
>>
>> 	dn = pcibios_get_phb_of_node(bus);
>> 	if (!dn)
>> 		return 0;
>>
>> 	for (pdn = dn; pdn != NULL; pdn = of_get_next_parent(pdn)) {
>> 		pcie_link_speed_stats = (const uint32_t *) of_get_property(pdn,
>> 			"ibm,pcie-link-speed-stats", NULL);
>> 		if (pcie_link_speed_stats)
>> 			break;
>> 	}
>>
>> 	of_node_put(pdn);
> 
> I think you need the of_node_put() in the body of the loop, otherwise
> aren't you leaking refcounts?

of_get_next_parent() takes care of that. It does of_node_put() on the
current node after doing of_node_get() on the parent.


Thanks,
-- 
Kleber Sacilotto de Souza
IBM Linux Technology Center

^ permalink raw reply

* Re: [PATCH -V7 02/10] powerpc/THP: Implement transparent hugepages for ppc64
From: David Gibson @ 2013-05-03 11:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-mm, linuxppc-dev, Aneesh Kumar K.V, paulus
In-Reply-To: <1367569143.4389.56.camel@pasglop>

[-- Attachment #1: Type: text/plain, Size: 990 bytes --]

On Fri, May 03, 2013 at 06:19:03PM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2013-05-03 at 14:52 +1000, David Gibson wrote:
> > Here, specifically, the fact that PAGE_BUSY is in PAGE_THP_HPTEFLAGS
> > is likely to be bad.  If the page is busy, it's in the middle of
> > update so can't stably be considered the same as anything.
> 
> _PAGE_BUSY is more like a read lock. It means it's being hashed, so what
> is not stable is _PAGE_HASHPTE, slot index, _ACCESSED and _DIRTY. The
> rest is stable and usually is what pmd_same looks at (though I have a
> small doubt vs. _ACCESSED and _DIRTY but at least x86 doesn't care since
> they are updated by HW).

Ok.  It still seems very odd to me that _PAGE_BUSY would be in the THP
version of _PAGE_HASHPTE, but not the normal one.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH net-next] af_unix: fix a fatal race with bit fields
From: Benjamin Herrenschmidt @ 2013-05-03 12:57 UTC (permalink / raw)
  To: Alan Modra
  Cc: Eric Dumazet, netdev, Ambrose Feinstein, Paul Mackerras,
	Anton Blanchard, linuxppc-dev, David Miller
In-Reply-To: <20130503013136.GN5221@bubble.grove.modra.org>

On Fri, 2013-05-03 at 11:01 +0930, Alan Modra wrote:
> On Tue, Apr 30, 2013 at 10:04:32PM -0700, Eric Dumazet wrote:
> > These kind of errors are pretty hard to find, its a pity to spend time
> > on them.
> 
> Well, yes.  From the first comment in gcc PR52080.  "For the following
> testcase we generate a 8 byte RMW cycle on IA64 which causes locking
> problems in the linux kernel btrfs filesystem."
> 
> Did someone fix btrfs, but not check other kernel locks?  Having now
> hit the same problem again, have you checked that other kernel locks
> don't have adjacent bit fields in the same 64-bit word?  And comment
> the struct to ensure someone doesn't optimize those unsigned chars
> back to bit fields.

Unfortunately, fixing "other" kernel locks is near impossible.

One could try to grep for all spinlock_t and maybe even all atomic_t,
may even write a script to spot automatically if a bitfield appears
to be around (though it could be hidden behind a structure etc...) but
what about an int accessed with cmxchg (a kernel macro doing a
lwarx/stwcx. loop on a value) for example ? There's plenty of these...

I don't think we can realistically "fix" all potential occurrences of
that bug in the kernel short of geting rid of all bitfields, which isn't
going to happen any time soon.

I'm afraid this *must* be fixed at the compiler level, with as backports
much as can realistically be done back to distros.

Ben.
 

^ permalink raw reply

* Re: [PATCH -V7 02/10] powerpc/THP: Implement transparent hugepages for ppc64
From: Benjamin Herrenschmidt @ 2013-05-03 13:00 UTC (permalink / raw)
  To: David Gibson; +Cc: linux-mm, linuxppc-dev, Aneesh Kumar K.V, paulus
In-Reply-To: <20130503115428.GW13041@truffula.fritz.box>

On Fri, 2013-05-03 at 21:54 +1000, David Gibson wrote:
> > _PAGE_BUSY is more like a read lock. It means it's being hashed, so what
> > is not stable is _PAGE_HASHPTE, slot index, _ACCESSED and _DIRTY. The
> > rest is stable and usually is what pmd_same looks at (though I have a
> > small doubt vs. _ACCESSED and _DIRTY but at least x86 doesn't care since
> > they are updated by HW).
> 
> Ok.  It still seems very odd to me that _PAGE_BUSY would be in the THP
> version of _PAGE_HASHPTE, but not the normal one.

Oh I agree, we should be consistent and it shouldn't be there, I was just
correcting some other aspect of your statement :-)

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH 1/1] powerpc: mpc512x_dma: add support for data transfers between memory and i/o memory
From: Anatolij Gustschin @ 2013-05-03 13:52 UTC (permalink / raw)
  To: Alexander Popov; +Cc: Vinod Koul, linuxppc-dev, linux-kernel, Dan Williams
In-Reply-To: <CAF0T0X5cRMvoZKQQUkQHTLjbzz1xi_1pvUiywsKb5BGvyMGvpg@mail.gmail.com>

Hello Alexander,

On Fri, 3 May 2013 14:43:23 +0400
Alexander Popov <a13xp0p0v88@gmail.com> wrote:

> Hello Anatolij,
> 
> > Note that there is a patch for .device_prep_slave_sg() operation
> > for this driver as part of this series:
> > https://patchwork.kernel.org/patch/2368581/
> > https://patchwork.kernel.org/patch/2368591/
> Thanks, I haven't seen that patch.
> It's certainly what my SCLPC device driver needs
> (http://patchwork.ozlabs.org/patch/241010/).
> I will send the second version of it which uses .device_prep_slave_sg().
> 
> > maybe you can reuse and improve it.
> > Anatolij
> Should I propose my additions at https://patchwork.kernel.org/patch/2368591/ ?

Yes, I think so. I only used drivers new .device_prep_slave_sg()
for SDHC DMA channel transfers up to now. Adding support for other
peripherals would be good. With generic DMA DT bindings patch for
this driver you can use

    dmas = <&dma0 26>;
    dma-names = "rx-tx";

in your sclpc@10100 DT node and then
dma_request_slave_channel(&pdev->dev, "rx-tx") in the lpbfifo
driver.

Thanks,

Anatolij

^ permalink raw reply

* Re: [PATCH net-next] af_unix: fix a fatal race with bit fields
From: Eric Dumazet @ 2013-05-03 14:14 UTC (permalink / raw)
  To: Alan Modra
  Cc: netdev, Ambrose Feinstein, Paul Mackerras, Anton Blanchard,
	linuxppc-dev, David Miller
In-Reply-To: <20130503013136.GN5221@bubble.grove.modra.org>

On Fri, 2013-05-03 at 11:01 +0930, Alan Modra wrote:
> On Tue, Apr 30, 2013 at 10:04:32PM -0700, Eric Dumazet wrote:
> > These kind of errors are pretty hard to find, its a pity to spend time
> > on them.
> 
> Well, yes.  From the first comment in gcc PR52080.  "For the following
> testcase we generate a 8 byte RMW cycle on IA64 which causes locking
> problems in the linux kernel btrfs filesystem."
> 
> Did someone fix btrfs, but not check other kernel locks?  Having now
> hit the same problem again, have you checked that other kernel locks
> don't have adjacent bit fields in the same 64-bit word?  And comment
> the struct to ensure someone doesn't optimize those unsigned chars
> back to bit fields.

Not only spinlock, but atomic_t followed by bit fields.

BTW, if a spinlock is followed by bit fields, but bit fields
only changed when this spinlock is held, there is no problem, unless
spinlock is a ticket spinlock.

In af_unix, bug happens because the bit fields were changed without
spinlock being held (another global spinlock is used instead)

(ppc64 doesnt use ticket spinlocks yet)

^ permalink raw reply

* RE: [PATCH net-next] af_unix: fix a fatal race with bit fields
From: David Laight @ 2013-05-03 14:29 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Eric Dumazet
  Cc: netdev, linuxppc-dev, Paul Mackerras, David Miller,
	Ambrose Feinstein
In-Reply-To: <1367372393.22115.6.camel@pasglop>

PiA+IENvdWxkIHBwYzY0IGV4cGVydHMgY29uZmlybSB1c2luZyBieXRlIGlzIHNhZmUsIG9yIHNo
b3VsZCB3ZSByZWFsbHkgYWRkDQo+ID4gYSAzMmJpdCBob2xlIGFmdGVyIHRoZSBzcGlubG9jayA/
IElmIHNvLCBJIHdvbmRlciBob3cgbWFueSBvdGhlciBwbGFjZXMNCj4gPiBuZWVkIGEgY2hhbmdl
Li4uDQouLi4NCj4gQWxzbyBJJ2QgYmUgc3VycHJpc2VkIGlmIHBwYzY0IGlzIHRoZSBvbmx5IG9u
ZSB3aXRoIHRoYXQgcHJvYmxlbS4uLiB3aGF0DQo+IGFib3V0IHNwYXJjNjQgYW5kIGFybTY0ID8N
Cg0KRXZlbiB4ODYgY291bGQgYmUgYWZmZWN0ZWQuDQpUaGUgd2lkdGggb2YgdGhlIG1lbW9yeSBj
eWNsZXMgdXNlZCBieSB0aGUgJ2JpdCBzZXQgYW5kIGJpdCBjbGVhcicNCmluc3RydWN0aW9ucyBp
c24ndCBkb2N1bWVudGVkLiBUaGV5IGFyZSBjZXJ0YWlubHkgYWxsb3dlZCB0byBkbw0KUk1XIG9u
IGFkamFjZW50IGJ5dGVzLg0KSSBkb24ndCByZW1lbWJlciB3aGV0aGVyIHRoZXkgYXJlIGNvbnN0
cmFpbmVkIHRvIG9ubHkgZG8NCjMyYml0IGFjY2Vzc2VzLCBidXQgbm90aGluZyB1c2VkIHRvIHNh
eSB0aGF0IHRoZXkgd291bGRuJ3QNCmRvIDMyYml0IG1pc2FsaWduZWQgb25lcyEgKGFsdGhvdWdo
IEkgc3VzcGVjdCB0aGV5IG5ldmVyIGhhdmUpLg0KDQoJRGF2aWQNCg0K

^ permalink raw reply

* RE: [PATCH net-next] af_unix: fix a fatal race with bit fields
From: Eric Dumazet @ 2013-05-03 15:02 UTC (permalink / raw)
  To: David Laight
  Cc: Ambrose Feinstein, Paul Mackerras, netdev, linuxppc-dev,
	David Miller
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6026B7226@saturn3.aculab.com>

On Fri, 2013-05-03 at 15:29 +0100, David Laight wrote:
> > > Could ppc64 experts confirm using byte is safe, or should we really add
> > > a 32bit hole after the spinlock ? If so, I wonder how many other places
> > > need a change...
> ...
> > Also I'd be surprised if ppc64 is the only one with that problem... what
> > about sparc64 and arm64 ?
> 
> Even x86 could be affected.
> The width of the memory cycles used by the 'bit set and bit clear'
> instructions isn't documented. They are certainly allowed to do
> RMW on adjacent bytes.
> I don't remember whether they are constrained to only do
> 32bit accesses, but nothing used to say that they wouldn't
> do 32bit misaligned ones! (although I suspect they never have).

x86 is not affected (or else we would have found the bug much earlier)

Setting 1-bit field to one/zero uses OR/AND instructions.

orb  $4,724(%reg)

doesn't load/store 64bits but 8bits.

^ permalink raw reply

* RE: [PATCH net-next] af_unix: fix a fatal race with bit fields
From: David Laight @ 2013-05-03 15:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ambrose Feinstein, Paul Mackerras, netdev, linuxppc-dev,
	David Miller
In-Reply-To: <1367593356.29805.37.camel@edumazet-glaptop>

PiA+ID4gQWxzbyBJJ2QgYmUgc3VycHJpc2VkIGlmIHBwYzY0IGlzIHRoZSBvbmx5IG9uZSB3aXRo
IHRoYXQgcHJvYmxlbS4uLiB3aGF0DQo+ID4gPiBhYm91dCBzcGFyYzY0IGFuZCBhcm02NCA/DQo+
ID4NCj4gPiBFdmVuIHg4NiBjb3VsZCBiZSBhZmZlY3RlZC4NCj4gPiBUaGUgd2lkdGggb2YgdGhl
IG1lbW9yeSBjeWNsZXMgdXNlZCBieSB0aGUgJ2JpdCBzZXQgYW5kIGJpdCBjbGVhcicNCj4gPiBp
bnN0cnVjdGlvbnMgaXNuJ3QgZG9jdW1lbnRlZC4gVGhleSBhcmUgY2VydGFpbmx5IGFsbG93ZWQg
dG8gZG8NCj4gPiBSTVcgb24gYWRqYWNlbnQgYnl0ZXMuDQo+ID4gSSBkb24ndCByZW1lbWJlciB3
aGV0aGVyIHRoZXkgYXJlIGNvbnN0cmFpbmVkIHRvIG9ubHkgZG8NCj4gPiAzMmJpdCBhY2Nlc3Nl
cywgYnV0IG5vdGhpbmcgdXNlZCB0byBzYXkgdGhhdCB0aGV5IHdvdWxkbid0DQo+ID4gZG8gMzJi
aXQgbWlzYWxpZ25lZCBvbmVzISAoYWx0aG91Z2ggSSBzdXNwZWN0IHRoZXkgbmV2ZXIgaGF2ZSku
DQo+IA0KPiB4ODYgaXMgbm90IGFmZmVjdGVkIChvciBlbHNlIHdlIHdvdWxkIGhhdmUgZm91bmQg
dGhlIGJ1ZyBtdWNoIGVhcmxpZXIpDQo+IA0KPiBTZXR0aW5nIDEtYml0IGZpZWxkIHRvIG9uZS96
ZXJvIHVzZXMgT1IvQU5EIGluc3RydWN0aW9ucy4NCj4gDQo+IG9yYiAgJDQsNzI0KCVyZWcpDQo+
IA0KPiBkb2Vzbid0IGxvYWQvc3RvcmUgNjRiaXRzIGJ1dCA4Yml0cy4NCg0KSSB3YXMgdGhpbmtp
bmcgb2YgY29kZSB0aGF0IG1pZ2h0IGJlIHVzaW5nIEJULCBCVEMsIEJUUiBvciBCVFMuDQpUaGVz
ZSBhcmUgcHJvYmFibHkgdXNlZCB3aXRoIHRoZSAnbG9jaycgcHJlZml4IC0gd2hpY2ggd291bGQN
CihJIHRoaW5rKSBtYWtlIHRoZW0gc2FmZS4NCg0KVGhlIGRvY3VtZW50ZWQgY29uc3RyYWludCBp
cyBtb3JlIHNwZWNpZmljIHRoYW4gaXQgdXNlZCB0byBiZQ0KdGhlIEludGVsIHZlcnNpb24gcmVh
ZHM6DQoNCiAgICBXaGVuIGFjY2Vzc2luZyBhIGJpdCBpbiBtZW1vcnksIHRoZSBwcm9jZXNzb3Ig
bWF5IGFjY2VzcyA0IGJ5dGVzDQogICAgc3RhcnRpbmcgZnJvbSB0aGUgbWVtb3J5IGFkZHJlc3Mg
Zm9yIGEgMzItYml0IG9wZXJhbmQgc2l6ZSwgdXNpbmcNCiAgICBieSB0aGUgZm9sbG93aW5nIHJl
bGF0aW9uc2hpcDoNCiAgICAgICAgRWZmZWN0aXZlIEFkZHJlc3MgKyAoNCDiiJcgKEJpdE9mZnNl
dCBESVYgMzIpKQ0KICAgIE9yLCBpdCBtYXkgYWNjZXNzIDIgYnl0ZXMgc3RhcnRpbmcgZnJvbSB0
aGUgbWVtb3J5IGFkZHJlc3MgZm9yIGENCiAgICAxNi1iaXQgb3BlcmFuZCwgdXNpbmcgdGhpcyBy
ZWxhdGlvbnNoaXA6DQogICAgICAgIEVmZmVjdGl2ZSBBZGRyZXNzICsgKDIg4oiXIChCaXRPZmZz
ZXQgRElWIDE2KSkNCiAgICBJdCBtYXkgZG8gc28gZXZlbiB3aGVuIG9ubHkgYSBzaW5nbGUgYnl0
ZSBuZWVkcyB0byBiZSBhY2Nlc3NlZCB0bw0KICAgIHJlYWNoIHRoZSBnaXZlbiBiaXQuDQogICAg
V2hlbiB1c2luZyB0aGlzIGJpdCBhZGRyZXNzaW5nIG1lY2hhbmlzbSwgc29mdHdhcmUgc2hvdWxk
IGF2b2lkDQogICAgcmVmZXJlbmNpbmcgYXJlYXMgb2YgbWVtb3J5IGNsb3NlIHRvIGFkZHJlc3Mg
c3BhY2UgaG9sZXMuDQogICAgSW4gcGFydGljdWxhciwgaXQgc2hvdWxkIGF2b2lkIHJlZmVyZW5j
ZXMgdG8gbWVtb3J5LW1hcHBlZCBJL08gcmVnaXN0ZXJzLg0KICAgIEluc3RlYWQsIHNvZnR3YXJl
IHNob3VsZCB1c2UgdGhlIE1PViBpbnN0cnVjdGlvbnMgdG8gbG9hZCBmcm9tIG9yIHN0b3JlDQog
ICAgdG8gdGhlc2UgYWRkcmVzc2VzLCBhbmQgdXNlIHRoZSByZWdpc3RlciBmb3JtIG9mIHRoZXNl
IGluc3RydWN0aW9ucyB0bw0KICAgIG1hbmlwdWxhdGUgdGhlIGRhdGEuDQogICAgSW4gNjQtYml0
IG1vZGUsIHRoZSBpbnN0cnVjdGlvbuKAmXMgZGVmYXVsdCBvcGVyYXRpb24gc2l6ZSBpcyAzMiBi
aXRzLg0KICAgIFVzaW5nIGEgUkVYIHByZWZpeCBpbiB0aGUgZm9ybSBvZiBSRVguUiBwZXJtaXRz
IGFjY2VzcyB0byBhZGRpdGlvbmFsDQogICAgcmVnaXN0ZXJzIChSOC1SMTUpLiBVc2luZyBhIFJF
WCBwcmVmaXggaW4gdGhlIGZvcm0gb2YgUkVYLlcgcHJvbW90ZXMNCiAgICBvcGVyYXRpb24gdG8g
NjQgYml0IG9wZXJhbmRzLg0KDQoJRGF2aWQNCg0KDQo=

^ permalink raw reply

* [PATCH] KVM: PPC: Book3E 64: Fix IRQs warnings and hangs
From: Mihai Caraman @ 2013-05-03 16:11 UTC (permalink / raw)
  To: kvm-ppc; +Cc: Mihai Caraman, linuxppc-dev, kvm

A change in the generic code highlighted that we were running with IRQs
(soft) enabled on Book3E 64-bit when trying to restart interrupts from
handle_exit(). This is a lesson to configure lockdep often :)

There is no reason to exit guest with soft_enabled == 1, a local_irq_enable()
call will do this for us so get rid of kvmppc_layz_ee() calls. With this fix
we eliminate irqs_disabled() warnings and some guest and host hangs revealed
under stress tests, but guests still exhibit some unresponsiveness.

The unresponsiveness has to do with the fact that arch_local_irq_restore()
does not guarantees to hard enable interrupts. To do so replace exception
function calls like timer_interrupt() with irq_happened flags. The
local_irq_enable() call takes care of replaying them and lets the interrupts
hard enabled.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
---
 arch/powerpc/kvm/booke.c |    9 +++------
 1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 1020119..82f155e 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -673,7 +673,6 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
 		ret = s;
 		goto out;
 	}
-	kvmppc_lazy_ee_enable();
 
 	kvm_guest_enter();
 
@@ -789,16 +788,16 @@ static void kvmppc_restart_interrupt(struct kvm_vcpu *vcpu,
 	switch (exit_nr) {
 	case BOOKE_INTERRUPT_EXTERNAL:
 		kvmppc_fill_pt_regs(&regs);
-		do_IRQ(&regs);
+		local_paca->irq_happened |= PACA_IRQ_EE;
 		break;
 	case BOOKE_INTERRUPT_DECREMENTER:
 		kvmppc_fill_pt_regs(&regs);
-		timer_interrupt(&regs);
+		local_paca->irq_happened |= PACA_IRQ_DEC;
 		break;
 #if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_BOOK3E_64)
 	case BOOKE_INTERRUPT_DOORBELL:
 		kvmppc_fill_pt_regs(&regs);
-		doorbell_exception(&regs);
+		local_paca->irq_happened |= PACA_IRQ_DBELL;
 		break;
 #endif
 	case BOOKE_INTERRUPT_MACHINE_CHECK:
@@ -1148,8 +1147,6 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu,
 		if (s <= 0) {
 			local_irq_enable();
 			r = (s << 2) | RESUME_HOST | (r & RESUME_FLAG_NV);
-		} else {
-			kvmppc_lazy_ee_enable();
 		}
 	}
 
-- 
1.7.4.1

^ permalink raw reply related

* Re: [PATCH] KVM: PPC: Book3E 64: Fix IRQs warnings and hangs
From: Alexander Graf @ 2013-05-03 16:24 UTC (permalink / raw)
  To: Mihai Caraman; +Cc: linuxppc-dev, kvm@vger.kernel.org VIRTUAL MA..., kvm-ppc
In-Reply-To: <1367597470-22214-1-git-send-email-mihai.caraman@freescale.com>


On 03.05.2013, at 18:11, Mihai Caraman wrote:

> A change in the generic code highlighted that we were running with =
IRQs
> (soft) enabled on Book3E 64-bit when trying to restart interrupts from
> handle_exit(). This is a lesson to configure lockdep often :)
>=20
> There is no reason to exit guest with soft_enabled =3D=3D 1, a =
local_irq_enable()
> call will do this for us so get rid of kvmppc_layz_ee() calls. With =
this fix
> we eliminate irqs_disabled() warnings and some guest and host hangs =
revealed
> under stress tests, but guests still exhibit some unresponsiveness.
>=20
> The unresponsiveness has to do with the fact that =
arch_local_irq_restore()
> does not guarantees to hard enable interrupts. To do so replace =
exception
> function calls like timer_interrupt() with irq_happened flags. The
> local_irq_enable() call takes care of replaying them and lets the =
interrupts
> hard enabled.
>=20
> Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>

Ben, could you please review?


Alex

> ---
> arch/powerpc/kvm/booke.c |    9 +++------
> 1 files changed, 3 insertions(+), 6 deletions(-)
>=20
> diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
> index 1020119..82f155e 100644
> --- a/arch/powerpc/kvm/booke.c
> +++ b/arch/powerpc/kvm/booke.c
> @@ -673,7 +673,6 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, =
struct kvm_vcpu *vcpu)
> 		ret =3D s;
> 		goto out;
> 	}
> -	kvmppc_lazy_ee_enable();
>=20
> 	kvm_guest_enter();
>=20
> @@ -789,16 +788,16 @@ static void kvmppc_restart_interrupt(struct =
kvm_vcpu *vcpu,
> 	switch (exit_nr) {
> 	case BOOKE_INTERRUPT_EXTERNAL:
> 		kvmppc_fill_pt_regs(&regs);
> -		do_IRQ(&regs);
> +		local_paca->irq_happened |=3D PACA_IRQ_EE;
> 		break;
> 	case BOOKE_INTERRUPT_DECREMENTER:
> 		kvmppc_fill_pt_regs(&regs);
> -		timer_interrupt(&regs);
> +		local_paca->irq_happened |=3D PACA_IRQ_DEC;
> 		break;
> #if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_BOOK3E_64)
> 	case BOOKE_INTERRUPT_DOORBELL:
> 		kvmppc_fill_pt_regs(&regs);
> -		doorbell_exception(&regs);
> +		local_paca->irq_happened |=3D PACA_IRQ_DBELL;
> 		break;
> #endif
> 	case BOOKE_INTERRUPT_MACHINE_CHECK:
> @@ -1148,8 +1147,6 @@ int kvmppc_handle_exit(struct kvm_run *run, =
struct kvm_vcpu *vcpu,
> 		if (s <=3D 0) {
> 			local_irq_enable();
> 			r =3D (s << 2) | RESUME_HOST | (r & =
RESUME_FLAG_NV);
> -		} else {
> -			kvmppc_lazy_ee_enable();
> 		}
> 	}
>=20
> --=20
> 1.7.4.1
>=20
>=20
> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: net/eth/ibmveth: Fixup retrieval of MAC address
From: Ben Hutchings @ 2013-05-03 16:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: netdev, linuxppc-dev, David Miller, David Gibson
In-Reply-To: <1367458520.4389.6.camel@pasglop>

On Thu, 2013-05-02 at 11:35 +1000, Benjamin Herrenschmidt wrote:
> Some ancient pHyp versions used to create a 8 bytes local-mac-address
> property in the device-tree instead of a 6 bytes one for veth.
> 
> The Linux driver code to deal with that is an insane hack which also
> happens to break with some choices of MAC addresses in qemu by testing
> for a bit in the address rather than just looking at the size of the
> property.
> 
> Sanitize this by doing the latter instead.
[...]
> @@ -1334,11 +1334,19 @@ static int ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id)
>  		dev->unit_address);
>  
>  	mac_addr_p = (unsigned char *)vio_get_attribute(dev, VETH_MAC_ADDR,
> -							NULL);
> +							&mac_len);
>  	if (!mac_addr_p) {
>  		dev_err(&dev->dev, "Can't find VETH_MAC_ADDR attribute\n");
>  		return -EINVAL;
>  	}
> +	/* Workaround for old/broken pHyp */
> +	if (mac_len == 8)
> +		mac_addr_p += 2;
> +	if (mac_len != 6) {

Missing 'else' before the second if?

> +		dev_err(&dev->dev, "VETH_MAC_ADDR attribute wrong len %d\n",
> +			mac_len);
> +		return -EINVAL;
> +	}
[...]

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH] KVM: PPC: Book3E 64: Fix IRQs warnings and hangs
From: Scott Wood @ 2013-05-03 18:04 UTC (permalink / raw)
  To: Mihai Caraman; +Cc: Mihai Caraman, linuxppc-dev, kvm, kvm-ppc
In-Reply-To: <1367597470-22214-1-git-send-email-mihai.caraman@freescale.com>

On 05/03/2013 11:11:10 AM, Mihai Caraman wrote:
> A change in the generic code highlighted that we were running with =20
> IRQs
> (soft) enabled on Book3E 64-bit when trying to restart interrupts from
> handle_exit(). This is a lesson to configure lockdep often :)
>=20
> There is no reason to exit guest with soft_enabled =3D=3D 1, a =20
> local_irq_enable()
> call will do this for us so get rid of kvmppc_layz_ee() calls. With =20
> this fix
> we eliminate irqs_disabled() warnings and some guest and host hangs =20
> revealed
> under stress tests, but guests still exhibit some unresponsiveness.
>=20
> The unresponsiveness has to do with the fact that =20
> arch_local_irq_restore()
> does not guarantees to hard enable interrupts.

Could you elaborate?  If the saved IRQ state was "enabled", why =20
wouldn't arch_local_irq_restore() hard-enable IRQs?  The last thing it =20
does is __hard_irq_enable().

Where is the arch_local_irq_restore() instance you're talking about?

> To do so replace exception
> function calls like timer_interrupt() with irq_happened flags. The
> local_irq_enable() call takes care of replaying them and lets the =20
> interrupts
> hard enabled.

Not sure what you mean by "lets the interrupts hard enabled"... Do you =20
mean the EE bit in regs->msr, as opposed to the EE bit in the current =20
MSR?

> Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
> ---
>  arch/powerpc/kvm/booke.c |    9 +++------
>  1 files changed, 3 insertions(+), 6 deletions(-)
>=20
> diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
> index 1020119..82f155e 100644
> --- a/arch/powerpc/kvm/booke.c
> +++ b/arch/powerpc/kvm/booke.c
> @@ -673,7 +673,6 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, =20
> struct kvm_vcpu *vcpu)
>  		ret =3D s;
>  		goto out;
>  	}
> -	kvmppc_lazy_ee_enable();
>=20
>  	kvm_guest_enter();
>=20
> @@ -789,16 +788,16 @@ static void kvmppc_restart_interrupt(struct =20
> kvm_vcpu *vcpu,
>  	switch (exit_nr) {
>  	case BOOKE_INTERRUPT_EXTERNAL:
>  		kvmppc_fill_pt_regs(&regs);
> -		do_IRQ(&regs);
> +		local_paca->irq_happened |=3D PACA_IRQ_EE;
>  		break;
>  	case BOOKE_INTERRUPT_DECREMENTER:
>  		kvmppc_fill_pt_regs(&regs);
> -		timer_interrupt(&regs);
> +		local_paca->irq_happened |=3D PACA_IRQ_DEC;
>  		break;
>  #if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_BOOK3E_64)
>  	case BOOKE_INTERRUPT_DOORBELL:
>  		kvmppc_fill_pt_regs(&regs);
> -		doorbell_exception(&regs);
> +		local_paca->irq_happened |=3D PACA_IRQ_DBELL;
>  		break;
>  #endif

Aren't you breaking 32-bit here?

-Scott=

^ permalink raw reply

* Re: [PATCH -V7 08/10] powerpc/THP: Enable THP on PPC64
From: Aneesh Kumar K.V @ 2013-05-03 18:49 UTC (permalink / raw)
  To: David Gibson; +Cc: paulus, linuxppc-dev, linux-mm
In-Reply-To: <20130503051528.GT13041@truffula.fritz.box>

David Gibson <dwg@au1.ibm.com> writes:

> On Mon, Apr 29, 2013 at 01:21:49AM +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>> 
>> We enable only if the we support 16MB page size.
>> 
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/include/asm/pgtable-ppc64.h |  3 +--
>>  arch/powerpc/mm/pgtable_64.c             | 28 ++++++++++++++++++++++++++++
>>  2 files changed, 29 insertions(+), 2 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
>> index 97fc839..d65534b 100644
>> --- a/arch/powerpc/include/asm/pgtable-ppc64.h
>> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
>> @@ -426,8 +426,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
>>  	return pmd_val(pmd) >> PTE_RPN_SHIFT;
>>  }
>>  
>> -/* We will enable it in the last patch */
>> -#define has_transparent_hugepage() 0
>> +extern int has_transparent_hugepage(void);
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>  
>>  static inline int pmd_young(pmd_t pmd)
>> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
>> index 54216c1..b742d6f 100644
>> --- a/arch/powerpc/mm/pgtable_64.c
>> +++ b/arch/powerpc/mm/pgtable_64.c
>> @@ -754,6 +754,34 @@ void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
>>  	return;
>>  }
>>  
>> +int has_transparent_hugepage(void)
>> +{
>> +	if (!mmu_has_feature(MMU_FTR_16M_PAGE))
>> +		return 0;
>> +	/*
>> +	 * We support THP only if HPAGE_SHIFT is 16MB.
>> +	 */
>> +	if (!HPAGE_SHIFT || (HPAGE_SHIFT != mmu_psize_defs[MMU_PAGE_16M].shift))
>> +		return 0;
>
> Again, THP should not be dependent on the value of HPAGE_SHIFT.  Just
> checking that mmu_psize_defsz[MMU_PAGE_16M].shift == 24 should be
> sufficient (i.e. that 16M hugepages are supported).

done

+	/*
+	 * We support THP only if PMD_SIZE is 16MB.
+	 */
+	if (mmu_psize_defs[MMU_PAGE_16M].shift != PMD_SHIFT)
+		return 0;
+	/*


>
>> +	/*
>> +	 * We need to make sure that we support 16MB hugepage in a segement
>> +	 * with base page size 64K or 4K. We only enable THP with a PAGE_SIZE
>> +	 * of 64K.
>> +	 */
>> +	/*
>> +	 * If we have 64K HPTE, we will be using that by default
>> +	 */
>> +	if (mmu_psize_defs[MMU_PAGE_64K].shift &&
>> +	    (mmu_psize_defs[MMU_PAGE_64K].penc[MMU_PAGE_16M] == -1))
>> +		return 0;
>> +	/*
>> +	 * Ok we only have 4K HPTE
>> +	 */
>> +	if (mmu_psize_defs[MMU_PAGE_4K].penc[MMU_PAGE_16M] == -1)
>> +		return 0;
>
> Except you don't actually support THP on 4K base page size yet.


That is 64K linux page size and 4K HPTE . We do support that. The Linux
page size part is taken care by Kconfig. 

>
>> +
>> +	return 1;
>> +}
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>  
>>  pmd_t pmdp_get_and_clear(struct mm_struct *mm,
>

-aneesh

^ permalink raw reply

* Re: [PATCH -V7 01/18] mm/THP: HPAGE_SHIFT is not a #define on some arch
From: Aneesh Kumar K.V @ 2013-05-03 18:51 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, paulus, linux-mm
In-Reply-To: <20130430050101.GY20202@truffula.fritz.box>

David Gibson <dwg@au1.ibm.com> writes:

> On Tue, Apr 30, 2013 at 09:12:09AM +0530, Aneesh Kumar K.V wrote:
>> David Gibson <dwg@au1.ibm.com> writes:
>> 
>> > On Mon, Apr 29, 2013 at 01:07:22AM +0530, Aneesh Kumar K.V wrote:
>> >> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>> >> 
>> >> On archs like powerpc that support different hugepage sizes, HPAGE_SHIFT
>> >> and other derived values like HPAGE_PMD_ORDER are not constants. So move
>> >> that to hugepage_init
>> >
>> > These seems to miss the point.  Those variables may be defined in
>> > terms of HPAGE_SHIFT right now, but that is of itself kind of broken.
>> > The transparent hugepage mechanism only works if the hugepage size is
>> > equal to the PMD size - and PMD_SHIFT remains a compile time constant.
>> >
>> > There's no reason having transparent hugepage should force the PMD
>> > size of hugepage to be the default for other purposes - it should be
>> > possible to do THP as long as PMD-sized is a possible hugepage size.
>> >
>> 
>> THP code does
>> 
>> #define HPAGE_PMD_SHIFT HPAGE_SHIFT
>> #define HPAGE_PMD_MASK HPAGE_MASK
>> #define HPAGE_PMD_SIZE HPAGE_SIZE
>> 
>> I had two options, one to move all those in terms of PMD_SHIFT
>
> This is a much better option that you've taken now, and really
> shouldn't be that hard.  The THP code is much more strongly tied to
> the fact that it is a PMD than the fact that it's the same size as
> explicit huge pages.
>

Ok I tried the above and that turned out to be much simpler. I will have
to make sure i didn't break other archs that support THP.

-aneesh

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox