LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v11 19/26] mm: provide speculative fault infrastructure
From: zhong jiang @ 2018-07-24 14:26 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
	khandual, aneesh.kumar, benh, mpe, paulus, Thomas Gleixner,
	Ingo Molnar, hpa, Will Deacon, Sergey Senozhatsky,
	sergey.senozhatsky.work, Andrea Arcangeli, Alexei Starovoitov,
	kemi.wang, Daniel Jordan, David Rientjes, Jerome Glisse,
	Ganesh Mahendran, Minchan Kim, Punit Agrawal, vinayak menon,
	Yang Shi, linux-kernel, linux-mm, haren, npiggin, bsingharora,
	paulmck, Tim Chen, linuxppc-dev, x86
In-Reply-To: <1526555193-7242-20-git-send-email-ldufour@linux.vnet.ibm.com>

On 2018/5/17 19:06, Laurent Dufour wrote:
> From: Peter Zijlstra <peterz@infradead.org>
>
> Provide infrastructure to do a speculative fault (not holding
> mmap_sem).
>
> The not holding of mmap_sem means we can race against VMA
> change/removal and page-table destruction. We use the SRCU VMA freeing
> to keep the VMA around. We use the VMA seqcount to detect change
> (including umapping / page-table deletion) and we use gup_fast() style
> page-table walking to deal with page-table races.
>
> Once we've obtained the page and are ready to update the PTE, we
> validate if the state we started the fault with is still valid, if
> not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
> PTE and we're done.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>
> [Manage the newly introduced pte_spinlock() for speculative page
>  fault to fail if the VMA is touched in our back]
> [Rename vma_is_dead() to vma_has_changed() and declare it here]
> [Fetch p4d and pud]
> [Set vmd.sequence in __handle_mm_fault()]
> [Abort speculative path when handle_userfault() has to be called]
> [Add additional VMA's flags checks in handle_speculative_fault()]
> [Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
> [Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
> [Remove warning comment about waiting for !seq&1 since we don't want
>  to wait]
> [Remove warning about no huge page support, mention it explictly]
> [Don't call do_fault() in the speculative path as __do_fault() calls
>  vma->vm_ops->fault() which may want to release mmap_sem]
> [Only vm_fault pointer argument for vma_has_changed()]
> [Fix check against huge page, calling pmd_trans_huge()]
> [Use READ_ONCE() when reading VMA's fields in the speculative path]
> [Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
>  processing done in vm_normal_page()]
> [Check that vma->anon_vma is already set when starting the speculative
>  path]
> [Check for memory policy as we can't support MPOL_INTERLEAVE case due to
>  the processing done in mpol_misplaced()]
> [Don't support VMA growing up or down]
> [Move check on vm_sequence just before calling handle_pte_fault()]
> [Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
> [Add mem cgroup oom check]
> [Use READ_ONCE to access p*d entries]
> [Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
> [Don't fetch pte again in handle_pte_fault() when running the speculative
>  path]
> [Check PMD against concurrent collapsing operation]
> [Try spin lock the pte during the speculative path to avoid deadlock with
>  other CPU's invalidating the TLB and requiring this CPU to catch the
>  inter processor's interrupt]
> [Move define of FAULT_FLAG_SPECULATIVE here]
> [Introduce __handle_speculative_fault() and add a check against
>  mm->mm_users in handle_speculative_fault() defined in mm.h]
> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
> ---
>  include/linux/hugetlb_inline.h |   2 +-
>  include/linux/mm.h             |  30 ++++
>  include/linux/pagemap.h        |   4 +-
>  mm/internal.h                  |  16 +-
>  mm/memory.c                    | 340 ++++++++++++++++++++++++++++++++++++++++-
>  5 files changed, 385 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
> index 0660a03d37d9..9e25283d6fc9 100644
> --- a/include/linux/hugetlb_inline.h
> +++ b/include/linux/hugetlb_inline.h
> @@ -8,7 +8,7 @@
>  
>  static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
>  {
> -	return !!(vma->vm_flags & VM_HUGETLB);
> +	return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
>  }
>  
>  #else
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 05cbba70104b..31acf98a7d92 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -315,6 +315,7 @@ extern pgprot_t protection_map[16];
>  #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
>  #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
>  #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
> +#define FAULT_FLAG_SPECULATIVE	0x200	/* Speculative fault, not holding mmap_sem */
>  
>  #define FAULT_FLAG_TRACE \
>  	{ FAULT_FLAG_WRITE,		"WRITE" }, \
> @@ -343,6 +344,10 @@ struct vm_fault {
>  	gfp_t gfp_mask;			/* gfp mask to be used for allocations */
>  	pgoff_t pgoff;			/* Logical page offset based on vma */
>  	unsigned long address;		/* Faulting virtual address */
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +	unsigned int sequence;
> +	pmd_t orig_pmd;			/* value of PMD at the time of fault */
> +#endif
>  	pmd_t *pmd;			/* Pointer to pmd entry matching
>  					 * the 'address' */
>  	pud_t *pud;			/* Pointer to pud entry matching
> @@ -1415,6 +1420,31 @@ int invalidate_inode_page(struct page *page);
>  #ifdef CONFIG_MMU
>  extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  		unsigned int flags);
> +
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +extern int __handle_speculative_fault(struct mm_struct *mm,
> +				      unsigned long address,
> +				      unsigned int flags);
> +static inline int handle_speculative_fault(struct mm_struct *mm,
> +					   unsigned long address,
> +					   unsigned int flags)
> +{
> +	/*
> +	 * Try speculative page fault for multithreaded user space task only.
> +	 */
> +	if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
> +		return VM_FAULT_RETRY;
> +	return __handle_speculative_fault(mm, address, flags);
> +}
> +#else
> +static inline int handle_speculative_fault(struct mm_struct *mm,
> +					   unsigned long address,
> +					   unsigned int flags)
> +{
> +	return VM_FAULT_RETRY;
> +}
> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
> +
>  extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
>  			    unsigned long address, unsigned int fault_flags,
>  			    bool *unlocked);
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index b1bd2186e6d2..6e2aa4e79af7 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -456,8 +456,8 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
>  	pgoff_t pgoff;
>  	if (unlikely(is_vm_hugetlb_page(vma)))
>  		return linear_hugepage_index(vma, address);
> -	pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
> -	pgoff += vma->vm_pgoff;
> +	pgoff = (address - READ_ONCE(vma->vm_start)) >> PAGE_SHIFT;
> +	pgoff += READ_ONCE(vma->vm_pgoff);
>  	return pgoff;
>  }
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index fb2667b20f0a..10b188c87fa4 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -44,7 +44,21 @@ int do_swap_page(struct vm_fault *vmf);
>  extern struct vm_area_struct *get_vma(struct mm_struct *mm,
>  				      unsigned long addr);
>  extern void put_vma(struct vm_area_struct *vma);
> -#endif
> +
> +static inline bool vma_has_changed(struct vm_fault *vmf)
> +{
> +	int ret = RB_EMPTY_NODE(&vmf->vma->vm_rb);
> +	unsigned int seq = READ_ONCE(vmf->vma->vm_sequence.sequence);
> +
> +	/*
> +	 * Matches both the wmb in write_seqlock_{begin,end}() and
> +	 * the wmb in vma_rb_erase().
> +	 */
> +	smp_rmb();
> +
> +	return ret || seq != vmf->sequence;
> +}
> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>  
>  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>  		unsigned long floor, unsigned long ceiling);
> diff --git a/mm/memory.c b/mm/memory.c
> index ab32b0b4bd69..7bbbb8c7b9cd 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -769,7 +769,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>  	if (page)
>  		dump_page(page, "bad pte");
>  	pr_alert("addr:%p vm_flags:%08lx anon_vma:%p mapping:%p index:%lx\n",
> -		 (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
> +		 (void *)addr, READ_ONCE(vma->vm_flags), vma->anon_vma,
> +		 mapping, index);
>  	pr_alert("file:%pD fault:%pf mmap:%pf readpage:%pf\n",
>  		 vma->vm_file,
>  		 vma->vm_ops ? vma->vm_ops->fault : NULL,
> @@ -2306,6 +2307,118 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>  }
>  EXPORT_SYMBOL_GPL(apply_to_page_range);
>  
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +static bool pte_spinlock(struct vm_fault *vmf)
> +{
> +	bool ret = false;
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	pmd_t pmdval;
> +#endif
> +
> +	/* Check if vma is still valid */
> +	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
> +		vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> +		spin_lock(vmf->ptl);
> +		return true;
> +	}
> +
> +again:
> +	local_irq_disable();
> +	if (vma_has_changed(vmf))
> +		goto out;
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	/*
> +	 * We check if the pmd value is still the same to ensure that there
> +	 * is not a huge collapse operation in progress in our back.
> +	 */
> +	pmdval = READ_ONCE(*vmf->pmd);
> +	if (!pmd_same(pmdval, vmf->orig_pmd))
> +		goto out;
> +#endif
> +
> +	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> +	if (unlikely(!spin_trylock(vmf->ptl))) {
> +		local_irq_enable();
> +		goto again;
> +	}
> +
> +	if (vma_has_changed(vmf)) {
> +		spin_unlock(vmf->ptl);
> +		goto out;
> +	}
> +
> +	ret = true;
> +out:
> +	local_irq_enable();
> +	return ret;
> +}
> +
> +static bool pte_map_lock(struct vm_fault *vmf)
> +{
> +	bool ret = false;
> +	pte_t *pte;
> +	spinlock_t *ptl;
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	pmd_t pmdval;
> +#endif
> +
> +	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
> +		vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> +					       vmf->address, &vmf->ptl);
> +		return true;
> +	}
> +
> +	/*
> +	 * The first vma_has_changed() guarantees the page-tables are still
> +	 * valid, having IRQs disabled ensures they stay around, hence the
> +	 * second vma_has_changed() to make sure they are still valid once
> +	 * we've got the lock. After that a concurrent zap_pte_range() will
> +	 * block on the PTL and thus we're safe.
> +	 */
> +again:
> +	local_irq_disable();
> +	if (vma_has_changed(vmf))
> +		goto out;
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	/*
> +	 * We check if the pmd value is still the same to ensure that there
> +	 * is not a huge collapse operation in progress in our back.
> +	 */
> +	pmdval = READ_ONCE(*vmf->pmd);
> +	if (!pmd_same(pmdval, vmf->orig_pmd))
> +		goto out;
> +#endif
> +
> +	/*
> +	 * Same as pte_offset_map_lock() except that we call
> +	 * spin_trylock() in place of spin_lock() to avoid race with
> +	 * unmap path which may have the lock and wait for this CPU
> +	 * to invalidate TLB but this CPU has irq disabled.
> +	 * Since we are in a speculative patch, accept it could fail
> +	 */
> +	ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> +	pte = pte_offset_map(vmf->pmd, vmf->address);
> +	if (unlikely(!spin_trylock(ptl))) {
> +		pte_unmap(pte);
> +		local_irq_enable();
> +		goto again;
> +	}
> +
> +	if (vma_has_changed(vmf)) {
> +		pte_unmap_unlock(pte, ptl);
> +		goto out;
> +	}
> +
> +	vmf->pte = pte;
> +	vmf->ptl = ptl;
> +	ret = true;
> +out:
> +	local_irq_enable();
> +	return ret;
> +}
> +#else
>  static inline bool pte_spinlock(struct vm_fault *vmf)
>  {
>  	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> @@ -2319,6 +2432,7 @@ static inline bool pte_map_lock(struct vm_fault *vmf)
>  				       vmf->address, &vmf->ptl);
>  	return true;
>  }
> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>  
>  /*
>   * handle_pte_fault chooses page fault handler according to an entry which was
> @@ -3208,6 +3322,14 @@ static int do_anonymous_page(struct vm_fault *vmf)
>  		ret = check_stable_address_space(vma->vm_mm);
>  		if (ret)
>  			goto unlock;
> +		/*
> +		 * Don't call the userfaultfd during the speculative path.
> +		 * We already checked for the VMA to not be managed through
> +		 * userfaultfd, but it may be set in our back once we have lock
> +		 * the pte. In such a case we can ignore it this time.
> +		 */
> +		if (vmf->flags & FAULT_FLAG_SPECULATIVE)
> +			goto setpte;
>  		/* Deliver the page fault to userland, check inside PT lock */
>  		if (userfaultfd_missing(vma)) {
>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -3249,7 +3371,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
>  		goto unlock_and_release;
>  
>  	/* Deliver the page fault to userland, check inside PT lock */
> -	if (userfaultfd_missing(vma)) {
> +	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE) && userfaultfd_missing(vma)) {
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>  		mem_cgroup_cancel_charge(page, memcg, false);
>  		put_page(page);
> @@ -3994,13 +4116,22 @@ static int handle_pte_fault(struct vm_fault *vmf)
>  
>  	if (unlikely(pmd_none(*vmf->pmd))) {
>  		/*
> +		 * In the case of the speculative page fault handler we abort
> +		 * the speculative path immediately as the pmd is probably
> +		 * in the way to be converted in a huge one. We will try
> +		 * again holding the mmap_sem (which implies that the collapse
> +		 * operation is done).
> +		 */
> +		if (vmf->flags & FAULT_FLAG_SPECULATIVE)
> +			return VM_FAULT_RETRY;
> +		/*
>  		 * Leave __pte_alloc() until later: because vm_ops->fault may
>  		 * want to allocate huge page, and if we expose page table
>  		 * for an instant, it will be difficult to retract from
>  		 * concurrent faults and from rmap lookups.
>  		 */
>  		vmf->pte = NULL;
> -	} else {
> +	} else if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
>  		/* See comment in pte_alloc_one_map() */
>  		if (pmd_devmap_trans_unstable(vmf->pmd))
>  			return 0;
> @@ -4009,6 +4140,9 @@ static int handle_pte_fault(struct vm_fault *vmf)
>  		 * pmd from under us anymore at this point because we hold the
>  		 * mmap_sem read mode and khugepaged takes it in write mode.
>  		 * So now it's safe to run pte_offset_map().
> +		 * This is not applicable to the speculative page fault handler
> +		 * but in that case, the pte is fetched earlier in
> +		 * handle_speculative_fault().
>  		 */
>  		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>  		vmf->orig_pte = *vmf->pte;
> @@ -4031,6 +4165,8 @@ static int handle_pte_fault(struct vm_fault *vmf)
>  	if (!vmf->pte) {
>  		if (vma_is_anonymous(vmf->vma))
>  			return do_anonymous_page(vmf);
> +		else if (vmf->flags & FAULT_FLAG_SPECULATIVE)
> +			return VM_FAULT_RETRY;
>  		else
>  			return do_fault(vmf);
>  	}
> @@ -4128,6 +4264,9 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  	vmf.pmd = pmd_alloc(mm, vmf.pud, address);
>  	if (!vmf.pmd)
>  		return VM_FAULT_OOM;
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +	vmf.sequence = raw_read_seqcount(&vma->vm_sequence);
> +#endif
>  	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
>  		ret = create_huge_pmd(&vmf);
>  		if (!(ret & VM_FAULT_FALLBACK))
> @@ -4161,6 +4300,201 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  	return handle_pte_fault(&vmf);
>  }
>  
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +/*
> + * Tries to handle the page fault in a speculative way, without grabbing the
> + * mmap_sem.
> + */
> +int __handle_speculative_fault(struct mm_struct *mm, unsigned long address,
> +			       unsigned int flags)
> +{
> +	struct vm_fault vmf = {
> +		.address = address,
> +	};
> +	pgd_t *pgd, pgdval;
> +	p4d_t *p4d, p4dval;
> +	pud_t pudval;
> +	int seq, ret = VM_FAULT_RETRY;
> +	struct vm_area_struct *vma;
> +#ifdef CONFIG_NUMA
> +	struct mempolicy *pol;
> +#endif
> +
> +	/* Clear flags that may lead to release the mmap_sem to retry */
> +	flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
> +	flags |= FAULT_FLAG_SPECULATIVE;
> +
> +	vma = get_vma(mm, address);
> +	if (!vma)
> +		return ret;
> +
> +	seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> seqlock,vma_rb_erase() */
> +	if (seq & 1)
> +		goto out_put;
> +
> +	/*
> +	 * Can't call vm_ops service has we don't know what they would do
> +	 * with the VMA.
> +	 * This include huge page from hugetlbfs.
> +	 */
> +	if (vma->vm_ops)
> +		goto out_put;
> +
  Hi   Laurent
   
   I think that most of pagefault will leave here.   Is there any case  need to skip ?
  I have tested the following  patch, it work well.

diff --git a/mm/memory.c b/mm/memory.c
index 936128b..9bc1545 100644
 @@ -3893,8 +3898,6 @@ static int handle_pte_fault(struct fault_env *fe)
        if (!fe->pte) {
                if (vma_is_anonymous(fe->vma))
                        return do_anonymous_page(fe);
-               else if (fe->flags & FAULT_FLAG_SPECULATIVE)
-                       return VM_FAULT_RETRY;
                else
                        return do_fault(fe);
        }
@@ -4026,20 +4029,11 @@ int __handle_speculative_fault(struct mm_struct *mm, unsigned long address,
                goto out_put;
        }
        /*
-        * Can't call vm_ops service has we don't know what they would do
-        * with the VMA.
-        * This include huge page from hugetlbfs.
-        */
-       if (vma->vm_ops) {
-               trace_spf_vma_notsup(_RET_IP_, vma, address);
-               goto out_put;
-       }


Thanks
zhong jiang
> +	/*
> +	 * __anon_vma_prepare() requires the mmap_sem to be held
> +	 * because vm_next and vm_prev must be safe. This can't be guaranteed
> +	 * in the speculative path.
> +	 */
> +	if (unlikely(!vma->anon_vma))
> +		goto out_put;
> +
> +	vmf.vma_flags = READ_ONCE(vma->vm_flags);
> +	vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot);
> +
> +	/* Can't call userland page fault handler in the speculative path */
> +	if (unlikely(vmf.vma_flags & VM_UFFD_MISSING))
> +		goto out_put;
> +
> +	if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP)
> +		/*
> +		 * This could be detected by the check address against VMA's
> +		 * boundaries but we want to trace it as not supported instead
> +		 * of changed.
> +		 */
> +		goto out_put;
> +
> +	if (address < READ_ONCE(vma->vm_start)
> +	    || READ_ONCE(vma->vm_end) <= address)
> +		goto out_put;
> +
> +	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
> +				       flags & FAULT_FLAG_INSTRUCTION,
> +				       flags & FAULT_FLAG_REMOTE)) {
> +		ret = VM_FAULT_SIGSEGV;
> +		goto out_put;
> +	}
> +
> +	/* This is one is required to check that the VMA has write access set */
> +	if (flags & FAULT_FLAG_WRITE) {
> +		if (unlikely(!(vmf.vma_flags & VM_WRITE))) {
> +			ret = VM_FAULT_SIGSEGV;
> +			goto out_put;
> +		}
> +	} else if (unlikely(!(vmf.vma_flags & (VM_READ|VM_EXEC|VM_WRITE)))) {
> +		ret = VM_FAULT_SIGSEGV;
> +		goto out_put;
> +	}
> +
> +#ifdef CONFIG_NUMA
> +	/*
> +	 * MPOL_INTERLEAVE implies additional checks in
> +	 * mpol_misplaced() which are not compatible with the
> +	 *speculative page fault processing.
> +	 */
> +	pol = __get_vma_policy(vma, address);
> +	if (!pol)
> +		pol = get_task_policy(current);
> +	if (pol && pol->mode == MPOL_INTERLEAVE)
> +		goto out_put;
> +#endif
> +
> +	/*
> +	 * Do a speculative lookup of the PTE entry.
> +	 */
> +	local_irq_disable();
> +	pgd = pgd_offset(mm, address);
> +	pgdval = READ_ONCE(*pgd);
> +	if (pgd_none(pgdval) || unlikely(pgd_bad(pgdval)))
> +		goto out_walk;
> +
> +	p4d = p4d_offset(pgd, address);
> +	p4dval = READ_ONCE(*p4d);
> +	if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval)))
> +		goto out_walk;
> +
> +	vmf.pud = pud_offset(p4d, address);
> +	pudval = READ_ONCE(*vmf.pud);
> +	if (pud_none(pudval) || unlikely(pud_bad(pudval)))
> +		goto out_walk;
> +
> +	/* Huge pages at PUD level are not supported. */
> +	if (unlikely(pud_trans_huge(pudval)))
> +		goto out_walk;
> +
> +	vmf.pmd = pmd_offset(vmf.pud, address);
> +	vmf.orig_pmd = READ_ONCE(*vmf.pmd);
> +	/*
> +	 * pmd_none could mean that a hugepage collapse is in progress
> +	 * in our back as collapse_huge_page() mark it before
> +	 * invalidating the pte (which is done once the IPI is catched
> +	 * by all CPU and we have interrupt disabled).
> +	 * For this reason we cannot handle THP in a speculative way since we
> +	 * can't safely indentify an in progress collapse operation done in our
> +	 * back on that PMD.
> +	 * Regarding the order of the following checks, see comment in
> +	 * pmd_devmap_trans_unstable()
> +	 */
> +	if (unlikely(pmd_devmap(vmf.orig_pmd) ||
> +		     pmd_none(vmf.orig_pmd) || pmd_trans_huge(vmf.orig_pmd) ||
> +		     is_swap_pmd(vmf.orig_pmd)))
> +		goto out_walk;
> +
> +	/*
> +	 * The above does not allocate/instantiate page-tables because doing so
> +	 * would lead to the possibility of instantiating page-tables after
> +	 * free_pgtables() -- and consequently leaking them.
> +	 *
> +	 * The result is that we take at least one !speculative fault per PMD
> +	 * in order to instantiate it.
> +	 */
> +
> +	vmf.pte = pte_offset_map(vmf.pmd, address);
> +	vmf.orig_pte = READ_ONCE(*vmf.pte);
> +	barrier(); /* See comment in handle_pte_fault() */
> +	if (pte_none(vmf.orig_pte)) {
> +		pte_unmap(vmf.pte);
> +		vmf.pte = NULL;
> +	}
> +
> +	vmf.vma = vma;
> +	vmf.pgoff = linear_page_index(vma, address);
> +	vmf.gfp_mask = __get_fault_gfp_mask(vma);
> +	vmf.sequence = seq;
> +	vmf.flags = flags;
> +
> +	local_irq_enable();
> +
> +	/*
> +	 * We need to re-validate the VMA after checking the bounds, otherwise
> +	 * we might have a false positive on the bounds.
> +	 */
> +	if (read_seqcount_retry(&vma->vm_sequence, seq))
> +		goto out_put;
> +
> +	mem_cgroup_oom_enable();
> +	ret = handle_pte_fault(&vmf);
> +	mem_cgroup_oom_disable();
> +
> +	put_vma(vma);
> +
> +	/*
> +	 * The task may have entered a memcg OOM situation but
> +	 * if the allocation error was handled gracefully (no
> +	 * VM_FAULT_OOM), there is no need to kill anything.
> +	 * Just clean up the OOM state peacefully.
> +	 */
> +	if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
> +		mem_cgroup_oom_synchronize(false);
> +	return ret;
> +
> +out_walk:
> +	local_irq_enable();
> +out_put:
> +	put_vma(vma);
> +	return ret;
> +}
> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
> +
>  /*
>   * By the time we get here, we already hold the mm semaphore
>   *

^ permalink raw reply

* RE: [PATCH] net: ethernet: fs-enet: Use generic CRC32 implementation
From: David Laight @ 2018-07-24 11:22 UTC (permalink / raw)
  To: 'Krzysztof Kozlowski'
  Cc: Pantelis Antoniou, David S. Miller, linuxppc-dev@lists.ozlabs.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Eric Biggers
In-Reply-To: <CAJKOXPdveQeg_M42v69m9vBEz2SaJE0_BZNRUqGLt4Ve+ZLYSg@mail.gmail.com>

RnJvbTogS3J6eXN6dG9mIEtvemxvd3NraQ0KPiBTZW50OiAyNCBKdWx5IDIwMTggMTI6MTINCi4u
Lg0KPiA+PiBOb3QgdGVzdGVkIG9uIGhhcmR3YXJlLg0KPiA+DQo+ID4gSGF2ZSB5b3UgdmVyaWZp
ZWQgdGhhdCB0aGUgb2xkIGFuZCBuZXcgZnVuY3Rpb25zIGdpdmUgdGhlDQo+ID4gc2FtZSByZXN1
bHQgZm9yIGEgZmV3IG1hYyBhZGRyZXNzZXM/DQo+ID4gSXQgaXMgdmVyeSBlYXN5IHRvIHVzZSB0
aGUgd3JvbmcgYml0cyBpbiBjcmMgY2FsY3VsYXRpb25zDQo+ID4gb3IgZ2VuZXJhdGUgdGhlIG91
dHB1dCBpbiB0aGUgd3JvbmcgYml0IG9yZGVyLg0KPiANCj4gSSBjb3BpZWQgdGhlIG9yaWdpbmFs
IGNvZGUgYW5kIG5ldyBvbmUgb250byBhIGRpZmZlcmVudCBkcml2ZXIgYW5kIHJ1bg0KPiB0aGlz
IGluIGEgbG9vcCBmb3IgdGhvdXNhbmRzIG9mIGRhdGEgaW5wdXQgKGFsdGhvdWdoIG5vdCBhbGwg
cG9zc2libGUNCj4gTUFDIGNvbWJpbmF0aW9ucykuIFRoZSBvdXRwdXQgd2FzIHRoZSBzYW1lLiBJ
IGFncmVlIGhvd2V2ZXIgdGhhdCByZWFsDQo+IHRlc3Rpbmcgd291bGQgYmUgaW1wb3J0YW50Lg0K
DQpTaW5jZSBDUkMgYXJlIGxpbmVhciB5b3Ugb25seSBuZWVkIHRvIGNoZWNrIHRoYXQgZWFjaCBp
bnB1dA0KYml0IGdlbmVyYXRlcyB0aGUgY29ycmVjdCBvdXRwdXQuDQoNCglEYXZpZA0KDQotDQpS
ZWdpc3RlcmVkIEFkZHJlc3MgTGFrZXNpZGUsIEJyYW1sZXkgUm9hZCwgTW91bnQgRmFybSwgTWls
dG9uIEtleW5lcywgTUsxIDFQVCwgVUsNClJlZ2lzdHJhdGlvbiBObzogMTM5NzM4NiAoV2FsZXMp
DQo=

^ permalink raw reply

* [PATCH] powerpc: Add a checkpatch wrapper with our preferred settings
From: Michael Ellerman @ 2018-07-24 14:03 UTC (permalink / raw)
  To: linuxppc-dev

This makes it easy to run checkpatch with settings that we have agreed
on (bwhahahahah).

Usage is eg:

  $ ./arch/powerpc/tools/checkpatch.sh -g origin/master..

To check all commits since origin/master.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/tools/checkpatch.sh | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 create mode 100755 arch/powerpc/tools/checkpatch.sh

diff --git a/arch/powerpc/tools/checkpatch.sh b/arch/powerpc/tools/checkpatch.sh
new file mode 100755
index 000000000000..4c2ac4655e26
--- /dev/null
+++ b/arch/powerpc/tools/checkpatch.sh
@@ -0,0 +1,21 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0+
+# Copyright 2018, Michael Ellerman, IBM Corporation.
+#
+# Wrapper around checkpatch that uses our preferred settings
+
+script_base=$(realpath $(dirname $0))
+
+exec $script_base/../../../scripts/checkpatch.pl \
+	--subjective \
+	--max-line-length=90 \
+	--show-types \
+	--ignore ARCH_INCLUDE_LINUX \
+	--ignore BIT_MACRO \
+	--ignore COMPARISON_TO_NULL \
+	--ignore EMAIL_SUBJECT \
+	--ignore FILE_PATH_CHANGES \
+	--ignore GLOBAL_INITIALISERS \
+	--ignore LINE_SPACING \
+	--ignore MULTIPLE_ASSIGNMENTS \
+	$@
-- 
2.14.1

^ permalink raw reply related

* Re: [v3, 1/9] powerpc/pkeys: Give all threads control of their key permissions
From: Michael Ellerman @ 2018-07-24 13:59 UTC (permalink / raw)
  To: Ram Pai
  Cc: fweimer, Ulrich.Weigand, linuxram, mhocko, bauerman, msuchanek,
	linuxppc-dev
In-Reply-To: <1531835470-32691-2-git-send-email-linuxram@us.ibm.com>

On Tue, 2018-07-17 at 13:51:02 UTC, Ram Pai wrote:
> Currently in a multithreaded application, a key allocated by one
> thread is not usable by other threads. By "not usable" we mean that
> other threads are unable to change the access permissions for that
> key for themselves.
> 
> When a new key is allocated in one thread, the corresponding UAMOR
> bits for that thread get enabled, however the UAMOR bits for that key
> for all other threads remain disabled.
> 
> Other threads have no way to set permissions on the key, and the
> current default permissions are that read/write is enabled for all
> keys, which means the key has no effect for other threads. Although
> that may be the desired behaviour in some circumstances, having all
> threads able to control their permissions for the key is more
> flexible.
> 
> The current behaviour also differs from the x86 behaviour, which is
> problematic for users.
> 
> To fix this, enable the UAMOR bits for all keys, at process
> creation (in start_thread(), ie exec time). Since the contents of
> UAMOR are inherited at fork, all threads are capable of modifying the
> permissions on any key.
> 
> This is technically an ABI break on powerpc, but pkey support is fairly
> new on powerpc and not widely used, and this brings us into
> line with x86.
> 
> Fixes: cf43d3b26452 ("powerpc: Enable pkey subsystem")
> Cc: stable@vger.kernel.org # v4.16+
> Tested-by: Florian Weimer <fweimer@redhat.com>
> Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> [mpe: Reword some of the changelog]
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/a57a04c76e06822e4377831611364c

cheers

^ permalink raw reply

* Re: [1/2] powerpc: Add ppc32_allmodconfig defconfig target
From: Michael Ellerman @ 2018-07-24 13:59 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: malat, rdunlap, npiggin
In-Reply-To: <20180709142426.26999-1-mpe@ellerman.id.au>

On Mon, 2018-07-09 at 14:24:25 UTC, Michael Ellerman wrote:
> Because the allmodconfig logic just sets every symbol to M or Y, it
> has the effect of always generating a 64-bit config, because
> CONFIG_PPC64 becomes Y.
> 
> So to make it easier for folks to test 32-bit code, provide a phony
> defconfig target that generates a 32-bit allmodconfig.
> 
> The 32-bit port has several mutually exclusive CPU types, we choose
> the Book3S variants as that's what the help text in Kconfig says is
> most common.
> 
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

Series applied to powerpc next.

https://git.kernel.org/powerpc/c/8db0c9d416f26018cb7cabfb0b144f

cheers

^ permalink raw reply

* Re: powerpc64s: Show ori31 availability in spectre_v1 sysfs file not v2
From: Michael Ellerman @ 2018-07-24 13:59 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: msuchanek, anton, npiggin
In-Reply-To: <20180709062521.693-1-mpe@ellerman.id.au>

On Mon, 2018-07-09 at 06:25:21 UTC, Michael Ellerman wrote:
> When I added the spectre_v2 information in sysfs, I included the
> availability of the ori31 speculation barrier.
> 
> Although the ori31 barrier can be used to mitigate v2, it's primarily
> intended as a spectre v1 mitigation. Spectre v2 is mitigated by
> hardware changes.
> 
> So rework the sysfs files to show the ori31 information in the
> spectre_v1 file, rather than v2.
> 
> Currently we display eg:
> 
>   $ grep . spectre_v*
>   spectre_v1:Mitigation: __user pointer sanitization
>   spectre_v2:Mitigation: Indirect branch cache disabled, ori31 speculation barrier enabled
> 
> After:
> 
>   $ grep . spectre_v*
>   spectre_v1:Mitigation: __user pointer sanitization, ori31 speculation barrier enabled
>   spectre_v2:Mitigation: Indirect branch cache disabled
> 
> Fixes: d6fbe1c55c55 ("powerpc/64s: Wire up cpu_show_spectre_v2()")
> Cc: stable@vger.kernel.org # v4.17+
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

Applied to powerpc next.

https://git.kernel.org/powerpc/c/6d44acae1937b81cf8115ada8958e0

cheers

^ permalink raw reply

* Re: powerpc/mm/hash: Improve error reporting on HCALL failures
From: Michael Ellerman @ 2018-07-24 13:59 UTC (permalink / raw)
  To: Aneesh Kumar K.V, npiggin, benh, paulus; +Cc: Aneesh Kumar K.V, linuxppc-dev
In-Reply-To: <20180629083904.29250-1-aneesh.kumar@linux.ibm.com>

On Fri, 2018-06-29 at 08:39:04 UTC, "Aneesh Kumar K.V" wrote:
> This patch adds error reporting to H_ENTER and H_READ hcalls. A failure for
> both these hcalls are mostly fatal and it would be good to log the failure
> reason.
> 
> We also switch printk to pr_*
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/ca42d8d2d6c55822fa8f1d230ffa3b

cheers

^ permalink raw reply

* Re: [1/3] powerpc/mm/hash: Remove the superfluous bitwise operation when find hpte group
From: Michael Ellerman @ 2018-07-24 13:59 UTC (permalink / raw)
  To: Aneesh Kumar K.V, npiggin, benh, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <20180629083631.28017-1-aneesh.kumar@linux.ibm.com>

On Fri, 2018-06-29 at 08:36:29 UTC, "Aneesh Kumar K.V" wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> When computing the starting slot number for a hash page table group we used
> to do this
> hpte_group = ((hash & htab_hash_mask) * HPTES_PER_GROUP) & ~0x7UL;
> 
> Multiplying with 8 (HPTES_PER_GROUP) imply the last three bits are 0. Hence we
> really don't need to clear then separately.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/1531cff44b5bb30c899404c044805e

cheers

^ permalink raw reply

* Re: [1/2] powerpc/mm: Check memblock_add against MAX_PHYSMEM_BITS range
From: Michael Ellerman @ 2018-07-24 13:59 UTC (permalink / raw)
  To: Aneesh Kumar K.V, npiggin, benh, paulus; +Cc: Aneesh Kumar K.V, linuxppc-dev
In-Reply-To: <20180621083158.30849-1-aneesh.kumar@linux.ibm.com>

On Thu, 2018-06-21 at 08:31:57 UTC, "Aneesh Kumar K.V" wrote:
> With SPARSEMEM config enabled, we make sure that we don't add sections beyond
> MAX_PHYSMEM_BITS range. This results in not building vmemmap mapping for
> range beyond max range. But our memblock layer looks the device tree and create
> mapping for the full memory range. Prevent this by checking against
> MAX_PHSYSMEM_BITS when doing memblock_add.
> 
> We don't do similar check for memeblock_reserve_range. If reserve range is beyond
> MAX_PHYSMEM_BITS we expect that to be configured with 'nomap'. Any other
> reserved range should come from existing memblock ranges which we already
> filtered while adding.
> 
> This avoids crash as below when running on a system with system ram config above
> MAX_PHSYSMEM_BITS
> 
>  Unable to handle kernel paging request for data at address 0xc00a001000000440
>  Faulting instruction address: 0xc000000001034118
>  cpu 0x0: Vector: 300 (Data Access) at [c00000000124fb30]
>      pc: c000000001034118: __free_pages_bootmem+0xc0/0x1c0
>      lr: c00000000103b258: free_all_bootmem+0x19c/0x22c
>      sp: c00000000124fdb0
>     msr: 9000000002001033
>     dar: c00a001000000440
>   dsisr: 40000000
>    current = 0xc00000000120dd00
>    paca    = 0xc000000001f60000^I irqmask: 0x03^I irq_happened: 0x01
>      pid   = 0, comm = swapper
>  [c00000000124fe20] c00000000103b258 free_all_bootmem+0x19c/0x22c
>  [c00000000124fee0] c000000001010a68 mem_init+0x3c/0x5c
>  [c00000000124ff00] c00000000100401c start_kernel+0x298/0x5e4
>  [c00000000124ff90] c00000000000b57c start_here_common+0x1c/0x520
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/6aba0c84ec474534bbae3675e95464

cheers

^ permalink raw reply

* Re: [v8, 1/5] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp()
From: Michael Ellerman @ 2018-07-24 13:59 UTC (permalink / raw)
  To: wei.guo.simon, linuxppc-dev; +Cc: Naveen N.  Rao, Simon Guo, Cyril Bur
In-Reply-To: <1528336675-10879-2-git-send-email-wei.guo.simon@gmail.com>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2810 bytes --]

On Thu, 2018-06-07 at 01:57:51 UTC, wei.guo.simon@gmail.com wrote:
> From: Simon Guo <wei.guo.simon@gmail.com>
> 
> Currently memcmp() 64bytes version in powerpc will fall back to .Lshort
> (compare per byte mode) if either src or dst address is not 8 bytes aligned.
> It can be opmitized in 2 situations:
> 
> 1) if both addresses are with the same offset with 8 bytes boundary:
> memcmp() can compare the unaligned bytes within 8 bytes boundary firstly
> and then compare the rest 8-bytes-aligned content with .Llong mode.
> 
> 2)  If src/dst addrs are not with the same offset of 8 bytes boundary:
> memcmp() can align src addr with 8 bytes, increment dst addr accordingly,
>  then load src with aligned mode and load dst with unaligned mode.
> 
> This patch optmizes memcmp() behavior in the above 2 situations.
> 
> Tested with both little/big endian. Performance result below is based on
> little endian.
> 
> Following is the test result with src/dst having the same offset case:
> (a similar result was observed when src/dst having different offset):
> (1) 256 bytes
> Test with the existing tools/testing/selftests/powerpc/stringloops/memcmp:
> - without patch
> 	29.773018302 seconds time elapsed                                          ( +- 0.09% )
> - with patch
> 	16.485568173 seconds time elapsed                                          ( +-  0.02% )
> 		-> There is ~+80% percent improvement
> 
> (2) 32 bytes
> To observe performance impact on < 32 bytes, modify
> tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
> -------
>  #include <string.h>
>  #include "utils.h"
> 
> -#define SIZE 256
> +#define SIZE 32
>  #define ITERATIONS 10000
> 
>  int test_memcmp(const void *s1, const void *s2, size_t n);
> --------
> 
> - Without patch
> 	0.244746482 seconds time elapsed                                          ( +-  0.36%)
> - with patch
> 	0.215069477 seconds time elapsed                                          ( +-  0.51%)
> 		-> There is ～+13% improvement
> 
> (3) 0~8 bytes
> To observe <8 bytes performance impact, modify
> tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
> -------
>  #include <string.h>
>  #include "utils.h"
> 
> -#define SIZE 256
> -#define ITERATIONS 10000
> +#define SIZE 8
> +#define ITERATIONS 1000000
> 
>  int test_memcmp(const void *s1, const void *s2, size_t n);
> -------
> - Without patch
>        1.845642503 seconds time elapsed                                          ( +- 0.12% )
> - With patch
>        1.849767135 seconds time elapsed                                          ( +- 0.26% )
> 		-> They are nearly the same. (-0.2%)
> 
> Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/2d9ee327adce5f6becea2dd51d282a

cheers

^ permalink raw reply

* Re: [v2] powerpc/64s: make PACA_IRQ_HARD_DIS track MSR[EE] closely
From: Michael Ellerman @ 2018-07-24 13:59 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Nicholas Piggin
In-Reply-To: <20180603122432.617-1-npiggin@gmail.com>

On Sun, 2018-06-03 at 12:24:32 UTC, Nicholas Piggin wrote:
> When the masked interrupt handler clears MSR[EE] for an interrupt in
> the PACA_IRQ_MUST_HARD_MASK set, it does not set PACA_IRQ_HARD_DIS.
> This makes them get out of synch.
> 
> With that taken into account, it's only low level irq manipulation
> (and interrupt entry before reconcile) where they can be out of synch.
> This makes the code less surprising.
> 
> It also allows the IRQ replay code to rely on the IRQ_HARD_DIS value
> and not have to mtmsrd again in this case (e.g., for an external
> interrupt that has been masked). The bigger benefit might just be
> that there is not such an element of surprise in these two bits of
> state.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/9b81c0211c249c1bc8caec2ddbc86e

cheers

^ permalink raw reply

* Re: [01/15] powerpc/powernv: opal_put_chars partial write fix
From: Michael Ellerman @ 2018-07-24 13:59 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev
  Cc: Jiri Slaby, linux-kernel, Nicholas Piggin, Greg Kroah-Hartman
In-Reply-To: <20180430145558.4308-2-npiggin@gmail.com>

On Mon, 2018-04-30 at 14:55:44 UTC, Nicholas Piggin wrote:
> The intention here is to consume and discard the remaining buffer
> upon error. This works if there has not been a previous partial write.
> If there has been, then total_len is no longer total number of bytes
> to copy. total_len is always "bytes left to copy", so it should be
> added to written bytes.
> 
> This code may not be exercised any more if partial writes will not be
> hit, but this is a small bugfix before a larger change.
> 
> Reviewed-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

Patches 1-9 applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/bd90284cc6c1c9e8e48c8eadd0c795

cheers

^ permalink raw reply

* Re: [v2] powerpc: NMI IPI make NMI IPIs fully sychronous
From: Michael Ellerman @ 2018-07-24 13:59 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Nicholas Piggin
In-Reply-To: <20180425051759.9474-1-npiggin@gmail.com>

On Wed, 2018-04-25 at 05:17:59 UTC, Nicholas Piggin wrote:
> There is an asynchronous aspect to smp_send_nmi_ipi. The caller waits
> for all CPUs to call in to the handler, but it does not wait for
> completion of the handler. This is a needless complication, so remove
> it and always wait synchronously.
> 
> The synchronous wait allows the caller to easily time out and clear
> the wait for completion (zero nmi_ipi_busy_count) in the case of badly
> behaved handlers. This would have prevented the recent smp_send_stop
> NMI IPI bug from causing the system to hang.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/5b73151fff63fb019db8171cb81c6c

cheers

^ permalink raw reply

* Re: powerpc/tm: Remove struct thread_info param from tm_reclaim_thread()
From: Michael Ellerman @ 2018-07-24 13:59 UTC (permalink / raw)
  To: Cyril Bur, linuxppc-dev; +Cc: gromero
In-Reply-To: <20180201010746.413-1-cyrilbur@gmail.com>

On Thu, 2018-02-01 at 01:07:46 UTC, Cyril Bur wrote:
> tm_reclaim_thread() doesn't use the parameter anymore, both callers have
> to bother getting it as they have no need for a struct thread_info
> either.
> 
> Just remove it and adjust the callers.
> 
> Signed-off-by: Cyril Bur <cyrilbur@gmail.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/edd00b830731be468fd3caf7f9154d

cheers

^ permalink raw reply

* Re: powerpc/tm: Update function prototype comment
From: Michael Ellerman @ 2018-07-24 13:59 UTC (permalink / raw)
  To: Cyril Bur, linuxppc-dev
In-Reply-To: <20180205051716.22529-1-cyrilbur@gmail.com>

On Mon, 2018-02-05 at 05:17:16 UTC, Cyril Bur wrote:
> In commit eb5c3f1c8647 ("powerpc: Always save/restore checkpointed regs
> during treclaim/trecheckpoint") __tm_recheckpoint was modified to no
> longer take the second parameter 'unsigned long orig_msr' as part of a
> TM rewrite to simplify the reclaiming/recheckpointing process.
> 
> There is a comment in the asm file where the function is delcared which
> has an incorrect prototype with the 'orig_msr' parameter.
> 
> This patch corrects the comment.
> 
> Signed-off-by: Cyril Bur <cyrilbur@gmail.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/a596a7e91710d26fd862e3b7031c40

cheers

^ permalink raw reply

* Re: [V2, 1/2] powerpc/powernv/opal-dump : Handles opal_dump_info properly
From: Michael Ellerman @ 2018-07-24 13:59 UTC (permalink / raw)
  To: Mukesh Ojha, stewart, jk, hegdevasant, linuxppc-dev; +Cc: Mukesh Ojha
In-Reply-To: <1487596931-26975-1-git-send-email-mukesh02@linux.vnet.ibm.com>

On Mon, 2017-02-20 at 13:22:10 UTC, Mukesh Ojha wrote:
> Moves the return value check of 'opal_dump_info' to a proper place which
> was previously unnecessarily filling all the dump info even on failure.
> 
> Signed-off-by: Mukesh Ojha <mukesh02@linux.vnet.ibm.com>
> Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com>
> Acked-by: Jeremy Kerr <jk@ozlabs.org>

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/a5bbe8fd29f7e42fe5d26371adbad9

cheers

^ permalink raw reply

* [PATCH 1/4] treewide: convert ISO_8859-1 text comments to utf-8
From: Arnd Bergmann @ 2018-07-24 11:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joe Perches, Arnd Bergmann, Samuel Ortiz, David S. Miller,
	Rob Herring, Michael Ellerman, Jonathan Cameron, linux-wireless,
	netdev, devicetree, linux-kernel, linux-arm-kernel, linux-crypto,
	linuxppc-dev, linux-iio, linux-pm, lvs-devel, netfilter-devel,
	coreteam

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=Binary, Size: 30563 bytes --]

Almost all files in the kernel are either plain text or UTF-8
encoded. A couple however are ISO_8859-1, usually just a few
characters in a C comments, for historic reasons.

This converts them all to UTF-8 for consistency.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 .../devicetree/bindings/net/nfc/pn544.txt     |   2 +-
 arch/arm/boot/dts/sun4i-a10-inet97fv2.dts     |   2 +-
 arch/arm/crypto/sha256_glue.c                 |   2 +-
 arch/arm/crypto/sha256_neon_glue.c            |   4 +-
 drivers/crypto/vmx/ghashp8-ppc.pl             |  12 +-
 drivers/iio/dac/ltc2632.c                     |   2 +-
 drivers/power/reset/ltc2952-poweroff.c        |   4 +-
 kernel/events/callchain.c                     |   2 +-
 net/netfilter/ipvs/Kconfig                    |   8 +-
 net/netfilter/ipvs/ip_vs_mh.c                 |   4 +-
 tools/power/cpupower/po/de.po                 |  44 +++----
 tools/power/cpupower/po/fr.po                 | 120 +++++++++---------
 12 files changed, 103 insertions(+), 103 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/nfc/pn544.txt b/Documentation/devicetree/bindings/net/nfc/pn544.txt
index 538a86f7b2b0..72593f056b75 100644
--- a/Documentation/devicetree/bindings/net/nfc/pn544.txt
+++ b/Documentation/devicetree/bindings/net/nfc/pn544.txt
@@ -2,7 +2,7 @@
 
 Required properties:
 - compatible: Should be "nxp,pn544-i2c".
-- clock-frequency: I²C work frequency.
+- clock-frequency: IÂ²C work frequency.
 - reg: address on the bus
 - interrupt-parent: phandle for the interrupt gpio controller
 - interrupts: GPIO interrupt to which the chip is connected
diff --git a/arch/arm/boot/dts/sun4i-a10-inet97fv2.dts b/arch/arm/boot/dts/sun4i-a10-inet97fv2.dts
index 5d096528e75a..71c27ea0b53e 100644
--- a/arch/arm/boot/dts/sun4i-a10-inet97fv2.dts
+++ b/arch/arm/boot/dts/sun4i-a10-inet97fv2.dts
@@ -1,7 +1,7 @@
 /*
  * Copyright 2014 Open Source Support GmbH
  *
- * David Lanzendörfer <david.lanzendoerfer@o2s.ch>
+ * David LanzendÃ¶rfer <david.lanzendoerfer@o2s.ch>
  *
  * This file is dual-licensed: you can use it either under the terms
  * of the GPL or the X11 license, at your option. Note that this dual
diff --git a/arch/arm/crypto/sha256_glue.c b/arch/arm/crypto/sha256_glue.c
index bf8ccff2c9d0..0ae900e778f3 100644
--- a/arch/arm/crypto/sha256_glue.c
+++ b/arch/arm/crypto/sha256_glue.c
@@ -2,7 +2,7 @@
  * Glue code for the SHA256 Secure Hash Algorithm assembly implementation
  * using optimized ARM assembler and NEON instructions.
  *
- * Copyright © 2015 Google Inc.
+ * Copyright Â© 2015 Google Inc.
  *
  * This file is based on sha256_ssse3_glue.c:
  *   Copyright (C) 2013 Intel Corporation
diff --git a/arch/arm/crypto/sha256_neon_glue.c b/arch/arm/crypto/sha256_neon_glue.c
index 9bbee56fbdc8..1d82c6cd31a4 100644
--- a/arch/arm/crypto/sha256_neon_glue.c
+++ b/arch/arm/crypto/sha256_neon_glue.c
@@ -2,10 +2,10 @@
  * Glue code for the SHA256 Secure Hash Algorithm assembly implementation
  * using NEON instructions.
  *
- * Copyright © 2015 Google Inc.
+ * Copyright Â© 2015 Google Inc.
  *
  * This file is based on sha512_neon_glue.c:
- *   Copyright © 2014 Jussi Kivilinna <jussi.kivilinna@iki.fi>
+ *   Copyright Â© 2014 Jussi Kivilinna <jussi.kivilinna@iki.fi>
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms of the GNU General Public License as published by the Free
diff --git a/drivers/crypto/vmx/ghashp8-ppc.pl b/drivers/crypto/vmx/ghashp8-ppc.pl
index f746af271460..38b06503ede0 100644
--- a/drivers/crypto/vmx/ghashp8-ppc.pl
+++ b/drivers/crypto/vmx/ghashp8-ppc.pl
@@ -129,9 +129,9 @@ $code=<<___;
 	 le?vperm	$IN,$IN,$IN,$lemask
 	vxor		$zero,$zero,$zero
 
-	vpmsumd		$Xl,$IN,$Hl		# H.lo·Xi.lo
-	vpmsumd		$Xm,$IN,$H		# H.hi·Xi.lo+H.lo·Xi.hi
-	vpmsumd		$Xh,$IN,$Hh		# H.hi·Xi.hi
+	vpmsumd		$Xl,$IN,$Hl		# H.loÂ·Xi.lo
+	vpmsumd		$Xm,$IN,$H		# H.hiÂ·Xi.lo+H.loÂ·Xi.hi
+	vpmsumd		$Xh,$IN,$Hh		# H.hiÂ·Xi.hi
 
 	vpmsumd		$t2,$Xl,$xC2		# 1st phase
 
@@ -187,11 +187,11 @@ $code=<<___;
 .align	5
 Loop:
 	 subic		$len,$len,16
-	vpmsumd		$Xl,$IN,$Hl		# H.lo·Xi.lo
+	vpmsumd		$Xl,$IN,$Hl		# H.loÂ·Xi.lo
 	 subfe.		r0,r0,r0		# borrow?-1:0
-	vpmsumd		$Xm,$IN,$H		# H.hi·Xi.lo+H.lo·Xi.hi
+	vpmsumd		$Xm,$IN,$H		# H.hiÂ·Xi.lo+H.loÂ·Xi.hi
 	 and		r0,r0,$len
-	vpmsumd		$Xh,$IN,$Hh		# H.hi·Xi.hi
+	vpmsumd		$Xh,$IN,$Hh		# H.hiÂ·Xi.hi
 	 add		$inp,$inp,r0
 
 	vpmsumd		$t2,$Xl,$xC2		# 1st phase
diff --git a/drivers/iio/dac/ltc2632.c b/drivers/iio/dac/ltc2632.c
index cca278eaa138..885105135580 100644
--- a/drivers/iio/dac/ltc2632.c
+++ b/drivers/iio/dac/ltc2632.c
@@ -1,7 +1,7 @@
 /*
  * LTC2632 Digital to analog convertors spi driver
  *
- * Copyright 2017 Maxime Roussin-Bélanger
+ * Copyright 2017 Maxime Roussin-BÃ©langer
  * expanded by Silvan Murer <silvan.murer@gmail.com>
  *
  * Licensed under the GPL-2.
diff --git a/drivers/power/reset/ltc2952-poweroff.c b/drivers/power/reset/ltc2952-poweroff.c
index 6b911b6b10a6..c484584745bc 100644
--- a/drivers/power/reset/ltc2952-poweroff.c
+++ b/drivers/power/reset/ltc2952-poweroff.c
@@ -2,7 +2,7 @@
  * LTC2952 (PowerPath) driver
  *
  * Copyright (C) 2014, Xsens Technologies BV <info@xsens.com>
- * Maintainer: René Moll <linux@r-moll.nl>
+ * Maintainer: RenÃ© Moll <linux@r-moll.nl>
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public License
@@ -319,6 +319,6 @@ static struct platform_driver ltc2952_poweroff_driver = {
 
 module_platform_driver(ltc2952_poweroff_driver);
 
-MODULE_AUTHOR("René Moll <rene.moll@xsens.com>");
+MODULE_AUTHOR("RenÃ© Moll <rene.moll@xsens.com>");
 MODULE_DESCRIPTION("LTC PowerPath power-off driver");
 MODULE_LICENSE("GPL v2");
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index c187aa3df3c8..24a77c34e9ad 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -4,7 +4,7 @@
  *  Copyright (C) 2008 Thomas Gleixner <tglx@linutronix.de>
  *  Copyright (C) 2008-2011 Red Hat, Inc., Ingo Molnar
  *  Copyright (C) 2008-2011 Red Hat, Inc., Peter Zijlstra
- *  Copyright  ©  2009 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
+ *  Copyright  Â©  2009 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
  *
  * For licensing details see kernel-base/COPYING
  */
diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
index 05dc1b77e466..cad48d07c818 100644
--- a/net/netfilter/ipvs/Kconfig
+++ b/net/netfilter/ipvs/Kconfig
@@ -296,10 +296,10 @@ config IP_VS_MH_TAB_INDEX
 	  stored in a hash table. This table is assigned by a preference
 	  list of the positions to each destination until all slots in
 	  the table are filled. The index determines the prime for size of
-	  the table as 251, 509, 1021, 2039, 4093, 8191, 16381, 32749,
-	  65521 or 131071. When using weights to allow destinations to
-	  receive more connections, the table is assigned an amount
-	  proportional to the weights specified. The table needs to be large
+	  the table asÂ 251, 509, 1021, 2039, 4093, 8191, 16381, 32749,
+	  65521 or 131071.Â When using weights to allow destinations to
+	  receive more connections,Â the table is assigned an amount
+	  proportional to the weights specified.Â The table needs to be large
 	  enough to effectively fit all the destinations multiplied by their
 	  respective weights.
 
diff --git a/net/netfilter/ipvs/ip_vs_mh.c b/net/netfilter/ipvs/ip_vs_mh.c
index 0f795b186eb3..94d9d349ebb0 100644
--- a/net/netfilter/ipvs/ip_vs_mh.c
+++ b/net/netfilter/ipvs/ip_vs_mh.c
@@ -5,10 +5,10 @@
  *
  */
 
-/* The mh algorithm is to assign a preference list of all the lookup
+/* The mh algorithm is to assignÂ a preference list of all the lookup
  * table positions to each destination and populate the table with
  * the most-preferred position of destinations. Then it is to select
- * destination with the hash key of source IP address through looking
+ * destination with the hash key of source IP addressÂ through looking
  * up a the lookup table.
  *
  * The algorithm is detailed in:
diff --git a/tools/power/cpupower/po/de.po b/tools/power/cpupower/po/de.po
index 78c09e51663a..840c17cc450a 100644
--- a/tools/power/cpupower/po/de.po
+++ b/tools/power/cpupower/po/de.po
@@ -323,12 +323,12 @@ msgstr "  Hardwarebedingte Grenzen der Taktfrequenz: "
 #: utils/cpufreq-info.c:256
 #, c-format
 msgid "  available frequency steps: "
-msgstr "  mögliche Taktfrequenzen: "
+msgstr "  mÃ¶gliche Taktfrequenzen: "
 
 #: utils/cpufreq-info.c:269
 #, c-format
 msgid "  available cpufreq governors: "
-msgstr "  mögliche Regler: "
+msgstr "  mÃ¶gliche Regler: "
 
 #: utils/cpufreq-info.c:280
 #, c-format
@@ -381,7 +381,7 @@ msgstr "Optionen:\n"
 msgid "  -e, --debug          Prints out debug information [default]\n"
 msgstr ""
 "  -e, --debug          Erzeugt detaillierte Informationen, hilfreich\n"
-"                       zum Aufspüren von Fehlern\n"
+"                       zum AufspÃ¼ren von Fehlern\n"
 
 #: utils/cpufreq-info.c:475
 #, c-format
@@ -424,7 +424,7 @@ msgstr "  -p, --policy         Findet die momentane Taktik heraus *\n"
 #: utils/cpufreq-info.c:482
 #, c-format
 msgid "  -g, --governors      Determines available cpufreq governors *\n"
-msgstr "  -g, --governors      Erzeugt eine Liste mit verfügbaren Reglern *\n"
+msgstr "  -g, --governors      Erzeugt eine Liste mit verfÃ¼gbaren Reglern *\n"
 
 #: utils/cpufreq-info.c:483
 #, c-format
@@ -450,7 +450,7 @@ msgstr ""
 #, c-format
 msgid "  -s, --stats          Shows cpufreq statistics if available\n"
 msgstr ""
-"  -s, --stats          Zeigt, sofern möglich, Statistiken über cpufreq an.\n"
+"  -s, --stats          Zeigt, sofern mÃ¶glich, Statistiken Ã¼ber cpufreq an.\n"
 
 #: utils/cpufreq-info.c:487
 #, c-format
@@ -473,9 +473,9 @@ msgid ""
 "cpufreq\n"
 "                       interface in 2.4. and early 2.6. kernels\n"
 msgstr ""
-"  -o, --proc           Erzeugt Informationen in einem ähnlichem Format zu "
+"  -o, --proc           Erzeugt Informationen in einem Ã¤hnlichem Format zu "
 "dem\n"
-"                       der /proc/cpufreq-Datei in 2.4. und frühen 2.6.\n"
+"                       der /proc/cpufreq-Datei in 2.4. und frÃ¼hen 2.6.\n"
 "                       Kernel-Versionen\n"
 
 #: utils/cpufreq-info.c:491
@@ -491,7 +491,7 @@ msgstr ""
 #: utils/cpufreq-info.c:492 utils/cpuidle-info.c:152
 #, c-format
 msgid "  -h, --help           Prints out this screen\n"
-msgstr "  -h, --help           Gibt diese Kurzübersicht aus\n"
+msgstr "  -h, --help           Gibt diese KurzÃ¼bersicht aus\n"
 
 #: utils/cpufreq-info.c:495
 #, c-format
@@ -501,7 +501,7 @@ msgid ""
 msgstr ""
 "Sofern kein anderer Parameter als '-c, --cpu' angegeben wird, liefert "
 "dieses\n"
-"Programm Informationen, die z.B. zum Berichten von Fehlern nützlich sind.\n"
+"Programm Informationen, die z.B. zum Berichten von Fehlern nÃ¼tzlich sind.\n"
 
 #: utils/cpufreq-info.c:497
 #, c-format
@@ -557,7 +557,7 @@ msgid ""
 "select\n"
 msgstr ""
 "  -d FREQ, --min FREQ      neue minimale Taktfrequenz, die der Regler\n"
-"                           auswählen darf\n"
+"                           auswÃ¤hlen darf\n"
 
 #: utils/cpufreq-set.c:28
 #, c-format
@@ -566,7 +566,7 @@ msgid ""
 "select\n"
 msgstr ""
 "  -u FREQ, --max FREQ      neue maximale Taktfrequenz, die der Regler\n"
-"                           auswählen darf\n"
+"                           auswÃ¤hlen darf\n"
 
 #: utils/cpufreq-set.c:29
 #, c-format
@@ -579,20 +579,20 @@ msgid ""
 "  -f FREQ, --freq FREQ     specific frequency to be set. Requires userspace\n"
 "                           governor to be available and loaded\n"
 msgstr ""
-"  -f FREQ, --freq FREQ     setze exakte Taktfrequenz. Benötigt den Regler\n"
+"  -f FREQ, --freq FREQ     setze exakte Taktfrequenz. BenÃ¶tigt den Regler\n"
 "                           'userspace'.\n"
 
 #: utils/cpufreq-set.c:32
 #, c-format
 msgid "  -r, --related            Switches all hardware-related CPUs\n"
 msgstr ""
-"  -r, --related            Setze Werte für alle CPUs, deren Taktfrequenz\n"
+"  -r, --related            Setze Werte fÃ¼r alle CPUs, deren Taktfrequenz\n"
 "                           hardwarebedingt identisch ist.\n"
 
 #: utils/cpufreq-set.c:33 utils/cpupower-set.c:28 utils/cpupower-info.c:27
 #, c-format
 msgid "  -h, --help               Prints out this screen\n"
-msgstr "  -h, --help               Gibt diese Kurzübersicht aus\n"
+msgstr "  -h, --help               Gibt diese KurzÃ¼bersicht aus\n"
 
 #: utils/cpufreq-set.c:35
 #, fuzzy, c-format
@@ -618,8 +618,8 @@ msgstr ""
 "   angenommen\n"
 "2. Der Parameter -f bzw. --freq kann mit keinem anderen als dem Parameter\n"
 "   -c bzw. --cpu kombiniert werden\n"
-"3. FREQuenzen können in Hz, kHz (Standard), MHz, GHz oder THz eingegeben\n"
-"   werden, indem der Wert und unmittelbar anschließend (ohne Leerzeichen!)\n"
+"3. FREQuenzen kÃ¶nnen in Hz, kHz (Standard), MHz, GHz oder THz eingegeben\n"
+"   werden, indem der Wert und unmittelbar anschlieÃŸend (ohne Leerzeichen!)\n"
 "   die Einheit angegeben werden. (Bsp: 1GHz )\n"
 "   (FREQuenz in kHz =^ MHz * 1000 =^ GHz * 1000000).\n"
 
@@ -638,7 +638,7 @@ msgid ""
 msgstr ""
 "Beim Einstellen ist ein Fehler aufgetreten. Typische Fehlerquellen sind:\n"
 "- nicht ausreichende Rechte (Administrator)\n"
-"- der Regler ist nicht verfügbar bzw. nicht geladen\n"
+"- der Regler ist nicht verfÃ¼gbar bzw. nicht geladen\n"
 "- die angegebene Taktik ist inkorrekt\n"
 "- eine spezifische Frequenz wurde angegeben, aber der Regler 'userspace'\n"
 "  kann entweder hardwarebedingt nicht genutzt werden oder ist nicht geladen\n"
@@ -821,7 +821,7 @@ msgstr ""
 #: utils/cpuidle-info.c:48
 #, fuzzy, c-format
 msgid "Available idle states:"
-msgstr "  mögliche Taktfrequenzen: "
+msgstr "  mÃ¶gliche Taktfrequenzen: "
 
 #: utils/cpuidle-info.c:71
 #, c-format
@@ -924,7 +924,7 @@ msgstr "Aufruf: cpufreq-info [Optionen]\n"
 msgid "  -s, --silent         Only show general C-state information\n"
 msgstr ""
 "  -e, --debug          Erzeugt detaillierte Informationen, hilfreich\n"
-"                       zum Aufspüren von Fehlern\n"
+"                       zum AufspÃ¼ren von Fehlern\n"
 
 #: utils/cpuidle-info.c:150
 #, fuzzy, c-format
@@ -933,9 +933,9 @@ msgid ""
 "acpi/processor/*/power\n"
 "                       interface in older kernels\n"
 msgstr ""
-"  -o, --proc           Erzeugt Informationen in einem ähnlichem Format zu "
+"  -o, --proc           Erzeugt Informationen in einem Ã¤hnlichem Format zu "
 "dem\n"
-"                       der /proc/cpufreq-Datei in 2.4. und frühen 2.6.\n"
+"                       der /proc/cpufreq-Datei in 2.4. und frÃ¼hen 2.6.\n"
 "                       Kernel-Versionen\n"
 
 #: utils/cpuidle-info.c:209
@@ -949,7 +949,7 @@ msgstr ""
 #~ "  -c CPU, --cpu CPU    CPU number which information shall be determined "
 #~ "about\n"
 #~ msgstr ""
-#~ "  -c CPU, --cpu CPU    Nummer der CPU, über die Informationen "
+#~ "  -c CPU, --cpu CPU    Nummer der CPU, Ã¼ber die Informationen "
 #~ "herausgefunden werden sollen\n"
 
 #~ msgid ""
diff --git a/tools/power/cpupower/po/fr.po b/tools/power/cpupower/po/fr.po
index 245ad20a9bf9..b46ca2548f86 100644
--- a/tools/power/cpupower/po/fr.po
+++ b/tools/power/cpupower/po/fr.po
@@ -212,7 +212,7 @@ msgstr ""
 #: utils/cpupower.c:91
 #, c-format
 msgid "Report errors and bugs to %s, please.\n"
-msgstr "Veuillez rapportez les erreurs et les bogues à %s, s'il vous plait.\n"
+msgstr "Veuillez rapportez les erreurs et les bogues Ã  %s, s'il vous plait.\n"
 
 #: utils/cpupower.c:114
 #, c-format
@@ -227,14 +227,14 @@ msgstr ""
 #: utils/cpufreq-info.c:31
 #, c-format
 msgid "Couldn't count the number of CPUs (%s: %s), assuming 1\n"
-msgstr "Détermination du nombre de CPUs (%s : %s) impossible.  Assume 1\n"
+msgstr "DÃ©termination du nombre de CPUs (%s : %s) impossible.  Assume 1\n"
 
 #: utils/cpufreq-info.c:63
 #, c-format
 msgid ""
 "          minimum CPU frequency  -  maximum CPU frequency  -  governor\n"
 msgstr ""
-"         Fréquence CPU minimale - Fréquence CPU maximale  - régulateur\n"
+"         FrÃ©quence CPU minimale - FrÃ©quence CPU maximale  - rÃ©gulateur\n"
 
 #: utils/cpufreq-info.c:151
 #, c-format
@@ -302,12 +302,12 @@ msgstr "  pilote : %s\n"
 #: utils/cpufreq-info.c:219
 #, fuzzy, c-format
 msgid "  CPUs which run at the same hardware frequency: "
-msgstr "  CPUs qui doivent changer de fréquences en même temps : "
+msgstr "  CPUs qui doivent changer de frÃ©quences en mÃªme temps : "
 
 #: utils/cpufreq-info.c:230
 #, fuzzy, c-format
 msgid "  CPUs which need to have their frequency coordinated by software: "
-msgstr "  CPUs qui doivent changer de fréquences en même temps : "
+msgstr "  CPUs qui doivent changer de frÃ©quences en mÃªme temps : "
 
 #: utils/cpufreq-info.c:241
 #, c-format
@@ -317,22 +317,22 @@ msgstr ""
 #: utils/cpufreq-info.c:247
 #, c-format
 msgid "  hardware limits: "
-msgstr "  limitation matérielle : "
+msgstr "  limitation matÃ©rielle : "
 
 #: utils/cpufreq-info.c:256
 #, c-format
 msgid "  available frequency steps: "
-msgstr "  plage de fréquence : "
+msgstr "  plage de frÃ©quence : "
 
 #: utils/cpufreq-info.c:269
 #, c-format
 msgid "  available cpufreq governors: "
-msgstr "  régulateurs disponibles : "
+msgstr "  rÃ©gulateurs disponibles : "
 
 #: utils/cpufreq-info.c:280
 #, c-format
 msgid "  current policy: frequency should be within "
-msgstr "  tactique actuelle : la fréquence doit être comprise entre "
+msgstr "  tactique actuelle : la frÃ©quence doit Ãªtre comprise entre "
 
 #: utils/cpufreq-info.c:282
 #, c-format
@@ -345,18 +345,18 @@ msgid ""
 "The governor \"%s\" may decide which speed to use\n"
 "                  within this range.\n"
 msgstr ""
-"Le régulateur \"%s\" est libre de choisir la vitesse\n"
-"                  dans cette plage de fréquences.\n"
+"Le rÃ©gulateur \"%s\" est libre de choisir la vitesse\n"
+"                  dans cette plage de frÃ©quences.\n"
 
 #: utils/cpufreq-info.c:293
 #, c-format
 msgid "  current CPU frequency is "
-msgstr "  la fréquence actuelle de ce CPU est "
+msgstr "  la frÃ©quence actuelle de ce CPU est "
 
 #: utils/cpufreq-info.c:296
 #, c-format
 msgid " (asserted by call to hardware)"
-msgstr " (vérifié par un appel direct du matériel)"
+msgstr " (vÃ©rifiÃ© par un appel direct du matÃ©riel)"
 
 #: utils/cpufreq-info.c:304
 #, c-format
@@ -377,7 +377,7 @@ msgstr "Options :\n"
 #: utils/cpufreq-info.c:474
 #, fuzzy, c-format
 msgid "  -e, --debug          Prints out debug information [default]\n"
-msgstr "  -e, --debug          Afficher les informations de déboguage\n"
+msgstr "  -e, --debug          Afficher les informations de dÃ©boguage\n"
 
 #: utils/cpufreq-info.c:475
 #, c-format
@@ -385,8 +385,8 @@ msgid ""
 "  -f, --freq           Get frequency the CPU currently runs at, according\n"
 "                       to the cpufreq core *\n"
 msgstr ""
-"  -f, --freq           Obtenir la fréquence actuelle du CPU selon le point\n"
-"                       de vue du coeur du système de cpufreq *\n"
+"  -f, --freq           Obtenir la frÃ©quence actuelle du CPU selon le point\n"
+"                       de vue du coeur du systÃ¨me de cpufreq *\n"
 
 #: utils/cpufreq-info.c:477
 #, c-format
@@ -394,8 +394,8 @@ msgid ""
 "  -w, --hwfreq         Get frequency the CPU currently runs at, by reading\n"
 "                       it from hardware (only available to root) *\n"
 msgstr ""
-"  -w, --hwfreq         Obtenir la fréquence actuelle du CPU directement par\n"
-"                       le matériel (doit être root) *\n"
+"  -w, --hwfreq         Obtenir la frÃ©quence actuelle du CPU directement par\n"
+"                       le matÃ©riel (doit Ãªtre root) *\n"
 
 #: utils/cpufreq-info.c:479
 #, c-format
@@ -403,13 +403,13 @@ msgid ""
 "  -l, --hwlimits       Determine the minimum and maximum CPU frequency "
 "allowed *\n"
 msgstr ""
-"  -l, --hwlimits       Affiche les fréquences minimales et maximales du CPU "
+"  -l, --hwlimits       Affiche les frÃ©quences minimales et maximales du CPU "
 "*\n"
 
 #: utils/cpufreq-info.c:480
 #, c-format
 msgid "  -d, --driver         Determines the used cpufreq kernel driver *\n"
-msgstr "  -d, --driver         Affiche le pilote cpufreq utilisé *\n"
+msgstr "  -d, --driver         Affiche le pilote cpufreq utilisÃ© *\n"
 
 #: utils/cpufreq-info.c:481
 #, c-format
@@ -420,7 +420,7 @@ msgstr "  -p, --policy         Affiche la tactique actuelle de cpufreq *\n"
 #, c-format
 msgid "  -g, --governors      Determines available cpufreq governors *\n"
 msgstr ""
-"  -g, --governors      Affiche les régulateurs disponibles de cpufreq *\n"
+"  -g, --governors      Affiche les rÃ©gulateurs disponibles de cpufreq *\n"
 
 #: utils/cpufreq-info.c:483
 #, fuzzy, c-format
@@ -429,7 +429,7 @@ msgid ""
 "frequency *\n"
 msgstr ""
 "  -a, --affected-cpus   Affiche quels sont les CPUs qui doivent changer de\n"
-"                        fréquences en même temps *\n"
+"                        frÃ©quences en mÃªme temps *\n"
 
 #: utils/cpufreq-info.c:484
 #, fuzzy, c-format
@@ -438,7 +438,7 @@ msgid ""
 "                       coordinated by software *\n"
 msgstr ""
 "  -a, --affected-cpus   Affiche quels sont les CPUs qui doivent changer de\n"
-"                        fréquences en même temps *\n"
+"                        frÃ©quences en mÃªme temps *\n"
 
 #: utils/cpufreq-info.c:486
 #, c-format
@@ -453,7 +453,7 @@ msgid ""
 "  -y, --latency        Determines the maximum latency on CPU frequency "
 "changes *\n"
 msgstr ""
-"  -l, --hwlimits       Affiche les fréquences minimales et maximales du CPU "
+"  -l, --hwlimits       Affiche les frÃ©quences minimales et maximales du CPU "
 "*\n"
 
 #: utils/cpufreq-info.c:488
@@ -469,7 +469,7 @@ msgid ""
 "                       interface in 2.4. and early 2.6. kernels\n"
 msgstr ""
 "  -o, --proc           Affiche les informations en utilisant l'interface\n"
-"                       fournie par /proc/cpufreq, présente dans les "
+"                       fournie par /proc/cpufreq, prÃ©sente dans les "
 "versions\n"
 "                       2.4 et les anciennes versions 2.6 du noyau\n"
 
@@ -485,7 +485,7 @@ msgstr ""
 #: utils/cpufreq-info.c:492 utils/cpuidle-info.c:152
 #, c-format
 msgid "  -h, --help           Prints out this screen\n"
-msgstr "  -h, --help           affiche l'aide-mémoire\n"
+msgstr "  -h, --help           affiche l'aide-mÃ©moire\n"
 
 #: utils/cpufreq-info.c:495
 #, c-format
@@ -493,8 +493,8 @@ msgid ""
 "If no argument or only the -c, --cpu parameter is given, debug output about\n"
 "cpufreq is printed which is useful e.g. for reporting bugs.\n"
 msgstr ""
-"Par défaut, les informations de déboguage seront affichées si aucun\n"
-"argument, ou bien si seulement l'argument -c (--cpu) est donné, afin de\n"
+"Par dÃ©faut, les informations de dÃ©boguage seront affichÃ©es si aucun\n"
+"argument, ou bien si seulement l'argument -c (--cpu) est donnÃ©, afin de\n"
 "faciliter les rapports de bogues par exemple\n"
 
 #: utils/cpufreq-info.c:497
@@ -517,8 +517,8 @@ msgid ""
 "You can't specify more than one --cpu parameter and/or\n"
 "more than one output-specific argument\n"
 msgstr ""
-"On ne peut indiquer plus d'un paramètre --cpu, tout comme l'on ne peut\n"
-"spécifier plus d'un argument de formatage\n"
+"On ne peut indiquer plus d'un paramÃ¨tre --cpu, tout comme l'on ne peut\n"
+"spÃ©cifier plus d'un argument de formatage\n"
 
 #: utils/cpufreq-info.c:600 utils/cpufreq-set.c:82 utils/cpupower-set.c:42
 #: utils/cpupower-info.c:42 utils/cpuidle-info.c:213
@@ -529,7 +529,7 @@ msgstr "option invalide\n"
 #: utils/cpufreq-info.c:617
 #, c-format
 msgid "couldn't analyze CPU %d as it doesn't seem to be present\n"
-msgstr "analyse du CPU %d impossible puisqu'il ne semble pas être présent\n"
+msgstr "analyse du CPU %d impossible puisqu'il ne semble pas Ãªtre prÃ©sent\n"
 
 #: utils/cpufreq-info.c:620 utils/cpupower-info.c:142
 #, c-format
@@ -547,8 +547,8 @@ msgid ""
 "  -d FREQ, --min FREQ      new minimum CPU frequency the governor may "
 "select\n"
 msgstr ""
-"  -d FREQ, --min FREQ       nouvelle fréquence minimale du CPU à utiliser\n"
-"                            par le régulateur\n"
+"  -d FREQ, --min FREQ       nouvelle frÃ©quence minimale du CPU Ã  utiliser\n"
+"                            par le rÃ©gulateur\n"
 
 #: utils/cpufreq-set.c:28
 #, c-format
@@ -556,13 +556,13 @@ msgid ""
 "  -u FREQ, --max FREQ      new maximum CPU frequency the governor may "
 "select\n"
 msgstr ""
-"  -u FREQ, --max FREQ       nouvelle fréquence maximale du CPU à utiliser\n"
-"                            par le régulateur\n"
+"  -u FREQ, --max FREQ       nouvelle frÃ©quence maximale du CPU Ã  utiliser\n"
+"                            par le rÃ©gulateur\n"
 
 #: utils/cpufreq-set.c:29
 #, c-format
 msgid "  -g GOV, --governor GOV   new cpufreq governor\n"
-msgstr "  -g GOV, --governor GOV   active le régulateur GOV\n"
+msgstr "  -g GOV, --governor GOV   active le rÃ©gulateur GOV\n"
 
 #: utils/cpufreq-set.c:30
 #, c-format
@@ -570,9 +570,9 @@ msgid ""
 "  -f FREQ, --freq FREQ     specific frequency to be set. Requires userspace\n"
 "                           governor to be available and loaded\n"
 msgstr ""
-"  -f FREQ, --freq FREQ     fixe la fréquence du processeur à FREQ. Il faut\n"
-"                           que le régulateur « userspace » soit disponible \n"
-"                           et activé.\n"
+"  -f FREQ, --freq FREQ     fixe la frÃ©quence du processeur Ã  FREQ. Il faut\n"
+"                           que le rÃ©gulateur Â« userspace Â» soit disponible \n"
+"                           et activÃ©.\n"
 
 #: utils/cpufreq-set.c:32
 #, c-format
@@ -582,7 +582,7 @@ msgstr ""
 #: utils/cpufreq-set.c:33 utils/cpupower-set.c:28 utils/cpupower-info.c:27
 #, fuzzy, c-format
 msgid "  -h, --help               Prints out this screen\n"
-msgstr "  -h, --help           affiche l'aide-mémoire\n"
+msgstr "  -h, --help           affiche l'aide-mÃ©moire\n"
 
 #: utils/cpufreq-set.c:35
 #, fuzzy, c-format
@@ -602,11 +602,11 @@ msgid ""
 "   (FREQuency in kHz =^ Hz * 0.001 =^ MHz * 1000 =^ GHz * 1000000).\n"
 msgstr ""
 "Remarque :\n"
-"1. Le CPU numéro 0 sera utilisé par défaut si -c (ou --cpu) est omis ;\n"
-"2. l'argument -f FREQ (ou --freq FREQ) ne peut être utilisé qu'avec --cpu ;\n"
-"3. on pourra préciser l'unité des fréquences en postfixant sans aucune "
+"1. Le CPU numÃ©ro 0 sera utilisÃ© par dÃ©faut si -c (ou --cpu) est omis ;\n"
+"2. l'argument -f FREQ (ou --freq FREQ) ne peut Ãªtre utilisÃ© qu'avec --cpu ;\n"
+"3. on pourra prÃ©ciser l'unitÃ© des frÃ©quences en postfixant sans aucune "
 "espace\n"
-"   les valeurs par hz, kHz (par défaut), MHz, GHz ou THz\n"
+"   les valeurs par hz, kHz (par dÃ©faut), MHz, GHz ou THz\n"
 "   (kHz =^ Hz * 0.001 =^ MHz * 1000 =^ GHz * 1000000).\n"
 
 #: utils/cpufreq-set.c:57
@@ -622,21 +622,21 @@ msgid ""
 "frequency\n"
 "   or because the userspace governor isn't loaded?\n"
 msgstr ""
-"En ajustant les nouveaux paramètres, une erreur est apparue. Les sources\n"
+"En ajustant les nouveaux paramÃ¨tres, une erreur est apparue. Les sources\n"
 "d'erreur typique sont :\n"
-"- droit d'administration insuffisant (êtes-vous root ?) ;\n"
-"- le régulateur choisi n'est pas disponible, ou bien n'est pas disponible "
+"- droit d'administration insuffisant (Ãªtes-vous root ?) ;\n"
+"- le rÃ©gulateur choisi n'est pas disponible, ou bien n'est pas disponible "
 "en\n"
 "  tant que module noyau ;\n"
 "- la tactique n'est pas disponible ;\n"
-"- vous voulez utiliser l'option -f/--freq, mais le régulateur « userspace »\n"
-"  n'est pas disponible, par exemple parce que le matériel ne le supporte\n"
-"  pas, ou bien n'est tout simplement pas chargé.\n"
+"- vous voulez utiliser l'option -f/--freq, mais le rÃ©gulateur Â« userspace Â»\n"
+"  n'est pas disponible, par exemple parce que le matÃ©riel ne le supporte\n"
+"  pas, ou bien n'est tout simplement pas chargÃ©.\n"
 
 #: utils/cpufreq-set.c:170
 #, c-format
 msgid "wrong, unknown or unhandled CPU?\n"
-msgstr "CPU inconnu ou non supporté ?\n"
+msgstr "CPU inconnu ou non supportÃ© ?\n"
 
 #: utils/cpufreq-set.c:302
 #, c-format
@@ -653,7 +653,7 @@ msgid ""
 "At least one parameter out of -f/--freq, -d/--min, -u/--max, and\n"
 "-g/--governor must be passed\n"
 msgstr ""
-"L'un de ces paramètres est obligatoire : -f/--freq, -d/--min, -u/--max et\n"
+"L'un de ces paramÃ¨tres est obligatoire : -f/--freq, -d/--min, -u/--max et\n"
 "-g/--governor\n"
 
 #: utils/cpufreq-set.c:347
@@ -810,7 +810,7 @@ msgstr ""
 #: utils/cpuidle-info.c:48
 #, fuzzy, c-format
 msgid "Available idle states:"
-msgstr "  plage de fréquence : "
+msgstr "  plage de frÃ©quence : "
 
 #: utils/cpuidle-info.c:71
 #, c-format
@@ -911,7 +911,7 @@ msgstr "Usage : cpufreq-info [options]\n"
 #: utils/cpuidle-info.c:149
 #, fuzzy, c-format
 msgid "  -s, --silent         Only show general C-state information\n"
-msgstr "  -e, --debug          Afficher les informations de déboguage\n"
+msgstr "  -e, --debug          Afficher les informations de dÃ©boguage\n"
 
 #: utils/cpuidle-info.c:150
 #, fuzzy, c-format
@@ -921,7 +921,7 @@ msgid ""
 "                       interface in older kernels\n"
 msgstr ""
 "  -o, --proc           Affiche les informations en utilisant l'interface\n"
-"                       fournie par /proc/cpufreq, présente dans les "
+"                       fournie par /proc/cpufreq, prÃ©sente dans les "
 "versions\n"
 "                       2.4 et les anciennes versions 2.6 du noyau\n"
 
@@ -929,19 +929,19 @@ msgstr ""
 #, fuzzy, c-format
 msgid "You can't specify more than one output-specific argument\n"
 msgstr ""
-"On ne peut indiquer plus d'un paramètre --cpu, tout comme l'on ne peut\n"
-"spécifier plus d'un argument de formatage\n"
+"On ne peut indiquer plus d'un paramÃ¨tre --cpu, tout comme l'on ne peut\n"
+"spÃ©cifier plus d'un argument de formatage\n"
 
 #~ msgid ""
 #~ "  -c CPU, --cpu CPU    CPU number which information shall be determined "
 #~ "about\n"
 #~ msgstr ""
-#~ "  -c CPU, --cpu CPU    Numéro du CPU pour lequel l'information sera "
-#~ "affichée\n"
+#~ "  -c CPU, --cpu CPU    NumÃ©ro du CPU pour lequel l'information sera "
+#~ "affichÃ©e\n"
 
 #~ msgid ""
 #~ "  -c CPU, --cpu CPU        number of CPU where cpufreq settings shall be "
 #~ "modified\n"
 #~ msgstr ""
-#~ "  -c CPU, --cpu CPU        numéro du CPU à prendre en compte pour les\n"
+#~ "  -c CPU, --cpu CPU        numÃ©ro du CPU Ã  prendre en compte pour les\n"
 #~ "                           changements\n"
-- 
2.18.0

^ permalink raw reply related

* Re: [PATCH] net: ethernet: fs-enet: Use generic CRC32 implementation
From: Krzysztof Kozlowski @ 2018-07-24 11:11 UTC (permalink / raw)
  To: David Laight
  Cc: Pantelis Antoniou, David S. Miller, linuxppc-dev@lists.ozlabs.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Eric Biggers
In-Reply-To: <5ef968e1cb2a473cad97f6bfead369ca@AcuMS.aculab.com>

On 24 July 2018 at 13:05, David Laight <David.Laight@aculab.com> wrote:
> From: Krzysztof Kozlowski
>> Sent: 23 July 2018 17:20
>> Use generic kernel CRC32 implementation because it:
>> 1. Should be faster (uses lookup tables),
>
> Are you sure?
> The lookup tables are unlikely to be in the data cache and
> the 6 cache misses kill performance.
> (Not that it particularly matters when setting up multicast hash tables).

Good point, so this statement should be rather "Could be faster"... I
did not run any performance tests so this is not backed up by any
data.

I think the main benefit is rather easier code maintenance by removing
duplicated, custom code.

>> 2. Removes duplicated CRC generation code,
>> 3. Uses well-proven algorithm instead of coding it one more time.
> ...
>>
>> Not tested on hardware.
>
> Have you verified that the old and new functions give the
> same result for a few mac addresses?
> It is very easy to use the wrong bits in crc calculations
> or generate the output in the wrong bit order.

I copied the original code and new one onto a different driver and run
this in a loop for thousands of data input (although not all possible
MAC combinations). The output was the same. I agree however that real
testing would be important.

Best regards,
Krzysztof

^ permalink raw reply

* Re: [PATCH v3 1/1] KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space
From: Cédric Le Goater @ 2018-07-24  9:07 UTC (permalink / raw)
  To: Sam Bobroff, linuxppc-dev; +Cc: kvm, kvm-ppc, paulus, david
In-Reply-To: <1fb3aea5f44f1029866ee10db40abde7e18b24ad.1531967105.git.sbobroff@linux.ibm.com>

On 07/19/2018 04:25 AM, Sam Bobroff wrote:
> From: Sam Bobroff <sam.bobroff@au1.ibm.com>
> 
> It is not currently possible to create the full number of possible
> VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses less
> threads per core than it's core stride (or "VSMT mode"). This is
> because the VCORE ID and XIVE offsets to grow beyond KVM_MAX_VCPUS
> even though the VCPU ID is less than KVM_MAX_VCPU_ID.
> 
> To address this, "pack" the VCORE ID and XIVE offsets by using
> knowledge of the way the VCPU IDs will be used when there are less
> guest threads per core than the core stride. The primary thread of
> each core will always be used first. Then, if the guest uses more than
> one thread per core, these secondary threads will sequentially follow
> the primary in each core.
> 
> So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
> VCPUs are being spaced apart, so at least half of each core is empty
> and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
> into the second half of each core (4..7, in an 8-thread core).
> 
> Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
> each core is being left empty, and we can map down into the second and
> third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
> 
> Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
> threads are being used and 7/8 of the core is empty, allowing use of
> the 1, 3, 5 and 7 thread slots.
> 
> (Strides less than 8 are handled similarly.)
> 
> This allows the VCORE ID or offset to be calculated quickly from the
> VCPU ID or XIVE server numbers, without access to the VCPU structure.
> 
> Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>

On the XIVE part, 

Reviewed-by: Cédric Le Goater <clg@kaod.org>

Thanks,

C.

> ---
> Hello everyone,
> 
> I've completed a trial merge with the guest native-XIVE code and found no
> problems; it's no more difficult than the host side and only requires a few
> calls to xive_vp().
> 
> On that basis, here is v3 (unchanged from v2) as non-RFC and it seems to be
> ready to go.
> 
> Patch set v3:
> Patch 1/1: KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space
> 
> Patch set v2:
> Patch 1/1: KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space
> * Corrected places in kvm/book3s_xive.c where IDs weren't packed.
> * Because kvmppc_pack_vcpu_id() is only called on P9, there is no need to test "emul_smt_mode > 1", so remove it.
> * Re-ordered block_offsets[] to be more ascending.
> * Added more detailed description of the packing algorithm.
> 
> Patch set v1:
> Patch 1/1: KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space
> 
>  arch/powerpc/include/asm/kvm_book3s.h | 44 +++++++++++++++++++++++++++++++++++
>  arch/powerpc/kvm/book3s_hv.c          | 14 +++++++----
>  arch/powerpc/kvm/book3s_xive.c        | 19 +++++++++------
>  3 files changed, 66 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
> index 1f345a0b6ba2..ba4b6e00fca7 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -390,4 +390,48 @@ extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
>  #define SPLIT_HACK_MASK			0xff000000
>  #define SPLIT_HACK_OFFS			0xfb000000
>  
> +/* Pack a VCPU ID from the [0..KVM_MAX_VCPU_ID) space down to the
> + * [0..KVM_MAX_VCPUS) space, while using knowledge of the guest's core stride
> + * (but not it's actual threading mode, which is not available) to avoid
> + * collisions.
> + *
> + * The implementation leaves VCPU IDs from the range [0..KVM_MAX_VCPUS) (block
> + * 0) unchanged: if the guest is filling each VCORE completely then it will be
> + * using consecutive IDs and it will fill the space without any packing.
> + *
> + * For higher VCPU IDs, the packed ID is based on the VCPU ID modulo
> + * KVM_MAX_VCPUS (effectively masking off the top bits) and then an offset is
> + * added to avoid collisions.
> + *
> + * VCPU IDs in the range [KVM_MAX_VCPUS..(KVM_MAX_VCPUS*2)) (block 1) are only
> + * possible if the guest is leaving at least 1/2 of each VCORE empty, so IDs
> + * can be safely packed into the second half of each VCORE by adding an offset
> + * of (stride / 2).
> + *
> + * Similarly, if VCPU IDs in the range [(KVM_MAX_VCPUS*2)..(KVM_MAX_VCPUS*4))
> + * (blocks 2 and 3) are seen, the guest must be leaving at least 3/4 of each
> + * VCORE empty so packed IDs can be offset by (stride / 4) and (stride * 3 / 4).
> + *
> + * Finally, VCPU IDs from blocks 5..7 will only be seen if the guest is using a
> + * stride of 8 and 1 thread per core so the remaining offsets of 1, 3, 5 and 7
> + * must be free to use.
> + *
> + * (The offsets for each block are stored in block_offsets[], indexed by the
> + * block number if the stride is 8. For cases where the guest's stride is less
> + * than 8, we can re-use the block_offsets array by multiplying the block
> + * number by (MAX_SMT_THREADS / stride) to reach the correct entry.)
> + */
> +static inline u32 kvmppc_pack_vcpu_id(struct kvm *kvm, u32 id)
> +{
> +	const int block_offsets[MAX_SMT_THREADS] = {0, 4, 2, 6, 1, 3, 5, 7};
> +	int stride = kvm->arch.emul_smt_mode;
> +	int block = (id / KVM_MAX_VCPUS) * (MAX_SMT_THREADS / stride);
> +	u32 packed_id;
> +
> +	BUG_ON(block >= MAX_SMT_THREADS);
> +	packed_id = (id % KVM_MAX_VCPUS) + block_offsets[block];
> +	BUG_ON(packed_id >= KVM_MAX_VCPUS);
> +	return packed_id;
> +}
> +
>  #endif /* __ASM_KVM_BOOK3S_H__ */
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index de686b340f4a..363c2fb0d89e 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -1816,7 +1816,7 @@ static int threads_per_vcore(struct kvm *kvm)
>  	return threads_per_subcore;
>  }
>  
> -static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
> +static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int id)
>  {
>  	struct kvmppc_vcore *vcore;
>  
> @@ -1830,7 +1830,7 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
>  	init_swait_queue_head(&vcore->wq);
>  	vcore->preempt_tb = TB_NIL;
>  	vcore->lpcr = kvm->arch.lpcr;
> -	vcore->first_vcpuid = core * kvm->arch.smt_mode;
> +	vcore->first_vcpuid = id;
>  	vcore->kvm = kvm;
>  	INIT_LIST_HEAD(&vcore->preempt_list);
>  
> @@ -2048,12 +2048,18 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
>  	mutex_lock(&kvm->lock);
>  	vcore = NULL;
>  	err = -EINVAL;
> -	core = id / kvm->arch.smt_mode;
> +	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
> +		BUG_ON(kvm->arch.smt_mode != 1);
> +		core = kvmppc_pack_vcpu_id(kvm, id);
> +	} else {
> +		core = id / kvm->arch.smt_mode;
> +	}
>  	if (core < KVM_MAX_VCORES) {
>  		vcore = kvm->arch.vcores[core];
> +		BUG_ON(cpu_has_feature(CPU_FTR_ARCH_300) && vcore);
>  		if (!vcore) {
>  			err = -ENOMEM;
> -			vcore = kvmppc_vcore_create(kvm, core);
> +			vcore = kvmppc_vcore_create(kvm, id & ~(kvm->arch.smt_mode - 1));
>  			kvm->arch.vcores[core] = vcore;
>  			kvm->arch.online_vcores++;
>  		}
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index f9818d7d3381..dbd5887daf4a 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -317,6 +317,11 @@ static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
>  	return -EBUSY;
>  }
>  
> +static u32 xive_vp(struct kvmppc_xive *xive, u32 server)
> +{
> +	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
> +}
> +
>  static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
>  			     struct kvmppc_xive_src_block *sb,
>  			     struct kvmppc_xive_irq_state *state)
> @@ -362,7 +367,7 @@ static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
>  	 */
>  	if (xd->flags & OPAL_XIVE_IRQ_MASK_VIA_FW) {
>  		xive_native_configure_irq(hw_num,
> -					  xive->vp_base + state->act_server,
> +					  xive_vp(xive, state->act_server),
>  					  MASKED, state->number);
>  		/* set old_p so we can track if an H_EOI was done */
>  		state->old_p = true;
> @@ -418,7 +423,7 @@ static void xive_finish_unmask(struct kvmppc_xive *xive,
>  	 */
>  	if (xd->flags & OPAL_XIVE_IRQ_MASK_VIA_FW) {
>  		xive_native_configure_irq(hw_num,
> -					  xive->vp_base + state->act_server,
> +					  xive_vp(xive, state->act_server),
>  					  state->act_priority, state->number);
>  		/* If an EOI is needed, do it here */
>  		if (!state->old_p)
> @@ -495,7 +500,7 @@ static int xive_target_interrupt(struct kvm *kvm,
>  	kvmppc_xive_select_irq(state, &hw_num, NULL);
>  
>  	return xive_native_configure_irq(hw_num,
> -					 xive->vp_base + server,
> +					 xive_vp(xive, server),
>  					 prio, state->number);
>  }
>  
> @@ -883,7 +888,7 @@ int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long guest_irq,
>  	 * which is fine for a never started interrupt.
>  	 */
>  	xive_native_configure_irq(hw_irq,
> -				  xive->vp_base + state->act_server,
> +				  xive_vp(xive, state->act_server),
>  				  state->act_priority, state->number);
>  
>  	/*
> @@ -959,7 +964,7 @@ int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned long guest_irq,
>  
>  	/* Reconfigure the IPI */
>  	xive_native_configure_irq(state->ipi_number,
> -				  xive->vp_base + state->act_server,
> +				  xive_vp(xive, state->act_server),
>  				  state->act_priority, state->number);
>  
>  	/*
> @@ -1084,7 +1089,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>  		pr_devel("Duplicate !\n");
>  		return -EEXIST;
>  	}
> -	if (cpu >= KVM_MAX_VCPUS) {
> +	if (cpu >= KVM_MAX_VCPU_ID) {
>  		pr_devel("Out of bounds !\n");
>  		return -EINVAL;
>  	}
> @@ -1098,7 +1103,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>  	xc->xive = xive;
>  	xc->vcpu = vcpu;
>  	xc->server_num = cpu;
> -	xc->vp_id = xive->vp_base + cpu;
> +	xc->vp_id = xive_vp(xive, cpu);
>  	xc->mfrr = 0xff;
>  	xc->valid = true;
>  
> 

^ permalink raw reply

* [PATCH v8 2/2] hwmon: ibmpowernv: Add attributes to enable/disable sensor groups
From: Shilpasri G Bhat @ 2018-07-24  9:13 UTC (permalink / raw)
  To: mpe, linux
  Cc: linuxppc-dev, linux-hwmon, linux-kernel, ego, stewart,
	Shilpasri G Bhat
In-Reply-To: <1532423589-18730-1-git-send-email-shilpa.bhat@linux.vnet.ibm.com>

OPAL firmware provides the facility for some groups of sensors to be
enabled/disabled at runtime to give the user the option of using the
system resources for collecting these sensors or not.

For example, on POWER9 systems, the On Chip Controller (OCC) gathers
various system and chip level sensors and maintains their values in
main memory.

This patch provides support for enabling/disabling the sensor groups
like power, temperature, current and voltage.

Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
[stewart@linux.vnet.ibm.com: Commit message]
---
Changes from v7:
- Use of_for_each_phandle() and of_count_phandle_with_args() to parse
  through the phandle array

 Documentation/hwmon/ibmpowernv |  43 +++++++-
 drivers/hwmon/ibmpowernv.c     | 238 +++++++++++++++++++++++++++++++++++------
 2 files changed, 247 insertions(+), 34 deletions(-)

diff --git a/Documentation/hwmon/ibmpowernv b/Documentation/hwmon/ibmpowernv
index 8826ba2..5646825 100644
--- a/Documentation/hwmon/ibmpowernv
+++ b/Documentation/hwmon/ibmpowernv
@@ -33,9 +33,48 @@ fanX_input		Measured RPM value.
 fanX_min		Threshold RPM for alert generation.
 fanX_fault		0: No fail condition
 			1: Failing fan
+
 tempX_input		Measured ambient temperature.
 tempX_max		Threshold ambient temperature for alert generation.
-inX_input		Measured power supply voltage
+tempX_highest		Historical maximum temperature
+tempX_lowest		Historical minimum temperature
+tempX_enable		Enable/disable all temperature sensors belonging to the
+			sub-group. In POWER9, this attribute corresponds to
+			each OCC. Using this attribute each OCC can be asked to
+			disable/enable all of its temperature sensors.
+			1: Enable
+			0: Disable
+
+inX_input		Measured power supply voltage (millivolt)
 inX_fault		0: No fail condition.
 			1: Failing power supply.
-power1_input		System power consumption (microWatt)
+inX_highest		Historical maximum voltage
+inX_lowest		Historical minimum voltage
+inX_enable		Enable/disable all voltage sensors belonging to the
+			sub-group. In POWER9, this attribute corresponds to
+			each OCC. Using this attribute each OCC can be asked to
+			disable/enable all of its voltage sensors.
+			1: Enable
+			0: Disable
+
+powerX_input		Power consumption (microWatt)
+powerX_input_highest	Historical maximum power
+powerX_input_lowest	Historical minimum power
+powerX_enable		Enable/disable all power sensors belonging to the
+			sub-group. In POWER9, this attribute corresponds to
+			each OCC. Using this attribute each OCC can be asked to
+			disable/enable all of its power sensors.
+			1: Enable
+			0: Disable
+
+currX_input		Measured current (milliampere)
+currX_highest		Historical maximum current
+currX_lowest		Historical minimum current
+currX_enable		Enable/disable all current sensors belonging to the
+			sub-group. In POWER9, this attribute corresponds to
+			each OCC. Using this attribute each OCC can be asked to
+			disable/enable all of its current sensors.
+			1: Enable
+			0: Disable
+
+energyX_input		Cumulative energy (microJoule)
diff --git a/drivers/hwmon/ibmpowernv.c b/drivers/hwmon/ibmpowernv.c
index f829dad..8347280 100644
--- a/drivers/hwmon/ibmpowernv.c
+++ b/drivers/hwmon/ibmpowernv.c
@@ -90,11 +90,20 @@ struct sensor_data {
 	char label[MAX_LABEL_LEN];
 	char name[MAX_ATTR_LEN];
 	struct device_attribute dev_attr;
+	struct sensor_group_data *sgrp_data;
+};
+
+struct sensor_group_data {
+	struct mutex mutex;
+	u32 gid;
+	bool enable;
 };
 
 struct platform_data {
 	const struct attribute_group *attr_groups[MAX_SENSOR_TYPE + 1];
+	struct sensor_group_data *sgrp_data;
 	u32 sensors_count; /* Total count of sensors from each group */
+	u32 nr_sensor_groups; /* Total number of sensor groups */
 };
 
 static ssize_t show_sensor(struct device *dev, struct device_attribute *devattr,
@@ -105,6 +114,9 @@ static ssize_t show_sensor(struct device *dev, struct device_attribute *devattr,
 	ssize_t ret;
 	u64 x;
 
+	if (sdata->sgrp_data && !sdata->sgrp_data->enable)
+		return -ENODATA;
+
 	ret =  opal_get_sensor_data_u64(sdata->id, &x);
 
 	if (ret)
@@ -120,6 +132,46 @@ static ssize_t show_sensor(struct device *dev, struct device_attribute *devattr,
 	return sprintf(buf, "%llu\n", x);
 }
 
+static ssize_t show_enable(struct device *dev,
+			   struct device_attribute *devattr, char *buf)
+{
+	struct sensor_data *sdata = container_of(devattr, struct sensor_data,
+						 dev_attr);
+
+	return sprintf(buf, "%u\n", sdata->sgrp_data->enable);
+}
+
+static ssize_t store_enable(struct device *dev,
+			    struct device_attribute *devattr,
+			    const char *buf, size_t count)
+{
+	struct sensor_data *sdata = container_of(devattr, struct sensor_data,
+						 dev_attr);
+	struct sensor_group_data *sgrp_data = sdata->sgrp_data;
+	int ret;
+	bool data;
+
+	ret = kstrtobool(buf, &data);
+	if (ret)
+		return ret;
+
+	ret = mutex_lock_interruptible(&sgrp_data->mutex);
+	if (ret)
+		return ret;
+
+	if (data != sgrp_data->enable) {
+		ret =  sensor_group_enable(sgrp_data->gid, data);
+		if (!ret)
+			sgrp_data->enable = data;
+	}
+
+	if (!ret)
+		ret = count;
+
+	mutex_unlock(&sgrp_data->mutex);
+	return ret;
+}
+
 static ssize_t show_label(struct device *dev, struct device_attribute *devattr,
 			  char *buf)
 {
@@ -292,12 +344,115 @@ static u32 get_sensor_hwmon_index(struct sensor_data *sdata,
 	return ++sensor_groups[sdata->type].hwmon_index;
 }
 
+static int init_sensor_group_data(struct platform_device *pdev,
+				  struct platform_data *pdata)
+{
+	struct sensor_group_data *sgrp_data;
+	struct device_node *groups, *sgrp;
+	int count = 0, ret = 0;
+	enum sensors type;
+
+	groups = of_find_compatible_node(NULL, NULL, "ibm,opal-sensor-group");
+	if (!groups)
+		return ret;
+
+	for_each_child_of_node(groups, sgrp) {
+		type = get_sensor_type(sgrp);
+		if (type != MAX_SENSOR_TYPE)
+			pdata->nr_sensor_groups++;
+	}
+
+	if (!pdata->nr_sensor_groups)
+		goto out;
+
+	sgrp_data = devm_kcalloc(&pdev->dev, pdata->nr_sensor_groups,
+				 sizeof(*sgrp_data), GFP_KERNEL);
+	if (!sgrp_data) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	for_each_child_of_node(groups, sgrp) {
+		u32 gid;
+
+		type = get_sensor_type(sgrp);
+		if (type == MAX_SENSOR_TYPE)
+			continue;
+
+		if (of_property_read_u32(sgrp, "sensor-group-id", &gid))
+			continue;
+
+		if (of_count_phandle_with_args(sgrp, "sensors", NULL) <= 0)
+			continue;
+
+		sensor_groups[type].attr_count++;
+		sgrp_data[count].gid = gid;
+		mutex_init(&sgrp_data[count].mutex);
+		sgrp_data[count++].enable = false;
+	}
+
+	pdata->sgrp_data = sgrp_data;
+out:
+	of_node_put(groups);
+	return ret;
+}
+
+static struct sensor_group_data *get_sensor_group(struct platform_data *pdata,
+						  struct device_node *node,
+						  enum sensors gtype)
+{
+	struct sensor_group_data *sgrp_data = pdata->sgrp_data;
+	struct device_node *groups, *sgrp;
+
+	groups = of_find_compatible_node(NULL, NULL, "ibm,opal-sensor-group");
+	if (!groups)
+		return NULL;
+
+	for_each_child_of_node(groups, sgrp) {
+		struct of_phandle_iterator it;
+		u32 gid;
+		int rc, i;
+		enum sensors type;
+
+		type = get_sensor_type(sgrp);
+		if (type != gtype)
+			continue;
+
+		if (of_property_read_u32(sgrp, "sensor-group-id", &gid))
+			continue;
+
+		of_for_each_phandle(&it, rc, sgrp, "sensors", NULL, 0)
+			if (it.phandle == node->phandle) {
+				of_node_put(it.node);
+				break;
+			}
+
+		if (rc)
+			continue;
+
+		for (i = 0; i < pdata->nr_sensor_groups; i++)
+			if (gid == sgrp_data[i].gid) {
+				of_node_put(sgrp);
+				of_node_put(groups);
+				return &sgrp_data[i];
+			}
+	}
+
+	of_node_put(groups);
+	return NULL;
+}
+
 static int populate_attr_groups(struct platform_device *pdev)
 {
 	struct platform_data *pdata = platform_get_drvdata(pdev);
 	const struct attribute_group **pgroups = pdata->attr_groups;
 	struct device_node *opal, *np;
 	enum sensors type;
+	int ret;
+
+	ret = init_sensor_group_data(pdev, pdata);
+	if (ret)
+		return ret;
 
 	opal = of_find_node_by_path("/ibm,opal/sensors");
 	for_each_child_of_node(opal, np) {
@@ -344,7 +499,10 @@ static int populate_attr_groups(struct platform_device *pdev)
 static void create_hwmon_attr(struct sensor_data *sdata, const char *attr_name,
 			      ssize_t (*show)(struct device *dev,
 					      struct device_attribute *attr,
-					      char *buf))
+					      char *buf),
+			    ssize_t (*store)(struct device *dev,
+					     struct device_attribute *attr,
+					     const char *buf, size_t count))
 {
 	snprintf(sdata->name, MAX_ATTR_LEN, "%s%d_%s",
 		 sensor_groups[sdata->type].name, sdata->hwmon_index,
@@ -352,23 +510,33 @@ static void create_hwmon_attr(struct sensor_data *sdata, const char *attr_name,
 
 	sysfs_attr_init(&sdata->dev_attr.attr);
 	sdata->dev_attr.attr.name = sdata->name;
-	sdata->dev_attr.attr.mode = S_IRUGO;
 	sdata->dev_attr.show = show;
+	if (store) {
+		sdata->dev_attr.store = store;
+		sdata->dev_attr.attr.mode = 0664;
+	} else {
+		sdata->dev_attr.attr.mode = 0444;
+	}
 }
 
 static void populate_sensor(struct sensor_data *sdata, int od, int hd, int sid,
 			    const char *attr_name, enum sensors type,
 			    const struct attribute_group *pgroup,
+			    struct sensor_group_data *sgrp_data,
 			    ssize_t (*show)(struct device *dev,
 					    struct device_attribute *attr,
-					    char *buf))
+					    char *buf),
+			    ssize_t (*store)(struct device *dev,
+					     struct device_attribute *attr,
+					     const char *buf, size_t count))
 {
 	sdata->id = sid;
 	sdata->type = type;
 	sdata->opal_index = od;
 	sdata->hwmon_index = hd;
-	create_hwmon_attr(sdata, attr_name, show);
+	create_hwmon_attr(sdata, attr_name, show, store);
 	pgroup->attrs[sensor_groups[type].attr_count++] = &sdata->dev_attr.attr;
+	sdata->sgrp_data = sgrp_data;
 }
 
 static char *get_max_attr(enum sensors type)
@@ -403,24 +571,23 @@ static int create_device_attrs(struct platform_device *pdev)
 	const struct attribute_group **pgroups = pdata->attr_groups;
 	struct device_node *opal, *np;
 	struct sensor_data *sdata;
-	u32 sensor_id;
-	enum sensors type;
 	u32 count = 0;
-	int err = 0;
+	u32 group_attr_id[MAX_SENSOR_TYPE] = {0};
 
-	opal = of_find_node_by_path("/ibm,opal/sensors");
 	sdata = devm_kcalloc(&pdev->dev,
 			     pdata->sensors_count, sizeof(*sdata),
 			     GFP_KERNEL);
-	if (!sdata) {
-		err = -ENOMEM;
-		goto exit_put_node;
-	}
+	if (!sdata)
+		return -ENOMEM;
 
+	opal = of_find_node_by_path("/ibm,opal/sensors");
 	for_each_child_of_node(opal, np) {
+		struct sensor_group_data *sgrp_data;
 		const char *attr_name;
-		u32 opal_index;
+		u32 opal_index, hw_id;
+		u32 sensor_id;
 		const char *label;
+		enum sensors type;
 
 		if (np->name == NULL)
 			continue;
@@ -456,14 +623,12 @@ static int create_device_attrs(struct platform_device *pdev)
 			opal_index = INVALID_INDEX;
 		}
 
-		sdata[count].opal_index = opal_index;
-		sdata[count].hwmon_index =
-			get_sensor_hwmon_index(&sdata[count], sdata, count);
-
-		create_hwmon_attr(&sdata[count], attr_name, show_sensor);
-
-		pgroups[type]->attrs[sensor_groups[type].attr_count++] =
-				&sdata[count++].dev_attr.attr;
+		hw_id = get_sensor_hwmon_index(&sdata[count], sdata, count);
+		sgrp_data = get_sensor_group(pdata, np, type);
+		populate_sensor(&sdata[count], opal_index, hw_id, sensor_id,
+				attr_name, type, pgroups[type], sgrp_data,
+				show_sensor, NULL);
+		count++;
 
 		if (!of_property_read_string(np, "label", &label)) {
 			/*
@@ -474,35 +639,43 @@ static int create_device_attrs(struct platform_device *pdev)
 			 */
 
 			make_sensor_label(np, &sdata[count], label);
-			populate_sensor(&sdata[count], opal_index,
-					sdata[count - 1].hwmon_index,
+			populate_sensor(&sdata[count], opal_index, hw_id,
 					sensor_id, "label", type, pgroups[type],
-					show_label);
+					NULL, show_label, NULL);
 			count++;
 		}
 
 		if (!of_property_read_u32(np, "sensor-data-max", &sensor_id)) {
 			attr_name = get_max_attr(type);
-			populate_sensor(&sdata[count], opal_index,
-					sdata[count - 1].hwmon_index,
+			populate_sensor(&sdata[count], opal_index, hw_id,
 					sensor_id, attr_name, type,
-					pgroups[type], show_sensor);
+					pgroups[type], sgrp_data, show_sensor,
+					NULL);
 			count++;
 		}
 
 		if (!of_property_read_u32(np, "sensor-data-min", &sensor_id)) {
 			attr_name = get_min_attr(type);
-			populate_sensor(&sdata[count], opal_index,
-					sdata[count - 1].hwmon_index,
+			populate_sensor(&sdata[count], opal_index, hw_id,
 					sensor_id, attr_name, type,
-					pgroups[type], show_sensor);
+					pgroups[type], sgrp_data, show_sensor,
+					NULL);
+			count++;
+		}
+
+		if (sgrp_data && !sgrp_data->enable) {
+			sgrp_data->enable = true;
+			hw_id = ++group_attr_id[type];
+			populate_sensor(&sdata[count], opal_index, hw_id,
+					sgrp_data->gid, "enable", type,
+					pgroups[type], sgrp_data, show_enable,
+					store_enable);
 			count++;
 		}
 	}
 
-exit_put_node:
 	of_node_put(opal);
-	return err;
+	return 0;
 }
 
 static int ibmpowernv_probe(struct platform_device *pdev)
@@ -517,6 +690,7 @@ static int ibmpowernv_probe(struct platform_device *pdev)
 
 	platform_set_drvdata(pdev, pdata);
 	pdata->sensors_count = 0;
+	pdata->nr_sensor_groups = 0;
 	err = populate_attr_groups(pdev);
 	if (err)
 		return err;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v8 1/2] powernv:opal-sensor-groups: Add support to enable sensor groups
From: Shilpasri G Bhat @ 2018-07-24  9:13 UTC (permalink / raw)
  To: mpe, linux
  Cc: linuxppc-dev, linux-hwmon, linux-kernel, ego, stewart,
	Shilpasri G Bhat
In-Reply-To: <1532423589-18730-1-git-send-email-shilpa.bhat@linux.vnet.ibm.com>

Adds support to enable/disable a sensor group at runtime. This
can be used to select the sensor groups that needs to be copied to
main memory by OCC. Sensor groups like power, temperature, current,
voltage, frequency, utilization can be enabled/disabled at runtime.

Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/opal-api.h                |  1 +
 arch/powerpc/include/asm/opal.h                    |  2 ++
 .../powerpc/platforms/powernv/opal-sensor-groups.c | 28 ++++++++++++++++++++++
 arch/powerpc/platforms/powernv/opal-wrappers.S     |  1 +
 4 files changed, 32 insertions(+)

diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index 3bab299..56a94a1 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -206,6 +206,7 @@
 #define OPAL_NPU_SPA_CLEAR_CACHE		160
 #define OPAL_NPU_TL_SET				161
 #define OPAL_SENSOR_READ_U64			162
+#define OPAL_SENSOR_GROUP_ENABLE		163
 #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR		164
 #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR		165
 #define OPAL_LAST				165
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index e1b2910..fc0550e 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -292,6 +292,7 @@ int64_t opal_imc_counters_init(uint32_t type, uint64_t address,
 int opal_get_power_shift_ratio(u32 handle, int token, u32 *psr);
 int opal_set_power_shift_ratio(u32 handle, int token, u32 psr);
 int opal_sensor_group_clear(u32 group_hndl, int token);
+int opal_sensor_group_enable(u32 group_hndl, int token, bool enable);
 
 s64 opal_signal_system_reset(s32 cpu);
 s64 opal_quiesce(u64 shutdown_type, s32 cpu);
@@ -326,6 +327,7 @@ extern int opal_async_wait_response_interruptible(uint64_t token,
 		struct opal_msg *msg);
 extern int opal_get_sensor_data(u32 sensor_hndl, u32 *sensor_data);
 extern int opal_get_sensor_data_u64(u32 sensor_hndl, u64 *sensor_data);
+extern int sensor_group_enable(u32 grp_hndl, bool enable);
 
 struct rtc_time;
 extern time64_t opal_get_boot_time(void);
diff --git a/arch/powerpc/platforms/powernv/opal-sensor-groups.c b/arch/powerpc/platforms/powernv/opal-sensor-groups.c
index 541c9ea..f7d04b6 100644
--- a/arch/powerpc/platforms/powernv/opal-sensor-groups.c
+++ b/arch/powerpc/platforms/powernv/opal-sensor-groups.c
@@ -32,6 +32,34 @@ struct sg_attr {
 	struct sg_attr *sgattrs;
 } *sgs;
 
+int sensor_group_enable(u32 handle, bool enable)
+{
+	struct opal_msg msg;
+	int token, ret;
+
+	token = opal_async_get_token_interruptible();
+	if (token < 0)
+		return token;
+
+	ret = opal_sensor_group_enable(handle, token, enable);
+	if (ret == OPAL_ASYNC_COMPLETION) {
+		ret = opal_async_wait_response(token, &msg);
+		if (ret) {
+			pr_devel("Failed to wait for the async response\n");
+			ret = -EIO;
+			goto out;
+		}
+		ret = opal_error_code(opal_get_async_rc(msg));
+	} else {
+		ret = opal_error_code(ret);
+	}
+
+out:
+	opal_async_release_token(token);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(sensor_group_enable);
+
 static ssize_t sg_store(struct kobject *kobj, struct kobj_attribute *attr,
 			const char *buf, size_t count)
 {
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index a8d9b40..8268a1e 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -327,3 +327,4 @@ OPAL_CALL(opal_npu_tl_set,			OPAL_NPU_TL_SET);
 OPAL_CALL(opal_pci_get_pbcq_tunnel_bar,		OPAL_PCI_GET_PBCQ_TUNNEL_BAR);
 OPAL_CALL(opal_pci_set_pbcq_tunnel_bar,		OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
 OPAL_CALL(opal_sensor_read_u64,			OPAL_SENSOR_READ_U64);
+OPAL_CALL(opal_sensor_group_enable,		OPAL_SENSOR_GROUP_ENABLE);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v8 0/2] hwmon/powernv: Add attributes to enable/disable sensors
From: Shilpasri G Bhat @ 2018-07-24  9:13 UTC (permalink / raw)
  To: mpe, linux
  Cc: linuxppc-dev, linux-hwmon, linux-kernel, ego, stewart,
	Shilpasri G Bhat

This patch series adds new attribute to enable or disable a sensor at
runtime.

Changes from v7:
- Use of_for_each_phandle() and of_count_phandle_with_args() to parse
  through the phandle array

v7 : https://lkml.org/lkml/2018/7/20/72
v6 : https://lkml.org/lkml/2018/7/18/806
v5 : https://lkml.org/lkml/2018/7/15/15
v4 : https://lkml.org/lkml/2018/7/6/379
v3 : https://lkml.org/lkml/2018/7/5/476
v2 : https://lkml.org/lkml/2018/7/4/263
v1 : https://lkml.org/lkml/2018/3/22/214

Shilpasri G Bhat (2):
  powernv:opal-sensor-groups: Add support to enable sensor groups
  hwmon: ibmpowernv: Add attributes to enable/disable sensor groups

 Documentation/hwmon/ibmpowernv                     |  43 +++-
 arch/powerpc/include/asm/opal-api.h                |   1 +
 arch/powerpc/include/asm/opal.h                    |   2 +
 .../powerpc/platforms/powernv/opal-sensor-groups.c |  28 +++
 arch/powerpc/platforms/powernv/opal-wrappers.S     |   1 +
 drivers/hwmon/ibmpowernv.c                         | 238 ++++++++++++++++++---
 6 files changed, 279 insertions(+), 34 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* [PATCH v4 2/2] powerpc: Enable CPU_FTR_ASYM_SMT for interleaved big-cores
From: Gautham R. Shenoy @ 2018-07-24  6:14 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Michael Neuling,
	Vaidyanathan Srinivasan, Akshay Adiga, Shilpasri G Bhat,
	Oliver O'Halloran, Nicholas Piggin, Murilo Opsfelder Araujo
  Cc: linuxppc-dev, linux-kernel, Gautham R. Shenoy
In-Reply-To: <1532412848-9826-1-git-send-email-ego@linux.vnet.ibm.com>

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

A pair of IBM POWER9 SMT4 cores can be fused together to form a big-core
with 8 SMT threads. This can be discovered via the "ibm,thread-groups"
CPU property in the device tree which will indicate which group of
threads that share the L1 cache, translation cache and instruction data
flow. If there are multiple such group of threads, then the core is a
big-core.

Furthermore, if the thread-ids of the threads of the big-core can be
obtained by interleaving the thread-ids of the thread-groups
(component small core), then such a big-core is called an interleaved
big-core.

Eg: Threads in the pair of component SMT4 cores of an interleaved
big-core are numbered {0,2,4,6} and {1,3,5,7} respectively.

The SMT4 cores forming a big-core are more or less independent
units. Thus when multiple tasks are scheduled to run on the fused
core, we get the best performance when the tasks are spread across the
pair of SMT4 cores.

This patch enables CPU_FTR_ASYM_SMT bit in the cpu-features on
detecting the presence of interleaved big-cores at boot up. This will
will bias the load-balancing of tasks on smaller numbered threads,
which will automatically result in spreading the tasks uniformly
across the associated pair of SMT4 cores.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/setup-common.c | 75 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 74 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
index 989edc1..22bc486 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -581,6 +581,69 @@ int get_cpu_thread_group_start(int cpu, struct thread_groups *tg)
 	return -1;
 }
 
+/*
+ * check_interleaved_big_core - Checks if the thread group tg
+ * corresponds to a big-core whose threads are interleavings of the
+ * threads of the component small cores.
+ *
+ * @tg: A thread-group struct for the core.
+ *
+ * Returns true if the core is a interleaved big-core.
+ * Returns false otherwise.
+ */
+static inline bool check_interleaved_big_core(struct thread_groups *tg)
+{
+	int nr_groups;
+	int threads_per_group;
+	int cur_cpu, next_cpu, i, j;
+
+	nr_groups = tg->nr_groups;
+	threads_per_group = tg->threads_per_group;
+
+	if (tg->property != 1)
+		return false;
+
+	if (nr_groups < 2 || threads_per_group < 2)
+		return false;
+
+	/*
+	 * In case of an interleaved big-core, the thread-ids of the
+	 * big-core can be obtained by interleaving the the thread-ids
+	 * of the component small
+	 *
+	 * Eg: On a 8-thread big-core with two SMT4 small cores, the
+	 * threads of the two component small cores will be
+	 * {0, 2, 4, 6} and {1, 3, 5, 7}.
+	 */
+	for (i = 0; i < nr_groups; i++) {
+		int group_start = i * threads_per_group;
+
+		for (j = 0; j < threads_per_group - 1; j++) {
+			int cur_idx = group_start + j;
+
+			cur_cpu = tg->thread_list[cur_idx];
+			next_cpu = tg->thread_list[cur_idx + 1];
+			if (next_cpu != cur_cpu + nr_groups)
+				return false;
+		}
+	}
+
+	return true;
+}
+
+#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
+static inline void enable_asym_smt_feature(void)
+{
+	int key = __builtin_ctzl(CPU_FTR_ASYM_SMT);
+
+	cur_cpu_spec->cpu_features |= CPU_FTR_ASYM_SMT;
+	static_branch_enable(&cpu_feature_keys[key]);
+	pr_info("Enabling ASYM_SMT on interleaved big-cores\n");
+}
+#else
+#define enable_asym_smt_feature()
+#endif
+
 /**
  * setup_cpu_maps - initialize the following cpu maps:
  *                  cpu_possible_mask
@@ -604,6 +667,7 @@ void __init smp_setup_cpu_maps(void)
 	struct device_node *dn;
 	int cpu = 0;
 	int nthreads = 1;
+	bool has_interleaved_big_cores = true;
 
 	has_big_cores = true;
 	DBG("smp_setup_cpu_maps()\n");
@@ -657,6 +721,12 @@ void __init smp_setup_cpu_maps(void)
 
 		if (has_big_cores && !dt_has_big_core(dn, &tg)) {
 			has_big_cores = false;
+			has_interleaved_big_cores = false;
+		}
+
+		if (has_interleaved_big_cores) {
+			has_interleaved_big_cores =
+				check_interleaved_big_core(&tg);
 		}
 
 		if (cpu >= nr_cpu_ids) {
@@ -713,7 +783,10 @@ void __init smp_setup_cpu_maps(void)
 	vdso_data->processorCount = num_present_cpus();
 #endif /* CONFIG_PPC64 */
 
-        /* Initialize CPU <=> thread mapping/
+	if (has_interleaved_big_cores)
+		enable_asym_smt_feature();
+
+	/* Initialize CPU <=> thread mapping/
 	 *
 	 * WARNING: We assume that the number of threads is the same for
 	 * every CPU in the system. If that is not the case, then some code
-- 
1.9.4

^ permalink raw reply related

* [PATCH v4 1/2] powerpc: Detect the presence of big-cores via "ibm, thread-groups"
From: Gautham R. Shenoy @ 2018-07-24  6:14 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Michael Neuling,
	Vaidyanathan Srinivasan, Akshay Adiga, Shilpasri G Bhat,
	Oliver O'Halloran, Nicholas Piggin, Murilo Opsfelder Araujo
  Cc: linuxppc-dev, linux-kernel, Gautham R. Shenoy
In-Reply-To: <1532412848-9826-1-git-send-email-ego@linux.vnet.ibm.com>

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

On IBM POWER9, the device tree exposes a property array identifed by
"ibm,thread-groups" which will indicate which groups of threads share a
particular set of resources.

As of today we only have one form of grouping identifying the group of
threads in the core that share the L1 cache, translation cache and
instruction data flow.

This patch defines the helper function to parse the contents of
"ibm,thread-groups" and a new structure to contain the parsed output.

The patch also creates the sysfs file named "small_core_siblings" that
returns the physical ids of the threads in the core that share the L1
cache, translation cache and instruction data flow.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu |   8 ++
 arch/powerpc/include/asm/cputhreads.h              |  22 +++
 arch/powerpc/kernel/setup-common.c                 | 154 +++++++++++++++++++++
 arch/powerpc/kernel/sysfs.c                        |  35 +++++
 4 files changed, 219 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 9c5e7732..41adf1d 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -487,3 +487,11 @@ Description:	Information about CPU vulnerabilities
 		"Not affected"	  CPU is not affected by the vulnerability
 		"Vulnerable"	  CPU is affected and no mitigation in effect
 		"Mitigation: $M"  CPU is affected and mitigation $M is in effect
+
+What: 		/sys/devices/system/cpu/cpu[0-9]+/small_core_siblings
+Date:		24-Jul-2018
+KernelVersion:	v4.18.0
+Contact:	Gautham R. Shenoy <ego@linux.vnet.ibm.com>
+Description:	List of Physical ids of CPUs which share the L1 cache,
+		translation cache and instruction data-flow with this CPU.
+Values:		Comma separated list of decimal integers.
diff --git a/arch/powerpc/include/asm/cputhreads.h b/arch/powerpc/include/asm/cputhreads.h
index d71a909..33226d7 100644
--- a/arch/powerpc/include/asm/cputhreads.h
+++ b/arch/powerpc/include/asm/cputhreads.h
@@ -23,11 +23,13 @@
 extern int threads_per_core;
 extern int threads_per_subcore;
 extern int threads_shift;
+extern bool has_big_cores;
 extern cpumask_t threads_core_mask;
 #else
 #define threads_per_core	1
 #define threads_per_subcore	1
 #define threads_shift		0
+#define has_big_cores		0
 #define threads_core_mask	(*get_cpu_mask(0))
 #endif
 
@@ -69,12 +71,32 @@ static inline cpumask_t cpu_online_cores_map(void)
 	return cpu_thread_mask_to_cores(cpu_online_mask);
 }
 
+#define MAX_THREAD_LIST_SIZE	8
+struct thread_groups {
+	unsigned int property;
+	unsigned int nr_groups;
+	unsigned int threads_per_group;
+	unsigned int thread_list[MAX_THREAD_LIST_SIZE];
+};
+
 #ifdef CONFIG_SMP
 int cpu_core_index_of_thread(int cpu);
 int cpu_first_thread_of_core(int core);
+int parse_thread_groups(struct device_node *dn, struct thread_groups *tg);
+int get_cpu_thread_group_start(int cpu, struct thread_groups *tg);
 #else
 static inline int cpu_core_index_of_thread(int cpu) { return cpu; }
 static inline int cpu_first_thread_of_core(int core) { return core; }
+static inline int parse_thread_groups(struct device_node *dn,
+				      struct thread_groups *tg)
+{
+	return -ENODATA;
+}
+
+static inline int get_cpu_thread_group_start(int cpu, struct thread_groups *tg)
+{
+	return -1;
+}
 #endif
 
 static inline int cpu_thread_in_core(int cpu)
diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
index 40b44bb..989edc1 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -402,10 +402,12 @@ void __init check_for_initrd(void)
 #ifdef CONFIG_SMP
 
 int threads_per_core, threads_per_subcore, threads_shift;
+bool has_big_cores;
 cpumask_t threads_core_mask;
 EXPORT_SYMBOL_GPL(threads_per_core);
 EXPORT_SYMBOL_GPL(threads_per_subcore);
 EXPORT_SYMBOL_GPL(threads_shift);
+EXPORT_SYMBOL_GPL(has_big_cores);
 EXPORT_SYMBOL_GPL(threads_core_mask);
 
 static void __init cpu_init_thread_core_maps(int tpc)
@@ -433,6 +435,152 @@ static void __init cpu_init_thread_core_maps(int tpc)
 
 u32 *cpu_to_phys_id = NULL;
 
+/*
+ * parse_thread_groups: Parses the "ibm,thread-groups" device tree
+ *                      property for the CPU device node @dn and stores
+ *                      the parsed output in the thread_groups
+ *                      structure @tg.
+ *
+ * @dn: The device node of the CPU device.
+ * @tg: Pointer to a thread group structure into which the parsed
+ *     output of "ibm,thread-groups" is stored.
+ *
+ * ibm,thread-groups[0..N-1] array defines which group of threads in
+ * the CPU-device node can be grouped together based on the property.
+ *
+ * ibm,thread-groups[0] tells us the property based on which the
+ * threads are being grouped together. If this value is 1, it implies
+ * that the threads in the same group share L1, translation cache.
+ *
+ * ibm,thread-groups[1] tells us how many such thread groups exist.
+ *
+ * ibm,thread-groups[2] tells us the number of threads in each such
+ * group.
+ *
+ * ibm,thread-groups[3..N-1] is the list of threads identified by
+ * "ibm,ppc-interrupt-server#s" arranged as per their membership in
+ * the grouping.
+ *
+ * Example: If ibm,thread-groups = [1,2,4,5,6,7,8,9,10,11,12] it
+ * implies that there are 2 groups of 4 threads each, where each group
+ * of threads share L1, translation cache.
+ *
+ * The "ibm,ppc-interrupt-server#s" of the first group is {5,6,7,8}
+ * and the "ibm,ppc-interrupt-server#s" of the second group is {9, 10,
+ * 11, 12} structure
+ *
+ * Returns 0 on success, -EINVAL if the property does not exist,
+ * -ENODATA if property does not have a value, and -EOVERFLOW if the
+ * property data isn't large enough.
+ */
+int parse_thread_groups(struct device_node *dn,
+			struct thread_groups *tg)
+{
+	unsigned int nr_groups, threads_per_group, property;
+	int i;
+	u32 thread_group_array[3 + MAX_THREAD_LIST_SIZE];
+	u32 *thread_list;
+	size_t total_threads;
+	int ret;
+
+	ret = of_property_read_u32_array(dn, "ibm,thread-groups",
+					 thread_group_array, 3);
+
+	if (ret)
+		goto out_err;
+
+	property = thread_group_array[0];
+	nr_groups = thread_group_array[1];
+	threads_per_group = thread_group_array[2];
+	total_threads = nr_groups * threads_per_group;
+
+	ret = of_property_read_u32_array(dn, "ibm,thread-groups",
+					 thread_group_array,
+					 3 + total_threads);
+	if (ret)
+		goto out_err;
+
+	thread_list = &thread_group_array[3];
+
+	for (i = 0 ; i < total_threads; i++)
+		tg->thread_list[i] = thread_list[i];
+
+	tg->property = property;
+	tg->nr_groups = nr_groups;
+	tg->threads_per_group = threads_per_group;
+
+	return 0;
+out_err:
+	tg->property = 0;
+	tg->nr_groups = 0;
+	tg->threads_per_group = 0;
+	return ret;
+}
+
+/*
+ * dt_has_big_core : Parses the device tree property
+ *		    "ibm,thread-groups" for device node pointed by @dn
+ *		    and stores the parsed output in the structure
+ *		    pointed to by @tg.  Then checks if the output in
+ *		    @tg corresponds to a big-core.
+ *
+ * @dn: Device node pointer of the CPU node being checked for a
+ *      big-core.
+ * @tg: Pointer to thread_groups struct in which parsed output of
+ *      "ibm,thread-groups" is recorded.
+ *
+ * Returns true if the @dn points to a big-core.
+ * Returns false if there is an error in parsing "ibm,thread-groups"
+ * or the parsed output doesn't correspond to a big-core.
+ */
+static inline bool dt_has_big_core(struct device_node *dn,
+				   struct thread_groups *tg)
+{
+	if (parse_thread_groups(dn, tg))
+		return false;
+
+	if (tg->property != 1)
+		return false;
+
+	if (tg->nr_groups < 1)
+		return false;
+
+	return true;
+}
+
+/*
+ * get_cpu_thread_group_start : Searches the thread group in tg->thread_list
+ *                              that @cpu belongs to.
+ *
+ * @cpu : The logical CPU whose thread group is being searched.
+ * @tg : The thread-group structure of the CPU node which @cpu belongs
+ *       to.
+ *
+ * Returns the index to tg->thread_list that points to the the start
+ * of the thread_group that @cpu belongs to.
+ *
+ * Returns -1 if cpu doesn't belong to any of the groups pointed to by
+ * tg->thread_list.
+ */
+int get_cpu_thread_group_start(int cpu, struct thread_groups *tg)
+{
+	int hw_cpu_id = get_hard_smp_processor_id(cpu);
+	int i, j;
+
+	for (i = 0; i < tg->nr_groups; i++) {
+		int group_start = i * tg->threads_per_group;
+
+		for (j = 0; j < tg->threads_per_group; j++) {
+			int idx = group_start + j;
+
+			if (tg->thread_list[idx] == hw_cpu_id)
+				return group_start;
+		}
+	}
+
+	return -1;
+}
+
 /**
  * setup_cpu_maps - initialize the following cpu maps:
  *                  cpu_possible_mask
@@ -457,6 +605,7 @@ void __init smp_setup_cpu_maps(void)
 	int cpu = 0;
 	int nthreads = 1;
 
+	has_big_cores = true;
 	DBG("smp_setup_cpu_maps()\n");
 
 	cpu_to_phys_id = __va(memblock_alloc(nr_cpu_ids * sizeof(u32),
@@ -467,6 +616,7 @@ void __init smp_setup_cpu_maps(void)
 		const __be32 *intserv;
 		__be32 cpu_be;
 		int j, len;
+		struct thread_groups tg;
 
 		DBG("  * %pOF...\n", dn);
 
@@ -505,6 +655,10 @@ void __init smp_setup_cpu_maps(void)
 			cpu++;
 		}
 
+		if (has_big_cores && !dt_has_big_core(dn, &tg)) {
+			has_big_cores = false;
+		}
+
 		if (cpu >= nr_cpu_ids) {
 			of_node_put(dn);
 			break;
diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index 755dc98..f5717de 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -18,6 +18,7 @@
 #include <asm/smp.h>
 #include <asm/pmc.h>
 #include <asm/firmware.h>
+#include <asm/cputhreads.h>
 
 #include "cacheinfo.h"
 #include "setup.h"
@@ -1025,6 +1026,33 @@ static ssize_t show_physical_id(struct device *dev,
 }
 static DEVICE_ATTR(physical_id, 0444, show_physical_id, NULL);
 
+static ssize_t show_small_core_siblings(struct device *dev,
+					struct device_attribute *attr,
+					char *buf)
+{
+	struct cpu *cpu = container_of(dev, struct cpu, dev);
+	struct device_node *dn = of_get_cpu_node(cpu->dev.id, NULL);
+	struct thread_groups tg;
+	int i, j;
+	ssize_t ret = 0;
+
+	if (parse_thread_groups(dn, &tg))
+		return -ENODATA;
+
+	i = get_cpu_thread_group_start(cpu->dev.id, &tg);
+
+	if (i == -1)
+		return -ENODATA;
+
+	for (j = 0; j < tg.threads_per_group - 1; j++)
+		ret += sprintf(buf + ret, "%d,", tg.thread_list[i + j]);
+
+	ret += sprintf(buf + ret, "%d\n", tg.thread_list[i + j]);
+
+	return ret;
+}
+static DEVICE_ATTR(small_core_siblings, 0444, show_small_core_siblings, NULL);
+
 static int __init topology_init(void)
 {
 	int cpu, r;
@@ -1048,6 +1076,13 @@ static int __init topology_init(void)
 			register_cpu(c, cpu);
 
 			device_create_file(&c->dev, &dev_attr_physical_id);
+
+			if (has_big_cores) {
+				const struct device_attribute *attr =
+				       &dev_attr_small_core_siblings;
+
+			       device_create_file(&c->dev, attr);
+			}
 		}
 	}
 	r = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "powerpc/topology:online",
-- 
1.9.4

^ permalink raw reply related

* [PATCH v4 0/2] powerpc: Detection and scheduler optimization for POWER9 bigcore
From: Gautham R. Shenoy @ 2018-07-24  6:14 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Michael Neuling,
	Vaidyanathan Srinivasan, Akshay Adiga, Shilpasri G Bhat,
	Oliver O'Halloran, Nicholas Piggin, Murilo Opsfelder Araujo
  Cc: linuxppc-dev, linux-kernel, Gautham R. Shenoy

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

Hi,

This is the fourth iteration of the patchset to add support for
big-core on POWER9.

The previous versions can be found here:

v3: https://lkml.org/lkml/2018/7/6/255
v2: https://lkml.org/lkml/2018/7/3/401
v1: https://lkml.org/lkml/2018/5/11/245

Changes :
v3 --> v4:
   - Build fix for powerpc-g5 : Enable CPU_FTR_ASYM_SMT only on
     CONFIG_PPC_POWERNV and CONFIG_PPC_PSERIES.
   - Fixed a minor error in the ABI description.

v2 --> v3
    - Set sane values in the tg->property, tg->nr_groups inside
    parse_thread_groups before returning due to an error.
    - Define a helper function to determine whether a CPU device node
      is a big-core or not.
    - Updated the comments around the functions to describe the
      arguments passed to them.

v1 --> v2
    - Added comments explaining the "ibm,thread-groups" device tree property.
    - Uses cleaner device-tree parsing functions to parse the u32 arrays.
    - Adds a sysfs file listing the small-core siblings for every CPU.
    - Enables the scheduler optimization by setting the CPU_FTR_ASYM_SMT bit
      in the cur_cpu_spec->cpu_features on detecting the presence
      of interleaved big-core.
    - Handles the corner case where there is only a single thread-group
      or when there is a single thread in a thread-group.

Description:
~~~~~~~~~~~~~~~~~~~~
A pair of IBM POWER9 SMT4 cores can be fused together to form a
big-core with 8 SMT threads. This can be discovered via the
"ibm,thread-groups" CPU property in the device tree which will
indicate which group of threads that share the L1 cache, translation
cache and instruction data flow.  If there are multiple such group of
threads, then the core is a big-core. Furthermore, the thread-ids of
such a big-core is obtained by interleaving the thread-ids of the
component SMT4 cores.

Eg: Threads in the pair of component SMT4 cores of an interleaved
big-core are numbered {0,2,4,6} and {1,3,5,7} respectively.

On such a big-core, when multiple tasks are scheduled to run on the
big-core, we get the best performance when the tasks are spread across
the pair of SMT4 cores.

The Linux scheduler supports a flag called "SD_ASYM_PACKING" which
when set in the SMT sched-domain, biases the load-balancing of the
tasks on the smaller numbered threads in the core. On an big-core
whose threads are interleavings of the threads of the small cores,
enabling SD_ASYM_PACKING in the SMT sched-domain automatically results
in spreading the tasks uniformly across the associated pair of SMT4
cores, thereby yielding better performance.

This patchset contains two patches which on detecting the presence of
interleaved big-cores will enable the the CPU_FTR_ASYM_SMT bit in the
cur_cpu_spec->cpu_feature.

Patch 1: adds support to detect the presence of
big-cores and reports the small-core siblings of each CPU X
via the sysfs file "/sys/devices/system/cpu/cpuX/big_core_siblings".

Patch 2: checks if the thread-ids of the component small-cores are
interleaved, in which case we enable the the CPU_FTR_ASYM_SMT bit in
the cur_cpu_spec->cpu_features which results in the SD_ASYM_PACKING
flag being set at the SMT level sched-domain.

Results:
~~~~~~~~~~~~~~~~~
Experimental results for ebizzy with 2 threads, bound to a single big-core
show a marked improvement with this patchset over the 4.18-rc5 vanilla
kernel.

The result of 100 such runs for 4.18-rc5 kernel and the 4.18-rc5 +
big-core-patches are as follows

4.18-rc5 vanilla:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        records/s    :  # samples  : Histogram
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[0 - 1000000]        :      0      : #
[1000000 - 2000000]  :      7      : ##
[2000000 - 3000000]  :      17     : ####
[3000000 - 4000000]  :      18     : ####
[4000000 - 5000000]  :      3      : #
[5000000 - 6000000]  :      55     : ############

4.8-rc5 + big-core-patches
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        records/s    :  # samples  : Histogram
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[0 - 1000000]        :      0      : #
[1000000 - 2000000]  :      0      : #
[2000000 - 3000000]  :      8      : ##
[3000000 - 4000000]  :      0      : #
[4000000 - 5000000]  :      0      : #
[5000000 - 6000000]  :      92     : ###################

Gautham R. Shenoy (2):
  powerpc: Detect the presence of big-cores via "ibm,thread-groups"
  powerpc: Enable CPU_FTR_ASYM_SMT for interleaved big-cores

 Documentation/ABI/testing/sysfs-devices-system-cpu |   8 +
 arch/powerpc/include/asm/cputhreads.h              |  22 ++
 arch/powerpc/kernel/setup-common.c                 | 229 ++++++++++++++++++++-
 arch/powerpc/kernel/sysfs.c                        |  35 ++++
 4 files changed, 293 insertions(+), 1 deletion(-)

-- 
1.9.4

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox