LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH kernel v7 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page
From: David Gibson @ 2018-07-18  2:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, kvm-ppc, Aneesh Kumar K.V, Alex Williamson,
	Michael Ellerman, Nicholas Piggin, Paul Mackerras
In-Reply-To: <20180717071913.2167-3-aik@ozlabs.ru>

[-- Attachment #1: Type: text/plain, Size: 10772 bytes --]

On Tue, Jul 17, 2018 at 05:19:13PM +1000, Alexey Kardashevskiy wrote:
> A VM which has:
>  - a DMA capable device passed through to it (eg. network card);
>  - running a malicious kernel that ignores H_PUT_TCE failure;
>  - capability of using IOMMU pages bigger that physical pages
> can create an IOMMU mapping that exposes (for example) 16MB of
> the host physical memory to the device when only 64K was allocated to the VM.
> 
> The remaining 16MB - 64K will be some other content of host memory, possibly
> including pages of the VM, but also pages of host kernel memory, host
> programs or other VMs.
> 
> The attacking VM does not control the location of the page it can map,
> and is only allowed to map as many pages as it has pages of RAM.
> 
> We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
> an IOMMU page is contained in the physical page so the PCI hardware won't
> get access to unassigned host memory; however this check is missing in
> the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
> did not hit this yet as the very first time when the mapping happens
> we do not have tbl::it_userspace allocated yet and fall back to
> the userspace which in turn calls VFIO IOMMU driver, this fails and
> the guest does not retry,
> 
> This stores the smallest preregistered page size in the preregistered
> region descriptor and changes the mm_iommu_xxx API to check this against
> the IOMMU page size.
> 
> This calculates maximum page size as a minimum of the natural region
> alignment and compound page size. For the page shift this uses the shift
> returned by find_linux_pte() which indicates how the page is mapped to
> the current userspace - if the page is huge and this is not a zero, then
> it is a leaf pte and the page is mapped within the range.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
> 
> v6 got a couple of rb's but since the patch has changed again, I am not
> putting them here yet.
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
> 
> ---
> Changes:
> v7:
> * do not fail if pte is not found, fall back to the default case instead
> 
> v6:
> * replaced hugetlbfs with pageshift from find_linux_pte()
> 
> v5:
> * only consider compound pages from hugetlbfs
> 
> v4:
> * reimplemented max pageshift calculation
> 
> v3:
> * fixed upper limit for the page size
> * added checks that we don't register parts of a huge page
> 
> v2:
> * explicitely check for compound pages before calling compound_order()
> 
> ---
> The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to
> advertise 16MB pages to the guest; a typical pseries guest will use 16MB
> for IOMMU pages without checking the mmu pagesize and this will fail
> at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256
> 
> With the change, mapping will fail in KVM and the guest will print:
> 
> mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0)
> mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0
> mlx5_core 0000:00:00.0: failed to map direct window for /pci@800000020000000/ethernet@0: -1
> ---
>  arch/powerpc/include/asm/mmu_context.h |  4 ++--
>  arch/powerpc/kvm/book3s_64_vio.c       |  2 +-
>  arch/powerpc/kvm/book3s_64_vio_hv.c    |  6 ++++--
>  arch/powerpc/mm/mmu_context_iommu.c    | 37 ++++++++++++++++++++++++++++++++--
>  drivers/vfio/vfio_iommu_spapr_tce.c    |  2 +-
>  5 files changed, 43 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index 896efa5..79d570c 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -35,9 +35,9 @@ extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
>  extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
>  		unsigned long ua, unsigned long entries);
>  extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
> -		unsigned long ua, unsigned long *hpa);
> +		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
>  extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
> -		unsigned long ua, unsigned long *hpa);
> +		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
>  extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
>  extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
>  #endif
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index d066e37..8c456fa 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -449,7 +449,7 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
>  		/* This only handles v2 IOMMU type, v1 is handled via ioctl() */
>  		return H_TOO_HARD;
>  
> -	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
> +	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, tbl->it_page_shift, &hpa)))
>  		return H_HARDWARE;
>  
>  	if (mm_iommu_mapped_inc(mem))
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 925fc31..5b298f5 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -279,7 +279,8 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
>  	if (!mem)
>  		return H_TOO_HARD;
>  
> -	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
> +	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, tbl->it_page_shift,
> +			&hpa)))
>  		return H_HARDWARE;
>  
>  	pua = (void *) vmalloc_to_phys(pua);
> @@ -469,7 +470,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  
>  		mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, IOMMU_PAGE_SIZE_4K);
>  		if (mem)
> -			prereg = mm_iommu_ua_to_hpa_rm(mem, ua, &tces) == 0;
> +			prereg = mm_iommu_ua_to_hpa_rm(mem, ua,
> +					IOMMU_PAGE_SHIFT_4K, &tces) == 0;
>  	}
>  
>  	if (!prereg) {
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index abb4364..a4ca576 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -19,6 +19,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/swap.h>
>  #include <asm/mmu_context.h>
> +#include <asm/pte-walk.h>
>  
>  static DEFINE_MUTEX(mem_list_mutex);
>  
> @@ -27,6 +28,7 @@ struct mm_iommu_table_group_mem_t {
>  	struct rcu_head rcu;
>  	unsigned long used;
>  	atomic64_t mapped;
> +	unsigned int pageshift;
>  	u64 ua;			/* userspace address */
>  	u64 entries;		/* number of entries in hpas[] */
>  	u64 *hpas;		/* vmalloc'ed */
> @@ -125,6 +127,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  {
>  	struct mm_iommu_table_group_mem_t *mem;
>  	long i, j, ret = 0, locked_entries = 0;
> +	unsigned int pageshift;
> +	unsigned long flags;
>  	struct page *page = NULL;
>  
>  	mutex_lock(&mem_list_mutex);
> @@ -159,6 +163,12 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  		goto unlock_exit;
>  	}
>  
> +	/*
> +	 * For a starting point for a maximum page size calculation
> +	 * we use @ua and @entries natural alignment to allow IOMMU pages
> +	 * smaller than huge pages but still bigger than PAGE_SIZE.
> +	 */
> +	mem->pageshift = __ffs(ua | (entries << PAGE_SHIFT));
>  	mem->hpas = vzalloc(array_size(entries, sizeof(mem->hpas[0])));
>  	if (!mem->hpas) {
>  		kfree(mem);
> @@ -199,6 +209,23 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  			}
>  		}
>  populate:
> +		pageshift = PAGE_SHIFT;
> +		if (PageCompound(page)) {
> +			pte_t *pte;
> +			struct page *head = compound_head(page);
> +			unsigned int compshift = compound_order(head);
> +
> +			local_irq_save(flags); /* disables as well */
> +			pte = find_linux_pte(mm->pgd, ua, NULL, &pageshift);
> +			local_irq_restore(flags);
> +
> +			/* Double check it is still the same pinned page */
> +			if (pte && pte_page(*pte) == head &&
> +					pageshift == compshift)
> +				pageshift = max_t(unsigned int, pageshift,
> +						PAGE_SHIFT);
> +		}
> +		mem->pageshift = min(mem->pageshift, pageshift);
>  		mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
>  	}
>  
> @@ -349,7 +376,7 @@ struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
>  EXPORT_SYMBOL_GPL(mm_iommu_find);
>  
>  long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
> -		unsigned long ua, unsigned long *hpa)
> +		unsigned long ua, unsigned int pageshift, unsigned long *hpa)
>  {
>  	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
>  	u64 *va = &mem->hpas[entry];
> @@ -357,6 +384,9 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  	if (entry >= mem->entries)
>  		return -EFAULT;
>  
> +	if (pageshift > mem->pageshift)
> +		return -EFAULT;
> +
>  	*hpa = *va | (ua & ~PAGE_MASK);
>  
>  	return 0;
> @@ -364,7 +394,7 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
>  
>  long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
> -		unsigned long ua, unsigned long *hpa)
> +		unsigned long ua, unsigned int pageshift, unsigned long *hpa)
>  {
>  	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
>  	void *va = &mem->hpas[entry];
> @@ -373,6 +403,9 @@ long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
>  	if (entry >= mem->entries)
>  		return -EFAULT;
>  
> +	if (pageshift > mem->pageshift)
> +		return -EFAULT;
> +
>  	pa = (void *) vmalloc_to_phys(va);
>  	if (!pa)
>  		return -EFAULT;
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 2da5f05..7cd63b0 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -467,7 +467,7 @@ static int tce_iommu_prereg_ua_to_hpa(struct tce_container *container,
>  	if (!mem)
>  		return -EINVAL;
>  
> -	ret = mm_iommu_ua_to_hpa(mem, tce, phpa);
> +	ret = mm_iommu_ua_to_hpa(mem, tce, shift, phpa);
>  	if (ret)
>  		return -EINVAL;
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH v7 0/4] resource: Use list_head to link sibling resource
From: Baoquan He @ 2018-07-18  2:49 UTC (permalink / raw)
  To: linux-kernel, akpm, robh+dt, dan.j.williams, nicolas.pitre, josh,
	fengguang.wu, bp, andy.shevchenko
  Cc: patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin,
	dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick,
	lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse,
	thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel,
	linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm,
	vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo,
	maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev,
	Baoquan He

This patchset is doing:
1) Move reparent_resources() to kernel/resource.c to clean up duplicated
   code in arch/microblaze/pci/pci-common.c and
   arch/powerpc/kernel/pci-common.c .
2) Replace struct resource's sibling list from singly linked list to
   list_head. Clearing out those pointer operation within singly linked
   list for better code readability.
2) Based on list_head replacement, add a new function
   walk_system_ram_res_rev() which can does reversed iteration on
   iomem_resource's siblings.
3) Change kexec_file loading to search system RAM top down for kernel
   loadin, using walk_system_ram_res_rev().

Note:
This patchset only passed testing on  x86_64 arch with network
enabling. The thing we need pay attetion to is that a root resource's
child member need be initialized specifically with LIST_HEAD_INIT() if
statically defined or INIT_LIST_HEAD() for dynamically definition. Here
Just like we do for iomem_resource/ioport_resource, or the change in
get_pci_domain_busn_res().

v6:
http://lkml.kernel.org/r/20180704041038.8190-1-bhe@redhat.com

v5:
http://lkml.kernel.org/r/20180612032831.29747-1-bhe@redhat.com

v4:
http://lkml.kernel.org/r/20180507063224.24229-1-bhe@redhat.com

v3:
http://lkml.kernel.org/r/20180419001848.3041-1-bhe@redhat.com

v2:
http://lkml.kernel.org/r/20180408024724.16812-1-bhe@redhat.com

v1:
http://lkml.kernel.org/r/20180322033722.9279-1-bhe@redhat.com

Changelog:
v6->v7:
  Fix code bugs that test robot reported on mips and ia64.

  Add error code description in reparent_resources() according to
  Andy's comment, and fix minor log typo.
v5->v6:
  Fix code style problems in reparent_resources() and use existing
  error codes, according to Andy's suggestion.

  Fix bugs test robot reported.

v4->v5:
  Add new patch 0001 to move duplicated reparent_resources() to
  kernel/resource.c to make it be shared by different ARCH-es.

  Fix several code bugs reported by test robot on ARCH powerpc and
  microblaze.
v3->v4:
  Fix several bugs test robot reported. Rewrite cover letter and patch
  log according to reviewer's comment.

v2->v3:
  Rename resource functions first_child() and sibling() to
  resource_first_chils() and resource_sibling(). Dan suggested this.

  Move resource_first_chils() and resource_sibling() to linux/ioport.h
  and make them as inline function. Rob suggested this. Accordingly add
  linux/list.h including in linux/ioport.h, please help review if this
  bring efficiency degradation or code redundancy.

  The change on struct resource {} bring two pointers of size increase,
  mention this in git log to make it more specifically, Rob suggested
  this.

v1->v2:
  Use list_head instead to link resource siblings. This is suggested by
  Andrew.

  Rewrite walk_system_ram_res_rev() after list_head is taken to link
  resouce siblings.

Baoquan He (4):
  resource: Move reparent_resources() to kernel/resource.c and make it
    public
  resource: Use list_head to link sibling resource
  resource: add walk_system_ram_res_rev()
  kexec_file: Load kernel at top of system RAM if required

 arch/arm/plat-samsung/pm-check.c            |   6 +-
 arch/ia64/sn/kernel/io_init.c               |   2 +-
 arch/microblaze/pci/pci-common.c            |  41 +----
 arch/mips/pci/pci-rc32434.c                 |  12 +-
 arch/powerpc/kernel/pci-common.c            |  39 +---
 arch/sparc/kernel/ioport.c                  |   2 +-
 arch/xtensa/include/asm/pci-bridge.h        |   4 +-
 drivers/eisa/eisa-bus.c                     |   2 +
 drivers/gpu/drm/drm_memory.c                |   3 +-
 drivers/gpu/drm/gma500/gtt.c                |   5 +-
 drivers/hv/vmbus_drv.c                      |  52 +++---
 drivers/input/joystick/iforce/iforce-main.c |   4 +-
 drivers/nvdimm/namespace_devs.c             |   6 +-
 drivers/nvdimm/nd.h                         |   5 +-
 drivers/of/address.c                        |   4 +-
 drivers/parisc/lba_pci.c                    |   4 +-
 drivers/pci/controller/vmd.c                |   8 +-
 drivers/pci/probe.c                         |   2 +
 drivers/pci/setup-bus.c                     |   2 +-
 include/linux/ioport.h                      |  21 ++-
 kernel/kexec_file.c                         |   2 +
 kernel/resource.c                           | 266 ++++++++++++++++++----------
 22 files changed, 260 insertions(+), 232 deletions(-)

-- 
2.13.6

^ permalink raw reply

* [PATCH v7 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public
From: Baoquan He @ 2018-07-18  2:49 UTC (permalink / raw)
  To: linux-kernel, akpm, robh+dt, dan.j.williams, nicolas.pitre, josh,
	fengguang.wu, bp, andy.shevchenko
  Cc: patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin,
	dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick,
	lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse,
	thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel,
	linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm,
	vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo,
	maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev,
	Baoquan He, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman
In-Reply-To: <20180718024944.577-1-bhe@redhat.com>

reparent_resources() is duplicated in arch/microblaze/pci/pci-common.c
and arch/powerpc/kernel/pci-common.c, so move it to kernel/resource.c
so that it's shared.

Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/microblaze/pci/pci-common.c | 37 -----------------------------------
 arch/powerpc/kernel/pci-common.c | 35 ---------------------------------
 include/linux/ioport.h           |  1 +
 kernel/resource.c                | 42 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 43 insertions(+), 72 deletions(-)

diff --git a/arch/microblaze/pci/pci-common.c b/arch/microblaze/pci/pci-common.c
index f34346d56095..7899bafab064 100644
--- a/arch/microblaze/pci/pci-common.c
+++ b/arch/microblaze/pci/pci-common.c
@@ -619,43 +619,6 @@ int pcibios_add_device(struct pci_dev *dev)
 EXPORT_SYMBOL(pcibios_add_device);
 
 /*
- * Reparent resource children of pr that conflict with res
- * under res, and make res replace those children.
- */
-static int __init reparent_resources(struct resource *parent,
-				     struct resource *res)
-{
-	struct resource *p, **pp;
-	struct resource **firstpp = NULL;
-
-	for (pp = &parent->child; (p = *pp) != NULL; pp = &p->sibling) {
-		if (p->end < res->start)
-			continue;
-		if (res->end < p->start)
-			break;
-		if (p->start < res->start || p->end > res->end)
-			return -1;	/* not completely contained */
-		if (firstpp == NULL)
-			firstpp = pp;
-	}
-	if (firstpp == NULL)
-		return -1;	/* didn't find any conflicting entries? */
-	res->parent = parent;
-	res->child = *firstpp;
-	res->sibling = *pp;
-	*firstpp = res;
-	*pp = NULL;
-	for (p = res->child; p != NULL; p = p->sibling) {
-		p->parent = res;
-		pr_debug("PCI: Reparented %s [%llx..%llx] under %s\n",
-			 p->name,
-			 (unsigned long long)p->start,
-			 (unsigned long long)p->end, res->name);
-	}
-	return 0;
-}
-
-/*
  *  Handle resources of PCI devices.  If the world were perfect, we could
  *  just allocate all the resource regions and do nothing more.  It isn't.
  *  On the other hand, we cannot just re-allocate all devices, as it would
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index fe9733ffffaa..926035bb378d 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1088,41 +1088,6 @@ resource_size_t pcibios_align_resource(void *data, const struct resource *res,
 EXPORT_SYMBOL(pcibios_align_resource);
 
 /*
- * Reparent resource children of pr that conflict with res
- * under res, and make res replace those children.
- */
-static int reparent_resources(struct resource *parent,
-				     struct resource *res)
-{
-	struct resource *p, **pp;
-	struct resource **firstpp = NULL;
-
-	for (pp = &parent->child; (p = *pp) != NULL; pp = &p->sibling) {
-		if (p->end < res->start)
-			continue;
-		if (res->end < p->start)
-			break;
-		if (p->start < res->start || p->end > res->end)
-			return -1;	/* not completely contained */
-		if (firstpp == NULL)
-			firstpp = pp;
-	}
-	if (firstpp == NULL)
-		return -1;	/* didn't find any conflicting entries? */
-	res->parent = parent;
-	res->child = *firstpp;
-	res->sibling = *pp;
-	*firstpp = res;
-	*pp = NULL;
-	for (p = res->child; p != NULL; p = p->sibling) {
-		p->parent = res;
-		pr_debug("PCI: Reparented %s %pR under %s\n",
-			 p->name, p, res->name);
-	}
-	return 0;
-}
-
-/*
  *  Handle resources of PCI devices.  If the world were perfect, we could
  *  just allocate all the resource regions and do nothing more.  It isn't.
  *  On the other hand, we cannot just re-allocate all devices, as it would
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index da0ebaec25f0..dfdcd0bfe54e 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -192,6 +192,7 @@ extern int allocate_resource(struct resource *root, struct resource *new,
 struct resource *lookup_resource(struct resource *root, resource_size_t start);
 int adjust_resource(struct resource *res, resource_size_t start,
 		    resource_size_t size);
+int reparent_resources(struct resource *parent, struct resource *res);
 resource_size_t resource_alignment(struct resource *res);
 static inline resource_size_t resource_size(const struct resource *res)
 {
diff --git a/kernel/resource.c b/kernel/resource.c
index 30e1bc68503b..81ccd19c1d9f 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -983,6 +983,48 @@ int adjust_resource(struct resource *res, resource_size_t start,
 }
 EXPORT_SYMBOL(adjust_resource);
 
+/**
+ * reparent_resources - reparent resource children of parent that res covers
+ * @parent: parent resource descriptor
+ * @res: resource descriptor desired by caller
+ *
+ * Returns 0 on success, -ENOTSUPP if child resource is not completely
+ * contained by 'res', -ECANCELED if no any conflicting entry found.
+ *
+ * Reparent resource children of 'parent' that conflict with 'res'
+ * under 'res', and make 'res' replace those children.
+ */
+int reparent_resources(struct resource *parent, struct resource *res)
+{
+	struct resource *p, **pp;
+	struct resource **firstpp = NULL;
+
+	for (pp = &parent->child; (p = *pp) != NULL; pp = &p->sibling) {
+		if (p->end < res->start)
+			continue;
+		if (res->end < p->start)
+			break;
+		if (p->start < res->start || p->end > res->end)
+			return -ENOTSUPP;	/* not completely contained */
+		if (firstpp == NULL)
+			firstpp = pp;
+	}
+	if (firstpp == NULL)
+		return -ECANCELED; /* didn't find any conflicting entries? */
+	res->parent = parent;
+	res->child = *firstpp;
+	res->sibling = *pp;
+	*firstpp = res;
+	*pp = NULL;
+	for (p = res->child; p != NULL; p = p->sibling) {
+		p->parent = res;
+		pr_debug("PCI: Reparented %s %pR under %s\n",
+			 p->name, p, res->name);
+	}
+	return 0;
+}
+EXPORT_SYMBOL(reparent_resources);
+
 static void __init __reserve_region_with_split(struct resource *root,
 		resource_size_t start, resource_size_t end,
 		const char *name)
-- 
2.13.6

^ permalink raw reply related

* [PATCH v7 2/4] resource: Use list_head to link sibling resource
From: Baoquan He @ 2018-07-18  2:49 UTC (permalink / raw)
  To: linux-kernel, akpm, robh+dt, dan.j.williams, nicolas.pitre, josh,
	fengguang.wu, bp, andy.shevchenko
  Cc: patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin,
	dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick,
	lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse,
	thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel,
	linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm,
	vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo,
	maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev,
	Baoquan He, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, linux-mips
In-Reply-To: <20180718024944.577-1-bhe@redhat.com>

The struct resource uses singly linked list to link siblings, implemented
by pointer operation. Replace it with list_head for better code readability.

Based on this list_head replacement, it will be very easy to do reverse
iteration on iomem_resource's sibling list in later patch.

Besides, type of member variables of struct resource, sibling and child, are
changed from 'struct resource *' to 'struct list_head'. This brings two
pointers of size increase.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Patrik Jakobsson <patrik.r.jakobsson@gmail.com>
Cc: David Airlie <airlied@linux.ie>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jonathan Derrick <jonathan.derrick@intel.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: devel@linuxdriverproject.org
Cc: linux-input@vger.kernel.org
Cc: linux-nvdimm@lists.01.org
Cc: devicetree@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: Michal Simek <monstr@monstr.eu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>                                                                                             
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linux-mips@linux-mips.org
---
 arch/arm/plat-samsung/pm-check.c            |   6 +-
 arch/ia64/sn/kernel/io_init.c               |   2 +-
 arch/microblaze/pci/pci-common.c            |   4 +-
 arch/mips/pci/pci-rc32434.c                 |  12 +-
 arch/powerpc/kernel/pci-common.c            |   4 +-
 arch/sparc/kernel/ioport.c                  |   2 +-
 arch/xtensa/include/asm/pci-bridge.h        |   4 +-
 drivers/eisa/eisa-bus.c                     |   2 +
 drivers/gpu/drm/drm_memory.c                |   3 +-
 drivers/gpu/drm/gma500/gtt.c                |   5 +-
 drivers/hv/vmbus_drv.c                      |  52 +++----
 drivers/input/joystick/iforce/iforce-main.c |   4 +-
 drivers/nvdimm/namespace_devs.c             |   6 +-
 drivers/nvdimm/nd.h                         |   5 +-
 drivers/of/address.c                        |   4 +-
 drivers/parisc/lba_pci.c                    |   4 +-
 drivers/pci/controller/vmd.c                |   8 +-
 drivers/pci/probe.c                         |   2 +
 drivers/pci/setup-bus.c                     |   2 +-
 include/linux/ioport.h                      |  17 ++-
 kernel/resource.c                           | 206 ++++++++++++++--------------
 21 files changed, 183 insertions(+), 171 deletions(-)

diff --git a/arch/arm/plat-samsung/pm-check.c b/arch/arm/plat-samsung/pm-check.c
index cd2c02c68bc3..5494355b1c49 100644
--- a/arch/arm/plat-samsung/pm-check.c
+++ b/arch/arm/plat-samsung/pm-check.c
@@ -46,8 +46,8 @@ typedef u32 *(run_fn_t)(struct resource *ptr, u32 *arg);
 static void s3c_pm_run_res(struct resource *ptr, run_fn_t fn, u32 *arg)
 {
 	while (ptr != NULL) {
-		if (ptr->child != NULL)
-			s3c_pm_run_res(ptr->child, fn, arg);
+		if (!list_empty(&ptr->child))
+			s3c_pm_run_res(resource_first_child(&ptr->child), fn, arg);
 
 		if ((ptr->flags & IORESOURCE_SYSTEM_RAM)
 				== IORESOURCE_SYSTEM_RAM) {
@@ -57,7 +57,7 @@ static void s3c_pm_run_res(struct resource *ptr, run_fn_t fn, u32 *arg)
 			arg = (fn)(ptr, arg);
 		}
 
-		ptr = ptr->sibling;
+		ptr = resource_sibling(ptr);
 	}
 }
 
diff --git a/arch/ia64/sn/kernel/io_init.c b/arch/ia64/sn/kernel/io_init.c
index d63809a6adfa..338a7b7f194d 100644
--- a/arch/ia64/sn/kernel/io_init.c
+++ b/arch/ia64/sn/kernel/io_init.c
@@ -192,7 +192,7 @@ sn_io_slot_fixup(struct pci_dev *dev)
 		 * if it's already in the device structure, remove it before
 		 * inserting
 		 */
-		if (res->parent && res->parent->child)
+		if (res->parent && !list_empty(&res->parent->child))
 			release_resource(res);
 
 		if (res->flags & IORESOURCE_IO)
diff --git a/arch/microblaze/pci/pci-common.c b/arch/microblaze/pci/pci-common.c
index 7899bafab064..2bf73e27e231 100644
--- a/arch/microblaze/pci/pci-common.c
+++ b/arch/microblaze/pci/pci-common.c
@@ -533,7 +533,9 @@ void pci_process_bridge_OF_ranges(struct pci_controller *hose,
 			res->flags = range.flags;
 			res->start = range.cpu_addr;
 			res->end = range.cpu_addr + range.size - 1;
-			res->parent = res->child = res->sibling = NULL;
+			res->parent = NULL;
+			INIT_LIST_HEAD(&res->child);
+			INIT_LIST_HEAD(&res->sibling);
 		}
 	}
 
diff --git a/arch/mips/pci/pci-rc32434.c b/arch/mips/pci/pci-rc32434.c
index 7f6ce6d734c0..e80283df7925 100644
--- a/arch/mips/pci/pci-rc32434.c
+++ b/arch/mips/pci/pci-rc32434.c
@@ -53,8 +53,8 @@ static struct resource rc32434_res_pci_mem1 = {
 	.start = 0x50000000,
 	.end = 0x5FFFFFFF,
 	.flags = IORESOURCE_MEM,
-	.sibling = NULL,
-	.child = &rc32434_res_pci_mem2
+	.sibling = LIST_HEAD_INIT(rc32434_res_pci_mem1.sibling),
+	.child = LIST_HEAD_INIT(rc32434_res_pci_mem1.child),
 };
 
 static struct resource rc32434_res_pci_mem2 = {
@@ -63,8 +63,8 @@ static struct resource rc32434_res_pci_mem2 = {
 	.end = 0x6FFFFFFF,
 	.flags = IORESOURCE_MEM,
 	.parent = &rc32434_res_pci_mem1,
-	.sibling = NULL,
-	.child = NULL
+	.sibling = LIST_HEAD_INIT(rc32434_res_pci_mem2.sibling),
+	.child = LIST_HEAD_INIT(rc32434_res_pci_mem2.child),
 };
 
 static struct resource rc32434_res_pci_io1 = {
@@ -72,6 +72,8 @@ static struct resource rc32434_res_pci_io1 = {
 	.start = 0x18800000,
 	.end = 0x188FFFFF,
 	.flags = IORESOURCE_IO,
+	.sibling = LIST_HEAD_INIT(rc32434_res_pci_io1.sibling),
+	.child = LIST_HEAD_INIT(rc32434_res_pci_io1.child),
 };
 
 extern struct pci_ops rc32434_pci_ops;
@@ -208,6 +210,8 @@ static int __init rc32434_pci_init(void)
 
 	pr_info("PCI: Initializing PCI\n");
 
+	list_add(&rc32434_res_pci_mem2.sibling, &rc32434_res_pci_mem1.child);
+
 	ioport_resource.start = rc32434_res_pci_io1.start;
 	ioport_resource.end = rc32434_res_pci_io1.end;
 
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 926035bb378d..28fbe83c9daf 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -761,7 +761,9 @@ void pci_process_bridge_OF_ranges(struct pci_controller *hose,
 			res->flags = range.flags;
 			res->start = range.cpu_addr;
 			res->end = range.cpu_addr + range.size - 1;
-			res->parent = res->child = res->sibling = NULL;
+			res->parent = NULL;
+			INIT_LIST_HEAD(&res->child);
+			INIT_LIST_HEAD(&res->sibling);
 		}
 	}
 }
diff --git a/arch/sparc/kernel/ioport.c b/arch/sparc/kernel/ioport.c
index cca9134cfa7d..99efe4e98b16 100644
--- a/arch/sparc/kernel/ioport.c
+++ b/arch/sparc/kernel/ioport.c
@@ -669,7 +669,7 @@ static int sparc_io_proc_show(struct seq_file *m, void *v)
 	struct resource *root = m->private, *r;
 	const char *nm;
 
-	for (r = root->child; r != NULL; r = r->sibling) {
+	list_for_each_entry(r, &root->child, sibling) {
 		if ((nm = r->name) == NULL) nm = "???";
 		seq_printf(m, "%016llx-%016llx: %s\n",
 				(unsigned long long)r->start,
diff --git a/arch/xtensa/include/asm/pci-bridge.h b/arch/xtensa/include/asm/pci-bridge.h
index 0b68c76ec1e6..f487b06817df 100644
--- a/arch/xtensa/include/asm/pci-bridge.h
+++ b/arch/xtensa/include/asm/pci-bridge.h
@@ -71,8 +71,8 @@ static inline void pcibios_init_resource(struct resource *res,
 	res->flags = flags;
 	res->name = name;
 	res->parent = NULL;
-	res->sibling = NULL;
-	res->child = NULL;
+	INIT_LIST_HEAD(&res->child);
+	INIT_LIST_HEAD(&res->sibling);
 }
 
 
diff --git a/drivers/eisa/eisa-bus.c b/drivers/eisa/eisa-bus.c
index 1e8062f6dbfc..dba78f75fd06 100644
--- a/drivers/eisa/eisa-bus.c
+++ b/drivers/eisa/eisa-bus.c
@@ -408,6 +408,8 @@ static struct resource eisa_root_res = {
 	.start = 0,
 	.end   = 0xffffffff,
 	.flags = IORESOURCE_IO,
+	.sibling = LIST_HEAD_INIT(eisa_root_res.sibling),
+	.child  = LIST_HEAD_INIT(eisa_root_res.child),
 };
 
 static int eisa_bus_count;
diff --git a/drivers/gpu/drm/drm_memory.c b/drivers/gpu/drm/drm_memory.c
index d69e4fc1ee77..33baa7fa5e41 100644
--- a/drivers/gpu/drm/drm_memory.c
+++ b/drivers/gpu/drm/drm_memory.c
@@ -155,9 +155,8 @@ u64 drm_get_max_iomem(void)
 	struct resource *tmp;
 	resource_size_t max_iomem = 0;
 
-	for (tmp = iomem_resource.child; tmp; tmp = tmp->sibling) {
+	list_for_each_entry(tmp, &iomem_resource.child, sibling)
 		max_iomem = max(max_iomem,  tmp->end);
-	}
 
 	return max_iomem;
 }
diff --git a/drivers/gpu/drm/gma500/gtt.c b/drivers/gpu/drm/gma500/gtt.c
index 3949b0990916..addd3bc009af 100644
--- a/drivers/gpu/drm/gma500/gtt.c
+++ b/drivers/gpu/drm/gma500/gtt.c
@@ -565,7 +565,7 @@ int psb_gtt_init(struct drm_device *dev, int resume)
 int psb_gtt_restore(struct drm_device *dev)
 {
 	struct drm_psb_private *dev_priv = dev->dev_private;
-	struct resource *r = dev_priv->gtt_mem->child;
+	struct resource *r;
 	struct gtt_range *range;
 	unsigned int restored = 0, total = 0, size = 0;
 
@@ -573,14 +573,13 @@ int psb_gtt_restore(struct drm_device *dev)
 	mutex_lock(&dev_priv->gtt_mutex);
 	psb_gtt_init(dev, 1);
 
-	while (r != NULL) {
+	list_for_each_entry(r, &dev_priv->gtt_mem->child, sibling) {
 		range = container_of(r, struct gtt_range, resource);
 		if (range->pages) {
 			psb_gtt_insert(dev, range, 1);
 			size += range->resource.end - range->resource.start;
 			restored++;
 		}
-		r = r->sibling;
 		total++;
 	}
 	mutex_unlock(&dev_priv->gtt_mutex);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index b10fe26c4891..d87ec5a1bc4c 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1412,9 +1412,8 @@ static acpi_status vmbus_walk_resources(struct acpi_resource *res, void *ctx)
 {
 	resource_size_t start = 0;
 	resource_size_t end = 0;
-	struct resource *new_res;
+	struct resource *new_res, *tmp;
 	struct resource **old_res = &hyperv_mmio;
-	struct resource **prev_res = NULL;
 
 	switch (res->type) {
 
@@ -1461,44 +1460,36 @@ static acpi_status vmbus_walk_resources(struct acpi_resource *res, void *ctx)
 	/*
 	 * If two ranges are adjacent, merge them.
 	 */
-	do {
-		if (!*old_res) {
-			*old_res = new_res;
-			break;
-		}
-
-		if (((*old_res)->end + 1) == new_res->start) {
-			(*old_res)->end = new_res->end;
+	if (!*old_res) {
+		*old_res = new_res;
+		return AE_OK;
+	}
+	tmp = *old_res;
+	list_for_each_entry_from(tmp, &tmp->parent->child, sibling) {
+		if ((tmp->end + 1) == new_res->start) {
+			tmp->end = new_res->end;
 			kfree(new_res);
 			break;
 		}
 
-		if ((*old_res)->start == new_res->end + 1) {
-			(*old_res)->start = new_res->start;
+		if (tmp->start == new_res->end + 1) {
+			tmp->start = new_res->start;
 			kfree(new_res);
 			break;
 		}
 
-		if ((*old_res)->start > new_res->end) {
-			new_res->sibling = *old_res;
-			if (prev_res)
-				(*prev_res)->sibling = new_res;
-			*old_res = new_res;
+		if (tmp->start > new_res->end) {
+			list_add(&new_res->sibling, tmp->sibling.prev);
 			break;
 		}
-
-		prev_res = old_res;
-		old_res = &(*old_res)->sibling;
-
-	} while (1);
+	}
 
 	return AE_OK;
 }
 
 static int vmbus_acpi_remove(struct acpi_device *device)
 {
-	struct resource *cur_res;
-	struct resource *next_res;
+	struct resource *res;
 
 	if (hyperv_mmio) {
 		if (fb_mmio) {
@@ -1507,10 +1498,9 @@ static int vmbus_acpi_remove(struct acpi_device *device)
 			fb_mmio = NULL;
 		}
 
-		for (cur_res = hyperv_mmio; cur_res; cur_res = next_res) {
-			next_res = cur_res->sibling;
-			kfree(cur_res);
-		}
+		res = hyperv_mmio;
+		list_for_each_entry_from(res, &res->parent->child, sibling)
+			kfree(res);
 	}
 
 	return 0;
@@ -1596,7 +1586,8 @@ int vmbus_allocate_mmio(struct resource **new, struct hv_device *device_obj,
 		}
 	}
 
-	for (iter = hyperv_mmio; iter; iter = iter->sibling) {
+	iter = hyperv_mmio;
+	list_for_each_entry_from(iter, &iter->parent->child, sibling) {
 		if ((iter->start >= max) || (iter->end <= min))
 			continue;
 
@@ -1639,7 +1630,8 @@ void vmbus_free_mmio(resource_size_t start, resource_size_t size)
 	struct resource *iter;
 
 	down(&hyperv_mmio_lock);
-	for (iter = hyperv_mmio; iter; iter = iter->sibling) {
+	iter = hyperv_mmio;
+	list_for_each_entry_from(iter, &iter->parent->child, sibling) {
 		if ((iter->start >= start + size) || (iter->end <= start))
 			continue;
 
diff --git a/drivers/input/joystick/iforce/iforce-main.c b/drivers/input/joystick/iforce/iforce-main.c
index daeeb4c7e3b0..5c0be27b33ff 100644
--- a/drivers/input/joystick/iforce/iforce-main.c
+++ b/drivers/input/joystick/iforce/iforce-main.c
@@ -305,8 +305,8 @@ int iforce_init_device(struct iforce *iforce)
 	iforce->device_memory.end = 200;
 	iforce->device_memory.flags = IORESOURCE_MEM;
 	iforce->device_memory.parent = NULL;
-	iforce->device_memory.child = NULL;
-	iforce->device_memory.sibling = NULL;
+	INIT_LIST_HEAD(&iforce->device_memory.child);
+	INIT_LIST_HEAD(&iforce->device_memory.sibling);
 
 /*
  * Wait until device ready - until it sends its first response.
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 28afdd668905..f53d410d9981 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -637,7 +637,7 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
  retry:
 	first = 0;
 	for_each_dpa_resource(ndd, res) {
-		struct resource *next = res->sibling, *new_res = NULL;
+		struct resource *next = resource_sibling(res), *new_res = NULL;
 		resource_size_t allocate, available = 0;
 		enum alloc_loc loc = ALLOC_ERR;
 		const char *action;
@@ -763,7 +763,7 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 	 * an initial "pmem-reserve pass".  Only do an initial BLK allocation
 	 * when none of the DPA space is reserved.
 	 */
-	if ((is_pmem || !ndd->dpa.child) && n == to_allocate)
+	if ((is_pmem || list_empty(&ndd->dpa.child)) && n == to_allocate)
 		return init_dpa_allocation(label_id, nd_region, nd_mapping, n);
 	return n;
 }
@@ -779,7 +779,7 @@ static int merge_dpa(struct nd_region *nd_region,
  retry:
 	for_each_dpa_resource(ndd, res) {
 		int rc;
-		struct resource *next = res->sibling;
+		struct resource *next = resource_sibling(res);
 		resource_size_t end = res->start + resource_size(res);
 
 		if (!next || strcmp(res->name, label_id->id) != 0
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 32e0364b48b9..da7da15e03e7 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -102,11 +102,10 @@ unsigned sizeof_namespace_label(struct nvdimm_drvdata *ndd);
 		(unsigned long long) (res ? res->start : 0), ##arg)
 
 #define for_each_dpa_resource(ndd, res) \
-	for (res = (ndd)->dpa.child; res; res = res->sibling)
+	list_for_each_entry(res, &(ndd)->dpa.child, sibling)
 
 #define for_each_dpa_resource_safe(ndd, res, next) \
-	for (res = (ndd)->dpa.child, next = res ? res->sibling : NULL; \
-			res; res = next, next = next ? next->sibling : NULL)
+	list_for_each_entry_safe(res, next, &(ndd)->dpa.child, sibling)
 
 struct nd_percpu_lane {
 	int count;
diff --git a/drivers/of/address.c b/drivers/of/address.c
index 53349912ac75..e2e25719ab52 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -330,7 +330,9 @@ int of_pci_range_to_resource(struct of_pci_range *range,
 {
 	int err;
 	res->flags = range->flags;
-	res->parent = res->child = res->sibling = NULL;
+	res->parent = NULL;
+	INIT_LIST_HEAD(&res->child);
+	INIT_LIST_HEAD(&res->sibling);
 	res->name = np->full_name;
 
 	if (res->flags & IORESOURCE_IO) {
diff --git a/drivers/parisc/lba_pci.c b/drivers/parisc/lba_pci.c
index 69bd98421eb1..7482bdfd1959 100644
--- a/drivers/parisc/lba_pci.c
+++ b/drivers/parisc/lba_pci.c
@@ -170,8 +170,8 @@ lba_dump_res(struct resource *r, int d)
 	for (i = d; i ; --i) printk(" ");
 	printk(KERN_DEBUG "%p [%lx,%lx]/%lx\n", r,
 		(long)r->start, (long)r->end, r->flags);
-	lba_dump_res(r->child, d+2);
-	lba_dump_res(r->sibling, d);
+	lba_dump_res(resource_first_child(&r->child), d+2);
+	lba_dump_res(resource_sibling(r), d);
 }
 
 
diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index 942b64fc7f1f..e3ace20345c7 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -542,14 +542,14 @@ static struct pci_ops vmd_ops = {
 
 static void vmd_attach_resources(struct vmd_dev *vmd)
 {
-	vmd->dev->resource[VMD_MEMBAR1].child = &vmd->resources[1];
-	vmd->dev->resource[VMD_MEMBAR2].child = &vmd->resources[2];
+	list_add(&vmd->resources[1].sibling, &vmd->dev->resource[VMD_MEMBAR1].child);
+	list_add(&vmd->resources[2].sibling, &vmd->dev->resource[VMD_MEMBAR2].child);
 }
 
 static void vmd_detach_resources(struct vmd_dev *vmd)
 {
-	vmd->dev->resource[VMD_MEMBAR1].child = NULL;
-	vmd->dev->resource[VMD_MEMBAR2].child = NULL;
+	INIT_LIST_HEAD(&vmd->dev->resource[VMD_MEMBAR1].child);
+	INIT_LIST_HEAD(&vmd->dev->resource[VMD_MEMBAR2].child);
 }
 
 /*
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index ac876e32de4b..9624dd1dfd49 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -59,6 +59,8 @@ static struct resource *get_pci_domain_busn_res(int domain_nr)
 	r->res.start = 0;
 	r->res.end = 0xff;
 	r->res.flags = IORESOURCE_BUS | IORESOURCE_PCI_FIXED;
+	INIT_LIST_HEAD(&r->res.child);
+	INIT_LIST_HEAD(&r->res.sibling);
 
 	list_add_tail(&r->list, &pci_domain_busn_res_list);
 
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 79b1824e83b4..8e685af8938d 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -2107,7 +2107,7 @@ int pci_reassign_bridge_resources(struct pci_dev *bridge, unsigned long type)
 				continue;
 
 			/* Ignore BARs which are still in use */
-			if (res->child)
+			if (!list_empty(&res->child))
 				continue;
 
 			ret = add_to_list(&saved, bridge, res, 0, 0);
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index dfdcd0bfe54e..b7456ae889dd 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -12,6 +12,7 @@
 #ifndef __ASSEMBLY__
 #include <linux/compiler.h>
 #include <linux/types.h>
+#include <linux/list.h>
 /*
  * Resources are tree-like, allowing
  * nesting etc..
@@ -22,7 +23,8 @@ struct resource {
 	const char *name;
 	unsigned long flags;
 	unsigned long desc;
-	struct resource *parent, *sibling, *child;
+	struct list_head child, sibling;
+	struct resource *parent;
 };
 
 /*
@@ -216,7 +218,6 @@ static inline bool resource_contains(struct resource *r1, struct resource *r2)
 	return r1->start <= r2->start && r1->end >= r2->end;
 }
 
-
 /* Convenience shorthand with allocation */
 #define request_region(start,n,name)		__request_region(&ioport_resource, (start), (n), (name), 0)
 #define request_muxed_region(start,n,name)	__request_region(&ioport_resource, (start), (n), (name), IORESOURCE_MUXED)
@@ -287,6 +288,18 @@ static inline bool resource_overlaps(struct resource *r1, struct resource *r2)
        return (r1->start <= r2->end && r1->end >= r2->start);
 }
 
+static inline struct resource *resource_sibling(struct resource *res)
+{
+	if (res->parent && !list_is_last(&res->sibling, &res->parent->child))
+		return list_next_entry(res, sibling);
+	return NULL;
+}
+
+static inline struct resource *resource_first_child(struct list_head *head)
+{
+	return list_first_entry_or_null(head, struct resource, sibling);
+}
+
 
 #endif /* __ASSEMBLY__ */
 #endif	/* _LINUX_IOPORT_H */
diff --git a/kernel/resource.c b/kernel/resource.c
index 81ccd19c1d9f..c96e58d3d2f8 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -31,6 +31,8 @@ struct resource ioport_resource = {
 	.start	= 0,
 	.end	= IO_SPACE_LIMIT,
 	.flags	= IORESOURCE_IO,
+	.sibling = LIST_HEAD_INIT(ioport_resource.sibling),
+	.child  = LIST_HEAD_INIT(ioport_resource.child),
 };
 EXPORT_SYMBOL(ioport_resource);
 
@@ -39,6 +41,8 @@ struct resource iomem_resource = {
 	.start	= 0,
 	.end	= -1,
 	.flags	= IORESOURCE_MEM,
+	.sibling = LIST_HEAD_INIT(iomem_resource.sibling),
+	.child  = LIST_HEAD_INIT(iomem_resource.child),
 };
 EXPORT_SYMBOL(iomem_resource);
 
@@ -57,20 +61,20 @@ static DEFINE_RWLOCK(resource_lock);
  * by boot mem after the system is up. So for reusing the resource entry
  * we need to remember the resource.
  */
-static struct resource *bootmem_resource_free;
+static struct list_head bootmem_resource_free = LIST_HEAD_INIT(bootmem_resource_free);
 static DEFINE_SPINLOCK(bootmem_resource_lock);
 
 static struct resource *next_resource(struct resource *p, bool sibling_only)
 {
 	/* Caller wants to traverse through siblings only */
 	if (sibling_only)
-		return p->sibling;
+		return resource_sibling(p);
 
-	if (p->child)
-		return p->child;
-	while (!p->sibling && p->parent)
+	if (!list_empty(&p->child))
+		return resource_first_child(&p->child);
+	while (!resource_sibling(p) && p->parent)
 		p = p->parent;
-	return p->sibling;
+	return resource_sibling(p);
 }
 
 static void *r_next(struct seq_file *m, void *v, loff_t *pos)
@@ -90,7 +94,7 @@ static void *r_start(struct seq_file *m, loff_t *pos)
 	struct resource *p = PDE_DATA(file_inode(m->file));
 	loff_t l = 0;
 	read_lock(&resource_lock);
-	for (p = p->child; p && l < *pos; p = r_next(m, p, &l))
+	for (p = resource_first_child(&p->child); p && l < *pos; p = r_next(m, p, &l))
 		;
 	return p;
 }
@@ -153,8 +157,7 @@ static void free_resource(struct resource *res)
 
 	if (!PageSlab(virt_to_head_page(res))) {
 		spin_lock(&bootmem_resource_lock);
-		res->sibling = bootmem_resource_free;
-		bootmem_resource_free = res;
+		list_add(&res->sibling, &bootmem_resource_free);
 		spin_unlock(&bootmem_resource_lock);
 	} else {
 		kfree(res);
@@ -166,10 +169,9 @@ static struct resource *alloc_resource(gfp_t flags)
 	struct resource *res = NULL;
 
 	spin_lock(&bootmem_resource_lock);
-	if (bootmem_resource_free) {
-		res = bootmem_resource_free;
-		bootmem_resource_free = res->sibling;
-	}
+	res = resource_first_child(&bootmem_resource_free);
+	if (res)
+		list_del(&res->sibling);
 	spin_unlock(&bootmem_resource_lock);
 
 	if (res)
@@ -177,6 +179,8 @@ static struct resource *alloc_resource(gfp_t flags)
 	else
 		res = kzalloc(sizeof(struct resource), flags);
 
+	INIT_LIST_HEAD(&res->child);
+	INIT_LIST_HEAD(&res->sibling);
 	return res;
 }
 
@@ -185,7 +189,7 @@ static struct resource * __request_resource(struct resource *root, struct resour
 {
 	resource_size_t start = new->start;
 	resource_size_t end = new->end;
-	struct resource *tmp, **p;
+	struct resource *tmp;
 
 	if (end < start)
 		return root;
@@ -193,64 +197,62 @@ static struct resource * __request_resource(struct resource *root, struct resour
 		return root;
 	if (end > root->end)
 		return root;
-	p = &root->child;
-	for (;;) {
-		tmp = *p;
-		if (!tmp || tmp->start > end) {
-			new->sibling = tmp;
-			*p = new;
+
+	if (list_empty(&root->child)) {
+		list_add(&new->sibling, &root->child);
+		new->parent = root;
+		INIT_LIST_HEAD(&new->child);
+		return NULL;
+	}
+
+	list_for_each_entry(tmp, &root->child, sibling) {
+		if (tmp->start > end) {
+			list_add(&new->sibling, tmp->sibling.prev);
 			new->parent = root;
+			INIT_LIST_HEAD(&new->child);
 			return NULL;
 		}
-		p = &tmp->sibling;
 		if (tmp->end < start)
 			continue;
 		return tmp;
 	}
+
+	list_add_tail(&new->sibling, &root->child);
+	new->parent = root;
+	INIT_LIST_HEAD(&new->child);
+	return NULL;
 }
 
 static int __release_resource(struct resource *old, bool release_child)
 {
-	struct resource *tmp, **p, *chd;
+	struct resource *tmp, *next, *chd;
 
-	p = &old->parent->child;
-	for (;;) {
-		tmp = *p;
-		if (!tmp)
-			break;
+	list_for_each_entry_safe(tmp, next, &old->parent->child, sibling) {
 		if (tmp == old) {
-			if (release_child || !(tmp->child)) {
-				*p = tmp->sibling;
+			if (release_child || list_empty(&tmp->child)) {
+				list_del(&tmp->sibling);
 			} else {
-				for (chd = tmp->child;; chd = chd->sibling) {
+				list_for_each_entry(chd, &tmp->child, sibling)
 					chd->parent = tmp->parent;
-					if (!(chd->sibling))
-						break;
-				}
-				*p = tmp->child;
-				chd->sibling = tmp->sibling;
+				list_splice(&tmp->child, tmp->sibling.prev);
+				list_del(&tmp->sibling);
 			}
+
 			old->parent = NULL;
 			return 0;
 		}
-		p = &tmp->sibling;
 	}
 	return -EINVAL;
 }
 
 static void __release_child_resources(struct resource *r)
 {
-	struct resource *tmp, *p;
+	struct resource *tmp, *next;
 	resource_size_t size;
 
-	p = r->child;
-	r->child = NULL;
-	while (p) {
-		tmp = p;
-		p = p->sibling;
-
+	list_for_each_entry_safe(tmp, next, &r->child, sibling) {
 		tmp->parent = NULL;
-		tmp->sibling = NULL;
+		list_del_init(&tmp->sibling);
 		__release_child_resources(tmp);
 
 		printk(KERN_DEBUG "release child resource %pR\n", tmp);
@@ -259,6 +261,8 @@ static void __release_child_resources(struct resource *r)
 		tmp->start = 0;
 		tmp->end = size - 1;
 	}
+
+	INIT_LIST_HEAD(&tmp->child);
 }
 
 void release_child_resources(struct resource *r)
@@ -343,7 +347,8 @@ static int find_next_iomem_res(struct resource *res, unsigned long desc,
 
 	read_lock(&resource_lock);
 
-	for (p = iomem_resource.child; p; p = next_resource(p, sibling_only)) {
+	for (p = resource_first_child(&iomem_resource.child); p;
+			p = next_resource(p, sibling_only)) {
 		if ((p->flags & res->flags) != res->flags)
 			continue;
 		if ((desc != IORES_DESC_NONE) && (desc != p->desc))
@@ -532,7 +537,7 @@ int region_intersects(resource_size_t start, size_t size, unsigned long flags,
 	struct resource *p;
 
 	read_lock(&resource_lock);
-	for (p = iomem_resource.child; p ; p = p->sibling) {
+	list_for_each_entry(p, &iomem_resource.child, sibling) {
 		bool is_type = (((p->flags & flags) == flags) &&
 				((desc == IORES_DESC_NONE) ||
 				 (desc == p->desc)));
@@ -586,7 +591,7 @@ static int __find_resource(struct resource *root, struct resource *old,
 			 resource_size_t  size,
 			 struct resource_constraint *constraint)
 {
-	struct resource *this = root->child;
+	struct resource *this = resource_first_child(&root->child);
 	struct resource tmp = *new, avail, alloc;
 
 	tmp.start = root->start;
@@ -596,7 +601,7 @@ static int __find_resource(struct resource *root, struct resource *old,
 	 */
 	if (this && this->start == root->start) {
 		tmp.start = (this == old) ? old->start : this->end + 1;
-		this = this->sibling;
+		this = resource_sibling(this);
 	}
 	for(;;) {
 		if (this)
@@ -632,7 +637,7 @@ next:		if (!this || this->end == root->end)
 
 		if (this != old)
 			tmp.start = this->end + 1;
-		this = this->sibling;
+		this = resource_sibling(this);
 	}
 	return -EBUSY;
 }
@@ -676,7 +681,7 @@ static int reallocate_resource(struct resource *root, struct resource *old,
 		goto out;
 	}
 
-	if (old->child) {
+	if (!list_empty(&old->child)) {
 		err = -EBUSY;
 		goto out;
 	}
@@ -757,7 +762,7 @@ struct resource *lookup_resource(struct resource *root, resource_size_t start)
 	struct resource *res;
 
 	read_lock(&resource_lock);
-	for (res = root->child; res; res = res->sibling) {
+	list_for_each_entry(res, &root->child, sibling) {
 		if (res->start == start)
 			break;
 	}
@@ -790,32 +795,27 @@ static struct resource * __insert_resource(struct resource *parent, struct resou
 			break;
 	}
 
-	for (next = first; ; next = next->sibling) {
+	for (next = first; ; next = resource_sibling(next)) {
 		/* Partial overlap? Bad, and unfixable */
 		if (next->start < new->start || next->end > new->end)
 			return next;
-		if (!next->sibling)
+		if (!resource_sibling(next))
 			break;
-		if (next->sibling->start > new->end)
+		if (resource_sibling(next)->start > new->end)
 			break;
 	}
-
 	new->parent = parent;
-	new->sibling = next->sibling;
-	new->child = first;
+	list_add(&new->sibling, &next->sibling);
+	INIT_LIST_HEAD(&new->child);
 
-	next->sibling = NULL;
-	for (next = first; next; next = next->sibling)
+	/*
+	 * From first to next, they all fall into new's region, so change them
+	 * as new's children.
+	 */
+	list_cut_position(&new->child, first->sibling.prev, &next->sibling);
+	list_for_each_entry(next, &new->child, sibling)
 		next->parent = new;
 
-	if (parent->child == first) {
-		parent->child = new;
-	} else {
-		next = parent->child;
-		while (next->sibling != first)
-			next = next->sibling;
-		next->sibling = new;
-	}
 	return NULL;
 }
 
@@ -937,19 +937,17 @@ static int __adjust_resource(struct resource *res, resource_size_t start,
 	if ((start < parent->start) || (end > parent->end))
 		goto out;
 
-	if (res->sibling && (res->sibling->start <= end))
+	if (resource_sibling(res) && (resource_sibling(res)->start <= end))
 		goto out;
 
-	tmp = parent->child;
-	if (tmp != res) {
-		while (tmp->sibling != res)
-			tmp = tmp->sibling;
+	if (res->sibling.prev != &parent->child) {
+		tmp = list_prev_entry(res, sibling);
 		if (start <= tmp->end)
 			goto out;
 	}
 
 skip:
-	for (tmp = res->child; tmp; tmp = tmp->sibling)
+	list_for_each_entry(tmp, &res->child, sibling)
 		if ((tmp->start < start) || (tmp->end > end))
 			goto out;
 
@@ -996,27 +994,30 @@ EXPORT_SYMBOL(adjust_resource);
  */
 int reparent_resources(struct resource *parent, struct resource *res)
 {
-	struct resource *p, **pp;
-	struct resource **firstpp = NULL;
+	struct resource *p, *first = NULL;
 
-	for (pp = &parent->child; (p = *pp) != NULL; pp = &p->sibling) {
+	list_for_each_entry(p, &parent->child, sibling) {
 		if (p->end < res->start)
 			continue;
 		if (res->end < p->start)
 			break;
 		if (p->start < res->start || p->end > res->end)
 			return -ENOTSUPP;	/* not completely contained */
-		if (firstpp == NULL)
-			firstpp = pp;
+		if (first == NULL)
+			first = p;
 	}
-	if (firstpp == NULL)
+	if (first == NULL)
 		return -ECANCELED; /* didn't find any conflicting entries? */
 	res->parent = parent;
-	res->child = *firstpp;
-	res->sibling = *pp;
-	*firstpp = res;
-	*pp = NULL;
-	for (p = res->child; p != NULL; p = p->sibling) {
+	list_add(&res->sibling, p->sibling.prev);
+	INIT_LIST_HEAD(&res->child);
+
+	/*
+	 * From first to p's previous sibling, they all fall into
+	 * res's region, change them as res's children.
+	 */
+	list_cut_position(&res->child, first->sibling.prev, res->sibling.prev);
+	list_for_each_entry(p, &res->child, sibling) {
 		p->parent = res;
 		pr_debug("PCI: Reparented %s %pR under %s\n",
 			 p->name, p, res->name);
@@ -1216,34 +1217,32 @@ EXPORT_SYMBOL(__request_region);
 void __release_region(struct resource *parent, resource_size_t start,
 			resource_size_t n)
 {
-	struct resource **p;
+	struct resource *res;
 	resource_size_t end;
 
-	p = &parent->child;
+	res = resource_first_child(&parent->child);
 	end = start + n - 1;
 
 	write_lock(&resource_lock);
 
 	for (;;) {
-		struct resource *res = *p;
-
 		if (!res)
 			break;
 		if (res->start <= start && res->end >= end) {
 			if (!(res->flags & IORESOURCE_BUSY)) {
-				p = &res->child;
+				res = resource_first_child(&res->child);
 				continue;
 			}
 			if (res->start != start || res->end != end)
 				break;
-			*p = res->sibling;
+			list_del(&res->sibling);
 			write_unlock(&resource_lock);
 			if (res->flags & IORESOURCE_MUXED)
 				wake_up(&muxed_resource_wait);
 			free_resource(res);
 			return;
 		}
-		p = &res->sibling;
+		res = resource_sibling(res);
 	}
 
 	write_unlock(&resource_lock);
@@ -1278,9 +1277,7 @@ EXPORT_SYMBOL(__release_region);
 int release_mem_region_adjustable(struct resource *parent,
 			resource_size_t start, resource_size_t size)
 {
-	struct resource **p;
-	struct resource *res;
-	struct resource *new_res;
+	struct resource *res, *new_res;
 	resource_size_t end;
 	int ret = -EINVAL;
 
@@ -1291,16 +1288,16 @@ int release_mem_region_adjustable(struct resource *parent,
 	/* The alloc_resource() result gets checked later */
 	new_res = alloc_resource(GFP_KERNEL);
 
-	p = &parent->child;
+	res = resource_first_child(&parent->child);
 	write_lock(&resource_lock);
 
-	while ((res = *p)) {
+	while ((res)) {
 		if (res->start >= end)
 			break;
 
 		/* look for the next resource if it does not fit into */
 		if (res->start > start || res->end < end) {
-			p = &res->sibling;
+			res = resource_sibling(res);
 			continue;
 		}
 
@@ -1308,14 +1305,14 @@ int release_mem_region_adjustable(struct resource *parent,
 			break;
 
 		if (!(res->flags & IORESOURCE_BUSY)) {
-			p = &res->child;
+			res = resource_first_child(&res->child);
 			continue;
 		}
 
 		/* found the target resource; let's adjust accordingly */
 		if (res->start == start && res->end == end) {
 			/* free the whole entry */
-			*p = res->sibling;
+			list_del(&res->sibling);
 			free_resource(res);
 			ret = 0;
 		} else if (res->start == start && res->end != end) {
@@ -1338,14 +1335,13 @@ int release_mem_region_adjustable(struct resource *parent,
 			new_res->flags = res->flags;
 			new_res->desc = res->desc;
 			new_res->parent = res->parent;
-			new_res->sibling = res->sibling;
-			new_res->child = NULL;
+			INIT_LIST_HEAD(&new_res->child);
 
 			ret = __adjust_resource(res, res->start,
 						start - res->start);
 			if (ret)
 				break;
-			res->sibling = new_res;
+			list_add(&new_res->sibling, &res->sibling);
 			new_res = NULL;
 		}
 
@@ -1526,7 +1522,7 @@ static int __init reserve_setup(char *str)
 			res->end = io_start + io_num - 1;
 			res->flags |= IORESOURCE_BUSY;
 			res->desc = IORES_DESC_NONE;
-			res->child = NULL;
+			INIT_LIST_HEAD(&res->child);
 			if (request_resource(parent, res) == 0)
 				reserved = x+1;
 		}
@@ -1546,7 +1542,7 @@ int iomem_map_sanity_check(resource_size_t addr, unsigned long size)
 	loff_t l;
 
 	read_lock(&resource_lock);
-	for (p = p->child; p ; p = r_next(NULL, p, &l)) {
+	for (p = resource_first_child(&p->child); p; p = r_next(NULL, p, &l)) {
 		/*
 		 * We can probably skip the resources without
 		 * IORESOURCE_IO attribute?
@@ -1602,7 +1598,7 @@ bool iomem_is_exclusive(u64 addr)
 	addr = addr & PAGE_MASK;
 
 	read_lock(&resource_lock);
-	for (p = p->child; p ; p = r_next(NULL, p, &l)) {
+	for (p = resource_first_child(&p->child); p; p = r_next(NULL, p, &l)) {
 		/*
 		 * We can probably skip the resources without
 		 * IORESOURCE_IO attribute?
-- 
2.13.6

^ permalink raw reply related

* [PATCH v7 3/4] resource: add walk_system_ram_res_rev()
From: Baoquan He @ 2018-07-18  2:49 UTC (permalink / raw)
  To: linux-kernel, akpm, robh+dt, dan.j.williams, nicolas.pitre, josh,
	fengguang.wu, bp, andy.shevchenko
  Cc: patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin,
	dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick,
	lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse,
	thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel,
	linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm,
	vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo,
	maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev,
	Baoquan He
In-Reply-To: <20180718024944.577-1-bhe@redhat.com>

This function, being a variant of walk_system_ram_res() introduced in
commit 8c86e70acead ("resource: provide new functions to walk through
resources"), walks through a list of all the resources of System RAM
in reversed order, i.e., from higher to lower.

It will be used in kexec_file code.

Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
---
 include/linux/ioport.h |  3 +++
 kernel/resource.c      | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index b7456ae889dd..066cc263e2cc 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -279,6 +279,9 @@ extern int
 walk_system_ram_res(u64 start, u64 end, void *arg,
 		    int (*func)(struct resource *, void *));
 extern int
+walk_system_ram_res_rev(u64 start, u64 end, void *arg,
+			int (*func)(struct resource *, void *));
+extern int
 walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start, u64 end,
 		    void *arg, int (*func)(struct resource *, void *));
 
diff --git a/kernel/resource.c b/kernel/resource.c
index c96e58d3d2f8..3e18f24b90c4 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -23,6 +23,8 @@
 #include <linux/pfn.h>
 #include <linux/mm.h>
 #include <linux/resource_ext.h>
+#include <linux/string.h>
+#include <linux/vmalloc.h>
 #include <asm/io.h>
 
 
@@ -443,6 +445,44 @@ int walk_system_ram_res(u64 start, u64 end, void *arg,
 }
 
 /*
+ * This function, being a variant of walk_system_ram_res(), calls the @func
+ * callback against all memory ranges of type System RAM which are marked as
+ * IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY in reversed order, i.e., from
+ * higher to lower.
+ */
+int walk_system_ram_res_rev(u64 start, u64 end, void *arg,
+				int (*func)(struct resource *, void *))
+{
+	unsigned long flags;
+	struct resource *res;
+	int ret = -1;
+
+	flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+
+	read_lock(&resource_lock);
+	list_for_each_entry_reverse(res, &iomem_resource.child, sibling) {
+		if (start >= end)
+			break;
+		if ((res->flags & flags) != flags)
+			continue;
+		if (res->desc != IORES_DESC_NONE)
+			continue;
+		if (res->end < start)
+			break;
+
+		if ((res->end >= start) && (res->start < end)) {
+			ret = (*func)(res, arg);
+			if (ret)
+				break;
+		}
+		end = res->start - 1;
+
+	}
+	read_unlock(&resource_lock);
+	return ret;
+}
+
+/*
  * This function calls the @func callback against all memory ranges, which
  * are ranges marked as IORESOURCE_MEM and IORESOUCE_BUSY.
  */
-- 
2.13.6

^ permalink raw reply related

* [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required
From: Baoquan He @ 2018-07-18  2:49 UTC (permalink / raw)
  To: linux-kernel, akpm, robh+dt, dan.j.williams, nicolas.pitre, josh,
	fengguang.wu, bp, andy.shevchenko
  Cc: patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin,
	dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick,
	lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse,
	thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel,
	linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm,
	vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo,
	maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev,
	Baoquan He, kexec
In-Reply-To: <20180718024944.577-1-bhe@redhat.com>

For kexec_file loading, if kexec_buf.top_down is 'true', the memory which
is used to load kernel/initrd/purgatory is supposed to be allocated from
top to down. This is what we have been doing all along in the old kexec
loading interface and the kexec loading is still default setting in some
distributions. However, the current kexec_file loading interface doesn't
do like this. The function arch_kexec_walk_mem() it calls ignores checking
kexec_buf.top_down, but calls walk_system_ram_res() directly to go through
all resources of System RAM from bottom to up, to try to find memory region
which can contain the specific kexec buffer, then call locate_mem_hole_callback()
to allocate memory in that found memory region from top to down. This brings
confusion especially when KASLR is widely supported , users have to make clear
why kexec/kdump kernel loading position is different between these two
interfaces in order to exclude unnecessary noises. Hence these two interfaces
need be unified on behaviour.

Here add checking if kexec_buf.top_down is 'true' in arch_kexec_walk_mem(),
if yes, call the newly added walk_system_ram_res_rev() to find memory region
from top to down to load kernel.

Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: kexec@lists.infradead.org
---
 kernel/kexec_file.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index c6a3b6851372..75226c1d08ce 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -518,6 +518,8 @@ int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
 					   IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
 					   crashk_res.start, crashk_res.end,
 					   kbuf, func);
+	else if (kbuf->top_down)
+		return walk_system_ram_res_rev(0, ULONG_MAX, kbuf, func);
 	else
 		return walk_system_ram_res(0, ULONG_MAX, kbuf, func);
 }
-- 
2.13.6

^ permalink raw reply related

* Re: [PATCH kernel v7 0/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page
From: Paul Mackerras @ 2018-07-18  6:28 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, kvm-ppc, Aneesh Kumar K.V,
	Alex Williamson, Michael Ellerman, Nicholas Piggin
In-Reply-To: <20180717071913.2167-1-aik@ozlabs.ru>

On Tue, Jul 17, 2018 at 05:19:11PM +1000, Alexey Kardashevskiy wrote:
> This is to improve page boundaries checking and should probably
> be cc:stable. I came accross this while debugging nvlink2 passthrough
> but the lack of checking might be exploited by the existing userspace.
> 
> The get_user_pages() comment says it should be "phased out" but the only
> alternative seems to be get_user_pages_longterm(), should that be used
> instead (this is longterm reference elevation, however it is not DAX,
> whatever this implies)? get_user_pages_remote() seems unnecessarily
> complicated because of @locked.
> 
> 
> Changes:
> v7:
> * 2/2: do not fail if pte is not found, fall back to the default case instead
> 
> v6:
> * 2/2: read pageshift from pte
> 
> v5:
> * 2/2: changed compound pages handling
> 
> v4:
> * 2/2: implemented less strict but still safe max pageshift as David suggested
> 
> v3:
> * enforced huge pages not to cross preregistered chunk boundaries
> 
> v2:
> * 2/2: explicitly check for compound pages before calling compound_order()
> 
> 
> This is based on sha1
> 9d3cce1 Linus Torvalds "Linux 4.18-rc5".
> 
> Please comment. Thanks.
> 
> 
> 
> Alexey Kardashevskiy (2):
>   vfio/spapr: Use IOMMU pageshift rather than pagesize
>   KVM: PPC: Check if IOMMU page is contained in the pinned physical page
> 
>  arch/powerpc/include/asm/mmu_context.h |  4 ++--
>  arch/powerpc/kvm/book3s_64_vio.c       |  2 +-
>  arch/powerpc/kvm/book3s_64_vio_hv.c    |  6 ++++--
>  arch/powerpc/mm/mmu_context_iommu.c    | 37 ++++++++++++++++++++++++++++++++--
>  drivers/vfio/vfio_iommu_spapr_tce.c    | 10 ++++-----
>  5 files changed, 47 insertions(+), 12 deletions(-)

Thanks, series applied to my kvm-ppc-next branch.

Paul.

^ permalink raw reply

* Re: [PATCH kernel] KVM: PPC: Fix hardware and emulated TCE tables matching
From: Paul Mackerras @ 2018-07-18  6:29 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, David Gibson, kvm-ppc
In-Reply-To: <20180620084258.1155-1-aik@ozlabs.ru>

On Wed, Jun 20, 2018 at 06:42:58PM +1000, Alexey Kardashevskiy wrote:
> When attaching a hardware table to LIOBN in KVM, we match table parameters
> such as page size, table offset and table size. However the tables are
> created via very different paths - VFIO and KVM - and the VFIO path goes
> through the platform code which has minimum TCE page size requirement
> (which is 4K but since we allocate memory by pages and cannot avoid
> alignment anyway, we align to 64k pages for powernv_defconfig).
> 
> So when we match the tables, one might be bigger that the other which
> means the hardware table cannot get attached to LIOBN and DMA mapping
> fails.
> 
> This removes the table size alignment from the guest visible table.
> This does not affect the memory allocation which is still aligned -
> kvmppc_tce_pages() takes care of this.
> 
> This relaxes the check we do when attaching tables to allow the hardware
> table be bigger than the guest visible table.
> 
> Ideally we want the KVM table to cover the same space as the hardware
> table does but since the hardware table may use multiple levels, and
> all levels must use the same table size (IODA2 design), the area it can
> actually cover might get very different from the window size which
> the guest requested, even though the guest won't map it all.
> 
> Fixes: ca1fc489cf "KVM: PPC: Book3S: Allow backing bigger guest IOMMU pages with smaller physical pages"
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Thanks, patch applied to my kvm-ppc-next branch.

Paul.

^ permalink raw reply

* Re: [PATCH v2] KVM: PPC: remove mmio_vsx_tx_sx_enabled in KVM MMIO emulation
From: Paul Mackerras @ 2018-07-18  6:30 UTC (permalink / raw)
  To: wei.guo.simon; +Cc: kvm-ppc, kvm, linuxppc-dev
In-Reply-To: <1527472106-22682-1-git-send-email-wei.guo.simon@gmail.com>

On Mon, May 28, 2018 at 09:48:26AM +0800, wei.guo.simon@gmail.com wrote:
> From: Simon Guo <wei.guo.simon@gmail.com>
> 
> Originally PPC KVM MMIO emulation uses only 0~31#(5 bits) for VSR
> reg number, and use mmio_vsx_tx_sx_enabled field together for
> 0~63# VSR regs.
> 
> Currently PPC KVM MMIO emulation is reimplemented with analyse_instr()
> assistence. analyse_instr() returns 0~63 for VSR register number, so
> it is not necessary to use additional mmio_vsx_tx_sx_enabled field
> any more.
> 
> This patch extends related reg bits(expand io_gpr to u16 from u8
> and use 6 bits for VSR reg#), so that mmio_vsx_tx_sx_enabled can
> be removed.
> 
> v1 -> v2 change:
> rework the commit message to remove "PR KVM" specific word.
> 
> Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>

Thanks, patch applied to my kvm-ppc-next branch.

Paul.

^ permalink raw reply

* Re: [PATCH] KVM: PPC: Book3S HV: add of_node_put() in success path
From: Paul Mackerras @ 2018-07-18  6:29 UTC (permalink / raw)
  To: Nicholas Mc Guire
  Cc: Benjamin Herrenschmidt, Michael Ellerman, kvm-ppc, linuxppc-dev,
	linux-kernel
In-Reply-To: <1530946387-6607-1-git-send-email-hofrat@osadl.org>

On Sat, Jul 07, 2018 at 08:53:07AM +0200, Nicholas Mc Guire wrote:
>  The call to of_find_compatible_node() is returning a pointer with
> incremented refcount so it must be explicitly decremented after the
> last use. As here it is only being used for checking of node presence
> but the result is not actually used in the success path it can be
> dropped immediately.
> 
> Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
> Fixes: commit f725758b899f ("KVM: PPC: Book3S HV: Use OPAL XICS emulation on POWER9")

Thanks, patch applied to my kvm-ppc-next branch.

Paul.

^ permalink raw reply

* Re: [PATCH] KVM: PPC: Book3S HV: fix constant size warning
From: Paul Mackerras @ 2018-07-18  6:29 UTC (permalink / raw)
  To: Nicholas Mc Guire
  Cc: Benjamin Herrenschmidt, Michael Ellerman, kvm-ppc, linuxppc-dev,
	linux-kernel
In-Reply-To: <1530954445-6780-1-git-send-email-hofrat@osadl.org>

On Sat, Jul 07, 2018 at 11:07:25AM +0200, Nicholas Mc Guire wrote:
>  The constants are 64bit but not explicitly declared UL resulting
> in sparse warnings. Fixed by declaring the constants UL.
> 
> Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>

Thanks, patch applied to my kvm-ppc-next branch.

Paul.

^ permalink raw reply

* Re: [PATCH 0/6] lib/crc32: treewide: Use existing define with polynomial
From: Krzysztof Kozlowski @ 2018-07-18  6:33 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, David S. Miller, Maxime Coquelin, Alexandre Torgue,
	Tom Lendacky, Siva Reddy Kallam, Prashant Sreedharan,
	Michael Chan, Fugang Duan, Pantelis Antoniou, Vitaly Bordug,
	Jose Abreu, Larry Finger, Florian Schilhabel, Greg Kroah-Hartman,
	Thomas Gleixner, linux-kernel, linux-crypto, linux-arm-kernel,
	netdev, linuxppc-dev, devel
In-Reply-To: <20180718001258.GA210746@gmail.com>

On 18 July 2018 at 02:12, Eric Biggers <ebiggers3@gmail.com> wrote:
> Hi Krzysztof,
>
> On Tue, Jul 17, 2018 at 06:05:35PM +0200, Krzysztof Kozlowski wrote:
>> Hi,
>>
>> Kernel defines same polynomial for CRC-32 in few places.
>> This is unnecessary duplication of the same value. Also this might
>> be error-prone for future code - every driver will define the
>> polynomial again.
>>
>> This is an attempt to unify definition of polynomial.  Few obvious
>> hard-coded locations are fixed with define.
>>
>> All series depend on each 1/6 and 2/6.
>>
>> This could be merged in two different merge windows (1st lib/crc and then
>> the rest) or taken through one tree.
>>
>> It would be nice to get some testing. Only generic lib/crc, bunzip, xz_crc32
>> and Freescale's Ethernet driver were tested on HW.  Rest got just different
>> builds.
>>
>> Best regards,
>> Krzysztof
>>
>> Krzysztof Kozlowski (6):
>>   lib/crc: Move polynomial definition to separate header
>>   lib/crc: Use consistent naming for CRC-32 polynomials
>>   crypto: stm32_crc32 - Use existing define with polynomial
>>   net: ethernet: Use existing define with polynomial
>>   staging: rtl: Use existing define with polynomial
>>   lib: Use existing define with polynomial
>>
>>  drivers/crypto/stm32/stm32_crc32.c               | 11 ++++-------
>>  drivers/net/ethernet/amd/xgbe/xgbe-dev.c         |  4 ++--
>>  drivers/net/ethernet/apple/bmac.c                |  8 ++------
>>  drivers/net/ethernet/broadcom/tg3.c              |  3 ++-
>>  drivers/net/ethernet/freescale/fec_main.c        |  4 ++--
>>  drivers/net/ethernet/freescale/fs_enet/fec.h     |  3 ---
>>  drivers/net/ethernet/freescale/fs_enet/mac-fec.c |  3 ++-
>>  drivers/net/ethernet/micrel/ks8851_mll.c         |  3 ++-
>>  drivers/net/ethernet/synopsys/dwc-xlgmac-hw.c    |  4 ++--
>>  drivers/staging/rtl8712/rtl871x_security.c       |  5 ++---
>>  drivers/staging/rtl8723bs/core/rtw_security.c    |  5 ++---
>>  include/linux/crc32poly.h                        | 20 ++++++++++++++++++++
>>  lib/crc32.c                                      | 11 ++++++-----
>>  lib/crc32defs.h                                  | 14 --------------
>>  lib/decompress_bunzip2.c                         |  3 ++-
>>  lib/gen_crc32table.c                             |  5 +++--
>>  lib/xz/xz_crc32.c                                |  3 ++-
>>  17 files changed, 55 insertions(+), 54 deletions(-)
>>  create mode 100644 include/linux/crc32poly.h
>>
>
> Did you check whether any of these users can be converted to use the CRC
> implementations in lib/, so they wouldn't need the polynomial definition
> themselves?

I did not check but that's interesting point... The Ethernet drivers
(xgbe, tg3, fec, ks8851, dwc-xlgmac) look like could be converted to
generic implementation. The apple/bmac looks weird. The rtl WiFi
drivers in long term can be converted to use generic lib80211 for
encryption (see commit 0d4876f4e977 ("staging:r8188eu: Use lib80211 to
encrypt (TKIP) tx frames")) but that is much bigger task. The
remaining use the polynomials in different aspect:
1. XZ and BUNZIP use it to create CRC tables - probably generic
gen_crc32table.c could be used,
2. stm32_crc32.c uses it to initialize HW CRC accelerator.

I can work on Freescale FEC, xz and bunzip code because these I can
test but I would prefer to do it as follow up.

Best regards,
Krzysztof

^ permalink raw reply

* Re: [PATCH v5 2/2] hwmon: ibmpowernv: Add attributes to enable/disable sensor groups
From: Shilpasri G Bhat @ 2018-07-18  6:57 UTC (permalink / raw)
  To: Guenter Roeck, mpe; +Cc: linuxppc-dev, linux-hwmon, linux-kernel, ego
In-Reply-To: <9e7f301e-c169-d034-eee7-a45f0afb03a1@roeck-us.net>

Hi,

On 07/17/2018 07:17 PM, Guenter Roeck wrote:
> On 07/14/2018 11:54 PM, Shilpasri G Bhat wrote:
>> On-Chip-Controller(OCC) is an embedded micro-processor in POWER9 chip
>> which measures various system and chip level sensors. These sensors
>> comprises of environmental sensors (like power, temperature, current
>> and voltage) and performance sensors (like utilization, frequency).
>> All these sensors are copied to main memory at a regular interval of
>> 100ms. OCC provides a way to select a group of sensors that is copied
>> to the main memory to increase the update frequency of selected sensor
>> groups. When a sensor-group is disabled, OCC will not copy it to main
>> memory and those sensors read 0 values.
>>
>> This patch provides support for enabling/disabling the sensor groups
>> like power, temperature, current and voltage. This patch adds new
>> per-senor sysfs attribute to disable and enable them.
>>
>> Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
>> ---
>> Changes from v4:
>> - As per Mpe's suggestion store device_node instead of phandles and
>>    clean it after init
>> - s/sg_data/sgrp_data
>>
>>   Documentation/hwmon/ibmpowernv |  43 ++++++-
>>   drivers/hwmon/ibmpowernv.c     | 256 +++++++++++++++++++++++++++++++++++------
>>   2 files changed, 265 insertions(+), 34 deletions(-)
>>
>> diff --git a/Documentation/hwmon/ibmpowernv b/Documentation/hwmon/ibmpowernv
>> index 8826ba2..5646825 100644
>> --- a/Documentation/hwmon/ibmpowernv
>> +++ b/Documentation/hwmon/ibmpowernv
>> @@ -33,9 +33,48 @@ fanX_input        Measured RPM value.
>>   fanX_min        Threshold RPM for alert generation.
>>   fanX_fault        0: No fail condition
>>               1: Failing fan
>> +
>>   tempX_input        Measured ambient temperature.
>>   tempX_max        Threshold ambient temperature for alert generation.
>> -inX_input        Measured power supply voltage
>> +tempX_highest        Historical maximum temperature
>> +tempX_lowest        Historical minimum temperature
>> +tempX_enable        Enable/disable all temperature sensors belonging to the
>> +            sub-group. In POWER9, this attribute corresponds to
>> +            each OCC. Using this attribute each OCC can be asked to
>> +            disable/enable all of its temperature sensors.
>> +            1: Enable
>> +            0: Disable
>> +
>> +inX_input        Measured power supply voltage (millivolt)
>>   inX_fault        0: No fail condition.
>>               1: Failing power supply.
>> -power1_input        System power consumption (microWatt)
>> +inX_highest        Historical maximum voltage
>> +inX_lowest        Historical minimum voltage
>> +inX_enable        Enable/disable all voltage sensors belonging to the
>> +            sub-group. In POWER9, this attribute corresponds to
>> +            each OCC. Using this attribute each OCC can be asked to
>> +            disable/enable all of its voltage sensors.
>> +            1: Enable
>> +            0: Disable
>> +
>> +powerX_input        Power consumption (microWatt)
>> +powerX_input_highest    Historical maximum power
>> +powerX_input_lowest    Historical minimum power
>> +powerX_enable        Enable/disable all power sensors belonging to the
>> +            sub-group. In POWER9, this attribute corresponds to
>> +            each OCC. Using this attribute each OCC can be asked to
>> +            disable/enable all of its power sensors.
>> +            1: Enable
>> +            0: Disable
>> +
>> +currX_input        Measured current (milliampere)
>> +currX_highest        Historical maximum current
>> +currX_lowest        Historical minimum current
>> +currX_enable        Enable/disable all current sensors belonging to the
>> +            sub-group. In POWER9, this attribute corresponds to
>> +            each OCC. Using this attribute each OCC can be asked to
>> +            disable/enable all of its current sensors.
>> +            1: Enable
>> +            0: Disable
>> +
>> +energyX_input        Cumulative energy (microJoule)
>> diff --git a/drivers/hwmon/ibmpowernv.c b/drivers/hwmon/ibmpowernv.c
>> index f829dad..a509b9b 100644
>> --- a/drivers/hwmon/ibmpowernv.c
>> +++ b/drivers/hwmon/ibmpowernv.c
>> @@ -90,11 +90,23 @@ struct sensor_data {
>>       char label[MAX_LABEL_LEN];
>>       char name[MAX_ATTR_LEN];
>>       struct device_attribute dev_attr;
>> +    struct sensor_group_data *sgrp_data;
>> +};
>> +
>> +struct sensor_group_data {
>> +    struct mutex mutex;
>> +    struct device_node **of_nodes;
>> +    u32 gid;
>> +    u32 nr_nodes;
>> +    enum sensors type;
>> +    bool enable;
>>   };
>>     struct platform_data {
>>       const struct attribute_group *attr_groups[MAX_SENSOR_TYPE + 1];
>> +    struct sensor_group_data *sgrp_data;
>>       u32 sensors_count; /* Total count of sensors from each group */
>> +    u32 nr_sensor_groups; /* Total number of sensor groups */
>>   };
>>     static ssize_t show_sensor(struct device *dev, struct device_attribute
>> *devattr,
>> @@ -105,6 +117,9 @@ static ssize_t show_sensor(struct device *dev, struct
>> device_attribute *devattr,
>>       ssize_t ret;
>>       u64 x;
>>   +    if (sdata->sgrp_data && !sdata->sgrp_data->enable)
>> +        return -ENODATA;
>> +
>>       ret =  opal_get_sensor_data_u64(sdata->id, &x);
>>         if (ret)
>> @@ -120,6 +135,46 @@ static ssize_t show_sensor(struct device *dev, struct
>> device_attribute *devattr,
>>       return sprintf(buf, "%llu\n", x);
>>   }
>>   +static ssize_t show_enable(struct device *dev,
>> +               struct device_attribute *devattr, char *buf)
>> +{
>> +    struct sensor_data *sdata = container_of(devattr, struct sensor_data,
>> +                         dev_attr);
>> +
>> +    return sprintf(buf, "%u\n", sdata->sgrp_data->enable);
>> +}
>> +
>> +static ssize_t store_enable(struct device *dev,
>> +                struct device_attribute *devattr,
>> +                const char *buf, size_t count)
>> +{
>> +    struct sensor_data *sdata = container_of(devattr, struct sensor_data,
>> +                         dev_attr);
>> +    struct sensor_group_data *sgrp_data = sdata->sgrp_data;
>> +    bool data;
>> +    int ret;
>> +
>> +    ret = kstrtobool(buf, &data);
>> +    if (ret)
>> +        return ret;
>> +
>> +    ret = mutex_lock_interruptible(&sgrp_data->mutex);
>> +    if (ret)
>> +        return ret;
>> +
>> +    if (data != sgrp_data->enable) {
>> +        ret =  sensor_group_enable(sgrp_data->gid, data);
>> +        if (!ret)
>> +            sgrp_data->enable = data;
>> +    }
>> +
>> +    if (!ret)
>> +        ret = count;
>> +
>> +    mutex_unlock(&sgrp_data->mutex);
>> +    return ret;
>> +}
>> +
>>   static ssize_t show_label(struct device *dev, struct device_attribute *devattr,
>>                 char *buf)
>>   {
>> @@ -292,12 +347,129 @@ static u32 get_sensor_hwmon_index(struct sensor_data
>> *sdata,
>>       return ++sensor_groups[sdata->type].hwmon_index;
>>   }
>>   +static int init_sensor_group_data(struct platform_device *pdev,
>> +                  struct platform_data *pdata)
>> +{
>> +    struct sensor_group_data *sgrp_data;
>> +    struct device_node *groups, *sgrp;
>> +    enum sensors type;
>> +    int count = 0, ret = 0;
>> +
>> +    groups = of_find_node_by_path("/ibm,opal/sensor-groups");
>> +    if (!groups)
>> +        return ret;
>> +
>> +    for_each_child_of_node(groups, sgrp) {
>> +        type = get_sensor_type(sgrp);
>> +        if (type != MAX_SENSOR_TYPE)
>> +            pdata->nr_sensor_groups++;
>> +    }
>> +
>> +    if (!pdata->nr_sensor_groups)
>> +        goto out;
>> +
>> +    sgrp_data = devm_kcalloc(&pdev->dev, pdata->nr_sensor_groups,
>> +                 sizeof(*sgrp_data), GFP_KERNEL);
>> +    if (!sgrp_data) {
>> +        ret = -ENOMEM;
>> +        goto out;
>> +    }
>> +
>> +    for_each_child_of_node(groups, sgrp) {
>> +        const __be32 *phandles;
>> +        int len, gid, i, k = 0;
>> +
>> +        type = get_sensor_type(sgrp);
>> +        if (type == MAX_SENSOR_TYPE)
>> +            continue;
>> +
>> +        if (of_property_read_u32(sgrp, "sensor-group-id", &gid))
>> +            continue;
>> +
>> +        phandles = of_get_property(sgrp, "sensors", &len);
>> +        if (!phandles)
>> +            continue;
>> +
>> +        len /= sizeof(u32);
>> +        if (!len)
>> +            continue;
>> +
>> +        sgrp_data[count].of_nodes = devm_kcalloc(&pdev->dev,
>> +                        sizeof(struct device_node *),
>> +                        len, GFP_KERNEL);
>> +        if (!sgrp_data[count].of_nodes) {
>> +            ret = -ENOMEM;
>> +            of_node_put(sgrp);
>> +            goto out;
>> +        }
>> +
>> +        for (i = 0; i < len; i++) {
>> +            struct device_node *node;
>> +
>> +            node = of_parse_phandle(sgrp, "sensors", i);
>> +            if (!node)
>> +                continue;
>> +            sgrp_data[count].of_nodes[k++] = node;
> 
> I don't immediately see where "node" is used later. As mentioned below,
> I'll need to have a much closer look into the code to understand what
> is going on here.

sgrp_data[count].of_nodes is used to store the pointers to sensor nodes
belonging to sensor-group 'sgrp_data[count]' . This is being used later in
to find the sensor-group of each sensor in create_device_attrs()

> 
>> +        }
>> +
>> +        sensor_groups[type].attr_count++;
>> +        sgrp_data[count].gid = gid;
>> +        sgrp_data[count].type = type;
>> +        sgrp_data[count].nr_nodes = len;
>> +        mutex_init(&sgrp_data[count].mutex);
>> +        sgrp_data[count++].enable = false;
>> +    }
>> +    pdata->sgrp_data = sgrp_data;
>> +out:
>> +    of_node_put(groups);
>> +    return ret;
>> +}
>> +
>> +static struct sensor_group_data *get_sensor_group(struct platform_data *pdata,
>> +                          struct device_node *node,
>> +                          enum sensors type)
>> +{
>> +    struct sensor_group_data *sgrp_data = pdata->sgrp_data;
>> +    int i, j;
>> +
>> +    for (i = 0; i < pdata->nr_sensor_groups; i++) {
>> +        if (type != sgrp_data[i].type)
>> +            continue;
>> +
>> +        for (j = 0; j < sgrp_data[i].nr_nodes; j++)
>> +            if (sgrp_data[i].of_nodes[j] == node)
>> +                return &sgrp_data[i];
>> +    }
>> +
>> +    return NULL;
>> +}
>> +
>> +static void clean_sensor_group_of_node(struct platform_device *pdev)
>> +{
>> +    struct platform_data *pdata = platform_get_drvdata(pdev);
>> +    struct sensor_group_data *sgrp_data = pdata->sgrp_data;
>> +    int i, j;
>> +
>> +    for (i = 0; i < pdata->nr_sensor_groups; i++) {
>> +        for (j = 0; j < sgrp_data[i].nr_nodes; j++)
>> +            of_node_put(sgrp_data[i].of_nodes[j]);
>> +
>> +        devm_kfree(&pdev->dev, sgrp_data[i].of_nodes);
> 
> The whole point of calling devm_kzalloc() is that calling devm_kfree()
> is not necessary. Any call to devm_kfree() is a strong indication
> that something is wrong.
> 
>> +        sgrp_data[i].of_nodes = NULL;
>> +    }
>> +}
>> +
> 
> Ok, this will require a detailed review. I don't understand what this
> function is about. It seems that sgrp_data[].of_nodes is only allocated
> to be  freed at the end of create_dev_attrs(), suggesting that the cleared
> data isn't needed for runtime and thus should not be stored in an
> array in the first place. I'll have to understand this much better
> than I do right now before approving it.
> 

Yeah sgrp_data[].of_nodes is not required at runtime. This can be defined locally.


Thanks and Regards,
Shilpa

> Guenter
> 
>>   static int populate_attr_groups(struct platform_device *pdev)
>>   {
>>       struct platform_data *pdata = platform_get_drvdata(pdev);
>>       const struct attribute_group **pgroups = pdata->attr_groups;
>>       struct device_node *opal, *np;
>>       enum sensors type;
>> +    int ret;
>> +
>> +    ret = init_sensor_group_data(pdev, pdata);
>> +    if (ret)
>> +        return ret;
>>         opal = of_find_node_by_path("/ibm,opal/sensors");
>>       for_each_child_of_node(opal, np) {
>> @@ -344,7 +516,10 @@ static int populate_attr_groups(struct platform_device
>> *pdev)
>>   static void create_hwmon_attr(struct sensor_data *sdata, const char *attr_name,
>>                     ssize_t (*show)(struct device *dev,
>>                             struct device_attribute *attr,
>> -                          char *buf))
>> +                          char *buf),
>> +                ssize_t (*store)(struct device *dev,
>> +                         struct device_attribute *attr,
>> +                         const char *buf, size_t count))
>>   {
>>       snprintf(sdata->name, MAX_ATTR_LEN, "%s%d_%s",
>>            sensor_groups[sdata->type].name, sdata->hwmon_index,
>> @@ -352,23 +527,33 @@ static void create_hwmon_attr(struct sensor_data *sdata,
>> const char *attr_name,
>>         sysfs_attr_init(&sdata->dev_attr.attr);
>>       sdata->dev_attr.attr.name = sdata->name;
>> -    sdata->dev_attr.attr.mode = S_IRUGO;
>>       sdata->dev_attr.show = show;
>> +    if (store) {
>> +        sdata->dev_attr.store = store;
>> +        sdata->dev_attr.attr.mode = 0664;
>> +    } else {
>> +        sdata->dev_attr.attr.mode = 0444;
>> +    }
>>   }
>>     static void populate_sensor(struct sensor_data *sdata, int od, int hd, int
>> sid,
>>                   const char *attr_name, enum sensors type,
>>                   const struct attribute_group *pgroup,
>> +                struct sensor_group_data *sgrp_data,
>>                   ssize_t (*show)(struct device *dev,
>>                           struct device_attribute *attr,
>> -                        char *buf))
>> +                        char *buf),
>> +                ssize_t (*store)(struct device *dev,
>> +                         struct device_attribute *attr,
>> +                         const char *buf, size_t count))
>>   {
>>       sdata->id = sid;
>>       sdata->type = type;
>>       sdata->opal_index = od;
>>       sdata->hwmon_index = hd;
>> -    create_hwmon_attr(sdata, attr_name, show);
>> +    create_hwmon_attr(sdata, attr_name, show, store);
>>       pgroup->attrs[sensor_groups[type].attr_count++] = &sdata->dev_attr.attr;
>> +    sdata->sgrp_data = sgrp_data;
>>   }
>>     static char *get_max_attr(enum sensors type)
>> @@ -403,24 +588,23 @@ static int create_device_attrs(struct platform_device
>> *pdev)
>>       const struct attribute_group **pgroups = pdata->attr_groups;
>>       struct device_node *opal, *np;
>>       struct sensor_data *sdata;
>> -    u32 sensor_id;
>> -    enum sensors type;
>>       u32 count = 0;
>> -    int err = 0;
>> +    u32 group_attr_id[MAX_SENSOR_TYPE] = {0};
>>   -    opal = of_find_node_by_path("/ibm,opal/sensors");
>>       sdata = devm_kcalloc(&pdev->dev,
>>                    pdata->sensors_count, sizeof(*sdata),
>>                    GFP_KERNEL);
>> -    if (!sdata) {
>> -        err = -ENOMEM;
>> -        goto exit_put_node;
>> -    }
>> +    if (!sdata)
>> +        return -ENOMEM;
>>   +    opal = of_find_node_by_path("/ibm,opal/sensors");
>>       for_each_child_of_node(opal, np) {
>> +        struct sensor_group_data *sgrp_data;
>>           const char *attr_name;
>> -        u32 opal_index;
>> +        u32 opal_index, hw_id;
>> +        u32 sensor_id;
>>           const char *label;
>> +        enum sensors type;
>>             if (np->name == NULL)
>>               continue;
>> @@ -456,14 +640,12 @@ static int create_device_attrs(struct platform_device
>> *pdev)
>>               opal_index = INVALID_INDEX;
>>           }
>>   -        sdata[count].opal_index = opal_index;
>> -        sdata[count].hwmon_index =
>> -            get_sensor_hwmon_index(&sdata[count], sdata, count);
>> -
>> -        create_hwmon_attr(&sdata[count], attr_name, show_sensor);
>> -
>> -        pgroups[type]->attrs[sensor_groups[type].attr_count++] =
>> -                &sdata[count++].dev_attr.attr;
>> +        hw_id = get_sensor_hwmon_index(&sdata[count], sdata, count);
>> +        sgrp_data = get_sensor_group(pdata, np, type);
>> +        populate_sensor(&sdata[count], opal_index, hw_id, sensor_id,
>> +                attr_name, type, pgroups[type], sgrp_data,
>> +                show_sensor, NULL);
>> +        count++;
>>             if (!of_property_read_string(np, "label", &label)) {
>>               /*
>> @@ -474,35 +656,44 @@ static int create_device_attrs(struct platform_device
>> *pdev)
>>                */
>>                 make_sensor_label(np, &sdata[count], label);
>> -            populate_sensor(&sdata[count], opal_index,
>> -                    sdata[count - 1].hwmon_index,
>> +            populate_sensor(&sdata[count], opal_index, hw_id,
>>                       sensor_id, "label", type, pgroups[type],
>> -                    show_label);
>> +                    NULL, show_label, NULL);
>>               count++;
>>           }
>>             if (!of_property_read_u32(np, "sensor-data-max", &sensor_id)) {
>>               attr_name = get_max_attr(type);
>> -            populate_sensor(&sdata[count], opal_index,
>> -                    sdata[count - 1].hwmon_index,
>> +            populate_sensor(&sdata[count], opal_index, hw_id,
>>                       sensor_id, attr_name, type,
>> -                    pgroups[type], show_sensor);
>> +                    pgroups[type], sgrp_data, show_sensor,
>> +                    NULL);
>>               count++;
>>           }
>>             if (!of_property_read_u32(np, "sensor-data-min", &sensor_id)) {
>>               attr_name = get_min_attr(type);
>> -            populate_sensor(&sdata[count], opal_index,
>> -                    sdata[count - 1].hwmon_index,
>> +            populate_sensor(&sdata[count], opal_index, hw_id,
>>                       sensor_id, attr_name, type,
>> -                    pgroups[type], show_sensor);
>> +                    pgroups[type], sgrp_data, show_sensor,
>> +                    NULL);
>> +            count++;
>> +        }
>> +
>> +        if (sgrp_data && !sgrp_data->enable) {
>> +            sgrp_data->enable = true;
>> +            hw_id = ++group_attr_id[type];
>> +            populate_sensor(&sdata[count], opal_index, hw_id,
>> +                    sgrp_data->gid, "enable", type,
>> +                    pgroups[type], sgrp_data, show_enable,
>> +                    store_enable);
>>               count++;
>>           }
>>       }
>>   -exit_put_node:
>>       of_node_put(opal);
>> -    return err;
>> +    clean_sensor_group_of_node(pdev);
>> +    return 0;
>>   }
>>     static int ibmpowernv_probe(struct platform_device *pdev)
>> @@ -517,6 +708,7 @@ static int ibmpowernv_probe(struct platform_device *pdev)
>>         platform_set_drvdata(pdev, pdata);
>>       pdata->sensors_count = 0;
>> +    pdata->nr_sensor_groups = 0;
>>       err = populate_attr_groups(pdev);
>>       if (err)
>>           return err;
>>
> 

^ permalink raw reply

* [PATCH 1/2] powerpc/mm/book3s/radix: Drop the unnecessary retry when creating linear mapping
From: Aneesh Kumar K.V @ 2018-07-18  8:01 UTC (permalink / raw)
  To: npiggin, benh, paulus, mpe; +Cc: linuxppc-dev, Aneesh Kumar K.V

Checking for mapping_size != PAGE_SIZE should make sure we handle the
split_text_mapping correctly. Drop the retry loop.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/pgtable-radix.c | 15 ++-------------
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index bba168d02235..d9819e573103 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -261,7 +261,6 @@ static int __meminit create_physical_mapping(unsigned long start,
 {
 	unsigned long vaddr, addr, mapping_size = 0;
 	pgprot_t prot;
-	unsigned long max_mapping_size;
 #ifdef CONFIG_STRICT_KERNEL_RWX
 	int split_text_mapping = 1;
 #else
@@ -275,12 +274,9 @@ static int __meminit create_physical_mapping(unsigned long start,
 
 		gap = end - addr;
 		previous_size = mapping_size;
-		max_mapping_size = PUD_SIZE;
 
-retry:
 		if (IS_ALIGNED(addr, PUD_SIZE) && gap >= PUD_SIZE &&
-		    mmu_psize_defs[MMU_PAGE_1G].shift &&
-		    PUD_SIZE <= max_mapping_size)
+		    mmu_psize_defs[MMU_PAGE_1G].shift)
 			mapping_size = PUD_SIZE;
 		else if (IS_ALIGNED(addr, PMD_SIZE) && gap >= PMD_SIZE &&
 			 mmu_psize_defs[MMU_PAGE_2M].shift)
@@ -288,14 +284,7 @@ static int __meminit create_physical_mapping(unsigned long start,
 		else
 			mapping_size = PAGE_SIZE;
 
-		if (split_text_mapping && (mapping_size == PUD_SIZE) &&
-			(addr <= __pa_symbol(__init_begin)) &&
-			(addr + mapping_size) >= __pa_symbol(_stext)) {
-			max_mapping_size = PMD_SIZE;
-			goto retry;
-		}
-
-		if (split_text_mapping && (mapping_size == PMD_SIZE) &&
+		if (split_text_mapping && (mapping_size != PAGE_SIZE) &&
 		    (addr <= __pa_symbol(__init_begin)) &&
 		    (addr + mapping_size) >= __pa_symbol(_stext))
 			mapping_size = PAGE_SIZE;
-- 
2.17.1

^ permalink raw reply related

* [PATCH 2/2] powerpc/mm/book3s/radix: Add mapping statistics
From: Aneesh Kumar K.V @ 2018-07-18  8:01 UTC (permalink / raw)
  To: npiggin, benh, paulus, mpe; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <20180718080112.16633-1-aneesh.kumar@linux.ibm.com>

Add statistics that show how memory is mapped within the kernel linear mapping.
This is similar to commit 37cd944c8d8f ("s390/pgtable: add mapping statistics")

We don't do this with Hash translation mode. Hash uses one size (mmu_linear_psize)
to map the kernel linear mapping and we print the linear psize during boot as
below.

"Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24"

A sample output looks like:

DirectMap4k:           0 kB
DirectMap64k:       18432 kB
DirectMap2M:     1030144 kB
DirectMap1G:    11534336 kB

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 13 ++++++++++++
 arch/powerpc/include/asm/book3s/64/radix.h   |  3 +++
 arch/powerpc/mm/pgtable-book3s64.c           | 22 ++++++++++++++++++++
 arch/powerpc/mm/pgtable-radix.c              | 13 +++++++-----
 4 files changed, 46 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index 01ee40f11f3a..1d2d69ae1bd2 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -208,4 +208,17 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
 
 #define check_pgt_cache()	do { } while (0)
 
+extern atomic_long_t direct_pages_count[MMU_PAGE_COUNT];
+static inline void update_page_count(unsigned long mapping_shift, long count)
+{
+	int level;
+
+	if (IS_ENABLED(CONFIG_PROC_FS)) {
+		level = shift_to_mmu_psize(mapping_shift);
+		if (level < 0)
+			return;
+		atomic_long_add(count, &direct_pages_count[level]);
+	}
+}
+
 #endif /* _ASM_POWERPC_BOOK3S_64_PGALLOC_H */
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 3ab3f7aef022..56a6e3f5f7c1 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -30,6 +30,9 @@
 #define RADIX_PUD_BAD_BITS		0x60000000000000e0UL
 #define RADIX_PGD_BAD_BITS		0x60000000000000e0UL
 
+#define RADIX_PMD_SHIFT		(PAGE_SHIFT + RADIX_PTE_INDEX_SIZE)
+#define RADIX_PUD_SHIFT		(RADIX_PMD_SHIFT + RADIX_PMD_INDEX_SIZE)
+#define RADIX_PGD_SHIFT		(RADIX_PUD_SHIFT + RADIX_PUD_INDEX_SIZE)
 /*
  * Size of EA range mapped by our pagetables.
  */
diff --git a/arch/powerpc/mm/pgtable-book3s64.c b/arch/powerpc/mm/pgtable-book3s64.c
index 4afbfbb64bfd..5d2328ef7958 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -450,3 +450,25 @@ void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int index)
 	return pgtable_free(table, index);
 }
 #endif
+
+#ifdef CONFIG_PROC_FS
+atomic_long_t direct_pages_count[MMU_PAGE_COUNT];
+
+void arch_report_meminfo(struct seq_file *m)
+{
+	/*
+	 * Hash maps the memory with one size mmu_linear_psize.
+	 * So don't bother to print these on hash
+	 */
+	if (!radix_enabled())
+		return;
+	seq_printf(m, "DirectMap4k:    %8lu kB\n",
+		   atomic_long_read(&direct_pages_count[MMU_PAGE_4K]) << 2);
+	seq_printf(m, "DirectMap64k:    %8lu kB\n",
+		   atomic_long_read(&direct_pages_count[MMU_PAGE_64K]) << 6);
+	seq_printf(m, "DirectMap2M:    %8lu kB\n",
+		   atomic_long_read(&direct_pages_count[MMU_PAGE_2M]) << 11);
+	seq_printf(m, "DirectMap1G:    %8lu kB\n",
+		   atomic_long_read(&direct_pages_count[MMU_PAGE_1G]) << 20);
+}
+#endif /* CONFIG_PROC_FS */
diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index d9819e573103..5c5ca58e7d1f 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -259,6 +259,7 @@ static int __meminit create_physical_mapping(unsigned long start,
 					     unsigned long end,
 					     int nid)
 {
+	int mapping_shift;
 	unsigned long vaddr, addr, mapping_size = 0;
 	pgprot_t prot;
 #ifdef CONFIG_STRICT_KERNEL_RWX
@@ -277,18 +278,19 @@ static int __meminit create_physical_mapping(unsigned long start,
 
 		if (IS_ALIGNED(addr, PUD_SIZE) && gap >= PUD_SIZE &&
 		    mmu_psize_defs[MMU_PAGE_1G].shift)
-			mapping_size = PUD_SIZE;
+			mapping_shift = RADIX_PUD_SHIFT;
 		else if (IS_ALIGNED(addr, PMD_SIZE) && gap >= PMD_SIZE &&
 			 mmu_psize_defs[MMU_PAGE_2M].shift)
-			mapping_size = PMD_SIZE;
+			mapping_shift = RADIX_PMD_SHIFT;
 		else
-			mapping_size = PAGE_SIZE;
+			mapping_shift = PAGE_SHIFT;
 
-		if (split_text_mapping && (mapping_size != PAGE_SIZE) &&
+		if (split_text_mapping && (mapping_shift != PAGE_SHIFT) &&
 		    (addr <= __pa_symbol(__init_begin)) &&
 		    (addr + mapping_size) >= __pa_symbol(_stext))
-			mapping_size = PAGE_SIZE;
+			mapping_shift = PAGE_SHIFT;
 
+		mapping_size = (1UL << mapping_shift);
 		if (mapping_size != previous_size) {
 			print_mapping(start, addr, previous_size);
 			start = addr;
@@ -305,6 +307,7 @@ static int __meminit create_physical_mapping(unsigned long start,
 		rc = __map_kernel_page(vaddr, addr, prot, mapping_size, nid, start, end);
 		if (rc)
 			return rc;
+		update_page_count(mapping_shift, 1);
 	}
 
 	print_mapping(start, addr, mapping_size);
-- 
2.17.1

^ permalink raw reply related

* Re: [RESEND][PATCH] powerpc/powernv : Save/Restore SPRG3 on entry/exit from stop.
From: Gautham R Shenoy @ 2018-07-18  8:12 UTC (permalink / raw)
  To: Michael Neuling
  Cc: Gautham R. Shenoy, Michael Ellerman, Benjamin Herrenschmidt,
	Vaidyanathan Srinivasan, linuxppc-dev, linux-kernel,
	Florian Weimer, Oleg Nesterov
In-Reply-To: <80bbdf47081e3e302ab5f28b5ddc9e2faabba842.camel@neuling.org>

Hello Mikey,

On Wed, Jul 18, 2018 at 09:24:19AM +1000, Michael Neuling wrote:
> 
> >  	DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER);
> > diff --git a/arch/powerpc/kernel/idle_book3s.S
> > b/arch/powerpc/kernel/idle_book3s.S
> > index d85d551..5069d42 100644
> > --- a/arch/powerpc/kernel/idle_book3s.S
> > +++ b/arch/powerpc/kernel/idle_book3s.S
> > @@ -120,6 +120,9 @@ power9_save_additional_sprs:
> >  	mfspr	r4, SPRN_MMCR2
> >  	std	r3, STOP_MMCR1(r13)
> >  	std	r4, STOP_MMCR2(r13)
> > +
> > +	mfspr	r3, SPRN_SPRG3
> > +	std	r3, STOP_SPRG3(r13)
> 
> We don't need to save it.  Just restore it from paca->sprg_vdso which should
> never change.

Ok. I will respin a patch to restore SPRG3 from paca->sprg_vdso.

> 
> How can we do better at catching these missing SPRGs?

We can go through the list of SPRs from the POWER9 User Manual and
document explicitly why we don't have to save/restore certain SPRs
during the execution of the stop instruction. Does this sound ok ?

(Ref: Table 4-8, Section 4.7.3.4 from the POWER9 User Manual
accessible from
https://openpowerfoundation.org/?resource_lib=power9-processor-users-manual)

> 
> We missed this one and looking at c1b25a17d249 we missed the AMOR a couple of
> months back. I'd rather we had some systematic way of finding the ones we are
> missing, rather than playing wake-a-mole.

I agree, we need something more systematic than the try-catch method
thing we have now.

For deep stop states on POWER9, we looked at the list of SPRs that
were being lost and restored during winkle on POWER8. The additional
SPRs that we took care of were the ones related to the Radix and IMC.

Now since winkle was used only in the context of CPU-Hotplug, the
CPU-online code would reinit some of the SPRs such as SPRG3, which is
why we didn't see this problem on POWER8. So, that is one obvious
place to audit.

AMOR was a bad miss. It was being restored in POWER8 as part of the
subcore restore code, so like RPR, it should have been restored along
with the other per-core SPRs.

> 
> Mikey 
> 
> >  	blr
> >  
> >  power9_restore_additional_sprs:
> > @@ -144,7 +147,9 @@ power9_restore_additional_sprs:
> >  	mtspr	SPRN_MMCR1, r4
> >  
> >  	ld	r3, STOP_MMCR2(r13)
> > +	ld	r4, STOP_SPRG3(r13)
> >  	mtspr	SPRN_MMCR2, r3
> > +	mtspr	SPRN_SPRG3, r4
> >  	blr
> >  
> >  /*

^ permalink raw reply

* Re: [PATCH] ipmi/powernv: Fix spurious warnings at boot
From: Jeremy Kerr @ 2018-07-18  8:25 UTC (permalink / raw)
  To: William A. Kennington III, linuxppc-dev
In-Reply-To: <20180615183332.249253-1-wak@google.com>

Hi William,

> Sometimes we have stale messages showing up in the recv queue that are
> being processed by the pollers. We don't want to print out warnings for
> these messages not having an outstanding request as they can be expected
> when doing a kexec from the petitroot environment or from another
> running kernel.

OK, makes sense to do. The in_drain state partially shadows the value of
!smi->cur_msg; we may be able to use that instead.

This means that we'll no longer warn when getting an IPMI message after
the first successful send occurs, but do we really care about that
anyway? We could just drop those silently...

Cheers,


Jeremy

^ permalink raw reply

* [PATCH v2] powerpc/powernv : Save/Restore SPRG3 on entry/exit from stop.
From: Gautham R. Shenoy @ 2018-07-18  8:33 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Michael Neuling,
	Vaidyanathan Srinivasan
  Cc: linuxppc-dev, linux-kernel, Florian Weimer, Gautham R. Shenoy,
	stable, Oleg Nesterov

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

On 64-bit servers, SPRN_SPRG3 and its userspace read-only mirror
SPRN_USPRG3 are used as userspace VDSO write and read registers
respectively.

SPRN_SPRG3 is lost when we enter stop4 and above, and is currently not
restored.  As a result, any read from SPRN_USPRG3 returns zero on an
exit from stop4 and above.

Thus in this situation, on POWER9, any call from sched_getcpu() always
returns zero, as on powerpc, we call __kernel_getcpu() which relies
upon SPRN_USPRG3 to report the CPU and NUMA node information.

Fix this by restoring SPRN_SPRG3 on wake up from a deep stop state
with the sprg_vdso value that is cached in PACA.

Fixes: e1c1cfed5432 ("powerpc/powernv: Save/Restore additional SPRs
for stop4 cpuidle")

Reported-by: Florian Weimer <fweimer@redhat.com>
Cc: <stable@vger.kernel.org> # 4.14
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
Change from v1: 
  Restoring the SPRG3 from paca->sprg_vdso instead of saving
  it separately during stop-entry, as suggested by Mikey.
  
 arch/powerpc/kernel/idle_book3s.S | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/idle_book3s.S b/arch/powerpc/kernel/idle_book3s.S
index d85d551..672ead8 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -144,7 +144,9 @@ power9_restore_additional_sprs:
 	mtspr	SPRN_MMCR1, r4
 
 	ld	r3, STOP_MMCR2(r13)
+	ld	r4, PACA_SPRG_VDSO(r13)
 	mtspr	SPRN_MMCR2, r3
+	mtspr	SPRN_SPRG3, r4
 	blr
 
 /*
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH] powerpc/prom_init: remove linux,stdout-package property
From: Michael Ellerman @ 2018-07-18  9:37 UTC (permalink / raw)
  To: Murilo Opsfelder Araujo, linux-kernel
  Cc: Murilo Opsfelder Araujo, Benjamin Herrenschmidt, Paul Mackerras,
	Mathieu Malaterre, Michael Bringmann, Nathan Fontenot, Kees Cook,
	Bharata B Rao, Alexey Kardashevskiy, Cédric Le Goater,
	Aneesh Kumar K . V, Nicholas Piggin, linuxppc-dev
In-Reply-To: <20180712171904.18971-1-muriloo@linux.ibm.com>

Hi Murilo,

Thanks for the patch.

Murilo Opsfelder Araujo <muriloo@linux.ibm.com> writes:
> This property was added in 2004 by
>
>     https://github.com/mpe/linux-fullhistory/commit/689fe5072fe9a0dec914bfa4fa60aed1e54563e6
>
> and the only use of it, which was already inside `#if 0`, was removed a month
> later by
>
>     https://github.com/mpe/linux-fullhistory/commit/1fbe5a6d90f6cd4ea610737ef488719d1a875de7
>
> Fixes: https://github.com/linuxppc/linux/issues/125

That is going to confuse some scripts that are expecting that to be a
"Fixes: <some commit>" tag :)

The proper tag to use there would be "Link:".

But, I'd prefer we didn't add github issue links to the history, as I'm
not sure they won't bit-rot eventually. Not because I'm a anti-Microsoft
conspiracy person but just because it's a repo I set up and manage and
there's no long term plan for it necessarily.

> ---
>  arch/powerpc/kernel/prom_init.c | 2 --
>  1 file changed, 2 deletions(-)

Including the link here would be ideal, as it means it doesn't end up in
the commit history, but it does end up in the mail archive. So if we
ever really need to find it, it should be there.

cheers

> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
> index 5425dd3d6a9f..c45fb463c9e5 100644
> --- a/arch/powerpc/kernel/prom_init.c
> +++ b/arch/powerpc/kernel/prom_init.c
> @@ -2102,8 +2102,6 @@ static void __init prom_init_stdout(void)
>  	stdout_node = call_prom("instance-to-package", 1, 1, prom.stdout);
>  	if (stdout_node != PROM_ERROR) {
>  		val = cpu_to_be32(stdout_node);
> -		prom_setprop(prom.chosen, "/chosen", "linux,stdout-package",
> -			     &val, sizeof(val));
>  
>  		/* If it's a display, note it */
>  		memset(type, 0, sizeof(type));
> -- 
> 2.17.1

^ permalink raw reply

* Re: [PATCH] Mark ams driver as orphaned in MAINTAINERS
From: Michael Ellerman @ 2018-07-18 10:00 UTC (permalink / raw)
  To: Michael Hanselmann, linuxppc-dev, linux-kernel; +Cc: Michael Hanselmann
In-Reply-To: <8fe08e4d44c56ee0b70517fc8609ab5a116bc407.1517265227.git.linux-kernel@hansmi.ch>

Michael Hanselmann <linux-kernel@hansmi.ch> writes:

> I no longer have any hardware with the Apple motion sensor and thus
> relinquish maintainership of the driver.

Thanks. I think I'll just remove the whole entry, meaning it will be
caught under the "LINUX FOR POWER MACINTOSH" entry.

Unless you object. Full patch below.

cheers

commit 19fe15338d6935c1a61a24b83dd315970b1ba162
Author:     Michael Hanselmann <linux-kernel@hansmi.ch>
AuthorDate: Mon Jan 29 22:40:09 2018 +0000
Commit:     Michael Ellerman <mpe@ellerman.id.au>
CommitDate: Wed Jul 18 19:57:42 2018 +1000

    MAINTAINERS: Remove the entry for the orphaned ams driver
    
    I no longer have any hardware with the Apple motion sensor and thus
    relinquish maintainership of the driver.
    
    Remove the maintainers entry entirely, meaning the code will now fall
    under "LINUX FOR POWER MACINTOSH".
    
    Signed-off-by: Michael Hanselmann <linux-kernel@hansmi.ch>
    [mpe: Drop the entry entirely, munge change log]
    Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

diff --git a/MAINTAINERS b/MAINTAINERS
index 07d1576fc766..50e398a463f9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -791,11 +791,6 @@ S:	Supported
 F:	drivers/net/ethernet/amd/xgbe/
 F:	arch/arm64/boot/dts/amd/amd-seattle-xgbe*.dtsi
 
-AMS (Apple Motion Sensor) DRIVER
-M:	Michael Hanselmann <linux-kernel@hansmi.ch>
-S:	Supported
-F:	drivers/macintosh/ams/
-
 ANALOG DEVICES INC AD5686 DRIVER
 M:	Stefan Popa <stefan.popa@analog.com>
 L:	linux-pm@vger.kernel.org

^ permalink raw reply related

* Re: [PATCH 2/3] [v2] m68k: mac: use time64_t in RTC handling
From: Geert Uytterhoeven @ 2018-07-18 11:36 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Paul Mackerras, Michael Ellerman, Joshua Thompson,
	Mathieu Malaterre, Benjamin Herrenschmidt, Greg Ungerer,
	linux-m68k, linuxppc-dev, Linux Kernel Mailing List,
	y2038 Mailman List, Meelis Roos, Andreas Schwab
In-Reply-To: <20180619140229.3615110-2-arnd@arndb.de>

On Tue, Jun 19, 2018 at 4:04 PM Arnd Bergmann <arnd@arndb.de> wrote:
> The real-time clock on m68k (and powerpc) mac systems uses an unsigned
> 32-bit value starting in 1904, which overflows in 2040, about two years
> later than everyone else, but this gets wrapped around in the Linux
> code in 2038 already because of the deprecated usage of time_t and/or
> long in the conversion.
>
> Getting rid of the deprecated interfaces makes it work until 2040 as
> documented, and it could be easily extended by reinterpreting
> the resulting time64_t as a positive number. For the moment, I'm
> adding a WARN_ON() that triggers if we encounter a time before 1970
> or after 2040 (the two are indistinguishable).
>
> This brings it in line with the corresponding code that we have on
> powerpc macintosh.
>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Thanks for your patch!

Applied and queued for v4.19, with the WARN_ON() dropped.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH 3/3] [v2] m68k: remove unused set_clock_mmss() helpers
From: Geert Uytterhoeven @ 2018-07-18 11:37 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Paul Mackerras, Michael Ellerman, Joshua Thompson,
	Mathieu Malaterre, Benjamin Herrenschmidt, Greg Ungerer,
	linux-m68k, linuxppc-dev, Linux Kernel Mailing List,
	y2038 Mailman List, Meelis Roos, Andreas Schwab
In-Reply-To: <20180619140229.3615110-3-arnd@arndb.de>

On Tue, Jun 19, 2018 at 4:04 PM Arnd Bergmann <arnd@arndb.de> wrote:
> Commit 397ac99c6cef ("m68k: remove dead timer code") removed set_rtc_mmss()
> because it was unused in 2012. However, this was itself the only user of the
> mach_set_clock_mmss() callback and the many implementations of that callback,
> which are equally unused.
>
> This removes all of those as well.
>
> Acked-by: Greg Ungerer <gerg@linux-m68k.org>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Thanks, applied and queued for v4.19.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* [PATCH 3/3] powerpc/dts/fsl: t2080rdb: use the Cortina PHY driver compatible
From: Camelia Groza @ 2018-07-18 11:46 UTC (permalink / raw)
  To: robh+dt, mark.rutland, benh
  Cc: paulus, mpe, devicetree, linuxppc-dev, linux-kernel,
	Camelia Groza
In-Reply-To: <cover.1531903211.git.camelia.groza@nxp.com>

The Cortina PHY requires the use of the dedicated Cortina PHY driver
instead of the generic one.

Signed-off-by: Camelia Groza <camelia.groza@nxp.com>
---
 arch/powerpc/boot/dts/fsl/t2080rdb.dts | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/t2080rdb.dts b/arch/powerpc/boot/dts/fsl/t2080rdb.dts
index 836e4c9..55c0210 100644
--- a/arch/powerpc/boot/dts/fsl/t2080rdb.dts
+++ b/arch/powerpc/boot/dts/fsl/t2080rdb.dts
@@ -97,12 +97,12 @@
 
 		mdio@fd000 {
 			xg_cs4315_phy1: ethernet-phy@c {
-				compatible = "ethernet-phy-ieee802.3-c45";
+				compatible = "ethernet-phy-id13e5.1002";
 				reg = <0xc>;
 			};
 
 			xg_cs4315_phy2: ethernet-phy@d {
-				compatible = "ethernet-phy-ieee802.3-c45";
+				compatible = "ethernet-phy-id13e5.1002";
 				reg = <0xd>;
 			};
 
-- 
1.9.1

^ permalink raw reply related

* [PATCH 2/3] powerpc/dts/fsl: t4240rdb: use the Cortina PHY driver compatible
From: Camelia Groza @ 2018-07-18 11:46 UTC (permalink / raw)
  To: robh+dt, mark.rutland, benh
  Cc: paulus, mpe, devicetree, linuxppc-dev, linux-kernel,
	Camelia Groza
In-Reply-To: <cover.1531903211.git.camelia.groza@nxp.com>

The Cortina PHY requires the use of the dedicated Cortina PHY driver
instead of the generic one.

Signed-off-by: Camelia Groza <camelia.groza@nxp.com>
---
 arch/powerpc/boot/dts/fsl/t4240rdb.dts | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/t4240rdb.dts b/arch/powerpc/boot/dts/fsl/t4240rdb.dts
index 15eb0a3..a56a705 100644
--- a/arch/powerpc/boot/dts/fsl/t4240rdb.dts
+++ b/arch/powerpc/boot/dts/fsl/t4240rdb.dts
@@ -267,22 +267,22 @@
 
 			mdio@fd000 {
 				xfiphy1: ethernet-phy@10 {
-					compatible = "ethernet-phy-ieee802.3-c45";
+					compatible = "ethernet-phy-id13e5.1002";
 					reg = <0x10>;
 				};
 
 				xfiphy2: ethernet-phy@11 {
-					compatible = "ethernet-phy-ieee802.3-c45";
+					compatible = "ethernet-phy-id13e5.1002";
 					reg = <0x11>;
 				};
 
 				xfiphy3: ethernet-phy@13 {
-					compatible = "ethernet-phy-ieee802.3-c45";
+					compatible = "ethernet-phy-id13e5.1002";
 					reg = <0x13>;
 				};
 
 				xfiphy4: ethernet-phy@12 {
-					compatible = "ethernet-phy-ieee802.3-c45";
+					compatible = "ethernet-phy-id13e5.1002";
 					reg = <0x12>;
 				};
 			};
-- 
1.9.1

^ permalink raw reply related

* [PATCH 1/3] powerpc/configs/dpaa: enable the Cortina PHY driver
From: Camelia Groza @ 2018-07-18 11:46 UTC (permalink / raw)
  To: robh+dt, mark.rutland, benh
  Cc: paulus, mpe, devicetree, linuxppc-dev, linux-kernel,
	Camelia Groza
In-Reply-To: <cover.1531903211.git.camelia.groza@nxp.com>

Cortina PHYs are present on T4240RDB and T2080RDB. Enable the driver
by default.

Signed-off-by: Camelia Groza <camelia.groza@nxp.com>
---
 arch/powerpc/configs/dpaa.config | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/configs/dpaa.config b/arch/powerpc/configs/dpaa.config
index 2fe76f5..4ffacaf 100644
--- a/arch/powerpc/configs/dpaa.config
+++ b/arch/powerpc/configs/dpaa.config
@@ -2,3 +2,4 @@ CONFIG_FSL_DPAA=y
 CONFIG_FSL_PAMU=y
 CONFIG_FSL_FMAN=y
 CONFIG_FSL_DPAA_ETH=y
+CONFIG_CORTINA_PHY=y
-- 
1.9.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox