LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing
From: Claire Chang @ 2021-06-24  6:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: heikki.krogerus, linux-devicetree, peterz, joonas.lahtinen,
	dri-devel, chris, grant.likely, paulus, Frank Rowand, mingo,
	Jianxiong Gao, Stefano Stabellini, Saravana Kannan, Joerg Roedel,
	Rafael J . Wysocki, Bartosz Golaszewski, bskeggs, linux-pci,
	xen-devel, Marek Szyprowski, Dan Williams, matthew.auld,
	Nicolas Boichat, thomas.hellstrom, Jim Quinlan, Will Deacon,
	Konrad Rzeszutek Wilk, intel-gfx, maarten.lankhorst, Robin Murphy,
	jani.nikula, Rob Herring, rodrigo.vivi, Bjorn Helgaas,
	boris.ostrovsky, Andy Shevchenko, jgross, airlied, Thierry Reding,
	Greg KH, Randy Dunlap, Qian Cai, lkml, list@263.net:IOMMU DRIVERS,
	Daniel Vetter, xypron.glpk, Tom Lendacky, linuxppc-dev, bauerman
In-Reply-To: <20210624054315.GA25381@lst.de>

On Thu, Jun 24, 2021 at 1:43 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, Jun 23, 2021 at 02:44:34PM -0400, Qian Cai wrote:
> > is_swiotlb_force_bounce at /usr/src/linux-next/./include/linux/swiotlb.h:119
> >
> > is_swiotlb_force_bounce() was the new function introduced in this patch here.
> >
> > +static inline bool is_swiotlb_force_bounce(struct device *dev)
> > +{
> > +     return dev->dma_io_tlb_mem->force_bounce;
> > +}
>
> To me the crash looks like dev->dma_io_tlb_mem is NULL.  Can you
> turn this into :
>
>         return dev->dma_io_tlb_mem && dev->dma_io_tlb_mem->force_bounce;
>
> for a quick debug check?

I just realized that dma_io_tlb_mem might be NULL like Christoph
pointed out since swiotlb might not get initialized.
However,  `Unable to handle kernel paging request at virtual address
dfff80000000000e` looks more like the address is garbage rather than
NULL?
I wonder if that's because dev->dma_io_tlb_mem is not assigned
properly (which means device_initialize is not called?).

^ permalink raw reply

* [PATCH] crypto: scatterwalk - Remove obsolete PageSlab check
From: Herbert Xu @ 2021-06-24  6:32 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jens Axboe, linux-block, Thomas Bogendoerfer, Mike Snitzer,
	David S. Miller, Geoff Levand, Christoph Lameter, ceph-devel,
	linux-mips, Dongsheng Yang, linux-kernel, James E.J. Bottomley,
	dm-devel, Linux Crypto Mailing List, linux-arch, Thomas Gleixner,
	Ilya Dryomov, linuxppc-dev, Christoph Hellwig
In-Reply-To: <20210618181258.GC1905674@iweiny-DESK2.sc.intel.com>

On Fri, Jun 18, 2021 at 11:12:58AM -0700, Ira Weiny wrote:
>
> Interesting!  Thanks!
> 
> Digging around a bit more I found:
> 
> https://lore.kernel.org/patchwork/patch/439637/

Nice find.  So we can at least get rid of the PageSlab call from
the Crypto API.

---8<---
As it is now legal to call flush_dcache_page on slab pages we
no longer need to do the check in the Crypto API.

Reported-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/include/crypto/scatterwalk.h b/include/crypto/scatterwalk.h
index c837d0775474..7af08174a721 100644
--- a/include/crypto/scatterwalk.h
+++ b/include/crypto/scatterwalk.h
@@ -81,12 +81,7 @@ static inline void scatterwalk_pagedone(struct scatter_walk *walk, int out,
 		struct page *page;
 
 		page = sg_page(walk->sg) + ((walk->offset - 1) >> PAGE_SHIFT);
-		/* Test ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE first as
-		 * PageSlab cannot be optimised away per se due to
-		 * use of volatile pointer.
-		 */
-		if (ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE && !PageSlab(page))
-			flush_dcache_page(page);
+		flush_dcache_page(page);
 	}
 
 	if (more && walk->offset >= walk->sg->offset + walk->sg->length)
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* [PATCH] powerpc: ps3: remove unneeded semicolon
From: 13145886936 @ 2021-06-24  6:03 UTC (permalink / raw)
  To: geoff; +Cc: linuxppc-dev, linux-kernel, gushengxian

From: gushengxian <gushengxian@yulong.com>

Remove unneeded semocolons.

Signed-off-by: gushengxian <gushengxian@yulong.com>
---
 arch/powerpc/platforms/ps3/system-bus.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/ps3/system-bus.c b/arch/powerpc/platforms/ps3/system-bus.c
index 1a5665875165..f57f37fe038c 100644
--- a/arch/powerpc/platforms/ps3/system-bus.c
+++ b/arch/powerpc/platforms/ps3/system-bus.c
@@ -604,7 +604,7 @@ static dma_addr_t ps3_ioc0_map_page(struct device *_dev, struct page *page,
 	default:
 		/* not happned */
 		BUG();
-	};
+	}
 	result = ps3_dma_map(dev->d_region, (unsigned long)ptr, size,
 			     &bus_addr, iopte_flag);
 
@@ -763,7 +763,7 @@ int ps3_system_bus_device_register(struct ps3_system_bus_device *dev)
 		break;
 	default:
 		BUG();
-	};
+	}
 
 	dev->core.of_node = NULL;
 	set_dev_node(&dev->core, 0);
-- 
2.25.1



^ permalink raw reply related

* [PATCH] powerpc/sysfs: Replace sizeof(arr)/sizeof(arr[0]) with ARRAY_SIZE
From: Jason Wang @ 2021-06-24  6:36 UTC (permalink / raw)
  To: mpe; +Cc: christophe.leroy, maddy, wangborong, paulus, clg, linuxppc-dev,
	joel

The ARRAY_SIZE macro is more compact and more formal in linux source.

Signed-off-by: Jason Wang <wangborong@cdjrlc.com>
---
 arch/powerpc/kernel/sysfs.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index 2e08640bb3b4..5ff0e55d0db1 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -843,14 +843,14 @@ static int register_cpu_online(unsigned int cpu)
 #ifdef HAS_PPC_PMC_IBM
 	case PPC_PMC_IBM:
 		attrs = ibm_common_attrs;
-		nattrs = sizeof(ibm_common_attrs) / sizeof(struct device_attribute);
+		nattrs = ARRAY_SIZE(ibm_common_attrs);
 		pmc_attrs = classic_pmc_attrs;
 		break;
 #endif /* HAS_PPC_PMC_IBM */
 #ifdef HAS_PPC_PMC_G4
 	case PPC_PMC_G4:
 		attrs = g4_common_attrs;
-		nattrs = sizeof(g4_common_attrs) / sizeof(struct device_attribute);
+		nattrs = ARRAY_SIZE(g4_common_attrs);
 		pmc_attrs = classic_pmc_attrs;
 		break;
 #endif /* HAS_PPC_PMC_G4 */
@@ -858,7 +858,7 @@ static int register_cpu_online(unsigned int cpu)
 	case PPC_PMC_PA6T:
 		/* PA Semi starts counting at PMC0 */
 		attrs = pa6t_attrs;
-		nattrs = sizeof(pa6t_attrs) / sizeof(struct device_attribute);
+		nattrs = ARRAY_SIZE(pa6t_attrs);
 		pmc_attrs = NULL;
 		break;
 #endif
@@ -940,14 +940,14 @@ static int unregister_cpu_online(unsigned int cpu)
 #ifdef HAS_PPC_PMC_IBM
 	case PPC_PMC_IBM:
 		attrs = ibm_common_attrs;
-		nattrs = sizeof(ibm_common_attrs) / sizeof(struct device_attribute);
+		nattrs = ARRAY_SIZE(ibm_common_attrs);
 		pmc_attrs = classic_pmc_attrs;
 		break;
 #endif /* HAS_PPC_PMC_IBM */
 #ifdef HAS_PPC_PMC_G4
 	case PPC_PMC_G4:
 		attrs = g4_common_attrs;
-		nattrs = sizeof(g4_common_attrs) / sizeof(struct device_attribute);
+		nattrs = ARRAY_SIZE(g4_common_attrs);
 		pmc_attrs = classic_pmc_attrs;
 		break;
 #endif /* HAS_PPC_PMC_G4 */
@@ -955,7 +955,7 @@ static int unregister_cpu_online(unsigned int cpu)
 	case PPC_PMC_PA6T:
 		/* PA Semi starts counting at PMC0 */
 		attrs = pa6t_attrs;
-		nattrs = sizeof(pa6t_attrs) / sizeof(struct device_attribute);
+		nattrs = ARRAY_SIZE(pa6t_attrs);
 		pmc_attrs = NULL;
 		break;
 #endif
-- 
2.31.1




^ permalink raw reply related

* Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU
From: Paolo Bonzini @ 2021-06-24  6:57 UTC (permalink / raw)
  To: David Stevens, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Paul Mackerras, Zhenyu Wang, Zhi Wang
  Cc: David Stevens, intel-gvt-dev, Wanpeng Li, kvm, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Joerg Roedel, linuxppc-dev,
	linux-kernel, kvm-ppc, linux-mips, James Morse, dri-devel,
	intel-gfx, Vitaly Kuznetsov, Alexandru Elisei, kvmarm,
	linux-arm-kernel, Jim Mattson
In-Reply-To: <20210624035749.4054934-1-stevensd@google.com>

On 24/06/21 05:57, David Stevens wrote:
> KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
> follow_pte in gfn_to_pfn. However, the resolved pfns may not have
> assoicated struct pages, so they should not be passed to pfn_to_page.
> This series removes such calls from the x86 and arm64 secondary MMU. To
> do this, this series modifies gfn_to_pfn to return a struct page in
> addition to a pfn, if the hva was resolved by gup. This allows the
> caller to call put_page only when necessated by gup.
> 
> This series provides a helper function that unwraps the new return type
> of gfn_to_pfn to provide behavior identical to the old behavior. As I
> have no hardware to test powerpc/mips changes, the function is used
> there for minimally invasive changes. Additionally, as gfn_to_page and
> gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
> easily changed over to only use pfns.
> 
> This addresses CVE-2021-22543 on x86 and arm64.

Thank you very much for this.  I agree that it makes sense to have a 
minimal change; I had similar changes almost ready, but was stuck with 
deadlocks in the gfn_to_pfn_cache case.  In retrospect I should have 
posted something similar to your patches.

I have started reviewing the patches, and they look good.  I will try to 
include them in 5.13.

Paolo


^ permalink raw reply

* Re: [PATCH 3/6] KVM: x86/mmu: avoid struct page in MMU
From: Paolo Bonzini @ 2021-06-24  7:31 UTC (permalink / raw)
  To: David Stevens, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Paul Mackerras, Zhenyu Wang, Zhi Wang
  Cc: intel-gvt-dev, Wanpeng Li, kvm, Will Deacon, Suzuki K Poulose,
	Sean Christopherson, Joerg Roedel, linuxppc-dev, linux-kernel,
	kvm-ppc, linux-mips, James Morse, dri-devel, intel-gfx,
	Vitaly Kuznetsov, Alexandru Elisei, kvmarm, linux-arm-kernel,
	Jim Mattson
In-Reply-To: <20210624035749.4054934-4-stevensd@google.com>

On 24/06/21 05:57, David Stevens wrote:
>   static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> -			int map_writable, int max_level, kvm_pfn_t pfn,
> +			int map_writable, int max_level,
> +			const struct kvm_pfn_page *pfnpg,
>   			bool prefault, bool is_tdp)
>   {
>   	bool nx_huge_page_workaround_enabled = is_nx_huge_pa

So the main difference with my tentative patches is that here I was 
passing the struct by value.  I'll try to clean this up for 5.15, but 
for now it's all good like this.  Thanks again!

Paolo


^ permalink raw reply

* Re: [PATCH] crypto: nx: Fix memcpy() over-reading in nonce
From: Herbert Xu @ 2021-06-24  7:36 UTC (permalink / raw)
  To: Kees Cook
  Cc: Nayna Jain, linux-kernel, stable, Paulo Flabiano Smorigo,
	linux-crypto, Breno Leitão, Paul Mackerras, linuxppc-dev,
	David S. Miller, linux-hardening
In-Reply-To: <20210616203459.1248036-1-keescook@chromium.org>

On Wed, Jun 16, 2021 at 01:34:59PM -0700, Kees Cook wrote:
> Fix typo in memcpy() where size should be CTR_RFC3686_NONCE_SIZE.
> 
> Fixes: 030f4e968741 ("crypto: nx - Fix reentrancy bugs")
> Cc: stable@vger.kernel.org
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
>  drivers/crypto/nx/nx-aes-ctr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Patch applied.  Thanks.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: linux-next: Signed-off-by missing for commit in the powerpc tree
From: Michael Ellerman @ 2021-06-24  7:42 UTC (permalink / raw)
  To: Stephen Rothwell, PowerPC
  Cc: Fabiano Rosas, Linux Kernel Mailing List, Nicholas Piggin,
	Linux Next Mailing List, Suraj Jitindar Singh
In-Reply-To: <20210622074622.4e483a33@canb.auug.org.au>

Stephen Rothwell <sfr@canb.auug.org.au> writes:
> Hi all,
>
> Commit
>
>   77bbbc0cf848 ("KVM: PPC: Book3S HV: Fix TLB management on SMT8 POWER9 and POWER10 processors")
>
> is missing a Signed-off-by from its author.

That was actually deliberate.

Suraj wrote the patch when he was at IBM, but never sent it.

Paul & Nick then resurrected it from some internal tree, and are now
submitting it on behalf of IBM.

So they are asserting option b) of the DCO AIUI:

  (b) The contribution is based upon previous work that, to the best
      of my knowledge, is covered under an appropriate open source
      license and I have the right under that license to submit that
      work with modifications, whether created in whole or in part
      by me, under the same open source license (unless I am
      permitted to submit under a different license), as indicated
      in the file


cheers

^ permalink raw reply

* [RESEND-PATCH v2] powerpc/papr_scm: Add support for reporting dirty-shutdown-count
From: Vaibhav Jain @ 2021-06-24  8:06 UTC (permalink / raw)
  To: linux-nvdimm, nvdimm, linuxppc-dev
  Cc: Santosh Sivaraj, Aneesh Kumar K . V, Vaibhav Jain, Dan Williams,
	Ira Weiny

Persistent memory devices like NVDIMMs can loose cached writes in case
something prevents flush on power-fail. Such situations are termed as
dirty shutdown and are exposed to applications as
last-shutdown-state (LSS) flag and a dirty-shutdown-counter(DSC) as
described at [1]. The latter being useful in conditions where multiple
applications want to detect a dirty shutdown event without racing with
one another.

PAPR-NVDIMMs have so far only exposed LSS style flags to indicate a
dirty-shutdown-state. This patch further adds support for DSC via the
"ibm,persistence-failed-count" device tree property of an NVDIMM. This
property is a monotonic increasing 64-bit counter thats an indication
of number of times an NVDIMM has encountered a dirty-shutdown event
causing persistence loss.

Since this value is not expected to change after system-boot hence
papr_scm reads & caches its value during NVDIMM probe and exposes it
as a PAPR sysfs attributed named 'dirty_shutdown' to match the name of
similarly named NFIT sysfs attribute. Also this value is available to
libnvdimm via PAPR_PDSM_HEALTH payload. 'struct nd_papr_pdsm_health'
has been extended to add a new member called 'dimm_dsc' presence of
which is indicated by the newly introduced PDSM_DIMM_DSC_VALID flag.

References:
[1] https://pmem.io/documents/Dirty_Shutdown_Handling-V1.0.pdf

Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
Changelog:

Resend:
* Added ack by Aneesh.
* Minor fix to changelog of v2 patch

v2:
* Rebased the patch on latest ppc-next tree
* s/psdm/pdsm/g   [ Santosh ]
---
 arch/powerpc/include/uapi/asm/papr_pdsm.h |  6 +++++
 arch/powerpc/platforms/pseries/papr_scm.c | 30 +++++++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h b/arch/powerpc/include/uapi/asm/papr_pdsm.h
index 50ef95e2f5b1..82488b1e7276 100644
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
@@ -77,6 +77,9 @@
 /* Indicate that the 'dimm_fuel_gauge' field is valid */
 #define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1
 
+/* Indicate that the 'dimm_dsc' field is valid */
+#define PDSM_DIMM_DSC_VALID 2
+
 /*
  * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH
  * Various flags indicate the health status of the dimm.
@@ -105,6 +108,9 @@ struct nd_papr_pdsm_health {
 
 			/* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */
 			__u16 dimm_fuel_gauge;
+
+			/* Extension flag PDSM_DIMM_DSC_VALID */
+			__u64 dimm_dsc;
 		};
 		__u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
 	};
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index 11e7b90a3360..eb8be0eb6119 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -114,6 +114,9 @@ struct papr_scm_priv {
 	/* Health information for the dimm */
 	u64 health_bitmap;
 
+	/* Holds the last known dirty shutdown counter value */
+	u64 dirty_shutdown_counter;
+
 	/* length of the stat buffer as expected by phyp */
 	size_t stat_buffer_len;
 };
@@ -603,6 +606,16 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
 	return rc;
 }
 
+/* Add the dirty-shutdown-counter value to the pdsm */
+static int papr_pdsm_dsc(struct papr_scm_priv *p,
+			 union nd_pdsm_payload *payload)
+{
+	payload->health.extension_flags |= PDSM_DIMM_DSC_VALID;
+	payload->health.dimm_dsc = p->dirty_shutdown_counter;
+
+	return sizeof(struct nd_papr_pdsm_health);
+}
+
 /* Fetch the DIMM health info and populate it in provided package. */
 static int papr_pdsm_health(struct papr_scm_priv *p,
 			    union nd_pdsm_payload *payload)
@@ -646,6 +659,8 @@ static int papr_pdsm_health(struct papr_scm_priv *p,
 
 	/* Populate the fuel gauge meter in the payload */
 	papr_pdsm_fuel_gauge(p, payload);
+	/* Populate the dirty-shutdown-counter field */
+	papr_pdsm_dsc(p, payload);
 
 	rc = sizeof(struct nd_papr_pdsm_health);
 
@@ -907,6 +922,16 @@ static ssize_t flags_show(struct device *dev,
 }
 DEVICE_ATTR_RO(flags);
 
+static ssize_t dirty_shutdown_show(struct device *dev,
+			  struct device_attribute *attr, char *buf)
+{
+	struct nvdimm *dimm = to_nvdimm(dev);
+	struct papr_scm_priv *p = nvdimm_provider_data(dimm);
+
+	return sysfs_emit(buf, "%llu\n", p->dirty_shutdown_counter);
+}
+DEVICE_ATTR_RO(dirty_shutdown);
+
 static umode_t papr_nd_attribute_visible(struct kobject *kobj,
 					 struct attribute *attr, int n)
 {
@@ -925,6 +950,7 @@ static umode_t papr_nd_attribute_visible(struct kobject *kobj,
 static struct attribute *papr_nd_attributes[] = {
 	&dev_attr_flags.attr,
 	&dev_attr_perf_stats.attr,
+	&dev_attr_dirty_shutdown.attr,
 	NULL,
 };
 
@@ -1149,6 +1175,10 @@ static int papr_scm_probe(struct platform_device *pdev)
 	p->is_volatile = !of_property_read_bool(dn, "ibm,cache-flush-required");
 	p->hcall_flush_required = of_property_read_bool(dn, "ibm,hcall-flush-required");
 
+	if (of_property_read_u64(dn, "ibm,persistence-failed-count",
+				 &p->dirty_shutdown_counter))
+		p->dirty_shutdown_counter = 0;
+
 	/* We just need to ensure that set cookies are unique across */
 	uuid_parse(uuid_str, (uuid_t *) uuid);
 	/*
-- 
2.31.1


^ permalink raw reply related

* Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity
From: Aneesh Kumar K.V @ 2021-06-24  8:20 UTC (permalink / raw)
  To: David Gibson
  Cc: Nathan Lynch, nvdimm, Daniel Henrique Barboza, dan.j.williams,
	linuxppc-dev
In-Reply-To: <YNP3P+30/tNzYMRP@yekko>

David Gibson <david@gibson.dropbear.id.au> writes:

> On Thu, Jun 17, 2021 at 10:21:05PM +0530, Aneesh Kumar K.V wrote:
>> PAPR interface currently supports two different ways of communicating resource
>> grouping details to the OS. These are referred to as Form 0 and Form 1
>> associativity grouping. Form 0 is the older format and is now considered
>> deprecated. This patch adds another resource grouping named FORM2.
>> 
>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  Documentation/powerpc/associativity.rst   | 135 ++++++++++++++++++++
>>  arch/powerpc/include/asm/firmware.h       |   3 +-
>>  arch/powerpc/include/asm/prom.h           |   1 +
>>  arch/powerpc/kernel/prom_init.c           |   3 +-
>>  arch/powerpc/mm/numa.c                    | 149 +++++++++++++++++++++-
>>  arch/powerpc/platforms/pseries/firmware.c |   1 +
>>  6 files changed, 286 insertions(+), 6 deletions(-)
>>  create mode 100644 Documentation/powerpc/associativity.rst
>> 
>> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
>> new file mode 100644
>> index 000000000000..93be604ac54d
>> --- /dev/null
>> +++ b/Documentation/powerpc/associativity.rst
>> @@ -0,0 +1,135 @@
>> +============================
>> +NUMA resource associativity
>> +=============================
>> +
>> +Associativity represents the groupings of the various platform resources into
>> +domains of substantially similar mean performance relative to resources outside
>> +of that domain. Resources subsets of a given domain that exhibit better
>> +performance relative to each other than relative to other resources subsets
>> +are represented as being members of a sub-grouping domain. This performance
>> +characteristic is presented in terms of NUMA node distance within the Linux kernel.
>> +From the platform view, these groups are also referred to as domains.
>> +
>> +PAPR interface currently supports different ways of communicating these resource
>> +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
>> +associativity grouping. Form 0 is the older format and is now considered deprecated.
>> +
>> +Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 property".
>> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
>> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
>> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
>> +
>> +Form 0
>> +-----
>> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
>> +
>> +Form 1
>> +-----
>> +With Form 1 a combination of ibm,associativity-reference-points and ibm,associativity
>> +device tree properties are used to determine the NUMA distance between resource groups/domains.
>> +
>> +The “ibm,associativity” property contains one or more lists of numbers (domainID)
>> +representing the resource’s platform grouping domains.
>> +
>> +The “ibm,associativity-reference-points” property contains one or more list of numbers
>> +(domainID index) that represents the 1 based ordinal in the associativity lists.
>> +The list of domainID index represnets increasing hierachy of
>> resource grouping.
>
> Typo "represnets".  Also s/hierachy/hierarchy/
>
>> +
>> +ex:
>> +{ primary domainID index, secondary domainID index, tertiary domainID index.. }
>
>> +Linux kernel uses the domainID at the primary domainID index as the NUMA node id.
>> +Linux kernel computes NUMA distance between two domains by recursively comparing
>> +if they belong to the same higher-level domains. For mismatch at every higher
>> +level of the resource group, the kernel doubles the NUMA distance between the
>> +comparing domains.
>
> The Form1 description is still kinda confusing, but I don't really
> care.  Form1 *is* confusing, it's Form2 that I hope will be clearer.
>
>> +
>> +Form 2
>> +-------
>> +Form 2 associativity format adds separate device tree properties representing NUMA node distance
>> +thereby making the node distance computation flexible. Form 2 also allows flexible primary
>> +domain numbering. With numa distance computation now detached from the index value of
>> +"ibm,associativity" property, Form 2 allows a large number of primary domain ids at the
>> +same domainID index representing resource groups of different performance/latency characteristics.
>
> So, see you've removed the special handling of secondary IDs for pmem
> - big improvement, thanks.  IIUC, in this revised version, for Form2
> there's really no reason for ibm,associativity-reference-points to
> have >1 entry.  Is that right?
>
> In Form2 everything revolves around the primary domain ID - so much
> that I suggest we come up with a short name for it.  How about just
> "node id" since that's how Linux uses it.

We don't really refer primary domain ID in rest of FORM2 details. We
do refer the entries as domainID. Is renaming domainID to NUMA
node id better?

>
>> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
>> +"ibm,architecture-vec-5" property.
>> +
>> +"ibm,numa-lookup-index-table" property contains one or more list numbers representing
>> +the domainIDs present in the system. The offset of the domainID in this property is considered
>> +the domainID index.
>
> The lookup-index-table is all about compactifying the representation
> of the distance matrix.  Because node ids are sparse, the matrix of
> distances is also effectively sparse.  You don't want to have a huge
> matrix in the DT with mostly meaningless entries, so you use the
> lookup table to compact the node ids.
>
> It's a bit confusing though, because you now have two representations
> of exactly the same information the "full" node id (== primary
> domainID) and the "short" node id (==domainID lookup table offset).
>
> I can see three ways to clarify this:
>
>   A) Give the short node id an explicit name rather than difficult to
>      parse "domainID index" or "domainID offset" which gets easily
>      muddled with other uses of index and offset.  Use that name
>      throughout.

I dopped the domainID index and started referring to that as domain
distance offset.

>
>   B) Eliminate the short ID entirely, and use an explicit sparse
>      matrix representation for the distance table e.g. a list of
>      (nodeA, nodeB, distance) tuples
>
>      That does have the disadvantage that it's not structurally
>      guaranteed that you have entries for every pair of "active" node
>      ids.

Other hypervisor would want to have a large possible domainID value and
mostly want to publish the full topology details during boot. Using the
above format makes a 16 node config have large "ibm,numa-distance-table"
property.

>
>   C) Eliminate the "long ID" entirely.  In the Form2 world, is there
>      actually any point allowing sparse node ids?  In this case you'd
>      have the node id read from associativity and use that directly to
>      index the distance table

Yes, other hypervisor would like to partition the domain ID space for
different resources.

>
>> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
>> +N domainID encoded as with encode-int
>> +
>> +For ex:
>> +ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index
>> for domainID 8 is 1.
>
> As noted "domainID index" is confusing here, because it's different
> from the "domainID index" values in reference-points.
>
>> +
>> +"ibm,numa-distance-table" property contains one or more list of numbers representing the NUMA
>> +distance between resource groups/domains present in the system.
>> +
>> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
>> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
>
> The N here always has to be the square of the N in the
> lookup-index-table, yes?  In which case do we actually need to include
> it?

most of PAPR device tree property follows the pattern  {N,....
Nelements}. This follows the same pattern.

>
>> +For ex:
>> +ibm,numa-lookup-index-table =  {3, 0, 8, 40}
>> +ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
>> +
>> +  | 0    8   40
>> +--|------------
>> +  |
>> +0 | 10   20  80
>> +  |
>> +8 | 20   10  160
>> +  |
>> +40| 80   160  10
>> +
>> +
>> +"ibm,associativity" property for resources in node 0, 8 and 40
>> +
>> +{ 3, 6, 7, 0 }
>> +{ 3, 6, 9, 8 }
>> +{ 3, 6, 7, 40}
>> +
>> +With "ibm,associativity-reference-points"  { 0x3 }
>> +
>> +Each resource (drcIndex) now also supports additional optional device tree properties.
>> +These properties are marked optional because the platform can choose not to export
>> +them and provide the system topology details using the earlier defined device tree
>> +properties alone. The optional device tree properties are used when adding new resources
>> +(DLPAR) and when the platform didn't provide the topology details of the domain which
>> +contains the newly added resource during boot.
>
> In what circumstances would you envisage a hot-add creating a new NUMA
> node, as opposed to adding resources to an existing (possibly empty)
> node?  Do you need any provision for removing NUMA nodes again?

Both Qemu and PowerVM don't intend to use this at this point. Both will
provide the full possible NUMA node details during boot. But I was not
sure whether we really need to restrict the possibility of a new
resource being added. This can be true in case of new memory devices
that gets hotplugged in the latency of the device is such that we never
expressed that in the distance table during boot. 



>
>> +"ibm,numa-lookup-index" property contains a number representing the domainID index to be used
>> +when building the NUMA distance of the numa node to which this resource belongs. This can
>> +be looked at as the index at which this new domainID would have appeared in
>> +"ibm,numa-lookup-index-table" if the domain was present during boot. The domainID
>
> Again, confusing use of "index" where it's used both for offsets in
> ibm,associativity properties and for offsets in ibm,numa-lookup-index-table

I guess we can drop the ibm,numa-lookuup-index and state that the
number of distance element in "ibm,numa-distance" suggest the index at
which this NUMA node should appear for compuring the distance details. 

>
>> +of the new resource can be obtained from the existing "ibm,associativity" property. This
>> +can be used to build distance information of a newly onlined NUMA node via DLPAR operation.
>> +The value is 1 based array index value.
>> +
>> +prop-encoded-array: An integer encoded as with encode-int specifying the domainID index
>> +
>> +"ibm,numa-distance" property contains one or more list of numbers presenting the NUMA distance
>> +from this resource domain to other resources.
>> +
>> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
>> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
>
> Again, does N have to equal 2 * (existing number of nodes + 1)?  If so
> you should say so, if not you should specify how incomplete
> information is interpreted.
>
>> +For ex:
>> +ibm,associativity     = { 4, 5, 10, 50}
>> +ibm,numa-lookup-index = { 4 }
>> +ibm,numa-distance   =  {8, 160, 255, 80, 10, 160, 255, 80, 10}
>> +
>> +resulting in a new toplogy as below.
>> +  | 0    8   40   50
>> +--|------------------
>> +  |
>> +0 | 10   20  80   160
>> +  |
>> +8 | 20   10  160  255
>> +  |
>> +40| 80   160  10  80
>> +  |
>> +50| 160  255  80  10
>> +

Here is the relevant updated part of the doc.

Form 2 associativity format adds separate device tree properties representing NUMA node distance
thereby making the node distance computation flexible. Form 2 also allows flexible primary
domain numbering. With numa distance computation now detached from the index value in
"ibm,associativity-reference-points" property, Form 2 allows a large number of primary domain
ids at the same domainID index representing resource groups of different performance/latency
characteristics.

Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
"ibm,architecture-vec-5" property.

"ibm,numa-lookup-index-table" property contains one or more list numbers representing
the domainIDs present in the system. The offset of the domainID in this property is
used as an index while computing numa distance information via "ibm,numa-distance-table".

prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
N domainID encoded as with encode-int

For ex:
"ibm,numa-lookup-index-table" =  {4, 0, 8, 250, 252}. Offset of domainID 8 (2) is used when
computing the distance of domain 8 from other domains present in the system. For rest of
this document this offset will be referred to as domain distance offset.

"ibm,numa-distance-table" property contains one or more list of numbers representing the NUMA
distance between resource groups/domains present in the system.

prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
N distance values encoded as with encode-bytes. The max distance value we could encode is 255.

For ex:
ibm,numa-lookup-index-table =  {3, 0, 8, 40}
ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}

  | 0    8   40
--|------------
  |
0 | 10   20  80
  |
8 | 20   10  160
  |
40| 80   160  10

A possible "ibm,associativity" property for resources in node 0, 8 and 40

{ 3, 6, 7, 0 }
{ 3, 6, 9, 8 }
{ 3, 6, 7, 40}

With "ibm,associativity-reference-points"  { 0x3 }

"ibm,lookup-index-table" helps in having a compact representation of distance matrix.
Since domainID can be sparse, the matrix of distances can also be effectively sparse.
With "ibm,lookup-index-table" we are able to achieve a compact representation of
distance information.

Each resource (drcIndex) now also supports additional optional device tree properties.
These properties are marked optional because the platform can choose not to export
them and provide the system topology details using the earlier defined device tree
properties alone. The optional device tree properties are used when adding new resources
(DLPAR) and when the platform didn't provide the topology details of the domain which
contains the newly added resource during boot.

"ibm,numa-distance" property contains one or more list of numbers presenting the NUMA distance
from this resource domain to other resources.

prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
N distance values encoded as with encode-bytes. The max distance value we could encode is 255.

If the system currently has 3 domains and a DLPAR operation is adding one additional
domain, "ibm,numa-distance" property should have 2 * (3 + 1) = 8 elements as shown below.
In other words the domain distance offset value for the newly added domain is number of
distance value elements in the "ibm,numa-distance" property divided by 2.


>> diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
>> index 60b631161360..97a3bd9ffeb9 100644
>> --- a/arch/powerpc/include/asm/firmware.h
>> +++ b/arch/powerpc/include/asm/firmware.h
>> @@ -53,6 +53,7 @@
>>  #define FW_FEATURE_ULTRAVISOR	ASM_CONST(0x0000004000000000)
>>  #define FW_FEATURE_STUFF_TCE	ASM_CONST(0x0000008000000000)
>>  #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000)
>> +#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000)
>>  
>>  #ifndef __ASSEMBLY__
>>  
>> @@ -73,7 +74,7 @@ enum {
>>  		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
>>  		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
>>  		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
>> -		FW_FEATURE_RPT_INVALIDATE,
>> +		FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY,
>>  	FW_FEATURE_PSERIES_ALWAYS = 0,
>>  	FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
>>  	FW_FEATURE_POWERNV_ALWAYS = 0,
>> diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
>> index df9fec9d232c..5c80152e8f18 100644
>> --- a/arch/powerpc/include/asm/prom.h
>> +++ b/arch/powerpc/include/asm/prom.h
>> @@ -149,6 +149,7 @@ extern int of_read_drc_info_cell(struct property **prop,
>>  #define OV5_XCMO		0x0440	/* Page Coalescing */
>>  #define OV5_FORM1_AFFINITY	0x0580	/* FORM1 NUMA affinity */
>>  #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
>> +#define OV5_FORM2_AFFINITY	0x0520	/* Form2 NUMA affinity */
>>  #define OV5_HP_EVT		0x0604	/* Hot Plug Event support */
>>  #define OV5_RESIZE_HPT		0x0601	/* Hash Page Table resizing */
>>  #define OV5_PFO_HW_RNG		0x1180	/* PFO Random Number Generator */
>> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
>> index 64b9593038a7..496fdac54c29 100644
>> --- a/arch/powerpc/kernel/prom_init.c
>> +++ b/arch/powerpc/kernel/prom_init.c
>> @@ -1070,7 +1070,8 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
>>  #else
>>  		0,
>>  #endif
>> -		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN),
>> +		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) |
>> +		OV5_FEAT(OV5_FORM2_AFFINITY),
>>  		.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
>>  		.micro_checkpoint = 0,
>>  		.reserved0 = 0,
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index d32729f235b8..5a7d94960fb7 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -56,12 +56,17 @@ static int n_mem_addr_cells, n_mem_size_cells;
>>  
>>  #define FORM0_AFFINITY 0
>>  #define FORM1_AFFINITY 1
>> +#define FORM2_AFFINITY 2
>>  static int affinity_form;
>>  
>>  #define MAX_DISTANCE_REF_POINTS 4
>>  static int max_associativity_domain_index;
>>  static const __be32 *distance_ref_points;
>>  static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
>> +static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = {
>> +	[0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 }
>> +};
>> +static int numa_id_index_table[MAX_NUMNODES];
>>  
>>  /*
>>   * Allocate node_to_cpumask_map based on number of available nodes
>> @@ -166,6 +171,27 @@ static void unmap_cpu_from_node(unsigned long cpu)
>>  }
>>  #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
>>  
>> +/*
>> + * With FORM2 if we are not using logical domain ids, we will find
>> + * both primary and seconday domains with same value. Hence always
>> + * start comparison from secondary domains
>
>
> IIUC, in this new draft, secondary domains are no longer a meaningful
> thing in Form2, so this comment and code seem outdated.

This needed fixup. With FORM2 our associativity array can be just one
element. I fixed up as below.

static int __cpu_form2_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
{
	int node1, node2;
	int dist = 0;

	node1 = associativity_to_nid(cpu1_assoc);
	node2 = associativity_to_nid(cpu2_assoc);

	dist = numa_distance_table[node1][node2];
	if (dist == LOCAL_DISTANCE)
		return 0;
	else if (dist == REMOTE_DISTANCE)
		return 1;
	else
		return 2;
}

also renamed cpu_distance to cpu_relative_distance.


>> + */
>> +static int __cpu_form2_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>> +{
>> +	int dist = 0;
>> +
>> +	int i, index;
>> +
>> +	for (i = 1; i < max_associativity_domain_index; i++) {
>> +		index = be32_to_cpu(distance_ref_points[i]);
>> +		if (cpu1_assoc[index] == cpu2_assoc[index])
>> +			break;
>> +		dist++;
>> +	}
>> +
>> +	return dist;
>> +}
>> +
>>  static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>>  {
>>  	int dist = 0;
>> @@ -178,7 +204,6 @@ static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>>  			break;
>>  		dist++;
>>  	}
>> -
>>  	return dist;
>>  }
>>  
>> @@ -186,8 +211,9 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>>  {
>>  	/* We should not get called with FORM0 */
>>  	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
>> -
>> -	return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
>> +	if (affinity_form == FORM1_AFFINITY)
>> +		return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
>> +	return __cpu_form2_distance(cpu1_assoc, cpu2_assoc);
>>  }
>>  
>>  /* must hold reference to node during call */
>> @@ -201,7 +227,9 @@ int __node_distance(int a, int b)
>>  	int i;
>>  	int distance = LOCAL_DISTANCE;
>>  
>> -	if (affinity_form == FORM0_AFFINITY)
>> +	if (affinity_form == FORM2_AFFINITY)
>> +		return numa_distance_table[a][b];
>> +	else if (affinity_form == FORM0_AFFINITY)
>>  		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
>>  
>>  	for (i = 0; i < max_associativity_domain_index; i++) {
>> @@ -303,15 +331,116 @@ static void initialize_form1_numa_distance(struct device_node *node)
>>  
>>  /*
>>   * Used to update distance information w.r.t newly added node.
>> + * ibm,numa-lookup-index -> 4
>> + * ibm,numa-distance -> {5, 20, 40, 60, 80, 10 }
>>   */
>>  void update_numa_distance(struct device_node *node)
>>  {
>> +	int i, nid, other_nid, other_nid_index = 0;
>> +	const __be32 *numa_indexp;
>> +	const __u8  *numa_distancep;
>> +	int numa_index, max_numa_index, numa_distance;
>> +
>>  	if (affinity_form == FORM0_AFFINITY)
>>  		return;
>>  	else if (affinity_form == FORM1_AFFINITY) {
>>  		initialize_form1_numa_distance(node);
>>  		return;
>>  	}
>> +	/* FORM2 affinity  */
>> +
>> +	nid = of_node_to_nid_single(node);
>> +	if (nid == NUMA_NO_NODE)
>> +		return;
>> +
>> +	/* Already initialized */
>> +	if (numa_distance_table[nid][nid] != -1)
>
> IIUC, this should exactly correspond to the case where the new
> resource lies in an existing NUMA node, yes?

yes.

>
>> +		return;
>> +	/*
>> +	 * update node distance if not already populated.
>> +	 */
>> +	numa_distancep = of_get_property(node, "ibm,numa-distance", NULL);
>> +	if (!numa_distancep)
>> +		return;
>> +
>> +	numa_indexp = of_get_property(node, "ibm,numa-lookup-index", NULL);
>> +	if (!numa_indexp)
>> +		return;
>> +
>> +	numa_index = of_read_number(numa_indexp, 1);
>> +	/*
>> +	 * update the numa_id_index_table. Device tree look at index table as
>> +	 * 1 based array indexing.
>> +	 */
>> +	numa_id_index_table[numa_index - 1] = nid;
>> +
>> +	max_numa_index = of_read_number((const __be32 *)numa_distancep, 1);
>> +	VM_WARN_ON(max_numa_index != 2 * numa_index);
>
> You WARN_ON(), but you don't actually bail out to avoid indexing past
> the end of the property below.
>
> AFAICT none of this really works unless numa_index == (previous number
> of numa nodes + 1), which makes all the use of these different
> variables kind of confusing.  If you had a variable that you just set
> equal to the new number of numa nodes (previous number plus the one
> being added), then I think you can replace all numa_index and
> max_numa_index references with that and it will be clearer.

I rewrote that as below.

	numa_distancep = of_get_property(node, "ibm,numa-distance", NULL);
	if (!numa_distancep)
		return;

	max_dist_element = of_read_number((const __be32 *)numa_distancep, 1);

	numa_id_index = max_dist_element >> 1;
	/*
	 * update the numa_id_index_table. Device tree look at index table as
	 * 1 based array indexing.
	 */
	if (numa_id_index_table[numa_id_index - 1] != -1) {
		WARN(1, "Wrong NUMA distance information\n");
		return;
	}

	numa_id_index_table[numa_id_index - 1] = nid;

	/* Skip the size which is encoded int */
	numa_distancep += sizeof(__be32);

	/*
	 * First fill the distance information from other node to this node.
	 */
	distance_index = 0;
	for (i = 0; i < numa_id_index; i++) {
		numa_distance = numa_distancep[distance_index++];
		other_nid = numa_id_index_table[i];
		if (other_nid == NUMA_NO_NODE)
			continue;
		numa_distance_table[other_nid][nid] = numa_distance;
	}

	for (i = 0; i < numa_id_index; i++) {
		numa_distance = numa_distancep[distance_index++];
		other_nid = numa_id_index_table[i];
		if (other_nid == NUMA_NO_NODE)
			continue;
		numa_distance_table[nid][other_nid] = numa_distance;
	}
}


>
>
>> +	/* Skip the size which is encoded int */
>> +	numa_distancep += sizeof(__be32);
>> +
>> +	/*
>> +	 * First fill the distance information from other node to this node.
>> +	 */
>> +	other_nid_index = 0;
>> +	for (i = 0; i < numa_index; i++) {
>> +		numa_distance = numa_distancep[i];
>> +		other_nid = numa_id_index_table[other_nid_index++];
>> +		numa_distance_table[other_nid][nid] = numa_distance;
>> +	}
>> +
>> +	other_nid_index = 0;
>> +	for (; i < max_numa_index; i++) {
>
> Again, splitting this loop and carrying i over seems a confusing way
> to code this.  It's basically two loops of N, one writing a row of the
> distance matrix, one writing a column (other_nid will even go through
> the same values in each loop).

Fixed

>
>> +		numa_distance = numa_distancep[i];
>> +		other_nid = numa_id_index_table[other_nid_index++];
>> +		numa_distance_table[nid][other_nid] = numa_distance;
>> +	}
>> +}
>> +
>> +/*
>> + * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
>> + * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
>> + */
>> +static void initialize_form2_numa_distance_lookup_table(struct device_node *root)
>> +{
>> +	const __u8 *numa_dist_table;
>> +	const __be32 *numa_lookup_index;
>> +	int numa_dist_table_length;
>> +	int max_numa_index, distance_index;
>> +	int i, curr_row = 0, curr_column = 0;
>> +
>> +	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
>> +	max_numa_index = of_read_number(&numa_lookup_index[0], 1);
>
> max_numa_index here has a different meaning to max_numa_index in the
> previous function, which is pointlessly confusing.
>
>> +	/* first element of the array is the size and is encode-int */
>> +	numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL);
>> +	numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1);
>> +	/* Skip the size which is encoded int */
>> +	numa_dist_table += sizeof(__be32);
>> +
>> +	pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d \n",
>> +		 numa_dist_table_length, max_numa_index);
>> +
>> +	for (i = 0; i < max_numa_index; i++)
>> +		/* +1 skip the max_numa_index in the property */
>> +		numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1);
>> +
>> +
>> +	VM_WARN_ON(numa_dist_table_length != max_numa_index * max_numa_index);
>
> Again, you don't actually bail out in this case.  And if it has to
> have this value, what's the point of encoding it into the property.
>
>> +	for (distance_index = 0; distance_index < numa_dist_table_length; distance_index++) {
>> +		int nodeA = numa_id_index_table[curr_row];
>> +		int nodeB = numa_id_index_table[curr_column++];
>
> You've already (sort of) verified that the distance table has size
> N^2, in which case you can just to a simple two dimensional loop
> rather than having to do ugly calculations of row and column.

Fixed all the above and updated as below.
	if (numa_dist_table_length != max_numa_index * max_numa_index) {
		WARN(1, "Wrong NUMA distnce information\n");
		/* consider everybody else just remote. */

		for (i = 0;  i < max_numa_index; i++) {
			for (j = 0; j < max_numa_index; j++) {
				int nodeA = numa_id_index_table[i];
				int nodeB = numa_id_index_table[j];

				if (nodeA == nodeB)
					numa_distance_table[nodeA][nodeB] = LOCAL_DISTANCE;
				else
					numa_distance_table[nodeA][nodeB] = REMOTE_DISTANCE;
			}
		}
	}

	distance_index = 0;
	for (i = 0;  i < max_numa_index; i++) {
		for (j = 0; j < max_numa_index; j++) {
			int nodeA = numa_id_index_table[i];
			int nodeB = numa_id_index_table[j];

			numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index++];

			pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
		}
	}
}


-aneesh

^ permalink raw reply

* Re: [PATCH 1/3] powerpc/pseries: skip reserved LMBs in dlpar_memory_add_by_count()
From: Laurent Dufour @ 2021-06-24  8:39 UTC (permalink / raw)
  To: Daniel Henrique Barboza, linuxppc-dev
In-Reply-To: <20210622133923.295373-2-danielhb413@gmail.com>

Le 22/06/2021 à 15:39, Daniel Henrique Barboza a écrit :
> The function is counting reserved LMBs as available to be added, but
> they aren't. This will cause the function to miscalculate the available
> LMBs and can trigger errors later on when executing dlpar_add_lmb().

Indeed I'm wondering if dlpar_add_lmb() would fail in that case, so that's even 
better to check for that flag earlier.

Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>

> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>   arch/powerpc/platforms/pseries/hotplug-memory.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index 36f66556a7c6..28a7fd90232f 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -683,6 +683,9 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
>   
>   	/* Validate that there are enough LMBs to satisfy the request */
>   	for_each_drmem_lmb(lmb) {
> +		if (lmb->flags & DRCONF_MEM_RESERVED)
> +			continue;
> +
>   		if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
>   			lmbs_available++;
>   
> 


^ permalink raw reply

* Re: [PATCH 1/6] KVM: x86/mmu: release audited pfns
From: Nicholas Piggin @ 2021-06-24  8:43 UTC (permalink / raw)
  To: Aleksandar Markovic, Huacai Chen, Marc Zyngier, Paul Mackerras,
	Paolo Bonzini, David Stevens, Zhenyu Wang, Zhi Wang
  Cc: Wanpeng Li, kvm, Suzuki K Poulose, Alexandru Elisei, intel-gfx,
	linuxppc-dev, linux-kernel, dri-devel, kvmarm, Will Deacon,
	James Morse, kvm-ppc, Sean Christopherson, Vitaly Kuznetsov,
	linux-mips, intel-gvt-dev, Joerg Roedel, linux-arm-kernel,
	Jim Mattson
In-Reply-To: <20210624035749.4054934-2-stevensd@google.com>

Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
> From: David Stevens <stevensd@chromium.org>

Changelog? This looks like a bug, should it have a Fixes: tag?

Thanks,
Nick

> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  arch/x86/kvm/mmu/mmu_audit.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu_audit.c b/arch/x86/kvm/mmu/mmu_audit.c
> index cedc17b2f60e..97ff184084b4 100644
> --- a/arch/x86/kvm/mmu/mmu_audit.c
> +++ b/arch/x86/kvm/mmu/mmu_audit.c
> @@ -121,6 +121,8 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level)
>  		audit_printk(vcpu->kvm, "levels %d pfn %llx hpa %llx "
>  			     "ent %llxn", vcpu->arch.mmu->root_level, pfn,
>  			     hpa, *sptep);
> +
> +	kvm_release_pfn_clean(pfn);
>  }
>  
>  static void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep)
> -- 
> 2.32.0.93.g670b81a890-goog
> 
> 

^ permalink raw reply

* Re: [PATCH 2/3] powerpc/pseries: break early in dlpar_memory_add_by_count() loops
From: Laurent Dufour @ 2021-06-24  8:45 UTC (permalink / raw)
  To: Daniel Henrique Barboza, linuxppc-dev
In-Reply-To: <20210622133923.295373-3-danielhb413@gmail.com>

Le 22/06/2021 à 15:39, Daniel Henrique Barboza a écrit :
> After a successful dlpar_add_lmb() call the LMB is marked as reserved.
> Later on, depending whether we added enough LMBs or not, we rely on
> the marked LMBs to see which ones might need to be removed, and we
> remove the reservation of all of them.
> 
> These are done in for_each_drmem_lmb() loops without any break
> condition. This means that we're going to check all LMBs of the partition
> even after going through all the reserved ones.
> 
> This patch adds break conditions in both loops to avoid this. The
> 'lmbs_added' variable was renamed to 'lmbs_reserved', and it's now
> being decremented each time a lmb reservation is removed, indicating
> if there are still marked LMBs to be processed.

Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>

> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>   arch/powerpc/platforms/pseries/hotplug-memory.c | 17 ++++++++++++-----
>   1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index 28a7fd90232f..c0a03e1537cb 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -673,7 +673,7 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
>   {
>   	struct drmem_lmb *lmb;
>   	int lmbs_available = 0;
> -	int lmbs_added = 0;
> +	int lmbs_reserved = 0;
>   	int rc;
>   
>   	pr_info("Attempting to hot-add %d LMB(s)\n", lmbs_to_add);
> @@ -714,13 +714,12 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
>   		 * requested LMBs cannot be added.
>   		 */
>   		drmem_mark_lmb_reserved(lmb);
> -
> -		lmbs_added++;
> -		if (lmbs_added == lmbs_to_add)
> +		lmbs_reserved++;
> +		if (lmbs_reserved == lmbs_to_add)
>   			break;
>   	}
>   
> -	if (lmbs_added != lmbs_to_add) {
> +	if (lmbs_reserved != lmbs_to_add) {
>   		pr_err("Memory hot-add failed, removing any added LMBs\n");
>   
>   		for_each_drmem_lmb(lmb) {
> @@ -735,6 +734,10 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
>   				dlpar_release_drc(lmb->drc_index);
>   
>   			drmem_remove_lmb_reservation(lmb);
> +			lmbs_reserved--;
> +
> +			if (lmbs_reserved == 0)
> +				break;
>   		}
>   		rc = -EINVAL;
>   	} else {
> @@ -745,6 +748,10 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
>   			pr_debug("Memory at %llx (drc index %x) was hot-added\n",
>   				 lmb->base_addr, lmb->drc_index);
>   			drmem_remove_lmb_reservation(lmb);
> +			lmbs_reserved--;
> +
> +			if (lmbs_reserved == 0)
> +				break;
>   		}
>   		rc = 0;
>   	}
> 


^ permalink raw reply

* Re: [PATCH 3/3] powerpc/pseries: fail quicker in dlpar_memory_add_by_ic()
From: Laurent Dufour @ 2021-06-24  8:48 UTC (permalink / raw)
  To: Daniel Henrique Barboza, linuxppc-dev
In-Reply-To: <20210622133923.295373-4-danielhb413@gmail.com>

Le 22/06/2021 à 15:39, Daniel Henrique Barboza a écrit :
> The validation done at the start of dlpar_memory_add_by_ic() is an all
> of nothing scenario - if any LMBs in the range is marked as RESERVED we
> can fail right away.
> 
> We then can remove the 'lmbs_available' var and its check with
> 'lmbs_to_add' since the whole LMB range was already validated in the
> previous step.

Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>

> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>   arch/powerpc/platforms/pseries/hotplug-memory.c | 14 ++++++--------
>   1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index c0a03e1537cb..377d852f5a9a 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -796,7 +796,6 @@ static int dlpar_memory_add_by_index(u32 drc_index)
>   static int dlpar_memory_add_by_ic(u32 lmbs_to_add, u32 drc_index)
>   {
>   	struct drmem_lmb *lmb, *start_lmb, *end_lmb;
> -	int lmbs_available = 0;
>   	int rc;
>   
>   	pr_info("Attempting to hot-add %u LMB(s) at index %x\n",
> @@ -811,15 +810,14 @@ static int dlpar_memory_add_by_ic(u32 lmbs_to_add, u32 drc_index)
>   
>   	/* Validate that the LMBs in this range are not reserved */
>   	for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
> -		if (lmb->flags & DRCONF_MEM_RESERVED)
> -			break;
> -
> -		lmbs_available++;
> +		/* Fail immediately if the whole range can't be hot-added */
> +		if (lmb->flags & DRCONF_MEM_RESERVED) {
> +			pr_err("Memory at %llx (drc index %x) is reserved\n",
> +					lmb->base_addr, lmb->drc_index);
> +			return -EINVAL;
> +		}
>   	}
>   
> -	if (lmbs_available < lmbs_to_add)
> -		return -EINVAL;
> -
>   	for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
>   		if (lmb->flags & DRCONF_MEM_ASSIGNED)
>   			continue;
> 


^ permalink raw reply

* Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn
From: Nicholas Piggin @ 2021-06-24  8:52 UTC (permalink / raw)
  To: Aleksandar Markovic, Huacai Chen, Marc Zyngier, Paul Mackerras,
	Paolo Bonzini, David Stevens, Zhenyu Wang, Zhi Wang
  Cc: Wanpeng Li, kvm, Suzuki K Poulose, Alexandru Elisei, intel-gfx,
	linuxppc-dev, linux-kernel, dri-devel, kvmarm, Will Deacon,
	James Morse, kvm-ppc, Sean Christopherson, Vitaly Kuznetsov,
	linux-mips, intel-gvt-dev, Joerg Roedel, linux-arm-kernel,
	Jim Mattson
In-Reply-To: <20210624035749.4054934-3-stevensd@google.com>

Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
> From: David Stevens <stevensd@chromium.org>
> 
> Return a struct kvm_pfn_page containing both a pfn and an optional
> struct page from the gfn_to_pfn family of functions. This differentiates
> the gup and follow_fault_pfn cases, which allows callers that only need
> a pfn to avoid touching the page struct in the latter case. For callers
> that need a struct page, introduce a helper function that unwraps a
> struct kvm_pfn_page into a struct page. This helper makes the call to
> kvm_get_pfn which had previously been in hva_to_pfn_remapped.
> 
> For now, wrap all calls to gfn_to_pfn functions in the new helper
> function. Callers which don't need the page struct will be updated in
> follow-up patches.

Hmm. You mean callers that do need the page will be updated? Normally 
if there will be leftover users that don't need the struct page then
you would go the other way and keep the old call the same, and add a new
one (gfn_to_pfn_page) just for those that need it.

Most kernel code I look at passes back multiple values by updating 
pointers to struct or variables rather than returning a struct, I 
suppose that's not really a big deal and a matter of taste.

Thanks,
Nick


^ permalink raw reply

* Re: [PATCH 3/6] KVM: x86/mmu: avoid struct page in MMU
From: Nicholas Piggin @ 2021-06-24  8:58 UTC (permalink / raw)
  To: Aleksandar Markovic, Huacai Chen, Marc Zyngier, Paul Mackerras,
	Paolo Bonzini, David Stevens, Zhenyu Wang, Zhi Wang
  Cc: Wanpeng Li, kvm, Suzuki K Poulose, Alexandru Elisei, intel-gfx,
	linuxppc-dev, linux-kernel, dri-devel, kvmarm, Will Deacon,
	James Morse, kvm-ppc, Sean Christopherson, Vitaly Kuznetsov,
	linux-mips, intel-gvt-dev, Joerg Roedel, linux-arm-kernel,
	Jim Mattson
In-Reply-To: <20210624035749.4054934-4-stevensd@google.com>

Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
> From: David Stevens <stevensd@chromium.org>
>  out_unlock:
>  	if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
>  		read_unlock(&vcpu->kvm->mmu_lock);
>  	else
>  		write_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(pfn);
> +	if (pfnpg.page)
> +		put_page(pfnpg.page);
>  	return r;
>  }

How about

  kvm_release_pfn_page_clean(pfnpg);

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn
From: Marc Zyngier @ 2021-06-24  9:40 UTC (permalink / raw)
  To: David Stevens
  Cc: Wanpeng Li, kvm, dri-devel, linux-mips, Will Deacon, kvmarm,
	intel-gvt-dev, Joerg Roedel, Huacai Chen, Aleksandar Markovic,
	Zhi Wang, Suzuki K Poulose, intel-gfx, kvm-ppc, Zhenyu Wang,
	Alexandru Elisei, linux-arm-kernel, Jim Mattson,
	Sean Christopherson, linux-kernel, James Morse, Paolo Bonzini,
	Vitaly Kuznetsov, linuxppc-dev
In-Reply-To: <20210624035749.4054934-3-stevensd@google.com>

Hi David,

On Thu, 24 Jun 2021 04:57:45 +0100,
David Stevens <stevensd@chromium.org> wrote:
> 
> From: David Stevens <stevensd@chromium.org>
> 
> Return a struct kvm_pfn_page containing both a pfn and an optional
> struct page from the gfn_to_pfn family of functions. This differentiates
> the gup and follow_fault_pfn cases, which allows callers that only need
> a pfn to avoid touching the page struct in the latter case. For callers
> that need a struct page, introduce a helper function that unwraps a
> struct kvm_pfn_page into a struct page. This helper makes the call to
> kvm_get_pfn which had previously been in hva_to_pfn_remapped.
> 
> For now, wrap all calls to gfn_to_pfn functions in the new helper
> function. Callers which don't need the page struct will be updated in
> follow-up patches.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  arch/arm64/kvm/mmu.c                   |   5 +-
>  arch/mips/kvm/mmu.c                    |   3 +-
>  arch/powerpc/kvm/book3s.c              |   3 +-
>  arch/powerpc/kvm/book3s_64_mmu_hv.c    |   5 +-
>  arch/powerpc/kvm/book3s_64_mmu_radix.c |   5 +-
>  arch/powerpc/kvm/book3s_hv_uvmem.c     |   4 +-
>  arch/powerpc/kvm/e500_mmu_host.c       |   2 +-
>  arch/x86/kvm/mmu/mmu.c                 |  11 ++-
>  arch/x86/kvm/mmu/mmu_audit.c           |   2 +-
>  arch/x86/kvm/x86.c                     |   2 +-
>  drivers/gpu/drm/i915/gvt/kvmgt.c       |   2 +-
>  include/linux/kvm_host.h               |  27 ++++--
>  include/linux/kvm_types.h              |   5 +
>  virt/kvm/kvm_main.c                    | 121 +++++++++++++------------
>  14 files changed, 109 insertions(+), 88 deletions(-)
> 

[...]

> +kvm_pfn_t kvm_pfn_page_unwrap(struct kvm_pfn_page pfnpg)
> +{
> +	if (pfnpg.page)
> +		return pfnpg.pfn;
> +
> +	kvm_get_pfn(pfnpg.pfn);
> +	return pfnpg.pfn;
> +}
> +EXPORT_SYMBOL_GPL(kvm_pfn_page_unwrap);

I'd really like to see a tiny bit of documentation explaining that
calls to kvm_pfn_page_unwrap() are not idempotent.

Otherwise, looks good to me.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply

* Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn
From: Paolo Bonzini @ 2021-06-24  9:42 UTC (permalink / raw)
  To: Nicholas Piggin, Aleksandar Markovic, Huacai Chen, Marc Zyngier,
	Paul Mackerras, David Stevens, Zhenyu Wang, Zhi Wang
  Cc: Wanpeng Li, kvm, Suzuki K Poulose, Alexandru Elisei, intel-gfx,
	linuxppc-dev, linux-kernel, dri-devel, kvmarm, Will Deacon,
	James Morse, kvm-ppc, Sean Christopherson, Vitaly Kuznetsov,
	linux-mips, intel-gvt-dev, Joerg Roedel, linux-arm-kernel,
	Jim Mattson
In-Reply-To: <1624524331.zsin3qejl9.astroid@bobo.none>

On 24/06/21 10:52, Nicholas Piggin wrote:
>> For now, wrap all calls to gfn_to_pfn functions in the new helper
>> function. Callers which don't need the page struct will be updated in
>> follow-up patches.
> Hmm. You mean callers that do need the page will be updated? Normally
> if there will be leftover users that don't need the struct page then
> you would go the other way and keep the old call the same, and add a new
> one (gfn_to_pfn_page) just for those that need it.

Needing kvm_pfn_page_unwrap is a sign that something might be buggy, so 
it's a good idea to move the short name to the common case and the ugly 
kvm_pfn_page_unwrap(gfn_to_pfn(...)) for the weird one.  In fact I'm not 
sure there should be any kvm_pfn_page_unwrap in the end.

Paolo


^ permalink raw reply

* Re: [PATCH 1/6] KVM: x86/mmu: release audited pfns
From: Paolo Bonzini @ 2021-06-24  9:43 UTC (permalink / raw)
  To: Nicholas Piggin, Aleksandar Markovic, Huacai Chen, Marc Zyngier,
	Paul Mackerras, David Stevens, Zhenyu Wang, Zhi Wang
  Cc: Wanpeng Li, kvm, Suzuki K Poulose, Alexandru Elisei, intel-gfx,
	linuxppc-dev, linux-kernel, dri-devel, kvmarm, Will Deacon,
	James Morse, kvm-ppc, Sean Christopherson, Vitaly Kuznetsov,
	linux-mips, intel-gvt-dev, Joerg Roedel, linux-arm-kernel,
	Jim Mattson
In-Reply-To: <1624524156.04etgk7zmz.astroid@bobo.none>

On 24/06/21 10:43, Nicholas Piggin wrote:
> Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
>> From: David Stevens <stevensd@chromium.org>
> 
> Changelog? This looks like a bug, should it have a Fixes: tag?

Probably has been there forever... The best way to fix the bug would be 
to nuke mmu_audit.c, which I've threatened to do many times but never 
followed up on.

Paolo

> Thanks,
> Nick
> 
>>
>> Signed-off-by: David Stevens <stevensd@chromium.org>
>> ---
>>   arch/x86/kvm/mmu/mmu_audit.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/arch/x86/kvm/mmu/mmu_audit.c b/arch/x86/kvm/mmu/mmu_audit.c
>> index cedc17b2f60e..97ff184084b4 100644
>> --- a/arch/x86/kvm/mmu/mmu_audit.c
>> +++ b/arch/x86/kvm/mmu/mmu_audit.c
>> @@ -121,6 +121,8 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level)
>>   		audit_printk(vcpu->kvm, "levels %d pfn %llx hpa %llx "
>>   			     "ent %llxn", vcpu->arch.mmu->root_level, pfn,
>>   			     hpa, *sptep);
>> +
>> +	kvm_release_pfn_clean(pfn);
>>   }
>>   
>>   static void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep)
>> -- 
>> 2.32.0.93.g670b81a890-goog
>>
>>
> 


^ permalink raw reply

* Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn
From: Nicholas Piggin @ 2021-06-24  9:57 UTC (permalink / raw)
  To: Aleksandar Markovic, Huacai Chen, Marc Zyngier, Paul Mackerras,
	Paolo Bonzini, David Stevens, Zhenyu Wang, Zhi Wang
  Cc: Wanpeng Li, kvm, Suzuki K Poulose, Alexandru Elisei, intel-gfx,
	linuxppc-dev, linux-kernel, dri-devel, kvmarm, Will Deacon,
	James Morse, kvm-ppc, Sean Christopherson, Vitaly Kuznetsov,
	linux-mips, intel-gvt-dev, Joerg Roedel, linux-arm-kernel,
	Jim Mattson
In-Reply-To: <201b68a7-10ea-d656-0c1e-5511b1f22674@redhat.com>

Excerpts from Paolo Bonzini's message of June 24, 2021 7:42 pm:
> On 24/06/21 10:52, Nicholas Piggin wrote:
>>> For now, wrap all calls to gfn_to_pfn functions in the new helper
>>> function. Callers which don't need the page struct will be updated in
>>> follow-up patches.
>> Hmm. You mean callers that do need the page will be updated? Normally
>> if there will be leftover users that don't need the struct page then
>> you would go the other way and keep the old call the same, and add a new
>> one (gfn_to_pfn_page) just for those that need it.
> 
> Needing kvm_pfn_page_unwrap is a sign that something might be buggy, so 
> it's a good idea to move the short name to the common case and the ugly 
> kvm_pfn_page_unwrap(gfn_to_pfn(...)) for the weird one.  In fact I'm not 
> sure there should be any kvm_pfn_page_unwrap in the end.

If all callers were updated that is one thing, but from the changelog
it sounds like that would not happen and there would be some gfn_to_pfn
users left over.

But yes in the end you would either need to make gfn_to_pfn never return
a page found via follow_pte, or change all callers to the new way. If 
the plan is for the latter then I guess that's fine.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH 3/6] KVM: x86/mmu: avoid struct page in MMU
From: Marc Zyngier @ 2021-06-24 10:06 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Wanpeng Li, kvm, dri-devel, linux-mips, Will Deacon, kvmarm,
	Alexandru Elisei, Joerg Roedel, Huacai Chen, Aleksandar Markovic,
	Zhi Wang, Suzuki K Poulose, intel-gfx, kvm-ppc, Zhenyu Wang,
	intel-gvt-dev, linux-arm-kernel, Jim Mattson, Sean Christopherson,
	linux-kernel, James Morse, David Stevens, Paolo Bonzini,
	Vitaly Kuznetsov, linuxppc-dev
In-Reply-To: <1624524744.2sr7o7ix86.astroid@bobo.none>

On Thu, 24 Jun 2021 09:58:00 +0100,
Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
> > From: David Stevens <stevensd@chromium.org>
> >  out_unlock:
> >  	if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> >  		read_unlock(&vcpu->kvm->mmu_lock);
> >  	else
> >  		write_unlock(&vcpu->kvm->mmu_lock);
> > -	kvm_release_pfn_clean(pfn);
> > +	if (pfnpg.page)
> > +		put_page(pfnpg.page);
> >  	return r;
> >  }
> 
> How about
> 
>   kvm_release_pfn_page_clean(pfnpg);

I'm not sure. I always found kvm_release_pfn_clean() ugly, because it
doesn't mark the page 'clean'. I find put_page() more correct.

Something like 'kvm_put_pfn_page()' would make more sense, but I'm so
bad at naming things that I could just as well call it 'bob()'.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply

* Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn
From: Paolo Bonzini @ 2021-06-24 10:13 UTC (permalink / raw)
  To: Nicholas Piggin, Aleksandar Markovic, Huacai Chen, Marc Zyngier,
	Paul Mackerras, David Stevens, Zhenyu Wang, Zhi Wang
  Cc: Wanpeng Li, kvm, Suzuki K Poulose, Alexandru Elisei, intel-gfx,
	linuxppc-dev, linux-kernel, dri-devel, kvmarm, Will Deacon,
	James Morse, kvm-ppc, Sean Christopherson, Vitaly Kuznetsov,
	linux-mips, intel-gvt-dev, Joerg Roedel, linux-arm-kernel,
	Jim Mattson
In-Reply-To: <1624528342.s2ezcyp90x.astroid@bobo.none>

On 24/06/21 11:57, Nicholas Piggin wrote:
>> Needing kvm_pfn_page_unwrap is a sign that something might be buggy, so
>> it's a good idea to move the short name to the common case and the ugly
>> kvm_pfn_page_unwrap(gfn_to_pfn(...)) for the weird one.  In fact I'm not
>> sure there should be any kvm_pfn_page_unwrap in the end.
>
> If all callers were updated that is one thing, but from the changelog
> it sounds like that would not happen and there would be some gfn_to_pfn
> users left over.

In this patches there are, so yeah the plan is to always change the 
callers to the new way.

> But yes in the end you would either need to make gfn_to_pfn never return
> a page found via follow_pte, or change all callers to the new way. If
> the plan is for the latter then I guess that's fine.


^ permalink raw reply

* Re: [PATCH 3/6] KVM: x86/mmu: avoid struct page in MMU
From: Paolo Bonzini @ 2021-06-24 10:17 UTC (permalink / raw)
  To: Marc Zyngier, Nicholas Piggin
  Cc: Wanpeng Li, kvm, dri-devel, linux-mips, Will Deacon, kvmarm,
	Alexandru Elisei, Joerg Roedel, Huacai Chen, Aleksandar Markovic,
	Zhi Wang, Suzuki K Poulose, intel-gfx, kvm-ppc, Zhenyu Wang,
	intel-gvt-dev, linux-arm-kernel, Jim Mattson, Sean Christopherson,
	linux-kernel, James Morse, David Stevens, Vitaly Kuznetsov,
	linuxppc-dev
In-Reply-To: <87mtrfinks.wl-maz@kernel.org>

On 24/06/21 12:06, Marc Zyngier wrote:
> On Thu, 24 Jun 2021 09:58:00 +0100,
> Nicholas Piggin <npiggin@gmail.com> wrote:
>>
>> Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
>>> From: David Stevens <stevensd@chromium.org>
>>>   out_unlock:
>>>   	if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
>>>   		read_unlock(&vcpu->kvm->mmu_lock);
>>>   	else
>>>   		write_unlock(&vcpu->kvm->mmu_lock);
>>> -	kvm_release_pfn_clean(pfn);
>>> +	if (pfnpg.page)
>>> +		put_page(pfnpg.page);
>>>   	return r;
>>>   }
>>
>> How about
>>
>>    kvm_release_pfn_page_clean(pfnpg);
> 
> I'm not sure. I always found kvm_release_pfn_clean() ugly, because it
> doesn't mark the page 'clean'. I find put_page() more correct.
> 
> Something like 'kvm_put_pfn_page()' would make more sense, but I'm so
> bad at naming things that I could just as well call it 'bob()'.

The best way to go would be to get rid of kvm_release_pfn_clean() and 
always go through a pfn_page.  Then we could or could not introduce 
wrappers kvm_put_pfn_page{,_dirty}.

I think for now it's best to limit the churn since these patches will go 
in the stable releases too, and clean up the resulting API once we have 
a clear idea of how all architectures are using kvm_pfn_page.

Paolo


^ permalink raw reply

* Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn
From: Nicholas Piggin @ 2021-06-24 10:17 UTC (permalink / raw)
  To: Aleksandar Markovic, Huacai Chen, Marc Zyngier, Paul Mackerras,
	Paolo Bonzini, David Stevens, Zhenyu Wang, Zhi Wang
  Cc: Wanpeng Li, kvm, Suzuki K Poulose, Alexandru Elisei, intel-gfx,
	linuxppc-dev, linux-kernel, dri-devel, kvmarm, Will Deacon,
	James Morse, kvm-ppc, Sean Christopherson, Vitaly Kuznetsov,
	linux-mips, intel-gvt-dev, Joerg Roedel, linux-arm-kernel,
	Jim Mattson
In-Reply-To: <1624528342.s2ezcyp90x.astroid@bobo.none>

Excerpts from Nicholas Piggin's message of June 24, 2021 7:57 pm:
> Excerpts from Paolo Bonzini's message of June 24, 2021 7:42 pm:
>> On 24/06/21 10:52, Nicholas Piggin wrote:
>>>> For now, wrap all calls to gfn_to_pfn functions in the new helper
>>>> function. Callers which don't need the page struct will be updated in
>>>> follow-up patches.
>>> Hmm. You mean callers that do need the page will be updated? Normally
>>> if there will be leftover users that don't need the struct page then
>>> you would go the other way and keep the old call the same, and add a new
>>> one (gfn_to_pfn_page) just for those that need it.
>> 
>> Needing kvm_pfn_page_unwrap is a sign that something might be buggy, so 
>> it's a good idea to move the short name to the common case and the ugly 
>> kvm_pfn_page_unwrap(gfn_to_pfn(...)) for the weird one.  In fact I'm not 
>> sure there should be any kvm_pfn_page_unwrap in the end.
> 
> If all callers were updated that is one thing, but from the changelog
> it sounds like that would not happen and there would be some gfn_to_pfn
> users left over.
> 
> But yes in the end you would either need to make gfn_to_pfn never return
> a page found via follow_pte, or change all callers to the new way. If 
> the plan is for the latter then I guess that's fine.

Actually in that case anyway I don't see the need -- the existence of
gfn_to_pfn is enough to know it might be buggy. It can just as easily
be grepped for as kvm_pfn_page_unwrap. And are gfn_to_page cases also
vulernable to the same issue?

So I think it could be marked deprecated or something if not everything 
will be converted in the one series, and don't need to touch all that 
arch code with this patch.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn
From: Paolo Bonzini @ 2021-06-24 10:21 UTC (permalink / raw)
  To: Nicholas Piggin, Aleksandar Markovic, Huacai Chen, Marc Zyngier,
	Paul Mackerras, David Stevens, Zhenyu Wang, Zhi Wang
  Cc: Wanpeng Li, kvm, Suzuki K Poulose, Alexandru Elisei, intel-gfx,
	linuxppc-dev, linux-kernel, dri-devel, kvmarm, Will Deacon,
	James Morse, kvm-ppc, Sean Christopherson, Vitaly Kuznetsov,
	linux-mips, intel-gvt-dev, Joerg Roedel, linux-arm-kernel,
	Jim Mattson
In-Reply-To: <1624529635.75a1ann91v.astroid@bobo.none>

On 24/06/21 12:17, Nicholas Piggin wrote:
>> If all callers were updated that is one thing, but from the changelog
>> it sounds like that would not happen and there would be some gfn_to_pfn
>> users left over.
>>
>> But yes in the end you would either need to make gfn_to_pfn never return
>> a page found via follow_pte, or change all callers to the new way. If
>> the plan is for the latter then I guess that's fine.
>
> Actually in that case anyway I don't see the need -- the existence of
> gfn_to_pfn is enough to know it might be buggy. It can just as easily
> be grepped for as kvm_pfn_page_unwrap.

Sure, but that would leave us with longer function names 
(gfn_to_pfn_page* instead of gfn_to_pfn*).  So the "safe" use is the one 
that looks worse and the unsafe use is the one that looks safe.

> And are gfn_to_page cases also
> vulernable to the same issue?

No, they're just broken for the VM_IO|VM_PFNMAP case.

Paolo

> So I think it could be marked deprecated or something if not everything
> will be converted in the one series, and don't need to touch all that
> arch code with this patch.


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox