LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
From: Valentin Schneider @ 2020-07-31  1:05 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	LKML, Nicholas Piggin, Morten Rasmussen, Oliver O'Halloran,
	Jordan Niethe, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200729061355.GA14603@linux.vnet.ibm.com>


(+Cc Morten)

On 29/07/20 07:13, Srikar Dronamraju wrote:
> * Valentin Schneider <valentin.schneider@arm.com> [2020-07-28 16:03:11]:
>
> Hi Valentin,
>
> Thanks for looking into the patches.
>
>> On 27/07/20 06:32, Srikar Dronamraju wrote:
>> > Add percpu coregroup maps and masks to create coregroup domain.
>> > If a coregroup doesn't exist, the coregroup domain will be degenerated
>> > in favour of SMT/CACHE domain.
>> >
>>
>> So there's at least one arm64 platform out there with the same "pairs of
>> cores share L2" thing (Ampere eMAG), and that lives quite happily with the
>> default scheduler topology (SMT/MC/DIE). Each pair of core gets its MC
>> domain, and the whole system is covered by DIE.
>>
>> Now arguably it's not a perfect representation; DIE doesn't have
>> SD_SHARE_PKG_RESOURCES so the highest level sd_llc can point to is MC. That
>> will impact all callsites using cpus_share_cache(): in the eMAG case, only
>> pairs of cores will be seen as sharing cache, even though *all* cores share
>> the same L3.
>>
>
> Okay, Its good to know that we have a chip which is similar to P9 in
> topology.
>
>> I'm trying to paint a picture of what the P9 topology looks like (the one
>> you showcase in your cover letter) to see if there are any similarities;
>> from what I gather in [1], wikichips and your cover letter, with P9 you can
>> have something like this in a single DIE (somewhat unsure about L3 setup;
>> it looks to be distributed?)
>>
>>      +---------------------------------------------------------------------+
>>      |                                  L3                                 |
>>      +---------------+-+---------------+-+---------------+-+---------------+
>>      |       L2      | |       L2      | |       L2      | |       L2      |
>>      +------+-+------+ +------+-+------+ +------+-+------+ +------+-+------+
>>      |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  |
>>      +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+
>>      |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs|
>>      +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+
>>
>> Which would lead to (ignoring the whole SMT CPU numbering shenanigans)
>>
>> NUMA     [                                                   ...
>> DIE      [                                             ]
>> MC       [         ] [         ] [         ] [         ]
>> BIGCORE  [         ] [         ] [         ] [         ]
>> SMT      [   ] [   ] [   ] [   ] [   ] [   ] [   ] [   ]
>>          00-03 04-07 08-11 12-15 16-19 20-23 24-27 28-31  <other node here>
>>
>
> What you have summed up is perfectly what a P9 topology looks like. I dont
> think I could have explained it better than this.
>

Yay!

>> This however has MC == BIGCORE; what makes it you can have different spans
>> for these two domains? If it's not too much to ask, I'd love to have a P9
>> topology diagram.
>>
>> [1]: 20200722081822.GG9290@linux.vnet.ibm.com
>
> At this time the current topology would be good enough i.e BIGCORE would
> always be equal to a MC. However in future we could have chips that can have
> lesser/larger number of CPUs in llc than in a BIGCORE or we could have
> granular or split L3 caches within a DIE. In such a case BIGCORE != MC.
>

Right, that one's fair enough.

> Also in the current P9 itself, two neighbouring core-pairs form a quad.
> Cache latency within a quad is better than a latency to a distant core-pair.
> Cache latency within a core pair is way better than latency within a quad.
> So if we have only 4 threads running on a DIE all of them accessing the same
> cache-lines, then we could probably benefit if all the tasks were to run
> within the quad aka MC/Coregroup.
>

Did you test this? WRT load balance we do try to balance "load" over the
different domain spans, so if you represent quads as their own MC domain,
you would AFAICT end up spreading tasks over the quads (rather than packing
them) when balancing at e.g. DIE level. The desired behaviour might be
hackable with some more ASYM_PACKING, but I'm not sure I should be
suggesting that :-)

> I have found some benchmarks which are latency sensitive to benefit by
> having a grouping a quad level (using kernel hacks and not backed by
> firmware changes). Gautham also found similar results in his experiments
> but he only used binding within the stock kernel.
>

IIUC you reflect this "fabric quirk" (i.e. coregroups) using this DT
binding thing.

That's also where things get interesting (for me) because I experienced
something similar on another arm64 platform (ThunderX1). This was more
about cache bandwidth than cache latency, but IMO it's in the same bag of
fabric quirks. I blabbered a bit about this at last LPC [1], but kind of
gave up on it given the TX1 was the only (arm64) platform where I could get
both significant and reproducible results.

Now, if you folks are seeing this on completely different hardware and have
"real" workloads that truly benefit from this kind of domain partitioning,
this might be another incentive to try and sort of generalize this. That's
outside the scope of your series, but your findings give me some hope!

I think what I had in mind back then was that if enough folks cared about
it, we might get some bits added to the ACPI spec; something along the
lines of proximity domains for the caches described in PPTT, IOW a cache
distance matrix. I don't really know what it'll take to get there, but I
figured I'd dump this in case someone's listening :-)

> I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC
> domain need not be LLC domain for Power.

From what I understood your MC domain does seem to map to LLC; but in any
case, shouldn't you set that flag at least for BIGCORE (i.e. L2)? AIUI with
your changes your sd_llc is gonna be SMT, and that's not going to be a very
big mask. IMO you do want to correctly reflect your LLC situation via this
flag to make cpus_share_cache() work properly.

[1]: https://linuxplumbersconf.org/event/4/contributions/484/

^ permalink raw reply

* Re: [PATCH 1/2 v2] powerpc/dma: Define map/unmap mmio resource callbacks
From: Oliver O'Halloran @ 2020-07-31  3:30 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: vladimirk, Carol L Soto, linux-pci, shlomin, israelr,
	Frederic Barrat, idanw, linuxppc-dev, Christoph Hellwig, aneela
In-Reply-To: <20200430131520.51211-1-maxg@mellanox.com>

On Thu, Apr 30, 2020 at 11:15 PM Max Gurtovoy <maxg@mellanox.com> wrote:
>
> Define the map_resource/unmap_resource callbacks for the dma_iommu_ops
> used by several powerpc platforms. The map_resource callback is called
> when trying to map a mmio resource through the dma_map_resource()
> driver API.
>
> For now, the callback returns an invalid address for devices using
> translations, but will "direct" map the resource when in bypass
> mode. Previous behavior for dma_map_resource() was to always return an
> invalid address.
>
> We also call an optional platform-specific controller op in
> case some setup is needed for the platform.

Hey Max,

Sorry for not getting to this sooner. Fred has been dutifully nagging
me to look at it, but people are constantly throwing stuff at me so
it's slipped through the cracks.

Anyway, the changes here are fine IMO. The only real suggestion I have
is that we might want to move the direct / bypass mode check out of
the arch/powerpc/kernel/dma-iommu.c and into the PHB specific function
in pci_controller_ops. I don't see any real reason p2p support should
be limited to devices using bypass mode since the data path is the
same for translated and untranslated DMAs. We do need to impose that
restriction for OPAL / PowerNV IODA PHBs due to the implementation of
the opal_pci_set_p2p() has the side effect of forcing the TVE into
no-translate mode. However, that's a platform issue so the restriction
should be imposed in platform code.

I'd like to fix that, but I'd prefer to do it as a follow up change
since I need to have a think about how to fix the firmware bits.

Reviewed-by: Oliver O'Halloran <oohall@gmail.com>

^ permalink raw reply

* Re: [PATCH] KVM: PPC: Book3S HV: fix a oops in kvmppc_uvmem_page_free()
From: Bharata B Rao @ 2020-07-31  4:29 UTC (permalink / raw)
  To: Ram Pai
  Cc: ldufour, cclaudio, kvm-ppc, sathnaga, aneesh.kumar, sukadev,
	linuxppc-dev, bauerman, david
In-Reply-To: <1596151526-4374-1-git-send-email-linuxram@us.ibm.com>

On Thu, Jul 30, 2020 at 04:25:26PM -0700, Ram Pai wrote:
> Observed the following oops while stress-testing, using multiple
> secureVM on a distro kernel. However this issue theoritically exists in
> 5.5 kernel and later.
> 
> This issue occurs when the total number of requested device-PFNs exceed
> the total-number of available device-PFNs.  PFN migration fails to
> allocate a device-pfn, which causes migrate_vma_finalize() to trigger
> kvmppc_uvmem_page_free() on a page, that is not associated with any
> device-pfn.  kvmppc_uvmem_page_free() blindly tries to access the
> contents of the private data which can be null, leading to the following
> kernel fault.
> 
>  --------------------------------------------------------------------------
>  Unable to handle kernel paging request for data at address 0x00000011
>  Faulting instruction address: 0xc00800000e36e110
>  Oops: Kernel access of bad area, sig: 11 [#1]
>  LE SMP NR_CPUS=2048 NUMA PowerNV
> ....
>  MSR:  900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>
> 		 CR: 24424822  XER: 00000000
>  CFAR: c000000000e3d764 DAR: 0000000000000011 DSISR: 40000000 IRQMASK: 0
>  GPR00: c00800000e36e0a4 c000001f1d59f610 c00800000e38a400 0000000000000000
>  GPR04: c000001fa5000000 fffffffffffffffe ffffffffffffffff c000201fffeaf300
>  GPR08: 00000000000001f0 0000000000000000 0000000000000f80 c00800000e373608
>  GPR12: c000000000e3d710 c000201fffeaf300 0000000000000001 00007fef87360000
>  GPR16: 00007fff97db4410 c000201c3b66a578 ffffffffffffffff 0000000000000000
>  GPR20: 0000000119db9ad0 000000000000000a fffffffffffffffc 0000000000000001
>  GPR24: c000201c3b660000 c000001f1d59f7a0 c0000000004cffb0 0000000000000001
>  GPR28: 0000000000000000 c00a001ff003e000 c00800000e386150 0000000000000f80
>  NIP [c00800000e36e110] kvmppc_uvmem_page_free+0xc8/0x210 [kvm_hv]
>  LR [c00800000e36e0a4] kvmppc_uvmem_page_free+0x5c/0x210 [kvm_hv]
>  Call Trace:
>  [c000000000512010] free_devmap_managed_page+0xd0/0x100
>  [c0000000003f71d0] put_devmap_managed_page+0xa0/0xc0
>  [c0000000004d24bc] migrate_vma_finalize+0x32c/0x410
>  [c00800000e36e828] kvmppc_svm_page_in.constprop.5+0xa0/0x460 [kvm_hv]
>  [c00800000e36eddc] kvmppc_uv_migrate_mem_slot.isra.2+0x1f4/0x230 [kvm_hv]
>  [c00800000e36fa98] kvmppc_h_svm_init_done+0x90/0x170 [kvm_hv]
>  [c00800000e35bb14] kvmppc_pseries_do_hcall+0x1ac/0x10a0 [kvm_hv]
>  [c00800000e35edf4] kvmppc_vcpu_run_hv+0x83c/0x1060 [kvm_hv]
>  [c00800000e95eb2c] kvmppc_vcpu_run+0x34/0x48 [kvm]
>  [c00800000e95a2dc] kvm_arch_vcpu_ioctl_run+0x374/0x830 [kvm]
>  [c00800000e9433b4] kvm_vcpu_ioctl+0x45c/0x7c0 [kvm]
>  [c0000000005451d0] do_vfs_ioctl+0xe0/0xaa0
>  [c000000000545d64] sys_ioctl+0xc4/0x160
>  [c00000000000b408] system_call+0x5c/0x70
>  Instruction dump:
>  a12d1174 2f890000 409e0158 a1271172 3929ffff b1271172 7c2004ac 39200000
>  913e0140 39200000 e87d0010 f93d0010 <89230011> e8c30000 e9030008 2f890000
>  --------------------------------------------------------------------------
> 
>  Fix the oops..
> 
> fixes: ca9f49 ("KVM: PPC: Book3S HV: Support for running secure guests")
> Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 2806983..f4002bf 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -1018,13 +1018,15 @@ static void kvmppc_uvmem_page_free(struct page *page)
>  {
>  	unsigned long pfn = page_to_pfn(page) -
>  			(kvmppc_uvmem_pgmap.res.start >> PAGE_SHIFT);
> -	struct kvmppc_uvmem_page_pvt *pvt;
> +	struct kvmppc_uvmem_page_pvt *pvt = page->zone_device_data;
> +
> +	if (!pvt)
> +		return;
>  
>  	spin_lock(&kvmppc_uvmem_bitmap_lock);
>  	bitmap_clear(kvmppc_uvmem_bitmap, pfn, 1);
>  	spin_unlock(&kvmppc_uvmem_bitmap_lock);
>  
> -	pvt = page->zone_device_data;
>  	page->zone_device_data = NULL;
>  	if (pvt->remove_gfn)
>  		kvmppc_gfn_remove(pvt->gpa >> PAGE_SHIFT, pvt->kvm);

In our case, device pages that are in use are always associated with a valid
pvt member. See kvmppc_uvmem_get_page() which returns failure if it
runs out of device pfns and that will result in proper failure of
page-in calls.

For the case where we run out of device pfns, migrate_vma_finalize() will
restore the original PTE and will not replace the PTE with device private PTE.

Also kvmppc_uvmem_page_free() (=dev_pagemap_ops.page_free()) is never
called for non-device-private pages.

This could be a use-after-free case possibly arising out of the new state
changes in HV. If so, this fix will only mask the bug and not address the
original problem.

Regards,
Bharata.

^ permalink raw reply

* Re: [PATCH] KVM: PPC: Book3S HV: Define H_PAGE_IN_NONSHARED for H_SVM_PAGE_IN hcall
From: Bharata B Rao @ 2020-07-31  4:33 UTC (permalink / raw)
  To: Ram Pai
  Cc: ldufour, linux-doc, corbet, kvm-ppc, Julia Lawall, sathnaga,
	sukadev, linuxppc-dev, david
In-Reply-To: <20200730232101.GB5882@oc0525413822.ibm.com>

On Thu, Jul 30, 2020 at 04:21:01PM -0700, Ram Pai wrote:
> H_SVM_PAGE_IN hcall takes a flag parameter. This parameter specifies the
> way in which a page will be treated.  H_PAGE_IN_NONSHARED indicates
> that the page will be shared with the Secure VM, and H_PAGE_IN_SHARED
> indicates that the page will not be shared but its contents will
> be copied.

Looks like you got the definitions of shared and non-shared interchanged.

> 
> However H_PAGE_IN_NONSHARED is not defined in the header file, though
> it is defined and documented in the API captured in
> Documentation/powerpc/ultravisor.rst
> 
> Define H_PAGE_IN_NONSHARED in the header file.

What is the use of defining this? Is this used directly in any place?
Or, are youp planning to introduce such a usage?

Regards,
Bharata.

^ permalink raw reply

* [PATCH] ASoC: fsl_sai: Fix value of FSL_SAI_CR1_RFW_MASK
From: Shengjiu Wang @ 2020-07-31  6:28 UTC (permalink / raw)
  To: timur, nicoleotsuka, Xiubo.Lee, festevam, lgirdwood, broonie,
	perex, tiwai, alsa-devel, linuxppc-dev, linux-kernel

The fifo_depth is 64 on i.MX8QM/i.MX8QXP, 128 on i.MX8MQ, 16 on
i.MX7ULP.

Original FSL_SAI_CR1_RFW_MASK value 0x1F is not suitable for
these platform, the FIFO watermark mask should be updated
according to the fifo_depth.

Fixes: a860fac42097 ("ASoC: fsl_sai: Add support for imx7ulp/imx8mq")
Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com>
---
 sound/soc/fsl/fsl_sai.c | 5 +++--
 sound/soc/fsl/fsl_sai.h | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/sound/soc/fsl/fsl_sai.c b/sound/soc/fsl/fsl_sai.c
index a22562f2df47..cdff739924e2 100644
--- a/sound/soc/fsl/fsl_sai.c
+++ b/sound/soc/fsl/fsl_sai.c
@@ -680,10 +680,11 @@ static int fsl_sai_dai_probe(struct snd_soc_dai *cpu_dai)
 	regmap_write(sai->regmap, FSL_SAI_RCSR(ofs), 0);
 
 	regmap_update_bits(sai->regmap, FSL_SAI_TCR1(ofs),
-			   FSL_SAI_CR1_RFW_MASK,
+			   FSL_SAI_CR1_RFW_MASK(sai->soc_data->fifo_depth),
 			   sai->soc_data->fifo_depth - FSL_SAI_MAXBURST_TX);
 	regmap_update_bits(sai->regmap, FSL_SAI_RCR1(ofs),
-			   FSL_SAI_CR1_RFW_MASK, FSL_SAI_MAXBURST_RX - 1);
+			   FSL_SAI_CR1_RFW_MASK(sai->soc_data->fifo_depth),
+			   FSL_SAI_MAXBURST_RX - 1);
 
 	snd_soc_dai_init_dma_data(cpu_dai, &sai->dma_params_tx,
 				&sai->dma_params_rx);
diff --git a/sound/soc/fsl/fsl_sai.h b/sound/soc/fsl/fsl_sai.h
index 76b15deea80c..6aba7d28f5f3 100644
--- a/sound/soc/fsl/fsl_sai.h
+++ b/sound/soc/fsl/fsl_sai.h
@@ -94,7 +94,7 @@
 #define FSL_SAI_CSR_FRDE	BIT(0)
 
 /* SAI Transmit and Receive Configuration 1 Register */
-#define FSL_SAI_CR1_RFW_MASK	0x1f
+#define FSL_SAI_CR1_RFW_MASK(x)	((x) - 1)
 
 /* SAI Transmit and Receive Configuration 2 Register */
 #define FSL_SAI_CR2_SYNC	BIT(30)
-- 
2.27.0


^ permalink raw reply related

* [PATCH v4 0/2] powerpc/papr_scm: add support for reporting NVDIMM 'life_used_percentage' metric
From: Vaibhav Jain @ 2020-07-31  6:41 UTC (permalink / raw)
  To: linuxppc-dev, linux-nvdimm
  Cc: Santosh Sivaraj, Oliver O'Halloran, Aneesh Kumar K . V,
	Vaibhav Jain, Dan Williams, Ira Weiny

Changes since v3[1]:

* Fixed a rebase issue pointed out by Aneesh in first patch in the series.

[1] https://lore.kernel.org/linux-nvdimm/20200730121303.134230-1-vaibhav@linux.ibm.com
---

This small patchset implements kernel side support for reporting
'life_used_percentage' metric in NDCTL with dimm health output for
papr-scm NVDIMMs. With corresponding NDCTL side changes output for
should be like:

$ sudo ndctl list -DH
[
  {
    "dev":"nmem0",
    "health":{
      "health_state":"ok",
      "life_used_percentage":0,
      "shutdown_state":"clean"
    }
  }
]

PHYP supports H_SCM_PERFORMANCE_STATS hcall through which an LPAR can
fetch various performance stats including 'fuel_gauge' percentage for
an NVDIMM. 'fuel_gauge' metric indicates the usable life remaining of
an NVDIMM expressed as percentage and  'life_used_percentage' can be
calculated as 'life_used_percentage = 100 - fuel_gauge'.

Structure of the patchset
=========================
First patch implements necessary scaffolding needed to issue the
H_SCM_PERFORMANCE_STATS hcall and fetch performance stats
catalogue. The patch also implements support for 'perf_stats' sysfs
attribute to report the full catalogue of supported performance stats
by PHYP.

Second and final patch implements support for sending this value to
libndctl by extending the PAPR_PDSM_HEALTH pdsm payload to add a new
field named 'dimm_fuel_gauge' to it.

Vaibhav Jain (2):
  powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
  powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric

 Documentation/ABI/testing/sysfs-bus-papr-pmem |  27 +++
 arch/powerpc/include/uapi/asm/papr_pdsm.h     |   9 +
 arch/powerpc/platforms/pseries/papr_scm.c     | 199 ++++++++++++++++++
 3 files changed, 235 insertions(+)

-- 
2.26.2

^ permalink raw reply

* [PATCH v4 1/2] powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
From: Vaibhav Jain @ 2020-07-31  6:41 UTC (permalink / raw)
  To: linuxppc-dev, linux-nvdimm
  Cc: Santosh Sivaraj, Oliver O'Halloran, Aneesh Kumar K . V,
	Vaibhav Jain, Dan Williams, Ira Weiny
In-Reply-To: <20200731064153.182203-1-vaibhav@linux.ibm.com>

Update papr_scm.c to query dimm performance statistics from PHYP via
H_SCM_PERFORMANCE_STATS hcall and export them to user-space as PAPR
specific NVDIMM attribute 'perf_stats' in sysfs. The patch also
provide a sysfs ABI documentation for the stats being reported and
their meanings.

During NVDIMM probe time in papr_scm_nvdimm_init() a special variant
of H_SCM_PERFORMANCE_STATS hcall is issued to check if collection of
performance statistics is supported or not. If successful then a PHYP
returns a maximum possible buffer length needed to read all
performance stats. This returned value is stored in a per-nvdimm
attribute 'stat_buffer_len'.

The layout of request buffer for reading NVDIMM performance stats from
PHYP is defined in 'struct papr_scm_perf_stats' and 'struct
papr_scm_perf_stat'. These structs are used in newly introduced
drc_pmem_query_stats() that issues the H_SCM_PERFORMANCE_STATS hcall.

The sysfs access function perf_stats_show() uses value
'stat_buffer_len' to allocate a buffer large enough to hold all
possible NVDIMM performance stats and passes it to
drc_pmem_query_stats() to populate. Finally statistics reported in the
buffer are formatted into the sysfs access function output buffer.

Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
Changelog:

v4:
* Fixed a build issue with this patch by moving a hunk from second
  patch in series to this patch. [ Aneesh ]

v3:
* Updated drc_pmem_query_stats() to not require 'buff_size' and 'out'
  args to the function. Instead 'buff_size' is calculated from
  'num_stats' and instead of populating 'R4' in arg 'out' the value is
  returned from the function in case 'R4' represents
  'max-buffer-size'.

Resend:
None

v2:
* Updated 'struct papr_scm_perf_stats' and 'struct papr_scm_perf_stat'
to use big-endian types. [ Aneesh ]
* s/len_stat_buffer/stat_buffer_len/ [ Aneesh ]
* s/statistics_id/stat_id/ , s/statistics_val/stat_val/ [ Aneesh ]
* Conversion from Big endian to cpu endian happens later rather than
just after its fetched from PHYP.
* Changed a log statement to unambiguously report dimm performance
stats are not available for the given nvdimm [ Ira ]
* Restructed some code to handle error case first [ Ira ]
---
 Documentation/ABI/testing/sysfs-bus-papr-pmem |  27 ++++
 arch/powerpc/platforms/pseries/papr_scm.c     | 150 ++++++++++++++++++
 2 files changed, 177 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem b/Documentation/ABI/testing/sysfs-bus-papr-pmem
index 5b10d036a8d4..c1a67275c43f 100644
--- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
+++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
@@ -25,3 +25,30 @@ Description:
 				  NVDIMM have been scrubbed.
 		* "locked"	: Indicating that NVDIMM contents cant
 				  be modified until next power cycle.
+
+What:		/sys/bus/nd/devices/nmemX/papr/perf_stats
+Date:		May, 2020
+KernelVersion:	v5.9
+Contact:	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>, linux-nvdimm@lists.01.org,
+Description:
+		(RO) Report various performance stats related to papr-scm NVDIMM
+		device.  Each stat is reported on a new line with each line
+		composed of a stat-identifier followed by it value. Below are
+		currently known dimm performance stats which are reported:
+
+		* "CtlResCt" : Controller Reset Count
+		* "CtlResTm" : Controller Reset Elapsed Time
+		* "PonSecs " : Power-on Seconds
+		* "MemLife " : Life Remaining
+		* "CritRscU" : Critical Resource Utilization
+		* "HostLCnt" : Host Load Count
+		* "HostSCnt" : Host Store Count
+		* "HostSDur" : Host Store Duration
+		* "HostLDur" : Host Load Duration
+		* "MedRCnt " : Media Read Count
+		* "MedWCnt " : Media Write Count
+		* "MedRDur " : Media Read Duration
+		* "MedWDur " : Media Write Duration
+		* "CchRHCnt" : Cache Read Hit Count
+		* "CchWHCnt" : Cache Write Hit Count
+		* "FastWCnt" : Fast Write Count
\ No newline at end of file
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index 3d1235a76ba9..f37f3f70007d 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -64,6 +64,26 @@
 				    PAPR_PMEM_HEALTH_FATAL |	\
 				    PAPR_PMEM_HEALTH_UNHEALTHY)
 
+#define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS)
+#define PAPR_SCM_PERF_STATS_VERSION 0x1
+
+/* Struct holding a single performance metric */
+struct papr_scm_perf_stat {
+	u8 stat_id[8];
+	__be64 stat_val;
+} __packed;
+
+/* Struct exchanged between kernel and PHYP for fetching drc perf stats */
+struct papr_scm_perf_stats {
+	u8 eye_catcher[8];
+	/* Should be PAPR_SCM_PERF_STATS_VERSION */
+	__be32 stats_version;
+	/* Number of stats following */
+	__be32 num_statistics;
+	/* zero or more performance matrics */
+	struct papr_scm_perf_stat scm_statistic[];
+} __packed;
+
 /* private struct associated with each region */
 struct papr_scm_priv {
 	struct platform_device *pdev;
@@ -92,6 +112,9 @@ struct papr_scm_priv {
 
 	/* Health information for the dimm */
 	u64 health_bitmap;
+
+	/* length of the stat buffer as expected by phyp */
+	size_t stat_buffer_len;
 };
 
 static LIST_HEAD(papr_nd_regions);
@@ -200,6 +223,79 @@ static int drc_pmem_query_n_bind(struct papr_scm_priv *p)
 	return drc_pmem_bind(p);
 }
 
+/*
+ * Query the Dimm performance stats from PHYP and copy them (if returned) to
+ * provided struct papr_scm_perf_stats instance 'stats' that can hold atleast
+ * (num_stats + header) bytes.
+ * - If buff_stats == NULL the return value is the size in byes of the buffer
+ * needed to hold all supported performance-statistics.
+ * - If buff_stats != NULL and num_stats == 0 then we copy all known
+ * performance-statistics to 'buff_stat' and expect to be large enough to
+ * hold them.
+ * - if buff_stats != NULL and num_stats > 0 then copy the requested
+ * performance-statistics to buff_stats.
+ */
+static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p,
+				    struct papr_scm_perf_stats *buff_stats,
+				    unsigned int num_stats)
+{
+	unsigned long ret[PLPAR_HCALL_BUFSIZE];
+	size_t size;
+	s64 rc;
+
+	/* Setup the out buffer */
+	if (buff_stats) {
+		memcpy(buff_stats->eye_catcher,
+		       PAPR_SCM_PERF_STATS_EYECATCHER, 8);
+		buff_stats->stats_version =
+			cpu_to_be32(PAPR_SCM_PERF_STATS_VERSION);
+		buff_stats->num_statistics =
+			cpu_to_be32(num_stats);
+
+		/*
+		 * Calculate the buffer size based on num-stats provided
+		 * or use the prefetched max buffer length
+		 */
+		if (num_stats)
+			/* Calculate size from the num_stats */
+			size = sizeof(struct papr_scm_perf_stats) +
+				num_stats * sizeof(struct papr_scm_perf_stat);
+		else
+			size = p->stat_buffer_len;
+	} else {
+		/* In case of no out buffer ignore the size */
+		size = 0;
+	}
+
+	/* Do the HCALL asking PHYP for info */
+	rc = plpar_hcall(H_SCM_PERFORMANCE_STATS, ret, p->drc_index,
+			 buff_stats ? virt_to_phys(buff_stats) : 0,
+			 size);
+
+	/* Check if the error was due to an unknown stat-id */
+	if (rc == H_PARTIAL) {
+		dev_err(&p->pdev->dev,
+			"Unknown performance stats, Err:0x%016lX\n", ret[0]);
+		return -ENOENT;
+	} else if (rc != H_SUCCESS) {
+		dev_err(&p->pdev->dev,
+			"Failed to query performance stats, Err:%lld\n", rc);
+		return -EIO;
+
+	} else if (!size) {
+		/* Handle case where stat buffer size was requested */
+		dev_dbg(&p->pdev->dev,
+			"Performance stats size %ld\n", ret[0]);
+		return ret[0];
+	}
+
+	/* Successfully fetched the requested stats from phyp */
+	dev_dbg(&p->pdev->dev,
+		"Performance stats returned %d stats\n",
+		be32_to_cpu(buff_stats->num_statistics));
+	return 0;
+}
+
 /*
  * Issue hcall to retrieve dimm health info and populate papr_scm_priv with the
  * health information.
@@ -637,6 +733,48 @@ static int papr_scm_ndctl(struct nvdimm_bus_descriptor *nd_desc,
 	return 0;
 }
 
+static ssize_t perf_stats_show(struct device *dev,
+			       struct device_attribute *attr, char *buf)
+{
+	int index, rc;
+	struct seq_buf s;
+	struct papr_scm_perf_stat *stat;
+	struct papr_scm_perf_stats *stats;
+	struct nvdimm *dimm = to_nvdimm(dev);
+	struct papr_scm_priv *p = nvdimm_provider_data(dimm);
+
+	if (!p->stat_buffer_len)
+		return -ENOENT;
+
+	/* Allocate the buffer for phyp where stats are written */
+	stats = kzalloc(p->stat_buffer_len, GFP_KERNEL);
+	if (!stats)
+		return -ENOMEM;
+
+	/* Ask phyp to return all dimm perf stats */
+	rc = drc_pmem_query_stats(p, stats, 0);
+	if (rc)
+		goto free_stats;
+	/*
+	 * Go through the returned output buffer and print stats and
+	 * values. Since stat_id is essentially a char string of
+	 * 8 bytes, simply use the string format specifier to print it.
+	 */
+	seq_buf_init(&s, buf, PAGE_SIZE);
+	for (index = 0, stat = stats->scm_statistic;
+	     index < be32_to_cpu(stats->num_statistics);
+	     ++index, ++stat) {
+		seq_buf_printf(&s, "%.8s = 0x%016llX\n",
+			       stat->stat_id,
+			       be64_to_cpu(stat->stat_val));
+	}
+
+free_stats:
+	kfree(stats);
+	return rc ? rc : seq_buf_used(&s);
+}
+DEVICE_ATTR_RO(perf_stats);
+
 static ssize_t flags_show(struct device *dev,
 			  struct device_attribute *attr, char *buf)
 {
@@ -682,6 +820,7 @@ DEVICE_ATTR_RO(flags);
 /* papr_scm specific dimm attributes */
 static struct attribute *papr_nd_attributes[] = {
 	&dev_attr_flags.attr,
+	&dev_attr_perf_stats.attr,
 	NULL,
 };
 
@@ -702,6 +841,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
 	struct nd_region_desc ndr_desc;
 	unsigned long dimm_flags;
 	int target_nid, online_nid;
+	ssize_t stat_size;
 
 	p->bus_desc.ndctl = papr_scm_ndctl;
 	p->bus_desc.module = THIS_MODULE;
@@ -769,6 +909,16 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
 	list_add_tail(&p->region_list, &papr_nd_regions);
 	mutex_unlock(&papr_ndr_lock);
 
+	/* Try retriving the stat buffer and see if its supported */
+	stat_size = drc_pmem_query_stats(p, NULL, 0);
+	if (stat_size > 0) {
+		p->stat_buffer_len = stat_size;
+		dev_dbg(&p->pdev->dev, "Max perf-stat size %lu-bytes\n",
+			p->stat_buffer_len);
+	} else {
+		dev_info(&p->pdev->dev, "Dimm performance stats unavailable\n");
+	}
+
 	return 0;
 
 err:	nvdimm_bus_unregister(p->bus);
-- 
2.26.2


^ permalink raw reply related

* [PATCH v4 2/2] powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
From: Vaibhav Jain @ 2020-07-31  6:41 UTC (permalink / raw)
  To: linuxppc-dev, linux-nvdimm
  Cc: Santosh Sivaraj, Oliver O'Halloran, Aneesh Kumar K . V,
	Vaibhav Jain, Dan Williams, Ira Weiny
In-Reply-To: <20200731064153.182203-1-vaibhav@linux.ibm.com>

We add support for reporting 'fuel-gauge' NVDIMM metric via
PAPR_PDSM_HEALTH pdsm payload. 'fuel-gauge' metric indicates the usage
life remaining of a papr-scm compatible NVDIMM. PHYP exposes this
metric via the H_SCM_PERFORMANCE_STATS.

The metric value is returned from the pdsm by extending the return
payload 'struct nd_papr_pdsm_health' without breaking the ABI. A new
field 'dimm_fuel_gauge' to hold the metric value is introduced at the
end of the payload struct and its presence is indicated by by
extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID.

The patch introduces a new function papr_pdsm_fuel_gauge() that is
called from papr_pdsm_health(). If fetching NVDIMM performance stats
is supported then 'papr_pdsm_fuel_gauge()' allocated an output buffer
large enough to hold the performance stat and passes it to
drc_pmem_query_stats() that issues the HCALL to PHYP. The return value
of the stat is then populated in the 'struct
nd_papr_pdsm_health.dimm_fuel_gauge' field with extension flag
'PDSM_DIMM_HEALTH_RUN_GAUGE_VALID' set in 'struct
nd_papr_pdsm_health.extension_flags'

Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
Changelog:

v4:
* Moved a hunk from this patch to previous patch in the series.
  [ Aneesh ]

v3:
* Updated papr_pdsm_fuel_guage() to use the updated
  drc_pmem_query_stats() function.

Resend:
None

v2:
* Restructure code in papr_pdsm_fuel_gauge() to handle error case
first [ Ira ]
* Ignore the return value of papr_pdsm_fuel_gauge() in
papr_psdm_health() [ Ira ]
---
 arch/powerpc/include/uapi/asm/papr_pdsm.h |  9 +++++
 arch/powerpc/platforms/pseries/papr_scm.c | 49 +++++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h b/arch/powerpc/include/uapi/asm/papr_pdsm.h
index 9ccecc1d6840..50ef95e2f5b1 100644
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
@@ -72,6 +72,11 @@
 #define PAPR_PDSM_DIMM_CRITICAL      2
 #define PAPR_PDSM_DIMM_FATAL         3
 
+/* struct nd_papr_pdsm_health.extension_flags field flags */
+
+/* Indicate that the 'dimm_fuel_gauge' field is valid */
+#define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1
+
 /*
  * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH
  * Various flags indicate the health status of the dimm.
@@ -84,6 +89,7 @@
  * dimm_locked		: Contents of the dimm cant be modified until CEC reboot
  * dimm_encrypted	: Contents of dimm are encrypted.
  * dimm_health		: Dimm health indicator. One of PAPR_PDSM_DIMM_XXXX
+ * dimm_fuel_gauge	: Life remaining of DIMM as a percentage from 0-100
  */
 struct nd_papr_pdsm_health {
 	union {
@@ -96,6 +102,9 @@ struct nd_papr_pdsm_health {
 			__u8 dimm_locked;
 			__u8 dimm_encrypted;
 			__u16 dimm_health;
+
+			/* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */
+			__u16 dimm_fuel_gauge;
 		};
 		__u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
 	};
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index f37f3f70007d..f439f0dfea7d 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -518,6 +518,51 @@ static int is_cmd_valid(struct nvdimm *nvdimm, unsigned int cmd, void *buf,
 	return 0;
 }
 
+static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
+				union nd_pdsm_payload *payload)
+{
+	int rc, size;
+	u64 statval;
+	struct papr_scm_perf_stat *stat;
+	struct papr_scm_perf_stats *stats;
+
+	/* Silently fail if fetching performance metrics isn't  supported */
+	if (!p->stat_buffer_len)
+		return 0;
+
+	/* Allocate request buffer enough to hold single performance stat */
+	size = sizeof(struct papr_scm_perf_stats) +
+		sizeof(struct papr_scm_perf_stat);
+
+	stats = kzalloc(size, GFP_KERNEL);
+	if (!stats)
+		return -ENOMEM;
+
+	stat = &stats->scm_statistic[0];
+	memcpy(&stat->stat_id, "MemLife ", sizeof(stat->stat_id));
+	stat->stat_val = 0;
+
+	/* Fetch the fuel gauge and populate it in payload */
+	rc = drc_pmem_query_stats(p, stats, 1);
+	if (rc < 0) {
+		dev_dbg(&p->pdev->dev, "Err(%d) fetching fuel gauge\n", rc);
+		goto free_stats;
+	}
+
+	statval = be64_to_cpu(stat->stat_val);
+	dev_dbg(&p->pdev->dev,
+		"Fetched fuel-gauge %llu", statval);
+	payload->health.extension_flags |=
+		PDSM_DIMM_HEALTH_RUN_GAUGE_VALID;
+	payload->health.dimm_fuel_gauge = statval;
+
+	rc = sizeof(struct nd_papr_pdsm_health);
+
+free_stats:
+	kfree(stats);
+	return rc;
+}
+
 /* Fetch the DIMM health info and populate it in provided package. */
 static int papr_pdsm_health(struct papr_scm_priv *p,
 			    union nd_pdsm_payload *payload)
@@ -558,6 +603,10 @@ static int papr_pdsm_health(struct papr_scm_priv *p,
 
 	/* struct populated hence can release the mutex now */
 	mutex_unlock(&p->health_mutex);
+
+	/* Populate the fuel gauge meter in the payload */
+	papr_pdsm_fuel_gauge(p, payload);
+
 	rc = sizeof(struct nd_papr_pdsm_health);
 
 out:
-- 
2.26.2


^ permalink raw reply related

* Re: [PATCH v4 1/2] powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
From: Aneesh Kumar K.V @ 2020-07-31  7:12 UTC (permalink / raw)
  To: Vaibhav Jain, linuxppc-dev, linux-nvdimm
  Cc: Santosh Sivaraj, Oliver O'Halloran, Vaibhav Jain,
	Dan Williams, Ira Weiny
In-Reply-To: <20200731064153.182203-2-vaibhav@linux.ibm.com>

Vaibhav Jain <vaibhav@linux.ibm.com> writes:

> Update papr_scm.c to query dimm performance statistics from PHYP via
> H_SCM_PERFORMANCE_STATS hcall and export them to user-space as PAPR
> specific NVDIMM attribute 'perf_stats' in sysfs. The patch also
> provide a sysfs ABI documentation for the stats being reported and
> their meanings.
>
> During NVDIMM probe time in papr_scm_nvdimm_init() a special variant
> of H_SCM_PERFORMANCE_STATS hcall is issued to check if collection of
> performance statistics is supported or not. If successful then a PHYP
> returns a maximum possible buffer length needed to read all
> performance stats. This returned value is stored in a per-nvdimm
> attribute 'stat_buffer_len'.
>
> The layout of request buffer for reading NVDIMM performance stats from
> PHYP is defined in 'struct papr_scm_perf_stats' and 'struct
> papr_scm_perf_stat'. These structs are used in newly introduced
> drc_pmem_query_stats() that issues the H_SCM_PERFORMANCE_STATS hcall.
>
> The sysfs access function perf_stats_show() uses value
> 'stat_buffer_len' to allocate a buffer large enough to hold all
> possible NVDIMM performance stats and passes it to
> drc_pmem_query_stats() to populate. Finally statistics reported in the
> buffer are formatted into the sysfs access function output buffer.
>
> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
> Changelog:
>
> v4:
> * Fixed a build issue with this patch by moving a hunk from second
>   patch in series to this patch. [ Aneesh ]
>
> v3:
> * Updated drc_pmem_query_stats() to not require 'buff_size' and 'out'
>   args to the function. Instead 'buff_size' is calculated from
>   'num_stats' and instead of populating 'R4' in arg 'out' the value is
>   returned from the function in case 'R4' represents
>   'max-buffer-size'.
>
> Resend:
> None
>
> v2:
> * Updated 'struct papr_scm_perf_stats' and 'struct papr_scm_perf_stat'
> to use big-endian types. [ Aneesh ]
> * s/len_stat_buffer/stat_buffer_len/ [ Aneesh ]
> * s/statistics_id/stat_id/ , s/statistics_val/stat_val/ [ Aneesh ]
> * Conversion from Big endian to cpu endian happens later rather than
> just after its fetched from PHYP.
> * Changed a log statement to unambiguously report dimm performance
> stats are not available for the given nvdimm [ Ira ]
> * Restructed some code to handle error case first [ Ira ]
> ---
>  Documentation/ABI/testing/sysfs-bus-papr-pmem |  27 ++++
>  arch/powerpc/platforms/pseries/papr_scm.c     | 150 ++++++++++++++++++
>  2 files changed, 177 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem b/Documentation/ABI/testing/sysfs-bus-papr-pmem
> index 5b10d036a8d4..c1a67275c43f 100644
> --- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
> +++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
> @@ -25,3 +25,30 @@ Description:
>  				  NVDIMM have been scrubbed.
>  		* "locked"	: Indicating that NVDIMM contents cant
>  				  be modified until next power cycle.
> +
> +What:		/sys/bus/nd/devices/nmemX/papr/perf_stats
> +Date:		May, 2020
> +KernelVersion:	v5.9
> +Contact:	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>, linux-nvdimm@lists.01.org,
> +Description:
> +		(RO) Report various performance stats related to papr-scm NVDIMM
> +		device.  Each stat is reported on a new line with each line
> +		composed of a stat-identifier followed by it value. Below are
> +		currently known dimm performance stats which are reported:
> +
> +		* "CtlResCt" : Controller Reset Count
> +		* "CtlResTm" : Controller Reset Elapsed Time
> +		* "PonSecs " : Power-on Seconds
> +		* "MemLife " : Life Remaining
> +		* "CritRscU" : Critical Resource Utilization
> +		* "HostLCnt" : Host Load Count
> +		* "HostSCnt" : Host Store Count
> +		* "HostSDur" : Host Store Duration
> +		* "HostLDur" : Host Load Duration
> +		* "MedRCnt " : Media Read Count
> +		* "MedWCnt " : Media Write Count
> +		* "MedRDur " : Media Read Duration
> +		* "MedWDur " : Media Write Duration
> +		* "CchRHCnt" : Cache Read Hit Count
> +		* "CchWHCnt" : Cache Write Hit Count
> +		* "FastWCnt" : Fast Write Count
> \ No newline at end of file
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
> index 3d1235a76ba9..f37f3f70007d 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -64,6 +64,26 @@
>  				    PAPR_PMEM_HEALTH_FATAL |	\
>  				    PAPR_PMEM_HEALTH_UNHEALTHY)
>  
> +#define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS)
> +#define PAPR_SCM_PERF_STATS_VERSION 0x1
> +
> +/* Struct holding a single performance metric */
> +struct papr_scm_perf_stat {
> +	u8 stat_id[8];
> +	__be64 stat_val;
> +} __packed;
> +
> +/* Struct exchanged between kernel and PHYP for fetching drc perf stats */
> +struct papr_scm_perf_stats {
> +	u8 eye_catcher[8];
> +	/* Should be PAPR_SCM_PERF_STATS_VERSION */
> +	__be32 stats_version;
> +	/* Number of stats following */
> +	__be32 num_statistics;
> +	/* zero or more performance matrics */
> +	struct papr_scm_perf_stat scm_statistic[];
> +} __packed;
> +
>  /* private struct associated with each region */
>  struct papr_scm_priv {
>  	struct platform_device *pdev;
> @@ -92,6 +112,9 @@ struct papr_scm_priv {
>  
>  	/* Health information for the dimm */
>  	u64 health_bitmap;
> +
> +	/* length of the stat buffer as expected by phyp */
> +	size_t stat_buffer_len;
>  };
>  
>  static LIST_HEAD(papr_nd_regions);
> @@ -200,6 +223,79 @@ static int drc_pmem_query_n_bind(struct papr_scm_priv *p)
>  	return drc_pmem_bind(p);
>  }
>  
> +/*
> + * Query the Dimm performance stats from PHYP and copy them (if returned) to
> + * provided struct papr_scm_perf_stats instance 'stats' that can hold atleast
> + * (num_stats + header) bytes.
> + * - If buff_stats == NULL the return value is the size in byes of the buffer
> + * needed to hold all supported performance-statistics.
> + * - If buff_stats != NULL and num_stats == 0 then we copy all known
> + * performance-statistics to 'buff_stat' and expect to be large enough to
> + * hold them.
> + * - if buff_stats != NULL and num_stats > 0 then copy the requested
> + * performance-statistics to buff_stats.
> + */
> +static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p,
> +				    struct papr_scm_perf_stats *buff_stats,
> +				    unsigned int num_stats)
> +{
> +	unsigned long ret[PLPAR_HCALL_BUFSIZE];
> +	size_t size;
> +	s64 rc;
> +
> +	/* Setup the out buffer */
> +	if (buff_stats) {
> +		memcpy(buff_stats->eye_catcher,
> +		       PAPR_SCM_PERF_STATS_EYECATCHER, 8);
> +		buff_stats->stats_version =
> +			cpu_to_be32(PAPR_SCM_PERF_STATS_VERSION);
> +		buff_stats->num_statistics =
> +			cpu_to_be32(num_stats);
> +
> +		/*
> +		 * Calculate the buffer size based on num-stats provided
> +		 * or use the prefetched max buffer length
> +		 */
> +		if (num_stats)
> +			/* Calculate size from the num_stats */
> +			size = sizeof(struct papr_scm_perf_stats) +
> +				num_stats * sizeof(struct papr_scm_perf_stat);
> +		else
> +			size = p->stat_buffer_len;
> +	} else {
> +		/* In case of no out buffer ignore the size */
> +		size = 0;
> +	}
> +
> +	/* Do the HCALL asking PHYP for info */
> +	rc = plpar_hcall(H_SCM_PERFORMANCE_STATS, ret, p->drc_index,
> +			 buff_stats ? virt_to_phys(buff_stats) : 0,
> +			 size);
> +
> +	/* Check if the error was due to an unknown stat-id */
> +	if (rc == H_PARTIAL) {
> +		dev_err(&p->pdev->dev,
> +			"Unknown performance stats, Err:0x%016lX\n", ret[0]);
> +		return -ENOENT;
> +	} else if (rc != H_SUCCESS) {
> +		dev_err(&p->pdev->dev,
> +			"Failed to query performance stats, Err:%lld\n", rc);
> +		return -EIO;
> +
> +	} else if (!size) {
> +		/* Handle case where stat buffer size was requested */
> +		dev_dbg(&p->pdev->dev,
> +			"Performance stats size %ld\n", ret[0]);
> +		return ret[0];
> +	}
> +
> +	/* Successfully fetched the requested stats from phyp */
> +	dev_dbg(&p->pdev->dev,
> +		"Performance stats returned %d stats\n",
> +		be32_to_cpu(buff_stats->num_statistics));
> +	return 0;
> +}
> +
>  /*
>   * Issue hcall to retrieve dimm health info and populate papr_scm_priv with the
>   * health information.
> @@ -637,6 +733,48 @@ static int papr_scm_ndctl(struct nvdimm_bus_descriptor *nd_desc,
>  	return 0;
>  }
>  
> +static ssize_t perf_stats_show(struct device *dev,
> +			       struct device_attribute *attr, char *buf)
> +{
> +	int index, rc;
> +	struct seq_buf s;
> +	struct papr_scm_perf_stat *stat;
> +	struct papr_scm_perf_stats *stats;
> +	struct nvdimm *dimm = to_nvdimm(dev);
> +	struct papr_scm_priv *p = nvdimm_provider_data(dimm);
> +
> +	if (!p->stat_buffer_len)
> +		return -ENOENT;
> +
> +	/* Allocate the buffer for phyp where stats are written */
> +	stats = kzalloc(p->stat_buffer_len, GFP_KERNEL);
> +	if (!stats)
> +		return -ENOMEM;
> +
> +	/* Ask phyp to return all dimm perf stats */
> +	rc = drc_pmem_query_stats(p, stats, 0);
> +	if (rc)
> +		goto free_stats;

So we end up making a HCALL for each read of the sysfs file? You do
throttle that for PAPR_HEALTH hcall (flags sysfs file). Do we need to do
that here? If not should we make this CAP_SYS_ADMIN?  You can possibly
add is_visible callback to papr group and then restrict this all to
CAP_SYS_ADMIN?


> +	/*
> +	 * Go through the returned output buffer and print stats and
> +	 * values. Since stat_id is essentially a char string of
> +	 * 8 bytes, simply use the string format specifier to print it.
> +	 */
> +	seq_buf_init(&s, buf, PAGE_SIZE);
> +	for (index = 0, stat = stats->scm_statistic;
> +	     index < be32_to_cpu(stats->num_statistics);
> +	     ++index, ++stat) {
> +		seq_buf_printf(&s, "%.8s = 0x%016llX\n",
> +			       stat->stat_id,
> +			       be64_to_cpu(stat->stat_val));
> +	}
> +
> +free_stats:
> +	kfree(stats);
> +	return rc ? rc : seq_buf_used(&s);
> +}
> +DEVICE_ATTR_RO(perf_stats);
> +
>  static ssize_t flags_show(struct device *dev,
>  			  struct device_attribute *attr, char *buf)
>  {
> @@ -682,6 +820,7 @@ DEVICE_ATTR_RO(flags);
>  /* papr_scm specific dimm attributes */
>  static struct attribute *papr_nd_attributes[] = {
>  	&dev_attr_flags.attr,
> +	&dev_attr_perf_stats.attr,
>  	NULL,
>  };
>  
> @@ -702,6 +841,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>  	struct nd_region_desc ndr_desc;
>  	unsigned long dimm_flags;
>  	int target_nid, online_nid;
> +	ssize_t stat_size;
>  
>  	p->bus_desc.ndctl = papr_scm_ndctl;
>  	p->bus_desc.module = THIS_MODULE;
> @@ -769,6 +909,16 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>  	list_add_tail(&p->region_list, &papr_nd_regions);
>  	mutex_unlock(&papr_ndr_lock);
>  
> +	/* Try retriving the stat buffer and see if its supported */
> +	stat_size = drc_pmem_query_stats(p, NULL, 0);
> +	if (stat_size > 0) {
> +		p->stat_buffer_len = stat_size;
> +		dev_dbg(&p->pdev->dev, "Max perf-stat size %lu-bytes\n",
> +			p->stat_buffer_len);
> +	} else {
> +		dev_info(&p->pdev->dev, "Dimm performance stats unavailable\n");
> +	}
> +
>  	return 0;
>  
>  err:	nvdimm_bus_unregister(p->bus);
> -- 
> 2.26.2

^ permalink raw reply

* Re: [PATCH v4 2/2] powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
From: Aneesh Kumar K.V @ 2020-07-31  7:14 UTC (permalink / raw)
  To: Vaibhav Jain, linuxppc-dev, linux-nvdimm
  Cc: Santosh Sivaraj, Oliver O'Halloran, Vaibhav Jain,
	Dan Williams, Ira Weiny
In-Reply-To: <20200731064153.182203-3-vaibhav@linux.ibm.com>

Vaibhav Jain <vaibhav@linux.ibm.com> writes:

> We add support for reporting 'fuel-gauge' NVDIMM metric via
> PAPR_PDSM_HEALTH pdsm payload. 'fuel-gauge' metric indicates the usage
> life remaining of a papr-scm compatible NVDIMM. PHYP exposes this
> metric via the H_SCM_PERFORMANCE_STATS.
>
> The metric value is returned from the pdsm by extending the return
> payload 'struct nd_papr_pdsm_health' without breaking the ABI. A new
> field 'dimm_fuel_gauge' to hold the metric value is introduced at the
> end of the payload struct and its presence is indicated by by
> extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID.
>
> The patch introduces a new function papr_pdsm_fuel_gauge() that is
> called from papr_pdsm_health(). If fetching NVDIMM performance stats
> is supported then 'papr_pdsm_fuel_gauge()' allocated an output buffer
> large enough to hold the performance stat and passes it to
> drc_pmem_query_stats() that issues the HCALL to PHYP. The return value
> of the stat is then populated in the 'struct
> nd_papr_pdsm_health.dimm_fuel_gauge' field with extension flag
> 'PDSM_DIMM_HEALTH_RUN_GAUGE_VALID' set in 'struct
> nd_papr_pdsm_health.extension_flags'
>

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
> Changelog:
>
> v4:
> * Moved a hunk from this patch to previous patch in the series.
>   [ Aneesh ]
>
> v3:
> * Updated papr_pdsm_fuel_guage() to use the updated
>   drc_pmem_query_stats() function.
>
> Resend:
> None
>
> v2:
> * Restructure code in papr_pdsm_fuel_gauge() to handle error case
> first [ Ira ]
> * Ignore the return value of papr_pdsm_fuel_gauge() in
> papr_psdm_health() [ Ira ]
> ---
>  arch/powerpc/include/uapi/asm/papr_pdsm.h |  9 +++++
>  arch/powerpc/platforms/pseries/papr_scm.c | 49 +++++++++++++++++++++++
>  2 files changed, 58 insertions(+)
>
> diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h b/arch/powerpc/include/uapi/asm/papr_pdsm.h
> index 9ccecc1d6840..50ef95e2f5b1 100644
> --- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
> +++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
> @@ -72,6 +72,11 @@
>  #define PAPR_PDSM_DIMM_CRITICAL      2
>  #define PAPR_PDSM_DIMM_FATAL         3
>  
> +/* struct nd_papr_pdsm_health.extension_flags field flags */
> +
> +/* Indicate that the 'dimm_fuel_gauge' field is valid */
> +#define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1
> +
>  /*
>   * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH
>   * Various flags indicate the health status of the dimm.
> @@ -84,6 +89,7 @@
>   * dimm_locked		: Contents of the dimm cant be modified until CEC reboot
>   * dimm_encrypted	: Contents of dimm are encrypted.
>   * dimm_health		: Dimm health indicator. One of PAPR_PDSM_DIMM_XXXX
> + * dimm_fuel_gauge	: Life remaining of DIMM as a percentage from 0-100
>   */
>  struct nd_papr_pdsm_health {
>  	union {
> @@ -96,6 +102,9 @@ struct nd_papr_pdsm_health {
>  			__u8 dimm_locked;
>  			__u8 dimm_encrypted;
>  			__u16 dimm_health;
> +
> +			/* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */
> +			__u16 dimm_fuel_gauge;
>  		};
>  		__u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
>  	};
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
> index f37f3f70007d..f439f0dfea7d 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -518,6 +518,51 @@ static int is_cmd_valid(struct nvdimm *nvdimm, unsigned int cmd, void *buf,
>  	return 0;
>  }
>  
> +static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
> +				union nd_pdsm_payload *payload)
> +{
> +	int rc, size;
> +	u64 statval;
> +	struct papr_scm_perf_stat *stat;
> +	struct papr_scm_perf_stats *stats;
> +
> +	/* Silently fail if fetching performance metrics isn't  supported */
> +	if (!p->stat_buffer_len)
> +		return 0;
> +
> +	/* Allocate request buffer enough to hold single performance stat */
> +	size = sizeof(struct papr_scm_perf_stats) +
> +		sizeof(struct papr_scm_perf_stat);
> +
> +	stats = kzalloc(size, GFP_KERNEL);
> +	if (!stats)
> +		return -ENOMEM;
> +
> +	stat = &stats->scm_statistic[0];
> +	memcpy(&stat->stat_id, "MemLife ", sizeof(stat->stat_id));
> +	stat->stat_val = 0;
> +
> +	/* Fetch the fuel gauge and populate it in payload */
> +	rc = drc_pmem_query_stats(p, stats, 1);
> +	if (rc < 0) {
> +		dev_dbg(&p->pdev->dev, "Err(%d) fetching fuel gauge\n", rc);
> +		goto free_stats;
> +	}
> +
> +	statval = be64_to_cpu(stat->stat_val);
> +	dev_dbg(&p->pdev->dev,
> +		"Fetched fuel-gauge %llu", statval);
> +	payload->health.extension_flags |=
> +		PDSM_DIMM_HEALTH_RUN_GAUGE_VALID;
> +	payload->health.dimm_fuel_gauge = statval;
> +
> +	rc = sizeof(struct nd_papr_pdsm_health);
> +
> +free_stats:
> +	kfree(stats);
> +	return rc;
> +}
> +
>  /* Fetch the DIMM health info and populate it in provided package. */
>  static int papr_pdsm_health(struct papr_scm_priv *p,
>  			    union nd_pdsm_payload *payload)
> @@ -558,6 +603,10 @@ static int papr_pdsm_health(struct papr_scm_priv *p,
>  
>  	/* struct populated hence can release the mutex now */
>  	mutex_unlock(&p->health_mutex);
> +
> +	/* Populate the fuel gauge meter in the payload */
> +	papr_pdsm_fuel_gauge(p, payload);
> +
>  	rc = sizeof(struct nd_papr_pdsm_health);
>  
>  out:
> -- 
> 2.26.2

^ permalink raw reply

* Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
From: Gautham R Shenoy @ 2020-07-31  7:36 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	LKML, Nicholas Piggin, Ingo Molnar, Oliver O'Halloran,
	Jordan Niethe, linuxppc-dev, Valentin Schneider
In-Reply-To: <20200729061355.GA14603@linux.vnet.ibm.com>

Hi Srikar, Valentin,

On Wed, Jul 29, 2020 at 11:43:55AM +0530, Srikar Dronamraju wrote:
> * Valentin Schneider <valentin.schneider@arm.com> [2020-07-28 16:03:11]:
>

[..snip..]

> At this time the current topology would be good enough i.e BIGCORE would
> always be equal to a MC. However in future we could have chips that can have
> lesser/larger number of CPUs in llc than in a BIGCORE or we could have
> granular or split L3 caches within a DIE. In such a case BIGCORE != MC.
> 
> Also in the current P9 itself, two neighbouring core-pairs form a quad.
> Cache latency within a quad is better than a latency to a distant core-pair.
> Cache latency within a core pair is way better than latency within a quad.
> So if we have only 4 threads running on a DIE all of them accessing the same
> cache-lines, then we could probably benefit if all the tasks were to run
> within the quad aka MC/Coregroup.
> 
> I have found some benchmarks which are latency sensitive to benefit by
> having a grouping a quad level (using kernel hacks and not backed by
> firmware changes). Gautham also found similar results in his experiments
> but he only used binding within the stock kernel.
> 
> I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC
> domain need not be LLC domain for Power.

I am observing that SD_SHARE_PKG_RESOURCES at L2 provides the best
results for POWER9 in terms of cache-benefits during wakeup.  On a
POWER9 Boston machine, running a producer-consumer test case
(https://github.com/gautshen/misc/blob/master/producer_consumer/producer_consumer.c)

The test case creates two threads, one Producer and another
Consumer. Both work on a fairly large shared array of size 64M.  In an
interation the Producer performs stores to 1024 random locations and
wakes up the Consumer. In the Consumer's iteration, loads from those
exact 1024 locations.

We measure the number of Consumer iterations per second and the
average time for each Consumer iteration. The smaller the time, the
better it is.

The following results are when I pinned the Producer and Consumer to
different combinations of CPUs to cover Small core , Big-core,
Neighbouring Big-core, Far off core within the same chip, and across
chips. There is a also a case where they are not affined anywhere, and
we let the scheduler wake them up correctly.

We find the best results when the Producer and Consumer are within the
same L2 domain. These numbers are also close to the numbers that we
get when we let the Scheduler wake them up (where LLC is L2).

## Same Small core (4 threads: Shares L1, L2, L3, Frequency Domain)

Consumer affined to  CPU 3
Producer affined to  CPU 1
    4698 iterations, avg time: 20034 ns
    4951 iterations, avg time: 20012 ns
    4957 iterations, avg time: 19971 ns
    4968 iterations, avg time: 19985 ns
    4970 iterations, avg time: 19977 ns


## Same Big Core (8 threads: Shares L2, L3, Frequency Domain)

Consumer affined to  CPU 7
Producer affined to  CPU 1
    4580 iterations, avg time: 19403 ns
    4851 iterations, avg time: 19373 ns
    4849 iterations, avg time: 19394 ns
    4856 iterations, avg time: 19394 ns
    4867 iterations, avg time: 19353 ns

## Neighbouring Big-core (Faster data-snooping from L2. Shares L3, Frequency Domain)

Producer affined to  CPU 1
Consumer affined to  CPU 11
    4270 iterations, avg time: 24158 ns
    4491 iterations, avg time: 24157 ns
    4500 iterations, avg time: 24148 ns
    4516 iterations, avg time: 24164 ns
    4518 iterations, avg time: 24165 ns


## Any other Big-core from Same Chip (Shares L3)

Producer affined to  CPU 1
Consumer affined to  CPU 87
    4176 iterations, avg time: 27953 ns
    4417 iterations, avg time: 27925 ns
    4415 iterations, avg time: 27934 ns
    4417 iterations, avg time: 27983 ns
    4430 iterations, avg time: 27958 ns


## Different Chips (No cache-sharing)

Consumer affined to  CPU 175
Producer affined to  CPU 1
    3277 iterations, avg time: 50786 ns
    3063 iterations, avg time: 50732 ns
    2831 iterations, avg time: 50737 ns
    2859 iterations, avg time: 50688 ns
    2849 iterations, avg time: 50722 ns

## Without affining them (Let Scheduler wake-them up appropriately)
Consumer affined to  CPU 0-175
Producer affined to  CPU 0-175
    4821 iterations, avg time: 19412 ns
    4863 iterations, avg time: 19435 ns
    4855 iterations, avg time: 19381 ns
    4811 iterations, avg time: 19458 ns
    4892 iterations, avg time: 19429 ns


--
Thanks and Regards
gautham.

^ permalink raw reply

* Re: [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain
From: Michael Ellerman @ 2020-07-31  7:45 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <20200727053230.19753-7-srikar@linux.vnet.ibm.com>

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> Currently "CACHE" domain happens to be the 2nd sched domain as per
> powerpc_topology. This domain will collapse if cpumask of l2-cache is
> same as SMT domain. However we could generalize this domain such that it
> could mean either be a "CACHE" domain or a "BIGCORE" domain.
>
> While setting up the "CACHE" domain, check if shared_cache is already
> set.

PeterZ asked for some overview of what you're doing and why, you
responded to his mail, but I was expecting to see that text incorporated
here somewhere.

He also asked for some comments, which I would also like to see.


I'm also not clear why we want to rename it to "bigcore", that's not a
commonly understood term, I don't think it's clear to new readers what
it means.

Leaving it as the shared cache domain, and having a comment mentioning
that "bigcores" share a cache, would be clearer I think.

cheers


> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Anton Blanchard <anton@ozlabs.org>
> Cc: Oliver O'Halloran <oohall@gmail.com>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Jordan Niethe <jniethe5@gmail.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> Changelog v1 -> v2:
> 	Moved shared_cache topology fixup to fixup_topology (Gautham)
>
>  arch/powerpc/kernel/smp.c | 48 +++++++++++++++++++++++++++------------
>  1 file changed, 34 insertions(+), 14 deletions(-)
>
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index d997c7411664..3c5ccf6d2b1c 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -85,6 +85,14 @@ EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
>  EXPORT_PER_CPU_SYMBOL(cpu_core_map);
>  EXPORT_SYMBOL_GPL(has_big_cores);
>  
> +enum {
> +#ifdef CONFIG_SCHED_SMT
> +	smt_idx,
> +#endif
> +	bigcore_idx,
> +	die_idx,
> +};
> +
>  #define MAX_THREAD_LIST_SIZE	8
>  #define THREAD_GROUP_SHARE_L1   1
>  struct thread_groups {
> @@ -851,13 +859,7 @@ static int powerpc_shared_cache_flags(void)
>   */
>  static const struct cpumask *shared_cache_mask(int cpu)
>  {
> -	if (shared_caches)
> -		return cpu_l2_cache_mask(cpu);
> -
> -	if (has_big_cores)
> -		return cpu_smallcore_mask(cpu);
> -
> -	return per_cpu(cpu_sibling_map, cpu);
> +	return per_cpu(cpu_l2_cache_map, cpu);
>  }
>  
>  #ifdef CONFIG_SCHED_SMT
> @@ -867,11 +869,16 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
>  }
>  #endif
>  
> +static const struct cpumask *cpu_bigcore_mask(int cpu)
> +{
> +	return per_cpu(cpu_sibling_map, cpu);
> +}
> +
>  static struct sched_domain_topology_level powerpc_topology[] = {
>  #ifdef CONFIG_SCHED_SMT
>  	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
>  #endif
> -	{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
> +	{ cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
>  	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
>  	{ NULL, },
>  };
> @@ -1311,7 +1318,6 @@ static void add_cpu_to_masks(int cpu)
>  void start_secondary(void *unused)
>  {
>  	unsigned int cpu = smp_processor_id();
> -	struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
>  
>  	mmgrab(&init_mm);
>  	current->active_mm = &init_mm;
> @@ -1337,14 +1343,20 @@ void start_secondary(void *unused)
>  	/* Update topology CPU masks */
>  	add_cpu_to_masks(cpu);
>  
> -	if (has_big_cores)
> -		sibling_mask = cpu_smallcore_mask;
>  	/*
>  	 * Check for any shared caches. Note that this must be done on a
>  	 * per-core basis because one core in the pair might be disabled.
>  	 */
> -	if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
> -		shared_caches = true;
> +	if (!shared_caches) {
> +		struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
> +		struct cpumask *mask = cpu_l2_cache_mask(cpu);
> +
> +		if (has_big_cores)
> +			sibling_mask = cpu_smallcore_mask;
> +
> +		if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
> +			shared_caches = true;
> +	}
>  
>  	set_numa_node(numa_cpu_lookup_table[cpu]);
>  	set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
> @@ -1375,9 +1387,17 @@ static void fixup_topology(void)
>  #ifdef CONFIG_SCHED_SMT
>  	if (has_big_cores) {
>  		pr_info("Big cores detected but using small core scheduling\n");
> -		powerpc_topology[0].mask = smallcore_smt_mask;
> +		powerpc_topology[smt_idx].mask = smallcore_smt_mask;
>  	}
>  #endif
> +	if (shared_caches) {
> +		pr_info("Using shared cache scheduler topology\n");
> +		powerpc_topology[bigcore_idx].mask = shared_cache_mask;
> +		powerpc_topology[bigcore_idx].sd_flags = powerpc_shared_cache_flags;
> +#ifdef CONFIG_SCHED_DEBUG
> +		powerpc_topology[bigcore_idx].name = "CACHE";
> +#endif
> +	}
>  }
>  
>  void __init smp_cpus_done(unsigned int max_cpus)
> -- 
> 2.17.1

^ permalink raw reply

* Re: [PATCH v4 07/10] Powerpc/numa: Detect support for coregroup
From: Michael Ellerman @ 2020-07-31  7:49 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <20200727053230.19753-8-srikar@linux.vnet.ibm.com>

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> Add support for grouping cores based on the device-tree classification.
> - The last domain in the associativity domains always refers to the
> core.
> - If primary reference domain happens to be the penultimate domain in
> the associativity domains device-tree property, then there are no
> coregroups. However if its not a penultimate domain, then there are
> coregroups. There can be more than one coregroup. For now we would be
> interested in the last or the smallest coregroups.

This still doesn't tell me what a coregroup actually represents.

I get that it's a grouping of cores, and that the device tree specifies
it for us, but grouping based on what?

I think the answer is we aren't being told by firmware, it's just a
grouping based on some opaque performance characteristic and we just
have to take that as given.

But please explain that clearly in the change log and the code comments.

cheers


> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Anton Blanchard <anton@ozlabs.org>
> Cc: Oliver O'Halloran <oohall@gmail.com>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Jordan Niethe <jniethe5@gmail.com>
> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> Changelog v1 -> v2:
> 	Explained Coregroup in commit msg (Michael Ellerman)
>
>  arch/powerpc/include/asm/smp.h |  1 +
>  arch/powerpc/kernel/smp.c      |  1 +
>  arch/powerpc/mm/numa.c         | 34 +++++++++++++++++++++-------------
>  3 files changed, 23 insertions(+), 13 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
> index 49a25e2400f2..5bdc17a7049f 100644
> --- a/arch/powerpc/include/asm/smp.h
> +++ b/arch/powerpc/include/asm/smp.h
> @@ -28,6 +28,7 @@
>  extern int boot_cpuid;
>  extern int spinning_secondaries;
>  extern u32 *cpu_to_phys_id;
> +extern bool coregroup_enabled;
>  
>  extern void cpu_die(void);
>  extern int cpu_to_chip_id(int cpu);
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 3c5ccf6d2b1c..698000c7f76f 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -74,6 +74,7 @@ static DEFINE_PER_CPU(int, cpu_state) = { 0 };
>  
>  struct task_struct *secondary_current;
>  bool has_big_cores;
> +bool coregroup_enabled;
>  
>  DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
>  DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 2298899a0f0a..51cb672f113b 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -886,7 +886,9 @@ static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
>  static void __init find_possible_nodes(void)
>  {
>  	struct device_node *rtas;
> -	u32 numnodes, i;
> +	const __be32 *domains;
> +	int prop_length, max_nodes;
> +	u32 i;
>  
>  	if (!numa_enabled)
>  		return;
> @@ -895,25 +897,31 @@ static void __init find_possible_nodes(void)
>  	if (!rtas)
>  		return;
>  
> -	if (of_property_read_u32_index(rtas, "ibm,current-associativity-domains",
> -				min_common_depth, &numnodes)) {
> -		/*
> -		 * ibm,current-associativity-domains is a fairly recent
> -		 * property. If it doesn't exist, then fallback on
> -		 * ibm,max-associativity-domains. Current denotes what the
> -		 * platform can support compared to max which denotes what the
> -		 * Hypervisor can support.
> -		 */
> -		if (of_property_read_u32_index(rtas, "ibm,max-associativity-domains",
> -				min_common_depth, &numnodes))
> +	/*
> +	 * ibm,current-associativity-domains is a fairly recent property. If
> +	 * it doesn't exist, then fallback on ibm,max-associativity-domains.
> +	 * Current denotes what the platform can support compared to max
> +	 * which denotes what the Hypervisor can support.
> +	 */
> +	domains = of_get_property(rtas, "ibm,current-associativity-domains",
> +					&prop_length);
> +	if (!domains) {
> +		domains = of_get_property(rtas, "ibm,max-associativity-domains",
> +					&prop_length);
> +		if (!domains)
>  			goto out;
>  	}
>  
> -	for (i = 0; i < numnodes; i++) {
> +	max_nodes = of_read_number(&domains[min_common_depth], 1);
> +	for (i = 0; i < max_nodes; i++) {
>  		if (!node_possible(i))
>  			node_set(i, node_possible_map);
>  	}
>  
> +	prop_length /= sizeof(int);
> +	if (prop_length > min_common_depth + 2)
> +		coregroup_enabled = 1;
> +
>  out:
>  	of_node_put(rtas);
>  }
> -- 
> 2.17.1

^ permalink raw reply

* Re: [PATCH v4 08/10] powerpc/smp: Allocate cpumask only after searching thread group
From: Michael Ellerman @ 2020-07-31  7:52 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <20200727053230.19753-9-srikar@linux.vnet.ibm.com>

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> If allocated earlier and the search fails, then cpumask need to be
> freed. However cpu_l1_cache_map can be allocated after we search thread
> group.

It's not freed anywhere AFAICS?

And even after this change there's still an error path that doesn't free
it, isn't there?

cheers

> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Anton Blanchard <anton@ozlabs.org>
> Cc: Oliver O'Halloran <oohall@gmail.com>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Jordan Niethe <jniethe5@gmail.com>
> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kernel/smp.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 698000c7f76f..dab96a1203ec 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -797,10 +797,6 @@ static int init_cpu_l1_cache_map(int cpu)
>  	if (err)
>  		goto out;
>  
> -	zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
> -				GFP_KERNEL,
> -				cpu_to_node(cpu));
> -
>  	cpu_group_start = get_cpu_thread_group_start(cpu, &tg);
>  
>  	if (unlikely(cpu_group_start == -1)) {
> @@ -809,6 +805,9 @@ static int init_cpu_l1_cache_map(int cpu)
>  		goto out;
>  	}
>  
> +	zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
> +				GFP_KERNEL, cpu_to_node(cpu));
> +
>  	for (i = first_thread; i < first_thread + threads_per_core; i++) {
>  		int i_group_start = get_cpu_thread_group_start(i, &tg);
>  
> -- 
> 2.17.1

^ permalink raw reply

* Re: [PATCH v4 10/10] powerpc/smp: Implement cpu_to_coregroup_id
From: Michael Ellerman @ 2020-07-31  8:02 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <20200727053230.19753-11-srikar@linux.vnet.ibm.com>

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> Lookup the coregroup id from the associativity array.

It's slightly strange that this is called in patch 9, but only properly
implemented here in patch 10.

I'm not saying you have to squash them together, but it would be good if
the change log for patch 9 mentioned that a subsequent commit will
complete the implementation and how that affects the behaviour.

cheers

> If unable to detect the coregroup id, fallback on the core id.
> This way, ensure sched_domain degenerates and an extra sched domain is
> not created.
>
> Ideally this function should have been implemented in
> arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
> don't need to find the primary domain again.
>
> If the device-tree mentions more than one coregroup, then kernel
> implements only the last or the smallest coregroup, which currently
> corresponds to the penultimate domain in the device-tree.
>
> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Anton Blanchard <anton@ozlabs.org>
> Cc: Oliver O'Halloran <oohall@gmail.com>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Jordan Niethe <jniethe5@gmail.com>
> Reviewed-by : Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> Changelog v1 -> v2:
> 	Move coregroup_enabled before getting associativity (Gautham)
>
>  arch/powerpc/mm/numa.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
>
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 0d57779e7942..8b3b3ec7fcc4 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1218,6 +1218,26 @@ int find_and_online_cpu_nid(int cpu)
>  
>  int cpu_to_coregroup_id(int cpu)
>  {
> +	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
> +	int index;
> +
> +	if (cpu < 0 || cpu > nr_cpu_ids)
> +		return -1;
> +
> +	if (!coregroup_enabled)
> +		goto out;
> +
> +	if (!firmware_has_feature(FW_FEATURE_VPHN))
> +		goto out;
> +
> +	if (vphn_get_associativity(cpu, associativity))
> +		goto out;
> +
> +	index = of_read_number(associativity, 1);
> +	if (index > min_common_depth + 1)
> +		return of_read_number(&associativity[index - 1], 1);
> +
> +out:
>  	return cpu_to_core_id(cpu);
>  }
>  
> -- 
> 2.17.1

^ permalink raw reply

* Re: [PATCH] KVM: PPC: Book3S HV: Define H_PAGE_IN_NONSHARED for H_SVM_PAGE_IN hcall
From: Ram Pai @ 2020-07-31  8:10 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: ldufour, linux-doc, corbet, kvm-ppc, Julia Lawall, sathnaga,
	sukadev, linuxppc-dev, david
In-Reply-To: <20200731043334.GB20199@in.ibm.com>

On Fri, Jul 31, 2020 at 10:03:34AM +0530, Bharata B Rao wrote:
> On Thu, Jul 30, 2020 at 04:21:01PM -0700, Ram Pai wrote:
> > H_SVM_PAGE_IN hcall takes a flag parameter. This parameter specifies the
> > way in which a page will be treated.  H_PAGE_IN_NONSHARED indicates
> > that the page will be shared with the Secure VM, and H_PAGE_IN_SHARED
> > indicates that the page will not be shared but its contents will
> > be copied.
> 
> Looks like you got the definitions of shared and non-shared interchanged.

Yes. Realized it after sending the patch. Will fix it.

> 
> > 
> > However H_PAGE_IN_NONSHARED is not defined in the header file, though
> > it is defined and documented in the API captured in
> > Documentation/powerpc/ultravisor.rst
> > 
> > Define H_PAGE_IN_NONSHARED in the header file.
> 
> What is the use of defining this? Is this used directly in any place?
> Or, are youp planning to introduce such a usage?

I know the Hypervisor code currently has no use for that define.
It assumes  H_PAGE_IN_NONSHARED to be !H_PAGE_IN_SHARED.

But H_PAGE_IN_NONSHARED is defined in the Ultravisor API, and hence has
to be captured in the header, to stay consistent.

RP

^ permalink raw reply

* [PATCH v3 0/4] VSX 32-byte vector paired load/store instructions
From: Balamuruhan S @ 2020-07-31  8:16 UTC (permalink / raw)
  To: mpe
  Cc: ravi.bangoria, Balamuruhan S, paulus, sandipan, naveen.n.rao,
	linuxppc-dev

VSX vector paired instructions operates with octword (32-byte) operand
for loads and stores between storage and a pair of two sequential Vector-Scalar
Registers (VSRs). There are 4 word instructions and 2 prefixed instructions
that provides this 32-byte storage access operations - lxvp, lxvpx, stxvp,
stxvpx, plxvpx, pstxvpx.

Emulation infrastructure doesn't have support for these instructions, to
operate with 32-byte storage access and to operate with 2 VSX registers.
This patch series enables the instruction emulation support and adds test
cases for them respectively.

Changes in v3:
-------------
Worked on review comments and suggestions from Ravi and Naveen,

* Fix the do_vsx_load() to handle vsx instructions if MSR_FP/MSR_VEC
  cleared in exception conditions and it reaches to read/write to
  thread_struct member fp_state/vr_state respectively.
* Fix wrongly used `__vector128 v[2]` in struct vsx_reg as it should
  hold a single vsx register size.
* Remove unnecessary `VSX_CHECK_VEC` flag set and condition to check
  `VSX_LDLEFT` that is not applicable for these vsx instructions.
* Fix comments in emulate_vsx_load() that were misleading.
* Rebased on latest powerpc next branch.

Changes in v2:
-------------
* Fix suggestion from Sandipan, wrap ISA 3.1 instructions with
  cpu_has_feature(CPU_FTR_ARCH_31) check.
* Rebase on latest powerpc next branch.

Balamuruhan S (4):
  powerpc/sstep: support new VSX vector paired storage access
    instructions
  powerpc/sstep: support emulation for vsx vector paired storage access
    instructions
  powerpc ppc-opcode: add encoding macros for vsx vector paired
    instructions
  powerpc sstep: add testcases for vsx load/store instructions

 arch/powerpc/include/asm/ppc-opcode.h |  17 ++
 arch/powerpc/lib/sstep.c              | 122 +++++++++++--
 arch/powerpc/lib/test_emulate_step.c  | 252 ++++++++++++++++++++++++++
 3 files changed, 374 insertions(+), 17 deletions(-)

base-commit: 71d7bca373d5fa0ec977ca4814f49140621bd7ae
-- 
2.24.1

^ permalink raw reply

* [PATCH v3 1/4] powerpc/sstep: support new VSX vector paired storage access instructions
From: Balamuruhan S @ 2020-07-31  8:16 UTC (permalink / raw)
  To: mpe
  Cc: ravi.bangoria, Balamuruhan S, paulus, sandipan, naveen.n.rao,
	linuxppc-dev
In-Reply-To: <20200731081637.1837559-1-bala24@linux.ibm.com>

VSX Vector Paired instructions loads/stores an octword (32 bytes)
from/to storage into two sequential VSRs. Add `analyse_instr()` support
to these new instructions,
        * Load VSX Vector Paired (lxvp)
        * Load VSX Vector Paired Indexed (lxvpx)
        * Prefixed Load VSX Vector Paired (plxvp)
        * Store VSX Vector Paired (stxvp)
        * Store VSX Vector Paired Indexed (stxvpx)
        * Prefixed Store VSX Vector Paired (pstxvp)

Suggested-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Balamuruhan S <bala24@linux.ibm.com>
---
 arch/powerpc/lib/sstep.c | 45 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index c58ea9e787cb..22147257d74d 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -32,6 +32,10 @@ extern char system_call_vectored_emulate[];
 #define XER_OV32	0x00080000U
 #define XER_CA32	0x00040000U
 
+#ifdef CONFIG_VSX
+#define VSX_REGISTER_XTP(rd)   ((((rd) & 1) << 5) | ((rd) & 0xfe))
+#endif
+
 #ifdef CONFIG_PPC_FPU
 /*
  * Functions in ldstfp.S
@@ -2386,6 +2390,14 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs,
 			op->vsx_flags = VSX_SPLAT;
 			break;
 
+		case 333:       /* lxvpx */
+			if (!cpu_has_feature(CPU_FTR_ARCH_31))
+				return -1;
+			op->reg = VSX_REGISTER_XTP(rd);
+			op->type = MKOP(LOAD_VSX, 0, 32);
+			op->element_size = 32;
+			break;
+
 		case 364:	/* lxvwsx */
 			op->reg = rd | ((word & 1) << 5);
 			op->type = MKOP(LOAD_VSX, 0, 4);
@@ -2414,6 +2426,13 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs,
 				VSX_CHECK_VEC;
 			break;
 		}
+		case 461:       /* stxvpx */
+			if (!cpu_has_feature(CPU_FTR_ARCH_31))
+				return -1;
+			op->reg = VSX_REGISTER_XTP(rd);
+			op->type = MKOP(STORE_VSX, 0, 32);
+			op->element_size = 32;
+			break;
 		case 524:	/* lxsspx */
 			op->reg = rd | ((word & 1) << 5);
 			op->type = MKOP(LOAD_VSX, 0, 4);
@@ -2655,6 +2674,22 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs,
 #endif
 
 #ifdef CONFIG_VSX
+	case 6:
+		if (!cpu_has_feature(CPU_FTR_ARCH_31))
+			return -1;
+		op->ea = dqform_ea(word, regs);
+		op->reg = VSX_REGISTER_XTP(rd);
+		op->element_size = 32;
+		switch (word & 0xf) {
+		case 0:         /* lxvp */
+			op->type = MKOP(LOAD_VSX, 0, 32);
+			break;
+		case 1:         /* stxvp */
+			op->type = MKOP(STORE_VSX, 0, 32);
+			break;
+		}
+		break;
+
 	case 61:	/* stfdp, lxv, stxsd, stxssp, stxv */
 		switch (word & 7) {
 		case 0:		/* stfdp with LSB of DS field = 0 */
@@ -2783,12 +2818,22 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs,
 			case 57:	/* pld */
 				op->type = MKOP(LOAD, PREFIXED, 8);
 				break;
+			case 58:        /* plxvp */
+				op->reg = VSX_REGISTER_XTP(rd);
+				op->type = MKOP(LOAD_VSX, PREFIXED, 32);
+				op->element_size = 32;
+				break;
 			case 60:        /* stq */
 				op->type = MKOP(STORE, PREFIXED, 16);
 				break;
 			case 61:	/* pstd */
 				op->type = MKOP(STORE, PREFIXED, 8);
 				break;
+			case 62:        /* pstxvp */
+				op->reg = VSX_REGISTER_XTP(rd);
+				op->type = MKOP(STORE_VSX, PREFIXED, 32);
+				op->element_size = 32;
+				break;
 			}
 			break;
 		case 1: /* Type 01 Eight-Byte Register-to-Register */
-- 
2.24.1


^ permalink raw reply related

* [PATCH v3 2/4] powerpc/sstep: support emulation for vsx vector paired storage access instructions
From: Balamuruhan S @ 2020-07-31  8:16 UTC (permalink / raw)
  To: mpe
  Cc: ravi.bangoria, Balamuruhan S, paulus, sandipan, naveen.n.rao,
	linuxppc-dev
In-Reply-To: <20200731081637.1837559-1-bala24@linux.ibm.com>

add emulate_step() changes to support vsx vector paired storage
access instructions that provides octword operands loads/stores
between storage and set of 64 Vector Scalar Registers (VSRs).

Suggested-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Suggested-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Balamuruhan S <bala24@linux.ibm.com>
---
 arch/powerpc/lib/sstep.c | 77 +++++++++++++++++++++++++++++++---------
 1 file changed, 60 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 22147257d74d..01e1a3adc406 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -280,6 +280,19 @@ static nokprobe_inline void do_byte_reverse(void *ptr, int nb)
 		up[1] = tmp;
 		break;
 	}
+	case 32: {
+		unsigned long *up = (unsigned long *)ptr;
+		unsigned long tmp;
+
+		tmp = byterev_8(up[0]);
+		up[0] = byterev_8(up[3]);
+		up[3] = tmp;
+		tmp = byterev_8(up[2]);
+		up[2] = byterev_8(up[1]);
+		up[1] = tmp;
+		break;
+	}
+
 #endif
 	default:
 		WARN_ON_ONCE(1);
@@ -710,6 +723,8 @@ void emulate_vsx_load(struct instruction_op *op, union vsx_reg *reg,
 	reg->d[0] = reg->d[1] = 0;
 
 	switch (op->element_size) {
+	case 32:
+		/* [p]lxvp[x] */
 	case 16:
 		/* whole vector; lxv[x] or lxvl[l] */
 		if (size == 0)
@@ -718,7 +733,7 @@ void emulate_vsx_load(struct instruction_op *op, union vsx_reg *reg,
 		if (IS_LE && (op->vsx_flags & VSX_LDLEFT))
 			rev = !rev;
 		if (rev)
-			do_byte_reverse(reg, 16);
+			do_byte_reverse(reg, size);
 		break;
 	case 8:
 		/* scalar loads, lxvd2x, lxvdsx */
@@ -794,6 +809,20 @@ void emulate_vsx_store(struct instruction_op *op, const union vsx_reg *reg,
 	size = GETSIZE(op->type);
 
 	switch (op->element_size) {
+	case 32:
+		/* [p]stxvp[x] */
+		if (size == 0)
+			break;
+		if (rev) {
+			/* reverse 32 bytes */
+			buf.d[0] = byterev_8(reg->d[3]);
+			buf.d[1] = byterev_8(reg->d[2]);
+			buf.d[2] = byterev_8(reg->d[1]);
+			buf.d[3] = byterev_8(reg->d[0]);
+			reg = &buf;
+		}
+		memcpy(mem, reg, size);
+		break;
 	case 16:
 		/* stxv, stxvx, stxvl, stxvll */
 		if (size == 0)
@@ -862,28 +891,35 @@ static nokprobe_inline int do_vsx_load(struct instruction_op *op,
 				       bool cross_endian)
 {
 	int reg = op->reg;
-	u8 mem[16];
-	union vsx_reg buf;
+	int i, nr_vsx_regs;
+	u8 mem[32];
+	union vsx_reg buf[2];
 	int size = GETSIZE(op->type);
 
 	if (!address_ok(regs, ea, size) || copy_mem_in(mem, ea, size, regs))
 		return -EFAULT;
 
-	emulate_vsx_load(op, &buf, mem, cross_endian);
+	nr_vsx_regs = size / sizeof(__vector128);
+	emulate_vsx_load(op, buf, mem, cross_endian);
 	preempt_disable();
 	if (reg < 32) {
 		/* FP regs + extensions */
 		if (regs->msr & MSR_FP) {
-			load_vsrn(reg, &buf);
+			for (i = 0; i < nr_vsx_regs; i++)
+				load_vsrn(reg + i, &buf[i].v);
 		} else {
-			current->thread.fp_state.fpr[reg][0] = buf.d[0];
-			current->thread.fp_state.fpr[reg][1] = buf.d[1];
+			for (i = 0; i < nr_vsx_regs; i++) {
+				current->thread.fp_state.fpr[reg + i][0] = buf[i].d[0];
+				current->thread.fp_state.fpr[reg + i][1] = buf[i].d[1];
+			}
 		}
 	} else {
 		if (regs->msr & MSR_VEC)
-			load_vsrn(reg, &buf);
+			for (i = 0; i < nr_vsx_regs; i++)
+				load_vsrn(reg + i, &buf[i].v);
 		else
-			current->thread.vr_state.vr[reg - 32] = buf.v;
+			for (i = 0; i < nr_vsx_regs; i++)
+				current->thread.vr_state.vr[reg - 32 + i] = buf[i].v;
 	}
 	preempt_enable();
 	return 0;
@@ -894,30 +930,37 @@ static nokprobe_inline int do_vsx_store(struct instruction_op *op,
 					bool cross_endian)
 {
 	int reg = op->reg;
-	u8 mem[16];
-	union vsx_reg buf;
+	int i, nr_vsx_regs;
+	u8 mem[32];
+	union vsx_reg buf[2];
 	int size = GETSIZE(op->type);
 
 	if (!address_ok(regs, ea, size))
 		return -EFAULT;
 
+	nr_vsx_regs = size / sizeof(__vector128);
 	preempt_disable();
 	if (reg < 32) {
 		/* FP regs + extensions */
 		if (regs->msr & MSR_FP) {
-			store_vsrn(reg, &buf);
+			for (i = 0; i < nr_vsx_regs; i++)
+				store_vsrn(reg + i, &buf[i].v);
 		} else {
-			buf.d[0] = current->thread.fp_state.fpr[reg][0];
-			buf.d[1] = current->thread.fp_state.fpr[reg][1];
+			for (i = 0; i < nr_vsx_regs; i++) {
+				buf[i].d[0] = current->thread.fp_state.fpr[reg + i][0];
+				buf[i].d[1] = current->thread.fp_state.fpr[reg + i][1];
+			}
 		}
 	} else {
 		if (regs->msr & MSR_VEC)
-			store_vsrn(reg, &buf);
+			for (i = 0; i < nr_vsx_regs; i++)
+				store_vsrn(reg + i, &buf[i].v);
 		else
-			buf.v = current->thread.vr_state.vr[reg - 32];
+			for (i = 0; i < nr_vsx_regs; i++)
+				buf[i].v = current->thread.vr_state.vr[reg - 32 + i];
 	}
 	preempt_enable();
-	emulate_vsx_store(op, &buf, mem, cross_endian);
+	emulate_vsx_store(op, buf, mem, cross_endian);
 	return  copy_mem_out(mem, ea, size, regs);
 }
 #endif /* CONFIG_VSX */
-- 
2.24.1


^ permalink raw reply related

* [PATCH v3 3/4] powerpc ppc-opcode: add encoding macros for vsx vector paired instructions
From: Balamuruhan S @ 2020-07-31  8:16 UTC (permalink / raw)
  To: mpe
  Cc: ravi.bangoria, Balamuruhan S, paulus, sandipan, naveen.n.rao,
	linuxppc-dev
In-Reply-To: <20200731081637.1837559-1-bala24@linux.ibm.com>

add instruction encoding, extended opcodes, regs and DQ immediate macro
for new vsx vector paired instructions,
        * Load VSX Vector Paired (lxvp)
        * Load VSX Vector Paired Indexed (lxvpx)
	* Prefixed Load VSX Vector Paired (plxvp)
        * Store VSX Vector Paired (stxvp)
        * Store VSX Vector Paired Indexed (stxvpx)
	* Prefixed Store VSX Vector Paired (pstxvp)

Signed-off-by: Balamuruhan S <bala24@linux.ibm.com>
---
 arch/powerpc/include/asm/ppc-opcode.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
index 4c0bdafb6a7b..6ad23f47d06a 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -78,6 +78,7 @@
 
 #define IMM_L(i)               ((uintptr_t)(i) & 0xffff)
 #define IMM_DS(i)              ((uintptr_t)(i) & 0xfffc)
+#define IMM_DQ(i)		(((uintptr_t)(i) & 0xfff) << 4)
 
 /*
  * 16-bit immediate helper macros: HA() is for use with sign-extending instrs
@@ -272,6 +273,8 @@
 #define PPC_INST_STFD			0xd8000000
 #define PPC_PREFIX_MLS			0x06000000
 #define PPC_PREFIX_8LS			0x04000000
+#define PPC_PLXVP_EX_OP			0xe8000000
+#define PPC_PSTXVP_EX_OP		0xf8000000
 
 /* Prefixed instructions */
 #define PPC_INST_PLD			0xe4000000
@@ -296,6 +299,8 @@
 #define __PPC_XS(s)	((((s) & 0x1f) << 21) | (((s) & 0x20) >> 5))
 #define __PPC_XT(s)	__PPC_XS(s)
 #define __PPC_T_TLB(t)	(((t) & 0x3) << 21)
+#define __PPC_TP(tp)	(((tp) & 0xf) << 22)
+#define __PPC_TX(tx)	(((tx) & 0x1) << 21)
 #define __PPC_WC(w)	(((w) & 0x3) << 21)
 #define __PPC_WS(w)	(((w) & 0x1f) << 11)
 #define __PPC_SH(s)	__PPC_WS(s)
@@ -387,6 +392,18 @@
 #define PPC_RAW_STXVD2X(s, a, b)	(0x7c000798 | VSX_XX1((s), a, b))
 #define PPC_RAW_LXVD2X(s, a, b)		(0x7c000698 | VSX_XX1((s), a, b))
 #define PPC_RAW_MFVRD(a, t)		(0x7c000066 | VSX_XX1((t) + 32, a, R0))
+#define PPC_LXVP(tp, tx, a, i) \
+	(0x18000000 | __PPC_TP(tp) | __PPC_TX(tx) | ___PPC_RA(a) | IMM_DQ(i))
+#define PPC_STXVP(sp, sx, a, i) \
+	(0x18000001 | __PPC_TP(sp) | __PPC_TX(sx) | ___PPC_RA(a) | IMM_DQ(i) | 0x1)
+#define PPC_LXVPX(tp, tx, a, b) \
+	(0x7c00029a | __PPC_TP(tp) | __PPC_TX(tx) | ___PPC_RA(a) | ___PPC_RB(b))
+#define PPC_STXVPX(sp, sx, a, b) \
+	(0x7c00039a | __PPC_TP(sp) | __PPC_TX(sx) | ___PPC_RA(a) | ___PPC_RB(b))
+#define PPC_PLXVP(a, i, pr, tp, tx) \
+	((PPC_PREFIX_8LS | __PPC_PRFX_R(pr) | IMM_H(i)) << 32 | (PPC_PLXVP_EX_OP | __PPC_TP(tp) | __PPC_TX(tx) | ___PPC_RA(a) | IMM_L(i)))
+#define PPC_PSTXVP(a, i, pr, sp, sx) \
+	((PPC_PREFIX_8LS | __PPC_PRFX_R(pr) | IMM_H(i)) << 32 | (PPC_PSTXVP_EX_OP | __PPC_TP(sp) | __PPC_TX(sx) | ___PPC_RA(a) | IMM_L(i)))
 #define PPC_RAW_MTVRD(t, a)		(0x7c000166 | VSX_XX1((t) + 32, a, R0))
 #define PPC_RAW_VPMSUMW(t, a, b)	(0x10000488 | VSX_XX3((t), a, b))
 #define PPC_RAW_VPMSUMD(t, a, b)	(0x100004c8 | VSX_XX3((t), a, b))
-- 
2.24.1


^ permalink raw reply related

* [PATCH v3 4/4] powerpc sstep: add testcases for vsx load/store instructions
From: Balamuruhan S @ 2020-07-31  8:16 UTC (permalink / raw)
  To: mpe
  Cc: ravi.bangoria, Balamuruhan S, paulus, sandipan, naveen.n.rao,
	linuxppc-dev
In-Reply-To: <20200731081637.1837559-1-bala24@linux.ibm.com>

add testcases for vsx load/store vector paired instructions,
        * Load VSX Vector Paired (lxvp)
        * Load VSX Vector Paired Indexed (lxvpx)
        * Prefixed Load VSX Vector Paired (plxvp)
        * Store VSX Vector Paired (stxvp)
        * Store VSX Vector Paired Indexed (stxvpx)
        * Prefixed Store VSX Vector Paired (pstxvp)

Suggested-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Balamuruhan S <bala24@linux.ibm.com>
---
 arch/powerpc/lib/test_emulate_step.c | 252 +++++++++++++++++++++++++++
 1 file changed, 252 insertions(+)

diff --git a/arch/powerpc/lib/test_emulate_step.c b/arch/powerpc/lib/test_emulate_step.c
index d242e9f72e0c..f16934b80511 100644
--- a/arch/powerpc/lib/test_emulate_step.c
+++ b/arch/powerpc/lib/test_emulate_step.c
@@ -612,6 +612,255 @@ static void __init test_lxvd2x_stxvd2x(void)
 }
 #endif /* CONFIG_VSX */
 
+#ifdef CONFIG_VSX
+static void __init test_lxvp_stxvp(void)
+{
+	struct pt_regs regs;
+	union {
+		vector128 a;
+		u32 b[4];
+	} c[2];
+	u32 cached_b[8];
+	int stepped = -1;
+
+	init_pt_regs(&regs);
+
+	/*** lxvp ***/
+
+	cached_b[0] = c[0].b[0] = 18233;
+	cached_b[1] = c[0].b[1] = 34863571;
+	cached_b[2] = c[0].b[2] = 834;
+	cached_b[3] = c[0].b[3] = 6138911;
+	cached_b[4] = c[1].b[0] = 1234;
+	cached_b[5] = c[1].b[1] = 5678;
+	cached_b[6] = c[1].b[2] = 91011;
+	cached_b[7] = c[1].b[3] = 121314;
+
+	regs.gpr[4] = (unsigned long)&c[0].a;
+
+	/*
+	 * lxvp XTp,DQ(RA)
+	 * XTp = 32×TX + 2×Tp
+	 * let TX=1 Tp=1 RA=4 DQ=0
+	 */
+	stepped = emulate_step(&regs, ppc_inst(PPC_LXVP(1, 1, 4, 0)));
+
+	if (stepped == 1 && cpu_has_feature(CPU_FTR_VSX)) {
+		show_result("lxvp", "PASS");
+	} else {
+		if (!cpu_has_feature(CPU_FTR_VSX))
+			show_result("lxvp", "PASS (!CPU_FTR_VSX)");
+		else
+			show_result("lxvp", "FAIL");
+	}
+
+	/*** stxvp ***/
+
+	c[0].b[0] = 21379463;
+	c[0].b[1] = 87;
+	c[0].b[2] = 374234;
+	c[0].b[3] = 4;
+	c[1].b[0] = 90;
+	c[1].b[1] = 122;
+	c[1].b[2] = 555;
+	c[1].b[3] = 32144;
+
+	/*
+	 * stxvp XSp,DQ(RA)
+	 * XSp = 32×SX + 2×Sp
+	 * let SX=1 Sp=1 RA=4 DQ=0
+	 */
+	stepped = emulate_step(&regs, ppc_inst(PPC_STXVP(1, 1, 4, 0)));
+
+	if (stepped == 1 && cached_b[0] == c[0].b[0] && cached_b[1] == c[0].b[1] &&
+	    cached_b[2] == c[0].b[2] && cached_b[3] == c[0].b[3] &&
+	    cached_b[4] == c[1].b[0] && cached_b[5] == c[1].b[1] &&
+	    cached_b[6] == c[1].b[2] && cached_b[7] == c[1].b[3] &&
+	    cpu_has_feature(CPU_FTR_VSX)) {
+		show_result("stxvp", "PASS");
+	} else {
+		if (!cpu_has_feature(CPU_FTR_VSX))
+			show_result("stxvp", "PASS (!CPU_FTR_VSX)");
+		else
+			show_result("stxvp", "FAIL");
+	}
+}
+#else
+static void __init test_lxvp_stxvp(void)
+{
+	show_result("lxvp", "SKIP (CONFIG_VSX is not set)");
+	show_result("stxvp", "SKIP (CONFIG_VSX is not set)");
+}
+#endif /* CONFIG_VSX */
+
+#ifdef CONFIG_VSX
+static void __init test_lxvpx_stxvpx(void)
+{
+	struct pt_regs regs;
+	union {
+		vector128 a;
+		u32 b[4];
+	} c[2];
+	u32 cached_b[8];
+	int stepped = -1;
+
+	init_pt_regs(&regs);
+
+	/*** lxvpx ***/
+
+	cached_b[0] = c[0].b[0] = 18233;
+	cached_b[1] = c[0].b[1] = 34863571;
+	cached_b[2] = c[0].b[2] = 834;
+	cached_b[3] = c[0].b[3] = 6138911;
+	cached_b[4] = c[1].b[0] = 1234;
+	cached_b[5] = c[1].b[1] = 5678;
+	cached_b[6] = c[1].b[2] = 91011;
+	cached_b[7] = c[1].b[3] = 121314;
+
+	regs.gpr[3] = (unsigned long)&c[0].a;
+	regs.gpr[4] = 0;
+
+	/*
+	 * lxvpx XTp,RA,RB
+	 * XTp = 32×TX + 2×Tp
+	 * let TX=1 Tp=1 RA=3 RB=4
+	 */
+	stepped = emulate_step(&regs, ppc_inst(PPC_LXVPX(1, 1, 3, 4)));
+
+	if (stepped == 1 && cpu_has_feature(CPU_FTR_VSX)) {
+		show_result("lxvpx", "PASS");
+	} else {
+		if (!cpu_has_feature(CPU_FTR_VSX))
+			show_result("lxvpx", "PASS (!CPU_FTR_VSX)");
+		else
+			show_result("lxvpx", "FAIL");
+	}
+
+	/*** stxvpx ***/
+
+	c[0].b[0] = 21379463;
+	c[0].b[1] = 87;
+	c[0].b[2] = 374234;
+	c[0].b[3] = 4;
+	c[1].b[0] = 90;
+	c[1].b[1] = 122;
+	c[1].b[2] = 555;
+	c[1].b[3] = 32144;
+
+	/*
+	 * stxvpx XSp,RA,RB
+	 * XSp = 32×SX + 2×Sp
+	 * let SX=1 Sp=1 RA=3 RB=4
+	 */
+	stepped = emulate_step(&regs, ppc_inst(PPC_STXVPX(1, 1, 3, 4)));
+
+	if (stepped == 1 && cached_b[0] == c[0].b[0] && cached_b[1] == c[0].b[1] &&
+	    cached_b[2] == c[0].b[2] && cached_b[3] == c[0].b[3] &&
+	    cached_b[4] == c[1].b[0] && cached_b[5] == c[1].b[1] &&
+	    cached_b[6] == c[1].b[2] && cached_b[7] == c[1].b[3] &&
+	    cpu_has_feature(CPU_FTR_VSX)) {
+		show_result("stxvpx", "PASS");
+	} else {
+		if (!cpu_has_feature(CPU_FTR_VSX))
+			show_result("stxvpx", "PASS (!CPU_FTR_VSX)");
+		else
+			show_result("stxvpx", "FAIL");
+	}
+}
+#else
+static void __init test_lxvpx_stxvpx(void)
+{
+	show_result("lxvpx", "SKIP (CONFIG_VSX is not set)");
+	show_result("stxvpx", "SKIP (CONFIG_VSX is not set)");
+}
+#endif /* CONFIG_VSX */
+
+#ifdef CONFIG_VSX
+static void __init test_plxvp_pstxvp(void)
+{
+	struct ppc_inst instr;
+	struct pt_regs regs;
+	union {
+		vector128 a;
+		u32 b[4];
+	} c[2];
+	u32 cached_b[8];
+	int stepped = -1;
+
+	/*
+	 * plxvp XTp,D(RA),R
+	 * XSp = 32×SX + 2×Sp
+	 * let RA=3 R=0 D=d0||d1=0 R=0 Sp=1 SX=1
+	 */
+	instr = ppc_inst_prefix(PPC_PLXVP(3, 0, 0, 1, 1) >> 32,
+			PPC_PLXVP(3, 0, 0, 1, 1) & 0xffffffff);
+
+	/*** plxvpx ***/
+
+	cached_b[0] = c[0].b[0] = 18233;
+	cached_b[1] = c[0].b[1] = 34863571;
+	cached_b[2] = c[0].b[2] = 834;
+	cached_b[3] = c[0].b[3] = 6138911;
+	cached_b[4] = c[1].b[0] = 1234;
+	cached_b[5] = c[1].b[1] = 5678;
+	cached_b[6] = c[1].b[2] = 91011;
+	cached_b[7] = c[1].b[3] = 121314;
+
+	init_pt_regs(&regs);
+	regs.gpr[3] = (unsigned long)&c[0].a;
+
+	stepped = emulate_step(&regs, instr);
+	if (stepped == 1 && cpu_has_feature(CPU_FTR_VSX)) {
+		show_result("plxvpx", "PASS");
+	} else {
+		if (!cpu_has_feature(CPU_FTR_VSX))
+			show_result("plxvpx", "PASS (!CPU_FTR_VSX)");
+		else
+			show_result("plxvpx", "FAIL");
+	}
+
+	/*** pstxvpx ***/
+
+	c[0].b[0] = 21379463;
+	c[0].b[1] = 87;
+	c[0].b[2] = 374234;
+	c[0].b[3] = 4;
+	c[1].b[0] = 90;
+	c[1].b[1] = 122;
+	c[1].b[2] = 555;
+	c[1].b[3] = 32144;
+
+	/*
+	 * pstxvpx XTp,D(RA),R
+	 * XSp = 32×SX + 2×Sp
+	 * let RA=3 D=d0||d1=0 R=0 Sp=1 SX=1
+	 */
+	instr = ppc_inst_prefix(PPC_PSTXVP(3, 0, 0, 1, 1) >> 32,
+			PPC_PSTXVP(3, 0, 0, 1, 1) & 0xffffffff);
+
+	stepped = emulate_step(&regs, instr);
+
+	if (stepped == 1 && cached_b[0] == c[0].b[0] && cached_b[1] == c[0].b[1] &&
+	    cached_b[2] == c[0].b[2] && cached_b[3] == c[0].b[3] &&
+	    cached_b[4] == c[1].b[0] && cached_b[5] == c[1].b[1] &&
+	    cached_b[6] == c[1].b[2] && cached_b[7] == c[1].b[3] &&
+	    cpu_has_feature(CPU_FTR_VSX)) {
+		show_result("pstxvpx", "PASS");
+	} else {
+		if (!cpu_has_feature(CPU_FTR_VSX))
+			show_result("pstxvpx", "PASS (!CPU_FTR_VSX)");
+		else
+			show_result("pstxvpx", "FAIL");
+	}
+}
+#else
+static void __init test_plxvp_pstxvp(void)
+{
+	show_result("plxvpx", "SKIP (CONFIG_VSX is not set)");
+	show_result("pstxvpx", "SKIP (CONFIG_VSX is not set)");
+}
+#endif /* CONFIG_VSX */
+
 static void __init run_tests_load_store(void)
 {
 	test_ld();
@@ -628,6 +877,9 @@ static void __init run_tests_load_store(void)
 	test_plfd_pstfd();
 	test_lvx_stvx();
 	test_lxvd2x_stxvd2x();
+	test_lxvp_stxvp();
+	test_lxvpx_stxvpx();
+	test_plxvp_pstxvp();
 }
 
 struct compute_test {
-- 
2.24.1


^ permalink raw reply related

* Re: [PATCH] KVM: PPC: Book3S HV: fix a oops in kvmppc_uvmem_page_free()
From: Ram Pai @ 2020-07-31  8:37 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: ldufour, cclaudio, kvm-ppc, sathnaga, aneesh.kumar, sukadev,
	linuxppc-dev, bauerman, david
In-Reply-To: <20200731042940.GA20199@in.ibm.com>

On Fri, Jul 31, 2020 at 09:59:40AM +0530, Bharata B Rao wrote:
> On Thu, Jul 30, 2020 at 04:25:26PM -0700, Ram Pai wrote:
> > Observed the following oops while stress-testing, using multiple
> > secureVM on a distro kernel. However this issue theoritically exists in
> > 5.5 kernel and later.
> > 
> > This issue occurs when the total number of requested device-PFNs exceed
> > the total-number of available device-PFNs.  PFN migration fails to
> > allocate a device-pfn, which causes migrate_vma_finalize() to trigger
> > kvmppc_uvmem_page_free() on a page, that is not associated with any
> > device-pfn.  kvmppc_uvmem_page_free() blindly tries to access the
> > contents of the private data which can be null, leading to the following
> > kernel fault.
> > 
> >  --------------------------------------------------------------------------
> >  Unable to handle kernel paging request for data at address 0x00000011
> >  Faulting instruction address: 0xc00800000e36e110
> >  Oops: Kernel access of bad area, sig: 11 [#1]
> >  LE SMP NR_CPUS=2048 NUMA PowerNV
> > ....
> >  MSR:  900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>
> > 		 CR: 24424822  XER: 00000000
> >  CFAR: c000000000e3d764 DAR: 0000000000000011 DSISR: 40000000 IRQMASK: 0
> >  GPR00: c00800000e36e0a4 c000001f1d59f610 c00800000e38a400 0000000000000000
> >  GPR04: c000001fa5000000 fffffffffffffffe ffffffffffffffff c000201fffeaf300
> >  GPR08: 00000000000001f0 0000000000000000 0000000000000f80 c00800000e373608
> >  GPR12: c000000000e3d710 c000201fffeaf300 0000000000000001 00007fef87360000
> >  GPR16: 00007fff97db4410 c000201c3b66a578 ffffffffffffffff 0000000000000000
> >  GPR20: 0000000119db9ad0 000000000000000a fffffffffffffffc 0000000000000001
> >  GPR24: c000201c3b660000 c000001f1d59f7a0 c0000000004cffb0 0000000000000001
> >  GPR28: 0000000000000000 c00a001ff003e000 c00800000e386150 0000000000000f80
> >  NIP [c00800000e36e110] kvmppc_uvmem_page_free+0xc8/0x210 [kvm_hv]
> >  LR [c00800000e36e0a4] kvmppc_uvmem_page_free+0x5c/0x210 [kvm_hv]
> >  Call Trace:
> >  [c000000000512010] free_devmap_managed_page+0xd0/0x100
> >  [c0000000003f71d0] put_devmap_managed_page+0xa0/0xc0
> >  [c0000000004d24bc] migrate_vma_finalize+0x32c/0x410
> >  [c00800000e36e828] kvmppc_svm_page_in.constprop.5+0xa0/0x460 [kvm_hv]
> >  [c00800000e36eddc] kvmppc_uv_migrate_mem_slot.isra.2+0x1f4/0x230 [kvm_hv]
> >  [c00800000e36fa98] kvmppc_h_svm_init_done+0x90/0x170 [kvm_hv]
> >  [c00800000e35bb14] kvmppc_pseries_do_hcall+0x1ac/0x10a0 [kvm_hv]
> >  [c00800000e35edf4] kvmppc_vcpu_run_hv+0x83c/0x1060 [kvm_hv]
> >  [c00800000e95eb2c] kvmppc_vcpu_run+0x34/0x48 [kvm]
> >  [c00800000e95a2dc] kvm_arch_vcpu_ioctl_run+0x374/0x830 [kvm]
> >  [c00800000e9433b4] kvm_vcpu_ioctl+0x45c/0x7c0 [kvm]
> >  [c0000000005451d0] do_vfs_ioctl+0xe0/0xaa0
> >  [c000000000545d64] sys_ioctl+0xc4/0x160
> >  [c00000000000b408] system_call+0x5c/0x70
> >  Instruction dump:
> >  a12d1174 2f890000 409e0158 a1271172 3929ffff b1271172 7c2004ac 39200000
> >  913e0140 39200000 e87d0010 f93d0010 <89230011> e8c30000 e9030008 2f890000
> >  --------------------------------------------------------------------------
> > 
> >  Fix the oops..
> > 
> > fixes: ca9f49 ("KVM: PPC: Book3S HV: Support for running secure guests")
> > Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> > ---
> >  arch/powerpc/kvm/book3s_hv_uvmem.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> > index 2806983..f4002bf 100644
> > --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> > +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> > @@ -1018,13 +1018,15 @@ static void kvmppc_uvmem_page_free(struct page *page)
> >  {
> >  	unsigned long pfn = page_to_pfn(page) -
> >  			(kvmppc_uvmem_pgmap.res.start >> PAGE_SHIFT);
> > -	struct kvmppc_uvmem_page_pvt *pvt;
> > +	struct kvmppc_uvmem_page_pvt *pvt = page->zone_device_data;
> > +
> > +	if (!pvt)
> > +		return;
> >  
> >  	spin_lock(&kvmppc_uvmem_bitmap_lock);
> >  	bitmap_clear(kvmppc_uvmem_bitmap, pfn, 1);
> >  	spin_unlock(&kvmppc_uvmem_bitmap_lock);
> >  
> > -	pvt = page->zone_device_data;
> >  	page->zone_device_data = NULL;
> >  	if (pvt->remove_gfn)
> >  		kvmppc_gfn_remove(pvt->gpa >> PAGE_SHIFT, pvt->kvm);
> 
> In our case, device pages that are in use are always associated with a valid
> pvt member. See kvmppc_uvmem_get_page() which returns failure if it
> runs out of device pfns and that will result in proper failure of
> page-in calls.

looked at the code, and yes that code path looks correct. So my
reasoning behind the root cause of this bug is incorrect. However the
bug is surfacing and there must be a reason.

> 
> For the case where we run out of device pfns, migrate_vma_finalize() will
> restore the original PTE and will not replace the PTE with device private PTE.
> 
> Also kvmppc_uvmem_page_free() (=dev_pagemap_ops.page_free()) is never
> called for non-device-private pages.

Yes. it should not be called. But as seen above in the stack trace, it is called. 

What would cause the HMM to call ->page_free() on a page that is not
associated with that device's pfn?

> 
> This could be a use-after-free case possibly arising out of the new state
> changes in HV. If so, this fix will only mask the bug and not address the
> original problem.

I can verify by rerunning the tests, without the new state changes. But
I do not see how those changes can cause this fault?

This could also be caused by a duplicate ->page_free() call due to some
bug in the migrate_page path? Could there be a race between
migrate_page() and a page_fault ?


Regardless, kvmppc_uvmem_page_free() needs to be fixed. It should not
access contents of pvt, without verifing pvt is valid.

> 
> Regards,
> Bharata.

-- 
Ram Pai

^ permalink raw reply

* Re: [PATCH v4 07/10] Powerpc/numa: Detect support for coregroup
From: Srikar Dronamraju @ 2020-07-31  9:18 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	Jordan Niethe, LKML, Nicholas Piggin, Valentin Schneider,
	Oliver O'Halloran, linuxppc-dev, Ingo Molnar
In-Reply-To: <8736585djw.fsf@mpe.ellerman.id.au>

* Michael Ellerman <mpe@ellerman.id.au> [2020-07-31 17:49:55]:

> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> > Add support for grouping cores based on the device-tree classification.
> > - The last domain in the associativity domains always refers to the
> > core.
> > - If primary reference domain happens to be the penultimate domain in
> > the associativity domains device-tree property, then there are no
> > coregroups. However if its not a penultimate domain, then there are
> > coregroups. There can be more than one coregroup. For now we would be
> > interested in the last or the smallest coregroups.
> 
> This still doesn't tell me what a coregroup actually represents.
> 
> I get that it's a grouping of cores, and that the device tree specifies
> it for us, but grouping based on what?
> 

We have just abstracted the fact that we are creating a sub-group of cores
within a DIE. We are limiting to one sub-group per core. However this would
allow the firmware the flexibility to vary the grouping. Once the firmware
starts using this group, we could add more code to detect the type of
grouping and adjust the sd domain flags accordingly.

> I think the answer is we aren't being told by firmware, it's just a
> grouping based on some opaque performance characteristic and we just
> have to take that as given.
> 

This is partially true. At this time, we dont have firmwares that can
exploit this code. Once the firmwares start using this grouping, we could
add more code to align the grouping to the scheduler topology.

> But please explain that clearly in the change log and the code comments.
> 

Okay, I will do the needful.

> cheers
> 

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

* Re: [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain
From: Srikar Dronamraju @ 2020-07-31  9:29 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	Jordan Niethe, LKML, Nicholas Piggin, Valentin Schneider,
	Oliver O'Halloran, linuxppc-dev, Ingo Molnar
In-Reply-To: <875za45dr2.fsf@mpe.ellerman.id.au>

* Michael Ellerman <mpe@ellerman.id.au> [2020-07-31 17:45:37]:

> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> > Currently "CACHE" domain happens to be the 2nd sched domain as per
> > powerpc_topology. This domain will collapse if cpumask of l2-cache is
> > same as SMT domain. However we could generalize this domain such that it
> > could mean either be a "CACHE" domain or a "BIGCORE" domain.
> >
> > While setting up the "CACHE" domain, check if shared_cache is already
> > set.
> 
> PeterZ asked for some overview of what you're doing and why, you
> responded to his mail, but I was expecting to see that text incorporated
> here somewhere.
> 

Okay, do you want that as part of the code or documentation dir or the
changelog?

> He also asked for some comments, which I would also like to see.
> 
> 
> I'm also not clear why we want to rename it to "bigcore", that's not a
> commonly understood term, I don't think it's clear to new readers what
> it means.
> 
> Leaving it as the shared cache domain, and having a comment mentioning
> that "bigcores" share a cache, would be clearer I think.
> 

Today, Shared cache is equal to Big Core. However in not too distant future,
Shared cache domain and Big Core may not be the same. For example lets
assume that L2 cache were to Shrink per small core with the firmware
exposing the core as a bigcore. Then with the current design, we have a SMT
== SHARED CACHE, and a DIE. We would not have any domain at the publicised 8
thread level. Keeping the Bigcore as a domain and mapping the shared
cache, (I am resetting the domain name as CACHE if BIGCORE==SHARED_CACHE),
helps us in this scenario.

> cheers
> 

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

* Re: [PATCH v4 08/10] powerpc/smp: Allocate cpumask only after searching thread group
From: Srikar Dronamraju @ 2020-07-31  9:49 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	Jordan Niethe, LKML, Nicholas Piggin, Valentin Schneider,
	Oliver O'Halloran, linuxppc-dev, Ingo Molnar
In-Reply-To: <87zh7g3yvk.fsf@mpe.ellerman.id.au>

* Michael Ellerman <mpe@ellerman.id.au> [2020-07-31 17:52:15]:

> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> > If allocated earlier and the search fails, then cpumask need to be
> > freed. However cpu_l1_cache_map can be allocated after we search thread
> > group.
> 
> It's not freed anywhere AFAICS?
> 

Yes, its never freed. Infact we are never checking if
zalloc_cpumask_var_node fails. Its not just this cpumask, but historically
all the other existing cpumasks in arch/powerpc/kernel/smp.c are never
freed/checked. I did dig into this a bit and it appears that ..
(Please do correct me if I am wrong!! )

Powerpc using cpumask_var_t for all of the percpu variables. And it dont seem
to enable CONFIG_CPUMASK_OFFSTACK even from the MAXSMP config.

So from include/linux/cpumask.h

typedef struct cpumask cpumask_var_t[1];
and
zalloc_cpumask_var_node ends up being cpumask_clear

So I think we are historically we seem to assume we are always
!CPUMASK_OFFSTACK and hence we dont need to check for return as well as
free..

I would look forward to your comments on how we should handle this going
forward. But I would keep this the same for this patchset.

One of the questions that I have is if we most likely are to be in
!CONFIG_CPUMASK_OFFSTACK, then should be migrate to cpumask_t for percpu
variables. 
 
The reason being we end up using NR_CPU cpumask for each percpu cpumask
variable instead of using NR_CPU cpumask_t pointer.

> And even after this change there's still an error path that doesn't free
> it, isn't there?
> 
> cheers
> 
> > Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> > Cc: LKML <linux-kernel@vger.kernel.org>
> > Cc: Michael Ellerman <mpe@ellerman.id.au>
> > Cc: Nicholas Piggin <npiggin@gmail.com>
> > Cc: Anton Blanchard <anton@ozlabs.org>
> > Cc: Oliver O'Halloran <oohall@gmail.com>
> > Cc: Nathan Lynch <nathanl@linux.ibm.com>
> > Cc: Michael Neuling <mikey@neuling.org>
> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > Cc: Jordan Niethe <jniethe5@gmail.com>
> > Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/kernel/smp.c | 7 +++----
> >  1 file changed, 3 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 698000c7f76f..dab96a1203ec 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -797,10 +797,6 @@ static int init_cpu_l1_cache_map(int cpu)
> >  	if (err)
> >  		goto out;
> >  
> > -	zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
> > -				GFP_KERNEL,
> > -				cpu_to_node(cpu));
> > -
> >  	cpu_group_start = get_cpu_thread_group_start(cpu, &tg);
> >  
> >  	if (unlikely(cpu_group_start == -1)) {
> > @@ -809,6 +805,9 @@ static int init_cpu_l1_cache_map(int cpu)
> >  		goto out;
> >  	}
> >  
> > +	zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
> > +				GFP_KERNEL, cpu_to_node(cpu));
> > +
> >  	for (i = first_thread; i < first_thread + threads_per_core; i++) {
> >  		int i_group_start = get_cpu_thread_group_start(i, &tg);
> >  
> > -- 
> > 2.17.1

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox