* Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
From: Gautham R Shenoy @ 2020-07-31 7:36 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
LKML, Nicholas Piggin, Ingo Molnar, Oliver O'Halloran,
Jordan Niethe, linuxppc-dev, Valentin Schneider
In-Reply-To: <20200729061355.GA14603@linux.vnet.ibm.com>
Hi Srikar, Valentin,
On Wed, Jul 29, 2020 at 11:43:55AM +0530, Srikar Dronamraju wrote:
> * Valentin Schneider <valentin.schneider@arm.com> [2020-07-28 16:03:11]:
>
[..snip..]
> At this time the current topology would be good enough i.e BIGCORE would
> always be equal to a MC. However in future we could have chips that can have
> lesser/larger number of CPUs in llc than in a BIGCORE or we could have
> granular or split L3 caches within a DIE. In such a case BIGCORE != MC.
>
> Also in the current P9 itself, two neighbouring core-pairs form a quad.
> Cache latency within a quad is better than a latency to a distant core-pair.
> Cache latency within a core pair is way better than latency within a quad.
> So if we have only 4 threads running on a DIE all of them accessing the same
> cache-lines, then we could probably benefit if all the tasks were to run
> within the quad aka MC/Coregroup.
>
> I have found some benchmarks which are latency sensitive to benefit by
> having a grouping a quad level (using kernel hacks and not backed by
> firmware changes). Gautham also found similar results in his experiments
> but he only used binding within the stock kernel.
>
> I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC
> domain need not be LLC domain for Power.
I am observing that SD_SHARE_PKG_RESOURCES at L2 provides the best
results for POWER9 in terms of cache-benefits during wakeup. On a
POWER9 Boston machine, running a producer-consumer test case
(https://github.com/gautshen/misc/blob/master/producer_consumer/producer_consumer.c)
The test case creates two threads, one Producer and another
Consumer. Both work on a fairly large shared array of size 64M. In an
interation the Producer performs stores to 1024 random locations and
wakes up the Consumer. In the Consumer's iteration, loads from those
exact 1024 locations.
We measure the number of Consumer iterations per second and the
average time for each Consumer iteration. The smaller the time, the
better it is.
The following results are when I pinned the Producer and Consumer to
different combinations of CPUs to cover Small core , Big-core,
Neighbouring Big-core, Far off core within the same chip, and across
chips. There is a also a case where they are not affined anywhere, and
we let the scheduler wake them up correctly.
We find the best results when the Producer and Consumer are within the
same L2 domain. These numbers are also close to the numbers that we
get when we let the Scheduler wake them up (where LLC is L2).
## Same Small core (4 threads: Shares L1, L2, L3, Frequency Domain)
Consumer affined to CPU 3
Producer affined to CPU 1
4698 iterations, avg time: 20034 ns
4951 iterations, avg time: 20012 ns
4957 iterations, avg time: 19971 ns
4968 iterations, avg time: 19985 ns
4970 iterations, avg time: 19977 ns
## Same Big Core (8 threads: Shares L2, L3, Frequency Domain)
Consumer affined to CPU 7
Producer affined to CPU 1
4580 iterations, avg time: 19403 ns
4851 iterations, avg time: 19373 ns
4849 iterations, avg time: 19394 ns
4856 iterations, avg time: 19394 ns
4867 iterations, avg time: 19353 ns
## Neighbouring Big-core (Faster data-snooping from L2. Shares L3, Frequency Domain)
Producer affined to CPU 1
Consumer affined to CPU 11
4270 iterations, avg time: 24158 ns
4491 iterations, avg time: 24157 ns
4500 iterations, avg time: 24148 ns
4516 iterations, avg time: 24164 ns
4518 iterations, avg time: 24165 ns
## Any other Big-core from Same Chip (Shares L3)
Producer affined to CPU 1
Consumer affined to CPU 87
4176 iterations, avg time: 27953 ns
4417 iterations, avg time: 27925 ns
4415 iterations, avg time: 27934 ns
4417 iterations, avg time: 27983 ns
4430 iterations, avg time: 27958 ns
## Different Chips (No cache-sharing)
Consumer affined to CPU 175
Producer affined to CPU 1
3277 iterations, avg time: 50786 ns
3063 iterations, avg time: 50732 ns
2831 iterations, avg time: 50737 ns
2859 iterations, avg time: 50688 ns
2849 iterations, avg time: 50722 ns
## Without affining them (Let Scheduler wake-them up appropriately)
Consumer affined to CPU 0-175
Producer affined to CPU 0-175
4821 iterations, avg time: 19412 ns
4863 iterations, avg time: 19435 ns
4855 iterations, avg time: 19381 ns
4811 iterations, avg time: 19458 ns
4892 iterations, avg time: 19429 ns
--
Thanks and Regards
gautham.
^ permalink raw reply
* Re: [PATCH v4 2/2] powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
From: Aneesh Kumar K.V @ 2020-07-31 7:14 UTC (permalink / raw)
To: Vaibhav Jain, linuxppc-dev, linux-nvdimm
Cc: Santosh Sivaraj, Oliver O'Halloran, Vaibhav Jain,
Dan Williams, Ira Weiny
In-Reply-To: <20200731064153.182203-3-vaibhav@linux.ibm.com>
Vaibhav Jain <vaibhav@linux.ibm.com> writes:
> We add support for reporting 'fuel-gauge' NVDIMM metric via
> PAPR_PDSM_HEALTH pdsm payload. 'fuel-gauge' metric indicates the usage
> life remaining of a papr-scm compatible NVDIMM. PHYP exposes this
> metric via the H_SCM_PERFORMANCE_STATS.
>
> The metric value is returned from the pdsm by extending the return
> payload 'struct nd_papr_pdsm_health' without breaking the ABI. A new
> field 'dimm_fuel_gauge' to hold the metric value is introduced at the
> end of the payload struct and its presence is indicated by by
> extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID.
>
> The patch introduces a new function papr_pdsm_fuel_gauge() that is
> called from papr_pdsm_health(). If fetching NVDIMM performance stats
> is supported then 'papr_pdsm_fuel_gauge()' allocated an output buffer
> large enough to hold the performance stat and passes it to
> drc_pmem_query_stats() that issues the HCALL to PHYP. The return value
> of the stat is then populated in the 'struct
> nd_papr_pdsm_health.dimm_fuel_gauge' field with extension flag
> 'PDSM_DIMM_HEALTH_RUN_GAUGE_VALID' set in 'struct
> nd_papr_pdsm_health.extension_flags'
>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
> Changelog:
>
> v4:
> * Moved a hunk from this patch to previous patch in the series.
> [ Aneesh ]
>
> v3:
> * Updated papr_pdsm_fuel_guage() to use the updated
> drc_pmem_query_stats() function.
>
> Resend:
> None
>
> v2:
> * Restructure code in papr_pdsm_fuel_gauge() to handle error case
> first [ Ira ]
> * Ignore the return value of papr_pdsm_fuel_gauge() in
> papr_psdm_health() [ Ira ]
> ---
> arch/powerpc/include/uapi/asm/papr_pdsm.h | 9 +++++
> arch/powerpc/platforms/pseries/papr_scm.c | 49 +++++++++++++++++++++++
> 2 files changed, 58 insertions(+)
>
> diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h b/arch/powerpc/include/uapi/asm/papr_pdsm.h
> index 9ccecc1d6840..50ef95e2f5b1 100644
> --- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
> +++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
> @@ -72,6 +72,11 @@
> #define PAPR_PDSM_DIMM_CRITICAL 2
> #define PAPR_PDSM_DIMM_FATAL 3
>
> +/* struct nd_papr_pdsm_health.extension_flags field flags */
> +
> +/* Indicate that the 'dimm_fuel_gauge' field is valid */
> +#define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1
> +
> /*
> * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH
> * Various flags indicate the health status of the dimm.
> @@ -84,6 +89,7 @@
> * dimm_locked : Contents of the dimm cant be modified until CEC reboot
> * dimm_encrypted : Contents of dimm are encrypted.
> * dimm_health : Dimm health indicator. One of PAPR_PDSM_DIMM_XXXX
> + * dimm_fuel_gauge : Life remaining of DIMM as a percentage from 0-100
> */
> struct nd_papr_pdsm_health {
> union {
> @@ -96,6 +102,9 @@ struct nd_papr_pdsm_health {
> __u8 dimm_locked;
> __u8 dimm_encrypted;
> __u16 dimm_health;
> +
> + /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */
> + __u16 dimm_fuel_gauge;
> };
> __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
> };
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
> index f37f3f70007d..f439f0dfea7d 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -518,6 +518,51 @@ static int is_cmd_valid(struct nvdimm *nvdimm, unsigned int cmd, void *buf,
> return 0;
> }
>
> +static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
> + union nd_pdsm_payload *payload)
> +{
> + int rc, size;
> + u64 statval;
> + struct papr_scm_perf_stat *stat;
> + struct papr_scm_perf_stats *stats;
> +
> + /* Silently fail if fetching performance metrics isn't supported */
> + if (!p->stat_buffer_len)
> + return 0;
> +
> + /* Allocate request buffer enough to hold single performance stat */
> + size = sizeof(struct papr_scm_perf_stats) +
> + sizeof(struct papr_scm_perf_stat);
> +
> + stats = kzalloc(size, GFP_KERNEL);
> + if (!stats)
> + return -ENOMEM;
> +
> + stat = &stats->scm_statistic[0];
> + memcpy(&stat->stat_id, "MemLife ", sizeof(stat->stat_id));
> + stat->stat_val = 0;
> +
> + /* Fetch the fuel gauge and populate it in payload */
> + rc = drc_pmem_query_stats(p, stats, 1);
> + if (rc < 0) {
> + dev_dbg(&p->pdev->dev, "Err(%d) fetching fuel gauge\n", rc);
> + goto free_stats;
> + }
> +
> + statval = be64_to_cpu(stat->stat_val);
> + dev_dbg(&p->pdev->dev,
> + "Fetched fuel-gauge %llu", statval);
> + payload->health.extension_flags |=
> + PDSM_DIMM_HEALTH_RUN_GAUGE_VALID;
> + payload->health.dimm_fuel_gauge = statval;
> +
> + rc = sizeof(struct nd_papr_pdsm_health);
> +
> +free_stats:
> + kfree(stats);
> + return rc;
> +}
> +
> /* Fetch the DIMM health info and populate it in provided package. */
> static int papr_pdsm_health(struct papr_scm_priv *p,
> union nd_pdsm_payload *payload)
> @@ -558,6 +603,10 @@ static int papr_pdsm_health(struct papr_scm_priv *p,
>
> /* struct populated hence can release the mutex now */
> mutex_unlock(&p->health_mutex);
> +
> + /* Populate the fuel gauge meter in the payload */
> + papr_pdsm_fuel_gauge(p, payload);
> +
> rc = sizeof(struct nd_papr_pdsm_health);
>
> out:
> --
> 2.26.2
^ permalink raw reply
* Re: [PATCH v4 1/2] powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
From: Aneesh Kumar K.V @ 2020-07-31 7:12 UTC (permalink / raw)
To: Vaibhav Jain, linuxppc-dev, linux-nvdimm
Cc: Santosh Sivaraj, Oliver O'Halloran, Vaibhav Jain,
Dan Williams, Ira Weiny
In-Reply-To: <20200731064153.182203-2-vaibhav@linux.ibm.com>
Vaibhav Jain <vaibhav@linux.ibm.com> writes:
> Update papr_scm.c to query dimm performance statistics from PHYP via
> H_SCM_PERFORMANCE_STATS hcall and export them to user-space as PAPR
> specific NVDIMM attribute 'perf_stats' in sysfs. The patch also
> provide a sysfs ABI documentation for the stats being reported and
> their meanings.
>
> During NVDIMM probe time in papr_scm_nvdimm_init() a special variant
> of H_SCM_PERFORMANCE_STATS hcall is issued to check if collection of
> performance statistics is supported or not. If successful then a PHYP
> returns a maximum possible buffer length needed to read all
> performance stats. This returned value is stored in a per-nvdimm
> attribute 'stat_buffer_len'.
>
> The layout of request buffer for reading NVDIMM performance stats from
> PHYP is defined in 'struct papr_scm_perf_stats' and 'struct
> papr_scm_perf_stat'. These structs are used in newly introduced
> drc_pmem_query_stats() that issues the H_SCM_PERFORMANCE_STATS hcall.
>
> The sysfs access function perf_stats_show() uses value
> 'stat_buffer_len' to allocate a buffer large enough to hold all
> possible NVDIMM performance stats and passes it to
> drc_pmem_query_stats() to populate. Finally statistics reported in the
> buffer are formatted into the sysfs access function output buffer.
>
> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
> Changelog:
>
> v4:
> * Fixed a build issue with this patch by moving a hunk from second
> patch in series to this patch. [ Aneesh ]
>
> v3:
> * Updated drc_pmem_query_stats() to not require 'buff_size' and 'out'
> args to the function. Instead 'buff_size' is calculated from
> 'num_stats' and instead of populating 'R4' in arg 'out' the value is
> returned from the function in case 'R4' represents
> 'max-buffer-size'.
>
> Resend:
> None
>
> v2:
> * Updated 'struct papr_scm_perf_stats' and 'struct papr_scm_perf_stat'
> to use big-endian types. [ Aneesh ]
> * s/len_stat_buffer/stat_buffer_len/ [ Aneesh ]
> * s/statistics_id/stat_id/ , s/statistics_val/stat_val/ [ Aneesh ]
> * Conversion from Big endian to cpu endian happens later rather than
> just after its fetched from PHYP.
> * Changed a log statement to unambiguously report dimm performance
> stats are not available for the given nvdimm [ Ira ]
> * Restructed some code to handle error case first [ Ira ]
> ---
> Documentation/ABI/testing/sysfs-bus-papr-pmem | 27 ++++
> arch/powerpc/platforms/pseries/papr_scm.c | 150 ++++++++++++++++++
> 2 files changed, 177 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem b/Documentation/ABI/testing/sysfs-bus-papr-pmem
> index 5b10d036a8d4..c1a67275c43f 100644
> --- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
> +++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
> @@ -25,3 +25,30 @@ Description:
> NVDIMM have been scrubbed.
> * "locked" : Indicating that NVDIMM contents cant
> be modified until next power cycle.
> +
> +What: /sys/bus/nd/devices/nmemX/papr/perf_stats
> +Date: May, 2020
> +KernelVersion: v5.9
> +Contact: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>, linux-nvdimm@lists.01.org,
> +Description:
> + (RO) Report various performance stats related to papr-scm NVDIMM
> + device. Each stat is reported on a new line with each line
> + composed of a stat-identifier followed by it value. Below are
> + currently known dimm performance stats which are reported:
> +
> + * "CtlResCt" : Controller Reset Count
> + * "CtlResTm" : Controller Reset Elapsed Time
> + * "PonSecs " : Power-on Seconds
> + * "MemLife " : Life Remaining
> + * "CritRscU" : Critical Resource Utilization
> + * "HostLCnt" : Host Load Count
> + * "HostSCnt" : Host Store Count
> + * "HostSDur" : Host Store Duration
> + * "HostLDur" : Host Load Duration
> + * "MedRCnt " : Media Read Count
> + * "MedWCnt " : Media Write Count
> + * "MedRDur " : Media Read Duration
> + * "MedWDur " : Media Write Duration
> + * "CchRHCnt" : Cache Read Hit Count
> + * "CchWHCnt" : Cache Write Hit Count
> + * "FastWCnt" : Fast Write Count
> \ No newline at end of file
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
> index 3d1235a76ba9..f37f3f70007d 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -64,6 +64,26 @@
> PAPR_PMEM_HEALTH_FATAL | \
> PAPR_PMEM_HEALTH_UNHEALTHY)
>
> +#define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS)
> +#define PAPR_SCM_PERF_STATS_VERSION 0x1
> +
> +/* Struct holding a single performance metric */
> +struct papr_scm_perf_stat {
> + u8 stat_id[8];
> + __be64 stat_val;
> +} __packed;
> +
> +/* Struct exchanged between kernel and PHYP for fetching drc perf stats */
> +struct papr_scm_perf_stats {
> + u8 eye_catcher[8];
> + /* Should be PAPR_SCM_PERF_STATS_VERSION */
> + __be32 stats_version;
> + /* Number of stats following */
> + __be32 num_statistics;
> + /* zero or more performance matrics */
> + struct papr_scm_perf_stat scm_statistic[];
> +} __packed;
> +
> /* private struct associated with each region */
> struct papr_scm_priv {
> struct platform_device *pdev;
> @@ -92,6 +112,9 @@ struct papr_scm_priv {
>
> /* Health information for the dimm */
> u64 health_bitmap;
> +
> + /* length of the stat buffer as expected by phyp */
> + size_t stat_buffer_len;
> };
>
> static LIST_HEAD(papr_nd_regions);
> @@ -200,6 +223,79 @@ static int drc_pmem_query_n_bind(struct papr_scm_priv *p)
> return drc_pmem_bind(p);
> }
>
> +/*
> + * Query the Dimm performance stats from PHYP and copy them (if returned) to
> + * provided struct papr_scm_perf_stats instance 'stats' that can hold atleast
> + * (num_stats + header) bytes.
> + * - If buff_stats == NULL the return value is the size in byes of the buffer
> + * needed to hold all supported performance-statistics.
> + * - If buff_stats != NULL and num_stats == 0 then we copy all known
> + * performance-statistics to 'buff_stat' and expect to be large enough to
> + * hold them.
> + * - if buff_stats != NULL and num_stats > 0 then copy the requested
> + * performance-statistics to buff_stats.
> + */
> +static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p,
> + struct papr_scm_perf_stats *buff_stats,
> + unsigned int num_stats)
> +{
> + unsigned long ret[PLPAR_HCALL_BUFSIZE];
> + size_t size;
> + s64 rc;
> +
> + /* Setup the out buffer */
> + if (buff_stats) {
> + memcpy(buff_stats->eye_catcher,
> + PAPR_SCM_PERF_STATS_EYECATCHER, 8);
> + buff_stats->stats_version =
> + cpu_to_be32(PAPR_SCM_PERF_STATS_VERSION);
> + buff_stats->num_statistics =
> + cpu_to_be32(num_stats);
> +
> + /*
> + * Calculate the buffer size based on num-stats provided
> + * or use the prefetched max buffer length
> + */
> + if (num_stats)
> + /* Calculate size from the num_stats */
> + size = sizeof(struct papr_scm_perf_stats) +
> + num_stats * sizeof(struct papr_scm_perf_stat);
> + else
> + size = p->stat_buffer_len;
> + } else {
> + /* In case of no out buffer ignore the size */
> + size = 0;
> + }
> +
> + /* Do the HCALL asking PHYP for info */
> + rc = plpar_hcall(H_SCM_PERFORMANCE_STATS, ret, p->drc_index,
> + buff_stats ? virt_to_phys(buff_stats) : 0,
> + size);
> +
> + /* Check if the error was due to an unknown stat-id */
> + if (rc == H_PARTIAL) {
> + dev_err(&p->pdev->dev,
> + "Unknown performance stats, Err:0x%016lX\n", ret[0]);
> + return -ENOENT;
> + } else if (rc != H_SUCCESS) {
> + dev_err(&p->pdev->dev,
> + "Failed to query performance stats, Err:%lld\n", rc);
> + return -EIO;
> +
> + } else if (!size) {
> + /* Handle case where stat buffer size was requested */
> + dev_dbg(&p->pdev->dev,
> + "Performance stats size %ld\n", ret[0]);
> + return ret[0];
> + }
> +
> + /* Successfully fetched the requested stats from phyp */
> + dev_dbg(&p->pdev->dev,
> + "Performance stats returned %d stats\n",
> + be32_to_cpu(buff_stats->num_statistics));
> + return 0;
> +}
> +
> /*
> * Issue hcall to retrieve dimm health info and populate papr_scm_priv with the
> * health information.
> @@ -637,6 +733,48 @@ static int papr_scm_ndctl(struct nvdimm_bus_descriptor *nd_desc,
> return 0;
> }
>
> +static ssize_t perf_stats_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + int index, rc;
> + struct seq_buf s;
> + struct papr_scm_perf_stat *stat;
> + struct papr_scm_perf_stats *stats;
> + struct nvdimm *dimm = to_nvdimm(dev);
> + struct papr_scm_priv *p = nvdimm_provider_data(dimm);
> +
> + if (!p->stat_buffer_len)
> + return -ENOENT;
> +
> + /* Allocate the buffer for phyp where stats are written */
> + stats = kzalloc(p->stat_buffer_len, GFP_KERNEL);
> + if (!stats)
> + return -ENOMEM;
> +
> + /* Ask phyp to return all dimm perf stats */
> + rc = drc_pmem_query_stats(p, stats, 0);
> + if (rc)
> + goto free_stats;
So we end up making a HCALL for each read of the sysfs file? You do
throttle that for PAPR_HEALTH hcall (flags sysfs file). Do we need to do
that here? If not should we make this CAP_SYS_ADMIN? You can possibly
add is_visible callback to papr group and then restrict this all to
CAP_SYS_ADMIN?
> + /*
> + * Go through the returned output buffer and print stats and
> + * values. Since stat_id is essentially a char string of
> + * 8 bytes, simply use the string format specifier to print it.
> + */
> + seq_buf_init(&s, buf, PAGE_SIZE);
> + for (index = 0, stat = stats->scm_statistic;
> + index < be32_to_cpu(stats->num_statistics);
> + ++index, ++stat) {
> + seq_buf_printf(&s, "%.8s = 0x%016llX\n",
> + stat->stat_id,
> + be64_to_cpu(stat->stat_val));
> + }
> +
> +free_stats:
> + kfree(stats);
> + return rc ? rc : seq_buf_used(&s);
> +}
> +DEVICE_ATTR_RO(perf_stats);
> +
> static ssize_t flags_show(struct device *dev,
> struct device_attribute *attr, char *buf)
> {
> @@ -682,6 +820,7 @@ DEVICE_ATTR_RO(flags);
> /* papr_scm specific dimm attributes */
> static struct attribute *papr_nd_attributes[] = {
> &dev_attr_flags.attr,
> + &dev_attr_perf_stats.attr,
> NULL,
> };
>
> @@ -702,6 +841,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
> struct nd_region_desc ndr_desc;
> unsigned long dimm_flags;
> int target_nid, online_nid;
> + ssize_t stat_size;
>
> p->bus_desc.ndctl = papr_scm_ndctl;
> p->bus_desc.module = THIS_MODULE;
> @@ -769,6 +909,16 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
> list_add_tail(&p->region_list, &papr_nd_regions);
> mutex_unlock(&papr_ndr_lock);
>
> + /* Try retriving the stat buffer and see if its supported */
> + stat_size = drc_pmem_query_stats(p, NULL, 0);
> + if (stat_size > 0) {
> + p->stat_buffer_len = stat_size;
> + dev_dbg(&p->pdev->dev, "Max perf-stat size %lu-bytes\n",
> + p->stat_buffer_len);
> + } else {
> + dev_info(&p->pdev->dev, "Dimm performance stats unavailable\n");
> + }
> +
> return 0;
>
> err: nvdimm_bus_unregister(p->bus);
> --
> 2.26.2
^ permalink raw reply
* [PATCH v4 2/2] powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
From: Vaibhav Jain @ 2020-07-31 6:41 UTC (permalink / raw)
To: linuxppc-dev, linux-nvdimm
Cc: Santosh Sivaraj, Oliver O'Halloran, Aneesh Kumar K . V,
Vaibhav Jain, Dan Williams, Ira Weiny
In-Reply-To: <20200731064153.182203-1-vaibhav@linux.ibm.com>
We add support for reporting 'fuel-gauge' NVDIMM metric via
PAPR_PDSM_HEALTH pdsm payload. 'fuel-gauge' metric indicates the usage
life remaining of a papr-scm compatible NVDIMM. PHYP exposes this
metric via the H_SCM_PERFORMANCE_STATS.
The metric value is returned from the pdsm by extending the return
payload 'struct nd_papr_pdsm_health' without breaking the ABI. A new
field 'dimm_fuel_gauge' to hold the metric value is introduced at the
end of the payload struct and its presence is indicated by by
extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID.
The patch introduces a new function papr_pdsm_fuel_gauge() that is
called from papr_pdsm_health(). If fetching NVDIMM performance stats
is supported then 'papr_pdsm_fuel_gauge()' allocated an output buffer
large enough to hold the performance stat and passes it to
drc_pmem_query_stats() that issues the HCALL to PHYP. The return value
of the stat is then populated in the 'struct
nd_papr_pdsm_health.dimm_fuel_gauge' field with extension flag
'PDSM_DIMM_HEALTH_RUN_GAUGE_VALID' set in 'struct
nd_papr_pdsm_health.extension_flags'
Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
Changelog:
v4:
* Moved a hunk from this patch to previous patch in the series.
[ Aneesh ]
v3:
* Updated papr_pdsm_fuel_guage() to use the updated
drc_pmem_query_stats() function.
Resend:
None
v2:
* Restructure code in papr_pdsm_fuel_gauge() to handle error case
first [ Ira ]
* Ignore the return value of papr_pdsm_fuel_gauge() in
papr_psdm_health() [ Ira ]
---
arch/powerpc/include/uapi/asm/papr_pdsm.h | 9 +++++
arch/powerpc/platforms/pseries/papr_scm.c | 49 +++++++++++++++++++++++
2 files changed, 58 insertions(+)
diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h b/arch/powerpc/include/uapi/asm/papr_pdsm.h
index 9ccecc1d6840..50ef95e2f5b1 100644
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
@@ -72,6 +72,11 @@
#define PAPR_PDSM_DIMM_CRITICAL 2
#define PAPR_PDSM_DIMM_FATAL 3
+/* struct nd_papr_pdsm_health.extension_flags field flags */
+
+/* Indicate that the 'dimm_fuel_gauge' field is valid */
+#define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1
+
/*
* Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH
* Various flags indicate the health status of the dimm.
@@ -84,6 +89,7 @@
* dimm_locked : Contents of the dimm cant be modified until CEC reboot
* dimm_encrypted : Contents of dimm are encrypted.
* dimm_health : Dimm health indicator. One of PAPR_PDSM_DIMM_XXXX
+ * dimm_fuel_gauge : Life remaining of DIMM as a percentage from 0-100
*/
struct nd_papr_pdsm_health {
union {
@@ -96,6 +102,9 @@ struct nd_papr_pdsm_health {
__u8 dimm_locked;
__u8 dimm_encrypted;
__u16 dimm_health;
+
+ /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */
+ __u16 dimm_fuel_gauge;
};
__u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
};
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index f37f3f70007d..f439f0dfea7d 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -518,6 +518,51 @@ static int is_cmd_valid(struct nvdimm *nvdimm, unsigned int cmd, void *buf,
return 0;
}
+static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
+ union nd_pdsm_payload *payload)
+{
+ int rc, size;
+ u64 statval;
+ struct papr_scm_perf_stat *stat;
+ struct papr_scm_perf_stats *stats;
+
+ /* Silently fail if fetching performance metrics isn't supported */
+ if (!p->stat_buffer_len)
+ return 0;
+
+ /* Allocate request buffer enough to hold single performance stat */
+ size = sizeof(struct papr_scm_perf_stats) +
+ sizeof(struct papr_scm_perf_stat);
+
+ stats = kzalloc(size, GFP_KERNEL);
+ if (!stats)
+ return -ENOMEM;
+
+ stat = &stats->scm_statistic[0];
+ memcpy(&stat->stat_id, "MemLife ", sizeof(stat->stat_id));
+ stat->stat_val = 0;
+
+ /* Fetch the fuel gauge and populate it in payload */
+ rc = drc_pmem_query_stats(p, stats, 1);
+ if (rc < 0) {
+ dev_dbg(&p->pdev->dev, "Err(%d) fetching fuel gauge\n", rc);
+ goto free_stats;
+ }
+
+ statval = be64_to_cpu(stat->stat_val);
+ dev_dbg(&p->pdev->dev,
+ "Fetched fuel-gauge %llu", statval);
+ payload->health.extension_flags |=
+ PDSM_DIMM_HEALTH_RUN_GAUGE_VALID;
+ payload->health.dimm_fuel_gauge = statval;
+
+ rc = sizeof(struct nd_papr_pdsm_health);
+
+free_stats:
+ kfree(stats);
+ return rc;
+}
+
/* Fetch the DIMM health info and populate it in provided package. */
static int papr_pdsm_health(struct papr_scm_priv *p,
union nd_pdsm_payload *payload)
@@ -558,6 +603,10 @@ static int papr_pdsm_health(struct papr_scm_priv *p,
/* struct populated hence can release the mutex now */
mutex_unlock(&p->health_mutex);
+
+ /* Populate the fuel gauge meter in the payload */
+ papr_pdsm_fuel_gauge(p, payload);
+
rc = sizeof(struct nd_papr_pdsm_health);
out:
--
2.26.2
^ permalink raw reply related
* [PATCH v4 1/2] powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
From: Vaibhav Jain @ 2020-07-31 6:41 UTC (permalink / raw)
To: linuxppc-dev, linux-nvdimm
Cc: Santosh Sivaraj, Oliver O'Halloran, Aneesh Kumar K . V,
Vaibhav Jain, Dan Williams, Ira Weiny
In-Reply-To: <20200731064153.182203-1-vaibhav@linux.ibm.com>
Update papr_scm.c to query dimm performance statistics from PHYP via
H_SCM_PERFORMANCE_STATS hcall and export them to user-space as PAPR
specific NVDIMM attribute 'perf_stats' in sysfs. The patch also
provide a sysfs ABI documentation for the stats being reported and
their meanings.
During NVDIMM probe time in papr_scm_nvdimm_init() a special variant
of H_SCM_PERFORMANCE_STATS hcall is issued to check if collection of
performance statistics is supported or not. If successful then a PHYP
returns a maximum possible buffer length needed to read all
performance stats. This returned value is stored in a per-nvdimm
attribute 'stat_buffer_len'.
The layout of request buffer for reading NVDIMM performance stats from
PHYP is defined in 'struct papr_scm_perf_stats' and 'struct
papr_scm_perf_stat'. These structs are used in newly introduced
drc_pmem_query_stats() that issues the H_SCM_PERFORMANCE_STATS hcall.
The sysfs access function perf_stats_show() uses value
'stat_buffer_len' to allocate a buffer large enough to hold all
possible NVDIMM performance stats and passes it to
drc_pmem_query_stats() to populate. Finally statistics reported in the
buffer are formatted into the sysfs access function output buffer.
Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
Changelog:
v4:
* Fixed a build issue with this patch by moving a hunk from second
patch in series to this patch. [ Aneesh ]
v3:
* Updated drc_pmem_query_stats() to not require 'buff_size' and 'out'
args to the function. Instead 'buff_size' is calculated from
'num_stats' and instead of populating 'R4' in arg 'out' the value is
returned from the function in case 'R4' represents
'max-buffer-size'.
Resend:
None
v2:
* Updated 'struct papr_scm_perf_stats' and 'struct papr_scm_perf_stat'
to use big-endian types. [ Aneesh ]
* s/len_stat_buffer/stat_buffer_len/ [ Aneesh ]
* s/statistics_id/stat_id/ , s/statistics_val/stat_val/ [ Aneesh ]
* Conversion from Big endian to cpu endian happens later rather than
just after its fetched from PHYP.
* Changed a log statement to unambiguously report dimm performance
stats are not available for the given nvdimm [ Ira ]
* Restructed some code to handle error case first [ Ira ]
---
Documentation/ABI/testing/sysfs-bus-papr-pmem | 27 ++++
arch/powerpc/platforms/pseries/papr_scm.c | 150 ++++++++++++++++++
2 files changed, 177 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem b/Documentation/ABI/testing/sysfs-bus-papr-pmem
index 5b10d036a8d4..c1a67275c43f 100644
--- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
+++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
@@ -25,3 +25,30 @@ Description:
NVDIMM have been scrubbed.
* "locked" : Indicating that NVDIMM contents cant
be modified until next power cycle.
+
+What: /sys/bus/nd/devices/nmemX/papr/perf_stats
+Date: May, 2020
+KernelVersion: v5.9
+Contact: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>, linux-nvdimm@lists.01.org,
+Description:
+ (RO) Report various performance stats related to papr-scm NVDIMM
+ device. Each stat is reported on a new line with each line
+ composed of a stat-identifier followed by it value. Below are
+ currently known dimm performance stats which are reported:
+
+ * "CtlResCt" : Controller Reset Count
+ * "CtlResTm" : Controller Reset Elapsed Time
+ * "PonSecs " : Power-on Seconds
+ * "MemLife " : Life Remaining
+ * "CritRscU" : Critical Resource Utilization
+ * "HostLCnt" : Host Load Count
+ * "HostSCnt" : Host Store Count
+ * "HostSDur" : Host Store Duration
+ * "HostLDur" : Host Load Duration
+ * "MedRCnt " : Media Read Count
+ * "MedWCnt " : Media Write Count
+ * "MedRDur " : Media Read Duration
+ * "MedWDur " : Media Write Duration
+ * "CchRHCnt" : Cache Read Hit Count
+ * "CchWHCnt" : Cache Write Hit Count
+ * "FastWCnt" : Fast Write Count
\ No newline at end of file
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index 3d1235a76ba9..f37f3f70007d 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -64,6 +64,26 @@
PAPR_PMEM_HEALTH_FATAL | \
PAPR_PMEM_HEALTH_UNHEALTHY)
+#define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS)
+#define PAPR_SCM_PERF_STATS_VERSION 0x1
+
+/* Struct holding a single performance metric */
+struct papr_scm_perf_stat {
+ u8 stat_id[8];
+ __be64 stat_val;
+} __packed;
+
+/* Struct exchanged between kernel and PHYP for fetching drc perf stats */
+struct papr_scm_perf_stats {
+ u8 eye_catcher[8];
+ /* Should be PAPR_SCM_PERF_STATS_VERSION */
+ __be32 stats_version;
+ /* Number of stats following */
+ __be32 num_statistics;
+ /* zero or more performance matrics */
+ struct papr_scm_perf_stat scm_statistic[];
+} __packed;
+
/* private struct associated with each region */
struct papr_scm_priv {
struct platform_device *pdev;
@@ -92,6 +112,9 @@ struct papr_scm_priv {
/* Health information for the dimm */
u64 health_bitmap;
+
+ /* length of the stat buffer as expected by phyp */
+ size_t stat_buffer_len;
};
static LIST_HEAD(papr_nd_regions);
@@ -200,6 +223,79 @@ static int drc_pmem_query_n_bind(struct papr_scm_priv *p)
return drc_pmem_bind(p);
}
+/*
+ * Query the Dimm performance stats from PHYP and copy them (if returned) to
+ * provided struct papr_scm_perf_stats instance 'stats' that can hold atleast
+ * (num_stats + header) bytes.
+ * - If buff_stats == NULL the return value is the size in byes of the buffer
+ * needed to hold all supported performance-statistics.
+ * - If buff_stats != NULL and num_stats == 0 then we copy all known
+ * performance-statistics to 'buff_stat' and expect to be large enough to
+ * hold them.
+ * - if buff_stats != NULL and num_stats > 0 then copy the requested
+ * performance-statistics to buff_stats.
+ */
+static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p,
+ struct papr_scm_perf_stats *buff_stats,
+ unsigned int num_stats)
+{
+ unsigned long ret[PLPAR_HCALL_BUFSIZE];
+ size_t size;
+ s64 rc;
+
+ /* Setup the out buffer */
+ if (buff_stats) {
+ memcpy(buff_stats->eye_catcher,
+ PAPR_SCM_PERF_STATS_EYECATCHER, 8);
+ buff_stats->stats_version =
+ cpu_to_be32(PAPR_SCM_PERF_STATS_VERSION);
+ buff_stats->num_statistics =
+ cpu_to_be32(num_stats);
+
+ /*
+ * Calculate the buffer size based on num-stats provided
+ * or use the prefetched max buffer length
+ */
+ if (num_stats)
+ /* Calculate size from the num_stats */
+ size = sizeof(struct papr_scm_perf_stats) +
+ num_stats * sizeof(struct papr_scm_perf_stat);
+ else
+ size = p->stat_buffer_len;
+ } else {
+ /* In case of no out buffer ignore the size */
+ size = 0;
+ }
+
+ /* Do the HCALL asking PHYP for info */
+ rc = plpar_hcall(H_SCM_PERFORMANCE_STATS, ret, p->drc_index,
+ buff_stats ? virt_to_phys(buff_stats) : 0,
+ size);
+
+ /* Check if the error was due to an unknown stat-id */
+ if (rc == H_PARTIAL) {
+ dev_err(&p->pdev->dev,
+ "Unknown performance stats, Err:0x%016lX\n", ret[0]);
+ return -ENOENT;
+ } else if (rc != H_SUCCESS) {
+ dev_err(&p->pdev->dev,
+ "Failed to query performance stats, Err:%lld\n", rc);
+ return -EIO;
+
+ } else if (!size) {
+ /* Handle case where stat buffer size was requested */
+ dev_dbg(&p->pdev->dev,
+ "Performance stats size %ld\n", ret[0]);
+ return ret[0];
+ }
+
+ /* Successfully fetched the requested stats from phyp */
+ dev_dbg(&p->pdev->dev,
+ "Performance stats returned %d stats\n",
+ be32_to_cpu(buff_stats->num_statistics));
+ return 0;
+}
+
/*
* Issue hcall to retrieve dimm health info and populate papr_scm_priv with the
* health information.
@@ -637,6 +733,48 @@ static int papr_scm_ndctl(struct nvdimm_bus_descriptor *nd_desc,
return 0;
}
+static ssize_t perf_stats_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ int index, rc;
+ struct seq_buf s;
+ struct papr_scm_perf_stat *stat;
+ struct papr_scm_perf_stats *stats;
+ struct nvdimm *dimm = to_nvdimm(dev);
+ struct papr_scm_priv *p = nvdimm_provider_data(dimm);
+
+ if (!p->stat_buffer_len)
+ return -ENOENT;
+
+ /* Allocate the buffer for phyp where stats are written */
+ stats = kzalloc(p->stat_buffer_len, GFP_KERNEL);
+ if (!stats)
+ return -ENOMEM;
+
+ /* Ask phyp to return all dimm perf stats */
+ rc = drc_pmem_query_stats(p, stats, 0);
+ if (rc)
+ goto free_stats;
+ /*
+ * Go through the returned output buffer and print stats and
+ * values. Since stat_id is essentially a char string of
+ * 8 bytes, simply use the string format specifier to print it.
+ */
+ seq_buf_init(&s, buf, PAGE_SIZE);
+ for (index = 0, stat = stats->scm_statistic;
+ index < be32_to_cpu(stats->num_statistics);
+ ++index, ++stat) {
+ seq_buf_printf(&s, "%.8s = 0x%016llX\n",
+ stat->stat_id,
+ be64_to_cpu(stat->stat_val));
+ }
+
+free_stats:
+ kfree(stats);
+ return rc ? rc : seq_buf_used(&s);
+}
+DEVICE_ATTR_RO(perf_stats);
+
static ssize_t flags_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -682,6 +820,7 @@ DEVICE_ATTR_RO(flags);
/* papr_scm specific dimm attributes */
static struct attribute *papr_nd_attributes[] = {
&dev_attr_flags.attr,
+ &dev_attr_perf_stats.attr,
NULL,
};
@@ -702,6 +841,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
struct nd_region_desc ndr_desc;
unsigned long dimm_flags;
int target_nid, online_nid;
+ ssize_t stat_size;
p->bus_desc.ndctl = papr_scm_ndctl;
p->bus_desc.module = THIS_MODULE;
@@ -769,6 +909,16 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
list_add_tail(&p->region_list, &papr_nd_regions);
mutex_unlock(&papr_ndr_lock);
+ /* Try retriving the stat buffer and see if its supported */
+ stat_size = drc_pmem_query_stats(p, NULL, 0);
+ if (stat_size > 0) {
+ p->stat_buffer_len = stat_size;
+ dev_dbg(&p->pdev->dev, "Max perf-stat size %lu-bytes\n",
+ p->stat_buffer_len);
+ } else {
+ dev_info(&p->pdev->dev, "Dimm performance stats unavailable\n");
+ }
+
return 0;
err: nvdimm_bus_unregister(p->bus);
--
2.26.2
^ permalink raw reply related
* [PATCH v4 0/2] powerpc/papr_scm: add support for reporting NVDIMM 'life_used_percentage' metric
From: Vaibhav Jain @ 2020-07-31 6:41 UTC (permalink / raw)
To: linuxppc-dev, linux-nvdimm
Cc: Santosh Sivaraj, Oliver O'Halloran, Aneesh Kumar K . V,
Vaibhav Jain, Dan Williams, Ira Weiny
Changes since v3[1]:
* Fixed a rebase issue pointed out by Aneesh in first patch in the series.
[1] https://lore.kernel.org/linux-nvdimm/20200730121303.134230-1-vaibhav@linux.ibm.com
---
This small patchset implements kernel side support for reporting
'life_used_percentage' metric in NDCTL with dimm health output for
papr-scm NVDIMMs. With corresponding NDCTL side changes output for
should be like:
$ sudo ndctl list -DH
[
{
"dev":"nmem0",
"health":{
"health_state":"ok",
"life_used_percentage":0,
"shutdown_state":"clean"
}
}
]
PHYP supports H_SCM_PERFORMANCE_STATS hcall through which an LPAR can
fetch various performance stats including 'fuel_gauge' percentage for
an NVDIMM. 'fuel_gauge' metric indicates the usable life remaining of
an NVDIMM expressed as percentage and 'life_used_percentage' can be
calculated as 'life_used_percentage = 100 - fuel_gauge'.
Structure of the patchset
=========================
First patch implements necessary scaffolding needed to issue the
H_SCM_PERFORMANCE_STATS hcall and fetch performance stats
catalogue. The patch also implements support for 'perf_stats' sysfs
attribute to report the full catalogue of supported performance stats
by PHYP.
Second and final patch implements support for sending this value to
libndctl by extending the PAPR_PDSM_HEALTH pdsm payload to add a new
field named 'dimm_fuel_gauge' to it.
Vaibhav Jain (2):
powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
Documentation/ABI/testing/sysfs-bus-papr-pmem | 27 +++
arch/powerpc/include/uapi/asm/papr_pdsm.h | 9 +
arch/powerpc/platforms/pseries/papr_scm.c | 199 ++++++++++++++++++
3 files changed, 235 insertions(+)
--
2.26.2
^ permalink raw reply
* [PATCH] ASoC: fsl_sai: Fix value of FSL_SAI_CR1_RFW_MASK
From: Shengjiu Wang @ 2020-07-31 6:28 UTC (permalink / raw)
To: timur, nicoleotsuka, Xiubo.Lee, festevam, lgirdwood, broonie,
perex, tiwai, alsa-devel, linuxppc-dev, linux-kernel
The fifo_depth is 64 on i.MX8QM/i.MX8QXP, 128 on i.MX8MQ, 16 on
i.MX7ULP.
Original FSL_SAI_CR1_RFW_MASK value 0x1F is not suitable for
these platform, the FIFO watermark mask should be updated
according to the fifo_depth.
Fixes: a860fac42097 ("ASoC: fsl_sai: Add support for imx7ulp/imx8mq")
Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com>
---
sound/soc/fsl/fsl_sai.c | 5 +++--
sound/soc/fsl/fsl_sai.h | 2 +-
2 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/sound/soc/fsl/fsl_sai.c b/sound/soc/fsl/fsl_sai.c
index a22562f2df47..cdff739924e2 100644
--- a/sound/soc/fsl/fsl_sai.c
+++ b/sound/soc/fsl/fsl_sai.c
@@ -680,10 +680,11 @@ static int fsl_sai_dai_probe(struct snd_soc_dai *cpu_dai)
regmap_write(sai->regmap, FSL_SAI_RCSR(ofs), 0);
regmap_update_bits(sai->regmap, FSL_SAI_TCR1(ofs),
- FSL_SAI_CR1_RFW_MASK,
+ FSL_SAI_CR1_RFW_MASK(sai->soc_data->fifo_depth),
sai->soc_data->fifo_depth - FSL_SAI_MAXBURST_TX);
regmap_update_bits(sai->regmap, FSL_SAI_RCR1(ofs),
- FSL_SAI_CR1_RFW_MASK, FSL_SAI_MAXBURST_RX - 1);
+ FSL_SAI_CR1_RFW_MASK(sai->soc_data->fifo_depth),
+ FSL_SAI_MAXBURST_RX - 1);
snd_soc_dai_init_dma_data(cpu_dai, &sai->dma_params_tx,
&sai->dma_params_rx);
diff --git a/sound/soc/fsl/fsl_sai.h b/sound/soc/fsl/fsl_sai.h
index 76b15deea80c..6aba7d28f5f3 100644
--- a/sound/soc/fsl/fsl_sai.h
+++ b/sound/soc/fsl/fsl_sai.h
@@ -94,7 +94,7 @@
#define FSL_SAI_CSR_FRDE BIT(0)
/* SAI Transmit and Receive Configuration 1 Register */
-#define FSL_SAI_CR1_RFW_MASK 0x1f
+#define FSL_SAI_CR1_RFW_MASK(x) ((x) - 1)
/* SAI Transmit and Receive Configuration 2 Register */
#define FSL_SAI_CR2_SYNC BIT(30)
--
2.27.0
^ permalink raw reply related
* Re: [PATCH] KVM: PPC: Book3S HV: Define H_PAGE_IN_NONSHARED for H_SVM_PAGE_IN hcall
From: Bharata B Rao @ 2020-07-31 4:33 UTC (permalink / raw)
To: Ram Pai
Cc: ldufour, linux-doc, corbet, kvm-ppc, Julia Lawall, sathnaga,
sukadev, linuxppc-dev, david
In-Reply-To: <20200730232101.GB5882@oc0525413822.ibm.com>
On Thu, Jul 30, 2020 at 04:21:01PM -0700, Ram Pai wrote:
> H_SVM_PAGE_IN hcall takes a flag parameter. This parameter specifies the
> way in which a page will be treated. H_PAGE_IN_NONSHARED indicates
> that the page will be shared with the Secure VM, and H_PAGE_IN_SHARED
> indicates that the page will not be shared but its contents will
> be copied.
Looks like you got the definitions of shared and non-shared interchanged.
>
> However H_PAGE_IN_NONSHARED is not defined in the header file, though
> it is defined and documented in the API captured in
> Documentation/powerpc/ultravisor.rst
>
> Define H_PAGE_IN_NONSHARED in the header file.
What is the use of defining this? Is this used directly in any place?
Or, are youp planning to introduce such a usage?
Regards,
Bharata.
^ permalink raw reply
* Re: [PATCH] KVM: PPC: Book3S HV: fix a oops in kvmppc_uvmem_page_free()
From: Bharata B Rao @ 2020-07-31 4:29 UTC (permalink / raw)
To: Ram Pai
Cc: ldufour, cclaudio, kvm-ppc, sathnaga, aneesh.kumar, sukadev,
linuxppc-dev, bauerman, david
In-Reply-To: <1596151526-4374-1-git-send-email-linuxram@us.ibm.com>
On Thu, Jul 30, 2020 at 04:25:26PM -0700, Ram Pai wrote:
> Observed the following oops while stress-testing, using multiple
> secureVM on a distro kernel. However this issue theoritically exists in
> 5.5 kernel and later.
>
> This issue occurs when the total number of requested device-PFNs exceed
> the total-number of available device-PFNs. PFN migration fails to
> allocate a device-pfn, which causes migrate_vma_finalize() to trigger
> kvmppc_uvmem_page_free() on a page, that is not associated with any
> device-pfn. kvmppc_uvmem_page_free() blindly tries to access the
> contents of the private data which can be null, leading to the following
> kernel fault.
>
> --------------------------------------------------------------------------
> Unable to handle kernel paging request for data at address 0x00000011
> Faulting instruction address: 0xc00800000e36e110
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE SMP NR_CPUS=2048 NUMA PowerNV
> ....
> MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>
> CR: 24424822 XER: 00000000
> CFAR: c000000000e3d764 DAR: 0000000000000011 DSISR: 40000000 IRQMASK: 0
> GPR00: c00800000e36e0a4 c000001f1d59f610 c00800000e38a400 0000000000000000
> GPR04: c000001fa5000000 fffffffffffffffe ffffffffffffffff c000201fffeaf300
> GPR08: 00000000000001f0 0000000000000000 0000000000000f80 c00800000e373608
> GPR12: c000000000e3d710 c000201fffeaf300 0000000000000001 00007fef87360000
> GPR16: 00007fff97db4410 c000201c3b66a578 ffffffffffffffff 0000000000000000
> GPR20: 0000000119db9ad0 000000000000000a fffffffffffffffc 0000000000000001
> GPR24: c000201c3b660000 c000001f1d59f7a0 c0000000004cffb0 0000000000000001
> GPR28: 0000000000000000 c00a001ff003e000 c00800000e386150 0000000000000f80
> NIP [c00800000e36e110] kvmppc_uvmem_page_free+0xc8/0x210 [kvm_hv]
> LR [c00800000e36e0a4] kvmppc_uvmem_page_free+0x5c/0x210 [kvm_hv]
> Call Trace:
> [c000000000512010] free_devmap_managed_page+0xd0/0x100
> [c0000000003f71d0] put_devmap_managed_page+0xa0/0xc0
> [c0000000004d24bc] migrate_vma_finalize+0x32c/0x410
> [c00800000e36e828] kvmppc_svm_page_in.constprop.5+0xa0/0x460 [kvm_hv]
> [c00800000e36eddc] kvmppc_uv_migrate_mem_slot.isra.2+0x1f4/0x230 [kvm_hv]
> [c00800000e36fa98] kvmppc_h_svm_init_done+0x90/0x170 [kvm_hv]
> [c00800000e35bb14] kvmppc_pseries_do_hcall+0x1ac/0x10a0 [kvm_hv]
> [c00800000e35edf4] kvmppc_vcpu_run_hv+0x83c/0x1060 [kvm_hv]
> [c00800000e95eb2c] kvmppc_vcpu_run+0x34/0x48 [kvm]
> [c00800000e95a2dc] kvm_arch_vcpu_ioctl_run+0x374/0x830 [kvm]
> [c00800000e9433b4] kvm_vcpu_ioctl+0x45c/0x7c0 [kvm]
> [c0000000005451d0] do_vfs_ioctl+0xe0/0xaa0
> [c000000000545d64] sys_ioctl+0xc4/0x160
> [c00000000000b408] system_call+0x5c/0x70
> Instruction dump:
> a12d1174 2f890000 409e0158 a1271172 3929ffff b1271172 7c2004ac 39200000
> 913e0140 39200000 e87d0010 f93d0010 <89230011> e8c30000 e9030008 2f890000
> --------------------------------------------------------------------------
>
> Fix the oops..
>
> fixes: ca9f49 ("KVM: PPC: Book3S HV: Support for running secure guests")
> Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> ---
> arch/powerpc/kvm/book3s_hv_uvmem.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 2806983..f4002bf 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -1018,13 +1018,15 @@ static void kvmppc_uvmem_page_free(struct page *page)
> {
> unsigned long pfn = page_to_pfn(page) -
> (kvmppc_uvmem_pgmap.res.start >> PAGE_SHIFT);
> - struct kvmppc_uvmem_page_pvt *pvt;
> + struct kvmppc_uvmem_page_pvt *pvt = page->zone_device_data;
> +
> + if (!pvt)
> + return;
>
> spin_lock(&kvmppc_uvmem_bitmap_lock);
> bitmap_clear(kvmppc_uvmem_bitmap, pfn, 1);
> spin_unlock(&kvmppc_uvmem_bitmap_lock);
>
> - pvt = page->zone_device_data;
> page->zone_device_data = NULL;
> if (pvt->remove_gfn)
> kvmppc_gfn_remove(pvt->gpa >> PAGE_SHIFT, pvt->kvm);
In our case, device pages that are in use are always associated with a valid
pvt member. See kvmppc_uvmem_get_page() which returns failure if it
runs out of device pfns and that will result in proper failure of
page-in calls.
For the case where we run out of device pfns, migrate_vma_finalize() will
restore the original PTE and will not replace the PTE with device private PTE.
Also kvmppc_uvmem_page_free() (=dev_pagemap_ops.page_free()) is never
called for non-device-private pages.
This could be a use-after-free case possibly arising out of the new state
changes in HV. If so, this fix will only mask the bug and not address the
original problem.
Regards,
Bharata.
^ permalink raw reply
* Re: [PATCH 1/2 v2] powerpc/dma: Define map/unmap mmio resource callbacks
From: Oliver O'Halloran @ 2020-07-31 3:30 UTC (permalink / raw)
To: Max Gurtovoy
Cc: vladimirk, Carol L Soto, linux-pci, shlomin, israelr,
Frederic Barrat, idanw, linuxppc-dev, Christoph Hellwig, aneela
In-Reply-To: <20200430131520.51211-1-maxg@mellanox.com>
On Thu, Apr 30, 2020 at 11:15 PM Max Gurtovoy <maxg@mellanox.com> wrote:
>
> Define the map_resource/unmap_resource callbacks for the dma_iommu_ops
> used by several powerpc platforms. The map_resource callback is called
> when trying to map a mmio resource through the dma_map_resource()
> driver API.
>
> For now, the callback returns an invalid address for devices using
> translations, but will "direct" map the resource when in bypass
> mode. Previous behavior for dma_map_resource() was to always return an
> invalid address.
>
> We also call an optional platform-specific controller op in
> case some setup is needed for the platform.
Hey Max,
Sorry for not getting to this sooner. Fred has been dutifully nagging
me to look at it, but people are constantly throwing stuff at me so
it's slipped through the cracks.
Anyway, the changes here are fine IMO. The only real suggestion I have
is that we might want to move the direct / bypass mode check out of
the arch/powerpc/kernel/dma-iommu.c and into the PHB specific function
in pci_controller_ops. I don't see any real reason p2p support should
be limited to devices using bypass mode since the data path is the
same for translated and untranslated DMAs. We do need to impose that
restriction for OPAL / PowerNV IODA PHBs due to the implementation of
the opal_pci_set_p2p() has the side effect of forcing the TVE into
no-translate mode. However, that's a platform issue so the restriction
should be imposed in platform code.
I'd like to fix that, but I'd prefer to do it as a follow up change
since I need to have a think about how to fix the firmware bits.
Reviewed-by: Oliver O'Halloran <oohall@gmail.com>
^ permalink raw reply
* Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
From: Valentin Schneider @ 2020-07-31 1:05 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
LKML, Nicholas Piggin, Morten Rasmussen, Oliver O'Halloran,
Jordan Niethe, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200729061355.GA14603@linux.vnet.ibm.com>
(+Cc Morten)
On 29/07/20 07:13, Srikar Dronamraju wrote:
> * Valentin Schneider <valentin.schneider@arm.com> [2020-07-28 16:03:11]:
>
> Hi Valentin,
>
> Thanks for looking into the patches.
>
>> On 27/07/20 06:32, Srikar Dronamraju wrote:
>> > Add percpu coregroup maps and masks to create coregroup domain.
>> > If a coregroup doesn't exist, the coregroup domain will be degenerated
>> > in favour of SMT/CACHE domain.
>> >
>>
>> So there's at least one arm64 platform out there with the same "pairs of
>> cores share L2" thing (Ampere eMAG), and that lives quite happily with the
>> default scheduler topology (SMT/MC/DIE). Each pair of core gets its MC
>> domain, and the whole system is covered by DIE.
>>
>> Now arguably it's not a perfect representation; DIE doesn't have
>> SD_SHARE_PKG_RESOURCES so the highest level sd_llc can point to is MC. That
>> will impact all callsites using cpus_share_cache(): in the eMAG case, only
>> pairs of cores will be seen as sharing cache, even though *all* cores share
>> the same L3.
>>
>
> Okay, Its good to know that we have a chip which is similar to P9 in
> topology.
>
>> I'm trying to paint a picture of what the P9 topology looks like (the one
>> you showcase in your cover letter) to see if there are any similarities;
>> from what I gather in [1], wikichips and your cover letter, with P9 you can
>> have something like this in a single DIE (somewhat unsure about L3 setup;
>> it looks to be distributed?)
>>
>> +---------------------------------------------------------------------+
>> | L3 |
>> +---------------+-+---------------+-+---------------+-+---------------+
>> | L2 | | L2 | | L2 | | L2 |
>> +------+-+------+ +------+-+------+ +------+-+------+ +------+-+------+
>> | L1 | | L1 | | L1 | | L1 | | L1 | | L1 | | L1 | | L1 |
>> +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+
>> |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs|
>> +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+
>>
>> Which would lead to (ignoring the whole SMT CPU numbering shenanigans)
>>
>> NUMA [ ...
>> DIE [ ]
>> MC [ ] [ ] [ ] [ ]
>> BIGCORE [ ] [ ] [ ] [ ]
>> SMT [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ]
>> 00-03 04-07 08-11 12-15 16-19 20-23 24-27 28-31 <other node here>
>>
>
> What you have summed up is perfectly what a P9 topology looks like. I dont
> think I could have explained it better than this.
>
Yay!
>> This however has MC == BIGCORE; what makes it you can have different spans
>> for these two domains? If it's not too much to ask, I'd love to have a P9
>> topology diagram.
>>
>> [1]: 20200722081822.GG9290@linux.vnet.ibm.com
>
> At this time the current topology would be good enough i.e BIGCORE would
> always be equal to a MC. However in future we could have chips that can have
> lesser/larger number of CPUs in llc than in a BIGCORE or we could have
> granular or split L3 caches within a DIE. In such a case BIGCORE != MC.
>
Right, that one's fair enough.
> Also in the current P9 itself, two neighbouring core-pairs form a quad.
> Cache latency within a quad is better than a latency to a distant core-pair.
> Cache latency within a core pair is way better than latency within a quad.
> So if we have only 4 threads running on a DIE all of them accessing the same
> cache-lines, then we could probably benefit if all the tasks were to run
> within the quad aka MC/Coregroup.
>
Did you test this? WRT load balance we do try to balance "load" over the
different domain spans, so if you represent quads as their own MC domain,
you would AFAICT end up spreading tasks over the quads (rather than packing
them) when balancing at e.g. DIE level. The desired behaviour might be
hackable with some more ASYM_PACKING, but I'm not sure I should be
suggesting that :-)
> I have found some benchmarks which are latency sensitive to benefit by
> having a grouping a quad level (using kernel hacks and not backed by
> firmware changes). Gautham also found similar results in his experiments
> but he only used binding within the stock kernel.
>
IIUC you reflect this "fabric quirk" (i.e. coregroups) using this DT
binding thing.
That's also where things get interesting (for me) because I experienced
something similar on another arm64 platform (ThunderX1). This was more
about cache bandwidth than cache latency, but IMO it's in the same bag of
fabric quirks. I blabbered a bit about this at last LPC [1], but kind of
gave up on it given the TX1 was the only (arm64) platform where I could get
both significant and reproducible results.
Now, if you folks are seeing this on completely different hardware and have
"real" workloads that truly benefit from this kind of domain partitioning,
this might be another incentive to try and sort of generalize this. That's
outside the scope of your series, but your findings give me some hope!
I think what I had in mind back then was that if enough folks cared about
it, we might get some bits added to the ACPI spec; something along the
lines of proximity domains for the caches described in PPTT, IOW a cache
distance matrix. I don't really know what it'll take to get there, but I
figured I'd dump this in case someone's listening :-)
> I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC
> domain need not be LLC domain for Power.
From what I understood your MC domain does seem to map to LLC; but in any
case, shouldn't you set that flag at least for BIGCORE (i.e. L2)? AIUI with
your changes your sd_llc is gonna be SMT, and that's not going to be a very
big mask. IMO you do want to correctly reflect your LLC situation via this
flag to make cpus_share_cache() work properly.
[1]: https://linuxplumbersconf.org/event/4/contributions/484/
^ permalink raw reply
* Re: [PATCH v2] powerpc/vio: drop bus_type from parent device
From: Michael Ellerman @ 2020-07-31 0:53 UTC (permalink / raw)
To: Greg KH
Cc: Stephen Rothwell, Thadeu Lima de Souza Cascardo, Peter Rajnoha,
linuxppc-dev
In-Reply-To: <20200730053716.GA3862178@kroah.com>
Greg KH <gregkh@linuxfoundation.org> writes:
> On Thu, Jul 30, 2020 at 11:28:38AM +1000, Michael Ellerman wrote:
>> [ Added Peter & Greg to Cc ]
>>
>> Thadeu Lima de Souza Cascardo <cascardo@canonical.com> writes:
>> > Commit df44b479654f62b478c18ee4d8bc4e9f897a9844 ("kobject: return error
>> > code if writing /sys/.../uevent fails") started returning failure when
>> > writing to /sys/devices/vio/uevent.
>> >
>> > This causes an early udevadm trigger to fail. On some installer versions of
>> > Ubuntu, this will cause init to exit, thus panicing the system very early
>> > during boot.
>> >
>> > Removing the bus_type from the parent device will remove some of the extra
>> > empty files from /sys/devices/vio/, but will keep the rest of the layout
>> > for vio devices, keeping them under /sys/devices/vio/.
>>
>> What exactly does it change?
>>
>> I'm finding it hard to evaluate if this change is going to cause a
>> regression somehow.
>>
>> I'm also not clear on why removing the bus type is correct, apart from
>> whether it fixes the bug you're seeing.
>>
>> > It has been tested that uevents for vio devices don't change after this
>> > fix, they still contain MODALIAS.
>> >
>> > Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
>> > Fixes: df44b479654f ("kobject: return error code if writing /sys/.../uevent fails")
>>
>> AFAICS there haven't been any other fixes for that commit. Do we know
>> why it is only vio that was affected? (possibly because it's a fake bus
>> to begin with?)
>
> So there was an error previously, the core was ignoring it, and now it
> isn't and to fix that you want to remove describing what bus a device is
> on?
>
> Huh???
Right.
Not to mention there are existing unfixed kernels out there, so whatever
userspace is crashing will need to be fixed for those anyway.
>> > diff --git a/arch/powerpc/platforms/pseries/vio.c b/arch/powerpc/platforms/pseries/vio.c
>> > index 37f1f25ba804..a94dab3972a0 100644
>> > --- a/arch/powerpc/platforms/pseries/vio.c
>> > +++ b/arch/powerpc/platforms/pseries/vio.c
>> > @@ -36,7 +36,6 @@ static struct vio_dev vio_bus_device = { /* fake "parent" device */
>> > .name = "vio",
>> > .type = "",
>> > .dev.init_name = "vio",
>> > - .dev.bus = &vio_bus_type,
>> > };
>
> Wait, a static 'struct device'? You all are playing with fire there.
> That's a reference counted object, and should never be declared like
> that at all.
Since 2005 :)
AC33c9bcf1 ("[PATCH] ppc64: tidy up vio devices fake parent")
> I see you register it, but never unregister it, why? Why is it even
> needed?
I don't remember, if I ever knew.
The code says:
/*
* The fake parent of all vio devices, just to give us
* a nice directory
*/
err = device_register(&vio_bus_device.dev);
But I suspect that may no longer be true.
ie. the devices show up in /sys/bus/vio/devices because they have
dev.bus = vio_bus_type, the fake parent doesn't seem to determine the
location.
> And if you remove the bus type of it, it will show up in a different
> part of sysfs, so I think this patch will show a user-visable change,
> right?
Yes I think so. But because it's a fake device to begin with that's
possibly OK.
I think we really need to get to the bottom of whether we need that
device at all, it seems like it might be left over cruft from the
ancient past.
I'll try and find time to work it out.
cheers
^ permalink raw reply
* Re: OF: Can't handle multiple dma-ranges with different offsets
From: Chris Packham @ 2020-07-31 0:10 UTC (permalink / raw)
To: robh+dt@kernel.org, frowand.list@gmail.com, mpe@ellerman.id.au,
benh@kernel.crashing.org, paulus@samba.org,
christophe.leroy@c-s.fr
Cc: devicetree@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
linux-kernel@vger.kernel.org
In-Reply-To: <961bc990-c815-1a19-c349-8b03065d5aab@alliedtelesis.co.nz>
On 23/07/20 10:11 am, Chris Packham wrote:
>
> On 22/07/20 4:19 pm, Chris Packham wrote:
>> Hi,
>>
>> I've just fired up linux kernel v5.7 on a p2040 based system and I'm
>> getting the following new warning
>>
>> OF: Can't handle multiple dma-ranges with different offsets on
>> node(/pcie@ffe202000)
>> OF: Can't handle multiple dma-ranges with different offsets on
>> node(/pcie@ffe202000)
>>
>> The warning itself was added in commit 9d55bebd9816 ("of/address:
>> Support multiple 'dma-ranges' entries") but I gather it's pointing
>> out something about the dts. My boards dts is based heavily on
>> p2041rdb.dts and the relevant pci2 section is identical (reproduced
>> below for reference).
>>
>> pci2: pcie@ffe202000 {
>> reg = <0xf 0xfe202000 0 0x1000>;
>> ranges = <0x02000000 0 0xe0000000 0xc 0x40000000 0 0x20000000
>> 0x01000000 0 0x00000000 0xf 0xf8020000 0 0x00010000>;
>> pcie@0 {
>> ranges = <0x02000000 0 0xe0000000
>> 0x02000000 0 0xe0000000
>> 0 0x20000000
>>
>> 0x01000000 0 0x00000000
>> 0x01000000 0 0x00000000
>> 0 0x00010000>;
>> };
>> };
>>
>> I haven't noticed any ill effect (aside from the scary message). I'm
>> not sure if there's something missing in the dts or in the code that
>> checks the ranges. Any guidance would be appreciated.
>
> I've also just checked the T2080RDB on v5.7.9 which shows a similar issue
>
> OF: Can't handle multiple dma-ranges with different offsets on
> node(/pcie@ffe250000)
> OF: Can't handle multiple dma-ranges with different offsets on
> node(/pcie@ffe250000)
> pcieport 0000:00:00.0: Invalid size 0xfffff9 for dma-range
> pcieport 0000:00:00.0: AER: enabled with IRQ 21
> OF: Can't handle multiple dma-ranges with different offsets on
> node(/pcie@ffe270000)
> OF: Can't handle multiple dma-ranges with different offsets on
> node(/pcie@ffe270000)
> pcieport 0001:00:00.0: Invalid size 0xfffff9 for dma-range
> pcieport 0001:00:00.0: AER: enabled with IRQ 23
I've been doing a bit more digging. The dma-ranges property is not in
the dts/dtb. It's actually inserted by u-boot via ft_fsl_pci_setup().
Here's some output from my T2080RDB
root@linuxbox ~]# xxd -g4
/sys/firmware/devicetree/base/pcie@ffe240000/dma-ranges
0000000: 02000000 00000000 df000007 0000000f ................
0000010: fe000000 00000000 00fffff9 42000000 ............B...
0000020: 00000000 00000000 00000000 00000000 ................
0000030: 00000000 df000007 43000000 00000010 ........C.......
0000040: 00000000 00000000 00000000 00000001 ................
0000050: 00000000 ....
I'm still wondering how best to deal with this. Hopefully without
needing to deploy a u-boot update.
^ permalink raw reply
* [PATCH] KVM: PPC: Book3S HV: fix a oops in kvmppc_uvmem_page_free()
From: Ram Pai @ 2020-07-30 23:25 UTC (permalink / raw)
To: kvm-ppc, linuxppc-dev
Cc: ldufour, linuxram, cclaudio, bharata, sathnaga, aneesh.kumar,
sukadev, bauerman, david
Observed the following oops while stress-testing, using multiple
secureVM on a distro kernel. However this issue theoritically exists in
5.5 kernel and later.
This issue occurs when the total number of requested device-PFNs exceed
the total-number of available device-PFNs. PFN migration fails to
allocate a device-pfn, which causes migrate_vma_finalize() to trigger
kvmppc_uvmem_page_free() on a page, that is not associated with any
device-pfn. kvmppc_uvmem_page_free() blindly tries to access the
contents of the private data which can be null, leading to the following
kernel fault.
--------------------------------------------------------------------------
Unable to handle kernel paging request for data at address 0x00000011
Faulting instruction address: 0xc00800000e36e110
Oops: Kernel access of bad area, sig: 11 [#1]
LE SMP NR_CPUS=2048 NUMA PowerNV
....
MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>
CR: 24424822 XER: 00000000
CFAR: c000000000e3d764 DAR: 0000000000000011 DSISR: 40000000 IRQMASK: 0
GPR00: c00800000e36e0a4 c000001f1d59f610 c00800000e38a400 0000000000000000
GPR04: c000001fa5000000 fffffffffffffffe ffffffffffffffff c000201fffeaf300
GPR08: 00000000000001f0 0000000000000000 0000000000000f80 c00800000e373608
GPR12: c000000000e3d710 c000201fffeaf300 0000000000000001 00007fef87360000
GPR16: 00007fff97db4410 c000201c3b66a578 ffffffffffffffff 0000000000000000
GPR20: 0000000119db9ad0 000000000000000a fffffffffffffffc 0000000000000001
GPR24: c000201c3b660000 c000001f1d59f7a0 c0000000004cffb0 0000000000000001
GPR28: 0000000000000000 c00a001ff003e000 c00800000e386150 0000000000000f80
NIP [c00800000e36e110] kvmppc_uvmem_page_free+0xc8/0x210 [kvm_hv]
LR [c00800000e36e0a4] kvmppc_uvmem_page_free+0x5c/0x210 [kvm_hv]
Call Trace:
[c000000000512010] free_devmap_managed_page+0xd0/0x100
[c0000000003f71d0] put_devmap_managed_page+0xa0/0xc0
[c0000000004d24bc] migrate_vma_finalize+0x32c/0x410
[c00800000e36e828] kvmppc_svm_page_in.constprop.5+0xa0/0x460 [kvm_hv]
[c00800000e36eddc] kvmppc_uv_migrate_mem_slot.isra.2+0x1f4/0x230 [kvm_hv]
[c00800000e36fa98] kvmppc_h_svm_init_done+0x90/0x170 [kvm_hv]
[c00800000e35bb14] kvmppc_pseries_do_hcall+0x1ac/0x10a0 [kvm_hv]
[c00800000e35edf4] kvmppc_vcpu_run_hv+0x83c/0x1060 [kvm_hv]
[c00800000e95eb2c] kvmppc_vcpu_run+0x34/0x48 [kvm]
[c00800000e95a2dc] kvm_arch_vcpu_ioctl_run+0x374/0x830 [kvm]
[c00800000e9433b4] kvm_vcpu_ioctl+0x45c/0x7c0 [kvm]
[c0000000005451d0] do_vfs_ioctl+0xe0/0xaa0
[c000000000545d64] sys_ioctl+0xc4/0x160
[c00000000000b408] system_call+0x5c/0x70
Instruction dump:
a12d1174 2f890000 409e0158 a1271172 3929ffff b1271172 7c2004ac 39200000
913e0140 39200000 e87d0010 f93d0010 <89230011> e8c30000 e9030008 2f890000
--------------------------------------------------------------------------
Fix the oops..
fixes: ca9f49 ("KVM: PPC: Book3S HV: Support for running secure guests")
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
arch/powerpc/kvm/book3s_hv_uvmem.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 2806983..f4002bf 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -1018,13 +1018,15 @@ static void kvmppc_uvmem_page_free(struct page *page)
{
unsigned long pfn = page_to_pfn(page) -
(kvmppc_uvmem_pgmap.res.start >> PAGE_SHIFT);
- struct kvmppc_uvmem_page_pvt *pvt;
+ struct kvmppc_uvmem_page_pvt *pvt = page->zone_device_data;
+
+ if (!pvt)
+ return;
spin_lock(&kvmppc_uvmem_bitmap_lock);
bitmap_clear(kvmppc_uvmem_bitmap, pfn, 1);
spin_unlock(&kvmppc_uvmem_bitmap_lock);
- pvt = page->zone_device_data;
page->zone_device_data = NULL;
if (pvt->remove_gfn)
kvmppc_gfn_remove(pvt->gpa >> PAGE_SHIFT, pvt->kvm);
--
1.8.3.1
^ permalink raw reply related
* [PATCH] KVM: PPC: Book3S HV: Define H_PAGE_IN_NONSHARED for H_SVM_PAGE_IN hcall
From: Ram Pai @ 2020-07-30 23:21 UTC (permalink / raw)
To: Julia Lawall
Cc: ldufour, linux-doc, corbet, kvm-ppc, bharata, sathnaga, sukadev,
linuxppc-dev, david
In-Reply-To: <alpine.DEB.2.22.394.2007301231140.2548@hadrien>
H_SVM_PAGE_IN hcall takes a flag parameter. This parameter specifies the
way in which a page will be treated. H_PAGE_IN_NONSHARED indicates
that the page will be shared with the Secure VM, and H_PAGE_IN_SHARED
indicates that the page will not be shared but its contents will
be copied.
However H_PAGE_IN_NONSHARED is not defined in the header file, though
it is defined and documented in the API captured in
Documentation/powerpc/ultravisor.rst
Define H_PAGE_IN_NONSHARED in the header file.
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
arch/powerpc/include/asm/hvcall.h | 4 +++-
arch/powerpc/kvm/book3s_hv_uvmem.c | 3 ++-
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h
index e90c073..43e3f8d 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -343,7 +343,9 @@
#define H_COPY_TOFROM_GUEST 0xF80C
/* Flags for H_SVM_PAGE_IN */
-#define H_PAGE_IN_SHARED 0x1
+#define H_PAGE_IN_NONSHARED 0x0 /* Page is not shared with the UV */
+#define H_PAGE_IN_SHARED 0x1 /* Page is shared with UV */
+#define H_PAGE_IN_MASK 0x1
/* Platform-specific hcalls used by the Ultravisor */
#define H_SVM_PAGE_IN 0xEF00
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 2dde0fb..2806983 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -947,12 +947,13 @@ unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
if (page_shift != PAGE_SHIFT)
return H_P3;
- if (flags & ~H_PAGE_IN_SHARED)
+ if (flags & ~H_PAGE_IN_MASK)
return H_P2;
if (flags & H_PAGE_IN_SHARED)
return kvmppc_share_page(kvm, gpa, page_shift);
+ /* handle H_PAGE_IN_NONSHARED */
ret = H_PARAMETER;
srcu_idx = srcu_read_lock(&kvm->srcu);
mmap_read_lock(kvm->mm);
--
1.8.3.1
--
Ram Pai
^ permalink raw reply related
* Re: [PATCH V5 0/4] powerpc/perf: Add support for perf extended regs in powerpc
From: Jiri Olsa @ 2020-07-30 19:50 UTC (permalink / raw)
To: Athira Rajeev
Cc: Ravi Bangoria, Michael Neuling, maddy, Arnaldo Carvalho de Melo,
Jiri Olsa, kjain, linuxppc-dev
In-Reply-To: <27D1CE26-A506-4CFF-B1C2-E0545F26E637@linux.vnet.ibm.com>
On Thu, Jul 30, 2020 at 01:24:40PM +0530, Athira Rajeev wrote:
>
>
> > On 27-Jul-2020, at 10:46 PM, Athira Rajeev <atrajeev@linux.vnet.ibm.com> wrote:
> >
> > Patch set to add support for perf extended register capability in
> > powerpc. The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to
> > indicate the PMU which support extended registers. The generic code
> > define the mask of extended registers as 0 for non supported architectures.
> >
> > Patches 1 and 2 are the kernel side changes needed to include
> > base support for extended regs in powerpc and in power10.
> > Patches 3 and 4 are the perf tools side changes needed to support the
> > extended registers.
> >
>
> Hi Arnaldo, Jiri
>
> please let me know if you have any comments/suggestions on this patch series to add support for perf extended regs.
hi,
can't really tell for powerpc, but in general
perf tool changes look ok
jirka
^ permalink raw reply
* Re: [PATCH v4 00/10] Coregroup support on Powerpc
From: Srikar Dronamraju @ 2020-07-30 17:22 UTC (permalink / raw)
To: Michael Ellerman
Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
Michael Ellerman, Peter Zijlstra, Jordan Niethe, Anton Blanchard,
LKML, Ingo Molnar, Nick Piggin, linuxppc-dev, Valentin Schneider
In-Reply-To: <20200727053230.19753-1-srikar@linux.vnet.ibm.com>
* Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2020-07-27 11:02:20]:
> Changelog v3 ->v4:
> v3: https://lore.kernel.org/lkml/20200723085116.4731-1-srikar@linux.vnet.ibm.com/t/#u
>
Here is a summary of some of the testing done with coregroup v4 patchsets.
It includes ebizzy, schbench, perf bench sched pipe and topology verification.
One the left side are results from powerpc/next tree and on the right are the
results with the patchset applied. Topological verification clearly shows that
there is no change in topology with and without the patches on all the 3 class
of systems that were tested.
On PowerPc/Next On Powerpc/next + Coregroup Support v4 patchset
Power 9 PowerNV (2 Node/ 160 Cpu System)
---------------------------------
ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)
N Min Max Median Avg Stddev N Min Max Median Avg Stddev
100 993884 1276090 1173476 1165914 54867.201 100 910470 1279820 1171095 1162091 67363.28
schbench (latency hence lower is better)
Latency percentiles (usec) Latency percentiles (usec)
50.0th: 455 50.0th: 454
75.0th: 533 75.0th: 543
90.0th: 683 90.0th: 701
95.0th: 743 95.0th: 737
*99.0th: 815 *99.0th: 805
99.5th: 839 99.5th: 835
99.9th: 913 99.9th: 893
min=0, max=1011 min=0, max=2833
perf bench sched pipe (lesser time and higher ops/sec is better)
# Running 'sched/pipe' benchmark: # Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes # Executed 1000000 pipe operations between two processes
Total time: 6.083 [sec] Total time: 6.303 [sec]
6.083576 usecs/op 6.303318 usecs/op
164377 ops/sec 158646 ops/sec
Power 9 LPAR (2 Node/ 128 Cpu System)
---------------------------------
ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)
N Min Max Median Avg Stddev N Min Max Median Avg Stddev
100 1058029 1295393 1200414 1188306.7 56786.538 100 943264 1287619 1180522 1168473.2 64469.955
schbench (latency hence lower is better)
Latency percentiles (usec) Latency percentiles (usec)
50.0000th: 34 50.0000th: 39
75.0000th: 46 75.0000th: 52
90.0000th: 53 90.0000th: 68
95.0000th: 56 95.0000th: 77
*99.0000th: 61 *99.0000th: 89
99.5000th: 63 99.5000th: 94
99.9000th: 81 99.9000th: 169
min=0, max=8405 min=0, max=23674
perf bench sched pipe (lesser time and higher ops/sec is better)
# Running 'sched/pipe' benchmark: # Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes # Executed 1000000 pipe operations between two processes
Total time: 8.768 [sec] Total time: 5.217 [sec]
8.768400 usecs/op 5.217625 usecs/op
114045 ops/sec 191658 ops/sec
Power 8 LPAR (8 Node/ 256 Cpu System)
---------------------------------
ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)
N Min Max Median Avg Stddev N Min Max Median Avg Stddev
100 1267615 1965234 1707423 1689137.6 144363.29 100 1175357 1924262 1691104 1664792.1 145876.4
schbench (latency hence lower is better)
Latency percentiles (usec) Latency percentiles (usec)
50.0th: 37 50.0th: 36
75.0th: 51 75.0th: 48
90.0th: 59 90.0th: 55
95.0th: 63 95.0th: 59
*99.0th: 71 *99.0th: 67
99.5th: 75 99.5th: 72
99.9th: 105 99.9th: 170
min=0, max=18560 min=0, max=27031
perf bench sched pipe (lesser time and higher ops/sec is better)
# Running 'sched/pipe' benchmark: # Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes # Executed 1000000 pipe operations between two processes
Total time: 6.013 [sec] Total time: 5.930 [sec]
6.013963 usecs/op 5.930724 usecs/op
166279 ops/sec 168613 ops/sec
Topology verification on Power9
Power9/ PowerNV / SMT4
tail -f /proc/cpuinfo
---------------------
cpu : POWER9, altivec supported
clock : 3600.000000MHz
revision : 2.2 (pvr 004e 1202)
timebase : 512000000
platform : PowerNV
model : 9006-22P
machine : PowerNV 9006-22P
firmware : OPAL
MMU : Radix
On PowerPc/Next On Powerpc/next + Coregroup Support v4 patchset
lscpu lscpu
------ ------
Architecture: ppc64le Architecture: ppc64le
Byte Order: Little Endian Byte Order: Little Endian
CPU(s): 160 CPU(s): 160
On-line CPU(s) list: 0-159 On-line CPU(s) list: 0-159
Thread(s) per core: 4 Thread(s) per core: 4
Core(s) per socket: 20 Core(s) per socket: 20
Socket(s): 2 Socket(s): 2
NUMA node(s): 2 NUMA node(s): 2
Model: 2.2 (pvr 004e 1202) Model: 2.2 (pvr 004e 1202)
Model name: POWER9, altivec supported Model name: POWER9, altivec supported
CPU max MHz: 3800.0000 CPU max MHz: 3800.0000
CPU min MHz: 2166.0000 CPU min MHz: 2166.0000
L1d cache: 32K L1d cache: 32K
L1i cache: 32K L1i cache: 32K
L2 cache: 512K L2 cache: 512K
L3 cache: 10240K L3 cache: 10240K
NUMA node0 CPU(s): 0-79 NUMA node0 CPU(s): 0-79
NUMA node8 CPU(s): 80-159 NUMA node8 CPU(s): 80-159
grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name
----------------------------------------------------- -----------------------------------------------------
/proc/sys/kernel/sched_domain/cpu0/domain0/name:SMT /proc/sys/kernel/sched_domain/cpu0/domain0/name:SMT
/proc/sys/kernel/sched_domain/cpu0/domain1/name:CACHE /proc/sys/kernel/sched_domain/cpu0/domain1/name:CACHE
/proc/sys/kernel/sched_domain/cpu0/domain2/name:DIE /proc/sys/kernel/sched_domain/cpu0/domain2/name:DIE
/proc/sys/kernel/sched_domain/cpu0/domain3/name:NUMA /proc/sys/kernel/sched_domain/cpu0/domain3/name:NUMA
grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags
------------------------------------------------------ ------------------------------------------------------
/proc/sys/kernel/sched_domain/cpu0/domain0/flags:2391 /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2391
/proc/sys/kernel/sched_domain/cpu0/domain1/flags:2327 /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2327
/proc/sys/kernel/sched_domain/cpu0/domain2/flags:2071 /proc/sys/kernel/sched_domain/cpu0/domain2/flags:2071
/proc/sys/kernel/sched_domain/cpu0/domain3/flags:12801 /proc/sys/kernel/sched_domain/cpu0/domain3/flags:12801
On PowerPc/Next
head /proc/schedstat
--------------------
version 15
timestamp 4295043536
cpu0 0 0 0 0 0 0 9597119314 2408913694 11897
domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,00000000,0000ffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 4941435230 11106132 1583
domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
On Powerpc/next + Coregroup Support v4 patchset
head /proc/schedstat
--------------------
version 15
timestamp 4296311826
cpu0 0 0 0 0 0 0 3353674045024 3781680865826 297483
domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,00000000,0000ffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 3337873293332 4231590033856 229090
domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Post sudo ppc64_cpu --smt=1 Post sudo ppc64_cpu --smt=1
--------------------- ---------------------
grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name
----------------------------------------------------- -----------------------------------------------------
/proc/sys/kernel/sched_domain/cpu0/domain0/name:CACHE /proc/sys/kernel/sched_domain/cpu0/domain0/name:CACHE
/proc/sys/kernel/sched_domain/cpu0/domain1/name:DIE /proc/sys/kernel/sched_domain/cpu0/domain1/name:DIE
/proc/sys/kernel/sched_domain/cpu0/domain2/name:NUMA /proc/sys/kernel/sched_domain/cpu0/domain2/name:NUMA
grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags
------------------------------------------------------ ------------------------------------------------------
/proc/sys/kernel/sched_domain/cpu0/domain0/flags:2327 /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2327
/proc/sys/kernel/sched_domain/cpu0/domain1/flags:2071 /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2071
/proc/sys/kernel/sched_domain/cpu0/domain2/flags:12801 /proc/sys/kernel/sched_domain/cpu0/domain2/flags:12801
On Powerpc/next
head /proc/schedstat
--------------------
version 15
timestamp 4295046242
cpu0 0 0 0 0 0 0 10978610020 2658997390 13068
domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu4 0 0 0 0 0 0 5408663896 95701034 7697
domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
On Powerpc/next + Coregroup Support v4 patchset
head /proc/schedstat
--------------------
version 15
timestamp 4296314905
cpu0 0 0 0 0 0 0 3355392013536 3781975150576 298723
domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu4 0 0 0 0 0 0 3351637920996 4427329763050 256776
domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Similar verification was done on Power 8 (8 Node 256 CPU LPAR) and Power 9 (2
node 128 Cpu LPAR) and they showed the topology before and after the patch to be
identical. If Interested, I could provide the same.
^ permalink raw reply
* Re: Documentation/powerpc: Ultravisor API
From: Ram Pai @ 2020-07-30 16:48 UTC (permalink / raw)
To: Julia Lawall; +Cc: sukadev, linuxppc-dev, linux-doc, corbet
In-Reply-To: <alpine.DEB.2.22.394.2007301231140.2548@hadrien>
On Thu, Jul 30, 2020 at 12:35:38PM +0200, Julia Lawall wrote:
> The file Documentation/powerpc/ultravisor.rst contains:
>
> Only valid value(s) in ``flags`` are:
>
> * H_PAGE_IN_SHARED which indicates that the page is to be shared
> with the Ultravisor.
>
> * H_PAGE_IN_NONSHARED indicates that the UV is not anymore
> interested in the page. Applicable if the page is a shared page.
>
> The flag H_PAGE_IN_SHARED exists in the Linux kernel
> (arch/powerpc/include/asm/hvcall.h), but the flag H_PAGE_IN_NONSHARED does
> not. Should the documentation be changed in some way?
Currently the code assumes H_PAGE_IN_NONSHARED as !H_PAGE_IN_SHARED.
We need to patch the kernel to explicitly define the flag.
I will submit a patch towards this.
Thanks,
RP
^ permalink raw reply
* Re: [PATCH -next] PCI: rpadlpar: Make some functions static
From: Bjorn Helgaas @ 2020-07-30 16:16 UTC (permalink / raw)
To: Wei Yongjun
Cc: Tyrel Datwyler, linux-pci, Hulk Robot, Bjorn Helgaas,
linuxppc-dev
In-Reply-To: <20200721151735.41181-1-weiyongjun1@huawei.com>
On Tue, Jul 21, 2020 at 11:17:35PM +0800, Wei Yongjun wrote:
> The sparse tool report build warnings as follows:
>
> drivers/pci/hotplug/rpadlpar_core.c:355:5: warning:
> symbol 'dlpar_remove_pci_slot' was not declared. Should it be static?
> drivers/pci/hotplug/rpadlpar_core.c:461:12: warning:
> symbol 'rpadlpar_io_init' was not declared. Should it be static?
> drivers/pci/hotplug/rpadlpar_core.c:473:6: warning:
> symbol 'rpadlpar_io_exit' was not declared. Should it be static?
>
> Those functions are not used outside of this file, so marks them
> static.
> Also mark rpadlpar_io_exit() as __exit.
>
> Reported-by: Hulk Robot <hulkci@huawei.com>
> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Applied to pci/hotplug for v5.9, thanks!
> ---
> drivers/pci/hotplug/rpadlpar_core.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/pci/hotplug/rpadlpar_core.c b/drivers/pci/hotplug/rpadlpar_core.c
> index c5eb509c72f0..f979b7098acf 100644
> --- a/drivers/pci/hotplug/rpadlpar_core.c
> +++ b/drivers/pci/hotplug/rpadlpar_core.c
> @@ -352,7 +352,7 @@ static int dlpar_remove_vio_slot(char *drc_name, struct device_node *dn)
> * -ENODEV Not a valid drc_name
> * -EIO Internal PCI Error
> */
> -int dlpar_remove_pci_slot(char *drc_name, struct device_node *dn)
> +static int dlpar_remove_pci_slot(char *drc_name, struct device_node *dn)
> {
> struct pci_bus *bus;
> struct slot *slot;
> @@ -458,7 +458,7 @@ static inline int is_dlpar_capable(void)
> return (int) (rc != RTAS_UNKNOWN_SERVICE);
> }
>
> -int __init rpadlpar_io_init(void)
> +static int __init rpadlpar_io_init(void)
> {
>
> if (!is_dlpar_capable()) {
> @@ -470,7 +470,7 @@ int __init rpadlpar_io_init(void)
> return dlpar_sysfs_init();
> }
>
> -void rpadlpar_io_exit(void)
> +static void __exit rpadlpar_io_exit(void)
> {
> dlpar_sysfs_exit();
> }
>
^ permalink raw reply
* [powerpc:next] BUILD SUCCESS cf1ae052e073c7ef6cf1a783a6427f7228253bd3
From: kernel test robot @ 2020-07-30 16:12 UTC (permalink / raw)
To: Michael Ellerman; +Cc: linuxppc-dev
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
branch HEAD: cf1ae052e073c7ef6cf1a783a6427f7228253bd3 powerpc/powernv/sriov: Remove unused but set variable 'phb'
elapsed time: 1486m
configs tested: 54
configs skipped: 1
The following configs have been built successfully.
More configs may be tested in the coming days.
arm defconfig
arm64 allyesconfig
arm64 defconfig
arm allyesconfig
arm allmodconfig
ia64 allmodconfig
ia64 defconfig
ia64 allyesconfig
m68k allmodconfig
m68k defconfig
m68k allyesconfig
nios2 defconfig
arc allyesconfig
nds32 allnoconfig
c6x allyesconfig
nds32 defconfig
nios2 allyesconfig
csky defconfig
alpha defconfig
alpha allyesconfig
xtensa allyesconfig
h8300 allyesconfig
arc defconfig
sh allmodconfig
parisc defconfig
s390 allyesconfig
parisc allyesconfig
s390 defconfig
i386 allyesconfig
sparc allyesconfig
sparc defconfig
i386 defconfig
mips allyesconfig
mips allmodconfig
powerpc defconfig
powerpc allyesconfig
powerpc allmodconfig
powerpc allnoconfig
i386 randconfig-a016-20200730
i386 randconfig-a012-20200730
i386 randconfig-a014-20200730
i386 randconfig-a015-20200730
i386 randconfig-a011-20200730
i386 randconfig-a013-20200730
riscv allyesconfig
riscv allnoconfig
riscv defconfig
riscv allmodconfig
x86_64 rhel
x86_64 allyesconfig
x86_64 rhel-7.6-kselftests
x86_64 defconfig
x86_64 rhel-8.3
x86_64 kexec
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
^ permalink raw reply
* Re: [PATCH v2] powerpc/vio: drop bus_type from parent device
From: Thadeu Lima de Souza Cascardo @ 2020-07-30 15:35 UTC (permalink / raw)
To: Greg KH; +Cc: Peter Rajnoha, linuxppc-dev
In-Reply-To: <20200730053716.GA3862178@kroah.com>
On Thu, Jul 30, 2020 at 07:37:16AM +0200, Greg KH wrote:
> On Thu, Jul 30, 2020 at 11:28:38AM +1000, Michael Ellerman wrote:
> > [ Added Peter & Greg to Cc ]
> >
> > Thadeu Lima de Souza Cascardo <cascardo@canonical.com> writes:
> > > Commit df44b479654f62b478c18ee4d8bc4e9f897a9844 ("kobject: return error
> > > code if writing /sys/.../uevent fails") started returning failure when
> > > writing to /sys/devices/vio/uevent.
> > >
> > > This causes an early udevadm trigger to fail. On some installer versions of
> > > Ubuntu, this will cause init to exit, thus panicing the system very early
> > > during boot.
> > >
> > > Removing the bus_type from the parent device will remove some of the extra
> > > empty files from /sys/devices/vio/, but will keep the rest of the layout
> > > for vio devices, keeping them under /sys/devices/vio/.
> >
> > What exactly does it change?
> >
> > I'm finding it hard to evaluate if this change is going to cause a
> > regression somehow.
> >
> > I'm also not clear on why removing the bus type is correct, apart from
> > whether it fixes the bug you're seeing.
> >
> > > It has been tested that uevents for vio devices don't change after this
> > > fix, they still contain MODALIAS.
> > >
> > > Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
> > > Fixes: df44b479654f ("kobject: return error code if writing /sys/.../uevent fails")
> >
> > AFAICS there haven't been any other fixes for that commit. Do we know
> > why it is only vio that was affected? (possibly because it's a fake bus
> > to begin with?)
>
> So there was an error previously, the core was ignoring it, and now it
> isn't and to fix that you want to remove describing what bus a device is
> on?
>
> Huh???
>
> >
> > cheers
> >
> > > diff --git a/arch/powerpc/platforms/pseries/vio.c b/arch/powerpc/platforms/pseries/vio.c
> > > index 37f1f25ba804..a94dab3972a0 100644
> > > --- a/arch/powerpc/platforms/pseries/vio.c
> > > +++ b/arch/powerpc/platforms/pseries/vio.c
> > > @@ -36,7 +36,6 @@ static struct vio_dev vio_bus_device = { /* fake "parent" device */
> > > .name = "vio",
> > > .type = "",
> > > .dev.init_name = "vio",
> > > - .dev.bus = &vio_bus_type,
> > > };
>
> Wait, a static 'struct device'? You all are playing with fire there.
> That's a reference counted object, and should never be declared like
> that at all.
>
> I see you register it, but never unregister it, why? Why is it even
> needed?
>
> And if you remove the bus type of it, it will show up in a different
> part of sysfs, so I think this patch will show a user-visable change,
> right?
>
> thanks,
>
> greg k-h
As the comment says, it's a "fake parent device". There is a user-visible
change, which is removing some attributes from the object, but it's still
showing up on the same path.
Returning an error code like df44b479654f does is also a user visible change
and it breaks installer images that panic early on boot.
I could investigate an alternative here, which would be not fail when writing
to uevent for this specific fake device.
Cascardo.
^ permalink raw reply
* Re: [PATCH] powerpc: fix function annotations to avoid section mismatch warnings with gcc-10
From: Vladis Dronov @ 2020-07-30 15:34 UTC (permalink / raw)
To: Michael Ellerman
Cc: Aneesh Kumar K . V, Paul Mackerras, linuxppc-dev, linux-kernel
In-Reply-To: <87ft995hv8.fsf@mpe.ellerman.id.au>
Hello, Michael,
----- Original Message -----
> From: "Michael Ellerman" <mpe@ellerman.id.au>
> Subject: Re: [PATCH] powerpc: fix function annotations to avoid section mismatch warnings with gcc-10
>
...
> >> > So what changed? These functions were inlined with older compilers, but
> >> > not anymore?
> >>
> >> Yes, exactly. Gcc-10 does not inline them anymore. If this is because of
> >> my
> >> build system, this can happen to others also.
> >>
> >> The same thing was fixed by Linus in e99332e7b4cd ("gcc-10: mark more
> >> functions
> >> __init to avoid section mismatch warnings").
> >
> > It sounds like this is part of "-finline-functions was retuned" on
> > <https://gcc.gnu.org/gcc-10/changes.html>? So everyone should see it
> > (no matter what config or build system), and it is a good thing too :-)
>
> I haven't seen it in my GCC 10 builds, so there must be some other
> subtlety. Probably it depends on details of the .config.
>
I've just had this building the latest upstream for the ppc64le with a derivative
of the RHEL-8 config. This can probably be a compiler/linker setting, like -O2
versus -O3.
> cheers
Best regards,
Vladis Dronov | Red Hat, Inc. | The Core Kernel | Senior Software Engineer
^ permalink raw reply
* [powerpc:fixes-test] BUILD SUCCESS 909adfc66b9a1db21b5e8733e9ebfa6cd5135d74
From: kernel test robot @ 2020-07-30 15:30 UTC (permalink / raw)
To: Michael Ellerman; +Cc: linuxppc-dev
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git fixes-test
branch HEAD: 909adfc66b9a1db21b5e8733e9ebfa6cd5135d74 powerpc/64s/hash: Fix hash_preload running with interrupts enabled
elapsed time: 4429m
configs tested: 102
configs skipped: 3
The following configs have been built successfully.
More configs may be tested in the coming days.
arm defconfig
arm64 allyesconfig
arm64 defconfig
arm allyesconfig
arm allmodconfig
sh r7785rp_defconfig
mips tb0226_defconfig
mips loongson3_defconfig
um kunit_defconfig
nds32 alldefconfig
arm imx_v4_v5_defconfig
mips gcw0_defconfig
mips fuloong2e_defconfig
arm pxa255-idp_defconfig
s390 defconfig
arm prima2_defconfig
arm footbridge_defconfig
mips nlm_xlr_defconfig
ia64 allmodconfig
ia64 defconfig
ia64 allyesconfig
m68k defconfig
m68k allmodconfig
m68k allyesconfig
nios2 defconfig
arc allyesconfig
nds32 allnoconfig
c6x allyesconfig
nds32 defconfig
nios2 allyesconfig
csky defconfig
alpha defconfig
alpha allyesconfig
xtensa allyesconfig
h8300 allyesconfig
arc defconfig
sh allmodconfig
parisc defconfig
s390 allyesconfig
parisc allyesconfig
i386 allyesconfig
sparc allyesconfig
sparc defconfig
i386 defconfig
mips allyesconfig
mips allmodconfig
powerpc allyesconfig
powerpc allmodconfig
powerpc allnoconfig
powerpc defconfig
x86_64 randconfig-a005-20200727
x86_64 randconfig-a004-20200727
x86_64 randconfig-a003-20200727
x86_64 randconfig-a006-20200727
x86_64 randconfig-a002-20200727
x86_64 randconfig-a001-20200727
i386 randconfig-a003-20200728
i386 randconfig-a004-20200728
i386 randconfig-a005-20200728
i386 randconfig-a002-20200728
i386 randconfig-a006-20200728
i386 randconfig-a001-20200728
i386 randconfig-a003-20200727
i386 randconfig-a005-20200727
i386 randconfig-a004-20200727
i386 randconfig-a006-20200727
i386 randconfig-a002-20200727
i386 randconfig-a001-20200727
x86_64 randconfig-a014-20200728
x86_64 randconfig-a012-20200728
x86_64 randconfig-a015-20200728
x86_64 randconfig-a016-20200728
x86_64 randconfig-a013-20200728
x86_64 randconfig-a011-20200728
i386 randconfig-a016-20200728
i386 randconfig-a012-20200728
i386 randconfig-a013-20200728
i386 randconfig-a014-20200728
i386 randconfig-a011-20200728
i386 randconfig-a015-20200728
i386 randconfig-a016-20200727
i386 randconfig-a013-20200727
i386 randconfig-a012-20200727
i386 randconfig-a015-20200727
i386 randconfig-a011-20200727
i386 randconfig-a014-20200727
i386 randconfig-a016-20200730
i386 randconfig-a012-20200730
i386 randconfig-a014-20200730
i386 randconfig-a015-20200730
i386 randconfig-a011-20200730
i386 randconfig-a013-20200730
riscv allyesconfig
riscv allnoconfig
riscv defconfig
riscv allmodconfig
x86_64 rhel
x86_64 allyesconfig
x86_64 rhel-7.6-kselftests
x86_64 defconfig
x86_64 rhel-8.3
x86_64 kexec
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
^ permalink raw reply
* Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
From: Nathan Lynch @ 2020-07-30 15:01 UTC (permalink / raw)
To: Michael Ellerman, Laurent Dufour; +Cc: tyreld, cheloha, linuxppc-dev
In-Reply-To: <87lfj16cql.fsf@mpe.ellerman.id.au>
Michael Ellerman <mpe@ellerman.id.au> writes:
> Nathan Lynch <nathanl@linux.ibm.com> writes:
>> Laurent Dufour <ldufour@linux.ibm.com> writes:
>>> Le 28/07/2020 à 19:37, Nathan Lynch a écrit :
>>>> The drmem lmb list can have hundreds of thousands of entries, and
>>>> unfortunately lookups take the form of linear searches. As long as
>>>> this is the case, traversals have the potential to monopolize the CPU
>>>> and provoke lockup reports, workqueue stalls, and the like unless
>>>> they explicitly yield.
>>>>
>>>> Rather than placing cond_resched() calls within various
>>>> for_each_drmem_lmb() loop blocks in the code, put it in the iteration
>>>> expression of the loop macro itself so users can't omit it.
>>>
>>> Is that not too much to call cond_resched() on every LMB?
>>>
>>> Could that be less frequent, every 10, or 100, I don't really know ?
>>
>> Everything done within for_each_drmem_lmb is relatively heavyweight
>> already. E.g. calling dlpar_remove_lmb()/dlpar_add_lmb() can take dozens
>> of milliseconds. I don't think cond_resched() is an expensive check in
>> this context.
>
> Hmm, mostly.
>
> But there are quite a few cases like drmem_update_dt_v1():
>
> for_each_drmem_lmb(lmb) {
> dr_cell->base_addr = cpu_to_be64(lmb->base_addr);
> dr_cell->drc_index = cpu_to_be32(lmb->drc_index);
> dr_cell->aa_index = cpu_to_be32(lmb->aa_index);
> dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb));
>
> dr_cell++;
> }
>
> Which will compile to a pretty tight loop at the moment.
>
> Or drmem_update_dt_v2() which has two loops over all lmbs.
>
> And although the actual TIF check is cheap the function call to do it is
> not free.
>
> So I worry this is going to make some of those long loops take even
> longer.
That's fair, and I was wrong - some of the loop bodies are relatively
simple, not doing allocations or taking locks, etc.
One way to deal is to keep for_each_drmem_lmb() as-is and add a new
iterator that can reschedule, e.g. for_each_drmem_lmb_slow().
On the other hand... it's probably not too strong to say that the
drmem/hotplug code is in crisis with respect to correctness and
algorithmic complexity, so those are my overriding concerns right
now. Yes, this change will pessimize loops that are reinitializing the
entire drmem_lmb array on every DLPAR operation, but:
1. it doesn't make any user of for_each_drmem_lmb() less correct;
2. why is this code doing that in the first place, other than to
accommodate a poor data structure choice?
The duration of the system calls where this code runs are measured in
minutes or hours on large configurations because of all the behaviors
that are at best O(n) with the amount of memory assigned to the
partition. For simplicity's sake I'd rather defer lower-level
performance considerations like this until the drmem data structures'
awful lookup properties are fixed -- hopefully in the 5.10 timeframe.
Thoughts?
^ permalink raw reply
* [powerpc:next-test] BUILD SUCCESS 2e6bd221d96fcfd9bd1eed5cd9c008e7959daed7
From: kernel test robot @ 2020-07-30 14:42 UTC (permalink / raw)
To: Michael Ellerman; +Cc: linuxppc-dev
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next-test
branch HEAD: 2e6bd221d96fcfd9bd1eed5cd9c008e7959daed7 powerpc/kexec_file: Enable early kernel OPAL calls
elapsed time: 1395m
configs tested: 52
configs skipped: 1
The following configs have been built successfully.
More configs may be tested in the coming days.
arm defconfig
arm64 allyesconfig
arm64 defconfig
arm allyesconfig
arm allmodconfig
ia64 allmodconfig
ia64 defconfig
ia64 allyesconfig
m68k allmodconfig
m68k defconfig
m68k allyesconfig
nds32 defconfig
nios2 allyesconfig
csky defconfig
alpha defconfig
alpha allyesconfig
xtensa allyesconfig
h8300 allyesconfig
arc defconfig
sh allmodconfig
parisc defconfig
s390 allyesconfig
parisc allyesconfig
s390 defconfig
i386 allyesconfig
sparc allyesconfig
sparc defconfig
i386 defconfig
nios2 defconfig
arc allyesconfig
nds32 allnoconfig
c6x allyesconfig
mips allyesconfig
mips allmodconfig
powerpc defconfig
powerpc allyesconfig
powerpc allmodconfig
powerpc allnoconfig
i386 randconfig-a016-20200730
i386 randconfig-a012-20200730
i386 randconfig-a014-20200730
i386 randconfig-a015-20200730
riscv allyesconfig
riscv allnoconfig
riscv defconfig
riscv allmodconfig
x86_64 rhel
x86_64 allyesconfig
x86_64 rhel-7.6-kselftests
x86_64 defconfig
x86_64 rhel-8.3
x86_64 kexec
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox