LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v4 08/10] powerpc/smp: Allocate cpumask only after searching thread group
From: Srikar Dronamraju @ 2020-07-31  9:49 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	Jordan Niethe, LKML, Nicholas Piggin, Valentin Schneider,
	Oliver O'Halloran, linuxppc-dev, Ingo Molnar
In-Reply-To: <87zh7g3yvk.fsf@mpe.ellerman.id.au>

* Michael Ellerman <mpe@ellerman.id.au> [2020-07-31 17:52:15]:

> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> > If allocated earlier and the search fails, then cpumask need to be
> > freed. However cpu_l1_cache_map can be allocated after we search thread
> > group.
> 
> It's not freed anywhere AFAICS?
> 

Yes, its never freed. Infact we are never checking if
zalloc_cpumask_var_node fails. Its not just this cpumask, but historically
all the other existing cpumasks in arch/powerpc/kernel/smp.c are never
freed/checked. I did dig into this a bit and it appears that ..
(Please do correct me if I am wrong!! )

Powerpc using cpumask_var_t for all of the percpu variables. And it dont seem
to enable CONFIG_CPUMASK_OFFSTACK even from the MAXSMP config.

So from include/linux/cpumask.h

typedef struct cpumask cpumask_var_t[1];
and
zalloc_cpumask_var_node ends up being cpumask_clear

So I think we are historically we seem to assume we are always
!CPUMASK_OFFSTACK and hence we dont need to check for return as well as
free..

I would look forward to your comments on how we should handle this going
forward. But I would keep this the same for this patchset.

One of the questions that I have is if we most likely are to be in
!CONFIG_CPUMASK_OFFSTACK, then should be migrate to cpumask_t for percpu
variables. 
 
The reason being we end up using NR_CPU cpumask for each percpu cpumask
variable instead of using NR_CPU cpumask_t pointer.

> And even after this change there's still an error path that doesn't free
> it, isn't there?
> 
> cheers
> 
> > Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> > Cc: LKML <linux-kernel@vger.kernel.org>
> > Cc: Michael Ellerman <mpe@ellerman.id.au>
> > Cc: Nicholas Piggin <npiggin@gmail.com>
> > Cc: Anton Blanchard <anton@ozlabs.org>
> > Cc: Oliver O'Halloran <oohall@gmail.com>
> > Cc: Nathan Lynch <nathanl@linux.ibm.com>
> > Cc: Michael Neuling <mikey@neuling.org>
> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > Cc: Jordan Niethe <jniethe5@gmail.com>
> > Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/kernel/smp.c | 7 +++----
> >  1 file changed, 3 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 698000c7f76f..dab96a1203ec 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -797,10 +797,6 @@ static int init_cpu_l1_cache_map(int cpu)
> >  	if (err)
> >  		goto out;
> >  
> > -	zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
> > -				GFP_KERNEL,
> > -				cpu_to_node(cpu));
> > -
> >  	cpu_group_start = get_cpu_thread_group_start(cpu, &tg);
> >  
> >  	if (unlikely(cpu_group_start == -1)) {
> > @@ -809,6 +805,9 @@ static int init_cpu_l1_cache_map(int cpu)
> >  		goto out;
> >  	}
> >  
> > +	zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
> > +				GFP_KERNEL, cpu_to_node(cpu));
> > +
> >  	for (i = first_thread; i < first_thread + threads_per_core; i++) {
> >  		int i_group_start = get_cpu_thread_group_start(i, &tg);
> >  
> > -- 
> > 2.17.1

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

* Re: [PATCH v4 10/10] powerpc/smp: Implement cpu_to_coregroup_id
From: Srikar Dronamraju @ 2020-07-31  9:58 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	Jordan Niethe, LKML, Nicholas Piggin, Valentin Schneider,
	Oliver O'Halloran, linuxppc-dev, Ingo Molnar
In-Reply-To: <87wo2k3yeq.fsf@mpe.ellerman.id.au>

* Michael Ellerman <mpe@ellerman.id.au> [2020-07-31 18:02:21]:

> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> > Lookup the coregroup id from the associativity array.
> 

Thanks Michael for all your comments and inputs.

> It's slightly strange that this is called in patch 9, but only properly
> implemented here in patch 10.
> 
> I'm not saying you have to squash them together, but it would be good if
> the change log for patch 9 mentioned that a subsequent commit will
> complete the implementation and how that affects the behaviour.
> 

I probably got influenced by few LKML community members who always add a
stub and implement the gory details in a subsequent patch. I will surely add
the change log in patch 9 about the subsequent patches.

> cheers
> 

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

* Re: [PATCH] KVM: PPC: Book3S HV: fix a oops in kvmppc_uvmem_page_free()
From: Bharata B Rao @ 2020-07-31 10:02 UTC (permalink / raw)
  To: Ram Pai
  Cc: ldufour, cclaudio, kvm-ppc, sathnaga, aneesh.kumar, sukadev,
	linuxppc-dev, bauerman, david
In-Reply-To: <20200731083700.GB5787@oc0525413822.ibm.com>

On Fri, Jul 31, 2020 at 01:37:00AM -0700, Ram Pai wrote:
> On Fri, Jul 31, 2020 at 09:59:40AM +0530, Bharata B Rao wrote:
> > On Thu, Jul 30, 2020 at 04:25:26PM -0700, Ram Pai wrote:
> > In our case, device pages that are in use are always associated with a valid
> > pvt member. See kvmppc_uvmem_get_page() which returns failure if it
> > runs out of device pfns and that will result in proper failure of
> > page-in calls.
> 
> looked at the code, and yes that code path looks correct. So my
> reasoning behind the root cause of this bug is incorrect. However the
> bug is surfacing and there must be a reason.
> 
> > 
> > For the case where we run out of device pfns, migrate_vma_finalize() will
> > restore the original PTE and will not replace the PTE with device private PTE.
> > 
> > Also kvmppc_uvmem_page_free() (=dev_pagemap_ops.page_free()) is never
> > called for non-device-private pages.
> 
> Yes. it should not be called. But as seen above in the stack trace, it is called. 
> 
> What would cause the HMM to call ->page_free() on a page that is not
> associated with that device's pfn?

I believe it is being called for a device private page, you can verify
it when you hit it next time?

> 
> > 
> > This could be a use-after-free case possibly arising out of the new state
> > changes in HV. If so, this fix will only mask the bug and not address the
> > original problem.
> 
> I can verify by rerunning the tests, without the new state changes. But
> I do not see how those changes can cause this fault?
> 
> This could also be caused by a duplicate ->page_free() call due to some
> bug in the migrate_page path? Could there be a race between
> migrate_page() and a page_fault ?
> 
> 
> Regardless, kvmppc_uvmem_page_free() needs to be fixed. It should not
> access contents of pvt, without verifing pvt is valid.

We don't expect pvt to be NULL here. Checking for NULL and returning
isn't the right fix, I think.

Regards,
Bharata.

^ permalink raw reply

* Re: [PATCH] ASoC: fsl_sai: Fix value of FSL_SAI_CR1_RFW_MASK
From: Fabio Estevam @ 2020-07-31 10:38 UTC (permalink / raw)
  To: Shengjiu Wang
  Cc: Linux-ALSA, Timur Tabi, Xiubo Li, linux-kernel, Takashi Iwai,
	Liam Girdwood, Jaroslav Kysela, Nicolin Chen, Mark Brown,
	linuxppc-dev
In-Reply-To: <1596176895-28724-1-git-send-email-shengjiu.wang@nxp.com>

Hi Shengjiu,

On Fri, Jul 31, 2020 at 3:32 AM Shengjiu Wang <shengjiu.wang@nxp.com> wrote:
>
> The fifo_depth is 64 on i.MX8QM/i.MX8QXP, 128 on i.MX8MQ, 16 on
> i.MX7ULP.
>
> Original FSL_SAI_CR1_RFW_MASK value 0x1F is not suitable for
> these platform, the FIFO watermark mask should be updated
> according to the fifo_depth.
>
> Fixes: a860fac42097 ("ASoC: fsl_sai: Add support for imx7ulp/imx8mq")
> Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com>

Reviewed-by: Fabio Estevam <festevam@gmail.com>

^ permalink raw reply

* [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id
From: Aneesh Kumar K.V @ 2020-07-31 11:19 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Nathan Lynch, Aneesh Kumar K.V, Srikar Dronamraju

We use ibm,associativity and ibm,associativity-lookup-arrays to derive the numa
node numbers. These device tree properties are firmware indicated grouping of
resources based on their hierarchy in the platform. These numbers (group id) are
not sequential and hypervisor/firmware can follow different numbering schemes.
For ex: on powernv platforms, we group them in the below order.

 *     - CCM node ID
 *     - HW card ID
 *     - HW module ID
 *     - Chip ID
 *     - Core ID

Based on ibm,associativity-reference-points we use one of the above group ids as
Linux NUMA node id. (On PowerNV platform Chip ID is used). This results
in Linux reporting non-linear NUMA node id and which also results in Linux
reporting empty node 0 NUMA nodes.

This can  be resolved by mapping the firmware provided group id to a logical Linux
NUMA id. In this patch, we do this only for pseries platforms considering the
firmware group id is a virtualized entity and users would not have drawn any
conclusion based on the Linux Numa Node id.

On PowerNV platform since we have historically mapped Chip ID as Linux NUMA node
id, we keep the existing Linux NUMA node id numbering.

Before Fix:
 # numactl -H
available: 2 nodes (0-1)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 1 size: 50912 MB
node 1 free: 45248 MB
node distances:
node   0   1
  0:  10  40
  1:  40  10

after fix
 # numactl  -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 0 size: 50912 MB
node 0 free: 49724 MB
node distances:
node   0
  0:  10

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/topology.h |  1 +
 arch/powerpc/mm/numa.c              | 49 ++++++++++++++++++++++-------
 2 files changed, 39 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index f0b6300e7dd3..15b0424a27a8 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -118,5 +118,6 @@ int get_physical_package_id(int cpu);
 #endif
 #endif
 
+int firmware_group_id_to_nid(int firmware_gid);
 #endif /* __KERNEL__ */
 #endif	/* _ASM_POWERPC_TOPOLOGY_H */
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index e437a9ac4956..6c659aada55b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int nid,
 	}
 }
 
+static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] =  NUMA_NO_NODE};
+
+int firmware_group_id_to_nid(int firmware_gid)
+{
+	static int last_nid = 0;
+
+	/*
+	 * For PowerNV we don't change the node id. This helps to avoid
+	 * confusion w.r.t the expected node ids. On pseries, node numbers
+	 * are virtualized. Hence do logical node id for pseries.
+	 */
+	if (!firmware_has_feature(FW_FEATURE_LPAR))
+		return firmware_gid;
+
+	if (firmware_gid ==  -1)
+		return NUMA_NO_NODE;
+
+	if (nid_map[firmware_gid] == NUMA_NO_NODE)
+		nid_map[firmware_gid] = last_nid++;
+
+	return nid_map[firmware_gid];
+}
+
 /* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
  * info is found.
  */
 static int associativity_to_nid(const __be32 *associativity)
 {
 	int nid = NUMA_NO_NODE;
+	int firmware_gid = -1;
 
 	if (!numa_enabled)
 		goto out;
 
 	if (of_read_number(associativity, 1) >= min_common_depth)
-		nid = of_read_number(&associativity[min_common_depth], 1);
+		firmware_gid = of_read_number(&associativity[min_common_depth], 1);
 
 	/* POWER4 LPAR uses 0xffff as invalid node */
-	if (nid == 0xffff || nid >= MAX_NUMNODES)
-		nid = NUMA_NO_NODE;
+	if (firmware_gid == 0xffff || firmware_gid >= MAX_NUMNODES)
+		firmware_gid = -1;
+
+	nid = firmware_group_id_to_nid(firmware_gid);
 
 	if (nid > 0 &&
-		of_read_number(associativity, 1) >= distance_ref_points_depth) {
+	    of_read_number(associativity, 1) >= distance_ref_points_depth) {
 		/*
 		 * Skip the length field and send start of associativity array
 		 */
@@ -432,24 +458,25 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
 static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
 {
 	struct assoc_arrays aa = { .arrays = NULL };
-	int default_nid = NUMA_NO_NODE;
-	int nid = default_nid;
+	int nid = NUMA_NO_NODE, firmware_gid;
 	int rc, index;
 
 	if ((min_common_depth < 0) || !numa_enabled)
-		return default_nid;
+		return NUMA_NO_NODE;
 
 	rc = of_get_assoc_arrays(&aa);
 	if (rc)
-		return default_nid;
+		return NUMA_NO_NODE;
 
 	if (min_common_depth <= aa.array_sz &&
 	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
 		index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
-		nid = of_read_number(&aa.arrays[index], 1);
+		firmware_gid = of_read_number(&aa.arrays[index], 1);
 
-		if (nid == 0xffff || nid >= MAX_NUMNODES)
-			nid = default_nid;
+		if (firmware_gid == 0xffff || firmware_gid >= MAX_NUMNODES)
+			firmware_gid = -1;
+
+		nid = firmware_group_id_to_nid(firmware_gid);
 
 		if (nid > 0) {
 			index = lmb->aa_index * aa.array_sz;
-- 
2.26.2


^ permalink raw reply related

* [RFC PATCH 2/2] powerpc/powernv/cpufreq: Don't assume chip id is same as Linux node id
From: Aneesh Kumar K.V @ 2020-07-31 11:22 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Nathan Lynch, Aneesh Kumar K.V, Srikar Dronamraju
In-Reply-To: <20200731111916.243569-1-aneesh.kumar@linux.ibm.com>

On PowerNV platforms we always have 1:1 mapping between chip ID and
firmware group id. Use the helper to convert firmware group id to
node id instead of directly using chip ID as Linux node id.

NOTE: This doesn't have any functional change. On PowerNV platforms
we continue to have 1:1 mapping between firmware group id and
Linux node id.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 drivers/cpufreq/powernv-cpufreq.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
index 8646eb197cd9..ca82b6ac0989 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -27,6 +27,7 @@
 #include <asm/reg.h>
 #include <asm/smp.h> /* Required for cpu_sibling_mask() in UP configs */
 #include <asm/opal.h>
+#include <asm/topology.h>
 #include <linux/timer.h>
 
 #define POWERNV_MAX_PSTATES_ORDER  8
@@ -1069,8 +1070,14 @@ static int init_chip_info(void)
 	}
 
 	for (i = 0; i < nr_chips; i++) {
+		unsigned int nid;
+
 		chips[i].id = chip[i];
-		cpumask_copy(&chips[i].mask, cpumask_of_node(chip[i]));
+		/*
+		 * On powervn platforms firmware group id is same as chipd id.
+		 */
+		nid = firmware_group_id_to_nid(chip[i]);
+		cpumask_copy(&chips[i].mask, cpumask_of_node(nid));
 		INIT_WORK(&chips[i].throttle, powernv_cpufreq_work_fn);
 		for_each_cpu(cpu, &chips[i].mask)
 			per_cpu(chip_info, cpu) =  &chips[i];
-- 
2.26.2


^ permalink raw reply related

* Re: [PATCH v4 10/10] powerpc/smp: Implement cpu_to_coregroup_id
From: Michael Ellerman @ 2020-07-31 11:29 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	Jordan Niethe, LKML, Nicholas Piggin, Valentin Schneider,
	Oliver O'Halloran, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200731095808.GB18776@linux.vnet.ibm.com>

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> * Michael Ellerman <mpe@ellerman.id.au> [2020-07-31 18:02:21]:
>
>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> > Lookup the coregroup id from the associativity array.
>
> Thanks Michael for all your comments and inputs.
>
>> It's slightly strange that this is called in patch 9, but only properly
>> implemented here in patch 10.
>> 
>> I'm not saying you have to squash them together, but it would be good if
>> the change log for patch 9 mentioned that a subsequent commit will
>> complete the implementation and how that affects the behaviour.
>
> I probably got influenced by few LKML community members who always add a
> stub and implement the gory details in a subsequent patch. I will surely add
> the change log in patch 9 about the subsequent patches.

That's OK, it's a valid way to do things, and can be good for keeping
the size of individual patches down to make them easier to review.

But yeah a mention in the change log of the preceeding patch is helpful
for anyone looking at that commit on its own in the future.

cheers

^ permalink raw reply

* Re: [PATCH v4 07/10] Powerpc/numa: Detect support for coregroup
From: Michael Ellerman @ 2020-07-31 11:31 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	Jordan Niethe, LKML, Nicholas Piggin, Valentin Schneider,
	Oliver O'Halloran, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200731091859.GG14603@linux.vnet.ibm.com>

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> * Michael Ellerman <mpe@ellerman.id.au> [2020-07-31 17:49:55]:
>
>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> > Add support for grouping cores based on the device-tree classification.
>> > - The last domain in the associativity domains always refers to the
>> > core.
>> > - If primary reference domain happens to be the penultimate domain in
>> > the associativity domains device-tree property, then there are no
>> > coregroups. However if its not a penultimate domain, then there are
>> > coregroups. There can be more than one coregroup. For now we would be
>> > interested in the last or the smallest coregroups.
>> 
>> This still doesn't tell me what a coregroup actually represents.
>> 
>> I get that it's a grouping of cores, and that the device tree specifies
>> it for us, but grouping based on what?
>
> We have just abstracted the fact that we are creating a sub-group of cores
> within a DIE. We are limiting to one sub-group per core. However this would
> allow the firmware the flexibility to vary the grouping. Once the firmware
> starts using this group, we could add more code to detect the type of
> grouping and adjust the sd domain flags accordingly.

OK. That's good info to have in the change log.

>> I think the answer is we aren't being told by firmware, it's just a
>> grouping based on some opaque performance characteristic and we just
>> have to take that as given.
>> 
>
> This is partially true. At this time, we dont have firmwares that can
> exploit this code. Once the firmwares start using this grouping, we could
> add more code to align the grouping to the scheduler topology.
>
>> But please explain that clearly in the change log and the code comments.
>> 
>
> Okay, I will do the needful.

Thanks.

cheers

^ permalink raw reply

* [PATCH 1/2] powerpc/vmemmap: Fix memory leak with vmemmap list allocation failures.
From: Aneesh Kumar K.V @ 2020-07-31 11:34 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K.V

If we fail to allocate vmemmap list, we don't keep track of allocated
vmemmap block buf. Hence on section deactivate we skip vmemmap block
buf free. This results in memory leak.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/init_64.c | 35 ++++++++++++++++++++++++++++-------
 1 file changed, 28 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 152aa0200cef..8a198504c0e1 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -162,16 +162,16 @@ static __meminit struct vmemmap_backing * vmemmap_list_alloc(int node)
 	return next++;
 }
 
-static __meminit void vmemmap_list_populate(unsigned long phys,
-					    unsigned long start,
-					    int node)
+static __meminit int vmemmap_list_populate(unsigned long phys,
+					   unsigned long start,
+					   int node)
 {
 	struct vmemmap_backing *vmem_back;
 
 	vmem_back = vmemmap_list_alloc(node);
 	if (unlikely(!vmem_back)) {
-		WARN_ON(1);
-		return;
+		pr_debug("vmemap list allocation failed\n");
+		return -ENOMEM;
 	}
 
 	vmem_back->phys = phys;
@@ -179,6 +179,7 @@ static __meminit void vmemmap_list_populate(unsigned long phys,
 	vmem_back->list = vmemmap_list;
 
 	vmemmap_list = vmem_back;
+	return 0;
 }
 
 static bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
@@ -199,6 +200,7 @@ static bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long star
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 		struct vmem_altmap *altmap)
 {
+	bool altmap_alloc;
 	unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
 
 	/* Align to the page size of the linear mapping. */
@@ -228,13 +230,32 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 			p = altmap_alloc_block_buf(page_size, altmap);
 			if (!p)
 				pr_debug("altmap block allocation failed, falling back to system memory");
+			else
+				altmap_alloc = true;
 		}
-		if (!p)
+		if (!p) {
 			p = vmemmap_alloc_block_buf(page_size, node);
+			altmap_alloc = false;
+		}
 		if (!p)
 			return -ENOMEM;
 
-		vmemmap_list_populate(__pa(p), start, node);
+		if (vmemmap_list_populate(__pa(p), start, node)) {
+			/*
+			 * If we don't populate vmemap list, we don't have
+			 * the ability to free the allocated vmemmap
+			 * pages in section_deactivate. Hence free them
+			 * here.
+			 */
+			int nr_pfns = page_size >> PAGE_SHIFT;
+			unsigned long page_order = get_order(page_size);
+
+			if (altmap_alloc)
+				vmem_altmap_free(altmap, nr_pfns);
+			else
+				free_pages((unsigned long)p, page_order);
+			return -ENOMEM;
+		}
 
 		pr_debug("      * %016lx..%016lx allocated at %p\n",
 			 start, start + page_size, p);
-- 
2.26.2


^ permalink raw reply related

* [PATCH 2/2] powerpc/vmemmap: Don't warn if we don't find a mapping vmemmap list entry
From: Aneesh Kumar K.V @ 2020-07-31 11:35 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K.V
In-Reply-To: <20200731113500.248306-1-aneesh.kumar@linux.ibm.com>

Now that we are handling vmemmap list allocation failure correctly, don't
WARN in section deactivate when we don't find a mapping vmemmap list entry.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/init_64.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 8a198504c0e1..3ffe8b84aab2 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -285,10 +285,8 @@ static unsigned long vmemmap_list_free(unsigned long start)
 		vmem_back_prev = vmem_back;
 	}
 
-	if (unlikely(!vmem_back)) {
-		WARN_ON(1);
+	if (unlikely(!vmem_back))
 		return 0;
-	}
 
 	/* remove it from vmemmap_list */
 	if (vmem_back == vmemmap_list) /* remove head */
-- 
2.26.2


^ permalink raw reply related

* [PATCH v4 00/12] PCI: Remove '*val = 0' from pcie_capability_read_*()
From: Saheed O. Bolarinwa @ 2020-07-31 11:02 UTC (permalink / raw)
  To: helgaas, Mike Marciniszyn, Dennis Dalessandro, Doug Ledford,
	Jason Gunthorpe, Arnd Bergmann, Greg Kroah-Hartman,
	David S. Miller, Kalle Valo, Jakub Kicinski, Rafael J. Wysocki,
	Len Brown, Russell Currey, Sam Bobroff, Oliver O'Halloran
  Cc: linuxppc-dev, linux-rdma, Saheed O. Bolarinwa, skhan,
	linux-wireless, linux-kernel, QCA ath9k Development, linux-acpi,
	netdev, linux-pci, bjorn, linux-kernel-mentees

v4 CHANGES:
- Drop uses of pcie_capability_read_*() return value. This related to
  [1] which is pointing towards making the accessors return void.
- Remove patches found to be unnecessary
- Reword some commit messages

v3 CHANGES:
- Split previous PATCH 6/13 into two : PATCH 6/14 and PATCH 7/14
- Fix commit message of PATCH 5/14
- Update Patch numbering and Commit messages
- Add 'Acked by Greg KH' to PATCH 2/14
- Add PATCH version

v2 CHANGES:
- Fix missing comma, causing the email cc error
- Fix typos and numbering errors in commit messages
- Add commit message to 13/13
- Add two more patches: PATCH 3/13 and PATCH 4/13

MERGING:
- Patch 6/12 depends on Patch 5/12. However Patch 5/12 has no dependency.
  Please, merge PATCH 6/12 only after Patch 5/12.
- Patch 12/12 depends on all preceding patches. Please merge Patch 12/12
  only after other patches in this series have been merged.
- All other patches have no dependencies besides those mentioned above and
  can be merge individually.

PATCH 5/12:
Set the default case in the switch-statement to set status
to "Power On".

PATCH 1/12 to 11/12:
Use the value read by pcie_capability_read_*() to determine success or
failure. This is done by checking if it is ~0, while maintaining the
functions' behaviour. This ensures that the changes in PATCH 12/12 does
not introduce any bug.

PATCH 12/12:
There are several reasons why a PCI capability read may fail whether the
device is present or not. If this happens, pcie_capability_read_*() will
return -EINVAL/PCIBIOS_BAD_REGISTER_NUMBER or PCIBIOS_DEVICE_NOT_FOUND
and *val is set to 0.

This behaviour if further ensured by this code inside
pcie_capability_read_*()

 ret = pci_read_config_dword(dev, pci_pcie_cap(dev) + pos, val);
 /*
  * Reset *val to 0 if pci_read_config_dword() fails, it may
  * have been written as 0xFFFFFFFF if hardware error happens
  * during pci_read_config_dword().
  */
 if (ret)
	 *val = 0;
 return ret;

a) Since all pci_generic_config_read() does is read a register value,
it may return success after reading a ~0 which *may* have been fabricated
by the PCI host bridge due to a read timeout. Hence pci_read_config_*() 
will return success with a fabricated ~0 in *val, indicating a problem.
In this case, the assumed behaviour of  pcie_capability_read_*() will be
wrong. To avoid error slipping through, more checks are necessary.

b) pci_read_config_*() will return PCIBIOS_DEVICE_NOT_FOUND only if 
dev->error_state = pci_channel_io_perm_failure (i.e. 
pci_dev_is_disconnected()) or if pci_generic_config_read() can't find the
device. In both cases *val is initially set to ~0 but as shown in the code
above pcie_capability_read_*() resets it back to 0. Even with this effort,
drivers still have to perform validation checks more so if 0 is a valid
value.

Most drivers only consider the case (b) and in some cases, there is the 
expectation that on timeout *val has a fabricated value of ~0, which *may*
not always be true as explained in (a).

In any case, checks need to be done to validate the value read and maybe
confirm which error has occurred. It is better left to the drivers to do.

Check the return value of pcie_capability_read_dword() to ensure success
and avoid bug as a result of Patch 14/14.
Remove the reset of *val to 0 when pci_read_config_*() fails.

[1] https://lore.kernel.org/linux-pci/20200714234625.GA428442@bjorn-Precision-5520/


Saheed O. Bolarinwa (12):
  IB/hfi1: Check if pcie_capability_read_*() reads ~0
  misc: rtsx: Check if pcie_capability_read_*() reads ~0
  ath9k: Check if pcie_capability_read_*() reads ~0
  iwlegacy: Check if pcie_capability_read_*() reads ~0
  PCI: pciehp: Set "Power On" as the default get_power_status
  PCI: pciehp: Check if pcie_capability_read_*() reads ~0
  PCI/ACPI: Check if pcie_capability_read_*() reads ~0
  PCI: Check if pcie_capability_read_*() reads ~0
  PCI/PM: Check if pcie_capability_read_*() reads ~0
  PCI/AER: Check if pcie_capability_read_*() reads ~0
  PCI/ASPM: Check if pcie_capability_read_*() reads ~0
  PCI: Remove '*val = 0' from pcie_capability_read_*()

 drivers/infiniband/hw/hfi1/aspm.c            |  6 ++--
 drivers/misc/cardreader/rts5227.c            |  2 +-
 drivers/misc/cardreader/rts5249.c            |  2 +-
 drivers/misc/cardreader/rts5260.c            |  2 +-
 drivers/misc/cardreader/rts5261.c            |  2 +-
 drivers/net/wireless/ath/ath9k/pci.c         |  3 +-
 drivers/net/wireless/intel/iwlegacy/common.c |  2 +-
 drivers/pci/access.c                         | 14 --------
 drivers/pci/hotplug/pciehp_hpc.c             | 13 +++++---
 drivers/pci/pci-acpi.c                       |  4 +--
 drivers/pci/pci.c                            | 34 ++++++++++++++------
 drivers/pci/pcie/aer.c                       |  2 +-
 drivers/pci/pcie/aspm.c                      | 10 +++---
 drivers/pci/probe.c                          | 12 +++----
 14 files changed, 56 insertions(+), 52 deletions(-)

-- 
2.18.4


^ permalink raw reply

* Re: [PATCH v4 08/10] powerpc/smp: Allocate cpumask only after searching thread group
From: Michael Ellerman @ 2020-07-31 12:14 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	Jordan Niethe, LKML, Nicholas Piggin, Valentin Schneider,
	Oliver O'Halloran, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200731094938.GA18776@linux.vnet.ibm.com>

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> * Michael Ellerman <mpe@ellerman.id.au> [2020-07-31 17:52:15]:
>
>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> > If allocated earlier and the search fails, then cpumask need to be
>> > freed. However cpu_l1_cache_map can be allocated after we search thread
>> > group.
>> 
>> It's not freed anywhere AFAICS?
>
> Yes, its never freed. Infact we are never checking if
> zalloc_cpumask_var_node fails. Its not just this cpumask, but historically
> all the other existing cpumasks in arch/powerpc/kernel/smp.c are never
> freed/checked. I did dig into this a bit and it appears that ..
> (Please do correct me if I am wrong!! )

That's correct.

> Powerpc using cpumask_var_t for all of the percpu variables. And it dont seem
> to enable CONFIG_CPUMASK_OFFSTACK even from the MAXSMP config.

I remember Rusty adding that code, but I don't know if we ever
considered enabling CPUMASK_OFFSTACK.

Probably we meant to but never got around to doing it.

> So from include/linux/cpumask.h
>
> typedef struct cpumask cpumask_var_t[1];
> and
> zalloc_cpumask_var_node ends up being cpumask_clear
>
> So I think we are historically we seem to assume we are always
> !CPUMASK_OFFSTACK and hence we dont need to check for return as well as
> free..

Right.

> I would look forward to your comments on how we should handle this going
> forward. But I would keep this the same for this patchset.

Agreed, just clarify in the change log that it's not freed at the moment
because of CPU_MASK_OFFSTACK=n

> One of the questions that I have is if we most likely are to be in
> !CONFIG_CPUMASK_OFFSTACK, then should be migrate to cpumask_t for percpu
> variables. 

I don't think so, cpumask_t is semi-deprecated AIUI.
  
> The reason being we end up using NR_CPU cpumask for each percpu cpumask
> variable instead of using NR_CPU cpumask_t pointer.

Our current defconfigs have NR_CPUS=2048, which is probably just small
enough to continue using OFFSTACK=n.

But we allow configuring NR_CPUS up to 8192, which surely would need
OFFSTACK=y in order to work.

So I think we need to stick with cpumask_var_t, but we should test with
OFFSTACK=y, and should probably be a bit more careful with checking the
allocations succeed.

And then we should select OFFSTACK=y for NR_CPUS above some threshold.

cheers

^ permalink raw reply

* Re: [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain
From: Michael Ellerman @ 2020-07-31 12:22 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	Jordan Niethe, LKML, Nicholas Piggin, Valentin Schneider,
	Oliver O'Halloran, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200731092901.GH14603@linux.vnet.ibm.com>

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> * Michael Ellerman <mpe@ellerman.id.au> [2020-07-31 17:45:37]:
>
>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> > Currently "CACHE" domain happens to be the 2nd sched domain as per
>> > powerpc_topology. This domain will collapse if cpumask of l2-cache is
>> > same as SMT domain. However we could generalize this domain such that it
>> > could mean either be a "CACHE" domain or a "BIGCORE" domain.
>> >
>> > While setting up the "CACHE" domain, check if shared_cache is already
>> > set.
>> 
>> PeterZ asked for some overview of what you're doing and why, you
>> responded to his mail, but I was expecting to see that text incorporated
>> here somewhere.
>> 
>
> Okay, do you want that as part of the code or documentation dir or the
> changelog?

I guess a comment is best, as that's most likely to be seen by people
looking at the code in future.

A little bit of overview in the change log is also good.

>> He also asked for some comments, which I would also like to see.
>> 
>> 
>> I'm also not clear why we want to rename it to "bigcore", that's not a
>> commonly understood term, I don't think it's clear to new readers what
>> it means.
>> 
>> Leaving it as the shared cache domain, and having a comment mentioning
>> that "bigcores" share a cache, would be clearer I think.
>> 
>
> Today, Shared cache is equal to Big Core. However in not too distant future,
> Shared cache domain and Big Core may not be the same. For example lets
> assume that L2 cache were to Shrink per small core with the firmware
> exposing the core as a bigcore. Then with the current design, we have a SMT
> == SHARED CACHE, and a DIE. We would not have any domain at the publicised 8
> thread level. Keeping the Bigcore as a domain and mapping the shared
> cache, (I am resetting the domain name as CACHE if BIGCORE==SHARED_CACHE),
> helps us in this scenario.

Yeah OK.

In that scenario it's not really clear what the 8 thread level domain
expresses, if the two "halves" of the bigcore have separate L2s. But
presumably there is still some benefit to exposing it.

cheers

^ permalink raw reply

* [PATCH v4 10/12] PCI/AER: Check if pcie_capability_read_*() reads ~0
From: Saheed O. Bolarinwa @ 2020-07-31 11:43 UTC (permalink / raw)
  To: helgaas, Russell Currey, Sam Bobroff, Oliver O'Halloran
  Cc: Saheed O. Bolarinwa, skhan, linux-kernel, linux-pci, bjorn,
	linuxppc-dev, linux-kernel-mentees
In-Reply-To: <20200731114329.100848-1-refactormyself@gmail.com>

On failure pcie_capability_read_*() sets it's last parameter, val
to 0. However, with Patch 12/12, it is possible that val is set
to ~0 on failure. This would introduce a bug because
(x & x) == (~0 & x).

Since ~0 is an invalid value in here,

Add extra check for ~0 to the if condition to confirm failure.

Suggested-by: Bjorn Helgaas <bjorn@helgaas.com>
Signed-off-by: Saheed O. Bolarinwa <refactormyself@gmail.com>
---
 drivers/pci/pcie/aer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 3acf56683915..dbeabc370efc 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -829,7 +829,7 @@ static bool is_error_source(struct pci_dev *dev, struct aer_err_info *e_info)
 
 	/* Check if AER is enabled */
 	pcie_capability_read_word(dev, PCI_EXP_DEVCTL, &reg16);
-	if (!(reg16 & PCI_EXP_AER_FLAGS))
+	if ((reg16 == (u16)~0) || !(reg16 & PCI_EXP_AER_FLAGS))
 		return false;
 
 	if (!aer)
-- 
2.18.4


^ permalink raw reply related

* [PATCH v4 12/12] PCI: Remove '*val = 0' from pcie_capability_read_*()
From: Saheed O. Bolarinwa @ 2020-07-31 11:43 UTC (permalink / raw)
  To: helgaas
  Cc: Greg Kroah-Hartman, linux-wireless, QCA ath9k Development,
	Oliver O'Halloran, linux-acpi, linux-rdma, Jason Gunthorpe,
	Doug Ledford, linux-pci, Jakub Kicinski, linux-kernel-mentees,
	Len Brown, Arnd Bergmann, skhan, bjorn, Kalle Valo,
	Mike Marciniszyn, Sam Bobroff, Saheed O. Bolarinwa,
	Dennis Dalessandro, Rafael J. Wysocki, linux-kernel, netdev,
	linuxppc-dev, David S. Miller
In-Reply-To: <20200731114329.100848-1-refactormyself@gmail.com>

There are several reasons why a PCI capability read may fail whether the
device is present or not. If this happens, pcie_capability_read_*() will
return -EINVAL/PCIBIOS_BAD_REGISTER_NUMBER or PCIBIOS_DEVICE_NOT_FOUND
and *val is set to 0.

This behaviour if further ensured by this code inside
pcie_capability_read_*()

 ret = pci_read_config_dword(dev, pci_pcie_cap(dev) + pos, val);
 /*
  * Reset *val to 0 if pci_read_config_dword() fails, it may
  * have been written as 0xFFFFFFFF if hardware error happens
  * during pci_read_config_dword().
  */
 if (ret)
	 *val = 0;
 return ret;

a) Since all pci_generic_config_read() does is read a register value,
it may return success after reading a ~0 which *may* have been fabricated
by the PCI host bridge due to a read timeout. Hence pci_read_config_*()
will return success with a fabricated ~0 in *val, indicating a problem.
In this case, the assumed behaviour of  pcie_capability_read_*() will be
wrong. To avoid error slipping through, more checks are necessary.

b) pci_read_config_*() will return PCIBIOS_DEVICE_NOT_FOUND only if
dev->error_state = pci_channel_io_perm_failure (i.e.
pci_dev_is_disconnected()) or if pci_generic_config_read() can't find the
device. In both cases *val is initially set to ~0 but as shown in the code
above pcie_capability_read_*() resets it back to 0. Even with this effort,
drivers still have to perform validation checks more so if 0 is a valid
value.

Most drivers only consider the case (b) and in some cases, there is the
expectation that on timeout *val has a fabricated value of ~0, which *may*
not always be true as explained in (a).

In any case, checks need to be done to validate the value read and maybe
confirm which error has occurred. It is better left to the drivers to do.

Remove the reset of *val to 0 when pci_read_config_*() fails.

Suggested-by: Bjorn Helgaas <bjorn@helgaas.com>
Signed-off-by: Saheed O. Bolarinwa <refactormyself@gmail.com>
---
 drivers/pci/access.c | 14 --------------
 1 file changed, 14 deletions(-)

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index 79c4a2ef269a..ec95edbb1ac8 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -413,13 +413,6 @@ int pcie_capability_read_word(struct pci_dev *dev, int pos, u16 *val)
 
 	if (pcie_capability_reg_implemented(dev, pos)) {
 		ret = pci_read_config_word(dev, pci_pcie_cap(dev) + pos, val);
-		/*
-		 * Reset *val to 0 if pci_read_config_word() fails, it may
-		 * have been written as 0xFFFF if hardware error happens
-		 * during pci_read_config_word().
-		 */
-		if (ret)
-			*val = 0;
 		return ret;
 	}
 
@@ -448,13 +441,6 @@ int pcie_capability_read_dword(struct pci_dev *dev, int pos, u32 *val)
 
 	if (pcie_capability_reg_implemented(dev, pos)) {
 		ret = pci_read_config_dword(dev, pci_pcie_cap(dev) + pos, val);
-		/*
-		 * Reset *val to 0 if pci_read_config_dword() fails, it may
-		 * have been written as 0xFFFFFFFF if hardware error happens
-		 * during pci_read_config_dword().
-		 */
-		if (ret)
-			*val = 0;
 		return ret;
 	}
 
-- 
2.18.4


^ permalink raw reply related

* [GIT PULL] Please pull powerpc/linux.git powerpc-5.8-8 tag
From: Michael Ellerman @ 2020-07-31 13:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Aneesh Kumar K.V, linuxppc-dev, linux-kernel, npiggin

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hi Linus,

Please pull one more powerpc fix for 5.8:

The following changes since commit f0479c4bcbd92d1a457d4a43bcab79f29d11334a:

  selftests/powerpc: Use proper error code to check fault address (2020-07-15 23:10:17 +1000)

are available in the git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git tags/powerpc-5.8-8

for you to fetch changes up to 909adfc66b9a1db21b5e8733e9ebfa6cd5135d74:

  powerpc/64s/hash: Fix hash_preload running with interrupts enabled (2020-07-27 17:02:09 +1000)

- ------------------------------------------------------------------
powerpc fixes for 5.8 #8

Fix a bug introduced by the changes we made to lockless page table walking this
cycle. When using the hash MMU, and perf with callchain recording, we can
deadlock if the PMI interrupts a hash fault, and the callchain recording then
takes a hash fault on the same page.

Thanks to:
  Nicholas Piggin, Aneesh Kumar K.V, Anton Blanchard, Athira Rajeev.

- ------------------------------------------------------------------
Nicholas Piggin (1):
      powerpc/64s/hash: Fix hash_preload running with interrupts enabled


 arch/powerpc/kernel/exceptions-64s.S  | 14 ++++++++---
 arch/powerpc/mm/book3s64/hash_utils.c | 25 ++++++++++++++++++++
 arch/powerpc/perf/core-book3s.c       |  6 +++++
 3 files changed, 42 insertions(+), 3 deletions(-)
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl8kFsQACgkQUevqPMjh
pYAkbg/9HzJB8KbjAVJpoXHOf9+aHylhZKU2pcVtdoYE07VV6JabWnlI5flvTSEY
3Dvr3h/X0k2Z3n52+s1kYlW27ErU+o9Hbz1j4O41Ardiy14GmQ02FSi5iXjhKxuk
AjX2MN76n1Te+1ee2Ib/Ubu8KOrbxLIya223tJ0TNu7RWuYVaCO/lNXPPTwm2JWI
s9lryRAHNc4qkPMytsWi/gzNml3IP6cqnGVt1H5fmnOwQXUYbSjdmhBa7AwjCsA1
xJBMkxBxWOnUSknRCxaCCUTQ3sD+sJMzp6verlwEIb+HRh08iJyVzuWCpT9m0ZOG
5F3CvhIxwLatDsJ9N5EeGI2V4qBstBlHEUJJAjkwKgZxzmEm7j9H+ItAc421eo2A
t/SYJTK4JltllCDyB22jCxdEgJh+opv8rki7U+INI01I8gHzqsr/0CHleGtMN5Cq
fUBf25LOMUCpjeAO207j4DzeY6rTXz9H07XWrR4su//4S9bNkYUYpFfeE/3Kwwdq
vZCxjm+fxxLPyAW+E2D2EGsyzIJootrAzly5T7gun8qwgCSgoCxI+YV9/DLmYaGf
OjcG6+X58sfn1FTc8cD6udKCIMKh0JmSccZRqw1Ijp4EiGMnDZUbP9jg/aSciD8a
+oco9QGvvxYQTMut4XUkXtqqZqaTISgWFD/QB9f4KGJ1Rs8tW5Y=
=oaTB
-----END PGP SIGNATURE-----

^ permalink raw reply

* Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
From: Michael Ellerman @ 2020-07-31 13:16 UTC (permalink / raw)
  To: Nathan Lynch, Laurent Dufour; +Cc: tyreld, cheloha, linuxppc-dev
In-Reply-To: <875za511z6.fsf@linux.ibm.com>

Nathan Lynch <nathanl@linux.ibm.com> writes:
> Michael Ellerman <mpe@ellerman.id.au> writes:
>> Nathan Lynch <nathanl@linux.ibm.com> writes:
>>> Laurent Dufour <ldufour@linux.ibm.com> writes:
>>>> Le 28/07/2020 à 19:37, Nathan Lynch a écrit :
>>>>> The drmem lmb list can have hundreds of thousands of entries, and
>>>>> unfortunately lookups take the form of linear searches. As long as
>>>>> this is the case, traversals have the potential to monopolize the CPU
>>>>> and provoke lockup reports, workqueue stalls, and the like unless
>>>>> they explicitly yield.
>>>>> 
>>>>> Rather than placing cond_resched() calls within various
>>>>> for_each_drmem_lmb() loop blocks in the code, put it in the iteration
>>>>> expression of the loop macro itself so users can't omit it.
>>>>
>>>> Is that not too much to call cond_resched() on every LMB?
>>>>
>>>> Could that be less frequent, every 10, or 100, I don't really know ?
>>>
>>> Everything done within for_each_drmem_lmb is relatively heavyweight
>>> already. E.g. calling dlpar_remove_lmb()/dlpar_add_lmb() can take dozens
>>> of milliseconds. I don't think cond_resched() is an expensive check in
>>> this context.
>>
>> Hmm, mostly.
>>
>> But there are quite a few cases like drmem_update_dt_v1():
>>
>> 	for_each_drmem_lmb(lmb) {
>> 		dr_cell->base_addr = cpu_to_be64(lmb->base_addr);
>> 		dr_cell->drc_index = cpu_to_be32(lmb->drc_index);
>> 		dr_cell->aa_index = cpu_to_be32(lmb->aa_index);
>> 		dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb));
>>
>> 		dr_cell++;
>> 	}
>>
>> Which will compile to a pretty tight loop at the moment.
>>
>> Or drmem_update_dt_v2() which has two loops over all lmbs.
>>
>> And although the actual TIF check is cheap the function call to do it is
>> not free.
>>
>> So I worry this is going to make some of those long loops take even
>> longer.
>
> That's fair, and I was wrong - some of the loop bodies are relatively
> simple, not doing allocations or taking locks, etc.
>
> One way to deal is to keep for_each_drmem_lmb() as-is and add a new
> iterator that can reschedule, e.g. for_each_drmem_lmb_slow().

If we did that, how many call-sites would need converting?
Is it ~2 or ~20 or ~200?

> On the other hand... it's probably not too strong to say that the
> drmem/hotplug code is in crisis with respect to correctness and
> algorithmic complexity, so those are my overriding concerns right
> now. Yes, this change will pessimize loops that are reinitializing the
> entire drmem_lmb array on every DLPAR operation, but:
>
> 1. it doesn't make any user of for_each_drmem_lmb() less correct;
> 2. why is this code doing that in the first place, other than to
>    accommodate a poor data structure choice?
>
> The duration of the system calls where this code runs are measured in
> minutes or hours on large configurations because of all the behaviors
> that are at best O(n) with the amount of memory assigned to the
> partition. For simplicity's sake I'd rather defer lower-level
> performance considerations like this until the drmem data structures'
> awful lookup properties are fixed -- hopefully in the 5.10 timeframe.

Yeah fair enough.

cheers

^ permalink raw reply

* Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
From: Nathan Lynch @ 2020-07-31 13:52 UTC (permalink / raw)
  To: Michael Ellerman, Laurent Dufour; +Cc: tyreld, cheloha, linuxppc-dev
In-Reply-To: <87ft974yf7.fsf@mpe.ellerman.id.au>

Michael Ellerman <mpe@ellerman.id.au> writes:

> Nathan Lynch <nathanl@linux.ibm.com> writes:
>> Michael Ellerman <mpe@ellerman.id.au> writes:
>>> Nathan Lynch <nathanl@linux.ibm.com> writes:
>>>> Laurent Dufour <ldufour@linux.ibm.com> writes:
>>>>> Le 28/07/2020 à 19:37, Nathan Lynch a écrit :
>>>>>> The drmem lmb list can have hundreds of thousands of entries, and
>>>>>> unfortunately lookups take the form of linear searches. As long as
>>>>>> this is the case, traversals have the potential to monopolize the CPU
>>>>>> and provoke lockup reports, workqueue stalls, and the like unless
>>>>>> they explicitly yield.
>>>>>> 
>>>>>> Rather than placing cond_resched() calls within various
>>>>>> for_each_drmem_lmb() loop blocks in the code, put it in the iteration
>>>>>> expression of the loop macro itself so users can't omit it.
>>>>>
>>>>> Is that not too much to call cond_resched() on every LMB?
>>>>>
>>>>> Could that be less frequent, every 10, or 100, I don't really know ?
>>>>
>>>> Everything done within for_each_drmem_lmb is relatively heavyweight
>>>> already. E.g. calling dlpar_remove_lmb()/dlpar_add_lmb() can take dozens
>>>> of milliseconds. I don't think cond_resched() is an expensive check in
>>>> this context.
>>>
>>> Hmm, mostly.
>>>
>>> But there are quite a few cases like drmem_update_dt_v1():
>>>
>>> 	for_each_drmem_lmb(lmb) {
>>> 		dr_cell->base_addr = cpu_to_be64(lmb->base_addr);
>>> 		dr_cell->drc_index = cpu_to_be32(lmb->drc_index);
>>> 		dr_cell->aa_index = cpu_to_be32(lmb->aa_index);
>>> 		dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb));
>>>
>>> 		dr_cell++;
>>> 	}
>>>
>>> Which will compile to a pretty tight loop at the moment.
>>>
>>> Or drmem_update_dt_v2() which has two loops over all lmbs.
>>>
>>> And although the actual TIF check is cheap the function call to do it is
>>> not free.
>>>
>>> So I worry this is going to make some of those long loops take even
>>> longer.
>>
>> That's fair, and I was wrong - some of the loop bodies are relatively
>> simple, not doing allocations or taking locks, etc.
>>
>> One way to deal is to keep for_each_drmem_lmb() as-is and add a new
>> iterator that can reschedule, e.g. for_each_drmem_lmb_slow().
>
> If we did that, how many call-sites would need converting?
> Is it ~2 or ~20 or ~200?

At a glance I would convert 15-20 out of the 24 users in the tree I'm
looking at. Let me know if I should do a v2 with that approach.

^ permalink raw reply

* Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.8-8 tag
From: pr-tracker-bot @ 2020-07-31 16:45 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: linuxppc-dev, Linus Torvalds, linux-kernel, npiggin,
	Aneesh Kumar K.V
In-Reply-To: <87ime34yya.fsf@mpe.ellerman.id.au>

The pull request you sent on Fri, 31 Jul 2020 23:05:17 +1000:

> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git tags/powerpc-5.8-8

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/deacdb3e3979979016fcd0ffd518c320a62ad166

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker

^ permalink raw reply

* Re: [PATCH V5 0/4] powerpc/perf: Add support for perf extended regs in powerpc
From: Athira Rajeev @ 2020-07-31 17:34 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Ravi Bangoria, Michael Neuling, maddy, kjain,
	Arnaldo Carvalho de Melo, Jiri Olsa, linuxppc-dev
In-Reply-To: <20200730195048.GA1484375@krava>



> On 31-Jul-2020, at 1:20 AM, Jiri Olsa <jolsa@redhat.com> wrote:
> 
> On Thu, Jul 30, 2020 at 01:24:40PM +0530, Athira Rajeev wrote:
>> 
>> 
>>> On 27-Jul-2020, at 10:46 PM, Athira Rajeev <atrajeev@linux.vnet.ibm.com> wrote:
>>> 
>>> Patch set to add support for perf extended register capability in
>>> powerpc. The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to
>>> indicate the PMU which support extended registers. The generic code
>>> define the mask of extended registers as 0 for non supported architectures.
>>> 
>>> Patches 1 and 2 are the kernel side changes needed to include
>>> base support for extended regs in powerpc and in power10.
>>> Patches 3 and 4 are the perf tools side changes needed to support the
>>> extended registers.
>>> 
>> 
>> Hi Arnaldo, Jiri
>> 
>> please let me know if you have any comments/suggestions on this patch series to add support for perf extended regs.
> 
> hi,
> can't really tell for powerpc, but in general
> perf tool changes look ok
> 

Hi Jiri,
Thanks for checking the patchset.

Athira.

> jirka


^ permalink raw reply

* Re: [PATCH] ASoC: fsl_sai: Fix value of FSL_SAI_CR1_RFW_MASK
From: Mark Brown @ 2020-07-31 18:54 UTC (permalink / raw)
  To: alsa-devel, tiwai, nicoleotsuka, linux-kernel, festevam,
	Shengjiu Wang, linuxppc-dev, Xiubo.Lee, lgirdwood, perex, timur
In-Reply-To: <1596176895-28724-1-git-send-email-shengjiu.wang@nxp.com>

On Fri, 31 Jul 2020 14:28:15 +0800, Shengjiu Wang wrote:
> The fifo_depth is 64 on i.MX8QM/i.MX8QXP, 128 on i.MX8MQ, 16 on
> i.MX7ULP.
> 
> Original FSL_SAI_CR1_RFW_MASK value 0x1F is not suitable for
> these platform, the FIFO watermark mask should be updated
> according to the fifo_depth.

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next

Thanks!

[1/1] ASoC: fsl_sai: Fix value of FSL_SAI_CR1_RFW_MASK
      commit: 5aef1ff2397d021f93d874b57dff032fdfac73de

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

^ permalink raw reply

* Re: [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id
From: Srikar Dronamraju @ 2020-08-01  5:20 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Nathan Lynch, linuxppc-dev
In-Reply-To: <20200731111916.243569-1-aneesh.kumar@linux.ibm.com>

* Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [2020-07-31 16:49:14]:

> We use ibm,associativity and ibm,associativity-lookup-arrays to derive the numa
> node numbers. These device tree properties are firmware indicated grouping of
> resources based on their hierarchy in the platform. These numbers (group id) are
> not sequential and hypervisor/firmware can follow different numbering schemes.
> For ex: on powernv platforms, we group them in the below order.
> 
>  *     - CCM node ID
>  *     - HW card ID
>  *     - HW module ID
>  *     - Chip ID
>  *     - Core ID
> 
> Based on ibm,associativity-reference-points we use one of the above group ids as
> Linux NUMA node id. (On PowerNV platform Chip ID is used). This results
> in Linux reporting non-linear NUMA node id and which also results in Linux
> reporting empty node 0 NUMA nodes.
> 

If its just to eliminate node 0, then we have 2 other probably better
solutions.
1. Dont mark node 0 as spl (currently still in mm-tree and a result in
linux-next)
2. powerpc specific: explicitly clear node 0 during numa bringup.

> This can  be resolved by mapping the firmware provided group id to a logical Linux
> NUMA id. In this patch, we do this only for pseries platforms considering the

On PowerVM, as you would know the nid is already a logical or a flattened
chip-id and not the actual hardware chip-id.

> firmware group id is a virtualized entity and users would not have drawn any
> conclusion based on the Linux Numa Node id.
> 
> On PowerNV platform since we have historically mapped Chip ID as Linux NUMA node
> id, we keep the existing Linux NUMA node id numbering.
> 
> Before Fix:
>  # numactl -H
> available: 2 nodes (0-1)
> node 0 cpus:
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
> node 1 size: 50912 MB
> node 1 free: 45248 MB
> node distances:
> node   0   1
>   0:  10  40
>   1:  40  10
> 
> after fix
>  # numactl  -H
> available: 1 nodes (0)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
> node 0 size: 50912 MB
> node 0 free: 49724 MB
> node distances:
> node   0
>   0:  10
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/include/asm/topology.h |  1 +
>  arch/powerpc/mm/numa.c              | 49 ++++++++++++++++++++++-------
>  2 files changed, 39 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
> index f0b6300e7dd3..15b0424a27a8 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -118,5 +118,6 @@ int get_physical_package_id(int cpu);
>  #endif
>  #endif
> 
> +int firmware_group_id_to_nid(int firmware_gid);
>  #endif /* __KERNEL__ */
>  #endif	/* _ASM_POWERPC_TOPOLOGY_H */
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index e437a9ac4956..6c659aada55b 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int nid,
>  	}
>  }
> 
> +static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] =  NUMA_NO_NODE};
> +
> +int firmware_group_id_to_nid(int firmware_gid)
> +{
> +	static int last_nid = 0;
> +
> +	/*
> +	 * For PowerNV we don't change the node id. This helps to avoid
> +	 * confusion w.r.t the expected node ids. On pseries, node numbers
> +	 * are virtualized. Hence do logical node id for pseries.
> +	 */
> +	if (!firmware_has_feature(FW_FEATURE_LPAR))
> +		return firmware_gid;
> +
> +	if (firmware_gid ==  -1)
> +		return NUMA_NO_NODE;
> +
> +	if (nid_map[firmware_gid] == NUMA_NO_NODE)
> +		nid_map[firmware_gid] = last_nid++;

How do we ensure 2 simultaneous firmware_group_id_to_nid() calls dont end up
at this place in parallel?

> +
> +	return nid_map[firmware_gid];
> +}
> +
>  /* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
>   * info is found.
>   */
>  static int associativity_to_nid(const __be32 *associativity)
>  {
>  	int nid = NUMA_NO_NODE;
> +	int firmware_gid = -1;
> 
>  	if (!numa_enabled)
>  		goto out;
> 
>  	if (of_read_number(associativity, 1) >= min_common_depth)
> -		nid = of_read_number(&associativity[min_common_depth], 1);
> +		firmware_gid = of_read_number(&associativity[min_common_depth], 1);
> 
>  	/* POWER4 LPAR uses 0xffff as invalid node */
> -	if (nid == 0xffff || nid >= MAX_NUMNODES)
> -		nid = NUMA_NO_NODE;
> +	if (firmware_gid == 0xffff || firmware_gid >= MAX_NUMNODES)
> +		firmware_gid = -1;

Lets assume two or more invocations of associativity_to_nid for the same
associativity, end up with -1, In each case aren't giving different
nids?


> +
> +	nid = firmware_group_id_to_nid(firmware_gid);
> 
>  	if (nid > 0 &&
> -		of_read_number(associativity, 1) >= distance_ref_points_depth) {
> +	    of_read_number(associativity, 1) >= distance_ref_points_depth) {
>  		/*
>  		 * Skip the length field and send start of associativity array
>  		 */
> @@ -432,24 +458,25 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
>  static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
>  {
>  	struct assoc_arrays aa = { .arrays = NULL };
> -	int default_nid = NUMA_NO_NODE;
> -	int nid = default_nid;
> +	int nid = NUMA_NO_NODE, firmware_gid;
>  	int rc, index;
> 
>  	if ((min_common_depth < 0) || !numa_enabled)
> -		return default_nid;
> +		return NUMA_NO_NODE;
> 
>  	rc = of_get_assoc_arrays(&aa);
>  	if (rc)
> -		return default_nid;
> +		return NUMA_NO_NODE;

https://lore.kernel.org/linuxppc-dev/87lfjc1b5f.fsf@linux.ibm.com/t/#u

> 
>  	if (min_common_depth <= aa.array_sz &&
>  	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
>  		index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
> -		nid = of_read_number(&aa.arrays[index], 1);
> +		firmware_gid = of_read_number(&aa.arrays[index], 1);
> 
> -		if (nid == 0xffff || nid >= MAX_NUMNODES)
> -			nid = default_nid;
> +		if (firmware_gid == 0xffff || firmware_gid >= MAX_NUMNODES)
> +			firmware_gid = -1;

Same case as above, How do we ensure that we return unique nid for a
similar assoc_array?

> +
> +		nid = firmware_group_id_to_nid(firmware_gid);
> 
>  		if (nid > 0) {
>  			index = lmb->aa_index * aa.array_sz;
> -- 
> 2.26.2
> 

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

* Re: [PATCH v2 15/16] powerpc/powernv/sriov: Make single PE mode a per-BAR setting
From: Nathan Chancellor @ 2020-08-01  6:18 UTC (permalink / raw)
  To: Oliver O'Halloran, Michael Ellerman; +Cc: clang-built-linux, linuxppc-dev
In-Reply-To: <20200722065715.1432738-15-oohall@gmail.com>

On Wed, Jul 22, 2020 at 04:57:14PM +1000, Oliver O'Halloran wrote:
> Using single PE BARs to map an SR-IOV BAR is really a choice about what
> strategy to use when mapping a BAR. It doesn't make much sense for this to
> be a global setting since a device might have one large BAR which needs to
> be mapped with single PE windows and another smaller BAR that can be mapped
> with a regular segmented window. Make the segmented vs single decision a
> per-BAR setting and clean up the logic that decides which mode to use.
> 
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
> ---
> v2: Dropped unused total_vfs variables in pnv_pci_ioda_fixup_iov_resources()
>     Dropped bar_no from pnv_pci_iov_resource_alignment()
>     Minor re-wording of comments.
> ---
>  arch/powerpc/platforms/powernv/pci-sriov.c | 131 ++++++++++-----------
>  arch/powerpc/platforms/powernv/pci.h       |  11 +-
>  2 files changed, 73 insertions(+), 69 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c b/arch/powerpc/platforms/powernv/pci-sriov.c
> index ce8ad6851d73..76215d01405b 100644
> --- a/arch/powerpc/platforms/powernv/pci-sriov.c
> +++ b/arch/powerpc/platforms/powernv/pci-sriov.c
> @@ -146,21 +146,17 @@
>  static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>  {
>  	struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
> -	const resource_size_t gate = phb->ioda.m64_segsize >> 2;
>  	struct resource *res;
>  	int i;
> -	resource_size_t size, total_vf_bar_sz;
> +	resource_size_t vf_bar_sz;
>  	struct pnv_iov_data *iov;
> -	int mul, total_vfs;
> +	int mul;
>  
>  	iov = kzalloc(sizeof(*iov), GFP_KERNEL);
>  	if (!iov)
>  		goto disable_iov;
>  	pdev->dev.archdata.iov_data = iov;
> -
> -	total_vfs = pci_sriov_get_totalvfs(pdev);
>  	mul = phb->ioda.total_pe_num;
> -	total_vf_bar_sz = 0;
>  
>  	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>  		res = &pdev->resource[i + PCI_IOV_RESOURCES];
> @@ -173,50 +169,50 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>  			goto disable_iov;
>  		}
>  
> -		total_vf_bar_sz += pci_iov_resource_size(pdev,
> -				i + PCI_IOV_RESOURCES);
> +		vf_bar_sz = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
>  
>  		/*
> -		 * If bigger than quarter of M64 segment size, just round up
> -		 * power of two.
> +		 * Generally, one segmented M64 BAR maps one IOV BAR. However,
> +		 * if a VF BAR is too large we end up wasting a lot of space.
> +		 * If each VF needs more than 1/4 of the default m64 segment
> +		 * then each VF BAR should be mapped in single-PE mode to reduce
> +		 * the amount of space required. This does however limit the
> +		 * number of VFs we can support.
>  		 *
> -		 * Generally, one M64 BAR maps one IOV BAR. To avoid conflict
> -		 * with other devices, IOV BAR size is expanded to be
> -		 * (total_pe * VF_BAR_size).  When VF_BAR_size is half of M64
> -		 * segment size , the expanded size would equal to half of the
> -		 * whole M64 space size, which will exhaust the M64 Space and
> -		 * limit the system flexibility.  This is a design decision to
> -		 * set the boundary to quarter of the M64 segment size.
> +		 * The 1/4 limit is arbitrary and can be tweaked.
>  		 */
> -		if (total_vf_bar_sz > gate) {
> -			mul = roundup_pow_of_two(total_vfs);
> -			dev_info(&pdev->dev,
> -				"VF BAR Total IOV size %llx > %llx, roundup to %d VFs\n",
> -				total_vf_bar_sz, gate, mul);
> -			iov->m64_single_mode = true;
> -			break;
> -		}
> -	}
> +		if (vf_bar_sz > (phb->ioda.m64_segsize >> 2)) {
> +			/*
> +			 * On PHB3, the minimum size alignment of M64 BAR in
> +			 * single mode is 32MB. If this VF BAR is smaller than
> +			 * 32MB, but still too large for a segmented window
> +			 * then we can't map it and need to disable SR-IOV for
> +			 * this device.
> +			 */
> +			if (vf_bar_sz < SZ_32M) {
> +				pci_err(pdev, "VF BAR%d: %pR can't be mapped in single PE mode\n",
> +					i, res);
> +				goto disable_iov;
> +			}
>  
> -	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> -		res = &pdev->resource[i + PCI_IOV_RESOURCES];
> -		if (!res->flags || res->parent)
> +			iov->m64_single_mode[i] = true;
>  			continue;
> +		}
>  
> -		size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
>  		/*
> -		 * On PHB3, the minimum size alignment of M64 BAR in single
> -		 * mode is 32MB.
> +		 * This BAR can be mapped with one segmented window, so adjust
> +		 * te resource size to accommodate.
>  		 */
> -		if (iov->m64_single_mode && (size < SZ_32M))
> -			goto disable_iov;
> +		pci_dbg(pdev, " Fixing VF BAR%d: %pR to\n", i, res);
> +		res->end = res->start + vf_bar_sz * mul - 1;
> +		pci_dbg(pdev, "                       %pR\n", res);
>  
> -		dev_dbg(&pdev->dev, " Fixing VF BAR%d: %pR to\n", i, res);
> -		res->end = res->start + size * mul - 1;
> -		dev_dbg(&pdev->dev, "                       %pR\n", res);
> -		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
> +		pci_info(pdev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
>  			 i, res, mul);
> +
> +		iov->need_shift = true;
>  	}
> +
>  	iov->vfs_expanded = mul;
>  
>  	return;
> @@ -260,42 +256,40 @@ void pnv_pci_ioda_fixup_iov(struct pci_dev *pdev)
>  resource_size_t pnv_pci_iov_resource_alignment(struct pci_dev *pdev,
>  						      int resno)
>  {
> -	struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
>  	struct pnv_iov_data *iov = pnv_iov_get(pdev);
>  	resource_size_t align;
>  
> +	/*
> +	 * iov can be null if we have an SR-IOV device with IOV BAR that can't
> +	 * be placed in the m64 space (i.e. The BAR is 32bit or non-prefetch).
> +	 * In that case we don't allow VFs to be enabled since one of their
> +	 * BARs would not be placed in the correct PE.
> +	 */
> +	if (!iov)
> +		return align;
> +	if (!iov->vfs_expanded)
> +		return align;
> +
> +	align = pci_iov_resource_size(pdev, resno);

I am not sure if it has been reported yet but clang points out that
align is initialized after its use:

arch/powerpc/platforms/powernv/pci-sriov.c:267:10: warning: variable 'align' is uninitialized when used here [-Wuninitialized]
                return align;
                       ^~~~~
arch/powerpc/platforms/powernv/pci-sriov.c:258:23: note: initialize the variable 'align' to silence this warning
        resource_size_t align;
                             ^
                              = 0
1 warning generated.

> +	/*
> +	 * If we're using single mode then we can just use the native VF BAR
> +	 * alignment. We validated that it's possible to use a single PE
> +	 * window above when we did the fixup.
> +	 */
> +	if (iov->m64_single_mode[resno - PCI_IOV_RESOURCES])
> +		return align;
> +
>  	/*
>  	 * On PowerNV platform, IOV BAR is mapped by M64 BAR to enable the
>  	 * SR-IOV. While from hardware perspective, the range mapped by M64
>  	 * BAR should be size aligned.
>  	 *
> -	 * When IOV BAR is mapped with M64 BAR in Single PE mode, the extra
> -	 * powernv-specific hardware restriction is gone. But if just use the
> -	 * VF BAR size as the alignment, PF BAR / VF BAR may be allocated with
> -	 * in one segment of M64 #15, which introduces the PE conflict between
> -	 * PF and VF. Based on this, the minimum alignment of an IOV BAR is
> -	 * m64_segsize.
> -	 *
>  	 * This function returns the total IOV BAR size if M64 BAR is in
>  	 * Shared PE mode or just VF BAR size if not.
>  	 * If the M64 BAR is in Single PE mode, return the VF BAR size or
>  	 * M64 segment size if IOV BAR size is less.
>  	 */
> -	align = pci_iov_resource_size(pdev, resno);
> -
> -	/*
> -	 * iov can be null if we have an SR-IOV device with IOV BAR that can't
> -	 * be placed in the m64 space (i.e. The BAR is 32bit or non-prefetch).
> -	 * In that case we don't allow VFs to be enabled so just return the
> -	 * default alignment.
> -	 */
> -	if (!iov)
> -		return align;
> -	if (!iov->vfs_expanded)
> -		return align;
> -	if (iov->m64_single_mode)
> -		return max(align, (resource_size_t)phb->ioda.m64_segsize);
> -
>  	return iov->vfs_expanded * align;
>  }
>  
> @@ -450,7 +444,7 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 num_vfs)
>  			continue;
>  
>  		/* don't need single mode? map everything in one go! */
> -		if (!iov->m64_single_mode) {
> +		if (!iov->m64_single_mode[i]) {
>  			win = pnv_pci_alloc_m64_bar(phb, iov);
>  			if (win < 0)
>  				goto m64_failed;
> @@ -543,6 +537,8 @@ static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>  		res = &dev->resource[i + PCI_IOV_RESOURCES];
>  		if (!res->flags || !res->parent)
>  			continue;
> +		if (iov->m64_single_mode[i])
> +			continue;
>  
>  		/*
>  		 * The actual IOV BAR range is determined by the start address
> @@ -574,6 +570,8 @@ static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>  		res = &dev->resource[i + PCI_IOV_RESOURCES];
>  		if (!res->flags || !res->parent)
>  			continue;
> +		if (iov->m64_single_mode[i])
> +			continue;
>  
>  		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>  		res2 = *res;
> @@ -619,8 +617,8 @@ static void pnv_pci_sriov_disable(struct pci_dev *pdev)
>  	/* Release VF PEs */
>  	pnv_ioda_release_vf_PE(pdev);
>  
> -	/* Un-shift the IOV BAR resources */
> -	if (!iov->m64_single_mode)
> +	/* Un-shift the IOV BARs if we need to */
> +	if (iov->need_shift)
>  		pnv_pci_vf_resource_shift(pdev, -base_pe);
>  
>  	/* Release M64 windows */
> @@ -738,9 +736,8 @@ static int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
>  	 * the IOV BAR according to the PE# allocated to the VFs.
>  	 * Otherwise, the PE# for the VF will conflict with others.
>  	 */
> -	if (!iov->m64_single_mode) {
> -		ret = pnv_pci_vf_resource_shift(pdev,
> -						base_pe->pe_number);
> +	if (iov->need_shift) {
> +		ret = pnv_pci_vf_resource_shift(pdev, base_pe->pe_number);
>  		if (ret)
>  			goto shift_failed;
>  	}
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 41a6f4e938e4..902e928c7c22 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -243,8 +243,15 @@ struct pnv_iov_data {
>  	/* pointer to the array of VF PEs. num_vfs long*/
>  	struct pnv_ioda_pe *vf_pe_arr;
>  
> -	/* Did we map the VF BARs with single-PE IODA BARs? */
> -	bool    m64_single_mode;
> +	/* Did we map the VF BAR with single-PE IODA BARs? */
> +	bool    m64_single_mode[PCI_SRIOV_NUM_BARS];
> +
> +	/*
> +	 * True if we're using any segmented windows. In that case we need
> +	 * shift the start of the IOV resource the segment corresponding to
> +	 * the allocated PE.
> +	 */
> +	bool    need_shift;
>  
>  	/*
>  	 * Bit mask used to track which m64 windows are used to map the
> -- 
> 2.26.2
> 

^ permalink raw reply

* Re: [PATCH v2] ASoC: fsl-asoc-card: Remove fsl_asoc_card_set_bias_level function
From: Nicolin Chen @ 2020-08-01  7:59 UTC (permalink / raw)
  To: Shengjiu Wang
  Cc: alsa-devel, timur, Xiubo.Lee, linuxppc-dev, tiwai, lgirdwood,
	perex, broonie, festevam, linux-kernel
In-Reply-To: <1596102422-14010-1-git-send-email-shengjiu.wang@nxp.com>

Hi,

Having two nits and one question, inline:

On Thu, Jul 30, 2020 at 05:47:02PM +0800, Shengjiu Wang wrote:
> @@ -182,6 +180,69 @@ static int fsl_asoc_card_hw_params(struct snd_pcm_substream *substream,
>  					       cpu_priv->slot_width);
>  		if (ret && ret != -ENOTSUPP) {
>  			dev_err(dev, "failed to set TDM slot for cpu dai\n");
> +			goto out;
> +		}
> +	}
> +
> +	/* Specific configuration for PLL */
> +	if (codec_priv->pll_id && codec_priv->fll_id) {
> +		if (priv->sample_format == SNDRV_PCM_FORMAT_S24_LE)
> +			pll_out = priv->sample_rate * 384;
> +		else
> +			pll_out = priv->sample_rate * 256;
> +
> +		ret = snd_soc_dai_set_pll(asoc_rtd_to_codec(rtd, 0),
> +					  codec_priv->pll_id,
> +					  codec_priv->mclk_id,
> +					  codec_priv->mclk_freq, pll_out);
> +		if (ret) {
> +			dev_err(dev, "failed to start FLL: %d\n", ret);
> +			goto out;
> +		}
> +
> +		ret = snd_soc_dai_set_sysclk(asoc_rtd_to_codec(rtd, 0),
> +					     codec_priv->fll_id,
> +					     pll_out, SND_SOC_CLOCK_IN);

Just came into my mind: do we need some protection here to prevent
PLL/SYSCLK reconfiguration if TX/RX end up with different values?

> +	return 0;
> +
> +out:
> +	priv->streams &= ~BIT(substream->stream);
> +	return ret;

Rather than "out:" which doesn't explicitly indicate an error-out,
"fail:" would be better, following what we used in probe().

> +static int fsl_asoc_card_hw_free(struct snd_pcm_substream *substream)
> +{
> +	struct snd_soc_pcm_runtime *rtd = substream->private_data;
> +	struct fsl_asoc_card_priv *priv = snd_soc_card_get_drvdata(rtd->card);
> +	struct codec_priv *codec_priv = &priv->codec_priv;
> +	struct device *dev = rtd->card->dev;
> +	int ret;
> +
> +	priv->streams &= ~BIT(substream->stream);
> +

> +	if (!priv->streams && codec_priv->pll_id &&
> +	    codec_priv->fll_id) {

This now can fit into single line :)

^ permalink raw reply

* Re: [PATCH 06/15] powerpc: fadamp: simplify fadump_reserve_crash_area()
From: Mike Rapoport @ 2020-08-01 10:18 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: linux-sh, Peter Zijlstra, Dave Hansen, Hari Bathini, linux-mips,
	Max Filippov, Paul Mackerras, sparclinux, linux-riscv,
	Will Deacon, Stafford Horne, Marek Szyprowski, linux-s390,
	linux-c6x-dev, Yoshinori Sato, x86, Russell King, Mike Rapoport,
	clang-built-linux, Ingo Molnar, Catalin Marinas, uclinux-h8-devel,
	linux-xtensa, openrisc, Borislav Petkov, Andy Lutomirski,
	Paul Walmsley, Thomas Gleixner, linux-arm-kernel, Michal Simek,
	linux-mm, linuxppc-dev, linux-kernel, iommu, Palmer Dabbelt,
	Andrew Morton, Christoph Hellwig
In-Reply-To: <87d04d5hda.fsf@mpe.ellerman.id.au>

On Thu, Jul 30, 2020 at 10:15:13PM +1000, Michael Ellerman wrote:
> Mike Rapoport <rppt@kernel.org> writes:
> > From: Mike Rapoport <rppt@linux.ibm.com>
> >
> > fadump_reserve_crash_area() reserves memory from a specified base address
> > till the end of the RAM.
> >
> > Replace iteration through the memblock.memory with a single call to
> > memblock_reserve() with appropriate  that will take care of proper memory
>                                      ^
>                                      parameters?
> > reservation.
> >
> > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > ---
> >  arch/powerpc/kernel/fadump.c | 20 +-------------------
> >  1 file changed, 1 insertion(+), 19 deletions(-)
> 
> I think this looks OK to me, but I don't have a setup to test it easily.
> I've added Hari to Cc who might be able to.
> 
> But I'll give you an ack in the hope that it works :)

Actually, I did some digging in the git log and the traversal was added
there on purpose by the commit b71a693d3db3 ("powerpc/fadump: exclude
memory holes while reserving memory in second kernel")
Presuming this is still reqruired I'm going to drop this patch and will
simply replace for_each_memblock() with for_each_mem_range() in v2.
 
> Acked-by: Michael Ellerman <mpe@ellerman.id.au>
> 
> 
> > diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> > index 78ab9a6ee6ac..2446a61e3c25 100644
> > --- a/arch/powerpc/kernel/fadump.c
> > +++ b/arch/powerpc/kernel/fadump.c
> > @@ -1658,25 +1658,7 @@ int __init fadump_reserve_mem(void)
> >  /* Preserve everything above the base address */
> >  static void __init fadump_reserve_crash_area(u64 base)
> >  {
> > -	struct memblock_region *reg;
> > -	u64 mstart, msize;
> > -
> > -	for_each_memblock(memory, reg) {
> > -		mstart = reg->base;
> > -		msize  = reg->size;
> > -
> > -		if ((mstart + msize) < base)
> > -			continue;
> > -
> > -		if (mstart < base) {
> > -			msize -= (base - mstart);
> > -			mstart = base;
> > -		}
> > -
> > -		pr_info("Reserving %lluMB of memory at %#016llx for preserving crash data",
> > -			(msize >> 20), mstart);
> > -		memblock_reserve(mstart, msize);
> > -	}
> > +	memblock_reserve(base, memblock_end_of_DRAM() - base);
> >  }
> >  
> >  unsigned long __init arch_reserved_kernel_pages(void)
> > -- 
> > 2.26.2

-- 
Sincerely yours,
Mike.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox