LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] powerpc/perf: Account for interrupts during PMC overflow for an invalid SIAR check
From: Athira Rajeev @ 2020-08-06 12:46 UTC (permalink / raw)
  To: mpe; +Cc: aik, maddy, linuxppc-dev

Performance monitor interrupt handler checks if any counter has overflown
and calls `record_and_restart` in core-book3s which invokes
`perf_event_overflow` to record the sample information.
Apart from creating sample, perf_event_overflow also does the interrupt
and period checks via perf_event_account_interrupt.

Currently we record information only if the SIAR valid bit is set
( using `siar_valid` check ) and hence the interrupt check.
But it is possible that we do sampling for some events that are not
generating valid SIAR and hence there is no chance to disable the event
if interrupts is more than max_samples_per_tick. This leads to soft lockup.

Fix this by adding perf_event_account_interrupt in the invalid siar
code path for a sampling event. ie if siar is invalid, just do interrupt
check and don't record the sample information.

Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/perf/core-book3s.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 01d7028..626e587 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -2101,6 +2101,10 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
 
 		if (perf_event_overflow(event, &data, regs))
 			power_pmu_stop(event, 0);
+	} else if (period) {
+		/* Account for interrupt incase of invalid siar */
+		if (perf_event_account_interrupt(event))
+			power_pmu_stop(event, 0);
 	}
 }
 
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
From: Michael Ellerman @ 2020-08-06 12:51 UTC (permalink / raw)
  To: Michael Roth, Greg Kurz
  Cc: Nathan Lynch, linuxppc-dev, Cedric Le Goater,
	Thiago Jung Bauermann
In-Reply-To: <159666656828.15440.9097316124875217814@sif>

Michael Roth <mdroth@linux.vnet.ibm.com> writes:
> Quoting Michael Roth (2020-08-04 23:37:32)
>> Quoting Michael Ellerman (2020-08-04 22:07:08)
>> > Greg Kurz <groug@kaod.org> writes:
>> > > On Tue, 04 Aug 2020 23:35:10 +1000
>> > > Michael Ellerman <mpe@ellerman.id.au> wrote:
>> > >> Spinning forever seems like a bad idea, but as has been demonstrated at
>> > >> least twice now, continuing when we don't know the state of the other
>> > >> CPU can lead to straight up crashes.
>> > >> 
>> > >> So I think I'm persuaded that it's preferable to have the kernel stuck
>> > >> spinning rather than oopsing.
>> > >> 
>> > >
>> > > +1
>> > >
>> > >> I'm 50/50 on whether we should have a cond_resched() in the loop. My
>> > >> first instinct is no, if we're stuck here for 20s a stack trace would be
>> > >> good. But then we will probably hit that on some big and/or heavily
>> > >> loaded machine.
>> > >> 
>> > >> So possibly we should call cond_resched() but have some custom logic in
>> > >> the loop to print a warning if we are stuck for more than some
>> > >> sufficiently long amount of time.
>> > >
>> > > How long should that be ?
>> > 
>> > Yeah good question.
>> > 
>> > I guess step one would be seeing how long it can take on the 384 vcpu
>> > machine. And we can probably test on some other big machines.
>> > 
>> > Hopefully Nathan can give us some idea of how long he's seen it take on
>> > large systems? I know he was concerned about the 20s timeout of the
>> > softlockup detector.
>> > 
>> > Maybe a minute or two?
>> 
>> Hmm, so I took a stab at this where I called cond_resched() after
>> every 5 seconds of polling and printed a warning at the same time (FWIW
>> that doesn't seem to trigger any warnings on a loaded 96-core mihawk
>> system using KVM running the 384vcpu unplug loop)
>> 
>> But it sounds like that's not quite what you had in mind. How frequently
>> do you think we should call cond_resched()? Maybe after 25 iterations
>> of polling smp_query_cpu_stopped() to keep original behavior somewhat
>> similar?

I think we can just call it on every iteration, it should be cheap
compared to an RTAS call.

The concern was just by doing that you effectively prevent the
softlockup detector from reporting you as stuck in that path. Hence the
desire to manually print a warning after ~60s or something.

cheers

^ permalink raw reply

* [RFC PATCH] powerpc/drmem: use global variable instead of fetching again
From: Aneesh Kumar K.V @ 2020-08-06 12:52 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Nathan Lynch, Aneesh Kumar K.V, Hari Bathini
In-Reply-To: <20200806123604.248361-1-aneesh.kumar@linux.ibm.com>

use mem_addr_cells/mem_size_cells instead of fetching the values
again from device tree.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/drmem.c | 24 ++++++------------------
 1 file changed, 6 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index b2eeea39684c..f533a7b04ab9 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -14,8 +14,6 @@
 #include <asm/prom.h>
 #include <asm/drmem.h>
 
-static int n_root_addr_cells, n_root_size_cells;
-
 static struct drmem_lmb_info __drmem_info;
 struct drmem_lmb_info *drmem_info = &__drmem_info;
 
@@ -196,8 +194,8 @@ static void read_drconf_v1_cell(struct drmem_lmb *lmb,
 {
 	const __be32 *p = *prop;
 
-	lmb->base_addr = of_read_number(p, n_root_addr_cells);
-	p += n_root_addr_cells;
+	lmb->base_addr = of_read_number(p, mem_addr_cells);
+	p += mem_addr_cells;
 	lmb->drc_index = of_read_number(p++, 1);
 
 	p++; /* skip reserved field */
@@ -233,8 +231,8 @@ static void read_drconf_v2_cell(struct of_drconf_cell_v2 *dr_cell,
 	const __be32 *p = *prop;
 
 	dr_cell->seq_lmbs = of_read_number(p++, 1);
-	dr_cell->base_addr = of_read_number(p, n_root_addr_cells);
-	p += n_root_addr_cells;
+	dr_cell->base_addr = of_read_number(p, mem_addr_cells);
+	p += mem_addr_cells;
 	dr_cell->drc_index = of_read_number(p++, 1);
 	dr_cell->aa_index = of_read_number(p++, 1);
 	dr_cell->flags = of_read_number(p++, 1);
@@ -285,10 +283,6 @@ int __init walk_drmem_lmbs_early(unsigned long node, void *data,
 	if (!prop || len < dt_root_size_cells * sizeof(__be32))
 		return ret;
 
-	/* Get the address & size cells */
-	n_root_addr_cells = dt_root_addr_cells;
-	n_root_size_cells = dt_root_size_cells;
-
 	drmem_info->lmb_size = dt_mem_next_cell(dt_root_size_cells, &prop);
 
 	usm = of_get_flat_dt_prop(node, "linux,drconf-usable-memory", &len);
@@ -318,12 +312,12 @@ static int init_drmem_lmb_size(struct device_node *dn)
 		return 0;
 
 	prop = of_get_property(dn, "ibm,lmb-size", &len);
-	if (!prop || len < n_root_size_cells * sizeof(__be32)) {
+	if (!prop || len < mem_size_cells * sizeof(__be32)) {
 		pr_info("Could not determine LMB size\n");
 		return -1;
 	}
 
-	drmem_info->lmb_size = of_read_number(prop, n_root_size_cells);
+	drmem_info->lmb_size = of_read_number(prop, mem_size_cells);
 	return 0;
 }
 
@@ -353,12 +347,6 @@ int walk_drmem_lmbs(struct device_node *dn, void *data,
 	if (!of_root)
 		return ret;
 
-	/* Get the address & size cells */
-	of_node_get(of_root);
-	n_root_addr_cells = of_n_addr_cells(of_root);
-	n_root_size_cells = of_n_size_cells(of_root);
-	of_node_put(of_root);
-
 	if (init_drmem_lmb_size(dn))
 		return ret;
 
-- 
2.26.2


^ permalink raw reply related

* Re: [PATCH 1/2] sched/topology: Allow archs to override cpu_smt_mask
From: Srikar Dronamraju @ 2020-08-06 12:53 UTC (permalink / raw)
  To: peterz
  Cc: Gautham R Shenoy, Michael Neuling, Vincent Guittot, Rik van Riel,
	linuxppc-dev, LKML, Valentin Schneider, Thomas Gleixner,
	Mel Gorman, Ingo Molnar, Dietmar Eggemann
In-Reply-To: <20200806085429.GX2674@hirez.programming.kicks-ass.net>

* peterz@infradead.org <peterz@infradead.org> [2020-08-06 10:54:29]:

> On Thu, Aug 06, 2020 at 03:32:25PM +1000, Michael Ellerman wrote:
> 
> > That brings with it a bunch of problems, such as existing software that
> > has been developed/configured for Power8 and expects to see SMT8.
> > 
> > We also allow LPARs to be live migrated from Power8 to Power9 (and back), so
> > maintaining the illusion of SMT8 is considered a requirement to make that work.
> 
> So how does that work if the kernel booted on P9 and demuxed the SMT8
> into 2xSMT4? If you migrate that state onto a P8 with actual SMT8 you're
> toast again.
> 

To add to what Michael already said, the reason we don't expose the demux of
SMT8 into 2xSMT4 to userspace, is to make the userspace believe they are on
a SMT8. When the kernel is live migrated from P8 to P9, till the time of reboot
they would only have the older P8 topology. After reboot the kernel topology
would change, but the userspace is made to believe that they are running on
SMT8 core by way of keeping the sibling_cpumask at SMT8 core level.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

* Re: [PATCH 1/2] sched/topology: Allow archs to override cpu_smt_mask
From: peterz @ 2020-08-06 13:15 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Gautham R Shenoy, Michael Neuling, Vincent Guittot,
	Srikar Dronamraju, Rik van Riel, linuxppc-dev, LKML,
	Valentin Schneider, Thomas Gleixner, Mel Gorman, Ingo Molnar,
	Dietmar Eggemann
In-Reply-To: <87d044yn9z.fsf@mpe.ellerman.id.au>

On Thu, Aug 06, 2020 at 10:25:12PM +1000, Michael Ellerman wrote:
> peterz@infradead.org writes:
> > On Thu, Aug 06, 2020 at 03:32:25PM +1000, Michael Ellerman wrote:
> >
> >> That brings with it a bunch of problems, such as existing software that
> >> has been developed/configured for Power8 and expects to see SMT8.
> >> 
> >> We also allow LPARs to be live migrated from Power8 to Power9 (and back), so
> >> maintaining the illusion of SMT8 is considered a requirement to make that work.
> >
> > So how does that work if the kernel booted on P9 and demuxed the SMT8
> > into 2xSMT4? If you migrate that state onto a P8 with actual SMT8 you're
> > toast again.
> 
> The SMT mask would be inaccurate on the P8, rather than the current case
> where it's inaccurate on the P9.
> 
> Which would be our preference, because the backward migration case is
> not common AIUI.
> 
> Or am I missing a reason we'd be even more toast than that?

Well, the scheduler might do a wee bit funny. We just had a patch that
increase load-balancing opportunities between SMT siblings because they
all share L1 anyway.

But yeah, nothing terminal.

> Under PowerVM the kernel does know it's being migrated, so we could
> actually update the mask, but I'm not sure if that's really feasible.

As long as you get a notification, rebuilding the sched domains isn't
terribly hard to do, there's more code that does that.

> >> Yeah I agree the naming is confusing.
> >> 
> >> Let's call them "SMT4 cores" and "SMT8 cores"?
> >
> > Works for me, thanks!
> >
> >> The problem is we are already lying to userspace, because firmware lies to us.
> >> 
> >> ie. the firmware on these systems shows us an SMT8 core, and so current kernels
> >> show SMT8 to userspace. I don't think we can realistically change that fact now,
> >> as these systems are already out in the field.
> >> 
> >> What this patch tries to do is undo some of the mess, and at least give the
> >> scheduler the right information.
> >
> > What a mess... I think it depends on what you do with that P9 to P8
> > migration case. Does it make sense to have a "p8_compat" boot arg for
> > the case where you want LPAR migration back onto P8 systems -- in which
> > case it simply takes the firmware's word as gospel and doesn't untangle
> > things, because it can actually land on a P8.
> 
> We already get told by firmware that we're running in "p8 compat" mode,
> because we have to pretend to userspace that it's running on a P8. So we
> could use that as a signal to leave things alone.
> 
> But my understanding is most LPARs don't get migrated back and forth,
> they'll start life on a P8 and only get migrated to a P9 once when the
> customer gets a P9. They might then run for a long time (months to
> years) on the P9 in P8 compat mode, not because they ever want to
> migrate back to a real P8, but because the software in the LPAR is still
> expecting to be on a P8.
> 
> I'm not a real expert on all the Enterprisey stuff though, so someone
> else might be able to give us a better picture.
> 
> But the point of mentioning the migration stuff was mainly just to
> explain why we feel we need to present SMT8 to userspace even on P9.

OK, fair enough. The patch wasn't particularly onerous, I was just
wondering why etc..

The case of starting on a P8 and being migrated to a P9 makes sense to
me; in that case you'd like to rebuild your sched domains, but can't go
about changing user visible topolofy information.

I suppose:

Acked-by; Peter Zijlstra (Intel) <peterz@infradead.org>

An updated Changelog that recaps some of this discussion might also be
nice.

^ permalink raw reply

* Re: [PATCH 1/2] sched/topology: Allow archs to override cpu_smt_mask
From: Srikar Dronamraju @ 2020-08-06 14:09 UTC (permalink / raw)
  To: peterz
  Cc: Gautham R Shenoy, Michael Neuling, Vincent Guittot, Rik van Riel,
	linuxppc-dev, LKML, Valentin Schneider, Thomas Gleixner,
	Mel Gorman, Ingo Molnar, Dietmar Eggemann
In-Reply-To: <20200806131547.GC2674@hirez.programming.kicks-ass.net>

* peterz@infradead.org <peterz@infradead.org> [2020-08-06 15:15:47]:

> > But my understanding is most LPARs don't get migrated back and forth,
> > they'll start life on a P8 and only get migrated to a P9 once when the
> > customer gets a P9. They might then run for a long time (months to
> > years) on the P9 in P8 compat mode, not because they ever want to
> > migrate back to a real P8, but because the software in the LPAR is still
> > expecting to be on a P8.
> > 
> > I'm not a real expert on all the Enterprisey stuff though, so someone
> > else might be able to give us a better picture.
> > 
> > But the point of mentioning the migration stuff was mainly just to
> > explain why we feel we need to present SMT8 to userspace even on P9.
> 
> OK, fair enough. The patch wasn't particularly onerous, I was just
> wondering why etc..
> 
> The case of starting on a P8 and being migrated to a P9 makes sense to
> me; in that case you'd like to rebuild your sched domains, but can't go
> about changing user visible topolofy information.
> 
> I suppose:
> 
> Acked-by; Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> An updated Changelog that recaps some of this discussion might also be
> nice.

Okay, will surely do the needful.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

* Re: [PATCH v2 2/2] powerpc/pseries: new lparcfg key/value pair: partition_affinity_score
From: Nathan Lynch @ 2020-08-06 15:17 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: Tyrel Datwyler, Scott Cheloha, linuxppc-dev
In-Reply-To: <871rkkymd5.fsf@mpe.ellerman.id.au>

Michael Ellerman <mpe@ellerman.id.au> writes:
> Tyrel Datwyler <tyreld@linux.ibm.com> writes:
>> On 7/27/20 11:46 AM, Scott Cheloha wrote:
>>> The H_GetPerformanceCounterInfo (GPCI) PHYP hypercall has a subcall,
>>> Affinity_Domain_Info_By_Partition, which returns, among other things,
>>> a "partition affinity score" for a given LPAR.  This score, a value on
>>> [0-100], represents the processor-memory affinity for the LPAR in
>>> question.  A score of 0 indicates the worst possible affinity while a
>>> score of 100 indicates perfect affinity.  The score can be used to
>>> reason about performance.
>>> 
>>> This patch adds the score for the local LPAR to the lparcfg procfile
>>> under a new 'partition_affinity_score' key.
>>> 
>>> Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
>>
>> I was hoping Michael would chime in the first time around on this patch series
>> about adding another key/value pair to lparcfg.
>
> That guy is so unreliable.
>
> I don't love adding new stuff in lparcfg, but given the file already
> exists and there's no prospect of removing it, it's probably not worth
> the effort to put the new field anywhere else.
>
> My other query with this was how on earth anyone is meant to interpret
> the metric. ie. if my metric is 50, what does that mean? If it's 90
> should I worry?

Here's some more background.

This interface is just passing up what the platform provides, and it's
identical to the partition affinity score described in the documentation
for the management console's lsmemopt command:

https://www.ibm.com/support/knowledgecenter/POWER9/p9edm/lsmemopt.html

The score is 0-100, higher values are better. To illustrate: I believe a
partition's score will be 100 (or very close to it) if all of its CPUs
and memory reside within one node. It will be lower than that when a
partition has some memory without local CPUs, and lower still when there
is no CPU-memory affinity within the partition. Beyond that I don't have
more specific information and the algorithm and scale are set by the
platform.

The intent is for this to be a metric to gather during problem
determination e.g. via sosreport or similar, but as far as Linux is
concerned this should be treated as an opaque value.

^ permalink raw reply

* Re: [PATCH v2 2/2] powerpc/pseries: new lparcfg key/value pair: partition_affinity_score
From: Nathan Lynch @ 2020-08-06 15:18 UTC (permalink / raw)
  To: Scott Cheloha, linuxppc-dev; +Cc: Tyrel Datwylder
In-Reply-To: <20200727184605.2945095-2-cheloha@linux.ibm.com>

Scott Cheloha <cheloha@linux.ibm.com> writes:
> The H_GetPerformanceCounterInfo (GPCI) PHYP hypercall has a subcall,
> Affinity_Domain_Info_By_Partition, which returns, among other things,
> a "partition affinity score" for a given LPAR.  This score, a value on
> [0-100], represents the processor-memory affinity for the LPAR in
> question.  A score of 0 indicates the worst possible affinity while a
> score of 100 indicates perfect affinity.  The score can be used to
> reason about performance.
>
> This patch adds the score for the local LPAR to the lparcfg procfile
> under a new 'partition_affinity_score' key.
>
> Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
> ---
>  arch/powerpc/platforms/pseries/lparcfg.c | 35 ++++++++++++++++++++++++
>  1 file changed, 35 insertions(+)
>
> diff --git a/arch/powerpc/platforms/pseries/lparcfg.c b/arch/powerpc/platforms/pseries/lparcfg.c
> index b8d28ab88178..e278390ab28d 100644
> --- a/arch/powerpc/platforms/pseries/lparcfg.c
> +++ b/arch/powerpc/platforms/pseries/lparcfg.c
> @@ -136,6 +136,39 @@ static unsigned int h_get_ppp(struct hvcall_ppp_data *ppp_data)
>  	return rc;
>  }
>  
> +static void show_gpci_data(struct seq_file *m)
> +{
> +	struct hv_gpci_request_buffer *buf;
> +	unsigned int affinity_score;
> +	long ret;
> +
> +	buf = kmalloc(sizeof(*buf), GFP_KERNEL);
> +	if (buf == NULL)
> +		return;
> +
> +	/*
> +	 * Show the local LPAR's affinity score.
> +	 *
> +	 * 0xB1 selects the Affinity_Domain_Info_By_Partition subcall.
> +	 * The score is at byte 0xB in the output buffer.
> +	 */
> +	memset(&buf->params, 0, sizeof(buf->params));
> +	buf->params.counter_request = cpu_to_be32(0xB1);
> +	buf->params.starting_index = cpu_to_be32(-1);	/* local LPAR */
> +	buf->params.counter_info_version_in = 0x5;	/* v5+ for score */
> +	ret = plpar_hcall_norets(H_GET_PERF_COUNTER_INFO, virt_to_phys(buf),
> +				 sizeof(*buf));
> +	if (ret != H_SUCCESS) {
> +		pr_debug("hcall failed: H_GET_PERF_COUNTER_INFO: %ld, %x\n",
> +			 ret, be32_to_cpu(buf->params.detail_rc));
> +		goto out;
> +	}
> +	affinity_score = buf->bytes[0xB];
> +	seq_printf(m, "partition_affinity_score=%u\n", affinity_score);
> +out:
> +	kfree(buf);
> +}
> +
>  static unsigned h_pic(unsigned long *pool_idle_time,
>  		      unsigned long *num_procs)
>  {
> @@ -487,6 +520,8 @@ static int pseries_lparcfg_data(struct seq_file *m, void *v)
>  			   partition_active_processors * 100);
>  	}
>  
> +	show_gpci_data(m);
> +
>  	seq_printf(m, "partition_active_processors=%d\n",
>  		   partition_active_processors);

Acked-by: Nathan Lynch <nathanl@linux.ibm.com>

^ permalink raw reply

* Re: [PATCH v2 1/2] powerpc/perf: consolidate GPCI hcall structs into asm/hvcall.h
From: Nathan Lynch @ 2020-08-06 15:19 UTC (permalink / raw)
  To: Scott Cheloha, linuxppc-dev; +Cc: Tyrel Datwylder
In-Reply-To: <20200727184605.2945095-1-cheloha@linux.ibm.com>

Scott Cheloha <cheloha@linux.ibm.com> writes:

> The H_GetPerformanceCounterInfo (GPCI) hypercall input/output structs are
> useful to modules outside of perf/, so move them into asm/hvcall.h to live
> alongside the other powerpc hypercall structs.
>
> Leave the perf-specific GPCI stuff in perf/hv-gpci.h.
>
> Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>

Acked-by: Nathan Lynch <nathanl@linux.ibm.com>

^ permalink raw reply

* Re: [PATCH][V2] macintosh: windfarm: remove detatch debug containing spelling mistakes
From: Wolfram Sang @ 2020-08-06 11:24 UTC (permalink / raw)
  To: Colin King; +Cc: Wolfram Sang, kernel-janitors, linuxppc-dev, linux-kernel
In-Reply-To: <20200806102901.44988-1-colin.king@canonical.com>

[-- Attachment #1: Type: text/plain, Size: 387 bytes --]

On Thu, Aug 06, 2020 at 11:29:01AM +0100, Colin King wrote:
> From: Colin Ian King <colin.king@canonical.com>
> 
> There are spelling mistakes in two debug messages. As recommended
> by Wolfram Sang, these can be removed as there is plenty of debug
> in the driver core.
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Wolfram Sang <wsa@kernel.org>


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH v2 1/4] powerpc/drmem: Make lmb_size 64 bit
From: Aneesh Kumar K.V @ 2020-08-06 16:23 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Nathan Lynch, Aneesh Kumar K.V, stable

Similar to commit 89c140bbaeee ("pseries: Fix 64 bit logical memory block panic")
make sure different variables tracking lmb_size are updated to be 64 bit.

This was found by code audit.

Cc: stable@vger.kernel.org
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/drmem.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 17ccc6474ab6..d719cbac34b2 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -21,7 +21,7 @@ struct drmem_lmb {
 struct drmem_lmb_info {
 	struct drmem_lmb        *lmbs;
 	int                     n_lmbs;
-	u32                     lmb_size;
+	u64                     lmb_size;
 };
 
 extern struct drmem_lmb_info *drmem_info;
@@ -67,7 +67,7 @@ struct of_drconf_cell_v2 {
 #define DRCONF_MEM_RESERVED	0x00000080
 #define DRCONF_MEM_HOTREMOVABLE	0x00000100
 
-static inline u32 drmem_lmb_size(void)
+static inline u64 drmem_lmb_size(void)
 {
 	return drmem_info->lmb_size;
 }
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 2/4] powerpc/mem: Store the dt_root_size/addr cell values for later usage
From: Aneesh Kumar K.V @ 2020-08-06 16:23 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Nathan Lynch, Aneesh Kumar K.V
In-Reply-To: <20200806162329.276534-1-aneesh.kumar@linux.ibm.com>

dt_root_addr_cells and dt_root_size_cells are __initdata variables.
So make a copy of the same which can be used post init.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/drmem.h | 2 ++
 arch/powerpc/kernel/prom.c       | 7 +++++++
 arch/powerpc/mm/numa.c           | 1 +
 3 files changed, 10 insertions(+)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index d719cbac34b2..ffb59caa88ee 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -123,4 +123,6 @@ static inline void lmb_clear_nid(struct drmem_lmb *lmb)
 }
 #endif
 
+extern int mem_addr_cells, mem_size_cells;
+
 #endif /* _ASM_POWERPC_LMB_H */
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index d8a2fb87ba0c..9a1701e85747 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -73,6 +73,7 @@ u64 ppc64_rma_size;
 #endif
 static phys_addr_t first_memblock_size;
 static int __initdata boot_cpu_count;
+int mem_addr_cells, mem_size_cells;
 
 static int __init early_parse_mem(char *p)
 {
@@ -536,6 +537,12 @@ static int __init early_init_dt_scan_memory_ppc(unsigned long node,
 						const char *uname,
 						int depth, void *data)
 {
+	/*
+	 * Make a copy from __initdata variable
+	 */
+	mem_addr_cells = dt_root_addr_cells;
+	mem_size_cells = dt_root_size_cells;
+
 #ifdef CONFIG_PPC_PSERIES
 	if (depth == 1 &&
 	    strcmp(uname, "ibm,dynamic-reconfiguration-memory") == 0) {
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 058fee9a0835..77d41d9775d2 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -368,6 +368,7 @@ static void __init get_n_mem_cells(int *n_addr_cells, int *n_size_cells)
 	of_node_put(memory);
 }
 
+/*  dt_mem_next_cell is __init  */
 static unsigned long read_n_cells(int n, const __be32 **buf)
 {
 	unsigned long result = 0;
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 3/4] powerpc/memhotplug: Make lmb size 64bit
From: Aneesh Kumar K.V @ 2020-08-06 16:23 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Nathan Lynch, Aneesh Kumar K.V, stable
In-Reply-To: <20200806162329.276534-1-aneesh.kumar@linux.ibm.com>

Similar to commit 89c140bbaeee ("pseries: Fix 64 bit logical memory block panic")
make sure different variables tracking lmb_size are updated to be 64 bit.

This was found by code audit.

Cc: stable@vger.kernel.org
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 .../platforms/pseries/hotplug-memory.c        | 37 +++++++++++--------
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 5d545b78111f..1fe3204c843a 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -277,7 +277,7 @@ static int dlpar_offline_lmb(struct drmem_lmb *lmb)
 	return dlpar_change_lmb_state(lmb, false);
 }
 
-static int pseries_remove_memblock(unsigned long base, unsigned int memblock_size)
+static int pseries_remove_memblock(unsigned long base, unsigned long memblock_size)
 {
 	unsigned long block_sz, start_pfn;
 	int sections_per_block;
@@ -308,9 +308,9 @@ static int pseries_remove_memblock(unsigned long base, unsigned int memblock_siz
 
 static int pseries_remove_mem_node(struct device_node *np)
 {
-	const __be32 *regs;
+	const __be32 *prop;
 	unsigned long base;
-	unsigned int lmb_size;
+	unsigned long lmb_size;
 	int ret = -EINVAL;
 
 	/*
@@ -322,12 +322,16 @@ static int pseries_remove_mem_node(struct device_node *np)
 	/*
 	 * Find the base address and size of the memblock
 	 */
-	regs = of_get_property(np, "reg", NULL);
-	if (!regs)
+	prop = of_get_property(np, "reg", NULL);
+	if (!prop)
 		return ret;
 
-	base = be64_to_cpu(*(unsigned long *)regs);
-	lmb_size = be32_to_cpu(regs[3]);
+	/*
+	 * "reg" property represents (addr,size) tuple.
+	 */
+	base = of_read_number(prop, mem_addr_cells);
+	prop += mem_addr_cells;
+	lmb_size = of_read_number(prop, mem_size_cells);
 
 	pseries_remove_memblock(base, lmb_size);
 	return 0;
@@ -557,7 +561,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index)
 
 #else
 static inline int pseries_remove_memblock(unsigned long base,
-					  unsigned int memblock_size)
+					  unsigned long memblock_size)
 {
 	return -EOPNOTSUPP;
 }
@@ -878,9 +882,9 @@ int dlpar_memory(struct pseries_hp_errorlog *hp_elog)
 
 static int pseries_add_mem_node(struct device_node *np)
 {
-	const __be32 *regs;
+	const __be32 *prop;
 	unsigned long base;
-	unsigned int lmb_size;
+	unsigned long lmb_size;
 	int ret = -EINVAL;
 
 	/*
@@ -892,12 +896,15 @@ static int pseries_add_mem_node(struct device_node *np)
 	/*
 	 * Find the base and size of the memblock
 	 */
-	regs = of_get_property(np, "reg", NULL);
-	if (!regs)
+	prop = of_get_property(np, "reg", NULL);
+	if (!prop)
 		return ret;
-
-	base = be64_to_cpu(*(unsigned long *)regs);
-	lmb_size = be32_to_cpu(regs[3]);
+	/*
+	 * "reg" property represents (addr,size) tuple.
+	 */
+	base = of_read_number(prop, mem_addr_cells);
+	prop += mem_addr_cells;
+	lmb_size = of_read_number(prop, mem_size_cells);
 
 	/*
 	 * Update memory region to represent the memory add
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 4/4] powerpc/book3s64/radix: Make radix_mem_block_size 64bit
From: Aneesh Kumar K.V @ 2020-08-06 16:23 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Nathan Lynch, Aneesh Kumar K.V
In-Reply-To: <20200806162329.276534-1-aneesh.kumar@linux.ibm.com>

Similar to commit 89c140bbaeee ("pseries: Fix 64 bit logical memory block panic")
make sure different variables tracking lmb_size are updated to be 64 bit.

Fixes: af9d00e93a4f ("powerpc/mm/radix: Create separate mappings for hot-plugged memory")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/mmu.h | 2 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
index 55442d45c597..1a0c9d09950f 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -85,7 +85,7 @@ extern unsigned int mmu_base_pid;
 /*
  * memory block size used with radix translation.
  */
-extern unsigned int __ro_after_init radix_mem_block_size;
+extern unsigned long __ro_after_init radix_mem_block_size;
 
 #define PRTB_SIZE_SHIFT	(mmu_pid_bits + 4)
 #define PRTB_ENTRIES	(1ul << mmu_pid_bits)
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 28c784976bed..ca76d9d6372a 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -34,7 +34,7 @@
 
 unsigned int mmu_pid_bits;
 unsigned int mmu_base_pid;
-unsigned int radix_mem_block_size __ro_after_init;
+unsigned long radix_mem_block_size __ro_after_init;
 
 static __ref void *early_alloc_pgtable(unsigned long size, int nid,
 			unsigned long region_start, unsigned long region_end)
-- 
2.26.2


^ permalink raw reply related

* [PATCH v1 2/5] powerpc/fault: Unnest definition of page_fault_is_write() and page_fault_is_bad()
From: Christophe Leroy @ 2020-08-06 17:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, npiggin
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <7baae4086cbb9ffb08c933b065ff7d29dbc03dd6.1596734104.git.christophe.leroy@csgroup.eu>

To make it more readable, separate page_fault_is_write() and page_fault_is_bad()
to avoir several levels of #ifdefs

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/mm/fault.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 2efa34d7e644..9ef9ee244f72 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -363,17 +363,19 @@ static void sanity_check_fault(bool is_write, bool is_user,
  */
 #if (defined(CONFIG_4xx) || defined(CONFIG_BOOKE))
 #define page_fault_is_write(__err)	((__err) & ESR_DST)
-#define page_fault_is_bad(__err)	(0)
 #else
 #define page_fault_is_write(__err)	((__err) & DSISR_ISSTORE)
-#if defined(CONFIG_PPC_8xx)
+#endif
+
+#if defined(CONFIG_4xx) || defined(CONFIG_BOOKE)
+#define page_fault_is_bad(__err)	(0)
+#elif defined(CONFIG_PPC_8xx)
 #define page_fault_is_bad(__err)	((__err) & DSISR_NOEXEC_OR_G)
 #elif defined(CONFIG_PPC64)
 #define page_fault_is_bad(__err)	((__err) & DSISR_BAD_FAULT_64S)
 #else
 #define page_fault_is_bad(__err)	((__err) & DSISR_BAD_FAULT_32S)
 #endif
-#endif
 
 /*
  * For 600- and 800-family processors, the error_code parameter is DSISR
-- 
2.25.0


^ permalink raw reply related

* [PATCH v1 1/5] powerpc/mm: sanity_check_fault() should work for all,  not only BOOK3S
From: Christophe Leroy @ 2020-08-06 17:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, npiggin
  Cc: linuxppc-dev, linux-kernel

The verification and message introduced by commit 374f3f5979f9
("powerpc/mm/hash: Handle user access of kernel address gracefully")
applies to all platforms, it should not be limited to BOOK3S.

Make the BOOK3S version of sanity_check_fault() the one for all,
and bail out earlier if not BOOK3S.

Fixes: 374f3f5979f9 ("powerpc/mm/hash: Handle user access of kernel address gracefully")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/mm/fault.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 925a7231abb3..2efa34d7e644 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -303,7 +303,6 @@ static inline void cmo_account_page_fault(void)
 static inline void cmo_account_page_fault(void) { }
 #endif /* CONFIG_PPC_SMLPAR */
 
-#ifdef CONFIG_PPC_BOOK3S
 static void sanity_check_fault(bool is_write, bool is_user,
 			       unsigned long error_code, unsigned long address)
 {
@@ -320,6 +319,9 @@ static void sanity_check_fault(bool is_write, bool is_user,
 		return;
 	}
 
+	if (!IS_ENABLED(CONFIG_PPC_BOOK3S))
+		return;
+
 	/*
 	 * For hash translation mode, we should never get a
 	 * PROTFAULT. Any update to pte to reduce access will result in us
@@ -354,10 +356,6 @@ static void sanity_check_fault(bool is_write, bool is_user,
 
 	WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
 }
-#else
-static void sanity_check_fault(bool is_write, bool is_user,
-			       unsigned long error_code, unsigned long address) { }
-#endif /* CONFIG_PPC_BOOK3S */
 
 /*
  * Define the correct "is_write" bit in error_code based
-- 
2.25.0


^ permalink raw reply related

* [PATCH v1 4/5] powerpc/fault: Avoid heavy search_exception_tables() verification
From: Christophe Leroy @ 2020-08-06 17:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, npiggin
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <7baae4086cbb9ffb08c933b065ff7d29dbc03dd6.1596734104.git.christophe.leroy@csgroup.eu>

search_exception_tables() is an heavy operation, we have to avoid it.
When KUAP is selected, we'll know the fault has been blocked by KUAP.
Otherwise, it behaves just as if the address was already in the TLBs
and no fault was generated.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/mm/fault.c | 20 +++++---------------
 1 file changed, 5 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 525e0c2b5406..edde169ba3a6 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -214,24 +214,14 @@ static bool bad_kernel_fault(struct pt_regs *regs, unsigned long error_code,
 	if (address >= TASK_SIZE)
 		return true;
 
-	if (!is_exec && (error_code & DSISR_PROTFAULT) &&
-	    !search_exception_tables(regs->nip)) {
+	// Read/write fault blocked by KUAP is bad, it can never succeed.
+	if (bad_kuap_fault(regs, address, is_write)) {
 		pr_crit_ratelimited("Kernel attempted to access user page (%lx) - exploit attempt? (uid: %d)\n",
-				    address,
-				    from_kuid(&init_user_ns, current_uid()));
-	}
-
-	// Fault on user outside of certain regions (eg. copy_tofrom_user()) is bad
-	if (!search_exception_tables(regs->nip))
-		return true;
-
-	// Read/write fault in a valid region (the exception table search passed
-	// above), but blocked by KUAP is bad, it can never succeed.
-	if (bad_kuap_fault(regs, address, is_write))
+				    address, from_kuid(&init_user_ns, current_uid()));
 		return true;
+	}
 
-	// What's left? Kernel fault on user in well defined regions (extable
-	// matched), and allowed by KUAP in the faulting context.
+	// What's left? Kernel fault on user and allowed by KUAP in the faulting context.
 	return false;
 }
 
-- 
2.25.0


^ permalink raw reply related

* [PATCH v1 3/5] powerpc/fault: Reorder tests in bad_kernel_fault()
From: Christophe Leroy @ 2020-08-06 17:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, npiggin
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <7baae4086cbb9ffb08c933b065ff7d29dbc03dd6.1596734104.git.christophe.leroy@csgroup.eu>

Check address earlier to simplify the following test.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/mm/fault.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 9ef9ee244f72..525e0c2b5406 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -210,17 +210,17 @@ static bool bad_kernel_fault(struct pt_regs *regs, unsigned long error_code,
 		return true;
 	}
 
-	if (!is_exec && address < TASK_SIZE && (error_code & DSISR_PROTFAULT) &&
+	// Kernel fault on kernel address is bad
+	if (address >= TASK_SIZE)
+		return true;
+
+	if (!is_exec && (error_code & DSISR_PROTFAULT) &&
 	    !search_exception_tables(regs->nip)) {
 		pr_crit_ratelimited("Kernel attempted to access user page (%lx) - exploit attempt? (uid: %d)\n",
 				    address,
 				    from_kuid(&init_user_ns, current_uid()));
 	}
 
-	// Kernel fault on kernel address is bad
-	if (address >= TASK_SIZE)
-		return true;
-
 	// Fault on user outside of certain regions (eg. copy_tofrom_user()) is bad
 	if (!search_exception_tables(regs->nip))
 		return true;
-- 
2.25.0


^ permalink raw reply related

* [PATCH v1 5/5] powerpc/fault: Perform exception fixup in do_page_fault()
From: Christophe Leroy @ 2020-08-06 17:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, npiggin
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <7baae4086cbb9ffb08c933b065ff7d29dbc03dd6.1596734104.git.christophe.leroy@csgroup.eu>

Exception fixup doesn't require the heady full regs saving,
do it from do_page_fault() directly.

For that, split bad_page_fault() in two parts.

As bad_page_fault() can also be called from other places than
handle_page_fault(), it will still perform exception fixup and
fallback on __bad_page_fault().

handle_page_fault() directly calls __bad_page_fault() as the
exception fixup will now be done by do_page_fault()

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/kernel/entry_32.S       |  2 +-
 arch/powerpc/kernel/exceptions-64e.S |  2 +-
 arch/powerpc/kernel/exceptions-64s.S |  2 +-
 arch/powerpc/mm/fault.c              | 33 ++++++++++++++++++++--------
 4 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index f4d0af8e1136..c198786591f9 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -678,7 +678,7 @@ handle_page_fault:
 	mr	r5,r3
 	addi	r3,r1,STACK_FRAME_OVERHEAD
 	lwz	r4,_DAR(r1)
-	bl	bad_page_fault
+	bl	__bad_page_fault
 	b	ret_from_except_full
 
 #ifdef CONFIG_PPC_BOOK3S_32
diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S
index d9ed79415100..dd9161ea5da8 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -1024,7 +1024,7 @@ storage_fault_common:
 	mr	r5,r3
 	addi	r3,r1,STACK_FRAME_OVERHEAD
 	ld	r4,_DAR(r1)
-	bl	bad_page_fault
+	bl	__bad_page_fault
 	b	ret_from_except
 
 /*
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index f7d748b88705..2cb3bcfb896d 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -3254,7 +3254,7 @@ handle_page_fault:
 	mr	r5,r3
 	addi	r3,r1,STACK_FRAME_OVERHEAD
 	ld	r4,_DAR(r1)
-	bl	bad_page_fault
+	bl	__bad_page_fault
 	b	interrupt_return
 
 /* We have a data breakpoint exception - handle it */
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index edde169ba3a6..bd6e397eb84a 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -542,10 +542,20 @@ NOKPROBE_SYMBOL(__do_page_fault);
 int do_page_fault(struct pt_regs *regs, unsigned long address,
 		  unsigned long error_code)
 {
+	const struct exception_table_entry *entry;
 	enum ctx_state prev_state = exception_enter();
 	int rc = __do_page_fault(regs, address, error_code);
 	exception_exit(prev_state);
-	return rc;
+	if (likely(!rc))
+		return 0;
+
+	entry = search_exception_tables(regs->nip);
+	if (unlikely(!entry))
+		return rc;
+
+	instruction_pointer_set(regs, extable_fixup(entry));
+
+	return 0;
 }
 NOKPROBE_SYMBOL(do_page_fault);
 
@@ -554,17 +564,10 @@ NOKPROBE_SYMBOL(do_page_fault);
  * It is called from the DSI and ISI handlers in head.S and from some
  * of the procedures in traps.c.
  */
-void bad_page_fault(struct pt_regs *regs, unsigned long address, int sig)
+void __bad_page_fault(struct pt_regs *regs, unsigned long address, int sig)
 {
-	const struct exception_table_entry *entry;
 	int is_write = page_fault_is_write(regs->dsisr);
 
-	/* Are we prepared to handle this fault?  */
-	if ((entry = search_exception_tables(regs->nip)) != NULL) {
-		regs->nip = extable_fixup(entry);
-		return;
-	}
-
 	/* kernel has accessed a bad area */
 
 	switch (TRAP(regs)) {
@@ -598,3 +601,15 @@ void bad_page_fault(struct pt_regs *regs, unsigned long address, int sig)
 
 	die("Kernel access of bad area", regs, sig);
 }
+
+void bad_page_fault(struct pt_regs *regs, unsigned long address, int sig)
+{
+	const struct exception_table_entry *entry;
+
+	/* Are we prepared to handle this fault?  */
+	entry = search_exception_tables(instruction_pointer(regs));
+	if (entry)
+		instruction_pointer_set(regs, extable_fixup(entry));
+	else
+		__bad_page_fault(regs, address, sig);
+}
-- 
2.25.0


^ permalink raw reply related

* [Bug 207359] MegaRAID SAS 9361 controller hang/reset
From: bugzilla-daemon @ 2020-08-06 17:56 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <bug-207359-206035@https.bugzilla.kernel.org/>

https://bugzilla.kernel.org/show_bug.cgi?id=207359

--- Comment #4 from Cameron (cam@neo-zeon.de) ---
I converted the box's filesystems from BTRFS to XFS, and switched the page size
from 4k to 64k. The problem appears to be entirely gone now. I am able to
conclusively run 5.7.13 without issue, which I verified as having the
megaraid_sas controller hang problem while still running my previous BTRFS+4k
page configuration.

Unfortunately, it took a great deal of time to perform this conversion, and I
wasn't able to keep the box down even longer to test if converting to XFS and
64k pages individually resolved the issue. All I can say for certain is that
either switching to XFS, to a 64k page size, or both has fixed the problem for
me.

The backup volume is a single SATA disk that is still using BTRFS (for
snapshotting), and is not giving me any trouble. But if this has any relation
to https://bugzilla.kernel.org/show_bug.cgi?id=206123, then this may not be
conclusive due to being that SATA disks potentially may not trigger the issue.
The single disk also can't push as much IO as the RAID10 volume so that may be
another reason.

My quasi educated non-kernel-dev guess is that this is probably a bug relating
to the 4k page size. Whether or not the regular behavior of BTRFS exacerbates
this (making it easier to reproduce), is possible, but unknown.

Hopefully someone else encountering this issue will find this helpful.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply

* Re: [PATCH v8 5/8] powerpc/vdso: Prepare for switching VDSO to generic C implementation.
From: Segher Boessenkool @ 2020-08-06 18:33 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Christophe Leroy, nathanl, arnd, linux-kernel,
	Tulio Magno Quites Machado Filho, Paul Mackerras, luto,
	linux-arch, tglx, vincenzo.frascino, linuxppc-dev
In-Reply-To: <87r1sky1hm.fsf@mpe.ellerman.id.au>

Hi!

On Thu, Aug 06, 2020 at 12:03:33PM +1000, Michael Ellerman wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > On Wed, Aug 05, 2020 at 04:24:16PM +1000, Michael Ellerman wrote:
> >> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
> >> > Indeed, 32-bit doesn't have a redzone, so I believe it needs a stack 
> >> > frame whenever it has anything to same.

^^^

> >> >     fbb60:	94 21 ff e0 	stwu    r1,-32(r1)
> >
> > This is the *only* place where you can use a negative offset from r1:
> > in the stwu to extend the stack (set up a new stack frame, or make the
> > current one bigger).
> 
> (You're talking about 32-bit code here right?)

The "SYSV" ELF binding, yeah, which is used for 32-bit on Linux (give or
take, ho hum).

The ABIs that have a red zone are much nicer here (but less simple) :-)

> >> At the same time it's much safer for us to just save/restore r2, and
> >> probably in the noise performance wise.
> >
> > If you want a function to be able to work with ABI-compliant code safely
> > (in all cases), you'll have to make it itself ABI-compliant as well,
> > yes :-)
> 
> True. Except this is the VDSO which has previously been a bit wild west
> as far as ABI goes :)

It could get away with many things because it was guaranteed to be a
leaf function.  Some of those things even violate the ABIs, but you can
get away with it easily, much reduced scope.  Now if this is generated
code, violating the rules will catch up with you sooner rather than
later ;-)


Segher

^ permalink raw reply

* Re: [PATCH v1 5/5] powerpc/fault: Perform exception fixup in do_page_fault()
From: kernel test robot @ 2020-08-06 21:07 UTC (permalink / raw)
  To: Christophe Leroy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, npiggin
  Cc: linuxppc-dev, kbuild-all, linux-kernel
In-Reply-To: <5748c8f5cf0a9b3686169e2c7709107e6aaec408.1596734105.git.christophe.leroy@csgroup.eu>

[-- Attachment #1: Type: text/plain, Size: 3059 bytes --]

Hi Christophe,

I love your patch! Perhaps something to improve:

[auto build test WARNING on powerpc/next]
[also build test WARNING on v5.8 next-20200806]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Christophe-Leroy/powerpc-mm-sanity_check_fault-should-work-for-all-not-only-BOOK3S/20200807-012433
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-allyesconfig (attached as .config)
compiler: powerpc64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> arch/powerpc/mm/fault.c:567:6: warning: no previous prototype for '__bad_page_fault' [-Wmissing-prototypes]
     567 | void __bad_page_fault(struct pt_regs *regs, unsigned long address, int sig)
         |      ^~~~~~~~~~~~~~~~

vim +/__bad_page_fault +567 arch/powerpc/mm/fault.c

   561	
   562	/*
   563	 * bad_page_fault is called when we have a bad access from the kernel.
   564	 * It is called from the DSI and ISI handlers in head.S and from some
   565	 * of the procedures in traps.c.
   566	 */
 > 567	void __bad_page_fault(struct pt_regs *regs, unsigned long address, int sig)
   568	{
   569		int is_write = page_fault_is_write(regs->dsisr);
   570	
   571		/* kernel has accessed a bad area */
   572	
   573		switch (TRAP(regs)) {
   574		case 0x300:
   575		case 0x380:
   576		case 0xe00:
   577			pr_alert("BUG: %s on %s at 0x%08lx\n",
   578				 regs->dar < PAGE_SIZE ? "Kernel NULL pointer dereference" :
   579				 "Unable to handle kernel data access",
   580				 is_write ? "write" : "read", regs->dar);
   581			break;
   582		case 0x400:
   583		case 0x480:
   584			pr_alert("BUG: Unable to handle kernel instruction fetch%s",
   585				 regs->nip < PAGE_SIZE ? " (NULL pointer?)\n" : "\n");
   586			break;
   587		case 0x600:
   588			pr_alert("BUG: Unable to handle kernel unaligned access at 0x%08lx\n",
   589				 regs->dar);
   590			break;
   591		default:
   592			pr_alert("BUG: Unable to handle unknown paging fault at 0x%08lx\n",
   593				 regs->dar);
   594			break;
   595		}
   596		printk(KERN_ALERT "Faulting instruction address: 0x%08lx\n",
   597			regs->nip);
   598	
   599		if (task_stack_end_corrupted(current))
   600			printk(KERN_ALERT "Thread overran stack, or stack corrupted\n");
   601	
   602		die("Kernel access of bad area", regs, sig);
   603	}
   604	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 69763 bytes --]

^ permalink raw reply

* Re: [PATCH] powerpc/perf: Account for interrupts during PMC overflow for an invalid SIAR check
From: Alexey Kardashevskiy @ 2020-08-06 23:51 UTC (permalink / raw)
  To: Athira Rajeev, mpe; +Cc: maddy, linuxppc-dev
In-Reply-To: <1596717992-7321-1-git-send-email-atrajeev@linux.vnet.ibm.com>



On 06/08/2020 22:46, Athira Rajeev wrote:
> Performance monitor interrupt handler checks if any counter has overflown
> and calls `record_and_restart` in core-book3s which invokes
> `perf_event_overflow` to record the sample information.
> Apart from creating sample, perf_event_overflow also does the interrupt
> and period checks via perf_event_account_interrupt.
> 
> Currently we record information only if the SIAR valid bit is set
> ( using `siar_valid` check ) and hence the interrupt check.
> But it is possible that we do sampling for some events that are not
> generating valid SIAR and hence there is no chance to disable the event
> if interrupts is more than max_samples_per_tick. This leads to soft lockup.
> 
> Fix this by adding perf_event_account_interrupt in the invalid siar
> code path for a sampling event. ie if siar is invalid, just do interrupt
> check and don't record the sample information.
> 
> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
> Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>



Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>


> ---
>  arch/powerpc/perf/core-book3s.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index 01d7028..626e587 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -2101,6 +2101,10 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
>  
>  		if (perf_event_overflow(event, &data, regs))
>  			power_pmu_stop(event, 0);
> +	} else if (period) {
> +		/* Account for interrupt incase of invalid siar */
> +		if (perf_event_account_interrupt(event))
> +			power_pmu_stop(event, 0);
>  	}
>  }
>  
> 

-- 
Alexey

^ permalink raw reply

* Re: [PATCH] powerpc/book3s64/radix: Make radix_mem_block_size 64bit
From: Michael Ellerman @ 2020-08-07  1:11 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linuxppc-dev
In-Reply-To: <878sesq6yl.fsf@linux.ibm.com>

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> Michael Ellerman <mpe@ellerman.id.au> writes:
>
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>> Similar to commit: 89c140bbaeee ("pseries: Fix 64 bit logical memory block panic")
>>> make sure we update different variables tracking lmb_size are updated
>>> to be 64 bit.
>>
>> That commit went to all stable releases, should this one also?
>>
>
> radix_mem_block_size got added recently and it is not yet upstram. But
> the drmem_lmb_info change can be a stable candidate. We also need this
>
> I will split this as two patches?

Yes, sounds good.

cheers

> modified   arch/powerpc/include/asm/drmem.h
> @@ -67,7 +67,7 @@ struct of_drconf_cell_v2 {
>  #define DRCONF_MEM_RESERVED	0x00000080
>  #define DRCONF_MEM_HOTREMOVABLE	0x00000100
>  
> -static inline u32 drmem_lmb_size(void)
> +static inline u64 drmem_lmb_size(void)
>  {
>  	return drmem_info->lmb_size;
>  }
>
> -aneesh

^ permalink raw reply

* Re: [PATCH V5 0/4] powerpc/perf: Add support for perf extended regs in powerpc
From: Athira Rajeev @ 2020-08-07  2:11 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ravi Bangoria, Michael Neuling, maddy, kajoljain, linuxppc-dev,
	Jiri Olsa, Jiri Olsa
In-Reply-To: <20200806122052.GC71359@kernel.org>



> On 06-Aug-2020, at 5:50 PM, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> 
> Em Fri, Jul 31, 2020 at 11:04:14PM +0530, Athira Rajeev escreveu:
>> 
>> 
>>> On 31-Jul-2020, at 1:20 AM, Jiri Olsa <jolsa@redhat.com> wrote:
>>> 
>>> On Thu, Jul 30, 2020 at 01:24:40PM +0530, Athira Rajeev wrote:
>>>> 
>>>> 
>>>>> On 27-Jul-2020, at 10:46 PM, Athira Rajeev <atrajeev@linux.vnet.ibm.com> wrote:
>>>>> 
>>>>> Patch set to add support for perf extended register capability in
>>>>> powerpc. The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to
>>>>> indicate the PMU which support extended registers. The generic code
>>>>> define the mask of extended registers as 0 for non supported architectures.
>>>>> 
>>>>> Patches 1 and 2 are the kernel side changes needed to include
>>>>> base support for extended regs in powerpc and in power10.
>>>>> Patches 3 and 4 are the perf tools side changes needed to support the
>>>>> extended registers.
>>>>> 
>>>> 
>>>> Hi Arnaldo, Jiri
>>>> 
>>>> please let me know if you have any comments/suggestions on this patch series to add support for perf extended regs.
>>> 
>>> hi,
>>> can't really tell for powerpc, but in general
>>> perf tool changes look ok
>>> 
>> 
>> Hi Jiri,
>> Thanks for checking the patchset.
> 
> So I'dd say you submit a v6, split into the kernel part, that probably
> should go via the PPC arch tree, and I can pick the tooling part, ok?
> 
> - Arnaldo

Sure Arnaldo, I will send a v6.

Thanks,
Athira

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox