LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v10 2/5] powerpc/vdso: Prepare for switching VDSO to generic C implementation.
From: Segher Boessenkool @ 2020-08-05 20:55 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: nathanl, linux-arch, vincenzo.frascino, arnd, linux-kernel,
	Paul Mackerras, luto, tglx, linuxppc-dev
In-Reply-To: <7d409421-6396-8eba-8250-b6c9ff8232d9@csgroup.eu>

Hi!

On Wed, Aug 05, 2020 at 06:51:44PM +0200, Christophe Leroy wrote:
> Le 05/08/2020 à 16:03, Segher Boessenkool a écrit :
> >On Wed, Aug 05, 2020 at 07:09:23AM +0000, Christophe Leroy wrote:
> >>+/*
> >>+ * The macros sets two stack frames, one for the caller and one for the 
> >>callee
> >>+ * because there are no requirement for the caller to set a stack frame 
> >>when
> >>+ * calling VDSO so it may have omitted to set one, especially on PPC64
> >>+ */
> >
> >If the caller follows the ABI, there always is a stack frame.  So what
> >is going on?
> 
> Looks like it is not the case. See discussion at 
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/2a67c333893454868bbfda773ba4b01c20272a5d.1588079622.git.christophe.leroy@c-s.fr/
> 
> Seems like GCC uses the redzone and doesn't set a stack frame. I guess 
> it doesn't know that the inline assembly contains a function call so it 
> doesn't set the frame.

Yes, that is the problem.  See
https://gcc.gnu.org/onlinedocs/gcc-10.2.0/gcc/Extended-Asm.html#AssemblerTemplate
where this is (briefly) discussed:
  "Accessing data from C programs without using input/output operands
  (such as by using global symbols directly from the assembler
  template) may not work as expected. Similarly, calling functions
  directly from an assembler template requires a detailed understanding
  of the target assembler and ABI."

I don't know of a good way to tell GCC some function needs a frame (that
is, one that doesn't result in extra code other than to set up the
frame).  I'll think about it.

Segher

^ permalink raw reply

* How would I code a Write to a proc file from within the kernel without reading anything from user space?
From: thefirst ECS @ 2020-08-05 21:52 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <3103222.25048.1596318084416.JavaMail.Dan@DanHP>

In order to help debug a certain discrepancy, I need to "simulate" an "echo 1 > /proc/file" but doing it from kernel even when root file system is unavailable. 

I have simulated it just fine via call_usermodehelper (with argv etc of "echo 1 > /proc/file") from inside the kernel which triggers:
[ee897ec0] [c0122704] proc_reg_write+0x80/0xb4
[ee897ef0] [c00d3b7c] vfs_write+0xb4/0x184

just as I had wanted. But now I need to trigger the "vfs_write" and "proc_reg_write" but without using call_usermodehelper since I will be doing it when root "/" is unavailable and so I can no longer access /bin/echo and call the usermodehelper etc. So my question is how can I do that in kernel?

Not sure if I'm supposed to look in fs read_write.c and fs.h and write a method based on those or if there's some other way etc.

Thanks,
-Dan

^ permalink raw reply

* Re: [PATCH] macintosh: windfarm: fix spelling mistake "detatch" -> "detach"
From: Wolfram Sang @ 2020-08-05 22:08 UTC (permalink / raw)
  To: Colin King; +Cc: kernel-janitors, linuxppc-dev, linux-kernel
In-Reply-To: <20200805104337.16104-1-colin.king@canonical.com>

[-- Attachment #1: Type: text/plain, Size: 375 bytes --]

On Wed, Aug 05, 2020 at 11:43:37AM +0100, Colin King wrote:
> From: Colin Ian King <colin.king@canonical.com>
> 
> There are spelling mistakes in DBG messages. Fix them.
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

These comments can go entirely. i2c_detach is long history. And for
remove, we have debugging output in the driver core meanwhile.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
From: Michael Roth @ 2020-08-05 22:29 UTC (permalink / raw)
  To: Greg Kurz, Michael Ellerman
  Cc: Nathan Lynch, linuxppc-dev, Cedric Le Goater,
	Thiago Jung Bauermann
In-Reply-To: <159660225263.15440.2633856149684894440@sif>

Quoting Michael Roth (2020-08-04 23:37:32)
> Quoting Michael Ellerman (2020-08-04 22:07:08)
> > Greg Kurz <groug@kaod.org> writes:
> > > On Tue, 04 Aug 2020 23:35:10 +1000
> > > Michael Ellerman <mpe@ellerman.id.au> wrote:
> > >> Spinning forever seems like a bad idea, but as has been demonstrated at
> > >> least twice now, continuing when we don't know the state of the other
> > >> CPU can lead to straight up crashes.
> > >> 
> > >> So I think I'm persuaded that it's preferable to have the kernel stuck
> > >> spinning rather than oopsing.
> > >> 
> > >
> > > +1
> > >
> > >> I'm 50/50 on whether we should have a cond_resched() in the loop. My
> > >> first instinct is no, if we're stuck here for 20s a stack trace would be
> > >> good. But then we will probably hit that on some big and/or heavily
> > >> loaded machine.
> > >> 
> > >> So possibly we should call cond_resched() but have some custom logic in
> > >> the loop to print a warning if we are stuck for more than some
> > >> sufficiently long amount of time.
> > >
> > > How long should that be ?
> > 
> > Yeah good question.
> > 
> > I guess step one would be seeing how long it can take on the 384 vcpu
> > machine. And we can probably test on some other big machines.
> > 
> > Hopefully Nathan can give us some idea of how long he's seen it take on
> > large systems? I know he was concerned about the 20s timeout of the
> > softlockup detector.
> > 
> > Maybe a minute or two?
> 
> Hmm, so I took a stab at this where I called cond_resched() after
> every 5 seconds of polling and printed a warning at the same time (FWIW
> that doesn't seem to trigger any warnings on a loaded 96-core mihawk
> system using KVM running the 384vcpu unplug loop)
> 
> But it sounds like that's not quite what you had in mind. How frequently
> do you think we should call cond_resched()? Maybe after 25 iterations
> of polling smp_query_cpu_stopped() to keep original behavior somewhat
> similar?
> 
> I'll let the current patch run on the mihawk system overnight in the
> meantime so we at least have that data point, but would be good to
> know what things look like a large pHyp machine.

At one point I did manage to get the system in a state where unplug
operations were taking 1-2s, but still not enough to trigger any
5s warning, and I wasn't able to reproduce that in subsequent runs.

I also tried reworking the patch so that we print a warning and
cond_resched() after 200 ms to make sure that path gets executed, but
only managed to trigger the warning twice after a few hours.

So, if we print a warning after a couple minutes, that seems pretty
conservative as far as avoiding spurious warnings. And if we
cond_resched() after 25 loops of polling (~0.1 ms in the cases
that caused the original crash), that would avoid most of the
default RCU/lockup warnings.

But having a second timeout to trigger the cond_resched() after some
set interval like 2s seems more deterministic since we're less
susceptible to longer delays due to things like the RTAS calls
contending for QEMU's global mutex in the the KVM case.


> 
> Thanks!
> 
> > 
> > >> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
> > >> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
> > >> 
> > >> This is not public.
> > >
> > > I'll have a look at changing that.
> > 
> > Thanks.
> > 
> > cheers

^ permalink raw reply

* Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
From: Michael Roth @ 2020-08-05 22:31 UTC (permalink / raw)
  To: Greg Kurz, Michael Ellerman
  Cc: Nathan Lynch, linuxppc-dev, Cedric Le Goater,
	Thiago Jung Bauermann
In-Reply-To: <159666656828.15440.9097316124875217814@sif>

Quoting Michael Roth (2020-08-05 17:29:28)
> Quoting Michael Roth (2020-08-04 23:37:32)
> > Quoting Michael Ellerman (2020-08-04 22:07:08)
> > > Greg Kurz <groug@kaod.org> writes:
> > > > On Tue, 04 Aug 2020 23:35:10 +1000
> > > > Michael Ellerman <mpe@ellerman.id.au> wrote:
> > > >> Spinning forever seems like a bad idea, but as has been demonstrated at
> > > >> least twice now, continuing when we don't know the state of the other
> > > >> CPU can lead to straight up crashes.
> > > >> 
> > > >> So I think I'm persuaded that it's preferable to have the kernel stuck
> > > >> spinning rather than oopsing.
> > > >> 
> > > >
> > > > +1
> > > >
> > > >> I'm 50/50 on whether we should have a cond_resched() in the loop. My
> > > >> first instinct is no, if we're stuck here for 20s a stack trace would be
> > > >> good. But then we will probably hit that on some big and/or heavily
> > > >> loaded machine.
> > > >> 
> > > >> So possibly we should call cond_resched() but have some custom logic in
> > > >> the loop to print a warning if we are stuck for more than some
> > > >> sufficiently long amount of time.
> > > >
> > > > How long should that be ?
> > > 
> > > Yeah good question.
> > > 
> > > I guess step one would be seeing how long it can take on the 384 vcpu
> > > machine. And we can probably test on some other big machines.
> > > 
> > > Hopefully Nathan can give us some idea of how long he's seen it take on
> > > large systems? I know he was concerned about the 20s timeout of the
> > > softlockup detector.
> > > 
> > > Maybe a minute or two?
> > 
> > Hmm, so I took a stab at this where I called cond_resched() after
> > every 5 seconds of polling and printed a warning at the same time (FWIW
> > that doesn't seem to trigger any warnings on a loaded 96-core mihawk
> > system using KVM running the 384vcpu unplug loop)
> > 
> > But it sounds like that's not quite what you had in mind. How frequently
> > do you think we should call cond_resched()? Maybe after 25 iterations
> > of polling smp_query_cpu_stopped() to keep original behavior somewhat
> > similar?
> > 
> > I'll let the current patch run on the mihawk system overnight in the
> > meantime so we at least have that data point, but would be good to
> > know what things look like a large pHyp machine.
> 
> At one point I did manage to get the system in a state where unplug
> operations were taking 1-2s, but still not enough to trigger any
> 5s warning, and I wasn't able to reproduce that in subsequent runs.
> 
> I also tried reworking the patch so that we print a warning and
> cond_resched() after 200 ms to make sure that path gets executed, but
> only managed to trigger the warning twice after a few hours.
> 
> So, if we print a warning after a couple minutes, that seems pretty
> conservative as far as avoiding spurious warnings. And if we
> cond_resched() after 25 loops of polling (~0.1 ms in the cases

~0.1 seconds I mean

> that caused the original crash), that would avoid most of the
> default RCU/lockup warnings.
> 
> But having a second timeout to trigger the cond_resched() after some
> set interval like 2s seems more deterministic since we're less
> susceptible to longer delays due to things like the RTAS calls
> contending for QEMU's global mutex in the the KVM case.
> 
> 
> > 
> > Thanks!
> > 
> > > 
> > > >> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
> > > >> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
> > > >> 
> > > >> This is not public.
> > > >
> > > > I'll have a look at changing that.
> > > 
> > > Thanks.
> > > 
> > > cheers

^ permalink raw reply

* Re: [PATCH v2 1/2] powerpc/perf: consolidate GPCI hcall structs into asm/hvcall.h
From: Tyrel Datwyler @ 2020-08-05 22:37 UTC (permalink / raw)
  To: Scott Cheloha, linuxppc-dev; +Cc: Nathan Lynch
In-Reply-To: <20200727184605.2945095-1-cheloha@linux.ibm.com>

On 7/27/20 11:46 AM, Scott Cheloha wrote:
> The H_GetPerformanceCounterInfo (GPCI) hypercall input/output structs are
> useful to modules outside of perf/, so move them into asm/hvcall.h to live
> alongside the other powerpc hypercall structs.
> 
> Leave the perf-specific GPCI stuff in perf/hv-gpci.h.
> 
> Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
> ---
>  arch/powerpc/include/asm/hvcall.h | 36 +++++++++++++++++++++++++++++++
>  arch/powerpc/perf/hv-gpci.c       |  9 --------
>  arch/powerpc/perf/hv-gpci.h       | 27 -----------------------
>  3 files changed, 36 insertions(+), 36 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h
> index e90c073e437e..c338480b4551 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -527,6 +527,42 @@ struct hv_guest_state {
>  /* Latest version of hv_guest_state structure */
>  #define HV_GUEST_STATE_VERSION	1
>  
> +/*
> + * From the document "H_GetPerformanceCounterInfo Interface" v1.07
> + *
> + * H_GET_PERF_COUNTER_INFO argument
> + */
> +struct hv_get_perf_counter_info_params {
> +	__be32 counter_request; /* I */
> +	__be32 starting_index;  /* IO */
> +	__be16 secondary_index; /* IO */
> +	__be16 returned_values; /* O */
> +	__be32 detail_rc; /* O, only needed when called via *_norets() */
> +
> +	/*
> +	 * O, size each of counter_value element in bytes, only set for version
> +	 * >= 0x3
> +	 */
> +	__be16 cv_element_size;
> +
> +	/* I, 0 (zero) for versions < 0x3 */
> +	__u8 counter_info_version_in;
> +
> +	/* O, 0 (zero) if version < 0x3. Must be set to 0 when making hcall */
> +	__u8 counter_info_version_out;
> +	__u8 reserved[0xC];
> +	__u8 counter_value[];
> +} __packed;
> +
> +#define HGPCI_REQ_BUFFER_SIZE	4096
> +#define HGPCI_MAX_DATA_BYTES \
> +	(HGPCI_REQ_BUFFER_SIZE - sizeof(struct hv_get_perf_counter_info_params))
> +
> +struct hv_gpci_request_buffer {
> +	struct hv_get_perf_counter_info_params params;
> +	uint8_t bytes[HGPCI_MAX_DATA_BYTES];
> +} __packed;
> +
>  #endif /* __ASSEMBLY__ */
>  #endif /* __KERNEL__ */
>  #endif /* _ASM_POWERPC_HVCALL_H */
> diff --git a/arch/powerpc/perf/hv-gpci.c b/arch/powerpc/perf/hv-gpci.c
> index 6884d16ec19b..1667315b82e9 100644
> --- a/arch/powerpc/perf/hv-gpci.c
> +++ b/arch/powerpc/perf/hv-gpci.c
> @@ -123,17 +123,8 @@ static const struct attribute_group *attr_groups[] = {
>  	NULL,
>  };
>  
> -#define HGPCI_REQ_BUFFER_SIZE	4096
> -#define HGPCI_MAX_DATA_BYTES \
> -	(HGPCI_REQ_BUFFER_SIZE - sizeof(struct hv_get_perf_counter_info_params))
> -
>  static DEFINE_PER_CPU(char, hv_gpci_reqb[HGPCI_REQ_BUFFER_SIZE]) __aligned(sizeof(uint64_t));
>  
> -struct hv_gpci_request_buffer {
> -	struct hv_get_perf_counter_info_params params;
> -	uint8_t bytes[HGPCI_MAX_DATA_BYTES];
> -} __packed;
> -
>  static unsigned long single_gpci_request(u32 req, u32 starting_index,
>  		u16 secondary_index, u8 version_in, u32 offset, u8 length,
>  		u64 *value)
> diff --git a/arch/powerpc/perf/hv-gpci.h b/arch/powerpc/perf/hv-gpci.h
> index a3053eda5dcc..4d108262bed7 100644
> --- a/arch/powerpc/perf/hv-gpci.h
> +++ b/arch/powerpc/perf/hv-gpci.h
> @@ -2,33 +2,6 @@
>  #ifndef LINUX_POWERPC_PERF_HV_GPCI_H_
>  #define LINUX_POWERPC_PERF_HV_GPCI_H_
>  
> -#include <linux/types.h>
> -
> -/* From the document "H_GetPerformanceCounterInfo Interface" v1.07 */
> -
> -/* H_GET_PERF_COUNTER_INFO argument */
> -struct hv_get_perf_counter_info_params {
> -	__be32 counter_request; /* I */
> -	__be32 starting_index;  /* IO */
> -	__be16 secondary_index; /* IO */
> -	__be16 returned_values; /* O */
> -	__be32 detail_rc; /* O, only needed when called via *_norets() */
> -
> -	/*
> -	 * O, size each of counter_value element in bytes, only set for version
> -	 * >= 0x3
> -	 */
> -	__be16 cv_element_size;
> -
> -	/* I, 0 (zero) for versions < 0x3 */
> -	__u8 counter_info_version_in;
> -
> -	/* O, 0 (zero) if version < 0x3. Must be set to 0 when making hcall */
> -	__u8 counter_info_version_out;
> -	__u8 reserved[0xC];
> -	__u8 counter_value[];
> -} __packed;
> -

Hmm, this pretty much guts this header which normally I'd be inclined to suggest
moving the whole thing. The remainder of that header for context:

>  /*
>   * counter info version => fw version/reference (spec version)
>   *
>
 * 8 => power8 (1.07)
 * [7 is skipped by spec 1.07]
 * 6 => TLBIE (1.07)
 * 5 => v7r7m0.phyp (1.05)
 * [4 skipped]
 * 3 => v7r6m0.phyp (?)
 * [1,2 skipped]
 * 0 => v7r{2,3,4}m0.phyp (?)
 */
#define COUNTER_INFO_VERSION_CURRENT 0x8

/* capability mask masks. */
enum {
        HV_GPCI_CM_GA = (1 << 7),
        HV_GPCI_CM_EXPANDED = (1 << 6),
        HV_GPCI_CM_LAB = (1 << 5)
};

#define REQUEST_FILE "../hv-gpci-requests.h"
#define NAME_LOWER hv_gpci
#define NAME_UPPER HV_GPCI
#include "req-gen/perf.h"
#undef REQUEST_FILE
#undef NAME_LOWER
#undef NAME_UPPER

The side effect of moving seems that we would have to drag "hv-gpci-requests.h"
along as well. So, maybe its best just moving the struct as you've done so it
can be used by code outside of perf.

-Tyrel

^ permalink raw reply

* Re: [PATCH v2 2/2] powerpc/pseries: new lparcfg key/value pair: partition_affinity_score
From: Tyrel Datwyler @ 2020-08-05 22:42 UTC (permalink / raw)
  To: Scott Cheloha, linuxppc-dev; +Cc: Nathan Lynch
In-Reply-To: <20200727184605.2945095-2-cheloha@linux.ibm.com>

On 7/27/20 11:46 AM, Scott Cheloha wrote:
> The H_GetPerformanceCounterInfo (GPCI) PHYP hypercall has a subcall,
> Affinity_Domain_Info_By_Partition, which returns, among other things,
> a "partition affinity score" for a given LPAR.  This score, a value on
> [0-100], represents the processor-memory affinity for the LPAR in
> question.  A score of 0 indicates the worst possible affinity while a
> score of 100 indicates perfect affinity.  The score can be used to
> reason about performance.
> 
> This patch adds the score for the local LPAR to the lparcfg procfile
> under a new 'partition_affinity_score' key.
> 
> Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>

I was hoping Michael would chime in the first time around on this patch series
about adding another key/value pair to lparcfg. So, barring a NACK from mpe:

Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com>

> ---
>  arch/powerpc/platforms/pseries/lparcfg.c | 35 ++++++++++++++++++++++++
>  1 file changed, 35 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/pseries/lparcfg.c b/arch/powerpc/platforms/pseries/lparcfg.c
> index b8d28ab88178..e278390ab28d 100644
> --- a/arch/powerpc/platforms/pseries/lparcfg.c
> +++ b/arch/powerpc/platforms/pseries/lparcfg.c
> @@ -136,6 +136,39 @@ static unsigned int h_get_ppp(struct hvcall_ppp_data *ppp_data)
>  	return rc;
>  }
>  
> +static void show_gpci_data(struct seq_file *m)
> +{
> +	struct hv_gpci_request_buffer *buf;
> +	unsigned int affinity_score;
> +	long ret;
> +
> +	buf = kmalloc(sizeof(*buf), GFP_KERNEL);
> +	if (buf == NULL)
> +		return;
> +
> +	/*
> +	 * Show the local LPAR's affinity score.
> +	 *
> +	 * 0xB1 selects the Affinity_Domain_Info_By_Partition subcall.
> +	 * The score is at byte 0xB in the output buffer.
> +	 */
> +	memset(&buf->params, 0, sizeof(buf->params));
> +	buf->params.counter_request = cpu_to_be32(0xB1);
> +	buf->params.starting_index = cpu_to_be32(-1);	/* local LPAR */
> +	buf->params.counter_info_version_in = 0x5;	/* v5+ for score */
> +	ret = plpar_hcall_norets(H_GET_PERF_COUNTER_INFO, virt_to_phys(buf),
> +				 sizeof(*buf));
> +	if (ret != H_SUCCESS) {
> +		pr_debug("hcall failed: H_GET_PERF_COUNTER_INFO: %ld, %x\n",
> +			 ret, be32_to_cpu(buf->params.detail_rc));
> +		goto out;
> +	}
> +	affinity_score = buf->bytes[0xB];
> +	seq_printf(m, "partition_affinity_score=%u\n", affinity_score);
> +out:
> +	kfree(buf);
> +}
> +
>  static unsigned h_pic(unsigned long *pool_idle_time,
>  		      unsigned long *num_procs)
>  {
> @@ -487,6 +520,8 @@ static int pseries_lparcfg_data(struct seq_file *m, void *v)
>  			   partition_active_processors * 100);
>  	}
>  
> +	show_gpci_data(m);
> +
>  	seq_printf(m, "partition_active_processors=%d\n",
>  		   partition_active_processors);
>  
> 


^ permalink raw reply

* Re: linux-next: manual merge of the char-misc tree with the powerpc tree
From: Stephen Rothwell @ 2020-08-05 23:40 UTC (permalink / raw)
  To: Greg KH, Arnd Bergmann, Michael Ellerman, PowerPC
  Cc: Linux Next Mailing List, Lee Jones, Linux Kernel Mailing List,
	Alastair D'Silva
In-Reply-To: <20200803165546.6ab5ab6f@canb.auug.org.au>

[-- Attachment #1: Type: text/plain, Size: 1029 bytes --]

Hi all,

On Mon, 3 Aug 2020 16:55:46 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> Today's linux-next merge of the char-misc tree got a conflict in:
> 
>   drivers/misc/ocxl/config.c
> 
> between commit:
> 
>   3591538a31af ("ocxl: Address kernel doc errors & warnings")
> 
> from the powerpc tree and commit:
> 
>   28fc491e9be6 ("misc: ocxl: config: Provide correct formatting to function headers")
> 
> from the char-misc tree.
> 
> I fixed it up (as it was just differences in comments, I just arbitrarily
> chose the latter version) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

This is now a conflict between the powerpc tree and Linus' tree.

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH v2] powerpc: Warn about use of smt_snooze_delay
From: Joel Stanley @ 2020-08-05 23:57 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: Tyrel Datwyler, linuxppc-dev, Gautham R Shenoy
In-Reply-To: <87eeomzknj.fsf@mpe.ellerman.id.au>

On Tue, 4 Aug 2020 at 11:59, Michael Ellerman <mpe@ellerman.id.au> wrote:
>
> Joel Stanley <joel@jms.id.au> writes:
> > It's not done anything for a long time. Save the percpu variable, and
> > emit a warning to remind users to not expect it to do anything.
> >
> > Fixes: 3fa8cad82b94 ("powerpc/pseries/cpuidle: smt-snooze-delay cleanup.")
> > Cc: stable@vger.kernel.org # v3.14
> > Signed-off-by: Joel Stanley <joel@jms.id.au>
> > --
> > v2:
> >  Use pr_warn instead of WARN
> >  Reword and print proccess name with pid in message
> >  Leave CPU_FTR_SMT test in
> >  Add Fixes line
> >
> > mpe, if you don't agree then feel free to drop the cc stable.
> >
> > Testing 'ppc64_cpu --smt=off' on a 24 core / 4 SMT system it's quite noisy
> > as the online/offline loop that ppc64_cpu runs is slow.
>
> Hmm, that is pretty spammy.
>
> I know I suggested the ratelimit, but I was thinking it would print once
> for each ppc64_cpu invocation, not ~30 times.
>
> How about pr_warn_once(), that should still be sufficient for people to
> notice if they're looking for it.

I think that's a reasonable suggestion.

>
> ...
> > diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
> > index 571b3259697e..ba6d4cee19ef 100644
> > --- a/arch/powerpc/kernel/sysfs.c
> > +++ b/arch/powerpc/kernel/sysfs.c
> > @@ -32,29 +32,26 @@
> >
> >  static DEFINE_PER_CPU(struct cpu, cpu_devices);
> >
> > -/*
> > - * SMT snooze delay stuff, 64-bit only for now
> > - */
> > -
> >  #ifdef CONFIG_PPC64
> >
> > -/* Time in microseconds we delay before sleeping in the idle loop */
> > -static DEFINE_PER_CPU(long, smt_snooze_delay) = { 100 };
> > +/*
> > + * Snooze delay has not been hooked up since 3fa8cad82b94 ("powerpc/pseries/cpuidle:
> > + * smt-snooze-delay cleanup.") and has been broken even longer. As was foretold in
> > + * 2014:
> > + *
> > + *  "ppc64_util currently utilises it. Once we fix ppc64_util, propose to clean
> > + *  up the kernel code."
> > + *
> > + * At some point in the future this code should be removed.
> > + */
> >
> >  static ssize_t store_smt_snooze_delay(struct device *dev,
> >                                     struct device_attribute *attr,
> >                                     const char *buf,
> >                                     size_t count)
> >  {
> > -     struct cpu *cpu = container_of(dev, struct cpu, dev);
> > -     ssize_t ret;
> > -     long snooze;
> > -
> > -     ret = sscanf(buf, "%ld", &snooze);
> > -     if (ret != 1)
> > -             return -EINVAL;
> > -
> > -     per_cpu(smt_snooze_delay, cpu->dev.id) = snooze;
> > +     pr_warn_ratelimited("%s (%d) used unsupported smt_snooze_delay, this has no effect\n",
> > +                         current->comm, current->pid);
>
> Can we make this:
>
>         "%s (%d) stored to unsupported smt_snooze_delay, which has no effect.\n",

ack

>
>
> >       return count;
> >  }
> >
> > @@ -62,9 +59,9 @@ static ssize_t show_smt_snooze_delay(struct device *dev,
> >                                    struct device_attribute *attr,
> >                                    char *buf)
> >  {
> > -     struct cpu *cpu = container_of(dev, struct cpu, dev);
> > -
> > -     return sprintf(buf, "%ld\n", per_cpu(smt_snooze_delay, cpu->dev.id));
> > +     pr_warn_ratelimited("%s (%d) used unsupported smt_snooze_delay, this has no effect\n",
> > +                         current->comm, current->pid);
>
> It has as much effect as it ever did :)
>
> So maybe:
>
>         "%s (%d) read from unsupported smt_snooze_delay.\n",
>
>
> I can do those changes when applying if you like rather than making you
> do a v3.

Yes please! Your suggested changes lgtm.

Cheers,

Joel

^ permalink raw reply

* Re: [PATCH v2 1/5] powerpc/mm: Introduce temporary mm
From: Daniel Axtens @ 2020-08-06  1:27 UTC (permalink / raw)
  To: Christopher M. Riedl, linuxppc-dev; +Cc: kernel-hardening
In-Reply-To: <20200709040316.12789-2-cmr@informatik.wtf>

Hi Chris,
  
>  void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk);
> +void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk);
>  bool ppc_breakpoint_available(void);
>  #ifdef CONFIG_PPC_ADV_DEBUG_REGS
>  extern void do_send_trap(struct pt_regs *regs, unsigned long address,
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index 1a474f6b1992..9269c7c7b04e 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -10,6 +10,7 @@
>  #include <asm/mmu.h>	
>  #include <asm/cputable.h>
>  #include <asm/cputhreads.h>
> +#include <asm/debug.h>
>  
>  /*
>   * Most if the context management is out of line
> @@ -300,5 +301,68 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
>  	return 0;
>  }
>  
> +struct temp_mm {
> +	struct mm_struct *temp;
> +	struct mm_struct *prev;
> +	bool is_kernel_thread;
> +	struct arch_hw_breakpoint brk[HBP_NUM_MAX];
> +};

This is on the nitpicky end, but I wonder if this should be named
temp_mm, or should be labelled something else to capture its broader
purpose as a context for code patching? I'm thinking that a store of
breakpoints is perhaps unusual in a memory-managment structure?

I don't have a better suggestion off the top of my head and I'm happy
for you to leave it, I just wanted to flag it as a possible way we could
be clearer.

> +
> +static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm)
> +{
> +	temp_mm->temp = mm;
> +	temp_mm->prev = NULL;
> +	temp_mm->is_kernel_thread = false;
> +	memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
> +}
> +
> +static inline void use_temporary_mm(struct temp_mm *temp_mm)
> +{
> +	lockdep_assert_irqs_disabled();
> +
> +	temp_mm->is_kernel_thread = current->mm == NULL;
> +	if (temp_mm->is_kernel_thread)
> +		temp_mm->prev = current->active_mm;

You don't seem to restore active_mm below. I don't know what active_mm
does, so I don't know if this is a problem.

> +	else
> +		temp_mm->prev = current->mm;
> +
> +	/*
> +	 * Hash requires a non-NULL current->mm to allocate a userspace address
> +	 * when handling a page fault. Does not appear to hurt in Radix either.
> +	 */
> +	current->mm = temp_mm->temp;
> +	switch_mm_irqs_off(NULL, temp_mm->temp, current);
> +
> +	if (ppc_breakpoint_available()) {

I wondered if this could be changed during a text-patching operation.
AIUI, it potentially can on a P9 via "dawr_enable_dangerous" in debugfs.

I don't know if that's a problem. My concern is that you could turn off
breakpoints, call 'use_temporary_mm', then turn them back on again
before 'unuse_temporary_mm' and get a breakpoint while that can access
the temporary mm. Is there something else that makes that safe?
disabling IRQs maybe?

> +		struct arch_hw_breakpoint null_brk = {0};
> +		int i = 0;
> +
> +		for (; i < nr_wp_slots(); ++i) {

super nitpicky, and I'm not sure if this is actually documented, but I'd
usually see this written as:

for (i = 0; i < nr_wp_slots(); i++) {

Not sure if there's any reason that it _shouldn't_ be written the way
you've written it (and I do like initialising the variable when it's
defined!), I'm just not used to it. (Likewise with the unuse function.)

> +			__get_breakpoint(i, &temp_mm->brk[i]);
> +			if (temp_mm->brk[i].type != 0)
> +				__set_breakpoint(i, &null_brk);
> +		}
> +	}
> +}
> +

Kind regards,
Daniel

> +static inline void unuse_temporary_mm(struct temp_mm *temp_mm)
> +{
> +	lockdep_assert_irqs_disabled();
> +
> +	if (temp_mm->is_kernel_thread)
> +		current->mm = NULL;
> +	else
> +		current->mm = temp_mm->prev;
> +	switch_mm_irqs_off(NULL, temp_mm->prev, current);
> +
> +	if (ppc_breakpoint_available()) {
> +		int i = 0;
> +
> +		for (; i < nr_wp_slots(); ++i)
> +			if (temp_mm->brk[i].type != 0)
> +				__set_breakpoint(i, &temp_mm->brk[i]);
> +	}
> +}
> +
>  #endif /* __KERNEL__ */
>  #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index 4650b9bb217f..b6c123bf5edd 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -824,6 +824,11 @@ static inline int set_breakpoint_8xx(struct arch_hw_breakpoint *brk)
>  	return 0;
>  }
>  
> +void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk)
> +{
> +	memcpy(brk, this_cpu_ptr(&current_brk[nr]), sizeof(*brk));
> +}
> +
>  void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
>  {
>  	memcpy(this_cpu_ptr(&current_brk[nr]), brk, sizeof(*brk));
> -- 
> 2.27.0

^ permalink raw reply

* Re: How would I code a Write to a proc file from within the kernel without reading anything from user space?
From: Michael Ellerman @ 2020-08-06  1:30 UTC (permalink / raw)
  To: thefirst ECS, linuxppc-dev
In-Reply-To: <32112832.26959.1596664366002.JavaMail.Dan@DanHP>

thefirst ECS <ecs_dn@yahoo.com> writes:
> In order to help debug a certain discrepancy, I need to "simulate" an "echo 1 > /proc/file" but doing it from kernel even when root file system is unavailable. 
>
> I have simulated it just fine via call_usermodehelper (with argv etc of "echo 1 > /proc/file") from inside the kernel which triggers:
> [ee897ec0] [c0122704] proc_reg_write+0x80/0xb4
> [ee897ef0] [c00d3b7c] vfs_write+0xb4/0x184
>
> just as I had wanted. But now I need to trigger the "vfs_write" and "proc_reg_write" but without using call_usermodehelper since I will be doing it when root "/" is unavailable and so I can no longer access /bin/echo and call the usermodehelper etc. So my question is how can I do that in kernel?
>
> Not sure if I'm supposed to look in fs read_write.c and fs.h and write a method based on those or if there's some other way etc.

You don't say which proc file, which would be useful information.

I'll assume it's /proc/sysrq-trigger based on your previous mail.

Rather than trying to invoke the write path from inside the kernel, the
simplest option is to just call __handle_sysrq() directly.

cheers

^ permalink raw reply

* Re: [PATCH v8 5/8] powerpc/vdso: Prepare for switching VDSO to generic C implementation.
From: Michael Ellerman @ 2020-08-06  2:03 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Christophe Leroy, nathanl, arnd, linux-kernel,
	Tulio Magno Quites Machado Filho, Paul Mackerras, luto,
	linux-arch, tglx, vincenzo.frascino, linuxppc-dev
In-Reply-To: <20200805133505.GN6753@gate.crashing.org>

Segher Boessenkool <segher@kernel.crashing.org> writes:
> On Wed, Aug 05, 2020 at 04:24:16PM +1000, Michael Ellerman wrote:
>> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
>> > Indeed, 32-bit doesn't have a redzone, so I believe it needs a stack 
>> > frame whenever it has anything to same.
>> 
>> Yeah OK that would explain it.
>> 
>> > Here is what I have in libc.so:
>> >
>> > 000fbb60 <__clock_gettime>:
>> >     fbb60:	94 21 ff e0 	stwu    r1,-32(r1)
>
> This is the *only* place where you can use a negative offset from r1:
> in the stwu to extend the stack (set up a new stack frame, or make the
> current one bigger).

(You're talking about 32-bit code here right?)

>> > diff --git a/arch/powerpc/include/asm/vdso/gettimeofday.h 
>> > b/arch/powerpc/include/asm/vdso/gettimeofday.h
>> > index a0712a6e80d9..0b6fa245d54e 100644
>> > --- a/arch/powerpc/include/asm/vdso/gettimeofday.h
>> > +++ b/arch/powerpc/include/asm/vdso/gettimeofday.h
>> > @@ -10,6 +10,7 @@
>> >     .cfi_startproc
>> >   	PPC_STLU	r1, -STACK_FRAME_OVERHEAD(r1)
>> >   	mflr		r0
>> > +	PPC_STLU	r1, -STACK_FRAME_OVERHEAD(r1)
>> >     .cfi_register lr, r0
>> 
>> The cfi_register should come directly after the mflr I think.
>
> That is the idiomatic way to write it, and most obviously correct.  But
> as long as the value in LR at function entry is available in multiple
> places (like, in LR and in R0 here), it is fine to use either for
> unwinding.  Sometimes you can use this to optimise the unwind tables a
> bit -- not really worth it in hand-written code, it's more important to
> make it legible ;-)

OK. Because LR still holds the LR value until it's clobbered later, by
which point the cfi_register has taken effect.

But yeah I think for readability it's best to keep the cfi_register next
to the mflr.

>> >> There's also no code to load/restore the TOC pointer on BE, which I
>> >> think we'll need to handle.
>> >
>> > I see no code in the generated vdso64.so doing anything with r2, but if 
>> > you think that's needed, just let's do it:
>> 
>> Hmm, true.
>> 
>> The compiler will use the toc for globals (and possibly also for large
>> constants?)
>
> And anything else it bloody well wants to, yeah :-)

Haha yeah OK.

>> AFAIK there's no way to disable use of the toc, or make it a build error
>> if it's needed.
>
> Yes.
>
>> At the same time it's much safer for us to just save/restore r2, and
>> probably in the noise performance wise.
>
> If you want a function to be able to work with ABI-compliant code safely
> (in all cases), you'll have to make it itself ABI-compliant as well,
> yes :-)

True. Except this is the VDSO which has previously been a bit wild west
as far as ABI goes :)

>> So yeah we should probably do as below.
>
> [ snip ]
>
> Looks good yes.

Thanks for reviewing.

cheers

^ permalink raw reply

* Re: [PATCH] powerpc/40x: Fix assembler warning about r0
From: Michael Ellerman @ 2020-08-06  2:18 UTC (permalink / raw)
  To: Christophe Leroy, linuxppc-dev
In-Reply-To: <bc70ddd8-d48d-d939-dce0-5aa8b8cf87d8@csgroup.eu>

Christophe Leroy <christophe.leroy@csgroup.eu> writes:
> Le 22/07/2020 à 04:24, Michael Ellerman a écrit :
>> The assembler says:
>>    arch/powerpc/kernel/head_40x.S:623: Warning: invalid register expression
>
> I get exactly the same with head_32.S, for the exact same reason.

Ah yep, I see it. I mostly build pmac32_defconfig which doesn't have
BDI_SWITCH enabled.

Send a patch? :)

Do we still need the BDI_SWITCH code? Is it likely anyone still has one,
that works?

cheers

^ permalink raw reply

* Re: [PATCH v2 2/5] powerpc/lib: Initialize a temporary mm for code patching
From: Daniel Axtens @ 2020-08-06  3:24 UTC (permalink / raw)
  To: Christopher M. Riedl, linuxppc-dev; +Cc: kernel-hardening
In-Reply-To: <20200709040316.12789-3-cmr@informatik.wtf>

"Christopher M. Riedl" <cmr@informatik.wtf> writes:

> When code patching a STRICT_KERNEL_RWX kernel the page containing the
> address to be patched is temporarily mapped with permissive memory
> protections. Currently, a per-cpu vmalloc patch area is used for this
> purpose. While the patch area is per-cpu, the temporary page mapping is
> inserted into the kernel page tables for the duration of the patching.
> The mapping is exposed to CPUs other than the patching CPU - this is
> undesirable from a hardening perspective.
>
> Use the `poking_init` init hook to prepare a temporary mm and patching
> address. Initialize the temporary mm by copying the init mm. Choose a
> randomized patching address inside the temporary mm userspace address
> portion. The next patch uses the temporary mm and patching address for
> code patching.
>
> Based on x86 implementation:
>
> commit 4fc19708b165
> ("x86/alternatives: Initialize temporary mm for patching")
>
> Signed-off-by: Christopher M. Riedl <cmr@informatik.wtf>
> ---
>  arch/powerpc/lib/code-patching.c | 33 ++++++++++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)
>
> diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> index 0a051dfeb177..8ae1a9e5fe6e 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c
> @@ -11,6 +11,8 @@
>  #include <linux/cpuhotplug.h>
>  #include <linux/slab.h>
>  #include <linux/uaccess.h>
> +#include <linux/sched/task.h>
> +#include <linux/random.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/page.h>
> @@ -44,6 +46,37 @@ int raw_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr)
>  }
>  
>  #ifdef CONFIG_STRICT_KERNEL_RWX
> +
> +static struct mm_struct *patching_mm __ro_after_init;
> +static unsigned long patching_addr __ro_after_init;
> +
> +void __init poking_init(void)
> +{
> +	spinlock_t *ptl; /* for protecting pte table */
> +	pte_t *ptep;
> +
> +	/*
> +	 * Some parts of the kernel (static keys for example) depend on
> +	 * successful code patching. Code patching under STRICT_KERNEL_RWX
> +	 * requires this setup - otherwise we cannot patch at all. We use
> +	 * BUG_ON() here and later since an early failure is preferred to
> +	 * buggy behavior and/or strange crashes later.
> +	 */
> +	patching_mm = copy_init_mm();
> +	BUG_ON(!patching_mm);
> +
> +	/*
> +	 * In hash we cannot go above DEFAULT_MAP_WINDOW easily.
> +	 * XXX: Do we want additional bits of entropy for radix?
> +	 */
> +	patching_addr = (get_random_long() & PAGE_MASK) %
> +		(DEFAULT_MAP_WINDOW - PAGE_SIZE);

It took me a while to understand this calculation. I see that it's
calculating a base address for a page in which to do patching. It does
the following:

 - get a random long

 - mask with PAGE_MASK so as to get a page aligned value

 - make sure that the base address is at least one PAGE_SIZE below
   DEFAULT_MAP_WINDOW so we have a clear page between the base and
   DEFAULT_MAP_WINDOW.

On 64-bit Book3S with 64K pages, that works out to be

PAGE_SIZE = 0x0000 0000 0001 0000
PAGE_MASK = 0xFFFF FFFF FFFF 0000

DEFAULT_MAP_WINDOW = DEFAULT_MAP_WINDOW_USER64 = TASK_SIZE_128TB
                   = 0x0000_8000_0000_0000

DEFAULT_MAP_WINDOW - PAGE_SIZE = 0x0000 7FFF FFFF 0000

It took a while (and a conversation with my wife who studied pure
maths!) but I am convinced that the modulo preserves the page-alignement
of the patching address.

One thing I did realise is that patching_addr can be zero at the end of
this process. That seems dubious and slightly error-prone to me - is
the patching process robust to that or should we exclude it?

Anyway, if I have the maths right, that there are 0x7fffffff or ~2
billion possible locations for the patching page, which is just shy of
31 bits of entropy.

I think this compares pretty favourably to most (K)ASLR implementations?

What's the range if built with 4k pages?

Kind regards,
Daniel

> +
> +	ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
> +	BUG_ON(!ptep);
> +	pte_unmap_unlock(ptep, ptl);
> +}
> +
>  static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
>  
>  static int text_area_cpu_up(unsigned int cpu)
> -- 
> 2.27.0

^ permalink raw reply

* Re: [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id
From: Aneesh Kumar K.V @ 2020-08-06 10:44 UTC (permalink / raw)
  To: Srikar Dronamraju; +Cc: Nathan Lynch, linuxppc-dev
In-Reply-To: <20200804072507.GI24375@linux.vnet.ibm.com>

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:

> * Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [2020-08-02 19:51:41]:
>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> > * Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [2020-07-31 16:49:14]:
>> >
>> >
>> > If its just to eliminate node 0, then we have 2 other probably better
>> > solutions.
>> > 1. Dont mark node 0 as spl (currently still in mm-tree and a result in
>> > linux-next)
>> > 2. powerpc specific: explicitly clear node 0 during numa bringup.
>> >
>> 
>> 
>> I am not sure I consider them better. But yes, those patches are good
>> and also resolves the node 0 initialization when the firmware didn't
>> indicate the presence of such a node.
>> 
>> This patch in addition make sure that we get the same topolgy report
>> across reboot on a virtualized partitions as longs as the cpu/memory
>> ratio per powervm domains remain the same. This should also help to
>> avoid confusion after an LPM migration once we start applying topology
>> updates. 
>> 
>
> What do we mean by cpu/memory ratio. The topology across reboot would have
> changed only if PowerVM would have allocated resources differently by
> scrambling/unscrambling. We are no more processing topology updates at
> runtime. As far as I know, after LPM, the source topology is maintained.

A LPAR running with one numa node and 10GB of memory on PowerVM domain
10 will report node 10 and 10GB memory in the current scheme. After LPM
migration or a CEC shutdown/reboot if the domain from which the resource
allocated becomes 11, then the LPAR will report node 11 and 10GB memory.

Having a logical node number means in the both the above cases we report
node 0, 10GB memory. 


>
>> >> This can  be resolved by mapping the firmware provided group id to a logical Linux
>> >> NUMA id. In this patch, we do this only for pseries platforms considering the
>> >
>> > On PowerVM, as you would know the nid is already a logical or a flattened
>> > chip-id and not the actual hardware chip-id.
>> 
>> Yes. But then they are derived based on PowerVM resources AKA domains.
>> Now based on the available resource on a system, we could end up with
>> different node numbers with same toplogy across reboots. Making it
>> logical at OS level prevent that. 
>
> The above statement kind of gives an impression, that topology changes
> across every reboot.  We only end up with different node numbers if and only
> if the underlying topology has changed and that case is very rare. Or am I
> missing something?

IIUC it also depends on availability of resources within the
domain at the time of LPAR start.

>
>> 
>> >> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> >> index e437a9ac4956..6c659aada55b 100644
>> >> --- a/arch/powerpc/mm/numa.c
>> >> +++ b/arch/powerpc/mm/numa.c
>> >> @@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int nid,
>> >>  	}
>> >>  }
>> >> 
>> >> +static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] =  NUMA_NO_NODE};
>> >> +
>> >> +int firmware_group_id_to_nid(int firmware_gid)
>> >> +{
>> >> +	static int last_nid = 0;
>> >> +
>> >> +	/*
>> >> +	 * For PowerNV we don't change the node id. This helps to avoid
>> >> +	 * confusion w.r.t the expected node ids. On pseries, node numbers
>> >> +	 * are virtualized. Hence do logical node id for pseries.
>> >> +	 */
>> >> +	if (!firmware_has_feature(FW_FEATURE_LPAR))
>> >> +		return firmware_gid;
>> >> +
>> >> +	if (firmware_gid ==  -1)
>> >> +		return NUMA_NO_NODE;
>> >> +
>> >> +	if (nid_map[firmware_gid] == NUMA_NO_NODE)
>> >> +		nid_map[firmware_gid] = last_nid++;
>> >
>> > How do we ensure 2 simultaneous firmware_group_id_to_nid() calls dont end up
>> > at this place in parallel?
>> 
>> Do we have a code path where we do that? All the node id init should
>> happen early and there should not be two cpus doing node init at the
>> same time. I might be mistaken. Can you point to the code path where you
>> expect this to be called in parallel?
>> 
>
> associativity_to_nid gets called the first time a cpu is being made present
> from offline. So it need not be in boot path. We may to verify if cpu
> hotplug, dlpar, operations are synchronized. For example a memory hotadd and
> cpu hotplug are they synchronized? I am not sure if they are synchronized at
> this time.

But you don't online cpu or memory to a non existent node post boot
right?. If the node is existent we have already initialized the nid_map.

However i am not sure whether we do a parallel initialization of devices. ie,
of_device_add getting called in parallel. if it can then we need the
below?

@@ -226,6 +226,7 @@ static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] =  NUMA_NO_NODE};
 int firmware_group_id_to_nid(int firmware_gid)
 {
        static int last_nid = 0;
+       static DEFINE_SPINLOCK(node_id_lock);
 
        /*
         * For PowerNV we don't change the node id. This helps to avoid
@@ -238,8 +239,13 @@ int firmware_group_id_to_nid(int firmware_gid)
        if (firmware_gid ==  -1)
                return NUMA_NO_NODE;
 
-       if (nid_map[firmware_gid] == NUMA_NO_NODE)
-               nid_map[firmware_gid] = last_nid++;
+       if (nid_map[firmware_gid] == NUMA_NO_NODE) {
+               spin_lock(&node_id_lock);
+               /*  recheck with lock held */
+               if (nid_map[firmware_gid] == NUMA_NO_NODE)
+                       nid_map[firmware_gid] = last_nid++;
+               spin_unlock(&node_id_lock);
+       }
 
        return nid_map[firmware_gid];
 }



>
>> >
>> >> +
>> >> +	return nid_map[firmware_gid];
>> >> +}
>> >> +
>> >>  /* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
>> >>   * info is found.
>> >>   */
>> >>  static int associativity_to_nid(const __be32 *associativity)
>> >>  {
>> >>  	int nid = NUMA_NO_NODE;
>> >> +	int firmware_gid = -1;
>> >> 
>> >>  	if (!numa_enabled)
>> >>  		goto out;
>> >> 
>> >>  	if (of_read_number(associativity, 1) >= min_common_depth)
>> >> -		nid = of_read_number(&associativity[min_common_depth], 1);
>> >> +		firmware_gid = of_read_number(&associativity[min_common_depth], 1);
>> >> 
>> >>  	/* POWER4 LPAR uses 0xffff as invalid node */
>> >> -	if (nid == 0xffff || nid >= MAX_NUMNODES)
>> >> -		nid = NUMA_NO_NODE;
>> >> +	if (firmware_gid == 0xffff || firmware_gid >= MAX_NUMNODES)
>> >> +		firmware_gid = -1;
>> >
>> > Lets assume two or more invocations of associativity_to_nid for the same
>> > associativity, end up with -1, In each case aren't giving different
>> > nids?
>> 
>> 
>> I didn't quiet get the comment here. But I assume you are indicating the
>> same one you mentioned above?
>> 
>
> No its not related to the above comment.
> We are incrementing the nid_map table for every unique firmware_gid or for
> every -1 (aka invalid associativities). If there are sufficiently large
> number of associativities that end up returning invalid associativities,
> then don't we quickly overflow the nid_map table? Not only about the
> overflow but a 8 node machine may soon look like a 80 node machine.

Not sure I follow. What does a large number of associativies imply? Are
you looking at ibm,associativity-lookup-arrays that got entries which
are invalid? Even there we are not parsing the full array, we lookup
only a specific firmware_gid (in case of lookup-arrays we use aa_index
value from drmem_lmb).

I will also add a las_nid > MAX_NUMNODES check in
firmware_group_id_to_nid() to handle the case where we find more numa
nodes than MAX_NUMANODES in device tree.

-aneesh

^ permalink raw reply

* [PATCH][V2] macintosh: windfarm: remove detatch debug containing spelling mistakes
From: Colin King @ 2020-08-06 10:29 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Wolfram Sang, linuxppc-dev
  Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

There are spelling mistakes in two debug messages. As recommended
by Wolfram Sang, these can be removed as there is plenty of debug
in the driver core.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---

V2: remove the debug rather than fixing the spelling

---
 drivers/macintosh/windfarm_lm75_sensor.c | 2 --
 drivers/macintosh/windfarm_lm87_sensor.c | 2 --
 2 files changed, 4 deletions(-)

diff --git a/drivers/macintosh/windfarm_lm75_sensor.c b/drivers/macintosh/windfarm_lm75_sensor.c
index 1e5fa09845e7..29f48c2028b6 100644
--- a/drivers/macintosh/windfarm_lm75_sensor.c
+++ b/drivers/macintosh/windfarm_lm75_sensor.c
@@ -152,8 +152,6 @@ static int wf_lm75_remove(struct i2c_client *client)
 {
 	struct wf_lm75_sensor *lm = i2c_get_clientdata(client);
 
-	DBG("wf_lm75: i2c detatch called for %s\n", lm->sens.name);
-
 	/* Mark client detached */
 	lm->i2c = NULL;
 
diff --git a/drivers/macintosh/windfarm_lm87_sensor.c b/drivers/macintosh/windfarm_lm87_sensor.c
index d011899c0a8a..9fab0b47cd3d 100644
--- a/drivers/macintosh/windfarm_lm87_sensor.c
+++ b/drivers/macintosh/windfarm_lm87_sensor.c
@@ -149,8 +149,6 @@ static int wf_lm87_remove(struct i2c_client *client)
 {
 	struct wf_lm87_sensor *lm = i2c_get_clientdata(client);
 
-	DBG("wf_lm87: i2c detatch called for %s\n", lm->sens.name);
-
 	/* Mark client detached */
 	lm->i2c = NULL;
 
-- 
2.27.0


^ permalink raw reply related

* Re: [PATCH v3 3/3] powerpc/uaccess: simplify the get_fs() set_fs() logic
From: Christophe Leroy @ 2020-08-06  9:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linuxppc-dev, Paul Mackerras, linux-kernel
In-Reply-To: <20200806091758.GA653@infradead.org>



Le 06/08/2020 à 11:17, Christoph Hellwig a écrit :
> Do you urgently need this?  My plan for 5.10 is to rebased and submit
> the remaining bits of this branch:
> 
>      http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/set_fs-removal
> 
> which will kill off set_fs/get_fs entirely.
> 

No this isn't needed urgently at all I think.

It was sleeping in Patchwork since January, and I received comments from 
Michael a few days ago asking me to re-submit, see 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/dd2876b808ea38eb7b7f760ecd6ce06096c61fb5.1580295551.git.christophe.leroy@c-s.fr/

But if you are killing set_fs/get_fs entirely, that's even better I 
guess. Thanks for the hands up.

Christophe

^ permalink raw reply

* [PATCH] powerpc/signal: Move and simplify get_clean_sp()
From: Christophe Leroy @ 2020-08-06  8:50 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel

get_clean_sp() is only used in kernel/signal.c . Move it there.

And GCC is smart enough to reduce the function when on PPC32, no
need of a special PPC32 simple version.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/include/asm/processor.h | 14 --------------
 arch/powerpc/kernel/signal.c         |  7 +++++++
 2 files changed, 7 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index ed0d633ab5aa..5c20b6d509ae 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -404,20 +404,6 @@ static inline void prefetchw(const void *x)
 
 #define HAVE_ARCH_PICK_MMAP_LAYOUT
 
-#ifdef CONFIG_PPC64
-static inline unsigned long get_clean_sp(unsigned long sp, int is_32)
-{
-	if (is_32)
-		return sp & 0x0ffffffffUL;
-	return sp;
-}
-#else
-static inline unsigned long get_clean_sp(unsigned long sp, int is_32)
-{
-	return sp;
-}
-#endif
-
 /* asm stubs */
 extern unsigned long isa300_idle_stop_noloss(unsigned long psscr_val);
 extern unsigned long isa300_idle_stop_mayloss(unsigned long psscr_val);
diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index d15a98c758b8..bd0ba7c5e2cf 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -171,6 +171,13 @@ inline unsigned long copy_ckfpr_from_user(struct task_struct *task,
 
 int show_unhandled_signals = 1;
 
+static unsigned long get_clean_sp(unsigned long sp, int is_32)
+{
+	if (is_32)
+		return sp & 0x0ffffffffUL;
+	return sp;
+}
+
 /*
  * Allocate space for the signal frame
  */
-- 
2.25.0


^ permalink raw reply related

* [PATCH] ASoC: fsl-asoc-card: Get "extal" clock rate by clk_get_rate
From: Shengjiu Wang @ 2020-08-06  7:39 UTC (permalink / raw)
  To: timur, nicoleotsuka, Xiubo.Lee, festevam, lgirdwood, broonie,
	perex, tiwai, alsa-devel, linuxppc-dev, linux-kernel

On some platform(.e.g. i.MX8QM MEK), the "extal" clock is different
with the mclk of codec, then the clock rate is also different.
So it is better to get clock rate of "extal" rate by clk_get_rate,
don't reuse the clock rate of mclk.

Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com>
---
 sound/soc/fsl/fsl-asoc-card.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/sound/soc/fsl/fsl-asoc-card.c b/sound/soc/fsl/fsl-asoc-card.c
index 52adedc03245..2c92a5efad61 100644
--- a/sound/soc/fsl/fsl-asoc-card.c
+++ b/sound/soc/fsl/fsl-asoc-card.c
@@ -696,6 +696,13 @@ static int fsl_asoc_card_probe(struct platform_device *pdev)
 			goto asrc_fail;
 		}
 	} else if (of_node_name_eq(cpu_np, "esai")) {
+		struct clk *esai_clk = clk_get(&cpu_pdev->dev, "extal");
+
+		if (!IS_ERR(esai_clk)) {
+			priv->cpu_priv.sysclk_freq[TX] = clk_get_rate(esai_clk);
+			priv->cpu_priv.sysclk_freq[RX] = clk_get_rate(esai_clk);
+			clk_put(esai_clk);
+		}
 		priv->cpu_priv.sysclk_id[1] = ESAI_HCKT_EXTAL;
 		priv->cpu_priv.sysclk_id[0] = ESAI_HCKR_EXTAL;
 	} else if (of_node_name_eq(cpu_np, "sai")) {
-- 
2.27.0


^ permalink raw reply related

* [powerpc:merge] BUILD SUCCESS 3cd2184115b85cc8242fec3d42529cd112962984
From: kernel test robot @ 2020-08-06  4:14 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  merge
branch HEAD: 3cd2184115b85cc8242fec3d42529cd112962984  Automatic merge of 'master', 'next' and 'fixes' (2020-08-05 22:24)

elapsed time: 844m

configs tested: 91
configs skipped: 6

The following configs have been built successfully.
More configs may be tested in the coming days.

arm                                 defconfig
arm64                            allyesconfig
arm64                               defconfig
arm                              allyesconfig
arm                              allmodconfig
sh                          polaris_defconfig
mips                  mips_paravirt_defconfig
microblaze                      mmu_defconfig
mips                 pnx8335_stb225_defconfig
arm                       aspeed_g5_defconfig
mips                      bmips_stb_defconfig
sh                   sh7770_generic_defconfig
m68k                       m5249evb_defconfig
alpha                               defconfig
sh                        dreamcast_defconfig
m68k                             alldefconfig
arm                     am200epdkit_defconfig
arm                         nhk8815_defconfig
powerpc                     mpc5200_defconfig
mips                         tb0219_defconfig
powerpc                      ppc6xx_defconfig
arm                       imx_v4_v5_defconfig
arm                     eseries_pxa_defconfig
arm                            pleb_defconfig
arm                        keystone_defconfig
arm                          collie_defconfig
arc                        vdk_hs38_defconfig
arm                      tct_hammer_defconfig
mips                     loongson1b_defconfig
powerpc                       ppc64_defconfig
arm                             ezx_defconfig
ia64                             allmodconfig
ia64                                defconfig
ia64                             allyesconfig
m68k                             allmodconfig
m68k                                defconfig
m68k                             allyesconfig
nds32                               defconfig
nios2                            allyesconfig
csky                                defconfig
alpha                            allyesconfig
xtensa                           allyesconfig
h8300                            allyesconfig
arc                                 defconfig
sh                               allmodconfig
parisc                              defconfig
s390                             allyesconfig
parisc                           allyesconfig
s390                                defconfig
i386                             allyesconfig
sparc                            allyesconfig
sparc                               defconfig
i386                                defconfig
nios2                               defconfig
arc                              allyesconfig
nds32                             allnoconfig
c6x                              allyesconfig
mips                             allyesconfig
mips                             allmodconfig
powerpc                          allyesconfig
powerpc                          allmodconfig
powerpc                           allnoconfig
powerpc                             defconfig
i386                 randconfig-a005-20200805
i386                 randconfig-a004-20200805
i386                 randconfig-a001-20200805
i386                 randconfig-a003-20200805
i386                 randconfig-a002-20200805
i386                 randconfig-a006-20200805
x86_64               randconfig-a013-20200805
x86_64               randconfig-a011-20200805
x86_64               randconfig-a012-20200805
x86_64               randconfig-a016-20200805
x86_64               randconfig-a015-20200805
x86_64               randconfig-a014-20200805
i386                 randconfig-a011-20200805
i386                 randconfig-a012-20200805
i386                 randconfig-a013-20200805
i386                 randconfig-a014-20200805
i386                 randconfig-a015-20200805
i386                 randconfig-a016-20200805
riscv                            allyesconfig
riscv                             allnoconfig
riscv                               defconfig
riscv                            allmodconfig
x86_64                                   rhel
x86_64                           allyesconfig
x86_64                    rhel-7.6-kselftests
x86_64                              defconfig
x86_64                               rhel-8.3
x86_64                                  kexec

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply

* [powerpc:next] BUILD SUCCESS a7aaa2f26bfd932a654706b19859e7adf802bee2
From: kernel test robot @ 2020-08-06  4:14 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  next
branch HEAD: a7aaa2f26bfd932a654706b19859e7adf802bee2  selftests/powerpc: Fix pkey syscall redefinitions

elapsed time: 843m

configs tested: 134
configs skipped: 14

The following configs have been built successfully.
More configs may be tested in the coming days.

arm                                 defconfig
arm64                               defconfig
arm                              allyesconfig
arm                              allmodconfig
arm64                            allyesconfig
arm                       mainstone_defconfig
sh                   sh7770_generic_defconfig
mips                       rbtx49xx_defconfig
ia64                         bigsur_defconfig
arm                          simpad_defconfig
sh                          polaris_defconfig
mips                  mips_paravirt_defconfig
microblaze                      mmu_defconfig
arm                         axm55xx_defconfig
xtensa                    xip_kc705_defconfig
m68k                       m5475evb_defconfig
powerpc                       holly_defconfig
powerpc                     pq2fads_defconfig
c6x                                 defconfig
arm                           u8500_defconfig
powerpc                    amigaone_defconfig
sh                           se7712_defconfig
arc                           tb10x_defconfig
sh                           se7780_defconfig
mips                   sb1250_swarm_defconfig
mips                         tb0287_defconfig
arc                              allyesconfig
sh                        dreamcast_defconfig
m68k                       m5249evb_defconfig
alpha                               defconfig
arc                      axs103_smp_defconfig
mips                      fuloong2e_defconfig
powerpc                    mvme5100_defconfig
mips                        bcm47xx_defconfig
m68k                             alldefconfig
arm                     am200epdkit_defconfig
arm                         nhk8815_defconfig
powerpc                      pasemi_defconfig
arm                             rpc_defconfig
m68k                        mvme147_defconfig
m68k                          multi_defconfig
powerpc                  mpc885_ads_defconfig
sh                   secureedge5410_defconfig
ia64                          tiger_defconfig
arm                         mv78xx0_defconfig
powerpc                     mpc5200_defconfig
mips                         tb0219_defconfig
powerpc                      ppc6xx_defconfig
s390                          debug_defconfig
powerpc                     powernv_defconfig
arm                        spear6xx_defconfig
mips                         db1xxx_defconfig
sh                         apsh4a3a_defconfig
mips                            gpr_defconfig
arm                       imx_v4_v5_defconfig
arm                     eseries_pxa_defconfig
arm                            pleb_defconfig
arm                        keystone_defconfig
arm                          collie_defconfig
arm                         shannon_defconfig
arm                     davinci_all_defconfig
xtensa                generic_kc705_defconfig
arm                          moxart_defconfig
sh                           sh2007_defconfig
sh                           se7751_defconfig
m68k                       m5275evb_defconfig
arc                        vdk_hs38_defconfig
arm                      tct_hammer_defconfig
mips                     loongson1b_defconfig
powerpc                       ppc64_defconfig
arm                             ezx_defconfig
ia64                             allmodconfig
ia64                                defconfig
ia64                             allyesconfig
m68k                             allmodconfig
m68k                                defconfig
m68k                             allyesconfig
nds32                               defconfig
nios2                            allyesconfig
csky                                defconfig
alpha                            allyesconfig
nios2                               defconfig
nds32                             allnoconfig
c6x                              allyesconfig
xtensa                           allyesconfig
h8300                            allyesconfig
arc                                 defconfig
sh                               allmodconfig
parisc                              defconfig
s390                             allyesconfig
parisc                           allyesconfig
s390                                defconfig
sparc                            allyesconfig
sparc                               defconfig
i386                                defconfig
i386                             allyesconfig
mips                             allyesconfig
mips                             allmodconfig
powerpc                             defconfig
powerpc                          allyesconfig
powerpc                          allmodconfig
powerpc                           allnoconfig
i386                 randconfig-a005-20200805
i386                 randconfig-a004-20200805
i386                 randconfig-a001-20200805
i386                 randconfig-a003-20200805
i386                 randconfig-a002-20200805
i386                 randconfig-a006-20200805
x86_64               randconfig-a013-20200805
x86_64               randconfig-a011-20200805
x86_64               randconfig-a012-20200805
x86_64               randconfig-a016-20200805
x86_64               randconfig-a015-20200805
x86_64               randconfig-a014-20200805
i386                 randconfig-a011-20200805
i386                 randconfig-a012-20200805
i386                 randconfig-a013-20200805
i386                 randconfig-a014-20200805
i386                 randconfig-a015-20200805
i386                 randconfig-a016-20200805
i386                 randconfig-a011-20200806
i386                 randconfig-a012-20200806
i386                 randconfig-a015-20200806
i386                 randconfig-a016-20200806
riscv                            allyesconfig
riscv                             allnoconfig
riscv                               defconfig
riscv                            allmodconfig
x86_64                                   rhel
x86_64                           allyesconfig
x86_64                    rhel-7.6-kselftests
x86_64                              defconfig
x86_64                               rhel-8.3
x86_64                                  kexec

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply

* Re: [PATCH 1/2] sched/topology: Allow archs to override cpu_smt_mask
From: peterz @ 2020-08-06  8:54 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Gautham R Shenoy, Michael Neuling, Vincent Guittot,
	Srikar Dronamraju, Rik van Riel, linuxppc-dev, LKML,
	Valentin Schneider, Thomas Gleixner, Mel Gorman, Ingo Molnar,
	Dietmar Eggemann
In-Reply-To: <87ft90z6dy.fsf@mpe.ellerman.id.au>

On Thu, Aug 06, 2020 at 03:32:25PM +1000, Michael Ellerman wrote:

> That brings with it a bunch of problems, such as existing software that
> has been developed/configured for Power8 and expects to see SMT8.
> 
> We also allow LPARs to be live migrated from Power8 to Power9 (and back), so
> maintaining the illusion of SMT8 is considered a requirement to make that work.

So how does that work if the kernel booted on P9 and demuxed the SMT8
into 2xSMT4? If you migrate that state onto a P8 with actual SMT8 you're
toast again.

> Yeah I agree the naming is confusing.
> 
> Let's call them "SMT4 cores" and "SMT8 cores"?

Works for me, thanks!

> The problem is we are already lying to userspace, because firmware lies to us.
> 
> ie. the firmware on these systems shows us an SMT8 core, and so current kernels
> show SMT8 to userspace. I don't think we can realistically change that fact now,
> as these systems are already out in the field.
> 
> What this patch tries to do is undo some of the mess, and at least give the
> scheduler the right information.

What a mess... I think it depends on what you do with that P9 to P8
migration case. Does it make sense to have a "p8_compat" boot arg for
the case where you want LPAR migration back onto P8 systems -- in which
case it simply takes the firmware's word as gospel and doesn't untangle
things, because it can actually land on a P8.

^ permalink raw reply

* Re: [PATCH 1/2] sched/topology: Allow archs to override cpu_smt_mask
From: Michael Ellerman @ 2020-08-06  5:32 UTC (permalink / raw)
  To: peterz, Srikar Dronamraju
  Cc: Gautham R Shenoy, Michael Neuling, Vincent Guittot, Rik van Riel,
	linuxppc-dev, LKML, Ingo Molnar, Thomas Gleixner, Mel Gorman,
	Valentin Schneider, Dietmar Eggemann
In-Reply-To: <20200804124755.GJ2674@hirez.programming.kicks-ass.net>

peterz@infradead.org writes:
> On Tue, Aug 04, 2020 at 05:40:07PM +0530, Srikar Dronamraju wrote:
>> * peterz@infradead.org <peterz@infradead.org> [2020-08-04 12:45:20]:
>> 
>> > On Tue, Aug 04, 2020 at 09:03:06AM +0530, Srikar Dronamraju wrote:
>> > > cpu_smt_mask tracks topology_sibling_cpumask. This would be good for
>> > > most architectures. One of the users of cpu_smt_mask(), would be to
>> > > identify idle-cores. On Power9, a pair of cores can be presented by the
>> > > firmware as a big-core for backward compatibility reasons.
>> > > 
>> > > In order to maintain userspace backward compatibility with previous
>> > > versions of processor, (since Power8 had SMT8 cores), Power9 onwards there
>> > > is option to the firmware to advertise a pair of SMT4 cores as a fused
>> > > cores (referred to as the big_core mode in the Linux Kernel). On Power9
>> > > this pair shares the L2 cache as well. However, from the scheduler's point
>> > > of view, a core should be determined by SMT4. The load-balancer already
>> > > does this. Hence allow PowerPc architecture to override the default
>> > > cpu_smt_mask() to point to the SMT4 cores in a big_core mode.
>> > 
>> > I'm utterly confused.
>> > 
>> > Why can't you set your regular siblings mask to the smt4 thing? Who
>> > cares about the compat stuff, I thought that was an LPAR/AIX thing.

To be clear this stuff is all for when we're running on the PowerVM machines,
ie. as LPARs.

That brings with it a bunch of problems, such as existing software that
has been developed/configured for Power8 and expects to see SMT8.

We also allow LPARs to be live migrated from Power8 to Power9 (and back), so
maintaining the illusion of SMT8 is considered a requirement to make that work.

>> There are no technical challenges to set the sibling mask to SMT4.
>> This is for Linux running on PowerVM. When these Power9 boxes are sold /
>> marketed as X core boxes (where X stand for SMT8 cores).  Since on PowerVM
>> world, everything is in SMT8 mode, the device tree properties still mark
>> the system to be running on 8 thread core. There are a number of utilities
>> like ppc64_cpu that directly read from device-tree. They would get core
>> count and thread count which is SMT8 based.
>> 
>> If the sibling_mask is set to small core, then same user when looking at
>
> FWIW, I find the small/big core naming utterly confusing vs the
> big/little naming from ARM. When you say small, I'm thinking of
> asymmetric cores, not SMT4/SMT8.

Yeah I agree the naming is confusing.

Let's call them "SMT4 cores" and "SMT8 cores"?

>> output from lscpu and other utilities that look at sysfs will start seeing
>> 2x number of cores to what he had provisioned and what the utilities from
>> the device-tree show. This can gets users confused.
>
> One will report SMT8 and the other SMT4, right? So only users that
> cannot read will be confused, but if you can't read, confusion is
> guaranteed anyway.

It's partly users, but also software that would see different values depending
on where it looks.

> Also, by exposing the true (SMT4) topology to userspace, userspace
> applications could behave better -- for those few that actually parse
> the topology information.

Agreed, though as you say there aren't that many that actually use the low-level
topology information.

>> So to keep the device-tree properties, utilities depending on device-tree,
>> sysfs and utilities depending on sysfs on the same page, userspace are only
>> exposed as SMT8.
>
> I'm not convinced it makes sense to lie to userspace just to accomodate
> a few users that cannot read.

The problem is we are already lying to userspace, because firmware lies to us.

ie. the firmware on these systems shows us an SMT8 core, and so current kernels
show SMT8 to userspace. I don't think we can realistically change that fact now,
as these systems are already out in the field.

What this patch tries to do is undo some of the mess, and at least give the
scheduler the right information.

cheers

^ permalink raw reply

* Re: [PATCH v10 2/5] powerpc/vdso: Prepare for switching VDSO to generic C implementation.
From: Christophe Leroy @ 2020-08-06  5:46 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: nathanl, linux-arch, vincenzo.frascino, arnd, linux-kernel,
	Paul Mackerras, luto, tglx, linuxppc-dev
In-Reply-To: <20200805184054.GQ6753@gate.crashing.org>

Hi,

On 08/05/2020 06:40 PM, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Aug 05, 2020 at 04:40:16PM +0000, Christophe Leroy wrote:
>>> It cannot optimise it because it does not know shift < 32.  The code
>>> below is incorrect for shift equal to 32, fwiw.
>>
>> Is there a way to tell it ?
> 
> Sure, for example the &31 should work (but it doesn't, with the GCC
> version you used -- which version is that?)

GCC 10.1

> 
>>> What does the compiler do for just
>>>
>>> static __always_inline u64 vdso_shift_ns(u64 ns, unsigned long shift)
>>> 	return ns >> (shift & 31);
>>> }
>>>
>>
>> Worse:
> 
> I cannot make heads or tails of all that branch spaghetti, sorry.
> 
>>   73c:	55 8c 06 fe 	clrlwi  r12,r12,27
>>   740:	7f c8 f0 14 	addc    r30,r8,r30
>>   744:	7c c6 4a 14 	add     r6,r6,r9
>>   748:	7c c6 e1 14 	adde    r6,r6,r28
>>   74c:	34 6c ff e0 	addic.  r3,r12,-32
>>   750:	41 80 00 70 	blt     7c0 <__c_kernel_clock_gettime+0x114>
> 
> This branch is always true.  Hrm.

As a standalone function:

With your suggestion:

000006ac <vdso_shift_ns>:
  6ac:	54 a5 06 fe 	clrlwi  r5,r5,27
  6b0:	35 25 ff e0 	addic.  r9,r5,-32
  6b4:	41 80 00 10 	blt     6c4 <vdso_shift_ns+0x18>
  6b8:	7c 64 4c 30 	srw     r4,r3,r9
  6bc:	38 60 00 00 	li      r3,0
  6c0:	4e 80 00 20 	blr
  6c4:	54 69 08 3c 	rlwinm  r9,r3,1,0,30
  6c8:	21 45 00 1f 	subfic  r10,r5,31
  6cc:	7c 84 2c 30 	srw     r4,r4,r5
  6d0:	7d 29 50 30 	slw     r9,r9,r10
  6d4:	7c 63 2c 30 	srw     r3,r3,r5
  6d8:	7d 24 23 78 	or      r4,r9,r4
  6dc:	4e 80 00 20 	blr


With the version as is in my series:

000006ac <vdso_shift_ns>:
  6ac:	21 25 00 20 	subfic  r9,r5,32
  6b0:	7c 69 48 30 	slw     r9,r3,r9
  6b4:	7c 84 2c 30 	srw     r4,r4,r5
  6b8:	7d 24 23 78 	or      r4,r9,r4
  6bc:	7c 63 2c 30 	srw     r3,r3,r5
  6c0:	4e 80 00 20 	blr


Christophe

^ permalink raw reply

* Re: [PATCH v3 3/3] powerpc/uaccess: simplify the get_fs() set_fs() logic
From: Christoph Hellwig @ 2020-08-06  9:17 UTC (permalink / raw)
  To: Christophe Leroy; +Cc: linuxppc-dev, Paul Mackerras, linux-kernel
In-Reply-To: <cf39cb8e42cffe323393b8cecdc59a7230298eab.1596702117.git.christophe.leroy@csgroup.eu>

Do you urgently need this?  My plan for 5.10 is to rebased and submit
the remaining bits of this branch:

    http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/set_fs-removal

which will kill off set_fs/get_fs entirely.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox