LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 07/30] ABI: sysfs-class-cxl: place "not in a guest" at description
From: Mauro Carvalho Chehab @ 2021-09-16  8:59 UTC (permalink / raw)
  To: Linux Doc Mailing List, Greg Kroah-Hartman
  Cc: Andrew Donnellan, Jonathan Corbet, Mauro Carvalho Chehab,
	linux-kernel, Frederic Barrat, linuxppc-dev
In-Reply-To: <cover.1631782432.git.mchehab+huawei@kernel.org>

The What: field should have just the location of the ABI.
Anything else should be inside the description.

This fixes its parsing by get_abi.pl script.

Acked-by: Andrew Donnellan <ajd@linux.ibm.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 Documentation/ABI/testing/sysfs-class-cxl | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-class-cxl b/Documentation/ABI/testing/sysfs-class-cxl
index 818f55970efb..3c77677e0ca7 100644
--- a/Documentation/ABI/testing/sysfs-class-cxl
+++ b/Documentation/ABI/testing/sysfs-class-cxl
@@ -166,10 +166,11 @@ Description:    read only
                 Decimal value of the Per Process MMIO space length.
 Users:		https://github.com/ibm-capi/libcxl
 
-What:           /sys/class/cxl/<afu>m/pp_mmio_off (not in a guest)
+What:           /sys/class/cxl/<afu>m/pp_mmio_off
 Date:           September 2014
 Contact:        linuxppc-dev@lists.ozlabs.org
 Description:    read only
+                (not in a guest)
                 Decimal value of the Per Process MMIO space offset.
 Users:		https://github.com/ibm-capi/libcxl
 
@@ -190,28 +191,31 @@ Description:    read only
                 Identifies the revision level of the PSL.
 Users:		https://github.com/ibm-capi/libcxl
 
-What:           /sys/class/cxl/<card>/base_image (not in a guest)
+What:           /sys/class/cxl/<card>/base_image
 Date:           September 2014
 Contact:        linuxppc-dev@lists.ozlabs.org
 Description:    read only
+                (not in a guest)
                 Identifies the revision level of the base image for devices
                 that support loadable PSLs. For FPGAs this field identifies
                 the image contained in the on-adapter flash which is loaded
                 during the initial program load.
 Users:		https://github.com/ibm-capi/libcxl
 
-What:           /sys/class/cxl/<card>/image_loaded (not in a guest)
+What:           /sys/class/cxl/<card>/image_loaded
 Date:           September 2014
 Contact:        linuxppc-dev@lists.ozlabs.org
 Description:    read only
+                (not in a guest)
                 Will return "user" or "factory" depending on the image loaded
                 onto the card.
 Users:		https://github.com/ibm-capi/libcxl
 
-What:           /sys/class/cxl/<card>/load_image_on_perst (not in a guest)
+What:           /sys/class/cxl/<card>/load_image_on_perst
 Date:           December 2014
 Contact:        linuxppc-dev@lists.ozlabs.org
 Description:    read/write
+                (not in a guest)
                 Valid entries are "none", "user", and "factory".
                 "none" means PERST will not cause image to be loaded to the
                 card.  A power cycle is required to load the image.
@@ -235,10 +239,11 @@ Description:    write only
                 contexts on the card AFUs.
 Users:		https://github.com/ibm-capi/libcxl
 
-What:		/sys/class/cxl/<card>/perst_reloads_same_image (not in a guest)
+What:		/sys/class/cxl/<card>/perst_reloads_same_image
 Date:		July 2015
 Contact:	linuxppc-dev@lists.ozlabs.org
 Description:	read/write
+                (not in a guest)
 		Trust that when an image is reloaded via PERST, it will not
 		have changed.
 
-- 
2.31.1


^ permalink raw reply related

* [PATCH v3 00/30]Change wildcards on ABI files
From: Mauro Carvalho Chehab @ 2021-09-16  8:59 UTC (permalink / raw)
  To: Linux Doc Mailing List, Greg Kroah-Hartman
  Cc: Heikki Krogerus, Kees Cook, Jonathan Corbet,
	Mauro Carvalho Chehab, netdev, Richard Cochran, Anton Vorontsov,
	linux-kernel, Johan Hovold, Tony Luck, Colin Cross, linux-usb,
	linuxppc-dev, Peter Rosin

The ABI files are meant to be parsed via a script (scripts/get_abi.pl).

A new improvement on it will allow it to help to detect if an ABI description
is missing, or if the What: field won't match the actual location of the symbol.

In order for get_abi.pl to convert What: into regex, changes are needed on
existing ABI files, as the conversion should not be ambiguous.

One alternative would be to convert everything into regexes, but this
would generate a huge amount of patches/changes. So, instead, let's
touch only the ABI files that aren't following the de-facto wildcard 
standards already found on most of the ABI files, e. g.:

	/.../
	*
	<foo>
	(option1|option2)
	X
	Y
	Z
	[0-9] (and variants)

---

v3:
   - Added a new patch for sysfs-class-rapidio;
   - sysfs-class-typec had a typo, instead of a wildcard;
   - sysfs-bus-soundwire-* had some additional What to be fixed;
   - added some reviewed-by/acked-by tags.

v2:
   - Added several patches to address uppercase "N" meaning
     as a wildcard.

Mauro Carvalho Chehab (30):
  ABI: sysfs-bus-usb: better document variable argument
  ABI: sysfs-tty: better document module name parameter
  ABI: sysfs-kernel-slab: use a wildcard for the cache name
  ABI: security: fix location for evm and ima_policy
  ABI: sysfs-class-tpm: use wildcards for pcr-* nodes
  ABI: sysfs-bus-rapidio: use wildcards on What definitions
  ABI: sysfs-class-cxl: place "not in a guest" at description
  ABI: sysfs-class-devfreq-event: use the right wildcards on What
  ABI: sysfs-class-mic: use the right wildcards on What definitions
  ABI: pstore: Fix What field
  ABI:  fix a typo on a What field
  ABI: sysfs-ata: use a proper wildcard for ata_*
  ABI: sysfs-class-infiniband: use wildcards on What definitions
  ABI: sysfs-bus-pci: use wildcards on What definitions
  ABI: -master: use wildcards on What definitions
  ABI: sysfs-bus-soundwire-slave: use wildcards on What definitions
  ABI: sysfs-class-gnss: use wildcards on What definitions
  ABI: sysfs-class-mei: use wildcards on What definitions
  ABI: sysfs-class-mux: use wildcards on What definitions
  ABI: sysfs-class-pwm: use wildcards on What definitions
  ABI: sysfs-class-rc: use wildcards on What definitions
  ABI: sysfs-class-rc-nuvoton: use wildcards on What definitions
  ABI: sysfs-class-uwb_rc: use wildcards on What definitions
  ABI: sysfs-class-uwb_rc-wusbhc: use wildcards on What definitions
  ABI: sysfs-devices-platform-dock: use wildcards on What definitions
  ABI: sysfs-devices-system-cpu: use wildcards on What definitions
  ABI: sysfs-firmware-efi-esrt: use wildcards on What definitions
  ABI: sysfs-platform-sst-atom: use wildcards on What definitions
  ABI: sysfs-ptp: use wildcards on What definitions
  ABI: sysfs-class-rapidio: use wildcards on What definitions

 .../ABI/stable/sysfs-class-infiniband         | 64 ++++++-------
 Documentation/ABI/stable/sysfs-class-tpm      |  2 +-
 Documentation/ABI/testing/evm                 |  4 +-
 Documentation/ABI/testing/ima_policy          |  2 +-
 Documentation/ABI/testing/pstore              |  3 +-
 Documentation/ABI/testing/sysfs-ata           |  2 +-
 Documentation/ABI/testing/sysfs-bus-pci       |  2 +-
 Documentation/ABI/testing/sysfs-bus-rapidio   | 32 +++----
 .../ABI/testing/sysfs-bus-soundwire-master    | 20 ++--
 .../ABI/testing/sysfs-bus-soundwire-slave     | 60 ++++++------
 Documentation/ABI/testing/sysfs-bus-usb       | 16 ++--
 Documentation/ABI/testing/sysfs-class-cxl     | 15 ++-
 .../ABI/testing/sysfs-class-devfreq-event     | 12 +--
 Documentation/ABI/testing/sysfs-class-gnss    |  2 +-
 Documentation/ABI/testing/sysfs-class-mei     | 18 ++--
 Documentation/ABI/testing/sysfs-class-mic     | 24 ++---
 Documentation/ABI/testing/sysfs-class-mux     |  2 +-
 Documentation/ABI/testing/sysfs-class-pwm     | 20 ++--
 Documentation/ABI/testing/sysfs-class-rapidio |  4 +-
 Documentation/ABI/testing/sysfs-class-rc      | 14 +--
 .../ABI/testing/sysfs-class-rc-nuvoton        |  2 +-
 Documentation/ABI/testing/sysfs-class-typec   |  2 +-
 Documentation/ABI/testing/sysfs-class-uwb_rc  | 26 ++---
 .../ABI/testing/sysfs-class-uwb_rc-wusbhc     | 10 +-
 .../ABI/testing/sysfs-devices-platform-dock   | 10 +-
 .../ABI/testing/sysfs-devices-system-cpu      | 16 ++--
 .../ABI/testing/sysfs-firmware-efi-esrt       | 16 ++--
 Documentation/ABI/testing/sysfs-kernel-slab   | 94 +++++++++----------
 .../ABI/testing/sysfs-platform-sst-atom       |  2 +-
 Documentation/ABI/testing/sysfs-ptp           | 30 +++---
 Documentation/ABI/testing/sysfs-tty           | 32 +++----
 31 files changed, 282 insertions(+), 276 deletions(-)

-- 
2.31.1



^ permalink raw reply

* Re: [PATCH] serial: 8250: SERIAL_8250_FSL should not default to y when compile-testing
From: Geert Uytterhoeven @ 2021-09-16  8:55 UTC (permalink / raw)
  To: Johan Hovold
  Cc: Linux Kernel Mailing List, Greg Kroah-Hartman, linuxppc-dev,
	Li Yang, Scott Wood, open list:SERIAL DRIVERS, Shawn Guo,
	Jiri Slaby, Linux ARM
In-Reply-To: <YUMESxr907YHM3ZT@hovoldconsulting.com>

Hi Johan,

On Thu, Sep 16, 2021 at 10:46 AM Johan Hovold <johan@kernel.org> wrote:
> On Wed, Sep 15, 2021 at 02:56:52PM +0200, Geert Uytterhoeven wrote:
> > Commit b1442c55ce8977aa ("serial: 8250: extend compile-test coverage")
> > added compile-test support to the Freescale 16550 driver.  However, as
> > SERIAL_8250_FSL is an invisible symbol, merely enabling COMPILE_TEST now
> > enables this driver.
> >
> > Fix this by making SERIAL_8250_FSL visible.  Tighten the dependencies to
> > prevent asking the user about this driver when configuring a kernel
> > without appropriate Freescale SoC or ACPI support.
>
> This tightening is arguable a separate change which risk introducing
> regressions if you get it wrong and should go in a separate patch at
> least.

Getting it wrong would indeed be a regression, but not tightening
that at the same time would mean I have to send a separate patch with
a Fixes tag referring to this fix, following this template:

    foo should depend on bar

    The foo hardware is only present on bar SoCs.  Hence add a
    dependency on bar, to prevent asking the user about this driver
    when configuring a kernel without bar support.

> > Fixes: b1442c55ce8977aa ("serial: 8250: extend compile-test coverage")
> > Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
> > ---
> > Yes, it's ugly, but I see no better solution. Do you?
> >
> >  drivers/tty/serial/8250/Kconfig | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/tty/serial/8250/Kconfig b/drivers/tty/serial/8250/Kconfig
> > index 808268edd2e82a45..a2978b31144e94f2 100644
> > --- a/drivers/tty/serial/8250/Kconfig
> > +++ b/drivers/tty/serial/8250/Kconfig
> > @@ -361,9 +361,13 @@ config SERIAL_8250_BCM2835AUX
> >         If unsure, say N.
> >
> >  config SERIAL_8250_FSL
> > -     bool
> > +     bool "Freescale 16550-style UART support (8250 based driver)"
> >       depends on SERIAL_8250_CONSOLE
> > -     default PPC || ARM || ARM64 || COMPILE_TEST
> > +     depends on FSL_SOC || ARCH_LAYERSCAPE || SOC_LS1021A || (ARM64 && ACPI) || COMPILE_TEST
> > +     default FSL_SOC || ARCH_LAYERSCAPE || SOC_LS1021A || (ARM64 && ACPI)
>
> I'd suggest just doing
>
>         bool "Freescale 16550-style UART support (8250 based driver)"
>         depends on SERIAL_8250_CONSOLE
>         default PPC || ARM || ARM64
>
> Since neither of the symbols you add to that "depends on" line is an
> actual build or runtime dependency.

They are.

> Then you can refine the "default" line in a follow up (or argue why you
> think there should be a "depends on FSL_SOC || ...").
>
> > +     help
> > +       Selecting this option will add support for the 16550-style serial
> > +       port hardware found on Freescale SoCs.
> >
> >  config SERIAL_8250_DW
> >       tristate "Support for Synopsys DesignWare 8250 quirks"

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH] serial: 8250: SERIAL_8250_FSL should not default to y when compile-testing
From: Johan Hovold @ 2021-09-16  8:46 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: linux-kernel, Greg Kroah-Hartman, linuxppc-dev, Li Yang,
	Scott Wood, linux-serial, Shawn Guo, Jiri Slaby, linux-arm-kernel
In-Reply-To: <c5f8aa5c081755f3c960b86fc61c2baaa33edcd9.1631710216.git.geert+renesas@glider.be>

On Wed, Sep 15, 2021 at 02:56:52PM +0200, Geert Uytterhoeven wrote:
> Commit b1442c55ce8977aa ("serial: 8250: extend compile-test coverage")
> added compile-test support to the Freescale 16550 driver.  However, as
> SERIAL_8250_FSL is an invisible symbol, merely enabling COMPILE_TEST now
> enables this driver.
> 
> Fix this by making SERIAL_8250_FSL visible.  Tighten the dependencies to
> prevent asking the user about this driver when configuring a kernel
> without appropriate Freescale SoC or ACPI support.

This tightening is arguable a separate change which risk introducing
regressions if you get it wrong and should go in a separate patch at
least.

> Fixes: b1442c55ce8977aa ("serial: 8250: extend compile-test coverage")
> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
> ---
> Yes, it's ugly, but I see no better solution. Do you?
> 
>  drivers/tty/serial/8250/Kconfig | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/tty/serial/8250/Kconfig b/drivers/tty/serial/8250/Kconfig
> index 808268edd2e82a45..a2978b31144e94f2 100644
> --- a/drivers/tty/serial/8250/Kconfig
> +++ b/drivers/tty/serial/8250/Kconfig
> @@ -361,9 +361,13 @@ config SERIAL_8250_BCM2835AUX
>  	  If unsure, say N.
>  
>  config SERIAL_8250_FSL
> -	bool
> +	bool "Freescale 16550-style UART support (8250 based driver)"
>  	depends on SERIAL_8250_CONSOLE
> -	default PPC || ARM || ARM64 || COMPILE_TEST
> +	depends on FSL_SOC || ARCH_LAYERSCAPE || SOC_LS1021A || (ARM64 && ACPI) || COMPILE_TEST
> +	default FSL_SOC || ARCH_LAYERSCAPE || SOC_LS1021A || (ARM64 && ACPI)

I'd suggest just doing

	bool "Freescale 16550-style UART support (8250 based driver)"
	depends on SERIAL_8250_CONSOLE
	default PPC || ARM || ARM64

Since neither of the symbols you add to that "depends on" line is an
actual build or runtime dependency.

Then you can refine the "default" line in a follow up (or argue why you
think there should be a "depends on FSL_SOC || ...").

> +	help
> +	  Selecting this option will add support for the 16550-style serial
> +	  port hardware found on Freescale SoCs.
>  
>  config SERIAL_8250_DW
>  	tristate "Support for Synopsys DesignWare 8250 quirks"

Johan

^ permalink raw reply

* Re: [PATCH v3 4/8] powerpc/pseries/svm: Add a powerpc version of cc_platform_has()
From: Christoph Hellwig @ 2021-09-16  7:35 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Sathyanarayanan Kuppuswamy, linux-efi, Brijesh Singh, kvm,
	dri-devel, platform-driver-x86, Paul Mackerras, linux-s390,
	Andi Kleen, Joerg Roedel, x86, amd-gfx, Christoph Hellwig,
	linux-graphics-maintainer, Tom Lendacky, Tianyu Lan,
	Borislav Petkov, kexec, linux-kernel, iommu, linux-fsdevel,
	linuxppc-dev
In-Reply-To: <f8388f18-5e90-5d0f-d681-0b17f8307dd4@csgroup.eu>

On Wed, Sep 15, 2021 at 07:18:34PM +0200, Christophe Leroy wrote:
> Could you please provide more explicit explanation why inlining such an
> helper is considered as bad practice and messy ?

Because now we get architectures to all subly differ.  Look at the mess
for ioremap and the ioremap* variant.

The only good reason to allow for inlines if if they are used in a hot
path.  Which cc_platform_has is not, especially not on powerpc.

^ permalink raw reply

* Re: [PATCH] powerpc: warn on emulation of dcbz instruction
From: Christophe Leroy @ 2021-09-16  7:23 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: Finn Thain, linuxppc-dev, linux-kernel, Stan Johnson
In-Reply-To: <eb1a39368401bf46e805ca64256604cc649f771e.camel@kernel.crashing.org>



Le 16/09/2021 à 09:16, Benjamin Herrenschmidt a écrit :
> On Thu, 2021-09-16 at 17:15 +1000, Benjamin Herrenschmidt wrote:
>> On Wed, 2021-09-15 at 16:31 +0200, Christophe Leroy wrote:
>>> dcbz instruction shouldn't be used on non-cached memory. Using
>>> it on non-cached memory can result in alignment exception and
>>> implies a heavy handling.
>>>
>>> Instead of silentely emulating the instruction and resulting in
>>> high
>>> performance degradation, warn whenever an alignment exception is
>>> taken due to dcbz, so that the user is made aware that dcbz
>>> instruction has been used unexpectedly.
>>>
>>> Reported-by: Stan Johnson <userm57@yahoo.com>
>>> Cc: Finn Thain <fthain@linux-m68k.org>
>>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>>> ---
>>>   arch/powerpc/kernel/align.c | 1 +
>>>   1 file changed, 1 insertion(+)
>>>
>>> diff --git a/arch/powerpc/kernel/align.c
>>> b/arch/powerpc/kernel/align.c
>>> index bbb4181621dd..adc3a4a9c6e4 100644
>>> --- a/arch/powerpc/kernel/align.c
>>> +++ b/arch/powerpc/kernel/align.c
>>> @@ -349,6 +349,7 @@ int fix_alignment(struct pt_regs *regs)
>>>   		if (op.type != CACHEOP + DCBZ)
>>>   			return -EINVAL;
>>>   		PPC_WARN_ALIGNMENT(dcbz, regs);
>>> +		WARN_ON_ONCE(1);
>>
>> This is heavy handed ... It will be treated as an oops by various
>> things uselessly spit out a kernel backtrace. Isn't
>> PPC_WARN_ALIGNMENT
>> enough ?


PPC_WARN_ALIGNMENT() only warns if explicitely activated, I want to 
catch uses on 'dcbz' on non-cached memory all the time as they are most 
often the result of using memset() instead of memset_io().

> 
> Ah I saw your other one about fbdev...  Ok what about you do that in a
> if (!user_mode(regs)) ?

Yes I can do WARN_ON_ONCE(!user_mode(regs)); instead.

> 
> Indeed the kernel should not do that.


Does userspace accesses non-cached memory directly ?

Christophe

^ permalink raw reply

* Re: [PATCH] powerpc: warn on emulation of dcbz instruction
From: Benjamin Herrenschmidt @ 2021-09-16  7:16 UTC (permalink / raw)
  To: Christophe Leroy, Paul Mackerras, Michael Ellerman
  Cc: Finn Thain, linuxppc-dev, linux-kernel, Stan Johnson
In-Reply-To: <2c0fd775625c76c4dd09b3e923da4405a003f3bd.camel@kernel.crashing.org>

On Thu, 2021-09-16 at 17:15 +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2021-09-15 at 16:31 +0200, Christophe Leroy wrote:
> > dcbz instruction shouldn't be used on non-cached memory. Using
> > it on non-cached memory can result in alignment exception and
> > implies a heavy handling.
> > 
> > Instead of silentely emulating the instruction and resulting in
> > high
> > performance degradation, warn whenever an alignment exception is
> > taken due to dcbz, so that the user is made aware that dcbz
> > instruction has been used unexpectedly.
> > 
> > Reported-by: Stan Johnson <userm57@yahoo.com>
> > Cc: Finn Thain <fthain@linux-m68k.org>
> > Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> > ---
> >  arch/powerpc/kernel/align.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/arch/powerpc/kernel/align.c
> > b/arch/powerpc/kernel/align.c
> > index bbb4181621dd..adc3a4a9c6e4 100644
> > --- a/arch/powerpc/kernel/align.c
> > +++ b/arch/powerpc/kernel/align.c
> > @@ -349,6 +349,7 @@ int fix_alignment(struct pt_regs *regs)
> >  		if (op.type != CACHEOP + DCBZ)
> >  			return -EINVAL;
> >  		PPC_WARN_ALIGNMENT(dcbz, regs);
> > +		WARN_ON_ONCE(1);
> 
> This is heavy handed ... It will be treated as an oops by various
> things uselessly spit out a kernel backtrace. Isn't
> PPC_WARN_ALIGNMENT
> enough ?

Ah I saw your other one about fbdev...  Ok what about you do that in a
if (!user_mode(regs)) ?

Indeed the kernel should not do that.

Cheers,
Ben.



^ permalink raw reply

* Re: [PATCH] powerpc: warn on emulation of dcbz instruction
From: Benjamin Herrenschmidt @ 2021-09-16  7:15 UTC (permalink / raw)
  To: Christophe Leroy, Paul Mackerras, Michael Ellerman
  Cc: Finn Thain, linuxppc-dev, linux-kernel, Stan Johnson
In-Reply-To: <62b33ca839f3d1d7d4b64b6f56af0bbe4d2c9057.1631716292.git.christophe.leroy@csgroup.eu>

On Wed, 2021-09-15 at 16:31 +0200, Christophe Leroy wrote:
> dcbz instruction shouldn't be used on non-cached memory. Using
> it on non-cached memory can result in alignment exception and
> implies a heavy handling.
> 
> Instead of silentely emulating the instruction and resulting in high
> performance degradation, warn whenever an alignment exception is
> taken due to dcbz, so that the user is made aware that dcbz
> instruction has been used unexpectedly.
> 
> Reported-by: Stan Johnson <userm57@yahoo.com>
> Cc: Finn Thain <fthain@linux-m68k.org>
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
>  arch/powerpc/kernel/align.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/kernel/align.c
> b/arch/powerpc/kernel/align.c
> index bbb4181621dd..adc3a4a9c6e4 100644
> --- a/arch/powerpc/kernel/align.c
> +++ b/arch/powerpc/kernel/align.c
> @@ -349,6 +349,7 @@ int fix_alignment(struct pt_regs *regs)
>  		if (op.type != CACHEOP + DCBZ)
>  			return -EINVAL;
>  		PPC_WARN_ALIGNMENT(dcbz, regs);
> +		WARN_ON_ONCE(1);

This is heavy handed ... It will be treated as an oops by various
things uselessly spit out a kernel backtrace. Isn't PPC_WARN_ALIGNMENT
enough ?

Ben.



^ permalink raw reply

* Re: [PATCH v4] lockdown,selinux: fix wrong subject in some SELinux lockdown checks
From: Ondrej Mosnacek @ 2021-09-16  6:57 UTC (permalink / raw)
  To: Paul Moore
  Cc: linux-efi, Linux PCI, linux-cxl, Steffen Klassert, Herbert Xu,
	X86 ML, James Morris, Linux ACPI, Ingo Molnar, linux-serial,
	Linux-pm mailing list, SElinux list, Steven Rostedt,
	Casey Schaufler, Dan Williams, network dev, Stephen Smalley,
	Kexec Mailing List, Linux kernel mailing list,
	Linux Security Module list, Linux FS Devel, bpf, linuxppc-dev,
	David S . Miller
In-Reply-To: <CAHC9VhQyejnmLn0NHQiWzikHs8ZdzAUdZ2WqNxgGM6xhJ4mvMQ@mail.gmail.com>

On Thu, Sep 16, 2021 at 4:59 AM Paul Moore <paul@paul-moore.com> wrote:
> On Mon, Sep 13, 2021 at 5:05 PM Paul Moore <paul@paul-moore.com> wrote:
> >
> > On Mon, Sep 13, 2021 at 10:02 AM Ondrej Mosnacek <omosnace@redhat.com> wrote:
> > >
> > > Commit 59438b46471a ("security,lockdown,selinux: implement SELinux
> > > lockdown") added an implementation of the locked_down LSM hook to
> > > SELinux, with the aim to restrict which domains are allowed to perform
> > > operations that would breach lockdown.
> > >
> > > However, in several places the security_locked_down() hook is called in
> > > situations where the current task isn't doing any action that would
> > > directly breach lockdown, leading to SELinux checks that are basically
> > > bogus.
> > >
> > > To fix this, add an explicit struct cred pointer argument to
> > > security_lockdown() and define NULL as a special value to pass instead
> > > of current_cred() in such situations. LSMs that take the subject
> > > credentials into account can then fall back to some default or ignore
> > > such calls altogether. In the SELinux lockdown hook implementation, use
> > > SECINITSID_KERNEL in case the cred argument is NULL.
> > >
> > > Most of the callers are updated to pass current_cred() as the cred
> > > pointer, thus maintaining the same behavior. The following callers are
> > > modified to pass NULL as the cred pointer instead:
> > > 1. arch/powerpc/xmon/xmon.c
> > >      Seems to be some interactive debugging facility. It appears that
> > >      the lockdown hook is called from interrupt context here, so it
> > >      should be more appropriate to request a global lockdown decision.
> > > 2. fs/tracefs/inode.c:tracefs_create_file()
> > >      Here the call is used to prevent creating new tracefs entries when
> > >      the kernel is locked down. Assumes that locking down is one-way -
> > >      i.e. if the hook returns non-zero once, it will never return zero
> > >      again, thus no point in creating these files. Also, the hook is
> > >      often called by a module's init function when it is loaded by
> > >      userspace, where it doesn't make much sense to do a check against
> > >      the current task's creds, since the task itself doesn't actually
> > >      use the tracing functionality (i.e. doesn't breach lockdown), just
> > >      indirectly makes some new tracepoints available to whoever is
> > >      authorized to use them.
> > > 3. net/xfrm/xfrm_user.c:copy_to_user_*()
> > >      Here a cryptographic secret is redacted based on the value returned
> > >      from the hook. There are two possible actions that may lead here:
> > >      a) A netlink message XFRM_MSG_GETSA with NLM_F_DUMP set - here the
> > >         task context is relevant, since the dumped data is sent back to
> > >         the current task.
> > >      b) When adding/deleting/updating an SA via XFRM_MSG_xxxSA, the
> > >         dumped SA is broadcasted to tasks subscribed to XFRM events -
> > >         here the current task context is not relevant as it doesn't
> > >         represent the tasks that could potentially see the secret.
> > >      It doesn't seem worth it to try to keep using the current task's
> > >      context in the a) case, since the eventual data leak can be
> > >      circumvented anyway via b), plus there is no way for the task to
> > >      indicate that it doesn't care about the actual key value, so the
> > >      check could generate a lot of "false alert" denials with SELinux.
> > >      Thus, let's pass NULL instead of current_cred() here faute de
> > >      mieux.
> > >
> > > Improvements-suggested-by: Casey Schaufler <casey@schaufler-ca.com>
> > > Improvements-suggested-by: Paul Moore <paul@paul-moore.com>
> > > Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
> > > Acked-by: Dan Williams <dan.j.williams@intel.com>         [cxl]
> > > Acked-by: Steffen Klassert <steffen.klassert@secunet.com> [xfrm]
> > > Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
> > > ---
> > >
> > > v4:
> > > - rebase on top of TODO
> > > - fix rebase conflicts:
> > >   * drivers/cxl/pci.c
> > >     - trivial: the lockdown reason was corrected in mainline
> > >   * kernel/bpf/helpers.c, kernel/trace/bpf_trace.c
> > >     - trivial: LOCKDOWN_BPF_READ was renamed to LOCKDOWN_BPF_READ_KERNEL
> > >       in mainline
> > >   * kernel/power/hibernate.c
> > >     - trivial: !secretmem_active() was added to the condition in
> > >       hibernation_available()
> > > - cover new security_locked_down() call in kernel/bpf/helpers.c
> > >   (LOCKDOWN_BPF_WRITE_USER in BPF_FUNC_probe_write_user case)
> > >
> > > v3: https://lore.kernel.org/lkml/20210616085118.1141101-1-omosnace@redhat.com/
> > > - add the cred argument to security_locked_down() and adapt all callers
> > > - keep using current_cred() in BPF, as the hook calls have been shifted
> > >   to program load time (commit ff40e51043af ("bpf, lockdown, audit: Fix
> > >   buggy SELinux lockdown permission checks"))
> > > - in SELinux, don't ignore hook calls where cred == NULL, but use
> > >   SECINITSID_KERNEL as the subject instead
> > > - update explanations in the commit message
> > >
> > > v2: https://lore.kernel.org/lkml/20210517092006.803332-1-omosnace@redhat.com/
> > > - change to a single hook based on suggestions by Casey Schaufler
> > >
> > > v1: https://lore.kernel.org/lkml/20210507114048.138933-1-omosnace@redhat.com/
> >
> > The changes between v3 and v4 all seem sane to me, but I'm going to
> > let this sit for a few days in hopes that we can collect a few more
> > Reviewed-bys and ACKs.  If I don't see any objections I'll merge it
> > mid-week(ish) into selinux/stable-5.15 and plan on sending it to Linus
> > after it goes through a build/test cycle.
>
> Time's up, I just merged this into selinux/stable-5.15 and I'll send
> this to Linus once it passes testing.

Thanks!

-- 
Ondrej Mosnacek
Software Engineer, Linux Security - SELinux kernel
Red Hat, Inc.


^ permalink raw reply

* Re: [PATCH v4] lockdown,selinux: fix wrong subject in some SELinux lockdown checks
From: Paul Moore @ 2021-09-16  2:59 UTC (permalink / raw)
  To: Ondrej Mosnacek
  Cc: linux-efi, linux-pci, linux-cxl, Steffen Klassert, Herbert Xu,
	x86, James Morris, linux-acpi, Ingo Molnar, linux-serial,
	linux-pm, selinux, Steven Rostedt, Casey Schaufler, Dan Williams,
	netdev, Stephen Smalley, kexec, linux-kernel,
	linux-security-module, linux-fsdevel, bpf, linuxppc-dev,
	David S . Miller
In-Reply-To: <CAHC9VhRw-S+zZUFz5QFFLMBATjo+YbPAiR21jX6p7cT0T+MVLA@mail.gmail.com>

On Mon, Sep 13, 2021 at 5:05 PM Paul Moore <paul@paul-moore.com> wrote:
>
> On Mon, Sep 13, 2021 at 10:02 AM Ondrej Mosnacek <omosnace@redhat.com> wrote:
> >
> > Commit 59438b46471a ("security,lockdown,selinux: implement SELinux
> > lockdown") added an implementation of the locked_down LSM hook to
> > SELinux, with the aim to restrict which domains are allowed to perform
> > operations that would breach lockdown.
> >
> > However, in several places the security_locked_down() hook is called in
> > situations where the current task isn't doing any action that would
> > directly breach lockdown, leading to SELinux checks that are basically
> > bogus.
> >
> > To fix this, add an explicit struct cred pointer argument to
> > security_lockdown() and define NULL as a special value to pass instead
> > of current_cred() in such situations. LSMs that take the subject
> > credentials into account can then fall back to some default or ignore
> > such calls altogether. In the SELinux lockdown hook implementation, use
> > SECINITSID_KERNEL in case the cred argument is NULL.
> >
> > Most of the callers are updated to pass current_cred() as the cred
> > pointer, thus maintaining the same behavior. The following callers are
> > modified to pass NULL as the cred pointer instead:
> > 1. arch/powerpc/xmon/xmon.c
> >      Seems to be some interactive debugging facility. It appears that
> >      the lockdown hook is called from interrupt context here, so it
> >      should be more appropriate to request a global lockdown decision.
> > 2. fs/tracefs/inode.c:tracefs_create_file()
> >      Here the call is used to prevent creating new tracefs entries when
> >      the kernel is locked down. Assumes that locking down is one-way -
> >      i.e. if the hook returns non-zero once, it will never return zero
> >      again, thus no point in creating these files. Also, the hook is
> >      often called by a module's init function when it is loaded by
> >      userspace, where it doesn't make much sense to do a check against
> >      the current task's creds, since the task itself doesn't actually
> >      use the tracing functionality (i.e. doesn't breach lockdown), just
> >      indirectly makes some new tracepoints available to whoever is
> >      authorized to use them.
> > 3. net/xfrm/xfrm_user.c:copy_to_user_*()
> >      Here a cryptographic secret is redacted based on the value returned
> >      from the hook. There are two possible actions that may lead here:
> >      a) A netlink message XFRM_MSG_GETSA with NLM_F_DUMP set - here the
> >         task context is relevant, since the dumped data is sent back to
> >         the current task.
> >      b) When adding/deleting/updating an SA via XFRM_MSG_xxxSA, the
> >         dumped SA is broadcasted to tasks subscribed to XFRM events -
> >         here the current task context is not relevant as it doesn't
> >         represent the tasks that could potentially see the secret.
> >      It doesn't seem worth it to try to keep using the current task's
> >      context in the a) case, since the eventual data leak can be
> >      circumvented anyway via b), plus there is no way for the task to
> >      indicate that it doesn't care about the actual key value, so the
> >      check could generate a lot of "false alert" denials with SELinux.
> >      Thus, let's pass NULL instead of current_cred() here faute de
> >      mieux.
> >
> > Improvements-suggested-by: Casey Schaufler <casey@schaufler-ca.com>
> > Improvements-suggested-by: Paul Moore <paul@paul-moore.com>
> > Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
> > Acked-by: Dan Williams <dan.j.williams@intel.com>         [cxl]
> > Acked-by: Steffen Klassert <steffen.klassert@secunet.com> [xfrm]
> > Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
> > ---
> >
> > v4:
> > - rebase on top of TODO
> > - fix rebase conflicts:
> >   * drivers/cxl/pci.c
> >     - trivial: the lockdown reason was corrected in mainline
> >   * kernel/bpf/helpers.c, kernel/trace/bpf_trace.c
> >     - trivial: LOCKDOWN_BPF_READ was renamed to LOCKDOWN_BPF_READ_KERNEL
> >       in mainline
> >   * kernel/power/hibernate.c
> >     - trivial: !secretmem_active() was added to the condition in
> >       hibernation_available()
> > - cover new security_locked_down() call in kernel/bpf/helpers.c
> >   (LOCKDOWN_BPF_WRITE_USER in BPF_FUNC_probe_write_user case)
> >
> > v3: https://lore.kernel.org/lkml/20210616085118.1141101-1-omosnace@redhat.com/
> > - add the cred argument to security_locked_down() and adapt all callers
> > - keep using current_cred() in BPF, as the hook calls have been shifted
> >   to program load time (commit ff40e51043af ("bpf, lockdown, audit: Fix
> >   buggy SELinux lockdown permission checks"))
> > - in SELinux, don't ignore hook calls where cred == NULL, but use
> >   SECINITSID_KERNEL as the subject instead
> > - update explanations in the commit message
> >
> > v2: https://lore.kernel.org/lkml/20210517092006.803332-1-omosnace@redhat.com/
> > - change to a single hook based on suggestions by Casey Schaufler
> >
> > v1: https://lore.kernel.org/lkml/20210507114048.138933-1-omosnace@redhat.com/
>
> The changes between v3 and v4 all seem sane to me, but I'm going to
> let this sit for a few days in hopes that we can collect a few more
> Reviewed-bys and ACKs.  If I don't see any objections I'll merge it
> mid-week(ish) into selinux/stable-5.15 and plan on sending it to Linus
> after it goes through a build/test cycle.

Time's up, I just merged this into selinux/stable-5.15 and I'll send
this to Linus once it passes testing.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix
From: Jordan Niethe @ 2021-09-16  2:04 UTC (permalink / raw)
  To: Christopher M. Riedl; +Cc: linuxppc-dev, linux-hardening
In-Reply-To: <CEAW7GNXW96H.18ANPMC01JA2C@wrwlf0000>

On Thu, Sep 16, 2021 at 10:40 AM Christopher M. Riedl
<cmr@bluescreens.de> wrote:
>
> On Tue Sep 14, 2021 at 11:24 PM CDT, Jordan Niethe wrote:
> > On Sat, Sep 11, 2021 at 12:39 PM Christopher M. Riedl
> > <cmr@bluescreens.de> wrote:
> > > ...
> > > +/*
> > > + * This can be called for kernel text or a module.
> > > + */
> > > +static int map_patch_mm(const void *addr, struct patch_mapping *patch_mapping)
> > > +{
> > > +       struct page *page;
> > > +       struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
> > > +       unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
> > > +
> > > +       if (is_vmalloc_or_module_addr(addr))
> > > +               page = vmalloc_to_page(addr);
> > > +       else
> > > +               page = virt_to_page(addr);
> > > +
> > > +       patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr,
> > > +                                            &patch_mapping->ptl);
> > > +       if (unlikely(!patch_mapping->ptep)) {
> > > +               pr_warn("map patch: failed to allocate pte for patching\n");
> > > +               return -1;
> > > +       }
> > > +
> > > +       set_pte_at(patching_mm, patching_addr, patch_mapping->ptep,
> > > +                  pte_mkdirty(mk_pte(page, PAGE_KERNEL)));
> >
> > I think because switch_mm_irqs_off() will not necessarily have a
> > barrier so a ptesync would be needed.
> > A spurious fault here from __patch_instruction() would not be handled
> > correctly.
>
> Sorry I don't quite follow - can you explain this to me in a bit more
> detail?

radix__set_pte_at() skips calling ptesync as an optimization.
If there is no ordering between changing the pte and then accessing
the page with __patch_instruction(), a spurious fault could be raised.
I think such a fault would end up being causing bad_kernel_fault() ->
true and would not be fixed up.

I thought there might be a barrier in switch_mm_irqs_off() that would
provide this ordering but afaics that is not always the case.

So I think that we need to have a ptesync after set_pte_at().

^ permalink raw reply

* Re: [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix
From: Jordan Niethe @ 2021-09-16  1:52 UTC (permalink / raw)
  To: Christopher M. Riedl; +Cc: linuxppc-dev, linux-hardening
In-Reply-To: <CEAVVEORU7UL.1ZDGQIF33JSOX@wrwlf0000>

On Thu, Sep 16, 2021 at 10:38 AM Christopher M. Riedl
<cmr@bluescreens.de> wrote:
>
> On Sat Sep 11, 2021 at 4:14 AM CDT, Jordan Niethe wrote:
> > On Sat, Sep 11, 2021 at 12:39 PM Christopher M. Riedl
> > <cmr@bluescreens.de> wrote:
> > >
> > > When code patching a STRICT_KERNEL_RWX kernel the page containing the
> > > address to be patched is temporarily mapped as writeable. Currently, a
> > > per-cpu vmalloc patch area is used for this purpose. While the patch
> > > area is per-cpu, the temporary page mapping is inserted into the kernel
> > > page tables for the duration of patching. The mapping is exposed to CPUs
> > > other than the patching CPU - this is undesirable from a hardening
> > > perspective. Use a temporary mm instead which keeps the mapping local to
> > > the CPU doing the patching.
> > >
> > > Use the `poking_init` init hook to prepare a temporary mm and patching
> > > address. Initialize the temporary mm by copying the init mm. Choose a
> > > randomized patching address inside the temporary mm userspace address
> > > space. The patching address is randomized between PAGE_SIZE and
> > > DEFAULT_MAP_WINDOW-PAGE_SIZE.
> > >
> > > Bits of entropy with 64K page size on BOOK3S_64:
> > >
> > >         bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> > >
> > >         PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
> > >         bits of entropy = log2(128TB / 64K)
> > >         bits of entropy = 31
> > >
> > > The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU
> > > operates - by default the space above DEFAULT_MAP_WINDOW is not
> > > available. Currently the Hash MMU does not use a temporary mm so
> > > technically this upper limit isn't necessary; however, a larger
> > > randomization range does not further "harden" this overall approach and
> > > future work may introduce patching with a temporary mm on Hash as well.
> > >
> > > Randomization occurs only once during initialization at boot for each
> > > possible CPU in the system.
> > >
> > > Introduce two new functions, map_patch_mm() and unmap_patch_mm(), to
> > > respectively create and remove the temporary mapping with write
> > > permissions at patching_addr. Map the page with PAGE_KERNEL to set
> > > EAA[0] for the PTE which ignores the AMR (so no need to unlock/lock
> > > KUAP) according to PowerISA v3.0b Figure 35 on Radix.
> > >
> > > Based on x86 implementation:
> > >
> > > commit 4fc19708b165
> > > ("x86/alternatives: Initialize temporary mm for patching")
> > >
> > > and:
> > >
> > > commit b3fd8e83ada0
> > > ("x86/alternatives: Use temporary mm for text poking")
> > >
> > > Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de>
> > >
> > > ---
> > >
> > > v6:  * Small clean-ups (naming, formatting, style, etc).
> > >      * Call stop_using_temporary_mm() before pte_unmap_unlock() after
> > >        patching.
> > >      * Replace BUG_ON()s in poking_init() w/ WARN_ON()s.
> > >
> > > v5:  * Only support Book3s64 Radix MMU for now.
> > >      * Use a per-cpu datastructure to hold the patching_addr and
> > >        patching_mm to avoid the need for a synchronization lock/mutex.
> > >
> > > v4:  * In the previous series this was two separate patches: one to init
> > >        the temporary mm in poking_init() (unused in powerpc at the time)
> > >        and the other to use it for patching (which removed all the
> > >        per-cpu vmalloc code). Now that we use poking_init() in the
> > >        existing per-cpu vmalloc approach, that separation doesn't work
> > >        as nicely anymore so I just merged the two patches into one.
> > >      * Preload the SLB entry and hash the page for the patching_addr
> > >        when using Hash on book3s64 to avoid taking an SLB and Hash fault
> > >        during patching. The previous implementation was a hack which
> > >        changed current->mm to allow the SLB and Hash fault handlers to
> > >        work with the temporary mm since both of those code-paths always
> > >        assume mm == current->mm.
> > >      * Also (hmm - seeing a trend here) with the book3s64 Hash MMU we
> > >        have to manage the mm->context.active_cpus counter and mm cpumask
> > >        since they determine (via mm_is_thread_local()) if the TLB flush
> > >        in pte_clear() is local or not - it should always be local when
> > >        we're using the temporary mm. On book3s64's Radix MMU we can
> > >        just call local_flush_tlb_mm().
> > >      * Use HPTE_USE_KERNEL_KEY on Hash to avoid costly lock/unlock of
> > >        KUAP.
> > > ---
> > >  arch/powerpc/lib/code-patching.c | 119 +++++++++++++++++++++++++++++--
> > >  1 file changed, 112 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> > > index e802e42c2789..af8e2a02a9dd 100644
> > > --- a/arch/powerpc/lib/code-patching.c
> > > +++ b/arch/powerpc/lib/code-patching.c
> > > @@ -11,6 +11,7 @@
> > >  #include <linux/cpuhotplug.h>
> > >  #include <linux/slab.h>
> > >  #include <linux/uaccess.h>
> > > +#include <linux/random.h>
> > >
> > >  #include <asm/tlbflush.h>
> > >  #include <asm/page.h>
> > > @@ -103,6 +104,7 @@ static inline void stop_using_temporary_mm(struct temp_mm *temp_mm)
> > >
> > >  static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> > >  static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
> > > +static DEFINE_PER_CPU(struct mm_struct *, cpu_patching_mm);
> > >
> > >  static int text_area_cpu_up(unsigned int cpu)
> > >  {
> > > @@ -126,8 +128,48 @@ static int text_area_cpu_down(unsigned int cpu)
> > >         return 0;
> > >  }
> > >
> > > +static __always_inline void __poking_init_temp_mm(void)
> > > +{
> > > +       int cpu;
> > > +       spinlock_t *ptl; /* for protecting pte table */
> >
> > ptl is just used so we don't have to open code allocating a pte in
> > patching_mm isn't it?
>
> Yup - I think that comment was a copy-pasta... I'll improve it.
>
> >
> > > +       pte_t *ptep;
> > > +       struct mm_struct *patching_mm;
> > > +       unsigned long patching_addr;
> > > +
> > > +       for_each_possible_cpu(cpu) {
> > > +               patching_mm = copy_init_mm();
> > > +               WARN_ON(!patching_mm);
> >
> > Would it be okay to just let the mmu handle null pointer dereferences?
>
> In general I think yes; however, the NULL dereference wouldn't occur
> until later during actual patching so I thought an early WARN here is
> appropriate.
>
> >
> > > +               per_cpu(cpu_patching_mm, cpu) = patching_mm;
> > > +
> > > +               /*
> > > +                * Choose a randomized, page-aligned address from the range:
> > > +                * [PAGE_SIZE, DEFAULT_MAP_WINDOW - PAGE_SIZE] The lower
> > > +                * address bound is PAGE_SIZE to avoid the zero-page.  The
> > > +                * upper address bound is DEFAULT_MAP_WINDOW - PAGE_SIZE to
> > > +                * stay under DEFAULT_MAP_WINDOW with the Book3s64 Hash MMU.
> > > +                */
> > > +               patching_addr = PAGE_SIZE + ((get_random_long() & PAGE_MASK)
> > > +                               % (DEFAULT_MAP_WINDOW - 2 * PAGE_SIZE));
> > > +               per_cpu(cpu_patching_addr, cpu) = patching_addr;
> >
> > On x86 the randomization depends on CONFIG_RANDOMIZE_BASE. Should it
> > be controllable here too?
>
> IIRC CONFIG_RANDOMIZE_BASE is for KASLR which IMO doesn't really have
> much to do with this.
>
> >
> > > +
> > > +               /*
> > > +                * PTE allocation uses GFP_KERNEL which means we need to
> > > +                * pre-allocate the PTE here because we cannot do the
> > > +                * allocation during patching when IRQs are disabled.
> > > +                */
> > > +               ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
> > > +               WARN_ON(!ptep);
> > > +               pte_unmap_unlock(ptep, ptl);
> > > +       }
> > > +}
> > > +
> > >  void __init poking_init(void)
> > >  {
> > > +       if (radix_enabled()) {
> > > +               __poking_init_temp_mm();
> >
> > Should this also be done with cpuhp_setup_state()?
>
> I think I prefer doing the setup ahead of time during boot.

It does lose the ability to free up memory after a cpu is hot
unplugged but I'm not sure if that's a big problem.

>
> >
> > > +               return;
> > > +       }
> > > +
> > >         WARN_ON(cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> > >                 "powerpc/text_poke:online", text_area_cpu_up,
> > >                 text_area_cpu_down) < 0);
> > > @@ -197,30 +239,93 @@ static inline int unmap_patch_area(void)
> > >         return 0;
> > >  }
> > >
> > > +struct patch_mapping {
> > > +       spinlock_t *ptl; /* for protecting pte table */
> > > +       pte_t *ptep;
> > > +       struct temp_mm temp_mm;
> > > +};
> > > +
> > > +/*
> > > + * This can be called for kernel text or a module.
> > > + */
> > > +static int map_patch_mm(const void *addr, struct patch_mapping *patch_mapping)
> > > +{
> > > +       struct page *page;
> > > +       struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
> > > +       unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
> > > +
> > > +       if (is_vmalloc_or_module_addr(addr))
> > > +               page = vmalloc_to_page(addr);
> > > +       else
> > > +               page = virt_to_page(addr);
> > > +
> > > +       patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr,
> > > +                                            &patch_mapping->ptl);
> > > +       if (unlikely(!patch_mapping->ptep)) {
> > > +               pr_warn("map patch: failed to allocate pte for patching\n");
> > > +               return -1;
> > > +       }
> > > +
> > > +       set_pte_at(patching_mm, patching_addr, patch_mapping->ptep,
> > > +                  pte_mkdirty(mk_pte(page, PAGE_KERNEL)));
> > > +
> > > +       init_temp_mm(&patch_mapping->temp_mm, patching_mm);
> > > +       start_using_temporary_mm(&patch_mapping->temp_mm);
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +static int unmap_patch_mm(struct patch_mapping *patch_mapping)
> > > +{
> > > +       struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
> > > +       unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
> > > +
> > > +       pte_clear(patching_mm, patching_addr, patch_mapping->ptep);
> > > +
> > > +       local_flush_tlb_mm(patching_mm);
> > > +       stop_using_temporary_mm(&patch_mapping->temp_mm);
> > > +
> > > +       pte_unmap_unlock(patch_mapping->ptep, patch_mapping->ptl);
> > > +
> > > +       return 0;
> > > +}
> > > +
> > >  static int do_patch_instruction(u32 *addr, struct ppc_inst instr)
> > >  {
> > >         int err, rc = 0;
> > >         u32 *patch_addr = NULL;
> > >         unsigned long flags;
> > > +       struct patch_mapping patch_mapping;
> > >
> > >         /*
> > > -        * During early early boot patch_instruction is called
> > > -        * when text_poke_area is not ready, but we still need
> > > -        * to allow patching. We just do the plain old patching
> > > +        * During early early boot patch_instruction is called when the
> > > +        * patching_mm/text_poke_area is not ready, but we still need to allow
> > > +        * patching. We just do the plain old patching.
> > >          */
> > > -       if (!this_cpu_read(text_poke_area))
> > > -               return raw_patch_instruction(addr, instr);
> > > +       if (radix_enabled()) {
> > > +               if (!this_cpu_read(cpu_patching_mm))
> > > +                       return raw_patch_instruction(addr, instr);
> > > +       } else {
> > > +               if (!this_cpu_read(text_poke_area))
> > > +                       return raw_patch_instruction(addr, instr);
> > > +       }
> >
> > Would testing cpu_patching_addr handler both of these cases?
> >
> > Then I think it might be clearer to do something like this:
> > if (radix_enabled()) {
> > return patch_instruction_mm(addr, instr);
> > }
> >
> > patch_instruction_mm() would combine map_patch_mm(), then patching and
> > unmap_patch_mm() into one function.
> >
> > IMO, a bit of code duplication would be cleaner than checking multiple
> > times for radix_enabled() and having struct patch_mapping especially
> > for maintaining state.
>
> Hmm, I think it's a good idea - I'll give it a go for the next version.
> Thanks for the suggestion!
>
> >
> > >
> > >         local_irq_save(flags);
> > >
> > > -       err = map_patch_area(addr);
> > > +       if (radix_enabled())
> > > +               err = map_patch_mm(addr, &patch_mapping);
> > > +       else
> > > +               err = map_patch_area(addr);
> > >         if (err)
> > >                 goto out;
> > >
> > >         patch_addr = (u32 *)(__this_cpu_read(cpu_patching_addr) | offset_in_page(addr));
> > >         rc = __patch_instruction(addr, instr, patch_addr);
> > >
> > > -       err = unmap_patch_area();
> > > +       if (radix_enabled())
> > > +               err = unmap_patch_mm(&patch_mapping);
> > > +       else
> > > +               err = unmap_patch_area();
> > >
> > >  out:
> > >         local_irq_restore(flags);
> > > --
> > > 2.32.0
> > >
> > Thanks,
> > Jordan
>

^ permalink raw reply

* Re: [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix
From: Christopher M. Riedl @ 2021-09-16  0:45 UTC (permalink / raw)
  To: Jordan Niethe; +Cc: linuxppc-dev, linux-hardening
In-Reply-To: <CACzsE9qr6QK_Xm6yVXT061sxR9SXaeFx5fkjiNAXFBFr6WDQOw@mail.gmail.com>

On Tue Sep 14, 2021 at 11:24 PM CDT, Jordan Niethe wrote:
> On Sat, Sep 11, 2021 at 12:39 PM Christopher M. Riedl
> <cmr@bluescreens.de> wrote:
> > ... 
> > +/*
> > + * This can be called for kernel text or a module.
> > + */
> > +static int map_patch_mm(const void *addr, struct patch_mapping *patch_mapping)
> > +{
> > +       struct page *page;
> > +       struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
> > +       unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
> > +
> > +       if (is_vmalloc_or_module_addr(addr))
> > +               page = vmalloc_to_page(addr);
> > +       else
> > +               page = virt_to_page(addr);
> > +
> > +       patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr,
> > +                                            &patch_mapping->ptl);
> > +       if (unlikely(!patch_mapping->ptep)) {
> > +               pr_warn("map patch: failed to allocate pte for patching\n");
> > +               return -1;
> > +       }
> > +
> > +       set_pte_at(patching_mm, patching_addr, patch_mapping->ptep,
> > +                  pte_mkdirty(mk_pte(page, PAGE_KERNEL)));
>
> I think because switch_mm_irqs_off() will not necessarily have a
> barrier so a ptesync would be needed.
> A spurious fault here from __patch_instruction() would not be handled
> correctly.

Sorry I don't quite follow - can you explain this to me in a bit more
detail?

^ permalink raw reply

* Re: [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix
From: Christopher M. Riedl @ 2021-09-16  0:29 UTC (permalink / raw)
  To: Jordan Niethe; +Cc: linuxppc-dev, linux-hardening
In-Reply-To: <CACzsE9rHnN9hY4b926r6Fc5tC0Tc7cvkF8cgVODunz7ZZYNFyA@mail.gmail.com>

On Sat Sep 11, 2021 at 4:14 AM CDT, Jordan Niethe wrote:
> On Sat, Sep 11, 2021 at 12:39 PM Christopher M. Riedl
> <cmr@bluescreens.de> wrote:
> >
> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
> > address to be patched is temporarily mapped as writeable. Currently, a
> > per-cpu vmalloc patch area is used for this purpose. While the patch
> > area is per-cpu, the temporary page mapping is inserted into the kernel
> > page tables for the duration of patching. The mapping is exposed to CPUs
> > other than the patching CPU - this is undesirable from a hardening
> > perspective. Use a temporary mm instead which keeps the mapping local to
> > the CPU doing the patching.
> >
> > Use the `poking_init` init hook to prepare a temporary mm and patching
> > address. Initialize the temporary mm by copying the init mm. Choose a
> > randomized patching address inside the temporary mm userspace address
> > space. The patching address is randomized between PAGE_SIZE and
> > DEFAULT_MAP_WINDOW-PAGE_SIZE.
> >
> > Bits of entropy with 64K page size on BOOK3S_64:
> >
> >         bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> >
> >         PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
> >         bits of entropy = log2(128TB / 64K)
> >         bits of entropy = 31
> >
> > The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU
> > operates - by default the space above DEFAULT_MAP_WINDOW is not
> > available. Currently the Hash MMU does not use a temporary mm so
> > technically this upper limit isn't necessary; however, a larger
> > randomization range does not further "harden" this overall approach and
> > future work may introduce patching with a temporary mm on Hash as well.
> >
> > Randomization occurs only once during initialization at boot for each
> > possible CPU in the system.
> >
> > Introduce two new functions, map_patch_mm() and unmap_patch_mm(), to
> > respectively create and remove the temporary mapping with write
> > permissions at patching_addr. Map the page with PAGE_KERNEL to set
> > EAA[0] for the PTE which ignores the AMR (so no need to unlock/lock
> > KUAP) according to PowerISA v3.0b Figure 35 on Radix.
> >
> > Based on x86 implementation:
> >
> > commit 4fc19708b165
> > ("x86/alternatives: Initialize temporary mm for patching")
> >
> > and:
> >
> > commit b3fd8e83ada0
> > ("x86/alternatives: Use temporary mm for text poking")
> >
> > Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de>
> >
> > ---
> >
> > v6:  * Small clean-ups (naming, formatting, style, etc).
> >      * Call stop_using_temporary_mm() before pte_unmap_unlock() after
> >        patching.
> >      * Replace BUG_ON()s in poking_init() w/ WARN_ON()s.
> >
> > v5:  * Only support Book3s64 Radix MMU for now.
> >      * Use a per-cpu datastructure to hold the patching_addr and
> >        patching_mm to avoid the need for a synchronization lock/mutex.
> >
> > v4:  * In the previous series this was two separate patches: one to init
> >        the temporary mm in poking_init() (unused in powerpc at the time)
> >        and the other to use it for patching (which removed all the
> >        per-cpu vmalloc code). Now that we use poking_init() in the
> >        existing per-cpu vmalloc approach, that separation doesn't work
> >        as nicely anymore so I just merged the two patches into one.
> >      * Preload the SLB entry and hash the page for the patching_addr
> >        when using Hash on book3s64 to avoid taking an SLB and Hash fault
> >        during patching. The previous implementation was a hack which
> >        changed current->mm to allow the SLB and Hash fault handlers to
> >        work with the temporary mm since both of those code-paths always
> >        assume mm == current->mm.
> >      * Also (hmm - seeing a trend here) with the book3s64 Hash MMU we
> >        have to manage the mm->context.active_cpus counter and mm cpumask
> >        since they determine (via mm_is_thread_local()) if the TLB flush
> >        in pte_clear() is local or not - it should always be local when
> >        we're using the temporary mm. On book3s64's Radix MMU we can
> >        just call local_flush_tlb_mm().
> >      * Use HPTE_USE_KERNEL_KEY on Hash to avoid costly lock/unlock of
> >        KUAP.
> > ---
> >  arch/powerpc/lib/code-patching.c | 119 +++++++++++++++++++++++++++++--
> >  1 file changed, 112 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> > index e802e42c2789..af8e2a02a9dd 100644
> > --- a/arch/powerpc/lib/code-patching.c
> > +++ b/arch/powerpc/lib/code-patching.c
> > @@ -11,6 +11,7 @@
> >  #include <linux/cpuhotplug.h>
> >  #include <linux/slab.h>
> >  #include <linux/uaccess.h>
> > +#include <linux/random.h>
> >
> >  #include <asm/tlbflush.h>
> >  #include <asm/page.h>
> > @@ -103,6 +104,7 @@ static inline void stop_using_temporary_mm(struct temp_mm *temp_mm)
> >
> >  static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> >  static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
> > +static DEFINE_PER_CPU(struct mm_struct *, cpu_patching_mm);
> >
> >  static int text_area_cpu_up(unsigned int cpu)
> >  {
> > @@ -126,8 +128,48 @@ static int text_area_cpu_down(unsigned int cpu)
> >         return 0;
> >  }
> >
> > +static __always_inline void __poking_init_temp_mm(void)
> > +{
> > +       int cpu;
> > +       spinlock_t *ptl; /* for protecting pte table */
>
> ptl is just used so we don't have to open code allocating a pte in
> patching_mm isn't it?

Yup - I think that comment was a copy-pasta... I'll improve it.

>
> > +       pte_t *ptep;
> > +       struct mm_struct *patching_mm;
> > +       unsigned long patching_addr;
> > +
> > +       for_each_possible_cpu(cpu) {
> > +               patching_mm = copy_init_mm();
> > +               WARN_ON(!patching_mm);
>
> Would it be okay to just let the mmu handle null pointer dereferences?

In general I think yes; however, the NULL dereference wouldn't occur
until later during actual patching so I thought an early WARN here is
appropriate. 

>
> > +               per_cpu(cpu_patching_mm, cpu) = patching_mm;
> > +
> > +               /*
> > +                * Choose a randomized, page-aligned address from the range:
> > +                * [PAGE_SIZE, DEFAULT_MAP_WINDOW - PAGE_SIZE] The lower
> > +                * address bound is PAGE_SIZE to avoid the zero-page.  The
> > +                * upper address bound is DEFAULT_MAP_WINDOW - PAGE_SIZE to
> > +                * stay under DEFAULT_MAP_WINDOW with the Book3s64 Hash MMU.
> > +                */
> > +               patching_addr = PAGE_SIZE + ((get_random_long() & PAGE_MASK)
> > +                               % (DEFAULT_MAP_WINDOW - 2 * PAGE_SIZE));
> > +               per_cpu(cpu_patching_addr, cpu) = patching_addr;
>
> On x86 the randomization depends on CONFIG_RANDOMIZE_BASE. Should it
> be controllable here too?

IIRC CONFIG_RANDOMIZE_BASE is for KASLR which IMO doesn't really have
much to do with this.

>
> > +
> > +               /*
> > +                * PTE allocation uses GFP_KERNEL which means we need to
> > +                * pre-allocate the PTE here because we cannot do the
> > +                * allocation during patching when IRQs are disabled.
> > +                */
> > +               ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
> > +               WARN_ON(!ptep);
> > +               pte_unmap_unlock(ptep, ptl);
> > +       }
> > +}
> > +
> >  void __init poking_init(void)
> >  {
> > +       if (radix_enabled()) {
> > +               __poking_init_temp_mm();
>
> Should this also be done with cpuhp_setup_state()?

I think I prefer doing the setup ahead of time during boot.

>
> > +               return;
> > +       }
> > +
> >         WARN_ON(cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> >                 "powerpc/text_poke:online", text_area_cpu_up,
> >                 text_area_cpu_down) < 0);
> > @@ -197,30 +239,93 @@ static inline int unmap_patch_area(void)
> >         return 0;
> >  }
> >
> > +struct patch_mapping {
> > +       spinlock_t *ptl; /* for protecting pte table */
> > +       pte_t *ptep;
> > +       struct temp_mm temp_mm;
> > +};
> > +
> > +/*
> > + * This can be called for kernel text or a module.
> > + */
> > +static int map_patch_mm(const void *addr, struct patch_mapping *patch_mapping)
> > +{
> > +       struct page *page;
> > +       struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
> > +       unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
> > +
> > +       if (is_vmalloc_or_module_addr(addr))
> > +               page = vmalloc_to_page(addr);
> > +       else
> > +               page = virt_to_page(addr);
> > +
> > +       patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr,
> > +                                            &patch_mapping->ptl);
> > +       if (unlikely(!patch_mapping->ptep)) {
> > +               pr_warn("map patch: failed to allocate pte for patching\n");
> > +               return -1;
> > +       }
> > +
> > +       set_pte_at(patching_mm, patching_addr, patch_mapping->ptep,
> > +                  pte_mkdirty(mk_pte(page, PAGE_KERNEL)));
> > +
> > +       init_temp_mm(&patch_mapping->temp_mm, patching_mm);
> > +       start_using_temporary_mm(&patch_mapping->temp_mm);
> > +
> > +       return 0;
> > +}
> > +
> > +static int unmap_patch_mm(struct patch_mapping *patch_mapping)
> > +{
> > +       struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
> > +       unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
> > +
> > +       pte_clear(patching_mm, patching_addr, patch_mapping->ptep);
> > +
> > +       local_flush_tlb_mm(patching_mm);
> > +       stop_using_temporary_mm(&patch_mapping->temp_mm);
> > +
> > +       pte_unmap_unlock(patch_mapping->ptep, patch_mapping->ptl);
> > +
> > +       return 0;
> > +}
> > +
> >  static int do_patch_instruction(u32 *addr, struct ppc_inst instr)
> >  {
> >         int err, rc = 0;
> >         u32 *patch_addr = NULL;
> >         unsigned long flags;
> > +       struct patch_mapping patch_mapping;
> >
> >         /*
> > -        * During early early boot patch_instruction is called
> > -        * when text_poke_area is not ready, but we still need
> > -        * to allow patching. We just do the plain old patching
> > +        * During early early boot patch_instruction is called when the
> > +        * patching_mm/text_poke_area is not ready, but we still need to allow
> > +        * patching. We just do the plain old patching.
> >          */
> > -       if (!this_cpu_read(text_poke_area))
> > -               return raw_patch_instruction(addr, instr);
> > +       if (radix_enabled()) {
> > +               if (!this_cpu_read(cpu_patching_mm))
> > +                       return raw_patch_instruction(addr, instr);
> > +       } else {
> > +               if (!this_cpu_read(text_poke_area))
> > +                       return raw_patch_instruction(addr, instr);
> > +       }
>
> Would testing cpu_patching_addr handler both of these cases?
>
> Then I think it might be clearer to do something like this:
> if (radix_enabled()) {
> return patch_instruction_mm(addr, instr);
> }
>
> patch_instruction_mm() would combine map_patch_mm(), then patching and
> unmap_patch_mm() into one function.
>
> IMO, a bit of code duplication would be cleaner than checking multiple
> times for radix_enabled() and having struct patch_mapping especially
> for maintaining state.

Hmm, I think it's a good idea - I'll give it a go for the next version.
Thanks for the suggestion!

>
> >
> >         local_irq_save(flags);
> >
> > -       err = map_patch_area(addr);
> > +       if (radix_enabled())
> > +               err = map_patch_mm(addr, &patch_mapping);
> > +       else
> > +               err = map_patch_area(addr);
> >         if (err)
> >                 goto out;
> >
> >         patch_addr = (u32 *)(__this_cpu_read(cpu_patching_addr) | offset_in_page(addr));
> >         rc = __patch_instruction(addr, instr, patch_addr);
> >
> > -       err = unmap_patch_area();
> > +       if (radix_enabled())
> > +               err = unmap_patch_mm(&patch_mapping);
> > +       else
> > +               err = unmap_patch_area();
> >
> >  out:
> >         local_irq_restore(flags);
> > --
> > 2.32.0
> >
> Thanks,
> Jordan


^ permalink raw reply

* Re: [PATCH v6 1/4] powerpc/64s: Introduce temporary mm for Radix MMU
From: Christopher M. Riedl @ 2021-09-16  0:24 UTC (permalink / raw)
  To: Jordan Niethe; +Cc: linuxppc-dev, linux-hardening
In-Reply-To: <CACzsE9rThU0JBACJoeeHOyEOA8CbFwRExrOhTsySaOH44yJa6g@mail.gmail.com>

On Sat Sep 11, 2021 at 3:26 AM CDT, Jordan Niethe wrote:
> On Sat, Sep 11, 2021 at 12:35 PM Christopher M. Riedl
> <cmr@bluescreens.de> wrote:
> >
> > x86 supports the notion of a temporary mm which restricts access to
> > temporary PTEs to a single CPU. A temporary mm is useful for situations
> > where a CPU needs to perform sensitive operations (such as patching a
> > STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
> > said mappings to other CPUs. Another benefit is that other CPU TLBs do
> > not need to be flushed when the temporary mm is torn down.
> >
> > Mappings in the temporary mm can be set in the userspace portion of the
> > address-space.
> >
> > Interrupts must be disabled while the temporary mm is in use. HW
> > breakpoints, which may have been set by userspace as watchpoints on
> > addresses now within the temporary mm, are saved and disabled when
> > loading the temporary mm. The HW breakpoints are restored when unloading
> > the temporary mm. All HW breakpoints are indiscriminately disabled while
> > the temporary mm is in use - this may include breakpoints set by perf.
>
> I had thought CPUs with a DAWR might not need to do this because the
> privilege level that breakpoints trigger on can be configured. But it
> turns out in ptrace, etc we use HW_BRK_TYPE_PRIV_ALL.

Thanks for double checking :)

>
> >
> > Based on x86 implementation:
> >
> > commit cefa929c034e
> > ("x86/mm: Introduce temporary mm structs")
> >
> > Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de>
> >
> > ---
> >
> > v6:  * Use {start,stop}_using_temporary_mm() instead of
> >        {use,unuse}_temporary_mm() as suggested by Christophe.
> >
> > v5:  * Drop support for using a temporary mm on Book3s64 Hash MMU.
> >
> > v4:  * Pass the prev mm instead of NULL to switch_mm_irqs_off() when
> >        using/unusing the temp mm as suggested by Jann Horn to keep
> >        the context.active counter in-sync on mm/nohash.
> >      * Disable SLB preload in the temporary mm when initializing the
> >        temp_mm struct.
> >      * Include asm/debug.h header to fix build issue with
> >        ppc44x_defconfig.
> > ---
> >  arch/powerpc/include/asm/debug.h |  1 +
> >  arch/powerpc/kernel/process.c    |  5 +++
> >  arch/powerpc/lib/code-patching.c | 56 ++++++++++++++++++++++++++++++++
> >  3 files changed, 62 insertions(+)
> >
> > diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
> > index 86a14736c76c..dfd82635ea8b 100644
> > --- a/arch/powerpc/include/asm/debug.h
> > +++ b/arch/powerpc/include/asm/debug.h
> > @@ -46,6 +46,7 @@ static inline int debugger_fault_handler(struct pt_regs *regs) { return 0; }
> >  #endif
> >
> >  void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk);
> > +void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk);
> >  bool ppc_breakpoint_available(void);
> >  #ifdef CONFIG_PPC_ADV_DEBUG_REGS
> >  extern void do_send_trap(struct pt_regs *regs, unsigned long address,
> > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> > index 50436b52c213..6aa1f5c4d520 100644
> > --- a/arch/powerpc/kernel/process.c
> > +++ b/arch/powerpc/kernel/process.c
> > @@ -865,6 +865,11 @@ static inline int set_breakpoint_8xx(struct arch_hw_breakpoint *brk)
> >         return 0;
> >  }
> >
> > +void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk)
> > +{
> > +       memcpy(brk, this_cpu_ptr(&current_brk[nr]), sizeof(*brk));
> > +}
>
> The breakpoint code is already a little hard to follow. I'm worried
> doing this might spread breakpoint handling into more places in the
> future.
> What about something like having a breakpoint_pause() function which
> clears the hardware registers only and then a breakpoint_resume()
> function that copies from current_brk[] back to the hardware
> registers?
> Then we don't have to make another copy of the breakpoint state.

I think that sounds reasonable - I'll add those functions instead with
the next spin.

>
> > +
> >  void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
> >  {
> >         memcpy(this_cpu_ptr(&current_brk[nr]), brk, sizeof(*brk));
> > diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> > index f9a3019e37b4..8d61a7d35b89 100644
> > --- a/arch/powerpc/lib/code-patching.c
> > +++ b/arch/powerpc/lib/code-patching.c
>
> Sorry I might have missed it, but what was the reason for not putting
> this stuff in mmu_context.h?

x86 ended up moving this code into their code-patching file as well. I
suppose nobody has thought of another use for a temporary mm like this
yet :)

>
> > @@ -17,6 +17,9 @@
> >  #include <asm/code-patching.h>
> >  #include <asm/setup.h>
> >  #include <asm/inst.h>
> > +#include <asm/mmu_context.h>
> > +#include <asm/debug.h>
> > +#include <asm/tlb.h>
> >
> >  static int __patch_instruction(u32 *exec_addr, struct ppc_inst instr, u32 *patch_addr)
> >  {
> > @@ -45,6 +48,59 @@ int raw_patch_instruction(u32 *addr, struct ppc_inst instr)
> >  }
> >
> >  #ifdef CONFIG_STRICT_KERNEL_RWX
> > +
> > +struct temp_mm {
> > +       struct mm_struct *temp;
> > +       struct mm_struct *prev;
> > +       struct arch_hw_breakpoint brk[HBP_NUM_MAX];
> ^ Then we wouldn't need this.
> > +};
> > +
> > +static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm)
> > +{
> > +       /* We currently only support temporary mm on the Book3s64 Radix MMU */
> > +       WARN_ON(!radix_enabled());
> > +
> > +       temp_mm->temp = mm;
> > +       temp_mm->prev = NULL;
> > +       memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
> > +}
> > +
> > +static inline void start_using_temporary_mm(struct temp_mm *temp_mm)
> > +{
> > +       lockdep_assert_irqs_disabled();
> > +
> > +       temp_mm->prev = current->active_mm;
> > +       switch_mm_irqs_off(temp_mm->prev, temp_mm->temp, current);
>
> Now that we only support radix it should be fine again to have it like
> this:
> switch_mm_irqs_off(NULL, temp_mm->temp, current);?
> It was changed from that because it would cause issues on nohash I
> thought.

That's true - but if we want to support the other MMUs in the future
I'd rather just keep it as-is. AFAICS there's no harm in passing
temp_mm->prev here instead of NULL.

>
> > +
> > +       WARN_ON(!mm_is_thread_local(temp_mm->temp));
> > +
> > +       if (ppc_breakpoint_available()) {
> > +               struct arch_hw_breakpoint null_brk = {0};
> > +               int i = 0;
> > +
> > +               for (; i < nr_wp_slots(); ++i) {
> > +                       __get_breakpoint(i, &temp_mm->brk[i]);
> > +                       if (temp_mm->brk[i].type != 0)
> > +                               __set_breakpoint(i, &null_brk);
> > +               }
> > +       }
> > +}
> > +
> > +static inline void stop_using_temporary_mm(struct temp_mm *temp_mm)
> > +{
> > +       lockdep_assert_irqs_disabled();
> > +
> > +       switch_mm_irqs_off(temp_mm->temp, temp_mm->prev, current);
> > +
> > +       if (ppc_breakpoint_available()) {
> > +               int i = 0;
> > +
> > +               for (; i < nr_wp_slots(); ++i)
> > +                       if (temp_mm->brk[i].type != 0)
> > +                               __set_breakpoint(i, &temp_mm->brk[i]);
> > +       }
> > +}
> > +
> >  static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> >
> >  static int text_area_cpu_up(unsigned int cpu)
> > --
> > 2.32.0
> >
> Thanks,
> Jordan


^ permalink raw reply

* Re: [PATCH v3 4/8] powerpc/pseries/svm: Add a powerpc version of cc_platform_has()
From: Borislav Petkov @ 2021-09-15 18:47 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Sathyanarayanan Kuppuswamy, linux-efi, Brijesh Singh, kvm,
	dri-devel, platform-driver-x86, Paul Mackerras, linux-s390,
	Andi Kleen, Joerg Roedel, x86, amd-gfx, Christoph Hellwig,
	linux-graphics-maintainer, Tom Lendacky, Tianyu Lan, kexec,
	linux-kernel, iommu, linux-fsdevel, linuxppc-dev
In-Reply-To: <f8388f18-5e90-5d0f-d681-0b17f8307dd4@csgroup.eu>

On Wed, Sep 15, 2021 at 07:18:34PM +0200, Christophe Leroy wrote:
> Could you please provide more explicit explanation why inlining such an
> helper is considered as bad practice and messy ?

Tom already told you to look at the previous threads. Let's read them
together. This one, for example:

https://lore.kernel.org/lkml/YSScWvpXeVXw%2Fed5@infradead.org/

| > To take it out of line, I'm leaning towards the latter, creating a new
| > file that is built based on the ARCH_HAS_PROTECTED_GUEST setting.
| 
| Yes.  In general everytime architectures have to provide the prototype
| and not just the implementation of something we end up with a giant mess
| sooner or later.  In a few cases that is still warranted due to
| performance concerns, but i don't think that is the case here.

So I think what Christoph means here is that you want to have the
generic prototype defined in a header and arches get to implement it
exactly to the letter so that there's no mess.

As to what mess exactly, I'd let him explain that.

> Because as demonstrated in my previous response some days ago, taking that
> outline ends up with an unneccessary ugly generated code and we don't
> benefit front GCC's capability to fold in and opt out unreachable code.

And this is real fast path where a couple of instructions matter or what?

set_memory_encrypted/_decrypted doesn't look like one to me.

> I can't see your point here. Inlining the function wouldn't add any
> ifdeffery as far as I can see.

If the function is touching defines etc, they all need to be visible.
If that function needs to call other functions - which is the case on
x86, perhaps not so much on power - then you need to either ifdef around
them or provide stubs with ifdeffery in the headers. And you need to
make them global functions instead of keeping them static to the same
compilation unit, etc, etc.

With a separate compilation unit, you don't need any of that and it is
all kept in that single file.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply

* Re: [PATCH v3 0/8] Implement generic cc_platform_has() helper function
From: Kuppuswamy, Sathyanarayanan @ 2021-09-15 17:26 UTC (permalink / raw)
  To: Borislav Petkov, Tom Lendacky
  Cc: linux-efi, Brijesh Singh, kvm, David Airlie, Dave Hansen,
	dri-devel, platform-driver-x86, Paul Mackerras, Will Deacon,
	Ard Biesheuvel, linux-s390, Andi Kleen, Baoquan He, Joerg Roedel,
	x86, amd-gfx, Christoph Hellwig, Christian Borntraeger,
	Ingo Molnar, linux-graphics-maintainer, Dave Young, Tianyu Lan,
	Thomas Zimmermann, Vasily Gorbik, Heiko Carstens,
	Maarten Lankhorst, Maxime Ripard, Andy Lutomirski,
	Thomas Gleixner, Peter Zijlstra, kexec, linux-kernel, iommu,
	Daniel Vetter, linux-fsdevel, linuxppc-dev
In-Reply-To: <YUIjS6lKEY5AadZx@zn.tnic>



On 9/15/21 9:46 AM, Borislav Petkov wrote:
> Sathya,
> 
> if you want to prepare the Intel variant intel_cc_platform_has() ontop
> of those and send it to me, that would be good because then I can
> integrate it all in one branch which can be used to base future work
> ontop.

I have a Intel variant patch (please check following patch). But it includes
TDX changes as well. Shall I move TDX changes to different patch and just
create a separate patch for adding intel_cc_platform_has()?


commit fc5f98a0ed94629d903827c5b44ee9295f835831
Author: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Date:   Wed May 12 11:35:13 2021 -0700

     x86/tdx: Add confidential guest support for TDX guest

     TDX architecture provides a way for VM guests to be highly secure and
     isolated (from untrusted VMM). To achieve this requirement, any data
     coming from VMM cannot be completely trusted. TDX guest fixes this
     issue by hardening the IO drivers against the attack from the VMM.
     So, when adding hardening fixes to the generic drivers, to protect
     custom fixes use cc_platform_has() API.

     Also add TDX guest support to cc_platform_has() API to protect the
     TDX specific fixes.

     Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a5b14de03458..2e78358923a1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -871,6 +871,7 @@ config INTEL_TDX_GUEST
         depends on SECURITY
         select X86_X2APIC
         select SECURITY_LOCKDOWN_LSM
+       select ARCH_HAS_CC_PLATFORM
         help
           Provide support for running in a trusted domain on Intel processors
           equipped with Trusted Domain eXtensions. TDX is a new Intel
diff --git a/arch/x86/include/asm/intel_cc_platform.h b/arch/x86/include/asm/intel_cc_platform.h
new file mode 100644
index 000000000000..472c3174beac
--- /dev/null
+++ b/arch/x86/include/asm/intel_cc_platform.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2021 Intel Corporation */
+#ifndef _ASM_X86_INTEL_CC_PLATFORM_H
+#define _ASM_X86_INTEL_CC_PLATFORM_H
+
+#if defined(CONFIG_CPU_SUP_INTEL) && defined(CONFIG_ARCH_HAS_CC_PLATFORM)
+bool intel_cc_platform_has(unsigned int flag);
+#else
+static inline bool intel_cc_platform_has(unsigned int flag) { return false; }
+#endif
+
+#endif /* _ASM_X86_INTEL_CC_PLATFORM_H */
+
diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c
index 3c9bacd3c3f3..e83bc2f48efe 100644
--- a/arch/x86/kernel/cc_platform.c
+++ b/arch/x86/kernel/cc_platform.c
@@ -10,11 +10,16 @@
  #include <linux/export.h>
  #include <linux/cc_platform.h>
  #include <linux/mem_encrypt.h>
+#include <linux/processor.h>
+
+#include <asm/intel_cc_platform.h>

  bool cc_platform_has(enum cc_attr attr)
  {
         if (sme_me_mask)
                 return amd_cc_platform_has(attr);
+       else if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
+               return intel_cc_platform_has(attr);

         return false;
  }
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 8321c43554a1..ab486a3b1eb0 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -11,6 +11,7 @@
  #include <linux/init.h>
  #include <linux/uaccess.h>
  #include <linux/delay.h>
+#include <linux/cc_platform.h>

  #include <asm/cpufeature.h>
  #include <asm/msr.h>
@@ -60,6 +61,21 @@ static u64 msr_test_ctrl_cache __ro_after_init;
   */
  static bool cpu_model_supports_sld __ro_after_init;

+#ifdef CONFIG_ARCH_HAS_CC_PLATFORM
+bool intel_cc_platform_has(enum cc_attr attr)
+{
+       switch (attr) {
+       case CC_ATTR_GUEST_TDX:
+               return cpu_feature_enabled(X86_FEATURE_TDX_GUEST);
+       default:
+               return false;
+       }
+
+       return false;
+}
+EXPORT_SYMBOL_GPL(intel_cc_platform_has);
+#endif
+
  /*
   * Processors which have self-snooping capability can handle conflicting
   * memory type across CPUs by snooping its own cache. However, there exists
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index 253f3ea66cd8..e38430e6e396 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -61,6 +61,15 @@ enum cc_attr {
          * Examples include SEV-ES.
          */
         CC_ATTR_GUEST_STATE_ENCRYPT,
+
+       /**
+        * @CC_ATTR_GUEST_TDX: Trusted Domain Extension Support
+        *
+        * The platform/OS is running as a TDX guest/virtual machine.
+        *
+        * Examples include SEV-ES.
+        */
+       CC_ATTR_GUEST_TDX,
  };

  #ifdef CONFIG_ARCH_HAS_CC_PLATFORM


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply related

* Re: [PATCH v3 4/8] powerpc/pseries/svm: Add a powerpc version of cc_platform_has()
From: Christophe Leroy @ 2021-09-15 17:18 UTC (permalink / raw)
  To: Borislav Petkov, Michael Ellerman
  Cc: linux-s390, Sathyanarayanan Kuppuswamy, linux-efi, Brijesh Singh,
	kvm, Tom Lendacky, Tianyu Lan, Joerg Roedel, x86, kexec,
	linux-kernel, dri-devel, platform-driver-x86, Christoph Hellwig,
	iommu, Andi Kleen, Paul Mackerras, amd-gfx, linux-fsdevel,
	linux-graphics-maintainer, linuxppc-dev
In-Reply-To: <YUHGDbtiGrDz5+NS@zn.tnic>

Le 15/09/2021 à 12:08, Borislav Petkov a écrit :
> On Wed, Sep 15, 2021 at 10:28:59AM +1000, Michael Ellerman wrote:
>> I don't love it, a new C file and an out-of-line call to then call back
>> to a static inline that for most configuration will return false ... but
>> whatever :)
> 
> Yeah, hch thinks it'll cause a big mess otherwise:
> 
> https://lore.kernel.org/lkml/YSScWvpXeVXw%2Fed5@infradead.org/

Could you please provide more explicit explanation why inlining such an 
helper is considered as bad practice and messy ?

Because as demonstrated in my previous response some days ago, taking 
that outline ends up with an unneccessary ugly generated code and we 
don't benefit front GCC's capability to fold in and opt out unreachable 
code.

As pointed by Michael in most cases the function will just return false 
so behind the performance concern, there is also the code size and code 
coverage topic that is to be taken into account. And even when the 
function doesn't return false, the only thing it does folds into a 
single powerpc instruction so there is really no point in making a 
dedicated out-of-line fonction for that and suffer the cost and the size 
of a function call and to justify the addition of a dedicated C file.

> 
> I guess less ifdeffery is nice too.

I can't see your point here. Inlining the function wouldn't add any 
ifdeffery as far as I can see.

So, would you mind reconsidering your approach and allow architectures 
to provide inline implementation by just not enforcing a generic 
prototype ? Or otherwise provide more details and exemple of why the 
cons are more important versus the pros ?

Thanks
Christophe

^ permalink raw reply

* Re: [PATCH v3 0/8] Implement generic cc_platform_has() helper function
From: Borislav Petkov @ 2021-09-15 16:46 UTC (permalink / raw)
  To: Tom Lendacky, Sathyanarayanan Kuppuswamy
  Cc: linux-efi, Brijesh Singh, kvm, David Airlie, Dave Hansen,
	dri-devel, platform-driver-x86, Paul Mackerras, Will Deacon,
	Ard Biesheuvel, linux-s390, Andi Kleen, Baoquan He, Joerg Roedel,
	x86, amd-gfx, Christoph Hellwig, Christian Borntraeger,
	Ingo Molnar, linux-graphics-maintainer, Dave Young, Tianyu Lan,
	Thomas Zimmermann, Vasily Gorbik, Heiko Carstens,
	Maarten Lankhorst, Maxime Ripard, Andy Lutomirski,
	Thomas Gleixner, Peter Zijlstra, kexec, linux-kernel, iommu,
	Daniel Vetter, linux-fsdevel, linuxppc-dev
In-Reply-To: <cover.1631141919.git.thomas.lendacky@amd.com>

On Wed, Sep 08, 2021 at 05:58:31PM -0500, Tom Lendacky wrote:
> This patch series provides a generic helper function, cc_platform_has(),
> to replace the sme_active(), sev_active(), sev_es_active() and
> mem_encrypt_active() functions.
> 
> It is expected that as new confidential computing technologies are
> added to the kernel, they can all be covered by a single function call
> instead of a collection of specific function calls all called from the
> same locations.
> 
> The powerpc and s390 patches have been compile tested only. Can the
> folks copied on this series verify that nothing breaks for them. Also,
> a new file, arch/powerpc/platforms/pseries/cc_platform.c, has been
> created for powerpc to hold the out of line function.

...

> 
> Tom Lendacky (8):
>   x86/ioremap: Selectively build arch override encryption functions
>   mm: Introduce a function to check for confidential computing features
>   x86/sev: Add an x86 version of cc_platform_has()
>   powerpc/pseries/svm: Add a powerpc version of cc_platform_has()
>   x86/sme: Replace occurrences of sme_active() with cc_platform_has()
>   x86/sev: Replace occurrences of sev_active() with cc_platform_has()
>   x86/sev: Replace occurrences of sev_es_active() with cc_platform_has()
>   treewide: Replace the use of mem_encrypt_active() with
>     cc_platform_has()

Ok, modulo the minor things the plan is to take this through tip after
-rc2 releases in order to pick up the powerpc build fix and have a clean
base (-rc2) to base stuff on, at the same time.

Pls holler if something's still amiss.

Sathya,

if you want to prepare the Intel variant intel_cc_platform_has() ontop
of those and send it to me, that would be good because then I can
integrate it all in one branch which can be used to base future work
ontop.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply

* Re: [PATCH v5 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance
From: Vincent Guittot @ 2021-09-15 15:43 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Juri Lelli, Aubrey Li, Srikar Dronamraju, Ravi V. Shankar,
	Peter Zijlstra (Intel), Ricardo Neri, Ben Segall,
	Srinivas Pandruvada, Joel Fernandes (Google), Ingo Molnar,
	Rafael J . Wysocki, Steven Rostedt, Mel Gorman, Len Brown,
	Nicholas Piggin, Aubrey Li, Dietmar Eggemann, Tim Chen,
	Quentin Perret, Daniel Bristot de Oliveira, linux-kernel,
	linuxppc-dev
In-Reply-To: <20210911011819.12184-7-ricardo.neri-calderon@linux.intel.com>

On Sat, 11 Sept 2021 at 03:19, Ricardo Neri
<ricardo.neri-calderon@linux.intel.com> wrote:
>
> When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
> check for the idle state of the destination CPU, dst_cpu, but also of
> its SMT siblings.
>
> If dst_cpu is idle but its SMT siblings are busy, performance suffers
> if it pulls tasks from a medium priority CPU that does not have SMT
> siblings.
>
> Implement asym_smt_can_pull_tasks() to inspect the state of the SMT
> siblings of both dst_cpu and the CPUs in the candidate busiest group.
>
> Cc: Aubrey Li <aubrey.li@intel.com>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Quentin Perret <qperret@google.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Reviewed-by: Len Brown <len.brown@intel.com>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes since v4:
>   * Use sg_lb_stats::sum_nr_running the idle state of a scheduling group.
>     (Vincent, Peter)
>   * Do not even idle CPUs in asym_smt_can_pull_tasks(). (Vincent)
>   * Updated function documentation and corrected a typo.
>
> Changes since v3:
>   * Removed the arch_asym_check_smt_siblings() hook. Discussions with the
>     powerpc folks showed that this patch should not impact them. Also, more
>     recent powerpc processor no longer use asym_packing. (PeterZ)
>   * Removed unnecessary local variable in asym_can_pull_tasks(). (Dietmar)
>   * Removed unnecessary check for local CPUs when the local group has zero
>     utilization. (Joel)
>   * Renamed asym_can_pull_tasks() as asym_smt_can_pull_tasks() to reflect
>     the fact that it deals with SMT cases.
>   * Made asym_smt_can_pull_tasks() return false for !CONFIG_SCHED_SMT so
>     that callers can deal with non-SMT cases.
>
> Changes since v2:
>   * Reworded the commit message to reflect updates in code.
>   * Corrected misrepresentation of dst_cpu as the CPU doing the load
>     balancing. (PeterZ)
>   * Removed call to arch_asym_check_smt_siblings() as it is now called in
>     sched_asym().
>
> Changes since v1:
>   * Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull
>     tasks. Instead, reclassify the candidate busiest group, as it
>     may still be selected. (PeterZ)
>   * Avoid an expensive and unnecessary call to cpumask_weight() when
>     determining if a sched_group is comprised of SMT siblings.
>     (PeterZ).
> ---
>  kernel/sched/fair.c | 94 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 94 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 26db017c14a3..8d763dd0174b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8597,10 +8597,98 @@ group_type group_classify(unsigned int imbalance_pct,
>         return group_has_spare;
>  }
>
> +/**
> + * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull tasks
> + * @dst_cpu:   Destination CPU of the load balancing
> + * @sds:       Load-balancing data with statistics of the local group
> + * @sgs:       Load-balancing statistics of the candidate busiest group
> + * @sg:                The candidate busiest group
> + *
> + * Check the state of the SMT siblings of both @sds::local and @sg and decide
> + * if @dst_cpu can pull tasks.
> + *
> + * If @dst_cpu does not have SMT siblings, it can pull tasks if two or more of
> + * the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull tasks
> + * only if @dst_cpu has higher priority.
> + *
> + * If both @dst_cpu and @sg have SMT siblings, and @sg has exactly one more
> + * busy CPU than @sds::local, let @dst_cpu pull tasks if it has higher priority.
> + * Bigger imbalances in the number of busy CPUs will be dealt with in
> + * update_sd_pick_busiest().
> + *
> + * If @sg does not have SMT siblings, only pull tasks if all of the SMT siblings
> + * of @dst_cpu are idle and @sg has lower priority.
> + */
> +static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> +                                   struct sg_lb_stats *sgs,
> +                                   struct sched_group *sg)
> +{
> +#ifdef CONFIG_SCHED_SMT
> +       bool local_is_smt, sg_is_smt;
> +       int sg_busy_cpus;
> +
> +       local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
> +       sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
> +
> +       sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
> +
> +       if (!local_is_smt) {
> +               /*
> +                * If we are here, @dst_cpu is idle and does not have SMT
> +                * siblings. Pull tasks if candidate group has two or more
> +                * busy CPUs.
> +                */
> +               if (sg_is_smt && sg_busy_cpus >= 2)

Do you really need to test sg_is_smt ? if sg_busy_cpus >= 2 then
sd_is_smt must be true ?

Also, This is the default behavior where we want to even the number of
busy cpu. Shouldn't you return false and fall back to the default
behavior ?

That being said, the default behavior tries to even the number of idle
cpus which is easier to compute and is equal to even the number of
busy cpus in "normal" system with the same number of cpus in groups
but this is not the case here. It could be good to change the default
behavior to even the number of busy cpus and that you use the default
behavior here. Additional condition will be used to select the busiest
group like more busy cpu or more number of running tasks

> +                       return true;
> +
> +               /*
> +                * @dst_cpu does not have SMT siblings. @sg may have SMT
> +                * siblings and only one is busy. In such case, @dst_cpu
> +                * can help if it has higher priority and is idle (i.e.,
> +                * it has no running tasks).

The previous comment above assume that "@dst_cpu is idle" but now you
need to check that sds->local_stat.sum_nr_running == 0

> +                */
> +               return !sds->local_stat.sum_nr_running &&
> +                      sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> +       }
> +
> +       /* @dst_cpu has SMT siblings. */
> +
> +       if (sg_is_smt) {
> +               int local_busy_cpus = sds->local->group_weight -
> +                                     sds->local_stat.idle_cpus;
> +               int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
> +
> +               if (busy_cpus_delta == 1)
> +                       return sched_asym_prefer(dst_cpu,
> +                                                sg->asym_prefer_cpu);
> +
> +               return false;
> +       }
> +
> +       /*
> +        * @sg does not have SMT siblings. Ensure that @sds::local does not end
> +        * up with more than one busy SMT sibling and only pull tasks if there
> +        * are not busy CPUs (i.e., no CPU has running tasks).
> +        */
> +       if (!sds->local_stat.sum_nr_running)
> +               return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> +
> +       return false;
> +#else
> +       /* Always return false so that callers deal with non-SMT cases. */
> +       return false;
> +#endif
> +}
> +
>  static inline bool
>  sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs,
>            struct sched_group *group)
>  {
> +       /* Only do SMT checks if either local or candidate have SMT siblings */
> +       if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
> +           (group->flags & SD_SHARE_CPUCAPACITY))
> +               return asym_smt_can_pull_tasks(env->dst_cpu, sds, sgs, group);
> +
>         return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
>  }
>
> @@ -9606,6 +9694,12 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>                     nr_running == 1)
>                         continue;
>
> +               /* Make sure we only pull tasks from a CPU of lower priority */
> +               if ((env->sd->flags & SD_ASYM_PACKING) &&
> +                   sched_asym_prefer(i, env->dst_cpu) &&
> +                   nr_running == 1)
> +                       continue;
> +
>                 switch (env->migration_type) {
>                 case migrate_load:
>                         /*
> --
> 2.17.1
>

^ permalink raw reply

* [PATCH] powerpc: warn on emulation of dcbz instruction
From: Christophe Leroy @ 2021-09-15 14:31 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: Stan Johnson, Finn Thain, linuxppc-dev, linux-kernel

dcbz instruction shouldn't be used on non-cached memory. Using
it on non-cached memory can result in alignment exception and
implies a heavy handling.

Instead of silentely emulating the instruction and resulting in high
performance degradation, warn whenever an alignment exception is
taken due to dcbz, so that the user is made aware that dcbz
instruction has been used unexpectedly.

Reported-by: Stan Johnson <userm57@yahoo.com>
Cc: Finn Thain <fthain@linux-m68k.org>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/kernel/align.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/align.c b/arch/powerpc/kernel/align.c
index bbb4181621dd..adc3a4a9c6e4 100644
--- a/arch/powerpc/kernel/align.c
+++ b/arch/powerpc/kernel/align.c
@@ -349,6 +349,7 @@ int fix_alignment(struct pt_regs *regs)
 		if (op.type != CACHEOP + DCBZ)
 			return -EINVAL;
 		PPC_WARN_ALIGNMENT(dcbz, regs);
+		WARN_ON_ONCE(1);
 		r = emulate_dcbz(op.ea, regs);
 	} else {
 		if (type == LARX || type == STCX)
-- 
2.31.1


^ permalink raw reply related

* [PATCH] powerpc/32s: Fix kuap_kernel_restore()
From: Christophe Leroy @ 2021-09-15 14:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: Stan Johnson, Finn Thain, linuxppc-dev, linux-kernel

At interrupt exit, kuap_kernel_restore() calls kuap_unclok() with the
value contained in regs->kuap. However, when regs->kuap contains
0xffffffff it means that KUAP was not unlocked so calling
kuap_unlock() is unrelevant and results in jeopardising the contents
of kernel space segment registers.

So check that regs->kuap doesn't contain KUAP_NONE before calling
kuap_unlock(). In the meantime it also means that if KUAP has not
been correcly locked back at interrupt exit, it must be locked
before continuing. This is done by checking the content of
current->thread.kuap which was returned by kuap_get_and_assert_locked()

Fixes: 16132529cee5 ("powerpc/32s: Rework Kernel Userspace Access Protection")
Reported-by: Stan Johnson <userm57@yahoo.com>
Cc: Finn Thain <fthain@linux-m68k.org>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/include/asm/book3s/32/kup.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/32/kup.h b/arch/powerpc/include/asm/book3s/32/kup.h
index d4b145b279f6..9f38040f0641 100644
--- a/arch/powerpc/include/asm/book3s/32/kup.h
+++ b/arch/powerpc/include/asm/book3s/32/kup.h
@@ -136,6 +136,14 @@ static inline void kuap_kernel_restore(struct pt_regs *regs, unsigned long kuap)
 	if (kuap_is_disabled())
 		return;

+	if (unlikely(kuap != KUAP_NONE)) {
+		current->thread.kuap = KUAP_NONE;
+		kuap_lock(kuap, false);
+	}
+
+	if (likely(regs->kuap == KUAP_NONE))
+		return;
+
 	current->thread.kuap = regs->kuap;

 	kuap_unlock(regs->kuap, false);
-- 
2.31.1

^ permalink raw reply related

* Re: [PATCH] swiotlb: set IO TLB segment size via cmdline
From: Christoph Hellwig @ 2021-09-15 13:53 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roman Skakun, linux-doc, Peter Zijlstra, Viresh Kumar,
	Linux Kernel Mailing List, Paul Mackerras, Will Deacon,
	Boris Ostrovsky, Marek Szyprowski, Stefano Stabellini,
	Jonathan Corbet, Christoph Hellwig, xen-devel, Paul E. McKenney,
	Konrad Rzeszutek Wilk, Muchun Song, Thomas Gleixner,
	Juergen Gross, Thomas Bogendoerfer, Andrii Anisov, linuxppc-dev,
	Randy Dunlap, linux-mips, iommu, Roman Skakun, Andrew Morton,
	Lu Baolu, Robin Murphy, Mike Rapoport, Maciej W. Rozycki
In-Reply-To: <84ef7ff7-2c9c-113a-4a2c-cef54a6ded51@suse.com>

On Wed, Sep 15, 2021 at 03:49:52PM +0200, Jan Beulich wrote:
> But the question remains: Why does the framebuffer need to be mapped
> in a single giant chunk?

More importantly: if you use dynamic dma mappings for your framebuffer
you're doing something wrong.

^ permalink raw reply

* Re: [PATCH] swiotlb: set IO TLB segment size via cmdline
From: Jan Beulich @ 2021-09-15 13:49 UTC (permalink / raw)
  To: Roman Skakun
  Cc: linux-doc, Peter Zijlstra, Viresh Kumar,
	Linux Kernel Mailing List, Paul Mackerras, Will Deacon,
	Boris Ostrovsky, Marek Szyprowski, Stefano Stabellini,
	Jonathan Corbet, Christoph Hellwig, xen-devel, Paul E. McKenney,
	Konrad Rzeszutek Wilk, Muchun Song, Thomas Gleixner,
	Juergen Gross, Thomas Bogendoerfer, Andrii Anisov, linuxppc-dev,
	Randy Dunlap, linux-mips, iommu, Roman Skakun, Andrew Morton,
	Lu Baolu, Robin Murphy, Mike Rapoport, Maciej W. Rozycki
In-Reply-To: <CADu_u-Ou08tMFm5xU871ae8ct+2YOuvn4rQ=83CMTbg2bx87Pg@mail.gmail.com>

On 15.09.2021 15:37, Roman Skakun wrote:
>>> From: Roman Skakun <roman_skakun@epam.com>
>>>
>>> It is possible when default IO TLB size is not
>>> enough to fit a long buffers as described here [1].
>>>
>>> This patch makes a way to set this parameter
>>> using cmdline instead of recompiling a kernel.
>>>
>>> [1] https://www.xilinx.com/support/answers/72694.html
>>
>>  I'm not convinced the swiotlb use describe there falls under "intended
>>  use" - mapping a 1280x720 framebuffer in a single chunk?
> 
> I had the same issue while mapping DMA chuck ~4MB for gem fb when
> using xen vdispl.
> I got the next log:
> [ 142.030421] rcar-fcp fea2f000.fcp: swiotlb buffer is full (sz:
> 3686400 bytes), total 32768 (slots), used 32 (slots)
> 
> It happened when I tried to map bounce buffer, which has a large size.
> The default size if 128(IO_TLB_SEGSIZE) * 2048(IO_TLB_SHIFT) = 262144
> bytes, but we requested 3686400 bytes.
> When I change IO_TLB_SEGSIZE to 2048. (2048(IO_TLB_SEGSIZE)  *
> 2048(IO_TLB_SHIFT) = 4194304bytes).
> It makes possible to retrieve a bounce buffer for requested size.
> After changing this value, the problem is gone.

But the question remains: Why does the framebuffer need to be mapped
in a single giant chunk?

>>  In order to be sure to catch all uses like this one (including ones
>>  which make it upstream in parallel to yours), I think you will want
>>  to rename the original IO_TLB_SEGSIZE to e.g. IO_TLB_DEFAULT_SEGSIZE.
> 
> I don't understand your point. Can you clarify this?

There's a concrete present example: I have a patch pending adding
another use of IO_TLB_SEGSIZE. This use would need to be replaced
like you do here in several places. The need for the additional
replacement would be quite obvious (from a build failure) if you
renamed the manifest constant. Without renaming, it'll take
someone running into an issue on a live system, which I consider
far worse. This is because a simple re-basing of one of the
patches on top of the other will not point out the need for the
extra replacement, nor would a test build (with both patches in
place).

Jan


^ permalink raw reply

* Re: [PATCH] swiotlb: set IO TLB segment size via cmdline
From: Roman Skakun @ 2021-09-15 13:37 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roman Skakun, linux-doc, Peter Zijlstra, Viresh Kumar,
	Linux Kernel Mailing List, Paul Mackerras, Will Deacon,
	Boris Ostrovsky, Marek Szyprowski, Stefano Stabellini,
	Jonathan Corbet, Christoph Hellwig, xen-devel, Paul E. McKenney,
	Konrad Rzeszutek Wilk, Muchun Song, Thomas Gleixner,
	Juergen Gross, Thomas Bogendoerfer, Andrii Anisov, linuxppc-dev,
	Randy Dunlap, linux-mips, iommu, Roman Skakun, Andrew Morton,
	Lu Baolu, Robin Murphy, Mike Rapoport, Maciej W. Rozycki
In-Reply-To: <7c04db79-7de1-93ff-0908-9bad60a287b9@suse.com>

Hi Jan,

Thanks for the answer.

>> From: Roman Skakun <roman_skakun@epam.com>
>>
>> It is possible when default IO TLB size is not
>> enough to fit a long buffers as described here [1].
>>
>> This patch makes a way to set this parameter
>> using cmdline instead of recompiling a kernel.
>>
>> [1] https://www.xilinx.com/support/answers/72694.html
>
>  I'm not convinced the swiotlb use describe there falls under "intended
>  use" - mapping a 1280x720 framebuffer in a single chunk?

I had the same issue while mapping DMA chuck ~4MB for gem fb when
using xen vdispl.
I got the next log:
[ 142.030421] rcar-fcp fea2f000.fcp: swiotlb buffer is full (sz:
3686400 bytes), total 32768 (slots), used 32 (slots)

It happened when I tried to map bounce buffer, which has a large size.
The default size if 128(IO_TLB_SEGSIZE) * 2048(IO_TLB_SHIFT) = 262144
bytes, but we requested 3686400 bytes.
When I change IO_TLB_SEGSIZE to 2048. (2048(IO_TLB_SEGSIZE)  *
2048(IO_TLB_SHIFT) = 4194304bytes).
It makes possible to retrieve a bounce buffer for requested size.
After changing this value, the problem is gone.

>  the bottom of this page is also confusing, as following "Then we can
>  confirm the modified swiotlb size in the boot log:" there is a log
>  fragment showing the same original size of 64Mb.

I suspect, this is a mistake in the article.
According to https://elixir.bootlin.com/linux/v5.14.4/source/kernel/dma/swiotlb.c#L214
and
https://elixir.bootlin.com/linux/v5.15-rc1/source/kernel/dma/swiotlb.c#L182
The IO_TLB_SEGSIZE is not used to calculate total size of swiotlb area.
This explains why we have the same total size before and after changing of
TLB segment size.

>  In order to be sure to catch all uses like this one (including ones
>  which make it upstream in parallel to yours), I think you will want
>  to rename the original IO_TLB_SEGSIZE to e.g. IO_TLB_DEFAULT_SEGSIZE.

I don't understand your point. Can you clarify this?

>> +     /* get max IO TLB segment size */
>> +     if (isdigit(*str)) {
>> +             tmp = simple_strtoul(str, &str, 0);
>> +             if (tmp)
>> +                     io_tlb_seg_size = ALIGN(tmp, IO_TLB_SEGSIZE);
>
> From all I can tell io_tlb_seg_size wants to be a power of 2. Merely
> aligning to a multiple of IO_TLB_SEGSIZE isn't going to be enough.

Yes, right, thanks!

Cheers,
Roman.

вт, 14 сент. 2021 г. в 18:29, Jan Beulich <jbeulich@suse.com>:
>
> On 14.09.2021 17:10, Roman Skakun wrote:
> > From: Roman Skakun <roman_skakun@epam.com>
> >
> > It is possible when default IO TLB size is not
> > enough to fit a long buffers as described here [1].
> >
> > This patch makes a way to set this parameter
> > using cmdline instead of recompiling a kernel.
> >
> > [1] https://www.xilinx.com/support/answers/72694.html
>
> I'm not convinced the swiotlb use describe there falls under "intended
> use" - mapping a 1280x720 framebuffer in a single chunk? (As an aside,
> the bottom of this page is also confusing, as following "Then we can
> confirm the modified swiotlb size in the boot log:" there is a log
> fragment showing the same original size of 64Mb.
>
> > --- a/arch/mips/cavium-octeon/dma-octeon.c
> > +++ b/arch/mips/cavium-octeon/dma-octeon.c
> > @@ -237,7 +237,7 @@ void __init plat_swiotlb_setup(void)
> >               swiotlbsize = 64 * (1<<20);
> >  #endif
> >       swiotlb_nslabs = swiotlbsize >> IO_TLB_SHIFT;
> > -     swiotlb_nslabs = ALIGN(swiotlb_nslabs, IO_TLB_SEGSIZE);
> > +     swiotlb_nslabs = ALIGN(swiotlb_nslabs, swiotlb_io_seg_size());
>
> In order to be sure to catch all uses like this one (including ones
> which make it upstream in parallel to yours), I think you will want
> to rename the original IO_TLB_SEGSIZE to e.g. IO_TLB_DEFAULT_SEGSIZE.
>
> > @@ -81,15 +86,30 @@ static unsigned int max_segment;
> >  static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
> >
> >  static int __init
> > -setup_io_tlb_npages(char *str)
> > +setup_io_tlb_params(char *str)
> >  {
> > +     unsigned long tmp;
> > +
> >       if (isdigit(*str)) {
> > -             /* avoid tail segment of size < IO_TLB_SEGSIZE */
> > -             default_nslabs =
> > -                     ALIGN(simple_strtoul(str, &str, 0), IO_TLB_SEGSIZE);
> > +             default_nslabs = simple_strtoul(str, &str, 0);
> >       }
> >       if (*str == ',')
> >               ++str;
> > +
> > +     /* get max IO TLB segment size */
> > +     if (isdigit(*str)) {
> > +             tmp = simple_strtoul(str, &str, 0);
> > +             if (tmp)
> > +                     io_tlb_seg_size = ALIGN(tmp, IO_TLB_SEGSIZE);
>
> From all I can tell io_tlb_seg_size wants to be a power of 2. Merely
> aligning to a multiple of IO_TLB_SEGSIZE isn't going to be enough.
>
> Jan
>


-- 
Best Regards, Roman.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox