LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: PowerNV PCI & SR-IOV cleanups
From: Christoph Hellwig @ 2020-07-10  6:45 UTC (permalink / raw)
  To: Oliver O'Halloran; +Cc: linux-pci, linuxppc-dev
In-Reply-To: <20200710052340.737567-1-oohall@gmail.com>

On Fri, Jul 10, 2020 at 03:23:25PM +1000, Oliver O'Halloran wrote:
> This is largely prep work for supporting VFs in the 32bit MMIO window.
> This is an unfortunate necessity due to how the Linux BAR allocator
> handles BARs marked as non-prefetchable. The distinction
> between prefetch and non-prefetchable BARs was made largely irrelevant
> with the introduction of PCIe, but the BAR allocator is overly
> conservative. It will always place non-pref bars in the prefetchable
> window, which is 32bit only. This results in us being unable to use VFs
> from NVMe drives and a few different RAID cards.

How about fixing that in the core PCI code?

(nothing against this series through, as it seems like a massive
cleanup)

^ permalink raw reply

* Re: [mm/debug_vm_pgtable] a97a171093: BUG:unable_to_handle_page_fault_for_address
From: Anshuman Khandual @ 2020-07-10  6:47 UTC (permalink / raw)
  To: kernel test robot
  Cc: Heiko Carstens, linux-mm, Paul Mackerras, H. Peter Anvin,
	linux-riscv, Will Deacon, linux-arch, linux-s390, x86,
	Mike Rapoport, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, linux-snps-arc, Vasily Gorbik, lkp,
	Borislav Petkov, Paul Walmsley, Kirill A . Shutemov,
	Thomas Gleixner, linux-arm-kernel, Vineet Gupta, linux-kernel,
	Palmer Dabbelt, Andrew Morton, linuxppc-dev
In-Reply-To: <20200709061122.GN3874@shao2-debian>



On 07/09/2020 11:41 AM, kernel test robot wrote:
> [   94.349598] BUG: unable to handle page fault for address: ffffed10a7ffddff
> [   94.351039] #PF: supervisor read access in kernel mode
> [   94.352172] #PF: error_code(0x0000) - not-present page
> [   94.353256] PGD 43ffed067 P4D 43ffed067 PUD 43fdee067 PMD 0 
> [   94.354484] Oops: 0000 [#1] SMP KASAN
> [   94.355238] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.8.0-rc4-00002-ga97a17109332c #1
> [   94.360456] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> [   94.361950] RIP: 0010:hugetlb_advanced_tests+0x137/0x699
> [   94.363026] Code: 8b 13 4d 85 f6 75 0b 48 ff 05 2c e4 6a 01 31 ed eb 41 bf f8 ff ff ff ba ff ff 37 00 4c 01 f7 48 c1 e2 2a 48 89 f9 48 c1 e9 03 <80> 3c 11 00 74 05 e8 cd c0 67 fa ba f8 ff ff ff 49 8b 2c 16 48 85
> [   94.366592] RSP: 0000:ffffc90000047d30 EFLAGS: 00010a06
> [   94.367693] RAX: 1ffffffff1049b80 RBX: ffff888380525308 RCX: 1ffff110a7ffddff
> [   94.369215] RDX: dffffc0000000000 RSI: 1ffff11087ffdc00 RDI: ffff88853ffeeff8
> [   94.370693] RBP: 000000000018e510 R08: 0000000000000025 R09: 0000000000000001
> [   94.372165] R10: ffff888380523c07 R11: ffffed10700a4780 R12: ffff88843208e510
> [   94.373674] R13: 0000000000000025 R14: ffff88843ffef000 R15: 000031e01ae61000
> [   94.375147] FS:  0000000000000000(0000) GS:ffff8883a3800000(0000) knlGS:0000000000000000
> [   94.376883] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   94.378051] CR2: ffffed10a7ffddff CR3: 0000000004e15000 CR4: 00000000000406a0
> [   94.379522] Call Trace:
> [   94.380073]  debug_vm_pgtable+0xd81/0x2029
> [   94.380871]  ? pmd_advanced_tests+0x621/0x621
> [   94.381819]  do_one_initcall+0x1eb/0xbd0
> [   94.382551]  ? trace_event_raw_event_initcall_finish+0x240/0x240
> [   94.383634]  ? rcu_read_lock_sched_held+0xb9/0x110
> [   94.388727]  ? rcu_read_lock_held+0xd0/0xd0
> [   94.389604]  ? __kasan_check_read+0x1d/0x30
> [   94.390485]  kernel_init_freeable+0x430/0x4f8
> [   94.391416]  ? rest_init+0x3f8/0x3f8
> [   94.392185]  kernel_init+0x14/0x1e8
> [   94.392918]  ret_from_fork+0x22/0x30
> [   94.393662] Modules linked in:
> [   94.394289] CR2: ffffed10a7ffddff
> [   94.395000] ---[ end trace 8ca5a1655dfb8c39 ]---

This bug is caused from here.

static inline struct mem_section *__nr_to_section(unsigned long nr)
{
#ifdef CONFIG_SPARSEMEM_EXTREME
        if (!mem_section)
                return NULL;
#endif
        if (!mem_section[SECTION_NR_TO_ROOT(nr)]) <-------- BUG
                return NULL;
        return &mem_section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK];
}

static inline struct mem_section *__pfn_to_section(unsigned long pfn)
{
        return __nr_to_section(pfn_to_section_nr(pfn));
}

#define __pfn_to_page(pfn)                              \
({      unsigned long __pfn = (pfn);                    \
        struct mem_section *__sec = __pfn_to_section(__pfn);    \
        __section_mem_map_addr(__sec) + __pfn;          \
})

which is called via hugetlb_advanced_tests().

paddr = (__pfn_to_phys(pfn) | RANDOM_ORVALUE) & PMD_MASK;
pte = pte_mkhuge(mk_pte(pfn_to_page(PHYS_PFN(paddr)), prot));

Primary reason being RANDOM_ORVALUE, which is added to the paddr before
being masked with PMD_MASK. This clobbers up the pfn value which cannot
be searched in relevant memory sections. This problem stays hidden on
other configs where pfn_to_page() does not go via memory section search.
Dropping off RANDOM_ORVALUE solves the problem. Probably, just wanted to
drop that off during V2 series (https://lkml.org/lkml/2020/4/8/997) but
dont remember why ended up keeping it again.

^ permalink raw reply

* Re: [PATCH v6 0/5] clean up redundant 'kvm_run' parameters
From: Tianjia Zhang @ 2020-07-10  7:32 UTC (permalink / raw)
  To: pbonzini, tsbogend, paulus, mpe, benh, borntraeger, frankja,
	david, cohuck, heiko.carstens, gor, sean.j.christopherson,
	vkuznets, wanpengli, jmattson, joro, tglx, mingo, bp, x86, hpa,
	maz, james.morse, julien.thierry.kdev, suzuki.poulose,
	christoffer.dall, peterx, thuth, chenhuacai
  Cc: linux-s390, kvm, linux-mips, kvm-ppc, linux-kernel, linuxppc-dev,
	kvmarm, linux-arm-kernel
In-Reply-To: <20200623131418.31473-1-tianjia.zhang@linux.alibaba.com>

Hi Paolo,

Any opinion on this series patches? Can I help with this patchset ?

Thanks and best,
Tianjia

On 2020/6/23 21:14, Tianjia Zhang wrote:
> In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
> structure. For historical reasons, many kvm-related function parameters
> retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This
> patch does a unified cleanup of these remaining redundant parameters.
> 
> This series of patches has completely cleaned the architecture of
> arm64, mips, ppc, and s390 (no such redundant code on x86). Due to
> the large number of modified codes, a separate patch is made for each
> platform. On the ppc platform, there is also a redundant structure
> pointer of 'kvm_run' in 'vcpu_arch', which has also been cleaned
> separately.
> 
> ---
> v6 changes:
>    Rearrange patch sets, only keep the unmerged patch.
>    rebase on mainline.
> 
> v5 change:
>    ppc: fix for review.
> 
> v4 change:
>    mips: fixes two errors in entry.c.
> 
> v3 change:
>    Keep the existing `vcpu->run` in the function body unchanged.
> 
> v2 change:
>    s390 retains the original variable name and minimizes modification.
> 
> Tianjia Zhang (5):
>    KVM: s390: clean up redundant 'kvm_run' parameters
>    KVM: arm64: clean up redundant 'kvm_run' parameters
>    KVM: PPC: clean up redundant kvm_run parameters in assembly
>    KVM: MIPS: clean up redundant 'kvm_run' parameters
>    KVM: MIPS: clean up redundant kvm_run parameters in assembly
> 
>   arch/arm64/include/asm/kvm_coproc.h   |  12 +--
>   arch/arm64/include/asm/kvm_host.h     |  11 +--
>   arch/arm64/include/asm/kvm_mmu.h      |   2 +-
>   arch/arm64/kvm/arm.c                  |   6 +-
>   arch/arm64/kvm/handle_exit.c          |  36 ++++----
>   arch/arm64/kvm/mmio.c                 |  11 +--
>   arch/arm64/kvm/mmu.c                  |   5 +-
>   arch/arm64/kvm/sys_regs.c             |  13 ++-
>   arch/mips/include/asm/kvm_host.h      |  32 ++------
>   arch/mips/kvm/emulate.c               |  59 +++++--------
>   arch/mips/kvm/entry.c                 |  21 ++---
>   arch/mips/kvm/mips.c                  |  14 ++--
>   arch/mips/kvm/trap_emul.c             | 114 +++++++++++---------------
>   arch/mips/kvm/vz.c                    |  26 +++---
>   arch/powerpc/include/asm/kvm_ppc.h    |   2 +-
>   arch/powerpc/kvm/book3s_interrupts.S  |  22 +++--
>   arch/powerpc/kvm/book3s_pr.c          |   9 +-
>   arch/powerpc/kvm/booke.c              |   9 +-
>   arch/powerpc/kvm/booke_interrupts.S   |   9 +-
>   arch/powerpc/kvm/bookehv_interrupts.S |  10 +--
>   arch/s390/kvm/kvm-s390.c              |  23 ++++--
>   21 files changed, 188 insertions(+), 258 deletions(-)
> 

^ permalink raw reply

* Re: [PATCH v6 1/5] KVM: s390: clean up redundant 'kvm_run' parameters
From: Paolo Bonzini @ 2020-07-10  7:48 UTC (permalink / raw)
  To: Christian Borntraeger, Tianjia Zhang, tsbogend, paulus, mpe, benh,
	frankja, david, cohuck, heiko.carstens, gor,
	sean.j.christopherson, vkuznets, wanpengli, jmattson, joro, tglx,
	mingo, bp, x86, hpa, maz, james.morse, julien.thierry.kdev,
	suzuki.poulose, christoffer.dall, peterx, thuth, chenhuacai
  Cc: linux-s390, kvm, linux-mips, kvm-ppc, linux-kernel, linuxppc-dev,
	kvmarm, linux-arm-kernel
In-Reply-To: <c49f8814-c7ea-6884-91c5-3dcd40c6509f@de.ibm.com>

On 23/06/20 17:31, Christian Borntraeger wrote:
> 
> I have trouble seeing value in this particular patch. We add LOCs
> without providing any noticable benefit. All other patches in this series at
> least reduce the amount of code. So I would defer this to Paolo if he prefers
> to have this way across all architectures. 

Yes, it adds lines of code but they're just

+	struct kvm_run *kvm_run = vcpu->run;

You could avoid the LoC increase by repeating vcpu->run over and over,
but I think the code overall is clearer.

Paolo


^ permalink raw reply

* Re: [PATCH v2 02/12] powerpc/kexec_file: mark PPC64 specific code
From: Laurent Dufour @ 2020-07-10  7:52 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <159371966340.21555.11937124937064648539.stgit@hbathini.in.ibm.com>

Le 02/07/2020 à 21:54, Hari Bathini a écrit :
> Some of the kexec_file_load code isn't PPC64 specific. Move PPC64
> specific code from kexec/file_load.c to kexec/file_load_64.c. Also,
> rename purgatory/trampoline.S to purgatory/trampoline_64.S in the
> same spirit.

FWIW and despite a minor comment below,

Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>

> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
> ---
> 
> Changes in v2:
> * No changes.
> 
> 
>   arch/powerpc/include/asm/kexec.h       |   11 +++
>   arch/powerpc/kexec/Makefile            |    2 -
>   arch/powerpc/kexec/elf_64.c            |    7 +-
>   arch/powerpc/kexec/file_load.c         |   37 ++--------
>   arch/powerpc/kexec/file_load_64.c      |  108 ++++++++++++++++++++++++++++++
>   arch/powerpc/purgatory/Makefile        |    4 +
>   arch/powerpc/purgatory/trampoline.S    |  117 --------------------------------
>   arch/powerpc/purgatory/trampoline_64.S |  117 ++++++++++++++++++++++++++++++++
>   8 files changed, 248 insertions(+), 155 deletions(-)
>   create mode 100644 arch/powerpc/kexec/file_load_64.c
>   delete mode 100644 arch/powerpc/purgatory/trampoline.S
>   create mode 100644 arch/powerpc/purgatory/trampoline_64.S
> 
> diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
> index c684768..7008ea1 100644
> --- a/arch/powerpc/include/asm/kexec.h
> +++ b/arch/powerpc/include/asm/kexec.h
> @@ -114,8 +114,17 @@ int setup_purgatory(struct kimage *image, const void *slave_code,
>   		    unsigned long fdt_load_addr);
>   int setup_new_fdt(const struct kimage *image, void *fdt,
>   		  unsigned long initrd_load_addr, unsigned long initrd_len,
> -		  const char *cmdline);
> +		  const char *cmdline, int *node);
>   int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size);
> +
> +#ifdef CONFIG_PPC64
> +int setup_purgatory_ppc64(struct kimage *image, const void *slave_code,
> +			  const void *fdt, unsigned long kernel_load_addr,
> +			  unsigned long fdt_load_addr);
> +int setup_new_fdt_ppc64(const struct kimage *image, void *fdt,
> +			unsigned long initrd_load_addr,
> +			unsigned long initrd_len, const char *cmdline);
> +#endif /* CONFIG_PPC64 */
>   #endif /* CONFIG_KEXEC_FILE */
>   
>   #else /* !CONFIG_KEXEC_CORE */
> diff --git a/arch/powerpc/kexec/Makefile b/arch/powerpc/kexec/Makefile
> index 86380c6..67c3553 100644
> --- a/arch/powerpc/kexec/Makefile
> +++ b/arch/powerpc/kexec/Makefile
> @@ -7,7 +7,7 @@ obj-y				+= core.o crash.o core_$(BITS).o
>   
>   obj-$(CONFIG_PPC32)		+= relocate_32.o
>   
> -obj-$(CONFIG_KEXEC_FILE)	+= file_load.o elf_$(BITS).o
> +obj-$(CONFIG_KEXEC_FILE)	+= file_load.o file_load_$(BITS).o elf_$(BITS).o
>   
>   ifdef CONFIG_HAVE_IMA_KEXEC
>   ifdef CONFIG_IMA
> diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
> index 3072fd6..23ad04c 100644
> --- a/arch/powerpc/kexec/elf_64.c
> +++ b/arch/powerpc/kexec/elf_64.c
> @@ -88,7 +88,8 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,
>   		goto out;
>   	}
>   
> -	ret = setup_new_fdt(image, fdt, initrd_load_addr, initrd_len, cmdline);
> +	ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr,
> +				  initrd_len, cmdline);
>   	if (ret)
>   		goto out;
>   
> @@ -107,8 +108,8 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,
>   	pr_debug("Loaded device tree at 0x%lx\n", fdt_load_addr);
>   
>   	slave_code = elf_info.buffer + elf_info.proghdrs[0].p_offset;
> -	ret = setup_purgatory(image, slave_code, fdt, kernel_load_addr,
> -			      fdt_load_addr);
> +	ret = setup_purgatory_ppc64(image, slave_code, fdt, kernel_load_addr,
> +				    fdt_load_addr);
>   	if (ret)
>   		pr_err("Error setting up the purgatory.\n");
>   
> diff --git a/arch/powerpc/kexec/file_load.c b/arch/powerpc/kexec/file_load.c
> index 143c917..99a2c4d 100644
> --- a/arch/powerpc/kexec/file_load.c
> +++ b/arch/powerpc/kexec/file_load.c
> @@ -1,6 +1,6 @@
>   // SPDX-License-Identifier: GPL-2.0-only
>   /*
> - * ppc64 code to implement the kexec_file_load syscall
> + * powerpc code to implement the kexec_file_load syscall
>    *
>    * Copyright (C) 2004  Adam Litke (agl@us.ibm.com)
>    * Copyright (C) 2004  IBM Corp.
> @@ -16,26 +16,10 @@
>   
>   #include <linux/slab.h>
>   #include <linux/kexec.h>
> -#include <linux/of_fdt.h>
>   #include <linux/libfdt.h>
>   #include <asm/ima.h>
>   
> -#define SLAVE_CODE_SIZE		256
> -
> -const struct kexec_file_ops * const kexec_file_loaders[] = {
> -	&kexec_elf64_ops,
> -	NULL
> -};
> -
> -int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
> -				  unsigned long buf_len)
> -{
> -	/* We don't support crash kernels yet. */
> -	if (image->type == KEXEC_TYPE_CRASH)
> -		return -EOPNOTSUPP;
> -
> -	return kexec_image_probe_default(image, buf, buf_len);
> -}
> +#define SLAVE_CODE_SIZE		256	/* First 0x100 bytes */
>   
>   /**
>    * setup_purgatory - initialize the purgatory's global variables
> @@ -127,24 +111,17 @@ int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size)
>    * @initrd_len:		Size of the next initrd, or 0 if there will be none.
>    * @cmdline:		Command line for the next kernel, or NULL if there will
>    *			be none.
> + * @chosen_node:        Set this output parameter to chosen_node.

minor:   ^ seems to be "node"

>    *
>    * Return: 0 on success, or negative errno on error.
>    */
>   int setup_new_fdt(const struct kimage *image, void *fdt,
>   		  unsigned long initrd_load_addr, unsigned long initrd_len,
> -		  const char *cmdline)
> +		  const char *cmdline, int *node)
>   {
>   	int ret, chosen_node;
>   	const void *prop;
>   
> -	/* Remove memory reservation for the current device tree. */
> -	ret = delete_fdt_mem_rsv(fdt, __pa(initial_boot_params),
> -				 fdt_totalsize(initial_boot_params));
> -	if (ret == 0)
> -		pr_debug("Removed old device tree reservation.\n");
> -	else if (ret != -ENOENT)
> -		return ret;
> -
>   	chosen_node = fdt_path_offset(fdt, "/chosen");
>   	if (chosen_node == -FDT_ERR_NOTFOUND) {
>   		chosen_node = fdt_add_subnode(fdt, fdt_path_offset(fdt, "/"),
> @@ -157,6 +134,8 @@ int setup_new_fdt(const struct kimage *image, void *fdt,
>   		pr_err("Malformed device tree: error reading /chosen.\n");
>   		return -EINVAL;
>   	}
> +	if (node)
> +		*node = chosen_node;
>   
>   	/* Did we boot using an initrd? */
>   	prop = fdt_getprop(fdt, chosen_node, "linux,initrd-start", NULL);
> @@ -242,10 +221,6 @@ int setup_new_fdt(const struct kimage *image, void *fdt,
>   		return ret;
>   	}
>   
> -	ret = fdt_setprop(fdt, chosen_node, "linux,booted-from-kexec", NULL, 0);
> -	if (ret)
> -		goto err;
> -
>   	return 0;
>   
>   err:
> diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c
> new file mode 100644
> index 0000000..e6bff960
> --- /dev/null
> +++ b/arch/powerpc/kexec/file_load_64.c
> @@ -0,0 +1,108 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * ppc64 code to implement the kexec_file_load syscall
> + *
> + * Copyright (C) 2004  Adam Litke (agl@us.ibm.com)
> + * Copyright (C) 2004  IBM Corp.
> + * Copyright (C) 2004,2005  Milton D Miller II, IBM Corporation
> + * Copyright (C) 2005  R Sharada (sharada@in.ibm.com)
> + * Copyright (C) 2006  Mohan Kumar M (mohan@in.ibm.com)
> + * Copyright (C) 2020  IBM Corporation
> + *
> + * Based on kexec-tools' kexec-ppc64.c, kexec-elf-rel-ppc64.c, fs2dt.c.
> + * Heavily modified for the kernel by
> + * Hari Bathini <hbathini@linux.ibm.com>.
> + */
> +
> +#include <linux/kexec.h>
> +#include <linux/of_fdt.h>
> +#include <linux/libfdt.h>
> +
> +const struct kexec_file_ops * const kexec_file_loaders[] = {
> +	&kexec_elf64_ops,
> +	NULL
> +};
> +
> +/**
> + * setup_purgatory_ppc64 - initialize PPC64 specific purgatory's global
> + *                         variables and call setup_purgatory() to initialize
> + *                         common global variable.
> + * @image:                 kexec image.
> + * @slave_code:            Slave code for the purgatory.
> + * @fdt:                   Flattened device tree for the next kernel.
> + * @kernel_load_addr:      Address where the kernel is loaded.
> + * @fdt_load_addr:         Address where the flattened device tree is loaded.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +int setup_purgatory_ppc64(struct kimage *image, const void *slave_code,
> +			  const void *fdt, unsigned long kernel_load_addr,
> +			  unsigned long fdt_load_addr)
> +{
> +	int ret;
> +
> +	ret = setup_purgatory(image, slave_code, fdt, kernel_load_addr,
> +			      fdt_load_addr);
> +	if (ret)
> +		pr_err("Failed to setup purgatory symbols");
> +	return ret;
> +}
> +
> +/**
> + * setup_new_fdt_ppc64 - Update the flattend device-tree of the kernel
> + *                       being loaded.
> + * @image:               kexec image being loaded.
> + * @fdt:                 Flattened device tree for the next kernel.
> + * @initrd_load_addr:    Address where the next initrd will be loaded.
> + * @initrd_len:          Size of the next initrd, or 0 if there will be none.
> + * @cmdline:             Command line for the next kernel, or NULL if there will
> + *                       be none.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +int setup_new_fdt_ppc64(const struct kimage *image, void *fdt,
> +			unsigned long initrd_load_addr,
> +			unsigned long initrd_len, const char *cmdline)
> +{
> +	int chosen_node, ret;
> +
> +	/* Remove memory reservation for the current device tree. */
> +	ret = delete_fdt_mem_rsv(fdt, __pa(initial_boot_params),
> +				 fdt_totalsize(initial_boot_params));
> +	if (ret == 0)
> +		pr_debug("Removed old device tree reservation.\n");
> +	else if (ret != -ENOENT) {
> +		pr_err("Failed to remove old device-tree reservation.\n");
> +		return ret;
> +	}
> +
> +	ret = setup_new_fdt(image, fdt, initrd_load_addr, initrd_len,
> +			    cmdline, &chosen_node);
> +	if (ret)
> +		return ret;
> +
> +	ret = fdt_setprop(fdt, chosen_node, "linux,booted-from-kexec", NULL, 0);
> +	if (ret)
> +		pr_err("Failed to update device-tree with linux,booted-from-kexec\n");
> +
> +	return ret;
> +}
> +
> +/**
> + * arch_kexec_kernel_image_probe - Does additional handling needed to setup
> + *                                 kexec segments.
> + * @image:                         kexec image being loaded.
> + * @buf:                           Buffer pointing to elf data.
> + * @buf_len:                       Length of the buffer.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
> +				  unsigned long buf_len)
> +{
> +	/* We don't support crash kernels yet. */
> +	if (image->type == KEXEC_TYPE_CRASH)
> +		return -EOPNOTSUPP;
> +
> +	return kexec_image_probe_default(image, buf, buf_len);
> +}
> diff --git a/arch/powerpc/purgatory/Makefile b/arch/powerpc/purgatory/Makefile
> index 7c6d8b1..348f5958 100644
> --- a/arch/powerpc/purgatory/Makefile
> +++ b/arch/powerpc/purgatory/Makefile
> @@ -2,11 +2,11 @@
>   
>   KASAN_SANITIZE := n
>   
> -targets += trampoline.o purgatory.ro kexec-purgatory.c
> +targets += trampoline_$(BITS).o purgatory.ro kexec-purgatory.c
>   
>   LDFLAGS_purgatory.ro := -e purgatory_start -r --no-undefined
>   
> -$(obj)/purgatory.ro: $(obj)/trampoline.o FORCE
> +$(obj)/purgatory.ro: $(obj)/trampoline_$(BITS).o FORCE
>   		$(call if_changed,ld)
>   
>   quiet_cmd_bin2c = BIN2C   $@
> diff --git a/arch/powerpc/purgatory/trampoline.S b/arch/powerpc/purgatory/trampoline.S
> deleted file mode 100644
> index a5a83c3..0000000
> --- a/arch/powerpc/purgatory/trampoline.S
> +++ /dev/null
> @@ -1,117 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0-only */
> -/*
> - * kexec trampoline
> - *
> - * Based on code taken from kexec-tools and kexec-lite.
> - *
> - * Copyright (C) 2004 - 2005, Milton D Miller II, IBM Corporation
> - * Copyright (C) 2006, Mohan Kumar M, IBM Corporation
> - * Copyright (C) 2013, Anton Blanchard, IBM Corporation
> - */
> -
> -#include <asm/asm-compat.h>
> -
> -	.machine ppc64
> -	.balign 256
> -	.globl purgatory_start
> -purgatory_start:
> -	b	master
> -
> -	/* ABI: possible run_at_load flag at 0x5c */
> -	.org purgatory_start + 0x5c
> -	.globl run_at_load
> -run_at_load:
> -	.long 0
> -	.size run_at_load, . - run_at_load
> -
> -	/* ABI: slaves start at 60 with r3=phys */
> -	.org purgatory_start + 0x60
> -slave:
> -	b .
> -	/* ABI: end of copied region */
> -	.org purgatory_start + 0x100
> -	.size purgatory_start, . - purgatory_start
> -
> -/*
> - * The above 0x100 bytes at purgatory_start are replaced with the
> - * code from the kernel (or next stage) by setup_purgatory().
> - */
> -
> -master:
> -	or	%r1,%r1,%r1	/* low priority to let other threads catchup */
> -	isync
> -	mr	%r17,%r3	/* save cpu id to r17 */
> -	mr	%r15,%r4	/* save physical address in reg15 */
> -
> -	or	%r3,%r3,%r3	/* ok now to high priority, lets boot */
> -	lis	%r6,0x1
> -	mtctr	%r6		/* delay a bit for slaves to catch up */
> -	bdnz	.		/* before we overwrite 0-100 again */
> -
> -	bl	0f		/* Work out where we're running */
> -0:	mflr	%r18
> -
> -	/* load device-tree address */
> -	ld	%r3, (dt_offset - 0b)(%r18)
> -	mr	%r16,%r3	/* save dt address in reg16 */
> -	li	%r4,20
> -	LWZX_BE	%r6,%r3,%r4	/* fetch __be32 version number at byte 20 */
> -	cmpwi	%cr0,%r6,2	/* v2 or later? */
> -	blt	1f
> -	li	%r4,28
> -	STWX_BE	%r17,%r3,%r4	/* Store my cpu as __be32 at byte 28 */
> -1:
> -	/* load the kernel address */
> -	ld	%r4,(kernel - 0b)(%r18)
> -
> -	/* load the run_at_load flag */
> -	/* possibly patched by kexec */
> -	ld	%r6,(run_at_load - 0b)(%r18)
> -	/* and patch it into the kernel */
> -	stw	%r6,(0x5c)(%r4)
> -
> -	mr	%r3,%r16	/* restore dt address */
> -
> -	li	%r5,0		/* r5 will be 0 for kernel */
> -
> -	mfmsr	%r11
> -	andi.	%r10,%r11,1	/* test MSR_LE */
> -	bne	.Little_endian
> -
> -	mtctr	%r4		/* prepare branch to */
> -	bctr			/* start kernel */
> -
> -.Little_endian:
> -	mtsrr0	%r4		/* prepare branch to */
> -
> -	clrrdi	%r11,%r11,1	/* clear MSR_LE */
> -	mtsrr1	%r11
> -
> -	rfid			/* update MSR and start kernel */
> -
> -
> -	.balign 8
> -	.globl kernel
> -kernel:
> -	.8byte  0x0
> -	.size kernel, . - kernel
> -
> -	.balign 8
> -	.globl dt_offset
> -dt_offset:
> -	.8byte  0x0
> -	.size dt_offset, . - dt_offset
> -
> -
> -	.data
> -	.balign 8
> -.globl purgatory_sha256_digest
> -purgatory_sha256_digest:
> -	.skip	32
> -	.size purgatory_sha256_digest, . - purgatory_sha256_digest
> -
> -	.balign 8
> -.globl purgatory_sha_regions
> -purgatory_sha_regions:
> -	.skip	8 * 2 * 16
> -	.size purgatory_sha_regions, . - purgatory_sha_regions
> diff --git a/arch/powerpc/purgatory/trampoline_64.S b/arch/powerpc/purgatory/trampoline_64.S
> new file mode 100644
> index 0000000..a5a83c3
> --- /dev/null
> +++ b/arch/powerpc/purgatory/trampoline_64.S
> @@ -0,0 +1,117 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * kexec trampoline
> + *
> + * Based on code taken from kexec-tools and kexec-lite.
> + *
> + * Copyright (C) 2004 - 2005, Milton D Miller II, IBM Corporation
> + * Copyright (C) 2006, Mohan Kumar M, IBM Corporation
> + * Copyright (C) 2013, Anton Blanchard, IBM Corporation
> + */
> +
> +#include <asm/asm-compat.h>
> +
> +	.machine ppc64
> +	.balign 256
> +	.globl purgatory_start
> +purgatory_start:
> +	b	master
> +
> +	/* ABI: possible run_at_load flag at 0x5c */
> +	.org purgatory_start + 0x5c
> +	.globl run_at_load
> +run_at_load:
> +	.long 0
> +	.size run_at_load, . - run_at_load
> +
> +	/* ABI: slaves start at 60 with r3=phys */
> +	.org purgatory_start + 0x60
> +slave:
> +	b .
> +	/* ABI: end of copied region */
> +	.org purgatory_start + 0x100
> +	.size purgatory_start, . - purgatory_start
> +
> +/*
> + * The above 0x100 bytes at purgatory_start are replaced with the
> + * code from the kernel (or next stage) by setup_purgatory().
> + */
> +
> +master:
> +	or	%r1,%r1,%r1	/* low priority to let other threads catchup */
> +	isync
> +	mr	%r17,%r3	/* save cpu id to r17 */
> +	mr	%r15,%r4	/* save physical address in reg15 */
> +
> +	or	%r3,%r3,%r3	/* ok now to high priority, lets boot */
> +	lis	%r6,0x1
> +	mtctr	%r6		/* delay a bit for slaves to catch up */
> +	bdnz	.		/* before we overwrite 0-100 again */
> +
> +	bl	0f		/* Work out where we're running */
> +0:	mflr	%r18
> +
> +	/* load device-tree address */
> +	ld	%r3, (dt_offset - 0b)(%r18)
> +	mr	%r16,%r3	/* save dt address in reg16 */
> +	li	%r4,20
> +	LWZX_BE	%r6,%r3,%r4	/* fetch __be32 version number at byte 20 */
> +	cmpwi	%cr0,%r6,2	/* v2 or later? */
> +	blt	1f
> +	li	%r4,28
> +	STWX_BE	%r17,%r3,%r4	/* Store my cpu as __be32 at byte 28 */
> +1:
> +	/* load the kernel address */
> +	ld	%r4,(kernel - 0b)(%r18)
> +
> +	/* load the run_at_load flag */
> +	/* possibly patched by kexec */
> +	ld	%r6,(run_at_load - 0b)(%r18)
> +	/* and patch it into the kernel */
> +	stw	%r6,(0x5c)(%r4)
> +
> +	mr	%r3,%r16	/* restore dt address */
> +
> +	li	%r5,0		/* r5 will be 0 for kernel */
> +
> +	mfmsr	%r11
> +	andi.	%r10,%r11,1	/* test MSR_LE */
> +	bne	.Little_endian
> +
> +	mtctr	%r4		/* prepare branch to */
> +	bctr			/* start kernel */
> +
> +.Little_endian:
> +	mtsrr0	%r4		/* prepare branch to */
> +
> +	clrrdi	%r11,%r11,1	/* clear MSR_LE */
> +	mtsrr1	%r11
> +
> +	rfid			/* update MSR and start kernel */
> +
> +
> +	.balign 8
> +	.globl kernel
> +kernel:
> +	.8byte  0x0
> +	.size kernel, . - kernel
> +
> +	.balign 8
> +	.globl dt_offset
> +dt_offset:
> +	.8byte  0x0
> +	.size dt_offset, . - dt_offset
> +
> +
> +	.data
> +	.balign 8
> +.globl purgatory_sha256_digest
> +purgatory_sha256_digest:
> +	.skip	32
> +	.size purgatory_sha256_digest, . - purgatory_sha256_digest
> +
> +	.balign 8
> +.globl purgatory_sha_regions
> +purgatory_sha_regions:
> +	.skip	8 * 2 * 16
> +	.size purgatory_sha_regions, . - purgatory_sha_regions
> 


^ permalink raw reply

* Re: [PATCH v6 3/5] KVM: PPC: clean up redundant kvm_run parameters in assembly
From: Paolo Bonzini @ 2020-07-10  7:57 UTC (permalink / raw)
  To: Tianjia Zhang, tsbogend, paulus, mpe, benh, borntraeger, frankja,
	david, cohuck, heiko.carstens, gor, sean.j.christopherson,
	vkuznets, wanpengli, jmattson, joro, tglx, mingo, bp, x86, hpa,
	maz, james.morse, julien.thierry.kdev, suzuki.poulose,
	christoffer.dall, peterx, thuth, chenhuacai
  Cc: linux-s390, kvm, linux-mips, kvm-ppc, linux-kernel, linuxppc-dev,
	kvmarm, linux-arm-kernel
In-Reply-To: <20200623131418.31473-4-tianjia.zhang@linux.alibaba.com>

On 23/06/20 15:14, Tianjia Zhang wrote:
>  
>  	/* Load non-volatile guest state from the vcpu */
> -	VCPU_LOAD_NVGPRS(r4)
> +	VCPU_LOAD_NVGPRS(r3)
>  
>  kvm_start_lightweight:
>  	/* Copy registers into shadow vcpu so we can access them in real mode */
> -	mr	r3, r4
>  	bl	FUNC(kvmppc_copy_to_svcpu)
>  	nop
> -	REST_GPR(4, r1)
> +	REST_GPR(3, r1)
>  
>  #ifdef CONFIG_PPC_BOOK3S_64
>  	/* Get the dcbz32 flag */
> @@ -146,7 +144,7 @@ after_sprg3_load:

Below, there are a bunch of references to r3 and r4 left

        rldicl  r3, r3, 0, 63           /* r3 &= 1 */
        stb     r3, HSTATE_RESTORE_HID5(r13)

        /* Load up guest SPRG3 value, since it's user readable */
        lwz     r3, VCPU_SHAREDBE(r4)				  <<<
        cmpwi   r3, 0
        ld      r5, VCPU_SHARED(r4)				  <<<

where r3 is also destroyed.  So I'd rather have three patches:

- one that is as small as possible, changing the prototypes and adding

	mr r4, r3

- one that entirely swaps out r3 and r4.  This would be the hard one to
review!

- one that cleans up the prologue and epilogue

But overall I think it's simplest to just leave out this patch.

Paolo


^ permalink raw reply

* Re: [PATCH v6 0/5] clean up redundant 'kvm_run' parameters
From: Paolo Bonzini @ 2020-07-10  8:06 UTC (permalink / raw)
  To: Tianjia Zhang, tsbogend, paulus, mpe, benh, borntraeger, frankja,
	david, cohuck, heiko.carstens, gor, sean.j.christopherson,
	vkuznets, wanpengli, jmattson, joro, tglx, mingo, bp, x86, hpa,
	maz, james.morse, julien.thierry.kdev, suzuki.poulose,
	christoffer.dall, peterx, thuth, chenhuacai
  Cc: linux-s390, kvm, linux-mips, kvm-ppc, linux-kernel, linuxppc-dev,
	kvmarm, linux-arm-kernel
In-Reply-To: <6604e273-d7b1-5007-8721-75c4a4dec68e@linux.alibaba.com>

On 10/07/20 09:32, Tianjia Zhang wrote:
> Hi Paolo,
> 
> Any opinion on this series patches? Can I help with this patchset ?

I was hoping to have some Tested-by, for now I'm queuing patches 1 and
2.  Thanks,

Paolo

> Thanks and best,
> Tianjia
> 
> On 2020/6/23 21:14, Tianjia Zhang wrote:
>> In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
>> structure. For historical reasons, many kvm-related function parameters
>> retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This
>> patch does a unified cleanup of these remaining redundant parameters.
>>
>> This series of patches has completely cleaned the architecture of
>> arm64, mips, ppc, and s390 (no such redundant code on x86). Due to
>> the large number of modified codes, a separate patch is made for each
>> platform. On the ppc platform, there is also a redundant structure
>> pointer of 'kvm_run' in 'vcpu_arch', which has also been cleaned
>> separately.
>>
>> ---
>> v6 changes:
>>    Rearrange patch sets, only keep the unmerged patch.
>>    rebase on mainline.
>>
>> v5 change:
>>    ppc: fix for review.
>>
>> v4 change:
>>    mips: fixes two errors in entry.c.
>>
>> v3 change:
>>    Keep the existing `vcpu->run` in the function body unchanged.
>>
>> v2 change:
>>    s390 retains the original variable name and minimizes modification.
>>
>> Tianjia Zhang (5):
>>    KVM: s390: clean up redundant 'kvm_run' parameters
>>    KVM: arm64: clean up redundant 'kvm_run' parameters
>>    KVM: PPC: clean up redundant kvm_run parameters in assembly
>>    KVM: MIPS: clean up redundant 'kvm_run' parameters
>>    KVM: MIPS: clean up redundant kvm_run parameters in assembly
>>
>>   arch/arm64/include/asm/kvm_coproc.h   |  12 +--
>>   arch/arm64/include/asm/kvm_host.h     |  11 +--
>>   arch/arm64/include/asm/kvm_mmu.h      |   2 +-
>>   arch/arm64/kvm/arm.c                  |   6 +-
>>   arch/arm64/kvm/handle_exit.c          |  36 ++++----
>>   arch/arm64/kvm/mmio.c                 |  11 +--
>>   arch/arm64/kvm/mmu.c                  |   5 +-
>>   arch/arm64/kvm/sys_regs.c             |  13 ++-
>>   arch/mips/include/asm/kvm_host.h      |  32 ++------
>>   arch/mips/kvm/emulate.c               |  59 +++++--------
>>   arch/mips/kvm/entry.c                 |  21 ++---
>>   arch/mips/kvm/mips.c                  |  14 ++--
>>   arch/mips/kvm/trap_emul.c             | 114 +++++++++++---------------
>>   arch/mips/kvm/vz.c                    |  26 +++---
>>   arch/powerpc/include/asm/kvm_ppc.h    |   2 +-
>>   arch/powerpc/kvm/book3s_interrupts.S  |  22 +++--
>>   arch/powerpc/kvm/book3s_pr.c          |   9 +-
>>   arch/powerpc/kvm/booke.c              |   9 +-
>>   arch/powerpc/kvm/booke_interrupts.S   |   9 +-
>>   arch/powerpc/kvm/bookehv_interrupts.S |  10 +--
>>   arch/s390/kvm/kvm-s390.c              |  23 ++++--
>>   21 files changed, 188 insertions(+), 258 deletions(-)
>>
> 


^ permalink raw reply

* Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option
From: Peter Zijlstra @ 2020-07-10  9:35 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linux-arch, Arnd Bergmann, x86, linux-kernel, linux-mm,
	Mathieu Desnoyers, linuxppc-dev
In-Reply-To: <20200710015646.2020871-8-npiggin@gmail.com>

On Fri, Jul 10, 2020 at 11:56:46AM +1000, Nicholas Piggin wrote:
> On big systems, the mm refcount can become highly contented when doing
> a lot of context switching with threaded applications (particularly
> switching between the idle thread and an application thread).
> 
> Abandoning lazy tlb slows switching down quite a bit in the important
> user->idle->user cases, so so instead implement a non-refcounted scheme
> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
> any remaining lazy ones.
> 
> On a 16-socket 192-core POWER8 system, a context switching benchmark
> with as many software threads as CPUs (so each switch will go in and
> out of idle), upstream can achieve a rate of about 1 million context
> switches per second. After this patch it goes up to 118 million.

That's mighty impressive, however:

> +static void shoot_lazy_tlbs(struct mm_struct *mm)
> +{
> +	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
> +		smp_call_function_many(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1);
> +		do_shoot_lazy_tlb(mm);
> +	}
> +}

IIRC you (power) never clear a CPU from that mask, so for other
workloads I can see this resulting in massive amounts of IPIs.

For instance, take as many processes as you have CPUs. For each,
manually walk the task across all CPUs and exit. Again.

Clearly, that's an extreme, but still...



^ permalink raw reply

* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Peter Zijlstra @ 2020-07-10  9:42 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linux-arch, Arnd Bergmann, x86, linux-kernel, linux-mm,
	Mathieu Desnoyers, linuxppc-dev
In-Reply-To: <20200710015646.2020871-5-npiggin@gmail.com>

On Fri, Jul 10, 2020 at 11:56:43AM +1000, Nicholas Piggin wrote:
> And get rid of the generic sync_core_before_usermode facility.
> 
> This helper is the wrong way around I think. The idea that membarrier
> state requires a core sync before returning to user is the easy one
> that does not need hiding behind membarrier calls. The gap in core
> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
> tricky detail that is better put in x86 lazy tlb code.
> 
> Consider if an arch did not synchronize core in switch_mm either, then
> membarrier_mm_sync_core_before_usermode would be in the wrong place
> but arch specific mmu context functions would still be the right place.
> There is also a exit_lazy_tlb case that is not covered by this call, which
> could be a bugs (kthread use mm the membarrier process's mm then context
> switch back to the process without switching mm or lazy mm switch).
> 
> This makes lazy tlb code a bit more modular.

Hurmph, I know I've been staring at this at some point. I think I meant
to have a TIF to force the IRET path in the case of MEMBAR_SYNC_CORE.
But I was discouraged by amluto.


^ permalink raw reply

* Re: [RFC PATCH 5/7] lazy tlb: introduce lazy mm refcount helper functions
From: Peter Zijlstra @ 2020-07-10  9:48 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linux-arch, Arnd Bergmann, x86, linux-kernel, linux-mm,
	Mathieu Desnoyers, linuxppc-dev
In-Reply-To: <20200710015646.2020871-6-npiggin@gmail.com>

On Fri, Jul 10, 2020 at 11:56:44AM +1000, Nicholas Piggin wrote:

> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 73199470c265..ad95812d2a3f 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1253,7 +1253,7 @@ void start_secondary(void *unused)
>  	unsigned int cpu = smp_processor_id();
>  	struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
>  
> -	mmgrab(&init_mm);
> +	mmgrab(&init_mm); /* XXX: where is the mmput for this? */
>  	current->active_mm = &init_mm;
>  
>  	smp_store_cpu_info(cpu);

Right; so IIRC it should be this one:

> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 134688d79589..ff9fcbc4e76b 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -578,7 +578,7 @@ static int finish_cpu(unsigned int cpu)
>  	 */
>  	if (mm != &init_mm)
>  		idle->active_mm = &init_mm;
> -	mmdrop(mm);
> +	mmdrop_lazy_tlb(mm);
>  	return 0;
>  }


^ permalink raw reply

* Re: PowerNV PCI & SR-IOV cleanups
From: Oliver O'Halloran @ 2020-07-10 12:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-pci, linuxppc-dev
In-Reply-To: <20200710064535.GA8354@infradead.org>

On Fri, Jul 10, 2020 at 4:45 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Fri, Jul 10, 2020 at 03:23:25PM +1000, Oliver O'Halloran wrote:
> > This is largely prep work for supporting VFs in the 32bit MMIO window.
> > This is an unfortunate necessity due to how the Linux BAR allocator
> > handles BARs marked as non-prefetchable. The distinction
> > between prefetch and non-prefetchable BARs was made largely irrelevant
> > with the introduction of PCIe, but the BAR allocator is overly
> > conservative. It will always place non-pref bars in the prefetchable
> > window, which is 32bit only. This results in us being unable to use VFs
> > from NVMe drives and a few different RAID cards.
>
> How about fixing that in the core PCI code?

I've been kicking around the idea but I've never managed to convince
myself that ignoring the non-prefetchable bit is a safe thing to do in
generic code. Since Gen3 at least the PCIe Base spec has provided some
guidance about when you can put non-prefetchable BARs in the
prefetchable window and as of the Gen5 spec it lists these conditions:

> 1) The entire path from the host to the adapter is over PCI Express.
> 2) No conventional PCI or PCI-X devices do peer-to-peer reads to the range mapped by the BAR.
> 3) The PCI Express Host Bridge does no byte merging. (This is believed to be true on most platforms.)
> 4) Any locations with read side-effects are never the target of Memory Reads with the TH bit Set.
> 5) The range mapped by the BAR is never the target of a speculative Memory Read, either Host initiated or peer-to-peer.

1) Is easy enough to verify.
2) Is probably true, but who knows.
3) I know this is true for the platforms I'm looking at since the HW
designers assure me there is no merging happening at the host-bridge
level. Merging of MMIO ops does seem like an insane thing to do so
it's probably true in general too, but there's no real way to tell.
4) Is also *probably* true since the TH bit is only set when it's
explicitly enabled via the TLP Processing Hints extended capability in
config space. I guess it's possible firmware might enable that without
Linux realising, but in that case Linux is probably not doing BAR
allocation.
5) I have no idea about, but it seems difficult to make any kind of
general statement about.

I might just be being paranoid.

Oliver

^ permalink raw reply

* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Mathieu Desnoyers @ 2020-07-10 14:02 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linux-arch, Arnd Bergmann, Peter Zijlstra, x86, linux-kernel,
	linux-mm, linuxppc-dev
In-Reply-To: <20200710015646.2020871-5-npiggin@gmail.com>

----- On Jul 9, 2020, at 9:56 PM, Nicholas Piggin npiggin@gmail.com wrote:

> And get rid of the generic sync_core_before_usermode facility.
> 
> This helper is the wrong way around I think. The idea that membarrier
> state requires a core sync before returning to user is the easy one
> that does not need hiding behind membarrier calls. The gap in core
> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
> tricky detail that is better put in x86 lazy tlb code.
> 
> Consider if an arch did not synchronize core in switch_mm either, then
> membarrier_mm_sync_core_before_usermode would be in the wrong place
> but arch specific mmu context functions would still be the right place.
> There is also a exit_lazy_tlb case that is not covered by this call, which
> could be a bugs (kthread use mm the membarrier process's mm then context
> switch back to the process without switching mm or lazy mm switch).
> 
> This makes lazy tlb code a bit more modular.

I agree that moving this logic to exit_lazy_tlb is much more modular and
cleaner.

Thanks,

Mathieu


> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
> .../membarrier-sync-core/arch-support.txt     |  6 +++-
> arch/x86/include/asm/mmu_context.h            | 35 +++++++++++++++++++
> arch/x86/include/asm/sync_core.h              | 28 ---------------
> include/linux/sched/mm.h                      | 14 --------
> include/linux/sync_core.h                     | 21 -----------
> kernel/cpu.c                                  |  4 ++-
> kernel/kthread.c                              |  2 +-
> kernel/sched/core.c                           | 16 ++++-----
> 8 files changed, 51 insertions(+), 75 deletions(-)
> delete mode 100644 arch/x86/include/asm/sync_core.h
> delete mode 100644 include/linux/sync_core.h
> 
> diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt
> b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
> index 52ad74a25f54..bd43fb1f5986 100644
> --- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt
> +++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
> @@ -5,6 +5,10 @@
> #
> # Architecture requirements
> #
> +# If your architecture returns to user-space through non-core-serializing
> +# instructions, you need to ensure these are done in switch_mm and
> exit_lazy_tlb
> +# (if lazy tlb switching is implemented).
> +#
> # * arm/arm64/powerpc
> #
> # Rely on implicit context synchronization as a result of exception return
> @@ -24,7 +28,7 @@
> # instead on write_cr3() performed by switch_mm() to provide core serialization
> # after changing the current mm, and deal with the special case of kthread ->
> # uthread (temporarily keeping current mm into active_mm) by issuing a
> -# sync_core_before_usermode() in that specific case.
> +# serializing instruction in exit_lazy_mm() in that specific case.
> #
>     -----------------------
>     |         arch |status|
> diff --git a/arch/x86/include/asm/mmu_context.h
> b/arch/x86/include/asm/mmu_context.h
> index 255750548433..5263863a9be8 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -6,6 +6,7 @@
> #include <linux/atomic.h>
> #include <linux/mm_types.h>
> #include <linux/pkeys.h>
> +#include <linux/sched/mm.h>
> 
> #include <trace/events/tlb.h>
> 
> @@ -95,6 +96,40 @@ static inline void switch_ldt(struct mm_struct *prev, struct
> mm_struct *next)
> #define enter_lazy_tlb enter_lazy_tlb
> extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
> 
> +#ifdef CONFIG_MEMBARRIER
> +/*
> + * Ensure that a core serializing instruction is issued before returning
> + * to user-mode, if a SYNC_CORE was requested. x86 implements return to
> + * user-space through sysexit, sysrel, and sysretq, which are not core
> + * serializing.
> + *
> + * See the membarrier comment in finish_task_switch as to why this is done
> + * in exit_lazy_tlb.
> + */
> +#define exit_lazy_tlb exit_lazy_tlb
> +static inline void exit_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
> +{
> +	/* Switching mm is serializing with write_cr3 */
> +        if (tsk->mm != mm)
> +                return;
> +
> +        if (likely(!(atomic_read(&mm->membarrier_state) &
> +                     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
> +                return;
> +
> +	/* With PTI, we unconditionally serialize before running user code. */
> +	if (static_cpu_has(X86_FEATURE_PTI))
> +		return;
> +	/*
> +	 * Return from interrupt and NMI is done through iret, which is core
> +	 * serializing.
> +	 */
> +	if (in_irq() || in_nmi())
> +		return;
> +	sync_core();
> +}
> +#endif
> +
> /*
>  * Init a new mm.  Used on mm copies, like at fork()
>  * and on mm's that are brand-new, like at execve().
> diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
> deleted file mode 100644
> index c67caafd3381..000000000000
> --- a/arch/x86/include/asm/sync_core.h
> +++ /dev/null
> @@ -1,28 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0 */
> -#ifndef _ASM_X86_SYNC_CORE_H
> -#define _ASM_X86_SYNC_CORE_H
> -
> -#include <linux/preempt.h>
> -#include <asm/processor.h>
> -#include <asm/cpufeature.h>
> -
> -/*
> - * Ensure that a core serializing instruction is issued before returning
> - * to user-mode. x86 implements return to user-space through sysexit,
> - * sysrel, and sysretq, which are not core serializing.
> - */
> -static inline void sync_core_before_usermode(void)
> -{
> -	/* With PTI, we unconditionally serialize before running user code. */
> -	if (static_cpu_has(X86_FEATURE_PTI))
> -		return;
> -	/*
> -	 * Return from interrupt and NMI is done through iret, which is core
> -	 * serializing.
> -	 */
> -	if (in_irq() || in_nmi())
> -		return;
> -	sync_core();
> -}
> -
> -#endif /* _ASM_X86_SYNC_CORE_H */
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 480a4d1b7dd8..9b026264b445 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -7,7 +7,6 @@
> #include <linux/sched.h>
> #include <linux/mm_types.h>
> #include <linux/gfp.h>
> -#include <linux/sync_core.h>
> 
> /*
>  * Routines for handling mm_structs
> @@ -364,16 +363,6 @@ enum {
> #include <asm/membarrier.h>
> #endif
> 
> -static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct
> *mm)
> -{
> -	if (current->mm != mm)
> -		return;
> -	if (likely(!(atomic_read(&mm->membarrier_state) &
> -		     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
> -		return;
> -	sync_core_before_usermode();
> -}
> -
> extern void membarrier_exec_mmap(struct mm_struct *mm);
> 
> #else
> @@ -387,9 +376,6 @@ static inline void membarrier_arch_switch_mm(struct
> mm_struct *prev,
> static inline void membarrier_exec_mmap(struct mm_struct *mm)
> {
> }
> -static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct
> *mm)
> -{
> -}
> #endif
> 
> #endif /* _LINUX_SCHED_MM_H */
> diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h
> deleted file mode 100644
> index 013da4b8b327..000000000000
> --- a/include/linux/sync_core.h
> +++ /dev/null
> @@ -1,21 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0 */
> -#ifndef _LINUX_SYNC_CORE_H
> -#define _LINUX_SYNC_CORE_H
> -
> -#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> -#include <asm/sync_core.h>
> -#else
> -/*
> - * This is a dummy sync_core_before_usermode() implementation that can be used
> - * on all architectures which return to user-space through core serializing
> - * instructions.
> - * If your architecture returns to user-space through non-core-serializing
> - * instructions, you need to write your own functions.
> - */
> -static inline void sync_core_before_usermode(void)
> -{
> -}
> -#endif
> -
> -#endif /* _LINUX_SYNC_CORE_H */
> -
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 6ff2578ecf17..134688d79589 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -572,7 +572,9 @@ static int finish_cpu(unsigned int cpu)
> 
> 	/*
> 	 * idle_task_exit() will have switched to &init_mm, now
> -	 * clean up any remaining active_mm state.
> +	 * clean up any remaining active_mm state. exit_lazy_tlb
> +	 * is not done, if an arch did any accounting in these
> +	 * functions it would have to be added.
> 	 */
> 	if (mm != &init_mm)
> 		idle->active_mm = &init_mm;
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index e813d92f2eab..6f93c649aa97 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -1251,9 +1251,9 @@ void kthread_use_mm(struct mm_struct *mm)
> 	finish_arch_post_lock_switch();
> #endif
> 
> +	exit_lazy_tlb(active_mm, tsk);
> 	if (active_mm != mm)
> 		mmdrop(active_mm);
> -	exit_lazy_tlb(active_mm, tsk);
> 
> 	to_kthread(tsk)->oldfs = get_fs();
> 	set_fs(USER_DS);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index debc917bc69b..31e22c79826c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3294,22 +3294,19 @@ static struct rq *finish_task_switch(struct task_struct
> *prev)
> 	kcov_finish_switch(current);
> 
> 	fire_sched_in_preempt_notifiers(current);
> +
> 	/*
> 	 * When switching through a kernel thread, the loop in
> 	 * membarrier_{private,global}_expedited() may have observed that
> 	 * kernel thread and not issued an IPI. It is therefore possible to
> 	 * schedule between user->kernel->user threads without passing though
> -	 * switch_mm(). Membarrier requires a barrier after storing to
> -	 * rq->curr, before returning to userspace, so provide them here:
> -	 *
> -	 * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
> -	 *   provided by mmdrop(),
> -	 * - a sync_core for SYNC_CORE.
> +	 * switch_mm(). Membarrier requires a full barrier after storing to
> +	 * rq->curr, before returning to userspace, for
> +	 * {PRIVATE,GLOBAL}_EXPEDITED. This is implicitly provided by mmdrop().
> 	 */
> -	if (mm) {
> -		membarrier_mm_sync_core_before_usermode(mm);
> +	if (mm)
> 		mmdrop(mm);
> -	}
> +
> 	if (unlikely(prev_state == TASK_DEAD)) {
> 		if (prev->sched_class->task_dead)
> 			prev->sched_class->task_dead(prev);
> @@ -6292,6 +6289,7 @@ void idle_task_exit(void)
> 	BUG_ON(current != this_rq()->idle);
> 
> 	if (mm != &init_mm) {
> +		/* enter_lazy_tlb is not done because we're about to go down */
> 		switch_mm(mm, &init_mm, current);
> 		finish_arch_post_lock_switch();
> 	}
> --
> 2.23.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH] powerpc/rtas: Restrict RTAS requests from userspace
From: Sasha Levin @ 2020-07-10 14:02 UTC (permalink / raw)
  To: Sasha Levin, Andrew Donnellan, linuxppc-dev; +Cc: nathanl, leobras.c, stable
In-Reply-To: <20200702161932.18176-1-ajd@linux.ibm.com>

Hi

[This is an automated email]

This commit has been processed because it contains a -stable tag.
The stable tag indicates that it's relevant for the following trees: all

The bot has tested the following trees: v5.7.7, v5.4.50, v4.19.131, v4.14.187, v4.9.229, v4.4.229.

v5.7.7: Build OK!
v5.4.50: Failed to apply! Possible dependencies:
    1a8916ee3ac29 ("powerpc: Detect the secure boot mode of the system")
    4238fad366a66 ("powerpc/ima: Add support to initialize ima policy rules")
    9155e2341aa8b ("powerpc/powernv: Add OPAL API interface to access secure variable")
    bd5d9c743d38f ("powerpc: expose secure variables to userspace via sysfs")

v4.19.131: Failed to apply! Possible dependencies:
    1a8916ee3ac29 ("powerpc: Detect the secure boot mode of the system")
    4238fad366a66 ("powerpc/ima: Add support to initialize ima policy rules")
    6f5f193e84d3d ("powerpc/opal: add MPIPL interface definitions")
    75d9fc7fd94eb ("powerpc/powernv: move OPAL call wrapper tracing and interrupt handling to C")
    8139046a5a347 ("powerpc/powernv: Make possible for user to force a full ipl cec reboot")
    88ec6b93c8e7d ("powerpc/xive: add OPAL extensions for the XIVE native exploitation support")
    9155e2341aa8b ("powerpc/powernv: Add OPAL API interface to access secure variable")
    bd5d9c743d38f ("powerpc: expose secure variables to userspace via sysfs")
    fb0b0a73b223f ("powerpc: Enable kcov")

v4.14.187: Failed to apply! Possible dependencies:
    04baaf28f40c6 ("powerpc/powernv: Add support to enable sensor groups")
    10dac04c79b18 ("mips: fix an off-by-one in dma_capable")
    1a8916ee3ac29 ("powerpc: Detect the secure boot mode of the system")
    32ce3862af3c4 ("powerpc/lib: Implement PMEM API")
    5cdcb01e0af5a ("powernv: opal-sensor: Add support to read 64bit sensor values")
    656ecc16e8fc2 ("crypto/nx: Initialize 842 high and normal RxFIFO control registers")
    6f5f193e84d3d ("powerpc/opal: add MPIPL interface definitions")
    74d656d219b98 ("powerpc/powernv: Add opal calls for opencapi")
    77adbd2207e85 ("powerpc/powernv: Add OPAL_BUSY to opal_error_code()")
    88ec6b93c8e7d ("powerpc/xive: add OPAL extensions for the XIVE native exploitation support")
    8c4cdce8b1ab0 ("mtd: nand: qcom: add command elements in BAM transaction")
    9155e2341aa8b ("powerpc/powernv: Add OPAL API interface to access secure variable")
    92e3da3cf193f ("powerpc: initial pkey plumbing")
    bd5d9c743d38f ("powerpc: expose secure variables to userspace via sysfs")
    d6a90bb83b508 ("powerpc/powernv: Enable tunneled operations")
    e36d0a2ed5019 ("powerpc/powernv: Implement NMI IPI with OPAL_SIGNAL_SYSTEM_RESET")
    ea8c64ace8664 ("dma-mapping: move swiotlb arch helpers to a new header")
    fb0b0a73b223f ("powerpc: Enable kcov")

v4.9.229: Failed to apply! Possible dependencies:
    1515ab9321562 ("powerpc/mm: Dump hash table")
    1a8916ee3ac29 ("powerpc: Detect the secure boot mode of the system")
    40565b5aedd6d ("sched/cputime, powerpc, s390: Make scaled cputime arch specific")
    4ad8622dc5489 ("powerpc/8xx: Implement hw_breakpoint")
    51c9c08439935 ("powerpc/kprobes: Implement Optprobes")
    5b9ff02785986 ("powerpc: Build-time sort the exception table")
    6533b7c16ee57 ("powerpc: Initial stack protector (-fstack-protector) support")
    65c059bcaa731 ("powerpc: Enable support for GCC plugins")
    8eb07b187000d ("powerpc/mm: Dump linux pagetables")
    92e3da3cf193f ("powerpc: initial pkey plumbing")
    981ee2d444408 ("sched/cputime, powerpc: Remove cputime_to_scaled()")
    a7d2475af7aed ("powerpc: Sort the selects under CONFIG_PPC")
    bd5d9c743d38f ("powerpc: expose secure variables to userspace via sysfs")
    d6c569b99558b ("powerpc/64: Move HAVE_CONTEXT_TRACKING from pseries to common Kconfig")
    dd5ac03e09554 ("powerpc/mm: Fix page table dump build on non-Book3S")
    ea8c64ace8664 ("dma-mapping: move swiotlb arch helpers to a new header")
    ebfa50df435ee ("powerpc: Add helper to check if offset is within relative branch range")
    fa769d3f58e6b ("powerpc/32: Enable HW_BREAKPOINT on BOOK3S")
    fb0b0a73b223f ("powerpc: Enable kcov")

v4.4.229: Failed to apply! Possible dependencies:
    019132ff3daf3 ("x86/mm/pkeys: Fill in pkey field in siginfo")
    01c8f1c44b83a ("mm, dax, gpu: convert vm_insert_mixed to pfn_t")
    0e749e54244ee ("dax: increase granularity of dax_clear_blocks() operations")
    1874f6895c92d ("x86/mm/gup: Simplify get_user_pages() PTE bit handling")
    1a8916ee3ac29 ("powerpc: Detect the secure boot mode of the system")
    33a709b25a760 ("mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys")
    34c0fd540e79f ("mm, dax, pmem: introduce pfn_t")
    3565fce3a6597 ("mm, x86: get_user_pages() for dax mappings")
    52db400fcd502 ("pmem, dax: clean up clear_pmem()")
    7b2d0dbac4890 ("x86/mm/pkeys: Pass VMA down in to fault signal generation code")
    8f62c883222c9 ("x86/mm/pkeys: Add arch-specific VMA protection bits")
    92e3da3cf193f ("powerpc: initial pkey plumbing")
    b2e0d1625e193 ("dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()")
    bd5d9c743d38f ("powerpc: expose secure variables to userspace via sysfs")
    f25748e3c34eb ("mm, dax: convert vmf_insert_pfn_pmd() to pfn_t")


NOTE: The patch will not be queued to stable trees until it is upstream.

How should we proceed with this patch?

-- 
Thanks
Sasha

^ permalink raw reply

* Re: [PATCH 0/2] ASoC: fsl_spdif: Clear the validity bit for TX
From: Mark Brown @ 2020-07-10 15:39 UTC (permalink / raw)
  To: timur, alsa-devel, perex, festevam, Shengjiu Wang, nicoleotsuka,
	tiwai, Xiubo.Lee
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <1594112066-31297-1-git-send-email-shengjiu.wang@nxp.com>

On Tue, 7 Jul 2020 16:54:24 +0800, Shengjiu Wang wrote:
> Clear the validity bit for TX
> Add kctl for configuring TX validity bit
> 
> Shengjiu Wang (2):
>   ASoC: fsl_spdif: Clear the validity bit for TX
>   ASoC: fsl_spdif: Add kctl for configuring TX validity bit
> 
> [...]

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next

Thanks!

[1/2] ASoC: fsl_spdif: Clear the validity bit for TX
      commit: 055b082156704b85a059770359d6cdedfb24831d
[2/2] ASoC: fsl_spdif: Add kctl for configuring TX validity bit
      commit: aa3fce5cd454db551a4ea1656bab9c2bacd2d1f4

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

^ permalink raw reply

* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Andy Lutomirski @ 2020-07-10 17:04 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linux-arch, Arnd Bergmann, Peter Zijlstra, X86 ML, LKML, Linux-MM,
	Mathieu Desnoyers, linuxppc-dev
In-Reply-To: <20200710015646.2020871-5-npiggin@gmail.com>

On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> And get rid of the generic sync_core_before_usermode facility.
>
> This helper is the wrong way around I think. The idea that membarrier
> state requires a core sync before returning to user is the easy one
> that does not need hiding behind membarrier calls. The gap in core
> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
> tricky detail that is better put in x86 lazy tlb code.
>
> Consider if an arch did not synchronize core in switch_mm either, then
> membarrier_mm_sync_core_before_usermode would be in the wrong place
> but arch specific mmu context functions would still be the right place.
> There is also a exit_lazy_tlb case that is not covered by this call, which
> could be a bugs (kthread use mm the membarrier process's mm then context
> switch back to the process without switching mm or lazy mm switch).
>
> This makes lazy tlb code a bit more modular.

The mm-switching and TLB-management has often had the regrettable
property that it gets wired up in a way that seems to work at the time
but doesn't have clear semantics, and I'm a bit concerned that this
patch is in that category.  If I'm understanding right, you're trying
to enforce the property that exiting lazy TLB mode will promise to
sync the core eventually.  But this has all kinds of odd properties:

 - Why is exit_lazy_tlb() getting called at all in the relevant cases?
 When is it permissible to call it?  I look at your new code and see:

> +/*
> + * Ensure that a core serializing instruction is issued before returning
> + * to user-mode, if a SYNC_CORE was requested. x86 implements return to
> + * user-space through sysexit, sysrel, and sysretq, which are not core
> + * serializing.
> + *
> + * See the membarrier comment in finish_task_switch as to why this is done
> + * in exit_lazy_tlb.
> + */
> +#define exit_lazy_tlb exit_lazy_tlb
> +static inline void exit_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
> +{
> +       /* Switching mm is serializing with write_cr3 */
> +        if (tsk->mm != mm)
> +                return;

And my brain says WTF?  Surely you meant something like if
(WARN_ON_ONCE(tsk->mm != mm)) { /* egads, what even happened?  how do
we try to recover well enough to get a crashed logged at least? */ }

So this needs actual documentation, preferably in comments near the
function, of what the preconditions are and what this mm parameter is.
Once that's done, then we could consider whether it's appropriate to
have this function promise to sync the core under some conditions.

 - This whole structure seems to rely on the idea that switching mm
syncs something.  I periodically ask chip vendor for non-serializing
mm switches.  Specifically, in my dream world, we have totally
separate user and kernel page tables.  Changing out the user tables
doesn't serialize or even create a fence.  Instead it creates the
minimum required pipeline hazard such that user memory access and
switches to usermode will make sure they access memory through the
correct page tables.  I haven't convinced a chip vendor yet, but there
are quite a few hundreds of cycles to be saved here.  With this in
mind, I see the fencing aspects of the TLB handling code as somewhat
of an accident.  I'm fine with documenting them and using them to
optimize other paths, but I think it should be explicit.  For example:

/* Also does a full barrier?  (Or a sync_core()-style barrier.)
However, if you rely on this, you must document it in a comment where
you call this function. *?
void switch_mm_irqs_off()
{
}

This is kind of like how we strongly encourage anyone using smp_?mb()
to document what they are fencing against.

Also, as it stands, I can easily see in_irq() ceasing to promise to
serialize.  There are older kernels for which it does not promise to
serialize.  And I have plans to make it stop serializing in the
nearish future.

--Andy

^ permalink raw reply

* Re: [PATCH v5] ima: move APPRAISE_BOOTPARAM dependency on ARCH_POLICY to runtime
From: Mimi Zohar @ 2020-07-10 17:23 UTC (permalink / raw)
  To: Bruno Meneguele, linux-kernel, x86, linuxppc-dev, linux-s390,
	linux-integrity
  Cc: erichte, nayna, stable
In-Reply-To: <20200709164647.45153-1-bmeneg@redhat.com>

On Thu, 2020-07-09 at 13:46 -0300, Bruno Meneguele wrote:
> APPRAISE_BOOTPARAM has been marked as dependent on !ARCH_POLICY in compile
> time, enforcing the appraisal whenever the kernel had the arch policy option
> enabled.

> However it breaks systems where the option is set but the system didn't
> boot in a "secure boot" platform. In this scenario, anytime an appraisal
> policy (i.e. ima_policy=appraisal_tcb) is used it will be forced, without
> giving the user the opportunity to label the filesystem, before enforcing
> integrity.
> 
> Considering the ARCH_POLICY is only effective when secure boot is actually
> enabled this patch remove the compile time dependency and move it to a
> runtime decision, based on the secure boot state of that platform.

Perhaps we could simplify this patch description a bit?

The IMA_APPRAISE_BOOTPARAM config allows enabling different
"ima_appraise=" modes - log, fix, enforce - at run time, but not when
IMA architecture specific policies are enabled.  This prevents
properly labeling the filesystem on systems where secure boot is
supported, but not enabled on the platform.  Only when secure boot is
enabled, should these IMA appraise modes be disabled.

This patch removes the compile time dependency and makes it a runtime
decision, based on the secure boot state of that platform.

<snip>

> diff --git a/security/integrity/ima/ima_appraise.c b/security/integrity/ima/ima_appraise.c
> index a9649b04b9f1..884de471b38a 100644
> --- a/security/integrity/ima/ima_appraise.c
> +++ b/security/integrity/ima/ima_appraise.c
> @@ -19,6 +19,11 @@
>  static int __init default_appraise_setup(c

> har *str)
>  {
>  #ifdef CONFIG_IMA_APPRAISE_BOOTPARAM
> +	if (arch_ima_get_secureboot()) {
> +		pr_info("appraise boot param ignored: secure boot enabled");

Instead of a generic statement, is it possible to include the actual
option being denied?  Perhaps something like: "Secure boot enabled,
ignoring %s boot command line option"

Mimi

> +		return 1;
> +	}
> +
>  	if (strncmp(str, "off", 3) == 0)
>  		ima_appraise = 0;
>  	else if (strncmp(str, "log", 3) == 0)

^ permalink raw reply

* Re: [PATCH] powerpc/numa: Restrict possible nodes based on platform
From: Nathan Lynch @ 2020-07-10 17:41 UTC (permalink / raw)
  To: Srikar Dronamraju; +Cc: linuxppc-dev, Bharata B Rao
In-Reply-To: <20200707084203.GC874@linux.vnet.ibm.com>

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:

> * Michael Ellerman <mpe@ellerman.id.au> [2020-07-07 15:02:17]:
>
>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> > $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
>> > /proc/device-tree/rtas/ibm,current-associativity-domains
>> > 		 00000005 00000001 00000002 00000002 00000002 00000010
>> > /proc/device-tree/rtas/ibm,max-associativity-domains
>> > 		 00000005 00000001 00000008 00000020 00000020 00000100
>> >
>> > $ cat /sys/devices/system/node/possible ##Before patch
>> > 0-31
>> >
>> > $ cat /sys/devices/system/node/possible ##After patch
>> > 0-1
>> >
>> > Note the maximum nodes this platform can support is only 2 but the
>> > possible nodes is set to 32.
>> 
>> But what about LPM to a system with more nodes?
>> 
>
> I have very less info on LPM, so I checked with Nathan Lynch before posting
> and as per Nathan in the current design of LPM, Linux wouldn't use the new
> node numbers.

(I see a v2 has been posted already but I needed a little time to check
with our hypervisor people on this point.)

It's less of a design and more of a least-bad option in the absence of a
more flexible NUMA architecture in Linux.

For now, the "rule" with LPM and NUMA has to be that Linux uses the NUMA
information from the device tree that it was booted with, and it must
disregard ibm,associativity and similar information after LPM or certain
other platform events. Historically there has been code that tried to
honor changes in NUMA information but it caused much worse problems than
degraded performance. That code has been disabled by default since last
year and is now subject to removal:

https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=182897

Most NUMA-aware code happens to follow that rule because the device tree
associativity information tends to get cached on first access in Linux's
own data structures. It all feels a little fragile to me though,
especially since we can DLPAR add processors and memory after LPM with
"new" associativity properties which don't relate to the logical
topology Linux has already built. However, on currently available hw, as
long as we're using ibm,max-associativity-domains to limit the set of
possible nodes, I believe such resources will always receive valid (but
possibly suboptimal) NUMA assignments. That's because as of this
writing ibm,max-associativity-domains has the same contents on all
currently available PowerVM systems.

Now if we change to using ibm,current-associativity-domains, which we
*can* expect to differ between differently configured systems, post-LPM
DLPAR additions can yield resources with node assignments that
fall outside of the possible range, especially when we've migrated from
a smaller system to a larger one.

Is the current code robust against that possibility? I don't think so:
it looks like of_node_to_nid_single(), of_drconf_to_nid_single() and
possibly more code need to guard against this in order to prevent
NODE_DATA() null dereferences and the like. Probably those functions
should be made to clamp the nid assignment at num_possible_nodes()
instead of MAX_NUMNODES, which strikes me as more correct regardless of
your patch.

^ permalink raw reply

* Re: [PATCH v5] ima: move APPRAISE_BOOTPARAM dependency on ARCH_POLICY to runtime
From: Bruno Meneguele @ 2020-07-10 18:03 UTC (permalink / raw)
  To: Mimi Zohar
  Cc: linux-s390, nayna, erichte, x86, linux-kernel, stable,
	linux-integrity, linuxppc-dev
In-Reply-To: <1594401804.14405.8.camel@linux.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 2439 bytes --]

On Fri, Jul 10, 2020 at 01:23:24PM -0400, Mimi Zohar wrote:
> On Thu, 2020-07-09 at 13:46 -0300, Bruno Meneguele wrote:
> > APPRAISE_BOOTPARAM has been marked as dependent on !ARCH_POLICY in compile
> > time, enforcing the appraisal whenever the kernel had the arch policy option
> > enabled.
> 
> > However it breaks systems where the option is set but the system didn't
> > boot in a "secure boot" platform. In this scenario, anytime an appraisal
> > policy (i.e. ima_policy=appraisal_tcb) is used it will be forced, without
> > giving the user the opportunity to label the filesystem, before enforcing
> > integrity.
> > 
> > Considering the ARCH_POLICY is only effective when secure boot is actually
> > enabled this patch remove the compile time dependency and move it to a
> > runtime decision, based on the secure boot state of that platform.
> 
> Perhaps we could simplify this patch description a bit?
> 
> The IMA_APPRAISE_BOOTPARAM config allows enabling different
> "ima_appraise=" modes - log, fix, enforce - at run time, but not when
> IMA architecture specific policies are enabled.  This prevents
> properly labeling the filesystem on systems where secure boot is
> supported, but not enabled on the platform.  Only when secure boot is
> enabled, should these IMA appraise modes be disabled.
> 
> This patch removes the compile time dependency and makes it a runtime
> decision, based on the secure boot state of that platform.
> 

Sounds good to me.

> <snip>
> 
> > diff --git a/security/integrity/ima/ima_appraise.c b/security/integrity/ima/ima_appraise.c
> > index a9649b04b9f1..884de471b38a 100644
> > --- a/security/integrity/ima/ima_appraise.c
> > +++ b/security/integrity/ima/ima_appraise.c
> > @@ -19,6 +19,11 @@
> >  static int __init default_appraise_setup(c
> 
> > har *str)
> >  {
> >  #ifdef CONFIG_IMA_APPRAISE_BOOTPARAM
> > +	if (arch_ima_get_secureboot()) {
> > +		pr_info("appraise boot param ignored: secure boot enabled");
> 
> Instead of a generic statement, is it possible to include the actual
> option being denied?  Perhaps something like: "Secure boot enabled,
> ignoring %s boot command line option"
> 
> Mimi
> 

Yes, sure.

Thanks!

> > +		return 1;
> > +	}
> > +
> >  	if (strncmp(str, "off", 3) == 0)
> >  		ima_appraise = 0;
> >  	else if (strncmp(str, "log", 3) == 0)
> 

-- 
bmeneg 
PGP Key: http://bmeneg.com/pubkey.txt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH v5] ima: move APPRAISE_BOOTPARAM dependency on ARCH_POLICY to runtime
From: Bruno Meneguele @ 2020-07-10 18:34 UTC (permalink / raw)
  To: Mimi Zohar
  Cc: linux-s390, nayna, erichte, x86, linux-kernel, stable,
	linux-integrity, linuxppc-dev
In-Reply-To: <20200710180338.GA10547@glitch>

[-- Attachment #1: Type: text/plain, Size: 4095 bytes --]

On Fri, Jul 10, 2020 at 03:03:38PM -0300, Bruno Meneguele wrote:
> On Fri, Jul 10, 2020 at 01:23:24PM -0400, Mimi Zohar wrote:
> > On Thu, 2020-07-09 at 13:46 -0300, Bruno Meneguele wrote:
> > > APPRAISE_BOOTPARAM has been marked as dependent on !ARCH_POLICY in compile
> > > time, enforcing the appraisal whenever the kernel had the arch policy option
> > > enabled.
> > 
> > > However it breaks systems where the option is set but the system didn't
> > > boot in a "secure boot" platform. In this scenario, anytime an appraisal
> > > policy (i.e. ima_policy=appraisal_tcb) is used it will be forced, without
> > > giving the user the opportunity to label the filesystem, before enforcing
> > > integrity.
> > > 
> > > Considering the ARCH_POLICY is only effective when secure boot is actually
> > > enabled this patch remove the compile time dependency and move it to a
> > > runtime decision, based on the secure boot state of that platform.
> > 
> > Perhaps we could simplify this patch description a bit?
> > 
> > The IMA_APPRAISE_BOOTPARAM config allows enabling different
> > "ima_appraise=" modes - log, fix, enforce - at run time, but not when
> > IMA architecture specific policies are enabled.  This prevents
> > properly labeling the filesystem on systems where secure boot is
> > supported, but not enabled on the platform.  Only when secure boot is
> > enabled, should these IMA appraise modes be disabled.
> > 
> > This patch removes the compile time dependency and makes it a runtime
> > decision, based on the secure boot state of that platform.
> > 
> 
> Sounds good to me.
> 
> > <snip>
> > 
> > > diff --git a/security/integrity/ima/ima_appraise.c b/security/integrity/ima/ima_appraise.c
> > > index a9649b04b9f1..884de471b38a 100644
> > > --- a/security/integrity/ima/ima_appraise.c
> > > +++ b/security/integrity/ima/ima_appraise.c
> > > @@ -19,6 +19,11 @@
> > >  static int __init default_appraise_setup(c
> > 
> > > har *str)
> > >  {
> > >  #ifdef CONFIG_IMA_APPRAISE_BOOTPARAM
> > > +	if (arch_ima_get_secureboot()) {
> > > +		pr_info("appraise boot param ignored: secure boot enabled");
> > 
> > Instead of a generic statement, is it possible to include the actual
> > option being denied?  Perhaps something like: "Secure boot enabled,
> > ignoring %s boot command line option"
> > 
> > Mimi
> > 
> 
> Yes, sure.
> 

Btw, would it make sense to first make sure we have a valid "str"
option and not something random to print?
 
diff --git a/security/integrity/ima/ima_appraise.c b/security/integrity/ima/ima_appraise.c
index a9649b04b9f1..1f1175531d3e 100644
--- a/security/integrity/ima/ima_appraise.c
+++ b/security/integrity/ima/ima_appraise.c
@@ -25,6 +25,16 @@ static int __init default_appraise_setup(char *str)
                ima_appraise = IMA_APPRAISE_LOG;
        else if (strncmp(str, "fix", 3) == 0)
                ima_appraise = IMA_APPRAISE_FIX;
+       else
+               pr_info("invalid \"%s\" appraise option");
+
+       if (arch_ima_get_secureboot()) {
+               if (!is_ima_appraise_enabled()) {
+                       pr_info("Secure boot enabled: ignoring ima_appraise=%s boot parameter option",
+                               str);
+                       ima_appraise = IMA_APPRAISE_ENFORCE;
+               }
+       }
 #endif
        return 1;
 }


The "else" there I think would make sense as well, at least to give the
user some feedback about a possible mispelling of him (as a separate
patch).

And "if(!is_ima_appraise_enabled())" would avoid to print anything about
"ignoring the option" to the user in case he explicitly set "enforce",
which we know there isn't any real effect but is allowed and shown in
kernel-parameters.txt.

> Thanks!
> 
> > > +		return 1;
> > > +	}
> > > +
> > >  	if (strncmp(str, "off", 3) == 0)
> > >  		ima_appraise = 0;
> > >  	else if (strncmp(str, "log", 3) == 0)
> > 
> 
> -- 
> bmeneg 
> PGP Key: http://bmeneg.com/pubkey.txt



-- 
bmeneg 
PGP Key: http://bmeneg.com/pubkey.txt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related

* Re: [PATCH v5] ima: move APPRAISE_BOOTPARAM dependency on ARCH_POLICY to runtime
From: Mimi Zohar @ 2020-07-10 18:54 UTC (permalink / raw)
  To: Bruno Meneguele
  Cc: linux-s390, nayna, erichte, x86, linux-kernel, stable,
	linux-integrity, linuxppc-dev
In-Reply-To: <20200710183420.GB10547@glitch>

On Fri, 2020-07-10 at 15:34 -0300, Bruno Meneguele wrote:
> On Fri, Jul 10, 2020 at 03:03:38PM -0300, Bruno Meneguele wrote:
> > On Fri, Jul 10, 2020 at 01:23:24PM -0400, Mimi Zohar wrote:
> > > On Thu, 2020-07-09 at 13:46 -0300, Bruno Meneguele wrote:
> > > > APPRAISE_BOOTPARAM has been marked as dependent on !ARCH_POLICY in compile
> > > > time, enforcing the appraisal whenever the kernel had the arch policy option
> > > > enabled.
> > > 
> > > > However it breaks systems where the option is set but the system didn't
> > > > boot in a "secure boot" platform. In this scenario, anytime an appraisal
> > > > policy (i.e. ima_policy=appraisal_tcb) is used it will be forced, without
> > > > giving the user the opportunity to label the filesystem, before enforcing
> > > > integrity.
> > > > 
> > > > Considering the ARCH_POLICY is only effective when secure boot is actually
> > > > enabled this patch remove the compile time dependency and move it to a
> > > > runtime decision, based on the secure boot state of that platform.
> > > 
> > > Perhaps we could simplify this patch description a bit?
> > > 
> > > The IMA_APPRAISE_BOOTPARAM config allows enabling different
> > > "ima_appraise=" modes - log, fix, enforce - at run time, but not when
> > > IMA architecture specific policies are enabled.  This prevents
> > > properly labeling the filesystem on systems where secure boot is
> > > supported, but not enabled on the platform.  Only when secure boot is
> > > enabled, should these IMA appraise modes be disabled.
> > > 
> > > This patch removes the compile time dependency and makes it a runtime
> > > decision, based on the secure boot state of that platform.
> > > 
> > 
> > Sounds good to me.
> > 
> > > <snip>
> > > 
> > > > diff --git a/security/integrity/ima/ima_appraise.c b/security/integrity/ima/ima_appraise.c
> > > > index a9649b04b9f1..884de471b38a 100644
> > > > --- a/security/integrity/ima/ima_appraise.c
> > > > +++ b/security/integrity/ima/ima_appraise.c
> > > > @@ -19,6 +19,11 @@
> > > >  static int __init default_appraise_setup(c
> > > 
> > > > har *str)
> > > >  {
> > > >  #ifdef CONFIG_IMA_APPRAISE_BOOTPARAM
> > > > +	if (arch_ima_get_secureboot()) {
> > > > +		pr_info("appraise boot param ignored: secure boot enabled");
> > > 
> > > Instead of a generic statement, is it possible to include the actual
> > > option being denied?  Perhaps something like: "Secure boot enabled,
> > > ignoring %s boot command line option"
> > > 
> > > Mimi
> > > 
> > 
> > Yes, sure.
> > 
> 
> Btw, would it make sense to first make sure we have a valid "str"
> option and not something random to print?
>  
> diff --git a/security/integrity/ima/ima_appraise.c b/security/integrity/ima/ima_appraise.c
> index a9649b04b9f1..1f1175531d3e 100644
> --- a/security/integrity/ima/ima_appraise.c
> +++ b/security/integrity/ima/ima_appraise.c
> @@ -25,6 +25,16 @@ static int __init default_appraise_setup(char *str)
>                 ima_appraise = IMA_APPRAISE_LOG;
>         else if (strncmp(str, "fix", 3) == 0)
>                 ima_appraise = IMA_APPRAISE_FIX;
> +       else
> +               pr_info("invalid \"%s\" appraise option");
> +
> +       if (arch_ima_get_secureboot()) {
> +               if (!is_ima_appraise_enabled()) {
> +                       pr_info("Secure boot enabled: ignoring ima_appraise=%s boot parameter option",
> +                               str);
> +                       ima_appraise = IMA_APPRAISE_ENFORCE;
> +               }
> +       }

Providing feedback is probably a good idea.  However, the
"arch_ima_get_secureboot" test can't come after setting
"ima_appraise."

Mimi

>  #endif
>         return 1;
>  }
> 
> 
> The "else" there I think would make sense as well, at least to give the
> user some feedback about a possible mispelling of him (as a separate
> patch).
> 
> And "if(!is_ima_appraise_enabled())" would avoid to print anything about
> "ignoring the option" to the user in case he explicitly set "enforce",
> which we know there isn't any real effect but is allowed and shown in
> kernel-parameters.txt.
> 
> > Thanks!
> > 
> > > > +		return 1;
> > > > +	}
> > > > +
> > > >  	if (strncmp(str, "off", 3) == 0)
> > > >  		ima_appraise = 0;
> > > >  	else if (strncmp(str, "log", 3) == 0)
> > > 
> > 
> > -- 
> > bmeneg 
> > PGP Key: http://bmeneg.com/pubkey.txt
> 
> 
> 


^ permalink raw reply

* Re: [PATCH v5] ima: move APPRAISE_BOOTPARAM dependency on ARCH_POLICY to runtime
From: Bruno Meneguele @ 2020-07-10 19:25 UTC (permalink / raw)
  To: Mimi Zohar
  Cc: linux-s390, nayna, erichte, x86, linux-kernel, stable,
	linux-integrity, linuxppc-dev
In-Reply-To: <1594407288.14405.36.camel@linux.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 5131 bytes --]

On Fri, Jul 10, 2020 at 02:54:48PM -0400, Mimi Zohar wrote:
> On Fri, 2020-07-10 at 15:34 -0300, Bruno Meneguele wrote:
> > On Fri, Jul 10, 2020 at 03:03:38PM -0300, Bruno Meneguele wrote:
> > > On Fri, Jul 10, 2020 at 01:23:24PM -0400, Mimi Zohar wrote:
> > > > On Thu, 2020-07-09 at 13:46 -0300, Bruno Meneguele wrote:
> > > > > APPRAISE_BOOTPARAM has been marked as dependent on !ARCH_POLICY in compile
> > > > > time, enforcing the appraisal whenever the kernel had the arch policy option
> > > > > enabled.
> > > > 
> > > > > However it breaks systems where the option is set but the system didn't
> > > > > boot in a "secure boot" platform. In this scenario, anytime an appraisal
> > > > > policy (i.e. ima_policy=appraisal_tcb) is used it will be forced, without
> > > > > giving the user the opportunity to label the filesystem, before enforcing
> > > > > integrity.
> > > > > 
> > > > > Considering the ARCH_POLICY is only effective when secure boot is actually
> > > > > enabled this patch remove the compile time dependency and move it to a
> > > > > runtime decision, based on the secure boot state of that platform.
> > > > 
> > > > Perhaps we could simplify this patch description a bit?
> > > > 
> > > > The IMA_APPRAISE_BOOTPARAM config allows enabling different
> > > > "ima_appraise=" modes - log, fix, enforce - at run time, but not when
> > > > IMA architecture specific policies are enabled.  This prevents
> > > > properly labeling the filesystem on systems where secure boot is
> > > > supported, but not enabled on the platform.  Only when secure boot is
> > > > enabled, should these IMA appraise modes be disabled.
> > > > 
> > > > This patch removes the compile time dependency and makes it a runtime
> > > > decision, based on the secure boot state of that platform.
> > > > 
> > > 
> > > Sounds good to me.
> > > 
> > > > <snip>
> > > > 
> > > > > diff --git a/security/integrity/ima/ima_appraise.c b/security/integrity/ima/ima_appraise.c
> > > > > index a9649b04b9f1..884de471b38a 100644
> > > > > --- a/security/integrity/ima/ima_appraise.c
> > > > > +++ b/security/integrity/ima/ima_appraise.c
> > > > > @@ -19,6 +19,11 @@
> > > > >  static int __init default_appraise_setup(c
> > > > 
> > > > > har *str)
> > > > >  {
> > > > >  #ifdef CONFIG_IMA_APPRAISE_BOOTPARAM
> > > > > +	if (arch_ima_get_secureboot()) {
> > > > > +		pr_info("appraise boot param ignored: secure boot enabled");
> > > > 
> > > > Instead of a generic statement, is it possible to include the actual
> > > > option being denied?  Perhaps something like: "Secure boot enabled,
> > > > ignoring %s boot command line option"
> > > > 
> > > > Mimi
> > > > 
> > > 
> > > Yes, sure.
> > > 
> > 
> > Btw, would it make sense to first make sure we have a valid "str"
> > option and not something random to print?
> >  
> > diff --git a/security/integrity/ima/ima_appraise.c b/security/integrity/ima/ima_appraise.c
> > index a9649b04b9f1..1f1175531d3e 100644
> > --- a/security/integrity/ima/ima_appraise.c
> > +++ b/security/integrity/ima/ima_appraise.c
> > @@ -25,6 +25,16 @@ static int __init default_appraise_setup(char *str)
> >                 ima_appraise = IMA_APPRAISE_LOG;
> >         else if (strncmp(str, "fix", 3) == 0)
> >                 ima_appraise = IMA_APPRAISE_FIX;
> > +       else
> > +               pr_info("invalid \"%s\" appraise option");
> > +
> > +       if (arch_ima_get_secureboot()) {
> > +               if (!is_ima_appraise_enabled()) {
> > +                       pr_info("Secure boot enabled: ignoring ima_appraise=%s boot parameter option",
> > +                               str);
> > +                       ima_appraise = IMA_APPRAISE_ENFORCE;
> > +               }
> > +       }
> 
> Providing feedback is probably a good idea.  However, the
> "arch_ima_get_secureboot" test can't come after setting
> "ima_appraise."
> 

Sorry, but I'm not sure if I got the reason to why it can't be done
after: would it be basically to prevent any further processing about
ima_appraise as a matter of security principle? Or maybe to keep the
dependency between secureboot and bootparam truly strict? 

Or are there something else I'm missing?

> Mimi
> 
> >  #endif
> >         return 1;
> >  }
> > 
> > 
> > The "else" there I think would make sense as well, at least to give the
> > user some feedback about a possible mispelling of him (as a separate
> > patch).
> > 
> > And "if(!is_ima_appraise_enabled())" would avoid to print anything about
> > "ignoring the option" to the user in case he explicitly set "enforce",
> > which we know there isn't any real effect but is allowed and shown in
> > kernel-parameters.txt.
> > 
> > > Thanks!
> > > 
> > > > > +		return 1;
> > > > > +	}
> > > > > +
> > > > >  	if (strncmp(str, "off", 3) == 0)
> > > > >  		ima_appraise = 0;
> > > > >  	else if (strncmp(str, "log", 3) == 0)
> > > > 
> > > 
> > > -- 
> > > bmeneg 
> > > PGP Key: http://bmeneg.com/pubkey.txt
> > 
> > 
> > 
> 

-- 
bmeneg 
PGP Key: http://bmeneg.com/pubkey.txt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [PATCH 0/14 v3] PCI: Remove '*val = 0' from pcie_capability_read_*()
From: Saheed Olayemi Bolarinwa @ 2020-07-10 21:20 UTC (permalink / raw)
  To: helgaas
  Cc: linux-wireless, linux-pci, QCA ath9k Development, netdev,
	Oliver O'Halloran, Stanislaw Gruszka, linux-acpi, linux-rdma,
	Jason Gunthorpe, Doug Ledford, Jakub Kicinski,
	linux-kernel-mentees, Len Brown, Arnd Bergmann, skhan, bjorn,
	Kalle Valo, Mike Marciniszyn, Sam Bobroff, Greg Kroah-Hartman,
	Dennis Dalessandro, Rafael J. Wysocki, linux-kernel, Lukas Wunner,
	Bolarinwa Olayemi Saheed, linuxppc-dev, David S. Miller

From: Bolarinwa Olayemi Saheed <refactormyself@gmail.com>

v3 CHANGES:
- Split previous PATCH 6/13 into two : PATCH 6/14 and PATCH 7/14
- Fix commit message of PATCH 5/14
- Update Patch numbering and Commit messages
- Add 'Acked by Greg KH' to PATCH 2/14
- Add PATCH version

v2 CHANGES:
- Fix missing comma, causing the email cc error
- Fix typos and numbering errors in commit messages
- Add commit message to 13/13
- Add two more patches: PATCH 3/13 and PATCH 4/13

MERGING:
Patch 7/14 depends on Patch 6/14. However Patch 6/14 has no dependency.
Please, merge PATCH 7/14 only after Patch 6/14.
Patch 14/14 depend on all preceeding patchs. Except for Patch 6/14 and
Patch 7/14, all other patches are independent of one another. Hence,
please merge Patch 14/14 only after other patches in this series have
been merged.

PATCH 6/14:
Make the function set status to "Power On" by default and only set to
Set "Power Off" only if pcie_capability_read_word() is successful and
(slot_ctrl & PCI_EXP_SLTCTL_PCC) == PCI_EXP_SLTCTL_PWR_OFF. 

PATCH 1/14 to 13/14:
Check the return value of pcie_capability_read_*() to ensure success or
confirm failure. While maintaining these functions, this ensures that the
changes in PATCH 14/14 does not introduce any bug. 

PATCH 14/14:
There are several reasons why a PCI capability read may fail whether the
device is present or not. If this happens, pcie_capability_read_*() will
return -EINVAL/PCIBIOS_BAD_REGISTER_NUMBER or PCIBIOS_DEVICE_NOT_FOUND
and *val is set to 0.

This behaviour if further ensured by this code inside
pcie_capability_read_*()

 ret = pci_read_config_dword(dev, pci_pcie_cap(dev) + pos, val);
 /*
  * Reset *val to 0 if pci_read_config_dword() fails, it may
  * have been written as 0xFFFFFFFF if hardware error happens
  * during pci_read_config_dword().
  */
 if (ret)
	 *val = 0;
 return ret;

a) Since all pci_generic_config_read() does is read a register value,
it may return success after reading a ~0 which *may* have been fabricated
by the PCI host bridge due to a read timeout. Hence pci_read_config_*() 
will return success with a fabricated ~0 in *val, indicating a problem.
In this case, the assumed behaviour of  pcie_capability_read_*() will be
wrong. To avoid error slipping through, more checks are necessary.

b) pci_read_config_*() will return PCIBIOS_DEVICE_NOT_FOUND only if 
dev->error_state = pci_channel_io_perm_failure (i.e. 
pci_dev_is_disconnected()) or if pci_generic_config_read() can't find the
device. In both cases *val is initially set to ~0 but as shown in the code
above pcie_capability_read_*() resets it back to 0. Even with this effort,
drivers still have to perform validation checks more so if 0 is a valid
value.

Most drivers only consider the case (b) and in some cases, there is the 
expectation that on timeout *val has a fabricated value of ~0, which *may*
not always be true as explained in (a).

In any case, checks need to be done to validate the value read and maybe
confirm which error has occurred. It is better left to the drivers to do.

Check the return value of pcie_capability_read_dword() to ensure success
and avoid bug as a result of Patch 14/14.
Remove the reset of *val to 0 when pci_read_config_*() fails.

Bolarinwa Olayemi Saheed (14):
  IB/hfi1: Check the return value of pcie_capability_read_*()
  misc: rtsx: Check the return value of pcie_capability_read_*()
  ath9k: Check the return value of pcie_capability_read_*()
  iwlegacy: Check the return value of pcie_capability_read_*()
  PCI: pciehp: Check the return value of pcie_capability_read_*()
  PCI: pciehp: Make "Power On" the default 
  PCI: pciehp: Check the return value of pcie_capability_read_*()
  PCI/ACPI: Check the return value of pcie_capability_read_*()
  PCI: pciehp: Check the return value of pcie_capability_read_*()
  PCI: Check the return value of pcie_capability_read_*()
  PCI/PM: Check return value of pcie_capability_read_*()
  PCI/AER: Check the return value of pcie_capability_read_*()
  PCI/ASPM: Check the return value of pcie_capability_read_*()
  PCI: Remove '*val = 0' from pcie_capability_read_*()

 drivers/net/wireless/ath/ath9k/pci.c         | 5 +++--
 drivers/net/wireless/intel/iwlegacy/common.c | 4 ++--
 drivers/infiniband/hw/hfi1/aspm.c | 7 ++++---
 drivers/misc/cardreader/rts5227.c | 5 +++--
 drivers/misc/cardreader/rts5249.c | 5 +++--
 drivers/misc/cardreader/rts5260.c | 5 +++--
 drivers/misc/cardreader/rts5261.c | 5 +++--
 drivers/pci/pcie/aer.c  |  5 +++--
 drivers/pci/pcie/aspm.c | 33 +++++++++++++++++----------------
 drivers/pci/hotplug/pciehp_hpc.c | 47 ++++++++++++++++----------------
 drivers/pci/pci-acpi.c           | 10 ++++---
 drivers/pci/probe.c              | 29 ++++++++++++--------
 drivers/pci/access.c | 14 --------------
 13 files changed, 87 insertions(+), 87 deletions(-)

-- 
2.18.2

^ permalink raw reply

* [PATCH 14/14 v3] PCI: Remove '*val = 0' from pcie_capability_read_*()
From: Saheed Olayemi Bolarinwa @ 2020-07-10 21:20 UTC (permalink / raw)
  To: helgaas
  Cc: linux-wireless, linux-pci, QCA ath9k Development, netdev,
	Oliver O'Halloran, Stanislaw Gruszka, linux-acpi, linux-rdma,
	Jason Gunthorpe, Doug Ledford, Jakub Kicinski,
	linux-kernel-mentees, Len Brown, Arnd Bergmann, skhan, bjorn,
	Kalle Valo, Mike Marciniszyn, Sam Bobroff, Greg Kroah-Hartman,
	Dennis Dalessandro, Rafael J. Wysocki, linux-kernel, Lukas Wunner,
	Bolarinwa Olayemi Saheed, linuxppc-dev, David S. Miller
In-Reply-To: <20200710212026.27136-1-refactormyself@gmail.com>

From: Bolarinwa Olayemi Saheed <refactormyself@gmail.com>

There are several reasons why a PCI capability read may fail whether the
device is present or not. If this happens, pcie_capability_read_*() will
return -EINVAL/PCIBIOS_BAD_REGISTER_NUMBER or PCIBIOS_DEVICE_NOT_FOUND
and *val is set to 0.

This behaviour if further ensured by this code inside
pcie_capability_read_*()

 ret = pci_read_config_dword(dev, pci_pcie_cap(dev) + pos, val);
 /*
  * Reset *val to 0 if pci_read_config_dword() fails, it may
  * have been written as 0xFFFFFFFF if hardware error happens
  * during pci_read_config_dword().
  */
 if (ret)
	 *val = 0;
 return ret;

a) Since all pci_generic_config_read() does is read a register value,
it may return success after reading a ~0 which *may* have been fabricated
by the PCI host bridge due to a read timeout. Hence pci_read_config_*() 
will return success with a fabricated ~0 in *val, indicating a problem.
In this case, the assumed behaviour of  pcie_capability_read_*() will be
wrong. To avoid error slipping through, more checks are necessary.

b) pci_read_config_*() will return PCIBIOS_DEVICE_NOT_FOUND only if 
dev->error_state = pci_channel_io_perm_failure (i.e. 
pci_dev_is_disconnected()) or if pci_generic_config_read() can't find the
device. In both cases *val is initially set to ~0 but as shown in the code
above pcie_capability_read_*() resets it back to 0. Even with this effort,
drivers still have to perform validation checks more so if 0 is a valid
value.

Most drivers only consider the case (b) and in some cases, there is the 
expectation that on timeout *val has a fabricated value of ~0, which *may*
not always be true as explained in (a).

In any case, checks need to be done to validate the value read and maybe
confirm which error has occurred. It is better left to the drivers to do.

Remove the reset of *val to 0 when pci_read_config_*() fails.

Suggested-by: Bjorn Helgaas <bjorn@helgaas.com>
Signed-off-by: Bolarinwa Olayemi Saheed <refactormyself@gmail.com>
---
This patch  depends on all of the preceeding patches in this series,
otherwise it will introduce bugs as pointed out in the commit message
of each.
 drivers/pci/access.c | 14 --------------
 1 file changed, 14 deletions(-)

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index 79c4a2ef269a..ec95edbb1ac8 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -413,13 +413,6 @@ int pcie_capability_read_word(struct pci_dev *dev, int pos, u16 *val)

 	if (pcie_capability_reg_implemented(dev, pos)) {
 		ret = pci_read_config_word(dev, pci_pcie_cap(dev) + pos, val);
-		/*
-		 * Reset *val to 0 if pci_read_config_word() fails, it may
-		 * have been written as 0xFFFF if hardware error happens
-		 * during pci_read_config_word().
-		 */
-		if (ret)
-			*val = 0;
 		return ret;
 	}

@@ -448,13 +441,6 @@ int pcie_capability_read_dword(struct pci_dev *dev, int pos, u32 *val)

 	if (pcie_capability_reg_implemented(dev, pos)) {
 		ret = pci_read_config_dword(dev, pci_pcie_cap(dev) + pos, val);
-		/*
-		 * Reset *val to 0 if pci_read_config_dword() fails, it may
-		 * have been written as 0xFFFFFFFF if hardware error happens
-		 * during pci_read_config_dword().
-		 */
-		if (ret)
-			*val = 0;
 		return ret;
 	}

-- 
2.18.2

^ permalink raw reply related

* Re: generic DMA bypass flag v4
From: Jesper Dangaard Brouer @ 2020-07-10 13:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Daniel Borkmann, Alexey Kardashevskiy, Greg Kroah-Hartman,
	Joerg Roedel, brouer, Robin Murphy, linux-kernel, iommu,
	Björn Töpel, linuxppc-dev, Lu Baolu
In-Reply-To: <20200708152449.316476-1-hch@lst.de>

On Wed,  8 Jul 2020 17:24:44 +0200
Christoph Hellwig <hch@lst.de> wrote:

> Note that as-is this breaks the XSK buffer pool, which unfortunately
> poked directly into DMA internals.  A fix for that is already queued
> up in the netdev tree.
> 
> Jesper and XDP gang: this should not regress any performance as
> the dma-direct calls are now inlined into the out of line DMA mapping
> calls.  But if you can verify the performance numbers that would be
> greatly appreciated.

From a superficial review of the patches, they look okay to me. I don't
have time to run a performance benchmark (before I go on vacation).

I hoped Björn could test/benchmark this(?), given (as mentioned) this
also affect XSK / AF_XDP performance.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply

* [PATCH V2 1/2] powerpc/vas: Report proper error code for address translation failure
From: Haren Myneni @ 2020-07-10 23:47 UTC (permalink / raw)
  To: mpe; +Cc: tulioqm, abali, linuxppc-dev, rzinsly


P9 DD2 NX workbook (Table 4-36) says DMA controller uses CC=5
internally for translation fault handling. NX reserves CC=250 for
OS to notify user space when NX encounters address translation
failure on the request buffer. Not an issue in earlier releases
as NX does not get faults on kernel addresses.

This patch defines CSB_CC_FAULT_ADDRESS(250) and updates CSB.CC with
this proper error code for user space.

Signed-off-by: Haren Myneni <haren@linux.ibm.com>

Changes in V2:
- Use CSB_CC_FAULT_ADDRESS instead of CSB_CC_ADDRESS_TRANSLATION
  to distinguish from other error codes.
- Add NX workbook reference in the comment.

---
 Documentation/powerpc/vas-api.rst          | 2 +-
 arch/powerpc/include/asm/icswx.h           | 4 ++++
 arch/powerpc/platforms/powernv/vas-fault.c | 2 +-
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/Documentation/powerpc/vas-api.rst b/Documentation/powerpc/vas-api.rst
index 1217c2f..788dc83 100644
--- a/Documentation/powerpc/vas-api.rst
+++ b/Documentation/powerpc/vas-api.rst
@@ -213,7 +213,7 @@ request buffers are not in memory. The operating system handles the fault by
 updating CSB with the following data:
 
 	csb.flags = CSB_V;
-	csb.cc = CSB_CC_TRANSLATION;
+	csb.cc = CSB_CC_FAULT_ADDRESS;
 	csb.ce = CSB_CE_TERMINATION;
 	csb.address = fault_address;
 
diff --git a/arch/powerpc/include/asm/icswx.h b/arch/powerpc/include/asm/icswx.h
index 965b1f3..9bc7c58 100644
--- a/arch/powerpc/include/asm/icswx.h
+++ b/arch/powerpc/include/asm/icswx.h
@@ -77,6 +77,10 @@ struct coprocessor_completion_block {
 #define CSB_CC_CHAIN		(37)
 #define CSB_CC_SEQUENCE		(38)
 #define CSB_CC_HW		(39)
+/*
+ * P9 DD NX Workbook 3.2 (Table 4-36): Address translation fault
+ */
+#define	CSB_CC_FAULT_ADDRESS	(250)
 
 #define CSB_SIZE		(0x10)
 #define CSB_ALIGN		CSB_SIZE
diff --git a/arch/powerpc/platforms/powernv/vas-fault.c b/arch/powerpc/platforms/powernv/vas-fault.c
index 266a6ca..3d21fce 100644
--- a/arch/powerpc/platforms/powernv/vas-fault.c
+++ b/arch/powerpc/platforms/powernv/vas-fault.c
@@ -79,7 +79,7 @@ static void update_csb(struct vas_window *window,
 	csb_addr = (void __user *)be64_to_cpu(crb->csb_addr);
 
 	memset(&csb, 0, sizeof(csb));
-	csb.cc = CSB_CC_TRANSLATION;
+	csb.cc = CSB_CC_FAULT_ADDRESS;
 	csb.ce = CSB_CE_TERMINATION;
 	csb.cs = 0;
 	csb.count = 0;
-- 
1.8.3.1



^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox