LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone
From: Palmer Dabbelt @ 2020-07-21 19:05 UTC (permalink / raw)
  To: alex
  Cc: aou, linux-mm, Anup Patel, linux-kernel, Atish Patra, paulus,
	zong.li, Paul Walmsley, linux-riscv, linuxppc-dev
In-Reply-To: <7cb2285e-68ba-6827-5e61-e33a4b65ac03@ghiti.fr>

On Tue, 21 Jul 2020 11:36:10 PDT (-0700), alex@ghiti.fr wrote:
> Let's try to make progress here: I add linux-mm in CC to get feedback on
> this patch as it blocks sv48 support too.

Sorry for being slow here.  I haven't replied because I hadn't really fleshed
out the design yet, but just so everyone's on the same page my problems with
this are:

* We waste vmalloc space on 32-bit systems, where there isn't a lot of it.
* On 64-bit systems the VA space around the kernel is precious because it's the
  only place we can place text (modules, BPF, whatever).  If we start putting
  the kernel in the vmalloc space then we either have to pre-allocate a bunch
  of space around it (essentially making it a fixed mapping anyway) or it
  becomes likely that we won't be able to find space for modules as they're
  loaded into running systems.
* Relying on a relocatable kernel for sv48 support introduces a fairly large
  performance hit.

Roughly, my proposal would be to:

* Leave the 32-bit memory map alone.  On 32-bit systems we can load modules
  anywhere and we only have one VA width, so we're not really solving any
  problems with these changes.
* Staticly allocate a 2GiB portion of the VA space for all our text, as its own
  region.  We'd link/relocate the kernel here instead of around PAGE_OFFSET,
  which would decouple the kernel from the physical memory layout of the system.
  This would have the side effect of sorting out a bunch of bootloader headaches
  that we currently have.
* Sort out how to maintain a linear map as the canonical hole moves around
  between the VA widths without adding a bunch of overhead to the virt2phys and
  friends.  This is probably going to be the trickiest part, but I think if we
  just change the page table code to essentially lie about VAs when an sv39
  system runs an sv48+sv39 kernel we could make it work -- there'd be some
  logical complexity involved, but it would remain fast.

This doesn't solve the problem of virtually relocatable kernels, but it does
let us decouple that from the sv48 stuff.  It also lets us stop relying on a
fixed physical address the kernel is loaded into, which is another thing I
don't like.

I know this may be a more complicated approach, but there aren't any sv48
systems around right now so I just don't see the rush to support them,
particularly when there's a cost to what already exists (for those who haven't
been watching, so far all the sv48 patch sets have imposed a significant
performance penalty on all systems).

>
> Alex
>
> Le 7/9/20 à 7:11 AM, Alex Ghiti a écrit :
>> Hi Palmer,
>>
>> Le 7/9/20 à 1:05 AM, Palmer Dabbelt a écrit :
>>> On Sun, 07 Jun 2020 00:59:46 PDT (-0700), alex@ghiti.fr wrote:
>>>> This is a preparatory patch for relocatable kernel.
>>>>
>>>> The kernel used to be linked at PAGE_OFFSET address and used to be
>>>> loaded
>>>> physically at the beginning of the main memory. Therefore, we could use
>>>> the linear mapping for the kernel mapping.
>>>>
>>>> But the relocated kernel base address will be different from PAGE_OFFSET
>>>> and since in the linear mapping, two different virtual addresses cannot
>>>> point to the same physical address, the kernel mapping needs to lie
>>>> outside
>>>> the linear mapping.
>>>
>>> I know it's been a while, but I keep opening this up to review it and
>>> just
>>> can't get over how ugly it is to put the kernel's linear map in the
>>> vmalloc
>>> region.
>>>
>>> I guess I don't understand why this is necessary at all.
>>> Specifically: why
>>> can't we just relocate the kernel within the linear map?  That would
>>> let the
>>> bootloader put the kernel wherever it wants, modulo the physical
>>> memory size we
>>> support.  We'd need to handle the regions that are coupled to the
>>> kernel's
>>> execution address, but we could just put them in an explicit memory
>>> region
>>> which is what we should probably be doing anyway.
>>
>> Virtual relocation in the linear mapping requires to move the kernel
>> physically too. Zong implemented this physical move in its KASLR RFC
>> patchset, which is cumbersome since finding an available physical spot
>> is harder than just selecting a virtual range in the vmalloc range.
>>
>> In addition, having the kernel mapping in the linear mapping prevents
>> the use of hugepage for the linear mapping resulting in performance loss
>> (at least for the GB that encompasses the kernel).
>>
>> Why do you find this "ugly" ? The vmalloc region is just a bunch of
>> available virtual addresses to whatever purpose we want, and as noted by
>> Zong, arm64 uses the same scheme.
>>
>>>
>>>> In addition, because modules and BPF must be close to the kernel (inside
>>>> +-2GB window), the kernel is placed at the end of the vmalloc zone minus
>>>> 2GB, which leaves room for modules and BPF. The kernel could not be
>>>> placed at the beginning of the vmalloc zone since other vmalloc
>>>> allocations from the kernel could get all the +-2GB window around the
>>>> kernel which would prevent new modules and BPF programs to be loaded.
>>>
>>> Well, that's not enough to make sure this doesn't happen -- it's just
>>> enough to
>>> make sure it doesn't happen very quickily.  That's the same boat we're
>>> already
>>> in, though, so it's not like it's worse.
>>
>> Indeed, that's not worse, I haven't found a way to reserve vmalloc area
>> without actually allocating it.
>>
>>>
>>>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>>>> Reviewed-by: Zong Li <zong.li@sifive.com>
>>>> ---
>>>>  arch/riscv/boot/loader.lds.S     |  3 +-
>>>>  arch/riscv/include/asm/page.h    | 10 +++++-
>>>>  arch/riscv/include/asm/pgtable.h | 38 ++++++++++++++-------
>>>>  arch/riscv/kernel/head.S         |  3 +-
>>>>  arch/riscv/kernel/module.c       |  4 +--
>>>>  arch/riscv/kernel/vmlinux.lds.S  |  3 +-
>>>>  arch/riscv/mm/init.c             | 58 +++++++++++++++++++++++++-------
>>>>  arch/riscv/mm/physaddr.c         |  2 +-
>>>>  8 files changed, 88 insertions(+), 33 deletions(-)
>>>>
>>>> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
>>>> index 47a5003c2e28..62d94696a19c 100644
>>>> --- a/arch/riscv/boot/loader.lds.S
>>>> +++ b/arch/riscv/boot/loader.lds.S
>>>> @@ -1,13 +1,14 @@
>>>>  /* SPDX-License-Identifier: GPL-2.0 */
>>>>
>>>>  #include <asm/page.h>
>>>> +#include <asm/pgtable.h>
>>>>
>>>>  OUTPUT_ARCH(riscv)
>>>>  ENTRY(_start)
>>>>
>>>>  SECTIONS
>>>>  {
>>>> -    . = PAGE_OFFSET;
>>>> +    . = KERNEL_LINK_ADDR;
>>>>
>>>>      .payload : {
>>>>          *(.payload)
>>>> diff --git a/arch/riscv/include/asm/page.h
>>>> b/arch/riscv/include/asm/page.h
>>>> index 2d50f76efe48..48bb09b6a9b7 100644
>>>> --- a/arch/riscv/include/asm/page.h
>>>> +++ b/arch/riscv/include/asm/page.h
>>>> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>>>>
>>>>  #ifdef CONFIG_MMU
>>>>  extern unsigned long va_pa_offset;
>>>> +extern unsigned long va_kernel_pa_offset;
>>>>  extern unsigned long pfn_base;
>>>>  #define ARCH_PFN_OFFSET        (pfn_base)
>>>>  #else
>>>>  #define va_pa_offset        0
>>>> +#define va_kernel_pa_offset    0
>>>>  #define ARCH_PFN_OFFSET        (PAGE_OFFSET >> PAGE_SHIFT)
>>>>  #endif /* CONFIG_MMU */
>>>>
>>>>  extern unsigned long max_low_pfn;
>>>>  extern unsigned long min_low_pfn;
>>>> +extern unsigned long kernel_virt_addr;
>>>>
>>>>  #define __pa_to_va_nodebug(x)    ((void *)((unsigned long) (x) +
>>>> va_pa_offset))
>>>> -#define __va_to_pa_nodebug(x)    ((unsigned long)(x) - va_pa_offset)
>>>> +#define linear_mapping_va_to_pa(x)    ((unsigned long)(x) -
>>>> va_pa_offset)
>>>> +#define kernel_mapping_va_to_pa(x)    \
>>>> +    ((unsigned long)(x) - va_kernel_pa_offset)
>>>> +#define __va_to_pa_nodebug(x)        \
>>>> +    (((x) >= PAGE_OFFSET) ?        \
>>>> +        linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
>>>>
>>>>  #ifdef CONFIG_DEBUG_VIRTUAL
>>>>  extern phys_addr_t __virt_to_phys(unsigned long x);
>>>> diff --git a/arch/riscv/include/asm/pgtable.h
>>>> b/arch/riscv/include/asm/pgtable.h
>>>> index 35b60035b6b0..94ef3b49dfb6 100644
>>>> --- a/arch/riscv/include/asm/pgtable.h
>>>> +++ b/arch/riscv/include/asm/pgtable.h
>>>> @@ -11,23 +11,29 @@
>>>>
>>>>  #include <asm/pgtable-bits.h>
>>>>
>>>> -#ifndef __ASSEMBLY__
>>>> -
>>>> -/* Page Upper Directory not used in RISC-V */
>>>> -#include <asm-generic/pgtable-nopud.h>
>>>> -#include <asm/page.h>
>>>> -#include <asm/tlbflush.h>
>>>> -#include <linux/mm_types.h>
>>>> -
>>>> -#ifdef CONFIG_MMU
>>>> +#ifndef CONFIG_MMU
>>>> +#define KERNEL_VIRT_ADDR    PAGE_OFFSET
>>>> +#define KERNEL_LINK_ADDR    PAGE_OFFSET
>>>> +#else
>>>> +/*
>>>> + * Leave 2GB for modules and BPF that must lie within a 2GB range
>>>> around
>>>> + * the kernel.
>>>> + */
>>>> +#define KERNEL_VIRT_ADDR    (VMALLOC_END - SZ_2G + 1)
>>>> +#define KERNEL_LINK_ADDR    KERNEL_VIRT_ADDR
>>>
>>> At a bare minimum this is going to make a mess of the 32-bit port, as
>>> non-relocatable kernels are now going to get linked at 1GiB which is
>>> where user
>>> code is supposed to live.  That's an easy fix, though, as the 32-bit
>>> stuff
>>> doesn't need any module address restrictions.
>>
>> Indeed, I will take a look at that.
>>
>>>
>>>>  #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
>>>>  #define VMALLOC_END      (PAGE_OFFSET - 1)
>>>>  #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
>>>>
>>>>  #define BPF_JIT_REGION_SIZE    (SZ_128M)
>>>> -#define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
>>>> -#define BPF_JIT_REGION_END    (VMALLOC_END)
>>>> +#define BPF_JIT_REGION_START    PFN_ALIGN((unsigned long)&_end)
>>>> +#define BPF_JIT_REGION_END    (BPF_JIT_REGION_START +
>>>> BPF_JIT_REGION_SIZE)
>>>> +
>>>> +#ifdef CONFIG_64BIT
>>>> +#define VMALLOC_MODULE_START    BPF_JIT_REGION_END
>>>> +#define VMALLOC_MODULE_END    (((unsigned long)&_start & PAGE_MASK)
>>>> + SZ_2G)
>>>> +#endif
>>>>
>>>>  /*
>>>>   * Roughly size the vmemmap space to be large enough to fit enough
>>>> @@ -57,9 +63,16 @@
>>>>  #define FIXADDR_SIZE     PGDIR_SIZE
>>>>  #endif
>>>>  #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
>>>> -
>>>>  #endif
>>>>
>>>> +#ifndef __ASSEMBLY__
>>>> +
>>>> +/* Page Upper Directory not used in RISC-V */
>>>> +#include <asm-generic/pgtable-nopud.h>
>>>> +#include <asm/page.h>
>>>> +#include <asm/tlbflush.h>
>>>> +#include <linux/mm_types.h>
>>>> +
>>>>  #ifdef CONFIG_64BIT
>>>>  #include <asm/pgtable-64.h>
>>>>  #else
>>>> @@ -483,6 +496,7 @@ static inline void __kernel_map_pages(struct page
>>>> *page, int numpages, int enabl
>>>>
>>>>  #define kern_addr_valid(addr)   (1) /* FIXME */
>>>>
>>>> +extern char _start[];
>>>>  extern void *dtb_early_va;
>>>>  void setup_bootmem(void);
>>>>  void paging_init(void);
>>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>>>> index 98a406474e7d..8f5bb7731327 100644
>>>> --- a/arch/riscv/kernel/head.S
>>>> +++ b/arch/riscv/kernel/head.S
>>>> @@ -49,7 +49,8 @@ ENTRY(_start)
>>>>  #ifdef CONFIG_MMU
>>>>  relocate:
>>>>      /* Relocate return address */
>>>> -    li a1, PAGE_OFFSET
>>>> +    la a1, kernel_virt_addr
>>>> +    REG_L a1, 0(a1)
>>>>      la a2, _start
>>>>      sub a1, a1, a2
>>>>      add ra, ra, a1
>>>> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
>>>> index 8bbe5dbe1341..1a8fbe05accf 100644
>>>> --- a/arch/riscv/kernel/module.c
>>>> +++ b/arch/riscv/kernel/module.c
>>>> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const
>>>> char *strtab,
>>>>  }
>>>>
>>>>  #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
>>>> -#define VMALLOC_MODULE_START \
>>>> -     max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
>>>>  void *module_alloc(unsigned long size)
>>>>  {
>>>>      return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
>>>> -                    VMALLOC_END, GFP_KERNEL,
>>>> +                    VMALLOC_MODULE_END, GFP_KERNEL,
>>>>                      PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
>>>>                      __builtin_return_address(0));
>>>>  }
>>>> diff --git a/arch/riscv/kernel/vmlinux.lds.S
>>>> b/arch/riscv/kernel/vmlinux.lds.S
>>>> index 0339b6bbe11a..a9abde62909f 100644
>>>> --- a/arch/riscv/kernel/vmlinux.lds.S
>>>> +++ b/arch/riscv/kernel/vmlinux.lds.S
>>>> @@ -4,7 +4,8 @@
>>>>   * Copyright (C) 2017 SiFive
>>>>   */
>>>>
>>>> -#define LOAD_OFFSET PAGE_OFFSET
>>>> +#include <asm/pgtable.h>
>>>> +#define LOAD_OFFSET KERNEL_LINK_ADDR
>>>>  #include <asm/vmlinux.lds.h>
>>>>  #include <asm/page.h>
>>>>  #include <asm/cache.h>
>>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>>>> index 736de6c8739f..71da78914645 100644
>>>> --- a/arch/riscv/mm/init.c
>>>> +++ b/arch/riscv/mm/init.c
>>>> @@ -22,6 +22,9 @@
>>>>
>>>>  #include "../kernel/head.h"
>>>>
>>>> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
>>>> +EXPORT_SYMBOL(kernel_virt_addr);
>>>> +
>>>>  unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>>>>                              __page_aligned_bss;
>>>>  EXPORT_SYMBOL(empty_zero_page);
>>>> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
>>>>  }
>>>>
>>>>  #ifdef CONFIG_MMU
>>>> +/* Offset between linear mapping virtual address and kernel load
>>>> address */
>>>>  unsigned long va_pa_offset;
>>>>  EXPORT_SYMBOL(va_pa_offset);
>>>> +/* Offset between kernel mapping virtual address and kernel load
>>>> address */
>>>> +unsigned long va_kernel_pa_offset;
>>>> +EXPORT_SYMBOL(va_kernel_pa_offset);
>>>>  unsigned long pfn_base;
>>>>  EXPORT_SYMBOL(pfn_base);
>>>>
>>>> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>>>>      if (mmu_enabled)
>>>>          return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>>>
>>>> -    pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
>>>> +    pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
>>>>      BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>>>>      return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>>>>  }
>>>> @@ -372,14 +379,30 @@ static uintptr_t __init
>>>> best_map_size(phys_addr_t base, phys_addr_t size)
>>>>  #error "setup_vm() is called from head.S before relocate so it
>>>> should not use absolute addressing."
>>>>  #endif
>>>>
>>>> +static uintptr_t load_pa, load_sz;
>>>> +
>>>> +static void __init create_kernel_page_table(pgd_t *pgdir, uintptr_t
>>>> map_size)
>>>> +{
>>>> +    uintptr_t va, end_va;
>>>> +
>>>> +    end_va = kernel_virt_addr + load_sz;
>>>> +    for (va = kernel_virt_addr; va < end_va; va += map_size)
>>>> +        create_pgd_mapping(pgdir, va,
>>>> +                   load_pa + (va - kernel_virt_addr),
>>>> +                   map_size, PAGE_KERNEL_EXEC);
>>>> +}
>>>> +
>>>>  asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>  {
>>>>      uintptr_t va, end_va;
>>>> -    uintptr_t load_pa = (uintptr_t)(&_start);
>>>> -    uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
>>>>      uintptr_t map_size = best_map_size(load_pa,
>>>> MAX_EARLY_MAPPING_SIZE);
>>>>
>>>> +    load_pa = (uintptr_t)(&_start);
>>>> +    load_sz = (uintptr_t)(&_end) - load_pa;
>>>> +
>>>>      va_pa_offset = PAGE_OFFSET - load_pa;
>>>> +    va_kernel_pa_offset = kernel_virt_addr - load_pa;
>>>> +
>>>>      pfn_base = PFN_DOWN(load_pa);
>>>>
>>>>      /*
>>>> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>      create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>>>>                 (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
>>>>      /* Setup trampoline PGD and PMD */
>>>> -    create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>> +    create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>>                 (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
>>>> -    create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>>>> +    create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
>>>>                 load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>>>>  #else
>>>>      /* Setup trampoline PGD */
>>>> -    create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>> +    create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>>                 load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
>>>>  #endif
>>>>
>>>>      /*
>>>> -     * Setup early PGD covering entire kernel which will allows
>>>> +     * Setup early PGD covering entire kernel which will allow
>>>>       * us to reach paging_init(). We map all memory banks later
>>>>       * in setup_vm_final() below.
>>>>       */
>>>> -    end_va = PAGE_OFFSET + load_sz;
>>>> -    for (va = PAGE_OFFSET; va < end_va; va += map_size)
>>>> -        create_pgd_mapping(early_pg_dir, va,
>>>> -                   load_pa + (va - PAGE_OFFSET),
>>>> -                   map_size, PAGE_KERNEL_EXEC);
>>>> +    create_kernel_page_table(early_pg_dir, map_size);
>>>>
>>>>      /* Create fixed mapping for early FDT parsing */
>>>>      end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
>>>> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
>>>>      uintptr_t va, map_size;
>>>>      phys_addr_t pa, start, end;
>>>>      struct memblock_region *reg;
>>>> +    static struct vm_struct vm_kernel = { 0 };
>>>>
>>>>      /* Set mmu_enabled flag */
>>>>      mmu_enabled = true;
>>>> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
>>>>          for (pa = start; pa < end; pa += map_size) {
>>>>              va = (uintptr_t)__va(pa);
>>>>              create_pgd_mapping(swapper_pg_dir, va, pa,
>>>> -                       map_size, PAGE_KERNEL_EXEC);
>>>> +                       map_size, PAGE_KERNEL);
>>>>          }
>>>>      }
>>>>
>>>> +    /* Map the kernel */
>>>> +    create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
>>>> +
>>>> +    /* Reserve the vmalloc area occupied by the kernel */
>>>> +    vm_kernel.addr = (void *)kernel_virt_addr;
>>>> +    vm_kernel.phys_addr = load_pa;
>>>> +    vm_kernel.size = (load_sz + PMD_SIZE - 1) & ~(PMD_SIZE - 1);
>>>> +    vm_kernel.flags = VM_MAP | VM_NO_GUARD;
>>>> +    vm_kernel.caller = __builtin_return_address(0);
>>>> +
>>>> +    vm_area_add_early(&vm_kernel);
>>>> +
>>>>      /* Clear fixmap PTE and PMD mappings */
>>>>      clear_fixmap(FIX_PTE);
>>>>      clear_fixmap(FIX_PMD);
>>>> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
>>>> index e8e4dcd39fed..35703d5ef5fd 100644
>>>> --- a/arch/riscv/mm/physaddr.c
>>>> +++ b/arch/riscv/mm/physaddr.c
>>>> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>>>>
>>>>  phys_addr_t __phys_addr_symbol(unsigned long x)
>>>>  {
>>>> -    unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
>>>> +    unsigned long kernel_start = (unsigned long)kernel_virt_addr;
>>>>      unsigned long kernel_end = (unsigned long)_end;
>>>>
>>>>      /*
>>
>> Alex

^ permalink raw reply

* Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone
From: Alex Ghiti @ 2020-07-21 18:36 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: aou, linux-mm, Anup Patel, linux-kernel, Atish Patra, paulus,
	zong.li, Paul Walmsley, linux-riscv, linuxppc-dev
In-Reply-To: <d7e3cbb7-c12a-bce2-f1db-c336d15f74bd@ghiti.fr>

Let's try to make progress here: I add linux-mm in CC to get feedback on 
this patch as it blocks sv48 support too.

Alex

Le 7/9/20 à 7:11 AM, Alex Ghiti a écrit :
> Hi Palmer,
> 
> Le 7/9/20 à 1:05 AM, Palmer Dabbelt a écrit :
>> On Sun, 07 Jun 2020 00:59:46 PDT (-0700), alex@ghiti.fr wrote:
>>> This is a preparatory patch for relocatable kernel.
>>>
>>> The kernel used to be linked at PAGE_OFFSET address and used to be 
>>> loaded
>>> physically at the beginning of the main memory. Therefore, we could use
>>> the linear mapping for the kernel mapping.
>>>
>>> But the relocated kernel base address will be different from PAGE_OFFSET
>>> and since in the linear mapping, two different virtual addresses cannot
>>> point to the same physical address, the kernel mapping needs to lie 
>>> outside
>>> the linear mapping.
>>
>> I know it's been a while, but I keep opening this up to review it and 
>> just
>> can't get over how ugly it is to put the kernel's linear map in the 
>> vmalloc
>> region.
>>
>> I guess I don't understand why this is necessary at all.  
>> Specifically: why
>> can't we just relocate the kernel within the linear map?  That would 
>> let the
>> bootloader put the kernel wherever it wants, modulo the physical 
>> memory size we
>> support.  We'd need to handle the regions that are coupled to the 
>> kernel's
>> execution address, but we could just put them in an explicit memory 
>> region
>> which is what we should probably be doing anyway.
> 
> Virtual relocation in the linear mapping requires to move the kernel 
> physically too. Zong implemented this physical move in its KASLR RFC 
> patchset, which is cumbersome since finding an available physical spot 
> is harder than just selecting a virtual range in the vmalloc range.
> 
> In addition, having the kernel mapping in the linear mapping prevents 
> the use of hugepage for the linear mapping resulting in performance loss 
> (at least for the GB that encompasses the kernel).
> 
> Why do you find this "ugly" ? The vmalloc region is just a bunch of 
> available virtual addresses to whatever purpose we want, and as noted by 
> Zong, arm64 uses the same scheme.
> 
>>
>>> In addition, because modules and BPF must be close to the kernel (inside
>>> +-2GB window), the kernel is placed at the end of the vmalloc zone minus
>>> 2GB, which leaves room for modules and BPF. The kernel could not be
>>> placed at the beginning of the vmalloc zone since other vmalloc
>>> allocations from the kernel could get all the +-2GB window around the
>>> kernel which would prevent new modules and BPF programs to be loaded.
>>
>> Well, that's not enough to make sure this doesn't happen -- it's just 
>> enough to
>> make sure it doesn't happen very quickily.  That's the same boat we're 
>> already
>> in, though, so it's not like it's worse.
> 
> Indeed, that's not worse, I haven't found a way to reserve vmalloc area 
> without actually allocating it.
> 
>>
>>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>>> Reviewed-by: Zong Li <zong.li@sifive.com>
>>> ---
>>>  arch/riscv/boot/loader.lds.S     |  3 +-
>>>  arch/riscv/include/asm/page.h    | 10 +++++-
>>>  arch/riscv/include/asm/pgtable.h | 38 ++++++++++++++-------
>>>  arch/riscv/kernel/head.S         |  3 +-
>>>  arch/riscv/kernel/module.c       |  4 +--
>>>  arch/riscv/kernel/vmlinux.lds.S  |  3 +-
>>>  arch/riscv/mm/init.c             | 58 +++++++++++++++++++++++++-------
>>>  arch/riscv/mm/physaddr.c         |  2 +-
>>>  8 files changed, 88 insertions(+), 33 deletions(-)
>>>
>>> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
>>> index 47a5003c2e28..62d94696a19c 100644
>>> --- a/arch/riscv/boot/loader.lds.S
>>> +++ b/arch/riscv/boot/loader.lds.S
>>> @@ -1,13 +1,14 @@
>>>  /* SPDX-License-Identifier: GPL-2.0 */
>>>
>>>  #include <asm/page.h>
>>> +#include <asm/pgtable.h>
>>>
>>>  OUTPUT_ARCH(riscv)
>>>  ENTRY(_start)
>>>
>>>  SECTIONS
>>>  {
>>> -    . = PAGE_OFFSET;
>>> +    . = KERNEL_LINK_ADDR;
>>>
>>>      .payload : {
>>>          *(.payload)
>>> diff --git a/arch/riscv/include/asm/page.h 
>>> b/arch/riscv/include/asm/page.h
>>> index 2d50f76efe48..48bb09b6a9b7 100644
>>> --- a/arch/riscv/include/asm/page.h
>>> +++ b/arch/riscv/include/asm/page.h
>>> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>>>
>>>  #ifdef CONFIG_MMU
>>>  extern unsigned long va_pa_offset;
>>> +extern unsigned long va_kernel_pa_offset;
>>>  extern unsigned long pfn_base;
>>>  #define ARCH_PFN_OFFSET        (pfn_base)
>>>  #else
>>>  #define va_pa_offset        0
>>> +#define va_kernel_pa_offset    0
>>>  #define ARCH_PFN_OFFSET        (PAGE_OFFSET >> PAGE_SHIFT)
>>>  #endif /* CONFIG_MMU */
>>>
>>>  extern unsigned long max_low_pfn;
>>>  extern unsigned long min_low_pfn;
>>> +extern unsigned long kernel_virt_addr;
>>>
>>>  #define __pa_to_va_nodebug(x)    ((void *)((unsigned long) (x) + 
>>> va_pa_offset))
>>> -#define __va_to_pa_nodebug(x)    ((unsigned long)(x) - va_pa_offset)
>>> +#define linear_mapping_va_to_pa(x)    ((unsigned long)(x) - 
>>> va_pa_offset)
>>> +#define kernel_mapping_va_to_pa(x)    \
>>> +    ((unsigned long)(x) - va_kernel_pa_offset)
>>> +#define __va_to_pa_nodebug(x)        \
>>> +    (((x) >= PAGE_OFFSET) ?        \
>>> +        linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
>>>
>>>  #ifdef CONFIG_DEBUG_VIRTUAL
>>>  extern phys_addr_t __virt_to_phys(unsigned long x);
>>> diff --git a/arch/riscv/include/asm/pgtable.h 
>>> b/arch/riscv/include/asm/pgtable.h
>>> index 35b60035b6b0..94ef3b49dfb6 100644
>>> --- a/arch/riscv/include/asm/pgtable.h
>>> +++ b/arch/riscv/include/asm/pgtable.h
>>> @@ -11,23 +11,29 @@
>>>
>>>  #include <asm/pgtable-bits.h>
>>>
>>> -#ifndef __ASSEMBLY__
>>> -
>>> -/* Page Upper Directory not used in RISC-V */
>>> -#include <asm-generic/pgtable-nopud.h>
>>> -#include <asm/page.h>
>>> -#include <asm/tlbflush.h>
>>> -#include <linux/mm_types.h>
>>> -
>>> -#ifdef CONFIG_MMU
>>> +#ifndef CONFIG_MMU
>>> +#define KERNEL_VIRT_ADDR    PAGE_OFFSET
>>> +#define KERNEL_LINK_ADDR    PAGE_OFFSET
>>> +#else
>>> +/*
>>> + * Leave 2GB for modules and BPF that must lie within a 2GB range 
>>> around
>>> + * the kernel.
>>> + */
>>> +#define KERNEL_VIRT_ADDR    (VMALLOC_END - SZ_2G + 1)
>>> +#define KERNEL_LINK_ADDR    KERNEL_VIRT_ADDR
>>
>> At a bare minimum this is going to make a mess of the 32-bit port, as
>> non-relocatable kernels are now going to get linked at 1GiB which is 
>> where user
>> code is supposed to live.  That's an easy fix, though, as the 32-bit 
>> stuff
>> doesn't need any module address restrictions.
> 
> Indeed, I will take a look at that.
> 
>>
>>>  #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
>>>  #define VMALLOC_END      (PAGE_OFFSET - 1)
>>>  #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
>>>
>>>  #define BPF_JIT_REGION_SIZE    (SZ_128M)
>>> -#define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
>>> -#define BPF_JIT_REGION_END    (VMALLOC_END)
>>> +#define BPF_JIT_REGION_START    PFN_ALIGN((unsigned long)&_end)
>>> +#define BPF_JIT_REGION_END    (BPF_JIT_REGION_START + 
>>> BPF_JIT_REGION_SIZE)
>>> +
>>> +#ifdef CONFIG_64BIT
>>> +#define VMALLOC_MODULE_START    BPF_JIT_REGION_END
>>> +#define VMALLOC_MODULE_END    (((unsigned long)&_start & PAGE_MASK) 
>>> + SZ_2G)
>>> +#endif
>>>
>>>  /*
>>>   * Roughly size the vmemmap space to be large enough to fit enough
>>> @@ -57,9 +63,16 @@
>>>  #define FIXADDR_SIZE     PGDIR_SIZE
>>>  #endif
>>>  #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
>>> -
>>>  #endif
>>>
>>> +#ifndef __ASSEMBLY__
>>> +
>>> +/* Page Upper Directory not used in RISC-V */
>>> +#include <asm-generic/pgtable-nopud.h>
>>> +#include <asm/page.h>
>>> +#include <asm/tlbflush.h>
>>> +#include <linux/mm_types.h>
>>> +
>>>  #ifdef CONFIG_64BIT
>>>  #include <asm/pgtable-64.h>
>>>  #else
>>> @@ -483,6 +496,7 @@ static inline void __kernel_map_pages(struct page 
>>> *page, int numpages, int enabl
>>>
>>>  #define kern_addr_valid(addr)   (1) /* FIXME */
>>>
>>> +extern char _start[];
>>>  extern void *dtb_early_va;
>>>  void setup_bootmem(void);
>>>  void paging_init(void);
>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>>> index 98a406474e7d..8f5bb7731327 100644
>>> --- a/arch/riscv/kernel/head.S
>>> +++ b/arch/riscv/kernel/head.S
>>> @@ -49,7 +49,8 @@ ENTRY(_start)
>>>  #ifdef CONFIG_MMU
>>>  relocate:
>>>      /* Relocate return address */
>>> -    li a1, PAGE_OFFSET
>>> +    la a1, kernel_virt_addr
>>> +    REG_L a1, 0(a1)
>>>      la a2, _start
>>>      sub a1, a1, a2
>>>      add ra, ra, a1
>>> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
>>> index 8bbe5dbe1341..1a8fbe05accf 100644
>>> --- a/arch/riscv/kernel/module.c
>>> +++ b/arch/riscv/kernel/module.c
>>> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const 
>>> char *strtab,
>>>  }
>>>
>>>  #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
>>> -#define VMALLOC_MODULE_START \
>>> -     max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
>>>  void *module_alloc(unsigned long size)
>>>  {
>>>      return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
>>> -                    VMALLOC_END, GFP_KERNEL,
>>> +                    VMALLOC_MODULE_END, GFP_KERNEL,
>>>                      PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
>>>                      __builtin_return_address(0));
>>>  }
>>> diff --git a/arch/riscv/kernel/vmlinux.lds.S 
>>> b/arch/riscv/kernel/vmlinux.lds.S
>>> index 0339b6bbe11a..a9abde62909f 100644
>>> --- a/arch/riscv/kernel/vmlinux.lds.S
>>> +++ b/arch/riscv/kernel/vmlinux.lds.S
>>> @@ -4,7 +4,8 @@
>>>   * Copyright (C) 2017 SiFive
>>>   */
>>>
>>> -#define LOAD_OFFSET PAGE_OFFSET
>>> +#include <asm/pgtable.h>
>>> +#define LOAD_OFFSET KERNEL_LINK_ADDR
>>>  #include <asm/vmlinux.lds.h>
>>>  #include <asm/page.h>
>>>  #include <asm/cache.h>
>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>>> index 736de6c8739f..71da78914645 100644
>>> --- a/arch/riscv/mm/init.c
>>> +++ b/arch/riscv/mm/init.c
>>> @@ -22,6 +22,9 @@
>>>
>>>  #include "../kernel/head.h"
>>>
>>> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
>>> +EXPORT_SYMBOL(kernel_virt_addr);
>>> +
>>>  unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>>>                              __page_aligned_bss;
>>>  EXPORT_SYMBOL(empty_zero_page);
>>> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
>>>  }
>>>
>>>  #ifdef CONFIG_MMU
>>> +/* Offset between linear mapping virtual address and kernel load 
>>> address */
>>>  unsigned long va_pa_offset;
>>>  EXPORT_SYMBOL(va_pa_offset);
>>> +/* Offset between kernel mapping virtual address and kernel load 
>>> address */
>>> +unsigned long va_kernel_pa_offset;
>>> +EXPORT_SYMBOL(va_kernel_pa_offset);
>>>  unsigned long pfn_base;
>>>  EXPORT_SYMBOL(pfn_base);
>>>
>>> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>>>      if (mmu_enabled)
>>>          return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>>
>>> -    pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
>>> +    pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
>>>      BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>>>      return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>>>  }
>>> @@ -372,14 +379,30 @@ static uintptr_t __init 
>>> best_map_size(phys_addr_t base, phys_addr_t size)
>>>  #error "setup_vm() is called from head.S before relocate so it 
>>> should not use absolute addressing."
>>>  #endif
>>>
>>> +static uintptr_t load_pa, load_sz;
>>> +
>>> +static void __init create_kernel_page_table(pgd_t *pgdir, uintptr_t 
>>> map_size)
>>> +{
>>> +    uintptr_t va, end_va;
>>> +
>>> +    end_va = kernel_virt_addr + load_sz;
>>> +    for (va = kernel_virt_addr; va < end_va; va += map_size)
>>> +        create_pgd_mapping(pgdir, va,
>>> +                   load_pa + (va - kernel_virt_addr),
>>> +                   map_size, PAGE_KERNEL_EXEC);
>>> +}
>>> +
>>>  asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>  {
>>>      uintptr_t va, end_va;
>>> -    uintptr_t load_pa = (uintptr_t)(&_start);
>>> -    uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
>>>      uintptr_t map_size = best_map_size(load_pa, 
>>> MAX_EARLY_MAPPING_SIZE);
>>>
>>> +    load_pa = (uintptr_t)(&_start);
>>> +    load_sz = (uintptr_t)(&_end) - load_pa;
>>> +
>>>      va_pa_offset = PAGE_OFFSET - load_pa;
>>> +    va_kernel_pa_offset = kernel_virt_addr - load_pa;
>>> +
>>>      pfn_base = PFN_DOWN(load_pa);
>>>
>>>      /*
>>> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>      create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>>>                 (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
>>>      /* Setup trampoline PGD and PMD */
>>> -    create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>> +    create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>                 (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
>>> -    create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>>> +    create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
>>>                 load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>>>  #else
>>>      /* Setup trampoline PGD */
>>> -    create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>> +    create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>                 load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
>>>  #endif
>>>
>>>      /*
>>> -     * Setup early PGD covering entire kernel which will allows
>>> +     * Setup early PGD covering entire kernel which will allow
>>>       * us to reach paging_init(). We map all memory banks later
>>>       * in setup_vm_final() below.
>>>       */
>>> -    end_va = PAGE_OFFSET + load_sz;
>>> -    for (va = PAGE_OFFSET; va < end_va; va += map_size)
>>> -        create_pgd_mapping(early_pg_dir, va,
>>> -                   load_pa + (va - PAGE_OFFSET),
>>> -                   map_size, PAGE_KERNEL_EXEC);
>>> +    create_kernel_page_table(early_pg_dir, map_size);
>>>
>>>      /* Create fixed mapping for early FDT parsing */
>>>      end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
>>> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
>>>      uintptr_t va, map_size;
>>>      phys_addr_t pa, start, end;
>>>      struct memblock_region *reg;
>>> +    static struct vm_struct vm_kernel = { 0 };
>>>
>>>      /* Set mmu_enabled flag */
>>>      mmu_enabled = true;
>>> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
>>>          for (pa = start; pa < end; pa += map_size) {
>>>              va = (uintptr_t)__va(pa);
>>>              create_pgd_mapping(swapper_pg_dir, va, pa,
>>> -                       map_size, PAGE_KERNEL_EXEC);
>>> +                       map_size, PAGE_KERNEL);
>>>          }
>>>      }
>>>
>>> +    /* Map the kernel */
>>> +    create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
>>> +
>>> +    /* Reserve the vmalloc area occupied by the kernel */
>>> +    vm_kernel.addr = (void *)kernel_virt_addr;
>>> +    vm_kernel.phys_addr = load_pa;
>>> +    vm_kernel.size = (load_sz + PMD_SIZE - 1) & ~(PMD_SIZE - 1);
>>> +    vm_kernel.flags = VM_MAP | VM_NO_GUARD;
>>> +    vm_kernel.caller = __builtin_return_address(0);
>>> +
>>> +    vm_area_add_early(&vm_kernel);
>>> +
>>>      /* Clear fixmap PTE and PMD mappings */
>>>      clear_fixmap(FIX_PTE);
>>>      clear_fixmap(FIX_PMD);
>>> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
>>> index e8e4dcd39fed..35703d5ef5fd 100644
>>> --- a/arch/riscv/mm/physaddr.c
>>> +++ b/arch/riscv/mm/physaddr.c
>>> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>>>
>>>  phys_addr_t __phys_addr_symbol(unsigned long x)
>>>  {
>>> -    unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
>>> +    unsigned long kernel_start = (unsigned long)kernel_virt_addr;
>>>      unsigned long kernel_end = (unsigned long)_end;
>>>
>>>      /*
> 
> Alex

^ permalink raw reply

* [powerpc:next] BUILD SUCCESS 9a77c4a0a12597c661be374b8d566516c0341570
From: kernel test robot @ 2020-07-21 18:37 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  next
branch HEAD: 9a77c4a0a12597c661be374b8d566516c0341570  powerpc/prom: Enable Radix GTSE in cpu pa-features

elapsed time: 1700m

configs tested: 117
configs skipped: 4

The following configs have been built successfully.
More configs may be tested in the coming days.

arm                                 defconfig
arm                              allyesconfig
arm                              allmodconfig
arm                               allnoconfig
arm64                            allyesconfig
arm64                               defconfig
arm64                            allmodconfig
arm64                             allnoconfig
c6x                                 defconfig
c6x                        evmc6474_defconfig
sh                               alldefconfig
mips                         bigsur_defconfig
arm                           sama5_defconfig
arm                           omap1_defconfig
mips                            e55_defconfig
arm                         shannon_defconfig
sparc64                           allnoconfig
arm                      footbridge_defconfig
s390                             alldefconfig
arm                            zeus_defconfig
sh                ecovec24-romimage_defconfig
arm                       netwinder_defconfig
mips                        nlm_xlr_defconfig
powerpc                        cell_defconfig
s390                          debug_defconfig
arm                          pxa3xx_defconfig
m68k                        m5407c3_defconfig
sh                          sdk7780_defconfig
c6x                         dsk6455_defconfig
arm                           h5000_defconfig
arm                         s3c6400_defconfig
mips                       rbtx49xx_defconfig
arm                            u300_defconfig
ia64                             alldefconfig
i386                              allnoconfig
i386                             allyesconfig
i386                                defconfig
i386                              debian-10.3
ia64                             allmodconfig
ia64                                defconfig
ia64                              allnoconfig
ia64                             allyesconfig
m68k                             allmodconfig
m68k                              allnoconfig
m68k                           sun3_defconfig
m68k                                defconfig
m68k                             allyesconfig
nios2                               defconfig
nios2                            allyesconfig
openrisc                            defconfig
c6x                              allyesconfig
c6x                               allnoconfig
openrisc                         allyesconfig
nds32                               defconfig
nds32                             allnoconfig
csky                             allyesconfig
csky                                defconfig
alpha                               defconfig
alpha                            allyesconfig
xtensa                           allyesconfig
h8300                            allyesconfig
h8300                            allmodconfig
xtensa                              defconfig
arc                                 defconfig
arc                              allyesconfig
sh                               allmodconfig
sh                                allnoconfig
microblaze                        allnoconfig
mips                             allyesconfig
mips                              allnoconfig
mips                             allmodconfig
parisc                            allnoconfig
parisc                              defconfig
parisc                           allyesconfig
parisc                           allmodconfig
powerpc                          allyesconfig
powerpc                          rhel-kconfig
powerpc                          allmodconfig
powerpc                           allnoconfig
powerpc                             defconfig
i386                 randconfig-a001-20200719
i386                 randconfig-a006-20200719
i386                 randconfig-a002-20200719
i386                 randconfig-a005-20200719
i386                 randconfig-a003-20200719
i386                 randconfig-a004-20200719
i386                 randconfig-a015-20200719
i386                 randconfig-a011-20200719
i386                 randconfig-a016-20200719
i386                 randconfig-a012-20200719
i386                 randconfig-a013-20200719
i386                 randconfig-a014-20200719
x86_64               randconfig-a005-20200719
x86_64               randconfig-a002-20200719
x86_64               randconfig-a006-20200719
x86_64               randconfig-a001-20200719
x86_64               randconfig-a003-20200719
x86_64               randconfig-a004-20200719
riscv                            allyesconfig
riscv                             allnoconfig
riscv                               defconfig
riscv                            allmodconfig
s390                             allyesconfig
s390                              allnoconfig
s390                             allmodconfig
s390                                defconfig
sparc                            allyesconfig
sparc                               defconfig
sparc64                             defconfig
sparc64                          allyesconfig
sparc64                          allmodconfig
x86_64                    rhel-7.6-kselftests
x86_64                               rhel-8.3
x86_64                                  kexec
x86_64                                   rhel
x86_64                                    lkp
x86_64                              fedora-25

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply

* Re: [PATCH v6] ima: move APPRAISE_BOOTPARAM dependency on ARCH_POLICY to runtime
From: Mimi Zohar @ 2020-07-21 17:26 UTC (permalink / raw)
  To: Bruno Meneguele
  Cc: linux-s390, Nayna, erichte, nayna, x86, linux-kernel, stable,
	linux-integrity, linuxppc-dev
In-Reply-To: <20200720153841.GG10323@glitch>

On Mon, 2020-07-20 at 12:38 -0300, Bruno Meneguele wrote:
> On Mon, Jul 20, 2020 at 10:56:55AM -0400, Mimi Zohar wrote:
> > On Mon, 2020-07-20 at 10:40 -0400, Nayna wrote:
> > > On 7/13/20 12:48 PM, Bruno Meneguele wrote:
> > > > The IMA_APPRAISE_BOOTPARAM config allows enabling different "ima_appraise="
> > > > modes - log, fix, enforce - at run time, but not when IMA architecture
> > > > specific policies are enabled.  This prevents properly labeling the
> > > > filesystem on systems where secure boot is supported, but not enabled on the
> > > > platform.  Only when secure boot is actually enabled should these IMA
> > > > appraise modes be disabled.
> > > >
> > > > This patch removes the compile time dependency and makes it a runtime
> > > > decision, based on the secure boot state of that platform.
> > > >
> > > > Test results as follows:
> > > >
> > > > -> x86-64 with secure boot enabled
> > > >
> > > > [    0.015637] Kernel command line: <...> ima_policy=appraise_tcb ima_appraise=fix
> > > > [    0.015668] ima: Secure boot enabled: ignoring ima_appraise=fix boot parameter option
> > > >
> > 
> > Is it common to have two colons in the same line?  Is the colon being
> > used as a delimiter when parsing the kernel logs?  Should the second
> > colon be replaced with a hyphen?  (No need to repost.  I'll fix it
> > up.)
> >  
> 
> AFAICS it has been used without any limitations, e.g:
> 
> PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
> clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
> microcode: CPU0: patch_level=0x08701013
> Lockdown: modprobe: unsigned module loading is restricted; see man kernel_lockdown.7
> ...
> 
> I'd say we're fine using it.

Ok.  FYI, it's now in next-integrity.

Mimi

^ permalink raw reply

* [PATCH v4 2/3] powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable
From: Pratik Rajesh Sampat @ 2020-07-21 15:37 UTC (permalink / raw)
  To: mpe, npiggin, benh, paulus, mikey, ego, svaidy, psampat,
	pratik.r.sampat, linuxppc-dev, linux-kernel
In-Reply-To: <20200721153708.89057-1-psampat@linux.ibm.com>

Replace the variable name from using "pnv_first_spr_loss_level" to
"deep_spr_loss_state".

pnv_first_spr_loss_level is supposed to be the earliest state that
has OPAL_PM_LOSE_FULL_CONTEXT set, in other places the kernel uses the
"deep" states as terminology. Hence renaming the variable to be coherent
to its semantics.

Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
---
 arch/powerpc/platforms/powernv/idle.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
index 642abf0b8329..28462d59a8e1 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -48,7 +48,7 @@ static bool default_stop_found;
  * First stop state levels when SPR and TB loss can occur.
  */
 static u64 pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
-static u64 pnv_first_spr_loss_level = MAX_STOP_STATE + 1;
+static u64 deep_spr_loss_state = MAX_STOP_STATE + 1;
 
 /*
  * psscr value and mask of the deepest stop idle state.
@@ -657,7 +657,7 @@ static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
 		  */
 		mmcr0		= mfspr(SPRN_MMCR0);
 	}
-	if ((psscr & PSSCR_RL_MASK) >= pnv_first_spr_loss_level) {
+	if ((psscr & PSSCR_RL_MASK) >= deep_spr_loss_state) {
 		sprs.lpcr	= mfspr(SPRN_LPCR);
 		sprs.hfscr	= mfspr(SPRN_HFSCR);
 		sprs.fscr	= mfspr(SPRN_FSCR);
@@ -741,7 +741,7 @@ static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
 	 * just always test PSSCR for SPR/TB state loss.
 	 */
 	pls = (psscr & PSSCR_PLS) >> PSSCR_PLS_SHIFT;
-	if (likely(pls < pnv_first_spr_loss_level)) {
+	if (likely(pls < deep_spr_loss_state)) {
 		if (sprs_saved)
 			atomic_stop_thread_idle();
 		goto out;
@@ -1088,7 +1088,7 @@ static void __init pnv_power9_idle_init(void)
 	 * the deepest loss-less (OPAL_PM_STOP_INST_FAST) stop state.
 	 */
 	pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
-	pnv_first_spr_loss_level = MAX_STOP_STATE + 1;
+	deep_spr_loss_state = MAX_STOP_STATE + 1;
 	for (i = 0; i < nr_pnv_idle_states; i++) {
 		int err;
 		struct pnv_idle_states_t *state = &pnv_idle_states[i];
@@ -1099,8 +1099,8 @@ static void __init pnv_power9_idle_init(void)
 			pnv_first_tb_loss_level = psscr_rl;
 
 		if ((state->flags & OPAL_PM_LOSE_FULL_CONTEXT) &&
-		     (pnv_first_spr_loss_level > psscr_rl))
-			pnv_first_spr_loss_level = psscr_rl;
+		     (deep_spr_loss_state > psscr_rl))
+			deep_spr_loss_state = psscr_rl;
 
 		/*
 		 * The idle code does not deal with TB loss occurring
@@ -1111,8 +1111,8 @@ static void __init pnv_power9_idle_init(void)
 		 * compatibility.
 		 */
 		if ((state->flags & OPAL_PM_TIMEBASE_STOP) &&
-		     (pnv_first_spr_loss_level > psscr_rl))
-			pnv_first_spr_loss_level = psscr_rl;
+		     (deep_spr_loss_state > psscr_rl))
+			deep_spr_loss_state = psscr_rl;
 
 		err = validate_psscr_val_mask(&state->psscr_val,
 					      &state->psscr_mask,
@@ -1158,7 +1158,7 @@ static void __init pnv_power9_idle_init(void)
 	}
 
 	pr_info("cpuidle-powernv: First stop level that may lose SPRs = 0x%llx\n",
-		pnv_first_spr_loss_level);
+		deep_spr_loss_state);
 
 	pr_info("cpuidle-powernv: First stop level that may lose timebase = 0x%llx\n",
 		pnv_first_tb_loss_level);
-- 
2.25.4


^ permalink raw reply related

* [PATCH v4 3/3] powerpc/powernv/idle: Exclude mfspr on HID1, 4, 5 on P9 and above
From: Pratik Rajesh Sampat @ 2020-07-21 15:37 UTC (permalink / raw)
  To: mpe, npiggin, benh, paulus, mikey, ego, svaidy, psampat,
	pratik.r.sampat, linuxppc-dev, linux-kernel
In-Reply-To: <20200721153708.89057-1-psampat@linux.ibm.com>

POWER9 onwards the support for the registers HID1, HID4, HID5 has been
receded.
Although mfspr on the above registers worked in Power9, In Power10
simulator is unrecognized. Moving their assignment under the
check for machines lower than Power9

Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/platforms/powernv/idle.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
index 28462d59a8e1..92098d6106be 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -73,9 +73,6 @@ static int pnv_save_sprs_for_deep_states(void)
 	 */
 	uint64_t lpcr_val	= mfspr(SPRN_LPCR);
 	uint64_t hid0_val	= mfspr(SPRN_HID0);
-	uint64_t hid1_val	= mfspr(SPRN_HID1);
-	uint64_t hid4_val	= mfspr(SPRN_HID4);
-	uint64_t hid5_val	= mfspr(SPRN_HID5);
 	uint64_t hmeer_val	= mfspr(SPRN_HMEER);
 	uint64_t msr_val = MSR_IDLE;
 	uint64_t psscr_val = pnv_deepest_stop_psscr_val;
@@ -117,6 +114,9 @@ static int pnv_save_sprs_for_deep_states(void)
 
 			/* Only p8 needs to set extra HID regiters */
 			if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
+				uint64_t hid1_val = mfspr(SPRN_HID1);
+				uint64_t hid4_val = mfspr(SPRN_HID4);
+				uint64_t hid5_val = mfspr(SPRN_HID5);
 
 				rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val);
 				if (rc != 0)
-- 
2.25.4


^ permalink raw reply related

* [PATCH v4 1/3] powerpc/powernv/idle: Replace CPU features check with PVR check
From: Pratik Rajesh Sampat @ 2020-07-21 15:37 UTC (permalink / raw)
  To: mpe, npiggin, benh, paulus, mikey, ego, svaidy, psampat,
	pratik.r.sampat, linuxppc-dev, linux-kernel
In-Reply-To: <20200721153708.89057-1-psampat@linux.ibm.com>

As the idle framework's architecture is incomplete, hence instead of
checking for just the processor type advertised in the device tree CPU
features; check for the Processor Version Register (PVR) so that finer
granularity can be leveraged while making processor checks.

Hence, making the PVR check on the outer level function, subsequently in
the hierarchy keeping the CPU_FTR_ARCH_300 check intact as it is a
faster check to do because of static branches

Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
---
 arch/powerpc/platforms/powernv/idle.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
index 2dd467383a88..642abf0b8329 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -1205,7 +1205,7 @@ static void __init pnv_probe_idle_states(void)
 		return;
 	}
 
-	if (cpu_has_feature(CPU_FTR_ARCH_300))
+	if (pvr_version_is(PVR_POWER9))
 		pnv_power9_idle_init();
 
 	for (i = 0; i < nr_pnv_idle_states; i++)
-- 
2.25.4


^ permalink raw reply related

* [PATCH v4 0/3] powernv/idle: Power9 idle cleanup
From: Pratik Rajesh Sampat @ 2020-07-21 15:37 UTC (permalink / raw)
  To: mpe, npiggin, benh, paulus, mikey, ego, svaidy, psampat,
	pratik.r.sampat, linuxppc-dev, linux-kernel

v3: https://lkml.org/lkml/2020/7/17/1093
Changelog v3-->v4:
Based on comments from Nicholas Piggin and Gautham Shenoy,
1. Changed the naming of pnv_first_spr_loss_level from
pnv_first_fullstate_loss_level to deep_spr_loss_state
2. Make the P9 PVR check only on the top level function
pnv_probe_idle_states and let the rest of the checks be DT based because
it is faster to do so

Pratik Rajesh Sampat (3):
  powerpc/powernv/idle: Replace CPU features check with PVR check
  powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable
  powerpc/powernv/idle: Exclude mfspr on HID1,4,5 on P9 and above

 arch/powerpc/platforms/powernv/idle.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

-- 
2.25.4

^ permalink raw reply

* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Mathieu Desnoyers @ 2020-07-21 15:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jens Axboe, linux-arch, Arnd Bergmann, x86, linux-kernel,
	Nicholas Piggin, Andy Lutomirski, linux-mm, Andy Lutomirski,
	linuxppc-dev
In-Reply-To: <20200721151947.GD10769@hirez.programming.kicks-ass.net>

----- On Jul 21, 2020, at 11:19 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Tue, Jul 21, 2020 at 11:15:13AM -0400, Mathieu Desnoyers wrote:
>> ----- On Jul 21, 2020, at 11:06 AM, Peter Zijlstra peterz@infradead.org wrote:
>> 
>> > On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote:
>> > 
>> >> That being said, the x86 sync core gap that I imagined could be fixed
>> >> by changing to rq->curr == rq->idle test does not actually exist because
>> >> the global membarrier does not have a sync core option. So fixing the
>> >> exit_lazy_tlb points that this series does *should* fix that. So
>> >> PF_KTHREAD may be less problematic than I thought from implementation
>> >> point of view, only semantics.
>> > 
>> > So I've been trying to figure out where that PF_KTHREAD comes from,
>> > commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy
>> > load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'.
>> > 
>> > So the first version:
>> > 
>> >  https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoyers@efficios.com
>> > 
>> > appears to unconditionally send the IPI and checks p->mm in the IPI
>> > context, but then v2:
>> > 
>> >  https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoyers@efficios.com
>> > 
>> > has the current code. But I've been unable to find the reason the
>> > 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'.
>> 
>> Looking back at my inbox, it seems like you are the one who proposed to
>> skip all kthreads:
>> 
>> https://lkml.kernel.org/r/20190904124333.GQ2332@hirez.programming.kicks-ass.net
> 
> I had a feeling it might've been me ;-) I just couldn't find the email.
> 
>> > The comment doesn't really help either; sure we have the whole lazy mm
>> > thing, but that's ->active_mm, not ->mm.
>> > 
>> > Possibly it is because {,un}use_mm() do not have sufficient barriers to
>> > make the remote p->mm test work? Or were we over-eager with the !p->mm
>> > doesn't imply kthread 'cleanups' at the time?
>> 
>> The nice thing about adding back kthreads to the threads considered for
>> membarrier
>> IPI is that it has no observable effect on the user-space ABI. No pre-existing
>> kthread
>> rely on this, and we just provide an additional guarantee for future kthread
>> implementations.
>> 
>> > Also, I just realized, I still have a fix for use_mm() now
>> > kthread_use_mm() that seems to have been lost.
>> 
>> I suspect we need to at least document the memory barriers in kthread_use_mm and
>> kthread_unuse_mm to state that they are required by membarrier if we want to
>> ipi kthreads as well.
> 
> Right, so going by that email you found it was mostly a case of being
> lazy, but yes, if we audit the kthread_{,un}use_mm() barriers and add
> any other bits that might be needed, covering kthreads should be
> possible.
> 
> No objections from me for making it so.

I'm OK on making membarrier cover kthreads using mm as well, provided we
audit kthread_{,un}use_mm() to make sure the proper barriers are in place
after setting task->mm and before clearing it.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Peter Zijlstra @ 2020-07-21 15:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jens Axboe, linux-arch, Arnd Bergmann, x86, linux-kernel,
	Nicholas Piggin, Andy Lutomirski, linux-mm, Andy Lutomirski,
	linuxppc-dev
In-Reply-To: <616209816.22376.1595344513051.JavaMail.zimbra@efficios.com>

On Tue, Jul 21, 2020 at 11:15:13AM -0400, Mathieu Desnoyers wrote:
> ----- On Jul 21, 2020, at 11:06 AM, Peter Zijlstra peterz@infradead.org wrote:
> 
> > On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote:
> > 
> >> That being said, the x86 sync core gap that I imagined could be fixed
> >> by changing to rq->curr == rq->idle test does not actually exist because
> >> the global membarrier does not have a sync core option. So fixing the
> >> exit_lazy_tlb points that this series does *should* fix that. So
> >> PF_KTHREAD may be less problematic than I thought from implementation
> >> point of view, only semantics.
> > 
> > So I've been trying to figure out where that PF_KTHREAD comes from,
> > commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy
> > load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'.
> > 
> > So the first version:
> > 
> >  https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoyers@efficios.com
> > 
> > appears to unconditionally send the IPI and checks p->mm in the IPI
> > context, but then v2:
> > 
> >  https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoyers@efficios.com
> > 
> > has the current code. But I've been unable to find the reason the
> > 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'.
> 
> Looking back at my inbox, it seems like you are the one who proposed to
> skip all kthreads: 
> 
> https://lkml.kernel.org/r/20190904124333.GQ2332@hirez.programming.kicks-ass.net

I had a feeling it might've been me ;-) I just couldn't find the email.

> > The comment doesn't really help either; sure we have the whole lazy mm
> > thing, but that's ->active_mm, not ->mm.
> > 
> > Possibly it is because {,un}use_mm() do not have sufficient barriers to
> > make the remote p->mm test work? Or were we over-eager with the !p->mm
> > doesn't imply kthread 'cleanups' at the time?
> 
> The nice thing about adding back kthreads to the threads considered for membarrier
> IPI is that it has no observable effect on the user-space ABI. No pre-existing kthread
> rely on this, and we just provide an additional guarantee for future kthread
> implementations.
> 
> > Also, I just realized, I still have a fix for use_mm() now
> > kthread_use_mm() that seems to have been lost.
> 
> I suspect we need to at least document the memory barriers in kthread_use_mm and
> kthread_unuse_mm to state that they are required by membarrier if we want to
> ipi kthreads as well.

Right, so going by that email you found it was mostly a case of being
lazy, but yes, if we audit the kthread_{,un}use_mm() barriers and add
any other bits that might be needed, covering kthreads should be
possible.

No objections from me for making it so.

^ permalink raw reply

* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Mathieu Desnoyers @ 2020-07-21 15:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jens Axboe, linux-arch, Arnd Bergmann, x86, linux-kernel,
	Nicholas Piggin, Andy Lutomirski, linux-mm, Andy Lutomirski,
	linuxppc-dev
In-Reply-To: <20200721150656.GN119549@hirez.programming.kicks-ass.net>

----- On Jul 21, 2020, at 11:06 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote:
> 
>> That being said, the x86 sync core gap that I imagined could be fixed
>> by changing to rq->curr == rq->idle test does not actually exist because
>> the global membarrier does not have a sync core option. So fixing the
>> exit_lazy_tlb points that this series does *should* fix that. So
>> PF_KTHREAD may be less problematic than I thought from implementation
>> point of view, only semantics.
> 
> So I've been trying to figure out where that PF_KTHREAD comes from,
> commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy
> load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'.
> 
> So the first version:
> 
>  https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoyers@efficios.com
> 
> appears to unconditionally send the IPI and checks p->mm in the IPI
> context, but then v2:
> 
>  https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoyers@efficios.com
> 
> has the current code. But I've been unable to find the reason the
> 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'.

Looking back at my inbox, it seems like you are the one who proposed to
skip all kthreads: 

https://lkml.kernel.org/r/20190904124333.GQ2332@hirez.programming.kicks-ass.net

> 
> The comment doesn't really help either; sure we have the whole lazy mm
> thing, but that's ->active_mm, not ->mm.
> 
> Possibly it is because {,un}use_mm() do not have sufficient barriers to
> make the remote p->mm test work? Or were we over-eager with the !p->mm
> doesn't imply kthread 'cleanups' at the time?

The nice thing about adding back kthreads to the threads considered for membarrier
IPI is that it has no observable effect on the user-space ABI. No pre-existing kthread
rely on this, and we just provide an additional guarantee for future kthread
implementations.

> Also, I just realized, I still have a fix for use_mm() now
> kthread_use_mm() that seems to have been lost.

I suspect we need to at least document the memory barriers in kthread_use_mm and
kthread_unuse_mm to state that they are required by membarrier if we want to
ipi kthreads as well.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* [PATCH -next] PCI: rpadlpar: Make some functions static
From: Wei Yongjun @ 2020-07-21 15:17 UTC (permalink / raw)
  To: Hulk Robot, Michael Ellerman, Tyrel Datwyler, Bjorn Helgaas
  Cc: linux-pci, Wei Yongjun, linuxppc-dev

The sparse tool report build warnings as follows:

drivers/pci/hotplug/rpadlpar_core.c:355:5: warning:
 symbol 'dlpar_remove_pci_slot' was not declared. Should it be static?
drivers/pci/hotplug/rpadlpar_core.c:461:12: warning:
 symbol 'rpadlpar_io_init' was not declared. Should it be static?
drivers/pci/hotplug/rpadlpar_core.c:473:6: warning:
 symbol 'rpadlpar_io_exit' was not declared. Should it be static?

Those functions are not used outside of this file, so marks them
static.
Also mark rpadlpar_io_exit() as __exit.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
---
 drivers/pci/hotplug/rpadlpar_core.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/hotplug/rpadlpar_core.c b/drivers/pci/hotplug/rpadlpar_core.c
index c5eb509c72f0..f979b7098acf 100644
--- a/drivers/pci/hotplug/rpadlpar_core.c
+++ b/drivers/pci/hotplug/rpadlpar_core.c
@@ -352,7 +352,7 @@ static int dlpar_remove_vio_slot(char *drc_name, struct device_node *dn)
  * -ENODEV		Not a valid drc_name
  * -EIO			Internal PCI Error
  */
-int dlpar_remove_pci_slot(char *drc_name, struct device_node *dn)
+static int dlpar_remove_pci_slot(char *drc_name, struct device_node *dn)
 {
 	struct pci_bus *bus;
 	struct slot *slot;
@@ -458,7 +458,7 @@ static inline int is_dlpar_capable(void)
 	return (int) (rc != RTAS_UNKNOWN_SERVICE);
 }
 
-int __init rpadlpar_io_init(void)
+static int __init rpadlpar_io_init(void)
 {
 
 	if (!is_dlpar_capable()) {
@@ -470,7 +470,7 @@ int __init rpadlpar_io_init(void)
 	return dlpar_sysfs_init();
 }
 
-void rpadlpar_io_exit(void)
+static void __exit rpadlpar_io_exit(void)
 {
 	dlpar_sysfs_exit();
 }


^ permalink raw reply related

* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: peterz @ 2020-07-21 15:06 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Jens Axboe, linux-arch, Arnd Bergmann, x86, linux-kernel,
	Andy Lutomirski, linux-mm, Mathieu Desnoyers, Andy Lutomirski,
	linuxppc-dev
In-Reply-To: <1595324577.x3bf55tpgu.astroid@bobo.none>

On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote:

> That being said, the x86 sync core gap that I imagined could be fixed 
> by changing to rq->curr == rq->idle test does not actually exist because
> the global membarrier does not have a sync core option. So fixing the
> exit_lazy_tlb points that this series does *should* fix that. So
> PF_KTHREAD may be less problematic than I thought from implementation
> point of view, only semantics.

So I've been trying to figure out where that PF_KTHREAD comes from,
commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy
load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'.

So the first version:

  https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoyers@efficios.com

appears to unconditionally send the IPI and checks p->mm in the IPI
context, but then v2:

  https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoyers@efficios.com

has the current code. But I've been unable to find the reason the
'p->mm' test changed into '!(p->flags & PF_KTHREAD)'.

The comment doesn't really help either; sure we have the whole lazy mm
thing, but that's ->active_mm, not ->mm.

Possibly it is because {,un}use_mm() do not have sufficient barriers to
make the remote p->mm test work? Or were we over-eager with the !p->mm
doesn't imply kthread 'cleanups' at the time?

Also, I just realized, I still have a fix for use_mm() now
kthread_use_mm() that seems to have been lost.

^ permalink raw reply

* Re: [RFC PATCH] powerpc/pseries/svm: capture instruction faulting on MMIO access, in sprg0 register
From: Nicholas Piggin @ 2020-07-21 15:00 UTC (permalink / raw)
  To: kvm-ppc, linuxppc-dev, Ram Pai
  Cc: sukadev, aik, bharata, sathnaga, ldufour, bauerman, david
In-Reply-To: <1594888333-9370-1-git-send-email-linuxram@us.ibm.com>

Excerpts from Ram Pai's message of July 16, 2020 6:32 pm:
> An instruction accessing a mmio address, generates a HDSI fault.  This fault is
> appropriately handled by the Hypervisor.  However in the case of secureVMs, the
> fault is delivered to the ultravisor.

Why not a ucall if you're paraultravizing it anyway?

> 
> Unfortunately the Ultravisor has no correct-way to fetch the faulting
> instruction. The PEF architecture does not allow Ultravisor to enable MMU
> translation. Walking the two level page table to read the instruction can race
> with other vcpus modifying the SVM's process scoped page table.
> 
> This problem can be correctly solved with some help from the kernel.
> 
> Capture the faulting instruction in SPRG0 register, before executing the
> faulting instruction. This enables the ultravisor to easily procure the
> faulting instruction and emulate it.
> 
> Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> ---
>  arch/powerpc/include/asm/io.h | 85 ++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 75 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
> index 635969b..7ef663d 100644
> --- a/arch/powerpc/include/asm/io.h
> +++ b/arch/powerpc/include/asm/io.h
> @@ -35,6 +35,7 @@
>  #include <asm/mmu.h>
>  #include <asm/ppc_asm.h>
>  #include <asm/pgtable.h>
> +#include <asm/svm.h>
>  
>  #define SIO_CONFIG_RA	0x398
>  #define SIO_CONFIG_RD	0x399
> @@ -105,34 +106,98 @@
>  static inline u##size name(const volatile u##size __iomem *addr)	\
>  {									\
>  	u##size ret;							\
> -	__asm__ __volatile__("sync;"#insn" %0,%y1;twi 0,%0,0;isync"	\
> -		: "=r" (ret) : "Z" (*addr) : "memory");			\
> +	if (is_secure_guest()) {					\
> +		__asm__ __volatile__("mfsprg0 %3;"			\
> +				"lnia %2;"				\
> +				"ld %2,12(%2);"				\
> +				"mtsprg0 %2;"				\
> +				"sync;"					\
> +				#insn" %0,%y1;"				\
> +				"twi 0,%0,0;"				\
> +				"isync;"				\
> +				"mtsprg0 %3"				\

We prefer to use mtspr in new code, and the nia offset should be 
calculated with a label I think "(1f - .)(%2)" should work.

SPRG usage is documented in arch/powerpc/include/asm/reg.h if this 
goes past RFC stage. Looks like SPRG0 probably could be used for this.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v3 0/2] Selftest for cpuidle latency measurement
From: Daniel Lezcano @ 2020-07-21 14:57 UTC (permalink / raw)
  To: Pratik Rajesh Sampat, rjw, mpe, benh, paulus, srivatsa, shuah,
	npiggin, ego, svaidy, pratik.r.sampat, linux-pm, linuxppc-dev,
	linux-kernel, linux-kselftest
In-Reply-To: <20200721124300.65615-1-psampat@linux.ibm.com>

On 21/07/2020 14:42, Pratik Rajesh Sampat wrote:
> v2: https://lkml.org/lkml/2020/7/17/369
> Changelog v2-->v3
> Based on comments from Gautham R. Shenoy adding the following in the
> selftest,
> 1. Grepping modules to determine if already loaded
> 2. Wrapper to enable/disable states
> 3. Preventing any operation/test on offlined CPUs 
> ---
> 
> The patch series introduces a mechanism to measure wakeup latency for
> IPI and timer based interrupts
> The motivation behind this series is to find significant deviations
> behind advertised latency and resisdency values

Why do you want to measure for the timer and the IPI ? Whatever the
source of the wakeup, the exit latency remains the same, no ?

Is all this kernel-ish code really needed ?

What about using a highres periodic timer and make it expires every eg.
50ms x 2400, so it is 120 secondes and measure the deviation. Repeat the
operation for each idle states.

And in order to make it as much accurate as possible, set the program
affinity on a CPU and isolate this one by preventing other processes to
be scheduled on and migrate the interrupts on the other CPUs.

That will be all userspace code, no?

-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply

* Re: [PATCH v3 2/3] powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable
From: Gautham R Shenoy @ 2020-07-21 14:55 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: ego, mikey, pratik.r.sampat, linux-kernel, Pratik Sampat, paulus,
	linuxppc-dev
In-Reply-To: <1595341835.8ad8mjl9hm.astroid@bobo.none>

Hi,

On Wed, Jul 22, 2020 at 12:37:41AM +1000, Nicholas Piggin wrote:
> Excerpts from Pratik Sampat's message of July 21, 2020 8:29 pm:
> > 
> > 
> > On 20/07/20 5:27 am, Nicholas Piggin wrote:
> >> Excerpts from Pratik Rajesh Sampat's message of July 18, 2020 4:53 am:
> >>> Replace the variable name from using "pnv_first_spr_loss_level" to
> >>> "pnv_first_fullstate_loss_level".
> >>>
> >>> As pnv_first_spr_loss_level is supposed to be the earliest state that
> >>> has OPAL_PM_LOSE_FULL_CONTEXT set, however as shallow states too loose
> >>> SPR values, render an incorrect terminology.
> >> It also doesn't lose "full" state at this loss level though. From the
> >> architecture it could be called "hv state loss level", but in POWER10
> >> even that is not strictly true.
> >>
> > Right. Just discovered that deep stop states won't loose full state
> > P10 onwards.
> > Would it better if we rename it as "pnv_all_spr_loss_state" instead
> > so that it stays generic enough while being semantically coherent?
> 
> It doesn't lose all SPRs. It does physically, but for Linux it appears 
> at least timebase SPRs are retained and that's mostly how it's 
> documented.
> 
> Maybe there's no really good name for it, but we do call it "deep" stop 
> in other places, you could call it deep_spr_loss maybe. I don't mind too 
> much though, whatever Gautham is happy with.

Nick's suggestion is fine by me. We can call it deep_spr_loss_state.

> 
> Thanks,
> Nick

--
Thanks and Regards
gautham.

^ permalink raw reply

* Re: [PATCH v3 2/3] powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable
From: Nicholas Piggin @ 2020-07-21 14:37 UTC (permalink / raw)
  To: benh, ego, linux-kernel, linuxppc-dev, mikey, mpe, paulus,
	pratik.r.sampat, Pratik Sampat, svaidy
In-Reply-To: <81dcf34e-870d-b3a1-7876-a6a2f0b37d1f@linux.ibm.com>

Excerpts from Pratik Sampat's message of July 21, 2020 8:29 pm:
> 
> 
> On 20/07/20 5:27 am, Nicholas Piggin wrote:
>> Excerpts from Pratik Rajesh Sampat's message of July 18, 2020 4:53 am:
>>> Replace the variable name from using "pnv_first_spr_loss_level" to
>>> "pnv_first_fullstate_loss_level".
>>>
>>> As pnv_first_spr_loss_level is supposed to be the earliest state that
>>> has OPAL_PM_LOSE_FULL_CONTEXT set, however as shallow states too loose
>>> SPR values, render an incorrect terminology.
>> It also doesn't lose "full" state at this loss level though. From the
>> architecture it could be called "hv state loss level", but in POWER10
>> even that is not strictly true.
>>
> Right. Just discovered that deep stop states won't loose full state
> P10 onwards.
> Would it better if we rename it as "pnv_all_spr_loss_state" instead
> so that it stays generic enough while being semantically coherent?

It doesn't lose all SPRs. It does physically, but for Linux it appears 
at least timebase SPRs are retained and that's mostly how it's 
documented.

Maybe there's no really good name for it, but we do call it "deep" stop 
in other places, you could call it deep_spr_loss maybe. I don't mind too 
much though, whatever Gautham is happy with.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks
From: Waiman Long @ 2020-07-21 14:36 UTC (permalink / raw)
  To: Nicholas Piggin, Peter Zijlstra
  Cc: linux-arch, Will Deacon, Boqun Feng, linux-kernel, kvm-ppc,
	virtualization, Ingo Molnar, linuxppc-dev
In-Reply-To: <1595327263.lk78cqolxm.astroid@bobo.none>

On 7/21/20 7:08 AM, Nicholas Piggin wrote:
> diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
> index b752d34517b3..26d8766a1106 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -31,16 +31,57 @@ static inline void queued_spin_unlock(struct qspinlock *lock)
>   
>   #else
>   extern void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
> +extern void queued_spin_lock_slowpath_queue(struct qspinlock *lock);
>   #endif
>   
>   static __always_inline void queued_spin_lock(struct qspinlock *lock)
>   {
> -	u32 val = 0;
> -
> -	if (likely(atomic_try_cmpxchg_lock(&lock->val, &val, _Q_LOCKED_VAL)))
> +	atomic_t *a = &lock->val;
> +	u32 val;
> +
> +again:
> +	asm volatile(
> +"1:\t"	PPC_LWARX(%0,0,%1,1) "	# queued_spin_lock			\n"
> +	: "=&r" (val)
> +	: "r" (&a->counter)
> +	: "memory");
> +
> +	if (likely(val == 0)) {
> +		asm_volatile_goto(
> +	"	stwcx.	%0,0,%1							\n"
> +	"	bne-	%l[again]						\n"
> +	"\t"	PPC_ACQUIRE_BARRIER "						\n"
> +		:
> +		: "r"(_Q_LOCKED_VAL), "r" (&a->counter)
> +		: "cr0", "memory"
> +		: again );
>   		return;
> -
> -	queued_spin_lock_slowpath(lock, val);
> +	}
> +
> +	if (likely(val == _Q_LOCKED_VAL)) {
> +		asm_volatile_goto(
> +	"	stwcx.	%0,0,%1							\n"
> +	"	bne-	%l[again]						\n"
> +		:
> +		: "r"(_Q_LOCKED_VAL | _Q_PENDING_VAL), "r" (&a->counter)
> +		: "cr0", "memory"
> +		: again );
> +
> +		atomic_cond_read_acquire(a, !(VAL & _Q_LOCKED_MASK));
> +//		clear_pending_set_locked(lock);
> +		WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
> +//		lockevent_inc(lock_pending);
> +		return;
> +	}
> +
> +	if (val == _Q_PENDING_VAL) {
> +		int cnt = _Q_PENDING_LOOPS;
> +		val = atomic_cond_read_relaxed(a,
> +					       (VAL != _Q_PENDING_VAL) || !cnt--);
> +		if (!(val & ~_Q_LOCKED_MASK))
> +			goto again;
> +        }
> +	queued_spin_lock_slowpath_queue(lock);
>   }
>   #define queued_spin_lock queued_spin_lock
>   

I am fine with the arch code override some part of the generic code.


> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index b9515fcc9b29..ebcc6f5d99d5 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -287,10 +287,14 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
>   
>   #ifdef CONFIG_PARAVIRT_SPINLOCKS
>   #define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
> +#define queued_spin_lock_slowpath_queue	native_queued_spin_lock_slowpath_queue
>   #endif
>   
>   #endif /* _GEN_PV_LOCK_SLOWPATH */
>   
> +void queued_spin_lock_slowpath_queue(struct qspinlock *lock);
> +static void __queued_spin_lock_slowpath_queue(struct qspinlock *lock);
> +
>   /**
>    * queued_spin_lock_slowpath - acquire the queued spinlock
>    * @lock: Pointer to queued spinlock structure
> @@ -314,12 +318,6 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
>    */
>   void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>   {
> -	struct mcs_spinlock *prev, *next, *node;
> -	u32 old, tail;
> -	int idx;
> -
> -	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
> -
>   	if (pv_enabled())
>   		goto pv_queue;
>   
> @@ -397,6 +395,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>   queue:
>   	lockevent_inc(lock_slowpath);
>   pv_queue:
> +	__queued_spin_lock_slowpath_queue(lock);
> +}
> +EXPORT_SYMBOL(queued_spin_lock_slowpath);
> +
> +void queued_spin_lock_slowpath_queue(struct qspinlock *lock)
> +{
> +	lockevent_inc(lock_slowpath);
> +	__queued_spin_lock_slowpath_queue(lock);
> +}
> +EXPORT_SYMBOL(queued_spin_lock_slowpath_queue);
> +
> +static void __queued_spin_lock_slowpath_queue(struct qspinlock *lock)
> +{
> +	struct mcs_spinlock *prev, *next, *node;
> +	u32 old, tail;
> +	u32 val;
> +	int idx;
> +
> +	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
> +
>   	node = this_cpu_ptr(&qnodes[0].mcs);
>   	idx = node->count++;
>   	tail = encode_tail(smp_processor_id(), idx);
> @@ -559,7 +577,6 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>   	 */
>   	__this_cpu_dec(qnodes[0].mcs.count);
>   }
> -EXPORT_SYMBOL(queued_spin_lock_slowpath);
>   
>   /*
>    * Generate the paravirt code for queued_spin_unlock_slowpath().
>
I would prefer to extract out the pending bit handling code out into a 
separate helper function which can be overridden by the arch code 
instead of breaking the slowpath into 2 pieces.

Cheers,
Longman


^ permalink raw reply

* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Nicholas Piggin @ 2020-07-21 14:30 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jens Axboe, linux-arch, Arnd Bergmann, Peter Zijlstra, x86,
	linux-kernel, Andy Lutomirski, linux-mm, Andy Lutomirski,
	linuxppc-dev
In-Reply-To: <470490605.22057.1595337118562.JavaMail.zimbra@efficios.com>

Excerpts from Mathieu Desnoyers's message of July 21, 2020 11:11 pm:
> ----- On Jul 21, 2020, at 6:04 AM, Nicholas Piggin npiggin@gmail.com wrote:
> 
>> Excerpts from Mathieu Desnoyers's message of July 21, 2020 2:46 am:
> [...]
>> 
>> Yeah you're probably right in this case I think. Quite likely most kernel
>> tasks that asynchronously write to user memory would at least have some
>> kind of producer-consumer barriers.
>> 
>> But is that restriction of all async modifications documented and enforced
>> anywhere?
>> 
>>>> How about other memory accesses via kthread_use_mm? Presumably there is
>>>> still ordering requirement there for membarrier,
>>> 
>>> Please provide an example case with memory accesses via kthread_use_mm where
>>> ordering matters to support your concern.
>> 
>> I think the concern Andy raised with io_uring was less a specific
>> problem he saw and more a general concern that we have these memory
>> accesses which are not synchronized with membarrier.
>> 
>>>> so I really think
>>>> it's a fragile interface with no real way for the user to know how
>>>> kernel threads may use its mm for any particular reason, so membarrier
>>>> should synchronize all possible kernel users as well.
>>> 
>>> I strongly doubt so, but perhaps something should be clarified in the
>>> documentation
>>> if you have that feeling.
>> 
>> I'd rather go the other way and say if you have reasoning or numbers for
>> why PF_KTHREAD is an important optimisation above rq->curr == rq->idle
>> then we could think about keeping this subtlety with appropriate
>> documentation added, otherwise we can just kill it and remove all doubt.
>> 
>> That being said, the x86 sync core gap that I imagined could be fixed
>> by changing to rq->curr == rq->idle test does not actually exist because
>> the global membarrier does not have a sync core option. So fixing the
>> exit_lazy_tlb points that this series does *should* fix that. So
>> PF_KTHREAD may be less problematic than I thought from implementation
>> point of view, only semantics.
> 
> Today, the membarrier global expedited command explicitly skips kernel threads,
> but it happens that membarrier private expedited considers those with the
> same mm as target for the IPI.
> 
> So we already implement a semantic which differs between private and global
> expedited membarriers.

Which is not a good thing.

> This can be explained in part by the fact that
> kthread_use_mm was introduced after 4.16, where the most recent membarrier
> commands where introduced. It seems that the effect on membarrier was not
> considered when kthread_use_mm was introduced.

No it was just renamed, it used to be called use_mm and has been in the 
kernel for ~ever.

That you hadn't considered this is actually weight for my point, which 
is that there's so much subtle behaviour that's easy to miss we're 
better off with simpler and fewer special cases until it's proven 
they're needed. Not the other way around.

> 
> Looking at membarrier(2) documentation, it states that IPIs are only sent to
> threads belonging to the same process as the calling thread. If my understanding
> of the notion of process is correct, this should rule out sending the IPI to
> kernel threads, given they are not "part" of the same process, only borrowing
> the mm. But I agree that the distinction is moot, and should be clarified.

It does if you read it in a user-hostile legalistic way. The reality is 
userspace shouldn't and can't know about how the kernel might implement 
functionality.

> Without a clear use-case to justify adding a constraint on membarrier, I am
> tempted to simply clarify documentation of current membarrier commands,
> stating clearly that they are not guaranteed to affect kernel threads. Then,
> if we have a compelling use-case to implement a different behavior which covers
> kthreads, this could be added consistently across membarrier commands with a
> flag (or by adding new commands).
> 
> Does this approach make sense ?

The other position is without a clear use case for PF_KTHREAD, seeing as 
async kernel accesses had not been considered before now, we limit the 
optimision to only skipping the idle thread. I think that makes more 
sense (unless you have a reason for PF_KTHREAD but it doesn't seem like 
there is much of one).

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v4 05/10] powerpc/dt_cpu_ftrs: Add feature for 2nd DAWR
From: Ravi Bangoria @ 2020-07-21 14:16 UTC (permalink / raw)
  To: Michael Ellerman, Jordan Niethe
  Cc: Christophe Leroy, apopple, mikey, miltonm, peterz, fweisbec, oleg,
	Nicholas Piggin, linux-kernel, Paul Mackerras, jolsa, pedromfc,
	naveen.n.rao, linuxppc-dev, mingo, Ravi Bangoria
In-Reply-To: <87d04prmgc.fsf@mpe.ellerman.id.au>



On 7/21/20 7:37 PM, Michael Ellerman wrote:
> Ravi Bangoria <ravi.bangoria@linux.ibm.com> writes:
>> On 7/21/20 4:59 PM, Michael Ellerman wrote:
>>> Ravi Bangoria <ravi.bangoria@linux.ibm.com> writes:
>>>> On 7/17/20 11:14 AM, Jordan Niethe wrote:
>>>>> On Fri, Jul 17, 2020 at 2:10 PM Ravi Bangoria
>>>>> <ravi.bangoria@linux.ibm.com> wrote:
>>>>>>
>>>>>> Add new device-tree feature for 2nd DAWR. If this feature is present,
>>>>>> 2nd DAWR is supported, otherwise not.
>>>>>>
>>>>>> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
>>>>>> ---
>>>>>>     arch/powerpc/include/asm/cputable.h | 7 +++++--
>>>>>>     arch/powerpc/kernel/dt_cpu_ftrs.c   | 7 +++++++
>>>>>>     2 files changed, 12 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h
>>>>>> index e506d429b1af..3445c86e1f6f 100644
>>>>>> --- a/arch/powerpc/include/asm/cputable.h
>>>>>> +++ b/arch/powerpc/include/asm/cputable.h
>>>>>> @@ -214,6 +214,7 @@ static inline void cpu_feature_keys_init(void) { }
>>>>>>     #define CPU_FTR_P9_TLBIE_ERAT_BUG      LONG_ASM_CONST(0x0001000000000000)
>>>>>>     #define CPU_FTR_P9_RADIX_PREFETCH_BUG  LONG_ASM_CONST(0x0002000000000000)
>>>>>>     #define CPU_FTR_ARCH_31                        LONG_ASM_CONST(0x0004000000000000)
>>>>>> +#define CPU_FTR_DAWR1                  LONG_ASM_CONST(0x0008000000000000)
>>>>>>
>>>>>>     #ifndef __ASSEMBLY__
>>>>>>
>>>>>> @@ -497,14 +498,16 @@ static inline void cpu_feature_keys_init(void) { }
>>>>>>     #define CPU_FTRS_POSSIBLE      \
>>>>>>                (CPU_FTRS_POWER7 | CPU_FTRS_POWER8E | CPU_FTRS_POWER8 | \
>>>>>>                 CPU_FTR_ALTIVEC_COMP | CPU_FTR_VSX_COMP | CPU_FTRS_POWER9 | \
>>>>>> -            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10)
>>>>>> +            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10 | \
>>>>>> +            CPU_FTR_DAWR1)
>>>>>>     #else
>>>>>>     #define CPU_FTRS_POSSIBLE      \
>>>>>>                (CPU_FTRS_PPC970 | CPU_FTRS_POWER5 | \
>>>>>>                 CPU_FTRS_POWER6 | CPU_FTRS_POWER7 | CPU_FTRS_POWER8E | \
>>>>>>                 CPU_FTRS_POWER8 | CPU_FTRS_CELL | CPU_FTRS_PA6T | \
>>>>>>                 CPU_FTR_VSX_COMP | CPU_FTR_ALTIVEC_COMP | CPU_FTRS_POWER9 | \
>>>>>> -            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10)
>>>>>> +            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10 | \
>>>>>> +            CPU_FTR_DAWR1)
>>>
>>>>> Instead of putting CPU_FTR_DAWR1 into CPU_FTRS_POSSIBLE should it go
>>>>> into CPU_FTRS_POWER10?
>>>>> Then it will be picked up by CPU_FTRS_POSSIBLE.
>>>>
>>>> I remember a discussion about this with Mikey and we decided to do it
>>>> this way. Obviously, the purpose is to make CPU_FTR_DAWR1 independent of
>>>> CPU_FTRS_POWER10 because DAWR1 is an optional feature in p10. I fear
>>>> including CPU_FTR_DAWR1 in CPU_FTRS_POWER10 can make it forcefully enabled
>>>> even when device-tree property is not present or pa-feature bit it not set,
>>>> because we do:
>>>>
>>>>          {       /* 3.1-compliant processor, i.e. Power10 "architected" mode */
>>>>                  .pvr_mask               = 0xffffffff,
>>>>                  .pvr_value              = 0x0f000006,
>>>>                  .cpu_name               = "POWER10 (architected)",
>>>>                  .cpu_features           = CPU_FTRS_POWER10,
>>>
>>> The pa-features logic will turn it off if the feature bit is not set.
>>>
>>> So you should be able to put it in CPU_FTRS_POWER10.
>>>
>>> See for example CPU_FTR_NOEXECUTE.
>>
>> Ah ok. scan_features() clears the feature if the bit is not set in
>> pa-features. So it should work find for powervm. I'll verify the same
>> thing happens in case of baremetal where we use cpu-features not
>> pa-features. If it works in baremetal as well, will put it in
>> CPU_FTRS_POWER10.
> 
> When we use DT CPU features we don't use CPU_FTRS_POWER10 at all.
> 
> We construct a cpu_spec from scratch with just the base set of features:
> 
> static struct cpu_spec __initdata base_cpu_spec = {
> 	.cpu_name		= NULL,
> 	.cpu_features		= CPU_FTRS_DT_CPU_BASE,
> 
> 
> And then individual features are enabled via the device tree flags.

Ah good. I was under a wrong impression that we use cpu_specs[] for all
the cases. Thanks mpe for explaining in detail :)

Ravi

^ permalink raw reply

* Re: [PATCH v4 05/10] powerpc/dt_cpu_ftrs: Add feature for 2nd DAWR
From: Michael Ellerman @ 2020-07-21 14:07 UTC (permalink / raw)
  To: Ravi Bangoria, Jordan Niethe
  Cc: Christophe Leroy, apopple, mikey, miltonm, peterz, fweisbec, oleg,
	Nicholas Piggin, linux-kernel, Paul Mackerras, jolsa, pedromfc,
	naveen.n.rao, linuxppc-dev, mingo, Ravi Bangoria
In-Reply-To: <62daa2d1-4e11-dcc1-cb1d-805ee4a156e0@linux.ibm.com>

Ravi Bangoria <ravi.bangoria@linux.ibm.com> writes:
> On 7/21/20 4:59 PM, Michael Ellerman wrote:
>> Ravi Bangoria <ravi.bangoria@linux.ibm.com> writes:
>>> On 7/17/20 11:14 AM, Jordan Niethe wrote:
>>>> On Fri, Jul 17, 2020 at 2:10 PM Ravi Bangoria
>>>> <ravi.bangoria@linux.ibm.com> wrote:
>>>>>
>>>>> Add new device-tree feature for 2nd DAWR. If this feature is present,
>>>>> 2nd DAWR is supported, otherwise not.
>>>>>
>>>>> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
>>>>> ---
>>>>>    arch/powerpc/include/asm/cputable.h | 7 +++++--
>>>>>    arch/powerpc/kernel/dt_cpu_ftrs.c   | 7 +++++++
>>>>>    2 files changed, 12 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h
>>>>> index e506d429b1af..3445c86e1f6f 100644
>>>>> --- a/arch/powerpc/include/asm/cputable.h
>>>>> +++ b/arch/powerpc/include/asm/cputable.h
>>>>> @@ -214,6 +214,7 @@ static inline void cpu_feature_keys_init(void) { }
>>>>>    #define CPU_FTR_P9_TLBIE_ERAT_BUG      LONG_ASM_CONST(0x0001000000000000)
>>>>>    #define CPU_FTR_P9_RADIX_PREFETCH_BUG  LONG_ASM_CONST(0x0002000000000000)
>>>>>    #define CPU_FTR_ARCH_31                        LONG_ASM_CONST(0x0004000000000000)
>>>>> +#define CPU_FTR_DAWR1                  LONG_ASM_CONST(0x0008000000000000)
>>>>>
>>>>>    #ifndef __ASSEMBLY__
>>>>>
>>>>> @@ -497,14 +498,16 @@ static inline void cpu_feature_keys_init(void) { }
>>>>>    #define CPU_FTRS_POSSIBLE      \
>>>>>               (CPU_FTRS_POWER7 | CPU_FTRS_POWER8E | CPU_FTRS_POWER8 | \
>>>>>                CPU_FTR_ALTIVEC_COMP | CPU_FTR_VSX_COMP | CPU_FTRS_POWER9 | \
>>>>> -            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10)
>>>>> +            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10 | \
>>>>> +            CPU_FTR_DAWR1)
>>>>>    #else
>>>>>    #define CPU_FTRS_POSSIBLE      \
>>>>>               (CPU_FTRS_PPC970 | CPU_FTRS_POWER5 | \
>>>>>                CPU_FTRS_POWER6 | CPU_FTRS_POWER7 | CPU_FTRS_POWER8E | \
>>>>>                CPU_FTRS_POWER8 | CPU_FTRS_CELL | CPU_FTRS_PA6T | \
>>>>>                CPU_FTR_VSX_COMP | CPU_FTR_ALTIVEC_COMP | CPU_FTRS_POWER9 | \
>>>>> -            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10)
>>>>> +            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10 | \
>>>>> +            CPU_FTR_DAWR1)
>> 
>>>> Instead of putting CPU_FTR_DAWR1 into CPU_FTRS_POSSIBLE should it go
>>>> into CPU_FTRS_POWER10?
>>>> Then it will be picked up by CPU_FTRS_POSSIBLE.
>>>
>>> I remember a discussion about this with Mikey and we decided to do it
>>> this way. Obviously, the purpose is to make CPU_FTR_DAWR1 independent of
>>> CPU_FTRS_POWER10 because DAWR1 is an optional feature in p10. I fear
>>> including CPU_FTR_DAWR1 in CPU_FTRS_POWER10 can make it forcefully enabled
>>> even when device-tree property is not present or pa-feature bit it not set,
>>> because we do:
>>>
>>>         {       /* 3.1-compliant processor, i.e. Power10 "architected" mode */
>>>                 .pvr_mask               = 0xffffffff,
>>>                 .pvr_value              = 0x0f000006,
>>>                 .cpu_name               = "POWER10 (architected)",
>>>                 .cpu_features           = CPU_FTRS_POWER10,
>> 
>> The pa-features logic will turn it off if the feature bit is not set.
>> 
>> So you should be able to put it in CPU_FTRS_POWER10.
>> 
>> See for example CPU_FTR_NOEXECUTE.
>
> Ah ok. scan_features() clears the feature if the bit is not set in
> pa-features. So it should work find for powervm. I'll verify the same
> thing happens in case of baremetal where we use cpu-features not
> pa-features. If it works in baremetal as well, will put it in
> CPU_FTRS_POWER10.

When we use DT CPU features we don't use CPU_FTRS_POWER10 at all.

We construct a cpu_spec from scratch with just the base set of features:

static struct cpu_spec __initdata base_cpu_spec = {
	.cpu_name		= NULL,
	.cpu_features		= CPU_FTRS_DT_CPU_BASE,


And then individual features are enabled via the device tree flags.

cheers

^ permalink raw reply

* Re: [PATCH v4 05/10] powerpc/dt_cpu_ftrs: Add feature for 2nd DAWR
From: Ravi Bangoria @ 2020-07-21 13:42 UTC (permalink / raw)
  To: Michael Ellerman, Jordan Niethe
  Cc: Christophe Leroy, apopple, mikey, miltonm, peterz, fweisbec, oleg,
	Nicholas Piggin, linux-kernel, Paul Mackerras, jolsa, pedromfc,
	naveen.n.rao, linuxppc-dev, mingo, Ravi Bangoria
In-Reply-To: <87mu3trtri.fsf@mpe.ellerman.id.au>



On 7/21/20 4:59 PM, Michael Ellerman wrote:
> Ravi Bangoria <ravi.bangoria@linux.ibm.com> writes:
>> On 7/17/20 11:14 AM, Jordan Niethe wrote:
>>> On Fri, Jul 17, 2020 at 2:10 PM Ravi Bangoria
>>> <ravi.bangoria@linux.ibm.com> wrote:
>>>>
>>>> Add new device-tree feature for 2nd DAWR. If this feature is present,
>>>> 2nd DAWR is supported, otherwise not.
>>>>
>>>> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
>>>> ---
>>>>    arch/powerpc/include/asm/cputable.h | 7 +++++--
>>>>    arch/powerpc/kernel/dt_cpu_ftrs.c   | 7 +++++++
>>>>    2 files changed, 12 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h
>>>> index e506d429b1af..3445c86e1f6f 100644
>>>> --- a/arch/powerpc/include/asm/cputable.h
>>>> +++ b/arch/powerpc/include/asm/cputable.h
>>>> @@ -214,6 +214,7 @@ static inline void cpu_feature_keys_init(void) { }
>>>>    #define CPU_FTR_P9_TLBIE_ERAT_BUG      LONG_ASM_CONST(0x0001000000000000)
>>>>    #define CPU_FTR_P9_RADIX_PREFETCH_BUG  LONG_ASM_CONST(0x0002000000000000)
>>>>    #define CPU_FTR_ARCH_31                        LONG_ASM_CONST(0x0004000000000000)
>>>> +#define CPU_FTR_DAWR1                  LONG_ASM_CONST(0x0008000000000000)
>>>>
>>>>    #ifndef __ASSEMBLY__
>>>>
>>>> @@ -497,14 +498,16 @@ static inline void cpu_feature_keys_init(void) { }
>>>>    #define CPU_FTRS_POSSIBLE      \
>>>>               (CPU_FTRS_POWER7 | CPU_FTRS_POWER8E | CPU_FTRS_POWER8 | \
>>>>                CPU_FTR_ALTIVEC_COMP | CPU_FTR_VSX_COMP | CPU_FTRS_POWER9 | \
>>>> -            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10)
>>>> +            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10 | \
>>>> +            CPU_FTR_DAWR1)
>>>>    #else
>>>>    #define CPU_FTRS_POSSIBLE      \
>>>>               (CPU_FTRS_PPC970 | CPU_FTRS_POWER5 | \
>>>>                CPU_FTRS_POWER6 | CPU_FTRS_POWER7 | CPU_FTRS_POWER8E | \
>>>>                CPU_FTRS_POWER8 | CPU_FTRS_CELL | CPU_FTRS_PA6T | \
>>>>                CPU_FTR_VSX_COMP | CPU_FTR_ALTIVEC_COMP | CPU_FTRS_POWER9 | \
>>>> -            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10)
>>>> +            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10 | \
>>>> +            CPU_FTR_DAWR1)
> 
>>> Instead of putting CPU_FTR_DAWR1 into CPU_FTRS_POSSIBLE should it go
>>> into CPU_FTRS_POWER10?
>>> Then it will be picked up by CPU_FTRS_POSSIBLE.
>>
>> I remember a discussion about this with Mikey and we decided to do it
>> this way. Obviously, the purpose is to make CPU_FTR_DAWR1 independent of
>> CPU_FTRS_POWER10 because DAWR1 is an optional feature in p10. I fear
>> including CPU_FTR_DAWR1 in CPU_FTRS_POWER10 can make it forcefully enabled
>> even when device-tree property is not present or pa-feature bit it not set,
>> because we do:
>>
>>         {       /* 3.1-compliant processor, i.e. Power10 "architected" mode */
>>                 .pvr_mask               = 0xffffffff,
>>                 .pvr_value              = 0x0f000006,
>>                 .cpu_name               = "POWER10 (architected)",
>>                 .cpu_features           = CPU_FTRS_POWER10,
> 
> The pa-features logic will turn it off if the feature bit is not set.
> 
> So you should be able to put it in CPU_FTRS_POWER10.
> 
> See for example CPU_FTR_NOEXECUTE.

Ah ok. scan_features() clears the feature if the bit is not set in
pa-features. So it should work find for powervm. I'll verify the same
thing happens in case of baremetal where we use cpu-features not
pa-features. If it works in baremetal as well, will put it in
CPU_FTRS_POWER10.

Thanks for the clarification,
Ravi

^ permalink raw reply

* Re: [PATCH v4 09/10] powerpc/watchpoint: Return available watchpoints dynamically
From: Ravi Bangoria @ 2020-07-21 13:33 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Christophe Leroy, apopple, mikey, miltonm, peterz, Jordan Niethe,
	oleg, Nicholas Piggin, linux-kernel, Paul Mackerras, jolsa,
	fweisbec, pedromfc, naveen.n.rao, linuxppc-dev, mingo,
	Ravi Bangoria
In-Reply-To: <87k0yxrtex.fsf@mpe.ellerman.id.au>



On 7/21/20 5:06 PM, Michael Ellerman wrote:
> Ravi Bangoria <ravi.bangoria@linux.ibm.com> writes:
>> On 7/20/20 9:12 AM, Jordan Niethe wrote:
>>> On Fri, Jul 17, 2020 at 2:11 PM Ravi Bangoria
>>> <ravi.bangoria@linux.ibm.com> wrote:
>>>>
>>>> So far Book3S Powerpc supported only one watchpoint. Power10 is
>>>> introducing 2nd DAWR. Enable 2nd DAWR support for Power10.
>>>> Availability of 2nd DAWR will depend on CPU_FTR_DAWR1.
>>>>
>>>> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
>>>> ---
>>>>    arch/powerpc/include/asm/cputable.h      | 4 +++-
>>>>    arch/powerpc/include/asm/hw_breakpoint.h | 5 +++--
>>>>    2 files changed, 6 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h
>>>> index 3445c86e1f6f..36a0851a7a9b 100644
>>>> --- a/arch/powerpc/include/asm/cputable.h
>>>> +++ b/arch/powerpc/include/asm/cputable.h
>>>> @@ -633,7 +633,9 @@ enum {
>>>>     * Maximum number of hw breakpoint supported on powerpc. Number of
>>>>     * breakpoints supported by actual hw might be less than this.
>>>>     */
>>>> -#define HBP_NUM_MAX    1
>>>> +#define HBP_NUM_MAX    2
>>>> +#define HBP_NUM_ONE    1
>>>> +#define HBP_NUM_TWO    2
> 
>>> I wonder if these defines are necessary - has it any advantage over
>>> just using the literal?
>>
>> No, not really. Initially I had something like:
>>
>> #define HBP_NUM_MAX    2
>> #define HBP_NUM_P8_P9  1
>> #define HBP_NUM_P10    2
>>
>> But then I thought it's also not right. So I made it _ONE and _TWO.
>> Now the function that decides nr watchpoints dynamically (nr_wp_slots)
>> is in different file, I thought to keep it like this so it would be
>> easier to figure out why _MAX is 2.
> 
> I don't think it makes anything clearer.
> 
> I had to stare at it thinking there was some sort of mapping or
> indirection going on, before I realised it's just literally the number
> of breakpoints.
> 
> So please just do:
> 
> static inline int nr_wp_slots(void)
> {
>         return cpu_has_feature(CPU_FTR_DAWR1) ? 2 : 1;
> }
> 
> If you think HBP_NUM_MAX needs explanation then do that with a comment,
> it can refer to nr_wp_slots() if that's helpful.

Agreed. By adding a comment, we can remove those macros. Will change it.

Thanks,
Ravi

^ permalink raw reply

* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Mathieu Desnoyers @ 2020-07-21 13:11 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Jens Axboe, linux-arch, Arnd Bergmann, Peter Zijlstra, x86,
	linux-kernel, Andy Lutomirski, linux-mm, Andy Lutomirski,
	linuxppc-dev
In-Reply-To: <1595324577.x3bf55tpgu.astroid@bobo.none>

----- On Jul 21, 2020, at 6:04 AM, Nicholas Piggin npiggin@gmail.com wrote:

> Excerpts from Mathieu Desnoyers's message of July 21, 2020 2:46 am:
[...]
> 
> Yeah you're probably right in this case I think. Quite likely most kernel
> tasks that asynchronously write to user memory would at least have some
> kind of producer-consumer barriers.
> 
> But is that restriction of all async modifications documented and enforced
> anywhere?
> 
>>> How about other memory accesses via kthread_use_mm? Presumably there is
>>> still ordering requirement there for membarrier,
>> 
>> Please provide an example case with memory accesses via kthread_use_mm where
>> ordering matters to support your concern.
> 
> I think the concern Andy raised with io_uring was less a specific
> problem he saw and more a general concern that we have these memory
> accesses which are not synchronized with membarrier.
> 
>>> so I really think
>>> it's a fragile interface with no real way for the user to know how
>>> kernel threads may use its mm for any particular reason, so membarrier
>>> should synchronize all possible kernel users as well.
>> 
>> I strongly doubt so, but perhaps something should be clarified in the
>> documentation
>> if you have that feeling.
> 
> I'd rather go the other way and say if you have reasoning or numbers for
> why PF_KTHREAD is an important optimisation above rq->curr == rq->idle
> then we could think about keeping this subtlety with appropriate
> documentation added, otherwise we can just kill it and remove all doubt.
> 
> That being said, the x86 sync core gap that I imagined could be fixed
> by changing to rq->curr == rq->idle test does not actually exist because
> the global membarrier does not have a sync core option. So fixing the
> exit_lazy_tlb points that this series does *should* fix that. So
> PF_KTHREAD may be less problematic than I thought from implementation
> point of view, only semantics.

Today, the membarrier global expedited command explicitly skips kernel threads,
but it happens that membarrier private expedited considers those with the
same mm as target for the IPI.

So we already implement a semantic which differs between private and global
expedited membarriers. This can be explained in part by the fact that
kthread_use_mm was introduced after 4.16, where the most recent membarrier
commands where introduced. It seems that the effect on membarrier was not
considered when kthread_use_mm was introduced.

Looking at membarrier(2) documentation, it states that IPIs are only sent to
threads belonging to the same process as the calling thread. If my understanding
of the notion of process is correct, this should rule out sending the IPI to
kernel threads, given they are not "part" of the same process, only borrowing
the mm. But I agree that the distinction is moot, and should be clarified.

Without a clear use-case to justify adding a constraint on membarrier, I am
tempted to simply clarify documentation of current membarrier commands,
stating clearly that they are not guaranteed to affect kernel threads. Then,
if we have a compelling use-case to implement a different behavior which covers
kthreads, this could be added consistently across membarrier commands with a
flag (or by adding new commands).

Does this approach make sense ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* [PATCH v3 0/2] Selftest for cpuidle latency measurement
From: Pratik Rajesh Sampat @ 2020-07-21 12:42 UTC (permalink / raw)
  To: rjw, daniel.lezcano, mpe, benh, paulus, srivatsa, shuah, npiggin,
	ego, svaidy, pratik.r.sampat, psampat, linux-pm, linuxppc-dev,
	linux-kernel, linux-kselftest

v2: https://lkml.org/lkml/2020/7/17/369
Changelog v2-->v3
Based on comments from Gautham R. Shenoy adding the following in the
selftest,
1. Grepping modules to determine if already loaded
2. Wrapper to enable/disable states
3. Preventing any operation/test on offlined CPUs 
---

The patch series introduces a mechanism to measure wakeup latency for
IPI and timer based interrupts
The motivation behind this series is to find significant deviations
behind advertised latency and resisdency values

To achieve this, we introduce a kernel module and expose its control
knobs through the debugfs interface that the selftests can engage with.

The kernel module provides the following interfaces within
/sys/kernel/debug/latency_test/ for,
1. IPI test:
  ipi_cpu_dest   # Destination CPU for the IPI
  ipi_cpu_src    # Origin of the IPI
  ipi_latency_ns # Measured latency time in ns
2. Timeout test:
  timeout_cpu_src     # CPU on which the timer to be queued
  timeout_expected_ns # Timer duration
  timeout_diff_ns     # Difference of actual duration vs expected timer
To include the module, check option and include as module
kernel hacking -> Cpuidle latency selftests

The selftest inserts the module, disables all the idle states and
enables them one by one testing the following:
1. Keeping source CPU constant, iterates through all the CPUS measuring
   IPI latency for baseline (CPU is busy with
   "cat /dev/random > /dev/null" workload) and the when the CPU is
   allowed to be at rest
2. Iterating through all the CPUs, sending expected timer durations to
   be equivalent to the residency of the the deepest idle state
   enabled and extracting the difference in time between the time of
   wakeup and the expected timer duration

Usage
-----
Can be used in conjuction to the rest of the selftests.
Default Output location in: tools/testing/cpuidle/cpuidle.log

To run this test specifically:
$ make -C tools/testing/selftests TARGETS="cpuidle" run_tests

There are a few optinal arguments too that the script can take
	[-h <help>]
	[-m <location of the module>]
	[-o <location of the output>]

Sample output snippet
---------------------
--IPI Latency Test---
--Baseline IPI Latency measurement: CPU Busy--
SRC_CPU   DEST_CPU IPI_Latency(ns)
...
0            8         1996
0            9         2125
0           10         1264
0           11         1788
0           12         2045
Baseline Average IPI latency(ns): 1843
---Enabling state: 5---
SRC_CPU   DEST_CPU IPI_Latency(ns)
0            8       621719
0            9       624752
0           10       622218
0           11       623968
0           12       621303
Expected IPI latency(ns): 100000
Observed Average IPI latency(ns): 622792

--Timeout Latency Test--
--Baseline Timeout Latency measurement: CPU Busy--
Wakeup_src Baseline_delay(ns) 
...
8            2249
9            2226
10           2211
11           2183
12           2263
Baseline Average timeout diff(ns): 2226
---Enabling state: 5---
8           10749                   
9           10911                   
10          10912                   
11          12100                   
12          73276                   
Expected timeout(ns): 10000200
Observed Average timeout diff(ns): 23589

Pratik Rajesh Sampat (2):
  cpuidle: Trace IPI based and timer based wakeup latency from idle
    states
  selftest/cpuidle: Add support for cpuidle latency measurement

 drivers/cpuidle/Makefile                   |   1 +
 drivers/cpuidle/test-cpuidle_latency.c     | 150 ++++++++++
 lib/Kconfig.debug                          |  10 +
 tools/testing/selftests/Makefile           |   1 +
 tools/testing/selftests/cpuidle/Makefile   |   6 +
 tools/testing/selftests/cpuidle/cpuidle.sh | 310 +++++++++++++++++++++
 tools/testing/selftests/cpuidle/settings   |   1 +
 7 files changed, 479 insertions(+)
 create mode 100644 drivers/cpuidle/test-cpuidle_latency.c
 create mode 100644 tools/testing/selftests/cpuidle/Makefile
 create mode 100755 tools/testing/selftests/cpuidle/cpuidle.sh
 create mode 100644 tools/testing/selftests/cpuidle/settings

-- 
2.25.4


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox