LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone
From: Palmer Dabbelt @ 2020-07-21 19:05 UTC (permalink / raw)
  To: alex
  Cc: aou, linux-mm, Anup Patel, linux-kernel, Atish Patra, paulus,
	zong.li, Paul Walmsley, linux-riscv, linuxppc-dev
In-Reply-To: <7cb2285e-68ba-6827-5e61-e33a4b65ac03@ghiti.fr>

On Tue, 21 Jul 2020 11:36:10 PDT (-0700), alex@ghiti.fr wrote:
> Let's try to make progress here: I add linux-mm in CC to get feedback on
> this patch as it blocks sv48 support too.

Sorry for being slow here.  I haven't replied because I hadn't really fleshed
out the design yet, but just so everyone's on the same page my problems with
this are:

* We waste vmalloc space on 32-bit systems, where there isn't a lot of it.
* On 64-bit systems the VA space around the kernel is precious because it's the
  only place we can place text (modules, BPF, whatever).  If we start putting
  the kernel in the vmalloc space then we either have to pre-allocate a bunch
  of space around it (essentially making it a fixed mapping anyway) or it
  becomes likely that we won't be able to find space for modules as they're
  loaded into running systems.
* Relying on a relocatable kernel for sv48 support introduces a fairly large
  performance hit.

Roughly, my proposal would be to:

* Leave the 32-bit memory map alone.  On 32-bit systems we can load modules
  anywhere and we only have one VA width, so we're not really solving any
  problems with these changes.
* Staticly allocate a 2GiB portion of the VA space for all our text, as its own
  region.  We'd link/relocate the kernel here instead of around PAGE_OFFSET,
  which would decouple the kernel from the physical memory layout of the system.
  This would have the side effect of sorting out a bunch of bootloader headaches
  that we currently have.
* Sort out how to maintain a linear map as the canonical hole moves around
  between the VA widths without adding a bunch of overhead to the virt2phys and
  friends.  This is probably going to be the trickiest part, but I think if we
  just change the page table code to essentially lie about VAs when an sv39
  system runs an sv48+sv39 kernel we could make it work -- there'd be some
  logical complexity involved, but it would remain fast.

This doesn't solve the problem of virtually relocatable kernels, but it does
let us decouple that from the sv48 stuff.  It also lets us stop relying on a
fixed physical address the kernel is loaded into, which is another thing I
don't like.

I know this may be a more complicated approach, but there aren't any sv48
systems around right now so I just don't see the rush to support them,
particularly when there's a cost to what already exists (for those who haven't
been watching, so far all the sv48 patch sets have imposed a significant
performance penalty on all systems).

>
> Alex
>
> Le 7/9/20 à 7:11 AM, Alex Ghiti a écrit :
>> Hi Palmer,
>>
>> Le 7/9/20 à 1:05 AM, Palmer Dabbelt a écrit :
>>> On Sun, 07 Jun 2020 00:59:46 PDT (-0700), alex@ghiti.fr wrote:
>>>> This is a preparatory patch for relocatable kernel.
>>>>
>>>> The kernel used to be linked at PAGE_OFFSET address and used to be
>>>> loaded
>>>> physically at the beginning of the main memory. Therefore, we could use
>>>> the linear mapping for the kernel mapping.
>>>>
>>>> But the relocated kernel base address will be different from PAGE_OFFSET
>>>> and since in the linear mapping, two different virtual addresses cannot
>>>> point to the same physical address, the kernel mapping needs to lie
>>>> outside
>>>> the linear mapping.
>>>
>>> I know it's been a while, but I keep opening this up to review it and
>>> just
>>> can't get over how ugly it is to put the kernel's linear map in the
>>> vmalloc
>>> region.
>>>
>>> I guess I don't understand why this is necessary at all.
>>> Specifically: why
>>> can't we just relocate the kernel within the linear map?  That would
>>> let the
>>> bootloader put the kernel wherever it wants, modulo the physical
>>> memory size we
>>> support.  We'd need to handle the regions that are coupled to the
>>> kernel's
>>> execution address, but we could just put them in an explicit memory
>>> region
>>> which is what we should probably be doing anyway.
>>
>> Virtual relocation in the linear mapping requires to move the kernel
>> physically too. Zong implemented this physical move in its KASLR RFC
>> patchset, which is cumbersome since finding an available physical spot
>> is harder than just selecting a virtual range in the vmalloc range.
>>
>> In addition, having the kernel mapping in the linear mapping prevents
>> the use of hugepage for the linear mapping resulting in performance loss
>> (at least for the GB that encompasses the kernel).
>>
>> Why do you find this "ugly" ? The vmalloc region is just a bunch of
>> available virtual addresses to whatever purpose we want, and as noted by
>> Zong, arm64 uses the same scheme.
>>
>>>
>>>> In addition, because modules and BPF must be close to the kernel (inside
>>>> +-2GB window), the kernel is placed at the end of the vmalloc zone minus
>>>> 2GB, which leaves room for modules and BPF. The kernel could not be
>>>> placed at the beginning of the vmalloc zone since other vmalloc
>>>> allocations from the kernel could get all the +-2GB window around the
>>>> kernel which would prevent new modules and BPF programs to be loaded.
>>>
>>> Well, that's not enough to make sure this doesn't happen -- it's just
>>> enough to
>>> make sure it doesn't happen very quickily.  That's the same boat we're
>>> already
>>> in, though, so it's not like it's worse.
>>
>> Indeed, that's not worse, I haven't found a way to reserve vmalloc area
>> without actually allocating it.
>>
>>>
>>>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>>>> Reviewed-by: Zong Li <zong.li@sifive.com>
>>>> ---
>>>>  arch/riscv/boot/loader.lds.S     |  3 +-
>>>>  arch/riscv/include/asm/page.h    | 10 +++++-
>>>>  arch/riscv/include/asm/pgtable.h | 38 ++++++++++++++-------
>>>>  arch/riscv/kernel/head.S         |  3 +-
>>>>  arch/riscv/kernel/module.c       |  4 +--
>>>>  arch/riscv/kernel/vmlinux.lds.S  |  3 +-
>>>>  arch/riscv/mm/init.c             | 58 +++++++++++++++++++++++++-------
>>>>  arch/riscv/mm/physaddr.c         |  2 +-
>>>>  8 files changed, 88 insertions(+), 33 deletions(-)
>>>>
>>>> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
>>>> index 47a5003c2e28..62d94696a19c 100644
>>>> --- a/arch/riscv/boot/loader.lds.S
>>>> +++ b/arch/riscv/boot/loader.lds.S
>>>> @@ -1,13 +1,14 @@
>>>>  /* SPDX-License-Identifier: GPL-2.0 */
>>>>
>>>>  #include <asm/page.h>
>>>> +#include <asm/pgtable.h>
>>>>
>>>>  OUTPUT_ARCH(riscv)
>>>>  ENTRY(_start)
>>>>
>>>>  SECTIONS
>>>>  {
>>>> -    . = PAGE_OFFSET;
>>>> +    . = KERNEL_LINK_ADDR;
>>>>
>>>>      .payload : {
>>>>          *(.payload)
>>>> diff --git a/arch/riscv/include/asm/page.h
>>>> b/arch/riscv/include/asm/page.h
>>>> index 2d50f76efe48..48bb09b6a9b7 100644
>>>> --- a/arch/riscv/include/asm/page.h
>>>> +++ b/arch/riscv/include/asm/page.h
>>>> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>>>>
>>>>  #ifdef CONFIG_MMU
>>>>  extern unsigned long va_pa_offset;
>>>> +extern unsigned long va_kernel_pa_offset;
>>>>  extern unsigned long pfn_base;
>>>>  #define ARCH_PFN_OFFSET        (pfn_base)
>>>>  #else
>>>>  #define va_pa_offset        0
>>>> +#define va_kernel_pa_offset    0
>>>>  #define ARCH_PFN_OFFSET        (PAGE_OFFSET >> PAGE_SHIFT)
>>>>  #endif /* CONFIG_MMU */
>>>>
>>>>  extern unsigned long max_low_pfn;
>>>>  extern unsigned long min_low_pfn;
>>>> +extern unsigned long kernel_virt_addr;
>>>>
>>>>  #define __pa_to_va_nodebug(x)    ((void *)((unsigned long) (x) +
>>>> va_pa_offset))
>>>> -#define __va_to_pa_nodebug(x)    ((unsigned long)(x) - va_pa_offset)
>>>> +#define linear_mapping_va_to_pa(x)    ((unsigned long)(x) -
>>>> va_pa_offset)
>>>> +#define kernel_mapping_va_to_pa(x)    \
>>>> +    ((unsigned long)(x) - va_kernel_pa_offset)
>>>> +#define __va_to_pa_nodebug(x)        \
>>>> +    (((x) >= PAGE_OFFSET) ?        \
>>>> +        linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
>>>>
>>>>  #ifdef CONFIG_DEBUG_VIRTUAL
>>>>  extern phys_addr_t __virt_to_phys(unsigned long x);
>>>> diff --git a/arch/riscv/include/asm/pgtable.h
>>>> b/arch/riscv/include/asm/pgtable.h
>>>> index 35b60035b6b0..94ef3b49dfb6 100644
>>>> --- a/arch/riscv/include/asm/pgtable.h
>>>> +++ b/arch/riscv/include/asm/pgtable.h
>>>> @@ -11,23 +11,29 @@
>>>>
>>>>  #include <asm/pgtable-bits.h>
>>>>
>>>> -#ifndef __ASSEMBLY__
>>>> -
>>>> -/* Page Upper Directory not used in RISC-V */
>>>> -#include <asm-generic/pgtable-nopud.h>
>>>> -#include <asm/page.h>
>>>> -#include <asm/tlbflush.h>
>>>> -#include <linux/mm_types.h>
>>>> -
>>>> -#ifdef CONFIG_MMU
>>>> +#ifndef CONFIG_MMU
>>>> +#define KERNEL_VIRT_ADDR    PAGE_OFFSET
>>>> +#define KERNEL_LINK_ADDR    PAGE_OFFSET
>>>> +#else
>>>> +/*
>>>> + * Leave 2GB for modules and BPF that must lie within a 2GB range
>>>> around
>>>> + * the kernel.
>>>> + */
>>>> +#define KERNEL_VIRT_ADDR    (VMALLOC_END - SZ_2G + 1)
>>>> +#define KERNEL_LINK_ADDR    KERNEL_VIRT_ADDR
>>>
>>> At a bare minimum this is going to make a mess of the 32-bit port, as
>>> non-relocatable kernels are now going to get linked at 1GiB which is
>>> where user
>>> code is supposed to live.  That's an easy fix, though, as the 32-bit
>>> stuff
>>> doesn't need any module address restrictions.
>>
>> Indeed, I will take a look at that.
>>
>>>
>>>>  #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
>>>>  #define VMALLOC_END      (PAGE_OFFSET - 1)
>>>>  #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
>>>>
>>>>  #define BPF_JIT_REGION_SIZE    (SZ_128M)
>>>> -#define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
>>>> -#define BPF_JIT_REGION_END    (VMALLOC_END)
>>>> +#define BPF_JIT_REGION_START    PFN_ALIGN((unsigned long)&_end)
>>>> +#define BPF_JIT_REGION_END    (BPF_JIT_REGION_START +
>>>> BPF_JIT_REGION_SIZE)
>>>> +
>>>> +#ifdef CONFIG_64BIT
>>>> +#define VMALLOC_MODULE_START    BPF_JIT_REGION_END
>>>> +#define VMALLOC_MODULE_END    (((unsigned long)&_start & PAGE_MASK)
>>>> + SZ_2G)
>>>> +#endif
>>>>
>>>>  /*
>>>>   * Roughly size the vmemmap space to be large enough to fit enough
>>>> @@ -57,9 +63,16 @@
>>>>  #define FIXADDR_SIZE     PGDIR_SIZE
>>>>  #endif
>>>>  #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
>>>> -
>>>>  #endif
>>>>
>>>> +#ifndef __ASSEMBLY__
>>>> +
>>>> +/* Page Upper Directory not used in RISC-V */
>>>> +#include <asm-generic/pgtable-nopud.h>
>>>> +#include <asm/page.h>
>>>> +#include <asm/tlbflush.h>
>>>> +#include <linux/mm_types.h>
>>>> +
>>>>  #ifdef CONFIG_64BIT
>>>>  #include <asm/pgtable-64.h>
>>>>  #else
>>>> @@ -483,6 +496,7 @@ static inline void __kernel_map_pages(struct page
>>>> *page, int numpages, int enabl
>>>>
>>>>  #define kern_addr_valid(addr)   (1) /* FIXME */
>>>>
>>>> +extern char _start[];
>>>>  extern void *dtb_early_va;
>>>>  void setup_bootmem(void);
>>>>  void paging_init(void);
>>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>>>> index 98a406474e7d..8f5bb7731327 100644
>>>> --- a/arch/riscv/kernel/head.S
>>>> +++ b/arch/riscv/kernel/head.S
>>>> @@ -49,7 +49,8 @@ ENTRY(_start)
>>>>  #ifdef CONFIG_MMU
>>>>  relocate:
>>>>      /* Relocate return address */
>>>> -    li a1, PAGE_OFFSET
>>>> +    la a1, kernel_virt_addr
>>>> +    REG_L a1, 0(a1)
>>>>      la a2, _start
>>>>      sub a1, a1, a2
>>>>      add ra, ra, a1
>>>> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
>>>> index 8bbe5dbe1341..1a8fbe05accf 100644
>>>> --- a/arch/riscv/kernel/module.c
>>>> +++ b/arch/riscv/kernel/module.c
>>>> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const
>>>> char *strtab,
>>>>  }
>>>>
>>>>  #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
>>>> -#define VMALLOC_MODULE_START \
>>>> -     max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
>>>>  void *module_alloc(unsigned long size)
>>>>  {
>>>>      return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
>>>> -                    VMALLOC_END, GFP_KERNEL,
>>>> +                    VMALLOC_MODULE_END, GFP_KERNEL,
>>>>                      PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
>>>>                      __builtin_return_address(0));
>>>>  }
>>>> diff --git a/arch/riscv/kernel/vmlinux.lds.S
>>>> b/arch/riscv/kernel/vmlinux.lds.S
>>>> index 0339b6bbe11a..a9abde62909f 100644
>>>> --- a/arch/riscv/kernel/vmlinux.lds.S
>>>> +++ b/arch/riscv/kernel/vmlinux.lds.S
>>>> @@ -4,7 +4,8 @@
>>>>   * Copyright (C) 2017 SiFive
>>>>   */
>>>>
>>>> -#define LOAD_OFFSET PAGE_OFFSET
>>>> +#include <asm/pgtable.h>
>>>> +#define LOAD_OFFSET KERNEL_LINK_ADDR
>>>>  #include <asm/vmlinux.lds.h>
>>>>  #include <asm/page.h>
>>>>  #include <asm/cache.h>
>>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>>>> index 736de6c8739f..71da78914645 100644
>>>> --- a/arch/riscv/mm/init.c
>>>> +++ b/arch/riscv/mm/init.c
>>>> @@ -22,6 +22,9 @@
>>>>
>>>>  #include "../kernel/head.h"
>>>>
>>>> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
>>>> +EXPORT_SYMBOL(kernel_virt_addr);
>>>> +
>>>>  unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>>>>                              __page_aligned_bss;
>>>>  EXPORT_SYMBOL(empty_zero_page);
>>>> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
>>>>  }
>>>>
>>>>  #ifdef CONFIG_MMU
>>>> +/* Offset between linear mapping virtual address and kernel load
>>>> address */
>>>>  unsigned long va_pa_offset;
>>>>  EXPORT_SYMBOL(va_pa_offset);
>>>> +/* Offset between kernel mapping virtual address and kernel load
>>>> address */
>>>> +unsigned long va_kernel_pa_offset;
>>>> +EXPORT_SYMBOL(va_kernel_pa_offset);
>>>>  unsigned long pfn_base;
>>>>  EXPORT_SYMBOL(pfn_base);
>>>>
>>>> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>>>>      if (mmu_enabled)
>>>>          return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>>>
>>>> -    pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
>>>> +    pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
>>>>      BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>>>>      return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>>>>  }
>>>> @@ -372,14 +379,30 @@ static uintptr_t __init
>>>> best_map_size(phys_addr_t base, phys_addr_t size)
>>>>  #error "setup_vm() is called from head.S before relocate so it
>>>> should not use absolute addressing."
>>>>  #endif
>>>>
>>>> +static uintptr_t load_pa, load_sz;
>>>> +
>>>> +static void __init create_kernel_page_table(pgd_t *pgdir, uintptr_t
>>>> map_size)
>>>> +{
>>>> +    uintptr_t va, end_va;
>>>> +
>>>> +    end_va = kernel_virt_addr + load_sz;
>>>> +    for (va = kernel_virt_addr; va < end_va; va += map_size)
>>>> +        create_pgd_mapping(pgdir, va,
>>>> +                   load_pa + (va - kernel_virt_addr),
>>>> +                   map_size, PAGE_KERNEL_EXEC);
>>>> +}
>>>> +
>>>>  asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>  {
>>>>      uintptr_t va, end_va;
>>>> -    uintptr_t load_pa = (uintptr_t)(&_start);
>>>> -    uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
>>>>      uintptr_t map_size = best_map_size(load_pa,
>>>> MAX_EARLY_MAPPING_SIZE);
>>>>
>>>> +    load_pa = (uintptr_t)(&_start);
>>>> +    load_sz = (uintptr_t)(&_end) - load_pa;
>>>> +
>>>>      va_pa_offset = PAGE_OFFSET - load_pa;
>>>> +    va_kernel_pa_offset = kernel_virt_addr - load_pa;
>>>> +
>>>>      pfn_base = PFN_DOWN(load_pa);
>>>>
>>>>      /*
>>>> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>      create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>>>>                 (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
>>>>      /* Setup trampoline PGD and PMD */
>>>> -    create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>> +    create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>>                 (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
>>>> -    create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>>>> +    create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
>>>>                 load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>>>>  #else
>>>>      /* Setup trampoline PGD */
>>>> -    create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>> +    create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>>                 load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
>>>>  #endif
>>>>
>>>>      /*
>>>> -     * Setup early PGD covering entire kernel which will allows
>>>> +     * Setup early PGD covering entire kernel which will allow
>>>>       * us to reach paging_init(). We map all memory banks later
>>>>       * in setup_vm_final() below.
>>>>       */
>>>> -    end_va = PAGE_OFFSET + load_sz;
>>>> -    for (va = PAGE_OFFSET; va < end_va; va += map_size)
>>>> -        create_pgd_mapping(early_pg_dir, va,
>>>> -                   load_pa + (va - PAGE_OFFSET),
>>>> -                   map_size, PAGE_KERNEL_EXEC);
>>>> +    create_kernel_page_table(early_pg_dir, map_size);
>>>>
>>>>      /* Create fixed mapping for early FDT parsing */
>>>>      end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
>>>> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
>>>>      uintptr_t va, map_size;
>>>>      phys_addr_t pa, start, end;
>>>>      struct memblock_region *reg;
>>>> +    static struct vm_struct vm_kernel = { 0 };
>>>>
>>>>      /* Set mmu_enabled flag */
>>>>      mmu_enabled = true;
>>>> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
>>>>          for (pa = start; pa < end; pa += map_size) {
>>>>              va = (uintptr_t)__va(pa);
>>>>              create_pgd_mapping(swapper_pg_dir, va, pa,
>>>> -                       map_size, PAGE_KERNEL_EXEC);
>>>> +                       map_size, PAGE_KERNEL);
>>>>          }
>>>>      }
>>>>
>>>> +    /* Map the kernel */
>>>> +    create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
>>>> +
>>>> +    /* Reserve the vmalloc area occupied by the kernel */
>>>> +    vm_kernel.addr = (void *)kernel_virt_addr;
>>>> +    vm_kernel.phys_addr = load_pa;
>>>> +    vm_kernel.size = (load_sz + PMD_SIZE - 1) & ~(PMD_SIZE - 1);
>>>> +    vm_kernel.flags = VM_MAP | VM_NO_GUARD;
>>>> +    vm_kernel.caller = __builtin_return_address(0);
>>>> +
>>>> +    vm_area_add_early(&vm_kernel);
>>>> +
>>>>      /* Clear fixmap PTE and PMD mappings */
>>>>      clear_fixmap(FIX_PTE);
>>>>      clear_fixmap(FIX_PMD);
>>>> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
>>>> index e8e4dcd39fed..35703d5ef5fd 100644
>>>> --- a/arch/riscv/mm/physaddr.c
>>>> +++ b/arch/riscv/mm/physaddr.c
>>>> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>>>>
>>>>  phys_addr_t __phys_addr_symbol(unsigned long x)
>>>>  {
>>>> -    unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
>>>> +    unsigned long kernel_start = (unsigned long)kernel_virt_addr;
>>>>      unsigned long kernel_end = (unsigned long)_end;
>>>>
>>>>      /*
>>
>> Alex

^ permalink raw reply

* Re: [PATCH v6] ima: move APPRAISE_BOOTPARAM dependency on ARCH_POLICY to runtime
From: Bruno Meneguele @ 2020-07-21 19:38 UTC (permalink / raw)
  To: Mimi Zohar
  Cc: linux-s390, Nayna, erichte, nayna, x86, linux-kernel, stable,
	linux-integrity, linuxppc-dev
In-Reply-To: <1595352376.5311.8.camel@linux.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 2046 bytes --]

On Tue, Jul 21, 2020 at 01:26:16PM -0400, Mimi Zohar wrote:
> On Mon, 2020-07-20 at 12:38 -0300, Bruno Meneguele wrote:
> > On Mon, Jul 20, 2020 at 10:56:55AM -0400, Mimi Zohar wrote:
> > > On Mon, 2020-07-20 at 10:40 -0400, Nayna wrote:
> > > > On 7/13/20 12:48 PM, Bruno Meneguele wrote:
> > > > > The IMA_APPRAISE_BOOTPARAM config allows enabling different "ima_appraise="
> > > > > modes - log, fix, enforce - at run time, but not when IMA architecture
> > > > > specific policies are enabled.  This prevents properly labeling the
> > > > > filesystem on systems where secure boot is supported, but not enabled on the
> > > > > platform.  Only when secure boot is actually enabled should these IMA
> > > > > appraise modes be disabled.
> > > > >
> > > > > This patch removes the compile time dependency and makes it a runtime
> > > > > decision, based on the secure boot state of that platform.
> > > > >
> > > > > Test results as follows:
> > > > >
> > > > > -> x86-64 with secure boot enabled
> > > > >
> > > > > [    0.015637] Kernel command line: <...> ima_policy=appraise_tcb ima_appraise=fix
> > > > > [    0.015668] ima: Secure boot enabled: ignoring ima_appraise=fix boot parameter option
> > > > >
> > > 
> > > Is it common to have two colons in the same line?  Is the colon being
> > > used as a delimiter when parsing the kernel logs?  Should the second
> > > colon be replaced with a hyphen?  (No need to repost.  I'll fix it
> > > up.)
> > >  
> > 
> > AFAICS it has been used without any limitations, e.g:
> > 
> > PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
> > clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
> > microcode: CPU0: patch_level=0x08701013
> > Lockdown: modprobe: unsigned module loading is restricted; see man kernel_lockdown.7
> > ...
> > 
> > I'd say we're fine using it.
> 
> Ok.  FYI, it's now in next-integrity.
> 
> Mimi
> 

Thanks Mimi.

-- 
bmeneg 
PGP Key: http://bmeneg.com/pubkey.txt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [powerpc:next-test] BUILD REGRESSION 08b8bb849948ff5e2305d1115ce8bbdd55364a70
From: kernel test robot @ 2020-07-21 20:01 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  next-test
branch HEAD: 08b8bb849948ff5e2305d1115ce8bbdd55364a70  powerpc test_emulate_step: move extern declaration to sstep.h

Error/Warning in current branch:

arch/powerpc/include/asm/ppc-opcode.h:277:22: error: called object is not a function or function pointer
arch/powerpc/include/asm/ppc-opcode.h:281:22: error: called object is not a function or function pointer
arch/powerpc/lib/test_emulate_step.c:28:4: error: 'PPC_INST_LWZ' undeclared (first use in this function); did you mean 'PPC_INST_LSWI'?
arch/powerpc/lib/test_emulate_step.c:29:4: error: 'PPC_INST_LWZ' undeclared (first use in this function); did you mean 'PPC_INST_LFD'?

Error/Warning ids grouped by kconfigs:

recent_errors
|-- powerpc-allmodconfig
|   |-- arch-powerpc-include-asm-ppc-opcode.h:error:called-object-is-not-a-function-or-function-pointer
|   `-- arch-powerpc-lib-test_emulate_step.c:error:PPC_INST_LWZ-undeclared-(first-use-in-this-function)
`-- powerpc-allyesconfig
    |-- arch-powerpc-include-asm-ppc-opcode.h:error:called-object-is-not-a-function-or-function-pointer
    `-- arch-powerpc-lib-test_emulate_step.c:error:PPC_INST_LWZ-undeclared-(first-use-in-this-function)

elapsed time: 1784m

configs tested: 97
configs skipped: 2

arm64                            allyesconfig
arm64                               defconfig
arm64                            allmodconfig
arm64                             allnoconfig
arm                                 defconfig
arm                              allyesconfig
arm                              allmodconfig
arm                               allnoconfig
arm                            zeus_defconfig
sh                ecovec24-romimage_defconfig
arm                       netwinder_defconfig
mips                        nlm_xlr_defconfig
powerpc                        cell_defconfig
s390                          debug_defconfig
arm                          pxa3xx_defconfig
m68k                        m5407c3_defconfig
sh                          sdk7780_defconfig
c6x                         dsk6455_defconfig
arm                           h5000_defconfig
i386                             allyesconfig
i386                                defconfig
i386                              debian-10.3
i386                              allnoconfig
ia64                             allmodconfig
ia64                                defconfig
ia64                              allnoconfig
ia64                             allyesconfig
m68k                             allmodconfig
m68k                              allnoconfig
m68k                           sun3_defconfig
m68k                                defconfig
m68k                             allyesconfig
nios2                               defconfig
nios2                            allyesconfig
openrisc                            defconfig
c6x                              allyesconfig
c6x                               allnoconfig
openrisc                         allyesconfig
nds32                               defconfig
nds32                             allnoconfig
csky                             allyesconfig
csky                                defconfig
alpha                               defconfig
alpha                            allyesconfig
xtensa                           allyesconfig
h8300                            allyesconfig
h8300                            allmodconfig
xtensa                              defconfig
arc                                 defconfig
arc                              allyesconfig
sh                               allmodconfig
sh                                allnoconfig
microblaze                        allnoconfig
mips                             allyesconfig
mips                              allnoconfig
mips                             allmodconfig
parisc                            allnoconfig
parisc                              defconfig
parisc                           allyesconfig
parisc                           allmodconfig
powerpc                          allyesconfig
powerpc                          rhel-kconfig
powerpc                          allmodconfig
powerpc                           allnoconfig
powerpc                             defconfig
i386                 randconfig-a001-20200719
i386                 randconfig-a006-20200719
i386                 randconfig-a002-20200719
i386                 randconfig-a005-20200719
i386                 randconfig-a003-20200719
i386                 randconfig-a004-20200719
x86_64               randconfig-a005-20200719
x86_64               randconfig-a002-20200719
x86_64               randconfig-a006-20200719
x86_64               randconfig-a001-20200719
x86_64               randconfig-a003-20200719
x86_64               randconfig-a004-20200719
riscv                            allyesconfig
riscv                             allnoconfig
riscv                               defconfig
riscv                            allmodconfig
s390                             allyesconfig
s390                              allnoconfig
s390                             allmodconfig
s390                                defconfig
sparc                            allyesconfig
sparc                               defconfig
sparc64                             defconfig
sparc64                           allnoconfig
sparc64                          allyesconfig
sparc64                          allmodconfig
x86_64                    rhel-7.6-kselftests
x86_64                               rhel-8.3
x86_64                                  kexec
x86_64                                   rhel
x86_64                                    lkp
x86_64                              fedora-25

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply

* Re: [PATCH v2 1/2] KVM: PPC: Book3S HV: move kvmppc_svm_page_out up
From: Ram Pai @ 2020-07-21 21:28 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: linux-kernel, kvm-ppc, bharata, paulus, sukadev, linuxppc-dev,
	bauerman
In-Reply-To: <20200721104202.15727-2-ldufour@linux.ibm.com>

On Tue, Jul 21, 2020 at 12:42:01PM +0200, Laurent Dufour wrote:
> kvmppc_svm_page_out() will need to be called by kvmppc_uvmem_drop_pages()
> so move it upper in this file.
> 
> Furthermore it will be interesting to call this function when already
> holding the kvm->arch.uvmem_lock, so prefix the original function with __
> and remove the locking in it, and introduce a wrapper which call that
> function with the lock held.
> 
> There is no functional change.

Reviewed-by: Ram Pai <linuxram@us.ibm.com>

> 
> Cc: Ram Pai <linuxram@us.ibm.com>
> Cc: Bharata B Rao <bharata@linux.ibm.com>
> Cc: Paul Mackerras <paulus@ozlabs.org>
> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
> ---

RP

^ permalink raw reply

* Re: [PATCH v2 2/2] KVM: PPC: Book3S HV: rework secure mem slot dropping
From: Ram Pai @ 2020-07-21 21:37 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: linux-kernel, kvm-ppc, bharata, paulus, sukadev, linuxppc-dev,
	bauerman
In-Reply-To: <20200721104202.15727-3-ldufour@linux.ibm.com>

On Tue, Jul 21, 2020 at 12:42:02PM +0200, Laurent Dufour wrote:
> When a secure memslot is dropped, all the pages backed in the secure device
> (aka really backed by secure memory by the Ultravisor) should be paged out
> to a normal page. Previously, this was achieved by triggering the page
> fault mechanism which is calling kvmppc_svm_page_out() on each pages.
> 
> This can't work when hot unplugging a memory slot because the memory slot
> is flagged as invalid and gfn_to_pfn() is then not trying to access the
> page, so the page fault mechanism is not triggered.
> 
> Since the final goal is to make a call to kvmppc_svm_page_out() it seems
> simpler to directly calling it instead of triggering such a mechanism. This
            ^^ call directly instead of triggering..

> way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
> memslot.
> 
> Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
> the call to __kvmppc_svm_page_out() is made.
> As __kvmppc_svm_page_out needs the vma pointer to migrate the pages, the
> VMA is fetched in a lazy way, to not trigger find_vma() all the time. In
> addition, the mmap_sem is help in read mode during that time, not in write
		          ^^ held

> mode since the virual memory layout is not impacted, and
> kvm->arch.uvmem_lock prevents concurrent operation on the secure device.
> 
> Cc: Ram Pai <linuxram@us.ibm.com>

Reviewed-by: Ram Pai <linuxram@us.ibm.com>

RP

^ permalink raw reply

* Re: [PATCH v4 6/7] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition
From: Leonardo Bras @ 2020-07-21 21:39 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Alexey Kardashevskiy, Joel Stanley, Christophe Leroy,
	Thiago Jung Bauermann, Ram Pai, Brian King
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200716071658.467820-7-leobras.c@gmail.com>

On Thu, 2020-07-16 at 04:16 -0300, Leonardo Bras wrote:
> +static void iommu_pseries_table_update(struct pci_dev *dev,
> +                                      struct device_node *pdn)
> +{
> +       const struct dynamic_dma_window_prop *ddw;
> +       struct pci_dn *pci;
> +       int len;
> +
> +       ddw = of_get_property(pdn, DMA64_PROPNAME, &len);
> +       if (!ddw  || len < sizeof(struct dynamic_dma_window_prop))
> +               return;
> +
> +       iommu_table_update(pci->table_group->tables[0], pci->phb->node,
> +                          ddw->liobn, ddw->dma_base, ddw->tce_shift,
> +                          ddw->window_shift);
> +}
> +
>  static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
>  {
>         struct device_node *pdn, *dn;
> @@ -1382,6 +1403,7 @@ static bool iommu_bypass_supported_pSeriesLP(struct pci_dev *pdev, u64 dma_mask)
>                 pdev->dev.archdata.dma_offset = enable_ddw(pdev, pdn);
>                 if (pdev->dev.archdata.dma_offset)
>                         return true;
> +               iommu_pseries_table_update(pdev, pdn);
>         }
> 

Noticed a bug in this one: pci is not getting assigned. 
My bad, there must have been a merge error.

Also, I will refactor the function to make use of pdn only, as I can do
pci = PCI_DN(pdn) (I think it's better this way).

Sorry for the buggy patch.

Best regards,


^ permalink raw reply

* Re: [PATCH v4 5/7] powerpc/iommu: Move iommu_table cleaning routine to iommu_table_clean
From: Leonardo Bras @ 2020-07-21 22:13 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Joel Stanley, Christophe Leroy,
	Thiago Jung Bauermann, Ram Pai, Brian King
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <51235292-a571-8792-c693-d0dc6faeb21c@ozlabs.ru>

On Tue, 2020-07-21 at 14:59 +1000, Alexey Kardashevskiy wrote:
> 
> On 16/07/2020 17:16, Leonardo Bras wrote:
> > Move the part of iommu_table_free() that does struct iommu_table cleaning
> > into iommu_table_clean, so we can invoke it separately.
> > 
> > This new function is useful for cleaning struct iommu_table before
> > initializing it again with a new DMA window, without having it freed and
> > allocated again.
> > 
> > Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
> > ---
> >  arch/powerpc/kernel/iommu.c | 30 ++++++++++++++++++------------
> >  1 file changed, 18 insertions(+), 12 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> > index 9704f3f76e63..c3242253a4e7 100644
> > --- a/arch/powerpc/kernel/iommu.c
> > +++ b/arch/powerpc/kernel/iommu.c
> > @@ -735,21 +735,10 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
> >  	return tbl;
> >  }
> >  
> > -static void iommu_table_free(struct kref *kref)
> > +static void iommu_table_clean(struct iommu_table *tbl)
> 
> iommu_table_free() + iommu_init_table() + set_iommu_table_base() should
> work too, why new helper?

iommu_table_free() also frees the tbl, which would need allocate it
again (new address) and to fill it up again, unnecessarily. 
I think it's a better approach to only change what is needed.

> There is also iommu_table_clear() which does a different thing so you
> need a better name.

I agree.
I had not noticed this other function before sending the patchset. What
would be a better name though? __iommu_table_free()? 

> Second, iommu_table_free
> use and it would be ok as we would only see this when hot-unplugging a
> PE because we always kept the default window.
> Btw you must be seeing these warnings now every time you create DDW with
> these patches as at least the first page is reserved, do not you?

It does not print a warning.
I noticed other warnings, but not this one from iommu_table_free():
/* verify that table contains no entries */
if (!bitmap_empty(tbl->it_ma
p, tbl->it_size))
	pr_warn("%s: Unexpected TCEs\n", __func__);

Before that, iommu_table_release_pages(tbl) is supposed to clear the 
bitmap, so this only tests for a tce that is created in this short period.

> Since we are replacing a table for a device which is still in the
> system, we should not try messing with its DMA if it already has
> mappings so the warning should become an error preventing DDW. It is
> rather hard to trigger in practice but I could hack a driver to ask for
> 32bit DMA mask first, map few pages and then ask for 64bit DMA mask, it
> is not illegal, I think. So this needs a new helper - "bool
> iommu_table_in_use(tbl)" - to use in enable_ddw(). Or I am overthinking
> this?... Thanks,

As of today, there seems to be nothing like that happening in the
driver I am testing. 
I spoke to Brian King on slack, and he mentioned that at the point DDW
is created there should be no allocations in place.

But I suppose some driver could try to do this.

Maybe a better approach would be removing the mapping only if the
default window is removed (at the end of enable_ddw, as an else to
resetting the default DMA window), and having a way to add more
mappings to those pools. But this last part doesn't look so simple, and
it would be better to understand if it's necessary investing work in
this.

What do you think?

Best regards,



^ permalink raw reply

* Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone
From: Benjamin Herrenschmidt @ 2020-07-21 23:11 UTC (permalink / raw)
  To: Alex Ghiti, Palmer Dabbelt
  Cc: aou, linux-mm, Anup Patel, linux-kernel, Atish Patra, paulus,
	zong.li, Paul Walmsley, linux-riscv, linuxppc-dev
In-Reply-To: <7cb2285e-68ba-6827-5e61-e33a4b65ac03@ghiti.fr>

On Tue, 2020-07-21 at 14:36 -0400, Alex Ghiti wrote:
> > > I guess I don't understand why this is necessary at all.  
> > > Specifically: why
> > > can't we just relocate the kernel within the linear map?  That would 
> > > let the
> > > bootloader put the kernel wherever it wants, modulo the physical 
> > > memory size we
> > > support.  We'd need to handle the regions that are coupled to the 
> > > kernel's
> > > execution address, but we could just put them in an explicit memory 
> > > region
> > > which is what we should probably be doing anyway.
> > 
> > Virtual relocation in the linear mapping requires to move the kernel 
> > physically too. Zong implemented this physical move in its KASLR RFC 
> > patchset, which is cumbersome since finding an available physical spot 
> > is harder than just selecting a virtual range in the vmalloc range.
> > 
> > In addition, having the kernel mapping in the linear mapping prevents 
> > the use of hugepage for the linear mapping resulting in performance loss 
> > (at least for the GB that encompasses the kernel).
> > 
> > Why do you find this "ugly" ? The vmalloc region is just a bunch of 
> > available virtual addresses to whatever purpose we want, and as noted by 
> > Zong, arm64 uses the same scheme.

I don't get it :-)

At least on powerpc we move the kernel in the linear mapping and it
works fine with huge pages, what is your problem there ? You rely on
punching small-page size holes in there ?

At least in the old days, there were a number of assumptions that
the kernel text/data/bss resides in the linear mapping.

If you change that you need to ensure that it's still physically
contiguous and you'll have to tweak __va and __pa, which might induce
extra overhead.

Cheers,
Ben.
 


^ permalink raw reply

* Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone
From: Benjamin Herrenschmidt @ 2020-07-21 23:12 UTC (permalink / raw)
  To: Palmer Dabbelt, alex
  Cc: aou, linux-mm, Anup Patel, linux-kernel, Atish Patra, paulus,
	zong.li, Paul Walmsley, linux-riscv, linuxppc-dev
In-Reply-To: <mhng-08bff01a-ca15-4bbc-8454-2ca3e823fef8@palmerdabbelt-glaptop1>

On Tue, 2020-07-21 at 12:05 -0700, Palmer Dabbelt wrote:
> 
> * We waste vmalloc space on 32-bit systems, where there isn't a lot of it.
> * On 64-bit systems the VA space around the kernel is precious because it's the
>   only place we can place text (modules, BPF, whatever). 

Why ? Branch distance limits ? You can't use trampolines ?

>  If we start putting
>   the kernel in the vmalloc space then we either have to pre-allocate a bunch
>   of space around it (essentially making it a fixed mapping anyway) or it
>   becomes likely that we won't be able to find space for modules as they're
>   loaded into running systems.

I dislike the kernel being in the vmalloc space (see my other email)
but I don't understand the specific issue with modules.

> * Relying on a relocatable kernel for sv48 support introduces a fairly large
>   performance hit.

Out of curiosity why would relocatable kernels introduce a significant
hit ? Where about do you see the overhead coming from ?

> Roughly, my proposal would be to:
> 
> * Leave the 32-bit memory map alone.  On 32-bit systems we can load modules
>   anywhere and we only have one VA width, so we're not really solving any
>   problems with these changes.
> * Staticly allocate a 2GiB portion of the VA space for all our text, as its own
>   region.  We'd link/relocate the kernel here instead of around PAGE_OFFSET,
>   which would decouple the kernel from the physical memory layout of the system.
>   This would have the side effect of sorting out a bunch of bootloader headaches
>   that we currently have.
> * Sort out how to maintain a linear map as the canonical hole moves around
>   between the VA widths without adding a bunch of overhead to the virt2phys and
>   friends.  This is probably going to be the trickiest part, but I think if we
>   just change the page table code to essentially lie about VAs when an sv39
>   system runs an sv48+sv39 kernel we could make it work -- there'd be some
>   logical complexity involved, but it would remain fast.
> 
> This doesn't solve the problem of virtually relocatable kernels, but it does
> let us decouple that from the sv48 stuff.  It also lets us stop relying on a
> fixed physical address the kernel is loaded into, which is another thing I
> don't like.
> 
> I know this may be a more complicated approach, but there aren't any sv48
> systems around right now so I just don't see the rush to support them,
> particularly when there's a cost to what already exists (for those who haven't
> been watching, so far all the sv48 patch sets have imposed a significant
> performance penalty on all systems).


^ permalink raw reply

* Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone
From: Palmer Dabbelt @ 2020-07-21 23:36 UTC (permalink / raw)
  To: benh
  Cc: aou, alex, linux-mm, Anup Patel, linux-kernel, Atish Patra,
	paulus, zong.li, Paul Walmsley, linux-riscv, linuxppc-dev
In-Reply-To: <54af168083aee9dbda1b531227521a26b77ba2c8.camel@kernel.crashing.org>

On Tue, 21 Jul 2020 16:11:02 PDT (-0700), benh@kernel.crashing.org wrote:
> On Tue, 2020-07-21 at 14:36 -0400, Alex Ghiti wrote:
>> > > I guess I don't understand why this is necessary at all.
>> > > Specifically: why
>> > > can't we just relocate the kernel within the linear map?  That would
>> > > let the
>> > > bootloader put the kernel wherever it wants, modulo the physical
>> > > memory size we
>> > > support.  We'd need to handle the regions that are coupled to the
>> > > kernel's
>> > > execution address, but we could just put them in an explicit memory
>> > > region
>> > > which is what we should probably be doing anyway.
>> >
>> > Virtual relocation in the linear mapping requires to move the kernel
>> > physically too. Zong implemented this physical move in its KASLR RFC
>> > patchset, which is cumbersome since finding an available physical spot
>> > is harder than just selecting a virtual range in the vmalloc range.
>> >
>> > In addition, having the kernel mapping in the linear mapping prevents
>> > the use of hugepage for the linear mapping resulting in performance loss
>> > (at least for the GB that encompasses the kernel).
>> >
>> > Why do you find this "ugly" ? The vmalloc region is just a bunch of
>> > available virtual addresses to whatever purpose we want, and as noted by
>> > Zong, arm64 uses the same scheme.
>
> I don't get it :-)
>
> At least on powerpc we move the kernel in the linear mapping and it
> works fine with huge pages, what is your problem there ? You rely on
> punching small-page size holes in there ?

That was my original suggestion, and I'm not actually sure it's invalid.  It
would mean that both the kernel's physical and virtual addresses are set by the
bootloader, which may or may not be workable if we want to have an sv48+sv39
kernel.  My initial approach to sv48+sv39 kernels would be to just throw away
the sv39 memory on sv48 kernels, which would preserve the linear map but mean
that there is no single physical address that's accessible for both.  That
would require some coordination between the bootloader and the kernel as to
where it should be loaded, but maybe there's a better way to design the linear
map.  Right now we have a bunch of unwritten rules about where things need to
be loaded, which is a recipe for disaster.

We could copy the kernel around, but I'm not sure I really like that idea.  We
do zero the BSS right now, so it's not like we entirely rely on the bootloader
to set up the kernel image, but with the hart race boot scheme we have right
now we'd at least need to leave a stub sitting around.  Maybe we just throw
away SBI v0.1, though, that's why we called it all legacy in the first place.

My bigger worry is that anything that involves running the kernel at arbitrary
virtual addresses means we need a PIC kernel, which means every global symbol
needs an indirection.  That's probably not so bad for shared libraries, but the
kernel has a lot of global symbols.  PLT references probably aren't so scary,
as we have an incoherent instruction cache so the virtual function predictor
isn't that hard to build, but making all global data accesses GOT-relative
seems like a disaster for performance.  This fixed-VA thing really just exists
so we don't have to be full-on PIC.

In theory I think we could just get away with pretending that medany is PIC,
which I believe works as long as the data and text offset stays constant, you
you don't have any symbols between 2GiB and -2GiB (as those may stay fixed,
even in medany), and you deal with GP accordingly (which should work itself out
in the current startup code).  We rely on this for some of the early boot code
(and will soon for kexec), but that's a very controlled code base and we've
already had some issues.  I'd be much more comfortable adding an explicit
semi-PIC code model, as I tend to miss something when doing these sorts of
things and then we could at least add it to the GCC test runs and guarantee it
actually works.  Not really sure I want to deal with that, though.  It would,
however, be the only way to get random virtual addresses during kernel
execution.

> At least in the old days, there were a number of assumptions that
> the kernel text/data/bss resides in the linear mapping.

Ya, it terrified me as well.  Alex says arm64 puts the kernel in the vmalloc
region, so assuming that's the case it must be possible.  I didn't get that
from reading the arm64 port (I guess it's no secret that pretty much all I do
is copy their code)

> If you change that you need to ensure that it's still physically
> contiguous and you'll have to tweak __va and __pa, which might induce
> extra overhead.

I'm operating under the assumption that we don't want to add an additional load
to virt2phys conversions.  arm64 bends over backwards to avoid the load, and
I'm assuming they have a reason for doing so.  Of course, if we're PIC then
maybe performance just doesn't matter, but I'm not sure I want to just give up.
Distros will probably build the sv48+sv39 kernels as soon as they show up, even
if there's no sv48 hardware for a while.

^ permalink raw reply

* Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone
From: Palmer Dabbelt @ 2020-07-21 23:48 UTC (permalink / raw)
  To: benh
  Cc: aou, alex, linux-mm, Anup Patel, linux-kernel, Atish Patra,
	paulus, zong.li, Paul Walmsley, linux-riscv, linuxppc-dev
In-Reply-To: <6fbea8347bdb8434d91cf3ec2b95b134bd66cfe3.camel@kernel.crashing.org>

On Tue, 21 Jul 2020 16:12:58 PDT (-0700), benh@kernel.crashing.org wrote:
> On Tue, 2020-07-21 at 12:05 -0700, Palmer Dabbelt wrote:
>>
>> * We waste vmalloc space on 32-bit systems, where there isn't a lot of it.
>> * On 64-bit systems the VA space around the kernel is precious because it's the
>>   only place we can place text (modules, BPF, whatever).
>
> Why ? Branch distance limits ? You can't use trampolines ?

Nothing fundamental, it's just that we don't have a large code model in the C
compiler.  As a result all the global symbols are resolved as 32-bit
PC-relative accesses.  We could fix this with a fast large code model, but then
the kernel would need to relax global symbol references in modules and we don't
even do that for the simple code models we have now.  FWIW, some of the
proposed large code models are essentially just split-PLT/GOT and therefor
don't require relaxation, but at that point we're essentially PIC until we
have more that 2GiB of kernel text -- and even then, we keep all the
performance issues.

>>  If we start putting
>>   the kernel in the vmalloc space then we either have to pre-allocate a bunch
>>   of space around it (essentially making it a fixed mapping anyway) or it
>>   becomes likely that we won't be able to find space for modules as they're
>>   loaded into running systems.
>
> I dislike the kernel being in the vmalloc space (see my other email)
> but I don't understand the specific issue with modules.

Essentially what's above, the modules smell the same as the rest of the
kernel's code and therefor have a similar set of restrictions.  If we build PIC
modules and have the PLT entries do GOT loads (as do our shared libraries) then
we could break this restriction, but that comes with some performance
implications.  Like I said in the other email, I'm less worried about the
instruction side of things so maybe that's the right way to go.

>> * Relying on a relocatable kernel for sv48 support introduces a fairly large
>>   performance hit.
>
> Out of curiosity why would relocatable kernels introduce a significant
> hit ? Where about do you see the overhead coming from ?

Our PIC codegen, probably better addressed by my other email and above.

>
>> Roughly, my proposal would be to:
>>
>> * Leave the 32-bit memory map alone.  On 32-bit systems we can load modules
>>   anywhere and we only have one VA width, so we're not really solving any
>>   problems with these changes.
>> * Staticly allocate a 2GiB portion of the VA space for all our text, as its own
>>   region.  We'd link/relocate the kernel here instead of around PAGE_OFFSET,
>>   which would decouple the kernel from the physical memory layout of the system.
>>   This would have the side effect of sorting out a bunch of bootloader headaches
>>   that we currently have.
>> * Sort out how to maintain a linear map as the canonical hole moves around
>>   between the VA widths without adding a bunch of overhead to the virt2phys and
>>   friends.  This is probably going to be the trickiest part, but I think if we
>>   just change the page table code to essentially lie about VAs when an sv39
>>   system runs an sv48+sv39 kernel we could make it work -- there'd be some
>>   logical complexity involved, but it would remain fast.
>>
>> This doesn't solve the problem of virtually relocatable kernels, but it does
>> let us decouple that from the sv48 stuff.  It also lets us stop relying on a
>> fixed physical address the kernel is loaded into, which is another thing I
>> don't like.
>>
>> I know this may be a more complicated approach, but there aren't any sv48
>> systems around right now so I just don't see the rush to support them,
>> particularly when there's a cost to what already exists (for those who haven't
>> been watching, so far all the sv48 patch sets have imposed a significant
>> performance penalty on all systems).

^ permalink raw reply

* Re: [PATCH v2 1/3] powerpc: inline doorbell sending functions
From: Michael Ellerman @ 2020-07-22  0:52 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev
  Cc: Nicholas Piggin, kvm-ppc, Paul Mackerras, Cédric Le Goater,
	Anton Blanchard, David Gibson
In-Reply-To: <20200630115034.137050-2-npiggin@gmail.com>

Nicholas Piggin <npiggin@gmail.com> writes:
> These are only called in one place for a given platform, so inline them
> for performance.
>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  arch/powerpc/include/asm/dbell.h | 63 ++++++++++++++++++++++++++++++--
>  arch/powerpc/kernel/dbell.c      | 55 ----------------------------
>  2 files changed, 60 insertions(+), 58 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/dbell.h b/arch/powerpc/include/asm/dbell.h
> index 4ce6808deed3..f19d2282e3f8 100644
> --- a/arch/powerpc/include/asm/dbell.h
> +++ b/arch/powerpc/include/asm/dbell.h
> @@ -13,6 +13,7 @@
>  
>  #include <asm/ppc-opcode.h>
>  #include <asm/feature-fixups.h>
> +#include <asm/kvm_ppc.h>
>  
>  #define PPC_DBELL_MSG_BRDCAST	(0x04000000)
>  #define PPC_DBELL_TYPE(x)	(((x) & 0xf) << (63-36))


This somehow breaks ppc40x_defconfig and others:

In file included from /home/michael/linux/arch/powerpc/include/asm/kvm_ppc.h:24,
                 from /home/michael/linux/arch/powerpc/include/asm/dbell.h:16,
                 from /home/michael/linux/arch/powerpc/kernel/asm-offsets.c:38:
/home/michael/linux/arch/powerpc/include/asm/kvm_booke.h: In function ‘kvmppc_get_fault_dar’:
/home/michael/linux/arch/powerpc/include/asm/kvm_booke.h:94:19: error: ‘struct kvm_vcpu_arch’ has no member named ‘fault_dear’
   94 |  return vcpu->arch.fault_dear;
      |                   ^
make[2]: *** [/home/michael/linux/scripts/Makefile.build:114: arch/powerpc/kernel/asm-offsets.s] Error 1
make[1]: *** [/home/michael/linux/Makefile:1175: prepare0] Error 2
make: *** [Makefile:185: __sub-make] Error 2

cheers

^ permalink raw reply

* Re: [PATCH v4 5/7] powerpc/iommu: Move iommu_table cleaning routine to iommu_table_clean
From: Brian King @ 2020-07-22  0:52 UTC (permalink / raw)
  To: Leonardo Bras, Alexey Kardashevskiy, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Joel Stanley,
	Christophe Leroy, Thiago Jung Bauermann, Ram Pai
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <0f4c2d84d0958e98e7ada53c25750fe548cadf0b.camel@gmail.com>

On 7/21/20 5:13 PM, Leonardo Bras wrote:
> On Tue, 2020-07-21 at 14:59 +1000, Alexey Kardashevskiy wrote:
>>
>> On 16/07/2020 17:16, Leonardo Bras wrote:
>>> Move the part of iommu_table_free() that does struct iommu_table cleaning
>>> into iommu_table_clean, so we can invoke it separately.
>>>
>>> This new function is useful for cleaning struct iommu_table before
>>> initializing it again with a new DMA window, without having it freed and
>>> allocated again.
>>>
>>> Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
>>> ---
>>>  arch/powerpc/kernel/iommu.c | 30 ++++++++++++++++++------------
>>>  1 file changed, 18 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>>> index 9704f3f76e63..c3242253a4e7 100644
>>> --- a/arch/powerpc/kernel/iommu.c
>>> +++ b/arch/powerpc/kernel/iommu.c
>>> @@ -735,21 +735,10 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
>>>  	return tbl;
>>>  }
>>>  
>>> -static void iommu_table_free(struct kref *kref)
>>> +static void iommu_table_clean(struct iommu_table *tbl)
>>
>> iommu_table_free() + iommu_init_table() + set_iommu_table_base() should
>> work too, why new helper?
> 
> iommu_table_free() also frees the tbl, which would need allocate it
> again (new address) and to fill it up again, unnecessarily. 
> I think it's a better approach to only change what is needed.
> 
>> There is also iommu_table_clear() which does a different thing so you
>> need a better name.
> 
> I agree.
> I had not noticed this other function before sending the patchset. What
> would be a better name though? __iommu_table_free()? 
> 
>> Second, iommu_table_free
>> use and it would be ok as we would only see this when hot-unplugging a
>> PE because we always kept the default window.
>> Btw you must be seeing these warnings now every time you create DDW with
>> these patches as at least the first page is reserved, do not you?
> 
> It does not print a warning.
> I noticed other warnings, but not this one from iommu_table_free():
> /* verify that table contains no entries */
> if (!bitmap_empty(tbl->it_ma
> p, tbl->it_size))
> 	pr_warn("%s: Unexpected TCEs\n", __func__);
> 
> Before that, iommu_table_release_pages(tbl) is supposed to clear the 
> bitmap, so this only tests for a tce that is created in this short period.
> 
>> Since we are replacing a table for a device which is still in the
>> system, we should not try messing with its DMA if it already has
>> mappings so the warning should become an error preventing DDW. It is
>> rather hard to trigger in practice but I could hack a driver to ask for
>> 32bit DMA mask first, map few pages and then ask for 64bit DMA mask, it
>> is not illegal, I think. So this needs a new helper - "bool
>> iommu_table_in_use(tbl)" - to use in enable_ddw(). Or I am overthinking
>> this?... Thanks,
> 
> As of today, there seems to be nothing like that happening in the
> driver I am testing. 
> I spoke to Brian King on slack, and he mentioned that at the point DDW
> is created there should be no allocations in place.

I think there are a couple of scenarios here. One is where there is a DMA
allocation prior to a call to set the DMA mask. Second scenario is if the
driver makes multiple calls to set the DMA mask. I would argue that a properly
written driver should tell the IOMMU subsystem what DMA mask it supports prior
to allocating DMA memroy. Documentation/core-api/dma-api-howto.rst should
describe what is legal and what is not.

It might be reasonable to declare its not allowed to allocate DMA memory
and then later change the DMA mask and clearly call this out in the documentation
if its not already.

-Brian


-- 
Brian King
Power Linux I/O
IBM Linux Technology Center


^ permalink raw reply

* Re: [PATCH 5/5] powerpc sstep: Add tests for Prefixed Add Immediate
From: Jordan Niethe @ 2020-07-22  1:26 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alistair Popple, Balamuruhan S
In-Reply-To: <20200525025923.19843-5-jniethe5@gmail.com>

On Mon, May 25, 2020 at 1:00 PM Jordan Niethe <jniethe5@gmail.com> wrote:
>
> Use the existing support for testing compute type instructions to test
> Prefixed Add Immediate (paddi).  The R bit of the paddi instruction
> controls whether current instruction address is used. Add test cases for
> when R=1 and for R=0. paddi has a 34 bit immediate field formed by
> concatenating si0 and si1. Add tests for the extreme values of this
> field.
>
> Skip the paddi tests if ISA v3.1 is unsupported.
>
> Some of these test cases were added by Balamuruhan S.
>
> Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
> ---
>  arch/powerpc/lib/test_emulate_step.c          | 127 ++++++++++++++++++
>  .../lib/test_emulate_step_exec_instr.S        |   1 +
>  2 files changed, 128 insertions(+)
>
> diff --git a/arch/powerpc/lib/test_emulate_step.c b/arch/powerpc/lib/test_emulate_step.c
> index 579b5db80674..33a72b7d2764 100644
> --- a/arch/powerpc/lib/test_emulate_step.c
> +++ b/arch/powerpc/lib/test_emulate_step.c
> @@ -105,6 +105,13 @@
>                                         ___PPC_RA(a) | ___PPC_RB(b))
>  #define TEST_ADDC_DOT(t, a, b) ppc_inst(PPC_INST_ADDC | ___PPC_RT(t) |         \
>                                         ___PPC_RA(a) | ___PPC_RB(b) | 0x1)
> +#define TEST_PADDI(t, a, i, pr)        ppc_inst_prefix(PPC_PREFIX_MLS | __PPC_PRFX_R(pr) |     \
> +                                       IMM_H(i),                       \
> +                                       PPC_INST_ADDI |                 \
> +                                       ___PPC_RT(t) | ___PPC_RA(a) |   \
> +                                       IMM_L(i))
> +
> +
>
>  #define MAX_SUBTESTS   16
>
> @@ -699,6 +706,11 @@ struct compute_test {
>         } subtests[MAX_SUBTESTS + 1];
>  };
>
> +/* Extreme values for si0||si1 (the MLS:D-form 34 bit immediate field) */
> +#define SI_MIN BIT(33)
> +#define SI_MAX (BIT(33) - 1)
> +#define SI_UMAX (BIT(34) - 1)
> +
>  static struct compute_test compute_tests[] = {
>         {
>                 .mnemonic = "nop",
> @@ -1071,6 +1083,121 @@ static struct compute_test compute_tests[] = {
>                                 }
>                         }
>                 }
> +       },
> +       {
> +               .mnemonic = "paddi",
> +               .cpu_feature = CPU_FTR_ARCH_31,
> +               .subtests = {
> +                       {
> +                               .descr = "RA = LONG_MIN, SI = SI_MIN, R = 0",
> +                               .instr = TEST_PADDI(21, 22, SI_MIN, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                                       .gpr[22] = LONG_MIN,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA = LONG_MIN, SI = SI_MAX, R = 0",
> +                               .instr = TEST_PADDI(21, 22, SI_MAX, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                                       .gpr[22] = LONG_MIN,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA = LONG_MAX, SI = SI_MAX, R = 0",
> +                               .instr = TEST_PADDI(21, 22, SI_MAX, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                                       .gpr[22] = LONG_MAX,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA = ULONG_MAX, SI = SI_UMAX, R = 0",
> +                               .instr = TEST_PADDI(21, 22, SI_UMAX, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                                       .gpr[22] = ULONG_MAX,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA = ULONG_MAX, SI = 0x1, R = 0",
> +                               .instr = TEST_PADDI(21, 22, 0x1, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                                       .gpr[22] = ULONG_MAX,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA = INT_MIN, SI = SI_MIN, R = 0",
> +                               .instr = TEST_PADDI(21, 22, SI_MIN, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                                       .gpr[22] = INT_MIN,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA = INT_MIN, SI = SI_MAX, R = 0",
> +                               .instr = TEST_PADDI(21, 22, SI_MAX, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                                       .gpr[22] = INT_MIN,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA = INT_MAX, SI = SI_MAX, R = 0",
> +                               .instr = TEST_PADDI(21, 22, SI_MAX, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                                       .gpr[22] = INT_MAX,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA = UINT_MAX, SI = 0x1, R = 0",
> +                               .instr = TEST_PADDI(21, 22, 0x1, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                                       .gpr[22] = UINT_MAX,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA = UINT_MAX, SI = SI_MAX, R = 0",
> +                               .instr = TEST_PADDI(21, 22, SI_MAX, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                                       .gpr[22] = UINT_MAX,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA is r0, SI = SI_MIN, R = 0",
> +                               .instr = TEST_PADDI(21, 0, SI_MIN, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0x0,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA = 0, SI = SI_MIN, R = 0",
> +                               .instr = TEST_PADDI(21, 22, SI_MIN, 0),
> +                               .regs = {
> +                                       .gpr[21] = 0x0,
> +                                       .gpr[22] = 0x0,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA is r0, SI = 0, R = 1",
> +                               .instr = TEST_PADDI(21, 0, 0, 1),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                               }
> +                       },
> +                       {
> +                               .descr = "RA is r0, SI = SI_MIN, R = 1",
> +                               .instr = TEST_PADDI(21, 0, SI_MIN, 1),
> +                               .regs = {
> +                                       .gpr[21] = 0,
> +                               }
> +                       }
> +               }
>         }
>  };
>
> diff --git a/arch/powerpc/lib/test_emulate_step_exec_instr.S b/arch/powerpc/lib/test_emulate_step_exec_instr.S
> index 1580f34f4f4f..aef53ee77a43 100644
> --- a/arch/powerpc/lib/test_emulate_step_exec_instr.S
> +++ b/arch/powerpc/lib/test_emulate_step_exec_instr.S
> @@ -81,6 +81,7 @@ _GLOBAL(exec_instr)
>
>         /* Placeholder for the test instruction */
>  1:     nop
> +       nop
>         patch_site 1b patch__exec_instr
>
>         /*
> --
> 2.17.1
>

Because of the alignment requirements of prefixed instructions, the
noops to be patched need to be aligned.
mpe, want me to send a new version?
--- a/arch/powerpc/lib/test_emulate_step_exec_instr.S
+++ b/arch/powerpc/lib/test_emulate_step_exec_instr.S
@@ -80,6 +80,7 @@ _GLOBAL(exec_instr)
        REST_NVGPRS(r31)

        /* Placeholder for the test instruction */
+.align 6
 1:     nop
        nop
        patch_site 1b patch__exec_instr

^ permalink raw reply

* Re: [PATCH v4 5/7] powerpc/iommu: Move iommu_table cleaning routine to iommu_table_clean
From: Alexey Kardashevskiy @ 2020-07-22  1:28 UTC (permalink / raw)
  To: Leonardo Bras, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Joel Stanley, Christophe Leroy,
	Thiago Jung Bauermann, Ram Pai, Brian King
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <0f4c2d84d0958e98e7ada53c25750fe548cadf0b.camel@gmail.com>



On 22/07/2020 08:13, Leonardo Bras wrote:
> On Tue, 2020-07-21 at 14:59 +1000, Alexey Kardashevskiy wrote:
>>
>> On 16/07/2020 17:16, Leonardo Bras wrote:
>>> Move the part of iommu_table_free() that does struct iommu_table cleaning
>>> into iommu_table_clean, so we can invoke it separately.
>>>
>>> This new function is useful for cleaning struct iommu_table before
>>> initializing it again with a new DMA window, without having it freed and
>>> allocated again.
>>>
>>> Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
>>> ---
>>>  arch/powerpc/kernel/iommu.c | 30 ++++++++++++++++++------------
>>>  1 file changed, 18 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>>> index 9704f3f76e63..c3242253a4e7 100644
>>> --- a/arch/powerpc/kernel/iommu.c
>>> +++ b/arch/powerpc/kernel/iommu.c
>>> @@ -735,21 +735,10 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
>>>  	return tbl;
>>>  }
>>>  
>>> -static void iommu_table_free(struct kref *kref)
>>> +static void iommu_table_clean(struct iommu_table *tbl)
>>
>> iommu_table_free() + iommu_init_table() + set_iommu_table_base() should
>> work too, why new helper?
> 
> iommu_table_free() also frees the tbl, which would need allocate it
> again (new address) and to fill it up again, unnecessarily. 

It is a new table in fact, everything is new there. You are only saving
kfree+kzalloc which does not seem a huge win.

Also, iommu_table_update() simply assumes 64bit window by passing
res_start=res_end=0 to iommu_init_table() which is not horribly robust
either. Yeah, I know, iommu_init_table() is always called with zeroes in
pseries but this is somewhat ok as those tables are from the device tree
and those windows don't overlap with 32bit MMIO but under KVM they will
(well, if we hack QEMU to advertise a single window).

I suggest removing iommu_pseries_table_update() from 6/7 and do
iommu_table_free() + iommu_init_table() + set_iommu_table_base() with a
WARN_ON(pdev->dev.archdata.dma_offset>=SZ_4G), may be even do this all
in enable_ddw() where we know for sure if it is 1:1 mapping or just a
big window.

Out of curiosity - what page sizes does pHyp advertise in "query"?


> I think it's a better approach to only change what is needed.
> 
>> There is also iommu_table_clear() which does a different thing so you
>> need a better name.
> 
> I agree.
> I had not noticed this other function before sending the patchset. What
> would be a better name though? __iommu_table_free()? 
> 
>> Second, iommu_table_free
>> use and it would be ok as we would only see this when hot-unplugging a
>> PE because we always kept the default window.
>> Btw you must be seeing these warnings now every time you create DDW with
>> these patches as at least the first page is reserved, do not you?
> 
> It does not print a warning.
> I noticed other warnings,

And what are these?

> but not this one from iommu_table_free():
> /* verify that table contains no entries */
> if (!bitmap_empty(tbl->it_ma
> p, tbl->it_size))
> 	pr_warn("%s: Unexpected TCEs\n", __func__);
> 
> Before that, iommu_table_release_pages(tbl) is supposed to clear the 
> bitmap, so this only tests for a tce that is created in this short period.


iommu_table_release_pages() only clears reserved pages - page 0 (just a
protection against NULL DMA pointers) and 32bit MMIO (these should not
be set for 64bit window). The "%s: Unexpected TCEs\n" is what checks for
actual mapped TCEs.


>> Since we are replacing a table for a device which is still in the
>> system, we should not try messing with its DMA if it already has
>> mappings so the warning should become an error preventing DDW. It is
>> rather hard to trigger in practice but I could hack a driver to ask for
>> 32bit DMA mask first, map few pages and then ask for 64bit DMA mask, it
>> is not illegal, I think. So this needs a new helper - "bool
>> iommu_table_in_use(tbl)" - to use in enable_ddw(). Or I am overthinking
>> this?... Thanks,
> 
> As of today, there seems to be nothing like that happening in the
> driver I am testing. 
> I spoke to Brian King on slack, and he mentioned that at the point DDW
> is created there should be no allocations in place.

Correct, there should not be. But it is also not a huge effort to fall
back if there are.


> But I suppose some driver could try to do this.
> 
> Maybe a better approach would be removing the mapping only if the
> default window is removed (at the end of enable_ddw, as an else to
> resetting the default DMA window), and having a way to add more
> mappings to those pools. But this last part doesn't look so simple, and
> it would be better to understand if it's necessary investing work in
> this.
> 
> What do you think?

Add iommu_table_in_use(tbl) and fail DDW if that says "yes".



-- 
Alexey

^ permalink raw reply

* [PATCHv3 1/2] powerpc/pseries: group lmb operation and memblock's
From: Pingfan Liu @ 2020-07-22  1:52 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nathan Lynch, kexec, Hari Bathini, Pingfan Liu

This patch prepares for the incoming patch which swaps the order of KOBJ_
uevent and dt's updating.

It has no functional effect, just groups lmb operation and memblock's in
order to insert dt updating operation easily, and makes it easier to
review.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
To: linuxppc-dev@lists.ozlabs.org
Cc: kexec@lists.infradead.org
---
v2 -> v3: rebase onto ppc next-test branch
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 26 ++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 5d545b7..1a3ac3b 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -355,7 +355,8 @@ static int dlpar_add_lmb(struct drmem_lmb *);
 static int dlpar_remove_lmb(struct drmem_lmb *lmb)
 {
 	unsigned long block_sz;
-	int rc;
+	phys_addr_t base_addr;
+	int rc, nid;
 
 	if (!lmb_is_removable(lmb))
 		return -EINVAL;
@@ -364,17 +365,19 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
 	if (rc)
 		return rc;
 
+	base_addr = lmb->base_addr;
+	nid = lmb->nid;
 	block_sz = pseries_memory_block_size();
 
-	__remove_memory(lmb->nid, lmb->base_addr, block_sz);
-
-	/* Update memory regions for memory remove */
-	memblock_remove(lmb->base_addr, block_sz);
-
 	invalidate_lmb_associativity_index(lmb);
 	lmb_clear_nid(lmb);
 	lmb->flags &= ~DRCONF_MEM_ASSIGNED;
 
+	__remove_memory(nid, base_addr, block_sz);
+
+	/* Update memory regions for memory remove */
+	memblock_remove(base_addr, block_sz);
+
 	return 0;
 }
 
@@ -603,6 +606,8 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 	}
 
 	lmb_set_nid(lmb);
+	lmb->flags |= DRCONF_MEM_ASSIGNED;
+
 	block_sz = memory_block_size_bytes();
 
 	/* Add the memory */
@@ -614,11 +619,14 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 
 	rc = dlpar_online_lmb(lmb);
 	if (rc) {
-		__remove_memory(lmb->nid, lmb->base_addr, block_sz);
+		int nid = lmb->nid;
+		phys_addr_t base_addr = lmb->base_addr;
+
 		invalidate_lmb_associativity_index(lmb);
 		lmb_clear_nid(lmb);
-	} else {
-		lmb->flags |= DRCONF_MEM_ASSIGNED;
+		lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+
+		__remove_memory(nid, base_addr, block_sz);
 	}
 
 	return rc;
-- 
2.7.5


^ permalink raw reply related

* [PATCHv3 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents
From: Pingfan Liu @ 2020-07-22  1:52 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nathan Lynch, kexec, Hari Bathini, Pingfan Liu
In-Reply-To: <1595382730-10565-1-git-send-email-kernelfans@gmail.com>

A bug is observed on pseries by taking the following steps on rhel:
-1. drmgr -c mem -r -q 5
-2. echo c > /proc/sysrq-trigger

And then, the failure looks like:
kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/
kdump: saving vmcore-dmesg.txt
kdump: saving vmcore-dmesg.txt complete
kdump: saving vmcore
 Checking for memory holes                         : [  0.0 %] /                   Checking for memory holes                         : [100.0 %] |                   Excluding unnecessary pages                       : [100.0 %] \                   Copying data                                      : [  0.3 %] -          eta: 38s[   44.337636] hash-mmu: mm: Hashing failure ! EA=0x7fffba400000 access=0x8000000000000004 current=makedumpfile
[   44.337663] hash-mmu:     trap=0x300 vsid=0x13a109c ssize=1 base psize=2 psize 2 pte=0xc000000050000504
[   44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba400000 access=0x8000000000000004 current=makedumpfile
[   44.337692] hash-mmu:     trap=0x300 vsid=0x13a109c ssize=1 base psize=2 psize 2 pte=0xc000000050000504
[   44.337708] makedumpfile[469]: unhandled signal 7 at 00007fffba400000 nip 00007fffbbc4d7fc lr 000000011356ca3c code 2
[   44.338548] Core dump to |/bin/false pipe failed
/lib/kdump-lib-initramfs.sh: line 98:   469 Bus error               $CORE_COLLECTOR /proc/vmcore $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete
kdump: saving vmcore failed

* Root cause *
  After analyzing, it turns out that in the current implementation,
when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating as
the code __remove_memory() comes before drmem_update_dt().
So in kdump kernel, when read_from_oldmem() resorts to
pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to
non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it
can be observed "Bus error"

From a viewpoint of listener and publisher, the publisher notifies the
listener before data is ready.  This introduces a problem where udev
launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before
updating. And in capture kernel, makedumpfile will access the memory based
on the stale dt info, and hit a SIGBUS error due to an un-existed lmb.

* Fix *
  In order to fix this issue, update dt before __remove_memory(), and
accordingly the same rule in hot-add path.

This will introduce extra dt updating payload for each involved lmb when hotplug.
But it should be fine since drmem_update_dt() is memory based operation and
hotplug is not a hot path.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
To: linuxppc-dev@lists.ozlabs.org
Cc: kexec@lists.infradead.org
---
v2 -> v3: rebase onto ppc next-test branch
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 1a3ac3b..def8cb3f 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -372,6 +372,7 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
 	invalidate_lmb_associativity_index(lmb);
 	lmb_clear_nid(lmb);
 	lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+	drmem_update_dt();
 
 	__remove_memory(nid, base_addr, block_sz);
 
@@ -607,6 +608,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 
 	lmb_set_nid(lmb);
 	lmb->flags |= DRCONF_MEM_ASSIGNED;
+	drmem_update_dt();
 
 	block_sz = memory_block_size_bytes();
 
@@ -625,6 +627,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 		invalidate_lmb_associativity_index(lmb);
 		lmb_clear_nid(lmb);
 		lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+		drmem_update_dt();
 
 		__remove_memory(nid, base_addr, block_sz);
 	}
@@ -877,9 +880,6 @@ int dlpar_memory(struct pseries_hp_errorlog *hp_elog)
 		break;
 	}
 
-	if (!rc)
-		rc = drmem_update_dt();
-
 	unlock_device_hotplug();
 	return rc;
 }
-- 
2.7.5


^ permalink raw reply related

* Re: [RFC PATCH] powerpc/pseries/svm: capture instruction faulting on MMIO access, in sprg0 register
From: Michael Ellerman @ 2020-07-22  2:06 UTC (permalink / raw)
  To: Ram Pai, kvm-ppc, linuxppc-dev
  Cc: sukadev, aik, linuxram, bharata, sathnaga, ldufour, bauerman,
	david
In-Reply-To: <1594888333-9370-1-git-send-email-linuxram@us.ibm.com>

Ram Pai <linuxram@us.ibm.com> writes:
> An instruction accessing a mmio address, generates a HDSI fault.  This fault is
> appropriately handled by the Hypervisor.  However in the case of secureVMs, the
> fault is delivered to the ultravisor.
>
> Unfortunately the Ultravisor has no correct-way to fetch the faulting
> instruction. The PEF architecture does not allow Ultravisor to enable MMU
> translation. Walking the two level page table to read the instruction can race
> with other vcpus modifying the SVM's process scoped page table.

You're trying to read the guest's kernel text IIUC, that mapping should
be stable. Possibly permissions on it could change over time, but the
virtual -> real mapping should not.

> This problem can be correctly solved with some help from the kernel.
>
> Capture the faulting instruction in SPRG0 register, before executing the
> faulting instruction. This enables the ultravisor to easily procure the
> faulting instruction and emulate it.

This is not something I'm going to merge. Sorry.

cheers

^ permalink raw reply

* Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone
From: Benjamin Herrenschmidt @ 2020-07-22  2:21 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: aou, alex, linux-mm, Anup Patel, linux-kernel, Atish Patra,
	paulus, zong.li, Paul Walmsley, linux-riscv, linuxppc-dev
In-Reply-To: <mhng-cd9a74ea-2edf-47e4-aade-b090f1a069f1@palmerdabbelt-glaptop1>

On Tue, 2020-07-21 at 16:48 -0700, Palmer Dabbelt wrote:
> > Why ? Branch distance limits ? You can't use trampolines ?
> 
> Nothing fundamental, it's just that we don't have a large code model in the C
> compiler.  As a result all the global symbols are resolved as 32-bit
> PC-relative accesses.  We could fix this with a fast large code model, but then
> the kernel would need to relax global symbol references in modules and we don't
> even do that for the simple code models we have now.  FWIW, some of the
> proposed large code models are essentially just split-PLT/GOT and therefor
> don't require relaxation, but at that point we're essentially PIC until we
> have more that 2GiB of kernel text -- and even then, we keep all the
> performance issues.

My memory might be out of date but I *think* we do it on powerpc
without going to a large code model, but just having the in-kernel
linker insert trampolines.

Cheers,
Ben.



^ permalink raw reply

* Re: [RFC PATCH] powerpc/pseries/svm: capture instruction faulting on MMIO access, in sprg0 register
From: Benjamin Herrenschmidt @ 2020-07-22  2:23 UTC (permalink / raw)
  To: Michael Ellerman, Ram Pai, kvm-ppc, linuxppc-dev
  Cc: sukadev, aik, bharata, sathnaga, ldufour, bauerman, david
In-Reply-To: <875zags3qp.fsf@mpe.ellerman.id.au>

On Wed, 2020-07-22 at 12:06 +1000, Michael Ellerman wrote:
> Ram Pai <linuxram@us.ibm.com> writes:
> > An instruction accessing a mmio address, generates a HDSI fault.  This fault is
> > appropriately handled by the Hypervisor.  However in the case of secureVMs, the
> > fault is delivered to the ultravisor.
> > 
> > Unfortunately the Ultravisor has no correct-way to fetch the faulting
> > instruction. The PEF architecture does not allow Ultravisor to enable MMU
> > translation. Walking the two level page table to read the instruction can race
> > with other vcpus modifying the SVM's process scoped page table.
> 
> You're trying to read the guest's kernel text IIUC, that mapping should
> be stable. Possibly permissions on it could change over time, but the
> virtual -> real mapping should not.

This will of course no longer work if userspace tries to access MMIO
but as long as you are ok limiting MMIO usage to the kernel, that's one
way.

iirc MMIO emulation in KVM on powerpc already has that race because of
HW TLB inval broadcast and it hasn't hurt anybody ... so far.

> > This problem can be correctly solved with some help from the kernel.
> > 
> > Capture the faulting instruction in SPRG0 register, before executing the
> > faulting instruction. This enables the ultravisor to easily procure the
> > faulting instruction and emulate it.
> 
> This is not something I'm going to merge. Sorry.

Another approach is to have the guest explicitly switch to using UV
calls for MMIO.

Cheers,
Ben.



^ permalink raw reply

* [PATCH] powerpc/40x: Fix assembler warning about r0
From: Michael Ellerman @ 2020-07-22  2:24 UTC (permalink / raw)
  To: linuxppc-dev

The assembler says:
  arch/powerpc/kernel/head_40x.S:623: Warning: invalid register expression

It's objecting to the use of r0 as the RA argument. That's because
when RA = 0 the literal value 0 is used, rather than the content of
r0, making the use of r0 in the source potentially confusing.

Fix it to use a literal 0, the generated code is identical.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/kernel/head_40x.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/head_40x.S b/arch/powerpc/kernel/head_40x.S
index 926bfa73586a..5b282d9965a5 100644
--- a/arch/powerpc/kernel/head_40x.S
+++ b/arch/powerpc/kernel/head_40x.S
@@ -620,7 +620,7 @@ _ENTRY(saved_ksp_limit)
 	ori	r6, r6, swapper_pg_dir@l
 	lis	r5, abatron_pteptrs@h
 	ori	r5, r5, abatron_pteptrs@l
-	stw	r5, 0xf0(r0)	/* Must match your Abatron config file */
+	stw	r5, 0xf0(0)	/* Must match your Abatron config file */
 	tophys(r5,r5)
 	stw	r6, 0(r5)
 
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH 1/2] powerpc/numa: Limit possible nodes to within num_possible_nodes
From: Nathan Lynch @ 2020-07-22  3:14 UTC (permalink / raw)
  To: Srikar Dronamraju, Michael Ellerman
  Cc: Tyrel Datwyler, linuxppc-dev, Srikar Dronamraju, Michael Ellerman
In-Reply-To: <20200715120534.3673-1-srikar@linux.vnet.ibm.com>

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> MAX_NUMNODES is a theoretical maximum number of nodes thats is supported
> by the kernel. Device tree properties exposes the number of possible
> nodes on the current platform. The kernel would detected this and would
> use it for most of its resource allocations.  If the platform now
> increases the nodes to over what was already exposed, then it may lead
> to inconsistencies. Hence limit it to the already exposed nodes.
>
> Suggested-by: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> Cc: Michael Ellerman <michaele@au1.ibm.com>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Tyrel Datwyler <tyreld@linux.ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/mm/numa.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 8ec7ff05ae47..a2c5fe0d0cad 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -221,7 +221,7 @@ static void initialize_distance_lookup_table(int nid,
>  	}
>  }
>  
> -/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
> +/* Returns nid in the range [0..num_possible_nodes()-1], or -1 if no useful numa
>   * info is found.
>   */
>  static int associativity_to_nid(const __be32 *associativity)
> @@ -235,7 +235,7 @@ static int associativity_to_nid(const __be32 *associativity)
>  		nid = of_read_number(&associativity[min_common_depth], 1);
>  
>  	/* POWER4 LPAR uses 0xffff as invalid node */
> -	if (nid == 0xffff || nid >= MAX_NUMNODES)
> +	if (nid == 0xffff || nid >= num_possible_nodes())
>  		nid = NUMA_NO_NODE;
>  
>  	if (nid > 0 &&
> @@ -448,7 +448,7 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
>  		index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
>  		nid = of_read_number(&aa.arrays[index], 1);
>  
> -		if (nid == 0xffff || nid >= MAX_NUMNODES)
> +		if (nid == 0xffff || nid >= num_possible_nodes())
>  			nid = default_nid;
>  
>  		if (nid > 0) {

Yes, looks fine to me.

Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>

^ permalink raw reply

* Re: [v3 02/15] KVM: PPC: Book3S HV: Cleanup updates for kvm vcpu MMCR
From: Athira Rajeev @ 2020-07-22  2:09 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: ego, Michael Neuling, maddy, kvm, kvm-ppc, svaidyan, acme, jolsa,
	linuxppc-dev
In-Reply-To: <20200721035420.GA3819606@thinks.paulus.ozlabs.org>

[-- Attachment #1: Type: text/plain, Size: 3591 bytes --]



> On 21-Jul-2020, at 9:24 AM, Paul Mackerras <paulus@ozlabs.org> wrote:
> 
> On Fri, Jul 17, 2020 at 10:38:14AM -0400, Athira Rajeev wrote:
>> Currently `kvm_vcpu_arch` stores all Monitor Mode Control registers
>> in a flat array in order: mmcr0, mmcr1, mmcra, mmcr2, mmcrs
>> Split this to give mmcra and mmcrs its own entries in vcpu and
>> use a flat array for mmcr0 to mmcr2. This patch implements this
>> cleanup to make code easier to read.
> 
> Changing the way KVM stores these values internally is fine, but
> changing the user ABI is not.  This part:
> 
>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>> index 264e266..e55d847 100644
>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>> @@ -510,8 +510,8 @@ struct kvm_ppc_cpu_char {
>> 
>> #define KVM_REG_PPC_MMCR0	(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x10)
>> #define KVM_REG_PPC_MMCR1	(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x11)
>> -#define KVM_REG_PPC_MMCRA	(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12)
>> -#define KVM_REG_PPC_MMCR2	(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x13)
>> +#define KVM_REG_PPC_MMCR2	(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12)
>> +#define KVM_REG_PPC_MMCRA	(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x13)
> 
> means that existing userspace programs that used to work would now be
> broken.  That is not acceptable (breaking the user ABI is only ever
> acceptable with a very compelling reason).  So NAK to this part of the
> patch.

Hi Paul

Thanks for checking the patch. I understood your point on user ABI breakage that this particular change can cause.
I will retain original KVM_REG_PPC_MMCRA and KVM_REG_PPC_MMCR2 order in `kvm.h`
And with that, additionally I will need below change ( on top of current patch ) for my clean up updates for kvm cpu MMCR to work,
Because now mmcra and mmcrs will have its own entries in vcpu and is not part of the mmcr[] array
Please suggest if this looks good

Thanks
Athira 

—
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 3f90eee261fc..b10bb404f0d5 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1679,10 +1679,13 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, u64 id,
        case KVM_REG_PPC_UAMOR:
                *val = get_reg_val(id, vcpu->arch.uamor);
                break;
-       case KVM_REG_PPC_MMCR0 ... KVM_REG_PPC_MMCR2:
+       case KVM_REG_PPC_MMCR0 ... KVM_REG_PPC_MMCR1:
                i = id - KVM_REG_PPC_MMCR0;
                *val = get_reg_val(id, vcpu->arch.mmcr[i]);
                break;
+       case KVM_REG_PPC_MMCR2:
+               *val = get_reg_val(id, vcpu->arch.mmcr[2]);
+               break;
        case KVM_REG_PPC_MMCRA:
                *val = get_reg_val(id, vcpu->arch.mmcra);
                break;
@@ -1906,10 +1909,13 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu *vcpu, u64 id,
        case KVM_REG_PPC_UAMOR:
                vcpu->arch.uamor = set_reg_val(id, *val);
                break;
-       case KVM_REG_PPC_MMCR0 ... KVM_REG_PPC_MMCR2:
+       case KVM_REG_PPC_MMCR0 ... KVM_REG_PPC_MMCR1:
                i = id - KVM_REG_PPC_MMCR0;
                vcpu->arch.mmcr[i] = set_reg_val(id, *val);
                break;
+       case KVM_REG_PPC_MMCR2:
+               vcpu->arch.mmcr[2] = set_reg_val(id, *val);
+               break;
        case KVM_REG_PPC_MMCRA:
                vcpu->arch.mmcra = set_reg_val(id, *val);
                break;
—


> 
> Regards,
> Paul.


[-- Attachment #2: Type: text/html, Size: 11048 bytes --]

^ permalink raw reply related

* Re: [v3 01/15] powerpc/perf: Update cpu_hw_event to use `struct` for storing MMCR registers
From: Athira Rajeev @ 2020-07-22  2:15 UTC (permalink / raw)
  To: Jordan Niethe
  Cc: Gautham R Shenoy, Michael Neuling, maddy, kvm, kvm-ppc, svaidyan,
	acme, jolsa, linuxppc-dev
In-Reply-To: <CACzsE9rnE_z-uEw2tyvdFDVzkOeBKJoZiTHD9fSRm9d8fVBr5w@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 25536 bytes --]



> On 21-Jul-2020, at 9:12 AM, Jordan Niethe <jniethe5@gmail.com> wrote:
> 
> On Sat, Jul 18, 2020 at 12:48 AM Athira Rajeev
> <atrajeev@linux.vnet.ibm.com <mailto:atrajeev@linux.vnet.ibm.com>> wrote:
>> 
>> core-book3s currently uses array to store the MMCR registers as part
>> of per-cpu `cpu_hw_events`. This patch does a clean up to use `struct`
>> to store mmcr regs instead of array. This will make code easier to read
>> and reduces chance of any subtle bug that may come in the future, say
>> when new registers are added. Patch updates all relevant code that was
>> using MMCR array ( cpuhw->mmcr[x]) to use newly introduced `struct`.
>> This includes the PMU driver code for supported platforms (power5
>> to power9) and ISA macros for counter support functions.
>> 
>> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
>> ---
>> arch/powerpc/include/asm/perf_event_server.h | 10 ++++--
>> arch/powerpc/perf/core-book3s.c              | 53 +++++++++++++---------------
>> arch/powerpc/perf/isa207-common.c            | 20 +++++------
>> arch/powerpc/perf/isa207-common.h            |  4 +--
>> arch/powerpc/perf/mpc7450-pmu.c              | 21 +++++++----
>> arch/powerpc/perf/power5+-pmu.c              | 17 ++++-----
>> arch/powerpc/perf/power5-pmu.c               | 17 ++++-----
>> arch/powerpc/perf/power6-pmu.c               | 16 ++++-----
>> arch/powerpc/perf/power7-pmu.c               | 17 ++++-----
>> arch/powerpc/perf/ppc970-pmu.c               | 24 ++++++-------
>> 10 files changed, 105 insertions(+), 94 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h
>> index 3e9703f..f9a3668 100644
>> --- a/arch/powerpc/include/asm/perf_event_server.h
>> +++ b/arch/powerpc/include/asm/perf_event_server.h
>> @@ -17,6 +17,12 @@
>> 
>> struct perf_event;
>> 
>> +struct mmcr_regs {
>> +       unsigned long mmcr0;
>> +       unsigned long mmcr1;
>> +       unsigned long mmcr2;
>> +       unsigned long mmcra;
>> +};
>> /*
>>  * This struct provides the constants and functions needed to
>>  * describe the PMU on a particular POWER-family CPU.
>> @@ -28,7 +34,7 @@ struct power_pmu {
>>        unsigned long   add_fields;
>>        unsigned long   test_adder;
>>        int             (*compute_mmcr)(u64 events[], int n_ev,
>> -                               unsigned int hwc[], unsigned long mmcr[],
>> +                               unsigned int hwc[], struct mmcr_regs *mmcr,
>>                                struct perf_event *pevents[]);
>>        int             (*get_constraint)(u64 event_id, unsigned long *mskp,
>>                                unsigned long *valp);
>> @@ -41,7 +47,7 @@ struct power_pmu {
>>        unsigned long   group_constraint_val;
>>        u64             (*bhrb_filter_map)(u64 branch_sample_type);
>>        void            (*config_bhrb)(u64 pmu_bhrb_filter);
>> -       void            (*disable_pmc)(unsigned int pmc, unsigned long mmcr[]);
>> +       void            (*disable_pmc)(unsigned int pmc, struct mmcr_regs *mmcr);
>>        int             (*limited_pmc_event)(u64 event_id);
>>        u32             flags;
>>        const struct attribute_group    **attr_groups;
>> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
>> index cd6a742..18b1b6a 100644
>> --- a/arch/powerpc/perf/core-book3s.c
>> +++ b/arch/powerpc/perf/core-book3s.c
>> @@ -37,12 +37,7 @@ struct cpu_hw_events {
>>        struct perf_event *event[MAX_HWEVENTS];
>>        u64 events[MAX_HWEVENTS];
>>        unsigned int flags[MAX_HWEVENTS];
>> -       /*
>> -        * The order of the MMCR array is:
>> -        *  - 64-bit, MMCR0, MMCR1, MMCRA, MMCR2
>> -        *  - 32-bit, MMCR0, MMCR1, MMCR2
>> -        */
>> -       unsigned long mmcr[4];
>> +       struct mmcr_regs mmcr;
>>        struct perf_event *limited_counter[MAX_LIMITED_HWCOUNTERS];
>>        u8  limited_hwidx[MAX_LIMITED_HWCOUNTERS];
>>        u64 alternatives[MAX_HWEVENTS][MAX_EVENT_ALTERNATIVES];
>> @@ -121,7 +116,7 @@ static void ebb_event_add(struct perf_event *event) { }
>> static void ebb_switch_out(unsigned long mmcr0) { }
>> static unsigned long ebb_switch_in(bool ebb, struct cpu_hw_events *cpuhw)
>> {
>> -       return cpuhw->mmcr[0];
>> +       return cpuhw->mmcr.mmcr0;
>> }
>> 
>> static inline void power_pmu_bhrb_enable(struct perf_event *event) {}
>> @@ -590,7 +585,7 @@ static void ebb_switch_out(unsigned long mmcr0)
>> 
>> static unsigned long ebb_switch_in(bool ebb, struct cpu_hw_events *cpuhw)
>> {
>> -       unsigned long mmcr0 = cpuhw->mmcr[0];
>> +       unsigned long mmcr0 = cpuhw->mmcr.mmcr0;
>> 
>>        if (!ebb)
>>                goto out;
>> @@ -624,7 +619,7 @@ static unsigned long ebb_switch_in(bool ebb, struct cpu_hw_events *cpuhw)
>>         * unfreeze counters, it should not set exclude_xxx in its events and
>>         * instead manage the MMCR2 entirely by itself.
>>         */
>> -       mtspr(SPRN_MMCR2, cpuhw->mmcr[3] | current->thread.mmcr2);
>> +       mtspr(SPRN_MMCR2, cpuhw->mmcr.mmcr2 | current->thread.mmcr2);
>> out:
>>        return mmcr0;
>> }
>> @@ -1232,9 +1227,9 @@ static void power_pmu_disable(struct pmu *pmu)
>>                /*
>>                 * Disable instruction sampling if it was enabled
>>                 */
>> -               if (cpuhw->mmcr[2] & MMCRA_SAMPLE_ENABLE) {
>> +               if (cpuhw->mmcr.mmcra & MMCRA_SAMPLE_ENABLE) {
>>                        mtspr(SPRN_MMCRA,
>> -                             cpuhw->mmcr[2] & ~MMCRA_SAMPLE_ENABLE);
>> +                             cpuhw->mmcr.mmcra & ~MMCRA_SAMPLE_ENABLE);
>>                        mb();
>>                        isync();
>>                }
>> @@ -1308,18 +1303,18 @@ static void power_pmu_enable(struct pmu *pmu)
>>         * (possibly updated for removal of events).
>>         */
>>        if (!cpuhw->n_added) {
>> -               mtspr(SPRN_MMCRA, cpuhw->mmcr[2] & ~MMCRA_SAMPLE_ENABLE);
>> -               mtspr(SPRN_MMCR1, cpuhw->mmcr[1]);
>> +               mtspr(SPRN_MMCRA, cpuhw->mmcr.mmcra & ~MMCRA_SAMPLE_ENABLE);
>> +               mtspr(SPRN_MMCR1, cpuhw->mmcr.mmcr1);
>>                goto out_enable;
>>        }
>> 
>>        /*
>>         * Clear all MMCR settings and recompute them for the new set of events.
>>         */
>> -       memset(cpuhw->mmcr, 0, sizeof(cpuhw->mmcr));
>> +       memset(&cpuhw->mmcr, 0, sizeof(cpuhw->mmcr));
>> 
>>        if (ppmu->compute_mmcr(cpuhw->events, cpuhw->n_events, hwc_index,
>> -                              cpuhw->mmcr, cpuhw->event)) {
>> +                              &cpuhw->mmcr, cpuhw->event)) {
>>                /* shouldn't ever get here */
>>                printk(KERN_ERR "oops compute_mmcr failed\n");
>>                goto out;
>> @@ -1333,11 +1328,11 @@ static void power_pmu_enable(struct pmu *pmu)
>>                 */
>>                event = cpuhw->event[0];
>>                if (event->attr.exclude_user)
>> -                       cpuhw->mmcr[0] |= MMCR0_FCP;
>> +                       cpuhw->mmcr.mmcr0 |= MMCR0_FCP;
>>                if (event->attr.exclude_kernel)
>> -                       cpuhw->mmcr[0] |= freeze_events_kernel;
>> +                       cpuhw->mmcr.mmcr0 |= freeze_events_kernel;
>>                if (event->attr.exclude_hv)
>> -                       cpuhw->mmcr[0] |= MMCR0_FCHV;
>> +                       cpuhw->mmcr.mmcr0 |= MMCR0_FCHV;
>>        }
>> 
>>        /*
>> @@ -1346,12 +1341,12 @@ static void power_pmu_enable(struct pmu *pmu)
>>         * Then unfreeze the events.
>>         */
>>        ppc_set_pmu_inuse(1);
>> -       mtspr(SPRN_MMCRA, cpuhw->mmcr[2] & ~MMCRA_SAMPLE_ENABLE);
>> -       mtspr(SPRN_MMCR1, cpuhw->mmcr[1]);
>> -       mtspr(SPRN_MMCR0, (cpuhw->mmcr[0] & ~(MMCR0_PMC1CE | MMCR0_PMCjCE))
>> +       mtspr(SPRN_MMCRA, cpuhw->mmcr.mmcra & ~MMCRA_SAMPLE_ENABLE);
>> +       mtspr(SPRN_MMCR1, cpuhw->mmcr.mmcr1);
>> +       mtspr(SPRN_MMCR0, (cpuhw->mmcr.mmcr0 & ~(MMCR0_PMC1CE | MMCR0_PMCjCE))
>>                                | MMCR0_FC);
>>        if (ppmu->flags & PPMU_ARCH_207S)
>> -               mtspr(SPRN_MMCR2, cpuhw->mmcr[3]);
>> +               mtspr(SPRN_MMCR2, cpuhw->mmcr.mmcr2);
>> 
>>        /*
>>         * Read off any pre-existing events that need to move
>> @@ -1402,7 +1397,7 @@ static void power_pmu_enable(struct pmu *pmu)
>>                perf_event_update_userpage(event);
>>        }
>>        cpuhw->n_limited = n_lim;
>> -       cpuhw->mmcr[0] |= MMCR0_PMXE | MMCR0_FCECE;
>> +       cpuhw->mmcr.mmcr0 |= MMCR0_PMXE | MMCR0_FCECE;
>> 
>>  out_enable:
>>        pmao_restore_workaround(ebb);
>> @@ -1418,9 +1413,9 @@ static void power_pmu_enable(struct pmu *pmu)
>>        /*
>>         * Enable instruction sampling if necessary
>>         */
>> -       if (cpuhw->mmcr[2] & MMCRA_SAMPLE_ENABLE) {
>> +       if (cpuhw->mmcr.mmcra & MMCRA_SAMPLE_ENABLE) {
>>                mb();
>> -               mtspr(SPRN_MMCRA, cpuhw->mmcr[2]);
>> +               mtspr(SPRN_MMCRA, cpuhw->mmcr.mmcra);
>>        }
>> 
>>  out:
>> @@ -1550,7 +1545,7 @@ static void power_pmu_del(struct perf_event *event, int ef_flags)
>>                                cpuhw->flags[i-1] = cpuhw->flags[i];
>>                        }
>>                        --cpuhw->n_events;
>> -                       ppmu->disable_pmc(event->hw.idx - 1, cpuhw->mmcr);
>> +                       ppmu->disable_pmc(event->hw.idx - 1, &cpuhw->mmcr);
>>                        if (event->hw.idx) {
>>                                write_pmc(event->hw.idx, 0);
>>                                event->hw.idx = 0;
>> @@ -1571,7 +1566,7 @@ static void power_pmu_del(struct perf_event *event, int ef_flags)
>>        }
>>        if (cpuhw->n_events == 0) {
>>                /* disable exceptions if no events are running */
>> -               cpuhw->mmcr[0] &= ~(MMCR0_PMXE | MMCR0_FCECE);
>> +               cpuhw->mmcr.mmcr0 &= ~(MMCR0_PMXE | MMCR0_FCECE);
>>        }
>> 
>>        if (has_branch_stack(event))
>> @@ -2240,7 +2235,7 @@ static void __perf_event_interrupt(struct pt_regs *regs)
>>         * XXX might want to use MSR.PM to keep the events frozen until
>>         * we get back out of this interrupt.
>>         */
>> -       write_mmcr0(cpuhw, cpuhw->mmcr[0]);
>> +       write_mmcr0(cpuhw, cpuhw->mmcr.mmcr0);
>> 
>>        if (nmi)
>>                nmi_exit();
>> @@ -2262,7 +2257,7 @@ static int power_pmu_prepare_cpu(unsigned int cpu)
>> 
>>        if (ppmu) {
>>                memset(cpuhw, 0, sizeof(*cpuhw));
>> -               cpuhw->mmcr[0] = MMCR0_FC;
>> +               cpuhw->mmcr.mmcr0 = MMCR0_FC;
>>        }
>>        return 0;
>> }
>> diff --git a/arch/powerpc/perf/isa207-common.c b/arch/powerpc/perf/isa207-common.c
>> index 4c86da5..2fe63f2 100644
>> --- a/arch/powerpc/perf/isa207-common.c
>> +++ b/arch/powerpc/perf/isa207-common.c
>> @@ -363,7 +363,7 @@ int isa207_get_constraint(u64 event, unsigned long *maskp, unsigned long *valp)
>> }
>> 
>> int isa207_compute_mmcr(u64 event[], int n_ev,
>> -                              unsigned int hwc[], unsigned long mmcr[],
>> +                              unsigned int hwc[], struct mmcr_regs *mmcr,
>>                               struct perf_event *pevents[])
>> {
>>        unsigned long mmcra, mmcr1, mmcr2, unit, combine, psel, cache, val;
>> @@ -464,30 +464,30 @@ int isa207_compute_mmcr(u64 event[], int n_ev,
>>        }
>> 
>>        /* Return MMCRx values */
>> -       mmcr[0] = 0;
>> +       mmcr->mmcr0 = 0;
>> 
>>        /* pmc_inuse is 1-based */
>>        if (pmc_inuse & 2)
>> -               mmcr[0] = MMCR0_PMC1CE;
>> +               mmcr->mmcr0 = MMCR0_PMC1CE;
>> 
>>        if (pmc_inuse & 0x7c)
>> -               mmcr[0] |= MMCR0_PMCjCE;
>> +               mmcr->mmcr0 |= MMCR0_PMCjCE;
>> 
>>        /* If we're not using PMC 5 or 6, freeze them */
>>        if (!(pmc_inuse & 0x60))
>> -               mmcr[0] |= MMCR0_FC56;
>> +               mmcr->mmcr0 |= MMCR0_FC56;
>> 
>> -       mmcr[1] = mmcr1;
>> -       mmcr[2] = mmcra;
>> -       mmcr[3] = mmcr2;
>> +       mmcr->mmcr1 = mmcr1;
>> +       mmcr->mmcra = mmcra;
>> +       mmcr->mmcr2 = mmcr2;
>> 
>>        return 0;
>> }
>> 
>> -void isa207_disable_pmc(unsigned int pmc, unsigned long mmcr[])
>> +void isa207_disable_pmc(unsigned int pmc, struct mmcr_regs *mmcr)
>> {
>>        if (pmc <= 3)
>> -               mmcr[1] &= ~(0xffUL << MMCR1_PMCSEL_SHIFT(pmc + 1));
>> +               mmcr->mmcr1 &= ~(0xffUL << MMCR1_PMCSEL_SHIFT(pmc + 1));
>> }
>> 
>> static int find_alternative(u64 event, const unsigned int ev_alt[][MAX_ALT], int size)
>> diff --git a/arch/powerpc/perf/isa207-common.h b/arch/powerpc/perf/isa207-common.h
>> index 63fd4f3..df968fd 100644
>> --- a/arch/powerpc/perf/isa207-common.h
>> +++ b/arch/powerpc/perf/isa207-common.h
>> @@ -217,9 +217,9 @@
>> 
>> int isa207_get_constraint(u64 event, unsigned long *maskp, unsigned long *valp);
>> int isa207_compute_mmcr(u64 event[], int n_ev,
>> -                               unsigned int hwc[], unsigned long mmcr[],
>> +                               unsigned int hwc[], struct mmcr_regs *mmcr,
>>                                struct perf_event *pevents[]);
>> -void isa207_disable_pmc(unsigned int pmc, unsigned long mmcr[]);
>> +void isa207_disable_pmc(unsigned int pmc, struct mmcr_regs *mmcr);
>> int isa207_get_alternatives(u64 event, u64 alt[], int size, unsigned int flags,
>>                                        const unsigned int ev_alt[][MAX_ALT]);
>> void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
>> diff --git a/arch/powerpc/perf/mpc7450-pmu.c b/arch/powerpc/perf/mpc7450-pmu.c
>> index 4d5ef92..826de25 100644
>> --- a/arch/powerpc/perf/mpc7450-pmu.c
>> +++ b/arch/powerpc/perf/mpc7450-pmu.c
>> @@ -257,7 +257,7 @@ static int mpc7450_get_alternatives(u64 event, unsigned int flags, u64 alt[])
>>  * Compute MMCR0/1/2 values for a set of events.
>>  */
>> static int mpc7450_compute_mmcr(u64 event[], int n_ev, unsigned int hwc[],
>> -                               unsigned long mmcr[],
>> +                               struct mmcr_regs *mmcr,
>>                                struct perf_event *pevents[])
>> {
>>        u8 event_index[N_CLASSES][N_COUNTER];
>> @@ -321,9 +321,16 @@ static int mpc7450_compute_mmcr(u64 event[], int n_ev, unsigned int hwc[],
>>                mmcr0 |= MMCR0_PMCnCE;
>> 
>>        /* Return MMCRx values */
>> -       mmcr[0] = mmcr0;
>> -       mmcr[1] = mmcr1;
>> -       mmcr[2] = mmcr2;
>> +       mmcr->mmcr0 = mmcr0;
>> +       mmcr->mmcr1 = mmcr1;
>> +       mmcr->mmcr2 = mmcr2;
> Will this mmcr->mmcr2 be used anywere, or will it always use mmcr->mmcra?

Hi Jordan

We will be actually using mmcr->mmcra in the core-book3s for mpc7450-pmu.c
I have still assigned mmcr->mmcr2 so that it will work for any future changes if made in corebook

>> +       /*
>> +        * 32-bit doesn't have an MMCRA and uses SPRN_MMCR2 to define
>> +        * SPRN_MMCRA. So assign mmcra of cpu_hw_events with `mmcr2`
>> +        * value to ensure that any write to this SPRN_MMCRA will
>> +        * use mmcr2 value.
>> +        */
> Something like this comment would probably be useful to near the struct mmcr.

Sure, I will add this information
Thanks for your suggestion

Athira
>> +       mmcr->mmcra = mmcr2;
>>        return 0;
>> }
>> 
>> @@ -331,12 +338,12 @@ static int mpc7450_compute_mmcr(u64 event[], int n_ev, unsigned int hwc[],
>>  * Disable counting by a PMC.
>>  * Note that the pmc argument is 0-based here, not 1-based.
>>  */
>> -static void mpc7450_disable_pmc(unsigned int pmc, unsigned long mmcr[])
>> +static void mpc7450_disable_pmc(unsigned int pmc, struct mmcr_regs *mmcr)
>> {
>>        if (pmc <= 1)
>> -               mmcr[0] &= ~(pmcsel_mask[pmc] << pmcsel_shift[pmc]);
>> +               mmcr->mmcr0 &= ~(pmcsel_mask[pmc] << pmcsel_shift[pmc]);
>>        else
>> -               mmcr[1] &= ~(pmcsel_mask[pmc] << pmcsel_shift[pmc]);
>> +               mmcr->mmcr1 &= ~(pmcsel_mask[pmc] << pmcsel_shift[pmc]);
>> }
>> 
>> static int mpc7450_generic_events[] = {
>> diff --git a/arch/powerpc/perf/power5+-pmu.c b/arch/powerpc/perf/power5+-pmu.c
>> index f857454..5f0821e 100644
>> --- a/arch/powerpc/perf/power5+-pmu.c
>> +++ b/arch/powerpc/perf/power5+-pmu.c
>> @@ -448,7 +448,8 @@ static int power5p_marked_instr_event(u64 event)
>> }
>> 
>> static int power5p_compute_mmcr(u64 event[], int n_ev,
>> -                               unsigned int hwc[], unsigned long mmcr[], struct perf_event *pevents[])
>> +                               unsigned int hwc[], struct mmcr_regs *mmcr,
>> +                               struct perf_event *pevents[])
>> {
>>        unsigned long mmcr1 = 0;
>>        unsigned long mmcra = 0;
>> @@ -586,20 +587,20 @@ static int power5p_compute_mmcr(u64 event[], int n_ev,
>>        }
>> 
>>        /* Return MMCRx values */
>> -       mmcr[0] = 0;
>> +       mmcr->mmcr0 = 0;
>>        if (pmc_inuse & 1)
>> -               mmcr[0] = MMCR0_PMC1CE;
>> +               mmcr->mmcr0 = MMCR0_PMC1CE;
>>        if (pmc_inuse & 0x3e)
>> -               mmcr[0] |= MMCR0_PMCjCE;
>> -       mmcr[1] = mmcr1;
>> -       mmcr[2] = mmcra;
>> +               mmcr->mmcr0 |= MMCR0_PMCjCE;
>> +       mmcr->mmcr1 = mmcr1;
>> +       mmcr->mmcra = mmcra;
>>        return 0;
>> }
>> 
>> -static void power5p_disable_pmc(unsigned int pmc, unsigned long mmcr[])
>> +static void power5p_disable_pmc(unsigned int pmc, struct mmcr_regs *mmcr)
>> {
>>        if (pmc <= 3)
>> -               mmcr[1] &= ~(0x7fUL << MMCR1_PMCSEL_SH(pmc));
>> +               mmcr->mmcr1 &= ~(0x7fUL << MMCR1_PMCSEL_SH(pmc));
>> }
>> 
>> static int power5p_generic_events[] = {
>> diff --git a/arch/powerpc/perf/power5-pmu.c b/arch/powerpc/perf/power5-pmu.c
>> index da52eca..426021d 100644
>> --- a/arch/powerpc/perf/power5-pmu.c
>> +++ b/arch/powerpc/perf/power5-pmu.c
>> @@ -379,7 +379,8 @@ static int power5_marked_instr_event(u64 event)
>> }
>> 
>> static int power5_compute_mmcr(u64 event[], int n_ev,
>> -                              unsigned int hwc[], unsigned long mmcr[], struct perf_event *pevents[])
>> +                              unsigned int hwc[], struct mmcr_regs *mmcr,
>> +                              struct perf_event *pevents[])
>> {
>>        unsigned long mmcr1 = 0;
>>        unsigned long mmcra = MMCRA_SDAR_DCACHE_MISS | MMCRA_SDAR_ERAT_MISS;
>> @@ -528,20 +529,20 @@ static int power5_compute_mmcr(u64 event[], int n_ev,
>>        }
>> 
>>        /* Return MMCRx values */
>> -       mmcr[0] = 0;
>> +       mmcr->mmcr0 = 0;
>>        if (pmc_inuse & 1)
>> -               mmcr[0] = MMCR0_PMC1CE;
>> +               mmcr->mmcr0 = MMCR0_PMC1CE;
>>        if (pmc_inuse & 0x3e)
>> -               mmcr[0] |= MMCR0_PMCjCE;
>> -       mmcr[1] = mmcr1;
>> -       mmcr[2] = mmcra;
>> +               mmcr->mmcr0 |= MMCR0_PMCjCE;
>> +       mmcr->mmcr1 = mmcr1;
>> +       mmcr->mmcra = mmcra;
>>        return 0;
>> }
>> 
>> -static void power5_disable_pmc(unsigned int pmc, unsigned long mmcr[])
>> +static void power5_disable_pmc(unsigned int pmc, struct mmcr_regs *mmcr)
>> {
>>        if (pmc <= 3)
>> -               mmcr[1] &= ~(0x7fUL << MMCR1_PMCSEL_SH(pmc));
>> +               mmcr->mmcr1 &= ~(0x7fUL << MMCR1_PMCSEL_SH(pmc));
>> }
>> 
>> static int power5_generic_events[] = {
>> diff --git a/arch/powerpc/perf/power6-pmu.c b/arch/powerpc/perf/power6-pmu.c
>> index 3929cac..e343a51 100644
>> --- a/arch/powerpc/perf/power6-pmu.c
>> +++ b/arch/powerpc/perf/power6-pmu.c
>> @@ -171,7 +171,7 @@ static int power6_marked_instr_event(u64 event)
>>  * Assign PMC numbers and compute MMCR1 value for a set of events
>>  */
>> static int p6_compute_mmcr(u64 event[], int n_ev,
>> -                          unsigned int hwc[], unsigned long mmcr[], struct perf_event *pevents[])
>> +                          unsigned int hwc[], struct mmcr_regs *mmcr, struct perf_event *pevents[])
>> {
>>        unsigned long mmcr1 = 0;
>>        unsigned long mmcra = MMCRA_SDAR_DCACHE_MISS | MMCRA_SDAR_ERAT_MISS;
>> @@ -243,13 +243,13 @@ static int p6_compute_mmcr(u64 event[], int n_ev,
>>                if (pmc < 4)
>>                        mmcr1 |= (unsigned long)psel << MMCR1_PMCSEL_SH(pmc);
>>        }
>> -       mmcr[0] = 0;
>> +       mmcr->mmcr0 = 0;
>>        if (pmc_inuse & 1)
>> -               mmcr[0] = MMCR0_PMC1CE;
>> +               mmcr->mmcr0 = MMCR0_PMC1CE;
>>        if (pmc_inuse & 0xe)
>> -               mmcr[0] |= MMCR0_PMCjCE;
>> -       mmcr[1] = mmcr1;
>> -       mmcr[2] = mmcra;
>> +               mmcr->mmcr0 |= MMCR0_PMCjCE;
>> +       mmcr->mmcr1 = mmcr1;
>> +       mmcr->mmcra = mmcra;
>>        return 0;
>> }
>> 
>> @@ -457,11 +457,11 @@ static int p6_get_alternatives(u64 event, unsigned int flags, u64 alt[])
>>        return nalt;
>> }
>> 
>> -static void p6_disable_pmc(unsigned int pmc, unsigned long mmcr[])
>> +static void p6_disable_pmc(unsigned int pmc, struct mmcr_regs *mmcr)
>> {
>>        /* Set PMCxSEL to 0 to disable PMCx */
>>        if (pmc <= 3)
>> -               mmcr[1] &= ~(0xffUL << MMCR1_PMCSEL_SH(pmc));
>> +               mmcr->mmcr1 &= ~(0xffUL << MMCR1_PMCSEL_SH(pmc));
>> }
>> 
>> static int power6_generic_events[] = {
>> diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c
>> index a137813..3152336 100644
>> --- a/arch/powerpc/perf/power7-pmu.c
>> +++ b/arch/powerpc/perf/power7-pmu.c
>> @@ -242,7 +242,8 @@ static int power7_marked_instr_event(u64 event)
>> }
>> 
>> static int power7_compute_mmcr(u64 event[], int n_ev,
>> -                              unsigned int hwc[], unsigned long mmcr[], struct perf_event *pevents[])
>> +                              unsigned int hwc[], struct mmcr_regs *mmcr,
>> +                              struct perf_event *pevents[])
>> {
>>        unsigned long mmcr1 = 0;
>>        unsigned long mmcra = MMCRA_SDAR_DCACHE_MISS | MMCRA_SDAR_ERAT_MISS;
>> @@ -298,20 +299,20 @@ static int power7_compute_mmcr(u64 event[], int n_ev,
>>        }
>> 
>>        /* Return MMCRx values */
>> -       mmcr[0] = 0;
>> +       mmcr->mmcr0 = 0;
>>        if (pmc_inuse & 1)
>> -               mmcr[0] = MMCR0_PMC1CE;
>> +               mmcr->mmcr0 = MMCR0_PMC1CE;
>>        if (pmc_inuse & 0x3e)
>> -               mmcr[0] |= MMCR0_PMCjCE;
>> -       mmcr[1] = mmcr1;
>> -       mmcr[2] = mmcra;
>> +               mmcr->mmcr0 |= MMCR0_PMCjCE;
>> +       mmcr->mmcr1 = mmcr1;
>> +       mmcr->mmcra = mmcra;
>>        return 0;
>> }
>> 
>> -static void power7_disable_pmc(unsigned int pmc, unsigned long mmcr[])
>> +static void power7_disable_pmc(unsigned int pmc, struct mmcr_regs *mmcr)
>> {
>>        if (pmc <= 3)
>> -               mmcr[1] &= ~(0xffUL << MMCR1_PMCSEL_SH(pmc));
>> +               mmcr->mmcr1 &= ~(0xffUL << MMCR1_PMCSEL_SH(pmc));
>> }
>> 
>> static int power7_generic_events[] = {
>> diff --git a/arch/powerpc/perf/ppc970-pmu.c b/arch/powerpc/perf/ppc970-pmu.c
>> index 4035d93..89a90ab 100644
>> --- a/arch/powerpc/perf/ppc970-pmu.c
>> +++ b/arch/powerpc/perf/ppc970-pmu.c
>> @@ -253,7 +253,8 @@ static int p970_get_alternatives(u64 event, unsigned int flags, u64 alt[])
>> }
>> 
>> static int p970_compute_mmcr(u64 event[], int n_ev,
>> -                            unsigned int hwc[], unsigned long mmcr[], struct perf_event *pevents[])
>> +                            unsigned int hwc[], struct mmcr_regs *mmcr,
>> +                            struct perf_event *pevents[])
>> {
>>        unsigned long mmcr0 = 0, mmcr1 = 0, mmcra = 0;
>>        unsigned int pmc, unit, byte, psel;
>> @@ -393,27 +394,26 @@ static int p970_compute_mmcr(u64 event[], int n_ev,
>>        mmcra |= 0x2000;        /* mark only one IOP per PPC instruction */
>> 
>>        /* Return MMCRx values */
>> -       mmcr[0] = mmcr0;
>> -       mmcr[1] = mmcr1;
>> -       mmcr[2] = mmcra;
>> +       mmcr->mmcr0 = mmcr0;
>> +       mmcr->mmcr1 = mmcr1;
>> +       mmcr->mmcra = mmcra;
>>        return 0;
>> }
>> 
>> -static void p970_disable_pmc(unsigned int pmc, unsigned long mmcr[])
>> +static void p970_disable_pmc(unsigned int pmc, struct mmcr_regs *mmcr)
>> {
>> -       int shift, i;
>> +       int shift;
>> 
>> +       /*
>> +        * Setting the PMCxSEL field to 0x08 disables PMC x.
>> +        */
>>        if (pmc <= 1) {
>>                shift = MMCR0_PMC1SEL_SH - 7 * pmc;
>> -               i = 0;
>> +               mmcr->mmcr0 = (mmcr->mmcr0 & ~(0x1fUL << shift)) | (0x08UL << shift);
>>        } else {
>>                shift = MMCR1_PMC3SEL_SH - 5 * (pmc - 2);
>> -               i = 1;
>> +               mmcr->mmcr1 = (mmcr->mmcr1 & ~(0x1fUL << shift)) | (0x08UL << shift);
>>        }
>> -       /*
>> -        * Setting the PMCxSEL field to 0x08 disables PMC x.
>> -        */
>> -       mmcr[i] = (mmcr[i] & ~(0x1fUL << shift)) | (0x08UL << shift);
>> }
>> 
>> static int ppc970_generic_events[] = {
>> --
>> 1.8.3.1


[-- Attachment #2: Type: text/html, Size: 56934 bytes --]

^ permalink raw reply

* Re: [PATCH 2/2] powerpc/numa: Remove a redundant variable
From: Nathan Lynch @ 2020-07-22  3:28 UTC (permalink / raw)
  To: Srikar Dronamraju; +Cc: Tyrel Datwyler, linuxppc-dev
In-Reply-To: <20200715120534.3673-2-srikar@linux.vnet.ibm.com>

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> In of_drconf_to_nid_single() default_nid always refers to NUMA_NO_NODE.
> Hence replace it with NUMA_NO_NODE.
>
> No functional changes.
>

...

> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index a2c5fe0d0cad..b066d89c2975 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -432,16 +432,15 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
>  static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
>  {
>  	struct assoc_arrays aa = { .arrays = NULL };
> -	int default_nid = NUMA_NO_NODE;
> -	int nid = default_nid;
> +	int nid = NUMA_NO_NODE;
>  	int rc, index;
>  
>  	if ((min_common_depth < 0) || !numa_enabled)
> -		return default_nid;
> +		return NUMA_NO_NODE;
>  
>  	rc = of_get_assoc_arrays(&aa);
>  	if (rc)
> -		return default_nid;
> +		return NUMA_NO_NODE;
>  
>  	if (min_common_depth <= aa.array_sz &&
>  	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
> @@ -449,7 +448,7 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
>  		nid = of_read_number(&aa.arrays[index], 1);
>  
>  		if (nid == 0xffff || nid >= num_possible_nodes())
> -			nid = default_nid;
> +			nid = NUMA_NO_NODE;
>  
>  		if (nid > 0) {
>  			index = lmb->aa_index * aa.array_sz;

Doesn't seem like an improvement to me, sorry. Admittedly it's a small
point, but a local variable named "default_nid" communicates that it's a
default or fallback value. That information is lost with this change.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox