LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 2/3] powerpc/pci: unmap legacy INTx interrupts of passthrough IO adapters
From: Michael Ellerman @ 2020-05-28 13:25 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: Oliver O'Halloran, linuxppc-dev
In-Reply-To: <44126659-0490-4466-7f08-1726a7f0ce6e@kaod.org>

Cédric Le Goater <clg@kaod.org> writes:
> On 4/29/20 9:51 AM, Cédric Le Goater wrote:
>> When a passthrough IO adapter is removed from a pseries machine using
>> hash MMU and the XIVE interrupt mode, the POWER hypervisor, pHyp,
>> expects the guest OS to have cleared all page table entries related to
>> the adapter. If some are still present, the RTAS call which isolates
>> the PCI slot returns error 9001 "valid outstanding translations" and
>> the removal of the IO adapter fails.
>> 
>> INTx interrupt numbers need special care because Linux maps the
>> interrupts automatically in the Linux interrupt number space if they
>> are presented in the device tree node describing the IO adapter. These
>> interrupts are not un-mapped automatically and in case of an hot-plug
>> adapter, the PCI hot-plug layer needs to handle the cleanup to make
>> sure that all the page table entries of the XIVE ESB pages are
>> cleared.
>
> It seems this patch needs more digging to make sure we are handling
> the IRQ unmapping in the correct PCI handler. Could you please keep
> it back for the moment ? 

Yep no worries.

cheers

^ permalink raw reply

* Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
From: Alex Ghiti @ 2020-05-28 13:07 UTC (permalink / raw)
  To: Zong Li
  Cc: Albert Ou, Anup Patel, linux-kernel@vger.kernel.org List,
	Atish Patra, Paul Mackerras, Paul Walmsley, Palmer Dabbelt,
	linux-riscv, linuxppc-dev
In-Reply-To: <28e9d2ab-074c-90c2-73b5-a85d30f828cc@ghiti.fr>

Hi Zong,

Le 5/27/20 à 3:29 AM, Alex Ghiti a écrit :
> Le 5/27/20 à 2:05 AM, Zong Li a écrit :
>> On Wed, May 27, 2020 at 1:06 AM Alex Ghiti <alex@ghiti.fr> wrote:
>>> Hi Zong,
>>>
>>> Le 5/26/20 à 5:43 AM, Zong Li a écrit :
>>>> On Sun, May 24, 2020 at 4:54 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>>>>> This is a preparatory patch for relocatable kernel.
>>>>>
>>>>> The kernel used to be linked at PAGE_OFFSET address and used to be 
>>>>> loaded
>>>>> physically at the beginning of the main memory. Therefore, we 
>>>>> could use
>>>>> the linear mapping for the kernel mapping.
>>>>>
>>>>> But the relocated kernel base address will be different from 
>>>>> PAGE_OFFSET
>>>>> and since in the linear mapping, two different virtual addresses 
>>>>> cannot
>>>>> point to the same physical address, the kernel mapping needs to 
>>>>> lie outside
>>>>> the linear mapping.
>>>>>
>>>>> In addition, because modules and BPF must be close to the kernel 
>>>>> (inside
>>>>> +-2GB window), the kernel is placed at the end of the vmalloc zone 
>>>>> minus
>>>>> 2GB, which leaves room for modules and BPF. The kernel could not be
>>>>> placed at the beginning of the vmalloc zone since other vmalloc
>>>>> allocations from the kernel could get all the +-2GB window around the
>>>>> kernel which would prevent new modules and BPF programs to be loaded.
>>>>>
>>>>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>>>>> ---
>>>>>    arch/riscv/boot/loader.lds.S     |  3 +-
>>>>>    arch/riscv/include/asm/page.h    | 10 +++++-
>>>>>    arch/riscv/include/asm/pgtable.h | 37 +++++++++++++-------
>>>>>    arch/riscv/kernel/head.S         |  3 +-
>>>>>    arch/riscv/kernel/module.c       |  4 +--
>>>>>    arch/riscv/kernel/vmlinux.lds.S  |  3 +-
>>>>>    arch/riscv/mm/init.c             | 58 
>>>>> +++++++++++++++++++++++++-------
>>>>>    arch/riscv/mm/physaddr.c         |  2 +-
>>>>>    8 files changed, 87 insertions(+), 33 deletions(-)
>>>>>
>>>>> diff --git a/arch/riscv/boot/loader.lds.S 
>>>>> b/arch/riscv/boot/loader.lds.S
>>>>> index 47a5003c2e28..62d94696a19c 100644
>>>>> --- a/arch/riscv/boot/loader.lds.S
>>>>> +++ b/arch/riscv/boot/loader.lds.S
>>>>> @@ -1,13 +1,14 @@
>>>>>    /* SPDX-License-Identifier: GPL-2.0 */
>>>>>
>>>>>    #include <asm/page.h>
>>>>> +#include <asm/pgtable.h>
>>>>>
>>>>>    OUTPUT_ARCH(riscv)
>>>>>    ENTRY(_start)
>>>>>
>>>>>    SECTIONS
>>>>>    {
>>>>> -       . = PAGE_OFFSET;
>>>>> +       . = KERNEL_LINK_ADDR;
>>>>>
>>>>>           .payload : {
>>>>>                   *(.payload)
>>>>> diff --git a/arch/riscv/include/asm/page.h 
>>>>> b/arch/riscv/include/asm/page.h
>>>>> index 2d50f76efe48..48bb09b6a9b7 100644
>>>>> --- a/arch/riscv/include/asm/page.h
>>>>> +++ b/arch/riscv/include/asm/page.h
>>>>> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>>>>>
>>>>>    #ifdef CONFIG_MMU
>>>>>    extern unsigned long va_pa_offset;
>>>>> +extern unsigned long va_kernel_pa_offset;
>>>>>    extern unsigned long pfn_base;
>>>>>    #define ARCH_PFN_OFFSET                (pfn_base)
>>>>>    #else
>>>>>    #define va_pa_offset           0
>>>>> +#define va_kernel_pa_offset    0
>>>>>    #define ARCH_PFN_OFFSET                (PAGE_OFFSET >> PAGE_SHIFT)
>>>>>    #endif /* CONFIG_MMU */
>>>>>
>>>>>    extern unsigned long max_low_pfn;
>>>>>    extern unsigned long min_low_pfn;
>>>>> +extern unsigned long kernel_virt_addr;
>>>>>
>>>>>    #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + 
>>>>> va_pa_offset))
>>>>> -#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
>>>>> +#define linear_mapping_va_to_pa(x)     ((unsigned long)(x) - 
>>>>> va_pa_offset)
>>>>> +#define kernel_mapping_va_to_pa(x)     \
>>>>> +       ((unsigned long)(x) - va_kernel_pa_offset)
>>>>> +#define __va_to_pa_nodebug(x)          \
>>>>> +       (((x) >= PAGE_OFFSET) ?         \
>>>>> +               linear_mapping_va_to_pa(x) : 
>>>>> kernel_mapping_va_to_pa(x))
>>>>>
>>>>>    #ifdef CONFIG_DEBUG_VIRTUAL
>>>>>    extern phys_addr_t __virt_to_phys(unsigned long x);
>>>>> diff --git a/arch/riscv/include/asm/pgtable.h 
>>>>> b/arch/riscv/include/asm/pgtable.h
>>>>> index 35b60035b6b0..25213cfaf680 100644
>>>>> --- a/arch/riscv/include/asm/pgtable.h
>>>>> +++ b/arch/riscv/include/asm/pgtable.h
>>>>> @@ -11,23 +11,29 @@
>>>>>
>>>>>    #include <asm/pgtable-bits.h>
>>>>>
>>>>> -#ifndef __ASSEMBLY__
>>>>> -
>>>>> -/* Page Upper Directory not used in RISC-V */
>>>>> -#include <asm-generic/pgtable-nopud.h>
>>>>> -#include <asm/page.h>
>>>>> -#include <asm/tlbflush.h>
>>>>> -#include <linux/mm_types.h>
>>>>> -
>>>>> -#ifdef CONFIG_MMU
>>>>> +#ifndef CONFIG_MMU
>>>>> +#define KERNEL_VIRT_ADDR       PAGE_OFFSET
>>>>> +#define KERNEL_LINK_ADDR       PAGE_OFFSET
>>>>> +#else
>>>>> +/*
>>>>> + * Leave 2GB for modules and BPF that must lie within a 2GB range 
>>>>> around
>>>>> + * the kernel.
>>>>> + */
>>>>> +#define KERNEL_VIRT_ADDR       (VMALLOC_END - SZ_2G + 1)
>>>>> +#define KERNEL_LINK_ADDR       KERNEL_VIRT_ADDR
>>>>>
>>>>>    #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
>>>>>    #define VMALLOC_END      (PAGE_OFFSET - 1)
>>>>>    #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
>>>>>
>>>>>    #define BPF_JIT_REGION_SIZE    (SZ_128M)
>>>>> -#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
>>>>> -#define BPF_JIT_REGION_END     (VMALLOC_END)
>>>>> +#define BPF_JIT_REGION_START   (kernel_virt_addr)
>>>>> +#define BPF_JIT_REGION_END     (kernel_virt_addr + 
>>>>> BPF_JIT_REGION_SIZE)
>>>> It seems to have a potential risk here, the region of bpf is
>>>> overlapping with kernel mapping, so if kernel size is bigger than
>>>> 128MB, bpf region would be occupied and run out by kernel mapping.
>> Is there the risk as I mentioned?
>
>
> Sorry I forgot to answer this one: I was confident that 128MB was 
> large enough for kernel
> and BPF. But I see no reason to leave this risk so I'll change 
> kernel_virt_addr for _end so
> BPF will have its 128MB reserved.
>
> Thanks !
>
> Alex
>
>
>>
>>>>> +
>>>>> +#ifdef CONFIG_64BIT
>>>>> +#define VMALLOC_MODULE_START   BPF_JIT_REGION_END
>>>>> +#define VMALLOC_MODULE_END     VMALLOC_END
>>>>> +#endif
>>>>>
>>>> Although kernel_virt_addr is a fixed address now, I think it could be
>>>> changed for the purpose of relocatable or KASLR, so if
>>>> kernel_virt_addr is moved to far from VMALLOC_END than 2G, the region
>>>> of module would be too big.
>>>
>>> Yes you're right, that's wrong to allow modules to lie outside
>>> the 2G window, thanks for noticing.
>>>
>>>
>>>> In addition, the region of module could be
>>>> +-2G around the kernel, so we don't be limited in one direction as
>>>> before. It seems to me that the region of the module could be decided
>>>> at runtime, for example, VMALLOC_MODULE_START is "&_end - 2G" and
>>>> VMLLOC_MODULE_END is "&_start + 2G".
>>>
>>> I had tried that, but as we need to make sure BPF region is different
>>> from the module's
>>> that makes the macro definitions really cumbersome. I'll give a try
>>> again anyway. And
>>> I tried to use _end and _start here but it failed, I have to debug 
>>> this.


I gave more thought about that and it is actually not possible to use 
the 2GB
before and after the kernel: modules can call exported functions from 
each other,
so we need to make sure that the "distance" between 2 modules is at most 
2GB.
And I assume BPF comes with the same restrictions with respect to 
modules so the
kernel + BPF + modules must live in the same 2GB region.

I'll come with a v4 quickly,

Thanks,

Alex


>>>
>>>
>>>>    I'm not sure whether the size of
>>>> region of bpf has to be 128MB for some particular reason, if not,
>>>> maybe the region of bpf could be the same with module to avoid being
>>>> run out by module.
>>>
>>> On the contrary, BPF region must not be the same as module's since in
>>> that case,
>>> modules could take all the space and make BPF fail.
>> ok, I got it. Thanks for the explaining.
>>
>>
>>>
>>> Thanks for your review Zong,
>>>
>>>
>>> Alex
>>>
>>>
>>>>>    /*
>>>>>     * Roughly size the vmemmap space to be large enough to fit enough
>>>>> @@ -57,9 +63,16 @@
>>>>>    #define FIXADDR_SIZE     PGDIR_SIZE
>>>>>    #endif
>>>>>    #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
>>>>> -
>>>>>    #endif
>>>>>
>>>>> +#ifndef __ASSEMBLY__
>>>>> +
>>>>> +/* Page Upper Directory not used in RISC-V */
>>>>> +#include <asm-generic/pgtable-nopud.h>
>>>>> +#include <asm/page.h>
>>>>> +#include <asm/tlbflush.h>
>>>>> +#include <linux/mm_types.h>
>>>>> +
>>>>>    #ifdef CONFIG_64BIT
>>>>>    #include <asm/pgtable-64.h>
>>>>>    #else
>>>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>>>>> index 98a406474e7d..8f5bb7731327 100644
>>>>> --- a/arch/riscv/kernel/head.S
>>>>> +++ b/arch/riscv/kernel/head.S
>>>>> @@ -49,7 +49,8 @@ ENTRY(_start)
>>>>>    #ifdef CONFIG_MMU
>>>>>    relocate:
>>>>>           /* Relocate return address */
>>>>> -       li a1, PAGE_OFFSET
>>>>> +       la a1, kernel_virt_addr
>>>>> +       REG_L a1, 0(a1)
>>>>>           la a2, _start
>>>>>           sub a1, a1, a2
>>>>>           add ra, ra, a1
>>>>> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
>>>>> index 8bbe5dbe1341..1a8fbe05accf 100644
>>>>> --- a/arch/riscv/kernel/module.c
>>>>> +++ b/arch/riscv/kernel/module.c
>>>>> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, 
>>>>> const char *strtab,
>>>>>    }
>>>>>
>>>>>    #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
>>>>> -#define VMALLOC_MODULE_START \
>>>>> -        max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
>>>>>    void *module_alloc(unsigned long size)
>>>>>    {
>>>>>           return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
>>>>> -                                   VMALLOC_END, GFP_KERNEL,
>>>>> +                                   VMALLOC_MODULE_END, GFP_KERNEL,
>>>>>                                       PAGE_KERNEL_EXEC, 0, 
>>>>> NUMA_NO_NODE,
>>>>> __builtin_return_address(0));
>>>>>    }
>>>>> diff --git a/arch/riscv/kernel/vmlinux.lds.S 
>>>>> b/arch/riscv/kernel/vmlinux.lds.S
>>>>> index 0339b6bbe11a..a9abde62909f 100644
>>>>> --- a/arch/riscv/kernel/vmlinux.lds.S
>>>>> +++ b/arch/riscv/kernel/vmlinux.lds.S
>>>>> @@ -4,7 +4,8 @@
>>>>>     * Copyright (C) 2017 SiFive
>>>>>     */
>>>>>
>>>>> -#define LOAD_OFFSET PAGE_OFFSET
>>>>> +#include <asm/pgtable.h>
>>>>> +#define LOAD_OFFSET KERNEL_LINK_ADDR
>>>>>    #include <asm/vmlinux.lds.h>
>>>>>    #include <asm/page.h>
>>>>>    #include <asm/cache.h>
>>>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>>>>> index 27a334106708..17f108baec4f 100644
>>>>> --- a/arch/riscv/mm/init.c
>>>>> +++ b/arch/riscv/mm/init.c
>>>>> @@ -22,6 +22,9 @@
>>>>>
>>>>>    #include "../kernel/head.h"
>>>>>
>>>>> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
>>>>> +EXPORT_SYMBOL(kernel_virt_addr);
>>>>> +
>>>>>    unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>>>>> __page_aligned_bss;
>>>>>    EXPORT_SYMBOL(empty_zero_page);
>>>>> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
>>>>>    }
>>>>>
>>>>>    #ifdef CONFIG_MMU
>>>>> +/* Offset between linear mapping virtual address and kernel load 
>>>>> address */
>>>>>    unsigned long va_pa_offset;
>>>>>    EXPORT_SYMBOL(va_pa_offset);
>>>>> +/* Offset between kernel mapping virtual address and kernel load 
>>>>> address */
>>>>> +unsigned long va_kernel_pa_offset;
>>>>> +EXPORT_SYMBOL(va_kernel_pa_offset);
>>>>>    unsigned long pfn_base;
>>>>>    EXPORT_SYMBOL(pfn_base);
>>>>>
>>>>> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>>>>>           if (mmu_enabled)
>>>>>                   return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>>>>
>>>>> -       pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
>>>>> +       pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
>>>>>           BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>>>>>           return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>>>>>    }
>>>>> @@ -372,14 +379,30 @@ static uintptr_t __init 
>>>>> best_map_size(phys_addr_t base, phys_addr_t size)
>>>>>    #error "setup_vm() is called from head.S before relocate so it 
>>>>> should not use absolute addressing."
>>>>>    #endif
>>>>>
>>>>> +static uintptr_t load_pa, load_sz;
>>>>> +
>>>>> +void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
>>>>> +{
>>>>> +       uintptr_t va, end_va;
>>>>> +
>>>>> +       end_va = kernel_virt_addr + load_sz;
>>>>> +       for (va = kernel_virt_addr; va < end_va; va += map_size)
>>>>> +               create_pgd_mapping(pgdir, va,
>>>>> +                                  load_pa + (va - kernel_virt_addr),
>>>>> +                                  map_size, PAGE_KERNEL_EXEC);
>>>>> +}
>>>>> +
>>>>>    asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>>    {
>>>>>           uintptr_t va, end_va;
>>>>> -       uintptr_t load_pa = (uintptr_t)(&_start);
>>>>> -       uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
>>>>>           uintptr_t map_size = best_map_size(load_pa, 
>>>>> MAX_EARLY_MAPPING_SIZE);
>>>>>
>>>>> +       load_pa = (uintptr_t)(&_start);
>>>>> +       load_sz = (uintptr_t)(&_end) - load_pa;
>>>>> +
>>>>>           va_pa_offset = PAGE_OFFSET - load_pa;
>>>>> +       va_kernel_pa_offset = kernel_virt_addr - load_pa;
>>>>> +
>>>>>           pfn_base = PFN_DOWN(load_pa);
>>>>>
>>>>>           /*
>>>>> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t 
>>>>> dtb_pa)
>>>>>           create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>>>>>                              (uintptr_t)fixmap_pte, PMD_SIZE, 
>>>>> PAGE_TABLE);
>>>>>           /* Setup trampoline PGD and PMD */
>>>>> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>>> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>>>                              (uintptr_t)trampoline_pmd, 
>>>>> PGDIR_SIZE, PAGE_TABLE);
>>>>> -       create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>>>>> +       create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
>>>>>                              load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>>>>>    #else
>>>>>           /* Setup trampoline PGD */
>>>>> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>>> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>>>                              load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
>>>>>    #endif
>>>>>
>>>>>           /*
>>>>> -        * Setup early PGD covering entire kernel which will allows
>>>>> +        * Setup early PGD covering entire kernel which will allow
>>>>>            * us to reach paging_init(). We map all memory banks later
>>>>>            * in setup_vm_final() below.
>>>>>            */
>>>>> -       end_va = PAGE_OFFSET + load_sz;
>>>>> -       for (va = PAGE_OFFSET; va < end_va; va += map_size)
>>>>> -               create_pgd_mapping(early_pg_dir, va,
>>>>> -                                  load_pa + (va - PAGE_OFFSET),
>>>>> -                                  map_size, PAGE_KERNEL_EXEC);
>>>>> +       create_kernel_page_table(early_pg_dir, map_size);
>>>>>
>>>>>           /* Create fixed mapping for early FDT parsing */
>>>>>           end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
>>>>> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
>>>>>           uintptr_t va, map_size;
>>>>>           phys_addr_t pa, start, end;
>>>>>           struct memblock_region *reg;
>>>>> +       static struct vm_struct vm_kernel = { 0 };
>>>>>
>>>>>           /* Set mmu_enabled flag */
>>>>>           mmu_enabled = true;
>>>>> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
>>>>>                   for (pa = start; pa < end; pa += map_size) {
>>>>>                           va = (uintptr_t)__va(pa);
>>>>> create_pgd_mapping(swapper_pg_dir, va, pa,
>>>>> -                                          map_size, 
>>>>> PAGE_KERNEL_EXEC);
>>>>> +                                          map_size, PAGE_KERNEL);
>>>>>                   }
>>>>>           }
>>>>>
>>>>> +       /* Map the kernel */
>>>>> +       create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
>>>>> +
>>>>> +       /* Reserve the vmalloc area occupied by the kernel */
>>>>> +       vm_kernel.addr = (void *)kernel_virt_addr;
>>>>> +       vm_kernel.phys_addr = load_pa;
>>>>> +       vm_kernel.size = (load_sz + PMD_SIZE) & ~(PMD_SIZE - 1);
>>>>> +       vm_kernel.flags = VM_MAP | VM_NO_GUARD;
>>>>> +       vm_kernel.caller = __builtin_return_address(0);
>>>>> +
>>>>> +       vm_area_add_early(&vm_kernel);
>>>>> +
>>>>>           /* Clear fixmap PTE and PMD mappings */
>>>>>           clear_fixmap(FIX_PTE);
>>>>>           clear_fixmap(FIX_PMD);
>>>>> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
>>>>> index e8e4dcd39fed..35703d5ef5fd 100644
>>>>> --- a/arch/riscv/mm/physaddr.c
>>>>> +++ b/arch/riscv/mm/physaddr.c
>>>>> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>>>>>
>>>>>    phys_addr_t __phys_addr_symbol(unsigned long x)
>>>>>    {
>>>>> -       unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
>>>>> +       unsigned long kernel_start = (unsigned long)kernel_virt_addr;
>>>>>           unsigned long kernel_end = (unsigned long)_end;
>>>>>
>>>>>           /*
>>>>> -- 
>>>>> 2.20.1
>>>>>

^ permalink raw reply

* Re: [PATCH] powerpc/64: Remove unused generic_secondary_thread_init()
From: Michael Ellerman @ 2020-05-28 13:15 UTC (permalink / raw)
  To: Jordan Niethe; +Cc: linuxppc-dev
In-Reply-To: <CACzsE9oxzdXOip7fnUF8H943FCrdEDqKRaF67YbF_MDguHWEag@mail.gmail.com>

Jordan Niethe <jniethe5@gmail.com> writes:
> On Tue, May 26, 2020 at 4:36 PM Michael Ellerman <mpe@ellerman.id.au> wrote:
>>
>> The last caller was removed in 2014 in commit fb5a515704d7 ("powerpc:
>> Remove platforms/wsp and associated pieces").
>>
>> Once generic_secondary_thread_init() is removed there are no longer
>> any uses of book3e_secondary_thread_init() or
>> generic_secondary_common_init so remove them too.
>>
>> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
>> ---
>>  arch/powerpc/include/asm/smp.h       |  1 -
>>  arch/powerpc/kernel/exceptions-64e.S |  4 ----
>>  arch/powerpc/kernel/head_64.S        | 18 ------------------
>>  3 files changed, 23 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
>> index 49a25e2400f2..81a49566ccd8 100644
>> --- a/arch/powerpc/include/asm/smp.h
>> +++ b/arch/powerpc/include/asm/smp.h
>> @@ -243,7 +243,6 @@ extern void arch_send_call_function_ipi_mask(const struct cpumask *mask);
>>   * 64-bit but defining them all here doesn't harm
>>   */
>>  extern void generic_secondary_smp_init(void);
>> -extern void generic_secondary_thread_init(void);
>>  extern unsigned long __secondary_hold_spinloop;
>>  extern unsigned long __secondary_hold_acknowledge;
>>  extern char __secondary_hold;
>> diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S
>> index d9ed79415100..9f9e8686798b 100644
>> --- a/arch/powerpc/kernel/exceptions-64e.S
>> +++ b/arch/powerpc/kernel/exceptions-64e.S
>> @@ -1814,10 +1814,6 @@ _GLOBAL(book3e_secondary_core_init)
>>  1:     mtlr    r28
>>         blr
>>
>> -_GLOBAL(book3e_secondary_thread_init)
>> -       mflr    r28
>> -       b       3b
>> -
>>         .globl init_core_book3e
>>  init_core_book3e:
>>         /* Establish the interrupt vector base */
>> diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
>> index 0e05a9a47a4b..4ae2c18c5fc6 100644
>> --- a/arch/powerpc/kernel/head_64.S
>> +++ b/arch/powerpc/kernel/head_64.S
>> @@ -302,23 +302,6 @@ _GLOBAL(fsl_secondary_thread_init)
>>  1:
>>  #endif
>
> Nothing directly calls generic_secondary_thread_init() but I think
> fsl_secondary_thread_init() which is directly above "falls through"
> into it. fsl_secondary_thread_init() still has callers.

Damnit, you're right, I love deleting code! Thanks for reviewing.

I'll send a v2.

cheers

^ permalink raw reply

* Re: [PATCH] powerpc/book3s64/kvm: Fix secondary page table walk warning during migration
From: Aneesh Kumar K.V @ 2020-05-28 13:08 UTC (permalink / raw)
  To: Michael Ellerman, kvm-ppc, paulus; +Cc: linuxppc-dev
In-Reply-To: <87a71sjk4o.fsf@mpe.ellerman.id.au>

On 5/28/20 6:23 PM, Michael Ellerman wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> This patch fix the below warning reported during migration.
>>
>>   find_kvm_secondary_pte called with kvm mmu_lock not held
>>   CPU: 23 PID: 5341 Comm: qemu-system-ppc Tainted: G        W         5.7.0-rc5-kvm-00211-g9ccf10d6d088 #432
>>   NIP:  c008000000fe848c LR: c008000000fe8488 CTR: 0000000000000000
>>   REGS: c000001e19f077e0 TRAP: 0700   Tainted: G        W          (5.7.0-rc5-kvm-00211-g9ccf10d6d088)
>>   MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 42222422  XER: 20040000
>>   CFAR: c00000000012f5ac IRQMASK: 0
>>   GPR00: c008000000fe8488 c000001e19f07a70 c008000000ffe200 0000000000000039
>>   GPR04: 0000000000000001 c000001ffc8b4900 0000000000018840 0000000000000007
>>   GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000001
>>   GPR12: 0000000000002000 c000001fff6d9400 000000011f884678 00007fff70b70000
>>   GPR16: 00007fff7137cb90 00007fff7dcb4410 0000000000000001 0000000000000000
>>   GPR20: 000000000ffe0000 0000000000000000 0000000000000001 0000000000000000
>>   GPR24: 8000000000000000 0000000000000001 c000001e1f67e600 c000001e1fd82410
>>   GPR28: 0000000000001000 c000001e2e410000 0000000000000fff 0000000000000ffe
>>   NIP [c008000000fe848c] kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
>>   LR [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv]
>>   Call Trace:
>>   [c000001e19f07a70] [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv] (unreliable)
>>   [c000001e19f07b50] [c008000000fd42e4] kvm_vm_ioctl_get_dirty_log_hv+0x33c/0x3c0 [kvm_hv]
>>   [c000001e19f07be0] [c008000000eea878] kvm_vm_ioctl_get_dirty_log+0x30/0x50 [kvm]
>>   [c000001e19f07c00] [c008000000edc818] kvm_vm_ioctl+0x2b0/0xc00 [kvm]
>>   [c000001e19f07d50] [c00000000046e148] ksys_ioctl+0xf8/0x150
>>   [c000001e19f07da0] [c00000000046e1c8] sys_ioctl+0x28/0x80
>>   [c000001e19f07dc0] [c00000000003652c] system_call_exception+0x16c/0x240
>>   [c000001e19f07e20] [c00000000000d070] system_call_common+0xf0/0x278
>>   Instruction dump:
>>   7d3a512a 4200ffd0 7ffefb78 4bfffdc4 60000000 3c820000 e8848468 3c620000
>>   e86384a8 38840010 4800673d e8410018 <0fe00000> 4bfffdd4 60000000 60000000
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   arch/powerpc/include/asm/kvm_book3s_64.h |  9 ++++++
>>   arch/powerpc/kvm/book3s_64_mmu_radix.c   | 35 ++++++++++++++++++++----
>>   2 files changed, 38 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
>> index c58e64a0a74f..cd5bad08b8d3 100644
>> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
>> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
>> @@ -635,6 +635,15 @@ extern void kvmhv_remove_nest_rmap_range(struct kvm *kvm,
>>   				unsigned long gpa, unsigned long hpa,
>>   				unsigned long nbytes);
>>   
>> +static inline pte_t *__find_kvm_secondary_pte(struct kvm *kvm, unsigned long ea,
>> +					      unsigned *hshift)
>> +{
>> +	pte_t *pte;
>> +
>> +	pte = __find_linux_pte(kvm->arch.pgtable, ea, NULL, hshift);
>> +	return pte;
>> +}
> 
> Why not just open code this in the single caller?

We could do that. But I though it is confusing and we want to avoid 
using linux page table walker (__find_linux_pte()) directly from within 
kvm code to walk partition scoped table.

> 
> Leaving it here someone will invariably decide to call it one day.
> 

I was looking at documenting it at the call site.  We can possibly add a 
comment here explaining avoid calling this directly and if used should 
have a  comments around explaining why it is safe.


> If you think it's worth keeping then it should have a comment explaining
> why it doesn't check the lock, and find_kvm_secondary_pte() should call
> it no?
> 

Was trying to avoid that indirection. I guess we should add a comment 
which suggest to avoid using __find_kvm_secondary_pte() and if used we 
should ask for comment at the call site explaining why it is safe?

-aneesh


^ permalink raw reply

* Re: [PATCH V4 1/2] powerpc/perf: Add support for outputting extended regs in perf intr_regs
From: Nageswara R Sastry @ 2020-05-28  9:32 UTC (permalink / raw)
  To: Athira Rajeev; +Cc: maddy, linuxppc-dev, linux-kernel
In-Reply-To: <1590573018-5201-2-git-send-email-atrajeev@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 9080 bytes --]


"Linuxppc-dev" <linuxppc-dev-bounces
+maddy=linux.vnet.ibm.com@lists.ozlabs.org> wrote on 27/05/2020 03:20:17
PM:

> From: "Athira Rajeev" <atrajeev@linux.vnet.ibm.com>
> To: linuxppc-dev@lists.ozlabs.org
> Cc: ravi.bangoria@linux.ibm.com, atrajeev@linux.vnet.ibm.com,
> maddy@linux.vnet.ibm.com, linux-kernel@vger.kernel.org,
> acme@kernel.org, anju@linux.vnet.ibm.com, jolsa@kernel.org
> Date: 28/05/2020 02:46 PM
> Subject: [PATCH V4 1/2] powerpc/perf: Add support for outputting
> extended regs in perf intr_regs
> Sent by: "Linuxppc-dev" <linuxppc-dev-bounces
> +maddy=linux.vnet.ibm.com@lists.ozlabs.org>
>
> From: Anju T Sudhakar <anju@linux.vnet.ibm.com>
>
> Add support for perf extended register capability in powerpc.
> The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to indicate the
> PMU which support extended registers. The generic code define the mask
> of extended registers as 0 for non supported architectures.
>
> Patch adds extended regs support for power9 platform by
> exposing MMCR0, MMCR1 and MMCR2 registers.
>
> REG_RESERVED mask needs update to include extended regs.
> `PERF_REG_EXTENDED_MASK`, contains mask value of the supported registers,
> is defined at runtime in the kernel based on platform since the supported
> registers may differ from one processor version to another and hence the
> MASK value.
>
> with patch
> ----------
>
> available registers: r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11
> r12 r13 r14 r15 r16 r17 r18 r19 r20 r21 r22 r23 r24 r25 r26
> r27 r28 r29 r30 r31 nip msr orig_r3 ctr link xer ccr softe
> trap dar dsisr sier mmcra mmcr0 mmcr1 mmcr2
>
> PERF_RECORD_SAMPLE(IP, 0x1): 4784/4784: 0 period: 1 addr: 0
> ... intr regs: mask 0xffffffffffff ABI 64-bit
> .... r0    0xc00000000012b77c
> .... r1    0xc000003fe5e03930
> .... r2    0xc000000001b0e000
> .... r3    0xc000003fdcddf800
> .... r4    0xc000003fc7880000
> .... r5    0x9c422724be
> .... r6    0xc000003fe5e03908
> .... r7    0xffffff63bddc8706
> .... r8    0x9e4
> .... r9    0x0
> .... r10   0x1
> .... r11   0x0
> .... r12   0xc0000000001299c0
> .... r13   0xc000003ffffc4800
> .... r14   0x0
> .... r15   0x7fffdd8b8b00
> .... r16   0x0
> .... r17   0x7fffdd8be6b8
> .... r18   0x7e7076607730
> .... r19   0x2f
> .... r20   0xc00000001fc26c68
> .... r21   0xc0002041e4227e00
> .... r22   0xc00000002018fb60
> .... r23   0x1
> .... r24   0xc000003ffec4d900
> .... r25   0x80000000
> .... r26   0x0
> .... r27   0x1
> .... r28   0x1
> .... r29   0xc000000001be1260
> .... r30   0x6008010
> .... r31   0xc000003ffebb7218
> .... nip   0xc00000000012b910
> .... msr   0x9000000000009033
> .... orig_r3 0xc00000000012b86c
> .... ctr   0xc0000000001299c0
> .... link  0xc00000000012b77c
> .... xer   0x0
> .... ccr   0x28002222
> .... softe 0x1
> .... trap  0xf00
> .... dar   0x0
> .... dsisr 0x80000000000
> .... sier  0x0
> .... mmcra 0x80000000000
> .... mmcr0 0x82008090
> .... mmcr1 0x1e000000
> .... mmcr2 0x0
>  ... thread: perf:4784
>
> Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
> [Defined PERF_REG_EXTENDED_MASK at run time to add support for
> different platforms ]
> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
> Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>

Tested with 5.7.0-rc2
Tested-by: Nageswara R Sastry <nasastry@in.ibm.com>


> ---
>  arch/powerpc/include/asm/perf_event_server.h |  8 +++++++
>  arch/powerpc/include/uapi/asm/perf_regs.h    | 14 +++++++++++-
>  arch/powerpc/perf/core-book3s.c              |  1 +
>  arch/powerpc/perf/perf_regs.c                | 34 +++++++++++++++++
> ++++++++---
>  arch/powerpc/perf/power9-pmu.c               |  6 +++++
>  5 files changed, 59 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/
> powerpc/include/asm/perf_event_server.h
> index 3e9703f..1458e1a 100644
> --- a/arch/powerpc/include/asm/perf_event_server.h
> +++ b/arch/powerpc/include/asm/perf_event_server.h
> @@ -15,6 +15,9 @@
>  #define MAX_EVENT_ALTERNATIVES   8
>  #define MAX_LIMITED_HWCOUNTERS   2
>
> +extern u64 mask_var;
> +#define PERF_REG_EXTENDED_MASK          mask_var
> +
>  struct perf_event;
>
>  /*
> @@ -55,6 +58,11 @@ struct power_pmu {
>     int       *blacklist_ev;
>     /* BHRB entries in the PMU */
>     int      bhrb_nr;
> +   /*
> +    * set this flag with `PERF_PMU_CAP_EXTENDED_REGS` if
> +    * the pmu supports extended perf regs capability
> +    */
> +   int      capabilities;
>  };
>
>  /*
> diff --git a/arch/powerpc/include/uapi/asm/perf_regs.h b/arch/
> powerpc/include/uapi/asm/perf_regs.h
> index f599064..485b1d5 100644
> --- a/arch/powerpc/include/uapi/asm/perf_regs.h
> +++ b/arch/powerpc/include/uapi/asm/perf_regs.h
> @@ -48,6 +48,18 @@ enum perf_event_powerpc_regs {
>     PERF_REG_POWERPC_DSISR,
>     PERF_REG_POWERPC_SIER,
>     PERF_REG_POWERPC_MMCRA,
> -   PERF_REG_POWERPC_MAX,
> +   /* Extended registers */
> +   PERF_REG_POWERPC_MMCR0,
> +   PERF_REG_POWERPC_MMCR1,
> +   PERF_REG_POWERPC_MMCR2,
> +   /* Max regs without the extended regs */
> +   PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
>  };
> +
> +#define PERF_REG_PMU_MASK   ((1ULL << PERF_REG_POWERPC_MAX) - 1)
> +
> +/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */
> +#define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2
> + 1)) - 1) \
> +            - PERF_REG_PMU_MASK)
> +
>  #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
> diff --git a/arch/powerpc/perf/core-book3s.c
b/arch/powerpc/perf/core-book3s.c
> index 3dcfecf..7f63edf 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -2275,6 +2275,7 @@ int register_power_pmu(struct power_pmu *pmu)
>        pmu->name);
>
>     power_pmu.attr_groups = ppmu->attr_groups;
> +   power_pmu.capabilities |= (ppmu->capabilities &
> PERF_PMU_CAP_EXTENDED_REGS);
>
>  #ifdef MSR_HV
>     /*
> diff --git a/arch/powerpc/perf/perf_regs.c
b/arch/powerpc/perf/perf_regs.c
> index a213a0a..c8a7e8c 100644
> --- a/arch/powerpc/perf/perf_regs.c
> +++ b/arch/powerpc/perf/perf_regs.c
> @@ -13,9 +13,11 @@
>  #include <asm/ptrace.h>
>  #include <asm/perf_regs.h>
>
> +u64 mask_var;
> +
>  #define PT_REGS_OFFSET(id, r) [id] = offsetof(struct pt_regs, r)
>
> -#define REG_RESERVED (~((1ULL << PERF_REG_POWERPC_MAX) - 1))
> +#define REG_RESERVED (~(PERF_REG_EXTENDED_MASK | PERF_REG_PMU_MASK))
>
>  static unsigned int pt_regs_offset[PERF_REG_POWERPC_MAX] = {
>     PT_REGS_OFFSET(PERF_REG_POWERPC_R0,  gpr[0]),
> @@ -69,10 +71,26 @@
>     PT_REGS_OFFSET(PERF_REG_POWERPC_MMCRA, dsisr),
>  };
>
> +/* Function to return the extended register values */
> +static u64 get_ext_regs_value(int idx)
> +{
> +   switch (idx) {
> +   case PERF_REG_POWERPC_MMCR0:
> +      return mfspr(SPRN_MMCR0);
> +   case PERF_REG_POWERPC_MMCR1:
> +      return mfspr(SPRN_MMCR1);
> +   case PERF_REG_POWERPC_MMCR2:
> +      return mfspr(SPRN_MMCR2);
> +   default: return 0;
> +   }
> +}
> +
>  u64 perf_reg_value(struct pt_regs *regs, int idx)
>  {
> -   if (WARN_ON_ONCE(idx >= PERF_REG_POWERPC_MAX))
> -      return 0;
> +   u64 PERF_REG_EXTENDED_MAX;
> +
> +   if (cpu_has_feature(CPU_FTR_ARCH_300))
> +      PERF_REG_EXTENDED_MAX = PERF_REG_POWERPC_MMCR2 + 1;
>
>     if (idx == PERF_REG_POWERPC_SIER &&
>        (IS_ENABLED(CONFIG_FSL_EMB_PERF_EVENT) ||
> @@ -85,6 +103,16 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
>         IS_ENABLED(CONFIG_PPC32)))
>        return 0;
>
> +   if (idx >= PERF_REG_POWERPC_MAX && idx < PERF_REG_EXTENDED_MAX)
> +      return get_ext_regs_value(idx);
> +
> +   /*
> +    * If the idx is referring to value beyond the
> +    * supported registers, return 0 with a warning
> +    */
> +   if (WARN_ON_ONCE(idx >= PERF_REG_EXTENDED_MAX))
> +      return 0;
> +
>     return regs_get_register(regs, pt_regs_offset[idx]);
>  }
>
> diff --git a/arch/powerpc/perf/power9-pmu.c
b/arch/powerpc/perf/power9-pmu.c
> index 08c3ef7..4525090 100644
> --- a/arch/powerpc/perf/power9-pmu.c
> +++ b/arch/powerpc/perf/power9-pmu.c
> @@ -90,6 +90,8 @@ enum {
>  #define POWER9_MMCRA_IFM3      0x00000000C0000000UL
>  #define POWER9_MMCRA_BHRB_MASK      0x00000000C0000000UL
>
> +extern u64 mask_var;
> +
>  /* Nasty Power9 specific hack */
>  #define PVR_POWER9_CUMULUS      0x00002000
>
> @@ -434,6 +436,7 @@ static void power9_config_bhrb(u64 pmu_bhrb_filter)
>     .cache_events      = &power9_cache_events,
>     .attr_groups      = power9_pmu_attr_groups,
>     .bhrb_nr      = 32,
> +   .capabilities           = PERF_PMU_CAP_EXTENDED_REGS,
>  };
>
>  int init_power9_pmu(void)
> @@ -457,6 +460,9 @@ int init_power9_pmu(void)
>        }
>     }
>
> +   /* Set the PERF_REG_EXTENDED_MASK here */
> +   mask_var = PERF_REG_PMU_MASK_300;
> +
>     rc = register_power_pmu(&power9_pmu);
>     if (rc)
>        return rc;
> --
> 1.8.3.1
>

[-- Attachment #2: Type: text/html, Size: 12313 bytes --]

^ permalink raw reply

* Re: [PATCH V4 2/2] tools/perf: Add perf tools support for extended register capability in powerpc
From: Nageswara R Sastry @ 2020-05-28  9:37 UTC (permalink / raw)
  To: Athira Rajeev; +Cc: maddy, linuxppc-dev, linux-kernel
In-Reply-To: <1590573018-5201-3-git-send-email-atrajeev@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 6079 bytes --]


"Athira Rajeev" <atrajeev@linux.vnet.ibm.com> wrote on 27/05/2020 03:20:18
PM:

> From: "Athira Rajeev" <atrajeev@linux.vnet.ibm.com>
> To: linuxppc-dev@lists.ozlabs.org
> Cc: linux-kernel@vger.kernel.org, ravi.bangoria@linux.ibm.com,
> maddy@linux.vnet.ibm.com, acme@kernel.org, anju@linux.vnet.ibm.com,
> jolsa@kernel.org, mpe@ellerman.id.au, atrajeev@linux.vnet.ibm.com
> Date: 28/05/2020 02:46 PM
> Subject: [PATCH V4 2/2] tools/perf: Add perf tools support for
> extended register capability in powerpc
>
> From: Anju T Sudhakar <anju@linux.vnet.ibm.com>
>
> Add extended regs to sample_reg_mask in the tool side to use
> with `-I?` option. Perf tools side uses extended mask to display
> the platform supported register names (with -I? option) to the user
> and also send this mask to the kernel to capture the extended registers
> in each sample. Hence decide the mask value based on the processor
> version.
>
> Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
> [Decide extended mask at run time based on platform]
> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
> Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>

Tested-by: Nageswara R Sastry <nasastry@in.ibm.com>
Tested with 5.7.0-rc2
Tested the following scenarios
1. perf record -I
2. perf report -D  # in output check for the registers
3. perf record -I<register name>
4. perf record -I<non existing register name>
5. perf record -I<non existing register name with special characters>
6. perf record -I<register name> -e <different event names>

> ---
>  tools/arch/powerpc/include/uapi/asm/perf_regs.h | 14 ++++++-
>  tools/perf/arch/powerpc/include/perf_regs.h     |  5 ++-
>  tools/perf/arch/powerpc/util/perf_regs.c        | 55 ++++++++++++++
> +++++++++++
>  3 files changed, 72 insertions(+), 2 deletions(-)
>
> diff --git a/tools/arch/powerpc/include/uapi/asm/perf_regs.h b/
> tools/arch/powerpc/include/uapi/asm/perf_regs.h
> index f599064..485b1d5 100644
> --- a/tools/arch/powerpc/include/uapi/asm/perf_regs.h
> +++ b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
> @@ -48,6 +48,18 @@ enum perf_event_powerpc_regs {
>     PERF_REG_POWERPC_DSISR,
>     PERF_REG_POWERPC_SIER,
>     PERF_REG_POWERPC_MMCRA,
> -   PERF_REG_POWERPC_MAX,
> +   /* Extended registers */
> +   PERF_REG_POWERPC_MMCR0,
> +   PERF_REG_POWERPC_MMCR1,
> +   PERF_REG_POWERPC_MMCR2,
> +   /* Max regs without the extended regs */
> +   PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
>  };
> +
> +#define PERF_REG_PMU_MASK   ((1ULL << PERF_REG_POWERPC_MAX) - 1)
> +
> +/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */
> +#define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2
> + 1)) - 1) \
> +            - PERF_REG_PMU_MASK)
> +
>  #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
> diff --git a/tools/perf/arch/powerpc/include/perf_regs.h b/tools/
> perf/arch/powerpc/include/perf_regs.h
> index e18a355..46ed00d 100644
> --- a/tools/perf/arch/powerpc/include/perf_regs.h
> +++ b/tools/perf/arch/powerpc/include/perf_regs.h
> @@ -64,7 +64,10 @@
>     [PERF_REG_POWERPC_DAR] = "dar",
>     [PERF_REG_POWERPC_DSISR] = "dsisr",
>     [PERF_REG_POWERPC_SIER] = "sier",
> -   [PERF_REG_POWERPC_MMCRA] = "mmcra"
> +   [PERF_REG_POWERPC_MMCRA] = "mmcra",
> +   [PERF_REG_POWERPC_MMCR0] = "mmcr0",
> +   [PERF_REG_POWERPC_MMCR1] = "mmcr1",
> +   [PERF_REG_POWERPC_MMCR2] = "mmcr2",
>  };
>
>  static inline const char *perf_reg_name(int id)
> diff --git a/tools/perf/arch/powerpc/util/perf_regs.c b/tools/perf/
> arch/powerpc/util/perf_regs.c
> index 0a52429..9179230 100644
> --- a/tools/perf/arch/powerpc/util/perf_regs.c
> +++ b/tools/perf/arch/powerpc/util/perf_regs.c
> @@ -6,9 +6,14 @@
>
>  #include "../../../util/perf_regs.h"
>  #include "../../../util/debug.h"
> +#include "../../../util/event.h"
> +#include "../../../util/header.h"
> +#include "../../../perf-sys.h"
>
>  #include <linux/kernel.h>
>
> +#define PVR_POWER9      0x004E
> +
>  const struct sample_reg sample_reg_masks[] = {
>     SMPL_REG(r0, PERF_REG_POWERPC_R0),
>     SMPL_REG(r1, PERF_REG_POWERPC_R1),
> @@ -55,6 +60,9 @@
>     SMPL_REG(dsisr, PERF_REG_POWERPC_DSISR),
>     SMPL_REG(sier, PERF_REG_POWERPC_SIER),
>     SMPL_REG(mmcra, PERF_REG_POWERPC_MMCRA),
> +   SMPL_REG(mmcr0, PERF_REG_POWERPC_MMCR0),
> +   SMPL_REG(mmcr1, PERF_REG_POWERPC_MMCR1),
> +   SMPL_REG(mmcr2, PERF_REG_POWERPC_MMCR2),
>     SMPL_REG_END
>  };
>
> @@ -163,3 +171,50 @@ int arch_sdt_arg_parse_op(char *old_op, char
**new_op)
>
>     return SDT_ARG_VALID;
>  }
> +
> +uint64_t arch__intr_reg_mask(void)
> +{
> +   struct perf_event_attr attr = {
> +      .type                   = PERF_TYPE_HARDWARE,
> +      .config                 = PERF_COUNT_HW_CPU_CYCLES,
> +      .sample_type            = PERF_SAMPLE_REGS_INTR,
> +      .precise_ip             = 1,
> +      .disabled               = 1,
> +      .exclude_kernel         = 1,
> +   };
> +   int fd, ret;
> +   char buffer[64];
> +   u32 version;
> +   u64 extended_mask = 0;
> +
> +   /* Get the PVR value to set the extended
> +    * mask specific to platform
> +    */
> +   get_cpuid(buffer, sizeof(buffer));
> +   ret = sscanf(buffer, "%u,", &version);
> +
> +   if (ret != 1) {
> +      pr_debug("Failed to get the processor version, unable to
> output extended registers\n");
> +      return PERF_REGS_MASK;
> +   }
> +
> +   if (version == PVR_POWER9)
> +      extended_mask = PERF_REG_PMU_MASK_300;
> +   else
> +      return PERF_REGS_MASK;
> +
> +   attr.sample_regs_intr = extended_mask;
> +   attr.sample_period = 1;
> +   event_attr_init(&attr);
> +
> +   /*
> +    * check if the pmu supports perf extended regs, before
> +    * returning the register mask to sample.
> +    */
> +   fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> +   if (fd != -1) {
> +      close(fd);
> +      return (extended_mask | PERF_REGS_MASK);
> +   }
> +   return PERF_REGS_MASK;
> +}
> --
> 1.8.3.1
>

[-- Attachment #2: Type: text/html, Size: 8458 bytes --]

^ permalink raw reply

* Re: [PATCH] powerpc/book3s64/kvm: Fix secondary page table walk warning during migration
From: Michael Ellerman @ 2020-05-28 12:53 UTC (permalink / raw)
  To: Aneesh Kumar K.V, kvm-ppc, paulus; +Cc: Aneesh Kumar K.V, linuxppc-dev
In-Reply-To: <20200528080456.87797-1-aneesh.kumar@linux.ibm.com>

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> This patch fix the below warning reported during migration.
>
>  find_kvm_secondary_pte called with kvm mmu_lock not held
>  CPU: 23 PID: 5341 Comm: qemu-system-ppc Tainted: G        W         5.7.0-rc5-kvm-00211-g9ccf10d6d088 #432
>  NIP:  c008000000fe848c LR: c008000000fe8488 CTR: 0000000000000000
>  REGS: c000001e19f077e0 TRAP: 0700   Tainted: G        W          (5.7.0-rc5-kvm-00211-g9ccf10d6d088)
>  MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 42222422  XER: 20040000
>  CFAR: c00000000012f5ac IRQMASK: 0
>  GPR00: c008000000fe8488 c000001e19f07a70 c008000000ffe200 0000000000000039
>  GPR04: 0000000000000001 c000001ffc8b4900 0000000000018840 0000000000000007
>  GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000001
>  GPR12: 0000000000002000 c000001fff6d9400 000000011f884678 00007fff70b70000
>  GPR16: 00007fff7137cb90 00007fff7dcb4410 0000000000000001 0000000000000000
>  GPR20: 000000000ffe0000 0000000000000000 0000000000000001 0000000000000000
>  GPR24: 8000000000000000 0000000000000001 c000001e1f67e600 c000001e1fd82410
>  GPR28: 0000000000001000 c000001e2e410000 0000000000000fff 0000000000000ffe
>  NIP [c008000000fe848c] kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
>  LR [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv]
>  Call Trace:
>  [c000001e19f07a70] [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv] (unreliable)
>  [c000001e19f07b50] [c008000000fd42e4] kvm_vm_ioctl_get_dirty_log_hv+0x33c/0x3c0 [kvm_hv]
>  [c000001e19f07be0] [c008000000eea878] kvm_vm_ioctl_get_dirty_log+0x30/0x50 [kvm]
>  [c000001e19f07c00] [c008000000edc818] kvm_vm_ioctl+0x2b0/0xc00 [kvm]
>  [c000001e19f07d50] [c00000000046e148] ksys_ioctl+0xf8/0x150
>  [c000001e19f07da0] [c00000000046e1c8] sys_ioctl+0x28/0x80
>  [c000001e19f07dc0] [c00000000003652c] system_call_exception+0x16c/0x240
>  [c000001e19f07e20] [c00000000000d070] system_call_common+0xf0/0x278
>  Instruction dump:
>  7d3a512a 4200ffd0 7ffefb78 4bfffdc4 60000000 3c820000 e8848468 3c620000
>  e86384a8 38840010 4800673d e8410018 <0fe00000> 4bfffdd4 60000000 60000000
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/include/asm/kvm_book3s_64.h |  9 ++++++
>  arch/powerpc/kvm/book3s_64_mmu_radix.c   | 35 ++++++++++++++++++++----
>  2 files changed, 38 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
> index c58e64a0a74f..cd5bad08b8d3 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
> @@ -635,6 +635,15 @@ extern void kvmhv_remove_nest_rmap_range(struct kvm *kvm,
>  				unsigned long gpa, unsigned long hpa,
>  				unsigned long nbytes);
>  
> +static inline pte_t *__find_kvm_secondary_pte(struct kvm *kvm, unsigned long ea,
> +					      unsigned *hshift)
> +{
> +	pte_t *pte;
> +
> +	pte = __find_linux_pte(kvm->arch.pgtable, ea, NULL, hshift);
> +	return pte;
> +}

Why not just open code this in the single caller?

Leaving it here someone will invariably decide to call it one day.

If you think it's worth keeping then it should have a comment explaining
why it doesn't check the lock, and find_kvm_secondary_pte() should call
it no?

cheers

^ permalink raw reply

* Re: [PATCH] powerpc/bpf: Enable bpf_probe_read{, str}() on powerpc again
From: Michael Ellerman @ 2020-05-28 12:23 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Daniel Borkmann, linux-kernel, Alexei Starovoitov, Paul Mackerras,
	Masami Hiramatsu, Brendan Gregg, Miroslav Benes, linuxppc-dev,
	Christoph Hellwig
In-Reply-To: <20200528091351.GE3529@linux-b0ei>

Petr Mladek <pmladek@suse.com> writes:
> On Thu 2020-05-28 11:03:43, Michael Ellerman wrote:
>> Petr Mladek <pmladek@suse.com> writes:
>> > The commit 0ebeea8ca8a4d1d453a ("bpf: Restrict bpf_probe_read{, str}() only
>> > to archs where they work") caused that bpf_probe_read{, str}() functions
>> > were not longer available on architectures where the same logical address
>> > might have different content in kernel and user memory mapping. These
>> > architectures should use probe_read_{user,kernel}_str helpers.
>> >
>> > For backward compatibility, the problematic functions are still available
>> > on architectures where the user and kernel address spaces are not
>> > overlapping. This is defined CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE.
>> >
>> > At the moment, these backward compatible functions are enabled only
>> > on x86_64, arm, and arm64. Let's do it also on powerpc that has
>> > the non overlapping address space as well.
>> >
>> > Signed-off-by: Petr Mladek <pmladek@suse.com>
>> 
>> This seems like it should have a Fixes: tag and go into v5.7?
>
> Good point:
>
> Fixes: commit 0ebeea8ca8a4d1d4 ("bpf: Restrict bpf_probe_read{, str}() only to archs where they work")
>
> And yes, it should ideally go into v5.7 either directly or via stable.
>
> Should I resend the patch with Fixes and
> Cc: stable@vger.kernel.org #v45.7 lines, please?

If it goes into v5.7 then it doesn't need a Cc: stable, and I guess a
Fixes: tag is nice to have but not so important as it already mentions
the commit that caused the problem. So a resend probably isn't
necessary.

Acked-by: Michael Ellerman <mpe@ellerman.id.au>


Daniel can you pick this up, or should I?

cheers


^ permalink raw reply

* [PATCH AUTOSEL 4.4 4/7] net/ethernet/freescale: rework quiesce/activate for ucc_geth
From: Sasha Levin @ 2020-05-28 11:58 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Matteo Ghidoni, Sasha Levin, netdev, Valentin Longchamp,
	linuxppc-dev, David S . Miller
In-Reply-To: <20200528115811.1406810-1-sashal@kernel.org>

From: Valentin Longchamp <valentin@longchamp.me>

[ Upstream commit 79dde73cf9bcf1dd317a2667f78b758e9fe139ed ]

ugeth_quiesce/activate are used to halt the controller when there is a
link change that requires to reconfigure the mac.

The previous implementation called netif_device_detach(). This however
causes the initial activation of the netdevice to fail precisely because
it's detached. For details, see [1].

A possible workaround was the revert of commit
net: linkwatch: add check for netdevice being present to linkwatch_do_dev
However, the check introduced in the above commit is correct and shall be
kept.

The netif_device_detach() is thus replaced with
netif_tx_stop_all_queues() that prevents any tranmission. This allows to
perform mac config change required by the link change, without detaching
the corresponding netdevice and thus not preventing its initial
activation.

[1] https://lists.openwall.net/netdev/2020/01/08/201

Signed-off-by: Valentin Longchamp <valentin@longchamp.me>
Acked-by: Matteo Ghidoni <matteo.ghidoni@ch.abb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/net/ethernet/freescale/ucc_geth.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/freescale/ucc_geth.c b/drivers/net/ethernet/freescale/ucc_geth.c
index 55ac00055977..96a1f62cc148 100644
--- a/drivers/net/ethernet/freescale/ucc_geth.c
+++ b/drivers/net/ethernet/freescale/ucc_geth.c
@@ -45,6 +45,7 @@
 #include <asm/ucc.h>
 #include <asm/ucc_fast.h>
 #include <asm/machdep.h>
+#include <net/sch_generic.h>
 
 #include "ucc_geth.h"
 
@@ -1551,11 +1552,8 @@ static int ugeth_disable(struct ucc_geth_private *ugeth, enum comm_dir mode)
 
 static void ugeth_quiesce(struct ucc_geth_private *ugeth)
 {
-	/* Prevent any further xmits, plus detach the device. */
-	netif_device_detach(ugeth->ndev);
-
-	/* Wait for any current xmits to finish. */
-	netif_tx_disable(ugeth->ndev);
+	/* Prevent any further xmits */
+	netif_tx_stop_all_queues(ugeth->ndev);
 
 	/* Disable the interrupt to avoid NAPI rescheduling. */
 	disable_irq(ugeth->ug_info->uf_info.irq);
@@ -1568,7 +1566,10 @@ static void ugeth_activate(struct ucc_geth_private *ugeth)
 {
 	napi_enable(&ugeth->napi);
 	enable_irq(ugeth->ug_info->uf_info.irq);
-	netif_device_attach(ugeth->ndev);
+
+	/* allow to xmit again  */
+	netif_tx_wake_all_queues(ugeth->ndev);
+	__netdev_watchdog_up(ugeth->ndev);
 }
 
 /* Called every time the controller might need to be made
-- 
2.25.1


^ permalink raw reply related

* [PATCH AUTOSEL 4.9 5/9] net/ethernet/freescale: rework quiesce/activate for ucc_geth
From: Sasha Levin @ 2020-05-28 11:57 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Matteo Ghidoni, Sasha Levin, netdev, Valentin Longchamp,
	linuxppc-dev, David S . Miller
In-Reply-To: <20200528115800.1406703-1-sashal@kernel.org>

From: Valentin Longchamp <valentin@longchamp.me>

[ Upstream commit 79dde73cf9bcf1dd317a2667f78b758e9fe139ed ]

ugeth_quiesce/activate are used to halt the controller when there is a
link change that requires to reconfigure the mac.

The previous implementation called netif_device_detach(). This however
causes the initial activation of the netdevice to fail precisely because
it's detached. For details, see [1].

A possible workaround was the revert of commit
net: linkwatch: add check for netdevice being present to linkwatch_do_dev
However, the check introduced in the above commit is correct and shall be
kept.

The netif_device_detach() is thus replaced with
netif_tx_stop_all_queues() that prevents any tranmission. This allows to
perform mac config change required by the link change, without detaching
the corresponding netdevice and thus not preventing its initial
activation.

[1] https://lists.openwall.net/netdev/2020/01/08/201

Signed-off-by: Valentin Longchamp <valentin@longchamp.me>
Acked-by: Matteo Ghidoni <matteo.ghidoni@ch.abb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/net/ethernet/freescale/ucc_geth.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/freescale/ucc_geth.c b/drivers/net/ethernet/freescale/ucc_geth.c
index 714593023bbc..af922bac19ae 100644
--- a/drivers/net/ethernet/freescale/ucc_geth.c
+++ b/drivers/net/ethernet/freescale/ucc_geth.c
@@ -45,6 +45,7 @@
 #include <soc/fsl/qe/ucc.h>
 #include <soc/fsl/qe/ucc_fast.h>
 #include <asm/machdep.h>
+#include <net/sch_generic.h>
 
 #include "ucc_geth.h"
 
@@ -1551,11 +1552,8 @@ static int ugeth_disable(struct ucc_geth_private *ugeth, enum comm_dir mode)
 
 static void ugeth_quiesce(struct ucc_geth_private *ugeth)
 {
-	/* Prevent any further xmits, plus detach the device. */
-	netif_device_detach(ugeth->ndev);
-
-	/* Wait for any current xmits to finish. */
-	netif_tx_disable(ugeth->ndev);
+	/* Prevent any further xmits */
+	netif_tx_stop_all_queues(ugeth->ndev);
 
 	/* Disable the interrupt to avoid NAPI rescheduling. */
 	disable_irq(ugeth->ug_info->uf_info.irq);
@@ -1568,7 +1566,10 @@ static void ugeth_activate(struct ucc_geth_private *ugeth)
 {
 	napi_enable(&ugeth->napi);
 	enable_irq(ugeth->ug_info->uf_info.irq);
-	netif_device_attach(ugeth->ndev);
+
+	/* allow to xmit again  */
+	netif_tx_wake_all_queues(ugeth->ndev);
+	__netdev_watchdog_up(ugeth->ndev);
 }
 
 /* Called every time the controller might need to be made
-- 
2.25.1


^ permalink raw reply related

* [PATCH AUTOSEL 4.14 08/13] net/ethernet/freescale: rework quiesce/activate for ucc_geth
From: Sasha Levin @ 2020-05-28 11:57 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Matteo Ghidoni, Sasha Levin, netdev, Valentin Longchamp,
	linuxppc-dev, David S . Miller
In-Reply-To: <20200528115744.1406533-1-sashal@kernel.org>

From: Valentin Longchamp <valentin@longchamp.me>

[ Upstream commit 79dde73cf9bcf1dd317a2667f78b758e9fe139ed ]

ugeth_quiesce/activate are used to halt the controller when there is a
link change that requires to reconfigure the mac.

The previous implementation called netif_device_detach(). This however
causes the initial activation of the netdevice to fail precisely because
it's detached. For details, see [1].

A possible workaround was the revert of commit
net: linkwatch: add check for netdevice being present to linkwatch_do_dev
However, the check introduced in the above commit is correct and shall be
kept.

The netif_device_detach() is thus replaced with
netif_tx_stop_all_queues() that prevents any tranmission. This allows to
perform mac config change required by the link change, without detaching
the corresponding netdevice and thus not preventing its initial
activation.

[1] https://lists.openwall.net/netdev/2020/01/08/201

Signed-off-by: Valentin Longchamp <valentin@longchamp.me>
Acked-by: Matteo Ghidoni <matteo.ghidoni@ch.abb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/net/ethernet/freescale/ucc_geth.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/freescale/ucc_geth.c b/drivers/net/ethernet/freescale/ucc_geth.c
index bddf4c25ee6e..7c2a9fd4dc1a 100644
--- a/drivers/net/ethernet/freescale/ucc_geth.c
+++ b/drivers/net/ethernet/freescale/ucc_geth.c
@@ -45,6 +45,7 @@
 #include <soc/fsl/qe/ucc.h>
 #include <soc/fsl/qe/ucc_fast.h>
 #include <asm/machdep.h>
+#include <net/sch_generic.h>
 
 #include "ucc_geth.h"
 
@@ -1551,11 +1552,8 @@ static int ugeth_disable(struct ucc_geth_private *ugeth, enum comm_dir mode)
 
 static void ugeth_quiesce(struct ucc_geth_private *ugeth)
 {
-	/* Prevent any further xmits, plus detach the device. */
-	netif_device_detach(ugeth->ndev);
-
-	/* Wait for any current xmits to finish. */
-	netif_tx_disable(ugeth->ndev);
+	/* Prevent any further xmits */
+	netif_tx_stop_all_queues(ugeth->ndev);
 
 	/* Disable the interrupt to avoid NAPI rescheduling. */
 	disable_irq(ugeth->ug_info->uf_info.irq);
@@ -1568,7 +1566,10 @@ static void ugeth_activate(struct ucc_geth_private *ugeth)
 {
 	napi_enable(&ugeth->napi);
 	enable_irq(ugeth->ug_info->uf_info.irq);
-	netif_device_attach(ugeth->ndev);
+
+	/* allow to xmit again  */
+	netif_tx_wake_all_queues(ugeth->ndev);
+	__netdev_watchdog_up(ugeth->ndev);
 }
 
 /* Called every time the controller might need to be made
-- 
2.25.1


^ permalink raw reply related

* [PATCH AUTOSEL 4.19 12/17] net/ethernet/freescale: rework quiesce/activate for ucc_geth
From: Sasha Levin @ 2020-05-28 11:57 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Matteo Ghidoni, Sasha Levin, netdev, Valentin Longchamp,
	linuxppc-dev, David S . Miller
In-Reply-To: <20200528115724.1406376-1-sashal@kernel.org>

From: Valentin Longchamp <valentin@longchamp.me>

[ Upstream commit 79dde73cf9bcf1dd317a2667f78b758e9fe139ed ]

ugeth_quiesce/activate are used to halt the controller when there is a
link change that requires to reconfigure the mac.

The previous implementation called netif_device_detach(). This however
causes the initial activation of the netdevice to fail precisely because
it's detached. For details, see [1].

A possible workaround was the revert of commit
net: linkwatch: add check for netdevice being present to linkwatch_do_dev
However, the check introduced in the above commit is correct and shall be
kept.

The netif_device_detach() is thus replaced with
netif_tx_stop_all_queues() that prevents any tranmission. This allows to
perform mac config change required by the link change, without detaching
the corresponding netdevice and thus not preventing its initial
activation.

[1] https://lists.openwall.net/netdev/2020/01/08/201

Signed-off-by: Valentin Longchamp <valentin@longchamp.me>
Acked-by: Matteo Ghidoni <matteo.ghidoni@ch.abb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/net/ethernet/freescale/ucc_geth.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/freescale/ucc_geth.c b/drivers/net/ethernet/freescale/ucc_geth.c
index a5bf02ae4bc5..5de6f7c73c1f 100644
--- a/drivers/net/ethernet/freescale/ucc_geth.c
+++ b/drivers/net/ethernet/freescale/ucc_geth.c
@@ -45,6 +45,7 @@
 #include <soc/fsl/qe/ucc.h>
 #include <soc/fsl/qe/ucc_fast.h>
 #include <asm/machdep.h>
+#include <net/sch_generic.h>
 
 #include "ucc_geth.h"
 
@@ -1551,11 +1552,8 @@ static int ugeth_disable(struct ucc_geth_private *ugeth, enum comm_dir mode)
 
 static void ugeth_quiesce(struct ucc_geth_private *ugeth)
 {
-	/* Prevent any further xmits, plus detach the device. */
-	netif_device_detach(ugeth->ndev);
-
-	/* Wait for any current xmits to finish. */
-	netif_tx_disable(ugeth->ndev);
+	/* Prevent any further xmits */
+	netif_tx_stop_all_queues(ugeth->ndev);
 
 	/* Disable the interrupt to avoid NAPI rescheduling. */
 	disable_irq(ugeth->ug_info->uf_info.irq);
@@ -1568,7 +1566,10 @@ static void ugeth_activate(struct ucc_geth_private *ugeth)
 {
 	napi_enable(&ugeth->napi);
 	enable_irq(ugeth->ug_info->uf_info.irq);
-	netif_device_attach(ugeth->ndev);
+
+	/* allow to xmit again  */
+	netif_tx_wake_all_queues(ugeth->ndev);
+	__netdev_watchdog_up(ugeth->ndev);
 }
 
 /* Called every time the controller might need to be made
-- 
2.25.1


^ permalink raw reply related

* [PATCH AUTOSEL 5.4 15/26] net/ethernet/freescale: rework quiesce/activate for ucc_geth
From: Sasha Levin @ 2020-05-28 11:56 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Matteo Ghidoni, Sasha Levin, netdev, Valentin Longchamp,
	linuxppc-dev, David S . Miller
In-Reply-To: <20200528115654.1406165-1-sashal@kernel.org>

From: Valentin Longchamp <valentin@longchamp.me>

[ Upstream commit 79dde73cf9bcf1dd317a2667f78b758e9fe139ed ]

ugeth_quiesce/activate are used to halt the controller when there is a
link change that requires to reconfigure the mac.

The previous implementation called netif_device_detach(). This however
causes the initial activation of the netdevice to fail precisely because
it's detached. For details, see [1].

A possible workaround was the revert of commit
net: linkwatch: add check for netdevice being present to linkwatch_do_dev
However, the check introduced in the above commit is correct and shall be
kept.

The netif_device_detach() is thus replaced with
netif_tx_stop_all_queues() that prevents any tranmission. This allows to
perform mac config change required by the link change, without detaching
the corresponding netdevice and thus not preventing its initial
activation.

[1] https://lists.openwall.net/netdev/2020/01/08/201

Signed-off-by: Valentin Longchamp <valentin@longchamp.me>
Acked-by: Matteo Ghidoni <matteo.ghidoni@ch.abb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/net/ethernet/freescale/ucc_geth.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/freescale/ucc_geth.c b/drivers/net/ethernet/freescale/ucc_geth.c
index f839fa94ebdd..d3b8ce734c1b 100644
--- a/drivers/net/ethernet/freescale/ucc_geth.c
+++ b/drivers/net/ethernet/freescale/ucc_geth.c
@@ -42,6 +42,7 @@
 #include <soc/fsl/qe/ucc.h>
 #include <soc/fsl/qe/ucc_fast.h>
 #include <asm/machdep.h>
+#include <net/sch_generic.h>
 
 #include "ucc_geth.h"
 
@@ -1548,11 +1549,8 @@ static int ugeth_disable(struct ucc_geth_private *ugeth, enum comm_dir mode)
 
 static void ugeth_quiesce(struct ucc_geth_private *ugeth)
 {
-	/* Prevent any further xmits, plus detach the device. */
-	netif_device_detach(ugeth->ndev);
-
-	/* Wait for any current xmits to finish. */
-	netif_tx_disable(ugeth->ndev);
+	/* Prevent any further xmits */
+	netif_tx_stop_all_queues(ugeth->ndev);
 
 	/* Disable the interrupt to avoid NAPI rescheduling. */
 	disable_irq(ugeth->ug_info->uf_info.irq);
@@ -1565,7 +1563,10 @@ static void ugeth_activate(struct ucc_geth_private *ugeth)
 {
 	napi_enable(&ugeth->napi);
 	enable_irq(ugeth->ug_info->uf_info.irq);
-	netif_device_attach(ugeth->ndev);
+
+	/* allow to xmit again  */
+	netif_tx_wake_all_queues(ugeth->ndev);
+	__netdev_watchdog_up(ugeth->ndev);
 }
 
 /* Called every time the controller might need to be made
-- 
2.25.1


^ permalink raw reply related

* [PATCH AUTOSEL 5.6 30/47] net/ethernet/freescale: rework quiesce/activate for ucc_geth
From: Sasha Levin @ 2020-05-28 11:55 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Matteo Ghidoni, Sasha Levin, netdev, Valentin Longchamp,
	linuxppc-dev, David S . Miller
In-Reply-To: <20200528115600.1405808-1-sashal@kernel.org>

From: Valentin Longchamp <valentin@longchamp.me>

[ Upstream commit 79dde73cf9bcf1dd317a2667f78b758e9fe139ed ]

ugeth_quiesce/activate are used to halt the controller when there is a
link change that requires to reconfigure the mac.

The previous implementation called netif_device_detach(). This however
causes the initial activation of the netdevice to fail precisely because
it's detached. For details, see [1].

A possible workaround was the revert of commit
net: linkwatch: add check for netdevice being present to linkwatch_do_dev
However, the check introduced in the above commit is correct and shall be
kept.

The netif_device_detach() is thus replaced with
netif_tx_stop_all_queues() that prevents any tranmission. This allows to
perform mac config change required by the link change, without detaching
the corresponding netdevice and thus not preventing its initial
activation.

[1] https://lists.openwall.net/netdev/2020/01/08/201

Signed-off-by: Valentin Longchamp <valentin@longchamp.me>
Acked-by: Matteo Ghidoni <matteo.ghidoni@ch.abb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/net/ethernet/freescale/ucc_geth.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/freescale/ucc_geth.c b/drivers/net/ethernet/freescale/ucc_geth.c
index 0d101c00286f..ab1b4a77b4a3 100644
--- a/drivers/net/ethernet/freescale/ucc_geth.c
+++ b/drivers/net/ethernet/freescale/ucc_geth.c
@@ -42,6 +42,7 @@
 #include <soc/fsl/qe/ucc.h>
 #include <soc/fsl/qe/ucc_fast.h>
 #include <asm/machdep.h>
+#include <net/sch_generic.h>
 
 #include "ucc_geth.h"
 
@@ -1548,11 +1549,8 @@ static int ugeth_disable(struct ucc_geth_private *ugeth, enum comm_dir mode)
 
 static void ugeth_quiesce(struct ucc_geth_private *ugeth)
 {
-	/* Prevent any further xmits, plus detach the device. */
-	netif_device_detach(ugeth->ndev);
-
-	/* Wait for any current xmits to finish. */
-	netif_tx_disable(ugeth->ndev);
+	/* Prevent any further xmits */
+	netif_tx_stop_all_queues(ugeth->ndev);
 
 	/* Disable the interrupt to avoid NAPI rescheduling. */
 	disable_irq(ugeth->ug_info->uf_info.irq);
@@ -1565,7 +1563,10 @@ static void ugeth_activate(struct ucc_geth_private *ugeth)
 {
 	napi_enable(&ugeth->napi);
 	enable_irq(ugeth->ug_info->uf_info.irq);
-	netif_device_attach(ugeth->ndev);
+
+	/* allow to xmit again  */
+	netif_tx_wake_all_queues(ugeth->ndev);
+	__netdev_watchdog_up(ugeth->ndev);
 }
 
 /* Called every time the controller might need to be made
-- 
2.25.1


^ permalink raw reply related

* [PATCH] powerpc/32: disable KASAN with pages bigger than 16k
From: Christophe Leroy @ 2020-05-28 10:17 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel

Mapping of early shadow area is implemented by using a single static
page table having all entries pointing to the same early shadow page.
The shadow area must therefore occupy full PGD entries.

The shadow area has a size of 128Mbytes starting at 0xf8000000.
With 4k pages, a PGD entry is 4Mbytes
With 16k pages, a PGD entry is 64Mbytes
With 64k pages, a PGD entry is 256Mbytes which is too big.

Until we rework the early shadow mapping, disable KASAN when the
page size is too big.

Reported-by: kbuild test robot <lkp@intel.com>
Fixes: 2edb16efc899 ("powerpc/32: Add KASAN support")
Cc: stable@vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/Kconfig | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1dfa59126fcf..066b36bf9efa 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -169,8 +169,8 @@ config PPC
 	select HAVE_ARCH_AUDITSYSCALL
 	select HAVE_ARCH_HUGE_VMAP		if PPC_BOOK3S_64 && PPC_RADIX_MMU
 	select HAVE_ARCH_JUMP_LABEL
-	select HAVE_ARCH_KASAN			if PPC32
-	select HAVE_ARCH_KASAN_VMALLOC		if PPC32
+	select HAVE_ARCH_KASAN			if PPC32 && PPC_PAGE_SHIFT <= 14
+	select HAVE_ARCH_KASAN_VMALLOC		if PPC32 && PPC_PAGE_SHIFT <= 14
 	select HAVE_ARCH_KGDB
 	select HAVE_ARCH_MMAP_RND_BITS
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS	if COMPAT
-- 
2.25.0


^ permalink raw reply related

* Re: [PATCH] powerpc/bpf: Enable bpf_probe_read{, str}() on powerpc again
From: Petr Mladek @ 2020-05-28  9:13 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Daniel Borkmann, linux-kernel, Alexei Starovoitov, Paul Mackerras,
	Masami Hiramatsu, Brendan Gregg, Miroslav Benes, linuxppc-dev,
	Christoph Hellwig
In-Reply-To: <87ftbkkh00.fsf@mpe.ellerman.id.au>

On Thu 2020-05-28 11:03:43, Michael Ellerman wrote:
> Petr Mladek <pmladek@suse.com> writes:
> > The commit 0ebeea8ca8a4d1d453a ("bpf: Restrict bpf_probe_read{, str}() only
> > to archs where they work") caused that bpf_probe_read{, str}() functions
> > were not longer available on architectures where the same logical address
> > might have different content in kernel and user memory mapping. These
> > architectures should use probe_read_{user,kernel}_str helpers.
> >
> > For backward compatibility, the problematic functions are still available
> > on architectures where the user and kernel address spaces are not
> > overlapping. This is defined CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE.
> >
> > At the moment, these backward compatible functions are enabled only
> > on x86_64, arm, and arm64. Let's do it also on powerpc that has
> > the non overlapping address space as well.
> >
> > Signed-off-by: Petr Mladek <pmladek@suse.com>
> 
> This seems like it should have a Fixes: tag and go into v5.7?

Good point:

Fixes: commit 0ebeea8ca8a4d1d4 ("bpf: Restrict bpf_probe_read{, str}() only to archs where they work")

And yes, it should ideally go into v5.7 either directly or via stable.

Should I resend the patch with Fixes and
Cc: stable@vger.kernel.org #v45.7 lines, please?

Best Regards,
Petr

^ permalink raw reply

* Re: [RFC PATCH 1/4] powerpc/64s: Don't init FSCR_DSCR in __init_FSCR()
From: Nicholas Piggin @ 2020-05-28  8:09 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman; +Cc: alistair, mikey, jniethe5
In-Reply-To: <20200527145843.2761782-1-mpe@ellerman.id.au>

Excerpts from Michael Ellerman's message of May 28, 2020 12:58 am:
> __init_FSCR() was added originally in commit 2468dcf641e4 ("powerpc:
> Add support for context switching the TAR register") (Feb 2013), and
> only set FSCR_TAR.
> 
> At that point FSCR (Facility Status and Control Register) was not
> context switched, so the setting was permanent after boot.
> 
> Later we added initialisation of FSCR_DSCR to __init_FSCR(), in commit
> 54c9b2253d34 ("powerpc: Set DSCR bit in FSCR setup") (Mar 2013), again
> that was permanent after boot.
> 
> Then commit 2517617e0de6 ("powerpc: Fix context switch DSCR on
> POWER8") (Aug 2013) added a limited context switch of FSCR, just the
> FSCR_DSCR bit was context switched based on thread.dscr_inherit. That
> commit said "This clears the H/FSCR DSCR bit initially", but it
> didn't, it left the initialisation of FSCR_DSCR in __init_FSCR().
> However the initial context switch from init_task to pid 1 would clear
> FSCR_DSCR because thread.dscr_inherit was 0.
> 
> That commit also introduced the requirement that FSCR_DSCR be clear
> for user processes, so that we can take the facility unavailable
> interrupt in order to manage dscr_inherit.
> 
> Then in commit 152d523e6307 ("powerpc: Create context switch helpers
> save_sprs() and restore_sprs()") (Dec 2015) FSCR was added to
> thread_struct. However it still wasn't fully context switched, we just
> took the existing value and set FSCR_DSCR if the new thread had
> dscr_inherit set. FSCR was still initialised at boot to FSCR_DSCR |
> FSCR_TAR, but that value was not propagated into the thread_struct, so
> the initial context switch set FSCR_DSCR back to 0.
> 
> Finally commit b57bd2de8c6c ("powerpc: Improve FSCR init and context
> switching") (Jun 2016) added a full context switch of the FSCR, and
> added an initialisation of init_task.thread.fscr to FSCR_TAR |
> FSCR_EBB, but omitted FSCR_DSCR.
> 
> The end result is that swapper runs with FSCR_DSCR set because of the
> initialisation in __init_FSCR(), but no other processes do, they use
> the value from init_task.thread.fscr.
> 
> Having FSCR_DSCR set for swapper allows it to access SPR 3 from
> userspace, but swapper never runs userspace, so it has no useful
> effect. It's also confusing to have the value initialised in two
> places to two different values.
> 
> So remove FSCR_DSCR from __init_FSCR(), this at least gets us to the
> point where there's a single value of FSCR, even if it's still set in
> two places.

Thanks for sorting out this mess, everything looks good to me.

Thanks,
Nick

^ permalink raw reply

* [PATCH] powerpc/book3s64/kvm: Fix secondary page table walk warning during migration
From: Aneesh Kumar K.V @ 2020-05-28  8:04 UTC (permalink / raw)
  To: kvm-ppc, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V

This patch fix the below warning reported during migration.

 find_kvm_secondary_pte called with kvm mmu_lock not held
 CPU: 23 PID: 5341 Comm: qemu-system-ppc Tainted: G        W         5.7.0-rc5-kvm-00211-g9ccf10d6d088 #432
 NIP:  c008000000fe848c LR: c008000000fe8488 CTR: 0000000000000000
 REGS: c000001e19f077e0 TRAP: 0700   Tainted: G        W          (5.7.0-rc5-kvm-00211-g9ccf10d6d088)
 MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 42222422  XER: 20040000
 CFAR: c00000000012f5ac IRQMASK: 0
 GPR00: c008000000fe8488 c000001e19f07a70 c008000000ffe200 0000000000000039
 GPR04: 0000000000000001 c000001ffc8b4900 0000000000018840 0000000000000007
 GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000001
 GPR12: 0000000000002000 c000001fff6d9400 000000011f884678 00007fff70b70000
 GPR16: 00007fff7137cb90 00007fff7dcb4410 0000000000000001 0000000000000000
 GPR20: 000000000ffe0000 0000000000000000 0000000000000001 0000000000000000
 GPR24: 8000000000000000 0000000000000001 c000001e1f67e600 c000001e1fd82410
 GPR28: 0000000000001000 c000001e2e410000 0000000000000fff 0000000000000ffe
 NIP [c008000000fe848c] kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
 LR [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv]
 Call Trace:
 [c000001e19f07a70] [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv] (unreliable)
 [c000001e19f07b50] [c008000000fd42e4] kvm_vm_ioctl_get_dirty_log_hv+0x33c/0x3c0 [kvm_hv]
 [c000001e19f07be0] [c008000000eea878] kvm_vm_ioctl_get_dirty_log+0x30/0x50 [kvm]
 [c000001e19f07c00] [c008000000edc818] kvm_vm_ioctl+0x2b0/0xc00 [kvm]
 [c000001e19f07d50] [c00000000046e148] ksys_ioctl+0xf8/0x150
 [c000001e19f07da0] [c00000000046e1c8] sys_ioctl+0x28/0x80
 [c000001e19f07dc0] [c00000000003652c] system_call_exception+0x16c/0x240
 [c000001e19f07e20] [c00000000000d070] system_call_common+0xf0/0x278
 Instruction dump:
 7d3a512a 4200ffd0 7ffefb78 4bfffdc4 60000000 3c820000 e8848468 3c620000
 e86384a8 38840010 4800673d e8410018 <0fe00000> 4bfffdd4 60000000 60000000

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/kvm_book3s_64.h |  9 ++++++
 arch/powerpc/kvm/book3s_64_mmu_radix.c   | 35 ++++++++++++++++++++----
 2 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
index c58e64a0a74f..cd5bad08b8d3 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -635,6 +635,15 @@ extern void kvmhv_remove_nest_rmap_range(struct kvm *kvm,
 				unsigned long gpa, unsigned long hpa,
 				unsigned long nbytes);
 
+static inline pte_t *__find_kvm_secondary_pte(struct kvm *kvm, unsigned long ea,
+					      unsigned *hshift)
+{
+	pte_t *pte;
+
+	pte = __find_linux_pte(kvm->arch.pgtable, ea, NULL, hshift);
+	return pte;
+}
+
 static inline pte_t *find_kvm_secondary_pte(struct kvm *kvm, unsigned long ea,
 					    unsigned *hshift)
 {
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 271f1c3d8443..a328db6a5ef8 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -1040,7 +1040,7 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
 {
 	unsigned long gfn = memslot->base_gfn + pagenum;
 	unsigned long gpa = gfn << PAGE_SHIFT;
-	pte_t *ptep;
+	pte_t *ptep, pte;
 	unsigned int shift;
 	int ret = 0;
 	unsigned long old, *rmapp;
@@ -1048,12 +1048,35 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
 	if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
 		return ret;
 
-	ptep = find_kvm_secondary_pte(kvm, gpa, &shift);
-	if (ptep && pte_present(*ptep) && pte_dirty(*ptep)) {
-		ret = 1;
-		if (shift)
-			ret = 1 << (shift - PAGE_SHIFT);
+	/*
+	 * For performance reason we don't hold kvm->mmu_lock
+	 * while walking the partition scoped table.
+	 */
+	ptep = __find_kvm_secondary_pte(kvm, gpa, &shift);
+	if (!ptep)
+		return 0;
+
+	pte = READ_ONCE(*ptep);
+	if (pte_present(pte) && pte_dirty(pte)) {
 		spin_lock(&kvm->mmu_lock);
+		/*
+		 * Recheck the pte again
+		 */
+		if (pte_val(pte) != pte_val(*ptep)) {
+			/*
+			 * We have KVM_MEM_LOG_DIRTY_PAGES enabled. Hence we
+			 * can only find PAGE_SIZE pte entries here. We can
+			 * continue to use the pte addr returned by above
+			 * page table walk.
+			 */
+			if (!pte_present(*ptep) || !pte_dirty(*ptep)) {
+				spin_unlock(&kvm->mmu_lock);
+				return 0;
+			}
+		}
+
+		ret = 1;
+		VM_BUG_ON(shift);
 		old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_DIRTY, 0,
 					      gpa, shift);
 		kvmppc_radix_tlbie_page(kvm, gpa, shift, kvm->arch.lpid);
-- 
2.26.2


^ permalink raw reply related

* Re: [PATCH v4 09/22] powerpc/kvm/book3s: Add helper to walk partition scoped linux page table.
From: Paul Mackerras @ 2020-05-28  7:25 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, npiggin
In-Reply-To: <e732e386-4a8c-2a7d-220c-e22e85b7a6c3@linux.ibm.com>

On Thu, May 28, 2020 at 11:31:04AM +0530, Aneesh Kumar K.V wrote:
> On 5/28/20 7:13 AM, Paul Mackerras wrote:
> > On Tue, May 05, 2020 at 12:47:16PM +0530, Aneesh Kumar K.V wrote:
> > > The locking rules for walking partition scoped table is different from process
> > > scoped table. Hence add a helper for secondary linux page table walk and also
> > > add check whether we are holding the right locks.
> > 
> > This patch is causing new warnings to appear when testing migration,
> > like this:
> > 
> > [  142.090159] ------------[ cut here ]------------
> > [  142.090160] find_kvm_secondary_pte called with kvm mmu_lock not held
> > [  142.090176] WARNING: CPU: 23 PID: 5341 at arch/powerpc/include/asm/kvm_book3s_64.h:644 kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
> > [  142.090177] Modules linked in: xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables bpfilter overlay binfmt_misc input_leds raid_class scsi_transport_sas sch_fq_codel sunrpc kvm_hv kvm
> > [  142.090188] CPU: 23 PID: 5341 Comm: qemu-system-ppc Tainted: G        W         5.7.0-rc5-kvm-00211-g9ccf10d6d088 #432
> > [  142.090189] NIP:  c008000000fe848c LR: c008000000fe8488 CTR: 0000000000000000
> > [  142.090190] REGS: c000001e19f077e0 TRAP: 0700   Tainted: G        W          (5.7.0-rc5-kvm-00211-g9ccf10d6d088)
> > [  142.090191] MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 42222422  XER: 20040000
> > [  142.090196] CFAR: c00000000012f5ac IRQMASK: 0
> >                 GPR00: c008000000fe8488 c000001e19f07a70 c008000000ffe200 0000000000000039
> >                 GPR04: 0000000000000001 c000001ffc8b4900 0000000000018840 0000000000000007
> >                 GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000001
> >                 GPR12: 0000000000002000 c000001fff6d9400 000000011f884678 00007fff70b70000
> >                 GPR16: 00007fff7137cb90 00007fff7dcb4410 0000000000000001 0000000000000000
> >                 GPR20: 000000000ffe0000 0000000000000000 0000000000000001 0000000000000000
> >                 GPR24: 8000000000000000 0000000000000001 c000001e1f67e600 c000001e1fd82410
> >                 GPR28: 0000000000001000 c000001e2e410000 0000000000000fff 0000000000000ffe
> > [  142.090217] NIP [c008000000fe848c] kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
> > [  142.090223] LR [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv]
> > [  142.090224] Call Trace:
> > [  142.090230] [c000001e19f07a70] [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv] (unreliable)
> > [  142.090236] [c000001e19f07b50] [c008000000fd42e4] kvm_vm_ioctl_get_dirty_log_hv+0x33c/0x3c0 [kvm_hv]
> > [  142.090292] [c000001e19f07be0] [c008000000eea878] kvm_vm_ioctl_get_dirty_log+0x30/0x50 [kvm]
> > [  142.090300] [c000001e19f07c00] [c008000000edc818] kvm_vm_ioctl+0x2b0/0xc00 [kvm]
> > [  142.090302] [c000001e19f07d50] [c00000000046e148] ksys_ioctl+0xf8/0x150
> > [  142.090305] [c000001e19f07da0] [c00000000046e1c8] sys_ioctl+0x28/0x80
> > [  142.090307] [c000001e19f07dc0] [c00000000003652c] system_call_exception+0x16c/0x240
> > [  142.090309] [c000001e19f07e20] [c00000000000d070] system_call_common+0xf0/0x278
> > [  142.090310] Instruction dump:
> > [  142.090312] 7d3a512a 4200ffd0 7ffefb78 4bfffdc4 60000000 3c820000 e8848468 3c620000
> > [  142.090317] e86384a8 38840010 4800673d e8410018 <0fe00000> 4bfffdd4 60000000 60000000
> > [  142.090322] ---[ end trace 619d45057b6919e0 ]---
> > 
> > Indeed, kvm_radix_test_clear_dirty() tests the PTE dirty bit
> > locklessly, and only takes the kvm->mmu_lock once it finds a dirty
> > PTE.  I think that is important for performance, since on any given
> > scan of the guest real address space we may only find a small
> > proportion of the guest pages to be dirty.
> > 
> > Are you now relying on the kvm->mmu_lock to protect the existence of
> > the PTEs, or just their content?
> > 
> 
> The patch series should not change any rules w.r.t kvm partition scoped page
> table walk. We only added helpers to make it explicit that this is different
> from regular linux page table walk. And kvm->mmu_lock is what was used to
> protect the partition scoped table walk earlier.
> 
> In this specific case, what we need probably is an open coded kvm partition
> scoped walk with a comment around explaining why is it ok to walk that
> partition scoped table without taking kvm->mmu_lock.
> 
> What happens when a parallel invalidate happens to Qemu address space? Since
> we are not holding kvm->mmu_lock mmu notifier will complete and we will go
> ahead with unmapping partition scoped table.
> 
> Do we need a change like below?
> 
> @@ -1040,7 +1040,7 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
>  {
>  	unsigned long gfn = memslot->base_gfn + pagenum;
>  	unsigned long gpa = gfn << PAGE_SHIFT;
> -	pte_t *ptep;
> +	pte_t *ptep, pte;
>  	unsigned int shift;
>  	int ret = 0;
>  	unsigned long old, *rmapp;
> @@ -1049,11 +1049,23 @@ static int kvm_radix_test_clear_dirty(struct kvm
> *kvm,
>  		return ret;
> 
>  	ptep = find_kvm_secondary_pte(kvm, gpa, &shift);

We need something different from find_kvm_secondary_pte here, since
that is what is generating the warning.

> -	if (ptep && pte_present(*ptep) && pte_dirty(*ptep)) {
> +	if (!ptep)
> +		return 0;
> +
> +	pte = READ_ONCE(*ptep);
> +	if (pte_present(pte) && pte_dirty(pte)) {
>  		ret = 1;
>  		if (shift)
>  			ret = 1 << (shift - PAGE_SHIFT);
>  		spin_lock(&kvm->mmu_lock);
> +		/*
> +		 * Recheck the pte again
> +		 */
> +		if (pte_val(pte) != pte_val(*ptep)) {

I don't think this is quite right.  I think it should be something
like:

		pte = *ptep;
		if (!(pte_present(pte) && pte_dirty(pte))) {


> +			spin_unlock(&kvm->mmu_lock);
> +			return 0;
> +		}
> +
>  		old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_DIRTY, 0,
>  					      gpa, shift);
>  		kvmppc_radix_tlbie_page(kvm, gpa, shift, kvm->arch.lpid);

Paul.

^ permalink raw reply

* [PATCH v5] KVM: PPC: clean up redundant kvm_run parameters in assembly
From: Tianjia Zhang @ 2020-05-28  6:24 UTC (permalink / raw)
  To: paulus, mpe, benh; +Cc: tianjia.zhang, linuxppc-dev, linux-kernel, kvm-ppc

In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
structure. For historical reasons, many kvm-related function parameters
retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This
patch does a unified cleanup of these remaining redundant parameters.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/powerpc/include/asm/kvm_ppc.h    |  2 +-
 arch/powerpc/kvm/book3s_interrupts.S  | 22 ++++++++++------------
 arch/powerpc/kvm/book3s_pr.c          |  9 ++++-----
 arch/powerpc/kvm/booke.c              |  9 ++++-----
 arch/powerpc/kvm/booke_interrupts.S   |  9 ++++-----
 arch/powerpc/kvm/bookehv_interrupts.S | 10 +++++-----
 6 files changed, 28 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index ccf66b3a4c1d..0a056c64c317 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -59,7 +59,7 @@ enum xlate_readwrite {
 };
 
 extern int kvmppc_vcpu_run(struct kvm_vcpu *vcpu);
-extern int __kvmppc_vcpu_run(struct kvm_run *run, struct kvm_vcpu *vcpu);
+extern int __kvmppc_vcpu_run(struct kvm_vcpu *vcpu);
 extern void kvmppc_handler_highmem(void);
 
 extern void kvmppc_dump_vcpu(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/kvm/book3s_interrupts.S b/arch/powerpc/kvm/book3s_interrupts.S
index f7ad99d972ce..a3674f6b8d3d 100644
--- a/arch/powerpc/kvm/book3s_interrupts.S
+++ b/arch/powerpc/kvm/book3s_interrupts.S
@@ -55,8 +55,7 @@
  ****************************************************************************/
 
 /* Registers:
- *  r3: kvm_run pointer
- *  r4: vcpu pointer
+ *  r3: vcpu pointer
  */
 _GLOBAL(__kvmppc_vcpu_run)
 
@@ -68,8 +67,8 @@ kvm_start_entry:
 	/* Save host state to the stack */
 	PPC_STLU r1, -SWITCH_FRAME_SIZE(r1)
 
-	/* Save r3 (kvm_run) and r4 (vcpu) */
-	SAVE_2GPRS(3, r1)
+	/* Save r3 (vcpu) */
+	SAVE_GPR(3, r1)
 
 	/* Save non-volatile registers (r14 - r31) */
 	SAVE_NVGPRS(r1)
@@ -82,14 +81,13 @@ kvm_start_entry:
 	PPC_STL	r0, _LINK(r1)
 
 	/* Load non-volatile guest state from the vcpu */
-	VCPU_LOAD_NVGPRS(r4)
+	VCPU_LOAD_NVGPRS(r3)
 
 kvm_start_lightweight:
 	/* Copy registers into shadow vcpu so we can access them in real mode */
-	mr	r3, r4
 	bl	FUNC(kvmppc_copy_to_svcpu)
 	nop
-	REST_GPR(4, r1)
+	REST_GPR(3, r1)
 
 #ifdef CONFIG_PPC_BOOK3S_64
 	/* Get the dcbz32 flag */
@@ -146,7 +144,7 @@ after_sprg3_load:
 	 *
 	 */
 
-	PPC_LL	r3, GPR4(r1)		/* vcpu pointer */
+	PPC_LL	r3, GPR3(r1)		/* vcpu pointer */
 
 	/*
 	 * kvmppc_copy_from_svcpu can clobber volatile registers, save
@@ -190,11 +188,11 @@ after_sprg3_load:
 	PPC_STL	r30, VCPU_GPR(R30)(r7)
 	PPC_STL	r31, VCPU_GPR(R31)(r7)
 
-	/* Pass the exit number as 3rd argument to kvmppc_handle_exit */
-	lwz	r5, VCPU_TRAP(r7)
+	/* Pass the exit number as 2nd argument to kvmppc_handle_exit */
+	lwz	r4, VCPU_TRAP(r7)
 
-	/* Restore r3 (kvm_run) and r4 (vcpu) */
-	REST_2GPRS(3, r1)
+	/* Restore r3 (vcpu) */
+	REST_GPR(3, r1)
 	bl	FUNC(kvmppc_handle_exit_pr)
 
 	/* If RESUME_GUEST, get back in the loop */
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index ef54f917bdaf..01c8fe5abe0d 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -1151,9 +1151,9 @@ static int kvmppc_exit_pr_progint(struct kvm_vcpu *vcpu, unsigned int exit_nr)
 	return r;
 }
 
-int kvmppc_handle_exit_pr(struct kvm_run *run, struct kvm_vcpu *vcpu,
-			  unsigned int exit_nr)
+int kvmppc_handle_exit_pr(struct kvm_vcpu *vcpu, unsigned int exit_nr)
 {
+	struct kvm_run *run = vcpu->run;
 	int r = RESUME_HOST;
 	int s;
 
@@ -1826,7 +1826,6 @@ static void kvmppc_core_vcpu_free_pr(struct kvm_vcpu *vcpu)
 
 static int kvmppc_vcpu_run_pr(struct kvm_vcpu *vcpu)
 {
-	struct kvm_run *run = vcpu->run;
 	int ret;
 #ifdef CONFIG_ALTIVEC
 	unsigned long uninitialized_var(vrsave);
@@ -1834,7 +1833,7 @@ static int kvmppc_vcpu_run_pr(struct kvm_vcpu *vcpu)
 
 	/* Check if we can run the vcpu at all */
 	if (!vcpu->arch.sane) {
-		run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+		vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
 		ret = -EINVAL;
 		goto out;
 	}
@@ -1861,7 +1860,7 @@ static int kvmppc_vcpu_run_pr(struct kvm_vcpu *vcpu)
 
 	kvmppc_fix_ee_before_entry();
 
-	ret = __kvmppc_vcpu_run(run, vcpu);
+	ret = __kvmppc_vcpu_run(vcpu);
 
 	kvmppc_clear_debug(vcpu);
 
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index c0d62a917e20..3e1c9f08e302 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -731,12 +731,11 @@ int kvmppc_core_check_requests(struct kvm_vcpu *vcpu)
 
 int kvmppc_vcpu_run(struct kvm_vcpu *vcpu)
 {
-	struct kvm_run *run = vcpu->run;
 	int ret, s;
 	struct debug_reg debug;
 
 	if (!vcpu->arch.sane) {
-		run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+		vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
 		return -EINVAL;
 	}
 
@@ -778,7 +777,7 @@ int kvmppc_vcpu_run(struct kvm_vcpu *vcpu)
 	vcpu->arch.pgdir = vcpu->kvm->mm->pgd;
 	kvmppc_fix_ee_before_entry();
 
-	ret = __kvmppc_vcpu_run(run, vcpu);
+	ret = __kvmppc_vcpu_run(vcpu);
 
 	/* No need for guest_exit. It's done in handle_exit.
 	   We also get here with interrupts enabled. */
@@ -982,9 +981,9 @@ static int kvmppc_resume_inst_load(struct kvm_vcpu *vcpu,
  *
  * Return value is in the form (errcode<<2 | RESUME_FLAG_HOST | RESUME_FLAG_NV)
  */
-int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu,
-                       unsigned int exit_nr)
+int kvmppc_handle_exit(struct kvm_vcpu *vcpu, unsigned int exit_nr)
 {
+	struct kvm_run *run = vcpu->run;
 	int r = RESUME_HOST;
 	int s;
 	int idx;
diff --git a/arch/powerpc/kvm/booke_interrupts.S b/arch/powerpc/kvm/booke_interrupts.S
index 2e56ab5a5f55..6fa82efe833b 100644
--- a/arch/powerpc/kvm/booke_interrupts.S
+++ b/arch/powerpc/kvm/booke_interrupts.S
@@ -237,7 +237,7 @@ _GLOBAL(kvmppc_resume_host)
 	/* Switch to kernel stack and jump to handler. */
 	LOAD_REG_ADDR(r3, kvmppc_handle_exit)
 	mtctr	r3
-	lwz	r3, HOST_RUN(r1)
+	mr	r3, r4
 	lwz	r2, HOST_R2(r1)
 	mr	r14, r4 /* Save vcpu pointer. */
 
@@ -337,15 +337,14 @@ heavyweight_exit:
 
 
 /* Registers:
- *  r3: kvm_run pointer
- *  r4: vcpu pointer
+ *  r3: vcpu pointer
  */
 _GLOBAL(__kvmppc_vcpu_run)
 	stwu	r1, -HOST_STACK_SIZE(r1)
-	stw	r1, VCPU_HOST_STACK(r4)	/* Save stack pointer to vcpu. */
+	stw	r1, VCPU_HOST_STACK(r3)	/* Save stack pointer to vcpu. */
 
 	/* Save host state to stack. */
-	stw	r3, HOST_RUN(r1)
+	mr	r4, r3
 	mflr	r3
 	stw	r3, HOST_STACK_LR(r1)
 	mfcr	r5
diff --git a/arch/powerpc/kvm/bookehv_interrupts.S b/arch/powerpc/kvm/bookehv_interrupts.S
index c577ba4b3169..8262c14fc9e6 100644
--- a/arch/powerpc/kvm/bookehv_interrupts.S
+++ b/arch/powerpc/kvm/bookehv_interrupts.S
@@ -434,9 +434,10 @@ _GLOBAL(kvmppc_resume_host)
 #endif
 
 	/* Switch to kernel stack and jump to handler. */
-	PPC_LL	r3, HOST_RUN(r1)
+	mr	r3, r4
 	mr	r5, r14 /* intno */
 	mr	r14, r4 /* Save vcpu pointer. */
+	mr	r4, r5
 	bl	kvmppc_handle_exit
 
 	/* Restore vcpu pointer and the nonvolatiles we used. */
@@ -525,15 +526,14 @@ heavyweight_exit:
 	blr
 
 /* Registers:
- *  r3: kvm_run pointer
- *  r4: vcpu pointer
+ *  r3: vcpu pointer
  */
 _GLOBAL(__kvmppc_vcpu_run)
 	stwu	r1, -HOST_STACK_SIZE(r1)
-	PPC_STL	r1, VCPU_HOST_STACK(r4)	/* Save stack pointer to vcpu. */
+	PPC_STL	r1, VCPU_HOST_STACK(r3)	/* Save stack pointer to vcpu. */
 
 	/* Save host state to stack. */
-	PPC_STL	r3, HOST_RUN(r1)
+	mr	r4, r3
 	mflr	r3
 	mfcr	r5
 	PPC_STL	r3, HOST_STACK_LR(r1)
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH v4 09/22] powerpc/kvm/book3s: Add helper to walk partition scoped linux page table.
From: Aneesh Kumar K.V @ 2020-05-28  6:01 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, npiggin
In-Reply-To: <20200528014338.GC307798@thinks.paulus.ozlabs.org>

On 5/28/20 7:13 AM, Paul Mackerras wrote:
> On Tue, May 05, 2020 at 12:47:16PM +0530, Aneesh Kumar K.V wrote:
>> The locking rules for walking partition scoped table is different from process
>> scoped table. Hence add a helper for secondary linux page table walk and also
>> add check whether we are holding the right locks.
> 
> This patch is causing new warnings to appear when testing migration,
> like this:
> 
> [  142.090159] ------------[ cut here ]------------
> [  142.090160] find_kvm_secondary_pte called with kvm mmu_lock not held
> [  142.090176] WARNING: CPU: 23 PID: 5341 at arch/powerpc/include/asm/kvm_book3s_64.h:644 kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
> [  142.090177] Modules linked in: xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables bpfilter overlay binfmt_misc input_leds raid_class scsi_transport_sas sch_fq_codel sunrpc kvm_hv kvm
> [  142.090188] CPU: 23 PID: 5341 Comm: qemu-system-ppc Tainted: G        W         5.7.0-rc5-kvm-00211-g9ccf10d6d088 #432
> [  142.090189] NIP:  c008000000fe848c LR: c008000000fe8488 CTR: 0000000000000000
> [  142.090190] REGS: c000001e19f077e0 TRAP: 0700   Tainted: G        W          (5.7.0-rc5-kvm-00211-g9ccf10d6d088)
> [  142.090191] MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 42222422  XER: 20040000
> [  142.090196] CFAR: c00000000012f5ac IRQMASK: 0
>                 GPR00: c008000000fe8488 c000001e19f07a70 c008000000ffe200 0000000000000039
>                 GPR04: 0000000000000001 c000001ffc8b4900 0000000000018840 0000000000000007
>                 GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000001
>                 GPR12: 0000000000002000 c000001fff6d9400 000000011f884678 00007fff70b70000
>                 GPR16: 00007fff7137cb90 00007fff7dcb4410 0000000000000001 0000000000000000
>                 GPR20: 000000000ffe0000 0000000000000000 0000000000000001 0000000000000000
>                 GPR24: 8000000000000000 0000000000000001 c000001e1f67e600 c000001e1fd82410
>                 GPR28: 0000000000001000 c000001e2e410000 0000000000000fff 0000000000000ffe
> [  142.090217] NIP [c008000000fe848c] kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
> [  142.090223] LR [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv]
> [  142.090224] Call Trace:
> [  142.090230] [c000001e19f07a70] [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv] (unreliable)
> [  142.090236] [c000001e19f07b50] [c008000000fd42e4] kvm_vm_ioctl_get_dirty_log_hv+0x33c/0x3c0 [kvm_hv]
> [  142.090292] [c000001e19f07be0] [c008000000eea878] kvm_vm_ioctl_get_dirty_log+0x30/0x50 [kvm]
> [  142.090300] [c000001e19f07c00] [c008000000edc818] kvm_vm_ioctl+0x2b0/0xc00 [kvm]
> [  142.090302] [c000001e19f07d50] [c00000000046e148] ksys_ioctl+0xf8/0x150
> [  142.090305] [c000001e19f07da0] [c00000000046e1c8] sys_ioctl+0x28/0x80
> [  142.090307] [c000001e19f07dc0] [c00000000003652c] system_call_exception+0x16c/0x240
> [  142.090309] [c000001e19f07e20] [c00000000000d070] system_call_common+0xf0/0x278
> [  142.090310] Instruction dump:
> [  142.090312] 7d3a512a 4200ffd0 7ffefb78 4bfffdc4 60000000 3c820000 e8848468 3c620000
> [  142.090317] e86384a8 38840010 4800673d e8410018 <0fe00000> 4bfffdd4 60000000 60000000
> [  142.090322] ---[ end trace 619d45057b6919e0 ]---
> 
> Indeed, kvm_radix_test_clear_dirty() tests the PTE dirty bit
> locklessly, and only takes the kvm->mmu_lock once it finds a dirty
> PTE.  I think that is important for performance, since on any given
> scan of the guest real address space we may only find a small
> proportion of the guest pages to be dirty.
> 
> Are you now relying on the kvm->mmu_lock to protect the existence of
> the PTEs, or just their content?
> 

The patch series should not change any rules w.r.t kvm partition scoped 
page table walk. We only added helpers to make it explicit that this is 
different from regular linux page table walk. And kvm->mmu_lock is what 
was used to protect the partition scoped table walk earlier.

In this specific case, what we need probably is an open coded kvm 
partition scoped walk with a comment around explaining why is it ok to 
walk that partition scoped table without taking kvm->mmu_lock.

What happens when a parallel invalidate happens to Qemu address space? 
Since we are not holding kvm->mmu_lock mmu notifier will complete and we 
will go ahead with unmapping partition scoped table.

Do we need a change like below?

@@ -1040,7 +1040,7 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
  {
  	unsigned long gfn = memslot->base_gfn + pagenum;
  	unsigned long gpa = gfn << PAGE_SHIFT;
-	pte_t *ptep;
+	pte_t *ptep, pte;
  	unsigned int shift;
  	int ret = 0;
  	unsigned long old, *rmapp;
@@ -1049,11 +1049,23 @@ static int kvm_radix_test_clear_dirty(struct kvm 
*kvm,
  		return ret;

  	ptep = find_kvm_secondary_pte(kvm, gpa, &shift);
-	if (ptep && pte_present(*ptep) && pte_dirty(*ptep)) {
+	if (!ptep)
+		return 0;
+
+	pte = READ_ONCE(*ptep);
+	if (pte_present(pte) && pte_dirty(pte)) {
  		ret = 1;
  		if (shift)
  			ret = 1 << (shift - PAGE_SHIFT);
  		spin_lock(&kvm->mmu_lock);
+		/*
+		 * Recheck the pte again
+		 */
+		if (pte_val(pte) != pte_val(*ptep)) {
+			spin_unlock(&kvm->mmu_lock);
+			return 0;
+		}
+
  		old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_DIRTY, 0,
  					      gpa, shift);
  		kvmppc_radix_tlbie_page(kvm, gpa, shift, kvm->arch.lpid);


^ permalink raw reply

* Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics
From: Paolo Bonzini @ 2020-05-28  5:22 UTC (permalink / raw)
  To: David Ahern, Jakub Kicinski, Emanuele Giuseppe Esposito
  Cc: linux-s390, kvm, linux-doc, netdev, Emanuele Giuseppe Esposito,
	linux-kernel, kvm-ppc, Jonathan Adams, Christian Borntraeger,
	Andrew Lunn, Alexander Viro, David Rientjes, linux-fsdevel,
	linux-mips, linuxppc-dev, linux-arm-kernel, Jim Mattson
In-Reply-To: <b6fa4439-c6b8-63a4-84fd-fbac3d4f10fd@gmail.com>

On 28/05/20 00:21, David Ahern wrote:
> On 5/27/20 3:07 PM, Paolo Bonzini wrote:
>> I see what you meant now.  statsfs can also be used to enumerate objects
>> if one is so inclined (with the prototype in patch 7, for example, each
>> network interface becomes a directory).
> 
> there are many use cases that have 100's to 1000's have network devices.
> Having a sysfs entry per device already bloats memory usage for these
> use cases; another filesystem with an entry per device makes that worse.
> Really the wrong direction for large scale systems.

Hi David,

IMO the important part for now is having a flexible kernel API for
exposing statistics across multiple subsystems, so that they can be
harvested in an efficient way.  The userspace API is secondary, and
multiple APIs can be added to cater for different usecases.

For example, as of the first five patches the memory usage is the same
as what is now in the mainline kernel, since all the patchset does is
take existing debugfs inodes and move them to statsfs.  I agree that, if
the concept is extended to the whole kernel, scalability and memory
usage becomes an issue; and indeed, the long-term plan is to support a
binary format that is actually _more_ efficient than the status quo for
large scale systems.

In the meanwhile, the new filesystem can be disabled (see the difference
between "STATS_FS" and "STATS_FS_API") if it imposes undesirable overhead.

Thanks,

Paolo

^ permalink raw reply

* Re: [PATCH v4 09/22] powerpc/kvm/book3s: Add helper to walk partition scoped linux page table.
From: Paul Mackerras @ 2020-05-28  1:43 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, npiggin
In-Reply-To: <20200505071729.54912-10-aneesh.kumar@linux.ibm.com>

On Tue, May 05, 2020 at 12:47:16PM +0530, Aneesh Kumar K.V wrote:
> The locking rules for walking partition scoped table is different from process
> scoped table. Hence add a helper for secondary linux page table walk and also
> add check whether we are holding the right locks.

This patch is causing new warnings to appear when testing migration,
like this:

[  142.090159] ------------[ cut here ]------------
[  142.090160] find_kvm_secondary_pte called with kvm mmu_lock not held 
[  142.090176] WARNING: CPU: 23 PID: 5341 at arch/powerpc/include/asm/kvm_book3s_64.h:644 kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
[  142.090177] Modules linked in: xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables bpfilter overlay binfmt_misc input_leds raid_class scsi_transport_sas sch_fq_codel sunrpc kvm_hv kvm
[  142.090188] CPU: 23 PID: 5341 Comm: qemu-system-ppc Tainted: G        W         5.7.0-rc5-kvm-00211-g9ccf10d6d088 #432
[  142.090189] NIP:  c008000000fe848c LR: c008000000fe8488 CTR: 0000000000000000
[  142.090190] REGS: c000001e19f077e0 TRAP: 0700   Tainted: G        W          (5.7.0-rc5-kvm-00211-g9ccf10d6d088)
[  142.090191] MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 42222422  XER: 20040000
[  142.090196] CFAR: c00000000012f5ac IRQMASK: 0 
               GPR00: c008000000fe8488 c000001e19f07a70 c008000000ffe200 0000000000000039 
               GPR04: 0000000000000001 c000001ffc8b4900 0000000000018840 0000000000000007 
               GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000001 
               GPR12: 0000000000002000 c000001fff6d9400 000000011f884678 00007fff70b70000 
               GPR16: 00007fff7137cb90 00007fff7dcb4410 0000000000000001 0000000000000000 
               GPR20: 000000000ffe0000 0000000000000000 0000000000000001 0000000000000000 
               GPR24: 8000000000000000 0000000000000001 c000001e1f67e600 c000001e1fd82410 
               GPR28: 0000000000001000 c000001e2e410000 0000000000000fff 0000000000000ffe 
[  142.090217] NIP [c008000000fe848c] kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
[  142.090223] LR [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv]
[  142.090224] Call Trace:
[  142.090230] [c000001e19f07a70] [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv] (unreliable)
[  142.090236] [c000001e19f07b50] [c008000000fd42e4] kvm_vm_ioctl_get_dirty_log_hv+0x33c/0x3c0 [kvm_hv]
[  142.090292] [c000001e19f07be0] [c008000000eea878] kvm_vm_ioctl_get_dirty_log+0x30/0x50 [kvm]
[  142.090300] [c000001e19f07c00] [c008000000edc818] kvm_vm_ioctl+0x2b0/0xc00 [kvm]
[  142.090302] [c000001e19f07d50] [c00000000046e148] ksys_ioctl+0xf8/0x150
[  142.090305] [c000001e19f07da0] [c00000000046e1c8] sys_ioctl+0x28/0x80
[  142.090307] [c000001e19f07dc0] [c00000000003652c] system_call_exception+0x16c/0x240
[  142.090309] [c000001e19f07e20] [c00000000000d070] system_call_common+0xf0/0x278
[  142.090310] Instruction dump:
[  142.090312] 7d3a512a 4200ffd0 7ffefb78 4bfffdc4 60000000 3c820000 e8848468 3c620000 
[  142.090317] e86384a8 38840010 4800673d e8410018 <0fe00000> 4bfffdd4 60000000 60000000 
[  142.090322] ---[ end trace 619d45057b6919e0 ]---

Indeed, kvm_radix_test_clear_dirty() tests the PTE dirty bit
locklessly, and only takes the kvm->mmu_lock once it finds a dirty
PTE.  I think that is important for performance, since on any given
scan of the guest real address space we may only find a small
proportion of the guest pages to be dirty.

Are you now relying on the kvm->mmu_lock to protect the existence of
the PTEs, or just their content?

Paul.

^ permalink raw reply

* Re: [PATCH] powerpc/bpf: Enable bpf_probe_read{, str}() on powerpc again
From: Michael Ellerman @ 2020-05-28  1:03 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Petr Mladek, Daniel Borkmann, linux-kernel, Alexei Starovoitov,
	Paul Mackerras, Masami Hiramatsu, Brendan Gregg, Miroslav Benes,
	linuxppc-dev, Christoph Hellwig
In-Reply-To: <20200527122844.19524-1-pmladek@suse.com>

Petr Mladek <pmladek@suse.com> writes:
> The commit 0ebeea8ca8a4d1d453a ("bpf: Restrict bpf_probe_read{, str}() only
> to archs where they work") caused that bpf_probe_read{, str}() functions
> were not longer available on architectures where the same logical address
> might have different content in kernel and user memory mapping. These
> architectures should use probe_read_{user,kernel}_str helpers.
>
> For backward compatibility, the problematic functions are still available
> on architectures where the user and kernel address spaces are not
> overlapping. This is defined CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE.
>
> At the moment, these backward compatible functions are enabled only
> on x86_64, arm, and arm64. Let's do it also on powerpc that has
> the non overlapping address space as well.
>
> Signed-off-by: Petr Mladek <pmladek@suse.com>

This seems like it should have a Fixes: tag and go into v5.7?

cheers

> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index d13b5328ca10..b29d7cb38368 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -126,6 +126,7 @@ config PPC
>  	select ARCH_HAS_MMIOWB			if PPC64
>  	select ARCH_HAS_PHYS_TO_DMA
>  	select ARCH_HAS_PMEM_API
> +	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
>  	select ARCH_HAS_PTE_DEVMAP		if PPC_BOOK3S_64
>  	select ARCH_HAS_PTE_SPECIAL
>  	select ARCH_HAS_MEMBARRIER_CALLBACKS
> -- 
> 2.26.2

^ permalink raw reply

* Re: [PATCH] powerpc/64: Remove unused generic_secondary_thread_init()
From: Jordan Niethe @ 2020-05-28  0:49 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev
In-Reply-To: <20200526063446.2476336-1-mpe@ellerman.id.au>

On Tue, May 26, 2020 at 4:36 PM Michael Ellerman <mpe@ellerman.id.au> wrote:
>
> The last caller was removed in 2014 in commit fb5a515704d7 ("powerpc:
> Remove platforms/wsp and associated pieces").
>
> Once generic_secondary_thread_init() is removed there are no longer
> any uses of book3e_secondary_thread_init() or
> generic_secondary_common_init so remove them too.
>
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
> ---
>  arch/powerpc/include/asm/smp.h       |  1 -
>  arch/powerpc/kernel/exceptions-64e.S |  4 ----
>  arch/powerpc/kernel/head_64.S        | 18 ------------------
>  3 files changed, 23 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
> index 49a25e2400f2..81a49566ccd8 100644
> --- a/arch/powerpc/include/asm/smp.h
> +++ b/arch/powerpc/include/asm/smp.h
> @@ -243,7 +243,6 @@ extern void arch_send_call_function_ipi_mask(const struct cpumask *mask);
>   * 64-bit but defining them all here doesn't harm
>   */
>  extern void generic_secondary_smp_init(void);
> -extern void generic_secondary_thread_init(void);
>  extern unsigned long __secondary_hold_spinloop;
>  extern unsigned long __secondary_hold_acknowledge;
>  extern char __secondary_hold;
> diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S
> index d9ed79415100..9f9e8686798b 100644
> --- a/arch/powerpc/kernel/exceptions-64e.S
> +++ b/arch/powerpc/kernel/exceptions-64e.S
> @@ -1814,10 +1814,6 @@ _GLOBAL(book3e_secondary_core_init)
>  1:     mtlr    r28
>         blr
>
> -_GLOBAL(book3e_secondary_thread_init)
> -       mflr    r28
> -       b       3b
> -
>         .globl init_core_book3e
>  init_core_book3e:
>         /* Establish the interrupt vector base */
> diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
> index 0e05a9a47a4b..4ae2c18c5fc6 100644
> --- a/arch/powerpc/kernel/head_64.S
> +++ b/arch/powerpc/kernel/head_64.S
> @@ -302,23 +302,6 @@ _GLOBAL(fsl_secondary_thread_init)
>  1:
>  #endif

Nothing directly calls generic_secondary_thread_init() but I think
fsl_secondary_thread_init() which is directly above "falls through"
into it. fsl_secondary_thread_init() still has callers.

>
> -_GLOBAL(generic_secondary_thread_init)
> -       mr      r24,r3
> -
> -       /* turn on 64-bit mode */
> -       bl      enable_64b_mode
> -
> -       /* get a valid TOC pointer, wherever we're mapped at */
> -       bl      relative_toc
> -       tovirt(r2,r2)
> -
> -#ifdef CONFIG_PPC_BOOK3E
> -       /* Book3E initialization */
> -       mr      r3,r24
> -       bl      book3e_secondary_thread_init
> -#endif
> -       b       generic_secondary_common_init
> -
>  /*
>   * On pSeries and most other platforms, secondary processors spin
>   * in the following code.
> @@ -385,7 +368,6 @@ _GLOBAL(generic_secondary_smp_init)
>  20:
>  #endif
>
> -generic_secondary_common_init:
>         /* Set up a paca value for this processor. Since we have the
>          * physical cpu id in r24, we need to search the pacas to find
>          * which logical id maps to our physical one.
> --
> 2.25.1
>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox