LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] crypto: vmx - Document CTR mode counter width quirks
From: Daniel Axtens @ 2019-06-11  1:54 UTC (permalink / raw)
  To: mpe, ebiggers, linux-crypto, Herbert Xu
  Cc: leo.barbosa, Stephan Mueller, nayna, omosnacek, leitao, pfsmorigo,
	marcelo.cerri, gcwilson, linuxppc-dev

The CTR code comes from OpenSSL, where it does a 32-bit counter.
The kernel has a 128-bit counter. This difference has lead to
issues.

Document it.

Signed-off-by: Daniel Axtens <dja@axtens.net>
---
 drivers/crypto/vmx/aesp8-ppc.pl | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/crypto/vmx/aesp8-ppc.pl b/drivers/crypto/vmx/aesp8-ppc.pl
index 9c6b5c1d6a1a..db874367b602 100644
--- a/drivers/crypto/vmx/aesp8-ppc.pl
+++ b/drivers/crypto/vmx/aesp8-ppc.pl
@@ -1286,6 +1286,24 @@ ___
 
 #########################################################################
 {{{	# CTR procedure[s]						#
+
+####################### WARNING: Here be dragons! #######################
+#
+# This code is written as 'ctr32', based on a 32-bit counter used
+# upstream. The kernel does *not* use a 32-bit counter. The kernel uses
+# a 128-bit counter.
+#
+# This leads to subtle changes from the upstream code: the counter
+# is incremented with vaddu_q_m rather than vaddu_w_m. This occurs in
+# both the bulk (8 blocks at a time) path, and in the individual block
+# path. Be aware of this when doing updates.
+#
+# See:
+# 1d4aa0b4c181 ("crypto: vmx - Fixing AES-CTR counter bug")
+# 009b30ac7444 ("crypto: vmx - CTR: always increment IV as quadword")
+# https://github.com/openssl/openssl/pull/8942
+#
+#########################################################################
 my ($inp,$out,$len,$key,$ivp,$x10,$rounds,$idx)=map("r$_",(3..10));
 my ($rndkey0,$rndkey1,$inout,$tmp)=		map("v$_",(0..3));
 my ($ivec,$inptail,$inpperm,$outhead,$outperm,$outmask,$keyperm,$one)=
@@ -1357,7 +1375,7 @@ Loop_ctr32_enc:
 	addi		$idx,$idx,16
 	bdnz		Loop_ctr32_enc
 
-	vadduqm		$ivec,$ivec,$one
+	vadduqm		$ivec,$ivec,$one	# Kernel change for 128-bit
 	 vmr		$dat,$inptail
 	 lvx		$inptail,0,$inp
 	 addi		$inp,$inp,16
@@ -1501,7 +1519,7 @@ Load_ctr32_enc_key:
 	$SHL		$len,$len,4
 
 	vadduqm		$out1,$ivec,$one	# counter values ...
-	vadduqm		$out2,$ivec,$two
+	vadduqm		$out2,$ivec,$two	# (do all ctr adds as 128-bit)
 	vxor		$out0,$ivec,$rndkey0	# ... xored with rndkey[0]
 	 le?li		$idx,8
 	vadduqm		$out3,$out1,$two
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH v2] powerpc: Add force enable of DAWR on P9 option
From: Cédric Le Goater @ 2019-06-10 17:31 UTC (permalink / raw)
  To: Michael Neuling, mpe; +Cc: linuxppc-dev, Cameron Kaiser
In-Reply-To: <20190401060312.22670-1-mikey@neuling.org>

Hello Michael,


> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -822,18 +822,21 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
>  	mtspr	SPRN_IAMR, r5
>  	mtspr	SPRN_PSPB, r6
>  	mtspr	SPRN_FSCR, r7
> -	ld	r5, VCPU_DAWR(r4)
> -	ld	r6, VCPU_DAWRX(r4)
> -	ld	r7, VCPU_CIABR(r4)
> -	ld	r8, VCPU_TAR(r4)
>  	/*
>  	 * Handle broken DAWR case by not writing it. This means we
>  	 * can still store the DAWR register for migration.
>  	 */
> -BEGIN_FTR_SECTION
> +	LOAD_REG_ADDR(r5, dawr_force_enable)
> +	lbz	r5, 0(r5)
> +	cmpdi	r5, 0
> +	beq	1f
> +	ld	r5, VCPU_DAWR(r4)
> +	ld	r6, VCPU_DAWRX(r4)
>  	mtspr	SPRN_DAWR, r5
>  	mtspr	SPRN_DAWRX, r6
> -END_FTR_SECTION_IFSET(CPU_FTR_DAWR)
> +1:
> +	ld	r7, VCPU_CIABR(r4)
> +	ld	r8, VCPU_TAR(r4)
>  	mtspr	SPRN_CIABR, r7
>  	mtspr	SPRN_TAR, r8
>  	ld	r5, VCPU_IC(r4)
> @@ -2513,11 +2516,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>  	blr
>  
>  2:
> -BEGIN_FTR_SECTION
> -	/* POWER9 with disabled DAWR */
> +	LOAD_REG_ADDR(r11, dawr_force_enable)
> +	lbz	r11, 0(r11)
> +	cmpdi	r11, 0
>  	li	r3, H_HARDWARE
> -	blr
> -END_FTR_SECTION_IFCLR(CPU_FTR_DAWR)
> +	beqlr

Why is this a 'beqlr' ? Shouldn't it be a blr ? 

C.

>  	/* Emulate H_SET_DABR/X on P8 for the sake of compat mode guests */
>  	rlwimi	r5, r4, 5, DAWRX_DR | DAWRX_DW
>  	rlwimi	r5, r4, 2, DAWRX_WT
> 


^ permalink raw reply

* Re: [RFC PATCH] powerpc/book3e: KASAN Full support for 64bit
From: Daniel Axtens @ 2019-06-11  1:21 UTC (permalink / raw)
  To: Christophe Leroy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <a0c04cb8-a19b-2d73-5725-4868556e2b47@c-s.fr>

Christophe Leroy <christophe.leroy@c-s.fr> writes:

> On 06/03/2019 11:50 PM, Daniel Axtens wrote:
>> Christophe Leroy <christophe.leroy@c-s.fr> writes:
>> 
>>> Hi,
>>>
>>> Ok, can you share your .config ?
>> 
>> Sure! This one is with kasan off as the last build I did was testing to
>> see if the code reorgisation was the cause of the issues. (it was not)
>> 
>> 
>> 
>> 
>> This was the kasan-enabled config that failed to boot:
>> 
>> 
>
> Same issue with your .config under QEMU:
>
> A go with gdb shows:
>
> Breakpoint 3, 0xc000000000027b6c in exc_0x700_common ()
> => 0xc000000000027b6c <exc_0x700_common+0>:	f8 01 00 70	std     r0,112(r1)
> (gdb) bt
> #0  0xc000000000027b6c in exc_0x700_common ()
> #1  0xc00000000136f80c in .udbg_init_memcons ()
>

Thanks for debugging this!

> Without CONFIG_PPC_EARLY_DEBUG, it boots fine for me. Can you check on 
> your side ?

Yes, that works on my side.

> Deactivating KASAN for arch/powerpc/kernel/udbg.o and 
> arch/powerpc/sysdev/udbg_memcons.o is not enough, we hit a call to 
> strstr() in register_early_udbg_console(), and once we get rid of it (in 
> the same way as in prom_init.c) the next issue is register_console() and 
> I don't know what to do about that one.

Disabling early debug seems like a reasonable restriction to add.

I'll have a look at modules across this and book3s next.

Regards,
Daniel

>
> Christophe
>
>> 
>> 
>> Regards,
>> Daniel
>> 
>>>
>>> Christophe
>>>
>>> Le 31/05/2019 à 03:29, Daniel Axtens a écrit :
>>>> Hi Christophe,
>>>>
>>>> I tried this on the t4240rdb and it fails to boot if KASAN is
>>>> enabled. It does boot with the patch applied but KASAN disabled, so that
>>>> narrows it down a little bit.
>>>>
>>>> I need to focus on 3s first so I'll just drop 3e from my patch set for
>>>> now.
>>>>
>>>> Regards,
>>>> Daniel
>>>>
>>>>> The KASAN shadow area is mapped into vmemmap space:
>>>>> 0x8000 0400 0000 0000 to 0x8000 0600 0000 0000.
>>>>> For this vmemmap has to be disabled.
>>>>>
>>>>> Cc: Daniel Axtens <dja@axtens.net>
>>>>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>>>>> ---
>>>>>    arch/powerpc/Kconfig                  |   1 +
>>>>>    arch/powerpc/Kconfig.debug            |   3 +-
>>>>>    arch/powerpc/include/asm/kasan.h      |  11 +++
>>>>>    arch/powerpc/kernel/Makefile          |   2 +
>>>>>    arch/powerpc/kernel/head_64.S         |   3 +
>>>>>    arch/powerpc/kernel/setup_64.c        |  20 +++---
>>>>>    arch/powerpc/mm/kasan/Makefile        |   1 +
>>>>>    arch/powerpc/mm/kasan/kasan_init_64.c | 129 ++++++++++++++++++++++++++++++++++
>>>>>    8 files changed, 159 insertions(+), 11 deletions(-)
>>>>>    create mode 100644 arch/powerpc/mm/kasan/kasan_init_64.c
>>>>>
>>>>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>>>>> index 1a2fb50126b2..e0b7c45e4dc7 100644
>>>>> --- a/arch/powerpc/Kconfig
>>>>> +++ b/arch/powerpc/Kconfig
>>>>> @@ -174,6 +174,7 @@ config PPC
>>>>>    	select HAVE_ARCH_AUDITSYSCALL
>>>>>    	select HAVE_ARCH_JUMP_LABEL
>>>>>    	select HAVE_ARCH_KASAN			if PPC32
>>>>> +	select HAVE_ARCH_KASAN			if PPC_BOOK3E_64 && !SPARSEMEM_VMEMMAP
>>>>>    	select HAVE_ARCH_KGDB
>>>>>    	select HAVE_ARCH_MMAP_RND_BITS
>>>>>    	select HAVE_ARCH_MMAP_RND_COMPAT_BITS	if COMPAT
>>>>> diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
>>>>> index 61febbbdd02b..b4140dd6b4e4 100644
>>>>> --- a/arch/powerpc/Kconfig.debug
>>>>> +++ b/arch/powerpc/Kconfig.debug
>>>>> @@ -370,4 +370,5 @@ config PPC_FAST_ENDIAN_SWITCH
>>>>>    config KASAN_SHADOW_OFFSET
>>>>>    	hex
>>>>>    	depends on KASAN
>>>>> -	default 0xe0000000
>>>>> +	default 0xe0000000 if PPC32
>>>>> +	default 0x6800040000000000 if PPC64
>>>>> diff --git a/arch/powerpc/include/asm/kasan.h b/arch/powerpc/include/asm/kasan.h
>>>>> index 296e51c2f066..756b3d58f921 100644
>>>>> --- a/arch/powerpc/include/asm/kasan.h
>>>>> +++ b/arch/powerpc/include/asm/kasan.h
>>>>> @@ -23,10 +23,21 @@
>>>>>    
>>>>>    #define KASAN_SHADOW_OFFSET	ASM_CONST(CONFIG_KASAN_SHADOW_OFFSET)
>>>>>    
>>>>> +#ifdef CONFIG_PPC32
>>>>>    #define KASAN_SHADOW_END	0UL
>>>>>    
>>>>>    #define KASAN_SHADOW_SIZE	(KASAN_SHADOW_END - KASAN_SHADOW_START)
>>>>>    
>>>>> +#else
>>>>> +
>>>>> +#include <asm/pgtable.h>
>>>>> +
>>>>> +#define KASAN_SHADOW_SIZE	(KERN_VIRT_SIZE >> KASAN_SHADOW_SCALE_SHIFT)
>>>>> +
>>>>> +#define KASAN_SHADOW_END	(KASAN_SHADOW_START + KASAN_SHADOW_SIZE)
>>>>> +
>>>>> +#endif /* CONFIG_PPC32 */
>>>>> +
>>>>>    #ifdef CONFIG_KASAN
>>>>>    void kasan_early_init(void);
>>>>>    void kasan_mmu_init(void);
>>>>> diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
>>>>> index 0ea6c4aa3a20..7f232c06f11d 100644
>>>>> --- a/arch/powerpc/kernel/Makefile
>>>>> +++ b/arch/powerpc/kernel/Makefile
>>>>> @@ -35,6 +35,8 @@ KASAN_SANITIZE_early_32.o := n
>>>>>    KASAN_SANITIZE_cputable.o := n
>>>>>    KASAN_SANITIZE_prom_init.o := n
>>>>>    KASAN_SANITIZE_btext.o := n
>>>>> +KASAN_SANITIZE_paca.o := n
>>>>> +KASAN_SANITIZE_setup_64.o := n
>>>>>    
>>>>>    ifdef CONFIG_KASAN
>>>>>    CFLAGS_early_32.o += -DDISABLE_BRANCH_PROFILING
>>>>> diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
>>>>> index 3fad8d499767..80fbd8024fb2 100644
>>>>> --- a/arch/powerpc/kernel/head_64.S
>>>>> +++ b/arch/powerpc/kernel/head_64.S
>>>>> @@ -966,6 +966,9 @@ start_here_multiplatform:
>>>>>    	 * and SLB setup before we turn on relocation.
>>>>>    	 */
>>>>>    
>>>>> +#ifdef CONFIG_KASAN
>>>>> +	bl	kasan_early_init
>>>>> +#endif
>>>>>    	/* Restore parameters passed from prom_init/kexec */
>>>>>    	mr	r3,r31
>>>>>    	bl	early_setup		/* also sets r13 and SPRG_PACA */
>>>>> diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
>>>>> index ba404dd9ce1d..d2bf860dd966 100644
>>>>> --- a/arch/powerpc/kernel/setup_64.c
>>>>> +++ b/arch/powerpc/kernel/setup_64.c
>>>>> @@ -311,6 +311,16 @@ void __init early_setup(unsigned long dt_ptr)
>>>>>     	DBG(" -> early_setup(), dt_ptr: 0x%lx\n", dt_ptr);
>>>>>    
>>>>>    	/*
>>>>> +	 * Configure exception handlers. This include setting up trampolines
>>>>> +	 * if needed, setting exception endian mode, etc...
>>>>> +	 */
>>>>> +	configure_exceptions();
>>>>> +
>>>>> +	/* Apply all the dynamic patching */
>>>>> +	apply_feature_fixups();
>>>>> +	setup_feature_keys();
>>>>> +
>>>>> +	/*
>>>>>    	 * Do early initialization using the flattened device
>>>>>    	 * tree, such as retrieving the physical memory map or
>>>>>    	 * calculating/retrieving the hash table size.
>>>>> @@ -325,16 +335,6 @@ void __init early_setup(unsigned long dt_ptr)
>>>>>    	setup_paca(paca_ptrs[boot_cpuid]);
>>>>>    	fixup_boot_paca();
>>>>>    
>>>>> -	/*
>>>>> -	 * Configure exception handlers. This include setting up trampolines
>>>>> -	 * if needed, setting exception endian mode, etc...
>>>>> -	 */
>>>>> -	configure_exceptions();
>>>>> -
>>>>> -	/* Apply all the dynamic patching */
>>>>> -	apply_feature_fixups();
>>>>> -	setup_feature_keys();
>>>>> -
>>>>>    	/* Initialize the hash table or TLB handling */
>>>>>    	early_init_mmu();
>>>>>    
>>>>> diff --git a/arch/powerpc/mm/kasan/Makefile b/arch/powerpc/mm/kasan/Makefile
>>>>> index 6577897673dd..0bfbe3892808 100644
>>>>> --- a/arch/powerpc/mm/kasan/Makefile
>>>>> +++ b/arch/powerpc/mm/kasan/Makefile
>>>>> @@ -3,3 +3,4 @@
>>>>>    KASAN_SANITIZE := n
>>>>>    
>>>>>    obj-$(CONFIG_PPC32)           += kasan_init_32.o
>>>>> +obj-$(CONFIG_PPC64)	+= kasan_init_64.o
>>>>> diff --git a/arch/powerpc/mm/kasan/kasan_init_64.c b/arch/powerpc/mm/kasan/kasan_init_64.c
>>>>> new file mode 100644
>>>>> index 000000000000..7fd71b8e883b
>>>>> --- /dev/null
>>>>> +++ b/arch/powerpc/mm/kasan/kasan_init_64.c
>>>>> @@ -0,0 +1,129 @@
>>>>> +// SPDX-License-Identifier: GPL-2.0
>>>>> +
>>>>> +#define DISABLE_BRANCH_PROFILING
>>>>> +
>>>>> +#include <linux/kasan.h>
>>>>> +#include <linux/printk.h>
>>>>> +#include <linux/memblock.h>
>>>>> +#include <linux/sched/task.h>
>>>>> +#include <asm/pgalloc.h>
>>>>> +
>>>>> +static void __init kasan_populate_pte(pte_t *ptep, pgprot_t prot)
>>>>> +{
>>>>> +	unsigned long va = (unsigned long)kasan_early_shadow_page;
>>>>> +	phys_addr_t pa = __pa(kasan_early_shadow_page);
>>>>> +	int i;
>>>>> +
>>>>> +	for (i = 0; i < PTRS_PER_PTE; i++, ptep++)
>>>>> +		__set_pte_at(&init_mm, va, ptep, pfn_pte(PHYS_PFN(pa), prot), 0);
>>>>> +}
>>>>> +
>>>>> +static void __init kasan_populate_pmd(pmd_t *pmdp)
>>>>> +{
>>>>> +	int i;
>>>>> +
>>>>> +	for (i = 0; i < PTRS_PER_PMD; i++)
>>>>> +		pmd_populate_kernel(&init_mm, pmdp + i, kasan_early_shadow_pte);
>>>>> +}
>>>>> +
>>>>> +static void __init kasan_populate_pud(pud_t *pudp)
>>>>> +{
>>>>> +	int i;
>>>>> +
>>>>> +	for (i = 0; i < PTRS_PER_PUD; i++)
>>>>> +		pud_populate(&init_mm, pudp + i, kasan_early_shadow_pmd);
>>>>> +}
>>>>> +
>>>>> +static void __init *kasan_alloc_pgtable(unsigned long size)
>>>>> +{
>>>>> +	void *ptr = memblock_alloc_try_nid(size, size, MEMBLOCK_LOW_LIMIT,
>>>>> +					   __pa(MAX_DMA_ADDRESS), NUMA_NO_NODE);
>>>>> +
>>>>> +	if (!ptr)
>>>>> +		panic("%s: Failed to allocate %lu bytes align=0x%lx max_addr=%lx\n",
>>>>> +		      __func__, size, size, __pa(MAX_DMA_ADDRESS));
>>>>> +
>>>>> +	return ptr;
>>>>> +}
>>>>> +
>>>>> +static int __init kasan_map_page(unsigned long va, unsigned long pa, pgprot_t prot)
>>>>> +{
>>>>> +	pgd_t *pgdp = pgd_offset_k(va);
>>>>> +	pud_t *pudp;
>>>>> +	pmd_t *pmdp;
>>>>> +	pte_t *ptep;
>>>>> +
>>>>> +	if (pgd_none(*pgdp) || (void *)pgd_page_vaddr(*pgdp) == kasan_early_shadow_pud) {
>>>>> +		pudp = kasan_alloc_pgtable(PUD_TABLE_SIZE);
>>>>> +		kasan_populate_pud(pudp);
>>>>> +		pgd_populate(&init_mm, pgdp, pudp);
>>>>> +	}
>>>>> +	pudp = pud_offset(pgdp, va);
>>>>> +	if (pud_none(*pudp) || (void *)pud_page_vaddr(*pudp) == kasan_early_shadow_pmd) {
>>>>> +		pmdp = kasan_alloc_pgtable(PMD_TABLE_SIZE);
>>>>> +		kasan_populate_pmd(pmdp);
>>>>> +		pud_populate(&init_mm, pudp, pmdp);
>>>>> +	}
>>>>> +	pmdp = pmd_offset(pudp, va);
>>>>> +	if (!pmd_present(*pmdp) || (void *)pmd_page_vaddr(*pmdp) == kasan_early_shadow_pte) {
>>>>> +		ptep = kasan_alloc_pgtable(PTE_TABLE_SIZE);
>>>>> +		kasan_populate_pte(ptep, PAGE_KERNEL);
>>>>> +		pmd_populate_kernel(&init_mm, pmdp, ptep);
>>>>> +	}
>>>>> +	ptep = pte_offset_kernel(pmdp, va);
>>>>> +
>>>>> +	__set_pte_at(&init_mm, va, ptep, pfn_pte(pa >> PAGE_SHIFT, prot), 0);
>>>>> +
>>>>> +	return 0;
>>>>> +}
>>>>> +
>>>>> +static void __init kasan_init_region(struct memblock_region *reg)
>>>>> +{
>>>>> +	void *start = __va(reg->base);
>>>>> +	void *end = __va(reg->base + reg->size);
>>>>> +	unsigned long k_start, k_end, k_cur;
>>>>> +
>>>>> +	if (start >= end)
>>>>> +		return;
>>>>> +
>>>>> +	k_start = (unsigned long)kasan_mem_to_shadow(start);
>>>>> +	k_end = (unsigned long)kasan_mem_to_shadow(end);
>>>>> +
>>>>> +	for (k_cur = k_start; k_cur < k_end; k_cur += PAGE_SIZE) {
>>>>> +		void *va = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
>>>>> +
>>>>> +		kasan_map_page(k_cur, __pa(va), PAGE_KERNEL);
>>>>> +	}
>>>>> +	flush_tlb_kernel_range(k_start, k_end);
>>>>> +}
>>>>> +
>>>>> +void __init kasan_init(void)
>>>>> +{
>>>>> +	struct memblock_region *reg;
>>>>> +
>>>>> +	for_each_memblock(memory, reg)
>>>>> +		kasan_init_region(reg);
>>>>> +
>>>>> +	/* It's too early to use clear_page() ! */
>>>>> +	memset(kasan_early_shadow_page, 0, sizeof(kasan_early_shadow_page));
>>>>> +
>>>>> +	/* Enable error messages */
>>>>> +	init_task.kasan_depth = 0;
>>>>> +	pr_info("KASAN init done\n");
>>>>> +}
>>>>> +
>>>>> +/* The early shadow maps everything to a single page of zeroes */
>>>>> +asmlinkage void __init kasan_early_init(void)
>>>>> +{
>>>>> +	unsigned long addr = KASAN_SHADOW_START;
>>>>> +	unsigned long end = KASAN_SHADOW_END;
>>>>> +	pgd_t *pgdp = pgd_offset_k(addr);
>>>>> +
>>>>> +	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
>>>>> +	kasan_populate_pmd(kasan_early_shadow_pmd);
>>>>> +	kasan_populate_pud(kasan_early_shadow_pud);
>>>>> +
>>>>> +	do {
>>>>> +		pgd_populate(&init_mm, pgdp, kasan_early_shadow_pud);
>>>>> +	} while (pgdp++, addr = pgd_addr_end(addr, end), addr != end);
>>>>> +}
>>>>> -- 
>>>>> 2.13.3

^ permalink raw reply

* Re: crash after NX error
From: Haren Myneni @ 2019-06-11  0:44 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev, Stewart Smith
In-Reply-To: <87zhmwmgv7.fsf@concordia.ellerman.id.au>

On 06/05/2019 04:06 AM, Michael Ellerman wrote:
> Stewart Smith <stewart@linux.ibm.com> writes:
>> On my two socket POWER9 system (powernv) with 842 zwap set up, I
>> recently got a crash with the Ubuntu kernel (I haven't tried with
>> upstream, and this is the first time the system has died like this, so
>> I'm not sure how repeatable it is).
>>
>> [    2.891463] zswap: loaded using pool 842-nx/zbud
>> ...
>> [15626.124646] nx_compress_powernv: ERROR: CSB still not valid after 5000000 us, giving up : 00 00 00 00 00000000
>> [16868.932913] Unable to handle kernel paging request for data at address 0x6655f67da816cdb8
>> [16868.933726] Faulting instruction address: 0xc000000000391600
>>
>>
>> cpu 0x68: Vector: 380 (Data Access Out of Range) at [c000001c9d98b9a0]
>>     pc: c000000000391600: kmem_cache_alloc+0x2e0/0x340
>>     lr: c0000000003915ec: kmem_cache_alloc+0x2cc/0x340
>>     sp: c000001c9d98bc20
>>    msr: 900000000280b033
>>    dar: 6655f67da816cdb8
>>   current = 0xc000001ad43cb400
>>   paca    = 0xc00000000fac7800   softe: 0        irq_happened: 0x01
>>     pid   = 8319, comm = make
>> Linux version 4.15.0-50-generic (buildd@bos02-ppc64el-006) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #54-Ubuntu SMP Mon May 6 18:55:18 UTC 2019 (Ubuntu 4.15.0-50.54-generic 4.15.18)
>>
>> 68:mon> t
>> [c000001c9d98bc20] c0000000003914d4 kmem_cache_alloc+0x1b4/0x340 (unreliable)
>> [c000001c9d98bc80] c0000000003b1e14 __khugepaged_enter+0x54/0x220
>> [c000001c9d98bcc0] c00000000010f0ec copy_process.isra.5.part.6+0xebc/0x1a10
>> [c000001c9d98bda0] c00000000010fe4c _do_fork+0xec/0x510
>> [c000001c9d98be30] c00000000000b584 ppc_clone+0x8/0xc
>> --- Exception: c00 (System Call) at 00007afe9daf87f4
>> SP (7fffca606880) is in userspace
>>
>> So, it looks like there could be a problem in the error path, plausibly
>> fixed by this patch:
>>
>> commit 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
>> Author: Haren Myneni <haren@linux.vnet.ibm.com>
>> Date:   Wed Jun 13 00:32:40 2018 -0700
>>
>>     crypto/nx: Initialize 842 high and normal RxFIFO control registers
>>     
>>     NX increments readOffset by FIFO size in receive FIFO control register
>>     when CRB is read. But the index in RxFIFO has to match with the
>>     corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
>>     may be processing incorrect CRBs and can cause CRB timeout.
>>     
>>     VAS FIFO offset is 0 when the receive window is opened during
>>     initialization. When the module is reloaded or in kexec boot, readOffset
>>     in FIFO control register may not match with VAS entry. This patch adds
>>     nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
>>     control register for both high and normal FIFOs.
>>     
>>     Signed-off-by: Haren Myneni <haren@us.ibm.com>
>>     [mpe: Fixup uninitialized variable warning]
>>     Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
>>
>> $ git describe --contains 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
>> v4.19-rc1~24^2~50
>>
>>
>> Which was never backported to any stable release, so probably needs to
>> be for v4.14 through v4.18.
> 
> Yeah the P9 NX support went in in:
>   b0d6c9bab5e4 ("crypto/nx: Add P9 NX support for 842 compression engine")
> 
> Which was: v4.14-rc1~119^2~21, so first released in v4.14.
> 
> 
> I'm actually less interested in that and more interested in the
> subsequent crash. The time stamps are miles apart though, did we just
> leave some corrupted memory after the NX failed and then hit it later?
> Or did we not correctly signal to the upper level APIs that the request
> failed.
> 
> I think we need to do some testing with errors injected into the
> wait_for_csb() path, to ensure that failures there are not causing
> corrupting in zswap. Haren have you done any testing of error injection?

The code path returns error code from wait_for_csb() properly to upper level APIs. In the case of decompression case, upon failure the request will fall back to SW 842. 

If NX is involved in this crash, the compression request may be successful with invalid CRB (mismatch FIFO entries in NX and VAS). Then SW 842 may be decompressed invalid data which might cause corruption later when accessing it. 

I will try to reproduce the issue with 4.14 kernel,

Thanks
Haren
  
> 
> cheers
> 


^ permalink raw reply

* [Bug 203839] Kernel 5.2-rc3 fails to boot on a PowerMac G4 3,6: systemd[1]: Failed to bump fs.file-max, ignoring: invalid argument
From: bugzilla-daemon @ 2019-06-11  0:34 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <bug-203839-206035@https.bugzilla.kernel.org/>

https://bugzilla.kernel.org/show_bug.cgi?id=203839

Erhard F. (erhard_f@mailbox.org) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #283139|0                           |1
        is obsolete|                            |

--- Comment #7 from Erhard F. (erhard_f@mailbox.org) ---
Created attachment 283185
  --> https://bugzilla.kernel.org/attachment.cgi?id=283185&action=edit
kernel .config (5.1.0-rc3+, G4 MDD)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply

* [Bug 203839] Kernel 5.2-rc3 fails to boot on a PowerMac G4 3,6: systemd[1]: Failed to bump fs.file-max, ignoring: invalid argument
From: bugzilla-daemon @ 2019-06-11  0:32 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <bug-203839-206035@https.bugzilla.kernel.org/>

https://bugzilla.kernel.org/show_bug.cgi?id=203839

--- Comment #6 from Erhard F. (erhard_f@mailbox.org) ---
Created attachment 283183
  --> https://bugzilla.kernel.org/attachment.cgi?id=283183&action=edit
bisect.log

bisect took me a while due to quite some skips. Cherry-picking
397d2300b08cdee052053e362018cdb6dd65eea2 and
305d60012304684bd59ea1f67703e51662e4906a helped me complete it.

# git bisect good | tee -a /root/bisect02.log
215b823707ce4e8e52b106915f70357fa474c669 is the first bad commit
commit 215b823707ce4e8e52b106915f70357fa474c669
Author: Christophe Leroy <christophe.leroy@c-s.fr>
Date:   Fri Apr 26 16:23:36 2019 +0000

    powerpc/32s: set up an early static hash table for KASAN.

    KASAN requires early activation of hash table, before memblock()
    functions are available.

    This patch implements an early hash_table statically defined in
    __initdata.

    During early boot, a single page table is used.

    For hash32, when doing the final init, one page table is allocated
    for each PGD entry because of the _PAGE_HASHPTE flag which can't be
    common to several virt pages. This is done after memblock get
    available but before switching to the final hash table, otherwise
    there are issues with TLB flushing due to the shared entries.

    Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
    Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

:040000 040000 abc24eb3c4ad3e4f2b1eb7b52c295c8b95d79a78
c3b6114c26eb8e181abb3f1abc9b6ecc12292f4d M      arch

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply

* Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
From: Naoya Horiguchi @ 2019-06-10 23:50 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Wanpeng Li, kvm, Punit Agrawal, Xiao Guangrong,
	linux-kernel@vger.kernel.org, Michal Hocko, linux-mm@kvack.org,
	yongkaiwu@tencent.com, Aneesh Kumar K.V, Paolo Bonzini,
	Andrew Morton, lidongchen@tencent.com,
	linuxppc-dev@lists.ozlabs.org, Anshuman Khandual
In-Reply-To: <f79d828c-b0b4-8a20-c316-a13430cfb13c@oracle.com>

On Wed, May 29, 2019 at 04:31:01PM -0700, Mike Kravetz wrote:
> On 5/28/19 2:49 AM, Wanpeng Li wrote:
> > Cc Paolo,
> > Hi all,
> > On Wed, 14 Feb 2018 at 06:34, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >>
> >> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> >>> Andrew Morton <akpm@linux-foundation.org> writes:
> >>>
> >>>> On Thu, 08 Feb 2018 12:30:45 +0000 Punit Agrawal <punit.agrawal@arm.com> wrote:
> >>>>
> >>>>>>
> >>>>>> So I don't think that the above test result means that errors are properly
> >>>>>> handled, and the proposed patch should help for arm64.
> >>>>>
> >>>>> Although, the deviation of pud_huge() avoids a kernel crash the code
> >>>>> would be easier to maintain and reason about if arm64 helpers are
> >>>>> consistent with expectations by core code.
> >>>>>
> >>>>> I'll look to update the arm64 helpers once this patch gets merged. But
> >>>>> it would be helpful if there was a clear expression of semantics for
> >>>>> pud_huge() for various cases. Is there any version that can be used as
> >>>>> reference?
> >>>>
> >>>> Is that an ack or tested-by?
> >>>>
> >>>> Mike keeps plaintively asking the powerpc developers to take a look,
> >>>> but they remain steadfastly in hiding.
> >>>
> >>> Cc'ing linuxppc-dev is always a good idea :)
> >>>
> >>
> >> Thanks Michael,
> >>
> >> I was mostly concerned about use cases for soft/hard offline of huge pages
> >> larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
> >> huge pages, and soft/hard offline support was specifically added for this.
> >> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB pages
> >> at PGD level"
> >>
> >> This patch will disable that functionality.  So, at a minimum this is a
> >> 'heads up'.  If there are actual use cases that depend on this, then more
> >> work/discussions will need to happen.  From the e-mail thread on PGD_SIZE
> >> support, I can not tell if there is a real use case or this is just a
> >> 'nice to have'.
> > 
> > 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
> > encounter gup_pud_range() panic several times in product environment.
> > Is there any plan to reenable and fix arch codes?
> 
> I too am aware of slightly more interest in 1G huge pages.  Suspect that as
> Intel MMU capacity increases to handle more TLB entries there will be more
> and more interest.
> 
> Personally, I am not looking at this issue.  Perhaps Naoya will comment as
> he know most about this code.

Thanks for forwarding this to me, I'm feeling that memory error handling
on 1GB hugepage is demanded as real use case.

> 
> > In addition, https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
> > The memory in guest can be 1GB/2MB/4K, though the host-backed memory
> > are 1GB hugetlbfs pages, after above PUD panic is fixed,
> > try_to_unmap() which is called in MCA recovery path will mark the PUD
> > hwpoison entry. The guest will vmexit and retry endlessly when
> > accessing any memory in the guest which is backed by this 1GB poisoned
> > hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
> > hugetlbfs pages/4KB pages, maybe file remap to a virtual address range
> > which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE
> > into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus
> > will be delivered to VM at page fault next time for the offensive
> > SPTE. Is this proposal acceptable?
> 
> I am not sure of the error handling design, but this does sound reasonable.

I agree that that's better.

> That block of code which potentially dissolves a huge page on memory error
> is hard to understand and I'm not sure if that is even the 'normal'
> functionality.  Certainly, we would hate to waste/poison an entire 1G page
> for an error on a small subsection.

Yes, that's not practical, so we need at first establish the code base for
2GB hugetlb splitting and then extending it to 1GB next.

Thanks,
Naoya Horiguchi

^ permalink raw reply

* Re: [PATCH 4/4] mm/vmalloc: Hugepage vmalloc mappings
From: Nicholas Piggin @ 2019-06-11  0:16 UTC (permalink / raw)
  To: Anshuman Khandual, linux-mm; +Cc: linuxppc-dev, linux-arm-kernel
In-Reply-To: <a3b2dcb1-148e-b2f1-e181-92c16d868bc9@arm.com>

Anshuman Khandual's on June 10, 2019 6:53 pm:
> On 06/10/2019 10:08 AM, Nicholas Piggin wrote:
>> For platforms that define HAVE_ARCH_HUGE_VMAP, have vmap allow vmalloc to
>> allocate huge pages and map them.
> 
> IIUC that extends HAVE_ARCH_HUGE_VMAP from iormap to vmalloc. 
> 
>> 
>> This brings dTLB misses for linux kernel tree `git diff` from 45,000 to
>> 8,000 on a Kaby Lake KVM guest with 8MB dentry hash and mitigations=off
>> (performance is in the noise, under 1% difference, page tables are likely
>> to be well cached for this workload). Similar numbers are seen on POWER9.
> 
> Sure will try this on arm64.
> 
>> 
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>>  include/asm-generic/4level-fixup.h |   1 +
>>  include/asm-generic/5level-fixup.h |   1 +
>>  include/linux/vmalloc.h            |   1 +
>>  mm/vmalloc.c                       | 132 +++++++++++++++++++++++------
>>  4 files changed, 107 insertions(+), 28 deletions(-)
>> 
>> diff --git a/include/asm-generic/4level-fixup.h b/include/asm-generic/4level-fixup.h
>> index e3667c9a33a5..3cc65a4dd093 100644
>> --- a/include/asm-generic/4level-fixup.h
>> +++ b/include/asm-generic/4level-fixup.h
>> @@ -20,6 +20,7 @@
>>  #define pud_none(pud)			0
>>  #define pud_bad(pud)			0
>>  #define pud_present(pud)		1
>> +#define pud_large(pud)			0
>>  #define pud_ERROR(pud)			do { } while (0)
>>  #define pud_clear(pud)			pgd_clear(pud)
>>  #define pud_val(pud)			pgd_val(pud)
>> diff --git a/include/asm-generic/5level-fixup.h b/include/asm-generic/5level-fixup.h
>> index bb6cb347018c..c4377db09a4f 100644
>> --- a/include/asm-generic/5level-fixup.h
>> +++ b/include/asm-generic/5level-fixup.h
>> @@ -22,6 +22,7 @@
>>  #define p4d_none(p4d)			0
>>  #define p4d_bad(p4d)			0
>>  #define p4d_present(p4d)		1
>> +#define p4d_large(p4d)			0
>>  #define p4d_ERROR(p4d)			do { } while (0)
>>  #define p4d_clear(p4d)			pgd_clear(p4d)
>>  #define p4d_val(p4d)			pgd_val(p4d)
> 
> Both of these are required from vmalloc_to_page() which as per a later comment
> should be part of a prerequisite patch before this series.

I'm not sure what you mean. This patch is where they get used.

Possibly I could split this and the vmalloc_to_page change out. I'll
consider it.

>> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
>> index 812bea5866d6..4c92dc608928 100644
>> --- a/include/linux/vmalloc.h
>> +++ b/include/linux/vmalloc.h
>> @@ -42,6 +42,7 @@ struct vm_struct {
>>  	unsigned long		size;
>>  	unsigned long		flags;
>>  	struct page		**pages;
>> +	unsigned int		page_shift;
> 
> So the entire vm_struct will be mapped with a single page_shift. It cannot have
> mix and match mappings with PAGE_SIZE, PMD_SIZE, PUD_SIZE etc in case the
> allocation fails for larger ones, falling back etc what over other reasons.

For now, yes. I have a bit of follow up work to improve that and make
it able to fall back, but it's a bit more churn and not a significant
benefit just yet because there are not a lot of very large vmallocs
(except the early hashes which can be satisfied with large allocs).

> 
>>  	unsigned int		nr_pages;
>>  	phys_addr_t		phys_addr;
>>  	const void		*caller;
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index dd27cfb29b10..0cf8e861caeb 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -36,6 +36,7 @@
>>  #include <linux/rbtree_augmented.h>
>>  
>>  #include <linux/uaccess.h>
>> +#include <asm/pgtable.h>
>>  #include <asm/tlbflush.h>
>>  #include <asm/shmparam.h>
>>  
>> @@ -440,6 +441,41 @@ static int vmap_pages_range(unsigned long start, unsigned long end,
>>  	return ret;
>>  }
>>  
>> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
>> +static int vmap_hpages_range(unsigned long start, unsigned long end,
> 
> A small nit (if you agree) s/hpages/huge_pages/

Hmm. It's not actually a good function name because it can do small
pages as well. vmap_pages_size_range or something may be better.

> 
>> +				   pgprot_t prot, struct page **pages,
> 
> Re-order (prot <---> pages) just to follow the standard like before.

Will do.

>> +				   unsigned int page_shift)
>> +{
>> +	unsigned long addr = start;
>> +	unsigned int i, nr = (end - start) >> (PAGE_SHIFT + page_shift);
> 
> s/nr/nr_huge_pages ?

Sure.

> Also should not we check for the alignment of the range [start...end] with
> respect to (1UL << [PAGE_SHIFT + page_shift]).

The caller should if it specifies large page. Could check and -EINVAL
for incorrect alignment.

>> +
>> +	for (i = 0; i < nr; i++) {
>> +		int err;
>> +
>> +		err = vmap_range_noflush(addr,
>> +					addr + (PAGE_SIZE << page_shift),
>> +					__pa(page_address(pages[i])), prot,
>> +					PAGE_SHIFT + page_shift);
>> +		if (err)
>> +			return err;
>> +
>> +		addr += PAGE_SIZE << page_shift;
>> +	}
>> +	flush_cache_vmap(start, end);
>> +
>> +	return nr;
>> +}
>> +#else
>> +static int vmap_hpages_range(unsigned long start, unsigned long end,
>> +			   pgprot_t prot, struct page **pages,
>> +			   unsigned int page_shift)
>> +{
>> +	BUG_ON(page_shift != PAGE_SIZE);
>> +	return vmap_pages_range(start, end, prot, pages);
>> +}
>> +#endif
>> +
>> +
>>  int is_vmalloc_or_module_addr(const void *x)
>>  {
>>  	/*
>> @@ -462,7 +498,7 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
>>  {
>>  	unsigned long addr = (unsigned long) vmalloc_addr;
>>  	struct page *page = NULL;
>> -	pgd_t *pgd = pgd_offset_k(addr);
>> +	pgd_t *pgd;
>>  	p4d_t *p4d;
>>  	pud_t *pud;
>>  	pmd_t *pmd;
>> @@ -474,27 +510,38 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
>>  	 */
>>  	VIRTUAL_BUG_ON(!is_vmalloc_or_module_addr(vmalloc_addr));
>>  
>> +	pgd = pgd_offset_k(addr);
>>  	if (pgd_none(*pgd))
>>  		return NULL;
>> +
> 
> Small nit. Stray line here.
> 
> 'pgd' related changes here seem to be just cleanups and should not part of this patch.

Yeah I figure it doesn't matter to make small changes close by, but
maybe that's more frowned upon now for git blame?

>>  	p4d = p4d_offset(pgd, addr);
>>  	if (p4d_none(*p4d))
>>  		return NULL;
>> -	pud = pud_offset(p4d, addr);
>> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
>> +	if (p4d_large(*p4d))
>> +		return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
>> +#endif
>> +	if (WARN_ON_ONCE(p4d_bad(*p4d)))
>> +		return NULL;
>>  
>> -	/*
>> -	 * Don't dereference bad PUD or PMD (below) entries. This will also
>> -	 * identify huge mappings, which we may encounter on architectures
>> -	 * that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be
>> -	 * identified as vmalloc addresses by is_vmalloc_addr(), but are
>> -	 * not [unambiguously] associated with a struct page, so there is
>> -	 * no correct value to return for them.
>> -	 */
> 
> What changed the situation so that we could return struct page for a huge
> mapping now ?

For the PUD case? Nothing changed, per se, we I just calculate the
correct struct page now, so I may return it.

> AFAICT even after this patch, PUD/P4D level huge pages can only
> be created with ioremap_page_range() not with vmalloc() which creates PMD
> sized mappings only. Hence if it's okay to dereference struct page of a huge
> mapping (not withstanding the comment here) it should be part of an earlier
> patch fixing it first for existing ioremap_page_range() huge mappings.

Possibly yes, we can consider 029c54b095995 to be a band-aid for huge
vmaps which is fixed properly by this change, in which case it could
make sense to break this into its own patch.

> 
>> -	WARN_ON_ONCE(pud_bad(*pud));
>> -	if (pud_none(*pud) || pud_bad(*pud))
>> +	pud = pud_offset(p4d, addr);
>> +	if (pud_none(*pud))
>> +		return NULL;
>> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
>> +	if (pud_large(*pud))
>> +		return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
>> +#endif
>> +	if (WARN_ON_ONCE(pud_bad(*pud)))
>>  		return NULL;
>> +
>>  	pmd = pmd_offset(pud, addr);
>> -	WARN_ON_ONCE(pmd_bad(*pmd));
>> -	if (pmd_none(*pmd) || pmd_bad(*pmd))
>> +	if (pmd_none(*pmd))
>> +		return NULL;
>> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
>> +	if (pmd_large(*pmd))
>> +		return pmd_page(*pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
>> +#endif
>> +	if (WARN_ON_ONCE(pmd_bad(*pmd)))
>>  		return NULL;
> 
> At each page table level, we are checking in this order
> 
> pXX_none() --> pXX_large() --> pXX_bad()
> 
> Are not these alternative orders bit better
> 
> pXX_bad() --> pXX_none() --> pXX_large()
> 
> Or
> 
> pXX_none() --> pXX_bad() --> pXX_large()
> 
> Checking for pXX_bad() at the end does not make much sense.

Yeah the order tends to go none->bad. It's not 100% clear we can
test bad before none (at least it goes against convention). But good
point should be changed to your last sequence I think.

> 
>>  
>>  	ptep = pte_offset_map(pmd, addr);
>> @@ -502,6 +549,7 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
>>  	if (pte_present(pte))
>>  		page = pte_page(pte);
>>  	pte_unmap(ptep);
>> +
> 
> Small nit. Stray line here.

I don't mind adding some lines here and there, like here. It is an
unrelated (alleged-)cleanup though.

> 
>>  	return page;
>>  }
>>  EXPORT_SYMBOL(vmalloc_to_page);
>> @@ -2185,8 +2233,9 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
>>  		return NULL;
>>  
>>  	if (flags & VM_IOREMAP)
>> -		align = 1ul << clamp_t(int, get_count_order_long(size),
>> -				       PAGE_SHIFT, IOREMAP_MAX_ORDER);
>> +		align = max(align,
>> +				1ul << clamp_t(int, get_count_order_long(size),
>> +				       PAGE_SHIFT, IOREMAP_MAX_ORDER));
>>  
>>  	area = kzalloc_node(sizeof(*area), gfp_mask & GFP_RECLAIM_MASK, node);
>>  	if (unlikely(!area))
>> @@ -2398,7 +2447,7 @@ static void __vunmap(const void *addr, int deallocate_pages)
>>  			struct page *page = area->pages[i];
>>  
>>  			BUG_ON(!page);
>> -			__free_pages(page, 0);
>> +			__free_pages(page, area->page_shift);
> 
> area->page_shift' turns out to be effective page order. I think the name here is bit
> misleading. s/page_shift/page_order or nr_pages should be better IMHO. page_shift is
> not actual shift (1UL << area->shift to get size) nor does it sound like page 'order'.

Yeah good point.

>>  		}
>>  
>>  		kvfree(area->pages);
>> @@ -2541,14 +2590,17 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>>  				 pgprot_t prot, int node)
>>  {
>>  	struct page **pages;
>> +	unsigned long addr = (unsigned long)area->addr;
>> +	unsigned long size = get_vm_area_size(area);
>> +	unsigned int page_shift = area->page_shift;
>> +	unsigned int shift = page_shift + PAGE_SHIFT;
>>  	unsigned int nr_pages, array_size, i;
>>  	const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
>>  	const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN;
>>  	const gfp_t highmem_mask = (gfp_mask & (GFP_DMA | GFP_DMA32)) ?
>> -					0 :
>> -					__GFP_HIGHMEM;
>> +					0 : __GFP_HIGHMEM;
>>  
>> -	nr_pages = get_vm_area_size(area) >> PAGE_SHIFT;
>> +	nr_pages = size >> shift;
>>  	array_size = (nr_pages * sizeof(struct page *));
>>  
>>  	area->nr_pages = nr_pages;
>> @@ -2569,10 +2621,8 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>>  	for (i = 0; i < area->nr_pages; i++) {
>>  		struct page *page;
>>  
>> -		if (node == NUMA_NO_NODE)
>> -			page = alloc_page(alloc_mask|highmem_mask);
>> -		else
>> -			page = alloc_pages_node(node, alloc_mask|highmem_mask, 0);
>> +		page = alloc_pages_node(node,
>> +				alloc_mask|highmem_mask, page_shift);
> 
> alloc_mask remains the exact same like before even for these high order pages.

Is there a problem there? I don't see.

>>  
>>  		if (unlikely(!page)) {
>>  			/* Successfully allocated i pages, free them in __vunmap() */
>> @@ -2584,8 +2634,9 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>>  			cond_resched();
>>  	}
>>  
>> -	if (map_vm_area(area, prot, pages))
>> +	if (vmap_hpages_range(addr, addr + size, prot, pages, page_shift) < 0)
>>  		goto fail;
>> +
>>  	return area->addr;
>>  
>>  fail:
>> @@ -2619,22 +2670,39 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
>>  			pgprot_t prot, unsigned long vm_flags, int node,
>>  			const void *caller)
>>  {
>> -	struct vm_struct *area;
>> +	struct vm_struct *area = NULL;
>>  	void *addr;
>>  	unsigned long real_size = size;
>> +	unsigned long real_align = align;
>> +	unsigned int shift = PAGE_SHIFT;
>>  
>>  	size = PAGE_ALIGN(size);
>>  	if (!size || (size >> PAGE_SHIFT) > totalram_pages())
>>  		goto fail;
>>  
>> +	if (IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP)) {
>> +		unsigned long size_per_node;
>> +
>> +		size_per_node = size;
>> +		if (node == NUMA_NO_NODE)
>> +			size_per_node /= num_online_nodes();
>> +		if (size_per_node >= PMD_SIZE)
>> +			shift = PMD_SHIFT;
> 
> There are two problems here.
> 
> 1. Should not size_per_node be aligned with PMD_SIZE to avoid wasting memory later
>    because of alignment upwards (making it worse for NUMA_NO_NODE)

I'm not sure what you mean, it's just a heuristic to check for node
interleaving, and use small pages if large can not interleave well.

> 2. What about PUD_SIZE which is not considered here at all

Yeah, not doing PUD pages at all. It would be pretty trivial to add 
after PMD is working, but would it actually get used anywhere?

> 3. We should have similar knobs like ioremap controlling different size huge mappings
> 
> static int __read_mostly ioremap_p4d_capable;
> static int __read_mostly ioremap_pud_capable;
> static int __read_mostly ioremap_pmd_capable;
> static int __read_mostly ioremap_huge_disabled;
> 
> while also giving arch a chance to weigh in through similar overrides like these.
> 
> arch_ioremap_[pud|pmd]_supported() ---> probably unifying it for vmalloc() 

I'm not sure if that makes sense. IO mappings I could see maybe having
some quirks or bugs or support issues. Cacheable mappings should be the
"base" for supporting larger pages, if the platform has trouble with
those then it should just avoid the feature I think.

Or is there a good reason to add the option? I don't mind, I just want
to avoid proliferation.

>> +	}
>> +again:
>> +	align = max(real_align, 1UL << shift);
>> +	size = ALIGN(real_size, align);
>> +
>>  	area = __get_vm_area_node(size, align, VM_ALLOC | VM_UNINITIALIZED |
>>  				vm_flags, start, end, node, gfp_mask, caller);
>>  	if (!area)
>>  		goto fail;
>>  
>> +	area->page_shift = shift - PAGE_SHIFT;
>> +
>>  	addr = __vmalloc_area_node(area, gfp_mask, prot, node);
>>  	if (!addr)
>> -		return NULL;
>> +		goto fail;
>>  
>>  	/*
>>  	 * In this function, newly allocated vm_struct has VM_UNINITIALIZED
>> @@ -2648,8 +2716,16 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
>>  	return addr;
>>  
>>  fail:
>> -	warn_alloc(gfp_mask, NULL,
>> +	if (shift == PMD_SHIFT) {
>> +		shift = PAGE_SHIFT;
>> +		goto again;
>> +	}
> 
> PUD_SHIFT should be accommodated here as well while falling back to lower
> mapping sizes in case previous allocation attempt fails.
> 

This also would be part of PUD support.

Thanks,
Nick

^ permalink raw reply

* [PATCH 1/1] PPC32: fix ptrace() access to FPU registers
From: Radu Rendec @ 2019-06-10 23:27 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Oleg Nesterov, Radu Rendec, Paul Mackerras
In-Reply-To: <20190610232758.19010-1-radu.rendec@gmail.com>

This patch addresses several issues with ptrace() access to FPU
registers through PTRACE_PEEKUSR/PTRACE_POKEUSR.

Standard CPU registers are of course the size of the machine word on
both PPC32/PPC64, but FPU registers are always 64-bit. Because the
ptrace() can only transfer one `long` at a time with PTRACE_PEEKUSR and
PTRACE_POKEUSR, on PPC32 userspace must do two separate ptrace() calls
to access a whole FPU register.

This patch fixes the code that translates between ptrace() addresses and
indexes into (struct thread_fp_state).fpr, taking into account all cases
for both PPC32/PPC64. In the previous version, on PPC32, the index was
double the correct value, allowing memory to be accessed beyond the
register array. This had the following side effects:
* Access to all FPU registers (except for FPR0) was broken.
* PTRACE_POKEUSR could corrupt memory following the FPU register array.
  That means the remainder of thread_struct, which is by design the last
  field of task_struct. For large enough ptrace() addresses, memory
  access could go even outside task_struct, corrupting the adjacent
  task_struct.

Note that gdb (which is probably the most frequent user of ptrace() with
PTRACE_PEEKUSR/PTRACE_POKEUSR) seems to always read/write all FPU
registers whenever a traced task stops.

Signed-off-by: Radu Rendec <radu.rendec@gmail.com>
---
 arch/powerpc/kernel/ptrace.c | 85 ++++++++++++++++++++++--------------
 1 file changed, 52 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 684b0b315c32..060e5ed0fad9 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -2991,69 +2991,88 @@ long arch_ptrace(struct task_struct *child, long request,
 	switch (request) {
 	/* read the word at location addr in the USER area. */
 	case PTRACE_PEEKUSR: {
-		unsigned long index, tmp;
+		unsigned long index, fpidx, tmp = 0;
 
 		ret = -EIO;
 		/* convert to index and check */
+		index = addr / sizeof(long);
+		if ((addr & (sizeof(long) - 1)) || (index > PT_FPSCR))
+			break;
 #ifdef CONFIG_PPC32
-		index = addr >> 2;
-		if ((addr & 3) || (index > PT_FPSCR)
-		    || (child->thread.regs == NULL))
-#else
-		index = addr >> 3;
-		if ((addr & 7) || (index > PT_FPSCR))
-#endif
+		if (child->thread.regs == NULL)
 			break;
+#endif
 
 		CHECK_FULL_REGS(child->thread.regs);
 		if (index < PT_FPR0) {
 			ret = ptrace_get_reg(child, (int) index, &tmp);
 			if (ret)
 				break;
-		} else {
-			unsigned int fpidx = index - PT_FPR0;
+			goto out_peekusr;
+		}
 
-			flush_fp_to_thread(child);
-			if (fpidx < (PT_FPSCR - PT_FPR0))
-				memcpy(&tmp, &child->thread.TS_FPR(fpidx),
-				       sizeof(long));
-			else
-				tmp = child->thread.fp_state.fpscr;
+		flush_fp_to_thread(child);
+#ifdef CONFIG_PPC32
+		if (index == PT_FPSCR - 1)
+			/* corner case for PPC32; do nothing */
+			goto out_peekusr;
+#endif
+		if (index == PT_FPSCR) {
+			tmp = child->thread.fp_state.fpscr;
+			goto out_peekusr;
 		}
+		/*
+		 * FPR is always 64-bit; on PPC32, userspace does two 32-bit
+		 * accesses. Add bit2 to allow accessing the upper half on
+		 * 32-bit; on 64-bit, bit2 is always 0 (we validate it above).
+		 */
+		fpidx = (addr - PT_FPR0 * sizeof(long)) / 8;
+		memcpy(&tmp, (void *)&child->thread.TS_FPR(fpidx) + (addr & 4),
+			sizeof(long));
+out_peekusr:
 		ret = put_user(tmp, datalp);
 		break;
 	}
 
 	/* write the word at location addr in the USER area */
 	case PTRACE_POKEUSR: {
-		unsigned long index;
+		unsigned long index, fpidx;
 
 		ret = -EIO;
 		/* convert to index and check */
+		index = addr / sizeof(long);
+		if ((addr & (sizeof(long) - 1)) || (index > PT_FPSCR))
+			break;
 #ifdef CONFIG_PPC32
-		index = addr >> 2;
-		if ((addr & 3) || (index > PT_FPSCR)
-		    || (child->thread.regs == NULL))
-#else
-		index = addr >> 3;
-		if ((addr & 7) || (index > PT_FPSCR))
-#endif
+		if (child->thread.regs == NULL)
 			break;
+#endif
 
 		CHECK_FULL_REGS(child->thread.regs);
 		if (index < PT_FPR0) {
 			ret = ptrace_put_reg(child, index, data);
-		} else {
-			unsigned int fpidx = index - PT_FPR0;
+			break;
+		}
 
-			flush_fp_to_thread(child);
-			if (fpidx < (PT_FPSCR - PT_FPR0))
-				memcpy(&child->thread.TS_FPR(fpidx), &data,
-				       sizeof(long));
-			else
-				child->thread.fp_state.fpscr = data;
-			ret = 0;
+		ret = 0;
+		flush_fp_to_thread(child);
+#ifdef CONFIG_PPC32
+		if (index == PT_FPSCR - 1)
+			/* corner case for PPC32; do nothing */
+			break;
+#endif
+		if (index == PT_FPSCR) {
+			child->thread.fp_state.fpscr = data;
+			break;
 		}
+		/*
+		 * FPR is always 64-bit; on PPC32, userspace does two 32-bit
+		 * accesses. Add bit2 to allow accessing the upper half on
+		 * 32-bit; on 64-bit, bit2 is always 0 (we validate it above).
+		 */
+		fpidx = (addr - PT_FPR0 * sizeof(long)) / 8;
+		memcpy((void *)&child->thread.TS_FPR(fpidx) + (addr & 4),
+			&data, sizeof(long));
 		break;
 	}
 
-- 
2.20.1


^ permalink raw reply related

* [PATCH 0/1] PPC32: fix ptrace() access to FPU registers
From: Radu Rendec @ 2019-06-10 23:27 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Oleg Nesterov, Radu Rendec, Paul Mackerras

Hi Everyone,

I'm following up on the ptrace() problem that I reported a few days ago.
I believe my version of the code handles all cases correctly. While the
problem essentially boils down to dividing the fpidx by 2 on PPC32, it
becomes tricky when the same code must work correctly on both PPC32 and
PPC64.

One other thing that I believe was handled incorrectly in the previous
version is the unused half of fpscr on PPC32. Note that while PT_FPSCR
is defined as (PT_FPR0 + 2*32 + 1), making only the upper half visible,
PT_FPR0 + 2*32 still corresponds to a possible address that userspace
can pass. In that case, comparing fpidx to (PT_FPSCR - PT_FPR0) would
cause an invalid access to the FPU registers array.

I tested the patch on 4.9.179, but that part of the code is identical in
recent kernels so it should work just the same.

I wrote a simple test program than can be used to quickly test (on an
x86_64 host) that all cases are handled correctly for both PPC32/PPC64.
The code is included below.

I also tested with gdbserver (test patch included below) and verified
that it generates two ptrace() calls for each FPU register, with
addresses between 0xc0 and 0x1bc.

8<--------------- Makefile ---------------------------------------------
.PHONY: all clean

all: ptrace-fpregs-32 ptrace-fpregs-64

ptrace-fpregs-32: ptrace-fpregs.c
	$(CC) -o ptrace-fpregs-32 -Wall -O2 -m32 ptrace-fpregs.c

ptrace-fpregs-64: ptrace-fpregs.c
	$(CC) -o ptrace-fpregs-64 -Wall -O2 ptrace-fpregs.c

clean:
	rm -f ptrace-fpregs-32 ptrace-fpregs-64
8<--------------- ptrace-fpregs.c --------------------------------------
#include <stdio.h>
#include <errno.h>

#define PT_FPR0	48

#ifndef __x86_64

#define PT_FPR31 (PT_FPR0 + 2*31)
#define PT_FPSCR (PT_FPR0 + 2*32 + 1)

#else

#define PT_FPSCR (PT_FPR0 + 32)

#endif

int test_access(unsigned long addr)
{
	int ret;

	do {
		unsigned long index, fpidx;

		ret = -EIO;

		/* convert to index and check */
		index = addr / sizeof(long);
		if ((addr & (sizeof(long) - 1)) || (index > PT_FPSCR))
			break;

		if (index < PT_FPR0) {
			ret = printf("ptrace_put_reg(%lu)", index);
			break;
		}

		ret = 0;
#ifndef __x86_64
		if (index == PT_FPSCR - 1) {
			/* corner case for PPC32; do nothing */
			printf("corner_case");
			break;
		}
#endif
		if (index == PT_FPSCR) {
			printf("fpscr");
			break;
		}

		/*
		 * FPR is always 64-bit; on PPC32, userspace does two 32-bit
		 * accesses. Add bit2 to allow accessing the upper half on
		 * 32-bit; on 64-bit, bit2 is always 0 (we validate it above).
		 */
		fpidx = (addr - PT_FPR0 * sizeof(long)) / 8;
		printf("TS_FPR[%lu] + %lu", fpidx, addr & 4);
		break;
	} while (0);

	return ret;
}

int main(void)
{
	unsigned long addr;
	int rc;

	for (addr = 0; addr < PT_FPSCR * sizeof(long) + 16; addr++) {
		printf("0x%04lx: ", addr);
		rc = test_access(addr);
		if (rc < 0)
			printf("!err!");
		printf("\t<%d>\n", rc);
	}

	return 0;
}
8<--------------- gdb.patch --------------------------------------------
--- gdb/gdbserver/linux-low.c.orig	2019-06-10 11:45:53.810882669 -0400
+++ gdb/gdbserver/linux-low.c	2019-06-10 11:49:32.272929766 -0400
@@ -4262,6 +4262,8 @@ store_register (struct regcache *regcach
   pid = lwpid_of (get_thread_lwp (current_inferior));
   for (i = 0; i < size; i += sizeof (PTRACE_XFER_TYPE))
     {
+      printf("writing register #%d offset %d at address %#x\n",
+             regno, i, (unsigned int)regaddr);
       errno = 0;
       ptrace (PTRACE_POKEUSER, pid,
 	    /* Coerce to a uintptr_t first to avoid potential gcc warning
8<----------------------------------------------------------------------

Radu Rendec (1):
  PPC32: fix ptrace() access to FPU registers

 arch/powerpc/kernel/ptrace.c | 85 ++++++++++++++++++++++--------------
 1 file changed, 52 insertions(+), 33 deletions(-)

-- 
2.20.1


^ permalink raw reply

* Re: [PATCH v3 14/33] docs: kbuild: convert docs to ReST and rename to *.rst
From: Federico Vaga @ 2019-06-10 16:48 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: linux-wireless, linux-fbdev, Emmanuel Grumbach, Stanislaw Gruszka,
	Vignesh Raghavendra, Linux Doc Mailing List, bridge,
	Palmer Dabbelt, alsa-devel, dri-devel, Ofer Levi, Masahiro Yamada,
	Harry Wei, Paul Mackerras, Miquel Raynal, linux-kbuild,
	linux-riscv, Vincent Chen, Aurelien Jacquiot, Jonas Bonn,
	Alex Shi, linux-c6x-dev, linux-scsi, Jonathan Corbet,
	Bartlomiej Zolnierkiewicz, netdev, Marek Vasut, coreteam,
	Mark Salter, Al exey Kuznetsov, linux-snps-arc,
	James E.J. Bottomley, Pablo Neira Ayuso, devel, Albert Ou,
	Johannes Berg, Intel Linux Wireless, Nikolay Aleksandrov,
	David Woodhouse, Jozsef Kadlecsik, Mauro Carvalho Chehab,
	openrisc, Greg Kroah-Hartman, Greentime Hu, linux-mtd,
	Takashi Iwai, Jaroslav Kysela, Stafford Horne,
	Stefan Kristiansson, Kalle Valo, Teddy Wang, Jon Maloy,
	Michal Simek, Michal Marek, netfilter-devel, Martin K. Petersen,
	g, Hideaki YOSHIFUJI, Vineet Gupta, linux-usb, Florian Westphal,
	linux-kernel, tipc-discussion, Sudip Mukherjee,
	Miguel Ojeda Sandonis, Roopa Prabhu, Richard Weinberger, Ying Xue,
	Luca Coelho, Brian Norris, linuxppc-dev, David S. Miller
In-Reply-To: <3d40d111d0512d785b6a67573772f532f88d2359.1560045490.git.mchehab+samsung@kernel.org>

In data Sunday, June 9, 2019 4:27:04 AM CEST, Mauro Carvalho Chehab ha 
scritto:
> The kbuild documentation clearly shows that the documents
> there are written at different times: some use markdown,
> some use their own peculiar logic to split sections.
> 
> Convert everything to ReST without affecting too much
> the author's style and avoiding adding uneeded markups.
> 
> The conversion is actually:
>   - add blank lines and identation in order to identify paragraphs;
>   - fix tables markups;
>   - add some lists markups;
>   - mark literal blocks;
>   - adjust title markups.
> 
> At its new index.rst, let's add a :orphan: while this is not linked to
> the main index.rst file, in order to avoid build warnings.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
> ---
>  Documentation/admin-guide/README.rst          |   2 +-
>  ...eaders_install.txt => headers_install.rst} |   5 +-
>  Documentation/kbuild/index.rst                |  27 +
>  Documentation/kbuild/issues.rst               |  11 +
>  .../kbuild/{kbuild.txt => kbuild.rst}         | 119 ++--
>  ...nfig-language.txt => kconfig-language.rst} | 232 ++++----
>  ...anguage.txt => kconfig-macro-language.rst} |  37 +-
>  .../kbuild/{kconfig.txt => kconfig.rst}       | 136 +++--
>  .../kbuild/{makefiles.txt => makefiles.rst}   | 530 +++++++++++-------
>  .../kbuild/{modules.txt => modules.rst}       | 168 +++---
>  Documentation/kernel-hacking/hacking.rst      |   4 +-
>  Documentation/process/coding-style.rst        |   2 +-
>  Documentation/process/submit-checklist.rst    |   2 +-
>  .../it_IT/kernel-hacking/hacking.rst          |   4 +-
>  .../it_IT/process/coding-style.rst            |   2 +-
>  .../it_IT/process/submit-checklist.rst        |   2 +-

Limited to translations/it_IT

Acked-by: Federico Vaga <federico.vaga@vaga.pv.it>

>  .../zh_CN/process/coding-style.rst            |   2 +-
>  .../zh_CN/process/submit-checklist.rst        |   2 +-
>  Kconfig                                       |   2 +-
>  arch/arc/plat-eznps/Kconfig                   |   2 +-
>  arch/c6x/Kconfig                              |   2 +-
>  arch/microblaze/Kconfig.debug                 |   2 +-
>  arch/microblaze/Kconfig.platform              |   2 +-
>  arch/nds32/Kconfig                            |   2 +-
>  arch/openrisc/Kconfig                         |   2 +-
>  arch/powerpc/sysdev/Kconfig                   |   2 +-
>  arch/riscv/Kconfig                            |   2 +-
>  drivers/auxdisplay/Kconfig                    |   2 +-
>  drivers/firmware/Kconfig                      |   2 +-
>  drivers/mtd/devices/Kconfig                   |   2 +-
>  drivers/net/ethernet/smsc/Kconfig             |   6 +-
>  drivers/net/wireless/intel/iwlegacy/Kconfig   |   4 +-
>  drivers/net/wireless/intel/iwlwifi/Kconfig    |   2 +-
>  drivers/parport/Kconfig                       |   2 +-
>  drivers/scsi/Kconfig                          |   4 +-
>  drivers/staging/sm750fb/Kconfig               |   2 +-
>  drivers/usb/misc/Kconfig                      |   4 +-
>  drivers/video/fbdev/Kconfig                   |  14 +-
>  net/bridge/netfilter/Kconfig                  |   2 +-
>  net/ipv4/netfilter/Kconfig                    |   2 +-
>  net/ipv6/netfilter/Kconfig                    |   2 +-
>  net/netfilter/Kconfig                         |  16 +-
>  net/tipc/Kconfig                              |   2 +-
>  scripts/Kbuild.include                        |   4 +-
>  scripts/Makefile.host                         |   2 +-
>  scripts/kconfig/symbol.c                      |   2 +-
>  .../tests/err_recursive_dep/expected_stderr   |  14 +-
>  sound/oss/dmasound/Kconfig                    |   6 +-
>  48 files changed, 840 insertions(+), 561 deletions(-)
>  rename Documentation/kbuild/{headers_install.txt => headers_install.rst}
> (96%) create mode 100644 Documentation/kbuild/index.rst
>  create mode 100644 Documentation/kbuild/issues.rst
>  rename Documentation/kbuild/{kbuild.txt => kbuild.rst} (72%)
>  rename Documentation/kbuild/{kconfig-language.txt => kconfig-language.rst}
> (85%) rename Documentation/kbuild/{kconfig-macro-language.txt =>
> kconfig-macro-language.rst} (94%) rename Documentation/kbuild/{kconfig.txt
> => kconfig.rst} (80%)
>  rename Documentation/kbuild/{makefiles.txt => makefiles.rst} (83%)
>  rename Documentation/kbuild/{modules.txt => modules.rst} (84%)




^ permalink raw reply

* [PATCH v3 2/3] powerpc/powernv: detect the secure boot mode of the system
From: Nayna Jain @ 2019-06-10 20:33 UTC (permalink / raw)
  To: linuxppc-dev, linux-efi, linux-integrity, linux-kernel
  Cc: Ard Biesheuvel, Nayna Jain, Claudio Carvalho, Mimi Zohar,
	Matthew Garret, Paul Mackerras, Jeremy Kerr
In-Reply-To: <1560198837-18857-1-git-send-email-nayna@linux.ibm.com>

PowerNV secure boot defines different IMA policies based on the secure
boot state of the system.

This patch defines a function to detect the secure boot state of the
system.

Signed-off-by: Nayna Jain <nayna@linux.ibm.com>
---
 arch/powerpc/include/asm/secboot.h       | 21 ++++++++
 arch/powerpc/platforms/powernv/Makefile  |  3 +-
 arch/powerpc/platforms/powernv/secboot.c | 61 ++++++++++++++++++++++++
 3 files changed, 84 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/include/asm/secboot.h
 create mode 100644 arch/powerpc/platforms/powernv/secboot.c

diff --git a/arch/powerpc/include/asm/secboot.h b/arch/powerpc/include/asm/secboot.h
new file mode 100644
index 000000000000..1904fb4a3352
--- /dev/null
+++ b/arch/powerpc/include/asm/secboot.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * PowerPC secure boot definitions
+ *
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna Jain <nayna@linux.ibm.com>
+ *
+ */
+#ifndef POWERPC_SECBOOT_H
+#define POWERPC_SECBOOT_H
+
+#if defined(CONFIG_OPAL_SECVAR)
+extern bool get_powerpc_sb_mode(void);
+#else
+static inline bool get_powerpc_sb_mode(void)
+{
+	return false;
+}
+#endif
+
+#endif
diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile
index 6651c742e530..74705a3fc474 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -4,6 +4,7 @@ obj-y			+= idle.o opal-rtc.o opal-nvram.o opal-lpc.o opal-flash.o
 obj-y			+= rng.o opal-elog.o opal-dump.o opal-sysparam.o opal-sensor.o
 obj-y			+= opal-msglog.o opal-hmi.o opal-power.o opal-irqchip.o
 obj-y			+= opal-kmsg.o opal-powercap.o opal-psr.o opal-sensor-groups.o
+obj-y			+= secboot.o
 
 obj-$(CONFIG_SMP)	+= smp.o subcore.o subcore-asm.o
 obj-$(CONFIG_PCI)	+= pci.o pci-ioda.o npu-dma.o pci-ioda-tce.o
@@ -16,4 +17,4 @@ obj-$(CONFIG_PERF_EVENTS) += opal-imc.o
 obj-$(CONFIG_PPC_MEMTRACE)	+= memtrace.o
 obj-$(CONFIG_PPC_VAS)	+= vas.o vas-window.o vas-debug.o
 obj-$(CONFIG_OCXL_BASE)	+= ocxl.o
-obj-$(CONFIG_OPAL_SECVAR) += opal-secvar.o
+obj-$(CONFIG_OPAL_SECVAR) += opal-secvar.o secboot.o
diff --git a/arch/powerpc/platforms/powernv/secboot.c b/arch/powerpc/platforms/powernv/secboot.c
new file mode 100644
index 000000000000..9199e520ebed
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/secboot.c
@@ -0,0 +1,61 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna Jain <nayna@linux.ibm.com>
+ *
+ * secboot.c
+ *      - util function to get powerpc secboot state
+ */
+#include <linux/uuid.h>
+#include <asm/opal.h>
+#include <asm/secboot.h>
+#include <asm/opal-secvar.h>
+
+bool get_powerpc_sb_mode(void)
+{
+	u8 secure_boot_name[] = "SecureBoot";
+	u8 setup_mode_name[] = "SetupMode";
+	u8 secboot, setupmode;
+	unsigned long size = sizeof(secboot);
+	int status;
+	unsigned long version;
+
+	status = opal_variable_version(&version);
+	if ((status != OPAL_SUCCESS) || (version != BACKEND_TC_COMPAT_V1)) {
+		pr_info("secboot: error retrieving compatible backend\n");
+		return false;
+	}
+
+	status = opal_get_variable(secure_boot_name, sizeof(secure_boot_name),
+				   NULL, NULL, &secboot, &size);
+
+	/*
+	 * For now assume all failures reading the SecureBoot variable implies
+	 * secure boot is not enabled. Later differentiate failure types.
+	 */
+	if (status != OPAL_SUCCESS) {
+		secboot = 0;
+		setupmode = 0;
+		goto out;
+	}
+
+	size = sizeof(setupmode);
+	status = opal_get_variable(setup_mode_name, sizeof(setup_mode_name),
+				   NULL, NULL,  &setupmode, &size);
+
+	/*
+	 * Failure to read the SetupMode variable does not prevent
+	 * secure boot mode
+	 */
+	if (status != OPAL_SUCCESS)
+		setupmode = 0;
+
+out:
+	if ((secboot == 0) || (setupmode == 1)) {
+		pr_info("secboot: secureboot mode disabled\n");
+		return false;
+	}
+
+	pr_info("secboot: secureboot mode enabled\n");
+	return true;
+}
-- 
2.20.1


^ permalink raw reply related

* [PATCH v3 3/3] powerpc: Add support to initialize ima policy rules
From: Nayna Jain @ 2019-06-10 20:33 UTC (permalink / raw)
  To: linuxppc-dev, linux-efi, linux-integrity, linux-kernel
  Cc: Ard Biesheuvel, Nayna Jain, Claudio Carvalho, Mimi Zohar,
	Matthew Garret, Paul Mackerras, Jeremy Kerr
In-Reply-To: <1560198837-18857-1-git-send-email-nayna@linux.ibm.com>

PowerNV secure boot relies on the kernel IMA security subsystem to
perform the OS kernel image signature verification. Since each secure
boot mode has different IMA policy requirements, dynamic definition of
the policy rules based on the runtime secure boot mode of the system is
required. On systems that support secure boot, but have it disabled,
only measurement policy rules of the kernel image and modules are
defined.

This patch defines the arch-specific implementation to retrieve the
secure boot mode of the system and accordingly configures the IMA policy
rules.

This patch provides arch-specific IMA policies if PPC_SECURE_BOOT
config is enabled.

Signed-off-by: Nayna Jain <nayna@linux.ibm.com>
---
 arch/powerpc/Kconfig           | 14 +++++++++
 arch/powerpc/kernel/Makefile   |  1 +
 arch/powerpc/kernel/ima_arch.c | 54 ++++++++++++++++++++++++++++++++++
 include/linux/ima.h            |  3 +-
 4 files changed, 71 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/kernel/ima_arch.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 8c1c636308c8..9de77bb14f54 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -902,6 +902,20 @@ config PPC_MEM_KEYS
 
 	  If unsure, say y.
 
+config PPC_SECURE_BOOT
+	prompt "Enable PowerPC Secure Boot"
+	bool
+	default n
+	depends on PPC64
+	depends on OPAL_SECVAR
+	depends on IMA
+	depends on IMA_ARCH_POLICY
+	help
+	  Linux on POWER with firmware secure boot enabled needs to define
+	  security policies to extend secure boot to the OS.This config
+	  allows user to enable OS Secure Boot on PowerPC systems that
+	  have firmware secure boot support.
+
 endmenu
 
 config ISA_DMA_API
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 0ea6c4aa3a20..75c929b41341 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -131,6 +131,7 @@ ifdef CONFIG_IMA
 obj-y				+= ima_kexec.o
 endif
 endif
+obj-$(CONFIG_PPC_SECURE_BOOT)	+= ima_arch.o
 
 obj-$(CONFIG_AUDIT)		+= audit.o
 obj64-$(CONFIG_AUDIT)		+= compat_audit.o
diff --git a/arch/powerpc/kernel/ima_arch.c b/arch/powerpc/kernel/ima_arch.c
new file mode 100644
index 000000000000..1767bf6e6550
--- /dev/null
+++ b/arch/powerpc/kernel/ima_arch.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna Jain <nayna@linux.ibm.com>
+ *
+ * ima_arch.c
+ *      - initialize ima policies for PowerPC Secure Boot
+ */
+
+#include <linux/ima.h>
+#include <asm/secboot.h>
+
+bool arch_ima_get_secureboot(void)
+{
+	bool sb_mode;
+
+	sb_mode = get_powerpc_sb_mode();
+	if (sb_mode)
+		return true;
+	else
+		return false;
+}
+
+/*
+ * File signature verification is not needed, include only measurements
+ */
+static const char *const default_arch_rules[] = {
+	"measure func=KEXEC_KERNEL_CHECK template=ima-modsig",
+	"measure func=MODULE_CHECK template=ima-modsig",
+	NULL
+};
+
+/* Both file signature verification and measurements are needed */
+static const char *const sb_arch_rules[] = {
+	"measure func=KEXEC_KERNEL_CHECK template=ima-modsig",
+	"measure func=MODULE_CHECK template=ima-modsig",
+	"appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig template=ima-modsig",
+#if !IS_ENABLED(CONFIG_MODULE_SIG)
+	"appraise func=MODULE_CHECK appraise_type=imasig|modsig template=ima-modsig",
+#endif
+	NULL
+};
+
+/*
+ * On PowerPC, file measurements are to be added to the IMA measurement list
+ * irrespective of the secure boot state of the system. Signature verification
+ * is conditionally enabled based on the secure boot state.
+ */
+const char *const *arch_get_ima_policy(void)
+{
+	if (IS_ENABLED(CONFIG_IMA_ARCH_POLICY) && arch_ima_get_secureboot())
+		return sb_arch_rules;
+	return default_arch_rules;
+}
diff --git a/include/linux/ima.h b/include/linux/ima.h
index fd9f7cf4cdf5..a01df076ecae 100644
--- a/include/linux/ima.h
+++ b/include/linux/ima.h
@@ -31,7 +31,8 @@ extern void ima_post_path_mknod(struct dentry *dentry);
 extern void ima_add_kexec_buffer(struct kimage *image);
 #endif
 
-#if (defined(CONFIG_X86) && defined(CONFIG_EFI)) || defined(CONFIG_S390)
+#if (defined(CONFIG_X86) && defined(CONFIG_EFI)) || defined(CONFIG_S390) \
+	|| defined(CONFIG_PPC_SECURE_BOOT)
 extern bool arch_ima_get_secureboot(void);
 extern const char * const *arch_get_ima_policy(void);
 #else
-- 
2.20.1


^ permalink raw reply related

* [PATCH v3 0/3] powerpc: Enabling IMA arch specific secure boot policies
From: Nayna Jain @ 2019-06-10 20:33 UTC (permalink / raw)
  To: linuxppc-dev, linux-efi, linux-integrity, linux-kernel
  Cc: Ard Biesheuvel, Nayna Jain, Claudio Carvalho, Mimi Zohar,
	Matthew Garret, Paul Mackerras, Jeremy Kerr

This patch set, previously named "powerpc: Enabling secure boot on powernv
systems - Part 1", is part of a series that implements secure boot on
PowerNV systems.

In order to verify the OS kernel on PowerNV, secure boot requires X.509
certificates trusted by the platform, the secure boot modes, and several
other pieces of information. These are stored in secure variables
controlled by OPAL, also known as OPAL secure variables.

The IMA architecture specific policy support on POWER is dependent on OPAL
runtime services to access secure variables. OPAL APIs in skiboot are
modified to define generic interface compatible to any backend. This
patchset is consequently updated to be compatible with new OPAL API
interface. This has cleaned up any EFIsms in the arch specific code.
Further, the ima arch specific policies are updated to be able to support
appended signatures. They also now use per policy template.

Exposing the OPAL secure variables to userspace will be posted as a
separate patch set, allowing the IMA architecture specific policy on POWER
to be upstreamed independently.

This patch set adds the following features:

1. Add support for OPAL Runtime API to access secure variables controlled
by OPAL.
2. Define IMA arch-specific policies based on the secure boot state and
mode of the system. On secure boot enabled PowerNV systems, the OS kernel
signature will be verified by IMA appraisal.

Pre-requisites for this patchset are:
1. OPAL APIs in Skiboot[1]
2. Appended signature support in IMA [2]
3. Per policy template support in IMA [3]

[1] https://patchwork.ozlabs.org/project/skiboot/list/?series=112868 
[2] https://patchwork.ozlabs.org/cover/1087361/. Updated version will be
posted soon
[3] Repo: https://kernel.googlesource.com/pub/scm/linux/kernel/git/zohar/linux-integrity
Branch: next-queued-testing. Commit: f241bb1f42aa95

----------------------------------------------------------------------------------

Original Cover Letter:

This patch set is part of a series that implements secure boot on PowerNV
systems.

In order to verify the OS kernel on PowerNV, secure boot requires X.509
certificates trusted by the platform, the secure boot modes, and several
other pieces of information. These are stored in secure variables
controlled by OPAL, also known as OPAL secure variables.

The IMA architecture specific policy support on Power is dependent on OPAL
runtime services to access secure variables. Instead of directly accessing
the OPAL runtime services, version 3 of this patch set relied upon the
EFI hooks. This version drops that dependency and calls the OPAL runtime
services directly. Skiboot OPAL APIs are due to be posted soon.

Exposing the OPAL secure variables to userspace will be posted as a
separate patch set, allowing the IMA architecture specific policy on Power
to be upstreamed independently.

This patch set adds the following features:

1. Add support for OPAL Runtime API to access secure variables controlled
by OPAL.
2. Define IMA arch-specific policies based on the secure boot state and
mode of the system. On secure boot enabled powernv systems, the OS kernel
signature will be verified by IMA appraisal.

[1] https://patchwork.kernel.org/cover/10882149/

Changelog:

v3:
* OPAL APIs in Patch 1 are updated to provide generic interface based on
key/keylen. This patchset updates kernel OPAL APIs to be compatible with
generic interface.
* Patch 2 is cleaned up to use new OPAL APIs. 
* Since OPAL can support different types of backend which can vary in the
variable interpretation, the Patch 2 is updated to add a check for the
backend version
* OPAL API now expects consumer to first check the supported backend version
before calling other secvar OPAL APIs. This check is now added in patch 2.
* IMA policies in Patch 3 is updated to specify appended signature and
per policy template.
* The patches now are free of any EFIisms.

v2:

* Removed Patch 1: powerpc/include: Override unneeded early ioremap
functions
* Updated Subject line and patch description of the Patch 1 of this series
* Removed dependency of OPAL_SECVAR on EFI, CPU_BIG_ENDIAN and UCS2_STRING
* Changed OPAL APIs from static to non-static. Added opal-secvar.h for the
same
* Removed EFI hooks from opal_secvar.c
* Removed opal_secvar_get_next(), opal_secvar_enqueue() and
opal_query_variable_info() function
* get_powerpc_sb_mode() in secboot.c now directly calls OPAL Runtime API
rather than via EFI hooks.
* Fixed log messages in get_powerpc_sb_mode() function.
* Added dependency for PPC_SECURE_BOOT on configs PPC64 and OPAL_SECVAR
* Replaced obj-$(CONFIG_IMA) with obj-$(CONFIG_PPC_SECURE_BOOT) in
arch/powerpc/kernel/Makefile

Claudio Carvalho (1):
  powerpc/powernv: Add OPAL API interface to get secureboot state

Nayna Jain (2):
  powerpc/powernv: detect the secure boot mode of the system
  powerpc: Add support to initialize ima policy rules

 arch/powerpc/Kconfig                         | 14 ++++
 arch/powerpc/include/asm/opal-api.h          |  4 +-
 arch/powerpc/include/asm/opal-secvar.h       | 23 ++++++
 arch/powerpc/include/asm/opal.h              |  6 ++
 arch/powerpc/include/asm/secboot.h           | 21 +++++
 arch/powerpc/kernel/Makefile                 |  1 +
 arch/powerpc/kernel/ima_arch.c               | 54 +++++++++++++
 arch/powerpc/platforms/powernv/Kconfig       |  6 ++
 arch/powerpc/platforms/powernv/Makefile      |  2 +
 arch/powerpc/platforms/powernv/opal-call.c   |  2 +
 arch/powerpc/platforms/powernv/opal-secvar.c | 85 ++++++++++++++++++++
 arch/powerpc/platforms/powernv/secboot.c     | 61 ++++++++++++++
 include/linux/ima.h                          |  3 +-
 13 files changed, 280 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/include/asm/opal-secvar.h
 create mode 100644 arch/powerpc/include/asm/secboot.h
 create mode 100644 arch/powerpc/kernel/ima_arch.c
 create mode 100644 arch/powerpc/platforms/powernv/opal-secvar.c
 create mode 100644 arch/powerpc/platforms/powernv/secboot.c

-- 
2.20.1


^ permalink raw reply

* [PATCH v3 1/3] powerpc/powernv: Add OPAL API interface to get secureboot state
From: Nayna Jain @ 2019-06-10 20:33 UTC (permalink / raw)
  To: linuxppc-dev, linux-efi, linux-integrity, linux-kernel
  Cc: Ard Biesheuvel, Nayna Jain, Claudio Carvalho, Mimi Zohar,
	Matthew Garret, Paul Mackerras, Jeremy Kerr
In-Reply-To: <1560198837-18857-1-git-send-email-nayna@linux.ibm.com>

From: Claudio Carvalho <cclaudio@linux.ibm.com>

The X.509 certificates trusted by the platform and other information
required to secure boot the OS kernel are wrapped in secure variables,
which are controlled by OPAL.

This patch adds support to read OPAL secure variables through
OPAL_SECVAR_GET call. It returns the metadata and data for a given secure
variable based on the unique key.

Since OPAL can support different types of backend which can vary in the
variable interpretation, a new OPAL API call named OPAL_SECVAR_BACKEND, is
added to retrieve the supported backend version. This helps the consumer
to know how to interpret the variable.

This support can be enabled using CONFIG_OPAL_SECVAR

Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Signed-off-by: Nayna Jain <nayna@linux.ibm.com>
---
This patch depends on a new OPAL call that is being added to skiboot.
The patch set that implements the new call has been posted to
https://patchwork.ozlabs.org/project/skiboot/list/?series=112868

 arch/powerpc/include/asm/opal-api.h          |  4 +-
 arch/powerpc/include/asm/opal-secvar.h       | 23 ++++++
 arch/powerpc/include/asm/opal.h              |  6 ++
 arch/powerpc/platforms/powernv/Kconfig       |  6 ++
 arch/powerpc/platforms/powernv/Makefile      |  1 +
 arch/powerpc/platforms/powernv/opal-call.c   |  2 +
 arch/powerpc/platforms/powernv/opal-secvar.c | 85 ++++++++++++++++++++
 7 files changed, 126 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/include/asm/opal-secvar.h
 create mode 100644 arch/powerpc/platforms/powernv/opal-secvar.c

diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index e1577cfa7186..a505e669b4b6 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -212,7 +212,9 @@
 #define OPAL_HANDLE_HMI2			166
 #define	OPAL_NX_COPROC_INIT			167
 #define OPAL_XIVE_GET_VP_STATE			170
-#define OPAL_LAST				170
+#define OPAL_SECVAR_GET                         173
+#define OPAL_SECVAR_BACKEND                     177
+#define OPAL_LAST				177
 
 #define QUIESCE_HOLD			1 /* Spin all calls at entry */
 #define QUIESCE_REJECT			2 /* Fail all calls with OPAL_BUSY */
diff --git a/arch/powerpc/include/asm/opal-secvar.h b/arch/powerpc/include/asm/opal-secvar.h
new file mode 100644
index 000000000000..b677171a0368
--- /dev/null
+++ b/arch/powerpc/include/asm/opal-secvar.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * PowerNV definitions for secure variables OPAL API.
+ *
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Claudio Carvalho <cclaudio@linux.ibm.com>
+ *
+ */
+#ifndef OPAL_SECVAR_H
+#define OPAL_SECVAR_H
+
+enum {
+	BACKEND_NONE = 0,
+	BACKEND_TC_COMPAT_V1,
+};
+
+extern int opal_get_variable(u8 *key, unsigned long ksize,
+			     u8 *metadata, unsigned long *mdsize,
+			     u8 *data, unsigned long *dsize);
+
+extern int opal_variable_version(unsigned long *backend);
+
+#endif
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 4cc37e708bc7..57d2c2356eda 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -394,6 +394,12 @@ void opal_powercap_init(void);
 void opal_psr_init(void);
 void opal_sensor_groups_init(void);
 
+extern int opal_secvar_get(uint64_t k_key, uint64_t k_key_len,
+			   uint64_t k_metadata, uint64_t k_metadata_size,
+			   uint64_t k_data, uint64_t k_data_size);
+
+extern int opal_secvar_backend(uint64_t k_backend);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_OPAL_H */
diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig
index 850eee860cf2..65b060539b5c 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -47,3 +47,9 @@ config PPC_VAS
 	  VAS adapters are found in POWER9 based systems.
 
 	  If unsure, say N.
+
+config OPAL_SECVAR
+	bool "OPAL Secure Variables"
+	depends on PPC_POWERNV
+	help
+	  This enables the kernel to access OPAL secure variables.
diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile
index da2e99efbd04..6651c742e530 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -16,3 +16,4 @@ obj-$(CONFIG_PERF_EVENTS) += opal-imc.o
 obj-$(CONFIG_PPC_MEMTRACE)	+= memtrace.o
 obj-$(CONFIG_PPC_VAS)	+= vas.o vas-window.o vas-debug.o
 obj-$(CONFIG_OCXL_BASE)	+= ocxl.o
+obj-$(CONFIG_OPAL_SECVAR) += opal-secvar.o
diff --git a/arch/powerpc/platforms/powernv/opal-call.c b/arch/powerpc/platforms/powernv/opal-call.c
index 36c8fa3647a2..0445980f294f 100644
--- a/arch/powerpc/platforms/powernv/opal-call.c
+++ b/arch/powerpc/platforms/powernv/opal-call.c
@@ -288,3 +288,5 @@ OPAL_CALL(opal_pci_set_pbcq_tunnel_bar,		OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
 OPAL_CALL(opal_sensor_read_u64,			OPAL_SENSOR_READ_U64);
 OPAL_CALL(opal_sensor_group_enable,		OPAL_SENSOR_GROUP_ENABLE);
 OPAL_CALL(opal_nx_coproc_init,			OPAL_NX_COPROC_INIT);
+OPAL_CALL(opal_secvar_get,                      OPAL_SECVAR_GET);
+OPAL_CALL(opal_secvar_backend,                  OPAL_SECVAR_BACKEND);
diff --git a/arch/powerpc/platforms/powernv/opal-secvar.c b/arch/powerpc/platforms/powernv/opal-secvar.c
new file mode 100644
index 000000000000..dba441dd5af1
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/opal-secvar.c
@@ -0,0 +1,85 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PowerNV code for secure variables
+ *
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Claudio Carvalho <cclaudio@linux.ibm.com>
+ *
+ */
+
+/*
+ * The opal wrappers in this file treat the @name, @vendor, and @data
+ * parameters as little endian blobs.
+ * @name is a ucs2 string
+ * @vendor is the vendor GUID. It is converted to LE in the kernel
+ * @data variable data, which layout may be different for each variable
+ */
+
+#define pr_fmt(fmt) "secvar: "fmt
+
+#include <linux/types.h>
+#include <asm/opal.h>
+#include <asm/opal-secvar.h>
+
+static bool is_opal_secvar_supported(void)
+{
+	static bool opal_secvar_supported;
+	static bool initialized;
+
+	if (initialized)
+		return opal_secvar_supported;
+
+	if (!opal_check_token(OPAL_SECVAR_GET)
+	    || !opal_check_token(OPAL_SECVAR_BACKEND)) {
+		pr_err("OPAL doesn't support secure variables\n");
+		opal_secvar_supported = false;
+	} else {
+		opal_secvar_supported = true;
+	}
+
+	initialized = true;
+
+	return opal_secvar_supported;
+}
+
+int opal_get_variable(u8 *key, unsigned long ksize, u8 *metadata,
+		      unsigned long *mdsize, u8 *data, unsigned long *dsize)
+{
+	int rc;
+
+	if (!is_opal_secvar_supported())
+		return OPAL_UNSUPPORTED;
+
+	if (mdsize)
+		*mdsize = cpu_to_be64(*mdsize);
+	if (dsize)
+		*dsize = cpu_to_be64(*dsize);
+
+	rc = opal_secvar_get(__pa(key), ksize, __pa(metadata), __pa(mdsize),
+			     __pa(data), __pa(dsize));
+
+	if (mdsize)
+		*mdsize = be64_to_cpu(*mdsize);
+	if (dsize)
+		*dsize = be64_to_cpu(*dsize);
+
+	return rc;
+}
+
+int opal_variable_version(unsigned long *backend)
+{
+	int rc;
+
+	if (!is_opal_secvar_supported())
+		return OPAL_UNSUPPORTED;
+
+	if (backend)
+		*backend = cpu_to_be64(*backend);
+
+	rc = opal_secvar_backend(__pa(backend));
+
+	if (backend)
+		*backend = be64_to_cpu(*backend);
+
+	return rc;
+}
-- 
2.20.1


^ permalink raw reply related

* Re: Latest Git kernel: Section mismatch in reference from the variable start_here_multiplatform to the function .init.text:.early_setup()
From: Christophe Leroy @ 2019-06-10 20:09 UTC (permalink / raw)
  To: Michael Ellerman, Christian Zigotzky; +Cc: linuxppc-dev, Christian Zigotzky
In-Reply-To: <461FD07C-F683-4CDB-9894-5A3E8D2E0411@xenosoft.de>

Michael ?

Christian Zigotzky <chzigotzky@xenosoft.de> a écrit :

> Hello Christophe,
>
> Could you please add this patch to the GIT kernel because the issue  
> still exists.
>
> Thanks,
> Christian
>
> On 15. May 2019, at 12:15, Christophe Leroy <christophe.leroy@c-s.fr> wrote:
>
> Hi,
>
> Le 15/05/2019 à 12:09, Christian Zigotzky a écrit :
> Hi All,
> I got the following error messages with the latest Git kernel today:
> GEN     .version
>   CHK     include/generated/compile.h
>   LD      vmlinux.o
>   MODPOST vmlinux.o
> WARNING: vmlinux.o(.text+0x302a): Section mismatch in reference from  
> the variable start_here_multiplatform to the function  
> .init.text:.early_setup()
> The function start_here_multiplatform() references
> the function __init .early_setup().
> This is often because start_here_multiplatform lacks a __init
> annotation or the annotation of .early_setup is wrong.
>   MODINFO modules.builtin.modinfo
>   KSYM    .tmp_kallsyms1.o
>   KSYM    .tmp_kallsyms2.o
>   LD      vmlinux
>   SORTEX  vmlinux
>   SYSMAP  System.map
>   CHKHEAD vmlinux
> What does it mean?
>
> ————————-
>
> I proposed a patch for it at https://patchwork.ozlabs.org/patch/1097845/
>
> Christophe



^ permalink raw reply

* Re: [PATCH] mm/nvdimm: Fix endian conversion issues 
From: Verma, Vishal L @ 2019-06-10 18:00 UTC (permalink / raw)
  To: aneesh.kumar@linux.ibm.com, Williams, Dan J
  Cc: linux-mm@kvack.org, linuxppc-dev@lists.ozlabs.org,
	linux-nvdimm@lists.01.org
In-Reply-To: <20190607064732.30384-1-aneesh.kumar@linux.ibm.com>

On Fri, 2019-06-07 at 12:17 +0530, Aneesh Kumar K.V wrote:
> nd_label->dpa issue was observed when trying to enable the namespace created
> with little-endian kernel on a big-endian kernel. That made me run
> `sparse` on the rest of the code and other changes are the result of that.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  drivers/nvdimm/btt.c            | 8 ++++----
>  drivers/nvdimm/namespace_devs.c | 7 ++++---
>  2 files changed, 8 insertions(+), 7 deletions(-)

The two BTT fixes seem like they may apply to stable as well, the
problematic code was introduced in relatively recent reworks/fixes.
Perhaps -

Fixes: d9b83c756953 ("libnvdimm, btt: rework error clearing")
Fixes: 9dedc73a4658 ("libnvdimm/btt: Fix LBA masking during 'free list' population")

Other than that, these look good to me.
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>

> 
> diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
> index 4671776f5623..4ac0f5dde467 100644
> --- a/drivers/nvdimm/btt.c
> +++ b/drivers/nvdimm/btt.c
> @@ -400,9 +400,9 @@ static int btt_flog_write(struct arena_info *arena, u32 lane, u32 sub,
>  	arena->freelist[lane].sub = 1 - arena->freelist[lane].sub;
>  	if (++(arena->freelist[lane].seq) == 4)
>  		arena->freelist[lane].seq = 1;
> -	if (ent_e_flag(ent->old_map))
> +	if (ent_e_flag(le32_to_cpu(ent->old_map)))
>  		arena->freelist[lane].has_err = 1;
> -	arena->freelist[lane].block = le32_to_cpu(ent_lba(ent->old_map));
> +	arena->freelist[lane].block = ent_lba(le32_to_cpu(ent->old_map));
>  
>  	return ret;
>  }
> @@ -568,8 +568,8 @@ static int btt_freelist_init(struct arena_info *arena)
>  		 * FIXME: if error clearing fails during init, we want to make
>  		 * the BTT read-only
>  		 */
> -		if (ent_e_flag(log_new.old_map) &&
> -				!ent_normal(log_new.old_map)) {
> +		if (ent_e_flag(le32_to_cpu(log_new.old_map)) &&
> +		    !ent_normal(le32_to_cpu(log_new.old_map))) {
>  			arena->freelist[i].has_err = 1;
>  			ret = arena_clear_freelist_error(arena, i);
>  			if (ret)
> diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
> index c4c5a191b1d6..500c37db496a 100644
> --- a/drivers/nvdimm/namespace_devs.c
> +++ b/drivers/nvdimm/namespace_devs.c
> @@ -1995,7 +1995,7 @@ static struct device *create_namespace_pmem(struct nd_region *nd_region,
>  		nd_mapping = &nd_region->mapping[i];
>  		label_ent = list_first_entry_or_null(&nd_mapping->labels,
>  				typeof(*label_ent), list);
> -		label0 = label_ent ? label_ent->label : 0;
> +		label0 = label_ent ? label_ent->label : NULL;
>  
>  		if (!label0) {
>  			WARN_ON(1);
> @@ -2330,8 +2330,9 @@ static struct device **scan_labels(struct nd_region *nd_region)
>  			continue;
>  
>  		/* skip labels that describe extents outside of the region */
> -		if (nd_label->dpa < nd_mapping->start || nd_label->dpa > map_end)
> -			continue;
> +		if (__le64_to_cpu(nd_label->dpa) < nd_mapping->start ||
> +		    __le64_to_cpu(nd_label->dpa) > map_end)
> +				continue;
>  
>  		i = add_namespace_resource(nd_region, nd_label, devs, count);
>  		if (i < 0)


^ permalink raw reply

* Re: [PATCH v3 02/11] s390x/mm: Fail when an altmap is used for arch_add_memory()
From: Oscar Salvador @ 2019-06-10 17:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, linux-s390, Michal Hocko, linux-ia64,
	Vasily Gorbik, linux-sh, Heiko Carstens, linux-kernel, Wei Yang,
	linux-mm, Mike Rapoport, Martin Schwidefsky, Igor Mammedov, akpm,
	linuxppc-dev, Dan Williams, linux-arm-kernel
In-Reply-To: <20190527111152.16324-3-david@redhat.com>

On Mon, May 27, 2019 at 01:11:43PM +0200, David Hildenbrand wrote:
> ZONE_DEVICE is not yet supported, fail if an altmap is passed, so we
> don't forget arch_add_memory()/arch_remove_memory() when unlocking
> support.
> 
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Vasily Gorbik <gor@linux.ibm.com>
> Cc: Oscar Salvador <osalvador@suse.com>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply

* Re: [PATCH v3 11/11] mm/memory_hotplug: Remove "zone" parameter from sparse_remove_one_section
From: Oscar Salvador @ 2019-06-10 16:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-s390, linux-ia64, linux-sh, linux-kernel, Wei Yang,
	linux-mm, Igor Mammedov, akpm, linuxppc-dev, Dan Williams,
	linux-arm-kernel
In-Reply-To: <20190527111152.16324-12-david@redhat.com>

On Mon, May 27, 2019 at 01:11:52PM +0200, David Hildenbrand wrote:
> The parameter is unused, so let's drop it. Memory removal paths should
> never care about zones. This is the job of memory offlining and will
> require more refactorings.
> 
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply

* Re: [PATCH v3 10/11] mm/memory_hotplug: Make unregister_memory_block_under_nodes() never fail
From: Oscar Salvador @ 2019-06-10 16:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-s390, linux-ia64, linux-sh, Greg Kroah-Hartman, Mark Brown,
	Rafael J. Wysocki, linux-kernel, Wei Yang, linux-mm,
	David S. Miller, Jonathan Cameron, Alex Deucher, Igor Mammedov,
	akpm, Chris Wilson, linuxppc-dev, Dan Williams, linux-arm-kernel
In-Reply-To: <20190527111152.16324-11-david@redhat.com>

On Mon, May 27, 2019 at 01:11:51PM +0200, David Hildenbrand wrote:
> We really don't want anything during memory hotunplug to fail.
> We always pass a valid memory block device, that check can go. Avoid
> allocating memory and eventually failing. As we are always called under
> lock, we can use a static piece of memory. This avoids having to put
> the structure onto the stack, having to guess about the stack size
> of callers.
> 
> Patch inspired by a patch from Oscar Salvador.
> 
> In the future, there might be no need to iterate over nodes at all.
> mem->nid should tell us exactly what to remove. Memory block devices
> with mixed nodes (added during boot) should properly fenced off and never
> removed.
> 
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Mark Brown <broonie@kernel.org>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply

* Re: [PATCH v3 01/11] mm/memory_hotplug: Simplify and fix check_hotplug_memory_range()
From: Oscar Salvador @ 2019-06-10 16:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-s390, Michal Hocko, linux-ia64, Pavel Tatashin, linux-sh,
	Mathieu Malaterre, linux-kernel, Wei Yang, linux-mm, Arun KS,
	Qian Cai, Wei Yang, Igor Mammedov, akpm, linuxppc-dev,
	Dan Williams, linux-arm-kernel
In-Reply-To: <20190527111152.16324-2-david@redhat.com>

On Mon, May 27, 2019 at 01:11:42PM +0200, David Hildenbrand wrote:
> By converting start and size to page granularity, we actually ignore
> unaligned parts within a page instead of properly bailing out with an
> error.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> Cc: Qian Cai <cai@lca.pw>
> Cc: Wei Yang <richard.weiyang@gmail.com>
> Cc: Arun KS <arunks@codeaurora.org>
> Cc: Mathieu Malaterre <malat@debian.org>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply

* Re: [PATCH v3 14/33] docs: kbuild: convert docs to ReST and rename to *.rst
From: Federico Vaga @ 2019-06-10 16:25 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: linux-wireless, linux-fbdev, Emmanuel Grumbach, Stanislaw Gruszka,
	Vignesh Raghavendra, Linux Doc Mailing List, bridge,
	Palmer Dabbelt, alsa-devel, dri-devel, Ofer Levi, Masahiro Yamada,
	Harry Wei, Paul Mackerras, Miquel Raynal, linux-kbuild,
	linux-riscv, Vincent Chen, Aurelien Jacquiot, Jonas Bonn,
	Alex Shi, linux-c6x-dev, linux-scsi, Jonathan Corbet,
	Bartlomiej Zolnierkiewicz, netdev, Marek Vasut, coreteam,
	Mark Salter, Al exey Kuznetsov, linux-snps-arc,
	James E.J. Bottomley, Pablo Neira Ayuso, devel, Albert Ou,
	Johannes Berg, Intel Linux Wireless, Nikolay Aleksandrov,
	David Woodhouse, Jozsef Kadlecsik, Mauro Carvalho Chehab,
	openrisc, Greg Kroah-Hartman, Greentime Hu, linux-mtd,
	Takashi Iwai, Jaroslav Kysela, Stafford Horne,
	Stefan Kristiansson, Kalle Valo, Teddy Wang, Jon Maloy,
	Michal Simek, Michal Marek, netfilter-devel, Martin K. Petersen,
	g, Hideaki YOSHIFUJI, Vineet Gupta, linux-usb, Florian Westphal,
	linux-kernel, tipc-discussion, Sudip Mukherjee,
	Miguel Ojeda Sandonis, Roopa Prabhu, Richard Weinberger, Ying Xue,
	Luca Coelho, Brian Norris, linuxppc-dev, David S. Miller
In-Reply-To: <3d40d111d0512d785b6a67573772f532f88d2359.1560045490.git.mchehab+samsung@kernel.org>

In data Sunday, June 9, 2019 4:27:04 AM CEST, Mauro Carvalho Chehab ha 
scritto:
> The kbuild documentation clearly shows that the documents
> there are written at different times: some use markdown,
> some use their own peculiar logic to split sections.
> 
> Convert everything to ReST without affecting too much
> the author's style and avoiding adding uneeded markups.
> 
> The conversion is actually:
>   - add blank lines and identation in order to identify paragraphs;
>   - fix tables markups;
>   - add some lists markups;
>   - mark literal blocks;
>   - adjust title markups.
> 
> At its new index.rst, let's add a :orphan: while this is not linked to
> the main index.rst file, in order to avoid build warnings.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
> ---
>  Documentation/admin-guide/README.rst          |   2 +-
>  ...eaders_install.txt => headers_install.rst} |   5 +-
>  Documentation/kbuild/index.rst                |  27 +
>  Documentation/kbuild/issues.rst               |  11 +
>  .../kbuild/{kbuild.txt => kbuild.rst}         | 119 ++--
>  ...nfig-language.txt => kconfig-language.rst} | 232 ++++----
>  ...anguage.txt => kconfig-macro-language.rst} |  37 +-
>  .../kbuild/{kconfig.txt => kconfig.rst}       | 136 +++--
>  .../kbuild/{makefiles.txt => makefiles.rst}   | 530 +++++++++++-------
>  .../kbuild/{modules.txt => modules.rst}       | 168 +++---
>  Documentation/kernel-hacking/hacking.rst      |   4 +-
>  Documentation/process/coding-style.rst        |   2 +-
>  Documentation/process/submit-checklist.rst    |   2 +-
>  .../it_IT/kernel-hacking/hacking.rst          |   4 +-
>  .../it_IT/process/coding-style.rst            |   2 +-
>  .../it_IT/process/submit-checklist.rst        |   2 +-

Limited to translations/it_IT

Acked-by: Federico Vaga <federico.vaga@vaga.pv.it>

>  .../zh_CN/process/coding-style.rst            |   2 +-
>  .../zh_CN/process/submit-checklist.rst        |   2 +-
>  Kconfig                                       |   2 +-
>  arch/arc/plat-eznps/Kconfig                   |   2 +-
>  arch/c6x/Kconfig                              |   2 +-
>  arch/microblaze/Kconfig.debug                 |   2 +-
>  arch/microblaze/Kconfig.platform              |   2 +-
>  arch/nds32/Kconfig                            |   2 +-
>  arch/openrisc/Kconfig                         |   2 +-
>  arch/powerpc/sysdev/Kconfig                   |   2 +-
>  arch/riscv/Kconfig                            |   2 +-
>  drivers/auxdisplay/Kconfig                    |   2 +-
>  drivers/firmware/Kconfig                      |   2 +-
>  drivers/mtd/devices/Kconfig                   |   2 +-
>  drivers/net/ethernet/smsc/Kconfig             |   6 +-
>  drivers/net/wireless/intel/iwlegacy/Kconfig   |   4 +-
>  drivers/net/wireless/intel/iwlwifi/Kconfig    |   2 +-
>  drivers/parport/Kconfig                       |   2 +-
>  drivers/scsi/Kconfig                          |   4 +-
>  drivers/staging/sm750fb/Kconfig               |   2 +-
>  drivers/usb/misc/Kconfig                      |   4 +-
>  drivers/video/fbdev/Kconfig                   |  14 +-
>  net/bridge/netfilter/Kconfig                  |   2 +-
>  net/ipv4/netfilter/Kconfig                    |   2 +-
>  net/ipv6/netfilter/Kconfig                    |   2 +-
>  net/netfilter/Kconfig                         |  16 +-
>  net/tipc/Kconfig                              |   2 +-
>  scripts/Kbuild.include                        |   4 +-
>  scripts/Makefile.host                         |   2 +-
>  scripts/kconfig/symbol.c                      |   2 +-
>  .../tests/err_recursive_dep/expected_stderr   |  14 +-
>  sound/oss/dmasound/Kconfig                    |   6 +-
>  48 files changed, 840 insertions(+), 561 deletions(-)
>  rename Documentation/kbuild/{headers_install.txt => headers_install.rst}
> (96%) create mode 100644 Documentation/kbuild/index.rst
>  create mode 100644 Documentation/kbuild/issues.rst
>  rename Documentation/kbuild/{kbuild.txt => kbuild.rst} (72%)
>  rename Documentation/kbuild/{kconfig-language.txt => kconfig-language.rst}
> (85%) rename Documentation/kbuild/{kconfig-macro-language.txt =>
> kconfig-macro-language.rst} (94%) rename Documentation/kbuild/{kconfig.txt
> => kconfig.rst} (80%)
>  rename Documentation/kbuild/{makefiles.txt => makefiles.rst} (83%)
>  rename Documentation/kbuild/{modules.txt => modules.rst} (84%)






^ permalink raw reply

* Re: [PATCH RESEND 1/2] tools/perf: Add arch neutral function to choose event for perf kvm record
From: Arnaldo Carvalho de Melo @ 2019-06-10 15:16 UTC (permalink / raw)
  To: Anju T Sudhakar
  Cc: ravi.bangoria, maddy, peterz, linuxppc-dev, linux-kernel,
	alexander.shishkin, namhyung, jolsa
In-Reply-To: <20190610064518.949-1-anju@linux.vnet.ibm.com>

Em Mon, Jun 10, 2019 at 12:15:17PM +0530, Anju T Sudhakar escreveu:
> 'perf kvm record' uses 'cycles'(if the user did not specify any event) as
> the default event to profile the guest.
> This will not provide any proper samples from the guest incase of
> powerpc architecture, since in powerpc the PMUs are controlled by
> the guest rather than the host.
> 
> Patch adds a function to pick an arch specific event for 'perf kvm record',
> instead of selecting 'cycles' as a default event for all architectures.
> 
> For powerpc this function checks for any user specified event, and if there
> isn't any it returns invalid instead of proceeding with 'cycles' event.

Michael, Ravi, Maddy, could you please provide an Acked-by, Reviewed-by
or Tested-by?

- Arnaldo
 
> Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
> ---
>  tools/perf/arch/powerpc/util/kvm-stat.c | 37 +++++++++++++++++++++++++
>  tools/perf/builtin-kvm.c                | 12 +++++++-
>  tools/perf/util/kvm-stat.h              |  2 +-
>  3 files changed, 49 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/perf/arch/powerpc/util/kvm-stat.c b/tools/perf/arch/powerpc/util/kvm-stat.c
> index f9db341c47b6..66f8fe500945 100644
> --- a/tools/perf/arch/powerpc/util/kvm-stat.c
> +++ b/tools/perf/arch/powerpc/util/kvm-stat.c
> @@ -8,6 +8,7 @@
>  
>  #include "book3s_hv_exits.h"
>  #include "book3s_hcalls.h"
> +#include <subcmd/parse-options.h>
>  
>  #define NR_TPS 4
>  
> @@ -172,3 +173,39 @@ int cpu_isa_init(struct perf_kvm_stat *kvm, const char *cpuid __maybe_unused)
>  
>  	return ret;
>  }
> +
> +/*
> + * Incase of powerpc architecture, pmu registers are programmable
> + * by guest kernel. So monitoring guest via host may not provide
> + * valid samples. It is better to fail the "perf kvm record"
> + * with default "cycles" event to monitor guest in powerpc.
> + *
> + * Function to parse the arguments and return appropriate values.
> + */
> +int kvm_add_default_arch_event(int *argc, const char **argv)
> +{
> +	const char **tmp;
> +	bool event = false;
> +	int i, j = *argc;
> +
> +	const struct option event_options[] = {
> +		OPT_BOOLEAN('e', "event", &event, NULL),
> +		OPT_END()
> +	};
> +
> +	tmp = calloc(j + 1, sizeof(char *));
> +	if (!tmp)
> +		return -EINVAL;
> +
> +	for (i = 0; i < j; i++)
> +		tmp[i] = argv[i];
> +
> +	parse_options(j, tmp, event_options, NULL, 0);
> +	if (!event) {
> +		free(tmp);
> +		return -EINVAL;
> +	}
> +
> +	free(tmp);
> +	return 0;
> +}
> diff --git a/tools/perf/builtin-kvm.c b/tools/perf/builtin-kvm.c
> index dbb6f737a3e2..fe33b3ec55c9 100644
> --- a/tools/perf/builtin-kvm.c
> +++ b/tools/perf/builtin-kvm.c
> @@ -1510,11 +1510,21 @@ static int kvm_cmd_stat(const char *file_name, int argc, const char **argv)
>  }
>  #endif /* HAVE_KVM_STAT_SUPPORT */
>  
> +int __weak kvm_add_default_arch_event(int *argc __maybe_unused,
> +					const char **argv __maybe_unused)
> +{
> +	return 0;
> +}
> +
>  static int __cmd_record(const char *file_name, int argc, const char **argv)
>  {
> -	int rec_argc, i = 0, j;
> +	int rec_argc, i = 0, j, ret;
>  	const char **rec_argv;
>  
> +	ret = kvm_add_default_arch_event(&argc, argv);
> +	if (ret)
> +		return -EINVAL;
> +
>  	rec_argc = argc + 2;
>  	rec_argv = calloc(rec_argc + 1, sizeof(char *));
>  	rec_argv[i++] = strdup("record");
> diff --git a/tools/perf/util/kvm-stat.h b/tools/perf/util/kvm-stat.h
> index 1403dec189b4..da38b56c46cb 100644
> --- a/tools/perf/util/kvm-stat.h
> +++ b/tools/perf/util/kvm-stat.h
> @@ -144,5 +144,5 @@ extern const int decode_str_len;
>  extern const char *kvm_exit_reason;
>  extern const char *kvm_entry_trace;
>  extern const char *kvm_exit_trace;
> -
> +extern int kvm_add_default_arch_event(int *argc, const char **argv);
>  #endif /* __PERF_KVM_STAT_H */
> -- 
> 2.17.2

-- 

- Arnaldo

^ permalink raw reply

* Re: [PATCH v2] powerpc/perf: Use cpumask_last() to determine the designated cpu for nest/core units.
From: Leonardo Bras @ 2019-06-10 18:47 UTC (permalink / raw)
  To: Anju T Sudhakar, mpe; +Cc: ego, maddy, linuxppc-dev
In-Reply-To: <20190610063229.32560-1-anju@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 3817 bytes --]

On Mon, 2019-06-10 at 12:02 +0530, Anju T Sudhakar wrote:
> Nest and core imc(In-memory Collection counters) assigns a particular
> cpu as the designated target for counter data collection.
> During system boot, the first online cpu in a chip gets assigned as
> the designated cpu for that chip(for nest-imc) and the first online cpu
> in a core gets assigned as the designated cpu for that core(for core-imc).
> 
> If the designated cpu goes offline, the next online cpu from the same
> chip(for nest-imc)/core(for core-imc) is assigned as the next target,
> and the event context is migrated to the target cpu.
> Currently, cpumask_any_but() function is used to find the target cpu.
> Though this function is expected to return a `random` cpu, this always
> returns the next online cpu.
> 
> If all cpus in a chip/core is offlined in a sequential manner, starting
> from the first cpu, the event migration has to happen for all the cpus
> which goes offline. Since the migration process involves a grace period,
> the total time taken to offline all the cpus will be significantly high.
> 
> Example:
> In a system which has 2 sockets, with
> NUMA node0 CPU(s):     0-87
> NUMA node8 CPU(s):     88-175
> 
> Time taken to offline cpu 88-175:
> real    2m56.099s
> user    0m0.191s
> sys     0m0.000s
> 
> Use cpumask_last() to choose the target cpu, when the designated cpu
> goes online, so the migration will happen only when the last_cpu in the
> mask goes offline. This way the time taken to offline all cpus in a
> chip/core can be reduced.
> 
> With the patch, 
> 
> Time taken  to offline cpu 88-175:
> real    0m12.207s
> user    0m0.171s
> sys     0m0.000s
> 
> 
> Offlining all cpus in reverse order is also taken care because,
> cpumask_any_but() is used to find the designated cpu if the last cpu in
> the mask goes offline. Since cpumask_any_but() always return the first
> cpu in the mask, that becomes the designated cpu and migration will happen
> only when the first_cpu in the mask goes offline.
> 
> Example:
> With the patch,
> 
> Time taken to offline cpu from 175-88:
> real    0m9.330s
> user    0m0.110s
> sys     0m0.000s

Seems like a very interesting work.
Out of curiosity, have you used 'chcpu -d' to create your benchmark?

> 
> Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
> Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
> ---
> 
> Changes from v1:
> 	Modified the commit log with more info.
> ---
> 
>  arch/powerpc/perf/imc-pmu.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
> index 31fa753..fbfd6e7 100644
> --- a/arch/powerpc/perf/imc-pmu.c
> +++ b/arch/powerpc/perf/imc-pmu.c
> @@ -366,7 +366,14 @@ static int ppc_nest_imc_cpu_offline(unsigned int cpu)
>  	 */
>  	nid = cpu_to_node(cpu);
>  	l_cpumask = cpumask_of_node(nid);
> -	target = cpumask_any_but(l_cpumask, cpu);
> +	target = cpumask_last(l_cpumask);
> +
> +	/*
> +	 * If this(target) is the last cpu in the cpumask for this chip,
> +	 * check for any possible online cpu in the chip.
> +	 */
> +	if (unlikely(target == cpu))
> +		target = cpumask_any_but(l_cpumask, cpu);
>  
>  	/*
>  	 * Update the cpumask with the target cpu and
> @@ -671,7 +678,10 @@ static int ppc_core_imc_cpu_offline(unsigned int cpu)
>  		return 0;
>  
>  	/* Find any online cpu in that core except the current "cpu" */
> -	ncpu = cpumask_any_but(cpu_sibling_mask(cpu), cpu);
> +	ncpu = cpumask_last(cpu_sibling_mask(cpu));
> +
> +	if (unlikely(ncpu == cpu))
> +		ncpu = cpumask_any_but(cpu_sibling_mask(cpu), cpu);
>  
>  	if (ncpu >= 0 && ncpu < nr_cpu_ids) {
>  		cpumask_set_cpu(ncpu, &core_imc_cpumask);

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook
From: Larry Finger @ 2019-06-10 18:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Aaro Koskinen, Christoph Hellwig,
	Christian Zigotzky, Michael Ellerman
  Cc: linux-wireless, linuxppc-dev, linux-kernel
In-Reply-To: <7697a9d10777b28ae79fdffdde6d0985555f6310.camel@kernel.crashing.org>

On 6/7/19 11:21 PM, Benjamin Herrenschmidt wrote:
> 
>> Please try the attached patch. I'm not really pleased with it and I will
>> continue to determine why the fallback to a 30-bit mask fails, but at least this
>> one works for me.
> 
> Your patch only makes sense if the device is indeed capable of
> addressing 31-bits.
> 
> So either the driver is buggy and asks for a too small mask in which
> case your patch is ok, or it's not and you're just going to cause all
> sort of interesting random problems including possible memory
> corruption.

Of course the driver may be buggy, but it asks for the correct mask.

This particular device is not capable of handling 32-bit DMA. The driver detects 
the 32-bit failure and falls back to 30 bits. It works on x86, and did on PPC32 
until 5.1. As Christoph said, it should always be possible to use fewer bits 
than the maximum.

Similar devices that are new enough to use b43 rather than b43legacy work with 
new kernels; however, they have and use 32-bit DMA.

Larry

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox