Linux-mm Archive on lore.kernel.org

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v6 05/10] mm: thp: enable thp migration in generic path
From: Andrew Morton @ 2017-05-25 22:43 UTC (permalink / raw)
  To: Zi Yan
  Cc: kbuild test robot, kbuild-all, n-horiguchi, kirill.shutemov,
	linux-kernel, linux-mm, minchan, vbabka, mgorman, mhocko,
	khandual, dnellans, dave.hansen
In-Reply-To: <138B8C07-2A41-40AA-9B4C-5F85FEFD6F0D@cs.rutgers.edu>

On Thu, 25 May 2017 13:19:54 -0400 "Zi Yan" <zi.yan@cs.rutgers.edu> wrote:

> On 25 May 2017, at 13:06, kbuild test robot wrote:
> 
> > Hi Zi,
> >
> > [auto build test WARNING on mmotm/master]
> > [also build test WARNING on v4.12-rc2 next-20170525]
> > [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> >
> > url:    https://github.com/0day-ci/linux/commits/Zi-Yan/mm-page-migration-enhancement-for-thp/20170526-003749
> > base:   git://git.cmpxchg.org/linux-mmotm.git master
> > config: i386-randconfig-x016-201721 (attached as .config)
> > compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> > reproduce:
> >         # save the attached .config to linux build tree
> >         make ARCH=i386
> >
> > All warnings (new ones prefixed by >>):
> >
> >    In file included from fs/proc/task_mmu.c:15:0:
> >    include/linux/swapops.h: In function 'swp_entry_to_pmd':
> >>> include/linux/swapops.h:222:16: warning: missing braces around initializer [-Wmissing-braces]
> >      return (pmd_t){{ 0 }};
> >                    ^
> 
> The braces are added to eliminate the warning from "m68k-linux-gcc (GCC) 4.9.0",
> which has the bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53119.

I think we'd prefer to have a warning on m68k than on i386!  Is there
something smarter we can do here?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v5 29/32] x86/mm: Add support to encrypt the kernel in-place
From: Tom Lendacky @ 2017-05-25 22:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-arch, linux-efi, kvm, linux-doc, x86, kexec, linux-kernel,
	kasan-dev, linux-mm, iommu, Rik van Riel,
	Radim Krčmář, Toshimitsu Kani, Arnd Bergmann,
	Jonathan Corbet, Matt Fleming, Michael S. Tsirkin, Joerg Roedel,
	Konrad Rzeszutek Wilk, Paolo Bonzini, Larry Woodman,
	Brijesh Singh, Ingo Molnar, Andy Lutomirski, H. Peter Anvin,
	Andrey Ryabinin, Alexander Potapenko, Dave Young, Thomas Gleixner,
	Dmitry Vyukov
In-Reply-To: <20170518124626.hqyqqbjpy7hmlpqc@pd.tnic>

On 5/18/2017 7:46 AM, Borislav Petkov wrote:
> On Tue, Apr 18, 2017 at 04:21:49PM -0500, Tom Lendacky wrote:
>> Add the support to encrypt the kernel in-place. This is done by creating
>> new page mappings for the kernel - a decrypted write-protected mapping
>> and an encrypted mapping. The kernel is encrypted by copying it through
>> a temporary buffer.
>>
>> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
>> ---
>>  arch/x86/include/asm/mem_encrypt.h |    6 +
>>  arch/x86/mm/Makefile               |    2
>>  arch/x86/mm/mem_encrypt.c          |  262 ++++++++++++++++++++++++++++++++++++
>>  arch/x86/mm/mem_encrypt_boot.S     |  151 +++++++++++++++++++++
>>  4 files changed, 421 insertions(+)
>>  create mode 100644 arch/x86/mm/mem_encrypt_boot.S
>>
>> diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
>> index b406df2..8f6f9b4 100644
>> --- a/arch/x86/include/asm/mem_encrypt.h
>> +++ b/arch/x86/include/asm/mem_encrypt.h
>> @@ -31,6 +31,12 @@ static inline u64 sme_dma_mask(void)
>>  	return ((u64)sme_me_mask << 1) - 1;
>>  }
>>
>> +void sme_encrypt_execute(unsigned long encrypted_kernel_vaddr,
>> +			 unsigned long decrypted_kernel_vaddr,
>> +			 unsigned long kernel_len,
>> +			 unsigned long encryption_wa,
>> +			 unsigned long encryption_pgd);
>> +
>>  void __init sme_early_encrypt(resource_size_t paddr,
>>  			      unsigned long size);
>>  void __init sme_early_decrypt(resource_size_t paddr,
>> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
>> index 9e13841..0633142 100644
>> --- a/arch/x86/mm/Makefile
>> +++ b/arch/x86/mm/Makefile
>> @@ -38,3 +38,5 @@ obj-$(CONFIG_NUMA_EMU)		+= numa_emulation.o
>>  obj-$(CONFIG_X86_INTEL_MPX)	+= mpx.o
>>  obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
>>  obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
>> +
>> +obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
>> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
>> index 30b07a3..0ff41a4 100644
>> --- a/arch/x86/mm/mem_encrypt.c
>> +++ b/arch/x86/mm/mem_encrypt.c
>> @@ -24,6 +24,7 @@
>>  #include <asm/setup.h>
>>  #include <asm/bootparam.h>
>>  #include <asm/cacheflush.h>
>> +#include <asm/sections.h>
>>
>>  /*
>>   * Since SME related variables are set early in the boot process they must
>> @@ -216,8 +217,269 @@ void swiotlb_set_mem_attributes(void *vaddr, unsigned long size)
>>  	set_memory_decrypted((unsigned long)vaddr, size >> PAGE_SHIFT);
>>  }
>>
>> +void __init sme_clear_pgd(pgd_t *pgd_base, unsigned long start,
>
> static

Yup.

>
>> +			  unsigned long end)
>> +{
>> +	unsigned long addr = start;
>> +	pgdval_t *pgd_p;
>> +
>> +	while (addr < end) {
>> +		unsigned long pgd_end;
>> +
>> +		pgd_end = (addr & PGDIR_MASK) + PGDIR_SIZE;
>> +		if (pgd_end > end)
>> +			pgd_end = end;
>> +
>> +		pgd_p = (pgdval_t *)pgd_base + pgd_index(addr);
>> +		*pgd_p = 0;
>
> Hmm, so this is a contiguous range from [start:end] which translates to
> 8-byte PGD pointers in the PGD page so you can simply memset that range,
> no?
>
> Instead of iterating over each one?

I guess I could do that, but this will probably only end up clearing a
single PGD entry anyway since it's highly doubtful the address range
would cross a 512GB boundary.

>
>> +
>> +		addr = pgd_end;
>> +	}
>> +}
>> +
>> +#define PGD_FLAGS	_KERNPG_TABLE_NOENC
>> +#define PUD_FLAGS	_KERNPG_TABLE_NOENC
>> +#define PMD_FLAGS	(__PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL)
>> +
>> +static void __init *sme_populate_pgd(pgd_t *pgd_base, void *pgtable_area,
>> +				     unsigned long vaddr, pmdval_t pmd_val)
>> +{
>> +	pgdval_t pgd, *pgd_p;
>> +	pudval_t pud, *pud_p;
>> +	pmdval_t pmd, *pmd_p;
>
> You should use the enclosing type, not the underlying one. I.e.,
>
> 	pgd_t *pgd;
> 	pud_t *pud;
> 	...
>
> and then the macros native_p*d_val(), p*d_offset() and so on. I say
> native_* because we don't want to have any paravirt nastyness here.
> I believe your previous version was using the proper interfaces.

I won't be able to use the p*d_offset() macros since they use __va()
and we're identity mapped during this time (which is why I would guess
the proposed changes for the 5-level pagetables in
arch/x86/kernel/head64.c, __startup_64, don't use these macros
either). I should be able to use the native_set_p*d() and others though,
I'll look into that.

>
> And the kernel has gotten 5-level pagetables support in
> the meantime, so this'll need to start at p4d AFAICT.
> arch/x86/mm/fault.c::dump_pagetable() looks like a good example to stare
> at.

Yeah, I accounted for that in the other parts of the code but I need
to do that here also.

>
>> +	pgd_p = (pgdval_t *)pgd_base + pgd_index(vaddr);
>> +	pgd = *pgd_p;
>> +	if (pgd) {
>> +		pud_p = (pudval_t *)(pgd & ~PTE_FLAGS_MASK);
>> +	} else {
>> +		pud_p = pgtable_area;
>> +		memset(pud_p, 0, sizeof(*pud_p) * PTRS_PER_PUD);
>> +		pgtable_area += sizeof(*pud_p) * PTRS_PER_PUD;
>> +
>> +		*pgd_p = (pgdval_t)pud_p + PGD_FLAGS;
>> +	}
>> +
>> +	pud_p += pud_index(vaddr);
>> +	pud = *pud_p;
>> +	if (pud) {
>> +		if (pud & _PAGE_PSE)
>> +			goto out;
>> +
>> +		pmd_p = (pmdval_t *)(pud & ~PTE_FLAGS_MASK);
>> +	} else {
>> +		pmd_p = pgtable_area;
>> +		memset(pmd_p, 0, sizeof(*pmd_p) * PTRS_PER_PMD);
>> +		pgtable_area += sizeof(*pmd_p) * PTRS_PER_PMD;
>> +
>> +		*pud_p = (pudval_t)pmd_p + PUD_FLAGS;
>> +	}
>> +
>> +	pmd_p += pmd_index(vaddr);
>> +	pmd = *pmd_p;
>> +	if (!pmd || !(pmd & _PAGE_PSE))
>> +		*pmd_p = pmd_val;
>> +
>> +out:
>> +	return pgtable_area;
>> +}
>> +
>> +static unsigned long __init sme_pgtable_calc(unsigned long len)
>> +{
>> +	unsigned long pud_tables, pmd_tables;
>> +	unsigned long total = 0;
>> +
>> +	/*
>> +	 * Perform a relatively simplistic calculation of the pagetable
>> +	 * entries that are needed. That mappings will be covered by 2MB
>> +	 * PMD entries so we can conservatively calculate the required
>> +	 * number of PUD and PMD structures needed to perform the mappings.
>> +	 * Incrementing the count for each covers the case where the
>> +	 * addresses cross entries.
>> +	 */
>> +	pud_tables = ALIGN(len, PGDIR_SIZE) / PGDIR_SIZE;
>> +	pud_tables++;
>> +	pmd_tables = ALIGN(len, PUD_SIZE) / PUD_SIZE;
>> +	pmd_tables++;
>> +
>> +	total += pud_tables * sizeof(pud_t) * PTRS_PER_PUD;
>> +	total += pmd_tables * sizeof(pmd_t) * PTRS_PER_PMD;
>> +
>> +	/*
>> +	 * Now calculate the added pagetable structures needed to populate
>> +	 * the new pagetables.
>> +	 */
>
> Nice commenting, helps following what's going on.
>
>> +	pud_tables = ALIGN(total, PGDIR_SIZE) / PGDIR_SIZE;
>> +	pmd_tables = ALIGN(total, PUD_SIZE) / PUD_SIZE;
>> +
>> +	total += pud_tables * sizeof(pud_t) * PTRS_PER_PUD;
>> +	total += pmd_tables * sizeof(pmd_t) * PTRS_PER_PMD;
>> +
>> +	return total;
>> +}
>> +
>>  void __init sme_encrypt_kernel(void)
>>  {
>> +	pgd_t *pgd;
>> +	void *pgtable_area;
>> +	unsigned long kernel_start, kernel_end, kernel_len;
>> +	unsigned long workarea_start, workarea_end, workarea_len;
>> +	unsigned long execute_start, execute_end, execute_len;
>> +	unsigned long pgtable_area_len;
>> +	unsigned long decrypted_base;
>> +	unsigned long paddr, pmd_flags;
>
>
> Please sort function local variables declaration in a reverse christmas
> tree order:
>
> 	<type> longest_variable_name;
> 	<type> shorter_var_name;
> 	<type> even_shorter;
> 	<type> i;
>

Will do.

>> +
>> +	if (!sme_active())
>> +		return;
>
> ...
>
>> diff --git a/arch/x86/mm/mem_encrypt_boot.S b/arch/x86/mm/mem_encrypt_boot.S
>> new file mode 100644
>> index 0000000..fb58f9f
>> --- /dev/null
>> +++ b/arch/x86/mm/mem_encrypt_boot.S
>> @@ -0,0 +1,151 @@
>> +/*
>> + * AMD Memory Encryption Support
>> + *
>> + * Copyright (C) 2016 Advanced Micro Devices, Inc.
>> + *
>> + * Author: Tom Lendacky <thomas.lendacky@amd.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + */
>> +
>> +#include <linux/linkage.h>
>> +#include <asm/pgtable.h>
>> +#include <asm/page.h>
>> +#include <asm/processor-flags.h>
>> +#include <asm/msr-index.h>
>> +
>> +	.text
>> +	.code64
>> +ENTRY(sme_encrypt_execute)
>> +
>> +	/*
>> +	 * Entry parameters:
>> +	 *   RDI - virtual address for the encrypted kernel mapping
>> +	 *   RSI - virtual address for the decrypted kernel mapping
>> +	 *   RDX - length of kernel
>> +	 *   RCX - virtual address of the encryption workarea, including:
>> +	 *     - stack page (PAGE_SIZE)
>> +	 *     - encryption routine page (PAGE_SIZE)
>> +	 *     - intermediate copy buffer (PMD_PAGE_SIZE)
>> +	 *    R8 - physcial address of the pagetables to use for encryption
>> +	 */
>> +
>> +	push	%rbp
>> +	push	%r12
>> +
>> +	/* Set up a one page stack in the non-encrypted memory area */
>> +	movq	%rsp, %rbp		/* Save current stack pointer */
>> +	movq	%rcx, %rax		/* Workarea stack page */
>> +	movq	%rax, %rsp		/* Set new stack pointer */
>> +	addq	$PAGE_SIZE, %rsp	/* Stack grows from the bottom */
>> +	addq	$PAGE_SIZE, %rax	/* Workarea encryption routine */
>> +
>> +	movq	%rdi, %r10		/* Encrypted kernel */
>> +	movq	%rsi, %r11		/* Decrypted kernel */
>> +	movq	%rdx, %r12		/* Kernel length */
>> +
>> +	/* Copy encryption routine into the workarea */
>> +	movq	%rax, %rdi		/* Workarea encryption routine */
>> +	leaq	.Lenc_start(%rip), %rsi	/* Encryption routine */
>> +	movq	$(.Lenc_stop - .Lenc_start), %rcx	/* Encryption routine length */
>> +	rep	movsb
>> +
>> +	/* Setup registers for call */
>> +	movq	%r10, %rdi		/* Encrypted kernel */
>> +	movq	%r11, %rsi		/* Decrypted kernel */
>> +	movq	%r8, %rdx		/* Pagetables used for encryption */
>> +	movq	%r12, %rcx		/* Kernel length */
>> +	movq	%rax, %r8		/* Workarea encryption routine */
>> +	addq	$PAGE_SIZE, %r8		/* Workarea intermediate copy buffer */
>> +
>> +	call	*%rax			/* Call the encryption routine */
>> +
>> +	movq	%rbp, %rsp		/* Restore original stack pointer */
>> +
>> +	pop	%r12
>> +	pop	%rbp
>> +
>> +	ret
>> +ENDPROC(sme_encrypt_execute)
>> +
>> +.Lenc_start:
>> +ENTRY(sme_enc_routine)
>
> A function called a "routine"? Why do we need the global symbol?
> Nothing's referencing it AFAICT.

I can change the name. As for the use of ENTRY... without the
ENTRY/ENDPROC combination I was receiving a warning about a return
instruction outside of a callable function. It looks like I can just
define the "sme_enc_routine:" label with the ENDPROC and the warning
goes away and the global is avoided. It doesn't like the local labels
(.L...) so I'll use the new name.

>
>> +/*
>> + * Routine used to encrypt kernel.
>> + *   This routine must be run outside of the kernel proper since
>> + *   the kernel will be encrypted during the process. So this
>> + *   routine is defined here and then copied to an area outside
>> + *   of the kernel where it will remain and run decrypted
>> + *   during execution.
>> + *
>> + *   On entry the registers must be:
>> + *     RDI - virtual address for the encrypted kernel mapping
>> + *     RSI - virtual address for the decrypted kernel mapping
>> + *     RDX - address of the pagetables to use for encryption
>> + *     RCX - length of kernel
>> + *      R8 - intermediate copy buffer
>> + *
>> + *     RAX - points to this routine
>> + *
>> + * The kernel will be encrypted by copying from the non-encrypted
>> + * kernel space to an intermediate buffer and then copying from the
>> + * intermediate buffer back to the encrypted kernel space. The physical
>> + * addresses of the two kernel space mappings are the same which
>> + * results in the kernel being encrypted "in place".
>> + */
>> +	/* Enable the new page tables */
>> +	mov	%rdx, %cr3
>> +
>> +	/* Flush any global TLBs */
>> +	mov	%cr4, %rdx
>> +	andq	$~X86_CR4_PGE, %rdx
>> +	mov	%rdx, %cr4
>> +	orq	$X86_CR4_PGE, %rdx
>> +	mov	%rdx, %cr4
>> +
>> +	/* Set the PAT register PA5 entry to write-protect */
>> +	push	%rcx
>> +	movl	$MSR_IA32_CR_PAT, %ecx
>> +	rdmsr
>> +	push	%rdx			/* Save original PAT value */
>> +	andl	$0xffff00ff, %edx	/* Clear PA5 */
>> +	orl	$0x00000500, %edx	/* Set PA5 to WP */
>
> Maybe check first whether PA5 is already set correctly and avoid the
> WRMSR and the restoring below too?

In the overall scheme of things it's probably not that big a deal when
compared to everything that's about to happen below.

>
>> +	wrmsr
>> +	pop	%rdx			/* RDX contains original PAT value */
>> +	pop	%rcx
>> +
>> +	movq	%rcx, %r9		/* Save kernel length */
>> +	movq	%rdi, %r10		/* Save encrypted kernel address */
>> +	movq	%rsi, %r11		/* Save decrypted kernel address */
>> +
>> +	wbinvd				/* Invalidate any cache entries */
>> +
>> +	/* Copy/encrypt 2MB at a time */
>> +1:
>> +	movq	%r11, %rsi		/* Source - decrypted kernel */
>> +	movq	%r8, %rdi		/* Dest   - intermediate copy buffer */
>> +	movq	$PMD_PAGE_SIZE, %rcx	/* 2MB length */
>> +	rep	movsb
>
> not movsQ?

The hardware will try to optimize rep movsb into large chunks assuming
things are aligned, sizes are large enough, etc. so we don't have to
explicitly specify and setup for a rep movsq.

Thanks,
Tom

>
>> +	movq	%r8, %rsi		/* Source - intermediate copy buffer */
>> +	movq	%r10, %rdi		/* Dest   - encrypted kernel */
>> +	movq	$PMD_PAGE_SIZE, %rcx	/* 2MB length */
>> +	rep	movsb
>> +
>> +	addq	$PMD_PAGE_SIZE, %r11
>> +	addq	$PMD_PAGE_SIZE, %r10
>> +	subq	$PMD_PAGE_SIZE, %r9	/* Kernel length decrement */
>> +	jnz	1b			/* Kernel length not zero? */
>> +
>> +	/* Restore PAT register */
>> +	push	%rdx			/* Save original PAT value */
>> +	movl	$MSR_IA32_CR_PAT, %ecx
>> +	rdmsr
>> +	pop	%rdx			/* Restore original PAT value */
>> +	wrmsr
>> +
>> +	ret
>> +ENDPROC(sme_enc_routine)
>> +.Lenc_stop:
>>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCHv1, RFC 7/8] x86/mm: Hacks for boot-time switching between 4- and 5-level paging
From: Kirill A. Shutemov @ 2017-05-25 20:33 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov
In-Reply-To: <20170525203334.867-1-kirill.shutemov@linux.intel.com>

There're bunch of workaround to make switching between 4- and 5-level
paging compile.

All of them need to be addressed properly before upstreaming.

Not-yet-signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig          | 4 ++--
 arch/x86/entry/entry_64.S | 5 +++++
 arch/x86/kernel/head_64.S | 6 ++++--
 arch/x86/xen/Kconfig      | 2 +-
 4 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0bf81e837cbf..c795207d8a3c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -100,7 +100,7 @@ config X86
 	select HAVE_ARCH_AUDITSYSCALL
 	select HAVE_ARCH_HUGE_VMAP		if X86_64 || X86_PAE
 	select HAVE_ARCH_JUMP_LABEL
-	select HAVE_ARCH_KASAN			if X86_64 && SPARSEMEM_VMEMMAP
+	select HAVE_ARCH_KASAN			if X86_64 && SPARSEMEM_VMEMMAP && !X86_5LEVEL
 	select HAVE_ARCH_KGDB
 	select HAVE_ARCH_KMEMCHECK
 	select HAVE_ARCH_MMAP_RND_BITS		if MMU
@@ -1980,7 +1980,7 @@ config RELOCATABLE
 
 config RANDOMIZE_BASE
 	bool "Randomize the address of the kernel image (KASLR)"
-	depends on RELOCATABLE
+	depends on RELOCATABLE && !X86_5LEVEL
 	default y
 	---help---
 	  In support of Kernel Address Space Layout Randomization (KASLR),
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index edec30584eb8..9e868fd6d792 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -269,6 +269,11 @@ return_from_SYSCALL_64:
 	 * Change top bits to match most significant bit (47th or 56th bit
 	 * depending on paging mode) in the address.
 	 */
+#ifdef CONFIG_X86_5LEVEL
+#warning FIXME
+#undef __VIRTUAL_MASK_SHIFT
+#define __VIRTUAL_MASK_SHIFT 56
+#endif
 	shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
 	sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
 
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 2009d9849e98..9dcf7a4d8612 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -37,11 +37,13 @@
  *
  */
 
-#define p4d_index(x)	(((x) >> P4D_SHIFT) & (PTRS_PER_P4D-1))
 #define pud_index(x)	(((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
 
-PGD_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
+#ifdef CONFIG_XEN
+/* FIXME */
+PGD_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE48)
 PGD_START_KERNEL = pgd_index(__START_KERNEL_map)
+#endif
 L3_START_KERNEL = pud_index(__START_KERNEL_map)
 
 	.text
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index 1be9667bd476..c1714cac7595 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -4,7 +4,7 @@
 
 config XEN
 	bool "Xen guest support"
-	depends on PARAVIRT
+	depends on PARAVIRT && !X86_5LEVEL
 	select PARAVIRT_CLOCK
 	depends on X86_64 || (X86_32 && X86_PAE)
 	depends on X86_LOCAL_APIC && X86_TSC
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCHv1, RFC 2/8] x86/mm: Make virtual memory layout movable for CONFIG_X86_5LEVEL
From: Kirill A. Shutemov @ 2017-05-25 20:33 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov
In-Reply-To: <20170525203334.867-1-kirill.shutemov@linux.intel.com>

We need to be able to adjust virtual memory layout at runtime to be able
to switch between 4- and 5-level paging at boot-time.

KASLR already has movable __VMALLOC_BASE, __VMEMMAP_BASE and __PAGE_OFFSET.
Let's re-use it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/kaslr.h            | 4 ----
 arch/x86/include/asm/page_64.h          | 4 ++++
 arch/x86/include/asm/page_64_types.h    | 2 +-
 arch/x86/include/asm/pgtable_64_types.h | 2 +-
 arch/x86/kernel/head64.c                | 9 +++++++++
 arch/x86/mm/kaslr.c                     | 8 --------
 6 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/kaslr.h b/arch/x86/include/asm/kaslr.h
index 1052a797d71d..683c9d736314 100644
--- a/arch/x86/include/asm/kaslr.h
+++ b/arch/x86/include/asm/kaslr.h
@@ -4,10 +4,6 @@
 unsigned long kaslr_get_random_long(const char *purpose);
 
 #ifdef CONFIG_RANDOMIZE_MEMORY
-extern unsigned long page_offset_base;
-extern unsigned long vmalloc_base;
-extern unsigned long vmemmap_base;
-
 void kernel_randomize_memory(void);
 #else
 static inline void kernel_randomize_memory(void) { }
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index b4a0d43248cf..a12fb4dcdd15 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -10,6 +10,10 @@
 extern unsigned long max_pfn;
 extern unsigned long phys_base;
 
+extern unsigned long page_offset_base;
+extern unsigned long vmalloc_base;
+extern unsigned long vmemmap_base;
+
 static inline unsigned long __phys_addr_nodebug(unsigned long x)
 {
 	unsigned long y = x - __START_KERNEL_map;
diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 3f5f08b010d0..0126d6bc2eb1 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -42,7 +42,7 @@
 #define __PAGE_OFFSET_BASE      _AC(0xffff880000000000, UL)
 #endif
 
-#ifdef CONFIG_RANDOMIZE_MEMORY
+#if defined(CONFIG_RANDOMIZE_MEMORY) || defined(CONFIG_X86_5LEVEL)
 #define __PAGE_OFFSET           page_offset_base
 #else
 #define __PAGE_OFFSET           __PAGE_OFFSET_BASE
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 06470da156ba..a9f77ead7088 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -85,7 +85,7 @@ typedef struct { pteval_t pte; } pte_t;
 #define __VMALLOC_BASE	_AC(0xffffc90000000000, UL)
 #define __VMEMMAP_BASE	_AC(0xffffea0000000000, UL)
 #endif
-#ifdef CONFIG_RANDOMIZE_MEMORY
+#if defined(CONFIG_RANDOMIZE_MEMORY) || defined(CONFIG_X86_5LEVEL)
 #define VMALLOC_START	vmalloc_base
 #define VMEMMAP_START	vmemmap_base
 #else
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 9403633f4c7c..408ed402db1a 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -38,6 +38,15 @@ extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
 static unsigned int __initdata next_early_pgt;
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
 
+#if defined(CONFIG_RANDOMIZE_MEMORY) || defined(CONFIG_X86_5LEVEL)
+unsigned long page_offset_base = __PAGE_OFFSET_BASE;
+EXPORT_SYMBOL(page_offset_base);
+unsigned long vmalloc_base = __VMALLOC_BASE;
+EXPORT_SYMBOL(vmalloc_base);
+unsigned long vmemmap_base = __VMEMMAP_BASE;
+EXPORT_SYMBOL(vmemmap_base);
+#endif
+
 static void __init *fixup_pointer(void *ptr, unsigned long physaddr)
 {
 	return ptr - (void *)_text + (void *)physaddr;
diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index af599167fe3c..e6420b18f6e0 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -53,14 +53,6 @@ static const unsigned long vaddr_end = EFI_VA_END;
 static const unsigned long vaddr_end = __START_KERNEL_map;
 #endif
 
-/* Default values */
-unsigned long page_offset_base = __PAGE_OFFSET_BASE;
-EXPORT_SYMBOL(page_offset_base);
-unsigned long vmalloc_base = __VMALLOC_BASE;
-EXPORT_SYMBOL(vmalloc_base);
-unsigned long vmemmap_base = __VMEMMAP_BASE;
-EXPORT_SYMBOL(vmemmap_base);
-
 /*
  * Memory regions randomized by KASLR (except modules that use a separate logic
  * earlier during boot). The list is ordered based on virtual addresses. This
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCHv1, RFC 5/8] x86/mm: Fold p4d page table layer at runtime
From: Kirill A. Shutemov @ 2017-05-25 20:33 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov
In-Reply-To: <20170525203334.867-1-kirill.shutemov@linux.intel.com>

This patch changes page table helpers to fold p4d at runtime.
The logic is the same as in <asm-generic/pgtable-nop4d.h>.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/paravirt.h |  3 ++-
 arch/x86/include/asm/pgalloc.h  |  5 ++++-
 arch/x86/include/asm/pgtable.h  | 10 +++++++++-
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 55fa56fe4e45..e934ed6dc036 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -615,7 +615,8 @@ static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
 
 static inline void pgd_clear(pgd_t *pgdp)
 {
-	set_pgd(pgdp, __pgd(0));
+	if (!p4d_folded)
+		set_pgd(pgdp, __pgd(0));
 }
 
 #endif  /* CONFIG_PGTABLE_LEVELS == 5 */
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index b2d0cd8288aa..5c42262169d0 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -155,6 +155,8 @@ static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pud,
 #if CONFIG_PGTABLE_LEVELS > 4
 static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, p4d_t *p4d)
 {
+	if (p4d_folded)
+		return;
 	paravirt_alloc_p4d(mm, __pa(p4d) >> PAGE_SHIFT);
 	set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
 }
@@ -179,7 +181,8 @@ extern void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d);
 static inline void __p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d,
 				  unsigned long address)
 {
-	___p4d_free_tlb(tlb, p4d);
+	if (!p4d_folded)
+		___p4d_free_tlb(tlb, p4d);
 }
 
 #endif	/* CONFIG_PGTABLE_LEVELS > 4 */
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 77037b6f1caa..4516a1bdcc31 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -53,7 +53,7 @@ extern struct mm_struct *pgd_page_get_mm(struct page *page);
 
 #ifndef __PAGETABLE_P4D_FOLDED
 #define set_pgd(pgdp, pgd)		native_set_pgd(pgdp, pgd)
-#define pgd_clear(pgd)			native_pgd_clear(pgd)
+#define pgd_clear(pgd)			(!p4d_folded ? native_pgd_clear(pgd) : 0)
 #endif
 
 #ifndef set_p4d
@@ -847,6 +847,8 @@ static inline unsigned long p4d_index(unsigned long address)
 #if CONFIG_PGTABLE_LEVELS > 4
 static inline int pgd_present(pgd_t pgd)
 {
+	if (p4d_folded)
+		return 1;
 	return pgd_flags(pgd) & _PAGE_PRESENT;
 }
 
@@ -864,16 +866,22 @@ static inline unsigned long pgd_page_vaddr(pgd_t pgd)
 /* to find an entry in a page-table-directory. */
 static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
 {
+	if (p4d_folded)
+		return (p4d_t *)pgd;
 	return (p4d_t *)pgd_page_vaddr(*pgd) + p4d_index(address);
 }
 
 static inline int pgd_bad(pgd_t pgd)
 {
+	if (p4d_folded)
+		return 0;
 	return (pgd_flags(pgd) & ~_PAGE_USER) != _KERNPG_TABLE;
 }
 
 static inline int pgd_none(pgd_t pgd)
 {
+	if (p4d_folded)
+		return 0;
 	/*
 	 * There is no need to do a workaround for the KNL stray
 	 * A/D bit erratum here.  PGDs only point to page tables
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCHv1, RFC 6/8] x86/mm: Replace compile-time checks for 5-level with runtime-time
From: Kirill A. Shutemov @ 2017-05-25 20:33 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov
In-Reply-To: <20170525203334.867-1-kirill.shutemov@linux.intel.com>

This patch converts the of CONFIG_X86_5LEVEL check to runtime checks for
p4d folding.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/ident_map.c       |  2 +-
 arch/x86/mm/init_64.c         | 28 +++++++++++++++++-----------
 arch/x86/mm/kaslr.c           |  6 +++---
 arch/x86/power/hibernate_64.c |  4 ++--
 arch/x86/xen/mmu_pv.c         |  2 +-
 5 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c
index adab1595f4bd..d2df33a2cbfb 100644
--- a/arch/x86/mm/ident_map.c
+++ b/arch/x86/mm/ident_map.c
@@ -115,7 +115,7 @@ int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page,
 		result = ident_p4d_init(info, p4d, addr, next);
 		if (result)
 			return result;
-		if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+		if (!p4d_folded) {
 			set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE));
 		} else {
 			/*
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index d135c613bf7b..b1b70a79fa14 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -88,12 +88,7 @@ static int __init nonx32_setup(char *str)
 }
 __setup("noexec32=", nonx32_setup);
 
-/*
- * When memory was added make sure all the processes MM have
- * suitable PGD entries in the local PGD level page.
- */
-#ifdef CONFIG_X86_5LEVEL
-void sync_global_pgds(unsigned long start, unsigned long end)
+static void sync_global_pgds_57(unsigned long start, unsigned long end)
 {
 	unsigned long addr;
 
@@ -129,8 +124,8 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 		spin_unlock(&pgd_lock);
 	}
 }
-#else
-void sync_global_pgds(unsigned long start, unsigned long end)
+
+static void sync_global_pgds_48(unsigned long start, unsigned long end)
 {
 	unsigned long addr;
 
@@ -173,7 +168,18 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 		spin_unlock(&pgd_lock);
 	}
 }
-#endif
+
+/*
+ * When memory was added make sure all the processes MM have
+ * suitable PGD entries in the local PGD level page.
+ */
+void sync_global_pgds(unsigned long start, unsigned long end)
+{
+	if (!p4d_folded)
+		sync_global_pgds_57(start, end);
+	else
+		sync_global_pgds_48(start, end);
+}
 
 /*
  * NOTE: This function is marked __ref because it calls __init function
@@ -632,7 +638,7 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
 	unsigned long vaddr = (unsigned long)__va(paddr);
 	int i = p4d_index(vaddr);
 
-	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+	if (p4d_folded)
 		return phys_pud_init((pud_t *) p4d_page, paddr, paddr_end, page_size_mask);
 
 	for (; i < PTRS_PER_P4D; i++, paddr = paddr_next) {
@@ -712,7 +718,7 @@ kernel_physical_mapping_init(unsigned long paddr_start,
 					   page_size_mask);
 
 		spin_lock(&init_mm.page_table_lock);
-		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+		if (!p4d_folded)
 			pgd_populate(&init_mm, pgd, p4d);
 		else
 			p4d_populate(&init_mm, p4d_offset(pgd, vaddr), (pud_t *) p4d);
diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index 55433f2d1957..a691ff07d825 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -134,7 +134,7 @@ void __init kernel_randomize_memory(void)
 		 */
 		entropy = remain_entropy / (ARRAY_SIZE(kaslr_regions) - i);
 		prandom_bytes_state(&rand_state, &rand, sizeof(rand));
-		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+		if (!p4d_folded)
 			entropy = (rand % (entropy + 1)) & P4D_MASK;
 		else
 			entropy = (rand % (entropy + 1)) & PUD_MASK;
@@ -146,7 +146,7 @@ void __init kernel_randomize_memory(void)
 		 * randomization alignment.
 		 */
 		vaddr += get_padding(&kaslr_regions[i]);
-		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+		if (!p4d_folded)
 			vaddr = round_up(vaddr + 1, P4D_SIZE);
 		else
 			vaddr = round_up(vaddr + 1, PUD_SIZE);
@@ -222,7 +222,7 @@ void __meminit init_trampoline(void)
 		return;
 	}
 
-	if (IS_ENABLED(CONFIG_X86_5LEVEL))
+	if (!p4d_folded)
 		init_trampoline_p4d();
 	else
 		init_trampoline_pud();
diff --git a/arch/x86/power/hibernate_64.c b/arch/x86/power/hibernate_64.c
index a6e21fee22ea..86696ff275b9 100644
--- a/arch/x86/power/hibernate_64.c
+++ b/arch/x86/power/hibernate_64.c
@@ -66,7 +66,7 @@ static int set_up_temporary_text_mapping(pgd_t *pgd)
 	 * tables used by the image kernel.
 	 */
 
-	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+	if (!p4d_folded) {
 		p4d = (p4d_t *)get_safe_page(GFP_ATOMIC);
 		if (!p4d)
 			return -ENOMEM;
@@ -84,7 +84,7 @@ static int set_up_temporary_text_mapping(pgd_t *pgd)
 		__pmd((jump_address_phys & PMD_MASK) | __PAGE_KERNEL_LARGE_EXEC));
 	set_pud(pud + pud_index(restore_jump_address),
 		__pud(__pa(pmd) | _KERNPG_TABLE));
-	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+	if (!p4d_folded) {
 		set_p4d(p4d + p4d_index(restore_jump_address), __p4d(__pa(pud) | _KERNPG_TABLE));
 		set_pgd(pgd + pgd_index(restore_jump_address), __pgd(__pa(p4d) | _KERNPG_TABLE));
 	} else {
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index d9ee946559c9..e39054fca812 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1214,7 +1214,7 @@ static void __init xen_cleanmfnmap(unsigned long vaddr)
 			continue;
 		xen_cleanmfnmap_p4d(p4d + i, unpin);
 	}
-	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+	if (!p4d_folded) {
 		set_pgd(pgd, __pgd(0));
 		xen_cleanmfnmap_free_pgtbl(p4d, unpin);
 	}
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCHv1, RFC 3/8] x86/mm: Make PGDIR_SHIFT and PTRS_PER_P4D variable
From: Kirill A. Shutemov @ 2017-05-25 20:33 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov
In-Reply-To: <20170525203334.867-1-kirill.shutemov@linux.intel.com>

For boot-time switching between 4- and 5-level paging we need to be able
to fold p4d page table level at runtime. It requires variable
PGDIR_SHIFT and PTRS_PER_P4D.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/pgtable_32.h       |  2 ++
 arch/x86/include/asm/pgtable_64_types.h |  7 +++++--
 arch/x86/kernel/head64.c                |  9 ++++++++-
 arch/x86/mm/dump_pagetables.c           | 11 +++--------
 arch/x86/mm/init_64.c                   |  2 +-
 arch/x86/platform/efi/efi_64.c          |  4 ++--
 6 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_32.h b/arch/x86/include/asm/pgtable_32.h
index bfab55675c16..9c3c811347b0 100644
--- a/arch/x86/include/asm/pgtable_32.h
+++ b/arch/x86/include/asm/pgtable_32.h
@@ -32,6 +32,8 @@ static inline void pgtable_cache_init(void) { }
 static inline void check_pgt_cache(void) { }
 void paging_init(void);
 
+static inline int pgd_large(pgd_t pgd) { return 0; }
+
 /*
  * Define this if things work differently on an i386 and an i486:
  * it will (on an i486) warn about kernel memory accesses that are
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index a9f77ead7088..a09f2fa91e09 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -19,6 +19,9 @@ typedef unsigned long	pgprotval_t;
 
 typedef struct { pteval_t pte; } pte_t;
 
+extern unsigned int pgdir_shift;
+extern unsigned int ptrs_per_p4d;
+
 #endif	/* !__ASSEMBLY__ */
 
 #define SHARED_KERNEL_PMD	0
@@ -28,14 +31,14 @@ typedef struct { pteval_t pte; } pte_t;
 /*
  * PGDIR_SHIFT determines what a top-level page table entry can map
  */
-#define PGDIR_SHIFT	48
+#define PGDIR_SHIFT	pgdir_shift
 #define PTRS_PER_PGD	512
 
 /*
  * 4th level page in 5-level paging case
  */
 #define P4D_SHIFT	39
-#define PTRS_PER_P4D	512
+#define PTRS_PER_P4D	ptrs_per_p4d
 #define P4D_SIZE	(_AC(1, UL) << P4D_SHIFT)
 #define P4D_MASK	(~(P4D_SIZE - 1))
 
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 408ed402db1a..d4e8d4beeb62 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -38,6 +38,13 @@ extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
 static unsigned int __initdata next_early_pgt;
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
 
+#ifdef CONFIG_X86_5LEVEL
+unsigned int pgdir_shift = 48;
+EXPORT_SYMBOL(pgdir_shift);
+unsigned int ptrs_per_p4d = 512;
+EXPORT_SYMBOL(ptrs_per_p4d);
+#endif
+
 #if defined(CONFIG_RANDOMIZE_MEMORY) || defined(CONFIG_X86_5LEVEL)
 unsigned long page_offset_base = __PAGE_OFFSET_BASE;
 EXPORT_SYMBOL(page_offset_base);
@@ -273,7 +280,7 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 	BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0);
 	BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0);
 	BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
-	BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) ==
+	MAYBE_BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) ==
 				(__START_KERNEL & PGDIR_MASK)));
 	BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);
 
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 0470826d2bdc..d7b3cf2320fd 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -380,14 +380,15 @@ static void walk_pud_level(struct seq_file *m, struct pg_state *st, p4d_t addr,
 #define p4d_none(a)  pud_none(__pud(p4d_val(a)))
 #endif
 
-#if PTRS_PER_P4D > 1
-
 static void walk_p4d_level(struct seq_file *m, struct pg_state *st, pgd_t addr, unsigned long P)
 {
 	int i;
 	p4d_t *start;
 	pgprotval_t prot;
 
+	if (PTRS_PER_P4D > 1)
+		return walk_pud_level(m, st, __p4d(pgd_val(addr)), P);
+
 	start = (p4d_t *)pgd_page_vaddr(addr);
 
 	for (i = 0; i < PTRS_PER_P4D; i++) {
@@ -407,12 +408,6 @@ static void walk_p4d_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
 	}
 }
 
-#else
-#define walk_p4d_level(m,s,a,p) walk_pud_level(m,s,__p4d(pgd_val(a)),p)
-#define pgd_large(a) p4d_large(__p4d(pgd_val(a)))
-#define pgd_none(a)  p4d_none(__p4d(pgd_val(a)))
-#endif
-
 static inline bool is_hypervisor_range(int idx)
 {
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 124f1a77c181..d135c613bf7b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -143,7 +143,7 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 		 * With folded p4d, pgd_none() is always false, we need to
 		 * handle synchonization on p4d level.
 		 */
-		BUILD_BUG_ON(pgd_none(*pgd_ref));
+		MAYBE_BUILD_BUG_ON(pgd_none(*pgd_ref));
 		p4d_ref = p4d_offset(pgd_ref, addr);
 
 		if (p4d_none(*p4d_ref))
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index c488625c9712..d6cfba3e164f 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -186,8 +186,8 @@ void efi_sync_low_kernel_mappings(void)
 	 * only span a single PGD entry and that the entry also maps
 	 * other important kernel regions.
 	 */
-	BUILD_BUG_ON(pgd_index(EFI_VA_END) != pgd_index(MODULES_END));
-	BUILD_BUG_ON((EFI_VA_START & PGDIR_MASK) !=
+	MAYBE_BUILD_BUG_ON(pgd_index(EFI_VA_END) != pgd_index(MODULES_END));
+	MAYBE_BUILD_BUG_ON((EFI_VA_START & PGDIR_MASK) !=
 			(EFI_VA_END & PGDIR_MASK));
 
 	pgd_efi = efi_pgd + pgd_index(PAGE_OFFSET);
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCHv1, RFC 1/8] x86/boot/compressed/64: Detect and handle 5-level paging at boot-time
From: Kirill A. Shutemov @ 2017-05-25 20:33 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov
In-Reply-To: <20170525203334.867-1-kirill.shutemov@linux.intel.com>

This patch prepare decompression code to boot-time switching between 4-
and 5-level paging.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/head_64.S | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 3ed26769810b..89d886c95afc 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -109,6 +109,31 @@ ENTRY(startup_32)
 	movl	$LOAD_PHYSICAL_ADDR, %ebx
 1:
 
+#ifdef CONFIG_X86_5LEVEL
+	pushl	%ebx
+
+	/* Check if leaf 7 is supported*/
+	movl	$0, %eax
+	cpuid
+	cmpl	$7, %eax
+	jb	1f
+
+	/*
+	 * Check if la57 is supported.
+	 * The feature is enumerated with CPUID.(EAX=07H, ECX=0):ECX[bit 16]
+	 */
+	movl	$7, %eax
+	movl	$0, %ecx
+	cpuid
+	andl	$(1 << 16), %ecx
+	jz	1f
+
+	/* p4d page table is not folded if la57 is present */
+	movl	$0, p4d_folded(%ebp)
+1:
+	popl %ebx
+#endif
+
 	/* Target address to relocate to for decompression */
 	movl	BP_init_size(%esi), %eax
 	subl	$_end, %eax
@@ -125,9 +150,14 @@ ENTRY(startup_32)
 	/* Enable PAE and LA57 mode */
 	movl	%cr4, %eax
 	orl	$X86_CR4_PAE, %eax
+
 #ifdef CONFIG_X86_5LEVEL
+	testl	$1, p4d_folded(%ebp)
+	jnz	1f
 	orl	$X86_CR4_LA57, %eax
+1:
 #endif
+
 	movl	%eax, %cr4
 
  /*
@@ -147,11 +177,15 @@ ENTRY(startup_32)
 	movl	%eax, 0(%edi)
 
 #ifdef CONFIG_X86_5LEVEL
+	testl	$1, p4d_folded(%ebp)
+	jnz	1f
+
 	/* Build Level 4 */
 	addl	$0x1000, %edx
 	leal	pgtable(%ebx,%edx), %edi
 	leal	0x1007 (%edi), %eax
 	movl	%eax, 0(%edi)
+1:
 #endif
 
 	/* Build Level 3 */
@@ -464,6 +498,9 @@ gdt:
 	.quad   0x0000000000000000	/* TS continued */
 gdt_end:
 
+p4d_folded:
+	.word	1
+
 #ifdef CONFIG_EFI_STUB
 efi_config:
 	.quad	0
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCHv1, RFC 8/8] x86/mm: Allow to boot without la57 if CONFIG_X86_5LEVEL=y
From: Kirill A. Shutemov @ 2017-05-25 20:33 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov
In-Reply-To: <20170525203334.867-1-kirill.shutemov@linux.intel.com>

All pieces of the puzzle are in place and we can now allow to boot with
CONFIG_X86_5LEVEL=y on a machine without la57 support.

Kernel will detect that la57 is missing and fold p4d at runtime.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/required-features.h | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/required-features.h b/arch/x86/include/asm/required-features.h
index d91ba04dd007..fac9a5c0abe9 100644
--- a/arch/x86/include/asm/required-features.h
+++ b/arch/x86/include/asm/required-features.h
@@ -53,12 +53,6 @@
 # define NEED_MOVBE	0
 #endif
 
-#ifdef CONFIG_X86_5LEVEL
-# define NEED_LA57	(1<<(X86_FEATURE_LA57 & 31))
-#else
-# define NEED_LA57	0
-#endif
-
 #ifdef CONFIG_X86_64
 #ifdef CONFIG_PARAVIRT
 /* Paravirtualized systems may not have PSE or PGE available */
@@ -104,7 +98,7 @@
 #define REQUIRED_MASK13	0
 #define REQUIRED_MASK14	0
 #define REQUIRED_MASK15	0
-#define REQUIRED_MASK16	(NEED_LA57)
+#define REQUIRED_MASK16	0
 #define REQUIRED_MASK17	0
 #define REQUIRED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18)
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCHv1, RFC 4/8] x86/mm: Handle boot-time paging mode switching at early boot
From: Kirill A. Shutemov @ 2017-05-25 20:33 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov
In-Reply-To: <20170525203334.867-1-kirill.shutemov@linux.intel.com>

This patch adds detection of 5-level paging at boot-time and adjusts
virtual memory layout and folds p4d page table layer if needed.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/page_64_types.h    | 13 +++----
 arch/x86/include/asm/pgtable_64_types.h | 37 +++++++++++++-------
 arch/x86/include/asm/processor.h        |  2 +-
 arch/x86/kernel/head64.c                | 62 +++++++++++++++++++++++++--------
 arch/x86/kernel/head_64.S               | 16 +++++----
 arch/x86/mm/kaslr.c                     |  2 +-
 6 files changed, 90 insertions(+), 42 deletions(-)

diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 0126d6bc2eb1..26056ef366b8 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -36,24 +36,21 @@
  * hypervisor to fit.  Choosing 16 slots here is arbitrary, but it's
  * what Xen requires.
  */
-#ifdef CONFIG_X86_5LEVEL
-#define __PAGE_OFFSET_BASE      _AC(0xff10000000000000, UL)
-#else
-#define __PAGE_OFFSET_BASE      _AC(0xffff880000000000, UL)
-#endif
+#define __PAGE_OFFSET_BASE57	_AC(0xff10000000000000, UL)
+#define __PAGE_OFFSET_BASE48	_AC(0xffff880000000000, UL)
 
 #if defined(CONFIG_RANDOMIZE_MEMORY) || defined(CONFIG_X86_5LEVEL)
 #define __PAGE_OFFSET           page_offset_base
 #else
-#define __PAGE_OFFSET           __PAGE_OFFSET_BASE
+#define __PAGE_OFFSET           __PAGE_OFFSET_BASE48
 #endif /* CONFIG_RANDOMIZE_MEMORY */
 
 #define __START_KERNEL_map	_AC(0xffffffff80000000, UL)
 
 /* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */
 #ifdef CONFIG_X86_5LEVEL
-#define __PHYSICAL_MASK_SHIFT	52
-#define __VIRTUAL_MASK_SHIFT	56
+#define __PHYSICAL_MASK_SHIFT	(p4d_folded ? 52 : 46)
+#define __VIRTUAL_MASK_SHIFT	(p4d_folded ? 47 : 56)
 #else
 #define __PHYSICAL_MASK_SHIFT	46
 #define __VIRTUAL_MASK_SHIFT	47
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index a09f2fa91e09..46f52da75e16 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -19,6 +19,12 @@ typedef unsigned long	pgprotval_t;
 
 typedef struct { pteval_t pte; } pte_t;
 
+#ifdef CONFIG_X86_5LEVEL
+extern unsigned int p4d_folded;
+#else
+#define p4d_folded 1
+#endif
+
 extern unsigned int pgdir_shift;
 extern unsigned int ptrs_per_p4d;
 
@@ -79,23 +85,30 @@ extern unsigned int ptrs_per_p4d;
 
 /* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */
 #define MAXMEM		_AC(__AC(1, UL) << MAX_PHYSMEM_BITS, UL)
-#ifdef CONFIG_X86_5LEVEL
-#define VMALLOC_SIZE_TB _AC(16384, UL)
-#define __VMALLOC_BASE	_AC(0xff92000000000000, UL)
-#define __VMEMMAP_BASE	_AC(0xffd4000000000000, UL)
-#else
-#define VMALLOC_SIZE_TB	_AC(32, UL)
-#define __VMALLOC_BASE	_AC(0xffffc90000000000, UL)
-#define __VMEMMAP_BASE	_AC(0xffffea0000000000, UL)
-#endif
+
+#ifndef __ASSEMBLY__
+#define __VMALLOC_BASE48	0xffffc90000000000
+#define __VMALLOC_BASE57	0xff92000000000000
+
+#define VMALLOC_SIZE_TB48	32UL
+#define VMALLOC_SIZE_TB57	16384UL
+
+#define __VMEMMAP_BASE48	0xffffea0000000000
+#define __VMEMMAP_BASE57	0xffd4000000000000
+
 #if defined(CONFIG_RANDOMIZE_MEMORY) || defined(CONFIG_X86_5LEVEL)
 #define VMALLOC_START	vmalloc_base
+#define VMALLOC_SIZE_TB	(!p4d_folded ? VMALLOC_SIZE_TB57 : VMALLOC_SIZE_TB48)
 #define VMEMMAP_START	vmemmap_base
 #else
-#define VMALLOC_START	__VMALLOC_BASE
-#define VMEMMAP_START	__VMEMMAP_BASE
+#define VMALLOC_START	__VMALLOC_BASE48
+#define VMALLOC_SIZE_TB	VMALLOC_SIZE_TB48
+#define VMEMMAP_START	__VMEMMAP_BASE48
 #endif /* CONFIG_RANDOMIZE_MEMORY */
-#define VMALLOC_END	(VMALLOC_START + _AC((VMALLOC_SIZE_TB << 40) - 1, UL))
+
+#define VMALLOC_END	(VMALLOC_START + (VMALLOC_SIZE_TB << 40) - 1)
+#endif
+
 #define MODULES_VADDR    (__START_KERNEL_map + KERNEL_IMAGE_SIZE)
 /* The module sections ends with the start of the fixmap */
 #define MODULES_END   __fix_to_virt(__end_of_fixed_addresses + 1)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 65663de9287b..92c3f33f7682 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -854,7 +854,7 @@ static inline void spin_lock_prefetch(const void *x)
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
 #define STACK_TOP		TASK_SIZE_LOW
-#define STACK_TOP_MAX		TASK_SIZE_MAX
+#define STACK_TOP_MAX		(!p4d_folded ? TASK_SIZE_MAX : DEFAULT_MAP_WINDOW)
 
 #define INIT_THREAD  {						\
 	.sp0			= TOP_OF_INIT_STACK,		\
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index d4e8d4beeb62..47629f3e32aa 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -39,26 +39,54 @@ static unsigned int __initdata next_early_pgt;
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
 
 #ifdef CONFIG_X86_5LEVEL
-unsigned int pgdir_shift = 48;
+unsigned int pgdir_shift = 39;
 EXPORT_SYMBOL(pgdir_shift);
-unsigned int ptrs_per_p4d = 512;
+unsigned int ptrs_per_p4d = 1;
 EXPORT_SYMBOL(ptrs_per_p4d);
 #endif
 
-#if defined(CONFIG_RANDOMIZE_MEMORY) || defined(CONFIG_X86_5LEVEL)
-unsigned long page_offset_base = __PAGE_OFFSET_BASE;
+unsigned long page_offset_base = __PAGE_OFFSET_BASE48;
 EXPORT_SYMBOL(page_offset_base);
-unsigned long vmalloc_base = __VMALLOC_BASE;
+unsigned long vmalloc_base = __VMALLOC_BASE48;
 EXPORT_SYMBOL(vmalloc_base);
-unsigned long vmemmap_base = __VMEMMAP_BASE;
+unsigned long vmemmap_base = __VMEMMAP_BASE48;
 EXPORT_SYMBOL(vmemmap_base);
-#endif
 
 static void __init *fixup_pointer(void *ptr, unsigned long physaddr)
 {
 	return ptr - (void *)_text + (void *)physaddr;
 }
 
+static unsigned long __init *fixup_long(void *ptr, unsigned long physaddr)
+{
+	return fixup_pointer(ptr, physaddr);
+}
+
+#ifdef CONFIG_X86_5LEVEL
+static unsigned int __init *fixup_int(void *ptr, unsigned long physaddr)
+{
+	return fixup_pointer(ptr, physaddr);
+}
+
+static void __init check_la57_support(unsigned long physaddr)
+{
+	if (native_cpuid_eax(0) < 7)
+		return;
+
+	if (!(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31))))
+		return;
+
+	*fixup_int(&p4d_folded, physaddr) = 0;
+	*fixup_int(&pgdir_shift, physaddr) = 48;
+	*fixup_int(&ptrs_per_p4d, physaddr) = 512;
+	*fixup_long(&page_offset_base, physaddr) = __PAGE_OFFSET_BASE57;
+	*fixup_long(&vmalloc_base, physaddr) = __VMALLOC_BASE57;
+	*fixup_long(&vmemmap_base, physaddr) = __VMEMMAP_BASE57;
+}
+#else
+static void __init check_la57_support(unsigned long physaddr) {}
+#endif
+
 void __init __startup_64(unsigned long physaddr)
 {
 	unsigned long load_delta, *p;
@@ -68,6 +96,8 @@ void __init __startup_64(unsigned long physaddr)
 	pmdval_t *pmd, pmd_entry;
 	int i;
 
+	check_la57_support(physaddr);
+
 	/* Is the address too large? */
 	if (physaddr >> MAX_PHYSMEM_BITS)
 		for (;;);
@@ -85,9 +115,14 @@ void __init __startup_64(unsigned long physaddr)
 	/* Fixup the physical addresses in the page table */
 
 	pgd = fixup_pointer(&early_top_pgt, physaddr);
-	pgd[pgd_index(__START_KERNEL_map)] += load_delta;
-
-	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+	p = pgd + pgd_index(__START_KERNEL_map);
+	if (p4d_folded)
+		*p = (unsigned long)level3_kernel_pgt;
+	else
+		*p = (unsigned long)level4_kernel_pgt;
+	*p += _PAGE_TABLE - __START_KERNEL_map + load_delta;
+
+	if (!p4d_folded) {
 		p4d = fixup_pointer(&level4_kernel_pgt, physaddr);
 		p4d[511] += load_delta;
 	}
@@ -109,7 +144,7 @@ void __init __startup_64(unsigned long physaddr)
 	pud = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
 	pmd = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
 
-	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+	if (!p4d_folded) {
 		p4d = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
 
 		i = (physaddr >> PGDIR_SHIFT) % PTRS_PER_PGD;
@@ -151,8 +186,7 @@ void __init __startup_64(unsigned long physaddr)
 	}
 
 	/* Fixup phys_base */
-	p = fixup_pointer(&phys_base, physaddr);
-	*p += load_delta;
+	*fixup_long(&phys_base, physaddr) += load_delta;
 }
 
 /* Wipe all early page tables except for the kernel symbol map */
@@ -185,7 +219,7 @@ int __init early_make_pgtable(unsigned long address)
 	 * critical -- __PAGE_OFFSET would point us back into the dynamic
 	 * range and we might end up looping forever...
 	 */
-	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+	if (p4d_folded)
 		p4d_p = pgd_p;
 	else if (pgd)
 		p4d_p = (p4dval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 6225550883df..2009d9849e98 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -104,7 +104,10 @@ ENTRY(secondary_startup_64)
 	/* Enable PAE mode, PGE and LA57 */
 	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
 #ifdef CONFIG_X86_5LEVEL
+	testl	$1, p4d_folded(%rip)
+	jnz	1f
 	orl	$X86_CR4_LA57, %ecx
+1:
 #endif
 	movq	%rcx, %cr4
 
@@ -333,12 +336,7 @@ GLOBAL(name)
 
 	__INITDATA
 NEXT_PAGE(early_top_pgt)
-	.fill	511,8,0
-#ifdef CONFIG_X86_5LEVEL
-	.quad	level4_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
-#else
-	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
-#endif
+	.fill	512,8,0
 
 NEXT_PAGE(early_dynamic_pgts)
 	.fill	512*EARLY_DYNAMIC_PAGE_TABLES,8,0
@@ -417,6 +415,12 @@ ENTRY(phys_base)
 	.quad   0x0000000000000000
 EXPORT_SYMBOL(phys_base)
 
+#ifdef CONFIG_X86_5LEVEL
+ENTRY(p4d_folded)
+	.word	1
+EXPORT_SYMBOL(p4d_folded)
+#endif
+
 #include "../../x86/xen/xen-head.S"
 	
 	__PAGE_ALIGNED_BSS
diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index e6420b18f6e0..55433f2d1957 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -43,7 +43,7 @@
  * before. You also need to add a BUILD_BUG_ON() in kernel_randomize_memory() to
  * ensure that this order is correct and won't be changed.
  */
-static const unsigned long vaddr_start = __PAGE_OFFSET_BASE;
+static const unsigned long vaddr_start = __PAGE_OFFSET_BASE48;
 
 #if defined(CONFIG_X86_ESPFIX64)
 static const unsigned long vaddr_end = ESPFIX_BASE_ADDR;
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCHv1, RFC 0/8] Boot-time switching between 4- and 5-level paging
From: Kirill A. Shutemov @ 2017-05-25 20:33 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Here' my first attempt to bring boot-time between 4- and 5-level paging.
It looks not too terrible to me. I've expected it to be worse.

The basic idea is to implement the same logic as pgtable-nop4d.h provides,
but at runtime.

Runtime folding is only implemented for CONFIG_X86_5LEVEL=y case. With the
option disabled, we do compile-time folding.

Initially, I tried to fold pgd instread. I've got to shell, but it
required a lot of hacks as kernel threats pgd in a special way.

Few things are broken (see patch 7/8) and many things are not yet tested.
So more work is required.

I also haven't evaluated performance impact. We can look into some form of
boot-time code patching later if required.

Please review. Any feedback is welcome.

Kirill A. Shutemov (8):
  x86/boot/compressed/64: Detect and handle 5-level paging at boot-time
  x86/mm: Make virtual memory layout movable for CONFIG_X86_5LEVEL
  x86/mm: Make PGDIR_SHIFT and PTRS_PER_P4D variable
  x86/mm: Handle boot-time paging mode switching at early boot
  x86/mm: Fold p4d page table layer at runtime
  x86/mm: Replace compile-time checks for 5-level with runtime-time
  x86/mm: Hacks for boot-time switching between 4- and 5-level paging
  x86/mm: Allow to boot without la57 if CONFIG_X86_5LEVEL=y

 arch/x86/Kconfig                         |  4 +-
 arch/x86/boot/compressed/head_64.S       | 37 ++++++++++++++++++
 arch/x86/entry/entry_64.S                |  5 +++
 arch/x86/include/asm/kaslr.h             |  4 --
 arch/x86/include/asm/page_64.h           |  4 ++
 arch/x86/include/asm/page_64_types.h     | 15 +++-----
 arch/x86/include/asm/paravirt.h          |  3 +-
 arch/x86/include/asm/pgalloc.h           |  5 ++-
 arch/x86/include/asm/pgtable.h           | 10 ++++-
 arch/x86/include/asm/pgtable_32.h        |  2 +
 arch/x86/include/asm/pgtable_64_types.h  | 46 ++++++++++++++--------
 arch/x86/include/asm/processor.h         |  2 +-
 arch/x86/include/asm/required-features.h |  8 +---
 arch/x86/kernel/head64.c                 | 66 ++++++++++++++++++++++++++++----
 arch/x86/kernel/head_64.S                | 22 +++++++----
 arch/x86/mm/dump_pagetables.c            | 11 ++----
 arch/x86/mm/ident_map.c                  |  2 +-
 arch/x86/mm/init_64.c                    | 30 +++++++++------
 arch/x86/mm/kaslr.c                      | 16 ++------
 arch/x86/platform/efi/efi_64.c           |  4 +-
 arch/x86/power/hibernate_64.c            |  4 +-
 arch/x86/xen/Kconfig                     |  2 +-
 arch/x86/xen/mmu_pv.c                    |  2 +-
 23 files changed, 208 insertions(+), 96 deletions(-)

-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration
From: Marcelo Tosatti @ 2017-05-25 19:35 UTC (permalink / raw)
  To: Luiz Capitulino, Christoph Lameter
  Cc: linux-kernel, linux-mm, Rik van Riel, Linux RT Users, cmetcalf
In-Reply-To: <20170519134934.0c298882@redhat.com>

On Fri, May 19, 2017 at 01:49:34PM -0400, Luiz Capitulino wrote:
> On Fri, 19 May 2017 12:13:26 -0500 (CDT)
> Christoph Lameter <cl@linux.com> wrote:
> 
> > > So why are you against integrating this simple, isolated patch which
> > > does not affect how current logic works?  
> > 
> > Frankly the argument does not make sense. Vmstat updates occur very
> > infrequently (probably even less than you IPIs and the other OS stuff that
> > also causes additional latencies that you seem to be willing to tolerate).
> 
> Infrequently is not good enough. It only has to happen once to
> cause a problem.
> 
> Also, IPIs take a few us, usually less. That's not a problem. In our
> testing we see the preemption caused by the kworker take 10us or
> even more. I've never seeing it take 3us. I'm not saying this is not
> true, I'm saying if this is causing a problem to us it will cause
> a problem to other people too.

Christoph, 

Some data:

 qemu-system-x86-12902 [003] ....1..  6517.621557: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000fc
 qemu-system-x86-12902 [003] d...2..  6517.621557: kvm_entry: vcpu 2
 qemu-system-x86-12902 [003] ....1..  6517.621560: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000fc
 qemu-system-x86-12902 [003] d...2..  6517.621561: kvm_entry: vcpu 2
 qemu-system-x86-12902 [003] ....1..  6517.621563: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000fc
 qemu-system-x86-12902 [003] d...2..  6517.621564: kvm_entry: vcpu 2
 qemu-system-x86-12902 [003] d..h1..  6517.622037: empty_smp_call_func:
empty_smp_call_func ran
 qemu-system-x86-12902 [003] ....1..  6517.622040: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000fb
 qemu-system-x86-12902 [003] d...2..  6517.622041: kvm_entry: vcpu 2

empty_smp_function_call: 3us.

 qemu-system-x86-12902 [003] ....1..  6517.702739: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000ef
 qemu-system-x86-12902 [003] d...2..  6517.702741: kvm_entry: vcpu 2
 qemu-system-x86-12902 [003] d..h1..  6517.702758: scheduler_tick
<-update_process_times
 qemu-system-x86-12902 [003] ....1..  6517.702760: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000ef
 qemu-system-x86-12902 [003] d...2..  6517.702760: kvm_entry: vcpu 2

scheduler_tick: 2us.

 qemu-system-x86-12902 [003] ....1..  6518.194570: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000ef
 qemu-system-x86-12902 [003] d...2..  6518.194571: kvm_entry: vcpu 2
 qemu-system-x86-12902 [003] ....1..  6518.194591: kvm_exit: reason
EXTERNAL_INTERRUPT rip 0x4004f1 info 0 800000ef
 qemu-system-x86-12902 [003] d...2..  6518.194593: kvm_entry: vcpu 2

That, and the 10us number for kworker mentioned above changes your
point of view of your 
"Frankly the argument does not make sense. Vmstat updates occur very
infrequently (probably even less than you IPIs and the other OS stuff that
also causes additional latencies that you seem to be willing to tolerate).
And you can configure the interval of vmstat updates freely.... Set
 the vmstat_interval to 60 seconds instead of 2 for a try? Is that rare
enough?" 

Argument? We're showing you the data that this is causing a latency
problem for us.

Is there anything you'd like to be improved on the patch?
Is there anything you dislike about it?

> No, we'd have to set it high enough to disable it and this will
> affect all CPUs.
> 
> Something that crossed my mind was to add a new tunable to set
> the vmstat_interval for each CPU, this way we could essentially
> disable it to the CPUs where DPDK is running. What's the implications
> of doing this besides not getting up to date stats in /proc/vmstat
> (which I still have to confirm would be OK)? Can this break anything
> in the kernel for example?

Well, you get incorrect statistics. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [patch] compiler, clang: suppress warning for unused static inline functions
From: Matthias Kaehlcke @ 2017-05-25 17:49 UTC (permalink / raw)
  To: Joe Perches
  Cc: Ingo Molnar, David Rientjes, Andrew Morton, Christoph Lameter,
	Pekka Enberg, Joonsoo Kim, linux-mm, linux-kernel,
	Douglas Anderson, Mark Brown, David Miller
In-Reply-To: <1495730933.29207.6.camel@perches.com>

Hi Joe,

El Thu, May 25, 2017 at 09:48:53AM -0700 Joe Perches ha dit:

> On Thu, 2017-05-25 at 09:14 -0700, Matthias Kaehlcke wrote:
> > clang doesn't raise
> > warnings about unused static inline functions in headers.
> 
> Is any "#include" file a "header" to clang or only "*.h" files?
> 
> For instance:
> 
> The kernel has ~500 .c files that other .c files #include.
> Are unused inline functions in those .c files reported?

Any "#include" file is a "header" to clang, no warnings are generated
for unused inline functions in included .c files.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v2] mm/hugetlb: Report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified
From: Punit Agrawal @ 2017-05-25 17:33 UTC (permalink / raw)
  To: James Morse; +Cc: linux-mm, Kirill A . Shutemov, Andrew Morton, Naoya Horiguchi
In-Reply-To: <20170525171035.16359-1-james.morse@arm.com>

James Morse <james.morse@arm.com> writes:

> KVM uses get_user_pages() to resolve its stage2 faults. KVM sets the
> FOLL_HWPOISON flag causing faultin_page() to return -EHWPOISON when it
> finds a VM_FAULT_HWPOISON. KVM handles these hwpoison pages as a special
> case.
>
> When huge pages are involved, this doesn't work so well. get_user_pages()
> calls follow_hugetlb_page(), which stops early if it receives
> VM_FAULT_HWPOISON from hugetlb_fault(), eventually returning -EFAULT to
> the caller. The step to map this to -EHWPOISON based on the FOLL_ flags
> is missing. The hwpoison special case is skipped, and -EFAULT is returned
> to user-space, causing Qemu or kvmtool to exit.
>
> Instead, move this VM_FAULT_ to errno mapping code into a header file
> and use it from faultin_page(), follow_hugetlb_page() and faultin_page().

You've got faultin_page() listed twice - I think you meant to add
fixup_user_fault().

With that addressed,

Acked-by: Punit Agrawal <punit.agrawal@arm.com>

>
> With this, KVM works as expected.
>
> CC: Punit Agrawal <punit.agrawal@arm.com>
> CC: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
> Changes since v1:
>  * Fixed checkpatch warnings (oops, wrong file)
>  * Added vm_fault_to_errno() call to faultin_page() as Naoya Horiguchi
>    suggested.
>
> This isn't a problem for arm64 today as we haven't enabled MEMORY_FAILURE,
> but I can't see any reason this doesn't happen on x86 too, so I think this
> should be a fix. This doesn't apply earlier than stable's v4.11.1 due to
> all sorts of cleanup. My best offer is:
> Cc: <stable@vger.kernel.org> # 4.11.1
>
>  include/linux/mm.h | 11 +++++++++++
>  mm/gup.c           | 20 ++++++++------------
>  mm/hugetlb.c       |  5 +++++
>  3 files changed, 24 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7cb17c6b97de..b892e95d4929 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2327,6 +2327,17 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
>  #define FOLL_REMOTE	0x2000	/* we are working on non-current tsk/mm */
>  #define FOLL_COW	0x4000	/* internal GUP flag */
>  
> +static inline int vm_fault_to_errno(int vm_fault, int foll_flags)
> +{
> +	if (vm_fault & VM_FAULT_OOM)
> +		return -ENOMEM;
> +	if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
> +		return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT;
> +	if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
> +		return -EFAULT;
> +	return 0;
> +}
> +
>  typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
>  			void *data);
>  extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
> diff --git a/mm/gup.c b/mm/gup.c
> index d9e6fddcc51f..b3c7214d710d 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -407,12 +407,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
>  
>  	ret = handle_mm_fault(vma, address, fault_flags);
>  	if (ret & VM_FAULT_ERROR) {
> -		if (ret & VM_FAULT_OOM)
> -			return -ENOMEM;
> -		if (ret & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
> -			return *flags & FOLL_HWPOISON ? -EHWPOISON : -EFAULT;
> -		if (ret & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
> -			return -EFAULT;
> +		int err = vm_fault_to_errno(ret, *flags);
> +
> +		if (err)
> +			return err;
>  		BUG();
>  	}
>  
> @@ -723,12 +721,10 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
>  	ret = handle_mm_fault(vma, address, fault_flags);
>  	major |= ret & VM_FAULT_MAJOR;
>  	if (ret & VM_FAULT_ERROR) {
> -		if (ret & VM_FAULT_OOM)
> -			return -ENOMEM;
> -		if (ret & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
> -			return -EHWPOISON;
> -		if (ret & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
> -			return -EFAULT;
> +		int err = vm_fault_to_errno(ret, 0);
> +
> +		if (err)
> +			return err;
>  		BUG();
>  	}
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e5828875f7bb..3eedb187e549 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4170,6 +4170,11 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			}
>  			ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
>  			if (ret & VM_FAULT_ERROR) {
> +				int err = vm_fault_to_errno(ret, flags);
> +
> +				if (err)
> +					return err;
> +
>  				remainder = 0;
>  				break;
>  			}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v6 06/10] mm: thp: check pmd migration entry in common path
From: kbuild test robot @ 2017-05-25 17:25 UTC (permalink / raw)
  To: Zi Yan
  Cc: kbuild-all, n-horiguchi, kirill.shutemov, linux-kernel, linux-mm,
	akpm, minchan, vbabka, mgorman, mhocko, khandual, zi.yan,
	dnellans, dave.hansen
In-Reply-To: <20170525141945.56028-7-zi.yan@sent.com>

[-- Attachment #1: Type: text/plain, Size: 4237 bytes --]

Hi Zi,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.12-rc2 next-20170525]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Zi-Yan/mm-page-migration-enhancement-for-thp/20170526-003749
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-x016-201721 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   In file included from fs//proc/task_mmu.c:15:0:
   include/linux/swapops.h: In function 'swp_entry_to_pmd':
   include/linux/swapops.h:222:16: warning: missing braces around initializer [-Wmissing-braces]
     return (pmd_t){{ 0 }};
                   ^
   include/linux/swapops.h:222:16: note: (near initialization for '(anonymous)')
   In file included from include/asm-generic/bug.h:4:0,
                    from arch/x86/include/asm/bug.h:81,
                    from include/linux/bug.h:4,
                    from include/linux/mmdebug.h:4,
                    from include/linux/mm.h:8,
                    from fs//proc/task_mmu.c:1:
   In function 'pmd_to_swp_entry.isra.34',
       inlined from 'pagemap_pmd_range' at fs//proc/task_mmu.c:1242:16:
>> include/linux/compiler.h:525:38: error: call to '__compiletime_assert_215' declared with attribute error: BUILD_BUG failed
     _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
                                         ^
   include/linux/compiler.h:508:4: note: in definition of macro '__compiletime_assert'
       prefix ## suffix();    \
       ^~~~~~
   include/linux/compiler.h:525:2: note: in expansion of macro '_compiletime_assert'
     _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
     ^~~~~~~~~~~~~~~~~~~
   include/linux/bug.h:54:37: note: in expansion of macro 'compiletime_assert'
    #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
                                        ^~~~~~~~~~~~~~~~~~
   include/linux/bug.h:88:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
    #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
                        ^~~~~~~~~~~~~~~~
>> include/linux/swapops.h:215:2: note: in expansion of macro 'BUILD_BUG'
     BUILD_BUG();
     ^~~~~~~~~

vim +/BUILD_BUG +215 include/linux/swapops.h

2182e56bf Zi Yan 2017-05-25  199  static inline void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
2182e56bf Zi Yan 2017-05-25  200  		struct page *page)
2182e56bf Zi Yan 2017-05-25  201  {
2182e56bf Zi Yan 2017-05-25  202  	BUILD_BUG();
2182e56bf Zi Yan 2017-05-25  203  }
2182e56bf Zi Yan 2017-05-25  204  
2182e56bf Zi Yan 2017-05-25  205  static inline void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
2182e56bf Zi Yan 2017-05-25  206  		struct page *new)
2182e56bf Zi Yan 2017-05-25  207  {
2182e56bf Zi Yan 2017-05-25  208  	BUILD_BUG();
2182e56bf Zi Yan 2017-05-25  209  }
2182e56bf Zi Yan 2017-05-25  210  
2182e56bf Zi Yan 2017-05-25  211  static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
2182e56bf Zi Yan 2017-05-25  212  
2182e56bf Zi Yan 2017-05-25  213  static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
2182e56bf Zi Yan 2017-05-25  214  {
2182e56bf Zi Yan 2017-05-25 @215  	BUILD_BUG();
2182e56bf Zi Yan 2017-05-25  216  	return swp_entry(0, 0);
2182e56bf Zi Yan 2017-05-25  217  }
2182e56bf Zi Yan 2017-05-25  218  
2182e56bf Zi Yan 2017-05-25  219  static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
2182e56bf Zi Yan 2017-05-25  220  {
2182e56bf Zi Yan 2017-05-25  221  	BUILD_BUG();
2182e56bf Zi Yan 2017-05-25  222  	return (pmd_t){{ 0 }};
2182e56bf Zi Yan 2017-05-25  223  }

:::::: The code at line 215 was first introduced by commit
:::::: 2182e56bfe80b55d1a9a8e8de04e6a5578b1dce0 mm: thp: enable thp migration in generic path

:::::: TO: Zi Yan <zi.yan@cs.rutgers.edu>
:::::: CC: 0day robot <fengguang.wu@intel.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31611 bytes --]

^ permalink raw reply

* Re: [PATCH v4 8/8] mm: rmap: Use correct helper when poisoning hugepages
From: Punit Agrawal @ 2017-05-25 17:22 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, akpm, linux-mm, linux-kernel, linux-arm-kernel,
	catalin.marinas, will.deacon, n-horiguchi, kirill.shutemov,
	mike.kravetz, steve.capper, mark.rutland, linux-arch,
	aneesh.kumar
In-Reply-To: <201705250342.fHpDVCsZ%fengguang.wu@intel.com>

kbuild test robot <lkp@intel.com> writes:

> Hi Punit,
>
> [auto build test ERROR on linus/master]
> [also build test ERROR on v4.12-rc2 next-20170524]
> [cannot apply to mmotm/master]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url:    https://github.com/0day-ci/linux/commits/Punit-Agrawal/Support-for-contiguous-pte-hugepages/20170524-221905
> config: x86_64-kexec (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=x86_64 
>
> All errors (new ones prefixed by >>):
>
>    mm/rmap.c: In function 'try_to_unmap_one':
>>> mm/rmap.c:1386:5: error: implicit declaration of function 'set_huge_swap_pte_at' [-Werror=implicit-function-declaration]
>         set_huge_swap_pte_at(mm, address,
>         ^~~~~~~~~~~~~~~~~~~~
>    cc1: some warnings being treated as errors
>
> vim +/set_huge_swap_pte_at +1386 mm/rmap.c
>
>   1380	
>   1381			if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
>   1382				pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
>   1383				if (PageHuge(page)) {
>   1384					int nr = 1 << compound_order(page);
>   1385					hugetlb_count_sub(nr, mm);
>> 1386					set_huge_swap_pte_at(mm, address,
>   1387							     pvmw.pte, pteval,
>   1388							     vma_mmu_pagesize(vma));
>   1389				} else {
>

Thanks for the report. The build failure is caused due to missing
function definition for set_huge_swap_pte_at() when CONFIG_HUGETLB_PAGE
is disabled. I've posted an update to Patch 7 where the function is
introduced to fix this issue.

> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v6 05/10] mm: thp: enable thp migration in generic path
From: Zi Yan @ 2017-05-25 17:19 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, n-horiguchi, kirill.shutemov, linux-kernel, linux-mm,
	akpm, minchan, vbabka, mgorman, mhocko, khandual, dnellans,
	dave.hansen
In-Reply-To: <201705260111.PCjyEyr4%fengguang.wu@intel.com>

[-- Attachment #1: Type: text/plain, Size: 1156 bytes --]

On 25 May 2017, at 13:06, kbuild test robot wrote:

> Hi Zi,
>
> [auto build test WARNING on mmotm/master]
> [also build test WARNING on v4.12-rc2 next-20170525]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url:    https://github.com/0day-ci/linux/commits/Zi-Yan/mm-page-migration-enhancement-for-thp/20170526-003749
> base:   git://git.cmpxchg.org/linux-mmotm.git master
> config: i386-randconfig-x016-201721 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386
>
> All warnings (new ones prefixed by >>):
>
>    In file included from fs/proc/task_mmu.c:15:0:
>    include/linux/swapops.h: In function 'swp_entry_to_pmd':
>>> include/linux/swapops.h:222:16: warning: missing braces around initializer [-Wmissing-braces]
>      return (pmd_t){{ 0 }};
>                    ^

The braces are added to eliminate the warning from "m68k-linux-gcc (GCC) 4.9.0",
which has the bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53119.

--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply

* [PATCH v4.1 7/8] mm/hugetlb: Introduce set_huge_swap_pte_at() helper
From: Punit Agrawal @ 2017-05-25 17:13 UTC (permalink / raw)
  To: akpm
  Cc: Punit Agrawal, linux-mm, linux-kernel, linux-arm-kernel,
	catalin.marinas, will.deacon, n-horiguchi, kirill.shutemov,
	mike.kravetz, steve.capper, mark.rutland, linux-arch,
	aneesh.kumar
In-Reply-To: <20170524115409.31309-8-punit.agrawal@arm.com>

set_huge_pte_at(), an architecture callback to populate hugepage ptes,
does not provide the range of virtual memory that is targeted. This
leads to ambiguity when dealing with swap entries on architectures that
support hugepages consisting of contiguous ptes.

Fix the problem by introducing an overridable helper for architectures
needing this support. The helper is called when populating the page
tables with swap entries. The size of the targeted region is provided to
the helper to help determine the number of entries to be updated.

Provide a default implementation that maintains the current behaviour.

Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Steve Capper <steve.capper@arm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
---
Hi Andrew,

This update fixes the build failure reported by 0-day when
CONFIG_HUGETLB_PAGE is disabled.

Thanks,
Punit

Change from v4:

* Added an empty definition for set_huge_swap_pte_at() when
  CONFIG_HUGETLB_PAGE is disabled

 include/linux/hugetlb.h | 13 +++++++++++++
 mm/hugetlb.c            |  8 +++++---
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 23010a3b2047..af859564509e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -502,6 +502,14 @@ static inline void hugetlb_count_sub(long l, struct mm_struct *mm)
 {
 	atomic_long_sub(l, &mm->hugetlb_usage);
 }
+
+#ifndef set_huge_swap_pte_at
+static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, pte_t pte, unsigned long sz)
+{
+	set_huge_pte_at(mm, addr, ptep, pte);
+}
+#endif
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 #define alloc_huge_page(v, a, r) NULL
@@ -546,6 +554,11 @@ static inline void hugetlb_report_usage(struct seq_file *f, struct mm_struct *m)
 static inline void hugetlb_count_sub(long l, struct mm_struct *mm)
 {
 }
+
+static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, pte_t pte, unsigned long sz)
+{
+}
 #endif	/* CONFIG_HUGETLB_PAGE */
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ddfed20cd637..e3052c16d29a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3263,9 +3263,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				 */
 				make_migration_entry_read(&swp_entry);
 				entry = swp_entry_to_pte(swp_entry);
-				set_huge_pte_at(src, addr, src_pte, entry);
+				set_huge_swap_pte_at(src, addr, src_pte,
+						     entry, sz);
 			}
-			set_huge_pte_at(dst, addr, dst_pte, entry);
+			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
 		} else {
 			if (cow) {
 				huge_ptep_set_wrprotect(src, addr, src_pte);
@@ -4277,7 +4278,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 
 				make_migration_entry_read(&entry);
 				newpte = swp_entry_to_pte(entry);
-				set_huge_pte_at(mm, address, ptep, newpte);
+				set_huge_swap_pte_at(mm, address, ptep,
+						     newpte, huge_page_size(h));
 				pages++;
 			}
 			spin_unlock(ptl);
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v2] mm/hugetlb: Report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified
From: James Morse @ 2017-05-25 17:10 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Andrew Morton, Punit Agrawal, James Morse,
	Naoya Horiguchi

KVM uses get_user_pages() to resolve its stage2 faults. KVM sets the
FOLL_HWPOISON flag causing faultin_page() to return -EHWPOISON when it
finds a VM_FAULT_HWPOISON. KVM handles these hwpoison pages as a special
case.

When huge pages are involved, this doesn't work so well. get_user_pages()
calls follow_hugetlb_page(), which stops early if it receives
VM_FAULT_HWPOISON from hugetlb_fault(), eventually returning -EFAULT to
the caller. The step to map this to -EHWPOISON based on the FOLL_ flags
is missing. The hwpoison special case is skipped, and -EFAULT is returned
to user-space, causing Qemu or kvmtool to exit.

Instead, move this VM_FAULT_ to errno mapping code into a header file
and use it from faultin_page(), follow_hugetlb_page() and faultin_page().

With this, KVM works as expected.

CC: Punit Agrawal <punit.agrawal@arm.com>
CC: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
Changes since v1:
 * Fixed checkpatch warnings (oops, wrong file)
 * Added vm_fault_to_errno() call to faultin_page() as Naoya Horiguchi
   suggested.

This isn't a problem for arm64 today as we haven't enabled MEMORY_FAILURE,
but I can't see any reason this doesn't happen on x86 too, so I think this
should be a fix. This doesn't apply earlier than stable's v4.11.1 due to
all sorts of cleanup. My best offer is:
Cc: <stable@vger.kernel.org> # 4.11.1

 include/linux/mm.h | 11 +++++++++++
 mm/gup.c           | 20 ++++++++------------
 mm/hugetlb.c       |  5 +++++
 3 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7cb17c6b97de..b892e95d4929 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2327,6 +2327,17 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
 #define FOLL_REMOTE	0x2000	/* we are working on non-current tsk/mm */
 #define FOLL_COW	0x4000	/* internal GUP flag */
 
+static inline int vm_fault_to_errno(int vm_fault, int foll_flags)
+{
+	if (vm_fault & VM_FAULT_OOM)
+		return -ENOMEM;
+	if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
+		return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT;
+	if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
+		return -EFAULT;
+	return 0;
+}
+
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
 extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
diff --git a/mm/gup.c b/mm/gup.c
index d9e6fddcc51f..b3c7214d710d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -407,12 +407,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 
 	ret = handle_mm_fault(vma, address, fault_flags);
 	if (ret & VM_FAULT_ERROR) {
-		if (ret & VM_FAULT_OOM)
-			return -ENOMEM;
-		if (ret & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
-			return *flags & FOLL_HWPOISON ? -EHWPOISON : -EFAULT;
-		if (ret & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
-			return -EFAULT;
+		int err = vm_fault_to_errno(ret, *flags);
+
+		if (err)
+			return err;
 		BUG();
 	}
 
@@ -723,12 +721,10 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 	ret = handle_mm_fault(vma, address, fault_flags);
 	major |= ret & VM_FAULT_MAJOR;
 	if (ret & VM_FAULT_ERROR) {
-		if (ret & VM_FAULT_OOM)
-			return -ENOMEM;
-		if (ret & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
-			return -EHWPOISON;
-		if (ret & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
-			return -EFAULT;
+		int err = vm_fault_to_errno(ret, 0);
+
+		if (err)
+			return err;
 		BUG();
 	}
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e5828875f7bb..3eedb187e549 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4170,6 +4170,11 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			}
 			ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
 			if (ret & VM_FAULT_ERROR) {
+				int err = vm_fault_to_errno(ret, flags);
+
+				if (err)
+					return err;
+
 				remainder = 0;
 				break;
 			}
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
From: Johannes Weiner @ 2017-05-25 17:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Vladimir Davydov, Tejun Heo, Li Zefan,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm
In-Reply-To: <20170525153819.GA7349@dhcp22.suse.cz>

On Thu, May 25, 2017 at 05:38:19PM +0200, Michal Hocko wrote:
> On Tue 23-05-17 09:25:44, Johannes Weiner wrote:
> > On Tue, May 23, 2017 at 09:07:47AM +0200, Michal Hocko wrote:
> > > On Mon 22-05-17 18:01:16, Roman Gushchin wrote:
> [...]
> > > > How to react on an OOM - is definitely a policy, which depends
> > > > on the workload. Nothing is changing here from how it's working now,
> > > > except now kernel will choose a victim cgroup, and kill the victim cgroup
> > > > rather than a process.
> > > 
> > > There is a _big_ difference. The current implementation just tries
> > > to recover from the OOM situation without carying much about the
> > > consequences on the workload. This is the last resort and a services for
> > > the _system_ to get back to sane state. You are trying to make it more
> > > clever and workload aware and that is inevitable going to depend on the
> > > specific workload. I really do think we cannot simply hardcode any
> > > policy into the kernel for this purpose and that is why I would like to
> > > see a discussion about how to do that in a more extensible way. This
> > > might be harder to implement now but it I believe it will turn out
> > > better longerm.
> > 
> > And that's where I still maintain that this isn't really a policy
> > change. Because what this code does ISN'T more clever, and the OOM
> > killer STILL IS a last-resort thing.
> 
> The thing I wanted to point out is that what and how much to kill
> definitely depends on the usecase. We currently kill all tasks which
> share the mm struct because that is the smallest unit that can unpin
> user memory. And that makes a lot of sense to me as a general default.
> I would call any attempt to guess tasks belonging to the same
> workload/job as a "more clever".

Yeah, I agree it needs to be configurable. But a memory domain is not
a random guess. It's a core concept of the VM at this point. The fact
that the OOM killer cannot handle it is pretty weird and goes way
beyond "I wish we could have some smarter heuristics to choose from."

> > We don't need any elaborate
> > just-in-time evaluation of what each entity is worth. We just want to
> > kill the biggest job, not the biggest MM. Just like you wouldn't want
> > just the biggest VMA unmapped and freed, since it leaves your process
> > incoherent, killing one process leaves a job incoherent.
> > 
> > I understand that making it fully configurable is a tempting thought,
> > because you'd offload all responsibility to userspace.
> 
> It is not only tempting it is also the only place which can define
> a more advanced OOM semantic sanely IMHO.

Why do you think that?

Everything the user would want to dynamically program in the kernel,
say with bpf, they could do in userspace and then update the scores
for each group and task periodically.

The only limitation is that you have to recalculate and update the
scoring tree every once in a while, whereas a bpf program could
evaluate things just-in-time. But for that to matter in practice, OOM
kills would have to be a fairly hot path.

> > > > > And both kinds of workloads (services/applications and individual
> > > > > processes run by users) can co-exist on the same host - consider the
> > > > > default systemd setup, for instance.
> > > > > 
> > > > > IMHO it would be better to give users a choice regarding what they
> > > > > really want for a particular cgroup in case of OOM - killing the whole
> > > > > cgroup or one of its descendants. For example, we could introduce a
> > > > > per-cgroup flag that would tell the kernel whether the cgroup can
> > > > > tolerate killing a descendant or not. If it can, the kernel will pick
> > > > > the fattest sub-cgroup or process and check it. If it cannot, it will
> > > > > kill the whole cgroup and all its processes and sub-cgroups.
> > > > 
> > > > The last thing we want to do, is to compare processes with cgroups.
> > > > I agree, that we can have some option to disable the cgroup-aware OOM at all,
> > > > mostly for backward-compatibility. But I don't think it should be a
> > > > per-cgroup configuration option, which we will support forever.
> > > 
> > > I can clearly see a demand for "this is definitely more important
> > > container than others so do not kill" usecases. I can also see demand
> > > for "do not kill this container running for X days". And more are likely
> > > to pop out.
> > 
> > That can all be done with scoring.
> 
> Maybe. But that requires somebody to tweak the scoring which can be hard
> from trivial.

Why is sorting and picking in userspace harder than sorting and
picking in the kernel?

> > This was 10 years ago, and nobody has missed anything critical enough
> > to implement something beyond scoring. So I don't see why we'd need to
> > do it for cgroups all of a sudden.
> > 
> > They're nothing special, they just group together things we have been
> > OOM killing for ages. So why shouldn't we use the same config model?
> > 
> > It seems to me, what we need for this patch is 1) a way to toggle
> > whether the processes and subgroups of a group are interdependent or
> > independent and 2) configurable OOM scoring per cgroup analogous to
> > what we have per process already. If a group is marked interdependent
> > we stop descending into it and evaluate it as one entity. Otherwise,
> > we go look for victims in its subgroups and individual processes.
> 
> This would be an absolute minimum, yes.
> 
> But I am still not convinced we should make this somehow "hardcoded" in
> the core oom killer handler.  Why cannot we allow a callback for modules
> and implement all these non-default OOM strategies in modules? We have
> oom_notify_list already but that doesn't get the full oom context which
> could be fixable but I suspect this is not the greatest interface at
> all. We do not really need multiple implementations of the OOM handling
> at the same time and a simple callback should be sufficient
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 04c9143a8625..926a36625322 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -995,6 +995,13 @@ bool out_of_memory(struct oom_control *oc)
>  	}
>  
>  	/*
> +	 * Try a registered oom handler to run and fallback to the default
> +	 * implementation if it cannot handle the current oom context
> +	 */
> +	if (oom_handler && oom_handler(oc))
> +		return true;

I think this would take us back to the dark days where memcg entry
points where big opaque branches in the generic VM code, which then
implemented their own thing, redundant locking, redundant LRU lists,
which was all very hard to maintain.

> +	/*
>  	 * If current has a pending SIGKILL or is exiting, then automatically
>  	 * select it.  The goal is to allow it to allocate so that it may
>  	 * quickly exit and free its memory.
> 
> Please note that I haven't explored how much of the infrastructure
> needed for the OOM decision making is available to modules. But we can
> export a lot of what we currently have in oom_kill.c. I admit it might
> turn out that this is simply not feasible but I would like this to be at
> least explored before we go and implement yet another hardcoded way to
> handle (see how I didn't use policy ;)) OOM situation.

;)

My doubt here is mainly that we'll see many (or any) real-life cases
materialize that cannot be handled with cgroups and scoring. These are
powerful building blocks on which userspace can implement all kinds of
policy and sorting algorithms.

So this seems like a lot of churn and complicated code to handle one
extension. An extension that implements basic functionality.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v6 05/10] mm: thp: enable thp migration in generic path
From: kbuild test robot @ 2017-05-25 17:06 UTC (permalink / raw)
  To: Zi Yan
  Cc: kbuild-all, n-horiguchi, kirill.shutemov, linux-kernel, linux-mm,
	akpm, minchan, vbabka, mgorman, mhocko, khandual, zi.yan,
	dnellans, dave.hansen
In-Reply-To: <20170525141945.56028-6-zi.yan@sent.com>

[-- Attachment #1: Type: text/plain, Size: 1704 bytes --]

Hi Zi,

[auto build test WARNING on mmotm/master]
[also build test WARNING on v4.12-rc2 next-20170525]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Zi-Yan/mm-page-migration-enhancement-for-thp/20170526-003749
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-x016-201721 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   In file included from fs/proc/task_mmu.c:15:0:
   include/linux/swapops.h: In function 'swp_entry_to_pmd':
>> include/linux/swapops.h:222:16: warning: missing braces around initializer [-Wmissing-braces]
     return (pmd_t){{ 0 }};
                   ^
   include/linux/swapops.h:222:16: note: (near initialization for '(anonymous)')

vim +222 include/linux/swapops.h

   206			struct page *new)
   207	{
   208		BUILD_BUG();
   209	}
   210	
   211	static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
   212	
   213	static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
   214	{
   215		BUILD_BUG();
   216		return swp_entry(0, 0);
   217	}
   218	
   219	static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
   220	{
   221		BUILD_BUG();
 > 222		return (pmd_t){{ 0 }};
   223	}
   224	
   225	static inline int is_pmd_migration_entry(pmd_t pmd)
   226	{
   227		return 0;
   228	}
   229	#endif
   230	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31611 bytes --]

^ permalink raw reply

* Re: [PATCH v6 06/10] mm: thp: check pmd migration entry in common path
From: kbuild test robot @ 2017-05-25 17:00 UTC (permalink / raw)
  To: Zi Yan
  Cc: kbuild-all, n-horiguchi, kirill.shutemov, linux-kernel, linux-mm,
	akpm, minchan, vbabka, mgorman, mhocko, khandual, zi.yan,
	dnellans, dave.hansen
In-Reply-To: <20170525141945.56028-7-zi.yan@sent.com>

[-- Attachment #1: Type: text/plain, Size: 2595 bytes --]

Hi Zi,

[auto build test WARNING on mmotm/master]
[also build test WARNING on v4.12-rc2 next-20170525]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Zi-Yan/mm-page-migration-enhancement-for-thp/20170526-003749
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-x017-201721 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   In file included from arch/x86/mm/gup.c:12:0:
   include/linux/swapops.h: In function 'swp_entry_to_pmd':
>> include/linux/swapops.h:222:2: warning: braces around scalar initializer
     return (pmd_t){{ 0 }};
     ^~~~~~
   include/linux/swapops.h:222:2: note: (near initialization for '(anonymous).pmd')

vim +222 include/linux/swapops.h

2182e56bf Zi Yan 2017-05-25  206  		struct page *new)
2182e56bf Zi Yan 2017-05-25  207  {
2182e56bf Zi Yan 2017-05-25  208  	BUILD_BUG();
2182e56bf Zi Yan 2017-05-25  209  }
2182e56bf Zi Yan 2017-05-25  210  
2182e56bf Zi Yan 2017-05-25  211  static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
2182e56bf Zi Yan 2017-05-25  212  
2182e56bf Zi Yan 2017-05-25  213  static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
2182e56bf Zi Yan 2017-05-25  214  {
2182e56bf Zi Yan 2017-05-25  215  	BUILD_BUG();
2182e56bf Zi Yan 2017-05-25  216  	return swp_entry(0, 0);
2182e56bf Zi Yan 2017-05-25  217  }
2182e56bf Zi Yan 2017-05-25  218  
2182e56bf Zi Yan 2017-05-25  219  static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
2182e56bf Zi Yan 2017-05-25  220  {
2182e56bf Zi Yan 2017-05-25  221  	BUILD_BUG();
2182e56bf Zi Yan 2017-05-25 @222  	return (pmd_t){{ 0 }};
2182e56bf Zi Yan 2017-05-25  223  }
2182e56bf Zi Yan 2017-05-25  224  
2182e56bf Zi Yan 2017-05-25  225  static inline int is_pmd_migration_entry(pmd_t pmd)
2182e56bf Zi Yan 2017-05-25  226  {
2182e56bf Zi Yan 2017-05-25  227  	return 0;
2182e56bf Zi Yan 2017-05-25  228  }
2182e56bf Zi Yan 2017-05-25  229  #endif
2182e56bf Zi Yan 2017-05-25  230  

:::::: The code at line 222 was first introduced by commit
:::::: 2182e56bfe80b55d1a9a8e8de04e6a5578b1dce0 mm: thp: enable thp migration in generic path

:::::: TO: Zi Yan <zi.yan@cs.rutgers.edu>
:::::: CC: 0day robot <fengguang.wu@intel.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31023 bytes --]

^ permalink raw reply

* Re: [patch] compiler, clang: suppress warning for unused static inline functions
From: Joe Perches @ 2017-05-25 16:48 UTC (permalink / raw)
  To: Matthias Kaehlcke, Ingo Molnar
  Cc: David Rientjes, Andrew Morton, Christoph Lameter, Pekka Enberg,
	Joonsoo Kim, linux-mm, linux-kernel, Douglas Anderson, Mark Brown,
	David Miller
In-Reply-To: <20170525161406.GT141096@google.com>

On Thu, 2017-05-25 at 09:14 -0700, Matthias Kaehlcke wrote:
> clang doesn't raise
> warnings about unused static inline functions in headers.

Is any "#include" file a "header" to clang or only "*.h" files?

For instance:

The kernel has ~500 .c files that other .c files #include.
Are unused inline functions in those .c files reported?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [patch] compiler, clang: suppress warning for unused static inline functions
From: Matthias Kaehlcke @ 2017-05-25 16:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Rientjes, Andrew Morton, Christoph Lameter, Pekka Enberg,
	Joonsoo Kim, linux-mm, linux-kernel, Douglas Anderson, Mark Brown,
	David Miller
In-Reply-To: <20170525055207.udcphnshuzl2gkps@gmail.com>

El Thu, May 25, 2017 at 07:52:07AM +0200 Ingo Molnar ha dit:

> 
> * Matthias Kaehlcke <mka@chromium.org> wrote:
> 
> > El Wed, May 24, 2017 at 02:01:15PM -0700 David Rientjes ha dit:
> > 
> > > GCC explicitly does not warn for unused static inline functions for
> > > -Wunused-function.  The manual states:
> > > 
> > > 	Warn whenever a static function is declared but not defined or
> > > 	a non-inline static function is unused.
> > > 
> > > Clang does warn for static inline functions that are unused.
> > > 
> > > It turns out that suppressing the warnings avoids potentially complex
> > > #ifdef directives, which also reduces LOC.
> > > 
> > > Supress the warning for clang.
> > > 
> > > Signed-off-by: David Rientjes <rientjes@google.com>
> > > ---
> > 
> > As expressed earlier in other threads, I don't think gcc's behavior is
> > preferable in this case. The warning on static inline functions (only
> > in .c files) allows to detect truly unused code. About 50% of the
> > warnings I have looked into so far fall into this category.
> > 
> > In my opinion it is more valuable to detect dead code than not having
> > a few more __maybe_unused attributes (there aren't really that many
> > instances, at least with x86 and arm64 defconfig). In most cases it is
> > not necessary to use #ifdef, it is an option which is preferred by
> > some maintainers. The reduced LOC is arguable, since dectecting dead
> > code allows to remove it.
> 
> Static inline functions in headers are often not dead code.

Sure, there is no intention to delete these and clang doesn't raise
warnings about unused static inline functions in headers.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm: add counters for different page fault types
From: Luigi Semenzato @ 2017-05-25 15:54 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Linux Memory Management List, Douglas Anderson, Dmitry Torokhov,
	Sonny Rao
In-Reply-To: <20170525001915.GA14999@bbox>

Thank you Minchan, that's certainly simpler and I am annoyed that I
didn't consider that :/

By a quick look, there are a few differences but maybe they don't matter?

1. can a major (anon) fault result in a hit in the swap cache?  So
pswpin will not get incremented and the fault will be counted as a
file fault.

2. pswpin also counts swapins from readahead --- which however I think
we have turned off (at least I hope so, since readahead isn't useful
with zram, in fact maybe zram should log a warning when readahead is
greater than 0 because I think that's the default).

Incidentally, I understand anon and file faults, but what's a shmem fault?

Thanks!





On Wed, May 24, 2017 at 5:19 PM, Minchan Kim <minchan@kernel.org> wrote:
> Hi Luigi,
>
> On Wed, May 24, 2017 at 12:41:26PM -0700, Luigi Semenzato wrote:
>> VM event counters are added to keep track of anonymous
>> vs. file vs. shmem page faults.  They are: pgmajfault_a,
>> pgmajfault_f and pgmajfault_s.  These are useful to
>> analyze system performance, particularly when the cost
>> of a fault for a file page is very different from that
>> of an anonymous page, as would happen, for instance, in
>> the presence of zram.
>
> Yeb, it's useful with zram and the way I have used is
>
>         PGMAJFAULT - PSWPIN
>
> With that, I can get how many portion in majfault stems from
> file-backed pages while others are from swap.
>
> Can't it meet for your requirement?
>
> Thanks.
>
>>
>> The PGMAJFAULT counter is no longer directly maintained.
>> Instead the three new counters are added whenever the
>> total count is needed.
>>
>> Signed-off-by: Luigi Semenzato <semenzato@google.com>
>> ---
>>  arch/s390/appldata/appldata_mem.c | 9 ++++++++-
>>  drivers/virtio/virtio_balloon.c   | 5 ++++-
>>  fs/dax.c                          | 5 +++--
>>  fs/ncpfs/mmap.c                   | 4 ++--
>>  include/linux/vm_event_item.h     | 1 +
>>  mm/filemap.c                      | 4 ++--
>>  mm/memcontrol.c                   | 7 ++++++-
>>  mm/memory.c                       | 4 ++--
>>  mm/shmem.c                        | 4 ++--
>>  mm/vmstat.c                       | 5 +++++
>>  10 files changed, 35 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/s390/appldata/appldata_mem.c b/arch/s390/appldata/appldata_mem.c
>> index 598df5708501..adb8b6412ffa 100644
>> --- a/arch/s390/appldata/appldata_mem.c
>> +++ b/arch/s390/appldata/appldata_mem.c
>> @@ -62,6 +62,9 @@ struct appldata_mem_data {
>>       u64 pgalloc;            /* page allocations */
>>       u64 pgfault;            /* page faults (major+minor) */
>>       u64 pgmajfault;         /* page faults (major only) */
>> +     u64 pgmajfault_s;       /* shmem page faults (major only) */
>> +     u64 pgmajfault_a;       /* anonymous page faults (major only) */
>> +     u64 pgmajfault_f;       /* file page faults (major only) */
>>  // <-- New in 2.6
>>
>>  } __packed;
>> @@ -93,7 +96,11 @@ static void appldata_get_mem_data(void *data)
>>       mem_data->pgalloc    = ev[PGALLOC_NORMAL];
>>       mem_data->pgalloc    += ev[PGALLOC_DMA];
>>       mem_data->pgfault    = ev[PGFAULT];
>> -     mem_data->pgmajfault = ev[PGMAJFAULT];
>> +     mem_data->pgmajfault =
>> +             ev[PGMAJFAULT_S] + ev[PGMAJFAULT_A] + ev[PGMAJFAULT_F];
>> +     mem_data->pgmajfault_s = ev[PGMAJFAULT_S];
>> +     mem_data->pgmajfault_a = ev[PGMAJFAULT_A];
>> +     mem_data->pgmajfault_f = ev[PGMAJFAULT_F];
>>
>>       si_meminfo(&val);
>>       mem_data->sharedram = val.sharedram;
>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>> index 408c174ef0d5..ed7100645d25 100644
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
>> @@ -259,7 +259,10 @@ static unsigned int update_balloon_stats(struct virtio_balloon *vb)
>>                               pages_to_bytes(events[PSWPIN]));
>>       update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_OUT,
>>                               pages_to_bytes(events[PSWPOUT]));
>> -     update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
>> +     update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT,
>> +                 events[PGMAJFAULT_S] +
>> +                 events[PGMAJFAULT_A] +
>> +                 events[PGMAJFAULT_F]);
>>       update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
>>  #endif
>>       update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE,
>> diff --git a/fs/dax.c b/fs/dax.c
>> index c22eaf162f95..3c92f2af0514 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -1200,8 +1200,9 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
>>       switch (iomap.type) {
>>       case IOMAP_MAPPED:
>>               if (iomap.flags & IOMAP_F_NEW) {
>> -                     count_vm_event(PGMAJFAULT);
>> -                     mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT);
>> +                     count_vm_event(PGMAJFAULT_F);
>> +                     mem_cgroup_count_vm_event(vmf->vma->vm_mm,
>> +                                               PGMAJFAULT_F);
>>                       major = VM_FAULT_MAJOR;
>>               }
>>               error = dax_insert_mapping(mapping, iomap.bdev, iomap.dax_dev,
>> diff --git a/fs/ncpfs/mmap.c b/fs/ncpfs/mmap.c
>> index 0c3905e0542e..ae04b9d86288 100644
>> --- a/fs/ncpfs/mmap.c
>> +++ b/fs/ncpfs/mmap.c
>> @@ -88,8 +88,8 @@ static int ncp_file_mmap_fault(struct vm_fault *vmf)
>>        * fetches from the network, here the analogue of disk.
>>        * -- nyc
>>        */
>> -     count_vm_event(PGMAJFAULT);
>> -     mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT);
>> +     count_vm_event(PGMAJFAULT_F);
>> +     mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT_F);
>>       return VM_FAULT_MAJOR;
>>  }
>>
>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>> index d84ae90ccd5c..2d2df45d4520 100644
>> --- a/include/linux/vm_event_item.h
>> +++ b/include/linux/vm_event_item.h
>> @@ -27,6 +27,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>               FOR_ALL_ZONES(PGSCAN_SKIP),
>>               PGFREE, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE,
>>               PGFAULT, PGMAJFAULT,
>> +             PGMAJFAULT_S, PGMAJFAULT_A, PGMAJFAULT_F,
>>               PGLAZYFREED,
>>               PGREFILL,
>>               PGSTEAL_KSWAPD,
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 6f1be573a5e6..d2b187b648b3 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -2225,8 +2225,8 @@ int filemap_fault(struct vm_fault *vmf)
>>       } else if (!page) {
>>               /* No page in the page cache at all */
>>               do_sync_mmap_readahead(vmf->vma, ra, file, offset);
>> -             count_vm_event(PGMAJFAULT);
>> -             mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT);
>> +             count_vm_event(PGMAJFAULT_F);
>> +             mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT_F);
>>               ret = VM_FAULT_MAJOR;
>>  retry_find:
>>               page = find_get_page(mapping, offset);
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 94172089f52f..045361f2b8fa 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3122,6 +3122,8 @@ unsigned int memcg1_events[] = {
>>       PGPGOUT,
>>       PGFAULT,
>>       PGMAJFAULT,
>> +     PGMAJFAULT_A,
>> +     PGMAJFAULT_F,
>>  };
>>
>>  static const char *const memcg1_event_names[] = {
>> @@ -3129,6 +3131,8 @@ static const char *const memcg1_event_names[] = {
>>       "pgpgout",
>>       "pgfault",
>>       "pgmajfault",
>> +     "pgmajfault_a",
>> +     "pgmajfault_f",
>>  };
>>
>>  static int memcg_stat_show(struct seq_file *m, void *v)
>> @@ -5229,7 +5233,8 @@ static int memory_stat_show(struct seq_file *m, void *v)
>>       /* Accumulated memory events */
>>
>>       seq_printf(m, "pgfault %lu\n", events[PGFAULT]);
>> -     seq_printf(m, "pgmajfault %lu\n", events[PGMAJFAULT]);
>> +     seq_printf(m, "pgmajfault %lu\n", events[PGMAJFAULT_S] +
>> +                     events[PGMAJFAULT_A] + events[PGMAJFAULT_F]);
>>
>>       seq_printf(m, "workingset_refault %lu\n",
>>                  stat[WORKINGSET_REFAULT]);
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 6ff5d729ded0..2c2b7b3ffe7f 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -2718,8 +2718,8 @@ int do_swap_page(struct vm_fault *vmf)
>>
>>               /* Had to read the page from swap area: Major fault */
>>               ret = VM_FAULT_MAJOR;
>> -             count_vm_event(PGMAJFAULT);
>> -             mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
>> +             count_vm_event(PGMAJFAULT_A);
>> +             mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT_A);
>>       } else if (PageHWPoison(page)) {
>>               /*
>>                * hwpoisoned dirty swapcache pages are kept for killing
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index e67d6ba4e98e..5eea045575c4 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1644,9 +1644,9 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>>                       /* Or update major stats only when swapin succeeds?? */
>>                       if (fault_type) {
>>                               *fault_type |= VM_FAULT_MAJOR;
>> -                             count_vm_event(PGMAJFAULT);
>> +                             count_vm_event(PGMAJFAULT_S);
>>                               mem_cgroup_count_vm_event(charge_mm,
>> -                                                       PGMAJFAULT);
>> +                                                       PGMAJFAULT_S);
>>                       }
>>                       /* Here we actually start the io */
>>                       page = shmem_swapin(swap, gfp, info, index);
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 76f73670200a..741bb14761cd 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -995,6 +995,9 @@ const char * const vmstat_text[] = {
>>
>>       "pgfault",
>>       "pgmajfault",
>> +     "pgmajfault_s",
>> +     "pgmajfault_a",
>> +     "pgmajfault_f",
>>       "pglazyfreed",
>>
>>       "pgrefill",
>> @@ -1511,6 +1514,8 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
>>       all_vm_events(v);
>>       v[PGPGIN] /= 2;         /* sectors -> kbytes */
>>       v[PGPGOUT] /= 2;
>> +     /* Add up page faults */
>> +     v[PGMAJFAULT] = v[PGMAJFAULT_S] + v[PGMAJFAULT_A] + v[PGMAJFAULT_F];
>>  #endif
>>       return (unsigned long *)m->private + *pos;
>>  }
>> --
>> 2.13.0.219.gdb65acc882-goog
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox