linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH, RFC] x86/boot/compressed/64: Handle 5-level paging boot if kernel is above 4G
@ 2017-10-09 16:09 Kirill A. Shutemov
  2017-10-09 16:54 ` Dave Hansen
  0 siblings, 1 reply; 5+ messages in thread
From: Kirill A. Shutemov @ 2017-10-09 16:09 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds, x86, Thomas Gleixner, H. Peter Anvin
  Cc: Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov, Andi Kleen,
	linux-mm, linux-kernel, Kirill A. Shutemov

[
  The patch is based on my boot-time switching patchset and would not apply
  directly to current upstream, but I would appreciate early feedback.
]

This patch addresses shortcoming in current boot process on machines
that supports 5-level paging.

If bootloader enables 64-bit mode with 4-level paging, we need to
switch over to 5-level paging. The switching requires disabling paging.
It works fine if kernel itself is loaded below 4G.

If bootloader put the kernel above 4G (not sure if anybody does this),
we would loose control as soon as paging is disabled as code becomes
unreachable.

This patch implements trampoline in lower memory to handle this
situation.

I use MBR memory (0x7c00) to store trampoline code.

Apart from trampoline itself we also need place to store top level page
table in lower memory as we don't have a way to load 64-bit value into
CR3 from 32-bit mode. We only really need 8-bytes there as we only use
the very first entry of the page table.

For this I use 0x7000.

Not sure if this placement is entirely safe, but I don't see a better
spot to place them.

We only need them for very short time, until main kernel image setup its
own page tables.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/head_64.S | 68 +++++++++++++++++++++++++-------------
 1 file changed, 45 insertions(+), 23 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index cefe4958fda9..049a289342bd 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -288,6 +288,22 @@ ENTRY(startup_64)
 	leaq	boot_stack_end(%rbx), %rsp
 
 #ifdef CONFIG_X86_5LEVEL
+/*
+ * We need trampoline in lower memory switch from 4- to 5-level paging for
+ * cases when bootloader put kernel above 4G, but didn't enable 5-level paging
+ * for us.
+ *
+ * Here we use MBR memory to store trampoline code.
+ *
+ * We also have to have top page table in lower memory as we don't have a way
+ * to load 64-bit value into CR3 from 32-bit mode. We only need 8-bytes there
+ * as we only use the very first entry of the page table.
+ *
+ * Here we use 0x7000 as top-level page table.
+ */
+#define LVL5_TRAMPOLINE	0x7c00
+#define LVL5_PGTABLE	0x7000
+
 	/* Preserve RBX across CPUID */
 	movq	%rbx, %r8
 
@@ -323,29 +339,37 @@ ENTRY(startup_64)
 	 * long mode would trigger #GP. So we need to switch off long mode
 	 * first.
 	 *
-	 * NOTE: This is not going to work if bootloader put us above 4G
-	 * limit.
+	 * We use trampoline in lower memory to handle situation when
+	 * bootloader put the kernel image above 4G.
 	 *
 	 * The first step is go into compatibility mode.
 	 */
 
-	/* Clear additional page table */
-	leaq	lvl5_pgtable(%rbx), %rdi
-	xorq	%rax, %rax
-	movq	$(PAGE_SIZE/8), %rcx
-	rep	stosq
+	/* Copy trampoline code in place */
+	movq	%rsi, %r9
+	leaq	lvl5_trampoline(%rip), %rsi
+	movq	$LVL5_TRAMPOLINE, %rdi
+	movq	$(lvl5_trampoline_end - lvl5_trampoline), %rcx
+	rep	movsb
+	movq	%r9, %rsi
 
 	/*
-	 * Setup current CR3 as the first and only entry in a new top level
+	 * Setup current CR3 as the first and the only entry in a new top level
 	 * page table.
 	 */
 	movq	%cr3, %rdi
 	leaq	0x7 (%rdi), %rax
-	movq	%rax, lvl5_pgtable(%rbx)
+	movq	%rax, LVL5_PGTABLE
+
+	/*
+	 * Load address of lvl5 into RDI.
+	 * It will be used to return address from trampoline.
+	 */
+	leaq	lvl5(%rip), %rdi
 
 	/* Switch to compatibility mode (CS.L = 0 CS.D = 1) via far return */
 	pushq	$__KERNEL32_CS
-	leaq	compatible_mode(%rip), %rax
+	movq	$LVL5_TRAMPOLINE, %rax
 	pushq	%rax
 	lretq
 lvl5:
@@ -488,9 +512,9 @@ relocated:
  */
 	jmp	*%rax
 
-	.code32
 #ifdef CONFIG_X86_5LEVEL
-compatible_mode:
+	.code32
+lvl5_trampoline:
 	/* Setup data and stack segments */
 	movl	$__KERNEL_DS, %eax
 	movl	%eax, %ds
@@ -502,7 +526,7 @@ compatible_mode:
 	movl	%eax, %cr0
 
 	/* Point CR3 to 5-level paging */
-	leal	lvl5_pgtable(%ebx), %eax
+	movl	$LVL5_PGTABLE, %eax
 	movl	%eax, %cr3
 
 	/* Enable PAE and LA57 mode */
@@ -510,14 +534,9 @@ compatible_mode:
 	orl	$(X86_CR4_PAE | X86_CR4_LA57), %eax
 	movl	%eax, %cr4
 
-	/* Calculate address we are running at */
-	call	1f
-1:	popl	%edi
-	subl	$1b, %edi
-
 	/* Prepare stack for far return to Long Mode */
 	pushl	$__KERNEL_CS
-	leal	lvl5(%edi), %eax
+	movl	$(lvl5_enabled - lvl5_trampoline + LVL5_TRAMPOLINE), %eax
 	push	%eax
 
 	/* Enable paging back */
@@ -525,8 +544,15 @@ compatible_mode:
 	movl	%eax, %cr0
 
 	lret
+
+	.code64
+lvl5_enabled:
+	/* Return from trampoline */
+	jmp	*%rdi
+lvl5_trampoline_end:
 #endif
 
+	.code32
 no_longmode:
 	/* This isn't an x86-64 CPU so hang */
 1:
@@ -584,7 +610,3 @@ boot_stack_end:
 	.balign 4096
 pgtable:
 	.fill BOOT_PGT_SIZE, 1, 0
-#ifdef CONFIG_X86_5LEVEL
-lvl5_pgtable:
-	.fill PAGE_SIZE, 1, 0
-#endif
-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH, RFC] x86/boot/compressed/64: Handle 5-level paging boot if kernel is above 4G
  2017-10-09 16:09 [PATCH, RFC] x86/boot/compressed/64: Handle 5-level paging boot if kernel is above 4G Kirill A. Shutemov
@ 2017-10-09 16:54 ` Dave Hansen
  2017-10-09 17:09   ` Kirill A. Shutemov
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Hansen @ 2017-10-09 16:54 UTC (permalink / raw)
  To: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin
  Cc: Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov, Andi Kleen,
	linux-mm, linux-kernel

On 10/09/2017 09:09 AM, Kirill A. Shutemov wrote:
> Apart from trampoline itself we also need place to store top level page
> table in lower memory as we don't have a way to load 64-bit value into
> CR3 from 32-bit mode. We only really need 8-bytes there as we only use
> the very first entry of the page table.

Oh, and this is why you have to move "lvl5_pgtable" out of the kernel image?

> diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
> index cefe4958fda9..049a289342bd 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -288,6 +288,22 @@ ENTRY(startup_64)
>  	leaq	boot_stack_end(%rbx), %rsp
>  
>  #ifdef CONFIG_X86_5LEVEL
> +/*
> + * We need trampoline in lower memory switch from 4- to 5-level paging for
> + * cases when bootloader put kernel above 4G, but didn't enable 5-level paging
> + * for us.
> + *
> + * Here we use MBR memory to store trampoline code.
> + *
> + * We also have to have top page table in lower memory as we don't have a way
> + * to load 64-bit value into CR3 from 32-bit mode. We only need 8-bytes there
> + * as we only use the very first entry of the page table.
> + *
> + * Here we use 0x7000 as top-level page table.
> + */
> +#define LVL5_TRAMPOLINE	0x7c00
> +#define LVL5_PGTABLE	0x7000
> +
>  	/* Preserve RBX across CPUID */
>  	movq	%rbx, %r8
>  
> @@ -323,29 +339,37 @@ ENTRY(startup_64)
>  	 * long mode would trigger #GP. So we need to switch off long mode
>  	 * first.
>  	 *
> -	 * NOTE: This is not going to work if bootloader put us above 4G
> -	 * limit.
> +	 * We use trampoline in lower memory to handle situation when
> +	 * bootloader put the kernel image above 4G.
>  	 *
>  	 * The first step is go into compatibility mode.
>  	 */
>  
> -	/* Clear additional page table */
> -	leaq	lvl5_pgtable(%rbx), %rdi
> -	xorq	%rax, %rax
> -	movq	$(PAGE_SIZE/8), %rcx
> -	rep	stosq
> +	/* Copy trampoline code in place */
> +	movq	%rsi, %r9
> +	leaq	lvl5_trampoline(%rip), %rsi
> +	movq	$LVL5_TRAMPOLINE, %rdi
> +	movq	$(lvl5_trampoline_end - lvl5_trampoline), %rcx
> +	rep	movsb
> +	movq	%r9, %rsi

This needs to get more heavily commented, like the use of r9 to stash
%rsi.  Why do you do that, btw?  I don't see it getting reused at first
glance.

I think it will also be really nice to differentate "lvl5_trampoline"
from "LVL5_TRAMPOLINE".  Maybe add "src" and "dst" to them or something.

>  	/*
> -	 * Setup current CR3 as the first and only entry in a new top level
> +	 * Setup current CR3 as the first and the only entry in a new top level
>  	 * page table.
>  	 */
>  	movq	%cr3, %rdi
>  	leaq	0x7 (%rdi), %rax
> -	movq	%rax, lvl5_pgtable(%rbx)
> +	movq	%rax, LVL5_PGTABLE
> +
> +	/*
> +	 * Load address of lvl5 into RDI.
> +	 * It will be used to return address from trampoline.
> +	 */
> +	leaq	lvl5(%rip), %rdi

Is there a reason to do a 'lea' here instead of just shoving the address
in directly?  Is this a shorter instruction or something?

>  	/* Switch to compatibility mode (CS.L = 0 CS.D = 1) via far return */
>  	pushq	$__KERNEL32_CS
> -	leaq	compatible_mode(%rip), %rax
> +	movq	$LVL5_TRAMPOLINE, %rax
>  	pushq	%rax
>  	lretq
>  lvl5:
> @@ -488,9 +512,9 @@ relocated:
>   */
>  	jmp	*%rax
>  
> -	.code32
>  #ifdef CONFIG_X86_5LEVEL
> -compatible_mode:
> +	.code32
> +lvl5_trampoline:
>  	/* Setup data and stack segments */
>  	movl	$__KERNEL_DS, %eax
>  	movl	%eax, %ds
> @@ -502,7 +526,7 @@ compatible_mode:
>  	movl	%eax, %cr0
>  
>  	/* Point CR3 to 5-level paging */
> -	leal	lvl5_pgtable(%ebx), %eax
> +	movl	$LVL5_PGTABLE, %eax
>  	movl	%eax, %cr3
>  
>  	/* Enable PAE and LA57 mode */
> @@ -510,14 +534,9 @@ compatible_mode:
>  	orl	$(X86_CR4_PAE | X86_CR4_LA57), %eax
>  	movl	%eax, %cr4
>  
> -	/* Calculate address we are running at */
> -	call	1f
> -1:	popl	%edi
> -	subl	$1b, %edi
> -
>  	/* Prepare stack for far return to Long Mode */
>  	pushl	$__KERNEL_CS
> -	leal	lvl5(%edi), %eax
> +	movl	$(lvl5_enabled - lvl5_trampoline + LVL5_TRAMPOLINE), %eax

This loads the trampoline address of "lvl5_enabled", right?  That'd be
handy to spell out explicitly.

>  	push	%eax
>  
>  	/* Enable paging back */
> @@ -525,8 +544,15 @@ compatible_mode:
>  	movl	%eax, %cr0
>  
>  	lret
> +
> +	.code64
> +lvl5_enabled:
> +	/* Return from trampoline */
> +	jmp	*%rdi
> +lvl5_trampoline_end:
>  #endif
>  
> +	.code32
>  no_longmode:
>  	/* This isn't an x86-64 CPU so hang */
>  1:
> @@ -584,7 +610,3 @@ boot_stack_end:
>  	.balign 4096
>  pgtable:
>  	.fill BOOT_PGT_SIZE, 1, 0
> -#ifdef CONFIG_X86_5LEVEL
> -lvl5_pgtable:
> -	.fill PAGE_SIZE, 1, 0
> -#endif
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH, RFC] x86/boot/compressed/64: Handle 5-level paging boot if kernel is above 4G
  2017-10-09 16:54 ` Dave Hansen
@ 2017-10-09 17:09   ` Kirill A. Shutemov
  2017-10-12 23:07     ` Eric W. Biederman
  0 siblings, 1 reply; 5+ messages in thread
From: Kirill A. Shutemov @ 2017-10-09 17:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andy Lutomirski, Cyrill Gorcunov,
	Borislav Petkov, Andi Kleen, linux-mm, linux-kernel

On Mon, Oct 09, 2017 at 09:54:53AM -0700, Dave Hansen wrote:
> On 10/09/2017 09:09 AM, Kirill A. Shutemov wrote:
> > Apart from trampoline itself we also need place to store top level page
> > table in lower memory as we don't have a way to load 64-bit value into
> > CR3 from 32-bit mode. We only really need 8-bytes there as we only use
> > the very first entry of the page table.
> 
> Oh, and this is why you have to move "lvl5_pgtable" out of the kernel image?

Right. I initialize the new location of top level page table directly.

> > diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
> > index cefe4958fda9..049a289342bd 100644
> > --- a/arch/x86/boot/compressed/head_64.S
> > +++ b/arch/x86/boot/compressed/head_64.S
> > @@ -288,6 +288,22 @@ ENTRY(startup_64)
> >  	leaq	boot_stack_end(%rbx), %rsp
> >  
> >  #ifdef CONFIG_X86_5LEVEL
> > +/*
> > + * We need trampoline in lower memory switch from 4- to 5-level paging for
> > + * cases when bootloader put kernel above 4G, but didn't enable 5-level paging
> > + * for us.
> > + *
> > + * Here we use MBR memory to store trampoline code.
> > + *
> > + * We also have to have top page table in lower memory as we don't have a way
> > + * to load 64-bit value into CR3 from 32-bit mode. We only need 8-bytes there
> > + * as we only use the very first entry of the page table.
> > + *
> > + * Here we use 0x7000 as top-level page table.
> > + */
> > +#define LVL5_TRAMPOLINE	0x7c00
> > +#define LVL5_PGTABLE	0x7000
> > +
> >  	/* Preserve RBX across CPUID */
> >  	movq	%rbx, %r8
> >  
> > @@ -323,29 +339,37 @@ ENTRY(startup_64)
> >  	 * long mode would trigger #GP. So we need to switch off long mode
> >  	 * first.
> >  	 *
> > -	 * NOTE: This is not going to work if bootloader put us above 4G
> > -	 * limit.
> > +	 * We use trampoline in lower memory to handle situation when
> > +	 * bootloader put the kernel image above 4G.
> >  	 *
> >  	 * The first step is go into compatibility mode.
> >  	 */
> >  
> > -	/* Clear additional page table */
> > -	leaq	lvl5_pgtable(%rbx), %rdi
> > -	xorq	%rax, %rax
> > -	movq	$(PAGE_SIZE/8), %rcx
> > -	rep	stosq
> > +	/* Copy trampoline code in place */
> > +	movq	%rsi, %r9
> > +	leaq	lvl5_trampoline(%rip), %rsi
> > +	movq	$LVL5_TRAMPOLINE, %rdi
> > +	movq	$(lvl5_trampoline_end - lvl5_trampoline), %rcx
> > +	rep	movsb
> > +	movq	%r9, %rsi
> 
> This needs to get more heavily commented, like the use of r9 to stash
> %rsi.  Why do you do that, btw?  I don't see it getting reused at first
> glance.

%rsi holds pointer to real_mode_data. It need to be preserved.

I'll add more comments.

> I think it will also be really nice to differentate "lvl5_trampoline"
> from "LVL5_TRAMPOLINE".  Maybe add "src" and "dst" to them or something.

Makes sense. Thanks.

> >  	/*
> > -	 * Setup current CR3 as the first and only entry in a new top level
> > +	 * Setup current CR3 as the first and the only entry in a new top level
> >  	 * page table.
> >  	 */
> >  	movq	%cr3, %rdi
> >  	leaq	0x7 (%rdi), %rax
> > -	movq	%rax, lvl5_pgtable(%rbx)
> > +	movq	%rax, LVL5_PGTABLE
> > +
> > +	/*
> > +	 * Load address of lvl5 into RDI.
> > +	 * It will be used to return address from trampoline.
> > +	 */
> > +	leaq	lvl5(%rip), %rdi
> 
> Is there a reason to do a 'lea' here instead of just shoving the address
> in directly?  Is this a shorter instruction or something?

This code can be loaded anywhere in memory and we need to calculate
absolute address of the label here.
AFAIK, "lea <label>(%rip), <register>" is idiomatic way to do this.

> >  	/* Switch to compatibility mode (CS.L = 0 CS.D = 1) via far return */
> >  	pushq	$__KERNEL32_CS
> > -	leaq	compatible_mode(%rip), %rax
> > +	movq	$LVL5_TRAMPOLINE, %rax
> >  	pushq	%rax
> >  	lretq
> >  lvl5:
> > @@ -488,9 +512,9 @@ relocated:
> >   */
> >  	jmp	*%rax
> >  
> > -	.code32
> >  #ifdef CONFIG_X86_5LEVEL
> > -compatible_mode:
> > +	.code32
> > +lvl5_trampoline:
> >  	/* Setup data and stack segments */
> >  	movl	$__KERNEL_DS, %eax
> >  	movl	%eax, %ds
> > @@ -502,7 +526,7 @@ compatible_mode:
> >  	movl	%eax, %cr0
> >  
> >  	/* Point CR3 to 5-level paging */
> > -	leal	lvl5_pgtable(%ebx), %eax
> > +	movl	$LVL5_PGTABLE, %eax
> >  	movl	%eax, %cr3
> >  
> >  	/* Enable PAE and LA57 mode */
> > @@ -510,14 +534,9 @@ compatible_mode:
> >  	orl	$(X86_CR4_PAE | X86_CR4_LA57), %eax
> >  	movl	%eax, %cr4
> >  
> > -	/* Calculate address we are running at */
> > -	call	1f
> > -1:	popl	%edi
> > -	subl	$1b, %edi
> > -
> >  	/* Prepare stack for far return to Long Mode */
> >  	pushl	$__KERNEL_CS
> > -	leal	lvl5(%edi), %eax
> > +	movl	$(lvl5_enabled - lvl5_trampoline + LVL5_TRAMPOLINE), %eax
> 
> This loads the trampoline address of "lvl5_enabled", right?  That'd be
> handy to spell out explicitly.

Yep, will do.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH, RFC] x86/boot/compressed/64: Handle 5-level paging boot if kernel is above 4G
  2017-10-09 17:09   ` Kirill A. Shutemov
@ 2017-10-12 23:07     ` Eric W. Biederman
  2017-10-13  4:03       ` Kirill A. Shutemov
  0 siblings, 1 reply; 5+ messages in thread
From: Eric W. Biederman @ 2017-10-12 23:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andy Lutomirski, Cyrill Gorcunov,
	Borislav Petkov, Andi Kleen, linux-mm, linux-kernel

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> On Mon, Oct 09, 2017 at 09:54:53AM -0700, Dave Hansen wrote:
>> On 10/09/2017 09:09 AM, Kirill A. Shutemov wrote:
>> > Apart from trampoline itself we also need place to store top level page
>> > table in lower memory as we don't have a way to load 64-bit value into
>> > CR3 from 32-bit mode. We only really need 8-bytes there as we only use
>> > the very first entry of the page table.
>> 
>> Oh, and this is why you have to move "lvl5_pgtable" out of the kernel image?
>
> Right. I initialize the new location of top level page table directly.

So just a quick note.  I have a fuzzy memory of people loading their
kernels above 4G physical because they did not have any memory below
4G.

That might be a very specialized case if my memory is correct because
cpu startup has to have a trampoline below 1MB.  So I don't know how
that works.  But I do seem to remember someone mentioning it.

Is there really no way to switch to 5 level paging other than to drop to
32bit mode and disable paging?    The x86 architecture does some very
bizarre things so I can believe it but that seems like a lot of work to
get somewhere.

Eric


>
>> > diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
>> > index cefe4958fda9..049a289342bd 100644
>> > --- a/arch/x86/boot/compressed/head_64.S
>> > +++ b/arch/x86/boot/compressed/head_64.S
>> > @@ -288,6 +288,22 @@ ENTRY(startup_64)
>> >  	leaq	boot_stack_end(%rbx), %rsp
>> >  
>> >  #ifdef CONFIG_X86_5LEVEL
>> > +/*
>> > + * We need trampoline in lower memory switch from 4- to 5-level paging for
>> > + * cases when bootloader put kernel above 4G, but didn't enable 5-level paging
>> > + * for us.
>> > + *
>> > + * Here we use MBR memory to store trampoline code.
>> > + *
>> > + * We also have to have top page table in lower memory as we don't have a way
>> > + * to load 64-bit value into CR3 from 32-bit mode. We only need 8-bytes there
>> > + * as we only use the very first entry of the page table.
>> > + *
>> > + * Here we use 0x7000 as top-level page table.
>> > + */
>> > +#define LVL5_TRAMPOLINE	0x7c00
>> > +#define LVL5_PGTABLE	0x7000
>> > +
>> >  	/* Preserve RBX across CPUID */
>> >  	movq	%rbx, %r8
>> >  
>> > @@ -323,29 +339,37 @@ ENTRY(startup_64)
>> >  	 * long mode would trigger #GP. So we need to switch off long mode
>> >  	 * first.
>> >  	 *
>> > -	 * NOTE: This is not going to work if bootloader put us above 4G
>> > -	 * limit.
>> > +	 * We use trampoline in lower memory to handle situation when
>> > +	 * bootloader put the kernel image above 4G.
>> >  	 *
>> >  	 * The first step is go into compatibility mode.
>> >  	 */
>> >  
>> > -	/* Clear additional page table */
>> > -	leaq	lvl5_pgtable(%rbx), %rdi
>> > -	xorq	%rax, %rax
>> > -	movq	$(PAGE_SIZE/8), %rcx
>> > -	rep	stosq
>> > +	/* Copy trampoline code in place */
>> > +	movq	%rsi, %r9
>> > +	leaq	lvl5_trampoline(%rip), %rsi
>> > +	movq	$LVL5_TRAMPOLINE, %rdi
>> > +	movq	$(lvl5_trampoline_end - lvl5_trampoline), %rcx
>> > +	rep	movsb
>> > +	movq	%r9, %rsi
>> 
>> This needs to get more heavily commented, like the use of r9 to stash
>> %rsi.  Why do you do that, btw?  I don't see it getting reused at first
>> glance.
>
> %rsi holds pointer to real_mode_data. It need to be preserved.
>
> I'll add more comments.
>
>> I think it will also be really nice to differentate "lvl5_trampoline"
>> from "LVL5_TRAMPOLINE".  Maybe add "src" and "dst" to them or something.
>
> Makes sense. Thanks.
>
>> >  	/*
>> > -	 * Setup current CR3 as the first and only entry in a new top level
>> > +	 * Setup current CR3 as the first and the only entry in a new top level
>> >  	 * page table.
>> >  	 */
>> >  	movq	%cr3, %rdi
>> >  	leaq	0x7 (%rdi), %rax
>> > -	movq	%rax, lvl5_pgtable(%rbx)
>> > +	movq	%rax, LVL5_PGTABLE
>> > +
>> > +	/*
>> > +	 * Load address of lvl5 into RDI.
>> > +	 * It will be used to return address from trampoline.
>> > +	 */
>> > +	leaq	lvl5(%rip), %rdi
>> 
>> Is there a reason to do a 'lea' here instead of just shoving the address
>> in directly?  Is this a shorter instruction or something?
>
> This code can be loaded anywhere in memory and we need to calculate
> absolute address of the label here.
> AFAIK, "lea <label>(%rip), <register>" is idiomatic way to do this.
>
>> >  	/* Switch to compatibility mode (CS.L = 0 CS.D = 1) via far return */
>> >  	pushq	$__KERNEL32_CS
>> > -	leaq	compatible_mode(%rip), %rax
>> > +	movq	$LVL5_TRAMPOLINE, %rax
>> >  	pushq	%rax
>> >  	lretq
>> >  lvl5:
>> > @@ -488,9 +512,9 @@ relocated:
>> >   */
>> >  	jmp	*%rax
>> >  
>> > -	.code32
>> >  #ifdef CONFIG_X86_5LEVEL
>> > -compatible_mode:
>> > +	.code32
>> > +lvl5_trampoline:
>> >  	/* Setup data and stack segments */
>> >  	movl	$__KERNEL_DS, %eax
>> >  	movl	%eax, %ds
>> > @@ -502,7 +526,7 @@ compatible_mode:
>> >  	movl	%eax, %cr0
>> >  
>> >  	/* Point CR3 to 5-level paging */
>> > -	leal	lvl5_pgtable(%ebx), %eax
>> > +	movl	$LVL5_PGTABLE, %eax
>> >  	movl	%eax, %cr3
>> >  
>> >  	/* Enable PAE and LA57 mode */
>> > @@ -510,14 +534,9 @@ compatible_mode:
>> >  	orl	$(X86_CR4_PAE | X86_CR4_LA57), %eax
>> >  	movl	%eax, %cr4
>> >  
>> > -	/* Calculate address we are running at */
>> > -	call	1f
>> > -1:	popl	%edi
>> > -	subl	$1b, %edi
>> > -
>> >  	/* Prepare stack for far return to Long Mode */
>> >  	pushl	$__KERNEL_CS
>> > -	leal	lvl5(%edi), %eax
>> > +	movl	$(lvl5_enabled - lvl5_trampoline + LVL5_TRAMPOLINE), %eax
>> 
>> This loads the trampoline address of "lvl5_enabled", right?  That'd be
>> handy to spell out explicitly.
>
> Yep, will do.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH, RFC] x86/boot/compressed/64: Handle 5-level paging boot if kernel is above 4G
  2017-10-12 23:07     ` Eric W. Biederman
@ 2017-10-13  4:03       ` Kirill A. Shutemov
  0 siblings, 0 replies; 5+ messages in thread
From: Kirill A. Shutemov @ 2017-10-13  4:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Dave Hansen, Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andy Lutomirski, Cyrill Gorcunov,
	Borislav Petkov, Andi Kleen, linux-mm, linux-kernel

On Thu, Oct 12, 2017 at 06:07:36PM -0500, Eric W. Biederman wrote:
> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> 
> > On Mon, Oct 09, 2017 at 09:54:53AM -0700, Dave Hansen wrote:
> >> On 10/09/2017 09:09 AM, Kirill A. Shutemov wrote:
> >> > Apart from trampoline itself we also need place to store top level page
> >> > table in lower memory as we don't have a way to load 64-bit value into
> >> > CR3 from 32-bit mode. We only really need 8-bytes there as we only use
> >> > the very first entry of the page table.
> >> 
> >> Oh, and this is why you have to move "lvl5_pgtable" out of the kernel image?
> >
> > Right. I initialize the new location of top level page table directly.
> 
> So just a quick note.  I have a fuzzy memory of people loading their
> kernels above 4G physical because they did not have any memory below
> 4G.
> 
> That might be a very specialized case if my memory is correct because
> cpu startup has to have a trampoline below 1MB.  So I don't know how
> that works.  But I do seem to remember someone mentioning it.
> 
> Is there really no way to switch to 5 level paging other than to drop to
> 32bit mode and disable paging?    The x86 architecture does some very
> bizarre things so I can believe it but that seems like a lot of work to
> get somewhere.

The spec[1] is pretty clear on this, see section 2.2.2:

	The processor allows software to modify CR4.LA57 only outside of
	IA-32e mode. In IA-32e mode, an attempt to modify CR4.LA57 using
	the MOV CR instruction causes a general-protection exception
	(#GP).

[1] https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-10-13  4:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-09 16:09 [PATCH, RFC] x86/boot/compressed/64: Handle 5-level paging boot if kernel is above 4G Kirill A. Shutemov
2017-10-09 16:54 ` Dave Hansen
2017-10-09 17:09   ` Kirill A. Shutemov
2017-10-12 23:07     ` Eric W. Biederman
2017-10-13  4:03       ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).