From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757258AbdJLXIJ (ORCPT ); Thu, 12 Oct 2017 19:08:09 -0400 Received: from out03.mta.xmission.com ([166.70.13.233]:60730 "EHLO out03.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757197AbdJLXIG (ORCPT ); Thu, 12 Oct 2017 19:08:06 -0400 From: ebiederm@xmission.com (Eric W. Biederman) To: "Kirill A. Shutemov" Cc: Dave Hansen , "Kirill A. Shutemov" , Ingo Molnar , Linus Torvalds , x86@kernel.org, Thomas Gleixner , "H. Peter Anvin" , Andy Lutomirski , Cyrill Gorcunov , Borislav Petkov , Andi Kleen , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20171009160924.68032-1-kirill.shutemov@linux.intel.com> <20171009170900.gyl5sizwnd54ridc@node.shutemov.name> Date: Thu, 12 Oct 2017 18:07:36 -0500 In-Reply-To: <20171009170900.gyl5sizwnd54ridc@node.shutemov.name> (Kirill A. Shutemov's message of "Mon, 9 Oct 2017 20:09:00 +0300") Message-ID: <87k200vubr.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1e2ma5-0003gg-4Y;;;mid=<87k200vubr.fsf@xmission.com>;;;hst=in02.mta.xmission.com;;;ip=67.3.233.18;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX1/xqpRLfxAjqvnHq7zxJy9IyK5Y+it83Zc= X-SA-Exim-Connect-IP: 67.3.233.18 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.7 XMSubLong Long Subject * 0.0 TVD_RCVD_IP Message was received from an IP address * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.5000] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa03 1397; Body=1 Fuz1=1 Fuz2=1] * 0.0 T_TooManySym_01 4+ unique symbols in subject * 0.4 FVGT_m_MULTI_ODD Contains multiple odd letter combinations * 0.0 T_TooManySym_02 5+ unique symbols in subject * 0.5 XM_Body_Dirty_Words Contains a dirty word X-Spam-DCC: XMission; sa03 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: *;"Kirill A. Shutemov" X-Spam-Relay-Country: X-Spam-Timing: total 5590 ms - load_scoreonly_sql: 0.06 (0.0%), signal_user_changed: 14 (0.2%), b_tie_ro: 11 (0.2%), parse: 3.4 (0.1%), extract_message_metadata: 36 (0.6%), get_uri_detail_list: 7 (0.1%), tests_pri_-1000: 17 (0.3%), tests_pri_-950: 4.7 (0.1%), tests_pri_-900: 2.2 (0.0%), tests_pri_-400: 65 (1.2%), check_bayes: 62 (1.1%), b_tokenize: 25 (0.4%), b_tok_get_all: 19 (0.3%), b_comp_prob: 8 (0.1%), b_tok_touch_all: 4.7 (0.1%), b_finish: 0.93 (0.0%), tests_pri_0: 1161 (20.8%), check_dkim_signature: 1.64 (0.0%), check_dkim_adsp: 6 (0.1%), tests_pri_500: 4275 (76.5%), poll_dns_idle: 4261 (76.2%), rewrite_mail: 0.00 (0.0%) Subject: Re: [PATCH, RFC] x86/boot/compressed/64: Handle 5-level paging boot if kernel is above 4G X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org "Kirill A. Shutemov" writes: > On Mon, Oct 09, 2017 at 09:54:53AM -0700, Dave Hansen wrote: >> On 10/09/2017 09:09 AM, Kirill A. Shutemov wrote: >> > Apart from trampoline itself we also need place to store top level page >> > table in lower memory as we don't have a way to load 64-bit value into >> > CR3 from 32-bit mode. We only really need 8-bytes there as we only use >> > the very first entry of the page table. >> >> Oh, and this is why you have to move "lvl5_pgtable" out of the kernel image? > > Right. I initialize the new location of top level page table directly. So just a quick note. I have a fuzzy memory of people loading their kernels above 4G physical because they did not have any memory below 4G. That might be a very specialized case if my memory is correct because cpu startup has to have a trampoline below 1MB. So I don't know how that works. But I do seem to remember someone mentioning it. Is there really no way to switch to 5 level paging other than to drop to 32bit mode and disable paging? The x86 architecture does some very bizarre things so I can believe it but that seems like a lot of work to get somewhere. Eric > >> > diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S >> > index cefe4958fda9..049a289342bd 100644 >> > --- a/arch/x86/boot/compressed/head_64.S >> > +++ b/arch/x86/boot/compressed/head_64.S >> > @@ -288,6 +288,22 @@ ENTRY(startup_64) >> > leaq boot_stack_end(%rbx), %rsp >> > >> > #ifdef CONFIG_X86_5LEVEL >> > +/* >> > + * We need trampoline in lower memory switch from 4- to 5-level paging for >> > + * cases when bootloader put kernel above 4G, but didn't enable 5-level paging >> > + * for us. >> > + * >> > + * Here we use MBR memory to store trampoline code. >> > + * >> > + * We also have to have top page table in lower memory as we don't have a way >> > + * to load 64-bit value into CR3 from 32-bit mode. We only need 8-bytes there >> > + * as we only use the very first entry of the page table. >> > + * >> > + * Here we use 0x7000 as top-level page table. >> > + */ >> > +#define LVL5_TRAMPOLINE 0x7c00 >> > +#define LVL5_PGTABLE 0x7000 >> > + >> > /* Preserve RBX across CPUID */ >> > movq %rbx, %r8 >> > >> > @@ -323,29 +339,37 @@ ENTRY(startup_64) >> > * long mode would trigger #GP. So we need to switch off long mode >> > * first. >> > * >> > - * NOTE: This is not going to work if bootloader put us above 4G >> > - * limit. >> > + * We use trampoline in lower memory to handle situation when >> > + * bootloader put the kernel image above 4G. >> > * >> > * The first step is go into compatibility mode. >> > */ >> > >> > - /* Clear additional page table */ >> > - leaq lvl5_pgtable(%rbx), %rdi >> > - xorq %rax, %rax >> > - movq $(PAGE_SIZE/8), %rcx >> > - rep stosq >> > + /* Copy trampoline code in place */ >> > + movq %rsi, %r9 >> > + leaq lvl5_trampoline(%rip), %rsi >> > + movq $LVL5_TRAMPOLINE, %rdi >> > + movq $(lvl5_trampoline_end - lvl5_trampoline), %rcx >> > + rep movsb >> > + movq %r9, %rsi >> >> This needs to get more heavily commented, like the use of r9 to stash >> %rsi. Why do you do that, btw? I don't see it getting reused at first >> glance. > > %rsi holds pointer to real_mode_data. It need to be preserved. > > I'll add more comments. > >> I think it will also be really nice to differentate "lvl5_trampoline" >> from "LVL5_TRAMPOLINE". Maybe add "src" and "dst" to them or something. > > Makes sense. Thanks. > >> > /* >> > - * Setup current CR3 as the first and only entry in a new top level >> > + * Setup current CR3 as the first and the only entry in a new top level >> > * page table. >> > */ >> > movq %cr3, %rdi >> > leaq 0x7 (%rdi), %rax >> > - movq %rax, lvl5_pgtable(%rbx) >> > + movq %rax, LVL5_PGTABLE >> > + >> > + /* >> > + * Load address of lvl5 into RDI. >> > + * It will be used to return address from trampoline. >> > + */ >> > + leaq lvl5(%rip), %rdi >> >> Is there a reason to do a 'lea' here instead of just shoving the address >> in directly? Is this a shorter instruction or something? > > This code can be loaded anywhere in memory and we need to calculate > absolute address of the label here. > AFAIK, "lea