From: Baoquan He <bhe@redhat.com>
To: Andy Lutomirski <luto@amacapital.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
Lianbo Jiang <lijiang@redhat.com>,
LKML <linux-kernel@vger.kernel.org>,
linux-efi <linux-efi@vger.kernel.org>,
Platform Driver <platform-driver-x86@vger.kernel.org>,
X86 ML <x86@kernel.org>, Ard Biesheuvel <ardb@kernel.org>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
Darren Hart <dvhart@infradead.org>,
Andy Shevchenko <andy@infradead.org>,
kexec@lists.infradead.org, Dave Young <dyoung@redhat.com>
Subject: Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified
Date: Tue, 13 Apr 2021 17:45:15 +0800 [thread overview]
Message-ID: <20210413094515.GD4282@MiWiFi-R3L-srv> (raw)
In-Reply-To: <CALCETrV0dgn1=7CoB+BSHdDuzqtfpKGOPvjJg+sNo74VrcJE=A@mail.gmail.com>
On 04/12/21 at 08:24am, Andy Lutomirski wrote:
> On Mon, Apr 12, 2021 at 2:52 AM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 04/11/21 at 06:49pm, Andy Lutomirski wrote:
> > >
> > >
> > > > On Apr 11, 2021, at 6:14 PM, Baoquan He <bhe@redhat.com> wrote:
> > > >
> > > > On 04/09/21 at 07:59pm, H. Peter Anvin wrote:
> > > >> Why don't we do this unconditionally? At the very best we gain half a megabyte of memory (except the trampoline, which has to live there, but it is only a few kilobytes.)
> > > >
> > > > This is a great suggestion, thanks. I think we can fix it in this way to
> > > > make code simpler. Then the specific caring of real mode in
> > > > efi_free_boot_services() can be removed too.
> > > >
> > >
> > > This whole situation makes me think that the code is buggy before and buggy after.
> > >
> > > The issue here (I think) is that various pieces of code want to reserve specific pieces of otherwise-available low memory for their own nefarious uses. I don’t know *why* crash kernel needs this, but that doesn’t matter too much.
> >
> > Kdump kernel also need go through real mode code path during bootup. It
> > is not different than normal kernel except that it skips the firmware
> > resetting. So kdump kernel needs low 1M as system RAM just as normal
> > kernel does. Here we reserve the whole low 1M with memblock_reserve()
> > to avoid any later kernel or driver data reside in this area. Otherwise,
> > we need dump the content of this area to vmcore. As we know, when crash
> > happened, the old memory of 1st kernel should be untouched until vmcore
> > dumping read out its content. Meanwhile, kdump kernel need reuse low 1M.
> > In the past, we used a back up region to copy out the low 1M area, and
> > map the back up region into the low 1M area in vmcore elf file. In
> > 6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the crashkernel
> > option is specified"), we changed to lock the whole low 1M to avoid
> > writting any kernel data into, like this we can skip this area when
> > dumping vmcore.
> >
> > Above is why we try to memblock reserve the whole low 1M. We don't want
> > to use it, just don't want anyone to use it in 1st kernel.
> >
> > >
> > > I propose that the right solution is to give low-memory-reserving code paths two chances to do what they need: once at the very beginning and once after EFI boot services are freed.
> > >
> > > Alternatively, just reserve *all* otherwise unused sub 1M memory up front, then release it right after releasing boot services, and then invoke the special cases exactly once.
> >
> > I am not sure if I got both suggested ways clearly. They look a little
> > complicated in our case. As I explained at above, we want the whole low
> > 1M locked up, not one piece or some pieces of it.
>
> My second suggestion is probably the better one. Here it is, concretely:
>
> The early (pre-free_efi_boot_services) code just reserves all
> available sub-1M memory unconditionally, but it specially marks it as
> reserved-but-available-later. We stop allocating the trampoline page
> at this stage.
>
> In free_efi_boot_services, instead of *freeing* the sub-1M memory, we
> stick it in the pile of reserved memory created in the early step.
> This may involve splitting a block, kind of like the current
> trampoline late allocation works.
>
> Then, *after* free_efi_boot_services(), we run a single block of code
> that lets everything that wants sub-1M code claim some. This means
> that the trampoline gets allocated and, if crashkernel wants to claim
> everything else, it can. After that, everything still unclaimed gets
> freed.
void __init setup_arch(char **cmdline_p)
{
...
efi_reserve_boot_services();
e820__memblock_alloc_reserved_mpc_new();
#ifdef CONFIG_X86_CHECK_BIOS_CORRUPTION
setup_bios_corruption_check();
#endif
reserve_real_mode();
trim_platform_memory_ranges();
trim_low_memory_range();
...
}
After efi_reserve_boot_services(), there are several function calling to
require memory reservation under low 1M.
asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
{
...
setup_arch(&command_line);
...
mm_init();
--> mem_init();
-->memblock_free_all();
...
#ifdef CONFIG_X86
if (efi_enabled(EFI_RUNTIME_SERVICES))
efi_enter_virtual_mode();
-->efi_free_boot_services();
-->memblock_free_late();
#endif
...
}
So from the code flow, we can see that buddy allocator is built in
mm_init() which puts all memory from memblock.memory excluding
memblock.reserved into buddy. And much later, we call
efi_free_boot_services() to release those reserved efi boot memory into
buddy too.
Are you suggesting we should do the memory reservation from low 1M
after efi_free_boot_services()? To require memory pages from buddy for
them? Please help point out my misunderstanding if have any.
With my understanding, in non-efi case, we have done the memory
reservation with memblock_reserve(), e.g
e820__memblock_alloc_reserved_mpc_new, reserve_real_mode() are calling
to do. Just efi_reserve|free_boot_services() break them when efi is
enabled. We can do them again in efi_free_boot_services() just like the
real_mode reservation does.
Thanks
Baoquan
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
WARNING: multiple messages have this Message-ID (diff)
From: Baoquan He <bhe@redhat.com>
To: Andy Lutomirski <luto@amacapital.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
Lianbo Jiang <lijiang@redhat.com>,
LKML <linux-kernel@vger.kernel.org>,
linux-efi <linux-efi@vger.kernel.org>,
Platform Driver <platform-driver-x86@vger.kernel.org>,
X86 ML <x86@kernel.org>, Ard Biesheuvel <ardb@kernel.org>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
Darren Hart <dvhart@infradead.org>,
Andy Shevchenko <andy@infradead.org>,
kexec@lists.infradead.org, Dave Young <dyoung@redhat.com>
Subject: Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified
Date: Tue, 13 Apr 2021 17:45:15 +0800 [thread overview]
Message-ID: <20210413094515.GD4282@MiWiFi-R3L-srv> (raw)
In-Reply-To: <CALCETrV0dgn1=7CoB+BSHdDuzqtfpKGOPvjJg+sNo74VrcJE=A@mail.gmail.com>
On 04/12/21 at 08:24am, Andy Lutomirski wrote:
> On Mon, Apr 12, 2021 at 2:52 AM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 04/11/21 at 06:49pm, Andy Lutomirski wrote:
> > >
> > >
> > > > On Apr 11, 2021, at 6:14 PM, Baoquan He <bhe@redhat.com> wrote:
> > > >
> > > > On 04/09/21 at 07:59pm, H. Peter Anvin wrote:
> > > >> Why don't we do this unconditionally? At the very best we gain half a megabyte of memory (except the trampoline, which has to live there, but it is only a few kilobytes.)
> > > >
> > > > This is a great suggestion, thanks. I think we can fix it in this way to
> > > > make code simpler. Then the specific caring of real mode in
> > > > efi_free_boot_services() can be removed too.
> > > >
> > >
> > > This whole situation makes me think that the code is buggy before and buggy after.
> > >
> > > The issue here (I think) is that various pieces of code want to reserve specific pieces of otherwise-available low memory for their own nefarious uses. I don’t know *why* crash kernel needs this, but that doesn’t matter too much.
> >
> > Kdump kernel also need go through real mode code path during bootup. It
> > is not different than normal kernel except that it skips the firmware
> > resetting. So kdump kernel needs low 1M as system RAM just as normal
> > kernel does. Here we reserve the whole low 1M with memblock_reserve()
> > to avoid any later kernel or driver data reside in this area. Otherwise,
> > we need dump the content of this area to vmcore. As we know, when crash
> > happened, the old memory of 1st kernel should be untouched until vmcore
> > dumping read out its content. Meanwhile, kdump kernel need reuse low 1M.
> > In the past, we used a back up region to copy out the low 1M area, and
> > map the back up region into the low 1M area in vmcore elf file. In
> > 6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the crashkernel
> > option is specified"), we changed to lock the whole low 1M to avoid
> > writting any kernel data into, like this we can skip this area when
> > dumping vmcore.
> >
> > Above is why we try to memblock reserve the whole low 1M. We don't want
> > to use it, just don't want anyone to use it in 1st kernel.
> >
> > >
> > > I propose that the right solution is to give low-memory-reserving code paths two chances to do what they need: once at the very beginning and once after EFI boot services are freed.
> > >
> > > Alternatively, just reserve *all* otherwise unused sub 1M memory up front, then release it right after releasing boot services, and then invoke the special cases exactly once.
> >
> > I am not sure if I got both suggested ways clearly. They look a little
> > complicated in our case. As I explained at above, we want the whole low
> > 1M locked up, not one piece or some pieces of it.
>
> My second suggestion is probably the better one. Here it is, concretely:
>
> The early (pre-free_efi_boot_services) code just reserves all
> available sub-1M memory unconditionally, but it specially marks it as
> reserved-but-available-later. We stop allocating the trampoline page
> at this stage.
>
> In free_efi_boot_services, instead of *freeing* the sub-1M memory, we
> stick it in the pile of reserved memory created in the early step.
> This may involve splitting a block, kind of like the current
> trampoline late allocation works.
>
> Then, *after* free_efi_boot_services(), we run a single block of code
> that lets everything that wants sub-1M code claim some. This means
> that the trampoline gets allocated and, if crashkernel wants to claim
> everything else, it can. After that, everything still unclaimed gets
> freed.
void __init setup_arch(char **cmdline_p)
{
...
efi_reserve_boot_services();
e820__memblock_alloc_reserved_mpc_new();
#ifdef CONFIG_X86_CHECK_BIOS_CORRUPTION
setup_bios_corruption_check();
#endif
reserve_real_mode();
trim_platform_memory_ranges();
trim_low_memory_range();
...
}
After efi_reserve_boot_services(), there are several function calling to
require memory reservation under low 1M.
asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
{
...
setup_arch(&command_line);
...
mm_init();
--> mem_init();
-->memblock_free_all();
...
#ifdef CONFIG_X86
if (efi_enabled(EFI_RUNTIME_SERVICES))
efi_enter_virtual_mode();
-->efi_free_boot_services();
-->memblock_free_late();
#endif
...
}
So from the code flow, we can see that buddy allocator is built in
mm_init() which puts all memory from memblock.memory excluding
memblock.reserved into buddy. And much later, we call
efi_free_boot_services() to release those reserved efi boot memory into
buddy too.
Are you suggesting we should do the memory reservation from low 1M
after efi_free_boot_services()? To require memory pages from buddy for
them? Please help point out my misunderstanding if have any.
With my understanding, in non-efi case, we have done the memory
reservation with memblock_reserve(), e.g
e820__memblock_alloc_reserved_mpc_new, reserve_real_mode() are calling
to do. Just efi_reserve|free_boot_services() break them when efi is
enabled. We can do them again in efi_free_boot_services() just like the
real_mode reservation does.
Thanks
Baoquan
next prev parent reply other threads:[~2021-04-13 9:45 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-07 14:03 [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified Lianbo Jiang
2021-04-07 14:03 ` Lianbo Jiang
2021-04-09 12:44 ` Baoquan He
2021-04-09 12:44 ` Baoquan He
2021-04-10 2:56 ` lijiang
2021-04-10 2:56 ` lijiang
[not found] ` <D7D32C89-4F99-434A-B7AF-7BEBDA494172@zytor.com>
2021-04-12 1:13 ` Baoquan He
2021-04-12 1:13 ` Baoquan He
2021-04-12 1:49 ` Andy Lutomirski
2021-04-12 1:49 ` Andy Lutomirski
2021-04-12 9:52 ` Baoquan He
2021-04-12 9:52 ` Baoquan He
2021-04-12 10:49 ` lijiang
2021-04-12 10:49 ` lijiang
2021-04-12 15:24 ` Andy Lutomirski
2021-04-12 15:24 ` Andy Lutomirski
2021-04-13 9:45 ` Baoquan He [this message]
2021-04-13 9:45 ` Baoquan He
[not found] ` <CANU+ZydgWTSg+iUix8ggn-cSPpg8qtShaUQ47cOeeMxFmXp_zQ@mail.gmail.com>
[not found] ` <CANU+ZydyKsctuFjPfBz7PuS=FaUtK0gs5Lq06pL5nuRJKe+J0w@mail.gmail.com>
2021-05-24 8:32 ` Baoquan He
2021-05-24 8:32 ` Baoquan He
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210413094515.GD4282@MiWiFi-R3L-srv \
--to=bhe@redhat.com \
--cc=andy@infradead.org \
--cc=ardb@kernel.org \
--cc=bp@alien8.de \
--cc=dvhart@infradead.org \
--cc=dyoung@redhat.com \
--cc=hpa@zytor.com \
--cc=kexec@lists.infradead.org \
--cc=lijiang@redhat.com \
--cc=linux-efi@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@amacapital.net \
--cc=mingo@redhat.com \
--cc=platform-driver-x86@vger.kernel.org \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.