From: Baoquan He <bhe@redhat.com>
To: lijiang <lijiang@redhat.com>, andy@infradead.org
Cc: linux-kernel@vger.kernel.org, linux-efi@vger.kernel.org,
platform-driver-x86@vger.kernel.org, x86@kernel.org,
ardb@kernel.org, tglx@linutronix.de, mingo@redhat.com,
bp@alien8.de, dvhart@infradead.org, kexec@lists.infradead.org,
hpa@zytor.com, Dave Young <dyoung@redhat.com>
Subject: Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified
Date: Mon, 24 May 2021 16:32:39 +0800 [thread overview]
Message-ID: <20210524083239.GA2872@MiWiFi-R3L-srv> (raw)
In-Reply-To: <CANU+ZydyKsctuFjPfBz7PuS=FaUtK0gs5Lq06pL5nuRJKe+J0w@mail.gmail.com>
Hi Lianbo,
On 05/24/21 at 11:00am, lijiang wrote:
> Also add mail lists and more people in the cc list.
>
> Thanks.
> Lianbo
>
>
> On Fri, May 21, 2021 at 8:36 PM lijiang <lijiang@redhat.com> wrote:
>
> > Hi, Baoquan, Andy
> > Sorry for the late reply.
> >
> > On Tue, Apr 13, 2021 at 5:45 PM Baoquan He <bhe@redhat.com> wrote:
> >
> >> On 04/12/21 at 08:24am, Andy Lutomirski wrote:
> >> > On Mon, Apr 12, 2021 at 2:52 AM Baoquan He <bhe@redhat.com> wrote:
> >> > >
> >> > > On 04/11/21 at 06:49pm, Andy Lutomirski wrote:
> >> > > >
> >> > > >
> >> > > > > On Apr 11, 2021, at 6:14 PM, Baoquan He <bhe@redhat.com> wrote:
> >> > > > >
> >> > > > > On 04/09/21 at 07:59pm, H. Peter Anvin wrote:
> >> > > > >> Why don't we do this unconditionally? At the very best we gain
> >> half a megabyte of memory (except the trampoline, which has to live there,
> >> but it is only a few kilobytes.)
> >> > > > >
> >> > > > > This is a great suggestion, thanks. I think we can fix it in this
> >> way to
> >> > > > > make code simpler. Then the specific caring of real mode in
> >> > > > > efi_free_boot_services() can be removed too.
> >> > > > >
> >> > > >
> >> > > > This whole situation makes me think that the code is buggy before
> >> and buggy after.
> >> > > >
> >> > > > The issue here (I think) is that various pieces of code want to
> >> reserve specific pieces of otherwise-available low memory for their own
> >> nefarious uses. I don’t know *why* crash kernel needs this, but that
> >> doesn’t matter too much.
> >> > >
> >> > > Kdump kernel also need go through real mode code path during bootup.
> >> It
> >> > > is not different than normal kernel except that it skips the firmware
> >> > > resetting. So kdump kernel needs low 1M as system RAM just as normal
> >> > > kernel does. Here we reserve the whole low 1M with memblock_reserve()
> >> > > to avoid any later kernel or driver data reside in this area.
> >> Otherwise,
> >> > > we need dump the content of this area to vmcore. As we know, when
> >> crash
> >> > > happened, the old memory of 1st kernel should be untouched until
> >> vmcore
> >> > > dumping read out its content. Meanwhile, kdump kernel need reuse low
> >> 1M.
> >> > > In the past, we used a back up region to copy out the low 1M area, and
> >> > > map the back up region into the low 1M area in vmcore elf file. In
> >> > > 6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the
> >> crashkernel
> >> > > option is specified"), we changed to lock the whole low 1M to avoid
> >> > > writting any kernel data into, like this we can skip this area when
> >> > > dumping vmcore.
> >> > >
> >> > > Above is why we try to memblock reserve the whole low 1M. We don't
> >> want
> >> > > to use it, just don't want anyone to use it in 1st kernel.
> >> > >
> >> > > >
> >> > > > I propose that the right solution is to give low-memory-reserving
> >> code paths two chances to do what they need: once at the very beginning and
> >> once after EFI boot services are freed.
> >> > > >
> >> > > > Alternatively, just reserve *all* otherwise unused sub 1M memory up
> >> front, then release it right after releasing boot services, and then invoke
> >> the special cases exactly once.
> >> > >
> >> > > I am not sure if I got both suggested ways clearly. They look a little
> >> > > complicated in our case. As I explained at above, we want the whole
> >> low
> >> > > 1M locked up, not one piece or some pieces of it.
> >> >
> >> > My second suggestion is probably the better one. Here it is,
> >> concretely:
> >> >
> >> > The early (pre-free_efi_boot_services) code just reserves all
> >> > available sub-1M memory unconditionally, but it specially marks it as
> >> > reserved-but-available-later. We stop allocating the trampoline page
> >> > at this stage.
> >> >
> >> > In free_efi_boot_services, instead of *freeing* the sub-1M memory, we
> >> > stick it in the pile of reserved memory created in the early step.
> >> > This may involve splitting a block, kind of like the current
> >> > trampoline late allocation works.
> >> >
> >> > Then, *after* free_efi_boot_services(), we run a single block of code
> >> > that lets everything that wants sub-1M code claim some. This means
> >> > that the trampoline gets allocated and, if crashkernel wants to claim
> >> > everything else, it can. After that, everything still unclaimed gets
> >> > freed.
> >>
> >> void __init setup_arch(char **cmdline_p)
> >> {
> >> ...
> >> efi_reserve_boot_services();
> >> e820__memblock_alloc_reserved_mpc_new();
> >> #ifdef CONFIG_X86_CHECK_BIOS_CORRUPTION
> >> setup_bios_corruption_check();
> >> #endif
> >> reserve_real_mode();
> >>
> >>
> >> trim_platform_memory_ranges();
> >> trim_low_memory_range();
> >> ...
> >> }
> >>
> >> After efi_reserve_boot_services(), there are several function calling to
> >> require memory reservation under low 1M.
> >>
> >>
> >> asmlinkage __visible void __init __no_sanitize_address
> >> start_kernel(void)
> >>
> >> {
> >> ...
> >> setup_arch(&command_line);
> >> ...
> >> mm_init();
> >> --> mem_init();
> >> -->memblock_free_all();
> >>
> >> ...
> >> #ifdef CONFIG_X86
> >> if (efi_enabled(EFI_RUNTIME_SERVICES))
> >> efi_enter_virtual_mode();
> >> -->efi_free_boot_services();
> >> -->memblock_free_late();
> >> #endif
> >> ...
> >> }
> >>
> >> So from the code flow, we can see that buddy allocator is built in
> >> mm_init() which puts all memory from memblock.memory excluding
> >> memblock.reserved into buddy. And much later, we call
> >> efi_free_boot_services() to release those reserved efi boot memory into
> >> buddy too.
> >>
> >> Are you suggesting we should do the memory reservation from low 1M
> >> after efi_free_boot_services()? To require memory pages from buddy for
> >> them? Please help point out my misunderstanding if have any.
> >>
> >> With my understanding, in non-efi case, we have done the memory
> >> reservation with memblock_reserve(), e.g
> >> e820__memblock_alloc_reserved_mpc_new, reserve_real_mode() are calling
> >> to do. Just efi_reserve|free_boot_services() break them when efi is
> >>
> >
> > Yes. But Andy also suggested to reserve all available sub-1M memory
> > **unconditionally**
> > in the early code.
> >
> > enabled. We can do them again in efi_free_boot_services() just like the
> >> real_mode reservation does.
> >>
> >
> > Do you mean to call the memblock_reserve() in
> > the efi_free_boot_services()? Or anything else?
> > Would you mind sharing more details about this?
Hmm, maybe no. There's no chance to call memblock_reserve() after
memblock_free_late().
I suggested making change in efi_free_boot_services() because the
similar problem has been encountered by Andy himself and fixed in
efi_free_boot_services(). Please check below commit for reference.
The patch described the phenemenon, while explained why in code
comment.
commit 5bc653b7318217c54244a14f248f1f07abe0a865
Author: Andy Lutomirski <luto@kernel.org>
Date: Wed Aug 10 02:29:17 2016 -0700
x86/efi: Allocate a trampoline if needed in efi_free_boot_services()
So we could handle it in the same place by extending the area to the
whole low-1M unconditionally. This is 1st way of three I can think of.
The other two are:
2) Move efi_reserve_boot_services() down to be above init_mem_mapping().
until init_mem_mapping(), we stop reserving memory from low-1M with
memblock_reserve(). The change is like below:
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 72920af0b3c0..93b2106d2050 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1067,7 +1067,6 @@ void __init setup_arch(char **cmdline_p)
* The EFI specification says that boot service code won't be
* called after ExitBootServices(). This is, in fact, a lie.
*/
- efi_reserve_boot_services();
/* preallocate 4k for mptable mpc */
e820__memblock_alloc_reserved_mpc_new();
@@ -1090,6 +1089,8 @@ void __init setup_arch(char **cmdline_p)
*/
trim_snb_memory();
+ efi_reserve_boot_services();
+
init_mem_mapping();
idt_setup_early_pf();
3) Do it in efi_reserve_boot_services() as you are doing in this patch.
I prefer the 1st and 2nd way.
And by the way, the original code of commit 5bc653b731821("x86/efi: Allocate
a trampoline if needed in efi_free_boot_services()") could be buggy. I
remember you ever pasted the boot log of system where you reproduced
issue and tested the tested patch, there are three separate pages from
low-1M reserved by boot services. If any of them falls into real mode
area, it will break the fix of commit 5bc653b731821.
Thanks
Baoquan
> >
> > diff --git a/arch/x86/platform/efi/quirks.c
> > b/arch/x86/platform/efi/quirks.c
> > index 7850111008a8..d02f12a60457 100644
> > --- a/arch/x86/platform/efi/quirks.c
> > +++ b/arch/x86/platform/efi/quirks.c
> > @@ -453,6 +453,8 @@ void __init efi_free_boot_services(void)
> > memblock_free_late(start, size);
> > }
> >
> > + memblock_reserve(0, 1<<20);
> > +
> > if (!num_entries)
> > return;
> >
> > Thanks.
> > Lianbo
> >
> >
> > Thanks
> >> Baoquan
> >>
> >
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
WARNING: multiple messages have this Message-ID (diff)
From: Baoquan He <bhe@redhat.com>
To: lijiang <lijiang@redhat.com>, andy@infradead.org
Cc: linux-kernel@vger.kernel.org, linux-efi@vger.kernel.org,
platform-driver-x86@vger.kernel.org, x86@kernel.org,
ardb@kernel.org, tglx@linutronix.de, mingo@redhat.com,
bp@alien8.de, dvhart@infradead.org, kexec@lists.infradead.org,
hpa@zytor.com, Dave Young <dyoung@redhat.com>
Subject: Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified
Date: Mon, 24 May 2021 16:32:39 +0800 [thread overview]
Message-ID: <20210524083239.GA2872@MiWiFi-R3L-srv> (raw)
In-Reply-To: <CANU+ZydyKsctuFjPfBz7PuS=FaUtK0gs5Lq06pL5nuRJKe+J0w@mail.gmail.com>
Hi Lianbo,
On 05/24/21 at 11:00am, lijiang wrote:
> Also add mail lists and more people in the cc list.
>
> Thanks.
> Lianbo
>
>
> On Fri, May 21, 2021 at 8:36 PM lijiang <lijiang@redhat.com> wrote:
>
> > Hi, Baoquan, Andy
> > Sorry for the late reply.
> >
> > On Tue, Apr 13, 2021 at 5:45 PM Baoquan He <bhe@redhat.com> wrote:
> >
> >> On 04/12/21 at 08:24am, Andy Lutomirski wrote:
> >> > On Mon, Apr 12, 2021 at 2:52 AM Baoquan He <bhe@redhat.com> wrote:
> >> > >
> >> > > On 04/11/21 at 06:49pm, Andy Lutomirski wrote:
> >> > > >
> >> > > >
> >> > > > > On Apr 11, 2021, at 6:14 PM, Baoquan He <bhe@redhat.com> wrote:
> >> > > > >
> >> > > > > On 04/09/21 at 07:59pm, H. Peter Anvin wrote:
> >> > > > >> Why don't we do this unconditionally? At the very best we gain
> >> half a megabyte of memory (except the trampoline, which has to live there,
> >> but it is only a few kilobytes.)
> >> > > > >
> >> > > > > This is a great suggestion, thanks. I think we can fix it in this
> >> way to
> >> > > > > make code simpler. Then the specific caring of real mode in
> >> > > > > efi_free_boot_services() can be removed too.
> >> > > > >
> >> > > >
> >> > > > This whole situation makes me think that the code is buggy before
> >> and buggy after.
> >> > > >
> >> > > > The issue here (I think) is that various pieces of code want to
> >> reserve specific pieces of otherwise-available low memory for their own
> >> nefarious uses. I don’t know *why* crash kernel needs this, but that
> >> doesn’t matter too much.
> >> > >
> >> > > Kdump kernel also need go through real mode code path during bootup.
> >> It
> >> > > is not different than normal kernel except that it skips the firmware
> >> > > resetting. So kdump kernel needs low 1M as system RAM just as normal
> >> > > kernel does. Here we reserve the whole low 1M with memblock_reserve()
> >> > > to avoid any later kernel or driver data reside in this area.
> >> Otherwise,
> >> > > we need dump the content of this area to vmcore. As we know, when
> >> crash
> >> > > happened, the old memory of 1st kernel should be untouched until
> >> vmcore
> >> > > dumping read out its content. Meanwhile, kdump kernel need reuse low
> >> 1M.
> >> > > In the past, we used a back up region to copy out the low 1M area, and
> >> > > map the back up region into the low 1M area in vmcore elf file. In
> >> > > 6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the
> >> crashkernel
> >> > > option is specified"), we changed to lock the whole low 1M to avoid
> >> > > writting any kernel data into, like this we can skip this area when
> >> > > dumping vmcore.
> >> > >
> >> > > Above is why we try to memblock reserve the whole low 1M. We don't
> >> want
> >> > > to use it, just don't want anyone to use it in 1st kernel.
> >> > >
> >> > > >
> >> > > > I propose that the right solution is to give low-memory-reserving
> >> code paths two chances to do what they need: once at the very beginning and
> >> once after EFI boot services are freed.
> >> > > >
> >> > > > Alternatively, just reserve *all* otherwise unused sub 1M memory up
> >> front, then release it right after releasing boot services, and then invoke
> >> the special cases exactly once.
> >> > >
> >> > > I am not sure if I got both suggested ways clearly. They look a little
> >> > > complicated in our case. As I explained at above, we want the whole
> >> low
> >> > > 1M locked up, not one piece or some pieces of it.
> >> >
> >> > My second suggestion is probably the better one. Here it is,
> >> concretely:
> >> >
> >> > The early (pre-free_efi_boot_services) code just reserves all
> >> > available sub-1M memory unconditionally, but it specially marks it as
> >> > reserved-but-available-later. We stop allocating the trampoline page
> >> > at this stage.
> >> >
> >> > In free_efi_boot_services, instead of *freeing* the sub-1M memory, we
> >> > stick it in the pile of reserved memory created in the early step.
> >> > This may involve splitting a block, kind of like the current
> >> > trampoline late allocation works.
> >> >
> >> > Then, *after* free_efi_boot_services(), we run a single block of code
> >> > that lets everything that wants sub-1M code claim some. This means
> >> > that the trampoline gets allocated and, if crashkernel wants to claim
> >> > everything else, it can. After that, everything still unclaimed gets
> >> > freed.
> >>
> >> void __init setup_arch(char **cmdline_p)
> >> {
> >> ...
> >> efi_reserve_boot_services();
> >> e820__memblock_alloc_reserved_mpc_new();
> >> #ifdef CONFIG_X86_CHECK_BIOS_CORRUPTION
> >> setup_bios_corruption_check();
> >> #endif
> >> reserve_real_mode();
> >>
> >>
> >> trim_platform_memory_ranges();
> >> trim_low_memory_range();
> >> ...
> >> }
> >>
> >> After efi_reserve_boot_services(), there are several function calling to
> >> require memory reservation under low 1M.
> >>
> >>
> >> asmlinkage __visible void __init __no_sanitize_address
> >> start_kernel(void)
> >>
> >> {
> >> ...
> >> setup_arch(&command_line);
> >> ...
> >> mm_init();
> >> --> mem_init();
> >> -->memblock_free_all();
> >>
> >> ...
> >> #ifdef CONFIG_X86
> >> if (efi_enabled(EFI_RUNTIME_SERVICES))
> >> efi_enter_virtual_mode();
> >> -->efi_free_boot_services();
> >> -->memblock_free_late();
> >> #endif
> >> ...
> >> }
> >>
> >> So from the code flow, we can see that buddy allocator is built in
> >> mm_init() which puts all memory from memblock.memory excluding
> >> memblock.reserved into buddy. And much later, we call
> >> efi_free_boot_services() to release those reserved efi boot memory into
> >> buddy too.
> >>
> >> Are you suggesting we should do the memory reservation from low 1M
> >> after efi_free_boot_services()? To require memory pages from buddy for
> >> them? Please help point out my misunderstanding if have any.
> >>
> >> With my understanding, in non-efi case, we have done the memory
> >> reservation with memblock_reserve(), e.g
> >> e820__memblock_alloc_reserved_mpc_new, reserve_real_mode() are calling
> >> to do. Just efi_reserve|free_boot_services() break them when efi is
> >>
> >
> > Yes. But Andy also suggested to reserve all available sub-1M memory
> > **unconditionally**
> > in the early code.
> >
> > enabled. We can do them again in efi_free_boot_services() just like the
> >> real_mode reservation does.
> >>
> >
> > Do you mean to call the memblock_reserve() in
> > the efi_free_boot_services()? Or anything else?
> > Would you mind sharing more details about this?
Hmm, maybe no. There's no chance to call memblock_reserve() after
memblock_free_late().
I suggested making change in efi_free_boot_services() because the
similar problem has been encountered by Andy himself and fixed in
efi_free_boot_services(). Please check below commit for reference.
The patch described the phenemenon, while explained why in code
comment.
commit 5bc653b7318217c54244a14f248f1f07abe0a865
Author: Andy Lutomirski <luto@kernel.org>
Date: Wed Aug 10 02:29:17 2016 -0700
x86/efi: Allocate a trampoline if needed in efi_free_boot_services()
So we could handle it in the same place by extending the area to the
whole low-1M unconditionally. This is 1st way of three I can think of.
The other two are:
2) Move efi_reserve_boot_services() down to be above init_mem_mapping().
until init_mem_mapping(), we stop reserving memory from low-1M with
memblock_reserve(). The change is like below:
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 72920af0b3c0..93b2106d2050 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1067,7 +1067,6 @@ void __init setup_arch(char **cmdline_p)
* The EFI specification says that boot service code won't be
* called after ExitBootServices(). This is, in fact, a lie.
*/
- efi_reserve_boot_services();
/* preallocate 4k for mptable mpc */
e820__memblock_alloc_reserved_mpc_new();
@@ -1090,6 +1089,8 @@ void __init setup_arch(char **cmdline_p)
*/
trim_snb_memory();
+ efi_reserve_boot_services();
+
init_mem_mapping();
idt_setup_early_pf();
3) Do it in efi_reserve_boot_services() as you are doing in this patch.
I prefer the 1st and 2nd way.
And by the way, the original code of commit 5bc653b731821("x86/efi: Allocate
a trampoline if needed in efi_free_boot_services()") could be buggy. I
remember you ever pasted the boot log of system where you reproduced
issue and tested the tested patch, there are three separate pages from
low-1M reserved by boot services. If any of them falls into real mode
area, it will break the fix of commit 5bc653b731821.
Thanks
Baoquan
> >
> > diff --git a/arch/x86/platform/efi/quirks.c
> > b/arch/x86/platform/efi/quirks.c
> > index 7850111008a8..d02f12a60457 100644
> > --- a/arch/x86/platform/efi/quirks.c
> > +++ b/arch/x86/platform/efi/quirks.c
> > @@ -453,6 +453,8 @@ void __init efi_free_boot_services(void)
> > memblock_free_late(start, size);
> > }
> >
> > + memblock_reserve(0, 1<<20);
> > +
> > if (!num_entries)
> > return;
> >
> > Thanks.
> > Lianbo
> >
> >
> > Thanks
> >> Baoquan
> >>
> >
next prev parent reply other threads:[~2021-05-24 8:32 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-07 14:03 [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified Lianbo Jiang
2021-04-07 14:03 ` Lianbo Jiang
2021-04-09 12:44 ` Baoquan He
2021-04-09 12:44 ` Baoquan He
2021-04-10 2:56 ` lijiang
2021-04-10 2:56 ` lijiang
[not found] ` <D7D32C89-4F99-434A-B7AF-7BEBDA494172@zytor.com>
2021-04-12 1:13 ` Baoquan He
2021-04-12 1:13 ` Baoquan He
2021-04-12 1:49 ` Andy Lutomirski
2021-04-12 1:49 ` Andy Lutomirski
2021-04-12 9:52 ` Baoquan He
2021-04-12 9:52 ` Baoquan He
2021-04-12 10:49 ` lijiang
2021-04-12 10:49 ` lijiang
2021-04-12 15:24 ` Andy Lutomirski
2021-04-12 15:24 ` Andy Lutomirski
2021-04-13 9:45 ` Baoquan He
2021-04-13 9:45 ` Baoquan He
[not found] ` <CANU+ZydgWTSg+iUix8ggn-cSPpg8qtShaUQ47cOeeMxFmXp_zQ@mail.gmail.com>
[not found] ` <CANU+ZydyKsctuFjPfBz7PuS=FaUtK0gs5Lq06pL5nuRJKe+J0w@mail.gmail.com>
2021-05-24 8:32 ` Baoquan He [this message]
2021-05-24 8:32 ` Baoquan He
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210524083239.GA2872@MiWiFi-R3L-srv \
--to=bhe@redhat.com \
--cc=andy@infradead.org \
--cc=ardb@kernel.org \
--cc=bp@alien8.de \
--cc=dvhart@infradead.org \
--cc=dyoung@redhat.com \
--cc=hpa@zytor.com \
--cc=kexec@lists.infradead.org \
--cc=lijiang@redhat.com \
--cc=linux-efi@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=platform-driver-x86@vger.kernel.org \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.