From: Baoquan He <bhe@redhat.com>
To: Li Huafei <lihuafei1@huawei.com>
Cc: Dave Young <dyoung@redhat.com>,
"chenhaixiang (A)" <chenhaixiang3@huawei.com>,
"kexec@lists.infradead.org" <kexec@lists.infradead.org>,
"chenhuacai@kernel.org" <chenhuacai@kernel.org>,
"x86@kernel.org" <x86@kernel.org>,
Louhongxiang <louhongxiang@huawei.com>,
"wangbin (A)" <wangbin224@huawei.com>,
"Fangchuangchuang(Fcc,Euler)" <fangchuangchuang@huawei.com>,
"wanghai (M)" <wanghai38@huawei.com>,
"Wangkefeng (OS Kernel Lab)" <wangkefeng.wang@huawei.com>
Subject: Re: Question about Address Range Validation in Crash Kernel Allocation
Date: Fri, 22 Mar 2024 09:16:15 +0800 [thread overview]
Message-ID: <Zfzb394SvIBggtS2@MiWiFi-R3L-srv> (raw)
In-Reply-To: <f286282d-a3ee-980e-565a-bf0c401ca529@huawei.com>
On 03/21/24 at 08:37pm, Li Huafei wrote:
>
>
> On 2024/3/21 18:06, Dave Young wrote:
> > Hi,
> >
> > On Thu, 21 Mar 2024 at 17:49, Li Huafei <lihuafei1@huawei.com> wrote:
> >>
> >> Hi Baoquan,
> >>
> >> On 2024/3/21 17:17, chenhaixiang (A) wrote:
> >>>
> >>>>> I'm sorry for the delay. Here are some details from the boot log and
> >>>> /proc/iomem:
> >>>>> The Boot log:
> >>>>> [ 0.000000] Linux version 6.8.0 (root@localhost.localdomain) (gcc (GCC)
> >>>> 10.3.1, GNU ld (GNU Binutils) 2.37) #3 SMP PREEMPT_DYNAMIC Wed Mar 20
> >>>> 11:46:11 UTC 2024
> >>>>> [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.8.0
> >>>> root=/dev/mapper/root ro crashkernel=512M resume=/dev/mapper/swap
> >>>> rd.lvm.lv=root rd.lvm.lv=swap crash_kexec_post_notifiers softlockup_panic=1
> >>>> reserve_kbox_mem=16M fsck.mode=auto fsck.repair=yes panic=3
> >>>> nmi_watchdog=1 quiet rd.shell=0 memblock=debug efi=debug
> >>>> console=ttyS0,115200n8 console=tty0
> >>>> ......snip...
> >>>>> [ 0.022622] memblock_phys_alloc_range: 536870912 bytes align=0x1000000
> >>>> from=0x0000000000000000 max_addr=0x0000000100000000
> >>>> reserve_crashkernel_generic+0x7c/0x220
> >>>>> [ 0.022628] memblock_phys_alloc_range: 536870912 bytes align=0x1000000
> >>>> from=0x0000000100000000 max_addr=0x0000400000000000
> >>>> reserve_crashkernel_generic+0x7c/0x220
> >>>>> [ 0.022632] memblock_reserve: [0x000000c01f000000-0x000000c03effffff]
> >>>> memblock_alloc_range_nid+0xee/0x170
> >>>>> [ 0.022634] memblock_phys_alloc_range: 268435456 bytes align=0x1000000
> >>>> from=0x0000000000000000 max_addr=0x0000000100000000
> >>>> reserve_crashkernel_generic+0x11d/0x220
> >>>>> [ 0.022638] memblock_reserve: [0x0000000049000000-0x0000000058ffffff]
> >>>> memblock_alloc_range_nid+0xee/0x170
> >>>>> [ 0.022640] crashkernel low memory reserved: 0x49000000 - 0x59000000
> >>>> (256 MB)
> >>>>> [ 0.022641] crashkernel reserved: 0x000000c01f000000 -
> >>>> 0x000000c03f000000 (512 MB)
> >>>>
> >>>> Here, crashkernel,low is reserved in region: [0x49000000 - 0x59000000] (256
> >>>> MB)
> >>>> crashkernel,high is reserved in region: [0x000000c01f000000 -
> >>>> 0x000000c03f000000] (512 MB) ......
> >>>>> [ 0.029839] memblock_reserve: [0x000000c03ffff740-0x000000c03fffff7f]
> >>>> memblock_alloc_range_nid+0xee/0x170
> >>>>> [ 0.029843] e820: update [mem 0x53cbd000-0x53ccffff] usable ==>
> >>>> reserved
> >>>>> [ 0.029861] TSC deadline timer available
> >>>>
> >>>> Then here, region [0x53cbd000-0x53ccffff] is reserved in e820, and print abvoe
> >>>> "usable ==> reserved". This should be the step which prevents earlier reserved
> >>>> crashkernel,low from being added to iomem tree. I am not sure what triggered
> >>>> the e820 update.
> >>
> >> We added dump_stack () printing in efi_mem_reserve () and found that
> >> [0x53cbd000-0x53ccffff] was reserved by BGRT:
> >>
> >> [ 0.032259] e820: update [mem 0x53cbd000-0x53ccffff] usable ==>
> >> reserved
> >> [ 0.032262] CPU: 0 PID: 0 Comm: swapper Not tainted
> >> 5.10.0-60.18.0.50.h820.eulerosv2r11.x86_64 #7
> >> [ 0.032263] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 8.25
> >> 08/30/2022
> >> [ 0.032264] Call Trace:
> >> [ 0.032265] ? dump_stack+0x57/0x6e
> >> [ 0.032267] ? bgrt_init+0xc2/0xc2
> >> [ 0.032268] ? __e820__range_update+0x7a/0x1d6
> >> [ 0.032270] ? bgrt_init+0xc2/0xc2
> >> [ 0.032272] ? bgrt_init+0xc2/0xc2
> >> [ 0.032274] ? efi_arch_mem_reserve+0x1a3/0x1d0
> >> [ 0.032276] ? efi_mem_reserve+0x2d/0x42
> >> [ 0.032278] ? acpi_parse_bgrt+0xa/0x11
> >> [ 0.032279] ? acpi_table_parse+0x86/0xbc
> >> [ 0.032281] ? acpi_boot_init+0x79/0xad
> >> [ 0.032282] ? setup_arch+0x835/0x954
> >> [ 0.032284] ? start_kernel+0x5d/0x455
> >> [ 0.032286] ? secondary_startup_64_no_verify+0xc2/0xcb
> >>
> >> efi_reserve_boot_services() has reserved memory of type
> >> EFI_BOOT_SERVICES_CODE & EFI_BOOT_SERVICES_DATA before crashkernel.
> >> efi_bgrt_init() assumes that EFI_BOOT_SERVICES_DATA is not reserved by
> >> other modules. Then, the e820_table is directly updated, and the BGRT
> >> memory is reserved.
> >>
> >> However, memblock_is_region_reserved() in efi_reserve_boot_services()
> >> returns true when the ranges only overlap.
> >>
> >> already_reserved = memblock_is_region_reserved(start, size);
> >
> > Do you mean efi_reserve_boot_services is supposed to reserve the bgrt
> > memory but it does not reserve it due to the region overlapping with
> > some other reserved region? If so can you debug and find what exact
> > memblock reserved region overlaps with the bgrt?
>
> Yes. I added the following debug print to efi_reserve_boot_services():
>
> --- a/arch/x86/platform/efi/quirks.c
> +++ b/arch/x86/platform/efi/quirks.c
> @@ -339,6 +339,10 @@ void __init efi_reserve_boot_services(void)
>
> already_reserved = memblock_is_region_reserved(start, size);
>
> + pr_info("kdumpdebug: efi_reserve_boot_services start 0x%lu, "
> + "size 0x%lx, type 0x%lx, already_reserved %d\n",
> + start, size, md->type, already_reserved);
> +
> /*
> * Because the following memblock_reserve() is paired
> * with memblock_free_late() for this region in
>
It's great debugging and analysis, thanks you guys. Now there are
several questions:
1) why memory region [0x5976a018-0x5976abc7] is reserved by memblock
for efi_mem_attr_table. It's supposed to be outside of the
EFI_BOOT_SERVICES_DATA area? We may need check here if it's a bug.
[ 0.000000] random: crng init done
[ 0.000000] memblock_reserve: [0x000000005976a018-0x000000005976abc7] efi_memattr_init+0x51/0xa0
> This memory [0x0000005976a018-0x00000005976abc7] is reserved here, which belongs to EFI_BOOT_SERVICES_DATA.
> [ 0.000000] memblock_reserve: [0x000000005976a018-0x000000005976abc7] efi_memattr_init+0x51/0xa0
> It falls in the following range
> [ 0.000000] efi: mem22: [Boot Data | | | | | | | | | | |WB|WT|WC|UC] range=[0x0000000051329000-0x000000005cefefff] (187MB)
>
> in efi_reserve_boot_services(), [0x0000005132900-0x00000005cefeff] will not be fully reserved because [0x0000005976a018-0x0000005976abc7]
> has already been reserved and overlaps with [0x0000005976a018-0x0000005976abc7]
2) Because efi_mem_attr_table memblock reserved [0x5976a018-0x5976abc7],
the whole EFI_BOOT_SERVICES_DATA area [0x5132900-0x5cefeff] is not
memblock reserved for later free. Excep of the small area, do we need
still memblock reserve the remaining area, we may need check if this is
a bug.
>
> [ 0.021316] efi: kdumpdebug: efi_reserve_boot_services start 0x51329000, size 0xbbd6000, type 0x4, already_reserved 1
>
> It is not reserved by memblock, this free memory region is allocated by crashkernel
>
> [ 0.022597] crashkernel low memory reserved: 0x49000000 - 0x59000000 (256 MB)
> [ 0.022599] crashkernel reserved: 0x000000c01f000000 - 0x000000c03f000000 (512 MB)
>
> In efi_bgrt_init (), it is assumed that the memory of the EFI_BOOT_SERVICES_DATA type has been successfully
> reserved. Therefore, the address in the range is directly used. As a result, the memory overlaps with
> the crashkernel region.
(3) efi_bgrt_init() should be innocent because it's supposed to safely
use the area according to the existing efi quirk handling.
(4) the deferring of adding crashh_low_res to iomem exposed the above
efi issue. When we cancel the deferring of crashh_res inserting into
iomem, we can see that the brgt area is reserved inside crashkernel
region, that's problematic.
2d4fd058-60efefff : System RAM
2d4fd058-58ffffff : System RAM
49000000-58ffffff : Crash kernel
53cbd000-53ccffff : Reserved <---
60eff000-704fefff : Reserved
--
93dd424000-93dd9fffff : Kernel bss
c01f000000-c03effffff : Crash kernel
d0000000000-d0fffffffff : PCI Bus 0000:00
d0000000000-d00001fffff : PCI Bus 0000:01
>
> [ 0.029694] e820: update [mem 0x53cbd000-0x53ccffff] usable ==> reserved
> >
> > BTW, the previous email threads are weird, and not threading
> > correctly, hard to find information.
>
> It should be because the log content is too large and has been put on hold. In my previous email, I received a prompt:
>
> The reason it is being held:
>
> Message body is too big: 248998 bytes with a limit of 40 KB
>
>
> >
> >>
> >> /*
> >> * Because the following memblock_reserve() is paired
> >> * with memblock_free_late() for this region in
> >> * efi_free_boot_services(), we must be extremely
> >> * careful not to reserve, and subsequently free,
> >> * critical regions of memory (like the kernel image) or
> >> * those regions that somebody else has already
> >> * reserved.
> >> *
> >> * A good example of a critical region that must not be
> >> * freed is page zero (first 4Kb of memory), which may
> >> * contain boot services code/data but is marked
> >> * E820_TYPE_RESERVED by trim_bios_range().
> >> */
> >> if (!already_reserved) {
> >> memblock_reserve(start, size);
> >>
> >> /*
> >> * If we are the first to reserve the region, no
> >> * one else cares about it. We own it and can
> >> * free it later.
> >> */
> >> if (can_free_region(start, size))
> >> continue;
> >> }
> >>
> >> As a result, some memory of EFI_BOOT_SERVICES_DATA is not reserved in
> >> advance. The subsequent crashkernel happens to reserve this portion of
> >> memory, which conflicts with BGRT.
> >>
> >>> Current analysis suggests that efi_reserve_boot_services() is causing the update of the e820 table.
> >>>
> >>>>
> >>>> How do you boot into your new 6.8.0 kernel? Used kexec -l to jump into the 2nd
> >>>> kernel, or reboot from bios/firmware boot up into 6.8.0?
> >>> It's reboot from bios boot up into 6.8.0. I attempted to revert the below patch,
> >>> and this time the conflicting segment "53cbd000-53ccffff" also appeared in the /proc/iomem
> >>> of the 6.8 kernel.
> >>>
> >>> 2d4fd058-60efefff : System RAM
> >>> 2d4fd058-58ffffff : System RAM
> >>> 49000000-58ffffff : Crash kernel
> >>> 53cbd000-53ccffff : Reserved
> >>> 60eff000-704fefff : Reserved
> >>> --
> >>> 93dd424000-93dd9fffff : Kernel bss
> >>> c01f000000-c03effffff : Crash kernel
> >>> d0000000000-d0fffffffff : PCI Bus 0000:00
> >>> d0000000000-d00001fffff : PCI Bus 0000:01
> >>>>
> >>>> Reverting below commit should fix your problem, can you try it?
> >>>>
> >>>> commit 4a693ce65b186fddc1a73621bd6f941e6e3eca21
> >>>> Author: Huacai Chen <chenhuacai@kernel.org>
> >>>> Date: Fri Dec 29 16:02:13 2023 +0800
> >>>>
> >>>> kdump: defer the insertion of crashkernel resources
> >>>
> >>> .
> >>>
> >>
> >> _______________________________________________
> >> kexec mailing list
> >> kexec@lists.infradead.org
> >> http://lists.infradead.org/mailman/listinfo/kexec
> >
> > .
> >
>
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
next prev parent reply other threads:[~2024-03-22 1:16 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-21 9:17 Question about Address Range Validation in Crash Kernel Allocation chenhaixiang (A)
2024-03-21 9:48 ` Li Huafei
2024-03-21 10:06 ` Dave Young
2024-03-21 12:37 ` Li Huafei
2024-03-22 1:16 ` Baoquan He [this message]
2024-03-22 7:26 ` Dave Young
2024-03-22 7:18 ` Dave Young
2024-03-22 7:18 ` Dave Young
2024-03-22 7:58 ` Li Huafei
2024-03-22 7:58 ` Li Huafei
[not found] <45065451d7d343679e150313c1ee2b62@huawei.com>
2024-03-21 7:09 ` Baoquan He
-- strict thread matches above, loose matches on Subject: below --
2024-03-20 13:12 chenhaixiang (A)
2024-03-20 14:08 ` Baoquan He
2024-03-19 7:24 chenhaixiang (A)
2024-03-19 8:21 ` Baoquan He
2024-03-18 12:00 chenhaixiang (A)
2024-03-19 1:38 ` Baoquan He
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zfzb394SvIBggtS2@MiWiFi-R3L-srv \
--to=bhe@redhat.com \
--cc=chenhaixiang3@huawei.com \
--cc=chenhuacai@kernel.org \
--cc=dyoung@redhat.com \
--cc=fangchuangchuang@huawei.com \
--cc=kexec@lists.infradead.org \
--cc=lihuafei1@huawei.com \
--cc=louhongxiang@huawei.com \
--cc=wangbin224@huawei.com \
--cc=wanghai38@huawei.com \
--cc=wangkefeng.wang@huawei.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.