Re: Question about Address Range Validation in Crash Kernel Allocation

From: Baoquan He <bhe@redhat.com>
To: Li Huafei <lihuafei1@huawei.com>
Cc: Dave Young <dyoung@redhat.com>,
	"chenhaixiang (A)" <chenhaixiang3@huawei.com>,
	"kexec@lists.infradead.org" <kexec@lists.infradead.org>,
	"chenhuacai@kernel.org" <chenhuacai@kernel.org>,
	"x86@kernel.org" <x86@kernel.org>,
	Louhongxiang <louhongxiang@huawei.com>,
	"wangbin (A)" <wangbin224@huawei.com>,
	"Fangchuangchuang(Fcc,Euler)" <fangchuangchuang@huawei.com>,
	"wanghai (M)" <wanghai38@huawei.com>,
	"Wangkefeng (OS Kernel Lab)" <wangkefeng.wang@huawei.com>
Subject: Re: Question about Address Range Validation in Crash Kernel Allocation
Date: Fri, 22 Mar 2024 09:16:15 +0800	[thread overview]
Message-ID: <Zfzb394SvIBggtS2@MiWiFi-R3L-srv> (raw)
In-Reply-To: <f286282d-a3ee-980e-565a-bf0c401ca529@huawei.com>

On 03/21/24 at 08:37pm, Li Huafei wrote:
> 
> 
> On 2024/3/21 18:06, Dave Young wrote:
> > Hi,
> > 
> > On Thu, 21 Mar 2024 at 17:49, Li Huafei <lihuafei1@huawei.com> wrote:
> >>
> >> Hi Baoquan，
> >>
> >> On 2024/3/21 17:17, chenhaixiang (A) wrote:
> >>>
> >>>>> I'm sorry for the delay. Here are some details from the boot log and
> >>>> /proc/iomem:
> >>>>> The Boot log:
> >>>>> [    0.000000] Linux version 6.8.0 (root@localhost.localdomain) (gcc (GCC)
> >>>> 10.3.1, GNU ld (GNU Binutils) 2.37) #3 SMP PREEMPT_DYNAMIC Wed Mar 20
> >>>> 11:46:11 UTC 2024
> >>>>> [    0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.8.0
> >>>> root=/dev/mapper/root ro crashkernel=512M resume=/dev/mapper/swap
> >>>> rd.lvm.lv=root rd.lvm.lv=swap crash_kexec_post_notifiers softlockup_panic=1
> >>>> reserve_kbox_mem=16M fsck.mode=auto fsck.repair=yes panic=3
> >>>> nmi_watchdog=1 quiet rd.shell=0 memblock=debug efi=debug
> >>>> console=ttyS0,115200n8 console=tty0
> >>>> ......snip...
> >>>>> [    0.022622] memblock_phys_alloc_range: 536870912 bytes align=0x1000000
> >>>> from=0x0000000000000000 max_addr=0x0000000100000000
> >>>> reserve_crashkernel_generic+0x7c/0x220
> >>>>> [    0.022628] memblock_phys_alloc_range: 536870912 bytes align=0x1000000
> >>>> from=0x0000000100000000 max_addr=0x0000400000000000
> >>>> reserve_crashkernel_generic+0x7c/0x220
> >>>>> [    0.022632] memblock_reserve: [0x000000c01f000000-0x000000c03effffff]
> >>>> memblock_alloc_range_nid+0xee/0x170
> >>>>> [    0.022634] memblock_phys_alloc_range: 268435456 bytes align=0x1000000
> >>>> from=0x0000000000000000 max_addr=0x0000000100000000
> >>>> reserve_crashkernel_generic+0x11d/0x220
> >>>>> [    0.022638] memblock_reserve: [0x0000000049000000-0x0000000058ffffff]
> >>>> memblock_alloc_range_nid+0xee/0x170
> >>>>> [    0.022640] crashkernel low memory reserved: 0x49000000 - 0x59000000
> >>>> (256 MB)
> >>>>> [    0.022641] crashkernel reserved: 0x000000c01f000000 -
> >>>> 0x000000c03f000000 (512 MB)
> >>>>
> >>>> Here, crashkernel,low is reserved in region:  [0x49000000 - 0x59000000] (256
> >>>> MB)
> >>>>       crashkernel,high is reserved in region: [0x000000c01f000000 -
> >>>> 0x000000c03f000000] (512 MB) ......
> >>>>> [    0.029839] memblock_reserve: [0x000000c03ffff740-0x000000c03fffff7f]
> >>>> memblock_alloc_range_nid+0xee/0x170
> >>>>> [    0.029843] e820: update [mem 0x53cbd000-0x53ccffff] usable ==>
> >>>> reserved
> >>>>> [    0.029861] TSC deadline timer available
> >>>>
> >>>> Then here, region [0x53cbd000-0x53ccffff] is reserved in e820, and print abvoe
> >>>> "usable ==> reserved". This should be the step which prevents earlier reserved
> >>>> crashkernel,low from being added to iomem tree. I am not sure what triggered
> >>>> the e820 update.
> >>
> >> We added dump_stack () printing in efi_mem_reserve () and found that
> >> [0x53cbd000-0x53ccffff] was reserved by BGRT:
> >>
> >>   [    0.032259] e820: update [mem 0x53cbd000-0x53ccffff] usable ==>
> >> reserved
> >>   [    0.032262] CPU: 0 PID: 0 Comm: swapper Not tainted
> >> 5.10.0-60.18.0.50.h820.eulerosv2r11.x86_64 #7
> >>   [    0.032263] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 8.25
> >> 08/30/2022
> >>   [    0.032264] Call Trace:
> >>   [    0.032265]  ? dump_stack+0x57/0x6e
> >>   [    0.032267]  ? bgrt_init+0xc2/0xc2
> >>   [    0.032268]  ? __e820__range_update+0x7a/0x1d6
> >>   [    0.032270]  ? bgrt_init+0xc2/0xc2
> >>   [    0.032272]  ? bgrt_init+0xc2/0xc2
> >>   [    0.032274]  ? efi_arch_mem_reserve+0x1a3/0x1d0
> >>   [    0.032276]  ? efi_mem_reserve+0x2d/0x42
> >>   [    0.032278]  ? acpi_parse_bgrt+0xa/0x11
> >>   [    0.032279]  ? acpi_table_parse+0x86/0xbc
> >>   [    0.032281]  ? acpi_boot_init+0x79/0xad
> >>   [    0.032282]  ? setup_arch+0x835/0x954
> >>   [    0.032284]  ? start_kernel+0x5d/0x455
> >>   [    0.032286]  ? secondary_startup_64_no_verify+0xc2/0xcb
> >>
> >> efi_reserve_boot_services() has reserved memory of type
> >> EFI_BOOT_SERVICES_CODE & EFI_BOOT_SERVICES_DATA  before crashkernel.
> >> efi_bgrt_init() assumes that EFI_BOOT_SERVICES_DATA is not reserved by
> >> other modules. Then, the e820_table is directly updated, and the BGRT
> >> memory is reserved.
> >>
> >> However, memblock_is_region_reserved() in efi_reserve_boot_services()
> >> returns true when the ranges only overlap.
> >>
> >>      already_reserved = memblock_is_region_reserved(start, size);
> > 
> > Do you mean efi_reserve_boot_services is supposed to reserve the bgrt
> > memory but it does not reserve it due to the region overlapping with
> > some other reserved region?  If so can you debug and find what exact
> > memblock reserved region overlaps with the bgrt?
> 
> Yes. I added the following debug print to efi_reserve_boot_services():
> 
> --- a/arch/x86/platform/efi/quirks.c
> +++ b/arch/x86/platform/efi/quirks.c
> @@ -339,6 +339,10 @@ void __init efi_reserve_boot_services(void)
> 
>                 already_reserved = memblock_is_region_reserved(start, size);
> 
> +               pr_info("kdumpdebug: efi_reserve_boot_services start 0x%lu, "
> +                       "size 0x%lx, type 0x%lx, already_reserved %d\n",
> +                       start, size, md->type, already_reserved);
> +
>                 /*
>                  * Because the following memblock_reserve() is paired
>                  * with memblock_free_late() for this region in
> 

It's great debugging and analysis, thanks you guys. Now there are
several questions:

1) why memory region [0x5976a018-0x5976abc7] is reserved by memblock
for efi_mem_attr_table. It's supposed to be outside of the
EFI_BOOT_SERVICES_DATA area? We may need check here if it's a bug.

[    0.000000] random: crng init done
[    0.000000] memblock_reserve: [0x000000005976a018-0x000000005976abc7] efi_memattr_init+0x51/0xa0

> This memory [0x0000005976a018-0x00000005976abc7] is reserved here, which belongs to EFI_BOOT_SERVICES_DATA.
>     [    0.000000] memblock_reserve: [0x000000005976a018-0x000000005976abc7] efi_memattr_init+0x51/0xa0
> It falls in the following range
>     [    0.000000] efi: mem22: [Boot Data   |   |  |  |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000000051329000-0x000000005cefefff] (187MB)
> 
> in efi_reserve_boot_services(), [0x0000005132900-0x00000005cefeff] will not be fully reserved because [0x0000005976a018-0x0000005976abc7]
> has already been reserved and overlaps with [0x0000005976a018-0x0000005976abc7]

2) Because efi_mem_attr_table memblock reserved [0x5976a018-0x5976abc7],
the whole EFI_BOOT_SERVICES_DATA area [0x5132900-0x5cefeff] is not
memblock reserved for later free. Excep of the small area, do we need
still memblock reserve the remaining area, we may need check if this is
a bug.

> 
>     [    0.021316] efi: kdumpdebug: efi_reserve_boot_services start 0x51329000, size 0xbbd6000, type 0x4, already_reserved 1
> 
> It is not reserved by memblock, this free memory region is allocated by crashkernel
> 
>     [    0.022597] crashkernel low memory reserved: 0x49000000 - 0x59000000 (256 MB)
>     [    0.022599] crashkernel reserved: 0x000000c01f000000 - 0x000000c03f000000 (512 MB)
> 
> In efi_bgrt_init (), it is assumed that the memory of the EFI_BOOT_SERVICES_DATA type has been successfully 
> reserved. Therefore, the address in the range is directly used. As a result, the memory overlaps with
> the crashkernel region. 

(3) efi_bgrt_init() should be innocent because it's supposed to safely
use the area according to the existing efi quirk handling.

(4) the deferring of adding crashh_low_res to iomem exposed the above
efi issue. When we cancel the deferring of crashh_res inserting into
iomem, we can see that the brgt area is reserved inside crashkernel
region, that's problematic.

2d4fd058-60efefff : System RAM
  2d4fd058-58ffffff : System RAM
    49000000-58ffffff : Crash kernel
      53cbd000-53ccffff : Reserved     <--- 
60eff000-704fefff : Reserved
--
  93dd424000-93dd9fffff : Kernel bss
  c01f000000-c03effffff : Crash kernel
d0000000000-d0fffffffff : PCI Bus 0000:00
  d0000000000-d00001fffff : PCI Bus 0000:01

> 
>     [    0.029694] e820: update [mem 0x53cbd000-0x53ccffff] usable ==> reserved
> > 
> > BTW, the previous email threads are weird, and not threading
> > correctly, hard to find information.
> 
> It should be because the log content is too large and has been put on hold. In my previous email, I received a prompt:
> 
>  The reason it is being held:
> 
>     Message body is too big: 248998 bytes with a limit of 40 KB
> 
> 
> > 
> >>
> >>      /*
> >>       * Because the following memblock_reserve() is paired
> >>       * with memblock_free_late() for this region in
> >>       * efi_free_boot_services(), we must be extremely
> >>       * careful not to reserve, and subsequently free,
> >>       * critical regions of memory (like the kernel image) or
> >>       * those regions that somebody else has already
> >>       * reserved.
> >>       *
> >>       * A good example of a critical region that must not be
> >>       * freed is page zero (first 4Kb of memory), which may
> >>       * contain boot services code/data but is marked
> >>       * E820_TYPE_RESERVED by trim_bios_range().
> >>       */
> >>      if (!already_reserved) {
> >>              memblock_reserve(start, size);
> >>
> >>              /*
> >>               * If we are the first to reserve the region, no
> >>               * one else cares about it. We own it and can
> >>               * free it later.
> >>               */
> >>              if (can_free_region(start, size))
> >>                      continue;
> >>      }
> >>
> >> As a result, some memory of EFI_BOOT_SERVICES_DATA is not reserved in
> >> advance. The subsequent crashkernel happens to reserve this portion of
> >> memory, which conflicts with BGRT.
> >>
> >>> Current analysis suggests that efi_reserve_boot_services() is causing the update of the e820 table.
> >>>
> >>>>
> >>>> How do you boot into your new 6.8.0 kernel? Used kexec -l to jump into the 2nd
> >>>> kernel, or reboot from bios/firmware boot up into 6.8.0?
> >>> It's reboot from bios boot up into 6.8.0. I attempted to revert the below patch,
> >>>  and this time the conflicting segment "53cbd000-53ccffff" also appeared in the /proc/iomem
> >>>  of the 6.8 kernel.
> >>>
> >>> 2d4fd058-60efefff : System RAM
> >>>   2d4fd058-58ffffff : System RAM
> >>>     49000000-58ffffff : Crash kernel
> >>>       53cbd000-53ccffff : Reserved
> >>> 60eff000-704fefff : Reserved
> >>> --
> >>>   93dd424000-93dd9fffff : Kernel bss
> >>>   c01f000000-c03effffff : Crash kernel
> >>> d0000000000-d0fffffffff : PCI Bus 0000:00
> >>>   d0000000000-d00001fffff : PCI Bus 0000:01
> >>>>
> >>>> Reverting below commit should fix your problem, can you try it?
> >>>>
> >>>> commit 4a693ce65b186fddc1a73621bd6f941e6e3eca21
> >>>> Author: Huacai Chen <chenhuacai@kernel.org>
> >>>> Date:   Fri Dec 29 16:02:13 2023 +0800
> >>>>
> >>>>     kdump: defer the insertion of crashkernel resources
> >>>
> >>> .
> >>>
> >>
> >> _______________________________________________
> >> kexec mailing list
> >> kexec@lists.infradead.org
> >> http://lists.infradead.org/mailman/listinfo/kexec
> > 
> > .
> > 
> 

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec