[PATCH] kexec: fix out of the ELF headers buffer issue in syscall kexec_file

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load()
@ 2015-09-28  6:41 Lee, Chun-Yi
  2015-09-28  7:16 ` Baoquan He
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Lee, Chun-Yi @ 2015-09-28  6:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, x86,
	Stephen Rothwell, Viresh Kumar, Takashi Iwai, Jiang Liu,
	Andy Lutomirski, Baoquan He, linux-kernel, Lee, Chun-Yi

On big machines have CPU number that's very nearly to consume whole ELF
headers buffer that's page aligned, 4096, 8192... Then the page fault error
randomly happened.

This patch modified the code in fill_up_crash_elf_data() by using
walk_system_ram_res() instead of walk_system_ram_range() to count the max
number of crash memory ranges. That's because the walk_system_ram_range()
filters out small memory regions that reside the same page, but
walk_system_ram_res() does not.

The oringial page fault issue sometimes happened on big machines when
preparing ELF headers:

[  305.291522] BUG: unable to handle kernel paging request at ffffc90613fc9000
[  305.299621] IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260
[  305.308300] PGD e000032067 PUD 6dcbec54067 PMD 9dc9bdeb067 PTE 0
[  305.315393] Oops: 0002 [#1] SMP
[...snip]
[  305.420953] task: ffff8e1c01ced600 ti: ffff8e1c03ec2000 task.ti: ffff8e1c03ec2000
[  305.429292] RIP: 0010:[<ffffffff8103d645>]  [<ffffffff8103d645>] prepare_elf64_ra
m_headers_callback+0x165/0x260
[...snip]

After tracing prepare_elf64_headers() and prepare_elf64_ram_headers_callback(),
the code uses walk_system_ram_res() to fill-in crash memory regions information
to program header, so it counts those small memory regions that reside in a
page area. But, when kernel was using walk_system_ram_range() in
fill_up_crash_elf_data() to count the number of crash memory regions, it
filters out small regions.

I printed those small memory regions, for example:

kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0

Base on the logic of walk_system_ram_range(), this memory region will be
filter out:

pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn)      [FAIL]

So, the max_nr_ranges that counted by kernel doesn't include small memory
regions. That causes the page fault issue happened in later code path for
preparing EFL headers,

This issue was hided on small machine that doesn't have too many CPU because
the free space of ELF headers buffer can cover the number of small memory
regions. But, when the machine has more CPUs or the number of memory regions
very nearly to consume whole page aligned buffer, e.g. 4096, 8192... Then
issue will happen randomly.

Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
---
 arch/x86/kernel/crash.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index e068d66..ad273b3d 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -185,8 +185,7 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 }

 #ifdef CONFIG_KEXEC_FILE
-static int get_nr_ram_ranges_callback(unsigned long start_pfn,
-				unsigned long nr_pfn, void *arg)
+static int get_nr_ram_ranges_callback(u64 start, u64 end, void *arg)
 {
 	int *nr_ranges = arg;

@@ -214,7 +213,7 @@ static void fill_up_crash_elf_data(struct crash_elf_data *ced,

 	ced->image = image;

-	walk_system_ram_range(0, -1, &nr_ranges,
+	walk_system_ram_res(0, -1, &nr_ranges,
 				get_nr_ram_ranges_callback);

 	ced->max_nr_ranges = nr_ranges;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load()
  2015-09-28  6:41 [PATCH] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load() Lee, Chun-Yi
@ 2015-09-28  7:16 ` Baoquan He
  2015-09-28  9:35   ` joeyli
  2015-09-28  8:07 ` Baoquan He
  2015-09-29  3:50 ` Minfei Huang
  2 siblings, 1 reply; 8+ messages in thread
From: Baoquan He @ 2015-09-28  7:16 UTC (permalink / raw)
  To: Lee, Chun-Yi
  Cc: Vivek Goyal, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, x86,
	Stephen Rothwell, Viresh Kumar, Takashi Iwai, Jiang Liu,
	Andy Lutomirski, linux-kernel, Lee, Chun-Yi

Hi Chun-Yi,

On 09/28/15 at 02:41pm, Lee, Chun-Yi wrote:
> On big machines have CPU number that's very nearly to consume whole ELF
> headers buffer that's page aligned, 4096, 8192... Then the page fault error
> randomly happened.
> 
> This patch modified the code in fill_up_crash_elf_data() by using
> walk_system_ram_res() instead of walk_system_ram_range() to count the max
> number of crash memory ranges. That's because the walk_system_ram_range()
> filters out small memory regions that reside the same page, but
> walk_system_ram_res() does not.
> 
> The oringial page fault issue sometimes happened on big machines when
> preparing ELF headers:
> 
> [  305.291522] BUG: unable to handle kernel paging request at ffffc90613fc9000
> [  305.299621] IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260
> [  305.308300] PGD e000032067 PUD 6dcbec54067 PMD 9dc9bdeb067 PTE 0
> [  305.315393] Oops: 0002 [#1] SMP
> [...snip]
> [  305.420953] task: ffff8e1c01ced600 ti: ffff8e1c03ec2000 task.ti: ffff8e1c03ec2000
> [  305.429292] RIP: 0010:[<ffffffff8103d645>]  [<ffffffff8103d645>] prepare_elf64_ra
> m_headers_callback+0x165/0x260
> [...snip]
> 
> After tracing prepare_elf64_headers() and prepare_elf64_ram_headers_callback(),
> the code uses walk_system_ram_res() to fill-in crash memory regions information
> to program header, so it counts those small memory regions that reside in a
> page area. But, when kernel was using walk_system_ram_range() in
> fill_up_crash_elf_data() to count the number of crash memory regions, it
> filters out small regions.
> 
> I printed those small memory regions, for example:
> 
> kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0
> 
> Base on the logic of walk_system_ram_range(), this memory region will be
> filter out:
> 
> pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
> end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
> end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn)      [FAIL]
> 
> So, the max_nr_ranges that counted by kernel doesn't include small memory
> regions. That causes the page fault issue happened in later code path for
> preparing EFL headers,
> 
> This issue was hided on small machine that doesn't have too many CPU because
> the free space of ELF headers buffer can cover the number of small memory
> regions. But, when the machine has more CPUs or the number of memory regions
> very nearly to consume whole page aligned buffer, e.g. 4096, 8192... Then
> issue will happen randomly.

It's a good finding and fix sounds reasonable. I didn't get why too many
CPUs will cause this bug. From your big machine can you check which
regions they are and what they are used for? I guess you mean the
crash_notes region, but not very sure.

> 
> Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
> ---
>  arch/x86/kernel/crash.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index e068d66..ad273b3d 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -185,8 +185,7 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>  }
>  
>  #ifdef CONFIG_KEXEC_FILE
> -static int get_nr_ram_ranges_callback(unsigned long start_pfn,
> -				unsigned long nr_pfn, void *arg)
> +static int get_nr_ram_ranges_callback(u64 start, u64 end, void *arg)
>  {
>  	int *nr_ranges = arg;
>  
> @@ -214,7 +213,7 @@ static void fill_up_crash_elf_data(struct crash_elf_data *ced,
>  
>  	ced->image = image;
>  
> -	walk_system_ram_range(0, -1, &nr_ranges,
> +	walk_system_ram_res(0, -1, &nr_ranges,
>  				get_nr_ram_ranges_callback);
>  
>  	ced->max_nr_ranges = nr_ranges;
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load()
  2015-09-28  6:41 [PATCH] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load() Lee, Chun-Yi
  2015-09-28  7:16 ` Baoquan He
@ 2015-09-28  8:07 ` Baoquan He
  2015-09-28  9:39   ` joeyli
  2015-09-29  3:50 ` Minfei Huang
  2 siblings, 1 reply; 8+ messages in thread
From: Baoquan He @ 2015-09-28  8:07 UTC (permalink / raw)
  To: Lee, Chun-Yi
  Cc: Vivek Goyal, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, x86,
	Stephen Rothwell, Viresh Kumar, Takashi Iwai, Jiang Liu,
	Andy Lutomirski, linux-kernel, Lee, Chun-Yi, akpm

On 09/28/15 at 02:41pm, Lee, Chun-Yi wrote:
> On big machines have CPU number that's very nearly to consume whole ELF
> headers buffer that's page aligned, 4096, 8192... Then the page fault error
> randomly happened.
> 
> This patch modified the code in fill_up_crash_elf_data() by using
> walk_system_ram_res() instead of walk_system_ram_range() to count the max
> number of crash memory ranges. That's because the walk_system_ram_range()
> filters out small memory regions that reside the same page, but
> walk_system_ram_res() does not.
> 
> The oringial page fault issue sometimes happened on big machines when
> preparing ELF headers:
> 
> [  305.291522] BUG: unable to handle kernel paging request at ffffc90613fc9000
> [  305.299621] IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260
> [  305.308300] PGD e000032067 PUD 6dcbec54067 PMD 9dc9bdeb067 PTE 0
> [  305.315393] Oops: 0002 [#1] SMP
> [...snip]
> [  305.420953] task: ffff8e1c01ced600 ti: ffff8e1c03ec2000 task.ti: ffff8e1c03ec2000
> [  305.429292] RIP: 0010:[<ffffffff8103d645>]  [<ffffffff8103d645>] prepare_elf64_ra
> m_headers_callback+0x165/0x260
> [...snip]
> 
> After tracing prepare_elf64_headers() and prepare_elf64_ram_headers_callback(),
> the code uses walk_system_ram_res() to fill-in crash memory regions information
> to program header, so it counts those small memory regions that reside in a
> page area. But, when kernel was using walk_system_ram_range() in
> fill_up_crash_elf_data() to count the number of crash memory regions, it
> filters out small regions.
> 
> I printed those small memory regions, for example:
> 
> kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0
> 
> Base on the logic of walk_system_ram_range(), this memory region will be
> filter out:
> 
> pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
> end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
> end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn)      [FAIL]
> 
> So, the max_nr_ranges that counted by kernel doesn't include small memory
> regions. That causes the page fault issue happened in later code path for
> preparing EFL headers,
> 
> This issue was hided on small machine that doesn't have too many CPU because
> the free space of ELF headers buffer can cover the number of small memory
> regions. But, when the machine has more CPUs or the number of memory regions
> very nearly to consume whole page aligned buffer, e.g. 4096, 8192... Then
> issue will happen randomly.

CC akpm too.

Read code again and I think it makes sense to use walk_system_ram_res.
And in prepare_elf64_headers it also uses walk_system_ram_res. That's
why you can find this bug. Otherwise we never find this and those small
regions which only spread in one page will be lost in vmcore.

Besides could you please rearrange your patch log? It's not easy to get
what this patch have done.

> 
> Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
> ---
>  arch/x86/kernel/crash.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index e068d66..ad273b3d 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -185,8 +185,7 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>  }
>  
>  #ifdef CONFIG_KEXEC_FILE
> -static int get_nr_ram_ranges_callback(unsigned long start_pfn,
> -				unsigned long nr_pfn, void *arg)
> +static int get_nr_ram_ranges_callback(u64 start, u64 end, void *arg)
>  {
>  	int *nr_ranges = arg;
>  
> @@ -214,7 +213,7 @@ static void fill_up_crash_elf_data(struct crash_elf_data *ced,
>  
>  	ced->image = image;
>  
> -	walk_system_ram_range(0, -1, &nr_ranges,
> +	walk_system_ram_res(0, -1, &nr_ranges,
>  				get_nr_ram_ranges_callback);
>  
>  	ced->max_nr_ranges = nr_ranges;
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load()
  2015-09-28  7:16 ` Baoquan He
@ 2015-09-28  9:35   ` joeyli
  0 siblings, 0 replies; 8+ messages in thread
From: joeyli @ 2015-09-28  9:35 UTC (permalink / raw)
  To: Baoquan He
  Cc: Lee, Chun-Yi, Vivek Goyal, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, x86, Stephen Rothwell, Viresh Kumar, Takashi Iwai,
	Jiang Liu, Andy Lutomirski, linux-kernel

Hi,

On Mon, Sep 28, 2015 at 03:16:41PM +0800, Baoquan He wrote:
> Hi Chun-Yi,
> 
> On 09/28/15 at 02:41pm, Lee, Chun-Yi wrote:
> > On big machines have CPU number that's very nearly to consume whole ELF
> > headers buffer that's page aligned, 4096, 8192... Then the page fault error
> > randomly happened.
> > 
> > This patch modified the code in fill_up_crash_elf_data() by using
> > walk_system_ram_res() instead of walk_system_ram_range() to count the max
> > number of crash memory ranges. That's because the walk_system_ram_range()
> > filters out small memory regions that reside the same page, but
> > walk_system_ram_res() does not.
> > 
> > The oringial page fault issue sometimes happened on big machines when
> > preparing ELF headers:
> > 
> > [  305.291522] BUG: unable to handle kernel paging request at ffffc90613fc9000
> > [  305.299621] IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260
> > [  305.308300] PGD e000032067 PUD 6dcbec54067 PMD 9dc9bdeb067 PTE 0
> > [  305.315393] Oops: 0002 [#1] SMP
> > [...snip]
> > [  305.420953] task: ffff8e1c01ced600 ti: ffff8e1c03ec2000 task.ti: ffff8e1c03ec2000
> > [  305.429292] RIP: 0010:[<ffffffff8103d645>]  [<ffffffff8103d645>] prepare_elf64_ra
> > m_headers_callback+0x165/0x260
> > [...snip]
> > 
> > After tracing prepare_elf64_headers() and prepare_elf64_ram_headers_callback(),
> > the code uses walk_system_ram_res() to fill-in crash memory regions information
> > to program header, so it counts those small memory regions that reside in a
> > page area. But, when kernel was using walk_system_ram_range() in
> > fill_up_crash_elf_data() to count the number of crash memory regions, it
> > filters out small regions.
> > 
> > I printed those small memory regions, for example:
> > 
> > kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0
> > 
> > Base on the logic of walk_system_ram_range(), this memory region will be
> > filter out:
> > 
> > pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
> > end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
> > end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn)      [FAIL]
> > 
> > So, the max_nr_ranges that counted by kernel doesn't include small memory
> > regions. That causes the page fault issue happened in later code path for
> > preparing EFL headers,
> > 
> > This issue was hided on small machine that doesn't have too many CPU because
> > the free space of ELF headers buffer can cover the number of small memory
> > regions. But, when the machine has more CPUs or the number of memory regions
> > very nearly to consume whole page aligned buffer, e.g. 4096, 8192... Then
> > issue will happen randomly.
> 
> It's a good finding and fix sounds reasonable. I didn't get why too many
> CPUs will cause this bug. From your big machine can you check which
> regions they are and what they are used for? I guess you mean the
> crash_notes region, but not very sure.
> 

In prepare_elf64_headers, the logic to allocate ELF header buffer is:

        /* extra phdr for vmcoreinfo elf note */
        nr_phdr = nr_cpus + 1;
        nr_phdr += ced->max_nr_ranges;

        /*
         * kexec-tools creates an extra PT_LOAD phdr for kernel text mapping
         * area on x86_64 (ffffffff80000000 - ffffffffa0000000).
         * I think this is required by tools like gdb. So same physical
         * memory will be mapped in two elf headers. One will contain kernel
         * text virtual addresses and other will have __va(physical) addresses.
         */
        nr_phdr++;
        elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr);
        elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN);

So whole buffer will be consumed as following:
0                                                                                                                       4096
+------------+--------------------+--------------------+---------------------------+---------------------------+---------+
| ELF header | each cpu PT_NOTE...| vmcoreinfo PT_NOTE | kernel text region PT_NOTE| PT_NOTE for memory regions|   free  |
| (64 bytes) | (n * 56 bytes)     | (56 bytes)         | (56 bytes)                | (n * 56 bytes)            |         |
+------------+--------------------+--------------------+---------------------------+---------------------------+---------+

When the free space can cover the number of small memory regions, means the
difference between walk_system_ram_range() and walk_system_ram_res(), then
this issue will not trigger.

But, when the CPU number grows to very nearly to consume whole 4096 buffer
then the issue will be happen.


Thanks a lot!
Joey Lee


> > 
> > Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
> > ---
> >  arch/x86/kernel/crash.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> > index e068d66..ad273b3d 100644
> > --- a/arch/x86/kernel/crash.c
> > +++ b/arch/x86/kernel/crash.c
> > @@ -185,8 +185,7 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
> >  }
> >  
> >  #ifdef CONFIG_KEXEC_FILE
> > -static int get_nr_ram_ranges_callback(unsigned long start_pfn,
> > -				unsigned long nr_pfn, void *arg)
> > +static int get_nr_ram_ranges_callback(u64 start, u64 end, void *arg)
> >  {
> >  	int *nr_ranges = arg;
> >  
> > @@ -214,7 +213,7 @@ static void fill_up_crash_elf_data(struct crash_elf_data *ced,
> >  
> >  	ced->image = image;
> >  
> > -	walk_system_ram_range(0, -1, &nr_ranges,
> > +	walk_system_ram_res(0, -1, &nr_ranges,
> >  				get_nr_ram_ranges_callback);
> >  
> >  	ced->max_nr_ranges = nr_ranges;
> > -- 
> > 2.1.4
> > 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load()
  2015-09-28  8:07 ` Baoquan He
@ 2015-09-28  9:39   ` joeyli
  2015-09-28  9:52     ` Baoquan He
  0 siblings, 1 reply; 8+ messages in thread
From: joeyli @ 2015-09-28  9:39 UTC (permalink / raw)
  To: Baoquan He
  Cc: Lee, Chun-Yi, Vivek Goyal, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, x86, Stephen Rothwell, Viresh Kumar, Takashi Iwai,
	Jiang Liu, Andy Lutomirski, linux-kernel, akpm

On Mon, Sep 28, 2015 at 04:07:57PM +0800, Baoquan He wrote:
> On 09/28/15 at 02:41pm, Lee, Chun-Yi wrote:
> > On big machines have CPU number that's very nearly to consume whole ELF
> > headers buffer that's page aligned, 4096, 8192... Then the page fault error
> > randomly happened.
> > 
> > This patch modified the code in fill_up_crash_elf_data() by using
> > walk_system_ram_res() instead of walk_system_ram_range() to count the max
> > number of crash memory ranges. That's because the walk_system_ram_range()
> > filters out small memory regions that reside the same page, but
> > walk_system_ram_res() does not.
> > 
> > The oringial page fault issue sometimes happened on big machines when
> > preparing ELF headers:
> > 
> > [  305.291522] BUG: unable to handle kernel paging request at ffffc90613fc9000
> > [  305.299621] IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260
> > [  305.308300] PGD e000032067 PUD 6dcbec54067 PMD 9dc9bdeb067 PTE 0
> > [  305.315393] Oops: 0002 [#1] SMP
> > [...snip]
> > [  305.420953] task: ffff8e1c01ced600 ti: ffff8e1c03ec2000 task.ti: ffff8e1c03ec2000
> > [  305.429292] RIP: 0010:[<ffffffff8103d645>]  [<ffffffff8103d645>] prepare_elf64_ra
> > m_headers_callback+0x165/0x260
> > [...snip]
> > 
> > After tracing prepare_elf64_headers() and prepare_elf64_ram_headers_callback(),
> > the code uses walk_system_ram_res() to fill-in crash memory regions information
> > to program header, so it counts those small memory regions that reside in a
> > page area. But, when kernel was using walk_system_ram_range() in
> > fill_up_crash_elf_data() to count the number of crash memory regions, it
> > filters out small regions.
> > 
> > I printed those small memory regions, for example:
> > 
> > kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0
> > 
> > Base on the logic of walk_system_ram_range(), this memory region will be
> > filter out:
> > 
> > pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
> > end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
> > end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn)      [FAIL]
> > 
> > So, the max_nr_ranges that counted by kernel doesn't include small memory
> > regions. That causes the page fault issue happened in later code path for
> > preparing EFL headers,
> > 
> > This issue was hided on small machine that doesn't have too many CPU because
> > the free space of ELF headers buffer can cover the number of small memory
> > regions. But, when the machine has more CPUs or the number of memory regions
> > very nearly to consume whole page aligned buffer, e.g. 4096, 8192... Then
> > issue will happen randomly.
> 
> CC akpm too.
> 
> Read code again and I think it makes sense to use walk_system_ram_res.
> And in prepare_elf64_headers it also uses walk_system_ram_res. That's
> why you can find this bug. Otherwise we never find this and those small
> regions which only spread in one page will be lost in vmcore.
> 
> Besides could you please rearrange your patch log? It's not easy to get
> what this patch have done.
>

To avoid confusing, I will simplify the patch description.
Removing things about CPU number but keep the difference between
walk_system_ram_res and walk_system_ram_range.


Thanks a lot!
Joey Lee
 
> > 
> > Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
> > ---
> >  arch/x86/kernel/crash.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> > index e068d66..ad273b3d 100644
> > --- a/arch/x86/kernel/crash.c
> > +++ b/arch/x86/kernel/crash.c
> > @@ -185,8 +185,7 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
> >  }
> >  
> >  #ifdef CONFIG_KEXEC_FILE
> > -static int get_nr_ram_ranges_callback(unsigned long start_pfn,
> > -				unsigned long nr_pfn, void *arg)
> > +static int get_nr_ram_ranges_callback(u64 start, u64 end, void *arg)
> >  {
> >  	int *nr_ranges = arg;
> >  
> > @@ -214,7 +213,7 @@ static void fill_up_crash_elf_data(struct crash_elf_data *ced,
> >  
> >  	ced->image = image;
> >  
> > -	walk_system_ram_range(0, -1, &nr_ranges,
> > +	walk_system_ram_res(0, -1, &nr_ranges,
> >  				get_nr_ram_ranges_callback);
> >  
> >  	ced->max_nr_ranges = nr_ranges;
> > -- 
> > 2.1.4
> > 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load()
  2015-09-28  9:39   ` joeyli
@ 2015-09-28  9:52     ` Baoquan He
  0 siblings, 0 replies; 8+ messages in thread
From: Baoquan He @ 2015-09-28  9:52 UTC (permalink / raw)
  To: joeyli
  Cc: Lee, Chun-Yi, Vivek Goyal, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, x86, Stephen Rothwell, Viresh Kumar, Takashi Iwai,
	Jiang Liu, Andy Lutomirski, linux-kernel, akpm

On 09/28/15 at 05:39pm, joeyli wrote:
> On Mon, Sep 28, 2015 at 04:07:57PM +0800, Baoquan He wrote:
> > On 09/28/15 at 02:41pm, Lee, Chun-Yi wrote:
> > > This issue was hided on small machine that doesn't have too many CPU because
> > > the free space of ELF headers buffer can cover the number of small memory
> > > regions. But, when the machine has more CPUs or the number of memory regions
> > > very nearly to consume whole page aligned buffer, e.g. 4096, 8192... Then
> > > issue will happen randomly.
> > 
> > CC akpm too.
> > 
> > Read code again and I think it makes sense to use walk_system_ram_res.
> > And in prepare_elf64_headers it also uses walk_system_ram_res. That's
> > why you can find this bug. Otherwise we never find this and those small
> > regions which only spread in one page will be lost in vmcore.
> > 
> > Besides could you please rearrange your patch log? It's not easy to get
> > what this patch have done.
> >
> 
> To avoid confusing, I will simplify the patch description.
> Removing things about CPU number but keep the difference between
> walk_system_ram_res and walk_system_ram_range.

Yeah, that is good. You can simply mention why it's not found before but
happened now because of many CPUs. The root cause that small regions
residing inside one page are ignored by walk_system_ram_range() need be
focused.

Thanks for your effort!

Baoquan


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load()
  2015-09-28  6:41 [PATCH] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load() Lee, Chun-Yi
  2015-09-28  7:16 ` Baoquan He
  2015-09-28  8:07 ` Baoquan He
@ 2015-09-29  3:50 ` Minfei Huang
  2015-09-29  8:52   ` joeyli
  2 siblings, 1 reply; 8+ messages in thread
From: Minfei Huang @ 2015-09-29  3:50 UTC (permalink / raw)
  To: Lee, Chun-Yi
  Cc: Vivek Goyal, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, x86,
	Stephen Rothwell, Viresh Kumar, Takashi Iwai, Jiang Liu,
	Andy Lutomirski, Baoquan He, linux-kernel, Lee, Chun-Yi, kexec

On 09/28/15 at 02:41pm, Lee, Chun-Yi wrote:
> On big machines have CPU number that's very nearly to consume whole ELF
> headers buffer that's page aligned, 4096, 8192... Then the page fault error
> randomly happened.
> 
> This patch modified the code in fill_up_crash_elf_data() by using
> walk_system_ram_res() instead of walk_system_ram_range() to count the max
> number of crash memory ranges. That's because the walk_system_ram_range()
> filters out small memory regions that reside the same page, but
> walk_system_ram_res() does not.
> 
> The oringial page fault issue sometimes happened on big machines when
> preparing ELF headers:
> 
> [  305.291522] BUG: unable to handle kernel paging request at ffffc90613fc9000
> [  305.299621] IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260
> [  305.308300] PGD e000032067 PUD 6dcbec54067 PMD 9dc9bdeb067 PTE 0
> [  305.315393] Oops: 0002 [#1] SMP
> [...snip]
> [  305.420953] task: ffff8e1c01ced600 ti: ffff8e1c03ec2000 task.ti: ffff8e1c03ec2000
> [  305.429292] RIP: 0010:[<ffffffff8103d645>]  [<ffffffff8103d645>] prepare_elf64_ra
> m_headers_callback+0x165/0x260
> [...snip]
> 
> After tracing prepare_elf64_headers() and prepare_elf64_ram_headers_callback(),
> the code uses walk_system_ram_res() to fill-in crash memory regions information
> to program header, so it counts those small memory regions that reside in a
> page area. But, when kernel was using walk_system_ram_range() in
> fill_up_crash_elf_data() to count the number of crash memory regions, it
> filters out small regions.
> 
> I printed those small memory regions, for example:
> 
> kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0
> 
> Base on the logic of walk_system_ram_range(), this memory region will be
> filter out:
> 
> pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
> end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
> end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn)      [FAIL]
> 
> So, the max_nr_ranges that counted by kernel doesn't include small memory
> regions. That causes the page fault issue happened in later code path for
> preparing EFL headers,
> 
> This issue was hided on small machine that doesn't have too many CPU because
> the free space of ELF headers buffer can cover the number of small memory
> regions. But, when the machine has more CPUs or the number of memory regions
> very nearly to consume whole page aligned buffer, e.g. 4096, 8192... Then
> issue will happen randomly.
> 
> Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
> ---
>  arch/x86/kernel/crash.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index e068d66..ad273b3d 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -185,8 +185,7 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>  }
>  
>  #ifdef CONFIG_KEXEC_FILE
> -static int get_nr_ram_ranges_callback(unsigned long start_pfn,
> -				unsigned long nr_pfn, void *arg)
> +static int get_nr_ram_ranges_callback(u64 start, u64 end, void *arg)
>  {
>  	int *nr_ranges = arg;

Ccing kexec maillist.

Good cacthing.

It is appreciate if you can change the above type to unsigned int
accordingly.

Thanks
Minfei

>  
> @@ -214,7 +213,7 @@ static void fill_up_crash_elf_data(struct crash_elf_data *ced,
>  
>  	ced->image = image;
>  
> -	walk_system_ram_range(0, -1, &nr_ranges,
> +	walk_system_ram_res(0, -1, &nr_ranges,
>  				get_nr_ram_ranges_callback);
>  
>  	ced->max_nr_ranges = nr_ranges;
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load()
  2015-09-29  3:50 ` Minfei Huang
@ 2015-09-29  8:52   ` joeyli
  0 siblings, 0 replies; 8+ messages in thread
From: joeyli @ 2015-09-29  8:52 UTC (permalink / raw)
  To: Minfei Huang
  Cc: Lee, Chun-Yi, Vivek Goyal, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, x86, Stephen Rothwell, Viresh Kumar, Takashi Iwai,
	Jiang Liu, Andy Lutomirski, Baoquan He, linux-kernel, kexec

Hi Minfei, 

On Tue, Sep 29, 2015 at 11:50:44AM +0800, Minfei Huang wrote:
> On 09/28/15 at 02:41pm, Lee, Chun-Yi wrote:
> > On big machines have CPU number that's very nearly to consume whole ELF
> > headers buffer that's page aligned, 4096, 8192... Then the page fault error
> > randomly happened.
> > 
> > This patch modified the code in fill_up_crash_elf_data() by using
> > walk_system_ram_res() instead of walk_system_ram_range() to count the max
> > number of crash memory ranges. That's because the walk_system_ram_range()
> > filters out small memory regions that reside the same page, but
> > walk_system_ram_res() does not.
> > 
> > The oringial page fault issue sometimes happened on big machines when
> > preparing ELF headers:
> > 
> > [  305.291522] BUG: unable to handle kernel paging request at ffffc90613fc9000
> > [  305.299621] IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260
> > [  305.308300] PGD e000032067 PUD 6dcbec54067 PMD 9dc9bdeb067 PTE 0
> > [  305.315393] Oops: 0002 [#1] SMP
> > [...snip]
> > [  305.420953] task: ffff8e1c01ced600 ti: ffff8e1c03ec2000 task.ti: ffff8e1c03ec2000
> > [  305.429292] RIP: 0010:[<ffffffff8103d645>]  [<ffffffff8103d645>] prepare_elf64_ra
> > m_headers_callback+0x165/0x260
> > [...snip]
> > 
> > After tracing prepare_elf64_headers() and prepare_elf64_ram_headers_callback(),
> > the code uses walk_system_ram_res() to fill-in crash memory regions information
> > to program header, so it counts those small memory regions that reside in a
> > page area. But, when kernel was using walk_system_ram_range() in
> > fill_up_crash_elf_data() to count the number of crash memory regions, it
> > filters out small regions.
> > 
> > I printed those small memory regions, for example:
> > 
> > kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0
> > 
> > Base on the logic of walk_system_ram_range(), this memory region will be
> > filter out:
> > 
> > pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
> > end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
> > end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn)      [FAIL]
> > 
> > So, the max_nr_ranges that counted by kernel doesn't include small memory
> > regions. That causes the page fault issue happened in later code path for
> > preparing EFL headers,
> > 
> > This issue was hided on small machine that doesn't have too many CPU because
> > the free space of ELF headers buffer can cover the number of small memory
> > regions. But, when the machine has more CPUs or the number of memory regions
> > very nearly to consume whole page aligned buffer, e.g. 4096, 8192... Then
> > issue will happen randomly.
> > 
> > Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
> > ---
> >  arch/x86/kernel/crash.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> > index e068d66..ad273b3d 100644
> > --- a/arch/x86/kernel/crash.c
> > +++ b/arch/x86/kernel/crash.c
> > @@ -185,8 +185,7 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
> >  }
> >  
> >  #ifdef CONFIG_KEXEC_FILE
> > -static int get_nr_ram_ranges_callback(unsigned long start_pfn,
> > -				unsigned long nr_pfn, void *arg)
> > +static int get_nr_ram_ranges_callback(u64 start, u64 end, void *arg)
> >  {
> >  	int *nr_ranges = arg;
> 
> Ccing kexec maillist.
> 
> Good cacthing.
> 
> It is appreciate if you can change the above type to unsigned int
> accordingly.
> 
> Thanks
> Minfei
> 

Looks unsigned int* is better, I will change in next version.
Thanks for your review.

Joey Lee

> >  
> > @@ -214,7 +213,7 @@ static void fill_up_crash_elf_data(struct crash_elf_data *ced,
> >  
> >  	ced->image = image;
> >  
> > -	walk_system_ram_range(0, -1, &nr_ranges,
> > +	walk_system_ram_res(0, -1, &nr_ranges,
> >  				get_nr_ram_ranges_callback);
> >  
> >  	ced->max_nr_ranges = nr_ranges;
> > -- 
> > 2.1.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-09-29  8:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-28  6:41 [PATCH] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load() Lee, Chun-Yi
2015-09-28  7:16 ` Baoquan He
2015-09-28  9:35   ` joeyli
2015-09-28  8:07 ` Baoquan He
2015-09-28  9:39   ` joeyli
2015-09-28  9:52     ` Baoquan He
2015-09-29  3:50 ` Minfei Huang
2015-09-29  8:52   ` joeyli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox