linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* /proc/vmcore mmap() failure issue
@ 2013-11-13 20:41 Vivek Goyal
  2013-11-13 21:04 ` Vivek Goyal
  2013-11-14 10:31 ` HATAYAMA Daisuke
  0 siblings, 2 replies; 30+ messages in thread
From: Vivek Goyal @ 2013-11-13 20:41 UTC (permalink / raw)
  To: linux kernel mailing list, HATAYAMA Daisuke
  Cc: Kexec Mailing List, Baoquan He, WANG Chao, Dave Young,
	Eric W. Biederman

Hi Hatayama,

We are facing some /proc/vmcore mmap() failure issues and then makdumpfile
exits without saving dump and system reboots.

I tried latest makedumpfile (devel branch) with 3.12 kernel.

I think this issue happens only on some machines. And it looks like it
happens when end of system RAM chunk in first kernel is not page aligned. For
example, I have one machine where I noticed it and this is how system
RAM looks like.

00100000-dafa57ff : System RAM
  01000000-015892fa : Kernel code
  015892fb-0195c9ff : Kernel data
  01ae6000-01d31fff : Kernel bss
  24000000-33ffffff : Crash kernel
dafa5800-dbffffff : reserved

Notice that dafa57ff does not end at page boundary and next reserved
range does not start at page boundary. I think that next reserved
range is referenced through some ACPI data. More on this later.

So we put some printk() messages to get more info. In a nut shell,
remap_pfn_range() fails when we try to map the last section of system
RAM not ending on page boundary.

remap_pfn_range()
   track_pfn_remap() {
        /*
         * For anything smaller than the vma size we set prot based on the
         * lookup.
         */ 
        flags = lookup_memtype(paddr);
        
        /* Check memtype for the remaining pages */
        while (size > PAGE_SIZE) {
                size -= PAGE_SIZE;
                paddr += PAGE_SIZE;
                if (flags != lookup_memtype(paddr))
                        return -EINVAL; <---------------- Failure.
        }
	
   }
     

So we pass in a range to track_pfn_remap. Say pfn=0xdad62 size=0x244000.
Now we call lookup_memtype() on every page in the range and make sure
they all are same, otherwise we fail. Guess what, all all same except
last page (which does not end at page boundary).

I dived deeper in to lookup_memtype() and noticed that all regular
ranges are not registered anywhere and their flags are _PAGE_CACHE_UC_MINUS.
But last unaligned page/range, is registered in memtype rb tree and
has attribute, _PAGE_CACHE_WB.

Then I hooked into reserve_memtype() to figure out who is registering
page 0xdafa5000 and it is acpi_init() which does it.

[    0.721655] Hardware name: <edited>
[    0.730590]  ffff8800340f3830 ffff8800340f37c0 ffffffff81575509
00000000dafa5000
[    0.738010]  ffff8800340f3800 ffffffff810566cc 00000000000dafa5
00000000dafa5000
[    0.745428]  00000000dafa6000 00000000dafa5000 0000000000000000
0000000000001000
[    0.752845] Call Trace:
[    0.755288]  [<ffffffff81575509>] dump_stack+0x45/0x56
[    0.760414]  [<ffffffff810566cc>] reserve_memtype+0x31c/0x3f0
[    0.766144]  [<ffffffff810537ef>] __ioremap_caller+0x12f/0x360
[    0.771963]  [<ffffffff8130ad56>] ? acpi_os_release_object+0xe/0x12
[    0.778217]  [<ffffffff815686ba>] ? acpi_os_map_memory+0xf6/0x14e
[    0.784295]  [<ffffffff81053a54>] ioremap_cache+0x14/0x20
[    0.789679]  [<ffffffff815686ba>] acpi_os_map_memory+0xf6/0x14e
[    0.795582]  [<ffffffff81322ac9>]
acpi_ex_system_memory_space_handler+0xdd/0x1ca
[    0.802961]  [<ffffffff8131ca48>]
acpi_ev_address_space_dispatch+0x1b0/0x208
[    0.809993]  [<ffffffff8131fd49>] acpi_ex_access_region+0x20e/0x2a2
[    0.816244]  [<ffffffff81149464>] ? __alloc_pages_nodemask+0x134/0x300
[    0.822754]  [<ffffffff813200e4>] acpi_ex_field_datum_io+0xf6/0x171
[    0.829004]  [<ffffffff81320301>] acpi_ex_extract_from_field+0xd7/0x20a
[    0.835602]  [<ffffffff81331d80>] ?
acpi_ut_create_internal_object_dbg+0x23/0x8a
[    0.842981]  [<ffffffff8131f8e7>]
acpi_ex_read_data_from_field+0x10f/0x14b
[    0.849838]  [<ffffffff81322e16>]
acpi_ex_resolve_node_to_value+0x18e/0x21c
[    0.856780]  [<ffffffff813230a6>] acpi_ex_resolve_to_value+0x202/0x209
[    0.863291]  [<ffffffff81319486>] acpi_ds_evaluate_name_path+0x7b/0xf5
[    0.869803]  [<ffffffff81319834>] acpi_ds_exec_end_op+0x98/0x3e8
[    0.875793]  [<ffffffff8132aca4>] acpi_ps_parse_loop+0x514/0x560
[    0.881784]  [<ffffffff8132b738>] acpi_ps_parse_aml+0x98/0x28c
[    0.887601]  [<ffffffff8132bf8d>] acpi_ps_execute_method+0x1c1/0x26c
[    0.893939]  [<ffffffff813269c5>] acpi_ns_evaluate+0x1c1/0x258
[    0.899755]  [<ffffffff8131cb98>] acpi_ev_execute_reg_method+0xca/0x112
[    0.906353]  [<ffffffff8131cd6e>] acpi_ev_reg_run+0x48/0x52
[    0.911910]  [<ffffffff81328fad>] acpi_ns_walk_namespace+0xc8/0x17f
[    0.918160]  [<ffffffff8131cd26>] ? acpi_ev_detach_region+0x146/0x146
[    0.924585]  [<ffffffff8131cdbc>] acpi_ev_execute_reg_methods+0x44/0xf7
[    0.931184]  [<ffffffff819b2324>] ? acpi_sleep_proc_init+0x2a/0x2a
[    0.937349]  [<ffffffff8130ac66>] ? acpi_os_wait_semaphore+0x43/0x57
[    0.943686]  [<ffffffff81331a3f>] ? acpi_ut_acquire_mutex+0x48/0x88
[    0.949938]  [<ffffffff8131ceb8>]
acpi_ev_initialize_op_regions+0x49/0x71
[    0.956709]  [<ffffffff819b2324>] ? acpi_sleep_proc_init+0x2a/0x2a
[    0.962873]  [<ffffffff81333310>] acpi_initialize_objects+0x23/0x4f
[    0.969125]  [<ffffffff819b23b4>] acpi_init+0x90/0x268

So basically, this split page seems to be a problem. Some other code
thinks that it has access to full page and goes ahead and registers
that with PAT rb tree and this causes problems in mmap() code.

I suspect we might have to go back to idea of copying first and last
non page aligned ranges in new kernel's memory and read it from there
to solve this issue. Do you have other ideas?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-13 20:41 /proc/vmcore mmap() failure issue Vivek Goyal
@ 2013-11-13 21:04 ` Vivek Goyal
  2013-11-13 21:14   ` H. Peter Anvin
  2013-11-14 10:31 ` HATAYAMA Daisuke
  1 sibling, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2013-11-13 21:04 UTC (permalink / raw)
  To: linux kernel mailing list, HATAYAMA Daisuke
  Cc: Kexec Mailing List, Baoquan He, WANG Chao, Dave Young,
	Eric W. Biederman, H. Peter Anvin

[CC hpa ]

And this issue brings me to the question that why do we allow sytem RAM
ranges which do not start on page boundary or do not end on page boundary. 
Can't we truncate the BIOS reported RAM ranges in such a way so that
they start and end at PAGE boundary and rest of the kernel will never see
unaligned portion of RAM and this will make life so much simpler for
other tools.

Thanks
Vivek

On Wed, Nov 13, 2013 at 03:41:30PM -0500, Vivek Goyal wrote:
> Hi Hatayama,
> 
> We are facing some /proc/vmcore mmap() failure issues and then makdumpfile
> exits without saving dump and system reboots.
> 
> I tried latest makedumpfile (devel branch) with 3.12 kernel.
> 
> I think this issue happens only on some machines. And it looks like it
> happens when end of system RAM chunk in first kernel is not page aligned. For
> example, I have one machine where I noticed it and this is how system
> RAM looks like.
> 
> 00100000-dafa57ff : System RAM
>   01000000-015892fa : Kernel code
>   015892fb-0195c9ff : Kernel data
>   01ae6000-01d31fff : Kernel bss
>   24000000-33ffffff : Crash kernel
> dafa5800-dbffffff : reserved
> 
> Notice that dafa57ff does not end at page boundary and next reserved
> range does not start at page boundary. I think that next reserved
> range is referenced through some ACPI data. More on this later.
> 
> So we put some printk() messages to get more info. In a nut shell,
> remap_pfn_range() fails when we try to map the last section of system
> RAM not ending on page boundary.
> 
> remap_pfn_range()
>    track_pfn_remap() {
>         /*
>          * For anything smaller than the vma size we set prot based on the
>          * lookup.
>          */ 
>         flags = lookup_memtype(paddr);
>         
>         /* Check memtype for the remaining pages */
>         while (size > PAGE_SIZE) {
>                 size -= PAGE_SIZE;
>                 paddr += PAGE_SIZE;
>                 if (flags != lookup_memtype(paddr))
>                         return -EINVAL; <---------------- Failure.
>         }
> 	
>    }
>      
> 
> So we pass in a range to track_pfn_remap. Say pfn=0xdad62 size=0x244000.
> Now we call lookup_memtype() on every page in the range and make sure
> they all are same, otherwise we fail. Guess what, all all same except
> last page (which does not end at page boundary).
> 
> I dived deeper in to lookup_memtype() and noticed that all regular
> ranges are not registered anywhere and their flags are _PAGE_CACHE_UC_MINUS.
> But last unaligned page/range, is registered in memtype rb tree and
> has attribute, _PAGE_CACHE_WB.
> 
> Then I hooked into reserve_memtype() to figure out who is registering
> page 0xdafa5000 and it is acpi_init() which does it.
> 
> [    0.721655] Hardware name: <edited>
> [    0.730590]  ffff8800340f3830 ffff8800340f37c0 ffffffff81575509
> 00000000dafa5000
> [    0.738010]  ffff8800340f3800 ffffffff810566cc 00000000000dafa5
> 00000000dafa5000
> [    0.745428]  00000000dafa6000 00000000dafa5000 0000000000000000
> 0000000000001000
> [    0.752845] Call Trace:
> [    0.755288]  [<ffffffff81575509>] dump_stack+0x45/0x56
> [    0.760414]  [<ffffffff810566cc>] reserve_memtype+0x31c/0x3f0
> [    0.766144]  [<ffffffff810537ef>] __ioremap_caller+0x12f/0x360
> [    0.771963]  [<ffffffff8130ad56>] ? acpi_os_release_object+0xe/0x12
> [    0.778217]  [<ffffffff815686ba>] ? acpi_os_map_memory+0xf6/0x14e
> [    0.784295]  [<ffffffff81053a54>] ioremap_cache+0x14/0x20
> [    0.789679]  [<ffffffff815686ba>] acpi_os_map_memory+0xf6/0x14e
> [    0.795582]  [<ffffffff81322ac9>]
> acpi_ex_system_memory_space_handler+0xdd/0x1ca
> [    0.802961]  [<ffffffff8131ca48>]
> acpi_ev_address_space_dispatch+0x1b0/0x208
> [    0.809993]  [<ffffffff8131fd49>] acpi_ex_access_region+0x20e/0x2a2
> [    0.816244]  [<ffffffff81149464>] ? __alloc_pages_nodemask+0x134/0x300
> [    0.822754]  [<ffffffff813200e4>] acpi_ex_field_datum_io+0xf6/0x171
> [    0.829004]  [<ffffffff81320301>] acpi_ex_extract_from_field+0xd7/0x20a
> [    0.835602]  [<ffffffff81331d80>] ?
> acpi_ut_create_internal_object_dbg+0x23/0x8a
> [    0.842981]  [<ffffffff8131f8e7>]
> acpi_ex_read_data_from_field+0x10f/0x14b
> [    0.849838]  [<ffffffff81322e16>]
> acpi_ex_resolve_node_to_value+0x18e/0x21c
> [    0.856780]  [<ffffffff813230a6>] acpi_ex_resolve_to_value+0x202/0x209
> [    0.863291]  [<ffffffff81319486>] acpi_ds_evaluate_name_path+0x7b/0xf5
> [    0.869803]  [<ffffffff81319834>] acpi_ds_exec_end_op+0x98/0x3e8
> [    0.875793]  [<ffffffff8132aca4>] acpi_ps_parse_loop+0x514/0x560
> [    0.881784]  [<ffffffff8132b738>] acpi_ps_parse_aml+0x98/0x28c
> [    0.887601]  [<ffffffff8132bf8d>] acpi_ps_execute_method+0x1c1/0x26c
> [    0.893939]  [<ffffffff813269c5>] acpi_ns_evaluate+0x1c1/0x258
> [    0.899755]  [<ffffffff8131cb98>] acpi_ev_execute_reg_method+0xca/0x112
> [    0.906353]  [<ffffffff8131cd6e>] acpi_ev_reg_run+0x48/0x52
> [    0.911910]  [<ffffffff81328fad>] acpi_ns_walk_namespace+0xc8/0x17f
> [    0.918160]  [<ffffffff8131cd26>] ? acpi_ev_detach_region+0x146/0x146
> [    0.924585]  [<ffffffff8131cdbc>] acpi_ev_execute_reg_methods+0x44/0xf7
> [    0.931184]  [<ffffffff819b2324>] ? acpi_sleep_proc_init+0x2a/0x2a
> [    0.937349]  [<ffffffff8130ac66>] ? acpi_os_wait_semaphore+0x43/0x57
> [    0.943686]  [<ffffffff81331a3f>] ? acpi_ut_acquire_mutex+0x48/0x88
> [    0.949938]  [<ffffffff8131ceb8>]
> acpi_ev_initialize_op_regions+0x49/0x71
> [    0.956709]  [<ffffffff819b2324>] ? acpi_sleep_proc_init+0x2a/0x2a
> [    0.962873]  [<ffffffff81333310>] acpi_initialize_objects+0x23/0x4f
> [    0.969125]  [<ffffffff819b23b4>] acpi_init+0x90/0x268
> 
> So basically, this split page seems to be a problem. Some other code
> thinks that it has access to full page and goes ahead and registers
> that with PAT rb tree and this causes problems in mmap() code.
> 
> I suspect we might have to go back to idea of copying first and last
> non page aligned ranges in new kernel's memory and read it from there
> to solve this issue. Do you have other ideas?
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-13 21:04 ` Vivek Goyal
@ 2013-11-13 21:14   ` H. Peter Anvin
  2013-11-13 22:41     ` Vivek Goyal
  0 siblings, 1 reply; 30+ messages in thread
From: H. Peter Anvin @ 2013-11-13 21:14 UTC (permalink / raw)
  To: Vivek Goyal, linux kernel mailing list, HATAYAMA Daisuke
  Cc: Kexec Mailing List, Baoquan He, WANG Chao, Dave Young,
	Eric W. Biederman

On 11/13/2013 01:04 PM, Vivek Goyal wrote:
> [CC hpa ]
> 
> And this issue brings me to the question that why do we allow sytem RAM
> ranges which do not start on page boundary or do not end on page boundary. 
> Can't we truncate the BIOS reported RAM ranges in such a way so that
> they start and end at PAGE boundary and rest of the kernel will never see
> unaligned portion of RAM and this will make life so much simpler for
> other tools.
> 

That is a bit of a headache for doing in the memblock space.  We do, in
fact, truncate partial pages, but later in the game.  It is possible we
should push that sooner in the stack.  The fact that it makes into the
rbtrees is fishy, but it also makes me wonder if we're doing something
totally stupid with regards to the memory mappings -- if this means
we're mapping ACPI data as noncacheable, that is not just a performance
problem but just plain wrong.  I don't even think the MTRRs can
represent different caching attributes for different parts of a page, so
this is something that we are doing.

	-hpa



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-13 21:14   ` H. Peter Anvin
@ 2013-11-13 22:41     ` Vivek Goyal
  2013-11-13 22:44       ` H. Peter Anvin
  0 siblings, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2013-11-13 22:41 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux kernel mailing list, HATAYAMA Daisuke, Kexec Mailing List,
	Baoquan He, WANG Chao, Dave Young, Eric W. Biederman

On Wed, Nov 13, 2013 at 01:14:53PM -0800, H. Peter Anvin wrote:
> On 11/13/2013 01:04 PM, Vivek Goyal wrote:
> > [CC hpa ]
> > 
> > And this issue brings me to the question that why do we allow sytem RAM
> > ranges which do not start on page boundary or do not end on page boundary. 
> > Can't we truncate the BIOS reported RAM ranges in such a way so that
> > they start and end at PAGE boundary and rest of the kernel will never see
> > unaligned portion of RAM and this will make life so much simpler for
> > other tools.
> > 
> 
> That is a bit of a headache for doing in the memblock space.  We do, in
> fact, truncate partial pages, but later in the game.  It is possible we
> should push that sooner in the stack.

Hi Peter,

I noticed we seem to be trimming away partial pages in memblock.

memblock_x86_fill() {
	/* throw away partial pages */
        memblock_trim_memory(PAGE_SIZE);
}

But not in e820 hence they show up in /proc/iomem.

How about something along the lines as below patch. This fixes my
/proc/vmcore mmap() issue.

Thanks
Vivek


---
 arch/x86/kernel/e820.c |   29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

Index: linux-2.6/arch/x86/kernel/e820.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/e820.c	2013-11-13 10:46:24.938057251 -0500
+++ linux-2.6/arch/x86/kernel/e820.c	2013-11-13 17:36:17.042681842 -0500
@@ -169,6 +169,33 @@ void __init e820_print_map(char *who)
 	}
 }
 
+static int e820_trim_memory(struct e820entry *map, unsigned int nr_entries,
+				unsigned int align)
+{
+	int i;
+	struct e820entry *ei;
+	u64 start, end, orig_start, orig_end;
+
+	for (i = 0; i < nr_entries; i++) {
+		ei = &map[i];
+		if (ei->type != E820_RAM)
+			continue;
+		orig_start = ei->addr;
+		orig_end = ei->addr + ei->size;
+
+		start = round_up(orig_start, align);
+		end = round_down(orig_end, align);
+
+		if (start == orig_start && end == orig_end)
+			continue;
+
+		ei->addr = start;
+		ei->size = end - start;
+	}
+
+	return 0;
+}
+
 /*
  * Sanitize the BIOS e820 map.
  *
@@ -267,6 +294,8 @@ int __init sanitize_e820_map(struct e820
 	int old_nr, new_nr, chg_nr;
 	int i;
 
+	e820_trim_memory(biosmap, *pnr_map, PAGE_SIZE);
+
 	/* if there's only one memory region, don't bother */
 	if (*pnr_map < 2)
 		return -1;


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-13 22:41     ` Vivek Goyal
@ 2013-11-13 22:44       ` H. Peter Anvin
  2013-11-13 23:00         ` Vivek Goyal
  0 siblings, 1 reply; 30+ messages in thread
From: H. Peter Anvin @ 2013-11-13 22:44 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux kernel mailing list, HATAYAMA Daisuke, Kexec Mailing List,
	Baoquan He, WANG Chao, Dave Young, Eric W. Biederman

On 11/13/2013 02:41 PM, Vivek Goyal wrote:
> 
> Hi Peter,
> 
> I noticed we seem to be trimming away partial pages in memblock.
> 
> memblock_x86_fill() {
> 	/* throw away partial pages */
>         memblock_trim_memory(PAGE_SIZE);
> }
> 
> But not in e820 hence they show up in /proc/iomem.
> 

Why does /proc/iomem matter?

> How about something along the lines as below patch. This fixes my
> /proc/vmcore mmap() issue.

I'm not sure if what you're seeing is something that is better handled
in userspace.

	-hpa



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-13 22:44       ` H. Peter Anvin
@ 2013-11-13 23:00         ` Vivek Goyal
  2013-11-13 23:08           ` H. Peter Anvin
  0 siblings, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2013-11-13 23:00 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux kernel mailing list, HATAYAMA Daisuke, Kexec Mailing List,
	Baoquan He, WANG Chao, Dave Young, Eric W. Biederman

On Wed, Nov 13, 2013 at 02:44:45PM -0800, H. Peter Anvin wrote:
> On 11/13/2013 02:41 PM, Vivek Goyal wrote:
> > 
> > Hi Peter,
> > 
> > I noticed we seem to be trimming away partial pages in memblock.
> > 
> > memblock_x86_fill() {
> > 	/* throw away partial pages */
> >         memblock_trim_memory(PAGE_SIZE);
> > }
> > 
> > But not in e820 hence they show up in /proc/iomem.
> > 
> 
> Why does /proc/iomem matter?

Kexec-tools parse /proc/iomem and prepare PT_LOAD elf headers for all
the RAM regions which need to be dumped out, including the partial page ones.
Second kernel parses these headers and exports to user space in /proc/vmcore.
Makedumpfile tries to mmap() vmcore then tries to map the partial page pfn
too and we run into issues.

> 
> > How about something along the lines as below patch. This fixes my
> > /proc/vmcore mmap() issue.
> 
> I'm not sure if what you're seeing is something that is better handled
> in userspace.

I think it should be easy to truncate ELF headers in kexec-tools too when
we are preparing elf headers. I am not sure if it is right thing to do or
not. If some entry is showing up in /proc/iomem as RAM, then kexec-tools
need to believe that it is a possibility that kernel is using that pfn and
that pfn needs to be dumped out in vmcore. IMHO, kernel should fix this
issue.

Secondly, I am writing in-kernel kexec support too and I prepare ELF
headers there too. And I am facing the same problem there too. So if
we truncate partial pages in e820 both kexec-tools and in kernel kexec
implementation don't have to do anything.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-13 23:00         ` Vivek Goyal
@ 2013-11-13 23:08           ` H. Peter Anvin
  0 siblings, 0 replies; 30+ messages in thread
From: H. Peter Anvin @ 2013-11-13 23:08 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux kernel mailing list, HATAYAMA Daisuke, Kexec Mailing List,
	Baoquan He, WANG Chao, Dave Young, Eric W. Biederman

On 11/13/2013 03:00 PM, Vivek Goyal wrote:
> I think it should be easy to truncate ELF headers in kexec-tools too when
> we are preparing elf headers. I am not sure if it is right thing to do or
> not. If some entry is showing up in /proc/iomem as RAM, then kexec-tools
> need to believe that it is a possibility that kernel is using that pfn and
> that pfn needs to be dumped out in vmcore. IMHO, kernel should fix this
> issue.

The kernel will never use a fractional page, so that is not an issue.

> Secondly, I am writing in-kernel kexec support too and I prepare ELF
> headers there too. And I am facing the same problem there too. So if
> we truncate partial pages in e820 both kexec-tools and in kernel kexec
> implementation don't have to do anything.

I'm mostly worried about truncation upon truncation causing problems.
As long as the trimming is done once and with proper consideration for
abutting regions, I guess I'm okay with it, although it feels wrong to me.

	-hpa


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-13 20:41 /proc/vmcore mmap() failure issue Vivek Goyal
  2013-11-13 21:04 ` Vivek Goyal
@ 2013-11-14 10:31 ` HATAYAMA Daisuke
  2013-11-14 15:13   ` Vivek Goyal
  1 sibling, 1 reply; 30+ messages in thread
From: HATAYAMA Daisuke @ 2013-11-14 10:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux kernel mailing list, Kexec Mailing List, Baoquan He,
	WANG Chao, Dave Young, Eric W. Biederman, Atsushi Kumagai

(2013/11/14 5:41), Vivek Goyal wrote:
> Hi Hatayama,
>
> We are facing some /proc/vmcore mmap() failure issues and then makdumpfile
> exits without saving dump and system reboots.
>
> I tried latest makedumpfile (devel branch) with 3.12 kernel.
>
> I think this issue happens only on some machines. And it looks like it
> happens when end of system RAM chunk in first kernel is not page aligned. For
> example, I have one machine where I noticed it and this is how system
> RAM looks like.
>
> 00100000-dafa57ff : System RAM
>    01000000-015892fa : Kernel code
>    015892fb-0195c9ff : Kernel data
>    01ae6000-01d31fff : Kernel bss
>    24000000-33ffffff : Crash kernel
> dafa5800-dbffffff : reserved
>
> Notice that dafa57ff does not end at page boundary and next reserved
> range does not start at page boundary. I think that next reserved
> range is referenced through some ACPI data. More on this later.
>
> So we put some printk() messages to get more info. In a nut shell,
> remap_pfn_range() fails when we try to map the last section of system
> RAM not ending on page boundary.
>
> remap_pfn_range()
>     track_pfn_remap() {
>          /*
>           * For anything smaller than the vma size we set prot based on the
>           * lookup.
>           */
>          flags = lookup_memtype(paddr);
>
>          /* Check memtype for the remaining pages */
>          while (size > PAGE_SIZE) {
>                  size -= PAGE_SIZE;
>                  paddr += PAGE_SIZE;
>                  if (flags != lookup_memtype(paddr))
>                          return -EINVAL; <---------------- Failure.
>          }
> 	
>     }
>
>
> So we pass in a range to track_pfn_remap. Say pfn=0xdad62 size=0x244000.
> Now we call lookup_memtype() on every page in the range and make sure
> they all are same, otherwise we fail. Guess what, all all same except
> last page (which does not end at page boundary).
>
> I dived deeper in to lookup_memtype() and noticed that all regular
> ranges are not registered anywhere and their flags are _PAGE_CACHE_UC_MINUS.
> But last unaligned page/range, is registered in memtype rb tree and
> has attribute, _PAGE_CACHE_WB.
>
> Then I hooked into reserve_memtype() to figure out who is registering
> page 0xdafa5000 and it is acpi_init() which does it.
>
> [    0.721655] Hardware name: <edited>
> [    0.730590]  ffff8800340f3830 ffff8800340f37c0 ffffffff81575509
> 00000000dafa5000
> [    0.738010]  ffff8800340f3800 ffffffff810566cc 00000000000dafa5
> 00000000dafa5000
> [    0.745428]  00000000dafa6000 00000000dafa5000 0000000000000000
> 0000000000001000
> [    0.752845] Call Trace:
> [    0.755288]  [<ffffffff81575509>] dump_stack+0x45/0x56
> [    0.760414]  [<ffffffff810566cc>] reserve_memtype+0x31c/0x3f0
> [    0.766144]  [<ffffffff810537ef>] __ioremap_caller+0x12f/0x360
> [    0.771963]  [<ffffffff8130ad56>] ? acpi_os_release_object+0xe/0x12
> [    0.778217]  [<ffffffff815686ba>] ? acpi_os_map_memory+0xf6/0x14e
> [    0.784295]  [<ffffffff81053a54>] ioremap_cache+0x14/0x20
> [    0.789679]  [<ffffffff815686ba>] acpi_os_map_memory+0xf6/0x14e
> [    0.795582]  [<ffffffff81322ac9>]
> acpi_ex_system_memory_space_handler+0xdd/0x1ca
> [    0.802961]  [<ffffffff8131ca48>]
> acpi_ev_address_space_dispatch+0x1b0/0x208
> [    0.809993]  [<ffffffff8131fd49>] acpi_ex_access_region+0x20e/0x2a2
> [    0.816244]  [<ffffffff81149464>] ? __alloc_pages_nodemask+0x134/0x300
> [    0.822754]  [<ffffffff813200e4>] acpi_ex_field_datum_io+0xf6/0x171
> [    0.829004]  [<ffffffff81320301>] acpi_ex_extract_from_field+0xd7/0x20a
> [    0.835602]  [<ffffffff81331d80>] ?
> acpi_ut_create_internal_object_dbg+0x23/0x8a
> [    0.842981]  [<ffffffff8131f8e7>]
> acpi_ex_read_data_from_field+0x10f/0x14b
> [    0.849838]  [<ffffffff81322e16>]
> acpi_ex_resolve_node_to_value+0x18e/0x21c
> [    0.856780]  [<ffffffff813230a6>] acpi_ex_resolve_to_value+0x202/0x209
> [    0.863291]  [<ffffffff81319486>] acpi_ds_evaluate_name_path+0x7b/0xf5
> [    0.869803]  [<ffffffff81319834>] acpi_ds_exec_end_op+0x98/0x3e8
> [    0.875793]  [<ffffffff8132aca4>] acpi_ps_parse_loop+0x514/0x560
> [    0.881784]  [<ffffffff8132b738>] acpi_ps_parse_aml+0x98/0x28c
> [    0.887601]  [<ffffffff8132bf8d>] acpi_ps_execute_method+0x1c1/0x26c
> [    0.893939]  [<ffffffff813269c5>] acpi_ns_evaluate+0x1c1/0x258
> [    0.899755]  [<ffffffff8131cb98>] acpi_ev_execute_reg_method+0xca/0x112
> [    0.906353]  [<ffffffff8131cd6e>] acpi_ev_reg_run+0x48/0x52
> [    0.911910]  [<ffffffff81328fad>] acpi_ns_walk_namespace+0xc8/0x17f
> [    0.918160]  [<ffffffff8131cd26>] ? acpi_ev_detach_region+0x146/0x146
> [    0.924585]  [<ffffffff8131cdbc>] acpi_ev_execute_reg_methods+0x44/0xf7
> [    0.931184]  [<ffffffff819b2324>] ? acpi_sleep_proc_init+0x2a/0x2a
> [    0.937349]  [<ffffffff8130ac66>] ? acpi_os_wait_semaphore+0x43/0x57
> [    0.943686]  [<ffffffff81331a3f>] ? acpi_ut_acquire_mutex+0x48/0x88
> [    0.949938]  [<ffffffff8131ceb8>]
> acpi_ev_initialize_op_regions+0x49/0x71
> [    0.956709]  [<ffffffff819b2324>] ? acpi_sleep_proc_init+0x2a/0x2a
> [    0.962873]  [<ffffffff81333310>] acpi_initialize_objects+0x23/0x4f
> [    0.969125]  [<ffffffff819b23b4>] acpi_init+0x90/0x268
>
> So basically, this split page seems to be a problem. Some other code
> thinks that it has access to full page and goes ahead and registers
> that with PAT rb tree and this causes problems in mmap() code.
>
> I suspect we might have to go back to idea of copying first and last
> non page aligned ranges in new kernel's memory and read it from there
> to solve this issue. Do you have other ideas?
>

Sorry for delayed response, although it looks like you have already found
a way to fix this issue.

BTW, I previously found a part of makedumpfile that truncates the first and
last pages if they are not aligned in page size. Discussing with Kumagai-san,
the truncation is performed on some ia64 system and he found a valid data in
the truncated area, and the latest makedumpfile no longer does such
truncation.

The commit is:

commit f854b37adba223d5b4801accbedd17b447266d51
Author: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Date:   Fri Jun 21 15:25:31 2013 +0900

     [PATCH 2/2] Fix the handling of the pages correspond to border of PT_LOAD.

     The pages correspond to border of PT_LOAD were removed as holes.
     For example, pfn:N showed below was removed but we know even
     odd region like [0x40ffda7000 - 0x40ffda8000] can include valid
     dates, so we shouldn't remove it as holes.

                                phys_start
                                = 0x40ffda7000
              |<-- frac_head -->|------------- PT_LOAD -------------
          ----+-----------------------+---------------------+----
              |         pfn:N         |       pfn:N+1       | ...
          ----+-----------------------+---------------------+----
              |
          pfn_to_paddr(pfn:N)               # page size = 16k
          = 0x40ffda4000

     This patch handles such odd regions correctly. Then read pfn:N
     and write it to disk, the ranges not covered by any PT_LOAD
     entries will be filled with 0.

     Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>

The log on the web is:

http://lists.infradead.org/pipermail/kexec/2013-May/008875.html

So, without this change, you would not have seen this issue. The original
reason why the code was implemented so might be the issues similar to here.

Next, I think it necessary to consider whether or not to revert the above
commit or not since makedumpfile fails on some kind of system as you reported.

-- 
Thanks.
HATAYAMA, Daisuke


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-14 10:31 ` HATAYAMA Daisuke
@ 2013-11-14 15:13   ` Vivek Goyal
  2013-11-15  9:41     ` HATAYAMA Daisuke
  0 siblings, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2013-11-14 15:13 UTC (permalink / raw)
  To: HATAYAMA Daisuke
  Cc: linux kernel mailing list, Kexec Mailing List, Baoquan He,
	WANG Chao, Dave Young, Eric W. Biederman, Atsushi Kumagai

On Thu, Nov 14, 2013 at 07:31:37PM +0900, HATAYAMA Daisuke wrote:

[..]
> BTW, I previously found a part of makedumpfile that truncates the first and
> last pages if they are not aligned in page size. Discussing with Kumagai-san,
> the truncation is performed on some ia64 system and he found a valid data in
> the truncated area, and the latest makedumpfile no longer does such
> truncation.

I went through the mail thread link you posted below. Looks like
bootloader had put command line in that area and that would be truncated
if we exclude partial pages from vmcore.

I don't know about IA64, but on x86, I see that we are trimming partial
pages before they are added to memblock. So any memory allocations after
this should not happen in partial page area. Bootloader still might
place some things in those partial pages I guess.

So do we care about that little bootloader data if it happens to be
there. Dump mechanism works only after kdump service has been loaded.
That means first kernel is up and running. That means any relevant
data passed to us from bootloader has already been copied in kernel
memory and kernel did not crash. So to me, we don't have a strong
need to look at exactly how bootloader passed data looked originally.
We can just look at kernel copy of associated data structures (bootparams,
command line etc).

I think being able to mmap() vmcore is much more important. BTW, is
it possible to change makedumpfile so that it uses mmap() for reading
page aligned areas and falls back to read() interface for reading
partial pages? (Though it is beginning to sound already complicatated
to me).

Given the fact that hpa does not like fixing it in kernel. We are left
with option of fixing it in following places.

- Drop partial pages in kexec-tools
- Drop partial pages in makeudmpfile.
- Read partial pages using read() interface in makedumpfile
- Modify /proc/vmcore to copy partial pages in second kernel's memory.

It is not clear to me that partial pages are really useful.  So I want
to avoid modifying /proc/vmcore to deal with partial pages and increase
complexity.

So fixing makedumpfile (either option2 or option 3) seems least risky
to me. In fact I would say let us keep it simple and truncate partial
pages in makedumpfile to keep it simple. And look at option 3 once we
have a strong use case for partial pages.

What do you think?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-14 15:13   ` Vivek Goyal
@ 2013-11-15  9:41     ` HATAYAMA Daisuke
  2013-11-15 14:26       ` Vivek Goyal
  0 siblings, 1 reply; 30+ messages in thread
From: HATAYAMA Daisuke @ 2013-11-15  9:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Baoquan He, Kexec Mailing List, linux kernel mailing list,
	Atsushi Kumagai, Eric W. Biederman, Dave Young, WANG Chao

(2013/11/15 0:13), Vivek Goyal wrote:
> On Thu, Nov 14, 2013 at 07:31:37PM +0900, HATAYAMA Daisuke wrote:
>
> [..]
>> BTW, I previously found a part of makedumpfile that truncates the first and
>> last pages if they are not aligned in page size. Discussing with Kumagai-san,
>> the truncation is performed on some ia64 system and he found a valid data in
>> the truncated area, and the latest makedumpfile no longer does such
>> truncation.
>
> I went through the mail thread link you posted below. Looks like
> bootloader had put command line in that area and that would be truncated
> if we exclude partial pages from vmcore.
>
> I don't know about IA64, but on x86, I see that we are trimming partial
> pages before they are added to memblock. So any memory allocations after
> this should not happen in partial page area. Bootloader still might
> place some things in those partial pages I guess.
>
> So do we care about that little bootloader data if it happens to be
> there. Dump mechanism works only after kdump service has been loaded.
> That means first kernel is up and running. That means any relevant
> data passed to us from bootloader has already been copied in kernel
> memory and kernel did not crash. So to me, we don't have a strong
> need to look at exactly how bootloader passed data looked originally.
> We can just look at kernel copy of associated data structures (bootparams,
> command line etc).
>
> I think being able to mmap() vmcore is much more important. BTW, is
> it possible to change makedumpfile so that it uses mmap() for reading
> page aligned areas and falls back to read() interface for reading
> partial pages? (Though it is beginning to sound already complicatated
> to me).
>
> Given the fact that hpa does not like fixing it in kernel. We are left
> with option of fixing it in following places.
>
> - Drop partial pages in kexec-tools
> - Drop partial pages in makeudmpfile.
> - Read partial pages using read() interface in makedumpfile
> - Modify /proc/vmcore to copy partial pages in second kernel's memory.
>
> It is not clear to me that partial pages are really useful.  So I want
> to avoid modifying /proc/vmcore to deal with partial pages and increase
> complexity.
>
> So fixing makedumpfile (either option2 or option 3) seems least risky
> to me. In fact I would say let us keep it simple and truncate partial
> pages in makedumpfile to keep it simple. And look at option 3 once we
> have a strong use case for partial pages.
>
> What do you think?
>

As you say, it's not clear that partial pages are really useful, but on
the other hand, it seems to me not clear that they are really useless.
I think we should get them as long as we have access to them.

It seems best to me the option 3). Switching between read and mmap would
be not so complex and also it's by far flexible in makedumpfile than in
kernel.

Also, I think it better for /proc/vmcore to disable mmap on partial
pages in order to avoid the issue here.

-- 
Thanks.
HATAYAMA, Daisuke


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-15  9:41     ` HATAYAMA Daisuke
@ 2013-11-15 14:26       ` Vivek Goyal
  2013-11-18  0:51         ` Atsushi Kumagai
  0 siblings, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2013-11-15 14:26 UTC (permalink / raw)
  To: HATAYAMA Daisuke
  Cc: Baoquan He, Kexec Mailing List, linux kernel mailing list,
	Atsushi Kumagai, Eric W. Biederman, Dave Young, WANG Chao

On Fri, Nov 15, 2013 at 06:41:52PM +0900, HATAYAMA Daisuke wrote:

[..]
> >Given the fact that hpa does not like fixing it in kernel. We are left
> >with option of fixing it in following places.
> >
> >- Drop partial pages in kexec-tools
> >- Drop partial pages in makeudmpfile.
> >- Read partial pages using read() interface in makedumpfile
> >- Modify /proc/vmcore to copy partial pages in second kernel's memory.
> >
> >It is not clear to me that partial pages are really useful.  So I want
> >to avoid modifying /proc/vmcore to deal with partial pages and increase
> >complexity.
> >
> >So fixing makedumpfile (either option2 or option 3) seems least risky
> >to me. In fact I would say let us keep it simple and truncate partial
> >pages in makedumpfile to keep it simple. And look at option 3 once we
> >have a strong use case for partial pages.
> >
> >What do you think?
> >
> 
> As you say, it's not clear that partial pages are really useful, but on
> the other hand, it seems to me not clear that they are really useless.
> I think we should get them as long as we have access to them.
> 
> It seems best to me the option 3). Switching between read and mmap would
> be not so complex and also it's by far flexible in makedumpfile than in
> kernel.

Ok, I am fine with option 3. It is more complicated option but safe
option.

Is there any chance that you could look into fixing this. I have no
experience writing code for makedumpfile.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-15 14:26       ` Vivek Goyal
@ 2013-11-18  0:51         ` Atsushi Kumagai
  2013-11-18 13:55           ` Vivek Goyal
  2013-11-19  9:55           ` HATAYAMA Daisuke
  0 siblings, 2 replies; 30+ messages in thread
From: Atsushi Kumagai @ 2013-11-18  0:51 UTC (permalink / raw)
  To: vgoyal@redhat.com
  Cc: d.hatayama@jp.fujitsu.com, bhe@redhat.com,
	kexec@lists.infradead.org, linux-kernel@vger.kernel.org,
	ebiederm@xmission.com, dyoung@redhat.com, chaowang@redhat.com

(2013/11/15 23:26), Vivek Goyal wrote:
> On Fri, Nov 15, 2013 at 06:41:52PM +0900, HATAYAMA Daisuke wrote:
>
> [..]
>>> Given the fact that hpa does not like fixing it in kernel. We are left
>>> with option of fixing it in following places.
>>>
>>> - Drop partial pages in kexec-tools
>>> - Drop partial pages in makeudmpfile.
>>> - Read partial pages using read() interface in makedumpfile
>>> - Modify /proc/vmcore to copy partial pages in second kernel's memory.
>>>
>>> It is not clear to me that partial pages are really useful.  So I want
>>> to avoid modifying /proc/vmcore to deal with partial pages and increase
>>> complexity.
>>>
>>> So fixing makedumpfile (either option2 or option 3) seems least risky
>>> to me. In fact I would say let us keep it simple and truncate partial
>>> pages in makedumpfile to keep it simple. And look at option 3 once we
>>> have a strong use case for partial pages.
>>>
>>> What do you think?
>>>
>>
>> As you say, it's not clear that partial pages are really useful, but on
>> the other hand, it seems to me not clear that they are really useless.
>> I think we should get them as long as we have access to them.
>>
>> It seems best to me the option 3). Switching between read and mmap would
>> be not so complex and also it's by far flexible in makedumpfile than in
>> kernel.
>
> Ok, I am fine with option 3. It is more complicated option but safe
> option.

It sounds reasonable also to me.

> Is there any chance that you could look into fixing this. I have no
> experience writing code for makedumpfile.

I'll send a patch to fix this soon.


Thanks
Atsushi Kumagai

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-18  0:51         ` Atsushi Kumagai
@ 2013-11-18 13:55           ` Vivek Goyal
  2013-11-20  5:29             ` Atsushi Kumagai
  2013-11-19  9:55           ` HATAYAMA Daisuke
  1 sibling, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2013-11-18 13:55 UTC (permalink / raw)
  To: Atsushi Kumagai
  Cc: d.hatayama@jp.fujitsu.com, bhe@redhat.com,
	kexec@lists.infradead.org, linux-kernel@vger.kernel.org,
	ebiederm@xmission.com, dyoung@redhat.com, chaowang@redhat.com

On Mon, Nov 18, 2013 at 12:51:39AM +0000, Atsushi Kumagai wrote:

[..]
> > Is there any chance that you could look into fixing this. I have no
> > experience writing code for makedumpfile.
> 
> I'll send a patch to fix this soon.

Thanks Atsushi.

Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-18  0:51         ` Atsushi Kumagai
  2013-11-18 13:55           ` Vivek Goyal
@ 2013-11-19  9:55           ` HATAYAMA Daisuke
  2013-11-20  5:27             ` Atsushi Kumagai
  1 sibling, 1 reply; 30+ messages in thread
From: HATAYAMA Daisuke @ 2013-11-19  9:55 UTC (permalink / raw)
  To: Atsushi Kumagai
  Cc: vgoyal@redhat.com, bhe@redhat.com, kexec@lists.infradead.org,
	linux-kernel@vger.kernel.org, ebiederm@xmission.com,
	dyoung@redhat.com, chaowang@redhat.com

(2013/11/18 9:51), Atsushi Kumagai wrote:
> (2013/11/15 23:26), Vivek Goyal wrote:
>> On Fri, Nov 15, 2013 at 06:41:52PM +0900, HATAYAMA Daisuke wrote:
>>
>> [..]
>>>> Given the fact that hpa does not like fixing it in kernel. We are left
>>>> with option of fixing it in following places.
>>>>
>>>> - Drop partial pages in kexec-tools
>>>> - Drop partial pages in makeudmpfile.
>>>> - Read partial pages using read() interface in makedumpfile
>>>> - Modify /proc/vmcore to copy partial pages in second kernel's memory.
>>>>
>>>> It is not clear to me that partial pages are really useful.  So I want
>>>> to avoid modifying /proc/vmcore to deal with partial pages and increase
>>>> complexity.
>>>>
>>>> So fixing makedumpfile (either option2 or option 3) seems least risky
>>>> to me. In fact I would say let us keep it simple and truncate partial
>>>> pages in makedumpfile to keep it simple. And look at option 3 once we
>>>> have a strong use case for partial pages.
>>>>
>>>> What do you think?
>>>>
>>>
>>> As you say, it's not clear that partial pages are really useful, but on
>>> the other hand, it seems to me not clear that they are really useless.
>>> I think we should get them as long as we have access to them.
>>>
>>> It seems best to me the option 3). Switching between read and mmap would
>>> be not so complex and also it's by far flexible in makedumpfile than in
>>> kernel.
>>
>> Ok, I am fine with option 3. It is more complicated option but safe
>> option.
> 
> It sounds reasonable also to me.
> 
>> Is there any chance that you could look into fixing this. I have no
>> experience writing code for makedumpfile.
> 
> I'll send a patch to fix this soon.
> 

Thanks.

BTW, now the following patch has been applied on top of makedumpfile in
kexec-tools package on fedora in order to avoid the issue.

https://lists.fedoraproject.org/pipermail/kexec/2013-November/000254.html

I remember prototype version of mmap patch implemented a kind of --no-mmap option
and we could use it to disable mmap() use and use read() instead, I think
which is useful when we face this kind of issue.

-- 
Thanks.
HATAYAMA, Daisuke


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-19  9:55           ` HATAYAMA Daisuke
@ 2013-11-20  5:27             ` Atsushi Kumagai
  2013-11-20  6:43               ` HATAYAMA Daisuke
  2013-11-21  7:14               ` chaowang
  0 siblings, 2 replies; 30+ messages in thread
From: Atsushi Kumagai @ 2013-11-20  5:27 UTC (permalink / raw)
  To: HATAYAMA Daisuke
  Cc: bhe@redhat.com, chaowang@redhat.com, kexec@lists.infradead.org,
	linux-kernel@vger.kernel.org, ebiederm@xmission.com,
	dyoung@redhat.com, vgoyal@redhat.com

On 2013/11/19 18:56:21, kexec <kexec-bounces@lists.infradead.org> wrote:
> (2013/11/18 9:51), Atsushi Kumagai wrote:
> > (2013/11/15 23:26), Vivek Goyal wrote:
> >> On Fri, Nov 15, 2013 at 06:41:52PM +0900, HATAYAMA Daisuke wrote:
> >>
> >> [..]
> >>>> Given the fact that hpa does not like fixing it in kernel. We are 
> >>>> left with option of fixing it in following places.
> >>>>
> >>>> - Drop partial pages in kexec-tools
> >>>> - Drop partial pages in makeudmpfile.
> >>>> - Read partial pages using read() interface in makedumpfile
> >>>> - Modify /proc/vmcore to copy partial pages in second kernel's memory.
> >>>>
> >>>> It is not clear to me that partial pages are really useful.  So I 
> >>>> want to avoid modifying /proc/vmcore to deal with partial pages and 
> >>>> increase complexity.
> >>>>
> >>>> So fixing makedumpfile (either option2 or option 3) seems least 
> >>>> risky to me. In fact I would say let us keep it simple and truncate 
> >>>> partial pages in makedumpfile to keep it simple. And look at option 
> >>>> 3 once we have a strong use case for partial pages.
> >>>>
> >>>> What do you think?
> >>>>
> >>>
> >>> As you say, it's not clear that partial pages are really useful, but 
> >>> on the other hand, it seems to me not clear that they are really useless.
> >>> I think we should get them as long as we have access to them.
> >>>
> >>> It seems best to me the option 3). Switching between read and mmap 
> >>> would be not so complex and also it's by far flexible in 
> >>> makedumpfile than in kernel.
> >>
> >> Ok, I am fine with option 3. It is more complicated option but safe 
> >> option.
> > 
> > It sounds reasonable also to me.
> > 
> >> Is there any chance that you could look into fixing this. I have no 
> >> experience writing code for makedumpfile.
> > 
> > I'll send a patch to fix this soon.
> > 
> 
> Thanks.
> 
> BTW, now the following patch has been applied on top of makedumpfile in kexec-tools package on fedora in order to avoid the issue.
> 
> https://lists.fedoraproject.org/pipermail/kexec/2013-November/000254.html
> 
> I remember prototype version of mmap patch implemented a kind of --no-mmap option and we could use it to disable mmap() use and use read() instead, I think which is useful when we face this kind of issue.

How about this fail back structure instead of such an extra option ?

Thanks
Atsushi Kumagai

From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Date: Wed, 20 Nov 2013 14:10:19 +0900
Subject: [PATCH] Fall back to read() when mmap() fails.

Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
---
 makedumpfile.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/makedumpfile.c b/makedumpfile.c
index ca03440..f583602 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -324,7 +324,15 @@ read_from_vmcore(off_t offset, void *bufptr, unsigned long size)
 		if (!read_with_mmap(offset, bufptr, size)) {
 			ERRMSG("Can't read the dump memory(%s) with mmap().\n",
 			       info->name_memory);
-			return FALSE;
+
+			ERRMSG("This kernel might have some problems about mmap().\n");
+			ERRMSG("read() will be used instead of mmap() from now.\n");
+
+			/*
+			 * Fall back to read().
+			 */
+			info->flag_usemmap = FALSE;
+			read_from_vmcore(offset, bufptr, size);
 		}
 	} else {
 		if (lseek(info->fd_memory, offset, SEEK_SET) == failed) {
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-18 13:55           ` Vivek Goyal
@ 2013-11-20  5:29             ` Atsushi Kumagai
  2013-11-20 14:59               ` Vivek Goyal
  0 siblings, 1 reply; 30+ messages in thread
From: Atsushi Kumagai @ 2013-11-20  5:29 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: bhe@redhat.com, kexec@lists.infradead.org,
	linux-kernel@vger.kernel.org, d.hatayama@jp.fujitsu.com,
	ebiederm@xmission.com, dyoung@redhat.com, chaowang@redhat.com

On 2013/11/18 22:56:10, kexec <kexec-bounces@lists.infradead.org> wrote:
> On Mon, Nov 18, 2013 at 12:51:39AM +0000, Atsushi Kumagai wrote:
> 
> [..]
> > > Is there any chance that you could look into fixing this. I have no 
> > > experience writing code for makedumpfile.
> > 
> > I'll send a patch to fix this soon.
> 
> Thanks Atsushi.
> 
> Vivek

Vivek, could you test this patch ?

Thanks
Atsushi Kumagai


From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Date: Wed, 20 Nov 2013 10:05:03 +0900
Subject: [PATCH] Disable mmap() for reading fractional pages.

Since mmap() was introduced on /proc/vmcore, it fails
for fractional pages which don't start or end at page boundary
due to kernel issue.
This patch disables mmap() temporarily for fractional pages
to avoid this issue, so mmap() will be used only for aligned pages.

Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
---
 makedumpfile.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/makedumpfile.c b/makedumpfile.c
index 3746cf6..ca03440 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -368,6 +368,7 @@ readpage_elf(unsigned long long paddr, void *bufptr)
 	off_t offset1, offset2;
 	size_t size1, size2;
 	unsigned long long phys_start, phys_end, frac_head = 0;
+	int original_usemmap = info->flag_usemmap;
 
 	offset1 = paddr_to_offset(paddr);
 	offset2 = paddr_to_offset(paddr + info->page_size);
@@ -392,6 +393,7 @@ readpage_elf(unsigned long long paddr, void *bufptr)
 		offset1 = paddr_to_offset(phys_start);
 		frac_head = phys_start - paddr;
 		memset(bufptr, 0, frac_head);
+		info->flag_usemmap = FALSE;
 	}
 
 	/*
@@ -402,6 +404,7 @@ readpage_elf(unsigned long long paddr, void *bufptr)
 		phys_end = page_head_to_phys_end(paddr);
 		offset2 = paddr_to_offset(phys_end);
 		memset(bufptr + (phys_end - paddr), 0, info->page_size - (phys_end - paddr));
+		info->flag_usemmap = FALSE;
 	}
 
 	/*
@@ -420,7 +423,7 @@ readpage_elf(unsigned long long paddr, void *bufptr)
 	if(!read_from_vmcore(offset1, bufptr + frac_head, size1)) {
 		ERRMSG("Can't read the dump memory(%s).\n",
 		       info->name_memory);
-		return FALSE;
+		goto error;
 	}
 
 	if (size1 + frac_head != info->page_size) {
@@ -429,11 +432,16 @@ readpage_elf(unsigned long long paddr, void *bufptr)
 		if(!read_from_vmcore(offset2, bufptr + frac_head + size1, size2)) {
 			ERRMSG("Can't read the dump memory(%s).\n",
 			       info->name_memory);
-			return FALSE;
+			goto error;
 		}
 	}
 
+	info->flag_usemmap = original_usemmap;
 	return TRUE;
+
+error:
+	info->flag_usemmap = original_usemmap;
+	return FALSE;
 }
 
 static int
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-20  5:27             ` Atsushi Kumagai
@ 2013-11-20  6:43               ` HATAYAMA Daisuke
  2013-11-26  1:52                 ` Atsushi Kumagai
  2013-11-21  7:14               ` chaowang
  1 sibling, 1 reply; 30+ messages in thread
From: HATAYAMA Daisuke @ 2013-11-20  6:43 UTC (permalink / raw)
  To: Atsushi Kumagai
  Cc: bhe@redhat.com, chaowang@redhat.com, kexec@lists.infradead.org,
	linux-kernel@vger.kernel.org, ebiederm@xmission.com,
	dyoung@redhat.com, vgoyal@redhat.com

(2013/11/20 14:27), Atsushi Kumagai wrote:
> On 2013/11/19 18:56:21, kexec <kexec-bounces@lists.infradead.org> wrote:
>> (2013/11/18 9:51), Atsushi Kumagai wrote:
>>> (2013/11/15 23:26), Vivek Goyal wrote:
>>>> On Fri, Nov 15, 2013 at 06:41:52PM +0900, HATAYAMA Daisuke wrote:
>>>>
>>>> [..]
>>>>>> Given the fact that hpa does not like fixing it in kernel. We are
>>>>>> left with option of fixing it in following places.
>>>>>>
>>>>>> - Drop partial pages in kexec-tools
>>>>>> - Drop partial pages in makeudmpfile.
>>>>>> - Read partial pages using read() interface in makedumpfile
>>>>>> - Modify /proc/vmcore to copy partial pages in second kernel's memory.
>>>>>>
>>>>>> It is not clear to me that partial pages are really useful.  So I
>>>>>> want to avoid modifying /proc/vmcore to deal with partial pages and
>>>>>> increase complexity.
>>>>>>
>>>>>> So fixing makedumpfile (either option2 or option 3) seems least
>>>>>> risky to me. In fact I would say let us keep it simple and truncate
>>>>>> partial pages in makedumpfile to keep it simple. And look at option
>>>>>> 3 once we have a strong use case for partial pages.
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>
>>>>> As you say, it's not clear that partial pages are really useful, but
>>>>> on the other hand, it seems to me not clear that they are really useless.
>>>>> I think we should get them as long as we have access to them.
>>>>>
>>>>> It seems best to me the option 3). Switching between read and mmap
>>>>> would be not so complex and also it's by far flexible in
>>>>> makedumpfile than in kernel.
>>>>
>>>> Ok, I am fine with option 3. It is more complicated option but safe
>>>> option.
>>>
>>> It sounds reasonable also to me.
>>>
>>>> Is there any chance that you could look into fixing this. I have no
>>>> experience writing code for makedumpfile.
>>>
>>> I'll send a patch to fix this soon.
>>>
>>
>> Thanks.
>>
>> BTW, now the following patch has been applied on top of makedumpfile in kexec-tools package on fedora in order to avoid the issue.
>>
>> https://lists.fedoraproject.org/pipermail/kexec/2013-November/000254.html
>>
>> I remember prototype version of mmap patch implemented a kind of --no-mmap option and we could use it to disable mmap() use and use read() instead, I think which is useful when we face this kind of issue.
> 
> How about this fail back structure instead of such an extra option ?
> 

I think this logic is useful and should be merged together in this fix.

However, I still think a kind of --no-mmap option is needed. There could happen
worse case due to mmap() in the future on some system, of course, I don't know
what the system actually is, but at least it must be behaving differently from
typical systems... Then, option is more flexible than patching.

It would also be useful for debugging use. read() is simpler than mmap(), and
read() is basic in the sense that initially makedumpfile didn't use mmap().
There might be a situation where we want to avoid using mmap(); for example,
when makedumpfile works badly and it looks like caused by mmap() code in kernel
code; Then, we would want to see if makedumpfile works well by disabling mmap(), 

-- 
Thanks.
HATAYAMA, Daisuke


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-20  5:29             ` Atsushi Kumagai
@ 2013-11-20 14:59               ` Vivek Goyal
  2013-11-21  5:00                 ` Atsushi Kumagai
  0 siblings, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2013-11-20 14:59 UTC (permalink / raw)
  To: Atsushi Kumagai
  Cc: bhe@redhat.com, kexec@lists.infradead.org,
	linux-kernel@vger.kernel.org, d.hatayama@jp.fujitsu.com,
	ebiederm@xmission.com, dyoung@redhat.com, chaowang@redhat.com

On Wed, Nov 20, 2013 at 05:29:16AM +0000, Atsushi Kumagai wrote:
> On 2013/11/18 22:56:10, kexec <kexec-bounces@lists.infradead.org> wrote:
> > On Mon, Nov 18, 2013 at 12:51:39AM +0000, Atsushi Kumagai wrote:
> > 
> > [..]
> > > > Is there any chance that you could look into fixing this. I have no 
> > > > experience writing code for makedumpfile.
> > > 
> > > I'll send a patch to fix this soon.
> > 
> > Thanks Atsushi.
> > 
> > Vivek
> 
> Vivek, could you test this patch ?
> 
> Thanks
> Atsushi Kumagai
> 
> 
> From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
> Date: Wed, 20 Nov 2013 10:05:03 +0900
> Subject: [PATCH] Disable mmap() for reading fractional pages.
> 
> Since mmap() was introduced on /proc/vmcore, it fails
> for fractional pages which don't start or end at page boundary
> due to kernel issue.
> This patch disables mmap() temporarily for fractional pages
> to avoid this issue, so mmap() will be used only for aligned pages.
> 
> Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>

Hi Atsushi,

Even with this patch applied I see mmap() failure.

mem_map (39)
  mem_map    : ffffea0004e00000
  pfn_start  : 138000
  pfn_end    : 140000
read /proc/vmcore with mmap()
Excluding unnecessary pages        : [100.0 %] |STEP [Excluding
unnecessary pages] : 0.035925 seconds
Excluding unnecessary pages        : [100.0 %] \STEP [Excluding
unnecessary pages] : 0.035774 seconds
Excluding unnecessary pages        : [100.0 %] -STEP [Excluding
unnecessary pages] : 0.035229 seconds
Copying data                       : [ 40.9 %] -Can't map
[b98fd000-b9cfd000] with mmap()
read_from_vmcore: Can't read the dump memory(/proc/vmcore) with mmap().
readpage_elf: Can't read the dump memory(/proc/vmcore).
readmem: type_addr: 1, addr:bffba000, size:4096
read_pfn: Can't get the page data.
 Resource temporarily unavailable
makedumpfile Failed.
kdump: saving vmcore failed

Following is part of /proc/iomem on my system.

00100000-bffc283f : System RAM
  01000000-018c551d : Kernel code
  018c551e-01ef3f3f : Kernel data
  0204a000-02984fff : Kernel bss
  2e000000-35ffffff : Crash kernel
bffc2840-bfffffff : reserved

This is a different system than what I used last time. So I am not sure
if this is same error or something else. But one thing is clear that
System RAM last page is partial and we should face mmap() failure.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-20 14:59               ` Vivek Goyal
@ 2013-11-21  5:00                 ` Atsushi Kumagai
  2013-11-21  8:31                   ` HATAYAMA Daisuke
  0 siblings, 1 reply; 30+ messages in thread
From: Atsushi Kumagai @ 2013-11-21  5:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: bhe@redhat.com, kexec@lists.infradead.org,
	linux-kernel@vger.kernel.org, d.hatayama@jp.fujitsu.com,
	ebiederm@xmission.com, dyoung@redhat.com, chaowang@redhat.com

Hello Vivek,

On 2013/11/21 0:00:01, kexec <kexec-bounces@lists.infradead.org> wrote:
> > > > > Is there any chance that you could look into fixing this. I 
> > > > > have no experience writing code for makedumpfile.
> > > > 
> > > > I'll send a patch to fix this soon.
> > > 
> > > Thanks Atsushi.
> > > 
> > > Vivek
> > 
> > Vivek, could you test this patch ?
> > 
> > Thanks
> > Atsushi Kumagai
> > 
> > 
> > From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
> > Date: Wed, 20 Nov 2013 10:05:03 +0900
> > Subject: [PATCH] Disable mmap() for reading fractional pages.
> > 
> > Since mmap() was introduced on /proc/vmcore, it fails for fractional 
> > pages which don't start or end at page boundary due to kernel issue.
> > This patch disables mmap() temporarily for fractional pages to avoid 
> > this issue, so mmap() will be used only for aligned pages.
> > 
> > Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
> 
> Hi Atsushi,
> 
> Even with this patch applied I see mmap() failure.
> 
> mem_map (39)
>   mem_map    : ffffea0004e00000
>   pfn_start  : 138000
>   pfn_end    : 140000
> read /proc/vmcore with mmap()
> Excluding unnecessary pages        : [100.0 %] |STEP [Excluding
> unnecessary pages] : 0.035925 seconds
> Excluding unnecessary pages        : [100.0 %] \STEP [Excluding
> unnecessary pages] : 0.035774 seconds
> Excluding unnecessary pages        : [100.0 %] -STEP [Excluding
> unnecessary pages] : 0.035229 seconds
> Copying data                       : [ 40.9 %] -Can't map
> [b98fd000-b9cfd000] with mmap()
> read_from_vmcore: Can't read the dump memory(/proc/vmcore) with mmap().
> readpage_elf: Can't read the dump memory(/proc/vmcore).
> readmem: type_addr: 1, addr:bffba000, size:4096
> read_pfn: Can't get the page data.
>  Resource temporarily unavailable
> makedumpfile Failed.
> kdump: saving vmcore failed
> 
> Following is part of /proc/iomem on my system.
> 
> 00100000-bffc283f : System RAM
>   01000000-018c551d : Kernel code
>   018c551e-01ef3f3f : Kernel data
>   0204a000-02984fff : Kernel bss
>   2e000000-35ffffff : Crash kernel
> bffc2840-bfffffff : reserved
> 
> This is a different system than what I used last time. So I am not sure if this is same error or something else. But one thing is clear that System RAM last page is partial and we should face mmap() failure.

Thanks for your testing, I've found my mistake.

My patch tries to disable mmap() when a partial page is found, but
actually mmap() has already been called because update_mmap_range()
calls mmap() for every 4MB region in advance.
If we try to keep using mmap() as much as possible, update_mmap_range()
has to check whether the target region of mmap() includes the partial
pages before calling mmap(), but it's too tough as workaround. 

So I think the patch I sent is enough, the policy will be simpler as
"Don't use mmap() for buggy kernels".

[PATCH] Fall back to read() when mmap() fails.
http://lists.infradead.org/pipermail/kexec/2013-November/010199.html


Thanks
Atsushi Kumagai

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-20  5:27             ` Atsushi Kumagai
  2013-11-20  6:43               ` HATAYAMA Daisuke
@ 2013-11-21  7:14               ` chaowang
  2013-11-25  8:09                 ` Atsushi Kumagai
  1 sibling, 1 reply; 30+ messages in thread
From: chaowang @ 2013-11-21  7:14 UTC (permalink / raw)
  To: Atsushi Kumagai
  Cc: HATAYAMA Daisuke, bhe@redhat.com, kexec@lists.infradead.org,
	linux-kernel@vger.kernel.org, ebiederm@xmission.com,
	dyoung@redhat.com, vgoyal@redhat.com

On 11/20/13 at 05:27am, Atsushi Kumagai wrote:
> On 2013/11/19 18:56:21, kexec <kexec-bounces@lists.infradead.org> wrote:
> > (2013/11/18 9:51), Atsushi Kumagai wrote:
> > > (2013/11/15 23:26), Vivek Goyal wrote:
> > >> On Fri, Nov 15, 2013 at 06:41:52PM +0900, HATAYAMA Daisuke wrote:
> > >>
> > >> [..]
> > >>>> Given the fact that hpa does not like fixing it in kernel. We are 
> > >>>> left with option of fixing it in following places.
> > >>>>
> > >>>> - Drop partial pages in kexec-tools
> > >>>> - Drop partial pages in makeudmpfile.
> > >>>> - Read partial pages using read() interface in makedumpfile
> > >>>> - Modify /proc/vmcore to copy partial pages in second kernel's memory.
> > >>>>
> > >>>> It is not clear to me that partial pages are really useful.  So I 
> > >>>> want to avoid modifying /proc/vmcore to deal with partial pages and 
> > >>>> increase complexity.
> > >>>>
> > >>>> So fixing makedumpfile (either option2 or option 3) seems least 
> > >>>> risky to me. In fact I would say let us keep it simple and truncate 
> > >>>> partial pages in makedumpfile to keep it simple. And look at option 
> > >>>> 3 once we have a strong use case for partial pages.
> > >>>>
> > >>>> What do you think?
> > >>>>
> > >>>
> > >>> As you say, it's not clear that partial pages are really useful, but 
> > >>> on the other hand, it seems to me not clear that they are really useless.
> > >>> I think we should get them as long as we have access to them.
> > >>>
> > >>> It seems best to me the option 3). Switching between read and mmap 
> > >>> would be not so complex and also it's by far flexible in 
> > >>> makedumpfile than in kernel.
> > >>
> > >> Ok, I am fine with option 3. It is more complicated option but safe 
> > >> option.
> > > 
> > > It sounds reasonable also to me.
> > > 
> > >> Is there any chance that you could look into fixing this. I have no 
> > >> experience writing code for makedumpfile.
> > > 
> > > I'll send a patch to fix this soon.
> > > 
> > 
> > Thanks.
> > 
> > BTW, now the following patch has been applied on top of makedumpfile in kexec-tools package on fedora in order to avoid the issue.
> > 
> > https://lists.fedoraproject.org/pipermail/kexec/2013-November/000254.html
> > 
> > I remember prototype version of mmap patch implemented a kind of --no-mmap option and we could use it to disable mmap() use and use read() instead, I think which is useful when we face this kind of issue.
> 
> How about this fail back structure instead of such an extra option ?
> 
> Thanks
> Atsushi Kumagai
> 
> From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
> Date: Wed, 20 Nov 2013 14:10:19 +0900
> Subject: [PATCH] Fall back to read() when mmap() fails.
> 
> Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
> ---
>  makedumpfile.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/makedumpfile.c b/makedumpfile.c
> index ca03440..f583602 100644
> --- a/makedumpfile.c
> +++ b/makedumpfile.c
> @@ -324,7 +324,15 @@ read_from_vmcore(off_t offset, void *bufptr, unsigned long size)
>  		if (!read_with_mmap(offset, bufptr, size)) {
>  			ERRMSG("Can't read the dump memory(%s) with mmap().\n",
>  			       info->name_memory);
> -			return FALSE;
> +
> +			ERRMSG("This kernel might have some problems about mmap().\n");
> +			ERRMSG("read() will be used instead of mmap() from now.\n");
> +
> +			/*
> +			 * Fall back to read().
> +			 */
> +			info->flag_usemmap = FALSE;
> +			read_from_vmcore(offset, bufptr, size);

Hi, Atsushi

I've got such a workstation too. And I confirm this patch works for me.

However, I have a question:
Why not switch to mmap() back after read()?

Thanks
WANG Chao

>  		}
>  	} else {
>  		if (lseek(info->fd_memory, offset, SEEK_SET) == failed) {
> -- 
> 1.8.0.2

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-21  5:00                 ` Atsushi Kumagai
@ 2013-11-21  8:31                   ` HATAYAMA Daisuke
  2013-11-21 16:52                     ` Vivek Goyal
  0 siblings, 1 reply; 30+ messages in thread
From: HATAYAMA Daisuke @ 2013-11-21  8:31 UTC (permalink / raw)
  To: Atsushi Kumagai
  Cc: Vivek Goyal, bhe@redhat.com, kexec@lists.infradead.org,
	linux-kernel@vger.kernel.org, ebiederm@xmission.com,
	dyoung@redhat.com, chaowang@redhat.com

(2013/11/21 14:00), Atsushi Kumagai wrote:
> Hello Vivek,
> 
> On 2013/11/21 0:00:01, kexec <kexec-bounces@lists.infradead.org> wrote:
>>>>>> Is there any chance that you could look into fixing this. I
>>>>>> have no experience writing code for makedumpfile.
>>>>>
>>>>> I'll send a patch to fix this soon.
>>>>
>>>> Thanks Atsushi.
>>>>
>>>> Vivek
>>>
>>> Vivek, could you test this patch ?
>>>
>>> Thanks
>>> Atsushi Kumagai
>>>
>>>
>>> From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
>>> Date: Wed, 20 Nov 2013 10:05:03 +0900
>>> Subject: [PATCH] Disable mmap() for reading fractional pages.
>>>
>>> Since mmap() was introduced on /proc/vmcore, it fails for fractional
>>> pages which don't start or end at page boundary due to kernel issue.
>>> This patch disables mmap() temporarily for fractional pages to avoid
>>> this issue, so mmap() will be used only for aligned pages.
>>>
>>> Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
>>
>> Hi Atsushi,
>>
>> Even with this patch applied I see mmap() failure.
>>
>> mem_map (39)
>>    mem_map    : ffffea0004e00000
>>    pfn_start  : 138000
>>    pfn_end    : 140000
>> read /proc/vmcore with mmap()
>> Excluding unnecessary pages        : [100.0 %] |STEP [Excluding
>> unnecessary pages] : 0.035925 seconds
>> Excluding unnecessary pages        : [100.0 %] \STEP [Excluding
>> unnecessary pages] : 0.035774 seconds
>> Excluding unnecessary pages        : [100.0 %] -STEP [Excluding
>> unnecessary pages] : 0.035229 seconds
>> Copying data                       : [ 40.9 %] -Can't map
>> [b98fd000-b9cfd000] with mmap()
>> read_from_vmcore: Can't read the dump memory(/proc/vmcore) with mmap().
>> readpage_elf: Can't read the dump memory(/proc/vmcore).
>> readmem: type_addr: 1, addr:bffba000, size:4096
>> read_pfn: Can't get the page data.
>>   Resource temporarily unavailable
>> makedumpfile Failed.
>> kdump: saving vmcore failed
>>
>> Following is part of /proc/iomem on my system.
>>
>> 00100000-bffc283f : System RAM
>>    01000000-018c551d : Kernel code
>>    018c551e-01ef3f3f : Kernel data
>>    0204a000-02984fff : Kernel bss
>>    2e000000-35ffffff : Crash kernel
>> bffc2840-bfffffff : reserved
>>
>> This is a different system than what I used last time. So I am not sure if this is same error or something else. But one thing is clear that System RAM last page is partial and we should face mmap() failure.
> 
> Thanks for your testing, I've found my mistake.
> 
> My patch tries to disable mmap() when a partial page is found, but
> actually mmap() has already been called because update_mmap_range()
> calls mmap() for every 4MB region in advance.
> If we try to keep using mmap() as much as possible, update_mmap_range()
> has to check whether the target region of mmap() includes the partial
> pages before calling mmap(), but it's too tough as workaround.
> 
> So I think the patch I sent is enough, the policy will be simpler as
> "Don't use mmap() for buggy kernels".
> 
> [PATCH] Fall back to read() when mmap() fails.
> http://lists.infradead.org/pipermail/kexec/2013-November/010199.html
> 

I think logic becomes not so complex. For example, if input vmcore
format is ELF, then:

o in update_mmap_range():
  - first calculate a range of the corresponding PT_LOAD entry truncated with
    PAGE_SIZE.
  - Then, truncate range of mmap() by the truncated range of the corresponding
    PT_LOAD entry, i.e., exlucde partial pages from mmap() target range.
  - Then determine offsets of two partial pages; the number of partial pages
    are always at most two. The offsets can easily be calculated from the
    original range of the corresponding PT_LOAD entry

o in read_from_vmcore(), if a given offset belongs to either of two partial
  pages, then go to read() path; if not, go to mmap() path.

-- 
Thanks.
HATAYAMA, Daisuke


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-21  8:31                   ` HATAYAMA Daisuke
@ 2013-11-21 16:52                     ` Vivek Goyal
  2013-11-25  8:10                       ` Atsushi Kumagai
  0 siblings, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2013-11-21 16:52 UTC (permalink / raw)
  To: HATAYAMA Daisuke
  Cc: Atsushi Kumagai, bhe@redhat.com, kexec@lists.infradead.org,
	linux-kernel@vger.kernel.org, ebiederm@xmission.com,
	dyoung@redhat.com, chaowang@redhat.com

On Thu, Nov 21, 2013 at 05:31:46PM +0900, HATAYAMA Daisuke wrote:

[..]
> > So I think the patch I sent is enough, the policy will be simpler as
> > "Don't use mmap() for buggy kernels".
> > 
> > [PATCH] Fall back to read() when mmap() fails.
> > http://lists.infradead.org/pipermail/kexec/2013-November/010199.html
> > 
> 
> I think logic becomes not so complex. For example, if input vmcore
> format is ELF, then:
> 
> o in update_mmap_range():
>   - first calculate a range of the corresponding PT_LOAD entry truncated with
>     PAGE_SIZE.
>   - Then, truncate range of mmap() by the truncated range of the corresponding
>     PT_LOAD entry, i.e., exlucde partial pages from mmap() target range.
>   - Then determine offsets of two partial pages; the number of partial pages
>     are always at most two. The offsets can easily be calculated from the
>     original range of the corresponding PT_LOAD entry
> 
> o in read_from_vmcore(), if a given offset belongs to either of two partial
>   pages, then go to read() path; if not, go to mmap() path.

I agree that we should do mmap() on all non-partial pages and do read()
on all partial pages. Otherwise we lose the benefit of faster speed of
mmap().

Thanks
Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-21  7:14               ` chaowang
@ 2013-11-25  8:09                 ` Atsushi Kumagai
  2013-11-26  3:29                   ` chaowang
  0 siblings, 1 reply; 30+ messages in thread
From: Atsushi Kumagai @ 2013-11-25  8:09 UTC (permalink / raw)
  To: chaowang@redhat.com
  Cc: bhe@redhat.com, kexec@lists.infradead.org,
	linux-kernel@vger.kernel.org, HATAYAMA Daisuke,
	ebiederm@xmission.com, dyoung@redhat.com, vgoyal@redhat.com

Hello WANG,

On 2013/11/21 16:15:22, kexec <kexec-bounces@lists.infradead.org> wrote:
> > How about this fail back structure instead of such an extra option ?
> > 
> > Thanks
> > Atsushi Kumagai
> > 
> > From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
> > Date: Wed, 20 Nov 2013 14:10:19 +0900
> > Subject: [PATCH] Fall back to read() when mmap() fails.
> > 
> > Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
> > ---
> >  makedumpfile.c | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/makedumpfile.c b/makedumpfile.c
> > index ca03440..f583602 100644
> > --- a/makedumpfile.c
> > +++ b/makedumpfile.c
> > @@ -324,7 +324,15 @@ read_from_vmcore(off_t offset, void *bufptr, unsigned long size)
> >  		if (!read_with_mmap(offset, bufptr, size)) {
> >  			ERRMSG("Can't read the dump memory(%s) with mmap().\n",
> >  			       info->name_memory);
> > -			return FALSE;
> > +
> > +			ERRMSG("This kernel might have some problems about mmap().\n");
> > +			ERRMSG("read() will be used instead of mmap() from now.\n");
> > +
> > +			/*
> > +			 * Fall back to read().
> > +			 */
> > +			info->flag_usemmap = FALSE;
> > +			read_from_vmcore(offset, bufptr, size);
> 
> Hi, Atsushi
> 
> I've got such a workstation too. And I confirm this patch works for me.

Thanks for your testing !

> However, I have a question:
> Why not switch to mmap() back after read()?

I made this patch as a general safety net, not only for the partial page
issue.
When facing unknown issues related mmap(), the kernel may have some bugs
and mmap() can fail for every pages. In the worst case, most all mmap()
will fail and try read() with error messages after every fail, but this
patch will prevent the chattering of the switch and so many error messages.


Thanks
Atsushi Kumagai

> Thanks
> WANG Chao
> 
> >  		}
> >  	} else {
> >  		if (lseek(info->fd_memory, offset, SEEK_SET) == failed) {
> > -- 
> > 1.8.0.2
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-21 16:52                     ` Vivek Goyal
@ 2013-11-25  8:10                       ` Atsushi Kumagai
  2013-11-25  9:01                         ` HATAYAMA Daisuke
  0 siblings, 1 reply; 30+ messages in thread
From: Atsushi Kumagai @ 2013-11-25  8:10 UTC (permalink / raw)
  To: Vivek Goyal, HATAYAMA Daisuke
  Cc: bhe@redhat.com, kexec@lists.infradead.org,
	linux-kernel@vger.kernel.org, ebiederm@xmission.com,
	dyoung@redhat.com, chaowang@redhat.com

On 2013/11/22 1:53:14, kexec <kexec-bounces@lists.infradead.org> wrote:
> On Thu, Nov 21, 2013 at 05:31:46PM +0900, HATAYAMA Daisuke wrote:
> 
> [..]
> > > So I think the patch I sent is enough, the policy will be simpler as
> > > "Don't use mmap() for buggy kernels".
> > > 
> > > [PATCH] Fall back to read() when mmap() fails.
> > > http://lists.infradead.org/pipermail/kexec/2013-November/010199.html
> > > 
> > 
> > I think logic becomes not so complex. For example, if input vmcore
> > format is ELF, then:
> > 
> > o in update_mmap_range():
> >   - first calculate a range of the corresponding PT_LOAD entry truncated with
> >     PAGE_SIZE.
> >   - Then, truncate range of mmap() by the truncated range of the corresponding
> >     PT_LOAD entry, i.e., exlucde partial pages from mmap() target range.
> >   - Then determine offsets of two partial pages; the number of partial pages
> >     are always at most two. The offsets can easily be calculated from the
> >     original range of the corresponding PT_LOAD entry
> > 
> > o in read_from_vmcore(), if a given offset belongs to either of two partial
> >   pages, then go to read() path; if not, go to mmap() path.
> 
> I agree that we should do mmap() on all non-partial pages and do read()
> on all partial pages. Otherwise we lose the benefit of faster speed of
> mmap().

I agree to avoid this issue by fixing makedumpfile as workaround while to
fix kernel is so tough and risky. However, it sounds strange to me to fix
userspace side elaborately for such definite kernel issue whose cause is
known, so we should fix the kernel itself.

Otherwise, will you continue to add specific fixes into user tools to
address kernel issues like this case ?


Thanks
Atsushi Kumagai

> Thanks
> Vivek
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-25  8:10                       ` Atsushi Kumagai
@ 2013-11-25  9:01                         ` HATAYAMA Daisuke
  2013-11-25 14:41                           ` Vivek Goyal
  0 siblings, 1 reply; 30+ messages in thread
From: HATAYAMA Daisuke @ 2013-11-25  9:01 UTC (permalink / raw)
  To: Atsushi Kumagai
  Cc: Vivek Goyal, bhe@redhat.com, kexec@lists.infradead.org,
	linux-kernel@vger.kernel.org, ebiederm@xmission.com,
	dyoung@redhat.com, chaowang@redhat.com

(2013/11/25 17:10), Atsushi Kumagai wrote:
> On 2013/11/22 1:53:14, kexec <kexec-bounces@lists.infradead.org> wrote:
>> On Thu, Nov 21, 2013 at 05:31:46PM +0900, HATAYAMA Daisuke wrote:
>>
>> [..]
>>>> So I think the patch I sent is enough, the policy will be simpler as
>>>> "Don't use mmap() for buggy kernels".
>>>>
>>>> [PATCH] Fall back to read() when mmap() fails.
>>>> http://lists.infradead.org/pipermail/kexec/2013-November/010199.html
>>>>
>>>
>>> I think logic becomes not so complex. For example, if input vmcore
>>> format is ELF, then:
>>>
>>> o in update_mmap_range():
>>>    - first calculate a range of the corresponding PT_LOAD entry truncated with
>>>      PAGE_SIZE.
>>>    - Then, truncate range of mmap() by the truncated range of the corresponding
>>>      PT_LOAD entry, i.e., exlucde partial pages from mmap() target range.
>>>    - Then determine offsets of two partial pages; the number of partial pages
>>>      are always at most two. The offsets can easily be calculated from the
>>>      original range of the corresponding PT_LOAD entry
>>>
>>> o in read_from_vmcore(), if a given offset belongs to either of two partial
>>>    pages, then go to read() path; if not, go to mmap() path.
>>
>> I agree that we should do mmap() on all non-partial pages and do read()
>> on all partial pages. Otherwise we lose the benefit of faster speed of
>> mmap().
> 
> I agree to avoid this issue by fixing makedumpfile as workaround while to
> fix kernel is so tough and risky. However, it sounds strange to me to fix
> userspace side elaborately for such definite kernel issue whose cause is
> known, so we should fix the kernel itself.
> 

> Otherwise, will you continue to add specific fixes into user tools to
> address kernel issues like this case ?
> 

makedumpfile supports a wide range of kernel versions and needs to satisfy
backward compatibility. mmap() on /proc/vmcore might be backported to some of
the old versions on some distributions if necessary. Then, it's hard to fix
each old kernel at each back port. The method that can be applied to all the
kernels in general, is necessary.

Also, looking at ia64 case where there's boot loader data on partial pages,
there could be other environments where partial pages contain other important
data other components have. So, the issue depends not only on kernels but also
other components such as boot loader and firmwares that can put data on
partial pages. We need to get there as long as there's important data there
and we have access to there.

-- 
Thanks.
HATAYAMA, Daisuke


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-25  9:01                         ` HATAYAMA Daisuke
@ 2013-11-25 14:41                           ` Vivek Goyal
  2013-11-26  1:51                             ` Atsushi Kumagai
  2013-11-26  5:16                             ` HATAYAMA Daisuke
  0 siblings, 2 replies; 30+ messages in thread
From: Vivek Goyal @ 2013-11-25 14:41 UTC (permalink / raw)
  To: HATAYAMA Daisuke
  Cc: Atsushi Kumagai, bhe@redhat.com, kexec@lists.infradead.org,
	linux-kernel@vger.kernel.org, ebiederm@xmission.com,
	dyoung@redhat.com, chaowang@redhat.com

On Mon, Nov 25, 2013 at 06:01:37PM +0900, HATAYAMA Daisuke wrote:

[..]
> > I agree to avoid this issue by fixing makedumpfile as workaround while to
> > fix kernel is so tough and risky. However, it sounds strange to me to fix
> > userspace side elaborately for such definite kernel issue whose cause is
> > known, so we should fix the kernel itself.
> > 
> 
> > Otherwise, will you continue to add specific fixes into user tools to
> > address kernel issues like this case ?
> > 
> 
> makedumpfile supports a wide range of kernel versions and needs to satisfy
> backward compatibility. mmap() on /proc/vmcore might be backported to some of
> the old versions on some distributions if necessary. Then, it's hard to fix
> each old kernel at each back port. The method that can be applied to all the
> kernels in general, is necessary.
> 
> Also, looking at ia64 case where there's boot loader data on partial pages,
> there could be other environments where partial pages contain other important
> data other components have. So, the issue depends not only on kernels but also
> other components such as boot loader and firmwares that can put data on
> partial pages. We need to get there as long as there's important data there
> and we have access to there.

Hi Atsushi, Hatayama,

So even if we fix the mmap() issue in kernel, looks like it will be a
good idea to ship the fix in makedumpfile as there have been a kernel
release where mmap() will cause issues.

Having said that, I think we need to fix it in kernel also. I was not sure
that what's the right fix. Should we truncate partial pages or should
we just copy partial pages from old memory to new kernel's memory and fill
partial page with zeros. And that's why I was hoping that makedumpfile
can fill the gap.

Copying partial pages to new memory seems like a safer approach. So may
be we can take a fix in makeudmpfile and another in kernel.

Hatayama, I know that in the past your initial mmap() patches were copying
partial pages to new kernel's memory. Would you like to resurrect that
patch again?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-25 14:41                           ` Vivek Goyal
@ 2013-11-26  1:51                             ` Atsushi Kumagai
  2013-11-26  5:16                             ` HATAYAMA Daisuke
  1 sibling, 0 replies; 30+ messages in thread
From: Atsushi Kumagai @ 2013-11-26  1:51 UTC (permalink / raw)
  To: Vivek Goyal, HATAYAMA Daisuke
  Cc: bhe@redhat.com, kexec@lists.infradead.org,
	linux-kernel@vger.kernel.org, ebiederm@xmission.com,
	dyoung@redhat.com, chaowang@redhat.com

On 2013/11/25 23:42:31, kexec <kexec-bounces@lists.infradead.org> wrote:
> On Mon, Nov 25, 2013 at 06:01:37PM +0900, HATAYAMA Daisuke wrote:
> 
> [..]
> > > I agree to avoid this issue by fixing makedumpfile as workaround while to
> > > fix kernel is so tough and risky. However, it sounds strange to me to fix
> > > userspace side elaborately for such definite kernel issue whose cause is
> > > known, so we should fix the kernel itself.
> > > 
> > 
> > > Otherwise, will you continue to add specific fixes into user tools to
> > > address kernel issues like this case ?
> > > 
> > 
> > makedumpfile supports a wide range of kernel versions and needs to satisfy
> > backward compatibility. mmap() on /proc/vmcore might be backported to some of
> > the old versions on some distributions if necessary. Then, it's hard to fix
> > each old kernel at each back port. The method that can be applied to all the
> > kernels in general, is necessary.
> > 
> > Also, looking at ia64 case where there's boot loader data on partial pages,
> > there could be other environments where partial pages contain other important
> > data other components have. So, the issue depends not only on kernels but also
> > other components such as boot loader and firmwares that can put data on
> > partial pages. We need to get there as long as there's important data there
> > and we have access to there.
> 
> Hi Atsushi, Hatayama,
> 
> So even if we fix the mmap() issue in kernel, looks like it will be a
> good idea to ship the fix in makedumpfile as there have been a kernel
> release where mmap() will cause issues.

OK, I'll make a patch set for makedumpfile to address issues about mmap():

  1. Fix the partial page issue
  2. Introduce general fall back structure (I already posted)
     http://lists.infradead.org/pipermail/kexec/2013-November/010199.html
  3. Add --non-mmap option


Thanks
Atsushi Kumagai

> Having said that, I think we need to fix it in kernel also. I was not sure
> that what's the right fix. Should we truncate partial pages or should
> we just copy partial pages from old memory to new kernel's memory and fill
> partial page with zeros. And that's why I was hoping that makedumpfile
> can fill the gap.
> 
> Copying partial pages to new memory seems like a safer approach. So may
> be we can take a fix in makeudmpfile and another in kernel.
> 
> Hatayama, I know that in the past your initial mmap() patches were copying
> partial pages to new kernel's memory. Would you like to resurrect that
> patch again?
> 
> Thanks
> Vivek
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-20  6:43               ` HATAYAMA Daisuke
@ 2013-11-26  1:52                 ` Atsushi Kumagai
  0 siblings, 0 replies; 30+ messages in thread
From: Atsushi Kumagai @ 2013-11-26  1:52 UTC (permalink / raw)
  To: HATAYAMA Daisuke
  Cc: bhe@redhat.com, kexec@lists.infradead.org,
	linux-kernel@vger.kernel.org, vgoyal@redhat.com,
	ebiederm@xmission.com, dyoung@redhat.com, chaowang@redhat.com

On 2013/11/20 15:44:45, kexec <kexec-bounces@lists.infradead.org> wrote:
> (2013/11/20 14:27), Atsushi Kumagai wrote:
> > On 2013/11/19 18:56:21, kexec <kexec-bounces@lists.infradead.org> wrote:
> >> (2013/11/18 9:51), Atsushi Kumagai wrote:
> >>> (2013/11/15 23:26), Vivek Goyal wrote:
> >>>> On Fri, Nov 15, 2013 at 06:41:52PM +0900, HATAYAMA Daisuke wrote:
> >>>>
> >>>> [..]
> >>>>>> Given the fact that hpa does not like fixing it in kernel. We are
> >>>>>> left with option of fixing it in following places.
> >>>>>>
> >>>>>> - Drop partial pages in kexec-tools
> >>>>>> - Drop partial pages in makeudmpfile.
> >>>>>> - Read partial pages using read() interface in makedumpfile
> >>>>>> - Modify /proc/vmcore to copy partial pages in second kernel's memory.
> >>>>>>
> >>>>>> It is not clear to me that partial pages are really useful.  So I
> >>>>>> want to avoid modifying /proc/vmcore to deal with partial pages and
> >>>>>> increase complexity.
> >>>>>>
> >>>>>> So fixing makedumpfile (either option2 or option 3) seems least
> >>>>>> risky to me. In fact I would say let us keep it simple and truncate
> >>>>>> partial pages in makedumpfile to keep it simple. And look at option
> >>>>>> 3 once we have a strong use case for partial pages.
> >>>>>>
> >>>>>> What do you think?
> >>>>>>
> >>>>>
> >>>>> As you say, it's not clear that partial pages are really useful, but
> >>>>> on the other hand, it seems to me not clear that they are really useless.
> >>>>> I think we should get them as long as we have access to them.
> >>>>>
> >>>>> It seems best to me the option 3). Switching between read and mmap
> >>>>> would be not so complex and also it's by far flexible in
> >>>>> makedumpfile than in kernel.
> >>>>
> >>>> Ok, I am fine with option 3. It is more complicated option but safe
> >>>> option.
> >>>
> >>> It sounds reasonable also to me.
> >>>
> >>>> Is there any chance that you could look into fixing this. I have no
> >>>> experience writing code for makedumpfile.
> >>>
> >>> I'll send a patch to fix this soon.
> >>>
> >>
> >> Thanks.
> >>
> >> BTW, now the following patch has been applied on top of makedumpfile in kexec-tools package on fedora in order to avoid the issue.
> >>
> >> https://lists.fedoraproject.org/pipermail/kexec/2013-November/000254.html
> >>
> >> I remember prototype version of mmap patch implemented a kind of --no-mmap option and we could use it to disable mmap() use and use read() instead, I think which is useful when we face this kind of issue.
> > 
> > How about this fail back structure instead of such an extra option ?
> > 
> 
> I think this logic is useful and should be merged together in this fix.
> 
> However, I still think a kind of --no-mmap option is needed. There could happen
> worse case due to mmap() in the future on some system, of course, I don't know
> what the system actually is, but at least it must be behaving differently from
> typical systems... Then, option is more flexible than patching.
> 
> It would also be useful for debugging use. read() is simpler than mmap(), and
> read() is basic in the sense that initially makedumpfile didn't use mmap().
> There might be a situation where we want to avoid using mmap(); for example,
> when makedumpfile works badly and it looks like caused by mmap() code in kernel
> code; Then, we would want to see if makedumpfile works well by disabling mmap(), 

Thanks for your explanation.
Additionally, the option to disable mmap() manually will help my test,
so I should introduce the option into upstream. 


Thanks
Atsushi Kumagai

> -- 
> Thanks.
> HATAYAMA, Daisuke
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-25  8:09                 ` Atsushi Kumagai
@ 2013-11-26  3:29                   ` chaowang
  0 siblings, 0 replies; 30+ messages in thread
From: chaowang @ 2013-11-26  3:29 UTC (permalink / raw)
  To: Atsushi Kumagai
  Cc: bhe@redhat.com, kexec@lists.infradead.org,
	linux-kernel@vger.kernel.org, HATAYAMA Daisuke,
	ebiederm@xmission.com, dyoung@redhat.com, vgoyal@redhat.com

On 11/25/13 at 08:09am, Atsushi Kumagai wrote:
> Hello WANG,
> 
> On 2013/11/21 16:15:22, kexec <kexec-bounces@lists.infradead.org> wrote:
> > > How about this fail back structure instead of such an extra option ?
> > > 
> > > Thanks
> > > Atsushi Kumagai
> > > 
> > > From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
> > > Date: Wed, 20 Nov 2013 14:10:19 +0900
> > > Subject: [PATCH] Fall back to read() when mmap() fails.
> > > 
> > > Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
> > > ---
> > >  makedumpfile.c | 10 +++++++++-
> > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/makedumpfile.c b/makedumpfile.c
> > > index ca03440..f583602 100644
> > > --- a/makedumpfile.c
> > > +++ b/makedumpfile.c
> > > @@ -324,7 +324,15 @@ read_from_vmcore(off_t offset, void *bufptr, unsigned long size)
> > >  		if (!read_with_mmap(offset, bufptr, size)) {
> > >  			ERRMSG("Can't read the dump memory(%s) with mmap().\n",
> > >  			       info->name_memory);
> > > -			return FALSE;
> > > +
> > > +			ERRMSG("This kernel might have some problems about mmap().\n");
> > > +			ERRMSG("read() will be used instead of mmap() from now.\n");
> > > +
> > > +			/*
> > > +			 * Fall back to read().
> > > +			 */
> > > +			info->flag_usemmap = FALSE;
> > > +			read_from_vmcore(offset, bufptr, size);
> > 
> > Hi, Atsushi
> > 
> > I've got such a workstation too. And I confirm this patch works for me.
> 
> Thanks for your testing !
> 
> > However, I have a question:
> > Why not switch to mmap() back after read()?
> 
> I made this patch as a general safety net, not only for the partial page
> issue.
> When facing unknown issues related mmap(), the kernel may have some bugs
> and mmap() can fail for every pages. In the worst case, most all mmap()
> will fail and try read() with error messages after every fail, but this
> patch will prevent the chattering of the switch and so many error messages.

Thanks for you explanation. I agree with you. Since mmap() is error
prone after first mmap failure, use read() instead as a fail safe makes
much sense to me.

WANG Chao
> 
> 
> Thanks
> Atsushi Kumagai
> 
> > Thanks
> > WANG Chao
> > 
> > >  		}
> > >  	} else {
> > >  		if (lseek(info->fd_memory, offset, SEEK_SET) == failed) {
> > > -- 
> > > 1.8.0.2
> > 
> > _______________________________________________
> > kexec mailing list
> > kexec@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec
> > 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: /proc/vmcore mmap() failure issue
  2013-11-25 14:41                           ` Vivek Goyal
  2013-11-26  1:51                             ` Atsushi Kumagai
@ 2013-11-26  5:16                             ` HATAYAMA Daisuke
  1 sibling, 0 replies; 30+ messages in thread
From: HATAYAMA Daisuke @ 2013-11-26  5:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Atsushi Kumagai, bhe@redhat.com, kexec@lists.infradead.org,
	linux-kernel@vger.kernel.org, ebiederm@xmission.com,
	dyoung@redhat.com, chaowang@redhat.com

(2013/11/25 23:41), Vivek Goyal wrote:
> On Mon, Nov 25, 2013 at 06:01:37PM +0900, HATAYAMA Daisuke wrote:
>
> [..]
>>> I agree to avoid this issue by fixing makedumpfile as workaround while to
>>> fix kernel is so tough and risky. However, it sounds strange to me to fix
>>> userspace side elaborately for such definite kernel issue whose cause is
>>> known, so we should fix the kernel itself.
>>>
>>
>>> Otherwise, will you continue to add specific fixes into user tools to
>>> address kernel issues like this case ?
>>>
>>
>> makedumpfile supports a wide range of kernel versions and needs to satisfy
>> backward compatibility. mmap() on /proc/vmcore might be backported to some of
>> the old versions on some distributions if necessary. Then, it's hard to fix
>> each old kernel at each back port. The method that can be applied to all the
>> kernels in general, is necessary.
>>
>> Also, looking at ia64 case where there's boot loader data on partial pages,
>> there could be other environments where partial pages contain other important
>> data other components have. So, the issue depends not only on kernels but also
>> other components such as boot loader and firmwares that can put data on
>> partial pages. We need to get there as long as there's important data there
>> and we have access to there.
>
> Hi Atsushi, Hatayama,
>
> So even if we fix the mmap() issue in kernel, looks like it will be a
> good idea to ship the fix in makedumpfile as there have been a kernel
> release where mmap() will cause issues.
>
> Having said that, I think we need to fix it in kernel also. I was not sure
> that what's the right fix. Should we truncate partial pages or should
> we just copy partial pages from old memory to new kernel's memory and fill
> partial page with zeros. And that's why I was hoping that makedumpfile
> can fill the gap.
>
> Copying partial pages to new memory seems like a safer approach. So may
> be we can take a fix in makeudmpfile and another in kernel.
>
> Hatayama, I know that in the past your initial mmap() patches were copying
> partial pages to new kernel's memory. Would you like to resurrect that
> patch again?
>

(Oh, I yesterday had totally forgotten partial page handling that was dropped
from mmap() patch set in the middle of development...)

Yes, but wait until next week.

-- 
Thanks.
HATAYAMA, Daisuke


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2013-11-26  5:17 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-13 20:41 /proc/vmcore mmap() failure issue Vivek Goyal
2013-11-13 21:04 ` Vivek Goyal
2013-11-13 21:14   ` H. Peter Anvin
2013-11-13 22:41     ` Vivek Goyal
2013-11-13 22:44       ` H. Peter Anvin
2013-11-13 23:00         ` Vivek Goyal
2013-11-13 23:08           ` H. Peter Anvin
2013-11-14 10:31 ` HATAYAMA Daisuke
2013-11-14 15:13   ` Vivek Goyal
2013-11-15  9:41     ` HATAYAMA Daisuke
2013-11-15 14:26       ` Vivek Goyal
2013-11-18  0:51         ` Atsushi Kumagai
2013-11-18 13:55           ` Vivek Goyal
2013-11-20  5:29             ` Atsushi Kumagai
2013-11-20 14:59               ` Vivek Goyal
2013-11-21  5:00                 ` Atsushi Kumagai
2013-11-21  8:31                   ` HATAYAMA Daisuke
2013-11-21 16:52                     ` Vivek Goyal
2013-11-25  8:10                       ` Atsushi Kumagai
2013-11-25  9:01                         ` HATAYAMA Daisuke
2013-11-25 14:41                           ` Vivek Goyal
2013-11-26  1:51                             ` Atsushi Kumagai
2013-11-26  5:16                             ` HATAYAMA Daisuke
2013-11-19  9:55           ` HATAYAMA Daisuke
2013-11-20  5:27             ` Atsushi Kumagai
2013-11-20  6:43               ` HATAYAMA Daisuke
2013-11-26  1:52                 ` Atsushi Kumagai
2013-11-21  7:14               ` chaowang
2013-11-25  8:09                 ` Atsushi Kumagai
2013-11-26  3:29                   ` chaowang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).